Skip to content

Patterns

Patterns let you find documents, sentences or tokens according to certain rules that will describe token attributes and relations between tokens. Attributes refer to information about a token like postag, raw text, dependency, entity type ... Patterns follow a dictionary structure and can be loaded from json files.

Token Patterns

Token patterns are simple patterns that do not use dependency parsing. They consist of a sequence of attributes related rules.

Attributes

Attributes are the properties of a token after analysis by the comprehension API. By defining an attribute in a pattern, only tokens that match the specific attribute will be matched.

AttributeDescription
ORTHText of the token.
TEXTText of the token.
LOWERText of the token in lowercase.
LEMMALemma of the token.
POSPart-Of-Speech tag of the token.
DEPDependency tag of the token.
ENT_TYPEEntity type of the token (see NER API).
CATEGORY_SUPERCategory "SUPER" of the token (see Meaning API).
CATEGORY_SUBCategory "SUB" of the token (see Meaning API).
LENGTHLength of the token.
IS_ALPHAToken is composed of alphabectics characters.
IS_ASCIIToken is composed of ASCII characters.
IS_DIGITToken is composed of digits.
IS_LOWERToken is in lowercase.
IS_UPPERToken is in uppercase.
IS_TITLEToken is in titlecase.
IS_PUNCTToken is punctuation.
IS_SPACEToken is space.
IS_SENT_STARTToken is the first in the sentence.
LIKE_NUMToken is a number.
LIKE_URLToken has entity type URL.
LIKE_EMAILToken has entity type email.
OPOperator to modify matching behavior.

Modifiers

Each attribute can map either to a single value or to a dictionary that allows modifiers for more complex behaviors.

Attribute

ModifierDescription
INAttribute value is a member of this list.
NOT_INAttribute value is not amember of this list.
ISSUBSETAttribute value is a subset (part of) this list
ISSUPERSETAttribute value is a superset of this list.
==, >, <, >=, <=For integer comparisons, attribute value is equal, greater or lower.

Operators

Operators work similarly as regular expressions operators, they allow to choose how often should a token be matched.

OperatorDescription
?Pattern step is optional and can match 0 or 1 token.
+Pattern can match 1 or more tokens.
*Pattern can match 0 or more tokens.
!Pattern is negated, it must match 0 time.
.Default operator, pattern should match 1 token.
patterns = {
    "patients":
        [
            {"POS":{"IN":["CD", "ENTITY"]}},
            {"POS":{"IN":["RB", "JJ", "PUNCT", "ENTITY"]}, "OP":"*"},
            {"LEMMA":"patient"}
        ],
    "date":
        [
            {"ENT_TYPE":"duration"}
        ]
}

text = "3416 consecutive patients were reviewed and 1476 finally enrolled (65.9 ± 20.9 years, 57.3% male). 76 (5.1%) patients had NAEs. Of 444 patients, 76% were male. They had a mean age of 69 ± 10 years."
nlp.add_document(text)

for s, matches in nlp.match_pattern(patterns, level='sentence'):
    print(s.str)
    print(matches)

Dependency Patterns

Dependency patterns use dependency parsing which construct a grammatical tree of the sentence to allow complex matching patterns. Attributes matching is similar to Token Patterns (Only exception is that for operators only "?" is available) but there are also relation operators specific to dependency matching. Matching between the pattern and the sentence does not use the order of token (like for Token Patterns) but the dependency relations between tokens.

A Dependency Pattern consist of a list of dictionary formated in this way:

NameDescription
LEFT_IDName of the left node in the relation, it must have been defined in an earlier node.
REL_OPOperator that describes the relation between left and right nodes.
RIGHT_IDName of the right node in the relation (the current node).
RIGHT_ATTRSThe attributes that must match with the right node, they are defined similarly as for Token Patterns.

All fields must be completed except for the root node which only needs 'RIGHT_ID' and 'RIGHT_ATTRS' fields. Each pattern must have one root node.

Relation Operators

Relation OperatorDescription
<A is a direct dependant of B.
>A is the immediate head of B.
<<A is a dependant of B directly or indirectly.
>>A is a head of B directly or indirectly.
.A token directly precedes B: A.idx == B.idx - 1.
.*A token is located before B: A.idx < B.idx.
;A token directly follows B: A.idx == B.idx + 1.
;*A token is located after B: A.idx > B.idx.
$+A is a sibling of B (same parent/head) and is located directly before B: A.idx == B.idx - 1.
$-A is a sibling of B (same parent/head) and is located directly after B: A.idx == B.idx + 1.
$++A is a sibling of B and is located before B: A.idx < B.idx.
$--A is a sibling of B and is located after B: A.idx > B.idx.

Dependency matching should not be used on Subsentence since they don't have a complete dependency tree.

nlp.add_document("Megasoft sues PearCorp for copyright infringement.")
pattern = {
  'pattern1':[
    {
      "RIGHT_ID": "verb",
      "RIGHT_ATTRS": {"POS": "V", "ORTH":" "}
    },
    {
      "LEFT_ID": "verb",
      "REL_OP": ">",
      "RIGHT_ID": "subject",
      "RIGHT_ATTRS": {"DEP": "nsubj"}
    },
    {
      "LEFT_ID": "verb",
      "REL_OP": ">",
      "RIGHT_ID": "object",
      "RIGHT_ATTRS": {"DEP": "obj"}
    }
  ]
}

for document, match_results in nlp.match_pattern(pattern, level='document'): 
  print(document)
  for pattern, matches in match_results.items():
    print(pattern, matches)
[Megasoft sues PearCorp for copyright infringement.]  
pattern1 [[sues, Megasoft, PearCorp]]