Overview of the query language


Basic components

Queries in the CorpusSearch query language are built from the following basic components:

Search function calls

The most basic query consists of a single
search function call with associated arguments. For instance, the following query searches for nodes labeled "NP-PRD" (predicate noun phrase) that immediately dominate nodes labeled "N" (noun):
(NP-PRD iDominates N)
Here is a sentence that matches the query.
/~*
For he 's a jolly good fellow.
*~/

/*
    1 IP-MAT: 9 NP-SBJ, 16 N
*/

(0 (1 IP-MAT (2 CONJ For)
             (4 NP-SBJ (5 PRO he))
             (7 BEP 's)
             (9 NP-PRD (10 D a) (12 ADJ jolly) (14 ADJ good) (16 N fellow)))
             (18 . .))
In addition to "iDominates", CorpusSearch has many other search functions, which are described in detail in their own section.

Logical operators

Search function calls can be combined using
logical operators. There are two types:

Argument operators

Argument "or"

Two adjacent pipes (||), whether in a command file or a definition file, will abort any search.

Any number of arguments to a search function may be linked together into a list using "|" (argument "or" or "pipe"). For instance,

(BE*|DO*|HV*|MD|VB* iPrecedes NP-SBJ*)
means "BE* or DO* or HV* or MD or VB* immediately precedes NP-SBJ*" and will find sentences like this:
/~*
+Tan was pompe & pryde cast down
(CMKEMPE,2.12)
*~/

/*
1 IP-MAT: 5 BED was, 7 NP-SBJ
*/

(0 (1 IP-MAT (2 ADVP-TMP (3 ADV +Tan))
             (5 BED was)
             (7 NP-SBJ (8 N pompe) (10 CONJ &) (12 N pryde))
             (14 VAN cast)
             (16 RP down))
   (18 ID CMKEMPE,2.12))

Argument "not"

"!" (argument "not", read as "bang") negates (= returns the complement of) arguments to search functions. For instance, the following query finds sentences with subjects that immediately dominate some category other than a pronoun.
(NP-SBJ* iDoms !PRO*)
/~*
a runde fot & +ticke bi-come+t an hors wel.
(CMHORSES,87.17)
*~/

/*
1 IP-MAT: 2 NP-SBJ, 15 ADJ +ticke
*/

(0 (1 IP-MAT (2 NP-SBJ (3 D a)
                       (5 ADJP (6 ADJ runde)
                               (8 CONJP *ICH*-1))
                       (10 N fot)
                       (12 CONJP-1 (13 CONJ &))
                                   (15 ADJ +ticke))
             (17 VBP bi-come+t)
             (19 NP-OB1 (20 D an) (22 N hors))
             (24 ADVP (25 ADV wel))
             (27 E_S .))
      (29 ID CMHORSES,87.17))

Argument "not" negates entire lists

"!" cannot be used to negate individual members of a list constructed with "|", and queries of the following form will abort:
(NP-SBJ* iDoms N|!D|!ADJ)
Instead, "!" must precede lists, and it is interpreted as negating the entire list (not just the first member). So the following query:
(NP-SBJ* iDoms !N|D|ADJ)
means "NP-SBJ* immediately dominates some node other than N, D, or ADJ", and it finds sentences with bare pronouns subjects like the following, which are not intended by the ill-formed query:
 
/~*
and he made them grete chere out of mesure
(CMMALORY-M4,2.13)
*~/
/*
1 IP-MAT:  4 NP-SBJ, 5 PRO
*/

(0  (1 IP-MAT (2 CONJ and)
	      (4 NP-SBJ (5 PRO he))
	      (7 VBD made)
	      (9 NP-DTV (10 PRO them))
	      (12 NP-ACC (13 ADJ grete) (15 N chere))
	      (17 ADVP (18 ADV out)
		       (20 PP (21 P of)
			      (23 NP (24 N mesure)))))
    (26 ID CMMALORY-M4,2.13))

The ill-formed query can be formulated in CorpusSearch by splitting up the single condition into two conditions combined with AND:

    (NP-SBJ* iDoms N)
AND (NP-SBJ* iDoms !D|!ADJ)

One argument "not" per search function

CorpusSearch does not allow you to negate both arguments to a single search function. So queries of the following form will abort:
(!NP-SBJ* iPrecedes !VBD)

Queries of the following form are fine.

    (NP-SBJ* iDoms !PRO*)
AND (NP-SBJ* iPrecedes !VBD)
It's true that the entire query contains more than one "!" operator, but each individual search function (iDoms, iPrecedes) contains only one.

Argument "not" before prefix indices

If you need to use both "!" and
prefix indices, "!" goes before the indices. This makes sense because the prefix index tells CorpusSearch which node is meant, and then "!" tells CorpusSearch to exclude that node from the output.

For instance, suppose you're looking for sentences with a subject that precedes the object, and where neither the subject nor the object is a pronoun. The following query:

    (NP-SBJ* precedes NP-OB1*)
AND (NP-SBJ* iDoms ![1]PRO*)
AND (NP-OB1* iDoms ![2]PRO*)
will find sentences like this one:
/~*
& +tat schal be a good hors.
(CMHORSES,85.9)
*~/

/*
1 IP-MAT: 4 NP-SBJ, 11 NP-OB1, 5 D +tat, 16 N hors
*/

(0 (1 IP-MAT (2 CONJ &)
             (4 NP-SBJ (5 D +tat))
             (7 MD schal)
             (9 BE be)
             (11 NP-OB1 (12 D a) (14 ADJ good) (16 N hors))
             (18 E_S .))
   (20 ID CMHORSES,85.9))

Search function operators

AND

AND returns trees in which both conjuncts match within a single boundary node. For instance, this query:
node: IP*

query:     (NP-TMP* iDoms ADV*)
       AND (TO iPrecedes VB)
finds this sentence:
/*
4 IP-INF-SBJ:  5 NP-TMP, 6 ADV+NS, 8 TO, 10 VB
*/

(0 (1 IP-MAT (2 CONJ but)
             (4 IP-INF-SBJ (5 NP-TMP (6 ADV+NS oftymes))
                           (8 TO to)
                           (10 VB rede)
                           (12 NP-OB1 (13 PRO it)))
             (15 MD shal)
             (17 VB cause)
             (19 NP-OB1 (20 PRO it))
             (22 IP-INF (23 ADVP (24 ADV wel))
                        (26 TO to)
                        (28 BE be)
                        (29 VAN vnderstande))
          (31 E_S /))
   (33 ID CMREYNAR,6.10))

OR (deprecated)

OR is not yet fully tested and is likely to yield unexpected results. Its functionality can normally be achieved with argument "or". When argument "or" is not sufficient, coding queries will achieve the desired effect.

OR encodes inclusive disjunction. "(FOO) OR (BAR)" returns all subtrees rooted in an instance of the query's selected node boundary with either the property "FOO" or the property "BAR" or both. "FOO" and "BAR" may consist of single search functions or may themselves be built up out of conjunctions, disjunctions and negations of simple search functions.

NOT (deprecated)

NOT is not yet fully tested. It does not yet work correctly in any but the simplest cases and should be avoided except for testing purposes. Its functionality can be achieved with coding queries, notably via the ELSE function.

NOT returns trees rooted in the node boundary that do not contain the described structure. It differs from argument "not" in that none of the arguments need to appear in the domain defined by the boundary node.

For instance,

NOT(NP* iPrecedes VB*)
returns trees that do not contain the structure "(NP* iPrecedes VB*)", including those that contain neither NP* nor VB*.

By contrast,

(NP* iPrecedes !VB*)
returns trees that must contain NP* (namely ones that iPrecede any node except VB*), and
(!NP* iPrecedes VB*)
returns trees that must contains VB* (namely one that are iPreceded by NP*).

Regular expressions

CorpusSearch allows the use of basic regular expression syntax in the arguments to functions, but does not support full regular expression syntax. (This is because of weaknesses in the version of Java that CorpusSearch is written in.)

Square brackets

The use of square brackets discussed in what follows is distinct from their use in connection with
prefix indices. As the name implies, prefix indices precede node labels. The square brackets discussed here are part of the specification of the node labels themselves.

Expressions enclosed in square brackets specify alternatives. For instance, "[xyz]" stands for a single character that is either an "x", a "y", or a "z". (Equivalently, one could search for "x|y|z". Similarly, "BE[DP]" and "BED|BEP" amount to the same.)

CorpusSearch doesn't support ranges such as "[A-Z]", "[a-z]", and "[0-9]". These must be expressed as complete lists of alternatives (for instance, "[0123456789]").

Period

The wildcard character "." (period) matches any letter or digit.

Asterisk

The wildcard character "*" (asterisk) matches any string, including the null string. "*" may be used anywhere in the string specifying the function argument - beginning, middle or end. For instance, "CP*" matches any label beginning with the letters "CP" ("CP", "CP-ADV", "CP-QUE-SPE", etc.). "*-SPE" matches any label ending in "-SPE", and "*hersum*" matches any string containing the substring "hersum" ("hersum" itself, "hersumnesse", "unhersumnesse", etc.).

"NP*" matches all types of noun phrases ("NP-SBJ", "NP-OB1", etc.), but also "NPR" and its variants. The latter matches are not usually intended. To avoid this problem, either add a hyphen ("NP-*") or make reference to appropriate definitions from a definition file. A more radical solution is to replace "NPR" by a non-interfering label like "NR" via a revision query or a sed script.

As mentioned above, regular expression support in Java is not perfect. In particular, after square brackets, ".*" and "*" behave erratically.

gi[uv]*            ← matches "giu" and "giv", but not "giue", "giuen", ... "give", "given", ...
                     (the asterisk is unexpectedly inert)

gi[uv].*           ← matches "giue", "giuen", ... "give", "given", ... but not "giu" or "giv"
	             (the asterisk unexpectedly disallows zero instances of the wildcard period)
In general, ".*" after square bracket gives the desired result, but refuses to match zero instances of the wildcard character ".", contrary to expectation. If those matches are important, searches need to include disjunctions along the following lines (here, the first disjunct doesn't contain an asterisk, since including it is both unnecessary and confusing):
gi[uv]|gi[uv].*    ← matches "giu", "giue" "giuen", ... "giv", "give", "given", ...

Escaping special expressions

Asterisk

As just discussed, "*" is ordinarily interpreted as a wildcard character. In searches for "*" as a literal character in the text, the wildcard interpretation needs to be disabled or "escaped". The standard way of doing this is by prefixing the asterisk with a backslash (\). For instance, the following query with unescaped asterisks:
(NP* dominates *con*)
finds noun phrases containing words like "contradiction" or "unconstitutional" or "Rubicon". If the search is intended to find noun phrases with subjects elided under conjunction ("*con*"), where the asterisks are part of the text, the asterisks need to be escaped:
(NP* dominates \*con\*)
It is also possible to search for literal asterisks by enclosing them in square brackets.
(NP* dominates [*]con[*])

Escaped and unescaped asterisks can be - and very often are - combined in a single search. For instance, here are two alternative ways to search for traces of the form "*T*-1", "*T*-2", etc. The two queries are not completely equivalent, as indicated in the example.

(NP* dominates \*T*)       ← matches *T*-12, *T34, *T*; doesn't match T*, T-5 

(NP* dominates \*T\*-*)    ← matches *T*-12; doesn't match *T34, *T*, T*, T-5

Empty categories in general (which always begin with an asterisk in the PPCHE) can be searched for with "\**".

Period

In searches for literal periods in the text, the period needs to be escaped, either by prefixing them with a backslash (\.) or by enclosing them in square brackets ([.]).

If you fail to find literal periods when you expect to, recall that period is on CorpusSearch's ignore_nodes list by default. So you will have to explicitly "unignore" it by editing the "ignore_nodes" command, either in your query file or by changing the default setting in your preference file.

Integers

Integer arguments are expected for certain search functions (for instance, among others,
iDomsNumber), but not allowed for others. For instance, "exists" doesn't take an integer argument, and so the following query crashes.
query: (1929 exists)           ←  CRASH !
To find tokens containing integers, the integer needs to be escaped.
query: (\1929 exists)

Same instance

If a CorpusSearch query contains (exactly) the same label more than once, CorpusSearch assumes by default that each occurrence of the label refers to the same node in the tree, and we say that CorpusSearch enforces "same instance". Thus, the following query:
    (ADVP precedes HV*|MD|VB*)
AND (HV*|MD|VB* precedes NP-SBJ*)
matches tokens with an adverb preceding a modal, and that same modal preceding a subject.

"Same instance" is triggered by strings that are identical character by character. Strings that are not character-by-character identical do not trigger "same instance", even if the strings are extensionally equivalent. For instance, the following query:

    (ADVP precedes HV*|MD|VB*)          ← HV* < MD
AND (MD|HV*|VB* precedes NP-SBJ*)       ← MD < HV*
matches tokens with an adverb preceding a modal in one clause and a form of HAVE preceding a subject in some other clause. (It also matches the tokens returned by the earlier query.) The reason that "same instance" isn't enforced in the second query is that "HV*" and "MD" appear in different orders in the two clauses of the query.

When search expressions become complex, it rapidly becomes difficult to tell whether they are character-by-character identical and whether they will trigger "same instance" or not. This is one of the many reasons to use definition files.

Prefix indices

Certain queries require not triggering "same instance". For instance, if you are searching for two sister noun phrases, the following query:
NP* hasSister NP*
will not give the desired result. In fact, it will not generate any hits at all because "hasSister" is not defined as a reflexive relation (in other words, CorpusSearch doesn't treat a node as't its own sister).

When it is necessary to avoid "same instance" interpretations, the intended reference of arguments needs to be explicitly specified with prefix indices, as in the following query:

([1]NP* hasSister [2]NP*)
Prefix indices are enclosed by square brackets ([ ]). Arguments with the same index refer to the same node (in other words, they explicitly trigger "same instance"). Arguments with distinct indices are forced to refer to distinct nodes.

As with "same instance", labels that are intended to refer to distinct nodes in the tree must be character-by-character identical (except for the prefix index).

Here is a slightly more complex example of a query with prefix indices:

    ([1]NP* hasSister [2]NP*) 
AND ([1]NP* iDoms [3]PRO)
AND ([2]NP* iDoms [4]PRO)
This query finds sentences with two sister NPs, each of which immediately dominates (its own) PRO.
/~*
And +tere it lykede him to suffre many repreuynges and scornes for vs
(CMMANDEV,1.4)
*~/

/*
1 IP-MAT: 7 NP-SBJ-1, 12 NP-OB2, 8 PRO it, 13 PRO him
*/

(0 (1 IP-MAT (2 CONJ And)
             (4 ADVP-LOC (5 ADV +tere))
             (7 NP-SBJ-1 (8 PRO it))
             (10 VBD lykede)
             (12 NP-OB2 (13 PRO him))
             (15 IP-INF-1 (16 TO to)
                          (18 VB suffre)
                          (20 NP-OB1 (21 Q many)
                                     (23 NS repreuynges)
                                     (25 CONJP (26 CONJ and)
                                               (28 NX (29 NS scornes))))
                          (31 PP (32 P for)
                                 (34 NP (35 PRO vs)))))
      (36 ID CMMANDEV,1.4))
If the prefix indices on the two instances of PRO were omitted, the query would return no hits, since it would be requiring one and the same PRO node to be the child of the two distinct NP nodes.

Here's a final example:

query: (IP-SMC iDoms [1]NP*) AND ([1]NP* iDoms [3]\**)
   AND (IP-SMC iDoms [2]NP*) AND ([2]NP* iDoms [4]\**)
This query searches for an IP-SMC node which immediately dominates two distinct NP* nodes, each immediately dominating (its own) trace. The two mentions of "IP-SMC" refer to the same node in the tree (because "same instance" is enforced by default). But "[1]NP*" and "[2]NP*" refer to different nodes (because of the distinct prefix indices), and the same is true of the traces "[3]\**" and "[4]\**".

The above query finds sentences like this one:

/~*
+After +t+am L+acedemonie gecuron him to ladteowe, Ircclidis w+as haten,
(OR4,1.53.30.12)
*~/

/*
23 IP-SMC: 34 NP-NOM, 34 NP-NOM, 35 *-2
23 IP-SMC: 36 NP-NOM-PRD, 36 NP-NOM-PRD, 37 *ICH*-1
*/

(0 (1 CODE )
   ...
   (23 IP-MAT-PRN (24 NP-NOM-1 *pro*)
                  (26 NP-NOM-2 (27 NPR^N Ircclidis))
                  (29 BEDI w+as)
                  (31 VBN haten)
                  (33 IP-SMC (34 NP-NOM *-1)
                             (36 NP-NOM-PRD *ICH*-2)))
                  (38 . ,))
  (40 ID OR4,1.53.30.12))