Here is a sentence that matches the query.(NP-PRD iDominates N)
In addition to "iDominates", CorpusSearch has many other search functions, which are described in detail in their own section./~* For he 's a jolly good fellow. *~/ /* 1 IP-MAT: 9 NP-SBJ, 16 N */ (0 (1 IP-MAT (2 CONJ For) (4 NP-SBJ (5 PRO he)) (7 BEP 's) (9 NP-PRD (10 D a) (12 ADJ jolly) (14 ADJ good) (16 N fellow))) (18 . .))
Two adjacent pipes (||), whether in a command file or a definition file, will abort any search. |
Any number of arguments to a search function may be linked together into a list using "|" (argument "or" or "pipe"). For instance,
means "BE* or DO* or HV* or MD or VB* immediately precedes NP-SBJ*" and will find sentences like this:(BE*|DO*|HV*|MD|VB* iPrecedes NP-SBJ*)
/~* +Tan was pompe & pryde cast down (CMKEMPE,2.12) *~/ /* 1 IP-MAT: 5 BED was, 7 NP-SBJ */ (0 (1 IP-MAT (2 ADVP-TMP (3 ADV +Tan)) (5 BED was) (7 NP-SBJ (8 N pompe) (10 CONJ &) (12 N pryde)) (14 VAN cast) (16 RP down)) (18 ID CMKEMPE,2.12))
(NP-SBJ* iDoms !PRO*)
/~* a runde fot & +ticke bi-come+t an hors wel. (CMHORSES,87.17) *~/ /* 1 IP-MAT: 2 NP-SBJ, 15 ADJ +ticke */ (0 (1 IP-MAT (2 NP-SBJ (3 D a) (5 ADJP (6 ADJ runde) (8 CONJP *ICH*-1)) (10 N fot) (12 CONJP-1 (13 CONJ &)) (15 ADJ +ticke)) (17 VBP bi-come+t) (19 NP-OB1 (20 D an) (22 N hors)) (24 ADVP (25 ADV wel)) (27 E_S .)) (29 ID CMHORSES,87.17))
Instead, "!" must precede lists, and it is interpreted as negating the entire list (not just the first member). So the following query:(NP-SBJ* iDoms N|!D|!ADJ)
means "NP-SBJ* immediately dominates some node other than N, D, or ADJ", and it finds sentences with bare pronouns subjects like the following, which are not intended by the ill-formed query:(NP-SBJ* iDoms !N|D|ADJ)
/~* and he made them grete chere out of mesure (CMMALORY-M4,2.13) *~/ /* 1 IP-MAT: 4 NP-SBJ, 5 PRO */ (0 (1 IP-MAT (2 CONJ and) (4 NP-SBJ (5 PRO he)) (7 VBD made) (9 NP-DTV (10 PRO them)) (12 NP-ACC (13 ADJ grete) (15 N chere)) (17 ADVP (18 ADV out) (20 PP (21 P of) (23 NP (24 N mesure))))) (26 ID CMMALORY-M4,2.13))
The ill-formed query can be formulated in CorpusSearch by splitting up the single condition into two conditions combined with AND:
(NP-SBJ* iDoms N) AND (NP-SBJ* iDoms !D|!ADJ)
(!NP-SBJ* iPrecedes !VBD)
Queries of the following form are fine.
It's true that the entire query contains more than one "!" operator, but each individual search function (iDoms, iPrecedes) contains only one.(NP-SBJ* iDoms !PRO*) AND (NP-SBJ* iPrecedes !VBD)
For instance, suppose you're looking for sentences with a subject that precedes the object, and where neither the subject nor the object is a pronoun. The following query:
will find sentences like this one:(NP-SBJ* precedes NP-OB1*) AND (NP-SBJ* iDoms ![1]PRO*) AND (NP-OB1* iDoms ![2]PRO*)
/~* & +tat schal be a good hors. (CMHORSES,85.9) *~/ /* 1 IP-MAT: 4 NP-SBJ, 11 NP-OB1, 5 D +tat, 16 N hors */ (0 (1 IP-MAT (2 CONJ &) (4 NP-SBJ (5 D +tat)) (7 MD schal) (9 BE be) (11 NP-OB1 (12 D a) (14 ADJ good) (16 N hors)) (18 E_S .)) (20 ID CMHORSES,85.9))
finds this sentence:node: IP* query: (NP-TMP* iDoms ADV*) AND (TO iPrecedes VB)
/* 4 IP-INF-SBJ: 5 NP-TMP, 6 ADV+NS, 8 TO, 10 VB */ (0 (1 IP-MAT (2 CONJ but) (4 IP-INF-SBJ (5 NP-TMP (6 ADV+NS oftymes)) (8 TO to) (10 VB rede) (12 NP-OB1 (13 PRO it))) (15 MD shal) (17 VB cause) (19 NP-OB1 (20 PRO it)) (22 IP-INF (23 ADVP (24 ADV wel)) (26 TO to) (28 BE be) (29 VAN vnderstande)) (31 E_S /)) (33 ID CMREYNAR,6.10))
OR is not yet fully tested and is likely to yield unexpected results. Its functionality can normally be achieved with argument "or". When argument "or" is not sufficient, coding queries will achieve the desired effect. |
OR encodes inclusive disjunction. "(FOO) OR (BAR)" returns all subtrees
rooted in an instance of the query's selected node boundary with either
the property "FOO" or the property "BAR" or both. "FOO" and "BAR"
may consist of single search functions or may themselves be built up out
of conjunctions, disjunctions and negations of simple search functions.
NOT (deprecated)
NOT is not yet fully tested. It does not yet work correctly in any but the simplest cases and should be avoided except for testing purposes. Its functionality can be achieved with coding queries, notably via the ELSE function. |
NOT returns trees rooted in the node boundary that do not contain the described structure. It differs from argument "not" in that none of the arguments need to appear in the domain defined by the boundary node.
For instance,
returns trees that do not contain the structure "(NP* iPrecedes VB*)", including those that contain neither NP* nor VB*.NOT(NP* iPrecedes VB*)
By contrast,
returns trees that must contain NP* (namely ones that iPrecede any node except VB*), and(NP* iPrecedes !VB*)
returns trees that must contains VB* (namely one that are iPreceded by NP*).(!NP* iPrecedes VB*)
Expressions enclosed in square brackets specify alternatives. For instance, "[xyz]" stands for a single character that is either an "x", a "y", or a "z". (Equivalently, one could search for "x|y|z". Similarly, "BE[DP]" and "BED|BEP" amount to the same.)
CorpusSearch doesn't support ranges such as "[A-Z]", "[a-z]", and "[0-9]".
These must be expressed as complete lists of alternatives (for instance,
"[0123456789]").
Period
The wildcard character "." (period) matches any letter or digit.
Asterisk
The wildcard character "*" (asterisk) matches any string, including the
null string. "*" may be used anywhere in the string specifying the
function argument - beginning, middle or end. For instance, "CP*" matches
any label beginning with the letters "CP" ("CP", "CP-ADV", "CP-QUE-SPE",
etc.). "*-SPE" matches any label ending in "-SPE", and "*hersum*" matches
any string containing the substring "hersum" ("hersum" itself,
"hersumnesse", "unhersumnesse", etc.).
"NP*" matches all types of noun phrases ("NP-SBJ", "NP-OB1", etc.), but also "NPR" and its variants. The latter matches are not usually intended. To avoid this problem, either add a hyphen ("NP-*") or make reference to appropriate definitions from a definition file. A more radical solution is to replace "NPR" by a non-interfering label like "NR" via a revision query or a sed script. |
As mentioned above, regular expression support in Java is not perfect. In particular, after square brackets, ".*" and "*" behave erratically.
In general, ".*" after square bracket gives the desired result, but refuses to match zero instances of the wildcard character ".", contrary to expectation. If those matches are important, searches need to include disjunctions along the following lines (here, the first disjunct doesn't contain an asterisk, since including it is both unnecessary and confusing):gi[uv]* ← matches "giu" and "giv", but not "giue", "giuen", ... "give", "given", ... (the asterisk is unexpectedly inert) gi[uv].* ← matches "giue", "giuen", ... "give", "given", ... but not "giu" or "giv" (the asterisk unexpectedly disallows zero instances of the wildcard period)
gi[uv]|gi[uv].* ← matches "giu", "giue" "giuen", ... "giv", "give", "given", ...
finds noun phrases containing words like "contradiction" or "unconstitutional" or "Rubicon". If the search is intended to find noun phrases with subjects elided under conjunction ("*con*"), where the asterisks are part of the text, the asterisks need to be escaped:(NP* dominates *con*)
It is also possible to search for literal asterisks by enclosing them in square brackets.(NP* dominates \*con\*)
(NP* dominates [*]con[*])
Escaped and unescaped asterisks can be - and very often are - combined in a single search. For instance, here are two alternative ways to search for traces of the form "*T*-1", "*T*-2", etc. The two queries are not completely equivalent, as indicated in the example.
Empty categories in general (which always begin with an asterisk in the PPCHE) can be searched for with "\**".(NP* dominates \*T*) ← matches *T*-12, *T34, *T*; doesn't match T*, T-5 (NP* dominates \*T\*-*) ← matches *T*-12; doesn't match *T34, *T*, T*, T-5
If you fail to find literal periods when you expect to, recall that period is on CorpusSearch's ignore_nodes list by default. So you will have to explicitly "unignore" it by editing the "ignore_nodes" command, either in your query file or by changing the default setting in your preference file. |
To find tokens containing integers, the integer needs to be escaped.query: (1929 exists) ← CRASH !
query: (\1929 exists)
matches tokens with an adverb preceding a modal, and that same modal preceding a subject.(ADVP precedes HV*|MD|VB*) AND (HV*|MD|VB* precedes NP-SBJ*)
"Same instance" is triggered by strings that are identical character by character. Strings that are not character-by-character identical do not trigger "same instance", even if the strings are extensionally equivalent. For instance, the following query:
matches tokens with an adverb preceding a modal in one clause and a form of HAVE preceding a subject in some other clause. (It also matches the tokens returned by the earlier query.) The reason that "same instance" isn't enforced in the second query is that "HV*" and "MD" appear in different orders in the two clauses of the query.(ADVP precedes HV*|MD|VB*) ← HV* < MD AND (MD|HV*|VB* precedes NP-SBJ*) ← MD < HV*
When search expressions become complex, it rapidly becomes difficult to
tell whether they are character-by-character identical and whether they
will trigger "same instance" or not. This is one of the many reasons to
use definition files.
When it is necessary to avoid "same instance" interpretations, the
intended reference of arguments needs to be explicitly specified with
prefix indices, as in the following query:
As with "same instance", labels that are intended to refer to distinct
nodes in the tree must be character-by-character identical (except for
the prefix index).
Here is a slightly more complex example of a query with prefix indices:
Here's a final example:
The above query finds sentences like this one:
Prefix indices
Certain queries require not triggering "same instance". For
instance, if you are searching for two sister noun phrases, the
following query:
will not give the desired result. In fact, it will not generate any hits
at all because "hasSister" is not defined as a reflexive relation
(in other words, CorpusSearch doesn't treat a node as't its own sister).
NP* hasSister NP*
Prefix indices are enclosed by square brackets ([ ]). Arguments with the
same index refer to the same node (in other words, they explicitly trigger
"same instance"). Arguments with distinct indices are forced to refer to
distinct nodes.
([1]NP* hasSister [2]NP*)
This query finds sentences with two sister NPs, each of which
immediately dominates (its own) PRO.
([1]NP* hasSister [2]NP*)
AND ([1]NP* iDoms [3]PRO)
AND ([2]NP* iDoms [4]PRO)
If the prefix indices on the two instances of PRO were omitted, the query
would return no hits, since it would be requiring one and the same PRO
node to be the child of the two distinct NP nodes.
/~*
And +tere it lykede him to suffre many repreuynges and scornes for vs
(CMMANDEV,1.4)
*~/
/*
1 IP-MAT: 7 NP-SBJ-1, 12 NP-OB2, 8 PRO it, 13 PRO him
*/
(0 (1 IP-MAT (2 CONJ And)
(4 ADVP-LOC (5 ADV +tere))
(7 NP-SBJ-1 (8 PRO it))
(10 VBD lykede)
(12 NP-OB2 (13 PRO him))
(15 IP-INF-1 (16 TO to)
(18 VB suffre)
(20 NP-OB1 (21 Q many)
(23 NS repreuynges)
(25 CONJP (26 CONJ and)
(28 NX (29 NS scornes))))
(31 PP (32 P for)
(34 NP (35 PRO vs)))))
(36 ID CMMANDEV,1.4))
This query searches for an IP-SMC node which immediately dominates two
distinct NP* nodes, each immediately dominating (its own) trace. The two
mentions of "IP-SMC" refer to the same node in the tree (because "same
instance" is enforced by default). But "[1]NP*" and "[2]NP*" refer to
different nodes (because of the distinct prefix indices), and the same is
true of the traces "[3]\**" and "[4]\**".
query: (IP-SMC iDoms [1]NP*) AND ([1]NP* iDoms [3]\**)
AND (IP-SMC iDoms [2]NP*) AND ([2]NP* iDoms [4]\**)
/~*
+After +t+am L+acedemonie gecuron him to ladteowe, Ircclidis w+as haten,
(OR4,1.53.30.12)
*~/
/*
23 IP-SMC: 34 NP-NOM, 34 NP-NOM, 35 *-2
23 IP-SMC: 36 NP-NOM-PRD, 36 NP-NOM-PRD, 37 *ICH*-1
*/
(0 (1 CODE