Because CorpusSearch expects its input to be in Penn Treebank format by default, POS-tagged files must be specially flagged. This is done by including a format flag as the first line of the POS-tagged file, as illustrated below:
This first line may be followed by philological or other header information. The beginning of the text to be searched is indicated by the following delimiter:#!FORMAT=POS_1
Thereafter, the file must contain word/tag pairs formatted according to standard conventions. Space functions as the delimiter between word/tag pairs, so neither words nor tags can themselves contain spaces. The delimiter between a word and its tag is "/" (forward slash).<text>
The delimiter between sentences is a blankline. In addition, every sentence must end with a sentence-final punctuation tag. By default, this tag is "." (period). Alternatively, the sentence-final punctuation tag may be "PONFP". The PONFP alternative must be indicated in the file-initial format line by replacing the default "POS_1" format by "POS_0". The terminal node ("word") associated with the sentence-final punctuation tag need not be a period. See the template file for examples.
Sentence-internal punctuation is also treated as a word/tag pair, but the tag must differ from the sentence-final punctuation tag, as illustrated in the following examples, which distinguish "," (comma) and "." (period) or PON and PONFP.
After/P the/D party/N ,/, we/PRO all/Q went/VBD back/ADV ./. Après/P la/D fête/N ,/PON nous/PRO sommes/BEP tous/Q retornés/VBN ./PONFP
Finally, the end of the text is indicated by the closing counterpart of the opening <text> delimiter:
</text>
#!FORMAT=POS_0 This file contains the best edition of the "Chanson de Roland", with emendations by Christiane Marchello-Nizia as noted. <text> word1/tag1 word2/tag2 word3/tag3 .... ./PONFP word1/tag1 word2/tag2 word3/tag3 .... ?/PONFP word1/tag1 word2/tag2 word3/tag3 .... ,/PONFP . . . </text>
finds this sentence:query: (MD0 exists)
/~* I shal not conne wel goo thyder ./. (ID CMREYNAR,14.261) *~/ /* 4 MD0 conne */ ( (PRO I) (MD shal) (NEG not) (MD0 conne) (ADV wel) (VB goo) (ADV thyder) (ID CMREYNAR,14.261) )
finds this sentence:query: (PRO iDominates he) AND (FP iDominates ane)
/~* Sythen he ledes +tam by +tar ane, (CMROLLEP,118.978) *~/ /* 2 PRO he, 7 FP ane */ ( (ADV Sythen) (PRO he) (VBP ledes) (PRO +tam) (P by) (PRO$ +tar) (FP ane) (. ,) (CMROLLEP,118.978) ) /*
finds this sentence:query: (as iPrecedes sone) AND (sone iPrecedes P)
/~* and as sone as he myght he toke his horse . (CMMALORY,206.3401) *~/ /* 2 as, 3 sone, 4 P as */ ( (CONJ and) (ADVR as) (ADV sone) (P as) (PRO he) (MD myght) (PRO he) (VBD toke) (PRO$ his) (N horse) (. .) (CMMALORY,206.3401) )
returns all tokens in the corpus in which the word "whoreson" is separated from the word "wilt" by at most two words, as in the following sentence:query: (whoreson neighborhood 2 wilt)
/~* why thou whoreson when wilt thou be maried? (DELONEY,79.296) *~/ /* 3 whoreson, 5 wilt */ ( (WADV why) (PRO thou) (N whoreson) (WADV when) (MD wilt) (PRO thou) (BE be) (VAN maried) (. ?) (ID DELONEY,79.296) )
Example query:
query: (VB precedes N)
Example output:
/~* thenne have ye cause to make myghty werre upon hym. (CMMALORY,2.25) *~/ /* 6 VB make, 8 N werre */ ( (ADV thenne) (HV have) (PRO ye) (N cause) (TO to) (VB make) (ADJ myghty) (N werre) (P upon) (PRO hym) (. .) (ID CMMALORY,2.25))