Searching POS-tagged corpora

Formatting conventions for POS-tagged corpora
- Template file
Search functions

Formatting conventions for POS-tagged corpora

CorpusSearch can search files that are POS-tagged but not further annotated for syntactic structure. In this section, the special formatting conventions for POS-tagged files are first discussed and then illustrated in a template file.

Because CorpusSearch expects its input to be in Penn Treebank format by default, POS-tagged files must be specially flagged. This is done by including a format flag as the first line of the POS-tagged file, as illustrated below:

#!FORMAT=POS_1

This first line may be followed by philological or other header information. The beginning of the text to be searched is indicated by the following delimiter:

<text>

Thereafter, the file must contain word/tag pairs formatted according to standard conventions. Space functions as the delimiter between word/tag pairs, so neither words nor tags can themselves contain spaces. The delimiter between a word and its tag is "/" (forward slash).

The delimiter between sentences is a blankline. In addition, every sentence must end with a sentence-final punctuation tag. By default, this tag is "." (period). Alternatively, the sentence-final punctuation tag may be "PONFP". The PONFP alternative must be indicated in the file-initial format line by replacing the default "POS_1" format by "POS_0". The terminal node ("word") associated with the sentence-final punctuation tag need not be a period. See the template file for examples.

Sentence-internal punctuation is also treated as a word/tag pair, but the tag must differ from the sentence-final punctuation tag, as illustrated in the following examples, which distinguish "," (comma) and "." (period) or PON and PONFP.

After/P the/D party/N ,/, we/PRO all/Q went/VBD back/ADV ./.

Après/P la/D fête/N ,/PON nous/PRO sommes/BEP tous/Q retornés/VBN ./PONFP

Finally, the end of the text is indicated by the closing counterpart of the opening <text> delimiter:

</text>

Template file

The following template file illustrates the formatting conventions just discussed, using the POS_0 alternative.

#!FORMAT=POS_0

This file contains the best edition of the "Chanson de Roland",
with emendations by Christiane Marchello-Nizia as noted.

<text>

word1/tag1 word2/tag2 word3/tag3 .... ./PONFP

word1/tag1 word2/tag2 word3/tag3 .... ?/PONFP

word1/tag1 word2/tag2 word3/tag3 .... ,/PONFP

.
.
.

</text>

Search functions

CorpusSearch treats POS-tagged files as containing sentences parsed with a completely flat structure, with every word/tag pair as an immediate daughter of the root node. Query files for POS-tagged corpora are therefore essentially the same as for parsed corpora. In particular:

The node boundary in queries for POS-tagged files is always $ROOT.

The tag for a word is treated as its mother. In other words, a query like "(N iDoms king)" returns sentences containing the word/tag pair "king/N".

Because of the flat structure of a POS-tagged file, many CorpusSearch functions cannot be used. The table of contents for this section lists those that are ordinarily appropriate.
The Neighborhood function works only on POS-tagged files.

exists (variant: Exists)

Exists searches for a POS tag or text anywhere in the sentence. For instance, this query:

query: (MD0 exists)

finds this sentence:

/~*
I shal not conne wel goo thyder ./. (ID CMREYNAR,14.261)
*~/

/*
    4 MD0 conne
*/

( (PRO I) (MD shal) (NEG not) (MD0 conne) (ADV wel) (VB goo) (ADV thyder)
  (ID CMREYNAR,14.261) )

iDominates (variants: idominates, iDoms, idoms)

As mentioned above, POS tags immediately dominate the words of the text. So this query:

query:     (PRO iDominates he)
       AND (FP iDominates ane)

finds this sentence:

/~*
Sythen he ledes +tam by +tar ane,
(CMROLLEP,118.978)
*~/

/*
    2 PRO he, 7 FP ane
*/
( (ADV Sythen) (PRO he) (VBP ledes) (PRO +tam) (P by) (PRO$ +tar) (FP ane) (. ,) 
  (CMROLLEP,118.978) )

/*

iPrecedes (variants: iprecedes, iPres, ipres)

"iPrecedes" is true if and only if its first argument immediately precedes its second argument in the sentence. So this query:

query:     (as iPrecedes sone) 
       AND (sone iPrecedes P)

finds this sentence:

/~*
and as sone as he myght he toke his horse .
(CMMALORY,206.3401)
*~/
/*
2 as, 3 sone, 4 P as
*/

( (CONJ and) (ADVR as) (ADV sone) (P as) (PRO he) (MD myght) (PRO he)
  (VBD toke) (PRO$ his) (N horse) (. .)
  (CMMALORY,206.3401) )

neighborhood (variant: Neighborhood)

The "neighborhood" function is available only for POS-tagged files. It takes three arguments - two words or tags and a number - and searches for sentences in which the two words/tags are separated from each other by the specified number of words (or fewer). For instance, this query:

query: (whoreson neighborhood 2 wilt)

returns all tokens in the corpus in which the word "whoreson" is separated from the word "wilt" by at most two words, as in the following sentence:

/~*
why thou whoreson when wilt thou be maried?
(DELONEY,79.296)
*~/
/*
3 whoreson,  5 wilt
*/

( (WADV why) (PRO thou) (N whoreson) (WADV when) (MD wilt) (PRO thou) (BE be) (VAN maried) (. ?)
  (ID DELONEY,79.296) )

precedes (variants: Precedes, Pres, pres)

"Precedes" imposes a less strict condition than "iPrecedes". "x precedes y" means "x comes before y in the sentence but not necessarily immediately".

Example query:

query: (VB precedes N)

Example output:

/~*
thenne have ye cause to make myghty werre upon hym.
(CMMALORY,2.25)
*~/

/*
    6 VB make, 8 N werre
*/

( (ADV thenne) (HV have) (PRO ye) (N cause) (TO to) (VB make) (ADJ myghty) (N werre) (P upon)
  (PRO hym) (. .)
  (ID CMMALORY,2.25))