Compatibility requirements for parsed corpora


In order for CorpusSearch to run on a parsed corpus, the corpus needs to satisfy certain formal requirements.

Complete parse

CorpusSearch expects sentences to be completely parsed. In particular, it expects every terminal node to be uniquely associated with an immediately dominating annotation label. The remainder of the structure must also be a well-formed tree structure.

Labels must be single words

CorpusSearch expects labels to be orthographic words - that is, strings without spaces (" "). If your intended label contains a space, CorpusSearch interprets the first string as a label and the next string as text. Any subsequent strings (for instance, the actual word in the text) will trigger a run-time error. For instance, if you try to use "PROPER NOUN" as a label, CorpusSearch will interpret "PROPER" as the label and "NOUN" as text, and then crash on the next item (the actual noun in the text). On the other hand, "PROPER_NOUN" with an underscore instead of space will be properly interpreted as a label and associated with the next word in the input.

Labels must not begin with digits

Labels must not begin with digits ("0", "1", ..., "9"). (Such digits are interpreted as
prefix indices left over from a previous search and ignored). But labels are allowed to end with digits. So "PP1" is a legal label, even though "1PP" is not.

Round parentheses

CorpusSearch expects the matching brackets indicating syntactic structure to be round parentheses ("(", ")"). If your corpus uses other sorts of matching brackets to indicate syntactic structure - such as curly brackets with ("{", "}"), square brackets ("[", "]") or some other system - you will have to convert them to round parentheses. By the same token, if your corpus contains round parentheses as part of the raw text (the text being annotated, rather than as part of the annotation system), they will need to be replaced by something like "OPEN-PAREN" or "LPAREN" or the like for the lefthand variants, and analogously for the righthand counterparts. Of course, the sequences that you choose should not occur elsewhere in the corpus.

No square brackets

This item follows from the one immediately preceding, but is repeated for clarity. Square brackets ("[" and "]") are used by CorpusSearch to enclose
prefix indices. If your corpus contains square brackets, you will need to convert them to something like "" or "" or the like for the lefthand variants, and analogously for the righthand counterparts. Again, the sequences that you choose should not occur elsewhere in the corpus.

Wrapper parens

CorpusSearch expects every sentence to be enclosed in a "wrapper", that is, a pair of unlabeled parentheses surrounding the sentence. The wrapper is a useful place to store items that are associated with the sentence, but not with its internal structure - for instance,
ID nodes or sociolinguistic information about the author (such as the META nodes in the Parsed Corpus of Early English Correspondence). In the following example, the wrapper consists of the first and last parentheses:
( (IP-MAT (ADVP-TMP (ADV Thenne))
          (NP-SBJ (NPR quene) (NPR Igrayne))
          (VBD waxid)
          (ADVP-TMP (ADV dayly))
          (ADJP (ADJR gretter) (CONJ and) (ADJR gretter))
          (. .))
  (ID CMMALORY,5.120))

If necessary, the wrapper parens can be referenced in queries with the expression "$METAROOT".

ID nodes

CorpusSearch can function without token identifier nodes (labeled "ID"), but it's useful to include them. For instance, when CorpusSearch searches the output of a previous search, it uses the ID nodes to keep statistics for the header, footer and summary blocks. Here's an example of an ID node:
(ID CMMALORY,5.120)
Here, CMMALORY identifies the source file, 5 is the page number, and 120 is the running token number for that file. In general, an ID node should have this form:
(ID <source_name>,<optional_information>.<running_token_number>)

CorpusSearch expects the material following the label "ID" to be a single orthographic word. In other words, that material must not contain spaces (" ").

The optional information between the comma and the period above is not referenced by CorpusSearch. It could be used to store page numbers (as in the Penn Parsed Historical Corpora of English), or some other information, or not used at all. The important thing is that the ID string must include a comma (which delimits the preceding material as the source name) and also a period (which delimits the following material as the running token number).

CorpusSearch expects to find the ID node just after the sentence, but inside the sentence wrapper (in other words, as the last child of $METAROOT).

Example of a transition

In 1994, Beatrice Santorini of the University of Pennsylvania built a corpus of parsed and annotated Yiddish texts. Like Phase 1 of the Middle English corpus, this Yiddish corpus was parsed only to the first level of constituents below the clause level. In those ancient days, these structures were searched using Perl scripts that matched regular expressions.

One passage from the corpus tells a joke that begins this way:

 
When you tell a story to a peasant, he laughs three times. 
The first time, he laughs when someone tells him the story.  
The second time, when it is explained to him.  
And the third time, when he understands the story.
Let's examine one sentence from that passage:
The first time, he laughs when someone tells him the story.

Here it is as it appeared in the corpus. (For this discussion, we don't need the definitions of the words and their labels, so they are omitted. Most of the labels in this corpus begin with lowercase letters. This is unorthodox, but entirely compatible with CorpusSearch's formal requirements.)

( [t dem ershtn mol ]
  [v0 lakht ] 
  [s er ] 
  ,
  [B [c ven ] [s men ] [v0 dertseylt ] [i im ] [d di mayse ] , B]
)
(RO,1)
The first problem is the use of square brackets ("[", "]") to represent syntactic structure, which CorpusSearch doesn't recognize (or rather, would want to interpret as prefix indices). So the first task is to convert the square brackets to round parentheses. At the same time, let's close up the space between the words and the close parens:
( (t dem ershtn mol)
  (v0 lakht)
  (s er)
  ,
  (B (c ven) 
     (s men)
     (v0 dertseylt)
     (i im)
     (d di mayse)
     , 
  B)
)
(RO,1)
The second problem is that the sentence isn't delimited by wrapper parens. (The outermost parens enclose the sentence, but CorpusSearch expects a further pairs of parens.) The following structure has the wrapper parens; we've also added a default IP label to the original highest parens.
( (IP (t dem ershtn mol)
      (v0 lakht)
      (s er)
      ,
      (B (c ven) 
         (s men)
         (v0 dertseylt)
        (i im)
        (d di mayse)
        , 
      B))
  (RO,1))
This form of the sentence can be partly searched by CorpusSearch. For instance, this query:
node:   IP*
query:  (v0 iPrecedes s)
finds the structure "(v0 lakht) (s er)", as expected.

But the sentence is still not completely parsed. For instance, the phrase "dem ershtn mol" ('the first time') has been parsed as one object. So if you run this query:

node:    IP*
query:  (ershtn precedes mol) 
CorpusSearch (if it runs at all) will not find the sentence. This is because CorpusSearch expects every terminal node in the tree to be uniquely associated with an immediately dominating annotation label. Also, the leading "B" on the parenthesis that marks the end of the B-labeled clause is redundant.

We therefore remove the leading "B" and give each terminal a tag of its own. In the following structure, we have given the punctuation marks linguistically appropriate tags, and we have given the words without tags of their own the dummy tag "x", which is enough to keep CorpusSearch from complaining. (Eventually, of course, we might want to replace those tags with linguistically appropriate ones, but that is independent of whether CorpusSearch will run properly on the corpus and give results as expected. It is worth noting that much fine research can be - and has been done - on structures like the one immediately following.)

( (IP (t (x dem) (x ershtn) (x mol))
      (v0 lakht)
      (s er) 
      (, ,)
      (B (c ven)
         (s men) 
         (v0 dertseylt)
         (i im) 
         (d (x di) (x mayse))
         (. ,)))
 (RO,1))

Finally, there is the node (RO,1), which identifies the sentence as part of the first story in the Royte Pomerantsen collection. In connection with adding the wrapper parens, we have already correctly included the node inside the wrapper parens, but it still needs to be given the standard CorpusSearch ID node form. As it stands, there is no running token number associated with the sentence. For purposes of illustration, we've manually added the number 2:

Now the annotation is fully compatible with CorpusSearch, and the following query:

node:   IP*
query:  (ershtn precedes mol) 
finds the structure as expected:
/~*
dem ershtn mol lakht er , ven men dertseylt im di mayse ,
*~/

/*
    1 t: 3 x ershtn, 4 x mol
*/

( (IP (t (x dem) (x ershtn) (x mol))
      (v0 lakht)
      (s er)
      (, ,)
      (B (c ven)
         (s men)
         (v0 dertseylt)
         (i im)
         (d (x di) (x mayse))
         (. ,)))
  (ID RO,1.2))

Further revisions to this structure - for instance, moving the final punctuation up a level (out of the B clause) - would make it compatible with the annotation guidelines for the PPCHE. These and other linguistically more interesting revisions can be performed automatically using CorpusSearch's corpus revision functionality. But they are not necessary for compatibility with CorpusSearch's purely formal requirements.