In order for CorpusSearch to run on a parsed corpus, the corpus needs to satisfy certain formal requirements.
If necessary, the wrapper parens can be referenced in queries with the
expression "$METAROOT".
CorpusSearch expects the material following the label "ID" to be a single
orthographic word. In other words, that material must not contain spaces
(" ").
The optional information between the comma and the period above is not
referenced by CorpusSearch. It could be used to store page numbers (as in
the Penn Parsed Historical Corpora of English), or some other information,
or not used at all. The important thing is that the ID string must
include a comma (which delimits the preceding material as the source name)
and also a period (which delimits the following material as the running
token number).
CorpusSearch expects to find the ID node just after the sentence, but
inside the sentence wrapper (in other words,
as the last child of $METAROOT).
One passage from the corpus tells a joke that begins this way:
Here it is as it appeared in the corpus. (For this discussion, we don't
need the definitions of the words and their labels, so they are omitted.
Most of the labels in this corpus begin with lowercase letters. This is
unorthodox, but entirely compatible with CorpusSearch's formal requirements.)
But the sentence is still not completely parsed. For instance, the
phrase "dem ershtn mol" ('the first time') has been parsed as one
object. So if you run this query:
We therefore remove the leading "B" and give each terminal a tag of its
own. In the following structure, we have given the punctuation marks
linguistically appropriate tags, and we have given the words without
tags of their own the dummy tag "x", which is enough to keep
CorpusSearch from complaining. (Eventually, of course, we might want to
replace those tags with linguistically appropriate ones, but that is
independent of whether CorpusSearch will run properly on the corpus and
give results as expected. It is worth noting that much fine research
can be - and has been done - on structures like the one immediately
following.)
Finally, there is the node (RO,1), which identifies the sentence as part
of the first story in the Royte Pomerantsen collection. In connection
with adding the wrapper parens, we have already correctly included the
node inside the wrapper parens, but it still needs to be given the
standard CorpusSearch ID node form. As it stands,
there is no running token number associated with the sentence. For
purposes of illustration, we've manually added the number 2:
Now the annotation is fully compatible with CorpusSearch, and the
following query:
Further revisions to this structure - for instance, moving the final
punctuation up a level (out of the B clause) - would make it compatible
with the annotation guidelines for the PPCHE. These and other
linguistically more interesting revisions can be performed automatically
using CorpusSearch's corpus
revision functionality. But they are not necessary for compatibility
with CorpusSearch's purely formal requirements.
Complete parse
CorpusSearch expects sentences to be completely parsed. In particular,
it expects every terminal node to be uniquely associated with an
immediately dominating annotation label. The remainder of the structure
must also be a well-formed tree structure.
Labels must be single words
CorpusSearch expects labels to be orthographic words - that is, strings
without spaces (" "). If your intended label contains a space,
CorpusSearch interprets the first string as a label and the next string
as text. Any subsequent strings (for instance, the actual word in the
text) will trigger a run-time error. For instance, if you try to use
"PROPER NOUN" as a label, CorpusSearch will interpret "PROPER" as the
label and "NOUN" as text, and then crash on the next item (the actual
noun in the text). On the other hand, "PROPER_NOUN" with an underscore
instead of space will be properly interpreted as a label and associated
with the next word in the input.
Labels must not begin with digits
Labels must not begin with digits ("0", "1", ..., "9"). (Such digits are
interpreted as prefix
indices left over from a previous search and ignored). But labels are
allowed to end with digits. So "PP1" is a legal label, even though
"1PP" is not.
Round parentheses
CorpusSearch expects the matching brackets indicating syntactic structure
to be round parentheses ("(", ")"). If your corpus uses other sorts of
matching brackets to indicate syntactic structure - such as curly brackets
with ("{", "}"), square brackets ("[", "]") or some other system - you
will have to convert them to round parentheses. By the same token, if
your corpus contains round parentheses as part of the raw text (the text
being annotated, rather than as part of the annotation system), they will
need to be replaced by something like "OPEN-PAREN" or "LPAREN" or the like
for the lefthand variants, and analogously for the righthand counterparts.
Of course, the sequences that you choose should not occur elsewhere in the
corpus.
No square brackets
This item follows from the one immediately preceding, but is repeated for
clarity. Square brackets ("[" and "]") are used by CorpusSearch to
enclose prefix
indices. If your corpus contains square brackets, you will need to
convert them to something like "Wrapper parens
CorpusSearch expects every sentence to be enclosed in a "wrapper", that
is, a pair of unlabeled parentheses surrounding the sentence. The wrapper
is a useful place to store items that are associated with the sentence,
but not with its internal structure - for instance, ID
nodes or sociolinguistic information about the author (such as the META
nodes in the Parsed Corpus of Early English Correspondence). In the
following example, the wrapper consists of the first and last parentheses:
( (IP-MAT (ADVP-TMP (ADV Thenne))
(NP-SBJ (NPR quene) (NPR Igrayne))
(VBD waxid)
(ADVP-TMP (ADV dayly))
(ADJP (ADJR gretter) (CONJ and) (ADJR gretter))
(. .))
(ID CMMALORY,5.120))
ID nodes
CorpusSearch can function without token identifier nodes (labeled "ID"),
but it's useful to include them. For instance, when CorpusSearch searches
the output of a previous search, it uses the ID nodes to keep statistics
for the header, footer and summary blocks. Here's an example of an ID
node:
Here, CMMALORY identifies the source file, 5 is the page number, and 120
is the running token number for that file. In general, an ID node should
have this form:
(ID CMMALORY,5.120)
(ID <source_name>,<optional_information>.<running_token_number>)
Example of a transition
In 1994, Beatrice Santorini of the University of Pennsylvania built a
corpus of parsed and annotated Yiddish texts. Like Phase 1 of the Middle
English corpus, this Yiddish corpus was parsed only to the first level of
constituents below the clause level. In those ancient days, these
structures were searched using Perl scripts that matched regular
expressions.
Let's examine one sentence from that passage:
When you tell a story to a peasant, he laughs three times.
The first time, he laughs when someone tells him the story.
The second time, when it is explained to him.
And the third time, when he understands the story.
The first time, he laughs when someone tells him the story.
The first problem is the use of square brackets ("[", "]") to represent
syntactic structure, which CorpusSearch doesn't recognize (or rather,
would want to interpret as prefix indices). So the first task is to
convert the square brackets to round parentheses. At the same time,
let's close up the space between the words and the close parens:
( [t dem ershtn mol ]
[v0 lakht ]
[s er ]
,
[B [c ven ] [s men ] [v0 dertseylt ] [i im ] [d di mayse ] , B]
)
(RO,1)
The second problem is that the sentence isn't delimited by wrapper parens.
(The outermost parens enclose the sentence, but CorpusSearch expects a
further pairs of parens.) The following structure has the wrapper parens;
we've also added a default IP label to the original highest parens.
( (t dem ershtn mol)
(v0 lakht)
(s er)
,
(B (c ven)
(s men)
(v0 dertseylt)
(i im)
(d di mayse)
,
B)
)
(RO,1)
This form of the sentence can be partly searched by CorpusSearch. For
instance, this query:
( (IP (t dem ershtn mol)
(v0 lakht)
(s er)
,
(B (c ven)
(s men)
(v0 dertseylt)
(i im)
(d di mayse)
,
B))
(RO,1))
finds the structure "(v0 lakht) (s er)", as expected.
node: IP*
query: (v0 iPrecedes s)
CorpusSearch (if it runs at all) will not find the sentence. This is
because CorpusSearch expects every terminal node in the tree to be
uniquely associated with an immediately dominating annotation label.
Also, the leading "B" on the parenthesis that marks the end of the
B-labeled clause is redundant.
node: IP*
query: (ershtn precedes mol)
( (IP (t (x dem) (x ershtn) (x mol))
(v0 lakht)
(s er)
(, ,)
(B (c ven)
(s men)
(v0 dertseylt)
(i im)
(d (x di) (x mayse))
(. ,)))
(RO,1))
finds the structure as expected:
node: IP*
query: (ershtn precedes mol)
/~*
dem ershtn mol lakht er , ven men dertseylt im di mayse ,
*~/
/*
1 t: 3 x ershtn, 4 x mol
*/
( (IP (t (x dem) (x ershtn) (x mol))
(v0 lakht)
(s er)
(, ,)
(B (c ven)
(s men)
(v0 dertseylt)
(i im)
(d (x di) (x mayse))
(. ,)))
(ID RO,1.2))