General introduction
-
As with the Penn Historical
Corpora, our primary goal has been to create an annotation system
that facilitates automated searches rather than to give a correct
linguistic analysis of each sentence, which in many cases is unworkable
and in some cases (due to structural and morphosyntactic ambiguity,
inaudible material, or other reasons) downright impossible.
-
We have tried to plan our system so that at each stage of the
annotation, information can be added in a monotonic way. That is, we
want any future revisions of the bracketed structures always to add
information, never to change it. This goal requires us to avoid
judgments that are subjective or error-prone.
-
As much as possible, we have tried to avoid making decisions that would
be controversial, whether with regard to text interpretation or to
linguistic theory. In doubtful cases, we either avoid specifying
structure, or we use default rules to decide the case for search
purposes. An example of the first strategy concerns VPs. These are
normally not indicated in the corpus, since VP boundaries are normally
indeterminate. This is clearly the case in Middle English, which allows
scrambling and where the internal structure of the VP is variable and
changing. But even in modern English, there are many cases in which it
is not clear whether some phrase attaches as a daughter of VP or higher
up in the tree. An example of the second strategy concerns PP
attachment. Whenever it is unclear where a PP attaches, we attach it by
default as high as possible.