At times, orthographic conventions conflict with the requirements of
morphosyntactic annotation. The conflicts are resolved by splitting and
joining orthographic words (sequences of characters without whitespace)
according to conventions that allow morphosyntactic annotation to
proceed, while at the same time preserving information that allows the
conventional orthographic form to be restored, either with reference to
a short list (in the case of clitics) or
completely algorithmically (otherwise).
Splitting words
When orthographic words are split, the site of the split is marked by a
trailing AT sign (@) on the first part and a leading AT sign (@) on the
second.
( (IP-MAT (NP-SBJ (PRO We@))
(VP (BEP @'re)
(VP (VAG gon@)
(IP-INF (TO @na)
(VP (VB go)
(ADVP-TMP (ADV now))))))
(PUNC .)))
( (IP-IMP (VP (VBI Lem@)
(IP-ECM (NP-SBJ (PRO @me))
(VP (VB go))))
(PUNC .)))
( (IP-IMP (VP (VBI Let@)
(IP-ECM (NP-SBJ (PRO @'s))
(VP (VB go))))
(PUNC .)))
( (IP-MAT (NP-SBJ (D That))
(VP (MD wo@)
(NEG @n't)
(VP (DO do)))
(PUNC .)))
Joining words
It is not always obvious when to join two orthographic words and when to treat them as a compound word. When standard orthography is not a guide, as it is in the case of "mother-in-law" below, a helpful rule of thumb is to consider the function of the sequence in context and ask what POS label best fits that function. If none of the POS tags for the constituents match that label, joining is the more intuitive option. For instance, in the first example below, "ten quart" modifies the noun and ADJ is the best-fit label for the sequence. But neither "ten" nor "quart" are adjectives. So it is best to join. By contrast, say, "high school" has the same distribution as the noun "school" and one of the constituents is a noun. In this case, it is best not to join but to treat the sequence as a compound noun. |
In addition to single orthographic words that need to be split, there is also the converse case of multi-word sequences that must be (or are best) treated as single orthographic words. There are two cases to consider: either the desired orthographic word reflects standard orthographic conventions or not.
In the first case, the text is revised in accordance with standard orthography by adding a hyphen.
In the second case, we join the individual parts into a single orthographic word with an EQUALS sign (=). The original orthographic form can be recovered algorithmically by replacing a word-internal EQUALS sign with SPACE. (The EQUALS sign has a further use, described in Clitics, which precludes globally replacing EQUALS with SPACE.)( (NP (D a) (ADJP (ADJ ten-quart)) ← like this (N bottle))) ( (NP (D a) (ADJP (ADJ-COMP (NUM ten) (N quart))) ← not like this (N bottle))) ( (NP (NP-POS (PRO$ her)) (N mother-in-law))) ← like this ( (NP (NP-POS (PRO$ her)) (N mother) ← not like this (PP (P in) (NP (N law)))))
The following list is an exhaustive list of joined words and morphemes in the APPCAppE with their POS tag; it can serve as a guide for the other corpora. The "Item" column gives citation forms; the actually occurring forms may be inflected (notably, for plural and possessive).
Item | Examples and notes | |
---|---|---|
=all | WPRO PRO PRO$ | what=all you=all your=all's |
ever=how | WADV | |
ever=what, ever=which, every=who | WD, WPRO | |
every=which | Q | |
=ful | N | bag=ful, can=ful, glass=ful, jar=ful, pan=ful, pot=ful |
=here | D | these=here, this=here |
how=come | WADV | |
how=much | N | as a synonym of AMOUNT; not as the mass noun counterpart to HOW MANY |
kind=of | ADV | also: kindly=of |
like=to | ADV | |
no=count | ADJ | |
no=how | ADV | |
sort=of | ADV | |
such=like | ADJR | |
=there | C, D | that=there, them=there |
=un | N | old=un, young=un |
=uns | PRO | them=uns, we=uns, you=uns |
up=raise | VB | presumably a nonce use |
used=to | ADV | |
you=guys | PRO |
Certain difficult cases defy the conventions, and we annotate them as best we can (here, using X).
( (CP-QUE-MAT (CODE <KShepherd_xmin=344.64>) (INTJ Um) (CODE <$$KShepherd_xmax=345.03>) (CODE <KShepherd_xmin=346.24>) (WADVP-1 (WADV when)) (IP-QQQ (IGNORE-DOD did) (NP-SBJ (PRO you=@) (DIS-FS (FS-FS m-)) (CODE <$$KShepherd_xmax=346.93>) (CODE <KShepherd_xmin=347.93>) (X @all)) (VP (GT get) (VP (ADVP-TMP *T*-1) (VAN married)))) (PUNC ?) (CODE <$$KShepherd_xmax=348.75>)))
In contrast to initials in (nick)names of persons, the letters in acronyms are grouped together as far as feasible. In APPCAppE and CoNyCE, the letters are not followed by trailing periods.
( (NP (N IV))) ( (NP (N TB))) ( (NP (N TV))) ( (NP (N-COMP (NPR US) (NPR Steel)))) ( (NP (D the) (NPR UMWA))) ( (NP (D the) (N-COMP (NPR UMW) (PP (P of) (NP (NPR A)))))) ( (NP (D the) (NPR USA))) ( (NP (D the) (N-COMP (NPR US) (PP (P of) (NP (NPR A))))))
The EQUALS sign is also used to indicate the clitic status of items that are phonologically reduced, but do not form a lexical item with an adjacent orthographic word.
Apart from the cases in Joining words, we do not join clitic forms with their hosts because doing so would massively complicate the annotation system as well as the queries necessary to search the corpus. For instance, if "THIS =UN" were instead represented as "THIS=UN", the POS tag system would have to be complicated to include, in this case, a compound tag D+N. This would have the unwelcome result that CorpusSearch queries targeting D or N would need to include the new compound tag. Enforcing consistency across queries would also be more difficult. |
The items treated as clitics in the APPCAppE are given below:
Corpus form | Conventional orthography | |
---|---|---|
Proclitic (leans right) | a= | a- joined with following verb |
Enclitic (leans left) | =ud | 'd |
=un(s) | 'un(s) | |
=uv | 've | |
=uz | 's |
( (NP (N-COMP (NPR F) (NPR D) (NPR R)))) ( (NP (N-COMP (NPR J) (NPR C) (NPR Hall))))