Word tokenization


Certain items in later English are fusions of earlier multi-word phrases. Given the length of time covered by the diachronic corpora and because word division in early texts is not always well represented, these items are difficult to treat in a consistent way across time. We attempt to adhere to the following principles and strategies.

Always split

When an orthographic word in the original text belongs to different sentence-level or phrasal constituents, the word is always split. The location of the split is marked by a trailing "@" on the first component and a leading "@" on the second component.

In order to facilitate (possibly future) lemmatization, determiners and possessive pronouns are also always split, with very rare exceptions.

Always unitary

The items in this category are generally spelled in the original texts as a single orthographic word. When spelled apart in the original, they are joined by underscores in the annotated version.

Treated as written

Differences among the corpora

Item PPCME2 Later corpora
AFTERNOON Phrasal
(PP (P after)
    (NP (N noon)))
Unitary N
(NP (D the) (N after_noon))
ALL BE IT, ALBEIT
(see ALBEIT in concessive clauses)
Split
(Q all) (BEP be) (PRO it)
Unitary ADV or P
(ADV/P albeit), (ADV/P al_be_it)
HOW BE IT, HOWBEIT
(see HOWBEIT in concessive clauses)
Split
(WADV how) (BEP be) (PRO it)
Unitary ADV or P
(ADV/P howbeit), (ADV/P how_be_it)
TODAY, TONIGHT Phrasal
(PP (P to)
    (NP (N day)))
(PP (P to)
    (NP (N night)))
Unitary N
(N to_day), (N$ to_day's)

(N to_night), (N$ to_night's)