Word tokenization

Always split
Always unitary
Treated as written
Differences among the corpora
Possessive clitic (see Dollar tag, Genitive/possessive modifier of N)

Certain items in later English are fusions of earlier multi-word phrases. Given the length of time covered by the diachronic corpora and because word division in early texts is not always well represented, these items are difficult to treat in a consistent way across time. We attempt to adhere to the following principles and strategies.

Certain forms are always split due to the requirements of the annotation or of (possibly future) lemmatization.
Items that fused before Middle English are treated as unitary.
Items that are not completely fused by Middle English are generally treated as written. That is, the components receive separate POS tags when they are spelled apart. When spelled together, the resulting word receives a complex POS tag in the PPCME2 to reflect its phrasal origin. In the later corpora, such words generally receive a simple tag, except if the modern counterpart receives a compound tag, as in cases like (Q+ONE someone).
A handful of items are treated differently in the later corpora than in the PPCME2.

Always split

When an orthographic word in the original text belongs to different sentence-level or phrasal constituents, the word is always split. The location of the split is marked by a trailing "@" on the first component and a leading "@" on the second component.

FOR + TO

(FOR for-@) (TO @to)
(FOR for@) (TO @te)

MD + BE

(MD shal@) (BE @be)
(MD wol@) (BE @bee)
(MD wyl@) (BE @be)

MD + NEG

(MD can@) (NEG @not)
(MD ca@) (NEG @nt)
(MD ca@) (NEG @n't)
(MD wo@) (NEG @n't)

MD or verb + PRO

(MD maist@) (PRO @tow)
(VBP Pre@) (PRO @the)
(VBP Pri@) (PRO @thee)
(VBP quoth@) (PRO @a)
(VBP keth@) (PRO @a)		← spelling variant of QUOTHA (= QUOTH HE)

NEG + MD or verb (in Middle English)

(NEG n@) (BEP @is)		← NE + IS
(NEG n@) (BED @aes)		← NE + WAS
(NEG n@) (HVP abbeo+d)		← NE + HAVETH
(NEG n@) (MD @ulle)		← NE + WILL
(NEG n@) (MD @alde)		← NE + WOULD
(NEG n@) (VBD @uste)		← NE + WUSTE

P + multiword complement or PRO

(PP (P o'@)
    (NP (PRO$ @my) (N life)))

(PP (P on@)
    (NP (PRO @'t)))

PRO + MD or verb

(PRO i@) (MD @challe)		← dialect form of I SHALL 
(PRO it@) (BEP @s)
(PRO it@) (BEP @'s)
(PRO me@) (VBP thinckes)
(PRO me@) (VBP thynketh)
(EX ther@) (BEP @s)
(PRO they@) (BEP @'l)
(PRO 'T@) (BEP @is)
(PRO t'@) (BEP @is)
(PRO t@) (MD @will)

In order to facilitate (possibly future) lemmatization, determiners and possessive pronouns are also always split, with very rare exceptions.

D or PRO$ + N

(D A@) (ADJ @wedded) (N mon)))
(D t@) (ONE @one)                ← cf. THE TONE
(D t@) (OTHERS @others)          ← cf. THE OTHERS
(D t@) (N @abbesse)
(D +t@) (NPR$ @er+tes)
(D +t@) (NPR @alde) (NPR testament)

(PRO$ mi@) (NPR @lorde)

(WD what-@) (N @time)

Exceptions:

(D+OTHER another) (N way)        ← ANOTHER
(D+ADJ thilke)                   ← THILKE
(D the) (D+ONE tone)             ← TONE < THE ONE
(D the) (D+OTHER tothir)         ← TOTHER < THE OTHER
(PRO$+N yourself)                ← reflexive pronoun

Always unitary

The items in this category are generally spelled in the original texts as a single orthographic word. When spelled apart in the original, they are joined by underscores in the annotated version.

Unitary adjective

(ADJ all_mightie)
(ADJ a_lone)
(ADJ back_ward)			← analogously: other adjectives ending in -WARD
(ADJ fore_sayd)
(ADJ glad_ful)
(ADJ inner_most)		← analogously: other adjectives ending in -MOST
(ADJ under_hand)
(ADJ up_right)
(ADJ well_come)

This category includes apparent compounds with 'false participles' (parasynthetic compounds in the terminology used by the OED).

(ADJ feather_footed)
(ADJ ill_natured)
(ADJ mild_hearted)
(ADJ two_toothed)

Unitary adverb or preposition

apon (but not upon)
asswa (but not ALSWA)
unto (but not into)

adverbs ending in -MOST and -WARD

a+det, about, above, abroad, afore, again, against, almost, already, although(inwith), 
altogether, always, alwhatamong, amore, anon, aright, away
before, behind, beneath, beside(s), betime(s), between, betwixt, beyond, bimong,
eftsoon, evermore,
fornigh, forthright,
forto, fromward(tofore), furthermore, furtherover,
henceforward,
intil, inwith,
la(n)hure,
maybe, mayfortune, mayhap, moreover,
na+gtuor+tan, natforthi, ne+taget, nethelatter & variants, nevermore & variants,
   nevertheless & variants, nonetheless & variants, notwithstanding,
onward, outake(n), overal, overmete,
peradventure, percase, perchance, perhaps,
thenceforth, there(to)against, throughout, tilinto, tilto, toeke(n), tofore(hand),
   togains, together, toward, towhether,
umbestunde, underhand, upright,
whatforthi, withal, within, without(forth),
+te+get, +tewhether, +tohhswa+tehh

Unitary noun

Nouns with degree OVER are treated as unitary items. By contrast, other categories with degree OVER are treated as written.

(ADVR+N over_fondness)
(ADVR+N over_Hastinesse)

(N a_do)			← A = northern infinitival marker
(N farewell)
(N inside)
(N outside)
(N to_do)
(N to_morrow)
(N$ to_Morrows)
(N yester_day)
(N$ yester_day's)
(N yester_night)
(NPR Wadenes_day)
(N well_come)

Unitary verb
- Verb with separable/inseparable prefix
  Because it is not reliably possible, we do not distinguish between separable and inseparable prefixes when they precede the verb. All verbal prefixes are treated as part of the verb. By contrast, separable prefixes that follow the verb are tagged RP.
```
(VBD by_shone)
(VBD to_brake)
(VB with_say)
```
- Verbs with A (overwhelmingly in Middle English)
  In most of these verbs, A is originally a prefix (adding "intensity")
```
(VBP a_kel+t)
(VBD a_resunede)
(VBD a_seide)
(VBP a_turne+t)
```
- Verbs with the perfective prefix GE-, I-, Y-, etc. (only in Middle English)
```
(VB i_heren)
(VAN ycleped)
(VBP +ge_bette)
```

Number

All numbers, whether cardinal or ordinal, are treated as unitary (even in Middle English)

(NUM a_hundred_and_fifty)
(NUM two_thousand_and_twenty_three)
(NUM two=m=_twen=tie=_three)
(NUM twenty-three)
(NUM three_and_twenty)
(NUM .xx_iij.)
(ADJ one_and_fiftieth)

Treated as written

A- words (A < IN, ON) (including the A HUNTING construction)

(PP (P a) (PP (P+ADV+WARD abackward)) (ADVP (ADV+WARD backward))) ABOARD (PP (P a) (PP (P+RP adown)) (ADVP (RP down)) (PP (P a) (PP (P+ADJ afar)) (ADJP (ADJ far))) (PP (P a) (PP (P+N ahunting)) (NP (N hunting))) (PP (P a) (PP (P+N alive)) (NP (N live))) AMID (PP (P a) (PP (P+N asleep)) (NP (N sleep))) (PP (P a) (PP (P+ADJ asunder)) (ADJP (ADJ sunder))) (PP (P a) (PP (P+NUM atwo)) (NP (NUM two))) analogously: ABED, ADAY, AFIRE, AFOOT, AFRESH, AMORROW, ANIGHT, APACE, APIECE, ASIDE, A+TRE, ...

Compound with degree OVER + ADJ, ADV, Q

Note: Nouns with degree OVER are treated as unitary.

(ADJP (ADVR over) (ADJ ripe))		(ADJP (ADVR+ADJ overripe))

(QP (ADVR over) (Q muche))		(QP (ADVR+Q over-much))

Compound with THERE- or WHERE-

(PP (ADVP (ADV there))			(PP (ADV+P therewith)))
    (P with))

(WPP (WADV where))			(WPP (WADV+P whereby))
     (P by))

Other

 
ALSWA

(NP (D an) (OTHER othyr) (N man)    (NP (D+OTHER another) (N man)

(PP (P at)		   	    (PP (P+ADV atonce))
    (ADVP (ADV once)))

(PP (P be)			    (PP (P+N bycaus)
    (NP (N cause)                       (CP-ADV ...))
        (CP-THT ...)))                                          ← note different dash tags

(PP (P before)			    (PP (P+N beforehand))	← analogously: AFOREHAND, 
    (NP (N hand)))                                                             BEHINDHAND

(NP-TMP (ADV before) (N time)))	    (NP-TMP (ADV+N beforetime))	← analogously: BEFORETIMES

(NP (ADJ English) (N man))	    (NP (ADJ+N Englishman))     ← analogously: DUTCHMAN, 
                                                                               FRENCHMAN,
                                                                               GENTLEMAN,
                                                                               NOBLEMAN

FORASMUCH                                                       ← analogously: INASMUCH,
                                                                               INSOMUCH

(PP (P for)			    (PP (P+ADV forever))
    (ADVP (ADV ever)))

(PP (P for)			    (PP (P+N forsooth))		← analogously: INSOOTH
    (NP (N sooth)))

(PP (P for) (D thi)		    (PP (P+D forthi)
    (CP-ADV ...))		        (CP-ADV ...))

(PP (P for) (WADV whi)		    (PP (P+WADV forwhi)		← when used as subordinator
    (CP-ADV ...))		        (CP-ADV ...))

(PP (P in)			    (PP (P+N indeed))
    (NP (N deed)))

(PP (P in)			    (PP (P+N instead))
    (NP (N stead)))

(PP (RP in)			    (PP (P into)                ← unlike unitary unto
    (P to)                              (NP ...))
    (NP ...))

(PP (P o')			    (PP (P+N o'clock))
    (NP (N clock)))

(PP (RP up)			    (PP (P up-on)               ← unlike unitary apon
    (P on)                              (NP ...))
    (NP ...))

Differences among the corpora

Item	PPCME2	Later corpora
AFTERNOON	Phrasal (PP (P after) (NP (N noon)))	Unitary N (NP (D the) (N after_noon))
ALL BE IT, ALBEIT (see ALBEIT in concessive clauses)	Split (Q all) (BEP be) (PRO it)	Unitary ADV or P (ADV/P albeit), (ADV/P al_be_it)
HOW BE IT, HOWBEIT (see HOWBEIT in concessive clauses)	Split (WADV how) (BEP be) (PRO it)	Unitary ADV or P (ADV/P howbeit), (ADV/P how_be_it)
TODAY, TONIGHT	Phrasal (PP (P to) (NP (N day))) (PP (P to) (NP (N night)))	Unitary N (N to_day), (N$ to_day's) (N to_night), (N$ to_night's)