In most of the examples below, the issues concerning tokenization arise from our annotation conventions. See Clausal conjunction and Direct speech for details. The final section covers several further minor cases.
Throughout, token breaks are indicated by a blank line.
Where possible, interjections
and vocatives are incorporated into
adjacent tokens, rather than split off.
Where possible, characters and the following line they speak count as a
single token.
Where possible, headings and titles are split into separate tokens.
Where possible, material tagged with CODE counts as a separate token.
However, font change indications are never separated from their associated
text.
In diaries and letters, the place and date of writing each count as
separate tokens.
Conjunction-related
No token break with conjoined simple verbs
Julius Caesar came , saw , and conquered . ← conjoined intransitive verbs
( (IP-MAT (NP-SBJ (NPR Julius) (NPR Caesar))
(VBD (VBD came) (, ,) (VBD saw) (, ,) (CONJ and) (VBD conquered))
(. .)))
They spotted , approached , and greeted the queen . ← conjoined transitive verbs with shared argument
( (IP-MAT (NP-SBJ (PRO They))
(VBD (VBD spotted) (, ,) (VBD approached) (, ,) (CONJ and) (VBD greeted))
(NP-OB1 (D the) (N queen))
(. .)))
Token break with VP conjunction
The absence of VP in our annotation scheme in the general case means that
we cannot treat cases like the following as a single token with VP
conjunction. Rather, we are forced to treat the two conjuncts as separate
tokens.
They sang ,
and danced the polka .
( (IP-MAT (NP-SBJ (PRO They))
(VBD sang)
(. ,)))
( (IP-MAT (CONJ and)
(NP-SBJ *con*)
(VBD danced)
(NP-OB1 (D the) (N polka))
(. .)))
A tricky case
They came and left ten years later . ← one token if the two events occur in the same year
( (IP-MAT (NP-SBJ (PRO He))
(VBD (VBD came) (CONJ and) (VBD left))
(ADVP-TMP (NP-MSR (NUM ten) (NS years))
(ADVR later))
(. .)))
They came , ← two tokens if the two events are separated by a decade
and left ten years later .
( (IP-MAT (NP-SBJ (PRO They))
(VBD came)
(. ,)))
( (IP-MAT (CONJ and)
(NP-SBJ *con*)
(VBD left)
(ADVP-TMP (NP-MSR (NUM ten) (NS years))
(ADVR later))
(. .)))
Direct speech
No token break with first direct speech clause
Caesar said , " I came . "
( (IP-MAT (NP-SBJ (NPR Caesar))
(VBD said)
(PUNC ,)
(PUNC ")
(IP-MAT-SPE (NP-SBJ (PRO I))
(VBD came))
(PUNC ")
(PUNC .)))
Token break with continuation
Caesar said , " I came ,
I saw ,
I conquered . "
( (IP-MAT (NP-SBJ (NPR Caesar))
(VBD said)
(PUNC ,)
(PUNC ")
(IP-MAT-SPE (NP-SBJ (PRO I))
(VBD came))
(PUNC ,)))
( (IP-MAT-SPE (NP-SBJ (PRO I))
(VBD saw)
(PUNC ,)))
( (IP-MAT-SPE (NP-SBJ (PRO I))
(VBD conquered)
(PUNC .)
(PUNC ")))
Clause-adjoined relative
For simplicity, we indicate only relevant details of the structures below.
No token break in simple case
She baked brownies , Which made everyone jump for joy .
( (IP-MAT (She baked brownies)
(PUNC ,)
(CP-CAR which made everyone jump for joy)
(PUNC .)))
No token break with first direct speech clause
She announced , " I will bake brownies , " which made everyone jump for joy .
( (IP-MAT (She announced)
(PUNC ,)
(PUNC ")
(IP-MAT-SPE (I will bake brownies))
(PUNC ")
(PUNC ,)
(CP-CAR which made everyone jump for joy)
(PUNC .)))
Token break with continuation of direct speech
She announced , " I will bake brownies ,
and I will start right now , "
which made everyone jump for joy .
( (IP-MAT (She announced)
(PUNC ,)
(PUNC ")
(IP-MAT-SPE I will bake brownies)
(PUNC ,)))
( (IP-MAT-SPE (and I will start right now)
(PUNC ,)
(PUNC ")))
( (CP-CAR (which made everyone jump for joy)
(PUNC .)))
Other cases
Interjections and vocatives
Plays
Falstaff . ← period does not trigger token break
It is I .
Headings and titles
The remaining cases below can be thought of as extending the principle of
default high attachment to tokenization.
Section 102 , ← heading
Soldering tips ← separate token for title
CODE material
<text>/CODE
<heading>/CODE
Chapter 22
<$$heading>/CODE
Here is a sentence.
<P_234>/CODE ← separate token
And here is another one.
But lo,
<P_235>/CODE ← part of surrounding token
the preceding page number is not
a separate token !
Diaries and letters
( (NP-LOC (NPR London)))
( (NP-TMP (NPR June) (ADJ 21st) (NUM 1706)))
( (CP-QUE-MAT (NP-VOC (ADJS Dearest) (NPR Delilah))
(, ,)
(WADVP-1 (WADV How))
(IP-SUB (ADVP *T*-1)
(BEP are)
(NP-SBJ (PRO you)))
(? .)))