Division into sentence tokens


As mentioned in the general introduction, the text in the corpora is divided into tokens, which generally correspond to a main clause together with any subordinate clauses that it contains. In legal texts and statutes, this definition can result in tokens that go on at excessive length. In such cases, we sometimes split off WHEREAS clauses and similar units on an ad hoc basis for convenience. In what follows, we ignore these special cases.

In most of the examples below, the issues concerning tokenization arise from our annotation conventions. See Clausal conjunction and Direct speech for details. The final section covers several further minor cases.

Throughout, token breaks are indicated by a blank line.

Conjunction-related

No token break with conjoined simple verbs

Julius Caesar came , saw , and conquered .                 ← conjoined intransitive verbs

( (IP-MAT (NP-SBJ (NPR Julius) (NPR Caesar))
          (VBD (VBD came) (, ,) (VBD saw) (, ,) (CONJ and) (VBD conquered))
          (. .)))

They spotted , approached , and greeted the queen .        ← conjoined transitive verbs with shared argument

( (IP-MAT (NP-SBJ (PRO They))
          (VBD (VBD spotted) (, ,) (VBD approached) (, ,) (CONJ and) (VBD greeted))
          (NP-OB1 (D the) (N queen))
          (. .)))

Token break with VP conjunction

The absence of VP in our annotation scheme in the general case means that we cannot treat cases like the following as a single token with VP conjunction. Rather, we are forced to treat the two conjuncts as separate tokens.

They sang ,

and danced the polka .

( (IP-MAT (NP-SBJ (PRO They))
          (VBD sang)
          (. ,)))

( (IP-MAT (CONJ and)
          (NP-SBJ *con*)
          (VBD danced)
          (NP-OB1 (D the) (N polka))
          (. .)))

A tricky case

They came and left ten years later .	        ← one token if the two events occur in the same year

( (IP-MAT (NP-SBJ (PRO He))
          (VBD (VBD came) (CONJ and) (VBD left))
          (ADVP-TMP (NP-MSR (NUM ten) (NS years))
                    (ADVR later))
          (. .)))

They came ,                                     ← two tokens if the two events are separated by a decade

and left ten years later .

( (IP-MAT (NP-SBJ (PRO They))
          (VBD came)
          (. ,)))

( (IP-MAT (CONJ and)
          (NP-SBJ *con*)
          (VBD left)
          (ADVP-TMP (NP-MSR (NUM ten) (NS years))
                    (ADVR later))
          (. .)))

Direct speech

No token break with first direct speech clause

Caesar said , " I came . "

( (IP-MAT (NP-SBJ (NPR Caesar))
          (VBD said)
          (PUNC ,)
          (PUNC ")
          (IP-MAT-SPE (NP-SBJ (PRO I))
                      (VBD came))
          (PUNC ")
          (PUNC .)))

Token break with continuation

Caesar said , " I came ,

I saw ,

I conquered . "

( (IP-MAT (NP-SBJ (NPR Caesar))
          (VBD said)
          (PUNC ,)
          (PUNC ")
          (IP-MAT-SPE (NP-SBJ (PRO I))
                      (VBD came))
          (PUNC ,)))

( (IP-MAT-SPE (NP-SBJ (PRO I))
              (VBD saw)
              (PUNC ,)))

( (IP-MAT-SPE (NP-SBJ (PRO I))
              (VBD conquered)
              (PUNC .)
              (PUNC ")))

Clause-adjoined relative

For simplicity, we indicate only relevant details of the structures below.

No token break in simple case

She baked brownies , Which made everyone jump for joy .

( (IP-MAT (She baked brownies)
          (PUNC ,)
          (CP-CAR which made everyone jump for joy)
          (PUNC .)))

No token break with first direct speech clause

She announced , " I will bake brownies , " which made everyone jump for joy .

( (IP-MAT (She announced)
          (PUNC ,)
          (PUNC ")
          (IP-MAT-SPE (I will bake brownies))
          (PUNC ")
          (PUNC ,)
          (CP-CAR which made everyone jump for joy)
          (PUNC .)))

Token break with continuation of direct speech

She announced , " I will bake brownies ,

and I will start right now , "

which made everyone jump for joy .

( (IP-MAT (She announced)
          (PUNC ,)
          (PUNC ")
          (IP-MAT-SPE I will bake brownies)
	  (PUNC ,)))

( (IP-MAT-SPE (and I will start right now)
              (PUNC ,)
              (PUNC ")))

( (CP-CAR (which made everyone jump for joy)
          (PUNC .)))

Other cases

Interjections and vocatives

Where possible, interjections and vocatives are incorporated into adjacent tokens, rather than split off.

Plays

Where possible, characters and the following line they speak count as a single token.

Falstaff .			← period does not trigger token break
It is I .

Headings and titles

The remaining cases below can be thought of as extending the principle of default high attachment to tokenization.

Where possible, headings and titles are split into separate tokens.

Section 102 ,                                   ← heading

Soldering tips					← separate token for title

CODE material

Where possible, material tagged with CODE counts as a separate token. However, font change indications are never separated from their associated text.

<text>/CODE

<heading>/CODE

Chapter 22

<$$heading>/CODE

Here is a sentence.

<P_234>/CODE				← separate token

And here is another one.

But lo,
<P_235>/CODE				← part of surrounding token
the preceding page number is not
a separate token !

Diaries and letters

In diaries and letters, the place and date of writing each count as separate tokens.

( (NP-LOC (NPR London)))

( (NP-TMP (NPR June) (ADJ 21st) (NUM 1706)))

( (CP-QUE-MAT (NP-VOC (ADJS Dearest) (NPR Delilah))
              (, ,)
              (WADVP-1 (WADV How))
              (IP-SUB (ADVP *T*-1)
	  	      (BEP are)
                      (NP-SBJ (PRO you)))
              (? .)))