It is wise to distinguish between a private (work-in-progress) version of the corpus and a public release version. The public release version should conform to the relevant annotation guidelines (apart from known issues). The private version can (and in general does) contain more information than is available in the public version. In particular, it can contain provisional distinctions that facilitate the correction phase and that are collapsed in the public version. It can also serve as a testbed for annotation ideas and implement those ideas inconsistently or otherwise imperfectly. In understanding the relation between the private and the public version, it is important to recognize that, as we know from historical linguistics, mergers are irreversible. Once a distinction is merged in the private version, it is not recoverable. Collapsing distinctions and losing information in the private version is therefore not a step to take lightly, since restoring the information would by definition require duplicating the human effort involved in annotating the distinction the first time around.
For the foreseeable future, human correction of parsed corpora is necessary. However, compared to automatic processing, it is both time-consuming and error-prone. When building a corpus, it is therefore wise to be on the lookout for ways to add information at early stages of processing in order to inform later stages and to minimize human case-by-case review at those later stages.
When adding information, it is extremely useful to think in terms of
normalization in the mathematical sense. Normalized structures are ones
where conceptually similar cases are treated in a uniform and
exceptionless way. Though normalized structures may diverge from
standard orthographic convention, they are well worth introducing since
they are conceptually and computationally so much more tractable than
their non-normalized counterparts.
In audio-aligned corpora, punctuation is not part of the underlying
audio signal and must be added. In addition to its obvious purpose of
providing conventional transcripts for human consumption, the process of
adding punctuation to the text (henceforth, "punctuating") is fertile
ground for adding information of the type just mentioned. Conventions
concerning punctuation can greatly facilitate or inhibit downstream
stages of processing and correcting, and it is therefore sensible and
efficient to introduce conventions that diverge from standard
orthographic practice when they result in normalized representations
that are not completely outlandish-looking. By way of example, we
discuss question marks and quotation marks below.
Question marks. In a series of questions that would
standardly form a single orthographic sentence, standard orthographic
practice would generally delimit the constituent questions (except for
the last) using commas. It is also possible to delimit each individual
constituent question using a question mark. The latter option is the
normalized form (and in this particular case contains more information,
though that is not always true of normalized forms), and it is therefore
the one we implement.
In connection with dangling
conjunctions followed
by BREAK, provisional
token-internal question marks should be used to guide the parser in
postulating a root CP-QUE-MAT. The token-internal question mark can be
replaced by a comma in the published version of the corpus if desired.
Quotation marks. Another example concerns the addition of
quotation marks in places where standard (modern) orthographic practice
does not have them. See Double
quotes for the annotation convention
and Quotations below for considerations of
implementation. Other ways that punctuation can be used to inform later
stages of the annotation, not necessarily related to normalization, are
discussed below.
Information can also be added at the tagging stage. Here are two examples:
In principle, empty terminals can be added at the punctuating stage,
but the tagging stage is probably more sensible from an ergonomic point
of view.
SEMI-COLON (;) is used to
delimit integrated
adnominal IP-MATs. Once it has served its function, it is replaced
by COMMA.
It is possible to shave some time off the addition of double quotes
at the punctuating stage by adding only initial quotes in tokens that
will eventually be dominated by a root QTP. Any automatic procedure
that encloses tokens delimited by initial and final quotation marks in
QTP can be revised to add a final quotation in addition to the QTP.
But it is possible to err on the side of too much cleverness, and it may
be wiser and less confusing to everyone concerned to add both open and
close quotes explicitly by hand.
It is important to understand that once the QTP brackets are part of
the parsed structure, the quotation marks themselves can be deleted
without loss of information. If they are kept, they are redundant. If
deleted, they can be restored algorithmically (that is, by a script and
without case-by-case human review).
A workaround is to provisionally replace EXCLAMATION POINT in the
private version of the corpus with some innocuous expression that is not
part of the corpus itself (say, XBANGX) and to replace XBANGX by the
punctuation mark in the public corpus. script that generates the public
release version of the corpus.
An alternative is to use XBANGX during the parsing correction phase
to attach an appropriate dash tag to the root of the token in question,
yielding, say, IP-MAT-EXL. Once this is done, XBANGX can be replaced by
PERIOD, since the information it contains is now represented on the dash
tag.
Using punctuation to guide the parser
( (CP-QUE-MAT (WNP-1 (WPRO What))
(IP-SUB (MD would)
(NP-SBJ (PRO you))
(VP (VB like)
(NP-OB1 *T*-1)))
(PUNC ?))) ← like this; not COMMA or PERIOD
( (CP-QUE-MAT (CONJ And)
(WADVP-1 (WADV when))
(IP-SUB (MD would)
(NP-SBJ (PRO you))
(VP (VB like)
(IP-ECM (NP-SBJ (PRO it))
(VP (ADVP-TMP *T*-1)
(VAN delivered)))))
(PUNC ?)))
( (CP-QUE-MAT (IP-SUB (BEP Are)
(NP-SBJ (PRO they))
(VP (VAG coming)))
(PUNC ?)
(CONJ or)
(CODE <BREAK>)))
red/ADJ
,/CONJ ← CONJ rather than PUNC marks left CONJP boundary
white/ADJ
,/CONJ ← tagging as CONJ here is redundant
and/CONJ
blue/ADJ
He/PRO
said/VBD
0/C-THT ← silent head of ordinary THAT complement (CP-THT)
he/PRO
would/MD
buy/VB
the/D
book/N
0/WPRO ← silent relative pronoun
you/PRO
recommended/VBD
./PUNC
so/ADVR
loud/ADJ
0/C-DEG ← silent head of degree complement (CP-DEG)
it/PRO
hurt/VBD
my/PRO
ears/NS
Marking constructions during corpus building
Compound words
Since compound word formation is productive, it is not really possible
to list all compound words in a
script to ensure that they are surrounded by the properly labelled
brackets in the parsed structure. Moreover, it would be extremely
useful to have compound words already identified as such in
the input to the parser. A simple solution is to use a
designated symbol - say, CARET (^) - in order to join words temporarily,
as illustrated in the flowchart below:
community
college
two
thousand
three
hundred
and
sixty
four
point
three
United
States
attorney
general
community^college
two^thousand^three^hundred^and^sixty^four^point^three
United^States
attorney^general
community^college/N
two^thousand^three^hundred^and^sixty^four^point^three/NUM
United^States/NPRS
attorney^general/N ← exception to Righthand Head Rule
(N community^college)
(NUM two^thousand^three^hundred^and^sixty^four^point^three)
(NPRS United^States)
(N attorney^general)
(N-COMP (N community) (N college))
(NUM-COMP (NUM two) (NUM thousand) (NUM three) (NUM hundred) (NUM and) (NUM sixty) (NUM four) (NUM point) (NUM three))
(N-COMP (NPR United) (NPRS States)) ← number mismatch within compound
(N-COMP (N attorney) (ADJ general)) ← exception to Righthand Head Rule
Elaborations and clause-adjoined constituents
COLON (:) can be used to demarcate elaborations (ELAB) or
clause-adjoined (-CAR) constituents. Depending on each corpus's
conventions, COLON is either retained in the public version or replaced
by COMMA.
( (NP (D a)
(N problem)
(PUNC :)
(ELAB (NP (NP-POS (PRO$ her))
(N inability)
(IP-INF (TO to)
(VP (VB play)
(NP-OB1 (N tennis))))))))
( (IP-MAT (NP-CAR (D a)
(ADJP (ADJ good))
(N example))
(PUNC :)
(NP-SBJ (QP (Q some))
(N people))
(VP (MD do@)
(NEG @n't)
(VP (VB own)
(NP-OB1 (D a)
(N car))))
(PUNC .)))
Elided forms
APOSTROPHE can be used to mark elided forms, notably word-initial
truncation. This can be useful to distinguish an elided form like 'MOST
(< ALMOST) from the otherwise homonymous quantifier MOST and to
assist the tagger in giving them the proper tags (ADV and QS,
respectively). Depending on each project's transcription conventions,
the elided forms may be replaced by the full forms in the public corpus,
or the elided forms with APOSTROPHE may be retained.
Integrated adnominal IP-MAT
Quotations (QTP)
In the parsed structure, direct
speech is enclosed in QTP brackets.
Since current parsers are unable to tell whether a particular sequence
is direct speech or not, it is useful to delimit the left and right
periphery of (what will eventually become) the QTP by adding DOUBLE
QUOTEs. The annotator can add DOUBLE QUOTEs at any time before the
parsing stage, probably most conveniently during punctuating. Since the
transcript needs to be punctuated anyway, no significant human effort is
added. But if marking direct speech and adding QTP is left to the
parsing correction stage, the human effort is considerable, since the
annotator there has to essentially duplicate the effort of the
punctuating pass by reading each token to determine if it is an instance
of direct speech.
Splitting and joining words
In general, splitting and joining words should be done by script and not
by hand. However, for cases that have been overlooked by the relevant
scripts, splitting and joining can be done at any stage of processing in
accordance with the applicable
conventions. It goes without saying that the cases should be added
to the relevant scripts to expedite future processing.
you ← need to join
all
don't ← need to split
you=all ← joined
do@ ← split
@n't
Tag questions
TAG is used as a provisional disfluency marker to enclose tag questions
(as opposed to other parenthetical questions). Once tag questions are
marked with a TAG dash tag (CP-QUE-TAG), the TAG disfluency marker is
replaced by PAREN.
Problematic characters
Exclamation point
As noted in the section on punctuation, the use
of EXCLAMATION POINT (affectionately referred to as "bang" in
programming circles) is deprecated as it interacts with the syntax of
shells and therefore often leads to problems in connection with regular
expression searches.
Quotation marks
Quotation marks, both single and double, can raise more or less serious
complications for formatting and searching in the course of corpus
building. The workaround is to replace them in the private corpus with
innocuous sequences (SINGLE_QUOTE and DOUBLE_QUOTE, SQUO and DQUO, or
the like) and to restore them in the public version.