General introduction


Philosophy and goals

File formats

Each text in the corpora comes in three different formats, each with a characteristic filename extension:

Text files (.txt)

Text files have the extension .txt. Besides the text, they contain text-level codes, inherited from the Helsinki Corpus and converted into HTML-type codes, as described in Text markup. The original page layout is not retained. Rather, the text is divided into tokens, which generally correspond to a main clause together with any subordinate clauses that it contains. Each token has a token ID, enclosed in parentheses, which contains the name of the file, a page reference to the printed text (possibly including a volume reference), and a running token number that locates the token within the computer file. Tokens may also consist entirely of text-level codes. Such tokens do not have IDs in the file, but they are counted by the token counter, which can lead to gaps in the running token numbers.

Material that has been split or separated in the course of annotation (punctuation, contractions, cliticized articles, and so on) remains separated in the text files in order to simplify searches and tallies. See Word tokenization for details.

<P_2>

<heading>

I . (CMMALORY,2.3)

Merlin (CMMALORY,2.4)

<$$heading>

HIT befel in the dayes of Uther Pendragon , when he was kynge of all
Englond and so regned , that there was a myghty duke in Cornewaill that
helde warre ageynst hym long tyme . (CMMALORY,2.6)

and the duke was called the duke of Tyntagil . (CMMALORY,2.7)

And so by meanes kynge Uther send for this duk chargyng hym to brynge
his wyf with hym . (CMMALORY,2.8)

for she was called a fair lady and a passynge wyse . (CMMALORY,2.9)

and her name was called Igrayne . (CMMALORY,2.10)

Part-of-speech (POS) tagged files (.pos)

Part-of-speech (POS) tagged texts have the extension .pos. They contain the material in the text files with a POS tag added to each word. Editorial material such as the text-level codes is given the tag CODE. Text elements are separated from their POS tags; the delimiter is the forward slash character (/).

The text is divided into tokens in the same way as in the text files.

<P_2>/CODE

<heading>/CODE

I/NUM ./PUNC CMMALORY,2.3/ID

Merlin/NPR CMMALORY,2.4/ID

<$$heading>/CODE

HIT/PRO befel/VBD in/P the/D dayes/NS of/P Uther/NPR Pendragon/NPR
,/PUNC when/P he/PRO was/BED kynge/N of/P all/Q Englond/NPR and/CONJ 
so/ADV regned/VBD ,/PUNC that/C there/EX was/BED a/D myghty/ADJ duke/N 
in/P Cornewaill/NPR that/C helde/VBD warre/N ageynst/P hym/PRO long/ADJ
tyme/N ,/PUNC CMMALORY,2.6/ID

and/CONJ the/D duke/N was/BED called/VAN the/D duke/N of/P Tyntagil/NPR
./PUNC CMMALORY,2.7/ID

And/CONJ so/ADV by/P meanes/NS kynge/NPR Uther/NPR send/VBD for/P
this/D duk/N chargyng/VAG hym/PRO to/TO brynge/VB his/PRO$ wyf/N with/P
hym/PRO ,/PUNC CMMALORY,2.8/ID

for/CONJ she/PRO was/BED called/VAN a/D fair/ADJ lady/N and/CONJ a/D
passynge/ADV wyse/ADJ ,/PUNC CMMALORY,2.9/ID

and/CONJ her/PRO$ name/N was/BED called/VAN Igrayne/NPR ./PUNC CMMALORY,2.10/ID

Parsed files (.psd)

Parsed files have the extension .psd. They contain a labeled bracketing of the text, with the first set of labeled parentheses around a word repeating the information from the POS-tagged files. The division into tokens in the parsed files is the same as in the text and POS files. Each token is enclosed with its ID in a set of unlabeled parentheses.
( (CODE <BEGIN_cmmalory-m4>))

( (CODE <P_2>))

( (CODE <heading>))

( (LS (NUM I) (PUNC .))
  (ID CMMALORY-M4,2.4))

( (NP (NPR Merlin))
  (ID CMMALORY-M4,2.5))

( (CODE <$$heading>))

( (IP-MAT (NP-SBJ=1 (PRO HIT))
	  (VBD befel)
	  (PP (P in)
	      (NP (D the)
		  (NS dayes)
		  (PP (P of)
		      (NP (NPR Uther) (NPR Pendragon)))))
	  (PUNC ,)
	  (PP (P when)
	      (CP-ADV (C 0)
		      (IP-SUB (IP-SUB (NP-SBJ (PRO he))
				      (BED was)
				      (NP-PRD (N kynge)
					      (PP (P of)
						  (NP (Q all) (NPR Englond)))))
			      (CONJP (CONJ and)
				     (IP-SUB (NP-SBJ *con*)
					     (ADVP (ADV so))
					     (VBD regned))))))
	  (PUNC ,)
	  (CP-THT-1 (C that)
		    (IP-SUB (NP-SBJ=2 (EX there))
			    (BED was)
			    (NP-2 (D a)
				  (ADJ myghty)
				  (N duke)
				  (CP-REL *ICH*-3))
			    (PP (P in)
				(NP (NPR Cornewaill)))
			    (CP-REL-3 (WNP-4 0)
				      (C that)
				      (IP-SUB (NP-SBJ *T*-4)
					      (VBD helde)
					      (NP-ACC (N warre))
					      (PP (P ageynst)
						  (NP (PRO hym)))
					      (NP-MSR (ADJ long) (N tyme))))))
	  (PUNC ,))
  (ID CMMALORY-M4,2.7))

( (IP-MAT (CONJ and)
	  (NP-SBJ-1 (D the) (N duke))
	  (BED was)
	  (VAN called)
	  (IP-SMC (NP-SBJ *-1)
		  (NP-PRD (D the)
			  (N duke)
			  (PP (P of)
			      (NP (NPR Tyntagil)))))
	  (PUNC .))
  (ID CMMALORY-M4,2.8))

( (IP-MAT (CONJ And)
	  (ADVP (ADV so))
	  (PP (P by)
	      (NP (NS meanes)))
	  (NP-SBJ (NPR kynge) (NPR Uther))
	  (VBD send)
	  (PP (P for)
	      (NP (D this) (N duk)))
	  (IP-PPL (VAG chargyng)
		  (NP-ACC (PRO hym))
		  (IP-INF (TO to)
			  (VB brynge)
			  (NP-ACC (PRO$ his) (N wyf))
			  (PP (P with)
			      (NP (PRO hym)))))
	  (PUNC ,))
  (ID CMMALORY-M4,2.9))

( (IP-MAT (CONJ for)
	  (NP-SBJ-1 (PRO she))
	  (BED was)
	  (VAN called)
	  (IP-SMC (NP-SBJ *-1)
		  (NP-PRD (NP (D a) (ADJ fair) (N lady))
			  (CONJP (CONJ and)
				 (NP (D a)
				     (ADJP (ADV passynge) (ADJ wyse))))))
	  (PUNC ,))
  (ID CMMALORY-M4,2.10))

( (IP-MAT (CONJ and)
	  (NP-SBJ-1 (PRO$ her) (N name))
	  (BED was)
	  (VAN called)
	  (IP-SMC (NP-SBJ *-1)
		  (NP-PRD (NPR Igrayne)))
	  (PUNC .))
  (ID CMMALORY-M4,2.11))

Text markup

In general, it has not been possible to retain the markup conventions of the Helsinki Corpus in their original form because of conflicts with the annotation system. The major changes made are as follows: