Lexicon creation


What is a lexicon?

A lexicon is a list of the words occurring in a file or corpus. Following each word is the total number of times the word is found, followed by the word's POS tag(s) and the number of times the word is found with each POS tag.

For instance, the line:

a-boute 11: [9 P] [1 RP] [1 ADV]
means that the word "a-boute" was found a total of 11 times, 9 times with the POS tag "P", 1 time with the POS tag "RP", and 1 time with the POS tag "ADV".

CorpusSearch determines word identity by spelling. Spellings that vary only by capitalization or by the presence of the initial "$" that indicates editorial emendations in the PPCHE are listed on the same line.

zoroaster Zoroaster ZOROASTER 57: [57 NPR]

is Is IS $is 10689: [10689 BEP]
CorpusSearch performs no other spelling normalization or morphological analysis.
a-bak 1: [1 P+ADV]
a-bakke 1: [1 P+ADV]

zacari 1: [1 NPR]
zacharie 1: [1 NPR]

make_lexicon

This is the basic command that causes a lexicon to be built. A query file containing only the following command generates a lexicon including every word in the input file(s).
make_lexicon: t
Lexicon queries do not contain a boundary node. But like any other command file, a lexicon query must end in .q and writes output to an .out file.

For some reason, lexicon queries return no output for searches where the labels contain square brackets enclosing alternatives. The relevant expressions need to be reformulated using argument "or". For instance, a POS label like "AD[JV]" needs to rewritten as "ADJ|ADV", and a text label like "[tT]ha" needs to be rewritten as "tha|Tha".

pos_labels

The lexicon can be restricted to words with certain POS tags. For instance, the following query creates a list of all words labeled as modals or verbs:
make_lexicon: t

pos_labels: MD*|V*

text_labels

The lexicon can be restricted to certain words in the text. For instance, the following query finds all words beginning with "th" or "+t", regardless of case:
make_lexicon: t

text_labels: [tT]h*|+[tT].*

A single query can contain both pos_labels and text_labels. For instance, the following query lists all verbs ending in "[eiy]th" or "[eiy]+t":

make_lexicon: t

pos_labels: V*

text_labels: *[eiy]th|*[eiy]+t

Sample output

Here is sample output that results from running "make_lexicon" on (a superseded version of) the PPCME2. For reasons of space, much of the actual output is omitted.
/*
PREFACE:
CorpusSearch copyright Beth Randall 2000.
Date:  Tue Sep 21 09:55:12 EDT 2004

command file:     lex.q
output file:      lex.out

Lexicon:
*/

/*  ~A~  */
a A $a 3713: [3421 D] [10 FW] [104 HV] [15 VAN21] [24 ADV21] [25 P21] [8 VBD21]
[15 P] [1 RP21] [1 N21] [4 CONJ] [5 VB21] [6 N] [4 ADJ21] [68 INTJ] [1 VBN21] [1 NUM21]
a+gen 15: [9 ADV] [6 P]
a+gennyst 1: [1 P]
a+gens 4: [4 P]
a+genst 2: [2 P]
a+geyne 10: [10 ADV]
a-+gen 63: [52 ADV] [11 P]
a-+gens 12: [12 P]
a-bak 1: [1 P+ADV]
a-bakke 1: [1 P+ADV]
a-baschyd 2: [2 VAN]
a-basshed 1: [1 VAN]
a-basshyd 1: [1 VAN]
a-beyn 1: [1 VB]
a-bod 1: [1 VBD]
a-bode 5: [5 VBD]
a-bood 4: [4 VBD]
a-boode 1: [1 VBD]
a-bouen 1: [1 P]
a-boute 11: [9 P] [1 RP] [1 ADV]
.
.
.
.
.
.
.
/*  ~Z~  */
zacari 1: [1 NPR]
zacharie 1: [1 NPR]
zaram 1: [1 NPR]
zebede 1: [1 NPR]
zelator 1: [1 N]
zelatoris 1: [1 NS]
zele 6: [6 N]
zelose 2: [2 ADJ]
zelously 1: [1 ADV]
zeno 1: [1 NPR]
zenocrates 1: [1 NPR]
zenon 1: [1 NPR]
zepherine 1: [1 NPR]
zorobabel Zorobabel 3: [3 NPR]
zorobabell Zorobabell 4: [4 NPR]
zozime 1: [1 NPR]