Penn Parsed Corpora of Historical English
The Penn Parsed Corpora of Historical English are running texts and
text samples of British English prose across its history - from the
earliest Middle English documents up to the First World War. They
include three corpora:
- the Penn-Helsinki Parsed Corpus of Middle English, second edition
(PPCME2),
- the Penn-Helsinki Parsed Corpus of Early Modern English
(PPCEME), and
- the Penn Parsed Corpus of Modern British English, second edition
(PPCMBE2).
The texts come in three forms: simple text, part-of-speech tagged text
and syntactically annotated text. The syntactic annotation (parsing)
permits searching not only for words and word sequences, but also for
abstract syntactic structures. All of the annotation has been carefully
reviewed by expert human annotators for accuracy and consistency. The
corpora are designed for the use of students and scholars of the history
of English, especially the historical syntax of the language, and they
are publicly available to individuals, research groups, and libraries.
The 2016 release adds 2 million words to the Modern British English
corpus, for a total of 3 million words, and includes a substantial
number of corrections to the other corpora in the series. In addition,
several small changes have been made to streamline the
annotation guidelines.
As of July 2020, the 2016 release of the Penn Parsed Corpora of
Historical English is being distributed by
the Linguistic Data Consortium
(LDC). The LDC catalog number
is LDC2020T16.
Potential new users, whether individuals or institutions, should contact
the LDC at
ldc AT ldc DOT upenn DOT edu. So should past
users wishing to update license agreements dating from before the 2016
release, who should make clear their status as past licensees in their
request.
If you already hold a license for the 2016 release, the new mode
of distribution does not affect you, except that you will no longer
be charged annual subscription fees.
Please note that the local web server that was distributed with the 2016
release on CD-ROM is not being distributed by the LDC and is no longer
being maintained. Users who have installed the web server are free to
continue to use it. The corpora can also be searched using Corpus
Search, an open source program written by Beth Randall, which can be
downloaded from its Sourceforge project webpage. However, the
documentation is no longer being maintained there, but rather
here.
Questions concerning the annotation or search of the PPCHE should be
sent to Beatrice Santorini at beatrice DOT santorini AT gmail DOT
com. This is also the address to send reports of annotation
errors, so that we can continue to improve the quality of the corpora.
|
Acknowledgments
- The PPCME2 was created with the support of the National Science
Foundation (Grants BNS 89-19701 and SBR 95-11368), with supplementary
support from the University of Pennsylvania Research Foundation.
- The PPCEME was created with the support of the National Endowment
for the Humanities (Grant PA 23382-99) and the National Science
Foundation (Grant BCS 99-05488).
- The PPCMBE2 was created with the support of the National Science
Foundation (Grants BCS 05-08731 and BCS 11-47499).
With respect to the above-listed grants, any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the National
Endowment for the Humanities or the National Science Foundation.
|
|
Byland Abbey, Yorkshire. It was at abbeys like Byland,
throughout Britain, that the manuscripts on which our knowledge of
Middle English is based were largely written, copied, and preserved. The
monastic orders that built and inhabited these monasteries were
dissolved by Henry VIII, whereupon the buildings were dismantled for
building materials by the landlords who succeeded to the monastic
estates. Most of the abbeys' manuscripts were lost, but some came into
private hands and so survived. Photo © A. Kroch 1998.
|
|