|
|
# Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC
|
|
|
**Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC**
|
|
|
|
|
|
## Motivation
|
|
|
# Motivation
|
|
|
Nowadays, large collections of literary texts are available in many languages and enable new approaches in literary studies like network analysis, topic modeling or stylometry.
|
|
|
Especially the analysis of networks of literary characters has become either a goal in itself or a
|
|
|
building block in larger contexts [Elson, Dames, McKeown 2010, Moretti 2011, Gyeong-Mi Park
|
... | ... | @@ -18,7 +18,7 @@ To the best of our knowledge there is no comparable corpus available to the acad |
|
|
The paper is structured as follows: first, a brief overview of existing corpora for named entities and coreference resolution is given, followed by the description of the textual sources of the fragments. We continue with a detailed description of our annotation guidelines and the annotation process, including the inter-annotator agreement (IAA). We then explain the two formats in which we release our data and conclude with a brief description of the statistics found in our corpus.
|
|
|
|
|
|
|
|
|
## Related Work
|
|
|
# Related Work
|
|
|
|
|
|
Comparing our corpus in the field of coreference resolution yields a number of relatable resources - though none in the domain of (German) literary texts. In this section we restrict the presentation to German and English corpora available for academic research, starting with the latter.
|
|
|
The best known corpora for English were released in the scope of the MUC-6 and MUC-7 conferences [TODO,TODO]. Those corpora each comprise about 30.000 tokens and contain articles of the wall street journal (WSJ) and airplanes crashes. The corpus released for the ACE-2005 [TODO] had about 400.000 annotated tokens and contains a mix of news, blog and web articles. With about 1.500.000 tokens, the OntoNotes 5.0 [ToDO] currently is the largest available resource for coreference resolution and is comprised of news articles, conversations and web articles.
|
... | ... | @@ -26,12 +26,12 @@ For German there are currently two corpora available. The first is the Potsdam c |
|
|
This overview shows that there is currently no resource for (German) literary texts and most articles of those resources tend to be much shorter than an average novel - yielding new phenomena to explain with statistical methods, therefore underlining the importance of the release of DROC, a resource comprising 90 fragments of German novels, published between 1650 and 1950. There is no other resource that has manually marked direct speech passages along with the respective speaker and addressee, of which DROC has more than 2.000.
|
|
|
|
|
|
|
|
|
## Description of the textual sources
|
|
|
# Description of the textual sources
|
|
|
The texts of the novels which are the basis for our corpus come from a large collection of German literary texts available as full-texts, the TextGrid repository [TextGrid 2015]. The texts found in this repository are part of one of the first large-scale digitization projects in the German language. The digitization was undertaken in separate steps by a commercial company, Directmedia, over the course of ten years, which sold digital texts on CDs and DVDs. It is important to understand that the TextGrid collection is comprised of three different groups of texts: The first group, by far the largest, consists of canonized texts of German literature. These are usually based on scholarly editions used for decades by academics. In most editions the writing has been normalized, in our context this means mainly that “th” has been replaced by “t” (for example “Tür” instead of “Thür”) and “ey” by “ei” (for example “sei” instead of “sey”).
|
|
|
|
|
|
The second group has been part of a collection called Deutsche Literatur von Frauen (German literature by women) which tried to collect as much literature from female authors as possible. Because many of these texts are not part of the literary canon, there are no scholarly editions for them and the creators of the collection had to base their digital texts on first prints or unchanged reprints of first prints. Therefore the collection is not balanced or representative for the literary production of the period it covers. The collection is copyright free and has been released in TEI markup on TextGridRep some years ago with a very generous Creative Commons-license (CC-by 3.0).
|
|
|
|
|
|
## Creation of the corpus
|
|
|
# Creation of the corpus
|
|
|
The corpus DROC comprises 90 fragments of different novels. The novels were randomly selected from 450 available novels of the TextGrid repository. We applied the Apache openNLP sentence detection component[TODO], trained on the TIGER corpus [TODO], to annotate sentence boundaries in the 90 selected novels. Then, for each novel, we randomly sampled a sentence index in those documents and extended the sentence in both directions until the beginning of a chapter and the end of a chapter was reached. In some occasions, where no structural information of chapters was available our annotators manually selected sentences that indicate the beginning of a coherent passage in the novel and therefore simulates an artificial border. The resulting fragments had an average length of 201 sentences. We implemented this procedure because we wanted to make sure that for all references either the proper nouns or the common nouns were part of the selected sentences.
|
|
|
|
|
|
The annotation process can be depicted as follows:
|
... | ... | @@ -42,11 +42,11 @@ We annotated our novels in ATHEN, a selfmade desktop application based on the ec |
|
|
|
|
|
After our annotators finished their pass over the documents, resulting inconsistencies were resolved together in order to get a clean version of the annotations.
|
|
|
|
|
|
## Annotation Guidelines
|
|
|
# Annotation Guidelines
|
|
|
|
|
|
We describe our annotation guidelines in a three step process. First we describe which references were annotated, followed by the description of the resulting phenomena we had to deal with in terms of coreference resolution. We conclude the guidelines section with the description of our guidelines for the annotation of direct speech utterances along with their speaker and addresses.
|
|
|
|
|
|
### Annotated character references
|
|
|
## Annotated character references
|
|
|
|
|
|
The annotation of character references follows a single rule:
|
|
|
|
... | ... | @@ -93,7 +93,7 @@ Cromwell and Gott, are identified as pseudos, because both are not taking part i |
|
|
* Uncertain: A boolean flag that could be set by the annotator if the decision is unclear
|
|
|
|
|
|
|
|
|
### Annotated coreferences
|
|
|
## Annotated coreferences
|
|
|
|
|
|
With the definition of the character references our annotators had the task to assign a unique identifier to each entity in the text, and to reuse this Id for each mention of an entity.
|
|
|
To enable an easier comparison of DROC to existing corpora with annotated coreference we discussgo through a selected list of coreferential linguistic phenomena and elaborate whether we marked them as coreferent or not.
|
... | ... | @@ -119,7 +119,7 @@ We conclude this section with a prototypical example taken from DROC: |
|
|
|
|
|
“Bekannte (ID=1, AppTdfW, plural) traten zu ihnen (ID=2, pron, plural) heran und das Gespräch war unterbrochen. Michael (ID=3, core) fuhr mit Käthe (ID=4, core) in einer offenen Droschke, in der milden Märznacht, nach Hause. Ihre (ID=4, pron) Blicke hingen am gestirnten Himmel, die seinen (ID=3, pron) an ihrem (ID=4, pron) Antlitz. In Beiden (ID=2, pron, plural) klang die Stimmung von Tristan (ID=5, core, pseudo) und Isolde (ID=6, core, pseudo) nach.”
|
|
|
|
|
|
### Borderline cases:
|
|
|
## Borderline cases:
|
|
|
|
|
|
During the annotation process some borderline cases were discovered, some of them will be explained exemplarily in the following.
|
|
|
|
... | ... | @@ -135,7 +135,7 @@ In rare cases a definitive decision is not possible - because of lacking knowled |
|
|
“Sollte ich etwa mit gebundenen Händen immer weiter zusehen, wie man mir mein Leben zertritt, bis die Jugend vorbei ist und alles zu spät?”
|
|
|
|
|
|
|
|
|
### Annotated direct speech
|
|
|
## Annotated direct speech
|
|
|
|
|
|
Direct speech and every text section enclosed by single or double quotation marks is annotated. Such annotations range from beginning quotation mark till ending ones - both included. In most cases these are french quotation marks, infrequently dashes. To every annotation one (or in rare cases more) speaker and addressed character references is assigned. If it was not possible to determine, who speaks or who is addressed, they are marked as “unknown”.
|
|
|
|
... | ... | @@ -145,7 +145,7 @@ Next to real direct speeches every text section within quotation marks is annota |
|
|
|
|
|
Sometimes direct speeches are not marked up by quotation marks. In rare cases, direct speeches are labeled by dashes or even without any marker.
|
|
|
|
|
|
## Inter- Annotator Agreement
|
|
|
# Inter- Annotator Agreement
|
|
|
|
|
|
There are multiple ways to measure an inter annotator agreement (IAA). We used 12 documents that were labelled by both our annotators, beginning with the same initial conditions and measured the IAA based for character reference annotation and for coreference resolution.
|
|
|
|
... | ... | @@ -156,12 +156,12 @@ On the same documents, we measured the IAA of the assigned coreference clusterin |
|
|
|
|
|
The code for the evaluation, as well as the documents that were used for the measurements can be downloaded from DROCs git repository.
|
|
|
|
|
|
## Release Formats
|
|
|
# Release Formats
|
|
|
|
|
|
DROC is available in two formats. The first format is XMI. These files are standard for Apache UIMA and come with a typesystem definition, required to open DROC.
|
|
|
The second format is TEI-XML[TODO]. This section gives a brief overview over the used representation within these formats. A more thorough definition can be found on the homepage of the project Kallimachos[TODO].
|
|
|
|
|
|
### XMI-format:
|
|
|
## XMI-format:
|
|
|
|
|
|
In the UIMA format, each annotation is stored with at least two features, a begin indicating the character offset where the annotation starts and an end feature indicating where an annotation ends. Additionally, each annotation has its own type, defined in a separate descriptor xml-file. For DROC we defined two types:
|
|
|
|
... | ... | @@ -174,12 +174,12 @@ Type DirectSpeech: |
|
|
|
|
|
[TODO table]
|
|
|
|
|
|
### TEI-XML:
|
|
|
## TEI-XML:
|
|
|
|
|
|
The second format DROC is available in is TEI-XML. Within the <body> element of each document, a sequence of <w> elements are added for each token. Character references have been encoded using the <persName> element and direct speech utterances using the <quote> element with embedded speech elements <sp> that direct to the speaker of the utterance. Sentence and paragraph borders have been added as virtual elements at the end of each document. We used the “prev” attribute of the element persName to refer to the first appearance of the corresponding entity of a character reference. Speaker have been encoded using the “who” attribute that refers to the xml:id of the speaking character reference.
|
|
|
|
|
|
|
|
|
## Corpus Statistics
|
|
|
# Corpus Statistics
|
|
|
|
|
|
DROC contains 90 fragments of different novels.
|
|
|
The corpus comprises about 393.000 tokens, determined by the tokenizer script of the TreeTagger[TODO]. On average each fragment is made of 4368±2334 tokens and 202±131 sentences. We manually annotated 52079 character references with the majority of 65% being pronouns (34060). About 23% (12005) of the references have the type “appellative” assigned and the remaining 12% (6013) are “core” references.
|
... | ... | @@ -189,7 +189,7 @@ Compared to the statistics from the study in [Kabadjov 2007], pronouns in DROC a |
|
|
|
|
|
[TODO table]
|
|
|
|
|
|
## License
|
|
|
# License
|
|
|
|
|
|
The corpus is licensed under the Creative Commons license CC-BY [see XXX]. Please quote this text if you use the corpus in your work.
|
|
|
|
... | ... | |