|
|
#Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC
|
|
|
# Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC
|
|
|
|
|
|
##Motivation
|
|
|
## Motivation
|
|
|
Nowadays, large collections of literary texts are available in many languages and enable new approaches in literary studies like network analysis, topic modeling or stylometry.
|
|
|
Especially the analysis of networks of literary characters has become either a goal in itself or a
|
|
|
building block in larger contexts [Elson, Dames, McKeown 2010, Moretti 2011, Gyeong-Mi Park
|
... | ... | @@ -18,3 +18,183 @@ To the best of our knowledge there is no comparable corpus available to the acad |
|
|
The paper is structured as follows: first, a brief overview of existing corpora for named entities and coreference resolution is given, followed by the description of the textual sources of the fragments. We continue with a detailed description of our annotation guidelines and the annotation process, including the inter-annotator agreement (IAA). We then explain the two formats in which we release our data and conclude with a brief description of the statistics found in our corpus.
|
|
|
|
|
|
|
|
|
## Related Work
|
|
|
|
|
|
Comparing our corpus in the field of coreference resolution yields a number of relatable resources - though none in the domain of (German) literary texts. In this section we restrict the presentation to German and English corpora available for academic research, starting with the latter.
|
|
|
The best known corpora for English were released in the scope of the MUC-6 and MUC-7 conferences [TODO,TODO]. Those corpora each comprise about 30.000 tokens and contain articles of the wall street journal (WSJ) and airplanes crashes. The corpus released for the ACE-2005 [TODO] had about 400.000 annotated tokens and contains a mix of news, blog and web articles. With about 1.500.000 tokens, the OntoNotes 5.0 [ToDO] currently is the largest available resource for coreference resolution and is comprised of news articles, conversations and web articles.
|
|
|
For German there are currently two corpora available. The first is the Potsdam commentary corpus [TODO], comprising 33.000 tokens and consisting of 176 newspaper commentaries. The other resource for German coreference resolution is the TüBa-D/Z corpus, released by the university of Tübingen. This part of the corpus is made of 1.700 newspaper articles, with about 640.000 tokens.
|
|
|
This overview shows that there is currently no resource for (German) literary texts and most articles of those resources tend to be much shorter than an average novel - yielding new phenomena to explain with statistical methods, therefore underlining the importance of the release of DROC, a resource comprising 90 fragments of German novels, published between 1650 and 1950. There is no other resource that has manually marked direct speech passages along with the respective speaker and addressee, of which DROC has more than 2.000.
|
|
|
|
|
|
|
|
|
## Description of the textual sources
|
|
|
The texts of the novels which are the basis for our corpus come from a large collection of German literary texts available as full-texts, the TextGrid repository [TextGrid 2015]. The texts found in this repository are part of one of the first large-scale digitization projects in the German language. The digitization was undertaken in separate steps by a commercial company, Directmedia, over the course of ten years, which sold digital texts on CDs and DVDs. It is important to understand that the TextGrid collection is comprised of three different groups of texts: The first group, by far the largest, consists of canonized texts of German literature. These are usually based on scholarly editions used for decades by academics. In most editions the writing has been normalized, in our context this means mainly that “th” has been replaced by “t” (for example “Tür” instead of “Thür”) and “ey” by “ei” (for example “sei” instead of “sey”).
|
|
|
|
|
|
The second group has been part of a collection called Deutsche Literatur von Frauen (German literature by women) which tried to collect as much literature from female authors as possible. Because many of these texts are not part of the literary canon, there are no scholarly editions for them and the creators of the collection had to base their digital texts on first prints or unchanged reprints of first prints. Therefore the collection is not balanced or representative for the literary production of the period it covers. The collection is copyright free and has been released in TEI markup on TextGridRep some years ago with a very generous Creative Commons-license (CC-by 3.0).
|
|
|
|
|
|
## Creation of the corpus
|
|
|
The corpus DROC comprises 90 fragments of different novels. The novels were randomly selected from 450 available novels of the TextGrid repository. We applied the Apache openNLP sentence detection component[TODO], trained on the TIGER corpus [TODO], to annotate sentence boundaries in the 90 selected novels. Then, for each novel, we randomly sampled a sentence index in those documents and extended the sentence in both directions until the beginning of a chapter and the end of a chapter was reached. In some occasions, where no structural information of chapters was available our annotators manually selected sentences that indicate the beginning of a coherent passage in the novel and therefore simulates an artificial border. The resulting fragments had an average length of 201 sentences. We implemented this procedure because we wanted to make sure that for all references either the proper nouns or the common nouns were part of the selected sentences.
|
|
|
|
|
|
The annotation process can be depicted as follows:
|
|
|
First we preprocessed the documents with a rule based script, developed with UIMA RUTA [TODO], in order to generate suggestions that both of our annotators could later either accept or change. Therefore, our corpus was created semi-automatically with initial support.
|
|
|
We annotated our novels in ATHEN, a selfmade desktop application based on the eclipse RCP4 framework. The perspective for character reference annotation can be seen in figure 1.
|
|
|
|
|
|
[TODO bild]
|
|
|
|
|
|
After our annotators finished their pass over the documents, resulting inconsistencies were resolved together in order to get a clean version of the annotations.
|
|
|
|
|
|
## Annotation Guidelines
|
|
|
|
|
|
We describe our annotation guidelines in a three step process. First we describe which references were annotated, followed by the description of the resulting phenomena we had to deal with in terms of coreference resolution. We conclude the guidelines section with the description of our guidelines for the annotation of direct speech utterances along with their speaker and addresses.
|
|
|
|
|
|
### Annotated character references
|
|
|
|
|
|
The annotation of character references follows a single rule:
|
|
|
|
|
|
Mark every text snippet in the novel that references a (literary) character.
|
|
|
|
|
|
We furthermore decided not to mark the complete nominal phrase surrounding the reference, instead we only marked the heads of the phrases. An example is given in figure 2.
|
|
|
|
|
|
Following this rule, the resulting phrases can be classified into the following subcategories:
|
|
|
|
|
|
1. Proper noun
|
|
|
Proper noun, for example forenames, surnames or family names. These names can also refer to entities that are not part of the fictional world (e.g. another author, historic persons, etc.) In our schema, the text snippets representing proper noun are marked as “Core”. Sometimes a “core” snippet is only a part of a reference (Show in figure 2, where “von Padden” is the Core Snippet of “Ritterschaftsrätin von Padden”).
|
|
|
|
|
|
2. Heads of common noun phrases
|
|
|
A head of a common noun phrase can be an arbitrary composite consisting of:
|
|
|
Occupational titles ( e.g. “Bäcker” - “baker”)
|
|
|
Relational expressions (e.g. “Mutter” - “mother”)
|
|
|
Gender terms (e.g. “Mann” - “man”)
|
|
|
Different titles (e.g.”Graf” - “earl”)
|
|
|
Action terms (e.g. “Spaziergänger” - “stroller”)
|
|
|
Defamations (e.g. “Idiot” - “idiot”)
|
|
|
Substantival verbs (e.g. “Rufende” - “shouter”)
|
|
|
Substantival adjectives (e.g. “Schöne” - “beauty”)
|
|
|
This listing is not complete, showing the complexity of this class. Annotations of this kind were marked as “AppTdfW” (Appellativ, Teil der fiktionalen Welt) if they are part of the fictional world or as “AppA” (Appellativ, Abstraktum) if they refer to generic or abstract entities that are not part of the fictional world.
|
|
|
|
|
|
3. Pronouns
|
|
|
This category, marked as “Pron”, comprises all sorts of pronouns, the most prominent examples:
|
|
|
Personal pronouns (e.g. “er”, “sie,” - “he”, “she”)
|
|
|
Possessive pronouns (e.g. “seine”, “ihre” - “his”, “her”)
|
|
|
Reflexive pronouns (e.g. “sich” - “himself”, “herself”, “themselves”)
|
|
|
Relative pronouns (e.g. “der”, “die” - “who”)
|
|
|
|
|
|
For each resulting character reference, we marked the following features:
|
|
|
* Type: one of “Core”,”Pron”,”AppTdfW” or “AppA”, as described above
|
|
|
|
|
|
* Range: (used only for cores) span of character offsets for the identification of the core text snippet
|
|
|
|
|
|
* Number: singular or plural
|
|
|
|
|
|
* ID: a unique identifier for each entity appearing in the text, used to represent coreference.
|
|
|
|
|
|
* Pseudo: This means that the person is mentioned in the text, but does not really take part in the action or does not exist in reality. An example for this case is “War nicht auch Cromwell erst in hohem Alter nach vergeudeter Jugend erweckt worden zum Dienste Gottes?” (“Only in his old age and after wasting his youth Cromwell was called to serve God, wasn’t he?”,Bleibtreu: Größenwahn). Both,
|
|
|
Cromwell and Gott, are identified as pseudos, because both are not taking part in this novel’s action.
|
|
|
|
|
|
* Uncertain: A boolean flag that could be set by the annotator if the decision is unclear
|
|
|
|
|
|
|
|
|
### Annotated coreferences
|
|
|
|
|
|
With the definition of the character references our annotators had the task to assign a unique identifier to each entity in the text, and to reuse this Id for each mention of an entity.
|
|
|
To enable an easier comparison of DROC to existing corpora with annotated coreference we discussgo through a selected list of coreferential linguistic phenomena and elaborate whether we marked them as coreferent or not.
|
|
|
|
|
|
Coordination and plural references
|
|
|
Plural references are included if the phrase that is required to mark them does not consist of multiple smaller references. Therefore our annotations are not hierarchical.
|
|
|
|
|
|
Split Antecedents
|
|
|
Split antecedents are not marked
|
|
|
|
|
|
Expletives
|
|
|
Expletives are not included in DROC
|
|
|
|
|
|
Appositions and Predicatives
|
|
|
Appositional as well as references in predicative position are (usually) marked as coreferent.
|
|
|
|
|
|
Bridging Anaphora
|
|
|
Bridging anaphora are not marked within DROC
|
|
|
|
|
|
Discourse
|
|
|
The information whether an entity is discourse new has to be parsed from the ID feature of the references.
|
|
|
We conclude this section with a prototypical example taken from DROC:
|
|
|
|
|
|
“Bekannte (ID=1, AppTdfW, plural) traten zu ihnen (ID=2, pron, plural) heran und das Gespräch war unterbrochen. Michael (ID=3, core) fuhr mit Käthe (ID=4, core) in einer offenen Droschke, in der milden Märznacht, nach Hause. Ihre (ID=4, pron) Blicke hingen am gestirnten Himmel, die seinen (ID=3, pron) an ihrem (ID=4, pron) Antlitz. In Beiden (ID=2, pron, plural) klang die Stimmung von Tristan (ID=5, core, pseudo) und Isolde (ID=6, core, pseudo) nach.”
|
|
|
|
|
|
### Borderline cases:
|
|
|
|
|
|
During the annotation process some borderline cases were discovered, some of them will be explained exemplarily in the following.
|
|
|
|
|
|
Usually named entities are human beings, albeit in some cases animals or even other things can play an important role for the plot of a story. In this instance these protagonists will also be annotated, as shown in the following example: “Eine Woche später und der Alraun war in seiner Art völlig ausgewachsen, etwa dreieinenhalben Fuß hoch;”
|
|
|
Mandrake is labeled as an entity, because in the course of the story the plant comes to life, is named Cornelius Nepos and is able to move and talk. Therefore it becomes an important agent.
|
|
|
|
|
|
Sometimes a novel is partly interrupted by a stichomythia, which then resembles a drama. In this incident the names, which introduce the speech, will be regarded as entities e.g.
|
|
|
“Einsiedel: Wie heißest du?
|
|
|
Simpl.: Ich heiße Bub.”
|
|
|
Both, Einsiedel and Simpl. (short for Simplicissimus), are tagged as named entities.
|
|
|
|
|
|
In rare cases a definitive decision is not possible - because of lacking knowledge or imprecise references. In the following example, it is not clearly determinable, what “man” (“someone”) refers to. It could refer to a concrete person or group, to human species generally or to nothing at all.
|
|
|
“Sollte ich etwa mit gebundenen Händen immer weiter zusehen, wie man mir mein Leben zertritt, bis die Jugend vorbei ist und alles zu spät?”
|
|
|
|
|
|
|
|
|
### Annotated direct speech
|
|
|
|
|
|
Direct speech and every text section enclosed by single or double quotation marks is annotated. Such annotations range from beginning quotation mark till ending ones - both included. In most cases these are french quotation marks, infrequently dashes. To every annotation one (or in rare cases more) speaker and addressed character references is assigned. If it was not possible to determine, who speaks or who is addressed, they are marked as “unknown”.
|
|
|
|
|
|
The annotation process obeys strict rules. If speaker and addressed reference are connected to the relevant direct speech with a communication verb, then these entities are labelled. If not, we looked for direct addresses within the direct speech which are not pronouns (e.g. “..., my dear friend”). If something like that does not exist either, the last mention of speaker and/or addressed person which lies outside of direct speeches was annotated, independent of being a noun or pronoun.
|
|
|
|
|
|
Next to real direct speeches every text section within quotation marks is annotated. This might be names of places, quotations or thoughts. In this case, category, which is set to “directspeech” by default, was changed. The following categories are defined: “thought”, “citation” (e.g. quotations of absent characters or of other fictional works), “fictionalspeech (speeches of a text entity that is not labeled as a named entity, e.g. “my heart says…”, “roses say…”), “name” (e.g. place names) and “other” (if further classification is not possible, e.g. a word highlighted with quotation marks by the author for accentuation).
|
|
|
|
|
|
Sometimes direct speeches are not marked up by quotation marks. In rare cases, direct speeches are labeled by dashes or even without any marker.
|
|
|
|
|
|
## Inter- Annotator Agreement
|
|
|
|
|
|
There are multiple ways to measure an inter annotator agreement (IAA). We used 12 documents that were labelled by both our annotators, beginning with the same initial conditions and measured the IAA based for character reference annotation and for coreference resolution.
|
|
|
|
|
|
For evaluating the quality of the character reference annotation we only took the annotated span into account and calculated Cohens Kappa [TODO].
|
|
|
We did this on a per token basis and converted the output of each annotator into a sequence of B-I-O labels. A measurement on a per token basis awards our annotator for not marking a token as a character reference on top of the rewards for marking the same span. This yields 31.185 instances and resulted in a kappa κ of 94.3%
|
|
|
|
|
|
On the same documents, we measured the IAA of the assigned coreference clustering with MUC-6 and B-Cube scores[TODO]. We evaluated a MUC6-F1 of 88.5% and a B-Cube F1 of 69%. Since both evaluation metrics require the amount of references to be equal we added references if necessary and treated them as singletons. The B-Cube metric punishes singleton clusters which explains the much lower score compared to the MUC evaluation. Removing unmatchable annotations yields a MUC-F1 of 92.4% and a B-Cube F1 of 76%. The documents were annotated by one annotator and afterwards both annotators revised the documents together to guarantee a corpus of high quality.
|
|
|
|
|
|
The code for the evaluation, as well as the documents that were used for the measurements can be downloaded from DROCs git repository.
|
|
|
|
|
|
## Release Formats
|
|
|
|
|
|
DROC is available in two formats. The first format is XMI. These files are standard for Apache UIMA and come with a typesystem definition, required to open DROC.
|
|
|
The second format is TEI-XML[TODO]. This section gives a brief overview over the used representation within these formats. A more thorough definition can be found on the homepage of the project Kallimachos[TODO].
|
|
|
|
|
|
### XMI-format:
|
|
|
|
|
|
In the UIMA format, each annotation is stored with at least two features, a begin indicating the character offset where the annotation starts and an end feature indicating where an annotation ends. Additionally, each annotation has its own type, defined in a separate descriptor xml-file. For DROC we defined two types:
|
|
|
|
|
|
Type NamedEntity:
|
|
|
This type represents a character reference. One annotation is created for every character reference. Table TODO gives an overview of the features used.
|
|
|
|
|
|
[TODO table]
|
|
|
|
|
|
Type DirectSpeech:
|
|
|
|
|
|
[TODO table]
|
|
|
|
|
|
### TEI-XML:
|
|
|
|
|
|
The second format DROC is available in is TEI-XML. Within the <body> element of each document, a sequence of <w> elements are added for each token. Character references have been encoded using the <persName> element and direct speech utterances using the <quote> element with embedded speech elements <sp> that direct to the speaker of the utterance. Sentence and paragraph borders have been added as virtual elements at the end of each document. We used the “prev” attribute of the element persName to refer to the first appearance of the corresponding entity of a character reference. Speaker have been encoded using the “who” attribute that refers to the xml:id of the speaking character reference.
|
|
|
|
|
|
|
|
|
## Corpus Statistics
|
|
|
|
|
|
DROC contains 90 fragments of different novels.
|
|
|
The corpus comprises about 393.000 tokens, determined by the tokenizer script of the TreeTagger[TODO]. On average each fragment is made of 4368±2334 tokens and 202±131 sentences. We manually annotated 52079 character references with the majority of 65% being pronouns (34060). About 23% (12005) of the references have the type “appellative” assigned and the remaining 12% (6013) are “core” references.
|
|
|
These 52081 references are clustered into 5288 entities, therefore on average 10 references per entity and 59±31 entities per document.
|
|
|
Compared to the statistics from the study in [Kabadjov 2007], pronouns in DROC appear more frequently, with a proportion of 65% compared to 44% evaluated by Kabadjov, with the amount of proper noun being almost constant with a small increase from 10% to 12% in DROC.
|
|
|
35 of those fragments were written by female authors and the remaining 55 were written by male authors resulting in a slightly imbalanced 40%-60% gender ratio.
|
|
|
|
|
|
[TODO table]
|
|
|
|
|
|
## License
|
|
|
|
|
|
The corpus is licensed under the Creative Commons license CC-BY [see XXX]. Please quote this text if you use the corpus in your work.
|
|
|
|
|
|
|
|
|
Wenn man das korpus verwendet, bitte folgednes Zitat verwenden
|
|
|
|
|
|
|
|
|
|