|
|
**Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Motivation
|
|
|
Nowadays, large collections of literary texts are available in many languages and enable new approaches in literary studies like network analysis, topic modeling or stylometry.
|
|
|
Especially the analysis of networks of literary characters has become either a goal in itself or a
|
|
|
building block in larger contexts [Elson, Dames, McKeown 2010, Moretti 2011, Gyeong-Mi Park
|
|
|
et al. 2013, Trilcke 2013]. In such networks, characters usually constitute the nodes while their
|
|
|
building block in larger contexts [@elson2010; @park2013; @trilcke2013 ][Moretti 2011-TODO]. In such networks, characters usually constitute the nodes while their
|
|
|
interaction, for example the amount of conversation, is modeled as edges, often using the
|
|
|
amount of interaction as weight. In order to create such networks, the first step is to find all
|
|
|
references to characters in a text. However, to detect all character references to an entity it is not sufficient to apply a state of the art named entity recognizer (NER) such as the Stanford NER [TODO]. A reference can appear in one of three broad syntactic categories: 1) a proper noun; 2) a nounnominal phrase; 3) a pronoun. Detecting category 2) and 3) is usually beyond the scope of named entity recognition. Furthermore, current state of the art NER were trained on newspaper articles, yielding another gap to bridge towards the domain of novels. Detecting character references is not enough to create a complete pipeline with the goal of creating social networks of literary texts; it is required to resolve each reference to its according entity in the fictional world (of a text/work), a process called coreference resolution.
|
|
|
references to characters in a text. However, to detect all character references to an entity it is not sufficient to apply a state of the art named entity recognizer (NER) such as the Stanford NER [@stamfordNER]. A reference can appear in one of three broad syntactic categories: 1) a proper noun; 2) a nounnominal phrase; 3) a pronoun. Detecting category 2) and 3) is usually beyond the scope of named entity recognition. Furthermore, current state of the art NER were trained on newspaper articles, yielding another gap to bridge towards the domain of novels. Detecting character references is not enough to create a complete pipeline with the goal of creating social networks of literary texts; it is required to resolve each reference to its according entity in the fictional world (of a text/work), a process called coreference resolution.
|
|
|
This work presents the corpus DROC (Deutsches ROman Corpus), comprised of 90 carefully manually annotated fragments of German novels and includes the following annotations:
|
|
|
1. Each character reference has been marked.
|
|
|
2. Each reference was assigned to one of four subcategories.
|
... | ... | @@ -21,21 +23,21 @@ The paper is structured as follows: first, a brief overview of existing corpora |
|
|
# Related Work
|
|
|
|
|
|
Comparing our corpus in the field of coreference resolution yields a number of relatable resources - though none in the domain of (German) literary texts. In this section we restrict the presentation to German and English corpora available for academic research, starting with the latter.
|
|
|
The best known corpora for English were released in the scope of the MUC-6 and MUC-7 conferences [TODO,TODO]. Those corpora each comprise about 30.000 tokens and contain articles of the wall street journal (WSJ) and airplanes crashes. The corpus released for the ACE-2005 [TODO] had about 400.000 annotated tokens and contains a mix of news, blog and web articles. With about 1.500.000 tokens, the OntoNotes 5.0 [ToDO] currently is the largest available resource for coreference resolution and is comprised of news articles, conversations and web articles.
|
|
|
For German there are currently two corpora available. The first is the Potsdam commentary corpus [TODO], comprising 33.000 tokens and consisting of 176 newspaper commentaries. The other resource for German coreference resolution is the TüBa-D/Z corpus, released by the university of Tübingen. This part of the corpus is made of 1.700 newspaper articles, with about 640.000 tokens.
|
|
|
The best known corpora for English were released in the scope of the MUC-6 and MUC-7 conferences [@muc6; @muc7]. Those corpora each comprise about 30.000 tokens and contain articles of the wall street journal (WSJ) and airplanes crashes. The corpus released for the ACE-2005 [TODO] had about 400.000 annotated tokens and contains a mix of news, blog and web articles. With about 1.500.000 tokens, the OntoNotes 5.0 [@ontonotes] currently is the largest available resource for coreference resolution and is comprised of news articles, conversations and web articles.
|
|
|
For German there are currently two corpora available. The first is the Potsdam commentary corpus [@potsdam], comprising 33.000 tokens and consisting of 176 newspaper commentaries. The other resource for German coreference resolution is the TüBa-D/Z corpus, released by the university of Tübingen. This part of the corpus is made of 1.700 newspaper articles, with about 640.000 tokens.
|
|
|
This overview shows that there is currently no resource for (German) literary texts and most articles of those resources tend to be much shorter than an average novel - yielding new phenomena to explain with statistical methods, therefore underlining the importance of the release of DROC, a resource comprising 90 fragments of German novels, published between 1650 and 1950. There is no other resource that has manually marked direct speech passages along with the respective speaker and addressee, of which DROC has more than 2.000.
|
|
|
|
|
|
|
|
|
# Description of the textual sources
|
|
|
The texts of the novels which are the basis for our corpus come from a large collection of German literary texts available as full-texts, the TextGrid repository [TextGrid 2015]. The texts found in this repository are part of one of the first large-scale digitization projects in the German language. The digitization was undertaken in separate steps by a commercial company, Directmedia, over the course of ten years, which sold digital texts on CDs and DVDs. It is important to understand that the TextGrid collection is comprised of three different groups of texts: The first group, by far the largest, consists of canonized texts of German literature. These are usually based on scholarly editions used for decades by academics. In most editions the writing has been normalized, in our context this means mainly that “th” has been replaced by “t” (for example “Tür” instead of “Thür”) and “ey” by “ei” (for example “sei” instead of “sey”).
|
|
|
The texts of the novels which are the basis for our corpus come from a large collection of German literary texts available as full-texts, the TextGrid repository [TextGrid 2015]. The texts found in this repository are part of one of the first large-scale digitization projects in the German language. The digitization was undertaken in separate steps by a commercial company, Directmedia, over the course of ten years, which sold digital texts on CDs and DVDs. It is important to understand that the TextGrid collection is comprised of three different groups of texts: The first group, by far the largest, consists of canonized texts of German literature. These are usually based on scholarly editions used for decades by academics. In most editions the writing has been normalized, in our context this means mainly that “th” has been replaced by “t” (for example “Tür” instead of “Thür”) and “ey” by “ei” (for example “sei” instead of “sey”).
|
|
|
|
|
|
The second group has been part of a collection called Deutsche Literatur von Frauen (German literature by women) which tried to collect as much literature from female authors as possible. Because many of these texts are not part of the literary canon, there are no scholarly editions for them and the creators of the collection had to base their digital texts on first prints or unchanged reprints of first prints. Therefore the collection is not balanced or representative for the literary production of the period it covers. The collection is copyright free and has been released in TEI markup on TextGridRep some years ago with a very generous Creative Commons-license (CC-by 3.0).
|
|
|
|
|
|
# Creation of the corpus
|
|
|
The corpus DROC comprises 90 fragments of different novels. The novels were randomly selected from 450 available novels of the TextGrid repository. We applied the Apache openNLP sentence detection component[TODO], trained on the TIGER corpus [TODO], to annotate sentence boundaries in the 90 selected novels. Then, for each novel, we randomly sampled a sentence index in those documents and extended the sentence in both directions until the beginning of a chapter and the end of a chapter was reached. In some occasions, where no structural information of chapters was available our annotators manually selected sentences that indicate the beginning of a coherent passage in the novel and therefore simulates an artificial border. The resulting fragments had an average length of 201 sentences. We implemented this procedure because we wanted to make sure that for all references either the proper nouns or the common nouns were part of the selected sentences.
|
|
|
The corpus DROC comprises 90 fragments of different novels. The novels were randomly selected from 450 available novels of the TextGrid repository. We applied the Apache openNLP sentence detection component[@openNLP], trained on the TIGER corpus [@tiger], to annotate sentence boundaries in the 90 selected novels. Then, for each novel, we randomly sampled a sentence index in those documents and extended the sentence in both directions until the beginning of a chapter and the end of a chapter was reached. In some occasions, where no structural information of chapters was available our annotators manually selected sentences that indicate the beginning of a coherent passage in the novel and therefore simulates an artificial border. The resulting fragments had an average length of 201 sentences. We implemented this procedure because we wanted to make sure that for all references either the proper nouns or the common nouns were part of the selected sentences.
|
|
|
|
|
|
The annotation process can be depicted as follows:
|
|
|
First we preprocessed the documents with a rule based script, developed with UIMA RUTA [TODO], in order to generate suggestions that both of our annotators could later either accept or change. Therefore, our corpus was created semi-automatically with initial support.
|
|
|
First we preprocessed the documents with a rule based script, developed with UIMA RUTA [@kluegl2016uima], in order to generate suggestions that both of our annotators could later either accept or change. Therefore, our corpus was created semi-automatically with initial support.
|
|
|
We annotated our novels in ATHEN, a selfmade desktop application based on the eclipse RCP4 framework. The perspective for character reference annotation can be seen in figure 1.
|
|
|
|
|
|
[TODO bild]
|
... | ... | @@ -99,7 +101,7 @@ With the definition of the character references our annotators had the task to a |
|
|
To enable an easier comparison of DROC to existing corpora with annotated coreference we discussgo through a selected list of coreferential linguistic phenomena and elaborate whether we marked them as coreferent or not.
|
|
|
|
|
|
Coordination and plural references
|
|
|
Plural references are included if the phrase that is required to mark them does not consist of multiple smaller references. Therefore our annotations are not hierarchical.
|
|
|
Plural references are included if the phrase that is required to mark them does not consist of multiple smaller references. Therefore our annotations are not hierarchical.
|
|
|
|
|
|
Split Antecedents
|
|
|
Split antecedents are not marked
|
... | ... | @@ -126,9 +128,9 @@ During the annotation process some borderline cases were discovered, some of the |
|
|
Usually named entities are human beings, albeit in some cases animals or even other things can play an important role for the plot of a story. In this instance these protagonists will also be annotated, as shown in the following example: “Eine Woche später und der Alraun war in seiner Art völlig ausgewachsen, etwa dreieinenhalben Fuß hoch;”
|
|
|
Mandrake is labeled as an entity, because in the course of the story the plant comes to life, is named Cornelius Nepos and is able to move and talk. Therefore it becomes an important agent.
|
|
|
|
|
|
Sometimes a novel is partly interrupted by a stichomythia, which then resembles a drama. In this incident the names, which introduce the speech, will be regarded as entities e.g.
|
|
|
Sometimes a novel is partly interrupted by a stichomythia, which then resembles a drama. In this incident the names, which introduce the speech, will be regarded as entities e.g.
|
|
|
“Einsiedel: Wie heißest du?
|
|
|
Simpl.: Ich heiße Bub.”
|
|
|
Simpl.: Ich heiße Bub.”
|
|
|
Both, Einsiedel and Simpl. (short for Simplicissimus), are tagged as named entities.
|
|
|
|
|
|
In rare cases a definitive decision is not possible - because of lacking knowledge or imprecise references. In the following example, it is not clearly determinable, what “man” (“someone”) refers to. It could refer to a concrete person or group, to human species generally or to nothing at all.
|
... | ... | @@ -149,17 +151,17 @@ Sometimes direct speeches are not marked up by quotation marks. In rare cases, d |
|
|
|
|
|
There are multiple ways to measure an inter annotator agreement (IAA). We used 12 documents that were labelled by both our annotators, beginning with the same initial conditions and measured the IAA based for character reference annotation and for coreference resolution.
|
|
|
|
|
|
For evaluating the quality of the character reference annotation we only took the annotated span into account and calculated Cohens Kappa [TODO].
|
|
|
For evaluating the quality of the character reference annotation we only took the annotated span into account and calculated Cohens Kappa [@kappa].
|
|
|
We did this on a per token basis and converted the output of each annotator into a sequence of B-I-O labels. A measurement on a per token basis awards our annotator for not marking a token as a character reference on top of the rewards for marking the same span. This yields 31.185 instances and resulted in a kappa κ of 94.3%
|
|
|
|
|
|
On the same documents, we measured the IAA of the assigned coreference clustering with MUC-6 and B-Cube scores[TODO]. We evaluated a MUC6-F1 of 88.5% and a B-Cube F1 of 69%. Since both evaluation metrics require the amount of references to be equal we added references if necessary and treated them as singletons. The B-Cube metric punishes singleton clusters which explains the much lower score compared to the MUC evaluation. Removing unmatchable annotations yields a MUC-F1 of 92.4% and a B-Cube F1 of 76%. The documents were annotated by one annotator and afterwards both annotators revised the documents together to guarantee a corpus of high quality.
|
|
|
On the same documents, we measured the IAA of the assigned coreference clustering with MUC-6 and B-Cube scores[@luo2005coreference]. We evaluated a MUC6-F1 of 88.5% and a B-Cube F1 of 69%. Since both evaluation metrics require the amount of references to be equal we added references if necessary and treated them as singletons. The B-Cube metric punishes singleton clusters which explains the much lower score compared to the MUC evaluation. Removing unmatchable annotations yields a MUC-F1 of 92.4% and a B-Cube F1 of 76%. The documents were annotated by one annotator and afterwards both annotators revised the documents together to guarantee a corpus of high quality.
|
|
|
|
|
|
The code for the evaluation, as well as the documents that were used for the measurements can be downloaded from DROCs git repository.
|
|
|
|
|
|
# Release Formats
|
|
|
|
|
|
DROC is available in two formats. The first format is XMI. These files are standard for Apache UIMA and come with a typesystem definition, required to open DROC.
|
|
|
The second format is TEI-XML[TODO]. This section gives a brief overview over the used representation within these formats. A more thorough definition can be found on the homepage of the project Kallimachos[TODO].
|
|
|
The second format is TEI-XML[@tei-schrott]. This section gives a brief overview over the used representation within these formats. A more thorough definition can be found on the homepage of the project Kallimachos[TODO].
|
|
|
|
|
|
## XMI-format:
|
|
|
|
... | ... | @@ -182,12 +184,18 @@ The second format DROC is available in is TEI-XML. Within the <body> element of |
|
|
# Corpus Statistics
|
|
|
|
|
|
DROC contains 90 fragments of different novels.
|
|
|
The corpus comprises about 393.000 tokens, determined by the tokenizer script of the TreeTagger[TODO]. On average each fragment is made of 4368±2334 tokens and 202±131 sentences. We manually annotated 52079 character references with the majority of 65% being pronouns (34060). About 23% (12005) of the references have the type “appellative” assigned and the remaining 12% (6013) are “core” references.
|
|
|
The corpus comprises about 393.000 tokens, determined by the tokenizer script of the TreeTagger[TODO]. On average each fragment is made of 4368±2334 tokens and 202±131 sentences. We manually annotated 52079 character references with the majority of 65% being pronouns (34060). About 23% (12005) of the references have the type “appellative” assigned and the remaining 12% (6013) are “core” references.
|
|
|
These 52081 references are clustered into 5288 entities, therefore on average 10 references per entity and 59±31 entities per document.
|
|
|
Compared to the statistics from the study in [Kabadjov 2007], pronouns in DROC appear more frequently, with a proportion of 65% compared to 44% evaluated by Kabadjov, with the amount of proper noun being almost constant with a small increase from 10% to 12% in DROC.
|
|
|
35 of those fragments were written by female authors and the remaining 55 were written by male authors resulting in a slightly imbalanced 40%-60% gender ratio.
|
|
|
|
|
|
[TODO table]
|
|
|
| Epoch | 1651-1700 | 1701-1750 | | | | | |
|
|
|
|---------------|-----------|-----------|---|---|---|---|---|
|
|
|
| Amount novels | 2 | 3 | | | | | |
|
|
|
|
|
|
Table: *Die dritte Tabelle.* (Tabellen-Unterschrift ist teilweise\
|
|
|
*kursiv*.)
|
|
|
|
|
|
|
|
|
# License
|
|
|
|
... | ... | @@ -198,3 +206,4 @@ Wenn man das korpus verwendet, bitte folgednes Zitat verwenden |
|
|
|
|
|
|
|
|
|
|
|
\bibliography |