|
|
**Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC**
|
|
|
---
|
|
|
title: Description of a Corpus of Character References in German Novels- DROC [Deutsches ROman Corpus ]
|
|
|
author:
|
|
|
- Markus Krug
|
|
|
- Frank Puppe
|
|
|
- Isabella Reger
|
|
|
- Lukas Weimer
|
|
|
- Luisa Macharowsky
|
|
|
- Stephan Feldhaus
|
|
|
- Fotis Jannidis
|
|
|
longauthor:
|
|
|
- Markus Krug¹
|
|
|
- Frank Puppe¹
|
|
|
- Isabella Reger²
|
|
|
- Lukas Weimer²
|
|
|
- Luisa Macharowsky²
|
|
|
- Stephan Feldhaus²
|
|
|
- Fotis Jannidis²
|
|
|
institute:
|
|
|
- ¹Wuerzburg Univiersity, Chair of artificial intelligence and applied computer science
|
|
|
- ²Wuerzburg University, Chair of literary computing
|
|
|
lang: en
|
|
|
report-number: 123
|
|
|
abstract: |
|
|
|
In this work, we present DROC, a corpus consisting of 90 fragments of german novels, published between the 17th and 20th century. DROC consists of more than 50.000 carefully annotated character references as well as their coreferences. Additionally, we annotated direct speech instances (contained) in the fragments, along with the corresponding speaker and addressee. The corpus is released in TEI-XML and Apache UIMA .xmi. Both formats are described in this contribution.
|
|
|
|
|
|
keywords-en:
|
|
|
- character references
|
|
|
- coreferences
|
|
|
- novel
|
|
|
- speaker
|
|
|
- direct speech
|
|
|
- corpus
|
|
|
- annotation
|
|
|
- gold
|
|
|
- german
|
|
|
keywords-de:
|
|
|
- Figurenreferenzen
|
|
|
- Koreferenzen
|
|
|
- Roman
|
|
|
- Sprecher
|
|
|
- Direkte Reden
|
|
|
- Korpus
|
|
|
- Annotation
|
|
|
- Gold
|
|
|
- deutsch
|
|
|
date: 2017
|
|
|
wpno: 4711
|
|
|
...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Motivation
|
|
|
Nowadays, large collections of literary texts are available in many languages and enable new approaches in literary studies like network analysis, topic modeling or stylometry.
|
|
|
Nowadays, large collections of literary texts are available in many languages and enable new approaches to literary studies like network analysis, topic modeling or stylometry.
|
|
|
Especially the analysis of networks of literary characters has become either a goal in itself or a
|
|
|
building block in larger contexts [@elson2010; @park2013; @trilcke2013 ][Moretti 2011-TODO]. In such networks, characters usually constitute the nodes while their
|
|
|
building block in larger contexts [@elson2010; @park2013; @trilcke2013; @moretti2011]. In such networks, characters usually constitute the nodes while their
|
|
|
interaction, for example the amount of conversation, is modeled as edges, often using the
|
|
|
amount of interaction as weight. In order to create such networks, the first step is to find all
|
|
|
references to characters in a text. However, to detect all character references to an entity it is not sufficient to apply a state of the art named entity recognizer (NER) such as the Stanford NER [@stamfordNER]. A reference can appear in one of three broad syntactic categories: 1) a proper noun; 2) a nounnominal phrase; 3) a pronoun. Detecting category 2) and 3) is usually beyond the scope of named entity recognition. Furthermore, current state of the art NER were trained on newspaper articles, yielding another gap to bridge towards the domain of novels. Detecting character references is not enough to create a complete pipeline with the goal of creating social networks of literary texts; it is required to resolve each reference to its according entity in the fictional world (of a text/work), a process called coreference resolution.
|
|
|
This work presents the corpus DROC (Deutsches ROman Corpus), comprised of 90 carefully manually annotated fragments of German novels and includes the following annotations:
|
|
|
references to characters in a text. However, in order to detect all character references to an entity, it is not sufficient to apply a state of the art named entity recognizer (NER) such as Stanford NER [@stamfordNER]. In a literary text, a reference can appear in one of three broad syntactic categories: 1) as a proper noun; 2) as a nominal phrase; 3) as a pronoun. Detecting categories 2) and 3) is usually beyond the scope of named entity recognition. Furthermore, current state of the art NER were trained on newspaper articles, which means another gap to bridge with regard to the domain of novels. The mere identification of character references is not sufficient to create a complete pipeline with the goal of extracting SNs from literary text; it is required to resolve each reference to its according entity in the fictional world (of a text/work), a process called coreference resolution.
|
|
|
This work presents the corpus DROC (Deutsches ROman Corpus), which consists of 90 carefully manually annotated fragments of German novels and includes the following annotations:
|
|
|
|
|
|
1. Each character reference has been marked.
|
|
|
2. Each reference was assigned to one of four subcategories.
|
|
|
3. Each character reference has an assigned entity-identifier which resembles the annotation of coreference resolution.
|
|
|
4. Each direct speech has been manually annotated.
|
|
|
5. The speaker and addressee of each direct speech have been manually marked.
|
|
|
|
|
|
To the best of our knowledge there is no comparable corpus available to the academic community in the domain of literary texts, especially for German. DROC comprises about 393.000 annotated tokens with more than 50.000 labelled character references.
|
|
|
The paper is structured as follows: first, a brief overview of existing corpora for named entities and coreference resolution is given, followed by the description of the textual sources of the fragments. We continue with a detailed description of our annotation guidelines and the annotation process, including the inter-annotator agreement (IAA). We then explain the two formats in which we release our data and conclude with a brief description of the statistics found in our corpus.
|
|
|
The paper is structured as follows:
|
|
|
|
|
|
First, a brief overview of existing corpora for named entities and coreference resolution is given, followed by a description of the textual sources of the fragments. We continue with a detailed description of our annotation guidelines and the annotation process, including the inter-annotator agreement (IAA). We then explain the two formats in which we release our data and conclude with a brief description of the statistics found in our corpus.
|
|
|
|
|
|
|
|
|
# Related Work
|
|
|
|
|
|
Comparing our corpus in the field of coreference resolution yields a number of relatable resources - though none in the domain of (German) literary texts. In this section we restrict the presentation to German and English corpora available for academic research, starting with the latter.
|
|
|
The best known corpora for English were released in the scope of the MUC-6 and MUC-7 conferences [@muc6; @muc7]. Those corpora each comprise about 30.000 tokens and contain articles of the wall street journal (WSJ) and airplanes crashes. The corpus released for the ACE-2005 [TODO] had about 400.000 annotated tokens and contains a mix of news, blog and web articles. With about 1.500.000 tokens, the OntoNotes 5.0 [@ontonotes] currently is the largest available resource for coreference resolution and is comprised of news articles, conversations and web articles.
|
|
|
For German there are currently two corpora available. The first is the Potsdam commentary corpus [@potsdam], comprising 33.000 tokens and consisting of 176 newspaper commentaries. The other resource for German coreference resolution is the TüBa-D/Z corpus, released by the university of Tübingen. This part of the corpus is made of 1.700 newspaper articles, with about 640.000 tokens.
|
|
|
This overview shows that there is currently no resource for (German) literary texts and most articles of those resources tend to be much shorter than an average novel - yielding new phenomena to explain with statistical methods, therefore underlining the importance of the release of DROC, a resource comprising 90 fragments of German novels, published between 1650 and 1950. There is no other resource that has manually marked direct speech passages along with the respective speaker and addressee, of which DROC has more than 2.000.
|
|
|
The best known corpora for English were released in the scope of the MUC-6 and MUC-7 conferences [@muc6; @muc7]. Those corpora each comprise about 30.000 tokens and contain articles of the wall street journal (WSJ) and airplanes crashes. The corpus released for ACE-2005 [@walker2006ace] had about 400.000 annotated tokens and contains a mix of news, blog and web articles. With about 1.500.000 tokens, OntoNotes 5.0 [@ontonotes] currently is the largest available resource for coreference resolution and consists of news articles, conversations and web articles.
|
|
|
For German, there are currently two corpora available. The first is the Potsdam commentary corpus [@potsdam], comprising 33.000 tokens derived from 176 newspaper commentaries. The other resource for German coreference resolution is the TüBa-D/Z corpus, released by the university of Tübingen. It is made of about 3.400 newspaper articles, with about 1.500.000 tokens.
|
|
|
This overview shows that there is currently no resource for (German) literary texts and most articles of the aforementioned resources tend to be much shorter than an average novel - yielding new phenomena to explain with statistical methods, therefore underlining the importance of the release of DROC, a resource comprising 90 fragments of German novels, published between 1650 and 1950. There is no other resource that has manually marked direct speech passages along with the respective speaker and addressee, of which DROC has more than 2.000.
|
|
|
|
|
|
|
|
|
# Description of the textual sources
|
|
|
The texts of the novels which are the basis for our corpus come from a large collection of German literary texts available as full-texts, the TextGrid repository [TextGrid 2015]. The texts found in this repository are part of one of the first large-scale digitization projects in the German language. The digitization was undertaken in separate steps by a commercial company, Directmedia, over the course of ten years, which sold digital texts on CDs and DVDs. It is important to understand that the TextGrid collection is comprised of three different groups of texts: The first group, by far the largest, consists of canonized texts of German literature. These are usually based on scholarly editions used for decades by academics. In most editions the writing has been normalized, in our context this means mainly that “th” has been replaced by “t” (for example “Tür” instead of “Thür”) and “ey” by “ei” (for example “sei” instead of “sey”).
|
|
|
# Description of the Textual Sources
|
|
|
The texts of the novels which are the basis for our corpus come from a large collection of German literary texts available as full-texts, the TextGrid repository [TextGrid 2015]. The texts found in this repository are part of one of the first large-scale digitization projects in the German language. The digitization was undertaken in separate steps by a commercial company, Directmedia, over the course of ten years, which sold digital texts on CDs and DVDs. It is important to understand that the TextGrid collection comprises two different groups of texts: The first group, by far the largest, consists of canonized texts of German literature. These are usually based on scholarly editions used for decades by academics. In most editions the writing has been normalized: In our context this means mainly that “th” has been replaced by “t” (for example “Tür” instead of “Thür”) and “ey” by “ei” (for example “sei” instead of “sey”).
|
|
|
|
|
|
The second group has been part of a collection called Deutsche Literatur von Frauen (German literature by women) which tried to collect as much literature from female authors as possible. Because many of these texts are not part of the literary canon, there are no scholarly editions for them and the creators of the collection had to base their digital texts on first prints or unchanged reprints of first prints. Therefore the collection is not balanced or representative for the literary production of the period it covers. The collection is copyright free and has been released in TEI markup on TextGridRep some years ago with a very generous Creative Commons-license (CC-by 3.0).
|
|
|
The second group has been part of a collection called *Deutsche Literatur von Frauen* (German literature by women) which tried to collect as much literature by female authors as possible. As many of these texts are not part of the literary canon, there are no scholarly editions and the creators of the collection had to base their digital texts on first prints or unchanged reprints of first prints. Therefore, the collection is not balanced or representative for the literary production of the period it covers. The collection is copyright free and has been released in TEI markup on TextGridRep with a very generous Creative Commons-license (CC-by 3.0).
|
|
|
|
|
|
# Creation of the corpus
|
|
|
The corpus DROC comprises 90 fragments of different novels. The novels were randomly selected from 450 available novels of the TextGrid repository. We applied the Apache openNLP sentence detection component[@openNLP], trained on the TIGER corpus [@tiger], to annotate sentence boundaries in the 90 selected novels. Then, for each novel, we randomly sampled a sentence index in those documents and extended the sentence in both directions until the beginning of a chapter and the end of a chapter was reached. In some occasions, where no structural information of chapters was available our annotators manually selected sentences that indicate the beginning of a coherent passage in the novel and therefore simulates an artificial border. The resulting fragments had an average length of 201 sentences. We implemented this procedure because we wanted to make sure that for all references either the proper nouns or the common nouns were part of the selected sentences.
|
|
|
# Creation of the Corpus
|
|
|
The corpus DROC comprises 90 fragments of different novels. The novels were randomly selected from 450 available novels of the TextGrid repository. We applied the Apache openNLP sentence detection component [@openNLP], trained on the TIGER corpus [@tiger], to annotate sentence boundaries in the selected novels. Then, for each novel, we randomly sampled a sentence index in those documents and extended the fragment in both directions until the beginning of a chapter and the end of a chapter was reached. In some occasions, where no structural information of chapters was available, our annotators manually selected sentences that indicate the beginning of a coherent passage in the novel and therefore simulate an artificial border. The resulting fragments had an average length of 201 sentences. We implemented this procedure because we wanted to make sure that for all references either the proper nouns or the common nouns were part of the selected sentences.
|
|
|
|
|
|
The annotation process can be depicted as follows:
|
|
|
First we preprocessed the documents with a rule based script, developed with UIMA RUTA [@kluegl2016uima], in order to generate suggestions that both of our annotators could later either accept or change. Therefore, our corpus was created semi-automatically with initial support.
|
|
|
We annotated our novels in ATHEN, a selfmade desktop application based on the eclipse RCP4 framework. The perspective for character reference annotation can be seen in figure 1.
|
|
|
We annotated our novels in ATHEN^[https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/Athen], a selfmade desktop application based on the eclipse RCP4 framework. The perspective for character reference annotation can be seen in figure 1.
|
|
|
|
|
|
![The main user interface for the coreference annotation in ATHEN. The left depicts the main editor which shows the currently opened document (in this case Effi Briest by Theodor Fontane) and on the right, there is the view to accept or change a selected annotation.](ATHEN.PNG "ATHEN coreferenceview")
|
|
|
|
|
|
[TODO bild]
|
|
|
|
|
|
After our annotators finished their pass over the documents, resulting inconsistencies were resolved together in order to get a clean version of the annotations.
|
|
|
|
|
|
# Annotation Guidelines
|
|
|
|
|
|
We describe our annotation guidelines in a three step process. First we describe which references were annotated, followed by the description of the resulting phenomena we had to deal with in terms of coreference resolution. We conclude the guidelines section with the description of our guidelines for the annotation of direct speech utterances along with their speaker and addresses.
|
|
|
We present our annotation guidelines in a three step process. First, we describe which references were annotated, followed by the description of the resulting phenomena we had to deal with in terms of coreference resolution and some borderline cases in DROC. We conclude the guidelines section with our guidelines for the annotation of direct speech utterances along with their speakers and addressees.
|
|
|
|
|
|
## Annotated character references
|
|
|
## Annotated Character References
|
|
|
|
|
|
The annotation of character references follows a single rule:
|
|
|
|
|
|
Mark every text snippet in the novel that references a (literary) character.
|
|
|
|
|
|
We furthermore decided not to mark the complete nominal phrase surrounding the reference, instead we only marked the heads of the phrases. An example is given in figure 2.
|
|
|
***Mark every text snippet in the novel that references a (literary) character.***
|
|
|
|
|
|
Furthermore, we decided not to mark the complete nominal phrase surrounding the reference and only marked the heads instead.
|
|
|
|
|
|
![A text snippet, taken from “Effi Briest” by Theodor Fontane. The picture shows the marked head “Ritterschaftsrätin von Padden” which is embedded in the nominal phrase “der alten Ritterschaftsrätin von Padden”. Analogously, the snippet “Frau von Titzewitz” is marked as the head of the phrase “einer etwas jüngeren Frau von Titzewitz”](exampleCR.PNG "ATHEN coreferenceview_ex")
|
|
|
|
|
|
Following this rule, the resulting phrases can be classified into the following subcategories:
|
|
|
|
|
|
1. Proper noun
|
|
|
Proper noun, for example forenames, surnames or family names. These names can also refer to entities that are not part of the fictional world (e.g. another author, historic persons, etc.) In our schema, the text snippets representing proper noun are marked as “Core”. Sometimes a “core” snippet is only a part of a reference (Show in figure 2, where “von Padden” is the Core Snippet of “Ritterschaftsrätin von Padden”).
|
|
|
|
|
|
2. Heads of common noun phrases
|
|
|
A head of a common noun phrase can be an arbitrary composite consisting of:
|
|
|
Occupational titles ( e.g. “Bäcker” - “baker”)
|
|
|
Relational expressions (e.g. “Mutter” - “mother”)
|
|
|
Gender terms (e.g. “Mann” - “man”)
|
|
|
Different titles (e.g.”Graf” - “earl”)
|
|
|
Action terms (e.g. “Spaziergänger” - “stroller”)
|
|
|
Defamations (e.g. “Idiot” - “idiot”)
|
|
|
Substantival verbs (e.g. “Rufende” - “shouter”)
|
|
|
Substantival adjectives (e.g. “Schöne” - “beauty”)
|
|
|
1. **Proper noun**
|
|
|
|
|
|
Proper noun, for example forenames, surnames or family names. These names can also refer to entities that are not part of the fictional world (e.g. another author, historic persons, etc.) In our schema, the text snippets representing proper nouns are marked as “Core”. Sometimes a “Core” snippet is only a part of a reference (As shown in figure 2, where “von Padden” is the Core snippet of “Ritterschaftsrätin von Padden”).
|
|
|
|
|
|
2. **Heads of common noun phrases**
|
|
|
|
|
|
A head of a common noun phrase can be an arbitrary composite consisting of:
|
|
|
|
|
|
* Occupational titles ( e.g. “Bäcker” - “baker”)
|
|
|
* Relational expressions (e.g. “Mutter” - “mother”)
|
|
|
* Gender terms (e.g. “Mann” - “man”)
|
|
|
* Different titles (e.g.”Graf” - “earl”)
|
|
|
* Action terms (e.g. “Spaziergänger” - “stroller”)
|
|
|
* Defamations (e.g. “Idiot” - “idiot”)
|
|
|
* Substantival verbs (e.g. “Rufende” - “shouter”)
|
|
|
* Substantival adjectives (e.g. “Schöne” - “beauty”)
|
|
|
|
|
|
This listing is not complete, showing the complexity of this class. Annotations of this kind were marked as “AppTdfW” (Appellativ, Teil der fiktionalen Welt) if they are part of the fictional world or as “AppA” (Appellativ, Abstraktum) if they refer to generic or abstract entities that are not part of the fictional world.
|
|
|
|
|
|
3. Pronouns
|
|
|
This category, marked as “Pron”, comprises all sorts of pronouns, the most prominent examples:
|
|
|
Personal pronouns (e.g. “er”, “sie,” - “he”, “she”)
|
|
|
Possessive pronouns (e.g. “seine”, “ihre” - “his”, “her”)
|
|
|
Reflexive pronouns (e.g. “sich” - “himself”, “herself”, “themselves”)
|
|
|
Relative pronouns (e.g. “der”, “die” - “who”)
|
|
|
3. **Pronouns**
|
|
|
|
|
|
This category, marked as “Pron”, comprises all sorts of pronouns, the most prominent examples being:
|
|
|
|
|
|
* Personal pronouns (e.g. “er”, “sie,” - “he”, “she”)
|
|
|
* Possessive pronouns (e.g. “seine”, “ihre” - “his”, “her”)
|
|
|
* Reflexive pronouns (e.g. “sich” - “himself”, “herself”, “themselves”)
|
|
|
* Relative pronouns (e.g. “der”, “die” - “who”)
|
|
|
|
|
|
For each resulting character reference, we marked the following features:
|
|
|
|
|
|
* Type: one of “Core”,”Pron”,”AppTdfW” or “AppA”, as described above
|
|
|
|
|
|
* Range: (used only for cores) span of character offsets for the identification of the core text snippet
|
|
|
* Range: (used only for Cores) span of character offsets for the identification of the core text snippet
|
|
|
|
|
|
* Number: singular or plural
|
|
|
|
|
|
* ID: a unique identifier for each entity appearing in the text, used to represent coreference.
|
|
|
|
|
|
* Pseudo: This means that the person is mentioned in the text, but does not really take part in the action or does not exist in reality. An example for this case is “War nicht auch Cromwell erst in hohem Alter nach vergeudeter Jugend erweckt worden zum Dienste Gottes?” (“Only in his old age and after wasting his youth Cromwell was called to serve God, wasn’t he?”,Bleibtreu: Größenwahn). Both,
|
|
|
Cromwell and Gott, are identified as pseudos, because both are not taking part in this novel’s action.
|
|
|
* Pseudo: This means that the person is mentioned in the text, but does not really take part in the action or does not exist in reality. An example for this case is “War nicht auch Cromwell erst in hohem Alter nach vergeudeter Jugend erweckt worden zum Dienste Gottes?” (“Only in his old age and after wasting his youth Cromwell was called to serve God, wasn’t he?”^[For better clarity all German citations were manually translated into English.],Bleibtreu: Größenwahn). Both,
|
|
|
*Cromwell and Gott*, are identified as pseudos, because both are not taking part in this novel’s action.
|
|
|
|
|
|
* Uncertain: A boolean flag that could be set by the annotator if the decision is unclear
|
|
|
* Uncertain: A boolean flag that can be set by the annotator if the decision is unclear.
|
|
|
|
|
|
|
|
|
## Annotated coreferences
|
|
|
## Annotated Coreferences
|
|
|
|
|
|
With the definition of the character references our annotators had the task to assign a unique identifier to each entity in the text, and to reuse this Id for each mention of an entity.
|
|
|
To enable an easier comparison of DROC to existing corpora with annotated coreference we discussgo through a selected list of coreferential linguistic phenomena and elaborate whether we marked them as coreferent or not.
|
|
|
To enable an easier comparison of DROC to existing corpora with annotated coreference we discuss a selected list of coreferential linguistic phenomena and elaborate whether we marked them as coreferent or not.
|
|
|
|
|
|
**Coordination and plural references**
|
|
|
|
|
|
Plural references are included if the phrase that is required to mark them does not consist of multiple smaller references. Therefore our annotations are not hierarchical.
|
|
|
|
|
|
**Split Antecedents**
|
|
|
|
|
|
Split antecedents, that is plural references which can only be mapped to more than one reference, are not marked. (e.g. "sie" in "Effi und Innstetten planten eine Reise, sie...")^[Effi and Innstetten planned a trip, they... ]
|
|
|
|
|
|
**Expletives**
|
|
|
|
|
|
Expletives, such as "It" in "It is raining" are not included in DROC.
|
|
|
|
|
|
Coordination and plural references
|
|
|
Plural references are included if the phrase that is required to mark them does not consist of multiple smaller references. Therefore our annotations are not hierarchical.
|
|
|
**Appositions and Predicatives**
|
|
|
|
|
|
Split Antecedents
|
|
|
Split antecedents are not marked
|
|
|
Appositional references (e.g. "Otto, ihr ältester Sohn,..."^[Otto, her oldest son]) as well as references in predicative position (e.g. "Er ist Bäcker"^[He is a baker] are (usually) marked as coreferent.)
|
|
|
|
|
|
Expletives
|
|
|
Expletives are not included in DROC
|
|
|
**Bridging Anaphora**
|
|
|
|
|
|
Appositions and Predicatives
|
|
|
Appositional as well as references in predicative position are (usually) marked as coreferent.
|
|
|
Bridging anaphora, such as the relation between tyre and bicycle in "I bought
|
|
|
a bicycle. A tyre was already flat", are not marked within DROC.
|
|
|
|
|
|
Bridging Anaphora
|
|
|
Bridging anaphora are not marked within DROC
|
|
|
**Discourse**
|
|
|
|
|
|
Discourse
|
|
|
The information whether an entity is discourse new has to be parsed from the ID feature of the references.
|
|
|
|
|
|
----------
|
|
|
|
|
|
We conclude this section with a prototypical example taken from DROC:
|
|
|
|
|
|
“Bekannte (ID=1, AppTdfW, plural) traten zu ihnen (ID=2, pron, plural) heran und das Gespräch war unterbrochen. Michael (ID=3, core) fuhr mit Käthe (ID=4, core) in einer offenen Droschke, in der milden Märznacht, nach Hause. Ihre (ID=4, pron) Blicke hingen am gestirnten Himmel, die seinen (ID=3, pron) an ihrem (ID=4, pron) Antlitz. In Beiden (ID=2, pron, plural) klang die Stimmung von Tristan (ID=5, core, pseudo) und Isolde (ID=6, core, pseudo) nach.”
|
|
|
“Bekannte (ID=1, AppTdfW, Plural) traten zu ihnen (ID=2, Pron, plural) heran und das Gespräch war unterbrochen. Michael (ID=3, Core) fuhr mit Käthe (ID=4, Core) in einer offenen Droschke, in der milden Märznacht, nach Hause. Ihre (ID=4, Pron) Blicke hingen am gestirnten Himmel, die seinen (ID=3, Pron) an ihrem (ID=4, Pron) Antlitz. In Beiden (ID=2, Pron, plural) klang die Stimmung von Tristan (ID=5, Core, pseudo) und Isolde (ID=6, Core, pseudo) nach.”^[“Friends came up to them and made their conversation stop. Michael went home with Käthe in an open hansom through the mild March night. Her eyes focussed on the starry sky, his eyes focussed on her countenance. Both of them empathized the mood of Tristan and Isolde.”, Dohm: Wie Frauen werden. ]
|
|
|
|
|
|
## Borderline cases:
|
|
|
## Borderline Cases
|
|
|
|
|
|
During the annotation process some borderline cases were discovered, some of them will be explained exemplarily in the following.
|
|
|
During the annotation process, some borderline cases were discovered and some of them will be explained exemplarily in the following.
|
|
|
|
|
|
Usually named entities are human beings, albeit in some cases animals or even other things can play an important role for the plot of a story. In this instance these protagonists will also be annotated, as shown in the following example: “Eine Woche später und der Alraun war in seiner Art völlig ausgewachsen, etwa dreieinenhalben Fuß hoch;”
|
|
|
Mandrake is labeled as an entity, because in the course of the story the plant comes to life, is named Cornelius Nepos and is able to move and talk. Therefore it becomes an important agent.
|
|
|
The character references are usually human beings, albeit in some cases animals or even other things can play an important role for the plot of a story. In this instance these protagonists will also be annotated, as shown in the following example: “Eine Woche später und der Alraun war in seiner Art völlig ausgewachsen, etwa dreieinenhalben Fuß hoch;”^[“One week later the mandrake was fully grown, approximately three and a half feet tall;” Arnim: Isabella von Ägypten]
|
|
|
Mandrake is labeled as an entity, because in the course of the story the plant comes to life, is named *Cornelius Nepos* and is able to move and talk. Therefore it becomes an important agent.
|
|
|
|
|
|
Sometimes a novel is partly interrupted by a stichomythia, which then resembles a drama. In this incident the names, which introduce the speech, will be regarded as entities e.g.
|
|
|
Sometimes a novel is partly interrupted by a stichomythia, which then resembles a drama. In this incident the names, which introduce the speech, will be regarded as entities e.g.
|
|
|
“Einsiedel: Wie heißest du?
|
|
|
Simpl.: Ich heiße Bub.”
|
|
|
Both, Einsiedel and Simpl. (short for Simplicissimus), are tagged as named entities.
|
|
|
Simpl.: Ich heiße Bub.”^[“Einsiedel: What is your name?; Simpl.: My name is boy.” Grimmelshausen: Der abenteuerliche Simplicissimus Teutsch]
|
|
|
Both, Einsiedel and Simpl. (short for Simplicissimus), are marked.
|
|
|
|
|
|
In rare cases a definitive decision is not possible - because of lacking knowledge or imprecise references. In the following example, it is not clearly determinable, what “man” (“someone”) refers to. It could refer to a concrete person or group, to human species generally or to nothing at all.
|
|
|
“Sollte ich etwa mit gebundenen Händen immer weiter zusehen, wie man mir mein Leben zertritt, bis die Jugend vorbei ist und alles zu spät?”
|
|
|
In rare cases a definitive decision is not possible - due to a lack of knowledge or imprecise references. In the following example, it is not clearly determinable what “man” (“someone”) refers to. It could refer to a concrete person or group, to human species generally or to nothing at all.
|
|
|
“Sollte ich etwa mit gebundenen Händen immer weiter zusehen, wie man mir mein Leben zertritt, bis die Jugend vorbei ist und alles zu spät?”^[“Should I really keep on watching with tied hands, how someone destroys my life until the youth faded and everything is too late?”, Reventlow: Ellen Olestjerne]
|
|
|
|
|
|
|
|
|
## Annotated direct speech
|
|
|
## Annotated Direct Speech
|
|
|
|
|
|
Direct speech and every text section enclosed by single or double quotation marks is annotated. Such annotations range from beginning quotation mark till ending ones - both included. In most cases these are french quotation marks, infrequently dashes. To every annotation one (or in rare cases more) speaker and addressed character references is assigned. If it was not possible to determine, who speaks or who is addressed, they are marked as “unknown”.
|
|
|
Direct speech passages and every text section enclosed by single or double quotation marks are annotated. Such annotations range from opening quotation marks till closing ones - both included. In most cases these are french quotation marks, infrequently dashes. To every annotation one (or in rare cases more) speaker and addressed character references is assigned. If it was not possible to determine who speaks or who is addressed, they are marked as “unknown”.
|
|
|
|
|
|
The annotation process obeys strict rules. If speaker and addressed reference are connected to the relevant direct speech with a communication verb, then these entities are labelled. If not, we looked for direct addresses within the direct speech which are not pronouns (e.g. “..., my dear friend”). If something like that does not exist either, the last mention of speaker and/or addressed person which lies outside of direct speeches was annotated, independent of being a noun or pronoun.
|
|
|
The annotation process obeys strict rules. If speaker and addressed reference are connected to the relevant direct speech with a communication verb, then these entities are labelled. If not, we looked for direct addressees within the direct speech which are not pronouns (e.g. “..., my dear friend”). If something like that does not exist either, the last mention of speaker and/or addressed person which lies outside of direct speeches was annotated, independent of being a noun or pronoun.
|
|
|
|
|
|
Next to real direct speeches every text section within quotation marks is annotated. This might be names of places, quotations or thoughts. In this case, category, which is set to “directspeech” by default, was changed. The following categories are defined: “thought”, “citation” (e.g. quotations of absent characters or of other fictional works), “fictionalspeech (speeches of a text entity that is not labeled as a named entity, e.g. “my heart says…”, “roses say…”), “name” (e.g. place names) and “other” (if further classification is not possible, e.g. a word highlighted with quotation marks by the author for accentuation).
|
|
|
Apart from real direct speeches every text section within quotation marks is annotated. This might be names of places, quotations or thoughts. In this case, category, which is set to “directspeech” by default, was changed. The following categories are defined: “thought”, “citation” (e.g. quotations of absent characters or of other fictional works), “fictionalspeech (speeches of a text entity that is not labeled as a reference, e.g. “my heart says…”, “roses say…”), “name” (e.g. place names) and “other” (if further classification is not possible, e.g. a word highlighted with quotation marks by the author for accentuation).
|
|
|
|
|
|
Sometimes direct speeches are not marked up by quotation marks. In rare cases, direct speeches are labeled by dashes or even without any marker.
|
|
|
|
|
|
# Inter- Annotator Agreement
|
|
|
# Inter-Annotator Agreement
|
|
|
|
|
|
There are multiple ways to measure an inter annotator agreement (IAA). We used 12 documents that were labelled by both our annotators, beginning with the same initial conditions and measured the IAA based for character reference annotation and for coreference resolution.
|
|
|
There are multiple ways to measure an inter annotator agreement (IAA). We used 12 documents that were labelled by both our annotators and measured the IAA for character reference annotation and for coreference resolution.
|
|
|
|
|
|
For evaluating the quality of the character reference annotation we only took the annotated span into account and calculated Cohens Kappa [@kappa].
|
|
|
We did this on a per token basis and converted the output of each annotator into a sequence of B-I-O labels. A measurement on a per token basis awards our annotator for not marking a token as a character reference on top of the rewards for marking the same span. This yields 31.185 instances and resulted in a kappa κ of 94.3%
|
|
|
We did this on a per token basis and converted the output of each annotator into a sequence of B-I-O labels. A measurement on a per token basis awards our annotator for not marking a token as a character reference on top of the rewards for marking the same span. This yields 31.185 instances and resulted in a Kappa κ of 94.3%
|
|
|
|
|
|
On the same documents, we measured the IAA of the assigned coreference clustering with MUC-6 and B-Cube scores[@luo2005coreference]. We evaluated a MUC6-F1 of 88.5% and a B-Cube F1 of 69%. Since both evaluation metrics require the amount of references to be equal we added references if necessary and treated them as singletons. The B-Cube metric punishes singleton clusters which explains the much lower score compared to the MUC evaluation. Removing unmatchable annotations yields a MUC-F1 of 92.4% and a B-Cube F1 of 76%. The documents were annotated by one annotator and afterwards both annotators revised the documents together to guarantee a corpus of high quality.
|
|
|
On the same documents, we measured the IAA of the assigned coreference clustering with MUC-6 and B-Cube scores [@luo2005coreference]. Our evaluation resulted in a MUC-6 F1 of 88.5% and a B-Cube F1 of 69%. Since both evaluation metrics require the amount of references to be equal we added references if necessary and treated them as singletons. The B-Cube metric punishes singleton clusters which explains the much lower score compared to the MUC evaluation. Removing unmatchable annotations yields a MUC-6 F1 of 92.4% and a B-Cube F1 of 76%. For the final version of DROC, the documents were annotated by one annotator and afterwards both annotators revised the documents together to guarantee a corpus of high quality.
|
|
|
|
|
|
The code for the evaluation, as well as the documents that were used for the measurements can be downloaded from DROCs git repository.
|
|
|
The code for the evaluation, as well as the documents that were used for the measurements can be downloaded from DROCs git repository^[https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release].
|
|
|
|
|
|
# Release Formats
|
|
|
|
|
|
DROC is available in two formats. The first format is XMI. These files are standard for Apache UIMA and come with a typesystem definition, required to open DROC.
|
|
|
The second format is TEI-XML[@tei-schrott]. This section gives a brief overview over the used representation within these formats. A more thorough definition can be found on the homepage of the project Kallimachos[TODO].
|
|
|
The second format is TEI-XML [@tei-schrott]. This section gives a brief overview over the representation used within these formats. A more thorough definition can be found on the homepage of the project Kallimachos^[http://kallimachos.de/kallimachos/index.php/Hauptseite].
|
|
|
|
|
|
## XMI-format
|
|
|
|
|
|
## XMI-format:
|
|
|
In the UIMA format, each annotation is stored with at least two features, a begin indicating the character offset where the annotation starts and an end feature indicating where an annotation ends. Additionally, each annotation has its own type, defined in a separate descriptor xml-file. For DROC, we defined two types:
|
|
|
|
|
|
In the UIMA format, each annotation is stored with at least two features, a begin indicating the character offset where the annotation starts and an end feature indicating where an annotation ends. Additionally, each annotation has its own type, defined in a separate descriptor xml-file. For DROC we defined two types:
|
|
|
**Type NamedEntity**
|
|
|
|
|
|
Type NamedEntity:
|
|
|
This type represents a character reference. One annotation is created for every character reference. Table TODO gives an overview of the features used.
|
|
|
This type represents a character reference. One annotation is created for every character reference. Table 1 gives an overview of the features used.
|
|
|
|
|
|
[TODO table]
|
|
|
\begin{table}[h]
|
|
|
\centering
|
|
|
\caption{Overview over the UIMA type: NamedEntity used in the .xmi encoding for DROC.}
|
|
|
\label{my-label}
|
|
|
\begin{tabular}{l|lll}
|
|
|
\hline
|
|
|
\rowcolor[HTML]{EFEFEF}
|
|
|
Featurename & Range & Featurevalues & Description \\ \hline
|
|
|
ID & String & any & \begin{tabular}[c]{@{}l@{}}A unique id, referring to the entity\\ this reference belongs to\end{tabular} \\
|
|
|
Pseudo & String & true & false \\
|
|
|
Numerus & String & Pl (plural) si (singular) & A string referring to the number of the reference \\
|
|
|
NEType & String & AppTdfW & Core \\
|
|
|
Uncertain & String & true & false \\
|
|
|
CoreRange & String & {[}from:to{]} or null & \begin{tabular}[c]{@{}l@{}}A string, used for core references\\ to show the text that is a proper noun\end{tabular}
|
|
|
\end{tabular}
|
|
|
\end{table}
|
|
|
**Type DirectSpeech**
|
|
|
|
|
|
Type DirectSpeech:
|
|
|
This type represents an instance of a direct speech. It stores information about speaker and addressee as well as its type.
|
|
|
|
|
|
[TODO table]
|
|
|
\begin{table}[h]
|
|
|
\centering
|
|
|
\caption{Overview over the UIMA type: DirectSpeech used in the .xmi encoding for DROC}
|
|
|
\label{my-label}
|
|
|
\begin{tabular}{llll}
|
|
|
\hline
|
|
|
\rowcolor[HTML]{EFEFEF}
|
|
|
Featurename & Range & Featurevalues & Description \\ \hline
|
|
|
\multicolumn{1}{l|}{Speaker} & Annotation & &\begin{tabular}[c]{@{}l@{}}An annotation of type NamedEntity,\\ depicting the speaker \end{tabular} \\
|
|
|
\multicolumn{1}{l|}{SpokenTo} & Annotation & & \begin{tabular}[c]{@{}l@{}}An annotation of type NamedEntity,\\ depicting the addressee \end{tabular} \\
|
|
|
\multicolumn{1}{l|}{category} & String & directspeech & thought \\ \hline
|
|
|
\end{tabular}
|
|
|
\end{table}
|
|
|
|
|
|
## TEI-XML:
|
|
|
|
|
|
The second format DROC is available in is TEI-XML. Within the <body> element of each document, a sequence of <w> elements are added for each token. Character references have been encoded using the <persName> element and direct speech utterances using the <quote> element with embedded speech elements <sp> that direct to the speaker of the utterance. Sentence and paragraph borders have been added as virtual elements at the end of each document. We used the “prev” attribute of the element persName to refer to the first appearance of the corresponding entity of a character reference. Speaker have been encoded using the “who” attribute that refers to the xml:id of the speaking character reference.
|
|
|
## TEI-XML
|
|
|
|
|
|
The second format DROC is available in is TEI-XML. Within the <body> element of each document, a sequence of <w> elements is added for each token. Character references have been encoded using the ``` <persName>``` element and direct speech utterances using the ``` <quote>``` element with embedded speech elements ``` <sp>``` that direct to the speaker of the utterance. Sentence and paragraph borders have been added as virtual elements at the end of each document. We used the “prev” attribute of the element persName to refer to the first appearance of the corresponding entity of a character reference. Speakers have been encoded using the “who” attribute that refers to the xml:id of the speaking character reference.
|
|
|
|
|
|
|
|
|
# Corpus Statistics
|
|
|
|
|
|
DROC contains 90 fragments of different novels.
|
|
|
The corpus comprises about 393.000 tokens, determined by the tokenizer script of the TreeTagger[TODO]. On average each fragment is made of 4368±2334 tokens and 202±131 sentences. We manually annotated 52079 character references with the majority of 65% being pronouns (34060). About 23% (12005) of the references have the type “appellative” assigned and the remaining 12% (6013) are “core” references.
|
|
|
The corpus comprises about 393.000 tokens, determined by the tokenizer script of the TreeTagger [@schmid2013]. On average each fragment consists of 4368±2334 tokens and 202±131 sentences. We manually annotated 52079 character references with the majority of 65% being pronouns (34060). About 23% (12005) of the references belong to the type “appellative” and the remaining 12% (6013) are “Core” references.
|
|
|
These 52081 references are clustered into 5288 entities, therefore on average 10 references per entity and 59±31 entities per document.
|
|
|
Compared to the statistics from the study in [Kabadjov 2007], pronouns in DROC appear more frequently, with a proportion of 65% compared to 44% evaluated by Kabadjov, with the amount of proper noun being almost constant with a small increase from 10% to 12% in DROC.
|
|
|
Compared to the statistics from the study in [@kabadjov2007], pronouns in DROC appear more frequently, with a proportion of 65% compared to 44% evaluated by Kabadjov, with the amount of proper nouns being almost constant with a small increase from 10% to 12% in DROC.
|
|
|
35 of those fragments were written by female authors and the remaining 55 were written by male authors resulting in a slightly imbalanced 40%-60% gender ratio.
|
|
|
|
|
|
| Epoch | 1651-1700 | 1701-1750 | | | | | |
|
|
|
|---------------|-----------|-----------|---|---|---|---|---|
|
|
|
| Amount novels | 2 | 3 | | | | | |
|
|
|
\begin{table}[h]
|
|
|
\centering
|
|
|
\caption{verview of the amount of novels published during an epoch of 50 years. It can be seen that most novels were published during the 19th century.}
|
|
|
\label{my-label}
|
|
|
\begin{tabular}{l|lllllll}
|
|
|
\hline
|
|
|
\rowcolor[HTML]{EFEFEF}
|
|
|
Epoch & \begin{tabular}[c]{@{}l@{}}1651 -\\ 1700\end{tabular} & \begin{tabular}[c]{@{}l@{}}1701 -\\ 1750\end{tabular} & \begin{tabular}[c]{@{}l@{}}1751 - \\ 1800\end{tabular} & \begin{tabular}[c]{@{}l@{}}1801 -\\ 1850\end{tabular} & \begin{tabular}[c]{@{}l@{}}1851 -\\ 1900\end{tabular} & \begin{tabular}[c]{@{}l@{}}1901 -\\ 1950\end{tabular} & \begin{tabular}[c]{@{}l@{}}1951-\\ 2000\end{tabular} \\ \hline
|
|
|
Amount novels & 2 & 3 & 4 & 31 & 35 & 14 & 1 \\ \hline
|
|
|
\end{tabular}
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
Table: *Die dritte Tabelle.* (Tabellen-Unterschrift ist teilweise\
|
|
|
*kursiv*.)
|
|
|
|
|
|
|
|
|
# License
|
|
|
|
|
|
The corpus is licensed under the Creative Commons license CC-BY [see XXX]. Please quote this text if you use the corpus in your work.
|
|
|
The corpus is licensed under the Creative Commons license CC-BY^[https://creativecommons.org/licenses/by/3.0/de/]. Please quote this text if you use the corpus in your work.
|
|
|
|
|
|
|
|
|
Wenn man das korpus verwendet, bitte folgednes Zitat verwenden
|
|
|
|
|
|
|
|
|
# References
|
|
|
|
|
|
\bibliography |