|
|
#Description of a Corpus of Character References in German Novels [DROC Deutsches ROman Corpus ] - DROC
|
|
|
|
|
|
##Motivation
|
|
|
Nowadays, large collections of literary texts are available in many languages and enable new approaches in literary studies like network analysis, topic modeling or stylometry.
|
|
|
Especially the analysis of networks of literary characters has become either a goal in itself or a
|
|
|
building block in larger contexts [Elson, Dames, McKeown 2010, Moretti 2011, Gyeong-Mi Park
|
|
|
et al. 2013, Trilcke 2013]. In such networks, characters usually constitute the nodes while their
|
|
|
interaction, for example the amount of conversation, is modeled as edges, often using the
|
|
|
amount of interaction as weight. In order to create such networks, the first step is to find all
|
|
|
references to characters in a text. However, to detect all character references to an entity it is not sufficient to apply a state of the art named entity recognizer (NER) such as the Stanford NER [TODO]. A reference can appear in one of three broad syntactic categories: 1) a proper noun; 2) a nounnominal phrase; 3) a pronoun. Detecting category 2) and 3) is usually beyond the scope of named entity recognition. Furthermore, current state of the art NER were trained on newspaper articles, yielding another gap to bridge towards the domain of novels. Detecting character references is not enough to create a complete pipeline with the goal of creating social networks of literary texts; it is required to resolve each reference to its according entity in the fictional world (of a text/work), a process called coreference resolution.
|
|
|
This work presents the corpus DROC (Deutsches ROman Corpus), comprised of 90 carefully manually annotated fragments of German novels and includes the following annotations:
|
|
|
1. Each character reference has been marked.
|
|
|
2. Each reference was assigned to one of four subcategories.
|
|
|
3. Each character reference has an assigned entity-identifier which resembles the annotation of coreference resolution.
|
|
|
4. Each direct speech has been manually annotated.
|
|
|
5. The speaker and addressee of each direct speech have been manually marked.
|
|
|
To the best of our knowledge there is no comparable corpus available to the academic community in the domain of literary texts, especially for German. DROC comprises about 393.000 annotated tokens with more than 50.000 labelled character references.
|
|
|
The paper is structured as follows: first, a brief overview of existing corpora for named entities and coreference resolution is given, followed by the description of the textual sources of the fragments. We continue with a detailed description of our annotation guidelines and the annotation process, including the inter-annotator agreement (IAA). We then explain the two formats in which we release our data and conclude with a brief description of the statistics found in our corpus. |