TEI Compatibility for ATHEN
Usually in digital humanities communities, the standard of encoding meta information is using TEI-XML. The format is extremely complicated in its entirety. This feature should enable ATHEn to read and write TEI documents.
TEI to XMI Conversion
A TEI document comes as a plain.xml. Whenever available, the user may be able to provide a xml schema additionally to the input document.
The process of converting into an .xmi would follow the following model:
- If a schema is provided, it is converted into a UIMA Typesystem (which is an xml itself), else, the typesystem is generated by analyzing all different sorts of xml elements and their attributes. Elements will be converted into types and attributes are converted into features.
- The resulting typesystem is merged with the existing typesystem of ATHEN
- All XML Elements alongside their attributes are removed from the text and stored as UIMA annotations with the correct span (this will fail if the xml document has no text at all.) All annotations will be of a special type
TEI-XML-Type
There are xml formats (such as the TueBa/DZ xml format) that store the entire text as attributes in the<word> tag
- In the first pass through the annotations, all
TEI-XML-Type
annotations are about to be converted into the according type of the UIMA typesystem, this mapping is logged and is saved by the application in order to guarantee reversibility. - A second pass through the annotations is done and all features (which might be references to other annotations, this is why 2 passes are required) are interpreted and stored
The resulting document is stored alongside the mapping of TEI elements and attributes to UIMA Types and features.
Note to step 5
In this step it should be tried to find links to other annotations by either analyzing the schema or by comparing the attribute value to existing element ids.
Summarized it can be depicted here:
XMI to TEI Conversion
The reverse process is a little bit harder:
- If a previously created mapping is available we can use this and revert the process (that is as long as a schema is avilable)
- If no mapping and no schema is available => ????
Special functionalities
-
If we happen to have multiple documents with the same text, it should be possible to aggregate the information stored within those documents once after another and save it in a single xmi. This can then afterwards get converted to get an aggregate TEI (if problems regarding the backwards conversiona re solved)
-
Convenience method to convert annotation in athen into other types. The user needs to be able to create a mapping of types and features that get converted into each other. This would e.g. enable to primitively parse TEI documents and then convert the resulting annotations into existing types (such as the ones used in DKPro or the ones used in ATHEN)