This collection of data consists of multiple files containing MQM annotated translations for the following language pairs: English→Spanish, Spanish→English, English→German, and German→English. The data was annotated by human translation professionals. All data sets support filtering by text content or by the MQM issue types annotated in them.
The following data sets are available:
- Round 1. This set contains data compiled in the first annotation round. Annotation for this round used a set of MQM issue types that was modified extensively for Round 2 based on analysis of Round 1. Therefore the two data sets are not directly comparable.
- Spanish→English (includes “adjudicated” data, i.e., data in which various annotations were reconciled to provide an ideal annotation)
- English→German (includes 18 WMT alternatives for WMT components)
- German→English (incudes “adjudicated” data, i.e., data in which various annotations were reconciled to provide an ideal annotation)
- Round 2. This set contains data compiled in the second annotation round. Annotation for this round used a set of MQM issues fully compliant with the current definition (with the addition of custom extensions for the “Function words” issue type).
All data sets are available as XML. A schema for validating the files is available at http://qt21.eu/deliverables/annotations/qtlp-annotations.xsd. For assistance with these data sets, please contact email@example.com. See comments in the schema for an explanation of the semantics of the various elements and attributes.
Version 2.0 (2014-09-17). This version makes the following changes from previous versions:
- All results are available as XML files.
- The tagging style was switched from pseudo-XML in the annotated target segments to the use of HTML <span> elements.
- Previously the underlying data used empty elements to indicate the start and end of spans to allow for overlapping error annotations. In this version all overlapping annotations were resolved to allow proper XML nesting. This change resolved all but three cases, which featured legitimate overlaps. These three cases were resolved by splitting one of the overlapping spans into multiple issues so that issues could be properly nested. Note that overlaps were apparent in the previous version as the individual files used improperly nested pseudo-XML to display them.
- Spacing within issue tagging was normalized (i.e., leading and trailing spaces were moved out of tagged spans unless the spacing was significant to the annotated issue. This change makes comparison of tagged spans more straight-forward since white space differences do not have to be accounted for.
- The HTML output is now produced directly from an XML source using an XSL style sheet, guaranteeing compatibility with the XML source.
- Target segments were renumbered to address cases where the numbering was non-sequential due to reordering of segments. As a result, numbering for round 1 files is not compatible with the earlier version.
- Version 1.0 files are available at http://qt21.eu/deliverables/annotations/archived/