Contributors: Aljoscha Burchardt (DFKI), Kim Harris (text&form), Arle Lommel (DFKI), Katrin Marheinecke (text&form),
Maja Popović (DFKI), Thomas Senf (text&form), Nicole Tielker (DFKI), Hans Uszkoreit (DFKI)
This resource contains a table of machine translated segments that show errors. The purpose of this resource is to provide a set of segments that show typical MT issues/errors so that MT developers can compare the performance of their systems to see how well they perform with the same input.
It consists of two parts:
Each test suite contains two sorts of segments:
For the TSNLP data, one MT result was selected from among those systems that were considered to exhibit barriers. This segment was the one judged by the group of linguists to come the closest to “getting it right”. For the corpus data, the translation in the corpus was used. In both cases the translation was annotated using MQM to identify issues and post-edited to show one possible way to resolve the issues. (Note that the post-editing was intended to be minimal, with only enough changes to make the sentence grammatical and acceptable. Full post-editing in many cases would result in more substantive changes in sentence structure, but the goal was not to create a stylistically perfect text.)
For the corpus data, no information is provided as to which system type translated the segment, for which system type(s) the segments proved to be a barrier, or the TSNLP class.
Data in the test suite files is in the following columns:
The data were annotated primarily using the same set of MQM issues used for the second round of the QTLaunchPad MQM Annotated Corpora. The list of issues and guidelines for annotators are available at http://qt21.eu/downloads/annotatorsGuidelines-2014-06-11.pdf. In addition, examples of selected additional MQM issue types were added. For the full list of issues found, please see the filter settings for the specific test suites.
The data sets provide options to filter data. It is possible to select whether to see corpus data, TSNLP data, or both. In addition, it is possible to conduct a full-text search of the annotated segments (search addresses both source and target texts, as well as postedited segments) and to filter results by the combination of MQM issues annotated in their content. Clicking on a row header also allows the currently visible results to be sorted (e.g., to sort by the type of system for which the content is a barrier).
NOTE: These resources will be updated from time to time. The date of the latest update can be found in each data set.
The following changes were made on the dates listed
This resource was prepared as part of the Coordination and Support Action “Preparation and Launch of a Large-scale Action for Quality Translation Technology (QTLaunchPad)” (Deliverable 1.4.1. QT Test Suite). This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 296347.