Resources
Information on the Multidimensional Quality Metrics (MQM) can be found here.
The central results generated in the QTLaunchPad project consist of tools and corpora:
Tools
translate5
The translate5 tool is an open-source editing, revision, and quality annotation environment. It provides a full-fledged, configurable implementation of MQM and was used to support the annotation campaign in QTLaunchPad. Features include:
- Support for span-level annotation with MQM issue sets
- Import of CSV data, including both bilingual (source-target) and multilingual (multiple alternative translations) data sets
- MQM statistics for both source and target text
A demo of translate5 can be accessed here. Current source code for translate5 is available here (note, however, that the translate5 install and update kit is not yet complete, and only experienced PHP developers and server administrators should attempt to install translate5).
A META-SHARE installation of translate5 that hosts a number of MT data sets (accessible for public viewing) is available here.
A basic tutorial on using translate5 for annotation tasks is available here.
translate5 CSV → HTML converter
Simple open-source PHP tool that takes CSV files exported from translate5 and converts them to a “pretty print” format similar to the format used in the MQM annotated corpora. The tool can be accessed via the QT21 META-SHARE repository here (requires PHP5 and a server environment or server emulation environment such as XAMPP).
ILSP Focused Crawler
ILSP-FC is a modular and open source focused crawler for the automatic acquisition of monolingual or bilingual, domain-specific or general corpora from the Web, depending on its configuration.
The open-source software can be downloaded here.
To learn more about the ILSP-FC, visit the ILSP-FC homepage.
Relevant public deliverables: D4.3.1. Multilingual corpus acquisition software
QuEst - An open source tool for translation quality estimation
As Machine Translation (MT) systems become widely adopted both for gisting purposes and for producing professional quality translations, automatic methods are needed for predicting the quality of a translated segment. This is referred to as Quality Estimation (QE). Different from standard MT evaluation metrics, QE metrics do not have access to reference (human) translations; they are aimed at MT systems in use. Developed as part of WP2, QuEst is a framework for building and testing models of translation quality prediction. It consists of two main modules: a feature extraction module and a machine learning module.
The open-source software can be downloaded from here.
To learn more about Quality Estimation, click here.
MQM Scorecard
The MQM Scorecard is a light-weight open-source tool for annotating aligned texts with MQM issues at the segment level. The MQM Scorecard requires PHP 5.x+ and MySQL.
Features include:
- Support for uploading a translate5 XML metrics definition file to configure the issues available for annotation
- Ergonomic UI for easy annotation of both source and target segments
- Import of tab-delimited translation files with support for limited inline markup
- Integrated help displays MQM definitions and examples
- Robust reporting features
- Support for displaying XML structured translation specifications files
- Span-level highlighting
- Notes at the segment level
NOTE: The Scorecard is currently (as of November 2014) being upgraded with user management features and is temporarily unavailable in a public demo version. To arrange for access to the development server for trial/testing, please contact info [at] qt21.eu. Current source code is available here.
A basic tutorial on using the Scorecard is avaialble here.
Corpora and Test Suites
The QTLaunchPad project provided the following annotated corpora and test suites:
MQM Annotated Corpora
The eight MQM annotated copora provide the results of expert annotation using an MQM-compliant metric developed for QTLaunchPad in four language pairs (DE→EN, EN→DE, EN→ES, and ES→EN) for both research (WMT) and customer data. These corpora provide a way to study the comparative performance of various types of MT systems and to compare different sorts of data and compare the results produced by different annotators. The set of MQM issue types was modified extensively for Round 2 based on analysis of Round 1. As the analysis led to structural changes in MQM, the two data sets are not 100% comparable, but overall trends can be compared. Individual segments were annotated by between 1 and 5 annotators.
- Round 1:
- English→Spanish
- Spanish→English (includes “adjudicated” data, i.e., data in which various annotations were reconciled to provide an ideal annotation)
- English→German (includes 18 WMT alternatives for segments taken from WMT)
- German→English (includes “adjudicated” data, i.e., data in which various annotations were reconciled to provide an ideal annotation)
- Round 2:
Relevant public deliverables: 1.1.2. TQ Error Corpus • 1.3.1. Barriers for HQMT
MT Test Suite
The MT Test Suite consists of two copora (EN→DE and DE→EN) containing source segments and their translations that proved difficult for state-of-the-art MT systems. Segments are categorized for the type of system for which they prove difficult. These corpora can be used to test performance of MT systems against known types or errors. Both corpora are available as filterable HTML and as XML files. The test suite contain data from corpora as well as sentences taken from the TSNLP grammar test suite, which helps to augment the suites with a wide variety of grammatical phenomena.
Relevant public deliverables: 1.4.1. TQ Test Suite
Domain-Specific Corpora
These data set contains documents acquired from the web, automatically classified as being in the indicated language(s) and relevant to the listed domain. All data are available under a Creative Commons license. The documents have been classified into one of the genre categories: "Reference", "News/Journalism", "Discussion", "Commercial" and "Other". Bilingual datasets include automatically aligned sentences that were extracted from pairs of parallel documents.
- Monolingual Corpora
- Medical Domain
- Automotive Domain
- Bilingual Corpora
Additional tools and resources may be accessed via the QT21 META-SHARE Repository.