Resources

Information on the Multidimensional Quality Metrics (MQM) can be found here.

The central results generated in the QTLaunchPad project consist of tools and corpora:

Tools

translate5

The translate5 tool is an open-source editing, revision, and quality annotation environment. It provides a full-fledged, configurable implementation of MQM and was used to support the annotation campaign in QTLaunchPad. Features include:

  • Support for span-level annotation with MQM issue sets
  • Import of CSV data, including both bilingual (source-target) and multilingual (multiple alternative translations) data sets
  • MQM statistics for both source and target text

A demo of translate5 can be accessed here. Current source code for translate5 is available here (note, however, that the translate5 install and update kit is not yet complete, and only experienced PHP developers and server administrators should attempt to install translate5).

A META-SHARE installation of translate5 that hosts a number of MT data sets (accessible for public viewing) is available here.

A basic tutorial on using translate5 for annotation tasks is available here.


translate5 CSV → HTML converter

Simple open-source PHP tool that takes CSV files exported from translate5 and converts them to a “pretty print” format similar to the format used in the MQM annotated corpora. The tool can be accessed via the QT21 META-SHARE repository here (requires PHP5 and a server environment or server emulation environment such as XAMPP). 


ILSP Focused Crawler

ILSP-FC is a modular and open source focused crawler for the automatic acquisition of monolingual or bilingual, domain-specific or general corpora from the Web, depending on its configuration.

The open-source software can be downloaded here

To learn more about the ILSP-FC, visit the ILSP-FC homepage

Relevant public deliverables: D4.3.1. Multilingual corpus acquisition software


 QuEst - An open source tool for translation quality estimation

As Machine Translation (MT) systems become widely adopted both for gisting purposes and for producing professional quality translations, automatic methods are needed for predicting the quality of a translated segment. This is referred to as Quality Estimation (QE). Different from standard MT evaluation metrics, QE metrics do not have access to reference (human) translations; they are aimed at MT systems in use. Developed as part of WP2, QuEst is a framework for building and testing models of translation quality prediction. It consists of two main modules: a feature extraction module and a machine learning module. 

The open-source software can be downloaded from here.
To learn more about Quality Estimation, click here.
 


MQM Scorecard

The MQM Scorecard is a light-weight open-source tool for annotating aligned texts with MQM issues at the segment level. The MQM Scorecard requires PHP 5.x+ and MySQL.

Features include:

  • Support for uploading a translate5 XML metrics definition file to configure the issues available for annotation
  • Ergonomic UI for easy annotation of both source and target segments
  • Import of tab-delimited translation files with support for limited inline markup
  • Integrated help displays MQM definitions and examples
  • Robust reporting features
  • Support for displaying XML structured translation specifications files
  • Span-level highlighting
  • Notes at the segment level

NOTE: The Scorecard is currently (as of November 2014) being upgraded with user management features and is temporarily unavailable in a public demo version. To arrange for access to the development server for trial/testing, please contact info [at] qt21.eu. Current source code is available here.

A basic tutorial on using the Scorecard is avaialble here.


Corpora and Test Suites

The QTLaunchPad project provided the following annotated corpora and test suites:

MQM Annotated Corpora

The eight MQM annotated copora provide the results of expert annotation using an MQM-compliant metric developed for QTLaunchPad in four language pairs (DE→EN, EN→DE, EN→ES, and ES→EN) for both research (WMT) and customer data. These corpora provide a way to study the comparative performance of various types of MT systems and to compare different sorts of data and compare the results produced by different annotators. The set of MQM issue types was modified extensively for Round 2 based on analysis of Round 1. As the analysis led to structural changes in MQM, the two data sets are not 100% comparable, but overall trends can be compared. Individual segments were annotated by between 1 and 5 annotators.

Relevant public deliverables: 1.1.2. TQ Error Corpus1.3.1. Barriers for HQMT 


MT Test Suite

The MT Test Suite consists of two copora (EN→DE and DE→EN) containing source segments and their translations that proved difficult for state-of-the-art MT systems. Segments are categorized for the type of system for which they prove difficult. These corpora can be used to test performance of MT systems against known types or errors. Both corpora are available as filterable HTML and as XML files. The test suite contain data from corpora as well as sentences taken from the TSNLP grammar test suite, which helps to augment the suites with a wide variety of grammatical phenomena.

Relevant public deliverables: 1.4.1. TQ Test Suite  


Domain-Specific Corpora

These data set contains documents acquired from the web, automatically classified as being in the indicated language(s) and relevant to the listed domain. All data are available under a Creative Commons license. The documents have been classified into one of the genre categories: "Reference", "News/Journalism", "Discussion", "Commercial" and "Other". Bilingual datasets include automatically aligned sentences that were extracted from pairs of parallel documents.


Additional tools and resources may be accessed via the QT21 META-SHARE Repository.