Data formats

Last updated: 2015 December 09

The MQM scorecard uses three sorts of external files: bitexts, XML metrics definition files, and XML structured translation specifications files. These are described in this document.

Bitexts (required)

The MQM scorecard is used to assess aligned bitexts, i.e., source and target texts aligned as source-target pairs of segments. The segmentation method is undefined as the Scorecard uses whatever segmentation is present in the file. A bitext to be evaluated MUST meet the following requirements:

  • The bitext must be saved as either UTF-8 or ASCII using HTML character entities. Other encodings will not display correctly.
  • Segments are separated by new line characters. As a result, source and target segments may not contain new line characters, although the <br />/<br> tag may be included within segments.
  • Source and target segments are separated with a tab character. As a result, segments may not contain tab characters. If tab characters are present within a segment, that segment will not display properly.
  • Segments may contain limited HTML inline markup. Most HTML markup is stripped because of the potential to introduce security risks (e.g., via uploaded Javascript) or to break Scorecard functionality (e.g., via included links that would take the user to other locations). The following tags are allowed:
    • <span> with a style attribute
    • <em> or <i>
    • <strong> or <b>
    • <sub>
    • <sup>
    • <code>

Note that use of a style attribute can lead to unpredictable or undesirable results and this capability should be used with caution.

Example

The following would be a valid tab-delimited bitext file for the scorecard containing three segments:

This is the first line	Ez az első sor
This line has an <span style="font-size:150%;">HTML tag</span>	Ennek a sornak van egy <span style="font-size:150%;">HTML-elem</span>
And this one has a<br />line break	Még ennek van egy<br />sortörés

XML metrics definition file (required)

The scorecard currently defines a metric using an XML file identical to that used to configure translate5. A sample configuration file is provided here:

<issues>
   <issue type="Accuracy">
      <issue type="Mistranslation">
         <issue type="Terminology"/>
      </issue>
      <issue type="Omission"/>
      <issue type="Addition"/>
      <issue type="Untranslated"/>
   </issue>
   <issue type="Fluency">
      <issue type="Register"/>
      <issue type="Style"/>
      <issue type="Inconsistency"/>
      <issue type="Spelling"/>
      <issue type="Typography"/>
      <issue type="Grammar" />
      <issue type="Locale violation" />
      <issue type="Unintelligible"/>
   </issue>
   <issue type="Verity">
      <issue type="Completeness"/>
      <issue type="Legal requirements"/>
      <issue type="Locale applicability"/>
   </issue>
</issues>

Note that this file format will be updated in future versions of the Scorecard to support issue weights.

XML Structured Translation Specifications (optional)

Although use of a specifications file is not required to use the Scorecard, it is recommended since it allows the annotator to consult the original specifications to determine whether a problem violates the specifications. The file provides a full translation specification that corresponds to specifications as defined in the ASTM F2575:2014 specification. While a full description of the XML file format and contents is beyond the scope of this tutorial, the following example will help users in creating their own specifications.

<structuredTranslationSpecifications id="5c30540f-2e6a-463d-8cc6-f6e217498d01" type="standalone">
   <source>
      <textualCharacteristics>
         <language status="accepted">en-029</language>
         <textType status="accepted">standard, Word document (presented as plain text for this assessment)</textType>
         <audience status="accepted">agricultural producers, exporters/importers of mangos, inspectors, public (secondary)</audience>
         <purpose status="accepted">inform audience of regulatory requirements for quality of agricultural products</purpose>
      </textualCharacteristics>
      <specializedLanguage>
         <subjectField status="accepted">agricultural products</subjectField>
         <terminology status="accepted">general agricultural terminology</terminology>
      </specializedLanguage>
      <volume status="accepted">1839 words / 9863 characters</volume>
      <complexity status="accepted">No particular complexity other than some tabular data</complexity>
      <origin>
         <author status="accepted">CROSQ standards committee</author>
         <isTranslation status="accepted">no</isTranslation>
      </origin>
   </source>
   <target>
      <targetLanguageInformation>
         <language status="accepted">fr-FR</language>
         <targetTerminology status="accepted">No specified termbase</targetTerminology>
      </targetLanguageInformation>
      <audience status="accepted">Same as source</audience>
      <purpose status="accepted">Same as source</purpose>
      <contentCorrespondence status="accepted">Full, covert translation</contentCorrespondence>
      <register status="accepted">The text should be understandable to and written at a level appropriate for general readers with a high-school-level education.</register>
      <fileFormat status="accepted">Same as source</fileFormat>
      <modality status="accepted">Print</modality>
      <style>
         <styleGuide status="accepted">None, follow source</styleGuide>
         <styleRelevance status="accepted">low</styleRelevance>
      </style>
      <layout status="accepted">Same as source</layout>
   </target>
   <production>
      <typicalTasks>
         <preparation status="accepted">The translator should read through the entire source text prior to translation and resolve any questions prior to translation</preparation>
         <initialTranslation status="accepted">Human translation. MT is not acceptable.</initialTranslation>
         <qualityAssurance status="accepted">In house revision with feedback from createor.<br />Translation will be assessed by third-party human reviewers using an online “scorecard” application.</qualityAssurance>
      </typicalTasks>
      <additionalTasks status="accepted">Terminology adherence shall be corrected as part of the revision process</additionalTasks>
   </production>
   <environment>
      <technology status="accepted">Translation to be done in Microsoft Word. Terminology to be viewed in Excel</technology>
      <referenceMaterials status="accepted">Other CROSQ standards may be consulted as necessary.</referenceMaterials>
      <workplaceRequirements  status="accepted">None</workplaceRequirements>
   </environment>
   <relationships>
      <permissions>
         <copyright>CROSQ</copyright>
         <recognition status="undetermined"/>
         <restrictions>None</restrictions>
      </permissions>
      <submissions>
         <qualifications status="undetermined"/>
         <deliverables status="undetermined"/>
         <delivery status="undetermined"/>
         <deadline status="undetermined"/>
      </submissions>
      <expectations status="undetermined">
         <compensation status="undetermined"/>
         <communication>Consult with manager</communication>
      </expectations>
   </relationships>
</structuredTranslationSpecifications>

NOTE: The old QTLaunchPad Metric Builder tool’s output does not conform to the requirements for the current Scorecard. A new Metric Builder is under development.