Generating a translation quality score with MQM

Last updated: 2014 November 21

MQM does not mandate a specific scoring model (in fact, scoring is not required to implement MQM). However, MQM does have a default scoring mechanism, which should be used where feasible in order to provide interoperability between scores. This document outlines the process of generating relevant translation quality scores using MQM.

IMPORTANT NOTE: MQM scores are scores of the translation product (i.e. translated text or content), not the translator. MQM does not assign blame for problems; rather, it is used to indicate that problems occur in the translated product or source text. Translation quality scores generated using MQM should not be blindly applied to translator assessment, although the results may be carefully considered in assessing a translator’s performance. For example, if Internationalization problems occur, they will negatively impact the quality score for the translated content, even though they are beyond the control of the translator. Any metrics used for translator assessment MUST exclude issue types that are beyond the control of the translator.

Severity levels

When applied in an analytic, error-count fashion, MQM provides scores for each issue type used, as well as an overall score and scoring for the major branches used in a metric. Default MQM scoring utilizes the notion of severity, with three default severity levels, each with a multiplier:

Level Description Multiplier
Minor Minor errors do not impede understanding of the text. A normal reader will “repair” the problem in his/her mind and move on. 1
Major Major errors make the text difficult to understand. Understandability is compromised to the point that the user will not necessarily be able to determine the intended meaning correctly. However, a major error does not make the text as a whole unfit for purpose. 10
Critical Critical errors are ones that, by themselves, cause the translation to fail to meet specifications and render it unusable for the intended audience and purpose. Critical errors MUST be repaired for the translation to be usable. 100

 

Using the multipliers, it is apparent that a Major error is counted as 10 times more serious than a Minor error, and a Critical error is 100 times more serious. Most errors should be categorized as Minor or Major, with Critical reserved for errors that, by themselves, would cause a translation to fail to meet specifications.

Penalties

Penalties (P) for an issue type are counted according to the following default formula, with P normally expressed as a percentage value.

P = (Issuesminor + 10 × Issuesmajor + 100 × Issuescritical) ÷ Word count

Alternative scoring models may be used, but use of the default scoring model is recommended to promote interoperability and compatability between metrics.

P values can be calculated for any single issue or group of issues. All penalty values are summed up within a branch. For example, if a metric checks Terminology and Mistranslation in the Accuracy branch and penalties for Terminology = 1.2% and penalties for Mistranslation = 1.4%, then P for the Accuracy branch would equal 2.6%.

Subscores

Subscores are calculated for top-level branches by summing the penalties for all issues contained within a branch. If a target-only evaluation is conducted, all penalty points will lower the score below an ideal 100%. This procedure is realized according to formula provided above.

However, since penalties can be assessed for both target and source texts (except for Accuracy, where a source score is impossible to calculate), it is possible to generate separate scores for source and target texts if the source text is also evaluated. If the source text is assessed and significant problems are encountered (which the translator would have to account for), it is recommended that the subscore for a branch be found calculating a net penalty that subtracts source penalty points from target penalty points, per this formula:

P = Penaltiestarget - Penaltiessource

This formula recognizes that translators may deal with deficient source texts and rewards them for doing so. For example, if a source text has a Fluency P of 2.5% but the target text has Fluency P of 1.2%, then the net penalty for Fluency would be -1.3% (note that lower values of P indicate higher quality). Situations like this one indicate that the translated product shows higher quality than the source content. (Strictly speaking, the score applies to the product, not the translator, but improvements in quality of course indicate that the translator has done a good job.)

Because quality scores are traditionally presented on a percentage scale with 100% the ideal, the quality score for any issue or branch is found by subtracting P from 100%. For example, if Accuracy P is 3.2%, then the quality score for Accuracy is 96.8% (= 100% - 3.2%). For the previous example with a net Fluency P, the quality score for Fluency would be 101.3% (= 100% - -1.3%). Quality scores in excess of 100% are possible if the the quality of the target text exceeds the quality of the source text.

Generating an overall score

An overall score can be calculated by subtracting all net penalties found in calculating the subscores:

TQ = 100% - PAccuracy - PFluency - PVerity - PDesign - PInternationalization - PCompatibility

Where:

TQ = quality score
      The overall rating of quality

PAccuracy = Penalties for Accuracy (target only)
      Sum of all weighted penalty points assigned in the Accuracy branch

PFluency = Net penalties for Fluency
      Sum of all weighted penalty points assigned in the Fluency branch

PVerity = Net penalties for Verity
      Sum of all weighted penalty points assigned in the Verity branch 

PDesign = Net penalties for Design
      Sum of all weighted penalty points assigned in the Design branch

PInternationalization = Net penalties for Internationalization
      Sum of all weighted penalty points assigned in the Internationalization branch

PCompatability = Net penalties for Compatability
      Sum of all weighted penalty points assigned in the Compatability branch (NOT RECOMMENDED FOR USE)

Note that because net penalties for all branches except Accuracy can be negative if the source is assessed and found problematic, the overall score in exceptional cases may be greater than 100%.