AI Eval

Frequently Asked Questions

MT systems may only sometimes produce accurate translations. They often need more common sense, which sometimes causes them to give grammatically correct yet nonsensical answers. Especially for idioms and complex sentences, they may fail to provide the proper output. They might also fail to encompass all cultural information in the translation and produce culturally inappropriate outputs. Furthermore, due to limited resources and training, they might not produce outputs in non-standard dialects
Different evaluation metrics are used for other tasks and function differently. For example, BLEU looks for word matches between the reference and candidate files. However, it cannot match different forms of the same word. ChrF++ looks for character matches and considers subtleties that BLEU may overlook. METEOR also looks for word matches; however, it can convert the word into its base form and then look for matches. Thus, every metric has a different way of functioning and varies according to the purpose. The field of evaluation is constantly evolving, with researchers working to improve existing metrics and develop new ones.
Transcription is the process of converting speech data to text data. This can involve converting recorded speech to text or live speech to text. ASR systems recognise speech and produce their transcripts.
Word Error Rate is calculated by taking a sum of the number of substitutions, insertions and deletions that the ASR-generated transcript contains compared to the original/raw transcript. Substitution refers to the change of words, eg- ASR showing ‘paint it’ when the speaker said ‘painting’. Addition is when an ASR adds hallucinatory words that were never uttered by the speaker. Deletion is when ASR fails to recognise something that was uttered by the speaker and hence, it is not listed in the transcript.
BLEU looks for exact word matches between the candidate and reference files. It marks a match incorrectly even if it uses the same word stem but different affixes. Thus, this rigid comparison criteria makes the score low compared to other metrics.
No, there is no limit. The model has been tested on 6000 sentences.
Aligning the candidate and reference files means that the order and format of the sentences in both files should be the same. This is required as the metrics run the files parallelly to look for matches.
Both files uploaded should be in the .txt extension
The BLEU, chrF++, and METEOR metrics, though widely used, do have some limitations. BLEU does not measure meaning but pays more attention to word combinations, leading to disregarding semantic meaning. The BLEU score also heavily penalises synonyms if they are not listed in the reference translations or treats them as unknown if they do not occur at least twice in the test set. The chrF++ scores can sometimes rate nonsensical translations as precise or closer to human translations if the combinations of two words are present in the reference translations. The chrF++ scores have not been exhaustively tested on Indian languages, which are exceptionally morphologically rich. This creates a concern regarding the reliability of the chrF++ score for translation software dealing with Indian languages. METEOR, though more sophisticated, is computationally expensive, requires language-specific resources, and can be overly lenient with word order.
It's very easy! Just upload your aligned reference and candidate files, click on upload, and you'll have your BLEU, chrF++, and METEOR scores. If you want to check the accuracy of ASR systems, follow the same process- upload your aligned reference and candidate files and you will have your Word Error Rate.