AI Eval

About

AI Eval is designed to calculate the accuracy scores of Machine Translation (MT) Systems and Automatic Speech Recognition (ASR) Systems. The aim is to cater awareness about the quality and performance of MT and ASR systems to the public. Its components include-

Machine Translation Evaluation

With the developments in technology and the tools required to extract data in various native languages, many platforms have incorporated translation facilities, allowing for the broader accessibility of information across speakers of multiple languages. However, the occasional inaccuracy of this software must be addressed. (Gala et al., 2023) While these translation tools are impressive, they're not perfect. Sometimes, they make mistakes or produce awkward phrasing that doesn't capture the original meaning. This is where translation evaluation comes in. We present four tools, BLEU, METEOR and chrF++, which provide scores and can be used to calculate the accuracy of the translation supplied by multiple software/platforms.
These tools are helpful, but they could be better, too. They each have their strengths and limitations. Sometimes, a translation might sound suitable to a human but get a low score, or vice versa. This is because these tools look at specific aspects of the translation, not necessarily how natural it sounds. Each tool looks at translations differently, so their scores vary as they focus on different aspects and give different marks. Before proceeding with the accuracy calculation, some general trends should be considered. Due to the differences in the tools’ functions, a striking contrast is observed between the scores they calculated. For ease of understanding, all scores calculated are out of 100.

NLTK (Natural Language Toolkit) is a Python library for working with human language data. The developers utilised functions such as meteor_score(), word_tokenize(), sentence_bleu(), corpus_chrf() from the nltk.translate module to calculate accuracy scores.

BLEU (Bilingual Evaluation Understudy) BLEU looks at smaller parts (n-grams) of the machine’s translation and checks how many appear in the human translations. The more matches it finds, the better the translation. Post (2018) BLEU score is usually meagre compared to chrF++ and COMET because it finds words that match the translation done by a machine and the human translation. It disregards subtler similarities between sentences as it only considers a translation correct if the exact words are used. (Papineni, Roukos, Ward, & Zhu, 2002).
chrF++ breaks translations into sequences of characters (Popović, 2017). For example, in the word "translation," character 3-grams would be "tra," "ran," "ans," and so on. This captures subtle differences in spelling and word forms that word-level metrics might miss. chrF++ scores are higher than BLEU because instead of looking for matching words, it looks for characters between machine-generated and human translations. Hence, it can include many similarities that the BLEU metric may overlook.
METEOR (Metric for Evaluation of Translation with Explicit Ordering) METEOR also looks at word and phrase matches between human and machine translations. However, unlike BLEU, it can detect synonyms and mark a game as correct if it means the same thing. Thus, METEOR scores are higher than BLEU scores as they are evaluated based on how well a word or phrase aligns. It can consider many contextual similarities between the machine and human translations that may have been disregarded by the BLEU score.(Banerjee et. al., 20025).
MultEval is a toolkit available as a repository on GitHub, used for calculating accuracy scores of MT systems through evaluation metrics such as BLEU, chrF++ and METEOR. The website uses this as a guide to present the evaluation metric tool as a website and making it more accessible.
A file in .txt format containing the data translated by a Machine Translation software of your choice. Each line should contain a single entry of sentences translated by the software.
A file in .txt format containing the human translations of the same sentences mentioned in the candidate file. The sentences in both files should be parallel, i.e. in the same order.

Automatic Speech Recognition

Automated Speech Recognition (ASR) enables machines to understand and process human speech. ASR systems facilitate various applications by converting spoken language into text, from voice-activated assistants and transcription services to language translation and accessibility tools. ASR relies on deep learning models that analyse sound waves, identify phonetic patterns, and match them with corresponding words in a given language. Several factors influence ASR: the number of speakers, the nature of speech, vocabulary size, and spectral bandwidth (Alharbi et al., 2021). Despite these advancements, ASR technology still faces challenges, such as accurately recognising accents, dialects, and everyday speech and handling background noise and multiple speakers. Word Error Rate is a method that is used to calculate the errors made by ASR systems and check their accuracy. The developers utilised the ‘distance()’ function from the Levenshtein library to calculate the distance.

Word Error Rate is a metric that is used to check the mistakes made by an ASR system. Word error rate aims to calculate how different the ASR output is from what was uttered. It compares the raw transcription with the transcription generated by an ASR and sums the amount of word insertions, deletions and substitutions made by the ASR. This is divided by the number of words in the reference file to calculate the WER.
Levenshtein is a Python library featuring a function ‘distance’ that calculates the number of edits, insertions, deletions, or substitutions required to transform one string into another. Used for word error recognition, identifying and measuring discrepancies between translated text and reference text, it provides a detailed analysis of translation accuracy, specifying errors to enhance the overall quality of translations.
A file in .txt format containing the data obtained by converting speech into text using an ASR system of your choice. Each line should contain a single entry of sentences translated by the software.
A file in .txt format containing the original transcription of the same sentences mentioned in the candidate file. The sentences in both files should be parallel, i.e. in the same order.