TL;DR
Applying the best method out of the box doesn’t seem to provide perfect results, retraining with additional data for the task and/or grammar correction seems required for obtaining better results.
Introduction
First things first, what is speech to text or STT? speech to text is the process from which from an input audio file we can generate a transcription of spoken text in that audio.
We wanted to evaluate the capabilities of the model wav2vec v2 out of the box in audios not previously seen by the model both in Spanish and English.
Evaluation metric (WER)
Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.
The general difficulty of measuring performance for the problem of speech to text lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER metric is derived from the Levenshtein distance, working at the word level instead of the phoneme level.
The general idea of this metric is evaluating the amount of substitutions, deletions, and insertions required for modifying the generated text into the target text.
Audios evaluated (Spanish and English)
For this test we evaluated the WER metric in the following audios:
Easy audios
Spanish audio 1 (“Es un jugador de ajedrez español”) from Mozilla Common Voice:
English audio 1 (“I would like to order a medium pizza with pepperoni and mushrooms”) generated audio:
Difficult audios
Spanish audio 2 (“Quien quiere una galleta”):
English audio 2 (“it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.”) from Babylon 5 Season 4:
Results (out of the box)
We used one version of wav2vec trained with Spanish data and another one trained with English data.
Spanish 1
Output text: es un jugador de agédrez español
Ground truth text: Es un jugador de ajedrez español
WER Score: 0.17
Spanish 2
Output text: yeera una galleta
Ground truth text: Quien quiere una galleta
WER score: 0.50
English 1
Output text:I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS
Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms
WER Score: 0.25
English 2
Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI
Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.
WER Score: 0.46
From the outputs generated we can see that there are in each output errors in some of the words that would be fixed with an autocorrector.
Results (regular auto-corrector, syntactic distance)
We tested using openly available auto correction for fixing the errors, obtaining the following results:
Spanish 1
Output text: es un jugador de agédrez español
Corrected text: es un jugador de ajedrez español
Ground truth text: Es un jugador de ajedrez español
WER Score: 0.0
Spanish 2
Output text: yeera una galleta
Corrected text:meera una galleta
Ground truth text: Quien quiere una galleta
WER score: 0.50
English 1
Output text: I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS
Corrected text: i would like to order a medium pizza with pepper oni and mushrooms
Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms
WER score: 0.17
English 2
Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI
Corrected text: it was the year of the birth the year of great sadness the year of pain cat year of jot did once nowi
Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.
WER Score: 0.46
We can see that in two cases the WER score is decreased by using syntactic autocorrection.
Results (phoneme autocorrector, phoneme distance)
We also wanted to try a phoneme autocorrector, which finds the closest word in phonemes instead of using syntactic distance. The same results were obtained in this case too. The motivation to use a phoneme autocorrector instead of a syntactic autocorrector is because phoneme distance should be able to obtain better results for words; an example for this would be the word league, if the translation was lig it would be closer in phoneme distance to the real word than in syntactic distance.
Spanish 1
Output text: es un jugador de agédrez español
Corrected text: es un jugador de ajedrez español
Ground truth text: Es un jugador de ajedrez español
WER Score: 0.0
Spanish 2
Output text: yeera una galleta
Corrected text: meera una galleta
Ground truth text: Quien quiere una galleta
WER score: 0.50
English 1
Output text: I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS
Corrected text: i would like to order a medium pizza with pepper oni and mushrooms
Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms
WER score: 0.17
English 2
Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI
Corrected text: it was the year of the birth the year of great sadness the year of pain cat year of jot did once nowi
Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.
WER Score: 0.46
Conclusions
Spanish 1: initial WER 0.17 -> WER after autocorrection: 0
Spanish 2: initial WER 0.5 -> WER after autocorrection: 0.5
English 1: initial WER 0.25 -> WER after autocorrection: 0.17
English 2: initial WER 0.46 -> WER after autocorrection: 0.46
From the results obtained we can conclude that out of the box wav2vec v2 generates some errors in the outputs even for reasonable quality inputs. We saw that some of these errors can be fixed in some cases by using either normal autocorrectors or phoneme autocorrectors (further testing should indicate that phoneme autocorrectors provide better results).
Using additional methods such as grammar correctors based on transformers would further improve the results obtained (PEPPER ONI in English example wasn’t fixed because autocorrectors work at a word level). Another way of improving the results would be to use additional training data for the task as the model seems to be worse when intonation is different from audios with plain intonation.
That’s it! Thanks for reading this post about speech to text (SST) both in Spanish and in English. If you have an audio project which requires ML or if you have a machine learning project in general feel free to reach out to us at hello@dynamindlabs.ai or filling the contact us form at https://dynamindlabs.ai so we can help you in making your project a reality! Until next time!