Speech to text in English and Spanish (state of the art)

TL;DR

Applying the best method out of the box doesn’t seem to provide perfect results, retraining with additional data for the task and/or grammar correction seems required for obtaining better results.

Introduction

First things first, what is speech to text or STT? speech to text is the process from which from an input audio file we can generate a transcription of spoken text in that audio.

We wanted to evaluate the capabilities of the model wav2vec v2 out of the box in audios not previously seen by the model both in Spanish and English.

Evaluation metric (WER)

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

The general difficulty of measuring performance for the problem of speech to text lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER metric is derived from the Levenshtein distance, working at the word level instead of the phoneme level.

The general idea of this metric is evaluating the amount of substitutions, deletions, and insertions required for modifying the generated text into the target text.

Audios evaluated (Spanish and English)

For this test we evaluated the WER metric in the following audios:

Easy audios

Spanish audio 1 (“Es un jugador de ajedrez español”) from Mozilla Common Voice:

https://youtu.be/QJurQGYItHA

English audio 1 (“I would like to order a medium pizza with pepperoni and mushrooms”) generated audio:

https://youtu.be/bVIZIInRHNw

Difficult audios

Spanish audio 2 (“Quien quiere una galleta”):

https://youtu.be/SmNbQPM7MJc

English audio 2 (“it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.”) from Babylon 5 Season 4:

https://youtu.be/C_StPEtErkc

Results (out of the box)

We used one version of wav2vec trained with Spanish data and another one trained with English data.

Spanish 1

Output text: es un jugador de agédrez español

Ground truth text: Es un jugador de ajedrez español

WER Score: 0.17

Spanish 2

Output text: yeera una galleta

Ground truth text: Quien quiere una galleta

WER score: 0.50

English 1

Output text:I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS

Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms

WER Score: 0.25

English 2

Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI

Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.

WER Score: 0.46

From the outputs generated we can see that there are in each output errors in some of the words that would be fixed with an autocorrector.

Results (regular auto-corrector, syntactic distance)

We tested using openly available auto correction for fixing the errors, obtaining the following results:

Spanish 1

Output text: es un jugador de agédrez español

Corrected text: es un jugador de ajedrez español

Ground truth text: Es un jugador de ajedrez español

WER Score: 0.0

Spanish 2

Output text: yeera una galleta

Corrected text:meera una galleta

Ground truth text: Quien quiere una galleta

WER score: 0.50

English 1

Output text: I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS

Corrected text: i would like to order a medium pizza with pepper oni and mushrooms

Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms

WER score: 0.17

English 2

Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI

Corrected text: it was the year of the birth the year of great sadness the year of pain cat year of jot did once nowi

Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.

WER Score: 0.46

We can see that in two cases the WER score is decreased by using syntactic autocorrection.

Results (phoneme autocorrector, phoneme distance)

We also wanted to try a phoneme autocorrector, which finds the closest word in phonemes instead of using syntactic distance. The same results were obtained in this case too. The motivation to use a phoneme autocorrector instead of a syntactic autocorrector is because phoneme distance should be able to obtain better results for words; an example for this would be the word league, if the translation was lig it would be closer in phoneme distance to the real word than in syntactic distance.

Spanish 1

Output text: es un jugador de agédrez español

Corrected text: es un jugador de ajedrez español

Ground truth text: Es un jugador de ajedrez español

WER Score: 0.0

Spanish 2

Output text: yeera una galleta

Corrected text: meera una galleta

Ground truth text: Quien quiere una galleta

WER score: 0.50

English 1

Output text: I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS

Corrected text: i would like to order a medium pizza with pepper oni and mushrooms

Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms

WER score: 0.17

English 2

Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI

Corrected text: it was the year of the birth the year of great sadness the year of pain cat year of jot did once nowi

Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.

WER Score: 0.46

Conclusions

Spanish 1: initial WER 0.17 -> WER after autocorrection: 0

Spanish 2: initial WER 0.5 -> WER after autocorrection: 0.5

English 1: initial WER 0.25 -> WER after autocorrection: 0.17

English 2: initial WER 0.46 -> WER after autocorrection: 0.46

From the results obtained we can conclude that out of the box wav2vec v2 generates some errors in the outputs even for reasonable quality inputs. We saw that some of these errors can be fixed in some cases by using either normal autocorrectors or phoneme autocorrectors (further testing should indicate that phoneme autocorrectors provide better results).

Using additional methods such as grammar correctors based on transformers would further improve the results obtained (PEPPER ONI in English example wasn’t fixed because autocorrectors work at a word level). Another way of improving the results would be to use additional training data for the task as the model seems to be worse when intonation is different from audios with plain intonation.

That’s it! Thanks for reading this post about speech to text (SST) both in Spanish and in English. If you have an audio project which requires ML or if you have a machine learning project in general feel free to reach out to us at hello@dynamindlabs.ai or filling the contact us form at https://dynamindlabs.ai so we can help you in making your project a reality! Until next time!

TL;DR

Applying the best method out of the box doesn’t seem to provide perfect results, retraining with additional data for the task and/or grammar correction seems required for obtaining better results.

Introduction

First things first, what is speech to text or STT? speech to text is the process from which from an input audio file we can generate a transcription of spoken text in that audio.

We wanted to evaluate the capabilities of the model wav2vec v2 out of the box in audios not previously seen by the model both in Spanish and English.

Evaluation metric (WER)

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

The general idea of this metric is evaluating the amount of substitutions, deletions, and insertions required for modifying the generated text into the target text.

Audios evaluated (Spanish and English)

For this test we evaluated the WER metric in the following audios:

Easy audios

Spanish audio 1 (“Es un jugador de ajedrez español”) from Mozilla Common Voice:

https://youtu.be/QJurQGYItHA

English audio 1 (“I would like to order a medium pizza with pepperoni and mushrooms”) generated audio:

https://youtu.be/bVIZIInRHNw

Difficult audios

Spanish audio 2 (“Quien quiere una galleta”):

https://youtu.be/SmNbQPM7MJc

English audio 2 (“it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.”) from Babylon 5 Season 4:

https://youtu.be/C_StPEtErkc

Results (out of the box)

We used one version of wav2vec trained with Spanish data and another one trained with English data.

Spanish 1

Output text: es un jugador de agédrez español

Ground truth text: Es un jugador de ajedrez español

WER Score: 0.17

Spanish 2

Output text: yeera una galleta

Ground truth text: Quien quiere una galleta

WER score: 0.50

English 1

Output text:I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS

Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms

WER Score: 0.25

English 2

Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI

Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.

WER Score: 0.46

From the outputs generated we can see that there are in each output errors in some of the words that would be fixed with an autocorrector.

Results (regular auto-corrector, syntactic distance)

We tested using openly available auto correction for fixing the errors, obtaining the following results:

Spanish 1

Output text: es un jugador de agédrez español

Corrected text: es un jugador de ajedrez español

Ground truth text: Es un jugador de ajedrez español

WER Score: 0.0

Spanish 2

Output text: yeera una galleta

Corrected text:meera una galleta

Ground truth text: Quien quiere una galleta

WER score: 0.50

English 1

Output text: I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS

Corrected text: i would like to order a medium pizza with pepper oni and mushrooms

Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms

WER score: 0.17

English 2

Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI

Corrected text: it was the year of the birth the year of great sadness the year of pain cat year of jot did once nowi

Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.

WER Score: 0.46

We can see that in two cases the WER score is decreased by using syntactic autocorrection.

Results (phoneme autocorrector, phoneme distance)

Spanish 1

Output text: es un jugador de agédrez español

Corrected text: es un jugador de ajedrez español

Ground truth text: Es un jugador de ajedrez español

WER Score: 0.0

Spanish 2

Output text: yeera una galleta

Corrected text: meera una galleta

Ground truth text: Quien quiere una galleta

WER score: 0.50

English 1

Output text: I WOULD LIKE TO ORDER A MEDIUM PIZA WITH PEPPER ONI AND MUSHROOMS

Corrected text: i would like to order a medium pizza with pepper oni and mushrooms

Ground truth text: I would like to order a medium pizza with pepperoni and mushrooms

WER score: 0.17

English 2

Output text: IT WAS THE YEAR OF THE BIRTH THE YEAR OF GREAT SADNESS THE YEAR OF PAIN CAT YEAR OF JOT DID ONCE ONNOWI

Corrected text: it was the year of the birth the year of great sadness the year of pain cat year of jot did once nowi

Ground truth text: it was year of rebirth, the year of great sadness, the year of pain and the year of joy. It was a new age.

WER Score: 0.46

Conclusions

Spanish 1: initial WER 0.17 -> WER after autocorrection: 0

Spanish 2: initial WER 0.5 -> WER after autocorrection: 0.5

English 1: initial WER 0.25 -> WER after autocorrection: 0.17

English 2: initial WER 0.46 -> WER after autocorrection: 0.46

Speech to text in English and Spanish (state of the art)

TL;DR

Introduction

Evaluation metric (WER)

Audios evaluated (Spanish and English)

Results (out of the box)

Results (regular auto-corrector, syntactic distance)

Results (phoneme autocorrector, phoneme distance)

Conclusions

TL;DR

Introduction

Evaluation metric (WER)

Audios evaluated (Spanish and English)

Results (out of the box)

Results (regular auto-corrector, syntactic distance)

Results (phoneme autocorrector, phoneme distance)

Conclusions

Share this article

Related articles

Using Machine Learning and AI for Automated Visual Inspection in Manufacturing

Can AI help you better understand the emotions of your call center clients?

Clutch Recognizes Dynamind Labs as one of the Leading AI Developers in Uruguay