Spoken language recognition on Mozilla Common Voice — Audio Transformations. | by Sergey Vilov | Aug, 2023

This is the third article on spoken language recognition based on the Mozilla Common Voice dataset. In Part I, we discussed data selection and data preprocessing and in Part II we analysed performance of several neural network classifiers.

The final model achieved 92% accuracy and 97% pairwise accuracy. Since this model suffers from somewhat high variance, the accuracy could potentially be improved by adding more data. One very common way to get extra data is to synthesize it by performing various transformations on the available dataset.

In this article, we will consider 5 popular transformations for audio data augmentation: adding noise, changing speed, changing pitch, time masking, and cut & splice.

The tutorial notebook can be found here.

For illustration purposes, will use the sample common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. This is the sentence The burning fire had been extinguished.

import librosa as lr
import IPythonsignal, sr = lr.load('./transformed/common_voice_en_100040.wav', res_type='kaiser_fast') #load signal
IPython.display.Audio(signal, rate=sr)

Original sample common_voice_en_100040 from MCV.

Original signal waveform (image by the author)

Adding noise is the simplest audio augmentation. The amount of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal signal amplitude and standard deviation of noise. We will generate several noise levels, defined with SNR, and see how they change the signal.

SNRs = (5,10,100,1000) #Signal-to-noise ratio: max amplitude over noise stdnoisy_signal = {}
for snr in SNRs:
noise_std = max(abs(signal))/snr #get noise std
noise =  noise_std*np.random.randn(len(signal),) #generate noise with given std
noisy_signal[snr] = signal+noise
IPython.display.display(IPython.display.Audio(noisy_signal[5], rate=sr))
IPython.display.display(IPython.display.Audio(noisy_signal[1000], rate=sr))