Forced Alignment using HuggingFace Wav2Vec2 Models: Part 1
Introduction
Karaoke is one of the things from the Philippines that most Filipinos abroad miss a lot.
Yesterday, I found myself window shopping for Magic Sing microphones. Not only are they expensive (upwards of 700+ euros), you also have to buy extra chips if you want more songs!
Karaoke Videos: A Dying Breed
One alternative is to simply go to YouTube. I see at least two downsides though.
First, you are at the mercy of what is and what is not uploaded. Unless the song is VERY popular, songs (especially French songs) do not have karaoke versions.
Second, if such videos exist, they are these soulless text-only videos. I don’t know about you but I grew up with these types of karaoke videos. I want my karaoke videos with a beach resort background and possible a vague love story going on ! Is that too much to ask?
Enter AI
With all the AI developments going on, I thought: is it already possible to get automate generating these karaoke videos?
Audio and lyrics are readily available online. However, lyrics that are synced to the audio on the word level, which is essential in karaoke are not.
Thankfully, Github user mikezzb has already implemented this. All that is left for me is to understand what it does and modify it to my liking.
Outline
Spoiler: This turned out to be quite a long post so I decided to split this into several parts so I can get a break.
The problem essentially boils down to: given the lyrics and the audio of a song, find the timestamps where each word is said.
Experts call this a Forced Alignment problem.
The Github repo that I followed tackles this problem in several steps:
- Isolate the vocals of the audio.
- Use a model to recognize the different phonemes on the vocals.
- Use a dynamic programming algorithm to guess the appropriate timestamps.
We will go through this step-by-step.
In this post, we will discuss how to isolate the vocals and how to run the model to know what sound or what letter the model thinks is being uttered at every timestamp.
Isolating the Vocals
To isolate the vocals, the repo uses htdemucs, an advanced model that you can use to separate the audio into its individual stems such as vocals, drums, bass, and other.
I was tempted to go deeper into knowing how this was trained. However, this is a rabbit hole that I am not entering for the moment. We have a karaoke mission to finish.
Loading the Pretrained htdemucs Model
The first step is to load the model into the device.
The device I am using is the CPU, because apparently I have no choice:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
device
Next, it is time to load the pretrained model:
from demucs.pretrained import get_model
model = get_model(name="htdemucs", repo=None)
model.to('cpu')
model.eval()
The method .to apparently loads the model onto the CPU. If you are not consistent about which devices things are loaded, apparently you can have problems.
Moreover, .eval tells the model that we are using this model to evaluate stuff as opposed to training stuff.
Stems and Sample Rates
Now that the model is loaded, we can look at what it can do for us:
model.sources
gives us ['drums', 'bass', 'other', 'vocals']. These are the stems that the model could provide for us.
Next, we can also look at the sample rate the model was trained on:
model.samplerate
It returns 44100. What does this number even mean?
Apparently, and I am not an expert, every second of audio can be divided into n data points called samples. These n number represent the amplitude at that given time. And suprisingly, that’s all the data you need to recreate the audio!
Loading the Track
Indeed, if you use the built-in load_track function from the demucs library, you get a tensor of what I think are floats.
from demucs.separate import load_track
audio_fn = f"/home/guissmo/tmp/lyrics-sync/dataset/chocolatine.wav"
audio = load_track(audio_fn, 2, sr)
The two means there are two channels, meaning that we are using stereo.
Hence, the shape audio.shape of audio would be:
torch.Size([2, 9068136])
This makes sense because the song I’ve just loaded is Joe Dassin’s Chocolatine, which is seconds long.
Therefore we would expect to have around samples. Which we do.
Technically, the file we are using is seconds long, according to the number of samples we have sampled.
Separating the Stems
Before separating the stems, the code normalizes the tensor using the mean and the standard deviation on each channel.
ref = audio.mean(0)
audio_normalized = (audio - ref.mean()) / ref.std()
In this code, ref takes the mean of the two channels separately and then normalizes audio in such a way that the mean on each channel is 0 and the standard deviation is 1.
Why? Because it prevents the computations from getting out of hand for convergence purposes, prevents rounding errors, and so that the loudness of the track does not greatly impact our results.
Now that all these preheating is done, we can put our data into the oven which we call the demucs model.
from demucs.apply import apply_model
with torch.no_grad():
sources = apply_model(
model, audio_normalized[None], device=device, shifts=1, split=True, overlap=0.25, progress=False)
In this code, we use torch.no_grad() because it seems that torch computes gradients (aka multidimensional derivatives) by default. This is only needed in training so we could speed things up.
The shifts keyword, according to the code in the documentation seems to be the number of random shifts added? To be honest, I am not sure what this means but it feels like the higher the better. Let me know if you know!
The split keyword permits demucs to split the audio into shorter more manageable chunks with an overlap of 0.5.
Acapella
At this point, sources will be a tensor of shape:
torch.Size([1, 4, 2, 9068136])
Our example looks like this:
tensor([[[[0.0130, 0.0202, 0.0192, ..., 0.0230, 0.0226, 0.0210],
[0.0158, 0.0225, 0.0212, ..., 0.0237, 0.0231, 0.0215]],
[[0.0013, 0.0013, 0.0013, ..., 0.0096, 0.0095, 0.0091],
[0.0015, 0.0014, 0.0014, ..., 0.0096, 0.0096, 0.0092]],
[[0.0190, 0.0143, 0.0132, ..., 0.0155, 0.0173, 0.0171],
[0.0176, 0.0137, 0.0124, ..., 0.0147, 0.0169, 0.0168]],
[[0.0022, 0.0019, 0.0022, ..., 0.0047, 0.0046, 0.0045],
[0.0020, 0.0019, 0.0020, ..., 0.0047, 0.0046, 0.0045]]]])
To get the amplitudes of the vocals, we need to take the last pair because it is ordered as written in:
model.sources # ['drums', 'bass', 'other', 'vocals']
Hence if we want a WAV file with a sample rate of (i.e. better sounding) we could save it as follows:
import soundfile as sf
audio_name = "chocolatine"
vocals = sources[0][3][0, ...]
sf.write(f'/home/guissmo/tmp/lyrics-sync/output/{audio_name}_vocals.wav', vocals, 44100)
This takes the left channel (as is done by [0, ...]) and writes it into a wav file with a sample rate of .
If we wanted to be fancy, we could take both channels but this seems to suffice for our purposes.
Post-Processing
For the next step, the model we will use needs to have the data with a sample rate, presumably because it’s good enough quality to train on but not so good that it’s enormous.
So, we will resample it using the librosa library:
vocals_pp = librosa.resample(vocals, orig_sr=44100, target_sr=16000)
sf.write(f'/home/guissmo/tmp/lyrics-sync/output/{audio_name}_vocals_pp.wav', vocals_pp, 16000)
As with the previous case, writing it into a file is optional and just for fun.
Detour: Instrumentals
As an aside, I thought: since demucs is supposedly separating the sound into different stems, shouldn’t adding the numbers together via
sources.sum(dim=1)
recreate the audio_normalized tensor?
Well, I did just that and added all the non-vocal arrays together like so:
import soundfile as sf
instruments = sources[0][0][0, ...] + sources[0][1][0, ...] + sources[0][2][0, ...]
sf.write(f'/home/guissmo/tmp/lyrics-sync/output/{audio_name}_instrumental.wav', instruments, sr)
And suprisingly, the resulting wav file gives a decent sounding instrumental. But clearly with some degradation.
Apparently, this process of separating the stems using demucs is lossy and some data is lost because of its attempts to isolate the different components of the song.
So we can’t perfectly recreate the instrumentals but we have something decent enough.
What Does the Neural Network Hear?
The Wav2Vec 2.0 model architecture developed by Facebook and is used to process raw audio inputs (think WAV files) for automatic speech recognition (ASR) tasks.
We will use two things from this architecture, the model and the processor.
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model_name = 'facebook/wav2vec2-large-xlsr-53-french'
model = Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)
As you can see from our code, we use facebook/wav2vec2-large-xlsr-53-french as the name of our model.
The corresponding model that I use for English is facebook/wav2vec2-large-960h-lv60-self.
For newbs like me, these model names are dependent on how they are called on HuggingFace.
Alphabet
The processor has a tokenizer and the French version has an alphabet of “characters”:
processor.tokenizer.get_vocab()
<pad> <s> </s> <unk> | E S A I T N R U L O D M C P É V ' Q F G B H J À X È Y - Ê Z Â Ç Î Ô Û Ù K Œ Ï W Ë Ü Æ Ÿ
The model will go through segments of our post-processed vocal track and for each segment, it will return numbers for each of the characters in the alphabet. A high number for a character means that the model strongly believes that that segment corresponds to each character. These values are called the corresponding logits of that segment, as determined by the model.
For example, if the model returns for A and for E, and for the other charcaters, then that segment of the vocal track sounds most like an A!
Sidenote: It seems that data scientists use it as the tokenizer’s vocabulary? I’m more familiar with the math term alphabet.
Chopping Up The Input To Segments
The model takes as input this array of amplitudes from the WAV file. For our Joe Dassin’s Chocolatine song, we have 9068136 data points corresponding to each of a second. We have post-processed it and now we have a sample rate of . We now have
That’s still a lot for my -year old Thinkpad to do at a time. And so we cut it up into -second chunks.
We define the following function:
import numpy as np
window_size = int(16000 * 15)
hop_length = window_size
def get_audio_segments(audio):
# Pad last incomplete frame.
pad_length = (window_size - (len(audio) % window_size)) % window_size
padded_audio = np.pad(audio, (0, pad_length), mode='constant')
return librosa.util.frame(padded_audio, frame_length=window_size, hop_length=hop_length, axis=0)
With this function we expect to have
segments. And indeed we do:
segs = get_audio_segments(vocals_pp)
Running the above code, segs.shape will be (14, 240000).
Finally Getting The Logits
The model processes the input points at a time, according to its config.
model.config.inputs_to_logits_ratio # 320
Hence, for each batch of : we expect to get logits. Including the padding we did in get_audio_segments, we expect to get 10500 logits.
Here is the code that we will use:
import torch
def get_logits_from_seg(seg):
input_vals = processor(seg, return_tensors="pt", padding="longest", sampling_rate=16000).input_values
with torch.no_grad():
seg_logits = model(input_vals).logits
return seg_logits
logits = get_logits_from_seg(segs[0])
for seg in segs[1:]:
logits_seg = get_logits_from_seg(seg)
logits = torch.cat((logits, logits_seg), dim=1)
The code basically uses the processor to process each segment and compute the corresponding logits for each segment. I’m sure there is a lot of heavy lifting going on behind the scenes but this is how simple it is to code.
After waiting a bit and running this code, we find that logits has shape torch.Size([1, 10486, 49]), which is less than the our calculation expected.
I investigated a bit but get_logits_from_seg only returns 749 each time. And I don’t know why. This has briefly been brought up here but I did not find a good answer why.
If you know why, contact me because I’m dying to know!
Getting the Transcription
Now that we have the logits, we can apply the softmax algorithm which normalizes our vector of size such that the sum of its components is , as in a probability distribution.
The repo uses the log_softmax though. It breaks the property in which it is a probability distribution, but then it makes the values “closer” so the difference between and will be less pronounced. More on why we use this later when we actually use the text alignment algorithm.
For now, here is the code:
emission = torch.log_softmax(logits, dim=-1)[0].cpu().detach()
At this point, we also have enough info on what the model thinks those random numbers representing amplitudes are saying.
We take the highest scoring logit for each chunk we inspected and use the processor to transcribe it into a human readable format.
Here is the code to make it happen:
def get_pred(logits):
return torch.argmax(logits, dim=-1)
def get_transcription(pred):
return processor.batch_decode(pred)
pred = get_pred(logits)
transcription = get_transcription(pred)
The beginning of the transcription reads:
"tous les matins il lachetait sentit fin au cla..."
Conclusion: Part 1
Highly-trained neural networks do not call the pastry a pain au chocolat and would rather predict it wrongly as fin au cla.
In the next part, we will discuss the dynamic programming algorithm that is the heart of aligning the text to the correct timestamp!