Forced Alignment using HuggingFace Wav2Vec2 Models: Part 1

Introduction

Karaoke is one of the things from the Philippines that most Filipinos abroad miss a lot.

Yesterday, I found myself window shopping for Magic Sing microphones. Not only are they expensive (upwards of 700+ euros), you also have to buy extra chips if you want more songs!

Karaoke Videos: A Dying Breed

One alternative is to simply go to YouTube. I see at least two downsides though.

First, you are at the mercy of what is and what is not uploaded. Unless the song is VERY popular, songs (especially French songs) do not have karaoke versions.

Second, if such videos exist, they are these soulless text-only videos. I don’t know about you but I grew up with these types of karaoke videos. I want my karaoke videos with a beach resort background and possible a vague love story going on ! Is that too much to ask?

Enter AI

With all the AI developments going on, I thought: is it already possible to get automate generating these karaoke videos?

Audio and lyrics are readily available online. However, lyrics that are synced to the audio on the word level, which is essential in karaoke are not.

Thankfully, Github user mikezzb has already implemented this. All that is left for me is to understand what it does and modify it to my liking.

Outline

Spoiler: This turned out to be quite a long post so I decided to split this into several parts so I can get a break.

The problem essentially boils down to: given the lyrics and the audio of a song, find the timestamps where each word is said.

Experts call this a Forced Alignment problem.

The Github repo that I followed tackles this problem in several steps:

  • Isolate the vocals of the audio.
  • Use a model to recognize the different phonemes on the vocals.
  • Use a dynamic programming algorithm to guess the appropriate timestamps.

We will go through this step-by-step.

In this post, we will discuss how to isolate the vocals and how to run the model to know what sound or what letter the model thinks is being uttered at every timestamp.

Isolating the Vocals

To isolate the vocals, the repo uses htdemucs, an advanced model that you can use to separate the audio into its individual stems such as vocals, drums, bass, and other.

I was tempted to go deeper into knowing how this was trained. However, this is a rabbit hole that I am not entering for the moment. We have a karaoke mission to finish.

Loading the Pretrained htdemucs Model

The first step is to load the model into the device.

The device I am using is the CPU, because apparently I have no choice:

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
device

Next, it is time to load the pretrained model:

from demucs.pretrained import get_model
model = get_model(name="htdemucs", repo=None)
model.to('cpu')
model.eval()

The method .to apparently loads the model onto the CPU. If you are not consistent about which devices things are loaded, apparently you can have problems.

Moreover, .eval tells the model that we are using this model to evaluate stuff as opposed to training stuff.

Stems and Sample Rates

Now that the model is loaded, we can look at what it can do for us:

model.sources

gives us ['drums', 'bass', 'other', 'vocals']. These are the stems that the model could provide for us.

Next, we can also look at the sample rate the model was trained on:

model.samplerate

It returns 44100. What does this number even mean?

Apparently, and I am not an expert, every second of audio can be divided into n data points called samples. These n number represent the amplitude at that given time. And suprisingly, that’s all the data you need to recreate the audio!

Loading the Track

Indeed, if you use the built-in load_track function from the demucs library, you get a tensor of what I think are floats.

from demucs.separate import load_track
audio_fn = f"/home/guissmo/tmp/lyrics-sync/dataset/chocolatine.wav"
audio = load_track(audio_fn, 2, sr)

The two means there are two channels, meaning that we are using stereo.

Hence, the shape audio.shape of audio would be:

torch.Size([2, 9068136])

This makes sense because the song I’ve just loaded is Joe Dassin’s Chocolatine, which is 205205 seconds long.

Therefore we would expect to have around 90405009040500 samples. Which we do.

Technically, the file we are using is 9068136÷44100205.639068136 \div 44100 \approx 205.63 seconds long, according to the number of samples we have sampled.

Separating the Stems

Before separating the stems, the code normalizes the tensor using the mean and the standard deviation on each channel.

ref = audio.mean(0)
audio_normalized = (audio - ref.mean()) / ref.std()

In this code, ref takes the mean of the two channels separately and then normalizes audio in such a way that the mean on each channel is 0 and the standard deviation is 1.

Why? Because it prevents the computations from getting out of hand for convergence purposes, prevents rounding errors, and so that the loudness of the track does not greatly impact our results.

Now that all these preheating is done, we can put our data into the oven which we call the demucs model.

from demucs.apply import apply_model
with torch.no_grad():
    sources = apply_model(
        model, audio_normalized[None], device=device, shifts=1, split=True, overlap=0.25, progress=False)

In this code, we use torch.no_grad() because it seems that torch computes gradients (aka multidimensional derivatives) by default. This is only needed in training so we could speed things up.

The shifts keyword, according to the code in the documentation seems to be the number of random shifts added? To be honest, I am not sure what this means but it feels like the higher the better. Let me know if you know!

The split keyword permits demucs to split the audio into shorter more manageable chunks with an overlap of 0.5.

Acapella

At this point, sources will be a tensor of shape:

torch.Size([1, 4, 2, 9068136])

Our example looks like this:

tensor([[[[0.0130, 0.0202, 0.0192,  ..., 0.0230, 0.0226, 0.0210],
          [0.0158, 0.0225, 0.0212,  ..., 0.0237, 0.0231, 0.0215]],

         [[0.0013, 0.0013, 0.0013,  ..., 0.0096, 0.0095, 0.0091],
          [0.0015, 0.0014, 0.0014,  ..., 0.0096, 0.0096, 0.0092]],

         [[0.0190, 0.0143, 0.0132,  ..., 0.0155, 0.0173, 0.0171],
          [0.0176, 0.0137, 0.0124,  ..., 0.0147, 0.0169, 0.0168]],

         [[0.0022, 0.0019, 0.0022,  ..., 0.0047, 0.0046, 0.0045],
          [0.0020, 0.0019, 0.0020,  ..., 0.0047, 0.0046, 0.0045]]]])

To get the amplitudes of the vocals, we need to take the last pair because it is ordered as written in:

model.sources # ['drums', 'bass', 'other', 'vocals']

Hence if we want a WAV file with a sample rate of 4410044100 (i.e. better sounding) we could save it as follows:

import soundfile as sf
audio_name = "chocolatine"
vocals = sources[0][3][0, ...]
sf.write(f'/home/guissmo/tmp/lyrics-sync/output/{audio_name}_vocals.wav', vocals, 44100)

This takes the left channel (as is done by [0, ...]) and writes it into a wav file with a sample rate of 4410044100.

If we wanted to be fancy, we could take both channels but this seems to suffice for our purposes.

Post-Processing

For the next step, the model we will use needs to have the data with a 1600016000 sample rate, presumably because it’s good enough quality to train on but not so good that it’s enormous.

So, we will resample it using the librosa library:

vocals_pp = librosa.resample(vocals, orig_sr=44100, target_sr=16000)
sf.write(f'/home/guissmo/tmp/lyrics-sync/output/{audio_name}_vocals_pp.wav', vocals_pp, 16000)

As with the previous case, writing it into a file is optional and just for fun.

Detour: Instrumentals

As an aside, I thought: since demucs is supposedly separating the sound into different stems, shouldn’t adding the numbers together via

sources.sum(dim=1)

recreate the audio_normalized tensor?

Well, I did just that and added all the non-vocal arrays together like so:

import soundfile as sf
instruments = sources[0][0][0, ...] + sources[0][1][0, ...] + sources[0][2][0, ...]
sf.write(f'/home/guissmo/tmp/lyrics-sync/output/{audio_name}_instrumental.wav', instruments, sr)

And suprisingly, the resulting wav file gives a decent sounding instrumental. But clearly with some degradation.

Apparently, this process of separating the stems using demucs is lossy and some data is lost because of its attempts to isolate the different components of the song.

So we can’t perfectly recreate the instrumentals but we have something decent enough.

What Does the Neural Network Hear?

The Wav2Vec 2.0 model architecture developed by Facebook and is used to process raw audio inputs (think WAV files) for automatic speech recognition (ASR) tasks.

We will use two things from this architecture, the model and the processor.

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model_name = 'facebook/wav2vec2-large-xlsr-53-french'
model = Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)

As you can see from our code, we use facebook/wav2vec2-large-xlsr-53-french as the name of our model.

The corresponding model that I use for English is facebook/wav2vec2-large-960h-lv60-self.

For newbs like me, these model names are dependent on how they are called on HuggingFace.

Alphabet

The processor has a tokenizer and the French version has an alphabet of 4949 “characters”:

processor.tokenizer.get_vocab()
 <pad> <s> </s> <unk> | E S A I T N R U L O D M C P É V ' Q F G B H J À X È Y - Ê Z Â Ç Î Ô Û Ù K Œ Ï W Ë Ü Æ Ÿ 

The model will go through segments of our post-processed vocal track and for each segment, it will return numbers for each of the characters in the alphabet. A high number for a character means that the model strongly believes that that segment corresponds to each character. These values are called the corresponding logits of that segment, as determined by the model.

For example, if the model returns 420420 for A and 6969 for E, and 00 for the 4747 other charcaters, then that segment of the vocal track sounds most like an A!

Sidenote: It seems that data scientists use it as the tokenizer’s vocabulary? I’m more familiar with the math term alphabet.

Chopping Up The Input To Segments

The model takes as input this array of amplitudes from the WAV file. For our Joe Dassin’s Chocolatine song, we have 9068136 data points corresponding to each 144100th\frac{1}{44100}\text{th} of a second. We have post-processed it and now we have a sample rate of 1600016000. We now have

90681361600044100=3290027\left\lceil 9068136 \cdot \frac{16000}{44100} \right\rceil = 3290027

That’s still a lot for my 66-year old Thinkpad to do at a time. And so we cut it up into 1515-second chunks.

We define the following function:

import numpy as np

window_size = int(16000 * 15)
hop_length = window_size

def get_audio_segments(audio):

    # Pad last incomplete frame.
    pad_length = (window_size - (len(audio) % window_size)) % window_size
    padded_audio = np.pad(audio, (0, pad_length), mode='constant')

    return librosa.util.frame(padded_audio, frame_length=window_size, hop_length=hop_length, axis=0)

With this function we expect to have

3290027240000=14\left\lceil \frac{3290027}{240000} \right\rceil = 14

segments. And indeed we do:

segs = get_audio_segments(vocals_pp)

Running the above code, segs.shape will be (14, 240000).

Finally Getting The Logits

The model processes the input 320320 points at a time, according to its config.

model.config.inputs_to_logits_ratio # 320

Hence, for each batch of 240000240000: we expect to get 750750 logits. Including the padding we did in get_audio_segments, we expect to get 10500 logits.

Here is the code that we will use:

import torch
def get_logits_from_seg(seg):
    input_vals = processor(seg, return_tensors="pt", padding="longest", sampling_rate=16000).input_values
    with torch.no_grad():
        seg_logits = model(input_vals).logits
    return seg_logits

logits = get_logits_from_seg(segs[0])
for seg in segs[1:]:
    logits_seg = get_logits_from_seg(seg)
    logits = torch.cat((logits, logits_seg), dim=1)

The code basically uses the processor to process each segment and compute the corresponding logits for each segment. I’m sure there is a lot of heavy lifting going on behind the scenes but this is how simple it is to code.

After waiting a bit and running this code, we find that logits has shape torch.Size([1, 10486, 49]), which is 1414 less than the 1050010500 our calculation expected.

I investigated a bit but get_logits_from_seg only returns 749 each time. And I don’t know why. This has briefly been brought up here but I did not find a good answer why.

If you know why, contact me because I’m dying to know!

Getting the Transcription

Now that we have the logits, we can apply the softmax algorithm which normalizes our vector of size 4949 such that the sum of its components is 11, as in a probability distribution.

The repo uses the log_softmax though. It breaks the property in which it is a probability distribution, but then it makes the values “closer” so the difference between 10%10\% and 1%1\% will be less pronounced. More on why we use this later when we actually use the text alignment algorithm.

For now, here is the code:

emission = torch.log_softmax(logits, dim=-1)[0].cpu().detach()

At this point, we also have enough info on what the model thinks those random numbers representing amplitudes are saying.

We take the highest scoring logit for each chunk we inspected and use the processor to transcribe it into a human readable format.

Here is the code to make it happen:

def get_pred(logits):
    return torch.argmax(logits, dim=-1)
def get_transcription(pred):
    return processor.batch_decode(pred)

pred = get_pred(logits)
transcription = get_transcription(pred)

The beginning of the transcription reads:

"tous les matins il lachetait sentit fin au cla..."

Conclusion: Part 1

Highly-trained neural networks do not call the pastry a pain au chocolat and would rather predict it wrongly as fin au cla.

In the next part, we will discuss the dynamic programming algorithm that is the heart of aligning the text to the correct timestamp!