Forced Alignment using HuggingFace Wav2Vec2 Models: Part 2
Reintroduction
To recall, the goal of this weekend project is:
Given a WAV file containing a song with vocals and a TXT file containing the lyrics of this song, output a LRC file that tells us at which timestamps each word in the lyric file is uttered in the sound file.
In the first part of this series, we isolated the vocals through the htdemucs library.
Then, we used a neural network to give us the following information:
For each chunk (say some milliseconds) of audio, we have a dictionary of letters (and some special characters) and scores.
The higher the score for the letter, the more strongly the model believes that this letter was being said on this timestamp.
For example, at some time , the letter B might have a score of 0.5 and V might have a score of 0.4. This basically means that the model is like: “Yeah, it sounds like a B or a V but I’m not so sure… but if I had to choose I would choose 0.5.”
Please refer to the first part linked above for more details.
As a reminder, I referred to this repo of Github user mikezzb and basically tried to follow and explain what they did as I understand it.
A Different Problem?
Statement
Let’s forget the original problem a bit and consider another problem.
Say, we have the following table:
| A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0 | 100 | 0 | 500 | 0 | 900 | 0 |
| 2 | 2 | 200 | 50 | 10 | 10 | 10 |
| 5 | 5 | 5 | 5 | 5 | 5 | 8 |
You need to make a path that goes from the top-left cell (labelled ) to the bottom-right cell (labelled ).
The only moves you can make are go to the right (stay on the same row) or go right and down (move to the next row).
For example, one valid path would be
| A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0 | 100 | 0 | 500 | 0 | 900 | 0 |
| 2 | 2 | 200 | 50 | 10 | 10 | 10 |
| 5 | 5 | 5 | 5 | 5 | 5 | 8 |
The total obtained by this path is . But we can find a path with a higher total.
| A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0 | 100 | 0 | 500 | 0 | 900 | 0 |
| 2 | 2 | 200 | 50 | 10 | 10 | 10 |
| 5 | 5 | 5 | 5 | 5 | 5 | 8 |
The above path’s total is . A much higher total than the previous one.
Of course, you can try to go through that tempting at the th column. But unfortunately, there is no path going through there that ends on the . So that’s a no-go.
The problem is to find a valid path with maximal sum and this is a somewhat easy dynamic programming problem.
Possible Solutions
Brute Force
We can loop through all the paths from the top-left to bottom-right corner. How many such paths are there? It will be something like:
where is the number of columns and is the number of rows.
This is because you need to choose a direction (stay in the row, or move to the next one) times because these are the number of gaps between the columns. Moreover, exactly of these directions must be to move to the next row to assure that the path ends at the bottom-most row.
A Less Brute-Forcey Algorithm
We can do better than that.
Let’s Get Mathematical
Let’s use some math notation to be more precise. Suppose is the number on the th row and th column of the table. For example, .
We also define a function that should tell us the highest sum possible across all paths that go from the top-left corner to , the cell on the th row and th column.
For example, using our table above, we know that and is at least .
Unpassable Cells
There is no way to get to so let’s set .
In fact, in general, there is no way so to to a cell if .
So we know that when .
The Recursive Formula
But how do we solve in general? Say for and ?
Well, we know that any path that ends in either passed through (meaning that the last direction was to switch rows) or (meaning the last direction was to stay at the current row).
One could convince themselves that must be the maximum between and and add . Thus
Altogether Now
Combining all the facts that we know about , we have
And so we get the following table for :
| A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|
| 1 | 3 | 6 | 10 | 15 | 21 | 28 |
| 101 | 101 | 601 | 601 | 1501 | 1501 | |
| 301 | 351 | 611 | 621 | 1511 | ||
| 306 | 356 | 616 | 629 |
Apparently, the highest sum any path from the top-left to the bottom-right cell is . But how can we recover the path?
Backtracking
Recovering the path is as easy as looking at which was the maximum among the two choices.
| A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|
| 1 | 3 | 6 | 10 | 15 | 21 | 28 |
| 101 | 101 | 601 | 601 | 1501 | 1501 | |
| 301 | 351 | 611 | 621 | 1511 | ||
| 306 | 356 | 616 | 629 |
Superimposing this path to the original table, we have:
| A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0 | 100 | 0 | 500 | 0 | 900 | 0 |
| 2 | 2 | 200 | 50 | 10 | 10 | 10 |
| 5 | 5 | 5 | 5 | 5 | 5 | 8 |
Back To Our Alignment Problem
For the sake of the example, suppose that the WAV file we have only contains the word TOUT. Suppose further that the audio file is seconds. Now, we can relabel our original table as follows:
| A | B | C | D | E | F | G | |
|---|---|---|---|---|---|---|---|
| 0:00 | 0:01 | 0:02 | 0:03 | 0:04 | 0:05 | 0:06 | |
| T | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| O | 0 | 100 | 0 | 500 | 0 | 900 | 0 |
| U | 2 | 2 | 200 | 50 | 10 | 10 | 10 |
| T | 5 | 5 | 5 | 5 | 5 | 5 | 8 |
This time we can interpret the table as the . These could be the scores given by the neural network for the letter at time .
As we have established, the good path for this table is:
| A | B | C | D | E | F | G | |
|---|---|---|---|---|---|---|---|
| 0:00 | 0:01 | 0:02 | 0:03 | 0:04 | 0:05 | 0:06 | |
| T | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| O | 0 | 100 | 0 | 500 | 0 | 900 | 0 |
| U | 2 | 2 | 200 | 50 | 10 | 10 | 10 |
| T | 5 | 5 | 5 | 5 | 5 | 5 | 8 |
This means that T was pronounced in the first second, O was pronounced for three seconds after that, then U pronounced two seconds and then the final T was pronounced at the final seconds.
At least that’s what the neural network most likely thinks.
Conclusion
Admittedly, some details of the real algorithm used in the original problem of audio alignment was swept under the rug in this explanation. If there is enough interest, I could write a detailed post on how to adapt this algorithm to the original problem.
For now, what I hope is that I have explained the general idea of this dynamic programming algorithm, which by the way has a time complexity of , much less than whatever would be.