Forced Alignment using HuggingFace Wav2Vec2 Models: Part 2

Reintroduction

To recall, the goal of this weekend project is:

Given a WAV file containing a song with vocals and a TXT file containing the lyrics of this song, output a LRC file that tells us at which timestamps each word in the lyric file is uttered in the sound file.

In the first part of this series, we isolated the vocals through the htdemucs library.

Then, we used a neural network to give us the following information:

For each chunk (say some milliseconds) of audio, we have a dictionary of letters (and some special characters) and scores.

The higher the score for the letter, the more strongly the model believes that this letter was being said on this timestamp.

For example, at some time tt, the letter B might have a score of 0.5 and V might have a score of 0.4. This basically means that the model is like: “Yeah, it sounds like a B or a V but I’m not so sure… but if I had to choose I would choose 0.5.”

Please refer to the first part linked above for more details.

As a reminder, I referred to this repo of Github user mikezzb and basically tried to follow and explain what they did as I understand it.

A Different Problem?

Statement

Let’s forget the original problem a bit and consider another problem.

Say, we have the following table:

ABCDEFG
1234567
0100050009000
2220050101010
5555558

You need to make a path that goes from the top-left cell (labelled 11) to the bottom-right cell (labelled 88).

The only moves you can make are go to the right (stay on the same row) or go right and down (move to the next row).

For example, one valid path would be

ABCDEFG
1234567
0100050009000
2220050101010
5555558

The total obtained by this path is 8181. But we can find a path with a higher total.

ABCDEFG
1234567
0100050009000
2220050101010
5555558

The above path’s total is 379379. A much higher total than the previous one.

Of course, you can try to go through that tempting 900900 at the 66th column. But unfortunately, there is no path going through there that ends on the 88. So that’s a no-go.

The problem is to find a valid path with maximal sum and this is a somewhat easy dynamic programming problem.

Possible Solutions

Brute Force

We can loop through all the paths from the top-left to bottom-right corner. How many such paths are there? It will be something like:

(n1k1)=(n1)!(k1)!(nk)!\binom{n-1}{k-1} = \frac{(n-1)!}{(k-1)!(n-k)!}

where nn is the number of columns and kk is the number of rows.

This is because you need to choose a direction (stay in the row, or move to the next one) n1n-1 times because these are the number of gaps between the columns. Moreover, exactly k1k-1 of these directions must be to move to the next row to assure that the path ends at the bottom-most row.

A Less Brute-Forcey Algorithm

We can do better than that.

Let’s Get Mathematical

Let’s use some math notation to be more precise. Suppose ci,jc_{i,j} is the number on the iith row and jjth column of the table. For example, c2,6=900c_{2,6} = 900.

We also define a function f(i,j)f(i, j) that should tell us the highest sum possible across all paths that go from the top-left corner (1,1)(1, 1) to (i,j)(i, j), the cell on the iith row and jjth column.

For example, using our table above, we know that f(1,1)=1f(1, 1) = 1 and f(4,7)f(4, 7) is at least 379379.

Unpassable Cells

There is no way to get to (4,1)(4, 1) so let’s set f(4,1)=f(4, 1) = -\infty.

In fact, in general, there is no way so to to a cell (i,j)(i', j') if i>ji' > j'.

So we know that f(i,j)=f(i, j) = -\infty when i>ji > j.

The Recursive Formula

But how do we solve f(i,j)f(i, j) in general? Say for i=4i = 4 and j=7j = 7?

Well, we know that any path that ends in (i,j)(i, j) either passed through (i1,j1)(i-1, j-1) (meaning that the last direction was to switch rows) or (i,j1)(i, j-1) (meaning the last direction was to stay at the current row).

One could convince themselves that f(i,j)f(i, j) must be the maximum between f(i1,j1)f(i-1, j-1) and f(i,j1)f(i, j-1) and add ci,jc_{i,j}. Thus

f(i,j)=max(f(i1,j1),f(i,j1))+ci,j.f(i, j) = \max(f(i-1, j-1), f(i, j-1)) + c_{i,j}.

Altogether Now

Combining all the facts that we know about ff, we have

f(i,j)={1if (i,j)=(0,0)if i>jmax(f(i1,j1),f(i,j1))+ci,jotherwise.f(i, j) = \begin{cases} 1 & \text{if } (i, j) = (0, 0) \\ -\infty & \text{if } i > j \\ \max(f(i-1, j-1), f(i, j-1)) + c_{i,j}& \text{otherwise}. \\ \end{cases}

And so we get the following table for f(i,j)f(i, j):

ABCDEFG
13610152128
-\infty10110160160115011501
-\infty-\infty3013516116211511
-\infty-\infty-\infty306356616629

Apparently, the highest sum any path from the top-left to the bottom-right cell is 629629. But how can we recover the path?

Backtracking

Recovering the path is as easy as looking at which was the maximum among the two choices.

ABCDEFG
13610152128
-\infty10110160160115011501
-\infty-\infty3013516116211511
-\infty-\infty-\infty306356616629

Superimposing this path to the original table, we have:

ABCDEFG
1234567
0100050009000
2220050101010
5555558

Back To Our Alignment Problem

For the sake of the example, suppose that the WAV file we have only contains the word TOUT. Suppose further that the audio file is 77 seconds. Now, we can relabel our original table as follows:

ABCDEFG
0:000:010:020:030:040:050:06
T1234567
O0100050009000
U2220050101010
T5555558

This time we can interpret the table as the ci,jc_{i,j}. These could be the scores given by the neural network for the letter ii at time jj.

As we have established, the good path for this table is:

ABCDEFG
0:000:010:020:030:040:050:06
T1234567
O0100050009000
U2220050101010
T5555558

This means that T was pronounced in the first second, O was pronounced for three seconds after that, then U pronounced two seconds and then the final T was pronounced at the final seconds.

At least that’s what the neural network most likely thinks.

Conclusion

Admittedly, some details of the real algorithm used in the original problem of audio alignment was swept under the rug in this explanation. If there is enough interest, I could write a detailed post on how to adapt this algorithm to the original problem.

For now, what I hope is that I have explained the general idea of this dynamic programming algorithm, which by the way has a time complexity of O(nk)O(nk), much less than whatever O((nk))O(\binom{n}{k}) would be.