Forced Alignment using HuggingFace Wav2Vec2 Models: Part 2

Reintroduction

To recall, the goal of this weekend project is:

Given a WAV file containing a song with vocals and a TXT file containing the lyrics of this song, output a LRC file that tells us at which timestamps each word in the lyric file is uttered in the sound file.

In the first part of this series, we isolated the vocals through the htdemucs library.

Then, we used a neural network to give us the following information:

For each chunk (say some milliseconds) of audio, we have a dictionary of letters (and some special characters) and scores.

The higher the score for the letter, the more strongly the model believes that this letter was being said on this timestamp.

For example, at some time $t$ , the letter B might have a score of 0.5 and V might have a score of 0.4. This basically means that the model is like: “Yeah, it sounds like a B or a V but I’m not so sure… but if I had to choose I would choose 0.5.”

Please refer to the first part linked above for more details.

As a reminder, I referred to this repo of Github user mikezzb and basically tried to follow and explain what they did as I understand it.

A Different Problem?

Statement

Let’s forget the original problem a bit and consider another problem.

Say, we have the following table:

A	B	C	D	E	F	G
1	2	3	4	5	6	7
0	100	0	500	0	900	0
2	2	200	50	10	10	10
5	5	5	5	5	5	8

You need to make a path that goes from the top-left cell (labelled $1$ ) to the bottom-right cell (labelled $8$ ).

The only moves you can make are go to the right (stay on the same row) or go right and down (move to the next row).

For example, one valid path would be

A	B	C	D	E	F	G
1	2	3	4	5	6	7
0	100	0	500	0	900	0
2	2	200	50	10	10	10
5	5	5	5	5	5	8

The total obtained by this path is $81$ . But we can find a path with a higher total.

A	B	C	D	E	F	G
1	2	3	4	5	6	7
0	100	0	500	0	900	0
2	2	200	50	10	10	10
5	5	5	5	5	5	8

The above path’s total is $379$ . A much higher total than the previous one.

Of course, you can try to go through that tempting $900$ at the $6$ th column. But unfortunately, there is no path going through there that ends on the $8$ . So that’s a no-go.

The problem is to find a valid path with maximal sum and this is a somewhat easy dynamic programming problem.

Possible Solutions

Brute Force

We can loop through all the paths from the top-left to bottom-right corner. How many such paths are there? It will be something like:

\binom{n-1}{k-1} = \frac{(n-1)!}{(k-1)!(n-k)!}

where $n$ is the number of columns and $k$ is the number of rows.

This is because you need to choose a direction (stay in the row, or move to the next one) $n-1$ times because these are the number of gaps between the columns. Moreover, exactly $k-1$ of these directions must be to move to the next row to assure that the path ends at the bottom-most row.

A Less Brute-Forcey Algorithm

We can do better than that.

Let’s Get Mathematical

Let’s use some math notation to be more precise. Suppose $c_{i,j}$ is the number on the $i$ th row and $j$ th column of the table. For example, $c_{2,6} = 900$ .

We also define a function $f(i, j)$ that should tell us the highest sum possible across all paths that go from the top-left corner $(1, 1)$ to $(i, j)$ , the cell on the $i$ th row and $j$ th column.

For example, using our table above, we know that $f(1, 1) = 1$ and $f(4, 7)$ is at least $379$ .

Unpassable Cells

There is no way to get to $(4, 1)$ so let’s set $f(4, 1) = -\infty$ .

In fact, in general, there is no way so to to a cell $(i', j')$ if $i' > j'$ .

So we know that $f(i, j) = -\infty$ when $i > j$ .

The Recursive Formula

But how do we solve $f(i, j)$ in general? Say for $i = 4$ and $j = 7$ ?

Well, we know that any path that ends in $(i, j)$ either passed through $(i-1, j-1)$ (meaning that the last direction was to switch rows) or $(i, j-1)$ (meaning the last direction was to stay at the current row).

One could convince themselves that $f(i, j)$ must be the maximum between $f(i-1, j-1)$ and $f(i, j-1)$ and add $c_{i,j}$ . Thus

f(i, j) = \max(f(i-1, j-1), f(i, j-1)) + c_{i,j}.

Altogether Now

Combining all the facts that we know about $f$ , we have

f(i, j) = \begin{cases} 1 & \text{if } (i, j) = (0, 0) \\ -\infty & \text{if } i > j \\ \max(f(i-1, j-1), f(i, j-1)) + c_{i,j}& \text{otherwise}. \\ \end{cases}

And so we get the following table for $f(i, j)$ :

A	B	C	D	E	F	G
1	3	6	10	15	21	28
$-\infty$	101	101	601	601	1501	1501
$-\infty$	$-\infty$	301	351	611	621	1511
$-\infty$	$-\infty$	$-\infty$	306	356	616	629

Apparently, the highest sum any path from the top-left to the bottom-right cell is $629$ . But how can we recover the path?

Backtracking

Recovering the path is as easy as looking at which was the maximum among the two choices.

A	B	C	D	E	F	G
1	3	6	10	15	21	28
$-\infty$	101	101	601	601	1501	1501
$-\infty$	$-\infty$	301	351	611	621	1511
$-\infty$	$-\infty$	$-\infty$	306	356	616	629

Superimposing this path to the original table, we have:

A	B	C	D	E	F	G
1	2	3	4	5	6	7
0	100	0	500	0	900	0
2	2	200	50	10	10	10
5	5	5	5	5	5	8

Back To Our Alignment Problem

For the sake of the example, suppose that the WAV file we have only contains the word TOUT. Suppose further that the audio file is $7$ seconds. Now, we can relabel our original table as follows:

	A	B	C	D	E	F	G
	0:00	0:01	0:02	0:03	0:04	0:05	0:06
T	1	2	3	4	5	6	7
O	0	100	0	500	0	900	0
U	2	2	200	50	10	10	10
T	5	5	5	5	5	5	8

This time we can interpret the table as the $c_{i,j}$ . These could be the scores given by the neural network for the letter $i$ at time $j$ .

As we have established, the good path for this table is:

	A	B	C	D	E	F	G
	0:00	0:01	0:02	0:03	0:04	0:05	0:06
T	1	2	3	4	5	6	7
O	0	100	0	500	0	900	0
U	2	2	200	50	10	10	10
T	5	5	5	5	5	5	8

This means that T was pronounced in the first second, O was pronounced for three seconds after that, then U pronounced two seconds and then the final T was pronounced at the final seconds.

At least that’s what the neural network most likely thinks.

Conclusion

Admittedly, some details of the real algorithm used in the original problem of audio alignment was swept under the rug in this explanation. If there is enough interest, I could write a detailed post on how to adapt this algorithm to the original problem.

For now, what I hope is that I have explained the general idea of this dynamic programming algorithm, which by the way has a time complexity of $O(nk)$ , much less than whatever $O(\binom{n}{k})$ would be.

«Forced Alignment using HuggingFace Wav2Vec2 Models: Part 1 Recognizing Text from Screenshots in Ubuntu»

GUISSMO