Finally Getting The Data Right

I’ve been a bit quiet in posting recently, not such for a lack of work, more a lack of progress. But this weekend I did indeed finally mange to get my data sorted.

Those of you following along so far may know that I’ve had difficulty with my data format, that is, the stuff I actually give to the computer to learn from. There are 2 ways of doing things. I can use a Mel spectrogram – which is a mathematical conversion of sound into an image – or I could use a normal uncompressed wav file.

The MEL way seemed good because, on the face of it,, I managed to get the whole process working. I was able to train a Grateful Dead discriminator, followed by a producer that seemed to put out pretty good MEL representations. Here was new Grateful Dead music! To remind ourselves, I got this:

Real Grateful Dead audio on the left: machine learnt audio on the right

But…. you can’t hear a MEL image, can you? These couldn’t be converted into sound. With wav files, the exact opposite problem: turning them into sound is trivial, but I couldn’t get the machine learning actually learning at all (a common problem with wavs in the ML world). Over this the summer I tried various ways for the wav method to work, but it never did. It became obvious that I had to go the MEL route since it actually worked. This meant turning a MEL image back into audio, and that in turn meant tackling the maths.

Anatomy of a MEL

So how is a MEL made? The first step is the hardest to understand, although it is easy enough to express. We use a Fourier transform (some fancy maths) to take some sound and decompose it into its constituent frequencies. Give it some audio and you end up with a collection of sine waves. We don’t do this for the entire audio – we do one Fourier transform for every 1/5th of a second, as an example, so ending up with a series of time blocks for which we have all the sine waves generated by the band in that time period.

The final stage is to adjust this so that the frequencies we hear in our ears are increased in power, and those that the ear finds harder to hear, we diminish. We want the machine to “hear” the same that we do (this is actually the “MEL” part of “MEL Spectrogram”.

The Hard Work

When I first used MELs, my approach was to construct the data in the following way:

  • Cut the wav files up into slices
  • Normalise the audio so it’s all roughly the same volume
  • Turn the short audio files into MELs
  • Turn each MEL into an image file, to give to a neural network

My job was to reverse that process. However, a major problem was that I didn’t actually give my neural networks an image, but it was just that the software library I use (Keras) has some useful functionality that makes it easy to feed it images. Keras does what we call “data normalization” – essentially converting the colour into a number between 0 and 1.

Now if I was feeding the neural net normalised MEL images, then I would be getting back the same thing. This meant I could skip the image creating part. This was the key to me. Now my process could be:

  • Cut the wav files into slices
  • Normalise the audio
  • Turn audio into MEL files
  • Normalise the MEL files

Luckily, it turned out the last step was just a bit of maths fiddling, and once that was done, I was able to move back and forth between audio and MEL files easily. Finally!

The Results Are In…

  • I took 12 seconds of Grateful Dead, from the Scarlet Begonias on 16th April 1978. It’s been mixed down to mono, 22050Hz (half CD quality) at 16-bit (same as CD). It sounds like this:

Now here’s the same 12 seconds of audio, after being converted to a MEL spectrogram and then back again:

Now that doesn’t sound good, does it – what’s going on? Well, if you remember our Fourier transform – it turns audio into a set of sine waves – the fault lies there. Let’s look at a sine wave. Here the horizontal axis is time:

Now we take the music from a certain point in time (say, the first 1/5th of a second). The sine waves we get from that are accurate but we lose the information about where they start, that is, where the wave should start at the beginning of the time slice. We don’t know where is it on the vertical axis. Since we lose this information, when we reverse the process we have to start every single sine wave we reconstruct from point 0, that is the middle of the vertical axis.

This problem is not unknown in machine learning audio analysis, and it is said that the image is “out of phase” – you can certainly hear it in the audio. But I hope you can agree it’s still the good old Grateful Dead.

Moving Forward

Some researchers have been seemingly able to use machine learning to “re-phase” the audio and clear this mess up. So the answer to our machine learning problem is likely more machine learning. However I’ll look at this another time. For now, it’s back to my original experiment. I need to build a discriminator that works with these slightly new MELs. If that can be done (and I should be able to find that out quite soon), we won’t be far away from fully new, synthesised Grateful Dead audio. It’ll sound terrible, but I will have then something to show for my efforts.