Finally Getting The Data Right

I’ve been a bit quiet in posting recently, not such for a lack of work, more a lack of progress. But this weekend I did indeed finally mange to get my data sorted.

Those of you following along so far may know that I’ve had difficulty with my data format, that is, the stuff I actually give to the computer to learn from. There are 2 ways of doing things. I can use a Mel spectrogram – which is a mathematical conversion of sound into an image – or I could use a normal uncompressed wav file.

The MEL way seemed good because, on the face of it,, I managed to get the whole process working. I was able to train a Grateful Dead discriminator, followed by a producer that seemed to put out pretty good MEL representations. Here was new Grateful Dead music! To remind ourselves, I got this:

Real Grateful Dead audio on the left: machine learnt audio on the right

But…. you can’t hear a MEL image, can you? These couldn’t be converted into sound. With wav files, the exact opposite problem: turning them into sound is trivial, but I couldn’t get the machine learning actually learning at all (a common problem with wavs in the ML world). Over this the summer I tried various ways for the wav method to work, but it never did. It became obvious that I had to go the MEL route since it actually worked. This meant turning a MEL image back into audio, and that in turn meant tackling the maths.

Anatomy of a MEL

So how is a MEL made? The first step is the hardest to understand, although it is easy enough to express. We use a Fourier transform (some fancy maths) to take some sound and decompose it into its constituent frequencies. Give it some audio and you end up with a collection of sine waves. We don’t do this for the entire audio – we do one Fourier transform for every 1/5th of a second, as an example, so ending up with a series of time blocks for which we have all the sine waves generated by the band in that time period.

The final stage is to adjust this so that the frequencies we hear in our ears are increased in power, and those that the ear finds harder to hear, we diminish. We want the machine to “hear” the same that we do (this is actually the “MEL” part of “MEL Spectrogram”.

The Hard Work

When I first used MELs, my approach was to construct the data in the following way:

  • Cut the wav files up into slices
  • Normalise the audio so it’s all roughly the same volume
  • Turn the short audio files into MELs
  • Turn each MEL into an image file, to give to a neural network

My job was to reverse that process. However, a major problem was that I didn’t actually give my neural networks an image, but it was just that the software library I use (Keras) has some useful functionality that makes it easy to feed it images. Keras does what we call “data normalization” – essentially converting the colour into a number between 0 and 1.

Now if I was feeding the neural net normalised MEL images, then I would be getting back the same thing. This meant I could skip the image creating part. This was the key to me. Now my process could be:

  • Cut the wav files into slices
  • Normalise the audio
  • Turn audio into MEL files
  • Normalise the MEL files

Luckily, it turned out the last step was just a bit of maths fiddling, and once that was done, I was able to move back and forth between audio and MEL files easily. Finally!

The Results Are In…

  • I took 12 seconds of Grateful Dead, from the Scarlet Begonias on 16th April 1978. It’s been mixed down to mono, 22050Hz (half CD quality) at 16-bit (same as CD). It sounds like this:

Now here’s the same 12 seconds of audio, after being converted to a MEL spectrogram and then back again:

Now that doesn’t sound good, does it – what’s going on? Well, if you remember our Fourier transform – it turns audio into a set of sine waves – the fault lies there. Let’s look at a sine wave. Here the horizontal axis is time:

Now we take the music from a certain point in time (say, the first 1/5th of a second). The sine waves we get from that are accurate but we lose the information about where they start, that is, where the wave should start at the beginning of the time slice. We don’t know where is it on the vertical axis. Since we lose this information, when we reverse the process we have to start every single sine wave we reconstruct from point 0, that is the middle of the vertical axis.

This problem is not unknown in machine learning audio analysis, and it is said that the image is “out of phase” – you can certainly hear it in the audio. But I hope you can agree it’s still the good old Grateful Dead.

Moving Forward

Some researchers have been seemingly able to use machine learning to “re-phase” the audio and clear this mess up. So the answer to our machine learning problem is likely more machine learning. However I’ll look at this another time. For now, it’s back to my original experiment. I need to build a discriminator that works with these slightly new MELs. If that can be done (and I should be able to find that out quite soon), we won’t be far away from fully new, synthesised Grateful Dead audio. It’ll sound terrible, but I will have then something to show for my efforts.

Reaching 2000 Epochs

In machine learning, an “Epoch” is the length of time required for the machine to examine all of your input data and learn from it. More epochs = more learning. But also unfortunately more data = slower epochs. Now, at the time of the last post one epoch was some 25,000 Mel images. With this much data, my poor little laptop was struggling to do 50 epochs. And yet clearly, after even 100 epochs (as evidenced in the last post) the base images were not acceptable in any way. Even if they were, the resolution would be too small. So the time came to invest, so I went and bought a chunky new desktop, complete with fancy graphics cards (a must in serious machine learning) to give me a speed boost. My goal? 2,000 Epochs or bust.

Now that sounds great, but I then discovered that getting the correct graphics drivers setup was like completing the trials of Hercules – and I’m a paid IT professional. There were 3 weekends of arduous trial and error until finally it was all done and setup. But it was worth it, because when I ran my first test, that one with the images of Jerry Garcia, instead of taking 9 hours it took 5 minutes! A staggering 100x faster. Now I can really forge ahead, I thought! So, how does 2,000 epochs look? Like this:

Being Better Just Brings Bigger Problems

It was here that real problems began. The first thing I noticed was trivial but important: due to the way that my data was structured and loaded, half the machines memory was being wasted. This causes major slowdowns as data has to be read from disc. The other problem was more important though: quite often, my GANS would stop learning after a small number of generations.

It seems that this is because the discriminator was getting too good – it was learning so fast that the creator could not keep up. This process was random as well, so it took a load of runs to get to 2,000 epochs. In a way, this is a good result, because it is a common problem of the technique I’m using; this likely indicates I’m partially on the wrong track. All said and done though, I thought the final result wasn’t bad this early in the experiments.

Beyond the problems of low resolution, discriminators learning quickly and managing all the data in the local machine there is a much larger problem: I have no method for turning the spectrograms back into audio. Being as that is the ultimate showstopper when the aim is to produce audio, this is the next issue we will solve. Stay tuned for updates!