First Steps

This blog was started a month or two after I started experimentation, so I have some catching up to do. After coming up with my plan, a had a few things to do. Like learn how to do machine learning. The first thing I did was to take some existing code and train on some (any) data. Since I was going to be using a technique that worked with images, it made sense to work with some simple images at first. So I downloaded an example DCGANS and after a few weeks trying to understand what was going on, I managed to train it to pop out pictures of Jerry Garcia:

Output of DCGANS trained on pictures of Jerry Garcia

Now this worked, but I only had a few images of Jerry to work with (well, 32, but that’s not a lot in the world of machine learning). In fact, the machine was not so much learning to draw Jerry as to remember the images shown. But this was enough to show that essentially the technique was working and I had a decent base to start with.

Learn, and learn again

What did I get from this first foray? Quite a lot, but the main points were

  • You need to really understand the data you provide
  • The data needs a huge amount of processing
  • There is potentially more code in getting the data format than required for the DCGANS!
  • The process is SLOW. The image above took 9 hours to compute.

Go Grab Some Data

The next step was data collection. I took 5 shows of GD and roughly the same amount of other audio and split that up into 10 second sound slices. I then turned all that into Mel images. I had a problem here in that the code I had worked with 128×128 images, and it already took forever to train on that, so for the start I just resized all my Mels to 128×128. This would be awful for audio quality – probably even worse than some of those dreadful summer ’70 audience tapes – but you have to start somewhere.

I should note that doing the work in that simple paragraph was about 2/3 weeks on or off. Life does indeed get in the way. However, at the end of a pretty long learning session, I was able to post this image on to reddit for a few people to look at:

First proper results

So there you go. I think you’re looking at the first computer generated Grateful Dead – although ideally you’d be listening to it. Problems? Well you’ll see the real image is both larger, has a different ratio and also, beyond some colour matching, is pretty much nothing like the final image on the right. Still, it’s a step in the right direction. It just needs a lot more training.

Starting With The Basics

Over their career, the Grateful dead played some 2,318 shows. Each of them is unique. Most fortunately, a majority of them were recorded for posterity.

My ambition is to create new shows using machine learning. That is, entire new shows that, to the listener, are indistinguishable from the real thing, except they will be created inside a computer.

Since this somewhat of a non-trivial task, we’ll start this blog with a little explanation of the method we will start with. To first note, getting a whole show is a serious piece of work. I’m going start by reducing this to a slightly different task: produce 10 seconds of audio that sounds like the Grateful Dead.

So, how do we even do that? The answer lies in deep convolutional generative adversarial networks (DCGAN). Big words, but we can break them down.

To make a DCGAN, we need two pieces. the first piece is a discriminator. You give it a piece of music, and it tries to answer a question: Is this music Grateful Dead? This discriminator is trained using machine learning techniques: we give it two large sets of data (lots of Grateful Dead ten second snippets, and lots of non-Grateful Dead ten second audio snippets) and let it learn to tell the difference.

The other part is the creator. This is a device that learns to create Grateful Dead audio. It does this by trying to beat the discriminator. When the learning is started, the discriminator is really bad at telling the difference between the two sets of audio, and so the creator can easily fool it. As the discriminator improves by learning, so should the creator.

Sounds Simple!

If only. The main problem is the data. DCGANS have proven to be very successful with images. As an example of this, go and look at some images of humans generated using this method: https://thispersondoesnotexist.com/. I hope you’ll agree that’s quite impressive. However, experiments with raw audio have NOT been successful, so to start with our experiments will be with images, NOT audio. However, we will use a very special kind of image: a Mel Spectrograph.

Mel Spectrograms

A Mel Spectrogram is essentially a special kind of graph. The horizontal axis represents time, whilst the vertical axis represents frequency: finally, the volume is given by the colour. it’s easier to look at one to understand it. Here’s a Mel of ten seconds of Grateful Dead audio (specifically 10 seconds of Sugar Magnolia from 16th April 1978:

You can kinda see a beat as the vertical lines on the Mel.

Plans For The Future

So the plan is pretty simple:

  • Go and grab a load of audio
  • Slice it into ten second files
  • Generate a Mel image for each slice
  • Train the discriminator on these images
  • Train the creator against the discriminator
  • Take the image output of the creator and turn it into sound

And the, voila! we have created something that hopefully sounds like the Grateful Dead. Stick around to see what results we get!