2020 Vision & Goals

I’ve been a bit inactive of late, a combination of getting a new machine learning computer and work. Still, it’s given me time have a rethink about my approach to generating new Grateful Dead audio.

I have two new approaches I want to try out this year. You could say one of them has already been done – that is, someone has already generated “fake Grateful Dead” audio, and you’ve listened to it and accepted it as the real thing for years!

Filling In The Gaps

Some older shows recorded on digital (typically early 80’s shows) can suffer from digi-noise, that is, pops and hiccups caused by the tape simply being a little old and losing some of it’s digital information. for a (really bad) example of this, see https://archive.org/details/gd1981-09-25.sbd.miller.88822.sbeok.flac16 (and check out the weird split Sugar Mag if you can get past the sound issues).

Not all tapes are this bad, and in fact in most cases when this happens someone like Charlie Miller will cover up the noise with some editing. But think about that sentence again: “cover up the noise with some editing”, hey that sounds like putting in fake audio. You have a piece of music with a discontinuity, and you have to fill it with something that sounds nicer and sounds like the GD, right?

Well, almost. Look at a typical sample of digi-noise:

This digi-noise would be pretty loud compared with the rest of the audio (which is why we want to hide it). However, if we zoom in:

We can see that in fact the time for this audio event is from ~3.0195s to 3.0226, i.e. 0.0031s. That’s just 3/100’s of a second, which is probably easy enough to fix in a studio.

But this problem is ideal for generating new GD audio. Up to now, my effort has been to “teach” the computer by feeding it a large amount of GD, and then asking it to make some original audio. The problem with this approach is that the test is very difficult for the computer. If I first asked you to read every single Stephen King novel and then tasked you with writing a new paragraph in the same style, you would find that difficult. If however I asked you to start by filling in a missing word, well that would be a lot easier. Or if that was too much, start with a single letter.

And that, in a nutshell, is the new approach. Instead of asking the computer to generate new audio from scratch, we instead ask it to fill in the missing audio. At first this will be someting like 3/100ths of a second. When that works, I simply ask it to fill it larger and larger gaps.

This approach has been tried for images, and the results are pretty good.

GANS network filling in image after training on human faces

As you can see, the computer is able to generate many images to fill in the blank space.

Style Transfer

The second thing I shall try this year is a “style-transfer” with GD audio. These are best explained with images. Example: I have some photos. I also have digital copies of many paintings. I train the computer to recognise the style of the painter and “transfer” my image into the style of the painter.

Basic image on left, styles applied on right

So what styles are there in GD audio? Of all the tapes I have ever listened to, they are almost always one of two styles: audience or soundboard. So I will train the computer to tell the difference between them, and then ask it to output the audience audio into the style of a soundboard. I hasten to add that quite a few people prefer audience tapes (especially with the somewhat dry soundboard tapes of the early 80s), and that the style could easily go the other way.

Time Dependency

This last point is a technical issue, but one which could offer easily the best results.

So, a sound file is – to the computer – a linear series of numbers (each number being the volume at a given point in time).

What we are really asking the machine to do is to continue generating a series of new numbers based on the numbers so far.

But think how you might do this. To accurately guess what comes next, we work on a number of differing timescales. Note in the scale? Chord in the sequence? Verse to be sung? Is it a Bobby number next? All my attempts so far have been really concentrating on the “next note”, because music generates a lot of audio and so we only want to really check the local time area otherwise our computation gets really slow. In effect, to generate the next second of music, my code so far only looks at the previous 2-4 seconds. But to produce longer samples, we will need the computer to understand a lot more about the structure of the song.

I don’t want to get super-technical here, but Google researchers have a partial solution to this, which they used for creating realistic human voices (paper here: https://arxiv.org/abs/1609.03499).

It essentially means my software will be able to take inferences from a much much larger area of the song. Generating a longer section of audio might not get any quicker but no longer will the computer have the memory of a goldfish. I’m really interested in this approach because it’s been tried and tested. Here, for example, is a section of music generated by a computer that has been trained on piano recitals.

Piano recital sample generated purely from the mind of a computer

My point being: If it can be done with piano recitals, it can be done with the Grateful Dead.

Finally, Results

Yes. you read that headline correctly: I have results! However, those expecting authentic sounding Grateful Dead – in whatever form that may take – will probably want to be waiting a lot longer. But if we view this whole process as akin to cooking a meal, I have at least sorted out the ingredients and cutlery, even if the food so far is somewhat lacking.

Again, our friend phase

So the basic approach was as outlined in the previous blog post. We build 2 computer programs, 1 to detect Grateful Dead music and the other to create some music. Then we put these 2 machines in an evolutionary arms race, as they should slowly get better at their jobs, and ultimately the generator should be able to create new Grateful Dead music (or at least, new music that you cannot tell is different).

The approach taken is not to use actual music (because this is difficult), but instead to do some processing on the sound beforehand. We actually split the audio into hundreds of time slices, and then work out all the sine waves of each slice. Since this is the format we input, it is also the format we output.

This gives us a problem with the output. Let’s imagine a piece of audio composed of 2 sine waves. It would look like this:

2 sine waves, representing a single sound

You can see that the starting point for both of these sine waves – the left hand side of the graph – is different. The red sine wave starts high up and the blue low down. This information is known as the phase of the signal.

The problem our generator has is that it generates sine waves but no phase information. Our generator then starts all sine waves at point 0 – the black line. But is has to do this for all the times we slice the audio up. The result is a set of broken sine waves, where the wave is “reset” at the start of every time slice:

Audio file with phase=0 at start of every timeslice

As you can see, this is most definitely not going to sound like what we want!

Simple solutions for the win

I really took some time to try and fix this issue. It is a known problem in audio processing and basically there is no fix for this. There have been some attempts using machine learning, but that would involve another huge amount of work. so instead, I did something very simple and it seemed to work. Quite simply, I just randomised the phase information, instead of setting it to 0 all the time. As can been in the diagram above, there is a repetition in the phase information at the start of every time step. This repetition really sticks in the ear. If you randomise the phase information, then you get something more like this:

Phase at every time stamp randomised instead of constant zero

This is not perfect, but now the results sound a lot better, and phase randomisation turned out to be not that hard to implement. With that out of the way, let’s move on to the results.

What we don’t expect

So essentially, the computer code is trying to replicate a certain style of music. If it were unable to incrementally get better, we might expect it to just produce random noise. In particular, we might expect white noise.

3 second sample of white noise

Or even pink noise, which is apparently what the sound crew used to test the GD’s audio system before a gig (pink noise is where the volume of the differing sound waves is equal):

3 second sample of pink noise

So for any success, we do don’t want to sound like these 2. What we do want is sound like the Grateful Dead. So here is a sample of the Grateful Dead in exactly the same format as my results – a 22050Hz mono audio sample:

Bonus points for guessing year, show or song!

The Results

After many hours of rendering audio, my program produced some samples. To the best of my knowledge, this is the first computationally created Grateful Dead audio to ever be generated (more epochs should be better):

Render after 1000 epochs

Render after 1500 epochs

Render after 2000 epochs

So – is this progress? It is certainly not the Grateful Dead, but on the other hand it is not white or pink noise. It is also – unlike my previous posts – actual audio. I count this as a partial success, but also an indicator the final goal is some distance away.

The Future

The obvious thing to do now is to increase everything – the amount of music I use as data, the length of time I spend processing, and the size of the output data. This will take some time. I have another approach up my sleeve that involves generating songs from a very low resolution and then increasing the resolution, as opposed to starting with a small piece of music and trying to make it longer. But that’s for another post!

Finally Getting The Data Right

I’ve been a bit quiet in posting recently, not such for a lack of work, more a lack of progress. But this weekend I did indeed finally mange to get my data sorted.

Those of you following along so far may know that I’ve had difficulty with my data format, that is, the stuff I actually give to the computer to learn from. There are 2 ways of doing things. I can use a Mel spectrogram – which is a mathematical conversion of sound into an image – or I could use a normal uncompressed wav file.

The MEL way seemed good because, on the face of it,, I managed to get the whole process working. I was able to train a Grateful Dead discriminator, followed by a producer that seemed to put out pretty good MEL representations. Here was new Grateful Dead music! To remind ourselves, I got this:

Real Grateful Dead audio on the left: machine learnt audio on the right

But…. you can’t hear a MEL image, can you? These couldn’t be converted into sound. With wav files, the exact opposite problem: turning them into sound is trivial, but I couldn’t get the machine learning actually learning at all (a common problem with wavs in the ML world). Over this the summer I tried various ways for the wav method to work, but it never did. It became obvious that I had to go the MEL route since it actually worked. This meant turning a MEL image back into audio, and that in turn meant tackling the maths.

Anatomy of a MEL

So how is a MEL made? The first step is the hardest to understand, although it is easy enough to express. We use a Fourier transform (some fancy maths) to take some sound and decompose it into its constituent frequencies. Give it some audio and you end up with a collection of sine waves. We don’t do this for the entire audio – we do one Fourier transform for every 1/5th of a second, as an example, so ending up with a series of time blocks for which we have all the sine waves generated by the band in that time period.

The final stage is to adjust this so that the frequencies we hear in our ears are increased in power, and those that the ear finds harder to hear, we diminish. We want the machine to “hear” the same that we do (this is actually the “MEL” part of “MEL Spectrogram”.

The Hard Work

When I first used MELs, my approach was to construct the data in the following way:

Cut the wav files up into slices
Normalise the audio so it’s all roughly the same volume
Turn the short audio files into MELs
Turn each MEL into an image file, to give to a neural network

My job was to reverse that process. However, a major problem was that I didn’t actually give my neural networks an image, but it was just that the software library I use (Keras) has some useful functionality that makes it easy to feed it images. Keras does what we call “data normalization” – essentially converting the colour into a number between 0 and 1.

Now if I was feeding the neural net normalised MEL images, then I would be getting back the same thing. This meant I could skip the image creating part. This was the key to me. Now my process could be:

Cut the wav files into slices
Normalise the audio
Turn audio into MEL files
Normalise the MEL files

Luckily, it turned out the last step was just a bit of maths fiddling, and once that was done, I was able to move back and forth between audio and MEL files easily. Finally!

The Results Are In…

I took 12 seconds of Grateful Dead, from the Scarlet Begonias on 16th April 1978. It’s been mixed down to mono, 22050Hz (half CD quality) at 16-bit (same as CD). It sounds like this:

Now here’s the same 12 seconds of audio, after being converted to a MEL spectrogram and then back again:

Now that doesn’t sound good, does it – what’s going on? Well, if you remember our Fourier transform – it turns audio into a set of sine waves – the fault lies there. Let’s look at a sine wave. Here the horizontal axis is time:

Now we take the music from a certain point in time (say, the first 1/5th of a second). The sine waves we get from that are accurate but we lose the information about where they start, that is, where the wave should start at the beginning of the time slice. We don’t know where is it on the vertical axis. Since we lose this information, when we reverse the process we have to start every single sine wave we reconstruct from point 0, that is the middle of the vertical axis.

This problem is not unknown in machine learning audio analysis, and it is said that the image is “out of phase” – you can certainly hear it in the audio. But I hope you can agree it’s still the good old Grateful Dead.

Moving Forward

Some researchers have been seemingly able to use machine learning to “re-phase” the audio and clear this mess up. So the answer to our machine learning problem is likely more machine learning. However I’ll look at this another time. For now, it’s back to my original experiment. I need to build a discriminator that works with these slightly new MELs. If that can be done (and I should be able to find that out quite soon), we won’t be far away from fully new, synthesised Grateful Dead audio. It’ll sound terrible, but I will have then something to show for my efforts.

Reducing a Jerry solo to numbers

My inbox is filled with literally no emails asking about how a neural network detects things.

Possibly that’s because no one is interested, or maybe they think it is too complex, or maybe they think I should just talk about something actually interesting to them. But this project will never be complete without the help of others, even if that help is of the form “you’ve mucked that up again Chris”. So here is a simple explanation of how my current network is intended to work. Well – it’s about as simple as I can make it.

After just a few months delving into machine learning, I can tell you the hardest part is DATA. Specifically, getting that data into exactly the right format. And specific you must be, for should any part of it be wrong, the computer will taunt you with many a horrible error message.

So let’s start with the easy end of the spectrum, and that is trivial: sound. Let’s visualise sound as a simple waveform:

Part of a waveform from a version of Tough Mama by the Jerry Garcia Band

Now since we start on a computer, we don’t have sound but we have a WAV file. That’s technology that has been around since 1991. So how do we go from sound to a WAV file?

The answer is move along the waveform from left to right. At regular points along the waveform, we take a sample at that point. Now we will lose some information at this point, but we sample at a really high rate so that shouldn’t be a major issue unless you are a purist. In the case of CD’s and, in particular, my WAV files, we sample 44,100 times a second – the sample rate. Remember that number – it will bite us later.

Now we have merely reduced the waveform to a fancy bar chart. How to simplify it more? Well the next stage is to realise that the waveform moves around a centre, sometimes being high up and sometimes being low down. Now, a quirk of recording means that generally there is a maximum level above or below zero that a microphone can handle. So knowing there is a maximum, we can simply assign some number which has a range -maximum to +maximum to every point on our bar chart. We could make “maximum” 5, which would mean all of our samples would be between -5 and +5.

Reducing to values from -5 to +5. Notice some accuracy is lost with such a small range

Computers, for reasons we don’t need to go into, choose some strange numbers. In my WAV files, they vary from -32,767 to +32768. So now we have reduced our music down to just a series of numbers. That is what is stored in the WAV file.

Luckily for us, machine learning requires a series of numbers – but just the right amount, and in the right format.

How much is just enough?

Actually, we are already pretty close with the WAV file. The most major change we need to make is that machine is rather fussy about it’s numbers. It requires values between 0 and 1. We have a series of integers from -32,767 to +32,768. It turns out to be pretty easy to convert between the 2 though: we simply take the number from the WAV file, add 32,767 to it and then divide by 65,535. For example, the number -13,270 would actually be:

(-13,270 + 32,767) / 65,535 = 0.29750515

The final tricky part is in reducing the data. You see, machine learning is a little slow. The more information – that is, the more numbers – we give the computer the slower it is. And machine learning is not just slow, it can be positively glacial at times. So we always want to try and reduce the data.

So what IS a reasonable amount of data? Well, let’s define our units. We’ll say “the total number of numbers we have to give to the machine for each piece of data”. A “piece of data” meaning, in this case, some sound. In my efforts with spectrogram images, I used a 10 second audio sample and the produced image had 320 X 240 x 3 = 230,400 numbers. With our WAV file, 10 seconds of audio works out at

10s * 44,100 sample rate * 2 stereo channels = 882,000

That’s 4 times the data we used with the spectrographs, so what can we do to reduce that? Well, does it really need to be stereo? Probably not – mono would be fine. Finally, does it need to be sampled at 44.1Hz? Likely not – if you sample at half that the quality is still good enough. With that in mind, look at those numbers again:

10s * 22,050 samples rate = 220,500

Now we have 10 seconds of audio reduced to just under a quarter of a million numbers from 0 to 1. In a future post, we’ll look at how machine learning uses those numbers to learn about the Grateful Dead.

The tricky question of audio

When I started writing this blog, content was pretty easy, since I already had a backlog of results complete and ready to write about. Now of course, updates are in real time, as they come. It’s unlikely they’ll sound as confident as progress is likely going to be slow. So for this post I’m going to look at a problem I have, and try to share some solutions with you. That problem is how to represent sound accurately and easily.

This is really a 2 part problem. Firstly I need to feed the computer pieces of audio. The longer the piece of audio, the more data it consumes and the longer it takes for the machine to study and learn from it. So I have to think very carefully about the data size of this audio data otherwise experiments will be very slow. Secondly, there are 2 main ways to feed the machine the audio – it could be digital wav files or a spectrograph (a graph image) of the sound file. The spectrograph is the current standard for audio recognition with machine learning, but however I give the data is how I get back results. So feeding it a spectrograph image doesn’t give me back audio – it gives me back an image that should be a spectrogram.

For the last few weeks I tried to solve this by turning a spectrogram back into audio. However, my efforts have not been that successful. The first problem was that even with a perfect setup, turning sound into a graph and back again loses you a lot of sonic information. Imagine creating an MP3 with a really terrible bitrate and then back to a wav. The second problem was that my code didn’t give me a spectrograph – it gave me an image which was meant to fool a spectrograph detector. Subtle, but it meant that the code to turn the graph back to sound failed in a lot of cases because the output was not a pure spectrograph.

With all that in mind, I decided to turn my attention to actually using sound. Whilst that sounds logical, any cursory examination of the machine learning field would show you that raw audio has not been that successful. Well, no problem, “no success” is currently the default with regards to audio results. And since if I use audio, I will get audio results, that is obviously better than what I have now. Let’s do this!

Now, back to the right amount of data. With data we only have one unit – numbers. My machine requires me to pass it a set of numbers, and this set of numbers must be of equal quantity and of the same dimension. The same dimension? This is easier to understand if we think about my spectrographs. The were images, right? And the size of the image was 320 x 240 pixels. But every image had three parts to it – the red, green and blue parts. So the dimensions of my data were 320 x 240 x 3 = 230,400 numbers. So this is the amount of data I was comfortable with processing in my last experiments. So let’s try and get a similar number for the raw audio.

Audio is encoded digitally by sampling the music frequently and assigning a single number to sample.

Audio data (red line) sampled over time – results in blue.

The sample rate for CD audio is 44.1kHz, which also happens to be the highest sample rate of the raw data I have. However, there are 2 channels – left and right. Let’s say we took 10 seconds of audio (the same amount of audio we used for the spectrographs), we get 44,100 * 2 * 10 = 882,000 numbers. Over three times larger. Well. let’s make some cuts to this. For a start, we probably don’t need stereo, and we can easily cut the sample rate down to half as well, getting us

22,050 samples/second * 1 mono channel * 10 seconds = 220,500 numbers

We are in the right ballpark. In actual fact, I ended up choosing 8 seconds of audio to be safe, and also to allow me to get more samples out of the data I had. A number, by the way, requires 4 bytes to be stored, so each of these 8 second clips ends takes up 841k.

Now onto the next question. Can the discriminator differentiate between GD and non GD given just 8 seconds of audio? Check back in a week or so to see the result.