overtune_ai

AI will change music production, but not in the way you think

Metro Boomin’s new diss track, created with AI, highlights the transformative potential of AI in music production. Discover how this technology is reshaping the industry.

Earlier this month, amidst the media frenzy of the rap battle between Kendrick Lamar and Drake, a surprising twist occurred. The renowned hip-hop producer Metro Boomin released a diss track titled ‘BBL Drizzy’.

At first impression, the track resembles a classic G-funk style. But after a few listens, the Motown sample starts to stick out. The sample contains the phrase “BBL Drizzy”, sung as a catchy melody. It’s interesting because that phrase did not hold any cultural relevance until Rick Ross used it to land a return shot on Drake on 14 April.

There are complexities involved in creating such a tailored sample. Hiring session musicians and a skilled funk composer to produce the music with vintage emulation techniques is no easy feat. It appears to be an impressive effort for a quick jab in a rap feud. Upon closer inspection, it becomes obvious that artificial intelligence is involved. Metro Boomin sourced the sample from an AI-generated track created by King Willonious.

Following Rick Ross’s remarks about Drake, King Wellonious prompted an AI tool to create a song in the style of 60s Motown incorporating the phrase “BBL Drizzy”. Wellonious then uploaded the song to YouTube, where it currently has over 1 million views.

BBL Drizzy marks a historical milestone in social acceptance and indicates a growing normalisation of AI music technology. But with increased acceptance comes a greater need to understand the tech. In this article, I explain its fundamental concept and aim to provide you with the knowledge to participate with confidence in conversations about AI, both in general and specifically in the music space. 

metro boomin heroes & villains review
Metro Boomin’s diss track ‘BBL Drizzy’ featured an AI-generated Motown sample created by King Willonious

Yes, you can understand neural nets

Most people overestimate the difficulty of understanding neural nets. Often, it’s simply because of a lack of interest or confidence. But the reality is that you don’t need to be a programmer or mathematician to get a high-level understanding. You just need a bit of deliberate attention. Do that, and you will have sufficient comprehension of the inner workings of AIs to feel comfortable discussing and speculating on the topic.

In this article, we’re concerned with generative AIs. Generative AI is a program capable of creating text, images, videos, music, or other content based on an existing dataset and a prompt from the end user. A great example is the classic autocomplete algorithm.

The autocomplete algorithm is a simple system that predicts the next word a user types. It does so by calculating the probability of the next item in a sequence using reference data—the words we type. It shares basic principles with more complex models like ChatGPT but is much easier to grasp.

Imagine that we want to create a Kendrick-Lamar-style autocomplete algorithm. We start by gathering all the raw data. That is, written text by Kendrick Lamar that we can access on the internet. That’s our dataset. Next, we organise and clean the data with the following procedures: 

  • Make a list of every unique word in the dataset
  • Count the number of times each word appears
  • Reorder the list so it’s sorted from the most common word to the least common

Now, we have our ordered list of words used by Kendrick Lamar. Next, we do another round of processing. This time, we analyse the relationship between words. We start with the word at the top of our list, the most frequent one. First, go back to the dataset and see what words most frequently follow. Next, log the information you find (a process called embedding). Repeat the process for all words in our ordered list. Now, you have embedded relational data about the frequency of each word in relation to other words in the dataset.

Congrats! You have created rich data reflecting Kendrick Lamar’s unique style and can predict what he will say next. However, we’re limited to only the next word. To predict longer sequences, we must run our process on the list multiple times.

Kendrick Lamar
The process of training a generative AI model involves several key steps, including data collection, processing, and model training, in this example using Kendrick Lamar’s lyrics

Each time, we must update the initial state with the output of the last iteration before going to another round. In other words, we use the last autocompletion to generate the next autocompletion and so on. Each iteration of this is called an epoch and is often used as a metric for evaluating the state of a model in training.

Now, we can input any combination of words, and the algorithm will find the path through the embedded data that is most likely to follow your input. It will also autocomplete your text in the style of Kendrick Lamar. This process is called ‘model training’; our dataset is the ‘training data’. A ‘model’ is the final program with all the embedded prediction information. 

If we prompt our model with the most common word from the dataset, we will get the most probable string of words. But what if we use a word that’s not in the dataset? Well, depending on how the developer decides to solve the problem, we will either get an error or some gibberish.

Of course, state-of-the-art AI models are not quite as simple as autocomplete algorithms. The biggest difference is how we configure models to prioritise certain words or tokens in training. While we described simple relationships only by their frequency of appearance, models like ChatGPT use ‘Transformer Architecture’. The advanced method allows the model to extrapolate abstract structural correlations between data points and reveal trends. But that’s out of scope for us here.

If you feel like you get the autocomplete model, you’ve gained a solid understanding of the fundamentals of generative AI models. Let’s build on that and talk about music. 

Suno, Udio and ElevenLabs — how it works

As of late, three generative AI music companies—Suno, Udio, and ElevenLabs — have stolen the PR spotlight and caused quite a stir in the music space. Every week, we learn about a new revolution: Everyone can now prompt a machine, and a high-quality song comes out—in full length. How is this possible?

It’s actually not that different from our autocomplete model. Audio files contain sound information that we can process much like we process raw text. We analyse these files as waveforms, which are graphical representations of volume intensity over time. They are extraordinarily detailed data, with over 48,000 data points per second.

Individual sound units function like audio pixels with values from 0.0 to 1.0. Data scientists, physicists, and musicians often discuss these regarding time and frequency domains. By decoding this data, clear patterns in music become apparent, even to the naked eye. In tandem with audio files, we can enhance models with textual descriptions of each file. With enough data, these models start to discern and replicate complex structural relationships between text and audio. Once trained, they can recreate the structured context of their training data and generate music.

The music they generate reflects the training data, just like our autocompletion model. For instance, a model trained exclusively on Kendrick Lamar’s music would only reflect his musical styles. Train a model on a diverse array of music, however, and the model begins to understand and generate a broader spectrum of musical behaviours. That’s how we get from simple word completion to music generation like Suno, Udio, and ElevenLabs.


READ MORE: Overtune | The AI-assisted app that turns your voice into a star’s


It might seem that these music models are appearing out of the blue, but they are not. They are built on the same technology that powers language, image, and video generators—which have all been progressively refined over the years. It’s worth noting Meta’s open-source project, MusicGen, is a robust model that seems to have played an influential part in spurring this wave of new companies.

It is, therefore, hard to say how much of the tech used by these companies is proprietary. From an outside perspective, it seems they have trained their models on open-source architecture, with only minor adjustments to the input layer. Most of these techniques combine audio with lyrics, enrich language descriptions, and utilise better-structured data. One can even question the legality of this.

What we can say with certainty, though, is that audio has caught on with visuals regarding generative AI. 

What to expect in the future?

Envisioning the future of AI in music, we already have dramatic precedents that are quite telling. On July 12, 2022, the image generation tool Midjourney was released to the public. At that time, speculations about the future of stock photographers and designers were bleak. And while most of these domains’ generic and repetitive functions have felt the impact, it’s not been as bleak as predicted.

Generative AI tools excel at tasks with the largest and clearest representation in the dataset on which they are trained. But that’s also precisely their main limitation: they can’t supersede the quality or intention of the data they mirror.

As magical as these models are, when left to stand alone, they often fail to deliver what users who are familiar with their flaws expect. That’s where Adobe, for example, has done well by integrating generative tools into their editing suite. They’ve positioned generative AI as an interactive, sci-fi stock image service, emphasising user control as a key value proposition.

This is not to say there is no need for one-shot generation models. Midjourney is still used by many to generate affordable images for ads, blogs, and other media. But it did not replace designers or photographers. While we may see substantial demand for one-shot background and meditation music, artificial intelligence is unlikely to have an utterly existential impact on commercial music.

There will nonetheless be a real long-term adaptation. It will reflect the one we are already seeing in tools used by designers and creatives—where people demand control to create specific things with intention and precision.

Intent, precision, and AI in music creation

Over the past couple of years, Overtune has patiently watched the development of AI in the music space. All the while, we have been researching and developing our own models, ever searching for the right use case. And now we have found it: assistive AI.

overtune_sequencer_AI
Overtune’s upcoming beat editor integrates assistive AI to support precise and intentional music creation

AI’s purpose is not to replace serious composers, producers, or songwriters. The datasets are not there, and they may never be. It’s the fundamental limitation of generative AI models. You can still do incredible things, don’t get me wrong. But we are right at the start of the curve and have already hit a point of quality where there is a diminishing return.

There’s an ethical question at play here, too. Although the music industry found a balance with sampling technologies in the late 90s via royalty systems, some projects violate this precedent. These companies train their models on music or image catalogues that have not been cleared. Doing so does not fall under fair use and violates the terms set by the authors. Claims that these models are merely inspired by copyrighted content are not only incorrect on a technical basis but are also outright gaslighting fair claims. AI has astounding potential, but these practices discredit its real value.

Its value is based on support. It’s, as said before, assistive. The more precise and aligned with the use intent of the dataset, the greater the value. For serious music, it won’t be about fulfilling demands like “make a pop song about my hamster”. It’s about solving problems like “if only I could turn this guitar loop into some funky keys” by making precise generations that fit into a greater context.

Overtune has built an ethical loop dataset of unmatched quality regarding precision and intent in music production. It’s a library custom-made to support precisely this function: to aid the creator by providing new loops that fit a wider vision. We will soon launch a closed beta of a model fine-tuned on this dataset and embedded into our beat editor. Let’s call it the musical equivalent of Adobe’s Generative Fill. If you’re interested in experiencing the true potential of generative AI in music creation, sign up as a beta tester here and help us bring it to life. 


Keep up to date with the best in UK music by following us on Instagram: @whynowworld and on Twitter/X: @whynowworld


Leave a Reply

More like this