Devlog: Groovematch (AI)

Intro

Machine learning is a subject that always caught my interest. Computers making decisions on their own, how cool is that? At first, I was totally baffled at how that was possible - what kind of complicated code do you have to write to accomplish machine learning-related tasks?

Up to this point, I worked with ML in academic and research environments. I felt that through the few projects I worked on in both areas, I had gained enough experience to start work on my own ML project.

Looking around for datasets, a couple caught my eye containing many Spotify songs and their properties.1 What interested me the most about them, though, was that they had columns for audio features, including genre, energy, acousticness, and instrumentalness.

That led me to an idea: a music recommendation service that would take in a user’s song(s) of choice, and recommend songs with similar audio features. It’s a generic idea, and it was done by many in the past, but I figured it would be a great opportunity to educate myself further on the ML engineering process.

Starting off

Data processing

Luckily for me, both of these datasets have a wide variety of songs encompassing many genres. I also considered the fact that I might want to fetch songs from my own playlists, and maybe even ask my friends if they want to contribute.

I started off by using Python and the pandas library to write a few functions handling merging datasets. I started by defining a function to merge an existing csv dataset of Spotify songs with a “master” dataset, to be used in model training:

MASTER_DATASET_FILE = 'master.csv'

def import_csv_from_file(
    file_path: str, 
    cols=['track_id', 'track_name', 'artist', 'popularity', 'danceability',
          'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness',
          'liveness', 'valence', 'genre']
):
    dataset_new = pd.read_csv(file_path, usecols=cols)
    dataset_master = pd.read_csv(MASTER_DATASET_FILE)
    updated = pd.concat([dataset_new, dataset_master])
    updated.to_csv(MASTER_DATASET_FILE, index=False)

This worked pretty well; the only possible issue was that different Spotify datasets may have the columns named differently. As a result, I had to manually rename the columns before running the code. In some cases, they may be ordered differently, too, but pandas’s concat function handled their intersection well, regardless of order.

I also wrote a script to handle removal of duplicates, since both of these datasets have some overlapping songs. Utilizing pandas’s drop_duplicates function, these were handled firstly by identical Spotify URIs (their track IDs), then by identical artist and track name pairs.

A twist and change of course

The next step in my process was to set up the ML side, though before that I wanted to code out logic to add my own Spotify songs/playlists to the master dataset. This would involve using Spotify API’s more directly, specifically through the spotipy Python library.

However, I would soon be grateful that I had written only a little bit of code so far. I shortly discovered that Spotify deprecated the Audio Features endpoint recently, meaning that fetching these properties for my own songs was impossible.

This frustrating realization dealt a huge blow to my ambitions of finishing this project, but I had a new idea in mind: instead of users inputting songs they like and getting similar recommendations, I revised the app idea so that users would input their audio preferences instead (i.e. acousticness, energy, danceability). My “new” idea now was that users could adjust sliders associated with these properties (all expressed as decimal values) and get recommended songs that fit to these preferences.

The recommendation aspect

The ML side of this project really just boiled down to taking in the user’s audio preferences, and returning what songs match those preferences best. This was something that could be accomplished using a content-based recommendation system.

In this case, my CBRS took in a user-inputted feature vector (in this case, their audio preferences), which contained their preferences represented as numbers. All the songs in the dataset had their own feature vectors as well, so I used a method of similarity calculation such as cosine similarity to find which song properties are the most similar to the user’s input.

One of the drawbacks of using only a CBRS, however, was that it would perform poorly on a dataset as large as the one I’m using. It would have to make over 100,000 vector comparisons, which was bad for a web app where fast response times were ideal.

This is where k-means clustering comes into play.

Working on clustering

k-means clustering involves taking items from a dataset and grouping similar ones into a certain number of clusters based on their features, without actually knowing what each bucket should group. To keep it simple, the algorithm takes in a number of clusters to use for grouping, then initializes that many centroids, or the “centers” of each cluster. It continuously assigns items to their closest centroid, and updates those centroids based on the mean of their items’ features, until the group assignments stop changing.

The result is a number of groups of similar items, with each group being relatively separate from each other.

An example of k-means clustering. Notice how there are four distinct groups of data, each of which have the same color, these are the clusters. For display purposes, I reduced the data to the two most 'important' features.

While k-means is typically used for cases like categorization and compression and not necessarily recommendation, it still had its perks here. Essentially I could use it to significantly reduce how many songs I had to compare user input to, keeping it only to the most similar ones.

One of the major aspects of k-means clustering is that when using it, you must specify in advance how many clusters you want to group the data into.

At first, this led me to consider methods of clustering that automatically determine the cluster count, such as HDBSCAN2. However, not only did it run extremely poorly on my data (well over 2 minutes each time), it determined the “ideal” cluster count to be just 3, far too few for my use case.

Later I discovered the elbow method. To be used in conjunction with k-means, this finds the optimal number of clusters n where the within-cluster sum of squares (WCSS) is minimized.3

The elbow curve for my k-means clustering model. The slope tapers around n=10, but I ended up using n=20 clusters where it was a bit flatter

Overall, k-means training ran extremely fast - under half a second every time. Considering the training process would be run every time the user asked for recommendations, this was great to see.

Integrating genre

At first, I did not integrate genre into model training with the aim of providing users with a wide variety of song recommendations, not confined to just a single genre. However, even though their feature values were similar, the actual songs recommended did not have the same “vibe” or energy at all when listening to them.

I ended up filtering the dataset used for training only to the user-selected genre. With this, another issue arose: within the dataset, there were so many genres across all the songs, about 115 of them. I decided to find a way to combine sets of similar ones into a more general grouping, so i.e. “metal” and “heavy rock” are grouped into a larger genre. As I worked on this, I found that there were many, many redundant genres, so cleaning them up was a much-needed task.

dataset_master.loc[
    dataset_master['genre'].isin(['rock', 'alt-rock', 'alternative', 'punk', 'punk-rock', 'grunge', 'emo', 'indie', 'singer-songwriter', 'psych-rock', 'j-rock', 'rock-n-roll']),
    'genre'
] = 'Rock & Alternative'

Onwards with recommending

Now with the data preprocessing pretty much done, and the k-means clustering fine tuned enough, it was time to work on the actual recommendation part. To best describe the development process that follows, I will use a hard-coded example sample of user input, using Electronic & Dance as the desired genre. These values are an exact match of the audio features of The Prodigy - Smack My B* Up.4

new_sample = np.array([0.6131979695431472, 0.995, 0.8488245195420157, 0.16994818652849744, 0.003072289156626506, 0.626, 0.109, 0.2633165829145728]).reshape(1, -1)

When a user inputs their desired audio features, after the dataset is filtered to their selected genre and the training process finishes, I put the user-inputted values into a numpy array like the one above. Using that resulting clustering model, I then take this sample and predict what cluster of similar songs it fits in.

As a result, we go from 100,000+ possible songs to compare to (across all genres), to about 18,000 (after genre filtering), to a mere 1,000 (after cluster assignment).

Now with a much smaller set of songs to work with, it was time to calculate similarity. There are several options of measuring it, but the two “main” ones used in content-based recommendation systems are cosine similarity and Euclidean distance.

In summary, the former works well if the vector size is irrelevant to similarity (so if the proportion of the two datapoints’ values are similar), but the latter works better if the “distance” between the datapoints matters more (so if the values themselves are similar).5 So of course, since I want the songs’ audio features (vibes) to match, Euclidean distance was my pick.

Using Euclidean distance on every song yields a similarity value between the user’s desired audio features and that song. The lower the value, the better.

As we can see, the most similar song to the sample input values in here is Virtual Gaming.

The final task is just to sort this dataframe in ascending order of similarity, and to recommend a small sample of songs at first. The user can ask for more recommendations later.

A view of the most similar songs to the user input. Note that there most certainly will not be a perfect match in any recommendation (see the 0.0 similarity score for the first song, this is only because I used values which exactly match this song's). Also note how the audio feature values are very close to each other to start.

Overall, it works well and is pretty fast! That is the core part of this project done. In the next entry related to this project, I will describe how I set up the web side of things.


  1. https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs and https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset ↩︎

  2. Hierarchical Density-Based Spatial Clustering of Applications with Noise ↩︎

  3. WCSS is defined as the total variance of the data points within each cluster. The lower the value, the less varied the data is, and the tighter the cluster is. ↩︎

  4. The fields in order are: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence ↩︎

  5. https://cmry.github.io/notes/euclidean-v-cosine ↩︎