Devlog: Groovematch (AI)

This is one part of a two part entry - see the Web Development post after this one!

Intro

Machine learning is a subject that always caught my interest. Computers making decisions on their own, how cool is that? At first, I was totally baffled at how that was possible - what kind of complicated code do you have to write to accomplish ML-related tasks?

Up to this point, I worked with ML in academic and research environments. Now, I felt that through the few projects I worked on in both areas, I had gained enough experience to start work on my own ML project.

Initially, my idea was to recommend songs on Spotify based on a user’s song(s) of choice, using a few datasets from Kaggle1. But upon finding that Spotify deprecated the audio features endpoint in their API (meaning I couldn’t find these of any song not in those datasets), the scope of the project changed partway through.

My idea was now to recommend songs based on users’ preferred audio features instead.

Data Processing

I started off by using Python and the pandas library to write a few functions to merge datasets. I defined a function to merge an existing csv dataset of Spotify songs with a “master” dataset, to be used in model training:

MASTER_DATASET_FILE = 'master.csv'

def import_csv_from_file(
    file_path: str, 
    cols=['track_id', 'track_name', 'artist', 'popularity', 'danceability',
          'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness',
          'liveness', 'valence', 'genre']
):
    dataset_new = pd.read_csv(file_path, usecols=cols)
    dataset_master = pd.read_csv(MASTER_DATASET_FILE)
    updated = pd.concat([dataset_new, dataset_master])
    updated.to_csv(MASTER_DATASET_FILE, index=False)

I also handled removal of duplicates, since both of these datasets have overlapping songs. Utilizing pandas’s drop_duplicates function, these were handled firstly by identical Spotify URIs (their track IDs), then by identical artist and track name pairs.

Recommendation

The core of the ML side consisted of taking in user audio preferences, and returning what songs match those preferences best. This was something that could be accomplished using a content-based recommendation system, which essentially compares similarity between the user preferences as a feature vector, and the vector of each song’s audio features.

One of the drawbacks of using only a CBRS, however, was that it would have to make over 100,000 vector comparisons, which was inefficient for an app where fast response matters. This was where k-means clustering comes into play: to reduce how many comparisons I’d have to make.

Adding in Clustering

k-means clustering involves taking items from a dataset and grouping similar ones based on their features, without actually knowing the explicit grouping criteria. It takes in a number of clusters to use for grouping, then initializes the “centers” of each cluster. Then it continuously assigns items to their closest center, updating those centers based on the mean of their members’ feature values, until items stop bouncing between clusters.

At that end, the result is a set of separated groups of similar items.

An example of k-means clustering. Notice how there are four distinct groups of data, each of which have the same color, these are the clusters. For display purposes, I reduced the data to the two most 'important' features.

One of the major aspects of k-means clustering is that when using it, you must specify in advance how many clusters you want to group the data into. This led me to also consider methods of clustering that automatically determine the cluster count, such as HDBSCAN2. They ran very poorly and generated too few clusters for my use case.

Later, for k-means, I discovered the elbow method. This finds the optimal number of clusters n where the within-cluster sum of squares (WCSS) is minimized.3

The elbow curve for my k-means clustering model. The slope tapers around n=10, but I ended up using n=20 clusters where it was a bit flatter

Overall, k-means training ran extremely fast - under half a second every time.

Genre as a Feature

At first, I did not integrate genre into model training to give users a wide variety of song recommendations, but while the features were indeed similar, the songs recommended did not have the same “vibe” or energy when actually listening to them.

My new idea was to filter the training data only to the user’s selected genre. But another issue arose: there were so many genres across all the songs in the dataset: about 115.

To mitigate that, I combined sets of similar genres into more general groupings.

dataset_master.loc[
    dataset_master['genre'].isin(['rock', 'alt-rock', 'alternative', 'punk', 'punk-rock', 'grunge', 'emo', 'indie', 'singer-songwriter', 'psych-rock', 'j-rock', 'rock-n-roll']),
    'genre'
] = 'Rock & Alternative'

Onwards with Recommending

To best describe the development process that follows, I will use a hard-coded example of user input, using Electronic & Dance as the desired genre. These values are an exact match of one of the songs in the dataset.4

new_sample = np.array([0.6131979695431472, 0.995, 0.8488245195420157, 0.16994818652849744, 0.003072289156626506, 0.626, 0.109, 0.2633165829145728]).reshape(1, -1)

When a user inputs audio features, I put that input into a numpy array like that one. Essentially it gets treated like a new song to be clustered, so I find what cluster that hypothetical “song” fits in.

Throughout that process, we go from 100,000+ possible songs for comparison (across all genres), to about 18,000 (after genre filtering), to a mere 1,000 (after cluster assignment).

There are several options of measuring similarity, but the two “main” ones used in CBRS are cosine similarity and Euclidean distance.

The former works well if the similarity in proportion of the two datapoints’ values matters more, but the latter works better if similarity should be based on the similarity of the values themselves.5 Euclidean distance was my pick, since I was looking for the closest match in values.

For each song, a lower similarity value meant a closer match.

As we can see, the most similar song to the sample input values in here is Virtual Gaming.

The final task is just to sort the dataframe in ascending order of similarity.

A view of the most similar songs to the user input. Note that there most certainly will not be a perfect match in any recommendation (see the 0.0 similarity score for the first song, this is only because I used values which exactly match this song's). Also note how the audio feature values are very close to each other to start.

Overall, it works well and is pretty fast! That is the core part of this project done. In the next entry related to this project, I will describe how I set up the web side of things.


  1. https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs and https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset ↩︎

  2. Hierarchical Density-Based Spatial Clustering of Applications with Noise ↩︎

  3. WCSS is defined as the total variance of the data points within each cluster. The lower the value, the less varied the data is, and the tighter the cluster is. ↩︎

  4. The fields in order are: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence ↩︎

  5. https://cmry.github.io/notes/euclidean-v-cosine ↩︎