This article is the second feature in the Projects of UpCode Series I, where UpCode Academy instructors lead teams of graduates to work on (hopefully) fun projects. You can find the first one here and second one here

Music is an undeniable part of our lives – we listen to music during commute, work, or even during the sad hours in the morning. The best ones often end up in the Billboard 200 chart, but here’s a big question: is there a method to this madness that is being a chart topper? Great songs are released year after year, but only a select few end up in the coveted list.

What if we can predict whether a song will end up in Billboard? UpCode graduates Tan Jin, Arthur, and Yue Feng believe so and decided to do just that. Led by instructor Dan (and the occasional meddling by instructor Jackie), they embarked on a journey to build an AI that can accurately predict song popularity.

Practical considerations

As with all Projects of UpCode, we had to use whatever that we learned from the courses – mainly Python. The difficulty of the project was just enough to tap on whatever we know, but also challenging enough to push us out of our comfort zones to learn new things and concepts.

Exploratory assessment and feasibility

Our idea needed several parts:

  1. A website for users to submit songs for prediction
  2. A server to know what song it is and predict whether it tops the chart
  3. A database of songs to know what song it was that the user submitted

We reasoned that whenever a user submits a song for prediction, we’d have to be able to retrieve a song’s information. This implies building our own database of songs, which is both not feasible and technically challenging.

Fortunately, Spotify has an API that allows users to retrieve a song’s information so we chose to use that. More for information on what an API is, we wrote about it at length in our previous feature here. In short, the API is a tool that we can use to retrieve information that we need, be it song title, artist, etc, directly from Spotify’s database. With the Spotify API, we can then do away with building our own song servers and instead tap into theirs.

Project Billboard workflow

Project Billboard workflow to predict song popularity

We decided to call our project Project Billboard. With reference to the image above, this was how we envisioned Project Billboard to be. A user who uses our website can predict song popularity by entering a song name and get the probability of getting into the Billboard charts.

In detail, this is what happens:

  1. User submits the name of a song he/she wishes to predict for chart topping ability and the request goes to the Project Billboard server
  2. The server then makes an API call to Spotify server
  3. Spotify then returns technical information of a song, e.g. length, tempo, cadence, etc
  4. The server runs the information through the machine learning model and obtains a prediction score
  5. The server then returns the prediction score back to the user

On the surface, it seemed straightforward. However, there is a component in this workflow that is deceptively small but very important – the machine learning model that can predict song popularity! As such, we divided our project into three phases: data retrieval, machine learning, and finally website building.

Phase I: Data retrieval

What is machine learning? There are many definitions to what machine learning is, but in short it is learning from data to make inferences and derive patterns without being explicitly programmed to.

To have a working machine learning model, one must first collect the necessary data to perform model training. In our case, we will have to collect normal songs and chart-toppers and let our machine learning model make inferences on what makes a chart-topper a chart-topper.

Which means we need data. Where do we find the data to begin with? Fortunately, the Spotify API has us covered. We were able to retrieve songs from their servers.

The following shows our data preparation:

Data retrieval strategy to build the machine learning model that can predict song populairty

All in all, in our raw dataset, we have a total of 5,878 songs, with 1,502 of them as songs that were on Billboard Top 100 chart at least once. We then cleaned the dataset up to remove any irregularities and rows of data containing missing values.

You might be wondering what “features” mean – it is the technical quality of a song, such as tempo, danceability (yes we can measure how dance-able a song is), length, etc. Our hypothesis was that technical information of a song can predict whether a song goes on a chart or otherwise. You can find more details on Spotify’s site here.

Phase II: Machine learning and optimization

Once we have gathered the dataset, we then got to work by training our machine learning model. We performed a 80-20 split, i.e. use 80% of the data for model training and 20% for model validation to assess our model’s accuracy. This is so that our model will eventually be generalized enough for unseen datasets, instead of just our original data.

Given that we were predicting probabilities of a song appearing on the chart, we chose three suitable models for training:

  1. Logistic regression
  2. K-nearest neighbours
  3. XGBoost

We provided a handy writeup for our models in the site as well.

After training the three models with the dataset, we found that the accuracy for the three models were around 50% which was as good as tossing a coin.

To improve the three models’ accuracy, we embarked on tweaking our input parameters for optimization. The details can be found in our Github repository,, but in short tweaking input parameters can yield performance better than the default settings. After extensive optimization efforts, the accuracies were 66.2%, 61.2%, and 68.7% for logistic regression, K-nearest neighbours, and XGBoost respectively.

After model training, we are now ready to use the models in production, i.e. hosted in a server for proper use. We saved the three optimized models for use in our server using the Pickle module. Pickling is the act of saving models so that it can be used straight away without going through the tedious processing of training the model every time we need to make predictions.

Phase III: Website to predict song popularity

For the front-end of Project Billboard, we used the Bootstrap 4 template. For the backend, we set up a Flask server. We loaded all three of the pickled models onto the server.

Project Billboard website

When users visit the site, they will be greeted with a page that contains a form, along with a drop-down menu next to it. Users can first enter the song that they’re interested in, followed by choosing which model they’d want a prediction from. We chose to make all three machine learning models available in case they’re curious what the probabilities for those different models.

Scroll lower, and users will see a writeup on the different models that we used. We provided details and the intricacies of each model, which you can see when you click “Learn More”.

Project Billboard timeline

After that, users can see the timeline of the project, from commencement to completion.

Project Billboard Team

And finally, the members of the team that built Project Billboard and their socials.

Conclusion

While the accuracy seems low, when we examined other similar projects that attempt to predict song popularity, we found that our best model accuracy of 68% is comparable to other projects..

Our hypothesis was that the song’s popularity was dependent on the audio features, but one thing that we did not include in our training dataset is the lyrics. That could perhaps be another area to involve for the future to improve the machine learning models.

This project is, to the best of our knowledge, the first song popularity predictor in production. Once again, head on to our site containing the predictor and our profiles here, and the Github repository for the machine learning here and production here.