Classify Genre of Music Using Random Forests

Katerina Falkner and Ishani Pidara
katerinafalkner2026@u.northwestern.edu, ishanipidara2026@u.northwestern.edu

Northwestern University

Professor Bryan Pardo

COMP_SCI 352: Machine Perception of Music & Audio

Github Repository

Motivation

Our goal was to create a model which could classify the genre of an audio clip using a random forest model. In addition to building the model, we also wanted to explore the GTZAN dataset, including which features were most important and which genres are most commonly classified as others. Lastly, we also wanted to explore what the impact of removing repeated values in the GTZAN dataset had on model accuracy. Music classification, especially that done by computers, is a common problem in the music technology field and one that we were particarly interested in exploring. Also, the GTZAN dataset is incredibly popular, and we are interested in exploring some of the limitations and common issues found within it.

Project Overview

In order to accomplish our goal, we implemented a random forest model which we trained and tested on the GTZAN dataset. Then, from the results of this model, we are able to analyze the dataset using several metrics, including overall accuracy values, a confusion matrix, and feature importance. These metrics allow us to examine how well the model predicts each genre and which features are most beneficial when splitting in the forest. We tested each of these using the original GTZAN dataset and a modified version of the set that sought to mitigate the issues found within it to compare the difference.

Methodology

Random Forest Implementation

We implemented a random forest model using the sklearn library in Python, specifically the RandomForestClassifier function. The classifier builds a collection of decision trees and trains each one on a random subset of the data rows and features. The final prediction of the random forest is made by aggregating the predictions of all the individual trees through a majority vote.



GTZAN Dataset

We used the GTZAN dataset for our model, which contains 1000 audio clips of 10 different genres. Each audio clip is 30 seconds long and has been preprocessed to extract 59 features, including chroma features, spectral centroids, harmony features, and MFCCs. However, the GTZAN dataset has several known issues, including repeated audio files. To address this, we also created a modified version of the dataset that removed the 51 repeated audio files specified in previous research.

The following table shows the distribution of genres that were removed:

Jazz Reggae Metal Pop Disco Hip-hop Rock
13 11 9 9 6 2 1


Training and Testing

We trained our model on 80% of the dataset and tested it on the remaining 20%, ensuring that the two sets were mutually exclusive using the split and train function from sklearn. Using this trained model, we made genre predictions on the test set using the predict function from sklearn.

The following tables show the distribution of samples per genre in the train and test sets:

Set Blues Classical Country Jazz Reggae Metal Pop Disco Hip-hop Rock
Train 80 87 73 65 66 84 78 73 83 78
Test 20 13 27 22 23 25 13 21 15 21


Success Metrics

We then analyzed the results using several metrics, including overall accuracy values, a confusion matrix, and feature importance.

Results

Original Dataset Evaluation

Accuracy Score: 75%

Feature Importance

Feature Importance Graph

Confusion Matrix

Confusion Matrix

The random forest model trained on the original dataset had an accuracy of 75%. The confusion matrix shows that several genres, such as classical and jazz, are classified with relatively high accuracy, whereas genres like pop and rock show more frequent misclassifications The model is better at distinguishing between certain genres than others, potentially due to similar acoustic features that genres share, making it more difficult to differentiate between them. For example, classical music has a 100% classification rate, potentially due to its unique lack of percussive instrumentation. On the other hand, rock had the worst accuracy (10/21, or 48%), often misidentified as country or metal. Additionally, as this was trained on the full dataset, it may also be the effect of duplicate files in the training and testing sets, meaning the model may have overfitted to these repeated samples.

The feature importance graph shows that perceptual variance, chroma stft mean, and ms mean were among some of the most important features, showcasing that harmonic and melodic characteristics were integral in differentiation.



Modified Dataset Evaluation

Accuracy Score: 78.42%

Feature Importance

Feature Importance Graph

Confusion Matrix

Confusion Matrix

The random forest model trained on the modified dataset improved accuracy from 75% to 78.42%. The confusion matrix shows that the model improved significantly at classifying pop (after 9 files were removed), and slightly improved with classical, blues, and country (all of which were not duplicated genres). On the other hand, it shows that jazz accuracy decreased significantly (from 91% to 45%), with 13 duplicates removed, and reggae decreased slightly, with 11 duplicates removed. This suggests that the model was incorrectly inflating the accuracy of jazz and reggae by overfitting to the duplicated audio files, and after removal, the model is able to better predict other genres. The feature importance graph shows that perceptual variance, chroma stft mean, and ms mean were still among some of the most important features.

Try It Out!

Disclaimer: The analysis might take a few moments to load, due to hosting constraints.

Pick a Genre


Audio Sample

A file with the specified genre was picked randomly from a subset of our test data..

File Name:

Audio Player:


Results

Predicted Genre:

What our random forest model classified this clip as, trained on the modified dataset.

Actual Genre:

The actual label from the GTZAN dataset.

Low Baseline:

A baseline that always predicts the most common genre in the training set, used to contextualize our model's performance.

High Baseline:

This baseline was taken from an online genre identifier that has an accuracy of 84%.