Alright, you’ve seen those YouTube videos about how to make a Christmas song. The ones that tell you which jingles to jangle. Which bells to ring. How to sprinkle in a cozy fireplace crackle.

Yeah we’re not doing that. We’re doing data science to write the perfect Christmas hit.

When we first started this channel I got really excited because I had a great idea: I’d create the perfect emo song using data science. I’m a former emo kid. And current one too.

But you know how this works. Halloween happened, schedule slipped and here I am doing it for Christmas. But I thought “that’s okay, it happens, I can just build the model and reuse it forever. Boom, a whole series”. You ready to find out how unbelievably wrong I was?

Because strap in, we’re doing some real world data science on real world data so it’s a real world pain in the *sleigh bells*

The Metadata Problem…

So what goes into making a song? There are a huge number of ways that I could take this. There’s the lyrics and how much we connect to sorrowful tales of last Christmas and plaintive pleas that all I want for Christmas is you. But I could take the exact same lyrics that tug your heart strings with one melody and put them over something else and it’d completely ruin the song for you.

I mean imagine if I’d put “It goes like this the fourth the fifth over a descending chord progression instead” *plays it going down with a different melody*.

So it’s not just lyrics we need to analyse to understand a song. There’s other stuff in there. How about the key, which can affect how easy or hard it is for people to sing along. The speed which can drive whether we’re feeling chill and cosy or ready to get up to dance, oh and danceability, that’s a thing they measure too.

There’s also metadata about the songs. A huge amount of Christmas songs are covers because who doesn’t like to do their own version of a Christmas classic. We also need to know duration and as I’m looking to build a literal data science model to create the perfect Christmas hit the metric I’m using to is is going to be streams on Spotify. It’d be pretty unfair to compare raw streams of White Christmas to something released in 2025. So I need to normalise by the year of release instead. Or rather the year it went on spotify. We’ll get to that part later.

It was as I was thinking this problem through that I realised what I had in front of me: Another HUGE data problem. Well, I’m a data science. 90% of our time is acquiring and cleaning up data anyway. Let’s get stuck in.

*A binary tree made out of musical nodes… I mean notes*

Why the Spotify API Wasn’t Enough…

So on my first day of looking at this problem I struck gold in discovering the Spotify for developers API. I distinctly remember going for lunch telling the entire office how excited I was to get stuck in because I had everything I needed to get my model going. Oh you sweet summer child.

Here’s a list of problems I found with the Spotify API in no particular order:

I found the API documentation very confusing for set up so I decided to vibe code it… and everything the AI suggested to me didn’t work. It was so frustrating and reminded me again why I just don’t do this. Write code for yourself, kids. It’s so much less painful.
The stuff I thought I needed from the API was DEPRECATED IN NOVEMBER 2024!!! Yes you heard that right. I did the modelling for this in November so everything I so excitedly needed was removed exactly a year prior.
Spotify decided that it’d make an end point for playlists but REMOVE ALL SPOTIFY CURATED playlists. So I ended up picking the first playlist, thinking “this looks good, there are loads of subscribers or followers or whatever Spotify calls them, I’ll use this” then found out DAYS LATER that there was an even bigger playlist that I should have used instead.

So as you can imagine I was quite annoyed that day. So much so that Liv texted me that evening to ask if I was okay. I was embarrassed to say that I was just doing data science…

But in the end I did manage to use Spotify for a few things. Literally the release year and duration. And even that was missing for 4 songs which in a list of 200 is a fair chunk! I couldn’t even get raw play counts! Spotify gives you a popularity song which has a recency bias but I was doing this analysis in November I didn’t have time to wait for the popularity score to catch up with what I needed it for.

*Algorithms below, music above, a human at the edge of it all.*

From Lyrics to MP3s: Gathering the Data

So in my Leigh is a robot for loop from earlier I also had a step to grab the lyrics of the song, usually lyrics I had to make a function to clean later because they were different depending on what site I got them from. Actually this wasn’t that painful. Maybe took a couple of hours over two days. Musical features though? Oh it got worse.

Because it turns out there was NOTHING online that I could scrape for music features. So I had to do it myself. I stumbled across the Essentia package which is useful for understanding a MP3 file when you have it, then rang round all the Christmas lovers I know to analyse their MP3s. I’m so glad my Mum loves Christmas.

And then when I looked triumphantly at my list I realised:

Where’s Slade?

The Pogue’s?

The Darkness?

Band Aid?

It was only then that I learned all these bands were British and I was missing out on my cultural Christmas hits so as this is my data science video and I am British, I went and did the same thing all over again on a UK Christmas classics playlist as well.

So cool, after two weeks we have lyrics, mp3s to analyse and metadata. Now what do we actually do with it?

*The Pogue’s Fairytale of new york, Christmas classic hit!*

Quantifying Christmas Songs

I got about five years into my quant career before I embarrassingly found out about feature engineering by misunderstanding someone and thinking they said future engineering. So I’m sorry if this is dumbing it down but I’m going to take a second to explain what feature engineering is:

So on a high level this is choosing what features we believe are salient in the model and we would like to see used. So for example, I could feed the lyrics as a whole to some kind of model and hope for some sensical output, but actually with such a low quantity of songs this was going to be more data than my model could meaningfully handle so I had to work out what important information it needed was.

Lyrical data is made up of words of course, so I needed a way to interpret those words. One way to do it was to find out how common these words are, but words like “the” and “and” come up all over the place so I used a useful bit of tech called the TF-IDF.

Caroline did a great explanation of the TF-IDF in a previous video but the cliff notes version is that the TF-IDF gets all words that are common in this song at a higher rate than the general lexicon. So words like “Christmas”, “snow” and “Santa” were common candidates, so much so that I decided to put the TF-IDF score of these words into my model itself, along with the other two five candidates of “let” and “baby”. And for good measure I gave each song a Christmas score based on a kid’s festive word’s learning document Caroline found. What, it’s a good data source?

I had a think of other useful features I could pull from the dataset. Christmas songs have to be sing-along-able so the reading level was probably important. No words like onomatopoeia in these songs, just a lot of onomatopoeia instead. The Flesch reading function was useful for this.

Other features that are common in songs are rhyming and repetition and by making use of nltk features I could get a proportion of lines that rhymed and count word repetitions.

Lyrics, Music, and What Matters

Next I thought about some of the most common features of songs. Mariah Carey’s All I Want For Christmas is You has something very important to consider: pronouns. Do people like it when we talk about ourselves in Christmas songs or would they rather be addressed directly? Stick I, we, you and he/she/they in the feature bin as well.

Finally for lyrics I wanted to understand whether Christmas songs tend to be mostly positive, negative or neutral. My instinct said that most would be positive but then I realised, Last Christmas isn’t exactly a cheerful song. I needed to reassess that assumption.

I also didn’t want to just do a single sentiment analysis of the song. Part of the beauty of a well-crafted song is the tension and resolution, we need excitement to build so that we earn the release and we need those moments of calm as well so we can have a moments break to hit those like and subscribe buttons.

Alright, that covers lyrics. What about the music itself? Well, if you didn’t know it already by the end of this video you’ll know I’m a complete music nerd so I had ambitions of mapping the chord progressions but it turns out this was overcomplicating things a lot and I’m not going to lie I was running out of time to make the video so I moved on.

Instead we used beats per minute, a measure of how fast paced the song is, danceability, chord changes rate, crunchy dissonance against smooth consonance and also data on the key and tonality (happy or sad) of the music.

I one hot encoded the keys and set tonality to binary along with the is cover and is instrumental I grabbed earlier. Everything else was numerical, including duration.

Right, we have all the data we need to model. Should we actually do some modelling then?

*Image showing a typical song structure*

Preparing the Model

So initially I was going to build a bog standard multilinear regression model but I have a little bit of a problem because I have a complicated feature of multi-collinearity in my model. Basically what that means is that I can’t treat for example the use of the word snow and Christmas score as if they’re independent variables because they’re not. Snow is generally regarded as a Christmas word so the two are very much collinear.

So instead I wanted to add an upgrade to the model to reduce potential overfitting as a result of these features and I had two main options: Ridge regression and Lasso regression.

Ridge prevents overfitting by adding a L2 penalty to the model that reduces the size of large coefficients but it actually keeps all the features I mentioned earlier in the model. That’s actually not desirable in my case. Constraints are good for the creative process but too many means I’m optimising for the constraints and not creatively writing a song.

LASSO is the other main option. It will force some coefficients to zero, selecting only the key features but we only have 191 which is small and a lot of multi collinearity which could mean that Lasso could pick out snow over Christmas score and send all other candidates multicollinear to it to zero. This means we wouldn’t get to see the impact of our other words if the model chose snow to dominate and with such a small sample size we can’t be sure that choosing snow is the right decision.

Fortunately, it’s possible to have the best of both worlds. Elastic Net regression combines both penalties into one handy equation with the LASSO part dropping features that we don’t need but ridge part capturing the correlated features. Awesome. Let’s get feeding it.

A few minor points before we do. Some of our data is pretty huge in absolute terms, like repetition count versus reading level, so we need to scale it to make it more fair. We use StandardScaler from SciKit Learn for that.

We’re also going to log transform the streams per year because this is highly skewed to All I Want For Christmas Is You and Last Christmas and we want to make the data look more normal. This is particularly important because another +10 streams per year on a song that has been out for a decade isn’t impactful. But for one that came out in 2025? That’s huge.

This is also particularly important because things can be skewed by the artist. People love Mariah Carey and I can’t get her to sing my song for this experiment.

Finally we need to work some magic with our keys. Unless we specify it, one hot encoding them leads to a matrix in which every row sums to 1. Linear regression can’t invert a perfectly collinear matrix and while elastic net will handle it numerically, the coefficients won’t be uniquely determined which makes it hard to interpret. To get around this we drop one column, in this case let’s drop C, pianists hate that key anyway. Then if all columns are zero we know that C is the one.

Okay we’re finally ready to run the model. Ready to see the exact formula that makes the perfect Christmas song?

*Santa mixing up the best Christmas song!*

Do’s, Don’ts, and Snow

So what we wanted from this model was for a significant chunk of the terms to zero out and fortunately we have that, particularly as the ones that aren’t zero are pretty close to it. So I can scan down this list and pick out only the important features to talk about. Those are the ones that tell me what to do and importantly, what NOT to do.

Let’s talk about what NOT to do first because the model is very clear on some things. First and foremost. DON’T do a cover. Which is good because I planned this whole video to end with me writing the song and it’d be pretty boring if it turned out to be just a cover. Secondly, NEVER write in F#. I agree. Who wants 7 sharps anyway?

In terms of words, the model believes pretty strongly that I should talk about snow a lot. Excellent. I’m writing this one down. Santa and Christmas also get votes but “let” is a term we should never use. Let’s move on from that one then. *wink*

Interestingly enough, the Christmas score was something that was zeroed out for no effect. Probably because I was only scanning Christmas music so they all had a pretty high Christmas score. Also covers would have affected this as well. We’re writing Christmas music so we’ll get a good Christmas score and move on.

My duration leans slightly longer than the average on better performing songs, telling me I should write a song about 3 minutes and 43s long. BPM is pretty much zero which means I should roughly go for the average of my dataset which is 125 beats per minute. Pronouns are pretty close to zero with a slight preference for “we”. Isn’t that lovely?

The stuff that gets interesting is when it comes to sentiment and keys and tonality. Overall the song should have a positive sentiment which makes sense, I think we like to be uplifted at Christmas, but what was really interesting was the coefficients for arc because the model was pretty unambiguous: The song should have a sad ending.

*The Do’s and Dont’s for putting together a Christmas hit!*

The Moment of Musical Truth…

Well okay, I can do that. I can write something bittersweet. But what was also interesting was when I looked at the tonality there was a decent tug towards minor, as in sad sounding. Hm.. a conundrum.

I then looked at my keys and there was a pretty strong preference there. Ab was by far the preferred key with a coefficient of 1.29, a high number that shows a strong preference, followed closely behind by C#. Now I can assure you no one is writing in Ab minor because again it has far too many flats so I decided to take C# minor instead which is a reasonable key to write in and get to work.

So I put together a chord progression, wrote a bit of music and it sounded SO DEPRESSING and not Christmassy at all. So I decided: it’s my model, I can take a bit of liberty with how to decode it and decided that the song would be in a combination of Ab major and C# minor. Which means I’m blending two different keys to keep it festive. Which isn’t musically straight forward so thank you very much Slade for helping me transition between the two.

And finally, I had to add a moderate amount of dissonance, so it’s time to remember my jazz guitar and be excited I get to include my favourite chord, C#m9. Add in a few suspensions and dominant sevenths and we’ve got something that has the jazziness that Christmas needs without being over the top.

So that’s it, the formula for the perfect song. Take C# minor and Ab major and jam them together, have a song with generally positive sentiment but a depressing ending to finish the arc, repeat a bunch, particularly of the word snow and never ever do a cover.

Well… I bet you want to know what that sounds like now, don’t you? Well click here to listen and watch the final music video and whilst your at it like and subscribe won’t you! Merry Christmas!

For more of this, come on the journey with us and keep being Evil

Can Data Science Create the Next Christmas Hit?

The Metadata Problem…

Why the Spotify API Wasn’t Enough…

From Lyrics to MP3s: Gathering the Data

Quantifying Christmas Songs

Lyrics, Music, and What Matters

Preparing the Model

Do’s, Don’ts, and Snow

The Moment of Musical Truth…

Evil Works

Can Data Science Create the Next Christmas Hit?

The Metadata Problem…

Why the Spotify API Wasn’t Enough…

From Lyrics to MP3s: Gathering the Data

Quantifying Christmas Songs

Lyrics, Music, and What Matters

Preparing the Model

Do’s, Don’ts, and Snow

The Moment of Musical Truth…

Is Die Hard A Christmas Movie?

Evil Works

Join Our Mailing List

Thank you!