Live Data Science Walkthrough: Making Google Trends Data Usable

The Hidden Problem with Google Trends Data

Have you ever tried using Google Trends for market research or machine learning?

You jump into Google Trends, type something like YouTube, and think:

“Wow this is incredibly granular. I can get data every 15–16 minutes. This is perfect for machine learning.”

So you download a month of data…..And suddenly… it’s daily.

Not ideal, but still workable.

Then you try a year.

Now everything is aggregated weekly.

Instead of 365 daily data points, you’ve got 52. That’s nowhere near enough for most time series models.

So how do we turn Google Trends data into something granular enough for machine learning?

That’s what this post is about.

Why You Can’t Just Stitch 90 Day Windows Together

You might think:

“Easy. I’ll just pull the maximum daily window (90 days), slide it forward, and stitch the results together.”

Seems reasonable, right?

But here’s the problem, and I’ve used Facebook as our example

Google normalises each query window independently.

For every request:

  • The highest value in that window is set to 100

  • Everything else is scaled relative to that

So if you naively stitch multiple 90 day windows together, you get something that looks like this:

Graph of raw google trends output for Facebook

Graph of raw google trends output for Facebook

  • Normal search behaviour

  • Then suddenly the entire world stops searching…

  • Except for one day?

  • Then back to normal.

Obviously, the world didn’t coordinate a global “don’t search” event.

What actually happened is this:

  • A single day within one 90 day window had a relatively higher spike (Facebook had gone down on this particular day and this is why).

  • That day was set to 100.

  • Everything else in that window was scaled down.

  • When stitched to adjacent windows, the scales don’t match.

So we need a smarter approach.

The Fix: Rolling Windows With Overlap

Instead of stitching independent windows, we:

  1. Pull 90 day windows

  2. Move forward 60 days each time

  3. Create a 30 day overlap

  4. Use the overlap to calculate a scaling factor

This allows us to normalise each new batch against the previous one and eventually normalise everything back to the first batch. lets explain how this works

Step 1: Pulling the Data

We use PyTrends, an unofficial API for Google Trends.

It works well ….. but you will hit rate limits.

So the setup includes:

  • pytrends

  • datetime

  • time (for sleep delays)

  • A dataframe library (I used Polars; Pandas works fine)

  • matplotlib.pyplot for plotting

screenshot of the function Imports required

Imports required

We build a TrendReq, define a payload with:

  • Search term (e.g., Facebook)

  • Time range

  • Geography (e.g., GB)

Then call .interest_over_time().

Simple… until rate limiting kicks in.

screenshot of the function Basic pytrends request

Basic pytrends request

Handling Rate Limits (The Real Data Science Experience)

To deal with rate limits:

  • Wrap the call in try/catch

  • Log failures

  • Sleep for 60 seconds

  • Retry (multiple times)

Eventually, you’ll get your window.

screenshot for Function for Rate limits

Function for Rate limits

Then I wrapped everything in another function that:

  • Converts Pandas to Polars

  • Adds a batch number

  • Saves intermediate results

screenshot for Function to convert to polars and added batch numbers

Function to convert to polars and added batch numbers

Why?

Because when rate limiting killed the script halfway through, I lost everything.

So I split the job into:

  • Part 1

  • Part 2

  • Part 3

Each saves to CSV.

screenshot for Function for running and saving

Function for running and saving

screenshot to show how to Control for re-running parts

Control for re-running parts

Is this beautiful production code?

No.

Is this what real world data science often looks like?

Yes.

You optimise modelling code.
Data collection code? Often hacked together until it works.

Step 2: Combining the Batches

Now we:

  • Load all partial CSVs

  • Merge them

  • Restore date formats

  • Fix floats and integers

  • Combine terms (YouTube, Amazon, Facebook, etc.) into one table

At this stage, the data still looks broken because we haven’t normalised the rolling windows.

Now comes the important part.

screenshot to show the Function to combine

Function to combine

So, examining the output of this function, we see… Here are the Google Trends values for each of the specified dates across the different terms. As I mentioned, if we focus on the Facebook graph, its current pattern looks like this

Graph of raw google trends output for Facebook

Graph of raw google trends output for Facebook

We've not done any processing to join up with that moving window so it's in the same world.  So lets explain that now

Step 3: Finding the Overlapping Dates

We:

  1. Add a column called count = 1

  2. Group by date

  3. Sum the count

This tells us how many times each date appears.

  • Dates appearing once - not overlapping

  • Dates appearing twice - overlapping windows

We only care about the duplicates. In my code i’m calling that dupe_dates

screenshot of Adding and grouping by count

Adding and grouping by count

screenshot of Filtering to just duplicate dates

Filtering to just duplicate dates

When I analyse time based data I’m not interested in isolated dates that appear only once. Those single occurrences don’t help calculate overlapping windows, instead we:

  • Remove everything that appears once

  • Keep the dates that are overlapping

  • Filter all the data on dates that contribute to overlapping windows.

screenshot of Filtering the raw data to only the dates that appear twice

Filtering the raw data to only the dates that appear twice

Step 4 : calculating the scaling factor between batches

To do this, I perform another group_by on the filtered table.

This allows me to collect the batch numbers associated with each remaining date.

Because of how the data was originally constructed, I already know two important things:

  • Each overlapping date appears in exactly two batches.

  • Those batches are already in chronological order.

screenshot of Code to identify which batches each date is part of

Code to identify which batches each date is part of

This allows me to  filter down to the dates that I  want.  So if I'm working on the scaling factor between batch i and j, I can filter down to the relevant dates like this:

screenshot of Code to filter down the dates for the relevant batch

Code to filter down the dates for the relevant batch

Now you might be thinking…

"Why are you doing this complicated explode and filter logic?, Why not just use something simple like this?":

screenshot of Simple filtering to batch i and j

Simple filtering to batch i and j

And the answer is because we want dates that are only in the window between batches 1 and 2. If we filter with an OR we'll grab dates that are in batch 2 but not in batch 1, namely the ones in the window between batches 2 and 3.

So if we use the correct filter, we can grab only the dates we want, and do a group_by with a mean to get the average of the window dates from the batch 1 dataset, and the average of the window dates from the batch 2 dataset.

screenshot of filtering and grouping by to get the average of the window in each batch

filtering and grouping by to get the average of the window in each batch

Dividing between the two and correcting the batch_number column allows us to calculate the scaling factor between the batches

screenshot of Calculating the scaling factor between batch 1 and 2

Calculating the scaling factor between batch 1 and 2

I packaged all of that logic into a single function. Inside it, I loop through each element we discussed and build one consolidated scaling factor table.

If you’ve been working in data science for a while, you might be wondering:

“Why not just vectorise this?”

That’s a fair question.

The short answer is: clarity.

For the purpose of teaching and walking through the logic step by step (especially in a video format), writing it out explicitly makes the transformation easier to follow. You can see exactly how each piece contributes to the final result.

That said, this implementation can absolutely be vectorised. In fact, I’d encourage you to try. Refactoring this into a vectorised approach is a great exercise if you’re looking to level up your programming skills.

screenshot of Function for calculating the scaling factor between batches

Function for calculating the scaling factor between batches

Above, the scaling factor table looks something like this,
for each batch, we calculate a scaling factor that normalises it to the previous batch.

For example, the scaling factor in row four tells us how to adjust Batch 4 so it matches the scale of Batch 3.
However that’s not what I ultimately want.

Instead of chaining each batch to the previous one, I want every batch normalised to Batch 1.

To get the scaling factor between Batch 3 and Batch 1, I simply multiply:

  • The factor from Batch 3 - Batch 2

  • By the factor from Batch 2- Batch 1

This gives us one combined scaling factor that puts every batch on the same scale

screenshot of Using a cumprod to calculate the scaling factors with batch 1 and finishing our data cleaning step

Using a cumprod to calculate the scaling factors with batch 1 and finishing our data cleaning step

The data then looks like this below. You can see that the scaling factors are null for batch 1 to batch 1 but kick in for batch 2 onwards.

screenshot of Data after scaling has been applied.

Data after scaling has been applied.

Step 5 : Examining the results

On this graph below, we can see the original data alongside the scaled data. The orange line shows the small drop offs:

“The odd spikes that look like, well, the world suddenly stopped Googling”

In contrast, the scaled data forms a much smoother, continuous line.

screenshot of Graph of the unscaled data versus our scaled data

Graph of the unscaled data versus our scaled data

I’m going to take that line in the graph as it is and examine it for a moment, then compare it with Google Trends data over the same five year period. At first glance, they look quite similar the peaks occur at the same times. I double checked the dates, and the peaks do align.

However, you might notice that the scale is different:

The peak in our data is much higher than the peak in the Google Trends data.

At first glance, this might make it seem like our scaling factors exaggerated that particular peak, which could raise concerns about the accuracy of the data.

image of Scaled Facebook data

Scaled Facebook data

screenshot of Google trends facebook data

Google trends facebook data

Remember….

that Google Trends averages data on a weekly basis.

If I take the weekly data for this time period and calculate the average, a value around 100 would indicate that my data is in the right ballpark.

I did that calculation for this specific date and got 102.8, which shows that our data not only behaves like Google Trends data but also gives us daily values that can be used directly in models.

screenshot of Our average over this date is reasonable

Our average over this date is reasonable

This is how we transformed Google Trends data into a format suitable for machine learning. The code is available on GitHub, so why not click the link and try it out yourself.

For more of this, come on the journey with us and keep being Evil

Next
Next

Domain Knowledge: The Machine Learning Unlock