The Hidden Problem with Google Trends Data

Have you ever tried using Google Trends for market research or machine learning?

You jump into Google Trends, type something like YouTube, and think:

“Wow this is incredibly granular. I can get data every 15–16 minutes. This is perfect for machine learning.”

So you download a month of data…..And suddenly… it’s daily.

Not ideal, but still workable.

Then you try a year.

Now everything is aggregated weekly.

Instead of 365 daily data points, you’ve got 52. That’s nowhere near enough for most time series models.

So how do we turn Google Trends data into something granular enough for machine learning?

That’s what this post is about.

Why You Can’t Just Stitch 90 Day Windows Together

You might think:

“Easy. I’ll just pull the maximum daily window (90 days), slide it forward, and stitch the results together.”

Seems reasonable, right?

But here’s the problem, and I’ve used Facebook as our example

Google normalises each query window independently.

For every request:

The highest value in that window is set to 100
Everything else is scaled relative to that

So if you naively stitch multiple 90 day windows together, you get something that looks like this:

*Graph of raw google trends output for Facebook*

Normal search behaviour
Then suddenly the entire world stops searching…
Except for one day?
Then back to normal.

Obviously, the world didn’t coordinate a global “don’t search” event.

What actually happened is this:

A single day within one 90 day window had a relatively higher spike (Facebook had gone down on this particular day and this is why).
That day was set to 100.
Everything else in that window was scaled down.
When stitched to adjacent windows, the scales don’t match.

So we need a smarter approach.

The Fix: Rolling Windows With Overlap

Instead of stitching independent windows, we:

Pull 90 day windows
Move forward 60 days each time
Create a 30 day overlap
Use the overlap to calculate a scaling factor

This allows us to normalise each new batch against the previous one and eventually normalise everything back to the first batch. lets explain how this works

Step 1: Pulling the Data

We use PyTrends, an unofficial API for Google Trends.

It works well ….. but you will hit rate limits.

So the setup includes:

pytrends
datetime
time (for sleep delays)
A dataframe library (I used Polars; Pandas works fine)
matplotlib.pyplot for plotting

screenshot of the function Imports required — *Imports required*

We build a TrendReq, define a payload with:

Search term (e.g., Facebook)
Time range
Geography (e.g., GB)

Then call .interest_over_time().

Simple… until rate limiting kicks in.

screenshot of the function Basic pytrends request — *Basic pytrends request*

Handling Rate Limits (The Real Data Science Experience)

To deal with rate limits:

Wrap the call in try/catch
Log failures
Sleep for 60 seconds
Retry (multiple times)

Eventually, you’ll get your window.

screenshot for Function for Rate limits — *Function for Rate limits*

Then I wrapped everything in another function that:

Converts Pandas to Polars
Adds a batch number
Saves intermediate results

screenshot for Function to convert to polars and added batch numbers — *Function to convert to polars and added batch numbers*

Why?

Because when rate limiting killed the script halfway through, I lost everything.

So I split the job into:

Part 1
Part 2
Part 3

Each saves to CSV.

screenshot for Function for running and saving — *Function for running and saving*

screenshot to show how to Control for re-running parts — *Control for re-running parts*

Is this beautiful production code?

No.

Is this what real world data science often looks like?

Yes.

You optimise modelling code.
Data collection code? Often hacked together until it works.

Step 2: Combining the Batches

Now we:

Load all partial CSVs
Merge them
Restore date formats
Fix floats and integers
Combine terms (YouTube, Amazon, Facebook, etc.) into one table

At this stage, the data still looks broken because we haven’t normalised the rolling windows.

Now comes the important part.

screenshot to show the Function to combine — *Function to combine*

So, examining the output of this function, we see… Here are the Google Trends values for each of the specified dates across the different terms. As I mentioned, if we focus on the Facebook graph, its current pattern looks like this

We've not done any processing to join up with that moving window so it's in the same world. So lets explain that now

Step 3: Finding the Overlapping Dates

We:

Add a column called count = 1
Group by date
Sum the count

This tells us how many times each date appears.

Dates appearing once - not overlapping
Dates appearing twice - overlapping windows

We only care about the duplicates. In my code i’m calling that dupe_dates

screenshot of Adding and grouping by count — *Adding and grouping by count*

screenshot of Filtering to just duplicate dates — *Filtering to just duplicate dates*

When I analyse time based data I’m not interested in isolated dates that appear only once. Those single occurrences don’t help calculate overlapping windows, instead we:

Remove everything that appears once .
Keep the dates that are overlapping
Filter all the data on dates that contribute to overlapping windows.

screenshot of Filtering the raw data to only the dates that appear twice — *Filtering the raw data to only the dates that appear twice*

Step 4 : calculating the scaling factor between batches

To do this, I perform another group_by on the filtered table.

This allows me to collect the batch numbers associated with each remaining date.

Because of how the data was originally constructed, I already know two important things:

Each overlapping date appears in exactly two batches.
Those batches are already in chronological order.

screenshot of Code to identify which batches each date is part of — *Code to identify which batches each date is part of*

This allows me to filter down to the dates that I want. So if I'm working on the scaling factor between batch i and j, I can filter down to the relevant dates like this:

screenshot of Code to filter down the dates for the relevant batch — *Code to filter down the dates for the relevant batch*

Now you might be thinking…

"Why are you doing this complicated explode and filter logic?, Why not just use something simple like this?":

screenshot of Simple filtering to batch i and j — *Simple filtering to batch i and j*

And the answer is because we want dates that are only in the window between batches 1 and 2. If we filter with an OR we'll grab dates that are in batch 2 but not in batch 1, namely the ones in the window between batches 2 and 3.

So if we use the correct filter, we can grab only the dates we want, and do a group_by with a mean to get the average of the window dates from the batch 1 dataset, and the average of the window dates from the batch 2 dataset.

screenshot of filtering and grouping by to get the average of the window in each batch — *filtering and grouping by to get the average of the window in each batch*

Dividing between the two and correcting the batch_number column allows us to calculate the scaling factor between the batches

screenshot of Calculating the scaling factor between batch 1 and 2 — *Calculating the scaling factor between batch 1 and 2*

I packaged all of that logic into a single function. Inside it, I loop through each element we discussed and build one consolidated scaling factor table.

If you’ve been working in data science for a while, you might be wondering:

“Why not just vectorise this?”

That’s a fair question.

The short answer is: clarity.

For the purpose of teaching and walking through the logic step by step (especially in a video format), writing it out explicitly makes the transformation easier to follow. You can see exactly how each piece contributes to the final result.

That said, this implementation can absolutely be vectorised. In fact, I’d encourage you to try. Refactoring this into a vectorised approach is a great exercise if you’re looking to level up your programming skills.

screenshot of Function for calculating the scaling factor between batches — *Function for calculating the scaling factor between batches*

Above, the scaling factor table looks something like this,
for each batch, we calculate a scaling factor that normalises it to the previous batch.

For example, the scaling factor in row four tells us how to adjust Batch 4 so it matches the scale of Batch 3.
However that’s not what I ultimately want.

Instead of chaining each batch to the previous one, I want every batch normalised to Batch 1.

To get the scaling factor between Batch 3 and Batch 1, I simply multiply:

The factor from Batch 3 - Batch 2
By the factor from Batch 2- Batch 1

This gives us one combined scaling factor that puts every batch on the same scale

screenshot of Using a cumprod to calculate the scaling factors with batch 1 and finishing our data cleaning step — *Using a cumprod to calculate the scaling factors with batch 1 and finishing our data cleaning step*

The data then looks like this below. You can see that the scaling factors are null for batch 1 to batch 1 but kick in for batch 2 onwards.

screenshot of Data after scaling has been applied. — *Data after scaling has been applied.*

Step 5 : Examining the results

On this graph below, we can see the original data alongside the scaled data. The orange line shows the small drop offs:

“The odd spikes that look like, well, the world suddenly stopped Googling”

In contrast, the scaled data forms a much smoother, continuous line.

screenshot of Graph of the unscaled data versus our scaled data — *Graph of the unscaled data versus our scaled data*

I’m going to take that line in the graph as it is and examine it for a moment, then compare it with Google Trends data over the same five year period. At first glance, they look quite similar the peaks occur at the same times. I double checked the dates, and the peaks do align.

However, you might notice that the scale is different:

The peak in our data is much higher than the peak in the Google Trends data.

At first glance, this might make it seem like our scaling factors exaggerated that particular peak, which could raise concerns about the accuracy of the data.

image of Scaled Facebook data — *Scaled Facebook data*

screenshot of Google trends facebook data — *Google trends facebook data*

Remember….

that Google Trends averages data on a weekly basis.

If I take the weekly data for this time period and calculate the average, a value around 100 would indicate that my data is in the right ballpark.

I did that calculation for this specific date and got 102.8, which shows that our data not only behaves like Google Trends data but also gives us daily values that can be used directly in models.

screenshot of Our average over this date is reasonable — *Our average over this date is reasonable*

This is how we transformed Google Trends data into a format suitable for machine learning. The code is available on GitHub, so why not click the link and try it out yourself.

For more of this, come on the journey with us and keep being Evil

Live Data Science Walkthrough: Making Google Trends Data Usable

The Hidden Problem with Google Trends Data

Why You Can’t Just Stitch 90 Day Windows Together

The Fix: Rolling Windows With Overlap

Step 1: Pulling the Data

Handling Rate Limits (The Real Data Science Experience)

Step 2: Combining the Batches

Step 3: Finding the Overlapping Dates

Step 4 : calculating the scaling factor between batches

Step 5 : Examining the results

Remember….

Evil Works

Live Data Science Walkthrough: Making Google Trends Data Usable

The Hidden Problem with Google Trends Data

Why You Can’t Just Stitch 90 Day Windows Together

The Fix: Rolling Windows With Overlap

Step 1: Pulling the Data

Handling Rate Limits (The Real Data Science Experience)

Step 2: Combining the Batches

Step 3: Finding the Overlapping Dates

Step 4 : calculating the scaling factor between batches

Step 5 : Examining the results

Remember….

Domain Knowledge: The Machine Learning Unlock

Evil Works

Join Our Mailing List

Thank you!