Perfect Horror Movie EVER

What actually makes a horror movie good? Is it how many times you jump out of your chair or peek through your fingers? The plot? That creeping sound that grabs you from behind?
Is it what the Academy says… or what we feel in the dark? This way, you’ll know exactly why The Ring chills differently than Jaws, and what levers directors pull to make you want to look away.

Let’s test it with data science.

We’ll build a dataset: jump-scare counts, pacing, runtime, jump-scare intensity, critic and audience scores. Turn it into a clear, testable recipe for the perfect horror movie.

Less guessing. More evidence.
Let’s do some Evil Work.

First, we need the dataset.

Good news: the jump-scare data already exists, thanks to a public jump-scare database.
It gives us jump scare rating, total jump-scares, the exact timestamps, whether a scare is “MAJOR,” plus runtime and year.

Next, we’ll layer in IMDb data for the rest: plot, genres, poster, director, awards, box office, actors, Rotten Tomatoes, and IMDb ratings and votes.

Then we merge both datasets. Genres, actors, writers? They come as lists, so we keep them as lists. All the numeric fields? We convert to clean floats/ints. Now it’s analysis-ready. Once it’s cleaned, we can finally see the shape of the horror landscape.

Time for some fun visuals to make this easy to read. When we plot the movies, it’s no surprise, most are horror. But some films include jump scares without actually being horror. Since we’re building the recipe for the perfect horror movie, we cut them. Nearly 1 in 5 weren’t horror.

IMDb Ratings by decade

Turns out a well-timed creep behind your shoulder can make you jump… even outside the genre.

Now we define the recipe.

Sugar = ImdbRating_num

Spice = Awards

Everything nice = no, we don’t need that

Chemical X = jump_scare_rating

Those are the ingredients to create the perfect horror movie.

Double, double, toil and trouble

We’ve got 666 movies (yes… fitting). For each one, we’ve got imdbRating_num and imdbVotes. Some titles have amazing ratings but with barely any votes. That’s noisy. So we borrow IMDb’s approach for the Top 250 and use a Bayesian weighted rating to keep things fair:

The Bayesian average is a way to rate things more fairly when some items don’t have many votes. Instead of just using a raw average, it blends two numbers: the item’s own average score and the overall average across everything. 

WR = (v / (v + m)) × R + (m / (v + m)) × C
Where R is the movie’s average rating, V is its vote count, M is the minimum votes to qualify (we’ll use 25,000), and C is the global mean rating.

It pulls the pretenders back into the grave and leaves only the films with real consensus standing. That becomes our own “Top 250.” With the rankings secured, we can now summon the awards.

Pretenders Lay Here

Now that we’ve locked our movie list, let’s see what the critics think.

We’ll use awards as a proxy, but not all trophies are equal. Wins and nominations both count, and Oscars and BAFTAs get extra weight. We roll those into a single prestige score and normalize it to 0–100. Coverage is strong: 93.4% of titles have award data. 

After scaling, the distribution is super skewed: mean 4.26, median 2.07, p10 0.24, p90 8.74, and a few elite titles spike up to 100.

Translation: most movies get little awards love, while a handful get all the glory.

That kind of lopsided data makes plots messy and models unreliable, because the outliers dominate. Box–Cox transformation fixes this: it squeezes the long tail, balances the spread, and pulls the distribution closer to normal. After that, patterns are clearer, relationships are more stable, and the stats play nicer, making the whole analysis much easier and more trustworthy.

Prestige is one signal. It measures applause, not fear. To gauge the shock in your spine, we head back to the scares.

Golden relics awaiting their next worthy sacrifice

We measure each film’s jump-scare rhythm: how evenly scares are spaced, how soon the first one hits, how long the quiet tail is, and how many are “major.”
When we test what drives jump_scare_rating, the pattern’s clear: more scares and tighter pacing lift the rating, while longer gaps, a later first scare, and a long quiet finish pull it down. Big, “major” scares help.

Fun fact: jump scares really ramped up in the ’80s. Before that, films averaged about three jump scares total, by the 2020s, that’s roughly doubled. Before the ’80s, you’d wait about 40 minutes for the first scare. Now they barely let you settle, around 15 minutes in, mid-popcorn bite… boom, jump scare.

  • Fastest time to first scare: 2000s → 16.03 min

  • Highest average IMDb decade: 1970s → 7.61

First scare Build up by Decade

Time to pick our measuring stick. Let’s be better than Professor Utonium and get the mix right.

I blend three signals: IMDb (what people think), Awards (prestige), and Scare (how hard it hits). Heavier on scares: 0.35 IMDb / 0.10 Awards / 0.55 Scare. For each movie, I take a weighted average of whatever’s available (skip the missing bits), then standardise it (z-score) so everything’s on the same scale.

That final number is PerfectScore: my “can’t look, can’t look away” meter.

We’re teaching the model to call the perfect fright but hands off the sugar (IMDb), spice (awards), and Chemical X (scare score). Those are just the fuel, the model only plays with the dials we, as filmmakers, can actually twist:


– How many scares per hour.
– How soon the first one lands.
– How even or irregular the gaps are.
– The runtime sweet spot.
– Director style and even plot themes, pulled out of the plot text with TF-IDF and SVD.

Professor Utonium Mixing the perfect combination

For plots, we don’t just dump the text in. We use TF-IDF, that’s “term frequency–inverse document frequency,” basically a way to score which words are important to this movie compared to all others. 

TF (Term Frequency): In The Ring, the phrase “seven days” shows up multiple times in the plot/dialogue, so it’s bright for this movie.

IDF (Inverse Document Frequency): Across most other horror plots, “seven days” barely appears — so it’s rare, which cranks the spotlight even brighter.

Put together, TF-IDF pushes “seven days” to centre stage while generic words like “blood,” “night,” or “house” fade into the dark.

Then we compress. Picture a haunted library: thousands of books, every word from every plot scattered across endless shelves, impossible to read right? Truncated SVD is the ghost librarian who tidies the chaos. She doesn’t copy the books; she collapses them into a few theme shelves: “haunted house,” “killer in the woods,” “zombie outbreak.”

Corridors filled with Spooky Stories

Now, instead of 10,000 loose words, we hold a handful of potent themes. The details are condensed, but the scary essence survives. The model doesn’t “read” like a human, it receives compact theme vectors: tiny coordinates that map each film’s neighbourhood in fear.

Enter the Random Forest. Think of it as an ensemble, not one shaky decision tree, but hundreds. Each tree explores a different haunted hallway, trained on a bootstrap sample of the data. At every split, we don’t let the tree peek at all features, just a random subset. That randomness does two things: it decorrelates the trees, and forces them to consider different paths through the house.

Alone, each tree might overfit, screaming at shadows. Together, through bagging and majority voting, the forest reduces variance, keeps bias low, and resists overfitting.

Horror data is messy and nonlinear: scares cluster, themes overlap, directors twist conventions. Linear regression would blink at the first jump scare. But Random Forests? They thrive on interaction effects and non-monotonic relationships, mapping the maze without getting lost. Plus, we get feature importance out the other side — which scare knobs really drive the PerfectScore.


We lock a Random Forest tuned to (3000 trees, depth≈24, max_features≈0.5, max_samples=0.8). Using permutation importance and partial dependence, we optimize only the controllable knobs — scares/hour, first-scare timing, gap irregularity, runtime, and major-scare ratio — while holding audience/awards out of the inputs. A randomised search over realistic bounds finds a peak around 105–110 minutes, first scare ≈ 12–16 minutes, 5–6 scares/hour with ~40% majors, moderate irregularity, and a short 3–6% quiet tail. That’s our evidence-backed “can’t look / can’t look away” mix.

Scare Schedule

Now we can turn the model into a script blueprint. Not line-by-line dialogue, a spine of beats. Set the timing knobs, place the shocks. Where the first scare hits, how many in total, which ones are majors, and how the tension climbs.

With this schedule — 121 minutes, 14 jumps, 6 majors, first scare at ~5 minutes. Above is the scare schedule plotted: orange stars for major jumps, blue dots for minors. Notice how Act II cuts back, then Act III stacks. The rhythm is engineered, not guessed.

Less guessing. More evidence.
Roll credits… and don’t look behind you.

Come on the journey with us by joining the Evil Lair.

Next
Next

Google Trends Is Misleading You