My hardest Pokemon battle was with the data

The Pokémon Trading Card Game (TCG) first launched in Japan on October 20, 1996, with its original set of 102 cards. It’s wild to realize Pokémon is actually as old as I am. I still remember being a kid and seeing those first cards and decks hit the shelves—the rush of opening a new pack, the thrill of pulling a secret rare, and that urge to collect them all. If only my mom had known what those cards would be worth today, she probably would’ve thought twice before threatening to toss them out! So now, it’s time to answer the question: was it ever really worth opening those packs, or should we have just kept them sealed all these years?

If you prefer watching over reading, you can do so right here:

Why Pokémon cards, you ask? Well, the thing is, there’s so much data out there about Pokémon. Seriously, between all the different sets, the endless list of individual cards, and the huge collector community, you can find just about any info you need. There are datasets ready to go, APIs, price trackers, forums. And when it comes to trading cards, Pokémon is easily one of the biggest franchises around, with products going all the way back to the '90s and new sets still coming out today. On top of that, the resale market is massive, some cards sell for hundreds or even thousands. So I figured, how hard could it be to analyze? Turns out, it’s way messier than you’d expect. I thought I knew Pokémon, but let me tell you, I had a lot to learn

I began by exploring the product names, where I discovered a significant pattern: the majority were sealed products, not individual cards. Crucially, many titles explicitly stated the number of packs. This revelation was a breakthrough, as it allowed me to extract the pack count using regex for my analysis.

For anyone unfamiliar, regex (regular expression) is basically a search pattern you can use to match text in really flexible ways. It’s like a smart filter for text whether you're looking for specific numbers, words, or even more complex patterns. In my case, I used regex to scan the product names and pull out numbers that represented the pack count. So instead of manually checking every title, I could just write a rule that does the work for me.

As I kept exploring, another pattern stood out, usually there’s a collection name, then a colon, then the Pokémon’s name. I thought, “Great, I’ll just split those up.” But then I realized I could scrape a list of official collections from the website and match those to product names instead. In theory, it sounded perfect. In practice, not so much. The website isn’t always consistent. So I ended up building some custom dictionaries to fill in the gaps.

But matching Pokémon names brought up a new issue. I grabbed a dataset from Kaggle with all the Pokémon names and assumed they were all just one word. Easy, right? Not so fast. Leigh, my resident Pokémon expert, quickly set me straight—there are now plenty of Pokémon with double-barrel names! My Kaggle list was outdated

So, I needed a better way to check if a name was actually a Pokémon, a trainer, or maybe an expansion name. That’s when I started using the Pokémon API to properly verify names. Lesson learned: always double-check your assumptions, especially with Pokémon.

A big pile of pokemon cards!

This was a journey I did NOT expect to go on…

At this point, I realized I needed a more general solution. So, I wrote a flexible function for the matching process. It goes through each product name and checks if it contains any collection name from my list—starting with the longest names first to avoid accidental short matches. When it finds a match, it strips that part out, leaving the rest for further analysis.

The best part? The function gives me back both a cleaned product list and a list of products that didn’t match any collection. That way, I can easily review the leftovers to see if there are patterns and I spot products that don’t even need a collection name in the first place.

It’s actually pretty flexible! I can use the exact same function to assign Pokémon names, or even other categories, just by giving it a different reference dataset. It matches and cleans in the same way, always keeping track of what didn’t get matched so I can give those cases a second look if needed. It’s a huge help for catching products that don’t quite fit the usual patterns.

With those main matches sorted, I turned my attention to cleaning up “other versions” like “EU VERSION,” “Chinese,” and similar labels. The reason is simple: these versions can affect price and there’s usually little reliable info about them, so they just add confusion.

For the products that didn’t match or weren’t NaN - not every product actually needs a Pokémon name or collection - I wrote a function called get_most_common_words. This function processes a series of text by cleaning and lowercasing it, splitting the text into individual words, and generating n-grams if needed. It then counts how frequently each word or word combination appears using collections.Counter. This makes it easy to quickly identify the most common terms or phrases in the dataset and see which patterns or keywords show up the most.

Even with all that, there were some stubborn gaps. That’s when I turned to the Bulbapedia API, which pretty much became my go-to source for anything Pokémon-related that I couldn’t figure out myself. It was a good reminder of why I picked Pokémon cards in the first place: I was confident the information would be out there somewhere.

A birthday pikachu pokemon card

We thought shiny was rare when we started opening packs for this video. Found out quickly that it was not.

 I searched for the leftover words I couldn’t match, looking for any extra details or clues. Sure enough, using Bulbapedia helped me fill in some missing data and clean things up even further.

 But as the project progressed, I realized my initial method for finding the pack count wasn’t cutting it. So I developed a new function that uses both words and regex patterns to accurately assign the pack_count in each product name. It can handle single or multiple patterns, supports both literal and regex matching, and allows you to choose whether all or any patterns should trigger a match. Once it finds a match, it updates the pack count and cleans up the text, leaving the data much tidier.

Getting an accurate pack count is crucial because it’s the foundation of any prediction model focused on pull rates. The number of packs in a sealed product determines how many chances you have to pull a valuable card, which directly affects calculations for expected value and the likelihood of landing something rare. Without this information, it’s almost impossible to make reliable predictions about the value you might get from opening a box, or to fairly compare different sealed products.

While I was at it, I started thinking about other terms that might be important—stuff like whether a product is “sleeved” (which could affect price) or if it’s part of an “EX” set (some of the rarest, priciest cards out there).

Ultimately, seeing how pack count is tied to the whole “gamblefication” aspect of Pokémon cards, I decided to focus on which product categories actually have the most packs and leave the deep-dive analysis for another post. The interesting part is that a single special card can sometimes be worth far more than the cost of a sealed box. Each pack typically includes 1 Rare, 3 Uncommon, and 6 Common cards, and some of those rares can sell for over £1,000. This brings up the big question: do you take the risk and open the box, or is it smarter to keep it sealed?

This kind of analysis lets us try to predict the price of a Pokémon pack based on the pull rates and potential cards inside. Plus, it opens up the same “gacha” or loot box style questions for other collectibles, like Labubu dolls which are trending right now. Are they the next Pokémon, or will their value fade over time? That’s what I want to find out.

Come on the journey with us by joining the Evil Lair.

Previous
Previous

What even is AI?

Next
Next

Does ChatGPT make you stupider?