What is incremental Computing? The Data Science Game Changer
How to use Incremental Computing to Speed Up Data Science
In data science we work with massive datasets and complex models. The problem? Every time the data changes we have to rerun everything from scratch, wasting hours of time and killing more trees with unnecessary computation.
What if we didn’t have to do that?
Today I’m going to talk about a gamechanger in data science: Incremental computing.
The Reality of Quant Life
So I started my career as a quant, expecting to go into it using my probability theory and risk neutral pricing measures but I soon found out that the dream I’d been sold in my degree is very dead. Quant life isn’t about coming up with clever maths. It’s about dealing with data. And a lot of it.
I also started my career in a team with heavy oversight, so much so that we had big testing requirements. The models that we had were trained on years worth of daily data across interest rates, foreign exchange, equities, commodities. There were tens of different oil time series alone, and that’s before you start looking at all the different stocks.
Every single night we’d run the same process, the exact same calculation in a big batch on data that was almost exactly the same. The only thing that changed was the incremental data from today, such as the latest market prices of shares.
But that’s how data science works. When the data changes you rerun your model. That’s just the way things are. But what if it didn’t have to be that way?
Incremental computing is the idea that if the data hasn’t changed then we don’t rerun it. A simple guiding principle and something easy for a human to understand but as with much of programming, translating it into something that the computer can understand is a whole different kettle of fish.
Constant re-run’s get you feeling like….
Why “Just Re-Run It” Doesn’t Scale
So let’s look at it with an example. Say you have all your data in a table and you want to start by doing a really basic data cleaning step: find all the nulls and replace them with zero. The basic way to do this is to take the table as a whole and do something like a pandas.fillna on it and easy, you’ve done it in one line of code.
The issue with that is that when you do the same thing tomorrow, you’re recalculating a load of work you’ve already done. You only need to do that for today’s data. Now this example is very simple, you could just do the fillna on the row you need and that’d be fine and dandy if that was the only processing you’d need to do on the data. But the reality is much more complicated.
Because in most cases, data processing isn’t only done once. Some processing is universal like filling nulls, but some models might need their own specific processing for their particular assets, for example currency conversion in foreign exchange products. Often this processing is so far abstracted from the business use case that the concept of “this row of data is the latest day” isn’t even present.
But yes, in theory if you took things use case by use case you could put in code so that only the most recent day of data gets included. But that covers just one scenario. Error corrections, new assets. Things like this are unpredictable and mean another rerun is required.
Trying to capture all the potential edge cases is an exercise in futility. What if instead we had a way to detect when something has changed and flag exactly what needs to be run? That’s the fundamental concept behind incremental computing.
Panda fillna for data cleaning
What Does “Incremental” Mean in Mathematical Terms?
We can frame this in terms of mathematics. We might have a function f which has parameters x1,...,xn, our data. Say we introduce a change in parameter xk, so now instead of xk we have xk + delta xk. Traditional programming would just rerun the function f with the new set of variables. Incremental programming means calculating the differential, df/dxk.
I wanted to clarify something before we dive into the next section. When I talk about the differential coming up, I’m talking about the literal mathematical differential. However incremental computing as a concept abstracts out to also mean just recalculate the bit that has changed. This can involve just re-calculating parts of the function and caching the rest rather than doing the actual differentiation. But I’ll be talking about the literal mathematical differential and we’ll dive deeper into both these topics in our follow up blog.
Which already brings an interesting question. We know how to differentiate maths. How on earth do we differentiate a programming function?
The easiest way we can get some of the power of incremental computing is to use partial evaluation. Our fillna example earlier was a perfect example of that. We can partially evaluate it by computing and storing the result of fillna applied to all previous days and only calculate the result of applying fillna to the latest day of data.
If you’ve ever written something in your code like full_sum += sum then you’re already doing a bit of incremental computation yourself. But much like differentiation is easy when our functions look like x + y, the real challenge is when the maths gets a little more complex.
When you were learning differentiation at school you probably started with things like f(x) = x^2 + 2x + 1 and taught to bring the exponent in front, reduce the power and zero out the constants, so x^2 becomes 2x, 2x becomes 2, 1 goes to zero and df/dx = 2x + 2. But when it came to functions like f = log(x) you had to learn that df/dx = 1/x.
When it comes to data science, our programming functions are complex, multiplying and dividing rows of data, taking exponents and applying functions that aren’t polynomials. As a result, they generally fall under the category of “not easy to differentiate”. Instead we need a programmatic way to differentiate these functions. Enter AAD: Adjoint Automatic Differentiation.
Programming technique Recursion
The Chain Rule for Programs
Do you remember the chain rule of differentiation? For those of you who need a quick refresher, it’s the concept that if we have a function y = f(g(x)) then dy/dx = df/dg (dot) dg/dx. A programming function is, at its most basic level, a function of functions, so if we can differentiate all of its components, we can differentiate the entire function using the chain rule.
Automatic differentiation can be done in two ways, bottom up, known as the tangent mode, or top down, known as the adjoint mode, ie we start with the function we have, the top level, and move down through its functions, computing the derivative with the same method. This is effectively the programming technique of recursion.
The main reason this is used is because it is generally far more computationally efficient than the bottom up method when the function has more inputs than outputs, which is generally the case in programming. This is because it needs only one sweep of the dependency graph to be able to calculate derivatives per all inputs.
However, it does have greater demands in the form of storage: keeping track of intermediate variables and the instructions that produced them, the so called tape of operations. There are a few techniques that can make this more efficient, but in the use case of incremental computing, this is actually desirable because these stored values count as partial evaluation, letting us re-use previous results instead of re-calculating.
A data scientist tool kit
What Evil Works can do for you….
So in a world in which we had full ADD on our codebase, incremental computing comes more or less for free. The challenge, of course, comes from the implementation. Building graphs that trace dependencies is complex and delicate, full of loops, nested function calls and complicated data sets. Calculating the exact way to propagate derivatives through all operations is not an easy task at all.
And as we mentioned earlier, differentiation is not always straightforward. Conditional statements, indexing and external library calls can easily scupper even well written AAD attempts.
There’s also the question of memory. The bigger and more complex the problem, the more tape that needs to be stored and managing this becomes a problem of its own. So incremental computing sounds amazing in theory, but how do we use it in practice?
Well, I’m talking about this because I have a little something up my sleeve. At Evil Works we’ve been hard at work implementing exactly this, a data science toolkit that has AAD baked in from the start. No more waiting for massive chunks of code to rerun that haven't been touched, we’ve taken care of all of the tricky details so that you can do what you’re good at: running code.
We’ve also added many features that just should be standard to data science. Try and add metres to feet? We’ve got you covered with all compatible SI units converted under the hood. Adding feet to degrees Celsius: We’ll error you out. We’ll even make it so that you can’t accidentally add numbers in different currencies without converting. Or at least make it so that you have to explicitly ignore it. You can do the addition knowingly but you can’t say we didn’t warn you!
These features and more are all part of our toolkit which you can get our hands on this February. Click the link to find the full list of features and sign up for the closed beta.
In the meantime, let us know any questions you have about this or data science in general by visiting our discord and head over to our YouTube channel to Like and subscribe and we’ll see you in the next one.
For more of this, come on the journey with us and keep being Evil