Using bitmaps to run interactive retention analyses over billions of events for less than $100/mo
Instead of paying Mixpanel over $2K/month for retention reporting functionality in 2012 Amir Salihefendic (founder of Doist, makers of Todoist) built a clever hack to get the functionality he needed. He estimates Doist has saved millions on product analytics tools through 2022 yet has run interactive retention analyses over billions of events.
Amir's hack was to use bitmaps and bitwise operations to answer common questions about cohorts of users. He built and open-sourced bitmapist, a Python library that tracks user events in Redis bitmaps and can generate retention/cohort reports.
It looks like Doist still uses bitmapist in production ten years later. As of this writing, Amir himself pushed the latest commit three weeks ago.
This post walks you through how bitmapist works and its limitations/pitfalls. I won't cover how to read and interpret a retention report. If you'd like a primer, here's a pretty good one.
The key idea: each cell in a retention report represents an intersection of two sets of users.
The retention report above shows:
- On January 27th, 1,257 users signed up for our app.
- On Day 2, 246 (19.6%) of the January 27th cohort opened the app.
Day 2 relative to the January 27th cohort is January 29th.
Thousands of users may have opened our app on January 29th, but 246 of them also signed up for our app on January 27th.
Here's another way to state the above:
A = Set of users who signed up on January 27th B = Set of users who opened the app on January 29th size(A) = 1257 size(B) = ? size(A ∩ B) = 246 size(A ∩ B) / size(A) = 19.6%
Every cell in a retention report is computed the same way.
The hack: store sets of users in bitmaps for fast set operations and high compression
Bitmaps are an ideal data structure for this use case:
- Bitwise operations provide fast set intersections and unions.
- Bitmaps compress this data well. A set of 8 million users fits in 1 megabyte.
Bitmapist maintains user bitmaps in Redis. Each bitmap is keyed on a
(time bin, event) pair. If user ID 542 opens the app some time on January 25th 2022, then the bit at index 542 will be set on the bitmap with key
20220125:opened app in the example below.
Figure 2 shows daily time bins and two distinct events, but the idea generalizes to any time bin (hourly, daily, weekly, monthly, etc.) and any number of distinct events.
Bitwise ANDs enable high-performance set intersections to compute counts of distinct users who did our start event on one date and our return event on another date. If you're unfamiliar with how a bitwise AND operation works, Wikipedia has a good example. Redis implements bitwise operations via its BITOP command.
Bitmapist provides a handy Python API for clients to track events (
mark_event("opened app", 542)) and transparently maintains hour/day/week/month-binned bitmaps in the background. It also provides a library to generate retention reports interactively:
Using Redis bitmaps for retention analyses is fast but extremely space-inefficient.
The Redis BITOP docs link to this article from 2011 simulating these techniques with 128MM users. Some relevant metrics from that article:
- Set counts and intersections over 128MM users are fast. Interactive retention analyses are easily achievable with this technique.
- 50ms to count all the bits set in one bitmap.
- 392ms to intersect 7 bitmaps (a week of daily bins).
- 1624ms to intersect 30 bitmaps (a month of daily bins).
- A bitmap of 128MM users takes up 16MB of memory. It's nice to see this play out predictably. You can confirm this easily on your own machine.
- Remember: Each bit is a distinct user.
- 1MB stores 8 million bits.
- Multiply that by 16 and you get 16MB and 128 million bits.
This technique is fast and compresses well, so what's the catch?
There are two:
- Sparse bitmaps are space-inefficient in Redis. Redis will allocate as much memory as it needs to set the Nth bit. If your user's ID is 8,000,000, bitmapist will set that bit in Redis, which will allocate 1MB of RAM to set it. You can see this behavior in action here.
- This technique stores lots of bitmaps. Let's say you track 100 distinct events and you bin hourly (24 * 30 bins/month), daily (30 bins/month), and monthly (1 bin/month). Worst case, you'll have ~75,000 bitmaps by the end of the month. If each bitmap is 1MB, Redis will allocate 75GB RAM.
As of 2020, Todoist had tens of millions of signups and over one million active users. That Todoist might set a bit index in the tens of millions range every day while setting few bits at a lower range (0 - 1,000,000) is highly likely.
Luckily, there's an easy fix for this that relies on more efficient bitmap representations.
Using roaring bitmaps fixes the space-inefficiencies introduced by big, sparse Redis bitmaps.
In 2017 Doist built a standalone bitmap server to address the memory issues described above. It implements Redis' wire protocol and the subset of Redis bitmap commands used by bitmapist so it's easy to swap in.
It is also three orders of magnitude more space-efficient than using Redis bitmaps under a heavily used bitmapist setup:
Memory in use reported by Redis (matches RSS of the redis-server process): 129.48G.
With the same dataset migrated to standalone bitmapist server under the same load: RSS reported at about 300M.
It achieves this result by using roaring bitmaps, compressed bitmaps that still offer high-performance bitwise operations.
If you're curious about Roaring bitmaps, I read the papers and wrote a primer on what they are and how they work.
Interactive retention analyses over billions of events for under $100/mo?
I haven't tested this out but it's well within the realm of possibility:
- In 2015, Doist was storing hundreds of millions events using bitmapist.
- In 2017, Doist's roaring bitmap-based server compressed at least hundreds of millions of events from ~130GB to ~300MB.
- A factor of 10 increase in the above event volume is billions of events. Assuming memory requirements scale linearly, you'll need 3GB RAM.
- Many AWS EC2 instance types will fit 3GB comfortably in physical RAM for under $100/mo.
I may test this out with the Github Archive dataset (~3 billion events) one day. If you take a crack at testing these assumptions before I do, please let me know so I can link to your post.
And if you work at Doist, let us know what hardware you run this on!
This technique nets you affordable and efficient distinct counts, set operations, and retention analyses at scale. But the benefits end there.
Spending $100/mo for interactive retention analyses over billions of events is incredibly affordable.
With an event volume in the 10's to 100's of millions per month you'd be forking over $1000's/month to Amplitude, Heap, or Mixpanel today. Amplitude's free tier ends at a 10MM event/mo. At that point, their annual contracts begin at ~$30K/year.
But these companies also provide a useful suite of tools beyond retention analyses that the techniques underlying bitmapist do not support. If you need funnel or path analyses, for example, you'll need to pony up the cash – this "swiss army knife" model of product analytics unfortunately commands a premium.
- The blog post introducing bitmapist
- Doist's bitmapist library on Github
- Doist's roaring bitmap-backed bitmap server
- An article from 2011 on real-time metrics using Redis bitmaps
- Redis bitmaps documentation
ShareThanks to Gaurav Oberoi, Chuck Groom, and Pierre Jambet.