Using bitmaps to run interactive retention analyses over billions of events for less than $100/mo

Headshot of
                Vikram Oberoi

Vikram Oberoi

·

August 10, 2022

vikramoberoi.com

· 7 min read

Using bitmaps to run interactive retention analyses over billions of events for less than $100/mo

A cartoon of a woman being quoted $100K to run a retention analysis in a product analytics tool.
A dramatic encounter between a sales prospect and an account executive at a major product analytics SaaS company. Credit: katiebcartoons.com

Instead of paying Mixpanel over $2K/month for retention reporting functionality in 2012 Amir Salihefendic (founder of Doist, makers of Todoist) built a clever hack to get the functionality he needed. He estimates Doist has saved millions on product analytics tools through 2022 yet has run interactive retention analyses over billions of events.

Amir’s hack was to use bitmaps and bitwise operations to answer common questions about cohorts of users. He built and open-sourced bitmapist, a Python library that tracks user events in Redis bitmaps and can generate retention/cohort reports.

It looks like Doist still uses bitmapist in production ten years later. As of this writing, Amir himself pushed the latest commit three weeks ago.

This post walks you through how bitmapist works and its limitations/pitfalls. I won’t cover how to read and interpret a retention report. If you’d like a primer, here’s a pretty good one.


An illustration of a daily retention report for an app. Users enter cohorts when they sign up for the app. They return when they open the app.
Figure 1. A daily retention report for an app. Users enter cohorts when they sign up for the app. They return when they open the app.

The key idea: each cell in a retention report represents an intersection of two sets of users.

The retention report above shows:

Day 2 relative to the January 27th cohort is January 29th.

Thousands of users may have opened our app on January 29th, but 246 of them also signed up for our app on January 27th.

Here’s another way to state the above:

A = Set of users who signed up on January 27th
B = Set of users who opened the app on January 29th

size(A) = 1257
size(B) = ?
size(A ∩ B) = 246

size(A ∩ B) / size(A) = 19.6%

Every cell in a retention report is computed the same way.

The hack: store sets of users in bitmaps for fast set operations and high compression

Bitmaps are an ideal data structure for this use case:

Bitmapist maintains user bitmaps in Redis. Each bitmap is keyed on a (time bin, event) pair. If user ID 542 opens the app some time on January 25th 2022, then the bit at index 542 will be set on the bitmap with key 20220125:opened app in the example below.

An illustration of how one might store user events in Redis bitmaps with daily time bins to generate the retention report in Figure 1.
Figure 2. How one might store user events in Redis bitmaps with daily time bins to generate the retention report in Figure 1.

Figure 2 shows daily time bins and two distinct events, but the idea generalizes to any time bin (hourly, daily, weekly, monthly, etc.) and any number of distinct events.

Bitwise ANDs enable high-performance set intersections to compute counts of distinct users who did our start event on one date and our return event on another date. If you’re unfamiliar with how a bitwise AND operation works, Wikipedia has a good example. Redis implements bitwise operations via its BITOP command.

Bitmapist provides a handy Python API for clients to track events (mark_event("opened app", 542)) and transparently maintains hour/day/week/month-binned bitmaps in the background. It also provides a library to generate retention reports interactively:

An screenshot of a retention report generated by Bitmapist. Image from https://github.com/Doist/bitmapist.
Figure 3. A retention report generated by Bitmapist. Image from https://github.com/Doist/bitmapist.

Using Redis bitmaps for retention analyses is fast but extremely space-inefficient.

The Redis BITOP docs link to this article from 2011 simulating these techniques with 128MM users. Some relevant metrics from that article:

This technique is fast and compresses well, so what’s the catch?

There are two:

  1. Sparse bitmaps are space-inefficient in Redis. Redis will allocate as much memory as it needs to set the Nth bit. If your user’s ID is 8,000,000, bitmapist will set that bit in Redis, which will allocate 1MB of RAM to set it. You can see this behavior in action here.
  2. This technique stores lots of bitmaps. Let’s say you track 100 distinct events and you bin hourly (24 * 30 bins/month), daily (30 bins/month), and monthly (1 bin/month). Worst case, you’ll have ~75,000 bitmaps by the end of the month. If each bitmap is 1MB, Redis will allocate 75GB RAM.

As of 2020, Todoist had tens of millions of signups and over one million active users. That Todoist might set a bit index in the tens of millions range every day while setting few bits at a lower range (0 - 1,000,000) is highly likely.

Luckily, there’s an easy fix for this that relies on more efficient bitmap representations.

Using roaring bitmaps fixes the space-inefficiencies introduced by big, sparse Redis bitmaps.

In 2017 Doist built a standalone bitmap server to address the memory issues described above. It implements Redis’ wire protocol and the subset of Redis bitmap commands used by bitmapist so it’s easy to swap in.

It is also three orders of magnitude more space-efficient than using Redis bitmaps under a heavily used bitmapist setup:

Memory in use reported by Redis (matches RSS of the redis-server process): 129.48G.

With the same dataset migrated to standalone bitmapist server under the same load: RSS reported at about 300M.

It achieves this result by using roaring bitmaps, compressed bitmaps that still offer high-performance bitwise operations.

If you’re curious about Roaring bitmaps, I read the papers and wrote a primer on what they are and how they work.

Interactive retention analyses over billions of events for under $100/mo?

I haven’t tested this out but it’s well within the realm of possibility:

I may test this out with the Github Archive dataset (~3 billion events) one day. If you take a crack at testing these assumptions before I do, please let me know so I can link to your post.

And if you work at Doist, let us know what hardware you run this on!

This technique nets you affordable and efficient distinct counts, set operations, and retention analyses at scale. But the benefits end there.

Spending $100/mo for interactive retention analyses over billions of events is incredibly affordable.

With an event volume in the 10’s to 100’s of millions per month you’d be forking over $1000’s/month to Amplitude, Heap, or Mixpanel today. Amplitude’s free tier ends at a 10MM event/mo. At that point, their annual contracts begin at ~$30K/year.

But these companies also provide a useful suite of tools beyond retention analyses that the techniques underlying bitmapist do not support. If you need funnel or path analyses, for example, you’ll need to pony up the cash – this “swiss army knife” model of product analytics unfortunately commands a premium.

Further reading

Thanks to Gaurav Oberoi, Chuck Groom, and Pierre Jambet.