· 10 min read
Funnel analysis in SQL using window functions, range frames, and regular expressions
👋
I wrote this while investigating a product opportunity in the product analytics space.
We started and stopped this work in the first half of 2023, but I’m leaving this here for posterity.
Related posts:
- Introducing Baxter: deep-dive event analytics on your data warehouse
- Baxter’s free funnel analysis SQL generation tool is out!
All these posts were on our product blog.
A few weeks ago we released Baxter’s free funnel analysis SQL generator (no longer available) supporting nine different SQL dialects. You can read about how the tool works and the assumptions we make about your data in this introductory blog post
We encourage you to copy/paste the queries our tool generates for your use cases! Funnel analysis queries are tough to write and the ones we generate are well-tested and will save you time.
But we’re also proponents for understanding how the code you’re using actually works. In that spirit, we’re going to share the ideas behind the query strategies we employ for users and SQL enthusiasts alike.
This is is the first post in a series. In it, I’m going to talk about how to write funnel analysis queries using three modern, common SQL features:
- Window functions
- Range frames (instead of row frames)
- Regular expressions
These queries can analyze loosely ordered funnels with a total sequence time constraint. These terms are defined here.
We use this strategy to generate funnel analysis queries for BigQuery, Spark, ClickHouse, DuckDB, PostgreSQL, and MySQL.
Idea #1: You can use regular expressions to match funnels if you collapse sequences of events into strings.
Here’s how you might collapse a sequence of events into a string using DuckDB’s SQL dialect:
select
user_id,
string_agg(
case
when event_name = 'View Product'
then 'A'
when event_name = 'Add to Cart'
then 'B'
when event_name = 'Place Order'
then 'C'
end
) as event_sequence
from events
where ...
group by user_id
-- user_id | event_sequence
-- ------------------------
-- 1 'ABAAC'
-- 2 'AAAAAAB'
-- ...
If our funnel steps are “View Product” → “Add to Cart” → “Place Order”, the query above collapses each event into single-character identifiers and turns our funnel steps into “A” → “B” → “C”.
With strings of event sequences, using regular expressions to match how far users have progressed through a funnel is trivial and efficient.
The following query assumes we have a table called event_sequences
with columns user_id
and event_sequence
, generated by our query above.
select
-- If there's an A, a user progressed to step 1.
(select count(*)
from event_sequences
where regexp_match(event_sequence, 'A') as step_1_count,
-- If B eventually follows A, a user progressed to step 2.
(select count(*)
from event_sequences
where regexp_match(event_sequence, 'A.*B') as step_2_count,
-- ... and if C eventually follows B, a user progressed to step 3.
(select count(*)
from event_sequences
where regexp_match(event_sequence, 'A.*B.*C') as step_3_count,
from event_sequences
-- step_1_count | step_2_count | step_3_count
-- ------------------------------------------
-- 142 93 20
Note that our regular expressions permit any other events to occur between each funnel steps. This satisfies our loose ordering constraint.
Recall that we need to consider two important constraints:
- All the events must occur within a specified period: a start and end time.
- All the events must occur within a specified duration. For example, all events A, B, and C might need to occur in a 24 hour period in our analysis. (This is our total sequence time constraint.)
#1 is easy – you can filter for events that fall within the start and end time in your where clause.
#2 is trickier and where window functions with range frames come into play.
Idea #2: Assemble a sliding window of events that occur within your funnel’s duration by using window functions and range frames.
👉
Let’s assume these are our funnel analysis parameters:
- The events must occur in the month of January 2023.
- Our funnel steps are: View Product, Add to Cart, and Place Order.
- The funnel duration is 24 hours.
Let’s start with the following query, which helps us satisfy parameters #1 and #2 and collapses event sequences into strings. We’ll build on it next to satisfy parameter #3.
This query is written in DuckDB’s SQL dialect.
select
user_id,
string_agg(
case
when event_name = 'View Product'
then 'A'
when event_name = 'Add to Cart'
then 'B'
when event_name = 'Place Order'
then 'C'
end
) as event_sequence
from events
where event_timestamp > '2023-01-01' and
event_timestamp < '2023-02-01' and
(event_name = 'View Product' or
event_name = 'Add to Cart' or
event_name = 'Place Order');
group by user_id;
-- user_id | event_sequence
-- ------------------------
-- 1 'ABAAC'
-- 2 'AAAAAAB'
-- ...
We’re not doing anything fancy yet. This query assembles event sequences that:
- … match our funnel steps (View Product, Add to Cart, and Place Order)
- … and occur within the period that we’re analyzing (Jan 2023)
If we used regular expressions to count users who progressed through our funnel using the query results above, our counts would be incorrect. We have strings of event sequences, but there is no guarantee that these event sequences occurred within a 24-hour period.
The following query (using DuckDB’s SQL dialect) builds on the query above using window functions and range frames to make that guarantee.
select
user_id,
string_agg(
case
when event_name = 'View Product'
then 'A'
when event_name = 'Add to Cart'
then 'B'
when event_name = 'Place Order'
then 'C'
end
) over funnel_duration as preceding_event_sequence,
case
when event_name = 'View Product'
then 'A'
when event_name = 'Add to Cart'
then 'B'
when event_name = 'Place Order'
then 'C'
end as event_identifier
from events
where event_timestamp > '2023-01-01' and
event_timestamp < '2023-02-01' and
(event_name = 'View Product' or
event_name = 'Add to Cart' or
event_name = 'Place Order')
window funnel_duration as (
partition by user_id
order by event_timestamp
range between interval 24 hours preceding and
current row
)
-- user_id | preceding_event_sequence | event_identifier
-- -----------------------------------------------------
-- 1 '' 'A'
-- 1 'A' 'B'
-- 1 'AB' 'A'
-- 1 'ABA' 'A'
-- 1 'AA' 'C'
Unlike our first query, which had a row for every user, there exists a row for every event in this result set. And preceding_event_sequence
contains all the events that occurred within the 24 hours preceding each event!
Our first query showed that user 1’s sequence of events over the course of January 2023 was ABAAC
. Examine the event_identifier
column in the results above and you’ll find you can reconstruct user 1’s ABAAC
event sequence.
But from these results you’ll also find:
- User 1 progressed through A and B within a 24 hour period.
- User 1 did not progress through A, B, and C within a 24 hour period.
The range frame in our window specification ensures that preceding_event_sequence
maintains a sliding window of events within the 24 hours preceding each event:
...
window funnel_duration as (
partition by user_id
order by event_timestamp
-- This is the range frame.
range between interval 24 hours preceding and
current row
)
...
If you’ve used window functions before, you’re likely familiar with row frames. Row frames allow users to assemble a sliding window of some number of rows relative to the current row. Range frames are different: they allow users to assemble sliding windows of rows that satisfy some range of values relative to the current row.
In our case, those rows must have event_timestamp
values within the 24 hours preceding our current row’s event_timestamp
value.
Window functions are already useful, but range frames make them wildly powerful. They might be my favorite modern SQL feature to-date.
Assemble the rest of the query with your SQL bag-of-tricks: CTEs, aggregate functions, etc.
The ideas above decompose our funnel analysis problem into two steps that get us most of the way:
- Collapse event sequences into strings to match funnels.
- Only collect event sequences that occur within a 24 hour window.
There are a lot of ways you might assemble your final result and they rely on the techniques you use writing everyday SQL. I won’t be going into the details here, but you can poke around our funnel SQL generator (no longer available) to see how we do it.
But make no mistake: the SQL will be verbose, impossible to parameterize without using a template language like Jinja (used by DBT), and it won’t make for light bedtime reading.
A note on our MySQL implementation
Our MySQL implementation works with MySQL 8. But it’s a bit of hack that uses undocumented behavior. This is not the case for any of our other implementations!
Let me explain.
Unlike other dialects, there is no string aggregation function that works in a windowing context in MySQL. group_concat
aggregates values into a string, but it does not work as a window function so we can’t use it to implement this query strategy.
MySQL doesn’t have a native list or array data type, either! With ClickHouse and Spark, we’re able to a) aggregate event sequences into a list, and then b) concatenate all elements in that array. So we can generate ['A', 'B', 'C']
and then turn that into 'ABC'
. We can’t do that in MySQL.
MySQL does have a JSON type. And we can aggregate event sequences into a JSON array. These look like ['A', 'B', 'C']
, but there’s no equivalent function for us to turn that into 'ABC'
.
Under the hood, however, MySQL’s regular expression matching actually works on JSON arrays. It does so in a peculiar way – by matching against the string ["A", "B", "C"]
Here’s how that works:
-- 'a' eventually followed by 'b' matches
select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a.*b');
+------------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a.*b') |
+------------------------------------------------------+
| 1 |
+------------------------------------------------------+
1 row in set (0.00 sec)
-- 'b' follows 'a' in this array. But 'ab' does not match!
mysql> select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'ab');
+----------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'ab') |
+----------------------------------------------------+
| 0 |
+----------------------------------------------------+
1 row in set (0.00 sec)
-- The string we're matching is actually:
-- ["a", "b", "c", "c", "d"]
--
-- So let's try matching 'a", "b` instead.
mysql> select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a", "b');
+--------------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a", "b') |
+--------------------------------------------------------+
| 1 |
+--------------------------------------------------------+
1 row in set (0.00 sec)
I know this is hack! But:
- We suspect there is a small minority of users who are using MySQL for this task.
- Our MySQL queries pass all our tests.
- This strategy is significantly faster than our fallback strategy (self-joins).
If you run into issues using our MySQL implementation, or you’d like us to fall back to another strategy for it, please reach out to us.
Why can’t Redshift, Snowflake, or SQLite use this query strategy?
Different reasons!
- Redshift doesn’t support range frames in window functions. Its dialect is ancient and stuck somewhere in Postgres 8, the database it’s forked from.
- Snowflake has a rich, modern SQL dialect. But it doesn’t doesn’t support sliding range frames, which the window function/range frame strategy that we’re using requires.
- SQLite actually supports this strategy! But it won’t work out of the box. While SQLite has dedicated regular expression-matching syntax (where
REGEXP ), it actually just calls a function that users must supply. Most implementations do not provide one, so we use our fallback strategy instead.
Some observations on the above:
- It is frustrating how much Redshift’s SQL dialect lacks in features.
- Snowflake has this one odd omission in its support for window ranges – but it does implement
MATCH_RECOGNIZE
, a powerful and expressive operator that’s perfect for funnel analysis. - I am confused as to why SQLite dedicates language syntax to regular expression matching without giving users the facility to use it out of the box.
Next: stay tuned to learn how to use Snowflake’s MATCH_RECOGNIZE
to implement funnel analysis queries
Snowflake’s dialect is powerful. You can see MATCH_RECOGNIZE
in action in our funnel SQL query generator (no longer available). I’m excited to share how that works.