Funnel analysis in SQL using window functions, range frames, and regular expressions

Headshot of
                Vikram Oberoi

Vikram Oberoi

·

February 22, 2023

vikramoberoi.com

· 10 min read

Funnel analysis in SQL using window functions, range frames, and regular expressions

A few weeks ago we released Baxter’s free funnel analysis SQL generator (no longer available) supporting nine different SQL dialects. You can read about how the tool works and the assumptions we make about your data in this introductory blog post

We encourage you to copy/paste the queries our tool generates for your use cases! Funnel analysis queries are tough to write and the ones we generate are well-tested and will save you time.

But we’re also proponents for understanding how the code you’re using actually works. In that spirit, we’re going to share the ideas behind the query strategies we employ for users and SQL enthusiasts alike.

This is is the first post in a series. In it, I’m going to talk about how to write funnel analysis queries using three modern, common SQL features:

These queries can analyze loosely ordered funnels with a total sequence time constraint. These terms are defined here.

We use this strategy to generate funnel analysis queries for BigQuery, Spark, ClickHouse, DuckDB, PostgreSQL, and MySQL.

Idea #1: You can use regular expressions to match funnels if you collapse sequences of events into strings.

Here’s how you might collapse a sequence of events into a string using DuckDB’s SQL dialect:


select
    user_id,
    string_agg(
        case
            when event_name = 'View Product'
            then 'A'
            when event_name = 'Add to Cart'
            then 'B'
            when event_name = 'Place Order'
            then 'C'
        end
    ) as event_sequence
from events
where ...
group by user_id

-- user_id | event_sequence
-- ------------------------
--       1   'ABAAC'
--       2   'AAAAAAB'
-- ...

If our funnel steps are “View Product” → “Add to Cart” → “Place Order”, the query above collapses each event into single-character identifiers and turns our funnel steps into “A” → “B” → “C”.

With strings of event sequences, using regular expressions to match how far users have progressed through a funnel is trivial and efficient.

The following query assumes we have a table called event_sequences with columns user_id and event_sequence, generated by our query above.

select
    -- If there's an A, a user progressed to step 1.
    (select count(*)
     from event_sequences
     where regexp_match(event_sequence, 'A') as step_1_count,

    -- If B eventually follows A, a user progressed to step 2.
    (select count(*)
     from event_sequences
     where regexp_match(event_sequence, 'A.*B') as step_2_count,

    -- ... and if C eventually follows B, a user progressed to step 3.
    (select count(*)
     from event_sequences
     where regexp_match(event_sequence, 'A.*B.*C') as step_3_count,

from event_sequences

-- step_1_count | step_2_count | step_3_count
-- ------------------------------------------
--          142             93             20

Note that our regular expressions permit any other events to occur between each funnel steps. This satisfies our loose ordering constraint.

Recall that we need to consider two important constraints:

  1. All the events must occur within a specified period: a start and end time.
  2. All the events must occur within a specified duration. For example, all events A, B, and C might need to occur in a 24 hour period in our analysis. (This is our total sequence time constraint.)

#1 is easy – you can filter for events that fall within the start and end time in your where clause.

#2 is trickier and where window functions with range frames come into play.

Idea #2: Assemble a sliding window of events that occur within your funnel’s duration by using window functions and range frames.

Let’s assume these are our funnel analysis parameters:

  1. The events must occur in the month of January 2023.
  2. Our funnel steps are: View Product, Add to Cart, and Place Order.
  3. The funnel duration is 24 hours.

Let’s start with the following query, which helps us satisfy parameters #1 and #2 and collapses event sequences into strings. We’ll build on it next to satisfy parameter #3.

This query is written in DuckDB’s SQL dialect.

select
    user_id,
    string_agg(
        case
            when event_name = 'View Product'
            then 'A'
            when event_name = 'Add to Cart'
            then 'B'
            when event_name = 'Place Order'
            then 'C'
        end
    ) as event_sequence
from events
where event_timestamp > '2023-01-01' and
      event_timestamp < '2023-02-01' and
      (event_name = 'View Product' or
       event_name = 'Add to Cart' or
       event_name = 'Place Order');
group by user_id;

-- user_id | event_sequence
-- ------------------------
--       1   'ABAAC'
--       2   'AAAAAAB'
-- ...

We’re not doing anything fancy yet. This query assembles event sequences that:

If we used regular expressions to count users who progressed through our funnel using the query results above, our counts would be incorrect. We have strings of event sequences, but there is no guarantee that these event sequences occurred within a 24-hour period.

The following query (using DuckDB’s SQL dialect) builds on the query above using window functions and range frames to make that guarantee.

select
    user_id,
    string_agg(
        case
            when event_name = 'View Product'
            then 'A'
            when event_name = 'Add to Cart'
            then 'B'
            when event_name = 'Place Order'
            then 'C'
        end
    ) over funnel_duration as preceding_event_sequence,
    case
        when event_name = 'View Product'
        then 'A'
        when event_name = 'Add to Cart'
        then 'B'
        when event_name = 'Place Order'
        then 'C'
    end as event_identifier
from events
where event_timestamp > '2023-01-01' and
      event_timestamp < '2023-02-01' and
      (event_name = 'View Product' or
       event_name = 'Add to Cart' or
       event_name = 'Place Order')
window funnel_duration as (
    partition by user_id
    order by event_timestamp
    range between interval 24 hours preceding and
                  current row
)

-- user_id | preceding_event_sequence | event_identifier
-- -----------------------------------------------------
--       1   ''                         'A'
--       1   'A'                        'B'
--       1   'AB'                       'A'
--       1   'ABA'                      'A'
--       1   'AA'                       'C'

Unlike our first query, which had a row for every user, there exists a row for every event in this result set. And preceding_event_sequence contains all the events that occurred within the 24 hours preceding each event!

Our first query showed that user 1’s sequence of events over the course of January 2023 was ABAAC. Examine the event_identifier column in the results above and you’ll find you can reconstruct user 1’s ABAAC event sequence.

But from these results you’ll also find:

The range frame in our window specification ensures that preceding_event_sequence maintains a sliding window of events within the 24 hours preceding each event:

...
window funnel_duration as (
    partition by user_id
    order by event_timestamp

    -- This is the range frame.
    range between interval 24 hours preceding and
                  current row
)
...

If you’ve used window functions before, you’re likely familiar with row frames. Row frames allow users to assemble a sliding window of some number of rows relative to the current row. Range frames are different: they allow users to assemble sliding windows of rows that satisfy some range of values relative to the current row.

In our case, those rows must have event_timestamp values within the 24 hours preceding our current row’s event_timestamp value.

Window functions are already useful, but range frames make them wildly powerful. They might be my favorite modern SQL feature to-date.

Assemble the rest of the query with your SQL bag-of-tricks: CTEs, aggregate functions, etc.

The ideas above decompose our funnel analysis problem into two steps that get us most of the way:

  1. Collapse event sequences into strings to match funnels.
  2. Only collect event sequences that occur within a 24 hour window.

There are a lot of ways you might assemble your final result and they rely on the techniques you use writing everyday SQL. I won’t be going into the details here, but you can poke around our funnel SQL generator (no longer available) to see how we do it.

But make no mistake: the SQL will be verbose, impossible to parameterize without using a template language like Jinja (used by DBT), and it won’t make for light bedtime reading.

A note on our MySQL implementation

Our MySQL implementation works with MySQL 8. But it’s a bit of hack that uses undocumented behavior. This is not the case for any of our other implementations!

Let me explain.

Unlike other dialects, there is no string aggregation function that works in a windowing context in MySQL. group_concat aggregates values into a string, but it does not work as a window function so we can’t use it to implement this query strategy.

MySQL doesn’t have a native list or array data type, either! With ClickHouse and Spark, we’re able to a) aggregate event sequences into a list, and then b) concatenate all elements in that array. So we can generate ['A', 'B', 'C'] and then turn that into 'ABC'. We can’t do that in MySQL.

MySQL does have a JSON type. And we can aggregate event sequences into a JSON array. These look like ['A', 'B', 'C'], but there’s no equivalent function for us to turn that into 'ABC'.

Under the hood, however, MySQL’s regular expression matching actually works on JSON arrays. It does so in a peculiar way – by matching against the string ["A", "B", "C"]

Here’s how that works:

-- 'a' eventually followed by 'b' matches

select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a.*b');
+------------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a.*b') |
+------------------------------------------------------+
|                                                    1 |
+------------------------------------------------------+
1 row in set (0.00 sec)



-- 'b' follows 'a' in this array. But 'ab' does not match!

mysql> select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'ab');
+----------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'ab') |
+----------------------------------------------------+
|                                                  0 |
+----------------------------------------------------+
1 row in set (0.00 sec)



-- The string we're matching is actually:
-- ["a", "b", "c", "c", "d"]
--
-- So let's try matching 'a", "b` instead.

mysql> select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a", "b');
+--------------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a", "b') |
+--------------------------------------------------------+
|                                                      1 |
+--------------------------------------------------------+
1 row in set (0.00 sec)

I know this is hack! But:

If you run into issues using our MySQL implementation, or you’d like us to fall back to another strategy for it, please reach out to us.

Why can’t Redshift, Snowflake, or SQLite use this query strategy?

Different reasons!

Some observations on the above:

Next: stay tuned to learn how to use Snowflake’s MATCH_RECOGNIZE to implement funnel analysis queries

Snowflake’s dialect is powerful. You can see MATCH_RECOGNIZE in action in our funnel SQL query generator (no longer available). I’m excited to share how that works.