I Asked 3 Data Analysts The Same Question. The Answers Made Me Start My Startup
István Mészáros
In my previous company, I worked as a team lead of data analysts and engineers. It was a medium size SaaS product development company.
Basically, product managers, marketers, salespeople (etc) came to us with requests, and we tried to deliver them ASAP. We collected the usage data from the web application into our data warehouse with Segment.
We avoided Google Analytics, Mixpanel and similar for security and privacy reasons. Also, transferring the data to external tools would have increased our operating costs significantly.
All this meant data analysts answered all product-related questions in the data warehouse.
Measure With (Guilty) Pleasure
One of the main metrics the company wanted to track was cohort user retention. With that, we wanted to see what impact the application changes had on newly acquired users.
Measuring cohort retention using SQL seemed simple enough as we already had all user behavioural data in our data warehouse. However, a devil sitting on my shoulder nudged me to do a little experiment.
I asked the three data analysts in my team the same question without them knowing about it:
“What is the week-over-week retention of our users that visited our website on the week of 2023–01–02?”
Who Was Right?
To my surprise, all three data analysts came back with different results. Down below you can see the differences (for the cohort 2023–01–02).
The charts describe the percentage of users visiting the application week over week after the initial week of 2023–01–02.
Here are the 3 SQL snippets. I simplified them a bit so they are easier to read.
Junior Analyst’s Code
WITH weekly_users AS (
SELECT
date_trunc('WEEK', pe.event_time) AS week,
count(distinct pe.user_id) user_count
FROM page_visits pe
GROUP BY 1
),
first_week AS (
SELECT user_count,
week
FROM weekly_users
WHERE week = DATE('2023-01-02')
)
SELECT
fw.week AS cohort_week,
wu.week AS retention_week,
fw.user_count*100.0 / wu.user_count
FROM weekly_users wu
JOIN first_week fw ON wu.week >= fw.week
This one has an obvious flaw. In the JOIN the user IDs are ignored. The final SELECT doesn’t care if the same users are retained as in the initial week.
First Senior Analyst’s Code
WITH events AS (
SELECT DISTINCT
user_id,
date_trunc('WEEK', event_time) AS week
FROM page_visits
)
SELECT
c.week cohort_week,
(count(DISTINCT w1.user_id) * 100.0) / count(DISTINCT c.user_id) AS week_1,
(count(DISTINCT w2.user_id) * 100.0) / count(DISTINCT c.user_id) AS week_2,
(count(DISTINCT w3.user_id) * 100.0) / count(DISTINCT c.user_id) AS week_3,
(count(DISTINCT w4.user_id) * 100.0) / count(DISTINCT c.user_id) AS week_4,
(count(DISTINCT w5.user_id) * 100.0) / count(DISTINCT c.user_id) AS week_5,
(count(DISTINCT w6.user_id) * 100.0) / count(DISTINCT c.user_id) AS week_6
FROM events c
LEFT JOIN events w1 ON c.user_id = w1.user_id AND w1.week = c.week + interval '7' day
LEFT JOIN events w2 ON c.user_id = w2.user_id AND w2.week = c.week + interval '14' day
LEFT JOIN events w3 ON c.user_id = w3.user_id AND w3.week = c.week + interval '21' day
LEFT JOIN events w4 ON c.user_id = w4.user_id AND w4.week = c.week + interval '28' day
LEFT JOIN events w5 ON c.user_id = w5.user_id AND w5.week = c.week + interval '35' day
LEFT JOIN events w6 ON c.user_id = w6.user_id AND w6.week = c.week + interval '42' day
WHERE c.week = DATE('2023-01-02')
GROUP BY 1
This one was cool as it was very easy to read. However, my main problem with it is that it can be difficult to extend and iterate on. Also not very suitable for BI tools for the same reason.
Second Senior Analyst's Code
WITH cohort AS (
SELECT
user_id,
DATE_TRUNC('week', MIN(event_time)) AS cohort_week
FROM
page_visits
GROUP BY
1
),
retention AS (
SELECT
cohort.cohort_week,
DATE_TRUNC('week', event_time) AS event_week,
COUNT(DISTINCT page_visits.user_id) AS weekly_active_users
FROM
cohort
JOIN page_visits ON cohort.user_id = page_visits.user_id
GROUP BY
1,
2
)
SELECT
retention.cohort_week,
retention.event_week,
round(retention.weekly_active_users *100.0 / cohort_size,2) AS retention_rate
FROM
retention
JOIN (
SELECT
cohort_week,
COUNT(DISTINCT user_id) AS cohort_size
FROM
cohort
GROUP BY
1
) AS cohort_size ON retention.cohort_week = cohort_size.cohort_week
WHERE retention.cohort_week = DATE('2023-01-02')
ORDER BY
1,
2;
I liked this one much more, as it was robust and easy to extend. For example, switching from week-over-week to month-over-month would require only 2 changes and a couple of renamings.
And The Winner Is! But…
We kept the second senior analyst’s solution. However, it got me thinking:
It took (on average) 1 hour to write the queries. Is this the most efficient way to get these results?
The second analyst’s results were very close to the first’s. Who was correct? (hint: neither of them, which I will cover in a future blog post 😃)
We get similar many product usage-related questions about user funnels and segmentation every day. Can’t these tasks be automated?
My First Startup Project
With this short story, I wanted to highlight 3 problems with product analytics in the data warehouse:
It can be tediously slow. Product managers, sales, and marketing people in an organisation have hundreds of similar questions daily. Data analysts have more critical tasks than answering these questions.
With people from different backgrounds and experiences, the results will be just as different for each product analytics question. The answers depend on who you ask.
I believe non-techy people should be able to gain insights from company data without a middleman or copying data to external tools like GA or Mixpanel.
These problems motivated me to start working on my first startup project mitzu.io.
Mitzu is:
No-code
Self-served
Product analytics web-application
Supports: Snowflake, Databricks, Redshift, Athena, Postgres, MySQL, Trino, etc.
Handles retention, user funnel and event segmentation type of questions (and more)
Cohort user retention in Mitzu
Putting together a cohort user retention metric takes less than a minute.
Mitzu generates a SQL that was reviewed by multiple data analysts and data scientists.
WITH anon_1 AS
(SELECT DISTINCT ecommerce.page_events.user_id AS _cte_user_id,
date_trunc('MINUTE', ecommerce.page_events.event_time) AS _cte_datetime,
NULL AS _cte_group
FROM ecommerce.page_events
WHERE ecommerce.page_events.event_name = 'page_visit'
AND ecommerce.page_events.event_time >= timestamp '2021-10-01 00:00:00'
AND ecommerce.page_events.event_time <= timestamp '2021-12-01 00:00:00'
AND TRUE),
anon_3 AS
(SELECT 0 AS _ret_index
UNION SELECT 1 AS _ret_index
UNION SELECT 2 AS _ret_index
UNION SELECT 3 AS _ret_index
UNION SELECT 4 AS _ret_index
UNION SELECT 5 AS _ret_index
UNION SELECT 6 AS _ret_index
UNION SELECT 7 AS _ret_index
UNION SELECT 8 AS _ret_index),
anon_2 AS
(SELECT DISTINCT ecommerce.checkout_events.user_id AS _cte_user_id,
date_trunc('MINUTE', ecommerce.checkout_events.event_time) AS _cte_datetime,
NULL AS _cte_group
FROM ecommerce.checkout_events
WHERE ecommerce.checkout_events.event_name = 'checkout'
AND ecommerce.checkout_events.event_time >= timestamp '2021-10-01 00:00:00'
AND ecommerce.checkout_events.event_time <= timestamp '2021-12-08 00:00:00'
AND TRUE)
SELECT date_trunc('WEEK', anon_1._cte_datetime) AS _datetime,
NULL AS _group,
ret_indeces._ret_index,
count(DISTINCT anon_1._cte_user_id) AS _user_count_1,
count(DISTINCT anon_2._cte_user_id) AS _user_count_2,
(count(DISTINCT anon_2._cte_user_id) * 100.0) / count(DISTINCT anon_1._cte_user_id) AS _agg_value
FROM anon_1
JOIN anon_3 AS ret_indeces ON TRUE
LEFT OUTER JOIN anon_2 ON anon_1._cte_user_id = anon_2._cte_user_id
AND anon_2._cte_datetime > date_add('week', ret_indeces._ret_index, anon_1._cte_datetime)
AND anon_2._cte_datetime <= date_add('week', 1, date_add('week', ret_indeces._ret_index, anon_1._cte_datetime))
GROUP BY 1,
2,
3
It is hard to read, but that is not the goal here.
Conclusion
Product analytics with SQL over the data warehouse:
can be very slow
produces inconsistent results
scales poorly since data analysts are a bottleneck
Mitzu is a tool that can help any product development company analyze usage behavioural data natively from the data warehouse.