October 31, 2023

  • 10 mins

Identifying Users In The Data Warehouse

István Mészáros

User identification in data is a recurring problem for warehouse-first B2C companies. A single user can be identified with multiple aliases in the datasets:

✅ user_id - ID after sign-in
✅ device_id - Mobile or desktop device's ID
✅ anonymous_id - Tracking cookie
✅ customer_id - Stripe
✅ contact_id - Hubspot
✅ email - Used for sign-in
... and many more.

Picking a single identifier for user identification is risky. You may end up:
❌ not being able to use that identifier for all your users
❌ multiple identifiers for the same user

ℹ️ The correct solution is to merge all these IDs into a single merged ID and create a user alias table where all aliases are associated with their merged IDs.

This approach is called retroactive user recognition and it simplifies data modelling significantly.
You won't have to think about which ID to use in joins. You can safely use the merged ID once they are backfilled to the all tables.

Retroactive User Recognition Example

It is the simplest to explain this challenge with an example.  Imagine you have a dataset that contains page visits of users. 

It has four columns:

  • event_time
  • anonymous_id
  • email
  • url


Clickstream events table

The anonymous_id is always filled from the browser's cookie. However, it changes if the browser cookie is refreshed. The email is filled once the user logs into the application. However, it changes if the user logs in with another email. Now, you need to answer a simple question: 

What was the first URL that a user visited?

As you can see, no single column can uniquely identify the user. In fact, all of its anonymous_ids and emails identify the users. The anonymous_id and the emails are only aliases for the same user. What you need is a user_alias table that contains all aliases. The alias table is the association of all the users' aliases to a single merged_id that can be used to track that user across all datasets.

Example aliases table


Creating and backfilling all aliases where it is missing is the process of retroactive user recognition.

Creating The User Aliases Table

Creating the user aliases table is way more complex than you think. Most people approach it with some special joins and group bys. However, building that alias table is impossible with only simple joins and group bys with SQL. 

Why? Because the problem of user aliases is, in fact, a graph problem. 

User with anonymous_id aid_1 signs in with email (john.doe@gmail.com). The browser clears the cookie at some point, so the user is also signed out. The user signs in again while having the new cookie. At another point, the user switches to the company email (john.doe@professional.com), which is typical for SaaS businesses. We are still talking about the same user who has four aliases.

Here is the change graph of alias associations for our user:

User alias association graph


As you can see, none of the anonymous_ids or emails are an excellent identifier for the user. We need to define a new ID uniquely associated with the user. The best way to do this is to have the whole graph as the source of the unique ID.

Once we have a unique ID, we can backfill it to all events in the table and be able to support the two original requirements.


Walking The Graph

To map out all the aliases, we must walk through all edges in this graph. Graph walking in SQL is only possible to do with recursive CTEs. 
In this example we are using Postgres SQL, with minor changes this SQL can ported to all SQL dialects.

First, we need to define the edges:



create table default.user_aliases as 
WITH RECURSIVE 
-- collecting the graph edges
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b -- making sure users without sign up are also included
  from page_visits  
),

-- making sure the graph is bidirectional
bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
)

Once we have, we must apply the so-called "connected components" algorithm to walk through the graph edges and detect if the nodes are part of the same sub-graph.



-- The connected components algorithm in SQL:
-- walking the graph to collect all aliases
walks AS (
  SELECT 
    a as a,
    a as b,
    1 as depth
  FROM bidirectional
  UNION
  SELECT w.a,
    e.b,
    depth + 1 as depth
  FROM walks w
    JOIN bidirectional e ON w.b = e.a
    and not (
      e.a = w.a
      and e.b = w.b
    )
    and not (
      e.b = w.a
      and e.a = w.b
    )
  -- Termination condition
  where depth < 10 
)

Here is the result of the Walks CTE. You can see which node (a) are reachable from the nodes (b) in the graph. Of course in our single example every node is reachable.

All possible association with connected components

We need some additional cleaning and finalization:



create table default.user_aliases as 
WITH RECURSIVE 
-- collecting the graph edges
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b -- making sure users without sign up are also included
  from page_visits 
),

-- making sure the graph is bidirectional
bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
),

-- The connected components algorithm in SQL:
-- walking the graph to collect all aliases
walks AS (
  SELECT 
    a as a,
    a as b,
    1 as depth
  FROM bidirectional
  UNION
  SELECT w.a,
    e.b,
    depth + 1 as depth
  FROM walks w
    JOIN bidirectional e ON w.b = e.a
    and not (
      e.a = w.a
      and e.b = w.b
    )
    and not (
      e.b = w.a
      and e.a = w.b
    )
  -- Termination condition
  where depth < 10 
)

-- picking the merged_id for all aliases
select 
  min(b) merged_id,
  a as alias
from walks
group by 2

Below, you can see what our final user_aliases table will look like (notice the merged_id is one of the aliases):


Simple aliases table

Now, you can join this table from the events and do a backfill.


Supporting All Tables

In our example, we had only a table and two user alias types. The added benefit of this approach is that it is extendable to any number of tables and aliases. Extendibility is the beauty of the connected components algorithm.

Here is an example of how a typical SaaS company would use this algorithm to recognize all of its users retroactively. In this example, we have:

  • Hubspot contact_id and email pairs
  • Stripe customer_id and email pairs
  • Product anonymous_id and email pairs

By adding the additional edges, we will have a more complete alias table with Stripe customer IDs and Hubspot contact IDs.

Extended aliases table


The connected components covering all these IDs enable the data team to join all datasets. 

Joining across all data sets from different domains unlocks access to the entire user lifecycle:

  • Landing page visit
  • Sign up
  • Engagement with the product
  • Subscription started (e.g. in Stripe)
  • The sales team contacted the person for upselling
  • Subscription cancelled

Here is an example how to extend this SQL to support different datasources. With this creating the user aliases table has never been easier before:


create table default.user_aliases as 
WITH RECURSIVE 
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b 
  from page_visits
  union all
  select 
    contat_id as a,
    email as b
  from hubspot_contacts -- extending with Hubspot
  union all
  select 
    customer_id as a,
    email as b
  from stripe_customers  -- extending with Stripe
),

bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
),

walks AS (
  SELECT 
    a as a,
    a as b,
    1 as depth
  FROM bidirectional
  UNION
  SELECT w.a,
    e.b,
    depth + 1 as depth
  FROM walks w
    JOIN bidirectional e ON w.b = e.a
    and not (
      e.a = w.a
      and e.b = w.b
    )
    and not (
      e.b = w.a
      and e.a = w.b
    )
  -- Termination condition
  where depth < 10 
)

select 
  min(b) merged_id,
  a as alias
from walks
group by 2

Final Thoughts

‍With the help of the connected components algorithm, we can map out correctly all the aliases of every user into a single table. We can then use this table to backfill all event (fact) tables with the merged_id.
Once we backfill the merged IDs to the event tables, we can easily track the users across all datasets. With that, a whole new world of data analytics reveals itself as joining across datasets becomes a simple exercise.

Explore warehouse native product analytics

See how you can benefit from warehouse native product analytics

Blogs for your growth