January 24, 2024

  • 10 minutes

Modeling Product Events In The Data Warehouse

István Mészáros

Modeling Product Events In The Data Warehouse

Teaser

Collecting, storing, and modeling product events in a data lake or warehouse has many benefits.

  • In 2024, most data lake and warehouse solutions have usage-based pricing models. This means the analytics computation drives up the cost, not the amount of data.
  • Your data team is in total control. Fixing quality issues, consolidating user identification, etc., is fine when you can update your data.
  • A single source of truth, the same data that describes the business, can be reused by all stakeholders.

Switching to warehouse-native product event tracking makes much sense for mature businesses. The devil is in the details, as usual. In this post, I will cover 3 models of how you can store product events in the data warehouse and what factors you should consider before deciding on the chosen model.

Factors To Consider For A Product Event Model

  1. Cost - how much will it cost you to manage and maintain the model? This factor is about the human effort that needs to be put in place.
  2. Ease of usage - will the data analytics and other stakeholders be able to understand your chosen model? How hard will it be to onboard new people to it? How hard it will be to use this model? Overall mental capacity to understand the model.
  3. Speed - will your model be responsive and fast to query? How long does an SQL query run before the result is ready?

These three factors are present in each of the product event models, and as you will soon see, there is no silver bullet solution. It depends on the business which you should choose.

2+1 Models Of Product Events

There are 2+1 product event models that most companies should consider. I am 100% sure there are more models. However, most companies should get along with these three.

1. Model: Single Table Per Event Type

A single table per event type is a model that most likely comes first to your mind as a data engineer. There will be precisely one table in the data warehouse for every event type the client application produces.

Product Event Tables (Single event per table)
Single event table

The greatness of this model lies in its simplicity. It is easy to explain and, for the most part, to query. Every table has its own set of columns corresponding to the events' properties. It is also fast to query as each table can be partitioned or indexed by the event's timestamps. The name of the table describes the type of event precisely.

The problem comes when the client application produces hundreds or thousands of different event types. Just imagine a complex mobile application or a game. Maintaining many tables in the modeling infrastructure like DBT gets exponentially more challenging.

Just imagine renaming an event property (column) or fixing column values in each of the 1000 tables.

Scores (the higher, the better)

  • Cost of maintenance: 6/10 (average)
    - 2/10
    - for complex applications for more than 50 event types
    - 10/10
    - for simple applications with less than 50 event types.
  • Clarity 10/10
  • Speed 10/10

Wide Multi Event Table

This product event model is the natural answer to the disadvantages of the single-wide event table model. If the data team knows that the client application will produce 1000 different event types, they can decide to store them in a single giant table.

Wide multi product event table

The upside of this product event model is that the data team can maintain it relatively easily compared to the 1000 event tables.

However, this solution has multiple issues, rendering it impractical.

The Problems:

  • Each event type has event properties, resulting in an ever-growing number of columns in the table. Imagine 1000 event types, where every event type has at least one event property that is unique to that event. This will result in at least 1000 different columns in the table.
  • Partitioning and indexing this table is more complicated as well. The natural date partitioning will yield poor query performance. Why? Because most queries will filter on a set of event types only. However, those queries will scan all event types. Partitioning on the event type column will result in a more complex ingestion process. Especially in the case of a data lake, this will result in many small files.
  • Understanding the 1000 columns in this table is a challenge, especially because most event types will have most of these columns filled with Nulls. Forget your lazy select * from queries, which will cost a fortune to execute.

Score (the higher, the better)

  • Cost of maintenance: 8/10
  • Clarity 3/10
  • Speed 8/10

Short Multi Event Table

The biggest downside of the wide multi-event tables is that most of the thousand columns are filled with NULLs for most event types. SQL language dialects don't give an option to ignore null-filled columns. This makes it impractical for data analysts to understand and use this product event model. Determining which columns contain values for which event type is a constant struggle.

The short multi-event table model is a variation of the wide multi-event table mode that solves the clarity problem.

In this model, we store every event type in the same table. However, we keep the individual event properties in a single column with a dynamic type such as JSON (in some cases MAP).

Modeling event properties with a JSON column is excellent because every event can have its unique schema inside the same table.

Short multi product event table

The downside of this solution is that the data analysts will need to use JSON operations in every SQL query they write. This requires some education, and it decreases the performance of the table, as filtering on JSON columns will always require extra CPU time.

Score (the higher, the better):

  • Cost of maintenance: 8/10
  • Clarity 8/10
  • Speed 6/10

Conclusion

Most companies will choose one of the three models that I described above. Here is a quick guide on which to choose for your scenario.

  • Single table for every event - It is best to use when the client-side application produces up to 50 event types.
  • Wide multi-event table - It is rarely a good choice. Choose this model only if your data warehouse doesn't support dynamic types such as JSON or MAP.
  • Short multi-event table - It is the best choice for products that produce many event types. It is only usable in data warehouses and lakes where dynamic types are supported.

Final Comparison Table

Final comparison of the three models

Explore warehouse native product analytics

See how you can benefit from warehouse native product analytics

Blogs for your growth