Collecting, storing, and modeling product events in a data lake or warehouse has many benefits.
Switching to warehouse-native product event tracking makes much sense for mature businesses. The devil is in the details, as usual. In this post, I will cover 3 models of how you can store product events in the data warehouse and what factors you should consider before deciding on the chosen model.
These three factors are present in each of the product event models, and as you will soon see, there is no silver bullet solution. It depends on the business which you should choose.
There are 2+1 product event models that most companies should consider. I am 100% sure there are more models. However, most companies should get along with these three.
A single table per event type is a model that most likely comes first to your mind as a data engineer. There will be precisely one table in the data warehouse for every event type the client application produces.
The greatness of this model lies in its simplicity. It is easy to explain and, for the most part, to query. Every table has its own set of columns corresponding to the events' properties. It is also fast to query as each table can be partitioned or indexed by the event's timestamps. The name of the table describes the type of event precisely.
The problem comes when the client application produces hundreds or thousands of different event types. Just imagine a complex mobile application or a game. Maintaining many tables in the modeling infrastructure like DBT gets exponentially more challenging.
Just imagine renaming an event property (column) or fixing column values in each of the 1000 tables.
This product event model is the natural answer to the disadvantages of the single-wide event table model. If the data team knows that the client application will produce 1000 different event types, they can decide to store them in a single giant table.
The upside of this product event model is that the data team can maintain it relatively easily compared to the 1000 event tables.
However, this solution has multiple issues, rendering it impractical.
The biggest downside of the wide multi-event tables is that most of the thousand columns are filled with NULLs for most event types. SQL language dialects don't give an option to ignore null-filled columns. This makes it impractical for data analysts to understand and use this product event model. Determining which columns contain values for which event type is a constant struggle.
The short multi-event table model is a variation of the wide multi-event table mode that solves the clarity problem.
In this model, we store every event type in the same table. However, we keep the individual event properties in a single column with a dynamic type such as JSON (in some cases MAP).
Modeling event properties with a JSON column is excellent because every event can have its unique schema inside the same table.
The downside of this solution is that the data analysts will need to use JSON operations in every SQL query they write. This requires some education, and it decreases the performance of the table, as filtering on JSON columns will always require extra CPU time.
Most companies will choose one of the three models that I described above. Here is a quick guide on which to choose for your scenario.
Final Comparison Table
See how you can benefit from warehouse native product analytics