Building a data platform might initially seem daunting, but the right tools can simplify the process considerably. CDPs, ELT tools, a scalable storage medium, and simple data modeling tools can make the investment worthwhile.
We live in an age where you can set up the whole data platform in under 1 hour, and a typical startup can leverage the benefits without needing a data expert team or spending money.
A data platform is a piece of infrastructure that any data-driven company uses to collect all its relevant data. The data may come from the product, any backend solution, or the company's external services for operations (Sales, marketing, finance tools, and services).
The purpose of the data platform is to collect and centralize the data in a data warehouse or data lake where data analysts and engineers can process it and derive valuable information.
This post uncovers the best options for a startup to build its first data platform. Before diving into the details, let's examine the pros and cons of maintaining a data platform and warehouse for startups.
Okay, this being out of the way, let's look at the best practices to start with a data platform.
There are millions of ways you can build a data platform. I will focus on keeping the operational and maintenance costs at a minimum. As I aimed this blog post for startups, I will assume that you have a limited number of tracked data points (under 100 million). However, I will keep in mind that your startup might grow, so the solution needs to be scalable and robust.
The other goal is to have it ready under 1 hour.
I suggest companies do not pick a single "magic" technology that seemingly can solve all data problems as it might become costly later, or you will hit a wall for some use cases.
There are four essential infrastructure pieces for a data platform.
As mentioned before, it is best to pick a serverless solution for a startup.
What is a serverless data warehouse? You might ask.
In short, you pay only for the time you run any of your SQL queries. This is great, especially since most offerings come with a free package. So, you might stay in the free tier for years.
PostgreSQL (Multiple very cheap offerings)
Generally, I don't recommend starting with Snowflake or Databricks as your first data warehouse solution. They typically require costly specialists, while the operation expenses are also high.
Rule of thumb: If you start with warehousing, use PostgreSQL (for example, Neon.tech) or BigQuery. Both will be great solutions until millions of data points and thousands of active users, essentially without cost.
A Customer Data Platform (CDP) is a specialized service that aggregates and organizes customer data from multiple sources into a single, unified database. The primary use case for CDP is to collect usage events from the startup's product.
Rule of thumb: If you start with warehousing and have many startup credits for Segment, then go with Segment. Beware, you might receive a surprise bill once your credit runs out.
If you don't have segment credits, RudderStack and Jitsu are great options. I love that they are open source. That gives another ease of mind that vendor lock-in is not an issue.
An ELT (extract load transform) solution is a service that can move data from external tools and services to your data warehouse. Why would you need this?
You need the external services data in the data warehouse to create a complete picture of your users. To understand which users are paying typically, you need data from Stripe, Paddle, and Chargbee (etc.). You need to bring data to your data warehouse from these services. Another excellent use case is Marketing analytics. You want to connect your cost data to customer payment data to understand the return on investment.
There is an ocean of ELT solutions out there. The game is about the covered list of integrations. You can be sure that at least one or two of your external services won't be covered, no matter your chosen solution.
Some CDPs can work as ELT/ETL solutions. Using the same service for CDP as ELT could make sense. The only problem is that the CDPs will likely cover even fewer integrations than most ELT solutions. You will pick an ELT solution anyway, and suddenly, you will maintain and pay for two services.
Rule of thumb: I suggest only using a CDP for ELT if you are sure you will only need sources they offer. As an ELT solution for startups, I recommend the option that covers most of your external sources. It is unlikely that an ELT solution will break the bank if you are a startup. Sales, marketing, and finance data in a data warehouse is tiny compared to product data. As an engineer, I would tilt towards Airbyte because they have an open-source version. However, Fivetran's free tier option is excellent as well.
Once you have your product data (with the help of CDPs) and the data from your external services (with the use of ELT tools) in the data warehouse, you need a data modeling tool that can help:
Another blog post will cover the data modeling process and best practices.
The data modeling space is far less crowded than the ELT or CDP space.
Cloud provider options (AWS Glue, GCP Cloud Dataflow, Azure Data Factory)
Build your solution
Rule of thumb: Stick to DBT Cloud. You don't need to reinvent the wheel. It takes less than 5 minutes to set up.
As you can see, there are many great options that you can choose from building your data stack. Most of them have free or low-cost pricing options. Thanks to the cloud nature of all these solutions, you can easily set them up.
Here is a sample stack you can set up in less than 30 minutes:
This stack is free for a long time. All of these components have a Free Tier that you can leverage. Later, the price will scale reasonably with your usage.
Setting up your startup data platform in the current ecosystem is easier. You can choose between many great building blocks for your stack based on your and your business's preferences.
You can even start for free with the solution that I described above. Depending on your use cases, this setup will remain without cost for a long time.
The best part is that once a single component in your data stack becomes too expensive or won't support the changing requirements, you can swap it out for another solution. Changing infra in production can be challenging, but it can remain manageable with the four different components.
See how you can benefit from warehouse native product analytics