•

Data Platform For Startups: An Overview

A hands-on guide for startups to select the right infrastructure when building a data platform to streamline analytics, and build a solid foundation for growth.

István Mészáros

October 19, 2023

•

5 min read

Share this post

Clickhouse and Mitzu warehouse-native integration

Overview

Heading 2

Subscribe to our newsletter

Join 1000+ Data and Analytics professionals staying up-to-date with Mitzu's newsletter.

Thank you! You have been subscribed!

Oops! Something went wrong while submitting the form.

Building a Scalable Data Platform for Startups in Under an Hour

Building a data platform might initially seem daunting, but the right tools can simplify the process considerably. CDPs, ELT tools, a scalable storage medium, and simple data modeling tools can make the investment worthwhile.

We live in an age where you can set up the whole data platform in under 1 hour, and a typical startup can leverage the benefits without needing a data expert team or spending money.

What is a Data Platform? Centralizing Data for Better Insights

A data platform is a piece of infrastructure that any data-driven company uses to collect all its relevant data. The data may come from the product, any backend solution, or the company's external services for operations (Sales, marketing, finance tools, and services).

The purpose of the data platform is to collect and centralize the data in a data warehouse or data lake where data analysts and engineers can process it and derive valuable information.

Pros and Cons of Maintaining a Data Platform for Startups

This post uncovers the best options for a startup to build its first data platform. Before diving into the details, let's examine the pros and cons of maintaining a data platform and warehouse for startups.

Pros:

Single source of truth - A startup can store and access all of its data about the business in a single place. Typically, without a data warehouse, data is scattered in various tools. Decentralized data leads to the "disjoint investigation" problem (data is impossible to join), which slows decision-making.
Easy accessibility - If all data is in a single place, stakeholders can access the data easily with the right tools. There is no need to learn multiple tools and semantics for the same business concept.
Low cost - Traditionally, data warehouses are associated with high costs. I would argue that this is not true anymore, especially with the introduction of serverless data warehouses. Even with millions of rows of data in the data warehouse, a startup can keep operational and maintenance costs near zero.

Cons:

Need for data experts - some companies are lucky that the core team is SQL native. For the rest, an additional hiring burden comes with introducing the data warehouse. In my experience, an analytics engineer is the best hire as the first data expert.
Data quality maintenance- keeping the data quality at a level that stakeholders will trust the data is challenging. Luckily, DBT and similar data modeling tools have built-in testing and validation capabilities.

Okay, this being out of the way, let's look at the best practices to start with a data platform.

Setting the Goal: Low-Cost, Scalable, and Ready in One Hour

There are millions of ways you can build a data platform. I will focus on keeping the operational and maintenance costs at a minimum. As I aimed this blog post for startups, I will assume that you have a limited number of tracked data points (under 100 million). However, I will keep in mind that your startup might grow, so the solution needs to be scalable and robust.

The other goal is to have it ready under 1 hour.

I suggest companies do not pick a single "magic" technology that seemingly can solve all data problems as it might become costly later, or you will hit a wall for some use cases.

The Four Key Components of a Startup Data Platform

There are four essential infrastructure pieces for a data platform.

Data storage (data warehouse or data lake)
Customer Data Platform
ELT solution
Data modeling solution

Choosing the Right Data Storage: Serverless Solutions for Startups

As mentioned before, it is best to pick a serverless solution for a startup.

What is a serverless data warehouse? You might ask.

In short, you pay only for the time you run any of your SQL queries. This is great, especially since most offerings come with a free package. So, you might stay in the free tier for years.

Best technologies for startups

PostgreSQL (Multiple very cheap offerings)

Cloud options: Neon.tech, AWS RDS, Digital Ocean, etc. Neon.tech has a free tier, that makes it very appealing.
Postgres is a well-known and documented database technology
Great community
Great feature set
Integrates very well with other tools

BigQuery (Google)

Free for a long time
Easy to use, hard to master.

Clickhouse

Extremely fast and scalable solution.
It starts very cheap
Data analysts tend not to like it due to unique JOIN mechanisms and missign SQL features

AWS Athena

It requires a specialized data engineering team
Very generous pricing

Generally, I don't recommend starting with Snowflake or Databricks as your first data warehouse solution. They typically require costly specialists, while the operation expenses are also high.

Rule of thumb: If you start with warehousing, use PostgreSQL (for example, Neon.tech) or BigQuery. Both will be great solutions until millions of data points and thousands of active users, essentially without cost.

Customer Data Platforms (CDPs): Collecting Usage Events Efficiently

A Customer Data Platform (CDP) is a specialized service that aggregates and organizes customer data from multiple sources into a single, unified database. The primary use case for CDP is to collect usage events from the startup's product.

Best technologies for startups

Segment

Industry-leading solution
Free trial only
After the trial, it starts very cheap but can be costly later on. I have seen multiple companies having to get rid of Segment due to its high cost.

Rudderstack

Warehouse native CDP. It stores all customer data only in the data warehouse. Compared to Segment, which holds metadata about the users internally.
Open-source and cloud offering
They offer a great Free tier option.

Jitsu

Open-source, you can self-host it. Self-hosting might be more expensive than their cloud offering.
It has a free tier that is great to start with
Great connector options
Event transformations with Javascript
Segment SDK compatibility

Rule of thumb: If you start with warehousing and have many startup credits for Segment, then go with Segment. Beware, you might receive a surprise bill once your credit runs out.
If you don't have segment credits, RudderStack and Jitsu are great options. I love that they are open source. That gives another ease of mind that vendor lock-in is not an issue.

ELT Solutions: Moving External Service Data to Your Warehouse

An ELT (extract load transform) solution is a service that can move data from external tools and services to your data warehouse. Why would you need this?

You need the external services data in the data warehouse to create a complete picture of your users. To understand which users are paying typically, you need data from Stripe, Paddle, and Chargbee (etc.). You need to bring data to your data warehouse from these services. Another excellent use case is Marketing analytics. You want to connect your cost data to customer payment data to understand the return on investment.

There is an ocean of ELT solutions out there. The game is about the covered list of integrations. You can be sure that at least one or two of your external services won't be covered, no matter your chosen solution.

Best technologies for startups

Fivetran

Industry-leading ELT solution
Somewhat pricier than the others, but not significantly
It has a free tier!
It is hard to estimate how much credit you are going to use.

Airbyte

Open source alternative
Fair pricing
Some connectors are not yet production-ready, which can cause some headaches.
Vendor lock-in is less of an issue here as their core connectors are open-source.

StitchData

Most connectors are based on the Singer open-source library.
Predictable pricing
In my experience, they offer the least of the connectors.
Simple UX, only a few options for anything other than copying data to your data warehouse. Although, most likely, it is the only thing you need.

Case of CDPs as ELT solutions

Some CDPs can work as ELT/ETL solutions. Using the same service for CDP as ELT could make sense. The only problem is that the CDPs will likely cover even fewer integrations than most ELT solutions. You will pick an ELT solution anyway, and suddenly, you will maintain and pay for two services.

Rule of thumb: I suggest only using a CDP for ELT if you are sure you will only need sources they offer. As an ELT solution for startups, I recommend the option that covers most of your external sources. It is unlikely that an ELT solution will break the bank if you are a startup. Sales, marketing, and finance data in a data warehouse is tiny compared to product data. As an engineer, I would tilt towards Airbyte because they have an open-source version. However, Fivetran's free tier option is excellent as well.

Data Modeling Tools: Cleaning and Transforming Your Data

Once you have your product data (with the help of CDPs) and the data from your external services (with the use of ELT tools) in the data warehouse, you need a data modeling tool that can help:

Clean the data in case there are some mistakes in it.
Transform and aggregate it to learn about your customers or the business.

Another blog post will cover the data modeling process and best practices.
The data modeling space is far less crowded than the ELT or CDP space.

Essentially, you have only a couple of options as a startup:

Cloud provider options (AWS Glue, GCP Cloud Dataflow, Azure Data Factory)

Usually needs a lot of engineering effort to maintain
It may sound like a good idea to use them as you are already using them as a cloud provider. However, you may find hiring a single person to maintain your data modeling pipelines hard in the long run.

DBT Cloud

DBT is an industry-standard modeling tool
Easy to learn, easy to hire for
It has an open-source version.
SQL + Jinja (recently some Python capabilities) is well understood.
DBT is not just a tool. It is a framework for how data should be governed and managed. It is much easier to get the modeling right with DBT.

Build your solution

Generally, it is the worst idea for startups as you don't have the resources and time. Unless someone in your team can copy-paste an existing solution, I advise against this option.

Rule of thumb: Stick to DBT Cloud. You don't need to reinvent the wheel. It takes less than 5 minutes to set up.

Sample Startup Data Platform: A Free and Scalable Stack

As you can see, there are many great options that you can choose from building your data stack. Most of them have free or low-cost pricing options. Thanks to the cloud nature of all these solutions, you can easily set them up.

Here is a sample stack you can set up in less than 30 minutes:

This stack is free for a long time. All of these components have a Free Tier that you can leverage. Later, the price will scale reasonably with your usage.

Conclusion: Simplifying Startup Analytics with Modern Tools

Modern data Stack with Snowflake, Rudderstack and Mitzu

Setting up your startup data platform in the current ecosystem is easier. You can choose between many great building blocks for your stack based on your and your business's preferences.

You can even start for free with the solution that I described above. Depending on your use cases, this setup will remain without cost for a long time.

The best part is that once a single component in your data stack becomes too expensive or won't support the changing requirements, you can swap it out for another solution. Changing infra in production can be challenging, but it can remain manageable with the four different components.

Unbeatable solution for all of your analytics needs

Get started with Mitzu for free and power your teams with data!

Book a Demo Start for Free

Blogs for your growth

View all

Data Engineering

Building a Scalable Data Platform for Startups in Under an Hour

What is a Data Platform? Centralizing Data for Better Insights

Pros and Cons of Maintaining a Data Platform for Startups

Pros:

Cons:

Setting the Goal: Low-Cost, Scalable, and Ready in One Hour

The Four Key Components of a Startup Data Platform

Choosing the Right Data Storage: Serverless Solutions for Startups

Best technologies for startups

PostgreSQL (Multiple very cheap offerings)

Customer Data Platforms (CDPs): Collecting Usage Events Efficiently

Best technologies for startups

ELT Solutions: Moving External Service Data to Your Warehouse

Best technologies for startups

Case of CDPs as ELT solutions

Data Modeling Tools: Cleaning and Transforming Your Data

Essentially, you have only a couple of options as a startup:

Cloud provider options (AWS Glue, GCP Cloud Dataflow, Azure Data Factory)

Build your solution

Sample Startup Data Platform: A Free and Scalable Stack

Conclusion: Simplifying Startup Analytics with Modern Tools

Unbeatable solution for all of your analytics needs

Blogs for your growth

Exploring the Warehouse-First Architecture

Cruise Booking Platform Uses Mitzu for Revenue Attribution

User ID Stitching in Databricks

How to get started?

Collect data‍

Setup Mitzu‍

Start making better decisions faster

Unbeatable solution for all of your analytics needs

Collect data
‍

Setup Mitzu
‍