What's inside

Part 1 - Starting Simple

First article in a series of tutorials about building a scalable data project with orchestration in Airflow (we’ll talk about alternatives) and data pipelines managed in dbt (we’ll also look at competition in this space).

Modern data stack

For at least 5 years now there’s been plenty of talk about what constitutes a “modern” tech stack. If you browse through the multiple articles, blog posts and company pitches that reference the term (1, 2, 3, 4, 5 just to link a few), you’ll notice a few consistent themes that emerge:

Cloud-based In order to achieve scalability, flexibility and access to basically unlimited resources, cloud based solutions emerged as a clear winner over the last 10 years. Most modern data warehousing or “lakehousing” solutions such as Databricks, Snowflake, BigQuery etc. are cloud-only.
Merging many data sources One of the key benefits of the modern data stack is the ability to use not only your internal datasets, like the contents of your transactional database and CRM, but also pulling in additional sources from marketing channels and user analytics to financial markets data and weather forecasts. A great variety of data processing solutions emerged like Fivetran, Stitch and Airbyte to name a few.
ELT over ELT With storage becoming cheaper and cheaper and compute representing the vast majority of cost for most data solutions, it’s become the default to pull in and keep as much raw data as possible and only apply the Transform once you have everything you need in one place. The benefit of having the source material at hand if you want to apply changes to your pipelines usually outweighs the incremental cost of storing more.
Version control and IaC With tools like Terraform, CloudFormation and GDM on one side and dbt and SQLMesh on the other, it’s now possible to keep all of your data project in a Git repository and control the whole process end-to-end. Throw in Docker and something like Airflow to orchestrate it all and you have all you need to pull, process and store your data in one place. The big advantage of these frameworks that emerged in last few years is also the community around them. You could define everything in code before, but with dbt you now have the industry standard for data pipelines definition that most data engineers are familiar with and that any data professional will be able to easily browse thanks to the SQL-based syntax.
AI, BI and beyond Once you have all of this data clean and ready in your database, the world is your oyster. Tableau, PowerBI, Looker - there’s a myriad of BI solutions to share the insights with analysts and end-users. For machine learning you can choose between Sagemaker, Spark and many others to train and deploy your models. And why not use your clean and processed data in your applications? With the variety of connectors available, reverse ETL is getting more popular and starts blurring the limits between the data stack and production environments.

Looking back 5-10 years from now, when a lot of the Big Data projects were in Hadoop with custom interfaces built in Scala and a lot of ML projects running in R or Java there’s been a clear shift towards the ideas of Modern Data Stack coupled with Python becoming the language of choice for these types of projects. Not only does it empower smaller, leaner teams of data engineers, analysts and scientists to build more powerful applications - it’s starting to drive the core of the business with data-driven decision making and ML support for core applications.

Humble beginnings

There is a catch in all of this. The scalability and flexibility is great, but a lot of the tools that promise it are expensive and the costs ramp up quickly when your datasets grow. Want to host your data in Snowflake? Pay attention to the warehouse sizes and what queries your analysts are running or you’re likely to spend thousands of dollars in hours. Interested in a dashboard for your business analysts in Looker? Sure, it’ll be just $50k for the first year.

The idea behind this series of articles is to show you the journey from a small, basic data project based on free to use or fully open source tools all the way to the full fledged “modern data stack” with shiny cloud solutions. Of course, a locally deployed Postgres database is not much of a use for your company. It is however a starting point for you to build some useful data pipelines and prototype your data connectors and orchestration in order to get buy-in from your organization to start investing in data. And there’s quite a lot that you can build locally that will be then easy to deploy to a cloud environment and scale if needed.

Toolset

Docker There’s definitely a learning curve to Docker and if you’re coming from a pure data background of running SQL queries in a GUI it will be confusing at times. It’s hard to argue though that Docker is essential to keeping environment consistency between local prototyping and production deployment. It’s also very convenient in our project for managing multiple environments for Python, which can come in very handy if your data connector is only available for Python 3.5 and older but your dbt unit tests require Python 3.10 or newer. Once you’ve been there and you’ve fixed it with Docker, you won’t be looking back. In our project, we’ve also attempted to keep the Docker as something you can play with if you want to finetune the project, but not something you need to engage with from the get-go to make it work. $$$ Docker is not entirely free. Docker engine is, but Docker desktop that you’ll use for local development is only free for smaller teams and businesses. In most cases you should either fall in the free category or have no problem securing the budget though.
Airflow Similar to Docker, Airflow is not necessarily a “plug and play” solution that you’ll master in a couple of days. It’s very flexible and a lot can be done with it, but this comes at the cost of complexity. In the initial example we’ll use it to provide you with a web interface to launch your scripts, as well as a scheduler to run them automatically on a daily basis. In the upcoming articles we’ll see how this can be developed further and we’ll also cover the upcoming Airflow 3 features and the migration to it from Airflow 2. $$$ Airflow is fully open source and free to use.
dbt dbt, standing for “data build tool”, has seen a massive growth in adoption in the last ~5 years. It’s arguably the go-to framework for maintaining data processing on structured data warehouses. The convenience of using SQL syntax for pretty much everything you’ll do there (with some Jinja and YML sprinkled on top) means that most “data people” will hit the ground running and can get straight to work, instead of learning the ins and outs of custom solutions. $$$ The dbt-core package for Python, as well as data connectors for it, is open source and therefore free to use if you deploy it yourself. dbt Cloud on the other hand, which includes orchestration of your models and a bunch of other features is not free and can become quite pricey.
PostgreSQL When it comes to databases there’s a ton of solutions out there and it really depends on your budget and data type and scale what is the best solution for you. It is hard to argue though, that when it comes to flexibility and scalability at this price point (free) it’s hard to beat PotgreSQL (mostly known as “postgres”. For most projects you do prefer a structured database with SQL interface (and a dbt and Python connector out of the box). And the community and amount of resources is huge so you’ll be able to find a solution for most problems that you might face. It also scales pretty well for both OLTP and OLAP applications, although for massive data warehouses there are definitely better solutions that we’ll talk about later. $$$ PostgreSQL is fully open source and free to use.

The “hello world” of modern data stack

So here we are with the first iteration of this GitHub repository - sunscrapers/airflow-dbt-template/part-1-starting-simple. This is the basis from which we’ll build out the project, but it already has the core pieces in place:

The database in which our raw data can be dumped and the further processing will happen.
The Airflow to orchestrate it.
A simple Python data connector pulling from an API.
dbt to process the raw inputs into a table that can be useful for analytics, with tests and snapshot examples.
Docker to run it all in dedicated environments that don’t really care where they are run.
A Makefile to make the initial setup even easier.
Some helper env files and such to streamline the setup process.

By pulling or forking this repo and doing a 5-minute setup you can get started and customize your own data connectors to your actual data sources and start building your dbt models for them. Some suggestions:

Make sure to keep your connectors and Python scripts in the python_scripts/ directory and only call them in DAGs with the create_python_script_docker_operator().
Pay attention to only saving things like usernames and passwords in your .env file that’s not pushed to Git and pass the values by pulling environmental variables to your Python and dbt containers. You can have a look at the setup for our Postgres database to check how it’s done.
For quick prototyping create a notebooks/ directory and use Jupyter. It’s in .gitignore so you won’t pollute your future Git repository with them. Make sure to create a virtual environment that reflects well the one in your Python Docker container. We’ll go into details of managing Python environments in a future article in which we’ll cover Poetry.
Try adding new schemas to your dbt and playing around with merging multiple data sources, for example by adding some additional datasets from the Eurostat API and building some more interesting analytics around them.

What’s next

In the next article we’ll cover the immediate step that you’d want to take if your prototype gets some traction - deployment to a production environment in which you can run your project and share it with other users. Once we have that out of the way, the world of opportunities opens up. Some topics that we’ll definitely want to cover are:

dbt alternatives and best practices
Airflow features and migration to Airflow 3
Structure of a data warehouse and the data lakehouse philosophy

In case you’re looking for someone to implement a data solution in your organization for you - contact us and we’ll propose a custom implementation for your needs and development plans.

Modern Data Stack with Airflow and dbt - starting simple (part 1)

Michał Puchała

What's inside

Part 1 - Starting Simple

Modern data stack

Humble beginnings

Toolset

The “hello world” of modern data stack

What’s next

Recent posts

Why data engineers don’t test - according to Reddit

Modern Data Stack with Airflow and dbt - going into the cloud (part 2)

Testing in dbt - part 3

Why data engineers don’t test - according to Reddit

Modern Data Stack with Airflow and dbt - going into the cloud (part 2)

Testing in dbt - part 3

Let's talk