Modern Data Stack with Airflow and dbt - going into the cloud (part 2)

Michał Puchała

16 June 2025, 6 min read

thumbnail post

What's inside

  1. Part 2 - going into the cloud
  2. The Cloud
  3. Setting it up
  4. Continuous improvement
  5. Where to next

Part 2 - going into the cloud

In the first article of this series we’ve established that the #1 characteristic of a modern data stack is that it’s cloud-based. This opens up a set of questions - which cloud to use? How to put our data system into it? How to reliably update it over time? Let’s look into these topics as we develop our template repository on GitHub.

The Cloud

Since the creation of Amazon Web Services in 2002, and especially since their release of EC2 in 2006, we’ve really only seen a one-way trend: more and more applications are deployed into “the cloud”. It’s very true for the data landscape as well. Very few companies would opt for running custom databases on their own servers when presented with the myriad of cloud options from simpler AWS RDS solutions, through BigQuery to Snowflake and Databricks. Although the cost equation can be more complex and not every data solution will come out cheaper when hosted on cloud servers, the ability to focus on the core tasks of data integration and processing and abstracting away most of the infrastructure and devops tasks is usually a no-brainer.

So which cloud to go for? The answer is predictable: “it depends”. Here are the main factors I would take into consideration:

  1. Probably the most important - are you already using a cloud provider and you’re happy with your choice? If that’s the case, most likely you’ll be happy to add your data services into the same cloud and save yourself some hassle. Between the key players in the space - AWS, GCP and Azure - but even among the smaller cloud providers, you are quite likely to find what you need to get yourself set up.
  2. Do you have some specific needs that require specific technologies? Although most cloud providers give you options of managed services for both OLTP and OLAP databases, there are differences that might matter. You might also want to integrate with external providers that use services that play nicer with one of the clouds and less so with others. On the other hand, technologies like Snowflake give you options when it comes to hosting so you’re not locked into a specific ecosystem.
  3. Legal and organizational aspects - laws like GDPR in EU or CCPA in California might force you to store your data in a specific geographic location, which might limit your choices. You might be forced to even store it in your country or state. Make sure to check it before you make your decision, because although most big players are present on all continents and majority of large markets there’s still some risk that you’ll end up having to migrate to another datacenter or region.

For our project we decided to go with AWS for the simple reason of having the most experience with this vendor. They offer good solutions both when it comes to OLTP/smaller datasets (RDS, Aurora) and big data (Redshift) as well as a selection of additional data-focused services.

For hosting our Airflow service we’ll go the simple route of an EC2 instance and for our databases we’ll switch them out of a Docker container and put them on RDS Postgres. Depending on your past usage, you might even find out this setup is free for you, since AWS offers a free tier for both EC2 and RDS (1 year on the smallest instance type for both) .

Setting it up

Although it’s very much possible to “click your way” into having your EC2 and RDS instances up and running via the AWS console, the sooner you realize the advantages of Infrastructure as a Code approach, the better. Being able to version-control your infrastructure, create templates and abstractions to easily replicate and scale some services and seamlessly collaborate with others on development of your cloud is definitely something to consider and the learning curve is not that steep in times when every IDE has an LLM consultant built-in.

Here again, some choices need to be made. Your default option for AWS is CloudFormation and AWS CDK and they’re definitely a respectable choice. The main issue there is that you’re limited to one cloud provider and if you’d like to add some elements of your ecosystem on a different platform, you’ll have to add another IaaC solution. That’s where Terraform steps in. With support for over 5000 providers, you’re most likely to find anything that you’ll need. Even though, for now, we’ll only set up AWS services - we will use Terraform for it.

And even though our initial setup is relatively simple (one Airflow, one database), you’ll see that you actually need to set up quite a few things:

  • The EC2 instance for Airflow
  • The main database where we’ll store our raw imports and dbt-processed tables
  • The Airflow database storing DAG statuses and runs meta-data
  • IAM users and policies to allow different services to interact with each other
  • Virtual Private Cloud setup for networking within our cloud

On top of that, we’ll actually set up one aspect of our cloud manually - the backend for our Terraform to store the status of our infrastructure and ensure that when multiple people are making changes to it at the same time it stays consistent.

Continuous improvement

Once the codebase is ready in your local environment there’s one more thing to do - push it to the cloud. In theory you can do it manually - pause your services, push the files into the server and restart everything. But anything done manually will take more time and be more prone to human error. That’s where the last element of today’s focus comes into play - CI/CD pipeline.

There’s a variety of services to choose from in this space such as GitHub Actions, Jenkins, Cloudbees etc. while most cloud providers also have their native offering like AWS CodePipeline, Azure DevOps and GCP Cloud Build. In our case, we need something lightweight and ideally free, so we’re opting for CircleCI. It’s got a pretty simple setup, there’s a decently equipped free tier and it provides us with a nice web UI for our pipeline.

We set up 4 steps in the pipeline to run:

  1. Check the code and run unit tests
  2. Plan the infrastructure changes (if any are needed) in Terraform
  3. Apply the required changes
  4. Deploy our code and initialize the services again

There are a couple of things you need to set up in the CircleCI interface such as a few environmental variables and pipeline triggers (details in the repo documentation), but then it’s a very reliable way to monitor your deployments and ensure that no broken code is making it into our production environment.

So in the end our system looks like this: img

Where to next

Now that you have the full setup of local environment for development, you infrastructure neatly set up in Terraform and a proper CI/CD pipeline for deployment, the foundations are there. In the next articles in this series we’ll explore the toolset available (we’ll start with a look at the freshly released Airflow 3 and its new capabilities) and different approaches to structuring and processing your data depending on the business context.

You’re welcome to use the GitHub repository we’ve built for this article for your own purposes, in case of any questions feel free to reach out at [email protected]!

Tags

Share

Let's talk

Discover how software, data, and AI can accelerate your growth. Let's discuss your goals and find the best solutions to help you achieve them.

Hi there, we use cookies to provide you with an amazing experience on our site. If you continue without changing the settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your cookie settings at any time.