
Piotr Mierzejewski
10 May 2025, 10 min read

What's inside
- The problem: Why data testing matters & how dbt solves it
- Introducing the Example Project & Dataset
- Testing the staging layer
- Conclusion
The problem: Why data testing matters & how dbt solves it
Testing is a well-established discipline in software engineering, with entire careers dedicated to ensuring software reliability and correctness. However, data engineering lacks the same level of maturity in testing practices. While software engineers rigorously apply unit tests, integration tests, and continuous testing methodologies, data engineers often overlook testing—or apply it inconsistently.
Yet, data engineers are also building software. A well-structured data pipeline is just as critical as a well-written application, as flawed data can lead to incorrect business decisions, compliance risks, and operational failures. Given this, a strong argument can be made that a good data engineer should also embrace software engineering best practices—including testing methodologies.
The engineers at dbt Labs must have come to a similar conclusion, because dbt provides a structured, SQL-native approach to testing data at various stages of transformation. By integrating testing directly into the analytics workflow, dbt makes it easy to catch issues before they impact downstream use cases.
In this series of articles, we will explore various testing approaches applicable to dbt projects. We’ll build the project layer by layer, introducing and explaining different test types as their need naturally arises within the workflow.
Introducing the Example Project & Dataset
To better understand the concept of testing, we’ll put it into the context of a project. By doing so, we can demonstrate that testing isn’t just an additional burden for engineers—it’s a valuable tool that enhances data reliability and simplifies development. When implemented effectively, a robust testing suite becomes a strategic investment that improves both efficiency and confidence in the data pipeline.
In this project, we will analyze drug labelers in the U.S. and extract some metrics about each of them. Specifically, we aim to determine:
- The number of distinct medical products registered under each labeler
- The number of unique active ingredients used across all their products
- The total count of generic drugs they produce
We will base our analysis on the FDA NDC dataset, which contains a list of all medical products registered with the FDA and approved for the U.S. market. To better illustrate key concepts, the project has been intentionally expanded beyond what would typically be expected for a project of this size. This expansion helps to more clearly highlight the core concepts.
First, let’s take a closer look at the structure of the dataset. Each drug in the dataset is stored as a JSON object within a list. A single drug entry looks like this:
{
"product_ndc": "71335-1974",
"generic_name": "Fluvoxamine Maleate",
"labeler_name": "Bryant Ranch Prepack",
"brand_name": "Fluvoxamine Maleate",
"active_ingredients": [
{
"name": "FLUVOXAMINE MALEATE",
"strength": "50 mg/1"
}
],
...
"listing_expiration_date": "20251231",
"marketing_category": "ANDA",
"dosage_form": "TABLET",
"product_type": "HUMAN PRESCRIPTION DRUG",
"route": [ "ORAL" ],
"marketing_start_date": "20010710",
"product_id": "71335-1974_ee35b0d2-a375-4176-be7d-23783cf8eb22",
"application_number": "ANDA075902",
"brand_name_base": "Fluvoxamine Maleate",
"pharm_class": [
"Serotonin Reuptake Inhibitor [EPC]",
"Serotonin Uptake Inhibitors [MoA]"
]
}
The ingestion script loads the data, inserting each entry as a new record in the table.
To incorporate this dataset into our dbt project, we need to define it as a source. Sources are a special type of model in dbt—they are not directly managed by the framework but are registered within the project. This allows them to serve as dependencies for downstream dbt-managed models.
Following best practices, each source model should have its own staging model, which acts as a one-to-one mapping between the source and dbt-managed layers. In our case, the first staging model will be stg_fda__ndc
, serving as an intermediary between the raw dataset and further transformations.
The diagrams above illustrate two key files that sit between the raw data and the final model:
stg_fda_ndc.sql
– This file contains the SQL logic that defines how the staging model is created and populated.stg_fda_ndc.yml
– This file describes the structure of the final model, including metadata, column definitions, and tests.
If we use a restaurant metaphor, the SQL file is like a recipe that the cook follows to prepare the dish, while the YAML file is like a photo of the finished dish with quality standards that the chef uses to ensure it meets expectations before serving.
So, what do these files contain?
For staging models, the code is relatively simple. We use the source
macro to reference the raw data and write a SELECT
query that performs light processing—mainly extracting the columns needed for downstream models.
select
(result ->> 'product_id') as product_id,
(result ->> 'product_ndc') as product_ndc,
(result ->> 'brand_name') as brand_name,
(result -> 'pharm_class') as pharm_class,
(result -> 'openfda') -> 'is_original_packager' as is_original_packager,
(result ->> 'generic_name') as generic_name,
(result ->> 'dosage_form') as dosage_form,
(result ->> 'route') as route,
(result ->> 'marketing_category') as marketing_category,
(result ->> 'dea_schedule') as dea_schedule,
(result -> 'active_ingredients') as active_ingredients,
(result ->> 'product_type') as product_type,
(result) ->> 'labeler_name' as labeler_name
from {{ source('fda', 'ndc') }}
And the corresponding YAML file:
version: 2
models:
- name: stg_fda__ndc
description: "Unpacking values from raw ndc dataset in json format"
columns:
- name: product_id
data_type: text
description: "ProductID is a concatenation of the NDC product code and SPL documentID."
- name: product_ndc
data_type: text
description: "The labeler manufacturer code and product code segments of the NDC number, separated by a hyphen."
- name: brand_name
data_type: text
description: "Brand or trade name of the drug product."
- name: pharm_class
data_type: jsonb
description: "These are the reported pharmacological class categories corresponding to the SubstanceNames listed above."
- name: is_original_packager
data_type: jsonb
description: "Whether or not the drug has been repackaged for distribution."
- name: generic_name
data_type: text
description: "Generic name(s) of the drug product."
- name: dosage_form
data_type: text
description: "The drug’s dosage form. There is no standard, but values may include terms like `tablet` or `solution for injection`."
- name: route
data_type: text
description: "The route of administation of the drug product."
- name: marketing_category
data_type: text
description: "Product types are broken down into several potential Marketing Categories, such as NDA/ANDA/BLA, OTC Monograph, or Unapproved Drug."
- name: dea_schedule
data_type: text
description: "This is the assigned DEA Schedule number as reported by the labeler. Values are CI, CII, CIII, CIV, and CV."
- name: active_ingredients
data_type: jsonb
description: "List of Json objects containg name of active ingredient and the strength of it"
- name: product_type
data_type: text
description: "Type of drug product"
- name: labeler_name
data_type: text
description: "manufacturer name"
Whoa! That was a lot to take in, and for this particular example, it might seem like overkill. However, there are several benefits to structuring our project this way:
- These definitions are used by automation tools for documentation, lineage tracking, and other metadata-driven processes.
- They make it easier for other team members to understand both this model and its dependencies.
- As the project scales, maintaining well-structured metadata becomes increasingly valuable. Adding it upfront, while you’re still in context, is much easier than retrofitting it later.
- YAML files can store model configurations that would otherwise clutter the SQL file, improving readability.
- And most importantly for this article—they are used to define tests!
And the drawbacks? I can think of only one—but it’s a big one:
- You have to maintain it!
Testing the staging layer
Build-in tests
So, we have a model in staging. Is it worth testing? Absolutely. Testing should start as early as possible to minimize the chances of debugging complex logic in downstream models, only to realize the issue originates from the source model.
This is similar to spending hours troubleshooting why nothing is displaying on a screen—only to discover the monitor isn’t plugged in.
At this stage, testing will be minimal since we don’t yet have full knowledge of what the downstream logic will require. Following the YAGNI (You Ain’t Gonna Need It) principle, we focus only on key invariants. Specifically, we ensure that product_id
is unique and not null in every row and that product_ndc
is not null.
This can be easily accomplished using built-in tests by modifying the column definitions in the stg_fda_ndc.yml
file, as shown below:
columns:
- name: product_id
data_type: text
description: "ProductID is a concatenation of the NDC product code and SPL documentID."
tests:
- not_null
- unique
- name: product_ndc
data_type: text
description: "The labeler manufacturer code and product code segments of the NDC number, separated by a hyphen."
tests:
- not_null
- unique
Once the models are created, dbt will automatically check if these conditions hold true and will stop with an error if they are not met, with the following output
13:22:26 1 of 1 START sql view model dev_staging.stg_fda__ndc .......................... [RUN]
13:22:26 1 of 1 OK created sql view model dev_staging.stg_fda__ndc ..................... [CREATE VIEW in 0.19s]
13:22:26 2 of 2 START test not_null_stg_fda__ndc_product_id ............................ [RUN]
13:22:26 2 of 2 FAIL 138449 not_null_stg_fda__ndc_product_id ........................... [FAIL 138449 in 0.61s]
13:22:26 3 of 3 START test not_null_stg_fda__ndc_product_ndc ........................... [RUN]
13:22:27 3 of 3 PASS not_null_stg_fda__ndc_product_ndc ................................. [PASS in 0.12s]
13:22:27 4 of 4 START test unique_stg_fda__ndc_product_id .............................. [RUN]
13:22:27 4 of 4 PASS unique_stg_fda__ndc_product_id .................................... [PASS in 0.03s]
Contracts
Next type of test I'd like to cover is the contract test. Contract tests check if the results of running a SQL script in the database align with the definitions specified in the YAML file. This type of test is useful for preventing discrepancies between the model and its configuration. For example, if a developer adds a new column to the query but forgets to update the YAML file, the project will fail to build, helping catch such oversights.
In this example, enforcing a contract primarily ensures schema validation, but contracts can offer additional benefits, such as maintaining a consistent interface between different parts of the pipeline. However, this comes at the cost of making the pipeline more rigid.
Consider a scenario where we pivot a table based on values from the marketing_category
column. If a new category appears—one that wasn't previously defined—the pipeline would break. Diagnosing the issue might require investigation, as the failure wouldn’t be immediately obvious.
Is this a good or bad thing? It depends. In some cases, enforcing a strict schema is beneficial, ensuring data consistency and preventing silent errors. In this instance, we prioritize stability and want to keep the schema frozen to guarantee that downstream models operate as expected.
Defining a contract on the model is done within the YAML file by adding the enforced: true
configuration in the model's contract config section.
models:
- name: stg_fda__ndc
description: "Unpacking values from raw ndc dataset in json format"
config:
contract:
enforced: true
We can verify that the contract is enforced by modifying either the model configuration or its definition without making the corresponding update. If the contract is working correctly, this mismatch should trigger an error.
Below is an example of what happens when we remove the brand_name
column from the YAML definition:
13:30:26 Compilation Error in model stg_fda__ndc (models/staging/stg_fda__ndc.sql)
This model has an enforced contract that failed.
Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.
| column_name | definition_type | contract_type | mismatch_reason |
| ----------- | --------------- | ------------- | ------------------- |
| brand_name | TEXT | | missing in contract |
And here’s what happens in the opposite situation—when brand_name
is removed from the SQL query but still exists in the YAML definition:
13:51:06 Compilation Error in model stg_fda__ndc (models/staging/stg_fda__ndc.sql)
This model has an enforced contract that failed.
Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.
| column_name | definition_type | contract_type | mismatch_reason |
| ----------- | --------------- | ------------- | --------------------- |
| brand_name | | TEXT | missing in definition |
Conclusion
In this article, we introduced our example project and demonstrated how built-in dbt tests and contracts help maintain key invariants and ensure schema consistency in the early stages of development.
As we continue this series, we will explore more advanced testing strategies, including custom SQL tests, unit tests, and CI/CD integration, explaining their purpose and applying them within the context of our project.
The goal is to provide you with the insights needed to determine whether testing is necessary for your project and, if so, which tests are worth implementing. Stay tuned for the next installment!