Testing in dbt - part 2

Piotr Mierzejewski

15 April 2025, 14 min read

thumbnail post

What's inside

  1. Introduction
  2. Solving the problem using the model
  3. Testing the model
  4. Conclusion

Introduction

We continue exploration of testing in dbt by building an example project, introducing different testing concepts, and explaining their usage and relevance as we progress. So far, we have created the staging layer and secured it with built-in tests and contracts to enforce data integrity and schema consistency.

Now, we’re moving deeper into the transformation process by constructing the intermediate layer. This stage presents new challenges that require additional types of validations, such as unit tests and singular tests, to ensure the correctness of transformations before data reaches the final reporting layer.

So, without further delay, let’s dive in!

Solving the problem using the model

We have a staging model (staging.stg_fda__ndc) that extracts key columns from the JSON data. The next step is to transform this raw data into something meaningful.

If we revisit the requirements for our final report, we need to determine:

  • The number of distinct medical products registered under each labeler
  • The number of unique active ingredients used across all their products
  • The total count of generic drugs they produce

Since this analysis is heavily focused on labelers, let’s explore the available data and see what information we can extract about them.

To achieve this, we can refer to the YAML configuration of the staging model. The time invested in writing clear and meaningful descriptions is now paying off, as it eliminates the need for extensive investigation into what information each column contains. While the labeler_name column is fairly self-explanatory, this process has also helped us identify other useful pieces of information that might be relevant to our analysis.

name: product_ndc
    data_type: text
    description: "The labeler manufacturer code and product code segments of the NDC number, separated by a hyphen."

It appears that the NDC code contains the labeler manufacturer code, which might be useful later. Will keep this in mind as we proceed with our analysis, but for now, let's examine the labeler_name column.

Running the following query:

select count(1)
from dev_staging.stg_fda__ndc;

returns:

138,499

While this query:

select count(distinct labeler_name)
from dev_staging.stg_fda__ndc;

returns:

9,851

This looks promising! The dataset contains 138,499 records, but only 9,851 unique labeler names, suggesting that each labeler is linked to multiple medical products. In theory, we could use this column to distinguish labelers. After all, this is an official dataset from a well-known organization, so why not?

However, before making that assumption, we should still verify the consistency of the labeler_name column. Variations in formatting, typos, or inconsistencies in the source data could lead to duplicate labelers appearing under slightly different names.

These small variations in names can easily be identified by ordering the distinct records. One effective way to do this is by using a GROUP BY query:

select labeler_name, count(1)
from dev_staging.stg_fda__ndc
group by labeler_name
order by labeler_name

returns data like this:

Labeler NameCount
"7-eleven"1
"7-Eleven"7
"7-Eleven Inc."3
"7-Eleven, Inc."2
"7-ELEVEN, INC."1
"7-ELEVEN"2

This is problematic because it means that the labeler_name field is essentially free text, and a quick glance through other records confirms this. The example above can be resolved with simple transformations, such as converting to lowercase, replacing Inc. and commas with an empty string, and trimming leading/trailing spaces. However, this won't be sufficient to handle typos in names.

Another issue is that we can't easily validate if all records are correct after the transformations. We could select a subset of names and apply a set of transformations to clean them, but what if, for example, one company’s name starts with Inc. and we accidentally remove it?

This makes the problem non-trivial. Fortunately, we’ve discovered an additional piece of information—the labeler code, which is part of the NDC. This might provide a more reliable reference and could potentially help us resolve the inconsistencies.

Querying the product_ndc column confirms that the documentation was correct. We can see that the NDC codes are composed of two parts, separated by a hyphen. The first part of the code represents the manufacturer.

"product_ndc"
"71335-2057"
"71335-2068"
"71335-2104"
"71335-2111"

In PostgreSQL, we can easily split the two parts of the NDC code by locating the position of the hyphen and then extracting the substring from the start of the string up to the position of the hyphen. Like this:

select  
    labeler_name,
    substring(product_ndc, 0, position('-' in product_ndc)) as labler_code
from dev_staging.stg_fda__ndc

We can further expand this query to examine the different names associated with each labeler.


with extracted_labeler_code as (
    select  
        lower(labeler_name) as labeler_name,
        substring(product_ndc, 0, position('-' in product_ndc)) as labler_code
    from dev_staging.stg_fda__ndc
)

select labler_code,labeler_name, count(1)
from extracted_labeler_code
where labler_code = '0363'
group by labler_code, labeler_name
order by labler_code, count(1) desc

Which returns something like this:

labler_codelabeler_namecount
0363walgreen company473
0363walgreens345
0363walgreen co.92
0363walgreens company63
0363walgreens co.58
0363walgreen co34
0363walgreen13
0363walgreens co11
0363walgreen's corporation1
0363walgreen, co.1
0363walgreen's co1
0363walgreens co,1
0363walgreens, co.1

This highlights the challenge we previously encountered—free-text fields are inherently non-deterministic and cannot be reliably cleaned using a fixed set of string transformations in all cases. This isn’t a challenge to solve an impossible problem; I’m well aware that ambitious engineers can tackle much harder tasks, and machine learning could even be an option. However, since we have an alternative way to identify the labeler and only need the labeler name for readability, I propose a much simpler solution that will get us moving forward.

I propose a heuristic approach that applies simple rule-based transformations and then selects the most frequently used name. The assumption is that, in most cases, the most commonly used name is the correct one.

with labeler_names_cleaned as (
    select
        product_ndc,
        product_id,
        trim(
            regexp_replace(
                regexp_replace(
                    replace(
                        replace(lower(labeler_name), '&', ' and '),
                    '`',''''),
                '[,\.]? ?(llc|ltd|inc|s\.?l|s\.?a|company)[\. ]?', ' ', 'g'),
            ' +', ' ', 'g')
        ) as labeler_name
    from {{ ref('stg_fda__ndc') }}
),
labeler_code_extracted as (
    select
        labeler_name,
        product_id,
        substring(product_ndc, 0, position('-' in product_ndc)) as labeler_code
    from labeler_names_cleaned
),
count_of_names_grouped_by_the_labeler_code as (
    select
        labeler_code,
        labeler_name,
        count(*) as number_of_occurences
    from labeler_code_extracted
    group by labeler_code, labeler_name
),
  
laber_name_ranked_by_number_of_occurences as (
    select
        labeler_code,
        labeler_name,
        number_of_occurences,
        row_number() over ( partition by labeler_code order by number_of_occurences desc, length(labeler_name) asc
    ) as rnk
    from count_of_names_grouped_by_the_labeler_code
    order by labeler_code asc, rnk asc
),
  
assing_each_labeler_its_cleaned_name as (
select
labeler_code,
labeler_name
from laber_name_ranked_by_number_of_occurences
where rnk = 1
)
  
select
labeler_code_extracted.product_id,
names_cleaned.labeler_name,
names_cleaned.labeler_code
from assing_each_labeler_its_cleaned_name as names_cleaned
inner join labeler_code_extracted
on names_cleaned.labeler_code = labeler_code_extracted.labeler_code

Let's break down this block of code.

The query is divided into Common Table Expressions (CTEs), which are individual queries that can be treated as temporary tables within the same query. This approach helps break down complex logic into smaller, more manageable chunks, making the query easier to write and understand. Additionally, the names of the CTEs serve as useful context, helping to clarify the purpose of each calculation.

The CTEs form a Directed Acyclic Graph (DAG), flowing from top to bottom, with each step building upon the previous one:

  1. labeler_names_cleaned – Applies basic rule-based cleaning. While this may evolve in the future, it currently converts all names to lowercase, removes terms like LCC, LTD, etc., and trims leading and trailing spaces.
  2. labeler_code_extracted – Extracts the first part of the NDC code, up to the position of the hyphen.
  3. count_of_names_grouped_by_labeler_code – Counts how many times each labeler name appears within a given labeler group.
  4. labeler_name_ranked_by_occurrences – Ranks labeler names by frequency within each group.
  5. assign_each_labeler_its_cleaned_name – Selects the most frequently used name as the standardized labeler name.
  6. Finally, the determined correct name is assigned to each drug entry in the dataset.

Notably, well-named CTEs can often serve as implicit documentation, reducing the need for additional comments.

Now, we can query the resulting model to evaluate how well this approach worked using our previous example. To do this, we first need to save the above query in a file named int_ndc_labelers_deduplicated.sql, then build the dbt project by running: dbt build

Once that complete, we can query the resulting model to inspect the output.

select labeler_name, count(1)
from dev_intermediate.int_ndc_labelers_deduplicated
where labeler_code = '0363'
group by labeler_name

Which produce:

labeler_namecount
walgreen1094

Unfortunately, the wrong name was selected. It turns out that the official name, "Walgreens," is used less frequently than some of its variations. However, this outcome is not entirely unexpected—we rely on the assumption that the most commonly used name is correct in the majority of cases.

Testing the model

Build in tests

With the model in place, the next step is to ensure it behaves as expected and establish safeguards for future changes. A good first step in achieving this is to create a YAML configuration.

version: 2
models:
    - name: int_ndc_labelers_deduplicated
    description: "A deduplicated list of drug labelers, with standardized names and extracted labeler codes.
    config:
        contract:
            enforced: true
    columns:
    - name: product_id
        data_type: text
        description: "A unique identifier for each product, created by concatenating the NDC product code with the SPL document ID."
        tests:
            - not_null
            - unique
    - name: labeler_name
        data_type: text
        description: "The standardized name of the drug labeler, derived from the raw dataset through cleaning and deduplication."
        tests:
            - not_null
    - name: labeler_code
        data_type: text
        description: "A numeric code identifying the labeler, extracted from the first segment of the NDC code."
        tests:
            - not_null

In this configuration, I have already included some built-in tests, as explained in the previous section. These tests run after the model is created and are particularly useful for downstream models. With these safeguards in place, subsequent models can confidently rely on the integrity of the data.

When defining these tests, I considered the following:

  • The primary invariant for the entire processing pipeline is product_id, so it must be validated as both unique and not null, even if no transformations were applied to this column.
  • Since labeler_name and labeler_code have been modified or derived and are essential for each drug, I am enforcing a not-null constraint on them to ensure data completeness.

Unit tests

What are unit test?

This is a great point to introduce a new type of test—unit tests. Unlike traditional data tests, unit tests focus on verifying logic rather than the data itself. Their purpose is to ensure that a specific block of logic behaves as expected.

In dbt, unit tests are black-box tests, meaning they do not rely on knowledge of the internal implementation of the code being tested. Instead, we define inputs and the expected outputs, and the testing framework executes the code with those inputs, verifying that the actual output matches what the developer has specified.

When to use unit tests?

Unit tests are especially valuable when a model contains conditional logic, regex functions, or other complex processing that goes beyond standard SQL functionality. However, they should not be used to validate whether built-in functions like AVG or MAX work correctly, as these are already well-tested within the database engine.

I think that the model we're currently working on contains logic complex enough to justify their use.

How to configure unit tests in dbt?

In dbt, unit tests are configured in a YAML file. While unit tests don't have to be defined within the configuration file of a specific model, I recommend keeping them there for better organization and navigation.

The following YAML structure is required to define a unit test:

unit_tests:
  - name: {{name_of_unit_test}}
    model: {{model_to_be_tested}}
    given:
      - input: {{mock_input_model}}
        rows:
          - { column_1: "1", column_2: "a" }
          # Add as many rows as needed to cover test cases

    expect:
      rows:
        - { result_col: "1" }
        # Define the expected output rows

Key Components of a Unit Test Configuration:

  • name: The name of the test should follow a consistent naming convention. I prefer descriptive names starting with test_should_, followed by what the test is verifying, e.g. test_should_extract_labeler_code_correctly_.
  • model: The name of the model being tested.
  • given: Defines mock input data for the test. You don’t need to specify all columns from the actual table—only those relevant to the test. Add as many rows as needed to cover all edge cases.
  • expect: Specifies the expected output based on the given input. The framework will run the model with the defined input and verify if the output matches the expected results.

Writing the unit tests

To determine which parts of the code require testing, let's review the key operations performed in this model:

  1. Rule-based cleaning of the labeler_name.
  2. Extracting the labeler_code.
  3. Grouping drugs by labeler and counting how many times each name appears.
  4. Identifying the most frequently used name.
  5. Assigning the most common name to each drug.
  6. Producing the final output, which includes product_id, labeler_code, and labeler_name.

While most of the computation focuses on cleaning and selecting the labeler name, this is actually not the most critical outcome of the model. The labeler_code is what will be used in downstream logic, making its correctness essential.

So, what needs to be tested?

  1. Is the labeler_code extracted correctly?
  2. Is the labeler_name cleaned as expected?

Now, should we combine these into a single test or keep them separate?

Typically, I would opt for a single test. However, the rule-based cleaning logic has the potential to grow significantly as we analyze the data and discover new edge cases that were not initially considered. If both aspects were tested together, new test cases for labeler_name cleaning would also include columns for labeler_code extraction, even though the latter is much simpler to verify. This could lead to redundant test values appearing multiple times, reducing readability.

For better clarity and maintainability, it makes sense to keep these as separate unit tests.

Test: test_should_extract_labeler_code_correctly

This test ensures that the NDC code is correctly split, extracting the labeler_code as expected. Since the labeler_code is derived from the first segment of the NDC (before the hyphen).

- name: test_should_extract_labeler_code_correctly
    model: int_ndc_labelers_deduplicated
    given:
        - input: ref('stg_fda__ndc')
        rows:
        - { product_id: "1", product_ndc: "1-2", }
        - { product_id: "2", product_ndc: "11-2", }
        - { product_id: "3", product_ndc: "111-4",}
        - { product_id: "4", product_ndc: "1111-5", }
    expect:
        rows:
        - { product_id: "1", labeler_code: "1" }
        - { product_id: "2", labeler_code: "11" }
        - { product_id: "3", labeler_code: "111" }
        - { product_id: "4", labeler_code: "1111" }

Test: `test_should_clean_and_select_most_commonly_used_name

This test ensures that the labeler name is properly cleaned and standardized, before selecting the most commonly used version. While the test covers multiple steps—initial cleaning transformations, grouping, and selecting the correct name—the most error-prone logic lies within the initial regex-based transformations. That’s where we will focus the majority of our testing efforts.

Why Is This Test Important?

  • The initial transformation logic has high potential for growth as more edge cases are discovered.
  • Regex-based cleaning can become increasingly complex and difficult to maintain over time.
  • Every modification must support not only new cases but also preserve correctness for previously handled cases.
  • Without proper testing, developers may avoid modifying the function out of fear of breaking existing functionality, leading to stagnant, suboptimal code.
- name: test_should_clean_and_select_most_commonly_used_name
    model: int_ndc_labelers_deduplicated
    given:
    - input: ref('stg_fda__ndc')
        rows:
            - {product_ndc: "1-1", labeler_name: "a company" }
            - {product_ndc: "1-1", labeler_name: "a co." }
            - {product_ndc: "1-1", labeler_name: "a, co." }
            - {product_ndc: "1-1", labeler_name: "a, ltd." }
            - {product_ndc: "1-1", labeler_name: "a, inc." }
            - {product_ndc: "1-1", labeler_name: "a, company" }
            - {product_ndc: "1-1", labeler_name: "a, corporation." }
            - {product_ndc: "1-1", labeler_name: "a s.l." }
            - {product_ndc: "1-1", labeler_name: "a s.a." }
            - {product_ndc: "1-1", labeler_name: "aa " }
            - {product_ndc: "2-1", labeler_name: "b inc" }
            - {product_ndc: "2-1", labeler_name: "b " }
    expect:
        rows:
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "1", labeler_name: "a" }
            - { labeler_code: "2", labeler_name: "b" }
            - { labeler_code: "2", labeler_name: "b" }

Running the unit tests

Now that all the setup is complete, it's time to run the tests and verify our implementation. To execute only the unit tests, use the following command:

dbt test --select test_type:unit

This ensures that only the unit tests are executed, and produce:

16:40:18  Running with dbt=1.9.1
16:40:18  Registered adapter: postgres=1.9.0
16:40:18  Found 5 models, 15 data tests, 1 source, 569 macros, 2 unit tests
16:40:18  
16:40:18  Concurrency: 1 threads (target='dev')
16:40:18  
16:40:19  1 of 2 START unit_test int_ndc_labelers_deduplicated::test_should_clean_and_select_most_commonly_used_name  [RUN]
16:40:19  1 of 2 PASS int_ndc_labelers_deduplicated::test_should_clean_and_select_most_commonly_used_name  [PASS in 0.18s]
16:40:19  2 of 2 START unit_test int_ndc_labelers_deduplicated::test_should_extract_labeler_code_correctly  [RUN]
16:40:19  2 of 2 PASS int_ndc_labelers_deduplicated::test_should_extract_labeler_code_correctly  [PASS in 0.07s]
16:40:19  
16:40:19  Finished running 2 unit tests in 0 hours 0 minutes and 0.41 seconds (0.41s).
16:40:19  
16:40:19  Completed successfully
16:40:19  
16:40:19  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Conclusion

In this article, we expanded our project and introduced a new type of test to our toolkit—Unit Tests. Unlike other dbt tests, unit tests validate the logic rather than the data itself. This is especially important when building complex transformations that may evolve over time. By implementing unit tests early, we create a safety net that allows us to confidently modify and extend our logic without unintentionally breaking existing functionality.

Unit tests are just one component of a comprehensive testing strategy, and we're still in the early stages of developing this project. There are many improvements and additions ahead. Stay tuned for the next part, where we'll take a closer look at singular tests and explore how they can further strengthen data validation.

Recent posts

See all blog posts

Let's talk

Discover how software, data, and AI can accelerate your growth. Let's discuss your goals and find the best solutions to help you achieve them.

Hi there, we use cookies to provide you with an amazing experience on our site. If you continue without changing the settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your cookie settings at any time.