Why i like dbt and why you should too

Dbt (or data build tool) transforms the way you work with sql, if you are someone like me that comes from a software engineering background then you will feel at home with the dbt workflow. So let’s get to it, first what is dbt?

Dbt is an open-source command line tool that helps analysts and engineers transform data in their warehouse more effectively.

Ok this is what dbt is according to wikipedia, but lets try to get a little better understanding then that. When I’m working with dbt it reminds me of working with a compiler, i write my dbt code, compile it into sql code and then i can execute my query or tests right from the terminal using the dbt cli. But don’t think that is all that dbt does, while the compiler is part of the core of what is dbt it’s much more then just that. I also mentioned that I can run my compiled sql code in the terminal using dbt, this is another core part of what makes up dbt. by letting dbt compile and then run code it allows it to understand the dependencies between the different models and do powerful things like run the models in order of their dependencies. These are some of the things that i will mention in this article.

Dbt code

First lets talk about dbt code and what it is. Dbt code is simply put sql code combined with jinja and some extra features sprinkled on top of that.

Using jinja together with sql will enable you to write things like loops and if statements which can be great tools when creating models. However it is those extra features which really let dbt shine. Probably the most important of these features is the ref() function. The ref() function reference other models in the dbt project and it lets you reference those model using only the model name (same as file name) without having to specify the database or schema.

select
  *
from {{ref('model_name'}}

Instead dbt let’s you specify the database and schema in a config file. This makes it easy to change database and schema depending on if you are in a development environment or production environment.

By letting dbt handle the references between different models it allows dbt to build a dependency graph during compilation of the dbt models. Using this dependency graph dbt can then run all the models in order of their dependencies.

Another important function in dbt is the source() function. This function works similar to the ref() function except that it references a source table instead of another dbt model and just like the ref() function a .yaml config file define the sources.

select
  *
from {{source('source_name', 'table_name')}}

Profiles

Before we start talking about the dbt cli i have to mention profiles which is only needed when using dbt cli and not if you are using dbt cloud (which I will not discuss in this article because honestly I have no experience using it).

Dbt uses profiles and targets to know which data warehouse to connect to. Inside the file profiles.yml are the profiles and target defined and this file can either be inside the project directory or inside the ~/.dbt/ directory.

The beauty with this is that you can reuse the same profile and profiles.yml file for projects that uses that same warehouses and you can use different targets depending on your environment. A typical workflow includes having at least one dev environment target and a prod environment target.

profile_name:              # a profile
  target: target_name      # defines the default target
  outputs:
    target_name:           # a target
      type: <bigquery | postgres | redshift | snowflake | other>
      schema: schema_name
      database: database_name
      account: account_name
      role: role_name
      user: user_name
      password: very_secure_password
      threads: 4
      warehouse: warehouse_name
      ### database-specific connection details
      ...

    second_target_name:    # a second target
      ...

second_profile_name:       # a second profile
...

Materialization strategies

Another great feature in dbt is that is supports four different materialization strategies. Materialization strategies tells dbt how it should handle persisting dbt models in your warehouses. Basically it tells dbt how it will build a model and what to do if the model already exists. The four strategies that dbt supports are table, view, incremental and ephemeral.

Table

This materialization strategy makes dbt build a table, overriding any existing table with the same name. A table is good because it is fast to query from, so any model that gets querried often, used by downstream models or if query latency is important. However it’s important to remember that a table don’t always represents the latest data from it’s source because the source may have updated after the table got built.

View

In many ways a view is an opposite to a table, but let’s start with what they have in common. Like a table a view is rebuilt entirely every time it run, it has no “memory” of previous runs. However how the view is different in that it don’t store data in any storage, whenever it’s querried it will build the view. This let’s it always reflect the latest data from it’s source by the cost of speed. Stacking views on top of each other can take a long time to run so be careful with using view too often.

Incremental

Using the incremental materialization strategy will result in the model creating a table just like the table materialization strategy, however unlike table the incremental materialization strategy will not overwrite any existing table created by a model. It will instead either update existing rows or append new rows to the existing table.

Because you are updating or appending to an existing table you don’t have to transform old data that the model has already transformed. This can greatly increase the speed of your models because you don’t have to process the same data more then once.

However to be able to update rows in a table dbt needs to know which rows should update and which rows should append to the table. To do this we need to give each row a unique_key which can consist of one or more columns. How to pick the unique_key depends on the data but it’s important that no column that is part of the unique_key has any null values.

For example lets say we have a fruit store and at the end of every day we want to update a table with information about how much fruit we have sold. Let’s say we have the columns date, fruit_name and amount_sold. In this case it makes sense to set the unique_key to be a combination of date and fruit_name, because we know for each day each fruit_name will be unique and each date will be unique because that is how dates work.

date fruit_name amount_sold
2023-01-10 Apple 10
2023-01-10 Banana 12
2023-01-10 Orange 8
2023-01-09 Apple 9
2023-01-09 Banana 10
2023-01-09 Orange 9

Ephemeral

The ephemeral materialization strategy is a bit different from the previous strategies (to be honest I have not used this one myself). When using this materialization strategy no tables are actually built in the warehouse, in this way it’s similar to a view however it’s still possible to select from a view while it’s not possible to select from the ephemeral model. Instead other dbt models that depends on the ephemeral model will use dbt to interpolate the ephemeral model as a common table expression in the dependent model.

Using the ephemeral materialization strategy can be a great way to reduce clutter in your repository but be careful to not to overuse this strategy as it can make it difficult to debug your code.

Dbt cli

Now you should have some understanding of the benefits of using dbt and how you can write dbt code. Now I want to talk how to use dbt from the terminal to do things like compiling, testing and running your dbt code. I mentioned earlier that functions like ref() is crusial to get the most out of dbt but if you don’t use dbt cloud then the dbt cli is also crusial.

DBT compile

DBT compile is a dbt cli function that will compile your dbt code and output the compiled code (pure sql code) to the target/ directory. This makes it much easier to debug the models, especially if you use a lot of jinja code or macros. It can also save you time while developing models because it gives you a quick feedback loop for any compile errors that may be in your dbt code. Because the compiling happens on your local machine you don’t have to wait for the code to upload to wherever you run your models.

The following is an example of the dbt code and how it looks after compilation.

SELECT
  *
FROM {{ref('model_name'}}
SELECT
  *
FROM database_name.schema_name.model_name

Another benefit of inspecting the compiled dbt code is that it allows you to copy the compiled code and execute it directly. This is useful when debugging your models as it allows for a faster feedback loop. This is also the stage where dbt will generate the dependency graph (as long as you are using the ref() function which will allow dbt to run your models in order of their dependencies).

Figure 1: example DAG, source https://docs.getdbt.com/terms/elt

Figure 1: example DAG, source https://docs.getdbt.com/terms/elt

DBT run

Once you have your compiled dbt code it’s time to run it and that is what the command dbt run does. However it is important to note that you do not need to run the dbt compile before running dbt run. dbt run will make sure that the models that you run compile, which includes not just compiling the dbt code but also building the dependency graph which determines the order that dbt run will run the models. Running dbt run without any arguments will tell dbt to run all your models in dependency order, however it’s possible to tell dbt to only run specific models. This is done by providing the --select flag along to dbt run. This flag also needs an argument to specify which model(s) to select. The following examples show different ways to use the --select flag.

dbt run --select my_dbt_project_name   # runs all models in your project
dbt run --select my_dbt_model          # runs a specific model
dbt run --select path.to.my.models     # runs all models in a specific directory
dbt run --select my_package.some_model # run a specific model in a specific package
dbt run --select tag:nightly           # run models with the "nightly" tag
dbt run --select path/to/models        # run models contained in path/to/models
dbt run --select path/to/my_model.sql  # run a specific model by its path

Finally there is one more flag that I want to mention and that is the --full-refresh flag. As discussed earlier dbt supports different materialization strategies. Imagine you are using the incremental materialization strategy but while developing maybe you want to rebuild the entire table instead of updating or appending the table. You could manually delete the table and then use dbt run to re-create the table. This works fine but dbt offers a better solution, when using the --full-refresh flag you tell dbt that it should first delete the table before building a new table, even if you are using the incremental materialization strategy.

DBT test

In the beginning I mentioned that you can use dbt cli to run your model tests and while I have not mentioned how testing works with dbt I will now talk about how to run your tests. I will talk more about specific dbt tests further down in this article, here I will just talk about how to run your tests from using dbt cli. Once you have defined your tests for your dbt models you can run the tests using their command dbt test. This works similar to how the dbt run command works, just running the dbt test command with no flags will make dbt run every test in your project. However just as the dbt run command you can add the --select flag to specify which tests that you want to run. The following examples show how to run dbt test with different flags.

dbt test --select one_specific_model                       # run tests for one_specific_model
dbt test --select some_package.*                           # run tests for all models in package
dbt test --select test_type:singular                       # run only tests defined singularly
dbt test --select test_type:generic                        # run only tests defined generically
dbt test --select one_specific_model,test_type:singular    # run singular tests limited to one_specific_model
dbt test --select one_specific_model,test_type:generic     # run generic tests limited to one_specific_model

DBT build

The last command I want to mention is the dbt build command. This command don’t really do anything new that any previous command does. However what it does is to combine the functionality of the commands dbt compile, dbt run and dbt test. Meaning that when using the dbt build command you tell dbt to compile the code, run the models and run the tests. It also supports the same flags as some previous commands, the most important flags is the --select and -full-refresh flags that I have mentioned earlier.

Tests

Now it’s finally time to discuss how to actually create tests in dbt. Dbt supports two different kinds of tests, singular tests and generic tests. Singular tests are tests that created for a specific model, This allows you to custom tailor the test for a specific model letting you test specific logic for that model. To reference which model to test you use the ref() function that I have mentioned earlier. The following is an example of a singular test.

-- Refunds have a negative amount, so the total amount should always be >= 0.
-- Therefore return records where this isn't true to make the test fail
select
    order_id,
    sum(amount) as total_amount
from {{ ref('fct_payments' )}}
group by 1
having not(total_amount >= 0)

The other kind of tests that dbt supports are the generic tests and these tests can be a little more interesting because these are tests that can run on any model. Because these tests can run on any model in your project they need to be generic (as the name implies). An example for a generic test can be to make sure that a column contains no null values. To run a test like this you need two parameters, the model name and column name to run the test on. But once the test is created it can be used for any number of models and columns. Here is an example of a generic test.

{% test not_null(model, column_name) %}
    select *
    from {{ model }}
    where {{ column_name }} is null
{% endtest %}

Some of these generic tests are common when testing your models, because of this dbt ships with some generic test already built in to dbt. These are tests that you will probably use on all of your dbt models together with your singular tests for your models. The following is descriptions of these different out of the box tests.

Out of the box tests

unique

The first out of the box test in dbt that I will talk about is unique which ensures that there are no duplicate rows in the dataset. This test take column names as a input and makes sure that each row has an unique combination of the given columns. This test is important if you are using the incremental materialization strategy since it requires each row to have a unique key combination.

Let’s say you have a model that runs every day, it reads from sources and calculates how much each customer has spent for the given day. In this example you probably want a unique identifier for the customer together with the data to be a unique combination. Because each row tells how much a customer has spend we don’t want two rows with the same customer on the same day.

not_null

Another test we will talk about is the not_null test. This test will check if the specified columns contains any null values. If you have any columns that should never be null, just add this test. Just like the unique test this test works well together with the incremental materialization strategy because each column in the unique key has to have a value that is not null.

This tests works well together with the unique test, lets think back to the previous example that tests to make sure that each customer only appears once every day. Imagine if the date or the customer identifier is null, then the unique test will eventually start failing once there are rows with these null values. So adding the not_null tests for the column you use to get the unique combinations is highly recommended.

accepted_values

The test acceptable_values is similar to the not_null test. It takes a column name and an array of the acceptable values and makes sure that the all rows inside the specified column has acceptable values. An example would be if you have a model that calculates how fast the car has driven for a given day. In this case you probably want to make sure that no car is moving faster then 299,792,458 m/s (the speed of light) or that it’s moving in a negative speed.

relationships

The final out of the box test I will discuss is relationships. This test validate the relationships between different models. So let’s again go back to our customer example, when reading from the customer identifiers from source tables we can use the relationships test to make sure that each customer id in the source table exists in the model. This way you can make sure that your model don’t miss any customers.

Documentation

The last thing that I want to mention is how dbt can help you generate a static documentation site for your models. Dbt automatically generates information about your models such as a DAG showing your model dependencies, columns, column types, table size and more. But besides the information automatically generated by dbt you can add your own descriptions to both columns and tables. This can greatly help the consumers if these models to understand them better.

To generate this documentation just run the command dbt docs generate, but make sure that you use dbt run before generating the documentation to be sure you get all the information about your models. Dbt also comes with the command dbt docs serve to help you try out your documentation locally. However it’s recommended to deploy the documentation using a CI/CD pipeline, so the documentation always gets updated when new changes gets pushed to production. To deploy our documentation at Tele2 we uses CI/CD to deploy our model documentation to GitLab pages.

Figure 2: example documentation, source https://docs.getdbt.com/terms/elt

Figure 2: example documentation, source https://docs.getdbt.com/terms/elt

Conclusion

There are a lot of features that I have barely mentioned and some that I have not mentioned at all, but this article is getting longer and longer and I feel I could keep writing forever. Hopefully you now have some understanding of what dbt is and how you can use it in your projects. DBT comes with plenty of great features that once you use them you will not want to go back to not using them. I will end with a short summary of my favorite features in dbt that I have mentioned in this article.

  • The workflow reminds me more of traditional software development using the cli.
  • I can configure profiles to setup connections to my warehouses across multiple projects.
  • Using the ref() and source() functions makes it easy to reference other tables inside and outside your dbt projects.
  • Easy to setup different materialization strategies for your models.
  • Comes with some handy out of the box tests, especially if you are the incremental materialization strategy.

References

Jens Lordén

Code, ML, MLOps, data and coffee


By [Jens], 2023-07-18