Simulating CDC with dbt snapshots

In data modeling, it's common to encounter modeling problems where it's not sufficient to know the data values of an entity right now. You need to be able to know what the values were at a previous point in time. This is most-commonly a data analytics problem, which means this type of model won't be present in native domain tables.

Data changes over time

Let's break down the problem. I'm going to use tables / scenarios from Microsoft's ubiquitous Contoso model. This is fairly well-known, quite generic, and easy to obtain. In fact, if you're interested in downloading data sets in various sizes for these tables (and more), they're available here, and a tool that lets you generate sample data of your own can be found here (might be worth checking this out for generating test data for your own schema, too).

I'm going to focus on some hypothetical changes to the Product table. The Contoso tables are fairly well-engineered for "demo" applications, and the Product table is no different. This is a good baseline for what you'd expect to see in a table like this in a real application. The columns are pretty straightforward, and in this exercise, I'm going to focus on changes that might occur in cost or price over time. Some of these columns are really unlikely to change without resulting in a new sku.

Column NameData type Notes
productkey int4 synthetic key
productcode int4 natural key
productname text
manufacturer text
brand text
color text
weightunit text
weight float8
cost float8 Likely to change over time
price float8 Likely to change over time
categorykey int4
categoryname text
subcategorykey int4
subcategoryname text

In our imaginary data warehouse scenario, we'd like to be able to see changes in columns like cost, price, category, and so on over time. Perhaps we're interested in historic performance of products from different manufacturers, or fill in your own hypothetical, and since these changes over time aren't already present in the source data, we're going to show how to generate these rows using dbt.

In data warehousing, the visibility of changes like this over time is known as a slowly-changing-dimension (SCD), and there are a number of ways to model them. In this example, we're going to show type-2, which creates a new row when changes are detected. In order to do this sort of change detection, you'd need to be able to tell when a row has changed, implying the need for a row ID. Then, each field in the row would have to be compared to see if any of them have changed. Not sophisticated, but unwieldy. Luckily, dbt can do this for us with some simple declarative syntax.

dbt Snapshots

If you're building a data warehouse, there's a good chance you're already using dbt. This open-source tool is available in host-yourself form (dbt core) or as a SAS platform (dbt cloud). The example shown here is built for dbt core, and is available on github. This example uses a dbt seed to load the Contoso Product table - not the way you'd do this at scale, but it makes the demo more portable. Another nod to demo portability - we're using Postgres as our sample data warehouse. All the tools here are fairly agnostic with respect to database, so these techniques adapts to your tool of choice.

The dbt syntax to compute the type-2 SCD is called a snapshot. In our example, we name the output table (producthistory) and the source to be monitored (product). In the config section, note the strategy, which can be either "timestamp" if your source data has a last-updated-date column, or 'check', which evaluates values of columns. Also note the "check_cols" - the value of "all" here is quick and easy, but typically a list of specific columns is preferred for the same reasons we don't use "select *".

snapshots:
  - name: producthistory
    relation: ref('CDC_with_dbt', 'product')
    description:  Slow-changing dimension view of product
    config:
      database: cdcdemo
      strategy: check
      unique_key: ProductKey
      check_cols: all
      hard_deletes: invalidate 

Let's take a look at these snapshots in action. We're going to make changes to the following two rows (shown here in JSON to list fields as rows), and the highlighted columns:

{
"select * from dbt.product where productcode in (0101001, 0101002) \r\n": [
	{
		"productkey" : 1,
		"productcode" : 101001,
		"productname" : "Contoso 512MB MP3 Player E51 Silver",
		"manufacturer" : "Contoso, Ltd",
		"brand" : "Contoso",
		"color" : "Silver",
		"weightunit" : "ounces",
		"weight" : 4.8,
		"cost" : 6.62,
		"price" : 12.99,
		"categorykey" : 1,
		"categoryname" : "Audio",
		"subcategorykey" : 101,
		"subcategoryname" : "MP4&MP3"
	},
	{
		"productkey" : 2,
		"productcode" : 101002,
		"productname" : "Contoso 512MB MP3 Player E51 Blue",
		"manufacturer" : "Contoso, Ltd",
		"brand" : "Contoso",
		"color" : "Blue",
		"weightunit" : "ounces",
		"weight" : 4.1,
		"cost" : 6.62,
		"price" : 12.99,
		"categorykey" : 1,
		"categoryname" : "Audio",
		"subcategorykey" : 101,
		"subcategoryname" : "MP4&MP3"
	}
]}

When we update these fields and run dbt against the updated data, the producthistory table will now show four rows, corresponding to the two original rows and the updated version of each respective row:

{
"select * from dbt.producthistory where productcode in (0101001, 0101002) \r\n": [
	{
		"productkey" : 1,
		"productcode" : 101001,
		"productname" : "Contoso 512MB MP3 Player E51 Silver",
		"manufacturer" : "Contoso, Ltd",
		"brand" : "Contoso",
		"color" : "Silver",
		"weightunit" : "ounces",
		"weight" : 4.8,
		"cost" : 6.62,
		"price" : 12.99,
		"categorykey" : 1,
		"categoryname" : "Audio",
		"subcategorykey" : 101,
		"subcategoryname" : "MP4&MP3",
		"dbt_scd_id" : "daef36db7500c9e531447a11e0e3107a",
		"dbt_updated_at" : "2025-01-17T00:35:19.923Z",
		"dbt_valid_from" : "2025-01-17T00:35:19.923Z",
		"dbt_valid_to" : "2025-01-17T00:37:05.581Z"
	},
	{
		"productkey" : 2,
		"productcode" : 101002,
		"productname" : "Contoso 512MB MP3 Player E51 Blue",
		"manufacturer" : "Contoso, Ltd",
		"brand" : "Contoso",
		"color" : "Blue",
		"weightunit" : "ounces",
		"weight" : 4.1,
		"cost" : 6.62,
		"price" : 12.99,
		"categorykey" : 1,
		"categoryname" : "Audio",
		"subcategorykey" : 101,
		"subcategoryname" : "MP4&MP3",
		"dbt_scd_id" : "4b8654fbe0b5aad13f42ec52524dbe2b",
		"dbt_updated_at" : "2025-01-17T00:35:19.923Z",
		"dbt_valid_from" : "2025-01-17T00:35:19.923Z",
		"dbt_valid_to" : "2025-01-17T00:37:05.581Z"
	},
	{
		"productkey" : 1,
		"productcode" : 101001,
		"productname" : "Contoso 512MB MP3 Player E52 Silver",
		"manufacturer" : "Contoso, Ltd",
		"brand" : "Contoso",
		"color" : "Silver",
		"weightunit" : "ounces",
		"weight" : 4.8,
		"cost" : 6.62,
		"price" : 15.99,
		"categorykey" : 1,
		"categoryname" : "Audio",
		"subcategorykey" : 101,
		"subcategoryname" : "MP4&MP3",
		"dbt_scd_id" : "11d26c8c79e5e304ec85f86913a440b2",
		"dbt_updated_at" : "2025-01-17T00:37:05.581Z",
		"dbt_valid_from" : "2025-01-17T00:37:05.581Z",
		"dbt_valid_to" : null
	},
	{
		"productkey" : 2,
		"productcode" : 101002,
		"productname" : "Contoso 512MB MP3 Player E52 Blue",
		"manufacturer" : "Contoso, Ltd",
		"brand" : "Contoso",
		"color" : "Blue",
		"weightunit" : "ounces",
		"weight" : 4.1,
		"cost" : 6.62,
		"price" : 15.99,
		"categorykey" : 1,
		"categoryname" : "Audio",
		"subcategorykey" : 101,
		"subcategoryname" : "MP4&MP3",
		"dbt_scd_id" : "fd80b5a5d49a315dea7ca9f302065ec7",
		"dbt_updated_at" : "2025-01-17T00:37:05.581Z",
		"dbt_valid_from" : "2025-01-17T00:37:05.581Z",
		"dbt_valid_to" : null
	}
]}

Note the "dbt*" fields added to each record here:

  • dbt_sct_id - a computed unique value based on the primary key and updated-at.
  • dbt_updated_at - this is when dbt last touched this record. In our case, "touch" is alwasys going to be "add".
  • dbt_valid_from - this is the beginng of the effective date window for each record. In the example above, note how dbt_valid_from is set to "now" when the model was initially built. If you need to adjust that value to synthetically backdate initial records to a datetime.min value, you'd need to do this in a subsequent update.
  • dbt_valid_to - this is the end date for the effective date window. Note that for the current record, this value is left null to indicate no expiration date. It won't be set until / unless a new version of that record is created.

Limits of SCD

The example above is an illustration of synthesizing a type-2 SCD records in a worst-case scenario, and as such, it shows some of the pitfalls of this approach. The biggest issue, of course, is that the assignment of effective dates (begin and end) are tied directly to when (and how often) dbt is run. This can lead to some mis-aligned time-based relationships in the warehouse, as shown here:

If your source data happens to have a last-updated date, of course, you can use the "timestamp" strategy in dbt's snapshot configuration, which narrows the window for problems, but won't eliminate it. One of the keys here is to remember that this is a strategy for slowly-changing dimensions - the more active these data sets become, the more susceptible you are to errors.

If you need better accuracy for quickly-changing dimensions, consider modeling relationships in your transactional data. This is something you can see to an extent in the Sales table of the Contoso dataset, where prices and costs of products are saved right in the sales transaction. This captures the attributes that are subject to change in the Product dataset and are also of interest to Sales, reducing the need to link to a specific version of a Product. You can also use a near-realtime CDC tool like Fivetran, which is capable of monitoring source datasets for changes and triggering data events that can feed your warehouse with a much higher fidelity than the solution shown here. Any of these more accurate techniques come with a cost, though, so it's worth keeping this dbt option in your tool box.