How to Use dbt’s Model Governance Features in Large Projects

Published in

synvert Datadrivers

5 min readJun 8, 2023

With version 1.5 dbt introduced new features which makes model governance possible. This enhancement is a great to make collaboration between teams much easier, transparent and reliable. In this post I want to show how I implemented model governance in a dbt project with about 1000 models and multiple collaboration teams.

The need for model governance

Software engineers already know the advantages of interfaces, reliable APIs and internal classes or functions. Consider a big software project with 1000 public classes where all teams can access all objects any time without the knowledge of the team which owns the object. The result would be a mess and spaghetti code would be guaranteed. In dbt this was best practice until end of April 2023. Luckily they introduced some features to tackle this issue:

Groups can be added to each model to bulk them together. All models in the same group are accessible, even like functions in the same class. Each model in a group has the “access” property available, which is defined if the model is private to the group, usable in the same project (protected) or public (useable in other dbt projects).
Contracts reminds the model owning team about models and projects which may rely on them. This creates stability for the whole project and works like an interface in software engineering.
Obviously there have to be breaking changes in the schema of a model. This is where versioning comes into play. A team can publish their new model in a different version and ask depending teams to migrate, after a while.

Beside those features, we developed conventions which forbid the usage of models in some layers, as you can read in the next section. Obviously, they are another option to handle access management, although their compliance can not be tested automatically.

Who should access what?!

A good practice that works very well for me is to divide the data warehouse into these logical data layers:

Source: Data which is materialized in object storages can be referenced with external tables, which are defined here. Only the base layer is allowed to read data from this layer.
Base: These are incremental models or snapshots (for SCD-2 historization) which materializes the data inside the data warehouse without any transformation. Only the cleansing layer is allowed to read data from this layer.
Cleanse: The cleanse views deduplicates data, casts types and do some other cleansing stuff that might be useful. Only the core layer is allowed to read data from this layer. Up to this point, all the layers mentioned are modeled as specified by the source systems.
Core: Materialized entities are created here that are based on the same source systems. An example from an e-commerce company is inventory data. The logistics partner transmits data from different storage locations and types, which must be merged at this point.
Business: Models in this layer are materialized and uses the core models as sources. Its allowed to join across multiple source systems. From my Data Engineers point of view, this is where the magic happens and added value is created which is very useful for the business. To stay with the example above, inventory data from several logistic partners could be merged here.
Marts: Mart models are flat, materialized and query optimized, to serve as a performant source for depending applications, like reporting tools or ML pipelines. They only read from core and business models.

The three layers Source, Base and Cleanse are layers which are maintained by Data Engineers in a team, which is usually responsible for the dbt project as a whole. They are responsible to provide new data or update schemas, if the underlying source system has any changes. So, the first interface is between the Cleansing (access: protected) and the Core layer. Any model should also be versioned to make any schema change smooth, because many teams may rely on the some model, which makes it even harder to adopt changes in all dependent models.

Multiple teams work in Core and Business layer, which are not allowed to create any references among each other in this layer. On this way they can focus on the business logic only, they are very flexible regarding refactoring tasks and they don’t need to care of any other models which may rely on them.

Models they do want to share, living in the marts layer, which is versioned, has contracts and are accessible in a protected or public manner. Other teams which need to reference data from another team always rely on models they decided to share in the marts.

Overview of the access and group policies. Models between teams are only shareable with versioned, contracted data marts

Implementing model governance

The way to implement contracts and versioning is straight forward and I simply followed their advices and recommendations. If the project already has well maintained model documentation, it would be easy to just set the “contract” property. Otherwise one should do the homework and add documentation. I mean, you should have done this anyway — with or without contracts.

The interesting part is how to implement the group definitions and access modalities. While it is possible to define the model group in dbt_project.yml, one has to define the access configuration for each model itself.

The group “bi_platform” is responsible for source, base and cleansing layer.

Models in base layer are private inside the bi_platform group.

Models in cleansing layer are protected inside the bi_platform group and define contracts.

Conclusion

In my point of view the usage of model governance features increases the stability of the project, but it obviously comes at a price. It would be much easier to define the access property in the dbt_project.yml for all models, as this is possible for groups. For the moment, one might use custom macros which sets the access property in the model config macro for each model, depending on the logic layer.
Teams which developed project conventions which regulate layer access, would love the new features dbt provides. One more enhancement would be, when models might be accessible from a specific group but not from the whole project. This would support the enforcement of layer rules even more and support the refactoring from big dbt projects to small ones which can be depend on each other (as dbt published, this will be possible in future releases of dbt).

How to Use dbt’s Model Governance Features in Large Projects

The need for model governance

Who should access what?!

Implementing model governance

Conclusion

Written by Julian Frenzel