Tanay Arora
Scaling dbt · Part 4

The dbt project
structure that scales.

A dbt project that nobody can navigate is a liability, not an asset. The decisions that prevent that from happening — layer boundaries, environment isolation, CI that actually runs fast enough to use — is worth being deliberate about.

Layer boundaries

The first decision on a blank project is how to organise the layers. Each layer needs a clear job — and models should only know about the layer below them. Without that constraint, models start doing too much and the project becomes hard to reason about as it grows.

Model layer contracts
Staging
One model per source table. Rename, cast, light cleaning only. No joins, no business logic. Nothing downstream should bypass this.
Intermediate
Ephemeral — compiles inline, never materialises. Where joins live. Can't be queried directly, which keeps the layer honest.
Marts
Business-entity grain. Stable, trusted. Knows nothing about BI tools or MCP — just models the business correctly.
BI
Materialised tables built for the BI tool. Optimised for query performance, not reuse. Consumers don't reach past this layer.
MCP
Curated layer for AI agents — pre-joined, columns restricted, access-controlled. Covered in Part 3.
The rule: a model only knows about the layer below it. Nothing reaches across layers.
Fig 1 — Six model layers, each with a clear contract about what it's allowed to do

Schema isolation: one macro, three environments

The most useful thing in the repo is a single Jinja macro that controls schema naming across all three environments. In development, each engineer gets their own isolated namespace — nobody steps on each other's work. In CI, schemas are scoped to the PR number. In production, schemas are clean with no prefix.

Schema naming by environment
Dev
dev_tanay_bi
dev_tanay_staging
dev_tanay_marts
Per-engineer namespace. Nobody shares schemas.
CI
pr_123_bi
pr_123_staging
pr_123_marts
PR-scoped. Torn down after every run.
Production
bi
staging
marts
Clean. No prefix. One macro controls all three.
Fig 2 — One macro derives the schema name from the target environment. No manual configuration per engineer.

The useful thing about this: A new engineer doesn't configure anything — they run dbt against their dev target and their schemas appear namespaced correctly. CI gets its own isolated space per PR, which means concurrent CI runs never conflict. And the cleanup macro drops everything scoped to a PR number after the run finishes, keeping the CI database clean.

A CI pipeline that only runs what changed

Running everything on every PR is what makes CI slow enough that engineers stop waiting for it. The first thing the pipeline does at Lyrebird is check whether any dbt files actually changed. If not, every downstream job is skipped — a Terraform-only PR doesn't wait for a dbt build, and the branch protection check still passes.

CI pipeline on every pull request
1
Change detection
Did any dbt files change? If no — skip everything. Branch protection still passes.
2
Lint changed files only
SQLFluff runs against the diff — not the whole project. Failures are always attributable to this PR.
3
Build in PR-scoped schema
dbt build --fail-fast in an isolated schema. If a model fails, CI stops immediately.
4
Drop PR schemas
A cleanup macro tears down all schemas prefixed with this PR number. Pass or fail — always runs.
A sentinel job acts as the branch protection check — passes whether build succeeded or was skipped
Fig 3 — Four steps, each with a clear purpose. Non-dbt PRs skip steps 2–4 entirely.

Branch protection requires a passing check — but if dbt didn't change, that check should still pass. A sentinel job handles this: it passes whether the build ran or was skipped. Non-dbt PRs are never blocked waiting for a check that won't run.

Deployment: merge to ship

Deployment triggers automatically on merge to main when dbt files changed. There's one non-obvious design decision: full-refresh runs are controlled by the PR title. If the title contains a specific flag, the deploy runs with full-refresh. Otherwise it's incremental. No separate workflow, no manual trigger — just a naming convention the pipeline reads before deciding how to build.

This matters because full refreshes on large incremental models are expensive and occasionally necessary. Having it controlled by the PR title means the decision is visible in the git history, reviewable in the PR, and doesn't require anyone to remember to flip a configuration somewhere.

Code style: lowercase everything

The linting config enforces a small set of rules that apply to every SQL file in the repo. The most important is the simplest: everything lowercase. Keywords, identifiers, functions, literals — all lowercase. Not because it's objectively correct, but because consistency removes a whole category of review comment. A PR that touches SQL should be reviewed for its logic, not its formatting.

Rule 1
Lowercase everything
Keywords, identifiers, functions, literals. No exceptions. The linter enforces it so reviewers don't have to.
Rule 2
120 character line limit
Long enough to be practical, short enough that side-by-side diffs are readable. Enforced by SQLFluff, not convention.
Rule 3
!= not <>
One not-equal style across the project. Picked c_style. Could have been either — what matters is it's consistent.

What a PR has to prove

Every PR uses a shared template with the same sections in the same order: what changed, why, how it was tested, and a security check confirming no credentials were hardcoded. The template is short enough that it doesn't slow anyone down, but specific enough to catch what usually gets missed — particularly the testing evidence.

The combination of isolated schemas, diff-based CI, and a consistent PR template means review conversations focus on modelling decisions. Not formatting, not missing tests, not "did you check this against prod?" — those are handled structurally before the review starts.

More from this series
01
AI & Data Strategy
From dashboards to decisions
02
Self-Service Analytics
Data at the point of decision
03
AI & Data Governance
Governed infrastructure for agentic AI
Tanay Arora
Senior Data Engineer · Melbourne, AU
LinkedIn GitHub Get in touch →
← AI & Data Governance Home →