Tanay Arora
AI & Data Governance · Part 3

Governed infrastructure
for agentic AI.

The first time you connect an AI agent to a live warehouse, the question isn't whether it works. It's what stops it from seeing things it shouldn't. A system prompt is not an answer — it can be overridden, misunderstood, or simply ignored. The boundary has to live somewhere more permanent than that.

Why raw access isn't enough

Pointing an AI agent at raw warehouse tables is the same as handing someone a database connection with no data platform in front of it. You get results, but you have no control over what gets queried, no guarantee the definitions are consistent, and no isolation between the agent's workload and everything else running on that compute. It's not a starting point — it's a known failure mode.

The design constraint was clear from the start: the agent would query through the same governed interface built for human analysts — curated models, defined metrics, explicit access controls. Anything less wasn't a simpler version — it was no data platform at all.

The governing principle
AI agent
asks questions
via MCP
governed
Read-only role
explicit grants
allowlist only
curated
Semantic layer
one definition
per metric
filtered
Source data
never touched
directly
Isolation at every layer — compute · database · role · table · column
Fig 1 — The agent never touches raw data directly. Every layer between them enforces a boundary.

Isolate everything

The first decision is physical separation. The agentic layer lives in its own database, runs on its own compute, and is entirely separate from the warehouse your ETL and BI tools use. This isn't just about security — it means an expensive agent query can't saturate the compute your production pipelines depend on, and a misconfigured grant in the agentic layer can't accidentally expose something in the main analytics database.

In practice this means: dedicated database, dedicated warehouse (auto-suspending when idle), and two roles — one for building, one for querying — with nothing shared between them.

Two roles, one job each

The access model has exactly two roles. The first is for the data team — it can create objects, manage grants, and configure the agentic infrastructure. The second is what the AI agent authenticates as at runtime. It can only read, and only from an explicit allowlist.

Builder role
Used by the data team to create semantic views, configure MCP servers, and manage grants. Never used at query time.
Create objects Manage grants Data team only
Agent role
What the AI agent authenticates as. Read-only. Scoped to an explicit allowlist of approved tables and views.
SELECT only Allowlisted tables Semantic views

The agent cannot query a table it was never granted access to. That boundary lives at the database level — not in a system prompt that could be overridden or ignored in a future conversation.

What the agent can and can't see

The agent role has an explicit allowlist. Every table outside that list returns an access error — not a result. The list is deliberately narrow: user behaviour data, product usage metrics, and sales call records. Revenue tables, billing data, and raw source tables are never granted.

Data type
What it contains
Access
Behavioural data
Curated activity events — pre-joined with context, sensitive columns stripped before the agent sees them
Granted
Business metrics
Aggregated product usage and engagement — modelled at a safe grain, not raw transactions
Granted
Operational intelligence
Sales, support, and engagement signals — for theme and pattern analysis, not individual records
Granted
Financial data
Aggregated revenue metrics only — individual billing records, subscription details, and raw financial data are blocked
Aggregated
Personal information
Contact details, identifiers, anything that could surface an individual — stripped at the data layer
Blocked
Raw source tables
Unmodelled production data — the agent queries pre-joined, aggregated fact tables, never raw sources directly
Blocked

A curated data layer between agent and source

Even within the allowlisted tables, the agent doesn't query raw data directly. Every approved table goes through a curated layer first — a set of models built specifically for AI consumers that pre-join context, strip sensitive columns, and expose only the fields the agent genuinely needs.

01
Pre-join
The agent doesn't navigate table relationships. The curated layer does that upstream — joining user context, practice details, and event metadata into a single clean table before the agent ever sees it.
02
Column restriction
Only the columns the agent needs are selected. Billing fields, contact details, and internal IDs are stripped at this layer — not filtered by the agent's instructions, which can be overridden.
03
Automatic grants on deploy
Each curated model automatically grants read access to the agent role when deployed to production — and only production. In development and CI environments, no grant is issued. The access control ships with the model, not as a separate manual step that could be forgotten.

Semantic views: one definition per metric

On top of the curated data layer sits a semantic layer — views that add business-friendly names, metric definitions, and natural language descriptions so the agent understands the data without guessing from column names or inventing its own logic.

The critical property: every metric is defined exactly once. Total consults, activation rate, average generation time — every consumer of the semantic layer gets the same definition. There's no version of "activation rate" that calculates differently depending on which tool is querying. The agent reads the semantic view; the semantic view enforces the definition.

The full deployment stack
AI agent
queries semantic views via MCP server · authenticates as agent role
Semantic layer
governed metric definitions · business-readable names · one source of truth
Curated data layer (dbt)
pre-joined · columns restricted · auto-grants on prod deploy only
Source tables (allowlisted)
7 approved tables · revenue and PII blocked · agent role never touches raw sources
Isolated compute & database
dedicated warehouse · auto-suspends when idle · no shared resources with ETL or BI
Fig 2 — The full governance stack: isolation at compute, database, role, table, and column level

What this prevents

Three failure modes that become structurally impossible with this design — not just unlikely:

01
Data leakage
The agent tries to query a revenue or PII table. The database returns an access error — not a result. No prompt engineering required to hold that boundary. It exists at the role level and cannot be overridden by a conversation.
02
Metric hallucination
The agent invents its own definition of "activation rate." The semantic layer corrects it — because the view defines what that metric means, and the agent queries the view, not the raw table. Wrong definitions can't persist.
03
Resource contention
An agentic query generates unexpected load. It hits an isolated warehouse that auto-suspends when idle — it can't consume compute budgeted for production ETL or BI queries, and cost stays predictable.

The whole pattern is version-controlled in Terraform — roles, grants, semantic view definitions, warehouse config. That means access controls are reviewable in a pull request like any other infrastructure change, not managed through a UI where history is hard to audit.

The principle that generalises

The components ended up being straightforward: a dedicated role scoped to an explicit allowlist, a curated data layer that pre-filters before the agent sees anything, a semantic layer that enforces metric definitions, and isolated compute. What took time was the design — deciding what the agent should and shouldn't see, and making that decision live in the infrastructure rather than in a conversation.

What makes it hold up over time is that none of the boundaries depend on the agent behaving correctly. They're enforced by the database, the role system, and the data model. The agent can't accidentally or deliberately step outside them — and when a new team member asks "what can the agent see?", the answer lives in version-controlled Terraform, not someone's memory.

More from this series
01
AI & Data Strategy
From dashboards to decisions
02
Self-Service Analytics
Data at the point of decision
04
Scaling dbt
The dbt project structure that scales
Tanay Arora
Senior Data Engineer · Melbourne, AU
LinkedIn GitHub Get in touch →
← Self-Service Analytics Scaling dbt →