From Python Notebooks to YAML Contracts: How a Declarative Ingestion Framework Scaled Data Lake Operations
← Back to Articles

From Python Notebooks to YAML Contracts: How a Declarative Ingestion Framework Scaled Data Lake Operations

TL;DR

  1. We put a declarative ingestion stack for the Data Lake into production, based on YAML contracts.
  2. Today we operate a massive data footprint with about 7 PB of data, ~8,000 transactional tables, and ~850 declarative YAMLs.
  3. We moved from a scattered model of local implementations to one based on 1 table : 1 YAML and 2 core notebooks.
  4. The new flow already covers about 85% of the Source → Bronze → Silver path.
  5. The estimated time to put a new ingestion into production dropped from days to hours.

The Scale Problem That Became an Architecture Problem

For a long time, the problem was not getting data into the Data Lake. The problem was growing without turning every new ingestion into more structural cost.

Today, CERC operates a platform with about 7 PB of data and ~8,000 transactional tables. At that scale, ingestion stops being a script. It becomes platform infrastructure.

When the operation was smaller, the old model seemed acceptable. Each domain created its own notebooks, its own standards, and, in some cases, its own repository. That gave local freedom. It also created structural divergence.

Over time, the bill came due. Maintenance effort started growing faster than the value delivered by each new source. The real cost was not only compute. It was engineering time spent repeating structure, reviewing variations of the same idea, and rebuilding context for every new ingestion.

This problem was more visible in the Source → Bronze → Silver flow, which concentrates a large part of the Data Lake’s operational surface. In that stretch, small implementation differences became more review, more maintenance, and less speed.

The pain showed up on four fronts:

Too much repeated code

Each new ingestion repeated the same structural base, with variations that were hard to govern.

Low speed

Creating a new source took days because the work was implementing a pipeline, not declaring an ingestion.

Weak governance

The expected standard was not always the executed standard because each implementation had too much freedom.

High cognitive cost

Every change required understanding local decisions before touching anything.

This was no longer a style question. It was an operability question.


The Model Change

Reducing the number of notebooks was not enough. We needed to change the ingestion development paradigm.

The goal was to move from a model where each team described how to execute an ingestion to one where the team declared what had to be ingested and the platform handled the rest.

In practice, that meant centralizing in the stack core what had been spread out before: contract validation, environment resolution, Bronze and Silver publishing, delete handling, and schema rules.

The criteria were straightforward:

  1. Standardize most workflows without leaving too much room for structural exceptions.
  2. Reduce the platform’s maintenance surface.
  3. Speed up the onboarding of new sources into the Data Lake.
  4. Strengthen governance without turning the platform team into a manual bottleneck.

When we framed the problem that way, the decision became clear. The bottleneck was not a lack of notebooks. It was an excess of structural freedom.


The Declarative Contract

The philosophy of the new stack can be summarized in one sentence: make the right thing the easy thing.

A new ingestion no longer starts with a Python notebook. It starts with a YAML contract. That contract describes metadata, source, destination, schema, and publishing rules. The YAML became the platform’s human interface. The runtime remained reusable code.

In broad terms, an ingestion follows this pattern:

metadata:
  table_description: "Functional description of the table"
  table_source_owner: "source-owner-team"
  table_datalake_owner: "datalake-owner-team"
  ingestion_type: batch
  ingestion_mode: full

workflow:
  name: source-bronze-silver-table-name
  schedule_america_sp: "25 03 * * *"

ingestion:
  bronze:
    source:
      prd:
        format: cloud-spanner
        dynamic_configs:
          project_id: "prd-project"
          instance_id: "source-instance"
          database_id: "source-database"
          table: "source_table_name"
    destination:
      format: parquet
      unity:
        schema_unity: "domain_bronze"
        table_unity: "bronze_table_name"

  silver:
    destination:
      format: delta
      unity:
        schema_unity: "domain_silver"
        table_unity: "TB_SILVER_TABLE_NAME"
    schema_config:
      partition_by: ["CuratedDt"]
      columns:
        - source_name: source_id
          silver_name: Id
          datatype: STRING
          primary_key: true
        - source_name: operation_date
          silver_name: OperationDate
          datatype: DATE
          primary_key: false
        - source_name: financial_amount
          silver_name: FinancialAmount
          datatype: FLOAT
          primary_key: false
        - source_name: payment_date
          silver_name: PaymentDate
          datatype: DATE
          primary_key: false

The important point is this: the YAML does not describe only the table name. It describes the ingestion contract for a table.

In the new model, this is the main authorship unit: 1 table : 1 YAML. The engineer describes the ingestion. The platform decides how to execute it.


How the Stack Executes the Contract

The YAML does not go straight to production. Before that, the stack validates the contract and turns it into valid execution parameters.

In practice, the flow follows this order:

  1. An engineer creates or updates a YAML spec.
  2. The spec goes through structural and semantic validation.
  3. The platform turns the spec into execution parameters by loading the YAML as a dictionary at runtime.
  4. Two core notebooks execute the contract in Bronze and Silver with the parameters from step 3.
  5. The ingestion runs with standardized paths, formats, and rules based on the parameters extracted from the YAML.

This design reduces a classic platform mistake: the pipeline works, but each team implements it in a different way.

At the runtime core, the split is simple:

  1. The Bronze notebook reads the source and writes the data to the standardized path in the Google Cloud Storage bucket in bronze.
  2. The Silver notebook reads Bronze, applies schema, casting, deduplication, and publishes the final table to the Google Cloud Storage bucket in silver.

This centralization changes the economics of maintenance. When a structural rule evolves, it evolves in a shared core, not in hundreds of nearly identical notebooks.


Governance and Operations at the Center of the Stack

An important part of this story is not in the YAML. It is in what prevents the YAML from becoming a mess.

Before any execution, the spec goes through a validation layer built with Pydantic. This layer checks accepted source formats, required fields, cross-field coherence, per-environment consistency, and schema rules.

In practice, governance appears through concrete mechanisms:

  1. Required fields and enums block invalid configurations at the entry point.
  2. Allowlists ensure that projects, formats, and certain behaviors follow known conventions.
  3. Guardrails prevent dangerous uses, such as overwrite write modes outside approved flows.
  4. Cross-field rules validate coherence between ingestion mode and the configured filter.
  5. Ownership and metadata make explicit who owns the source and who owns the table in the Data Lake.

This is the point where the stack trades freedom for operability. Convention stops being a recommendation. It becomes an entry criterion.

This layer also takes the stack beyond “copying data.” The runtime already includes validation, data quality, and operational controls that used to be scattered across local implementations.


GhostBuster: Deletes Became a Platform Flow

GhostBuster is the stack mechanism that ensures deletions made in the transactional source are correctly reflected in the silver layer of the Data Lake.

In the declarative contract, this behavior can be enabled directly in the YAML spec. From that point on, deletes stop being exceptions handled case by case for each table and become part of the platform’s standard operation.

In day-to-day work, this changes ingestion in four ways:

  1. The table is created with an explicit rule for handling deletions.
  2. In reprocessing flows, the stack prevents records already removed from the source from reappearing in silver.
  3. When validation finds IDs pending removal, the case enters a controlled deletion flow.
  4. That flow stays registered in an operational trail until the hard delete runs.

The practical effect was reducing a recurring type of operational friction. Before, deletes in silver would usually open manual requests and extend the inconsistency window between source and Data Lake. Now, much of that work is absorbed by the stack itself.


Streaming: The Same Contract, Different Pace

Batch and streaming are usually treated as separate worlds. Different pipelines, different tools, different logic. In the declarative stack, the YAML contract is the same. The difference is in one field: ingestion_type: streaming.

From that point on, the platform executes the right flow. The engineer declares the ingestion. The stack decides how to process it.

Source: Google Cloud Pub/Sub

For streaming, the main source we operate is Google Cloud Pub/Sub. Instead of reading transactional tables by polling, the stack consumes messages published to a topic. Each message carries a binary payload that the platform persists in the Bronze layer before any transformation.

The path is analogous to batch, but adapted for the event-driven model:

Pub/Sub Bronze (Delta) Silver (Delta)

Two Core Notebooks (Again)

Just like batch, the streaming runtime is centralized. There is no notebook per topic. There are two core notebooks that the platform instantiates with parameters extracted from the YAML contract:

  • Bronze Streaming: reads the Pub/Sub topic via Apache Spark Structured Streaming and persists the data in the Bronze layer in Delta format, partitioned by ingestion date.
  • Silver Streaming: reads the Bronze streaming table, applies column renaming, casting, trimming, and computed columns, and publishes the result to the Silver layer.

The same centralization logic from batch applies here. A single runtime change impacts all streaming contracts at once.

The Streaming YAML Contract

The difference between a batch YAML and a streaming YAML is in three places: the ingestion_type field, the source format (pubsub), and a streaming block that defines the checkpoint and trigger mode.

metadata:
  table_description: "Functional description of the streaming table"
  table_source_owner: "source-owner-team"
  table_datalake_owner: "datalake-owner-team"
  ingestion_type: streaming
  ingestion_mode: incremental

workflow:
  name: streaming-bronze-silver-table-name
  schedule_america_sp: "*/30 * * * *"

ingestion:
  bronze:
    source:
      prd:
        format: pubsub
        dynamic_configs:
          project_id: "prd-project"
          subscription_id: "subscription-name"
          topic_id: "topic-name"
          max_records_per_fetch: 10000
    destination:
      format: delta
      unity:
        schema_unity: "domain_bronze"
        table_unity: "tb_table_name_bronze"
        partition_by:
          - "dt_ingestion"
      destination_columns_schema:
        messageId: "string"
        payload: "binary"
        dt_ingestion: "date"
      streaming:
        trigger:
          available_now: true
        check_point_location: "gs://bucket-checkpoints/bronze/domain/table"

  silver:
    streaming:
      trigger:
        available_now: true
    destination:
      format: delta
      unity:
        schema_unity: "domain_silver"
        table_unity: "TB_TABLE_NAME_SILVER"
    schema_config:
      partition_by:
        - "CuratedDt"
      columns:
        - source_name: messageId
          silver_name: MessageId
          datatype: string
          primary_key: true

Trigger available_now: true

The default mode we operate is available_now: true. It instructs Spark Structured Streaming to process all data available at the time of execution and then shut down the job. The behavior is similar to a controlled micro-batch: it consumes what is in the queue, finishes, and releases the cluster.

This mode works well with schedulers like Airflow because the job has a predictable start and end, without needing a dedicated cluster running continuously.

Checkpoint: Managed by the Contract

The checkpoint location is the mechanism that ensures Spark Structured Streaming knows exactly where to resume processing after a failure or restart. In the YAML contract, it can be declared explicitly or left for the platform to generate automatically from the table name and environment:

gs://bucket-checkpoints/{env}/streaming_checkpoints/silver/{schema}/{table}

When the checkpoint is not specified in the YAML, the platform fills in this path automatically. This prevents checkpoints from being lost due to oversight or inconsistent manual configuration.

The Same Governance

The streaming block goes through the same Pydantic validations as the rest of the contract. Required fields are checked, path formats are validated, and cross-environment consistency is guaranteed before any execution. The platform does not open structural exceptions for streaming: the declarative model is the same.


Generative AI Adoption at Scale

The stack became the operational standard for ingestion when the declarative contract became the platform’s main authorship unit.

Today, we operate with about 850 YAMLs in production. That number matters less because of the volume itself and more because of what it proves: the stack stopped being a new pattern and became the operational standard for ingestion.

We used AI agents to accelerate the most repetitive parts of the migration, such as creating and updating specs. They reduced mechanical work, but they did not change the central logic of the design. The structural gain came from the declarative stack. The repository includes several skills, instructions, and prompts to help agents create and evolve YAMLs, reducing work that used to take days down to hours.

Migration: From 530 Notebooks to 530 YAMLs

This change did not happen in an empty space. About 530 legacy notebooks had to be converted to the new declarative contract. That migration was the step required to replace the old model with a flow where the platform can evolve through a shared core.

AI agents helped throughout the migration process, from identifying candidate notebooks to creating the first YAML versions.

What mattered was not only converting code. It was converting the logic of each ingestion to the declarative model, which required modeling decisions and adjustments for edge cases. The result was a faster, more consistent migration that left the stack ready to operate at scale with the new model.

Migrating 530 notebooks to 530 YAMLs was not only a volume question. It was a question of changing how ingestion is designed, written, and maintained. The declarative contract became the new center of operations, and the migration was the necessary step to get there.

Public Data: Full Coverage in a Separate Repository

The AI asset coverage model is not limited to the declarative stack. The repository that ingests Brazilian public datasets — CGU, CVM, IBGE, Receita Federal (Brazil’s IRS), IBAMA, and others — is also fully covered.

There, engineers do not write YAML contracts to describe pipelines. The pattern is different: each source has a Databricks notebook that reads the public origin, generates a unique ID per record, and writes the data to Google Cloud Storage. What is the same is the philosophy: make the right thing the easy thing.

The repository is covered with five types of Copilot assets:

  1. 1 specialist agent (black-belt.agent.md) with full repository context.
  2. 5 skills covering the most common scenarios: notebook structure, GCS interaction, multithreaded download, primary key discovery, and workflow YAML configuration.
  3. 4 instruction files with required code patterns, naming conventions, and organization rules.
  4. 3 prompts for recurring tasks: adding a new source, modifying an existing ingestion, and diagnosing a broken workflow.

With those assets, an agent can create a complete notebook for a new public source — with retry logic, logging, ID generation, and GCS upload — without manual guidance at each step.

A Skill in Action: Primary Key Discovery

Public data rarely has a guaranteed unique ID at the source. A Receita Federal file has no UUID. An IBGE dataset has no explicit primary key. Without an ID per record, deduplication and traceability break down.

The primary-key-discovery skill solves this with a three-path decision tree. Before deciding, the agent checks about 200 rows of real data from the source. That sample determines the ID strategy before any code is written:

  1. Does the source already have a globally unique ID (API UUID, database PK)? → reuse the existing field.
  2. Do immutable natural keys exist (CNPJ, CPF, reference date)? → generate a deterministic SHA-256 hash.
  3. No natural keys? → use UUID v4.

When the path is SHA-256, the generated function follows this pattern:

import hashlib
from typing import Dict, Any, List

def generate_record_id(dict_record: Dict[str, Any], list_key_fields: List[str]) -> str:
    str_composite = "|".join(str(dict_record.get(field, "")) for field in list_key_fields)
    return hashlib.sha256(str_composite.encode()).hexdigest()

# Example: CNPJ + reference date as composite key
list_key_fields = ["cnpj", "reference_date"]
dict_record = {"cnpj": "12345678000190", "reference_date": "2025-01", "company_name": "Example"}
dict_record["id"] = generate_record_id(dict_record, list_key_fields)

# Uniqueness check before GCS upload
list_ids = [generate_record_id(r, list_key_fields) for r in list_data]
assert len(list_ids) == len(set(list_ids)), "Duplicate IDs detected before upload"

The ID is deterministic: the same record always produces the same hash. This is essential for reprocessing — the stack does not create duplicates when the same ingestion runs twice.

The skill also defines what not to do: MD5 for record keys (collision risk), mutable fields in the hash (status, counters), and timestamp as the sole identifier. Those rules live in the skill file. The agent applies them automatically.

The result is that every new public data source is born with a traceable, validated ID consistent with all others. No manual instruction. No case-by-case review.


What the Stack Covers Today

The declarative stack now governs about 850 YAMLs in production and covers roughly 85% of the workflows in the Source → Bronze → Silver flow.

Inside that main path, the stack already standardizes:

  1. The main batch flow.
  2. Support for multiple source formats, including Spanner, BigQuery, Delta, and files.
  3. Explicit configuration by environment, with stg, int, and prd treated as part of the contract.
  4. Streaming via Google Cloud Pub/Sub with Spark Structured Streaming, using the same declarative model described above.

This matters because it shows the model’s real boundary. The stack covers most of the operation without pretending every edge case fits the same path. The gain comes from standardizing what is recurring and making clear where the edge begins.

And What About Sustainment?

The declarative stack eliminated the need for a large part of sustainment. It changed the kind of sustainment we do. Before, each notebook could be a different case. Now, we have a common core to evolve and improve. Sustainment today is more focused on evolving the runtime, improving the validation layer, and ensuring the contract remains the platform’s human interface. The gain is that when we make a structural improvement, it impacts the whole stack, not just one specific case.

Adding a new column coming from a transactional migration, for example, is no longer a notebook case. It is a contract evolution that can be applied to hundreds of YAMLs with the same adjustment. The result is that sustainment evolves from reactive maintenance work into proactive platform evolution.

Combine that with AI agents and we get a scenario where sustainment is faster, more consistent, and more focused on evolving the platform than on maintaining specific cases. The declarative contract became the center of operations, and sustainment became the center of platform evolution.

Can Anyone Create a New Ingestion?

Yes. That is the idea. The declarative model and the validation layer were designed so that any engineer can create a new ingestion by following the contract. Governance is guaranteed by validation, which blocks invalid or dangerous configurations. The result is that creating new ingestions becomes more self-service, without depending on a central platform team for each new source. The declarative contract is the platform’s human interface, and it was designed to be accessible and easy to use, even for people with no previous experience with the stack. The goal is to democratize ingestion creation while preserving governance and operability.

Internal teams have already started opening PRs to create new ingestions following the declarative model, and the response has been positive. The process is faster, more predictable, and less prone to error than the previous model. The declarative contract became the new standard for creating ingestions, and the platform is ready to scale with this model. The result is that, with the declarative contract, the platform can grow faster and more consistently without repeating the structural costs of the past.

A very common example is the creation of ingestions for public tables that teams discover and want to bring into the Data Lake. With the declarative model, they can create a YAML by following the contract, and the platform handles the rest. The result is that onboarding new sources becomes faster and less dependent on manual intervention, which accelerates Data Lake growth without compromising governance or operability.


The Results

The table below summarizes what changed in the development and operating model:

AspectBeforeAfter
Development paradigmImperative, focused on the “how”Declarative, focused on the “what”
Main authorship surfacePython notebooks, in the 2 notebooks : 1 table model, with 1 bronze table and 1 silver tableDeclarative YAMLs, in the 1 YAML : 1 table model, with 1 bronze table and 1 silver table
Estimated time for a new ingestionDays per new sourceHours per new source
Current stack scaleLogic spread across isolated notebook implementations~850 centralized YAMLs
Execution coreDistributed implementations2 core notebooks
GovernanceVaried by implementationValidated by contract
Delete handlingLocal solutions and manual interventionGhostBuster with a standardized and traceable flow
OrganizationMultiple local patternsUnified ingestion model

When ingestion authorship moves from hundreds of free-form implementations to validated contracts, the platform drastically reduces the number of places where it can diverge from itself.

That gain appears on four dimensions at the same time:

  1. Less repeated code to write and review.
  2. Less structural variation across workflows.
  3. More predictability in operations.
  4. More speed to put new sources into production.

What We Learned

This was not a frictionless change. The simplification was worth it, but it brought important lessons.

1. Adopting a declarative model required a change in authorship.

Standardizing the technology was the most direct part. Harder was aligning the authorship change. Teams that were used to building the full ingestion had to move to a flow where the main decision stops being the notebook and becomes the contract.

2. Not every workflow enters the new model at the same pace.

The 85% coverage already represents major progress. It also showed that the contract needs a clear limit. When the exception becomes the rule, the stack loses its standardization power.

3. Simplifying implementation does not remove the need for good modeling.

The declarative model reduces implementation cost. It does not remove the need for correct decisions about schema, source, deduplication, deletes, and publishing. When the contract is poorly modeled, the stack only scales the mistake faster.


What Comes Next

With 850 YAMLs in production, the next phase is expanding the platform’s capabilities for new use cases and integrations.

  1. Expand coverage beyond the current 85%.
  2. Evolve AI-assisted authorship to reduce manual work in the creation and evolution of specs.
  3. Expand connectors, formats, and edge cases inside the same declarative model.
  4. Make the creation of new ingestions increasingly self-service for teams.
  5. Collect and extract more transactional tables into the Data Lake, accelerating the onboarding of new sources.

The important point is that the foundation changed. We now have a simpler base to grow without repeating the structural costs of the past.


Technologies

LayerTechnology
Ingestion specificationYAML
ProcessingDatabricks + Apache Spark
Bronze layerCentralized generic notebook
Silver layerCentralized generic notebook
Validation and governancePython + declarative models + allowlists
Deletes and operational controlGhostBuster + Validator + Data Quality
Creation accelerationAI agents + Asset Inventory + automated validation
Stack organizationUnified ingestion repository

CERC operates the infrastructure of the Brazilian financial market for financial asset registration. Building data platforms in this context means working with real scale, real impact, and engineering decisions that need to be operable the next day. If you want to work on problems like this, we are hiring.


This post was written by CERC’s Data Engineering team: Davi Campos, André Tayer, and Guilherme Oliveira.