CERC’s journey from BigQuery on-demand to lower costs without sacrificing resilience
← Back to Articles

CERC’s journey from BigQuery on-demand to lower costs without sacrificing resilience

TL;DR — At CERC, we moved away from BigQuery on-demand after a human error triggered five hours of continuously running queries and caused a severe cost impact. From that incident onward, we redesigned the operation around simplicity, operational efficiency, and resilience: first with environment-based reservations, then by testing and discarding a custom autoscaling approach that did not deliver the expected performance gains, and later by adopting fixed capacity with annual commitments, reducing BigQuery costs by 40%. We later refined the model again to isolate critical workloads with a regulatory reservation that could use idle slots from other reservations and autoscaling only during specific windows. The end result was a more predictable, more efficient operation that was better aligned with the criticality of our processes.


CERC’s journey from BigQuery on-demand to lower costs without sacrificing resilience

In platform engineering, almost every good choice has an expiration date.

The model that solves today’s problem well can become risky as the company grows, as operations become more sensitive, or when mistakes stop being mere inconveniences and start having real financial impact.

That is exactly what happened to us at CERC with BigQuery.

At first, we operated in the on-demand model. For the stage we were in, that choice made sense: it was simple, required little cloud maturity, and avoided the need to size capacity too early.

It worked. Until the day it didn’t.

A human error, in March 2022, caused queries to run continuously for about five hours. The result was catastrophic billing. In just a few hours, we doubled our cloud bill and learned, in the most expensive way possible, an important lesson: convenience without predictability comes with interest.

From that point on, our question changed.

It was no longer “how should we use BigQuery?” It became “how should we operate BigQuery in a way that matches the level of control, resilience, and efficiency that CERC needs?”


The three assumptions that guided the redesign

After the incident, we defined three criteria to evaluate any new architecture:

  • Simplicity: the design needed to be clear enough to operate safely.
  • Operational efficiency: we did not want to trade financial risk for an operation that was too complex.
  • Resilience: critical workloads needed to keep running predictably.

These assumptions sound obvious. The problem is that when pressure shows up, it is common to sacrifice one of them without noticing.

We tried not to do that.


Evolution at a glance

Evolution of BigQuery operations at CERC


Phase 1: the comfort of on-demand

The on-demand model gave us three clear advantages:

  • zero need to plan slots;
  • low operational complexity;
  • fast adoption.

For a company that was growing and still maturing in cloud, this was extremely useful.

But the model also hid a risk: it shifts the capacity concern, but it does not eliminate the need for predictability. When a workload behaves abnormally, the bill can follow right behind it.

That is what the incident made painfully clear.


Phase 2: reservations by environment

Our first response was to move to the reservation model.

We created a dedicated project to centralize slots and split capacity across four main reservations:

1) Staging

An internal testing environment with fewer slots. Here, cost efficiency mattered most. Slower queries were acceptable.

2) Homologation

An environment more sensitive to latency because it concentrates customer certification and validation operations. It received more capacity.

3) Production

An environment with the greatest need for compute power, speed, and predictability. We also enabled the use of idle slots coming from other reservations.

4) All

A low-slot reservation for exploratory use across the organization. It also worked as a kind of safety net to prevent new projects from appearing outside the governance model.

Environment-based reservation model

What this change solved

With this design, we stopped operating with open-ended consumption and started operating within a predefined capacity range. We gained:

  • cost predictability;
  • basic isolation across contexts;
  • more platform control.

At that moment, it looked like the problem was solved.

It wasn’t.


Phase 3: the assumption that seemed right

After moving to reservations, an almost intuitive idea emerged:

If slots represent compute capacity, then increasing slots dynamically should make queries faster.

Based on that assumption, we built a custom autoscaling mechanism.

The logic was simple:

  • monitor slot usage in production;
  • increase capacity when consumption approached peak levels;
  • deallocate slots when pressure dropped.

On paper, it looked elegant. Dynamic. Smart. Economically efficient.

In practice, costs remained high.

That was when we decided to test the assumption instead of continuing to assume it was true.


Phase 4: we turned autoscaling off — and nothing got worse

We disabled our scaling mechanism and started operating with a fixed number of slots.

We expected to see performance degradation.

It never came.

Queries did not become materially slower.

This was one of the most important moments in the journey because it dismantled an assumption that seemed very reasonable. We cannot say with absolute certainty what caused that behavior, since BigQuery’s internal slot mechanics are proprietary. But our hypotheses started to revolve around two points:

  • there may be some activation cost, or “cold start,” when new slots come into play;
  • a relevant part of the workloads was not parallelizable enough to benefit linearly from more slots.

The practical effect

We made a simple decision: remove custom autoscaling from the architecture.

That brought two immediate benefits:

  • it simplified the operation;
  • it reduced cost.

With fixed capacity, we started purchasing slots on annual commitments and reduced BigQuery costs by 40%.

That was a valuable lesson: sometimes the best optimization is to stop over-optimizing.


Phase 5: a new problem appeared — the noisy neighbor

A year later, we noticed another limitation in the design.

Our reservations were separated by environment, not by process criticality.

In practice, that meant different production projects could compete for the same slots. For ordinary workloads, that was already bad. For regulatory workloads, it was dangerous.

The risk here was not just latency. It was missing critical processing windows.

The solution was to create a new reservation: the regulatory reservation.

There, we concentrated all regulatory processes into their own project, with operational precedence over other workloads.

From noisy neighbor to regulatory isolation

What changed with that

We started isolating the right workload with the right criterion.

It was no longer just “production versus homologation.” Now it was:

  • critical workloads with their own reservation;
  • less sensitive workloads sharing another capacity layer.

This adjustment may seem small, but it completely changes how the platform responds to internal contention.


Phase 6: bringing scaling back, now guided by windows

Even with the regulatory reservation, one important question remained:

how do we increase capacity during critical moments without falling back into continuous scaling?

The answer was to reintroduce scaling, but with a different rationale.

Instead of allocating and deallocating slots all the time based on momentary usage, we started expanding capacity during predefined regulatory windows.

That meant:

  • before the critical window, we increased slots;
  • during execution, we kept the extra capacity;
  • once it was over, we reduced it again.

And there was one more refinement.

If the regulatory process finished earlier than expected, the application itself would publish a Pub/Sub message indicating that the additional slots could be removed.

Scaling stopped responding to consumption noise and started responding to a real business event.


Phase 7: BigQuery Editions changed the problem again

When BigQuery Editions arrived, we had to redesign the operation once more.

The product now offered native autoscaling, but in a different cost model than before. So the question stopped being “can we scale?” and became “in what order should capacity be consumed?

Our final design followed this logic:

  1. use the pre-allocated slots from the regulatory reservation itself;
  2. if that is not enough, use idle slots from other reservations;
  3. only if neither of those is available, fall back to native autoscaling.

Final logic with BigQuery Editions

Why this order matters

Because it turns autoscaling into a last resort, not the default behavior.

That detail is essential. If you let autoscaling act freely all the time, you risk ending up continuously operating with expanded capacity — and losing the predictability you were trying to gain in the first place.

That is why, even in the Editions model, we kept using the same previous principle: the autoscaling ceiling is raised only during predefined windows and lowered again afterward.


How we implemented it

This entire operation was described with Terraform and YAML.

Instead of depending on manual configuration or tacit knowledge, we started codifying the most important platform decisions:

  • baseline capacity;
  • whether idle slots should be used;
  • autoscaling limits;
  • assignees by project.

A simplified configuration example:

reservation-regulatory:
  slot_capacity: 100
  ignore_idle_slots: false
  autoscale_max_slots: 1400
  assignees:
    - id: projects/<project_name>

And the Terraform that materializes this pattern:

resource "google_bigquery_reservation" "reservations" {
  provider          = google-beta
  for_each          = local.reservations
  project           = each.value.project_id
  name              = each.value.name
  location          = each.value.location
  edition           = each.value.edition
  concurrency       = each.value.concurrency
  ignore_idle_slots = each.value.ignore_idle_slots
  slot_capacity     = each.value.slot_capacity
  scaling_mode      = each.value.scaling_mode
  max_slots         = each.value.max_slots

  dynamic "autoscale" {
    for_each = each.value.autoscale_max_slots != null ? [true] : []
    content {
      max_slots = each.value.autoscale_max_slots
    }
  }

  lifecycle {
    ignore_changes = [autoscale[0].max_slots]
  }
}

The gain here was not just automation. It was operational consistency.


What we learned

If we had to summarize the journey in a few points, they would be these:

1) The right initial model can stop being the right model

On-demand was useful at the stage the company was in. The mistake would have been insisting on it after operations changed.

2) Intuitive performance assumptions need to be tested

“More slots = more speed” sounded obvious. It wasn’t.

3) Environment-based isolation is not enough for workloads with different levels of criticality

At some point, the unit of isolation needs to reflect the business process.

4) Autoscaling is not automatically a sign of maturity

Without operational context, it can become just an expensive way to hide inefficiency.

5) Real efficiency comes from balancing cost, simplicity, and resilience

If a design improves one of those by destroying the other two, it is probably not mature yet.


What changed in our platform

At CERC, this BigQuery journey was not just a shift from one pricing model to another.

It was the evolution of a data platform toward a more intentional operation.

We started with convenience. We went through an incident. We built a first response. We disproved an assumption that seemed correct. We reduced cost. We refined isolation. We reintroduced elasticity in the right place. And in the end, we arrived at a better design not because it was more sophisticated, but because it was more aligned with how the operation actually works.

That kind of result rarely appears all at once.

It appears when a platform team is willing to revisit assumptions, simplify what became too complex, and redesign the foundation before the system starts charging too high a price for it.


Want to work on problems like this?

CERC’s Infrastructure Center of Excellence exists to build the platforms that allow the company to grow with efficiency, order, and resilience. That means designing the foundation on which applications, teams, and critical operations can evolve with safety, predictability, and autonomy.

This is the kind of work where architecture does not stay in the diagram. It directly impacts cost, performance, governance, operational risk, and the company’s ability to scale without losing control.

If you enjoy building platforms, automating operations, designing resilient systems, and making engineering decisions with real-world impact, this is exactly the kind of challenge we work on here.


CERC operates infrastructure for the Brazilian financial market to register receivables — a system where correctness, scale, and reliability are not optional. We build the data platform on which the financial system runs. If you want to work on problems like this — real scale, real consequences, and the autonomy to design the right solution — we’re hiring.


This post was written by the Infrastructure Center of Excellence team: Felipe Trucolo, Demetrius Moro, and André Santos.