SaaS · Case study

Puppet CloudOps

A multi-cloud FinOps platform: ingest cloud billing across five hyperscalers, surface sub-second spend analytics, and auto-generate Terraform pull requests so customers apply savings through GitOps.

End-to-end owner · LLD author 2024 → 2026 Visit site ↗

GoClickHouseKafkaKEDAGraphQL FederationtypeScriptReactTerraformKubernetes

The problem

Cloud spend is scattered: every hyperscaler bills in its own format, at its own cadence, in volumes measured in terabytes a day. A FinOps platform has to ingest all of it without losing a row, normalise it into one comparable shape, answer analytical questions over billions of rows fast enough to feel interactive, and then close the loop: turn a recommendation into a change a customer can actually apply. And it has to do all of that multi-tenant and SOC 2-ready from day one.

What I did

I owned Puppet CloudOps end-to-end and authored its low-level design. It ships through Puppet’s Early Access programme.

SourcesAWS · Azure · GCP · OCI · Kubernetes, FOCUS-normalised

IngestScheduled + SQS event triggers, Kafka, KEDA parser fleet 0→N, DLQ & retry

StoreClickHouse db-per-tenant + materialised views · PostgreSQL schema-per-tenant · Redis

APIGraphQL Apollo Federation supergraph over ~7 Go subgraphs

ActReact + TanStack UI · auto-generated Terraform PRs → GitOps

Five billing formats in, an applyable Terraform pull request out.

Multi-tenant by design. Database-per-tenant in ClickHouse, schema-per-tenant in PostgreSQL, async Kafka messaging, token-auth REST for service-to-service. Tenant isolation was baked in for SOC 2 from the first schema.
Ingest at scale. Terabytes a day, dual-trigger (scheduled + SQS event-driven), with KEDA scaling the parser fleet 0→N on queue backlog under backpressure, plus dead-letter queues and retry. Zero production data-loss.
Sub-second analytics. ClickHouse materialised views, pre-aggregation, Redis caching and query rewriting took p95 from ~20s to sub-second over billions of rows.
GraphQL platform. An Apollo Federation supergraph over ~7 Go (gqlgen) subgraphs, codegen typings into a React + TanStack UI, with reactive state over Redis pub/sub and WebSockets.
Closing the loop. Optimisation recommendations become auto-generated Terraform pull requests, so a customer applies savings through their normal GitOps review, the same way they ship everything else to their IaC.

Impact

p95 analytics latency from ~20 seconds to sub-second over billions of rows.
Zero production data-loss across terabyte-a-day ingest, with elastic 0→N scaling absorbing spend spikes.
Savings delivered as reviewable Terraform PRs, keeping customers’ infra in GitOps where the rest of their changes already live.
A multi-tenant, SOC 2-ready platform spanning five hyperscalers from one normalised FOCUS model.

Note: Product UI screenshots are internal and pending clearance; the architecture above is the cleared view. Internal UI captures will be added here once approved.