Puppet CloudOps
A multi-cloud FinOps platform: ingest cloud billing across five hyperscalers, surface sub-second spend analytics, and auto-generate Terraform pull requests so customers apply savings through GitOps.
The problem
Cloud spend is scattered: every hyperscaler bills in its own format, at its own cadence, in volumes measured in terabytes a day. A FinOps platform has to ingest all of it without losing a row, normalise it into one comparable shape, answer analytical questions over billions of rows fast enough to feel interactive, and then close the loop: turn a recommendation into a change a customer can actually apply. And it has to do all of that multi-tenant and SOC 2-ready from day one.
What I did
I owned Puppet CloudOps end-to-end and authored its low-level design. It ships through Puppet’s Early Access programme.
Five billing formats in, an applyable Terraform pull request out.
- Multi-tenant by design. Database-per-tenant in ClickHouse, schema-per-tenant in PostgreSQL, async Kafka messaging, token-auth REST for service-to-service. Tenant isolation was baked in for SOC 2 from the first schema.
- Ingest at scale. Terabytes a day, dual-trigger (scheduled + SQS event-driven), with KEDA scaling the parser fleet 0→N on queue backlog under backpressure, plus dead-letter queues and retry. Zero production data-loss.
- Sub-second analytics. ClickHouse materialised views, pre-aggregation, Redis caching and query rewriting took p95 from ~20s to sub-second over billions of rows.
- GraphQL platform. An Apollo Federation supergraph over ~7 Go (gqlgen) subgraphs, codegen typings into a React + TanStack UI, with reactive state over Redis pub/sub and WebSockets.
- Closing the loop. Optimisation recommendations become auto-generated Terraform pull requests, so a customer applies savings through their normal GitOps review, the same way they ship everything else to their IaC.
Impact
- p95 analytics latency from ~20 seconds to sub-second over billions of rows.
- Zero production data-loss across terabyte-a-day ingest, with elastic 0→N scaling absorbing spend spikes.
- Savings delivered as reviewable Terraform PRs, keeping customers’ infra in GitOps where the rest of their changes already live.
- A multi-tenant, SOC 2-ready platform spanning five hyperscalers from one normalised FOCUS model.
Note: Product UI screenshots are internal and pending clearance; the architecture above is the cleared view. Internal UI captures will be added here once approved.