Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments
Year: 2024
Labels: monitoring, prometheus, thanos
Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments
Speaker(s): Colin Douch, Cloudflare
Video URL: https://youtu.be/ofhvbG0iTjU?si=be7Df_-4ryWneqiZ
Summary: Cloudflare scaled their Thanos metrics stack to 600+ data centers by moving to R2 storage and automating bucket creation for every new site. They optimized performance by running resource-heavy compaction on idle edge CPUs during off-peak hours and using a custom proxy to prevent slow, global queries.
Timestamps
00:00- Introduction to the Talk & Speaker01:22- Cloudflare's Thanos Journey05:20- Tooling Developed to Handle Scaling & Improve UX06:15- Storage12:10- Compactors14:35- Stores17:15- Queries20:20- Outro with Q&A
Key Takeaways
Cloudflare moved from Nagios/OpenTSDB to Prometheus and Thanos in 2018.
They abandoned OpenTSDB primarily because they lacked Java expertise and found the Hadoop/HBase stack difficult to maintain.
Initial Scale (2018): ~100 data centers, <1 billion active time series. Current Scale: 600+ data centers, 1,000+ sidecars, 50TB+ metrics daily, and 50 petabytes of long-term storage.
Storage: Internal Ceph storage was unmaintained and centralized. Migrated to Cloudflare R2 (S3-compatible object storage). Since Thanos sidecars only allow one upload per block, they patched Thanos to allow configurable metadata filenames. This let them run two sidecars simultaneously to dual-write to both Ceph and R2 during the transition.
Compactors: Run on bare-metal edge nodes using Nomad. They run compaction jobs as "Chron" tasks during night. Compactor is a background worker that merges Prometheus' 2-hour metrics blocks into larger blocks, e.g. 2 weeks, to reduce the number of files.
Stores: For optimal SLOs, a single Thanos store should handle roughly 5 terabytes of data blocks. They built a tool called "Store Sinker" that automatically chunks buckets into 5TB segments and provisions a dedicated store for each chunk.
Queries: By default, Thanos queries every store API it knows, meaning a query is only as fast as the slowest connected data center. So they built a Label Enforcer reverse proxy to enforce including an "external label", e.g. data center name or region.
Questions/Discussion Points
- Speaker mentions they want to meet a "decently respectable set of SLOs", what could they be? Query latency? Availability/Up-time? Data freshness (how quickly data is available in long-term storage)? Retention integrity?