Intro to Scaling Prometheus with Cortex
Year: 2020
Labels: prometheus, cortex, monitoring
Intro to Scaling Prometheus with Cortex
Speaker(s): Tom Wilkie, Grafana Labs & Ken Haines, Microsoft
Video URL: https://youtu.be/oHGoZd-9jWA
Summary: Cortex is a horizontally scalable, push-based backend that allows multiple Prometheus instances to aggregate metrics into a single, multi-tenant cluster for global observability. It eliminates the vertical scaling limits and security hurdles of traditional federation by using distributed systems logic and cost-effective object storage for long-term data retention.
Timestamps
00:00- Introduction to Speakers and Agenda01:15- Prometheus Model & Grafana04:30- Prometheus' Answer to Global Observability = Federation07:10- Why Cortex?08:15- About Cortex11:00- History of Cortex14:15- Demo: Collecting metrics from various Prometheus instances and send to a Central Cortex installation to do global aggregates and reports via Grafana19:15- Outro
Key Takeaways
- Prometheus is designed to be co-located with the jobs it monitors (pull-based). This becomes difficult to manage when software is deployed across multiple disparate regions or clusters.
- Standard Prometheus federation requires edge servers to be exposed for scraping (security risk), scales vertically (limited by single-node RAM), and often requires complex pre-aggregation to avoid overwhelming the central server.
- Cortex can be scaled out by adding more nodes to handle increased cardinality and ingestion rates (Horizontal Scalability).
- Cortex uses Prometheus’s remote_write feature. Edge Prometheus instances push data to the central Cortex cluster, eliminating the need to open firewall ports at the edge for pulling metrics.
- Through a Distributed Hash Table (DHT) and replication, Cortex ensures there are no gaps in data even if individual nodes fail.
- Cortex offloads data to object stores like Amazon S3, Google Bigtable, or Cassandra.
Questions/Discussion Points
- This talk is a few years old and even though Cortex is a "Graduated" CNCF project, stable and reliable, the industry seems to have shifted its focus to Grafana Mimir, VictoriaMetrics and Thanos.
- Cortex and its peers now natively ingest OpenTelemetry (OTLP) so we aren't limited to Prometheus metrics anymore.
- The "Chunks" storage (DynamoDB/Cassandra) that is mentioned in the talk is considered legacy now and we use Object Storage (S3/GCS) because it is cheaper and scales better.