Back to Main Table

Intro to Scaling Prometheus with Cortex

Year: 2020

Labels: prometheus, cortex, monitoring

Intro to Scaling Prometheus with Cortex

Speaker(s): Tom Wilkie, Grafana Labs & Ken Haines, Microsoft

Video URL: https://youtu.be/oHGoZd-9jWA

Summary: Cortex is a horizontally scalable, push-based backend that allows multiple Prometheus instances to aggregate metrics into a single, multi-tenant cluster for global observability. It eliminates the vertical scaling limits and security hurdles of traditional federation by using distributed systems logic and cost-effective object storage for long-term data retention.

Timestamps

  • 00:00 - Introduction to Speakers and Agenda
  • 01:15 - Prometheus Model & Grafana
  • 04:30 - Prometheus' Answer to Global Observability = Federation
  • 07:10 - Why Cortex?
  • 08:15 - About Cortex
  • 11:00 - History of Cortex
  • 14:15 - Demo: Collecting metrics from various Prometheus instances and send to a Central Cortex installation to do global aggregates and reports via Grafana
  • 19:15 - Outro

Key Takeaways

  • Prometheus is designed to be co-located with the jobs it monitors (pull-based). This becomes difficult to manage when software is deployed across multiple disparate regions or clusters.
  • Standard Prometheus federation requires edge servers to be exposed for scraping (security risk), scales vertically (limited by single-node RAM), and often requires complex pre-aggregation to avoid overwhelming the central server.
  • Cortex can be scaled out by adding more nodes to handle increased cardinality and ingestion rates (Horizontal Scalability).
  • Cortex uses Prometheus’s remote_write feature. Edge Prometheus instances push data to the central Cortex cluster, eliminating the need to open firewall ports at the edge for pulling metrics.
  • Through a Distributed Hash Table (DHT) and replication, Cortex ensures there are no gaps in data even if individual nodes fail.
  • Cortex offloads data to object stores like Amazon S3, Google Bigtable, or Cassandra.

Questions/Discussion Points

  • This talk is a few years old and even though Cortex is a "Graduated" CNCF project, stable and reliable, the industry seems to have shifted its focus to Grafana Mimir, VictoriaMetrics and Thanos.
  • Cortex and its peers now natively ingest OpenTelemetry (OTLP) so we aren't limited to Prometheus metrics anymore.
  • The "Chunks" storage (DynamoDB/Cassandra) that is mentioned in the talk is considered legacy now and we use Object Storage (S3/GCS) because it is cheaper and scales better.

Links/Resources