Prometheus Deep Dive

Year: 2019

Labels: prometheus, observability

Prometheus Deep Dive

Speaker(s): Ben Kochie

Video URL: https://youtu.be/Me-kZi4xkEs

Summary: This talk delved into the design choices of Prometheus monitoring system architecture, and best practices and recommendations for data collection and scaling strategies. The speaker also showcased some recent enhancements (although this was in 2019).

Timestamps

00:00 - Introduction to Speaker and Prometheus
02:25 - Prometheus Design
04:40 - Prometheus Data Collection
10:10 - Scaling Strategies
13:40 - Q&A: Prometheus' Storage
16:10 - Q&A: Retroactively Evaluating Recording Rules
17:25 - Q&A: Experience with Thanos
22:20 - Q&A: Facing Problems with Prometheus
23:25 - Q&A: Application-specific Metrics
25:50 - Q&A: Vertical Compaction
27:10 - Q&A: Prometheus Web Interface
28:30 - Q&A: TSDB Sizing
29:10 - Q&A: Config Management

Key Takeaways

Prometheus is designed to be the most reliable component on the network by minimizing dependencies and running locally. This way, Prometheus maintains visibility even if external network dependencies fail. The Write-Ahead log ensures data reliability during operations and restarts and the immutability of the time series database prevents data corruption.
Prometheus should be deployed close to the targets to ensure accurate monitoring without relying on broader network stability.
Vertically scale Prometheus instances before considering horizontal scaling, e.g. for 1,000-pods or similarly large deployments should split Prometheus instances based on services to enhance manageability and performance.
Take capacity planning seriously, e.g. a server handling 100,000 samples/second requires planning behind how to manage storage and processing.