Monitoring Serverless Workloads with OpenTelemetry and Prometheus
Year: 2024
Labels: cloud native, monitoring, observability, opentelemetry, prometheus, serverless
Monitoring Serverless Workloads with OpenTelemetry and Prometheus
Speaker(s): Ridwan Sharif, Google
Video URL: https://www.youtube.com/watch?v=r6qb3OivFyQ
Summary: The talk explains the challenges of observability in serverless environments, such as the transient nature of instances. To address these issues, Google introduced sidecars in Cloud Run to run alongside main containers for collecting telemetry data, and they implemented lifecycle dependencies to ensure proper telemetry collection during startup and shutdown. Adjustments were also made to support both push-based (e.g., OpenTelemetry) and pull-based (e.g., Prometheus) telemetry systems in serverless environments.
Timestamps
00:00
- Introduction and Agenda02:00
- Focus on Serverless05:00
- Serverless Observability (O11y)07:25
- Problem: serverless instances are expected to be short-lived08:20
- Problem: the need to normalize cumulative metrics08:50
- Problem: prometheus-based scraping expect metrics to be pulled10:45
- Problem: need more control of how to configure collections11:40
- Cloud Run infrastructure changes needed to accommodate o11y14:35
- Adjustments needed for push-based OpenTelemetry Protocol (OTLP) ingest15:05
- Adjustments needed for pull-based ingest (Prometheus metrics)17:25
- How the productionize (deploy and configure) the collectors19:00
- Next steps
Key Takeaways
- Traditional monitoring tools face difficulties due to the ephemeral nature of serverless instances.
- Google introduced sidecars and lifecycle dependencies in Cloud Run to improve observability.
- Ongoing efforts include reducing resource overhead and addressing CPU throttling issues.
Questions/Discussion Points
- These were the changes done in Cloud Run to accommodate o11y - what about other serverless services?