When it comes to Prometheus and alerts, the typical use case is to send alerts to Alertmanager for handling (deduplication, grouping) and routing them to the various services such Slack, PagerDuty etc. However, there might be situations where we might need to perform analysis on alert patterns and being able to visualize how often the alerts are firing can be very useful. In this post, I will share how we can visualize the alert metrics on Grafana using the various PromQL operators and functions.
Last week at work, I encountered an alert that was misfiring. Or so I thought…
Alertmanager is an application that handles alerts sent by client applications such as Prometheus. It can also perform alert grouping, deduplication, silencing, inhibition. Definitely a useful addition to any modern monitoring infrastructure. That being said, configuring it can be a little daunting with the many different configurations available and somewhat vague explanations on some of the terms. While configuring Alertmanager, I came across these 3 confusing terms: group_wait, group_interval and repeat_interval.
For those who aren’t familiar, node-exporter is a Prometheus exporter that exposes hardware and OS metrics from *NIX kernels. To get it up and running, there’s a simple guide on Prometheus official docs. The issue with the approach is that running node-exporter by executing binary directly isn’t the most reliable approach in a production environment as there’s no way to ensure that the node_exporter process will run continuously. This is where systemd comes in.