Visualizing alerts metrics on Grafana

When it comes to Prometheus and alerts, the typical use case is to send alerts to Alertmanager for handling (deduplication, grouping) and routing them to the various services such Slack, PagerDuty etc. However, there might be situations where we might need to perform analysis on alert patterns and being able to visualize how often the alerts are firing can be very useful. In this post, I will share how we can visualize the alert metrics on Grafana using the various PromQL operators and functions.
Read more →

Debugging a misfiring Prometheus alert

Last week at work, I encountered an alert that was misfiring. Or so I thought…
Read more →

Understanding the differences between alertmanager’s group_wait, group_interval and repeat_interval

Alertmanager is an application that handles alerts sent by client applications such as Prometheus. It can also perform alert grouping, deduplication, silencing, inhibition. Definitely a useful addition to any modern monitoring infrastructure. That being said, configuring it can be a little daunting with the many different configurations available and somewhat vague explanations on some of the terms. While configuring Alertmanager, I came across these 3 confusing terms: group_wait, group_interval and repeat_interval.
Read more →