In my previous post, I shared about my debugging process using various Linux tools and debugger. During the process, I came across the analysis technique using flame graphs and thought it will be interesting to see what information I can get out of it. What are flame graphs? Flame graphs, as the name suggests, are graphs that look like flames because of the shape and color (usually red-yellowish hues). It was invented by Brendan Gregg for the purpose of analyzing performance issue and understand CPU usage quickly.
Recently one of our nodejs application (responsible for scraping metrics for external services) running in our EKS cluster was experiencing high CPU usage and memory leak and I was tasked to figure out the root cause. In this post, I will share my troubleshooting process and interesting stuff I discovered along the way. It all began with an alert notifying us of the application experiencing CPU throttling. Looking at the dashboard, it became apparent that high CPU usage isn’t the only issue; it was also experiencing memory leak and oddly high incoming and outgoing traffic.
Alertmanager is an application that handles alerts sent by client applications such as Prometheus. It can also perform alert grouping, deduplication, silencing, inhibition. Definitely a useful addition to any modern monitoring infrastructure. That being said, configuring it can be a little daunting with the many different configurations available and somewhat vague explanations on some of the terms. While configuring Alertmanager, I came across these 3 confusing terms: group_wait, group_interval and repeat_interval.
For those who aren’t familiar, node-exporter is a Prometheus exporter that exposes hardware and OS metrics from *NIX kernels. To get it up and running, there’s a simple guide on Prometheus official docs. The issue with the approach is that running node-exporter by executing binary directly isn’t the most reliable approach in a production environment as there’s no way to ensure that the node_exporter process will run continuously. This is where systemd comes in.