Debugging containers using nsenter

If you have ever managed a Kubernetes cluster, chances are you have encountered pods that just doesn’t want to behave the way they are supposed to.

You checked the logs and traced it back to the source code. Logic checks out ✅

You started narrowing down the causes. Networking issue? Configuration issue?

You entered the container and decided to use ping to identify network connectivity issues.

/ $ ping google.com
PING google.com (142.251.12.138): 56 data bytes
ping: permission denied (are you root?)

Or maybe you wanted to install another tool like tcpdump to observe network traffic.

Read more →

Visualizing alerts metrics on Grafana

When it comes to Prometheus and alerts, the typical use case is to send alerts to Alertmanager for handling (deduplication, grouping) and routing them to the various services such Slack, PagerDuty etc.

However, there might be situations where we might need to perform analysis on alert patterns and being able to visualize how often the alerts are firing can be very useful.

In this post, I will share how we can visualize the alert metrics on Grafana using the various PromQL operators and functions.

Read more →

Debugging a misfiring Prometheus alert

Last week at work, I encountered an alert that was misfiring. Or so I thought…
Read more →

Nodejs application CPU profile analysis with Flame Graphs

In my previous post, I shared about my debugging process using various Linux tools and debugger. During the process, I came across the analysis technique using flame graphs and thought it will be interesting to see what information I can get out of it.


What are flame graphs?

Flame graphs, as the name suggests, are graphs that look like flames because of the shape and color (usually red-yellowish hues). It was invented by Brendan Gregg for the purpose of analyzing performance issue and understand CPU usage quickly.

Read more →

Debugging high CPU usage and memory leak on Nodejs application

Recently one of our nodejs application (responsible for scraping metrics for external services) running in our EKS cluster was experiencing high CPU usage and memory leak and I was tasked to figure out the root cause. In this post, I will share my troubleshooting process and interesting stuff I discovered along the way.

It all began with an alert notifying us of the application experiencing CPU throttling. Looking at the dashboard, it became apparent that high CPU usage isn’t the only issue; it was also experiencing memory leak and oddly high incoming and outgoing traffic.

Read more →