A little trick I learned for copying files out of a (somewhat) locked down EC2 worker node for EKS

A process is basically a program in execution and a program is a piece of code which may be a single line or millions of lines long written in a programming language.

When a UNIX machine gets powered up, the kernel will be loaded and complete its initialization process. Once initialization is completed, the kernel creates a set of processes in the user space, including the scheduling of the system management daemon process (usually named init) which has PID 1 and is responsible for running the right complement of services and daemons at any given time.

Read more →

Managing multiple EKS clusters access using Apiservers’ private endpoints with AWS VPN

2021-10-18 — 3 min read

#aws #eks #dns

I manage multiple EKS clusters (multi-envs multi-tenants) at work and access to these is via Bastion instances deployed within each VPC of those clusters.

However this approach can become unmaintainable over time as the number of Bastion instances will grow with the number of clusters we manage. This means additional effort required for monitoring and maintenance of each of those Bastion instances.

This led to the idea of removing all Bastion instances and configure direct access to Apiservers instead.

Read more →

Debugging containers using nsenter

2021-10-11 — 2 min read

If you have ever managed a Kubernetes cluster, chances are you have encountered pods that just doesn’t want to behave the way they are supposed to.

You checked the logs and traced it back to the source code. Logic checks out ✅

You started narrowing down the causes. Networking issue? Configuration issue?

You entered the container and decided to use ping to identify network connectivity issues.

/ $ ping google.com
PING google.com (142.251.12.138): 56 data bytes
ping: permission denied (are you root?)

Or maybe you wanted to install another tool like tcpdump to observe network traffic.

Read more →

Visualizing alerts metrics on Grafana

2021-09-26 — 3 min read

#prometheus #monitoring #grafana

When it comes to Prometheus and alerts, the typical use case is to send alerts to Alertmanager for handling (deduplication, grouping) and routing them to the various services such Slack, PagerDuty etc.

However, there might be situations where we might need to perform analysis on alert patterns and being able to visualize how often the alerts are firing can be very useful.

In this post, I will share how we can visualize the alert metrics on Grafana using the various PromQL operators and functions.

Read more →

Debugging a misfiring Prometheus alert

2021-09-20 — 6 min read

#prometheus #monitoring

Last week at work, I encountered an alert that was misfiring. Or so I thought…

Read more →

Nodejs application CPU profile analysis with Flame Graphs

2021-09-06 — 5 min read

#debug #performance analysis #nodejs

In my previous post, I shared about my debugging process using various Linux tools and debugger. During the process, I came across the analysis technique using flame graphs and thought it will be interesting to see what information I can get out of it.

What are flame graphs?

Flame graphs, as the name suggests, are graphs that look like flames because of the shape and color (usually red-yellowish hues). It was invented by Brendan Gregg for the purpose of analyzing performance issue and understand CPU usage quickly.

Read more →

Debugging high CPU usage and memory leak on Nodejs application

2021-09-04 — 7 min read

#debug #performance analysis #nodejs

Recently one of our nodejs application (responsible for scraping metrics for external services) running in our EKS cluster was experiencing high CPU usage and memory leak and I was tasked to figure out the root cause. In this post, I will share my troubleshooting process and interesting stuff I discovered along the way.

It all began with an alert notifying us of the application experiencing CPU throttling. Looking at the dashboard, it became apparent that high CPU usage isn’t the only issue; it was also experiencing memory leak and oddly high incoming and outgoing traffic.

Read more →

Understanding the differences between alertmanager’s group_wait, group_interval and repeat_interval

2021-08-27 — 2 min read

#alertmanager #monitoring #prometheus

Alertmanager is an application that handles alerts sent by client applications such as Prometheus. It can also perform alert grouping, deduplication, silencing, inhibition. Definitely a useful addition to any modern monitoring infrastructure.

That being said, configuring it can be a little daunting with the many different configurations available and somewhat vague explanations on some of the terms.

While configuring Alertmanager, I came across these 3 confusing terms: group_wait, group_interval and repeat_interval.

Read more →

Node-exporter setup with Systemd

2021-08-19 — 3 min read

#linux #monitoring #node-exporter #systemd

For those who aren’t familiar, node-exporter is a Prometheus exporter that exposes hardware and OS metrics from *NIX kernels.

To get it up and running, there’s a simple guide on Prometheus official docs. The issue with the approach is that running node-exporter by executing binary directly isn’t the most reliable approach in a production environment as there’s no way to ensure that the node_exporter process will run continuously.

This is where systemd comes in. systemd is an init system and system maanger and comes with a management tool called systemctl meant for managing processes, checking statuses, configuration and changing system states.

Read more →