kubectl
A little trick I learned for copying files out of a (somewhat) locked down EC2 worker node for EKS
How Wireshark saved my sanity while setting up OpenVPN on Raspberry Pi
Difference between TCP window size & MTU
Using multiple SSH configurations for git operations
Configuring conntrack limits for EKS worker nodes
Orphan vs Zombie vs Daemon processes
What are processes?
A process is basically a program in execution and a program is a piece of code which may be a single line or millions of lines long written in a programming language.
When a UNIX machine gets powered up, the kernel will be loaded and complete its initialization process. Once initialization is completed, the kernel creates a set of processes in the user space, including the scheduling of the system management daemon process (usually named init
) which has PID 1
and is responsible for running the right complement of services and daemons at any given time.
Managing multiple EKS clusters access using Apiservers’ private endpoints with AWS VPN
I manage multiple EKS clusters (multi-envs multi-tenants) at work and access to these is via Bastion instances deployed within each VPC of those clusters.
However this approach can become unmaintainable over time as the number of Bastion instances will grow with the number of clusters we manage. This means additional effort required for monitoring and maintenance of each of those Bastion instances.
This led to the idea of removing all Bastion instances and configure direct access to Apiservers instead.
Debugging containers using nsenter
If you have ever managed a Kubernetes cluster, chances are you have encountered pods that just doesn’t want to behave the way they are supposed to.
You checked the logs and traced it back to the source code. Logic checks out ✅
You started narrowing down the causes. Networking issue? Configuration issue?
You entered the container and decided to use ping
to identify network connectivity issues.
/ $ ping google.com
PING google.com (142.251.12.138): 56 data bytes
ping: permission denied (are you root?)
Or maybe you wanted to install another tool like tcpdump
to observe network traffic.
Visualizing alerts metrics on Grafana
When it comes to Prometheus and alerts, the typical use case is to send alerts to Alertmanager for handling (deduplication, grouping) and routing them to the various services such Slack, PagerDuty etc.
However, there might be situations where we might need to perform analysis on alert patterns and being able to visualize how often the alerts are firing can be very useful.
In this post, I will share how we can visualize the alert metrics on Grafana using the various PromQL operators and functions.
Debugging a misfiring Prometheus alert
Nodejs application CPU profile analysis with Flame Graphs
In my previous post, I shared about my debugging process using various Linux tools and debugger. During the process, I came across the analysis technique using flame graphs and thought it will be interesting to see what information I can get out of it.
What are flame graphs?
Flame graphs, as the name suggests, are graphs that look like flames because of the shape and color (usually red-yellowish hues). It was invented by Brendan Gregg for the purpose of analyzing performance issue and understand CPU usage quickly.
Debugging high CPU usage and memory leak on Nodejs application
Recently one of our nodejs application (responsible for scraping metrics for external services) running in our EKS cluster was experiencing high CPU usage and memory leak and I was tasked to figure out the root cause. In this post, I will share my troubleshooting process and interesting stuff I discovered along the way.
It all began with an alert notifying us of the application experiencing CPU throttling. Looking at the dashboard, it became apparent that high CPU usage isn’t the only issue; it was also experiencing memory leak and oddly high incoming and outgoing traffic.
Understanding the differences between alertmanager’s group_wait, group_interval and repeat_interval
Alertmanager is an application that handles alerts sent by client applications such as Prometheus. It can also perform alert grouping, deduplication, silencing, inhibition. Definitely a useful addition to any modern monitoring infrastructure.
That being said, configuring it can be a little daunting with the many different configurations available and somewhat vague explanations on some of the terms.
While configuring Alertmanager, I came across these 3 confusing terms: group_wait
, group_interval
and repeat_interval
.
Node-exporter setup with Systemd
For those who aren’t familiar, node-exporter is a Prometheus exporter that exposes hardware and OS metrics from *NIX kernels.
To get it up and running, there’s a simple guide on Prometheus official docs. The issue with the approach is that running node-exporter by executing binary directly isn’t the most reliable approach in a production environment as there’s no way to ensure that the node_exporter process will run continuously.
This is where systemd
comes in. systemd
is an init system and system maanger and comes with a management tool called systemctl
meant for managing processes, checking statuses, configuration and changing system states.