It all started with this alert.

Processes experience elevated CPU throttling.
25.91% throttling of CPU in namespace vault for container consul in pod consul-server-2.

I checked the dashboards and see that CPU usage was periodically peaking above the resources.requests.cpu (red line). Perhaps I should increase the CPU requests a little.

But throttling? Why was the process facing CPU throttling when there’s still quite a bit more to go before hitting the resources.limits.cpu (orange line)?

Isn’t the CPU throttling logic simply:

if cpuUsage > cpuLimit {
    throttle()
} else {
    continueWithProcess()
}

Well…not so simple…


Container requests and limits

Pods are the smallest deployable units of computing that one can create and manage in Kubernetes.

Within pods are containers. There can be one or more containers, which we specify using a container specs list under PodSpec.

Under the container specs, we can specify the CPU and memory resources of the container(s) that will be running within a pod.

Setting the requests and limits

apiVersion: v1
kind: Pod
.
.
spec:
  containers:
  - .
    .
    .
    resources:
      limits:
        cpu: "1"
        memory: 400Mi
      requests:
        cpu: 500m
        memory: 200Mi

These spec.containers[].resources specifications are then used by Kubernetes for workload scheduling and resource limiting.

requests are used by kube-scheduler to decide which worker node to assign a pod to.

limits are used by kubelet to limit how much resources a container can use

Put simply, we can view them as soft limits (requests) and hard limits (limits).

How are these CPU requests and limits mechanisms implemented by Kubernetes?

For those who aren’t aware, “containers” isn’t a first-class concept in Linux. It is made up of Linux features like cgroups to control available resources to processes and namespaces to isolate processes.

A cgroup is basically a grouping of processes and consists of 2 parts - the core and controllers.

Taken from https://www.kernel.org/doc/Documentation/cgroup-v2.txt:

cgroup is largely composed of two parts - the core and controllers. cgroup core is primarily responsible for hierarchically organizing processes. A cgroup controller is usually responsible for distributing a specific type of system resource along the hierarchy although there are utility controllers which serve purposes other than resource distribution.

Under the hood, requests.cpu and limits.cpu are implemented using features of CPU cgroup controller (grouping processes) and CFS scheduler (assigning resource based on groupings)

Though the Kubernetes configuration for both requests.cpu & limits.cpu look similar, they are actually implemented using different mechanisms.

The related configurations and files are located under the /sys/fs/cgroup/kubepods directory.

From the image above, we can see how the cgroup directories are structured in order to control the CPU resource for each pod/process.

/kubepods
|__...cgroup related files
|__/besteffort
|   |__...cgroup related files
|   |__/pod700d3573-6918-4f34-a802-facd3d7c6228
|   |__/pod7c6497d6-c5c7-497f-8f8f-b54d9010ea49
|   |__/pod90c3c7ed-d488-4e2e-8aaf-edaa935f31b9
|   |__/podb8e4fe2d-6ca9-4ba1-bfc9-a4dfb40e9544
|   |__/podba6a8975-37a7-4c9c-a365-347844d069e6
|__/burstable
    |__...cgroup related files
    |__/podc8b9ed51-a468-46ec-afc8-8d000da6942e
    |__/podd7ee7ff2-9089-4825-ab80-281f59f5487a
    |__/podf8802cf7-4278-4596-b278-ce21f4ab2145
    |__/pod84cc4e4e-beea-4ff4-8700-5d534e266304
    |__/pod1351a523-8320-4bb2-9104-7528fd43e8ae
        |__...cgroup related files
        |__/ce90611f00c776ab1a99ba92c88d972aac6f89bf6fd5b2c4b16a0ba5c83cf28a
        |__/bd2bd6e9d405703ceb140e0a94eb4df02ed0930498459b1faeeba2504c81a7e8

We can also see that there’s 10 pods (directories prefixed with pod) running on this particular node, which matches the output of kubectl get pod | grep <node name>.

Then there’s the directories nested under one of the pod directories with alphanumeric hashes as their names. These are for the containers within a pod.

This is verified by comparing it with the details of the pods.


CPU requests via cpu.shares

CPU request is implemented using cpu.shares. CFS scheduler looks at the cpu.shares file configured for different process groupings to determine how much CPU time a process can use.

This file can be found at the /sys/fs/cgroup/cpu,cpuacct directory of a container:

/ $ cd /sys/fs/cgroup/cpu,cpuacct/
/sys/fs/cgroup/cpu,cpuacct $ cat cpu.shares
51

The cpu.shares should match the container’s resources.requests.cpu value (in this case, it is cpu: 50m).

It is important to note that the value represents the relative share of CPU time a container will receive when there is contention for CPU resources. It does not represent the actual CPU time each container will receive.

In Kubernetes, one CPU (1000m) is equivalent to 1 vCPU/Core for cloud providers and 1 hyperthread on bare-metal Intel processors.

Assuming we are deploying 2 containers on a single core node and there’s contention for CPU resources:

Scenario A: containers configured with similar requests.cpu values

container A:
  requests.cpu: 1000m
container B:
  requests.cpu: 1000m

Both containers will receive the same amount of CPU time.

Scenario B: containers configured with different requests.cpu values

container A:
  requests.cpu: 1000m
container B:
  requests.cpu: 2000m

In this scenario, container B will receive twice as much CPU time as container A.

What happens if only container A is running?

In this case, container A will get all the available CPU time since there’s no other processes contending for CPU resources.

That being said, there might be cases where we want to put a hard limit on the amount of CPU time a set of processes have access to (e.g hostile workloads consuming unnecessary CPU time, limit resource usage when performing load test), which brings us to the next section.


CPU limits via CFS quota

CPU limit is implemented using CFS bandwidth controller (a subsystem/extension of CFS scheduler), which will use values specified in cpu.cfs_period_us and cpu.cfs_quota_us (us = μ, microseconds) to control how much time is available to each control group.

cpu.cfs_period_us: length of the accounting period, also in microseconds. This is configured to 100,000 in Kubernetes.

cpu.cfs_quota_us: amount of CPU time (in microseconds) available to the group during each accounting period. This value is taken from the limits.cpu.

1 vCPU == 1000m == 100,000us
0.5vCPU == 500m == 50,000us

Similar to cpu.shares, the files can be found at /sys/fs/cgroup/cpu,cpuacct directory of a container:

/sys/fs/cgroup/cpu,cpuacct # cat cpu.cfs_quota_us
50000
/sys/fs/cgroup/cpu,cpuacct # cat cpu.cfs_period_us
100000

Let’s say a web service container is the only process running and has the following requests.cpu set:

web service container:
  requests.cpu: 1000m

Assuming that it takes 200ms to respond to a request and since there’s no contention for CPU time, it will have the full 200ms of CPU time uninterrupted.

What if we now set the limits.cpu?

web service container:
  requests.cpu: 1000m
  limits.cpu: 500m

The same request will now take 350ms to respond!

This is because instead of being able to use 200ms of uninterrupted CPU time, the process now has only a quota of 500m/1000m * 100,000us every 100,000us period. Once the quota is depleted, the process will be throttled.

Throttling metrics can be found in the cpu.stat file:

/sys/fs/cgroup/cpu,cpuacct $ cat cpu.stat
nr_periods 258700
nr_throttled 107792
throttled_time 8635080132047

nr_periods: number of periods a process was running WITHOUT throttling

nr_throttled: number of periods a process was throttled

throttled_time: total time a thread in cgroup was throttled

throttled_percentage: (rate of change of nr_throttled)/(rate of change of nr_periods). This can give you an idea of how badly a process is being throttled.

What can you do about throttled applications/processes?

Fix the application OR increase/remove the limits!

Conclusion

requests.cpu and limits.cpu seems similar but are implemented using very different mechanisms!

Just because requests.cpu < limits.cpu does not mean that the process/application/container will not be throttled.

References

This was by far one of the most complicated topic I have researched on, bringing me down several rabbit holes, diving into kernel documentations, articles, videos etc.

For those interested, these are the resources that helped me greatly on this topic: