Causes and Countermeasures for CPU Steal Time

Many people who use cloud services ask about CPU Steal. CPU Steal Time is a common metric that arises from the difference in environments between cloud services and physical servers. High CPU steal time can also cause web services to fail. CPU steal time is a metric that tells you how much of your CPU's resources are being stolen by the process of distributing virtualized resources. Let's take a look at CPU steal time.

CPU Steal Time

CPU Steal Time is the amount of time, expressed as a percentage, that a virtual CPU waits for a physical CPU while the hypervisor services another virtual processor. Virtual machines (VMs) operating in a virtual environment share resources with other instances on a single host. CPU Steal Time tells you how long the CPUs running in a VM are waiting to be allocated resources from the physical machine.

How do you check CPU Steal Time?

First, run the top command in Linux to see a real-time view of key performance metrics. Below are the values when the top command was executed.

Top - 10:00:00 up 120 days, 7:00, 3 users, load average: 1.15, 0.88, 0.86
Tasks: 122 total, 10 running, 112 sleeping, 0 stopped, 0 zombie
%Cpu(s): 40.0%us,0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0,0%hi, 0.1%si, 70.0%st

If you are using the Top command, you can see metrics that are currently occurring. If you want to see historical metrics in the event of a service issue, you should use a monitoring service, such as the and Tap Infrastructure Monitoring Service, to see metrics at the time of the issue. Most monitoring services monitor the CPU Steal metric around the clock.

CPU 사용량 = CPU Usage

The CPU(s) entry is described below.

us(user) : Percentage of CPU time used by user applications (non-kernel code)
sy(system) : Percentage of CPU time used by the system (kernel code)
ni(nice) : Percentage of CPU time used by user processes with NI values between 1 and 19
id(idle) : Percentage of CPU rest time
wa(wait) : Percentage of CPU time spent waiting
hi(hard interrupt) : Percentage of time spent by interrupt handlers executed immediately
si(soft interrupt) : Percentage of time spent by interrupt handlers executed after waiting
st(steal time) : Percentage of time not received by the virtualized system (used by processes to wait involuntarily)

CPU Steal Time can be found by looking at the very last entry for CPU. If you are not in a true virtualized environment, CPU Steal Time is meaningless.

What causes high CPU Steal Time?

CPU steals occur because either the physical equipment the VM is on is running out of resources to begin with, or the physical equipment has enough resources but not enough CPU resources allocated to the VM. This can happen if you have too many VMs up, or if the administrator has incorrectly set the limits for each resource available to the VM. Alternatively, the physical equipment may be aging and unable to keep up with the hosting service.

What happens when CPU steal time is high?

First of all, for jobs that take a long time in the background, such as batch jobs, this is usually not a problem. CPU Steal Time will not stop the job in these cases, it will just finish a little slower as it shares CPU cycles with other VMs.

However, this can be problematic for web applications. Web applications require real-time processing of customer requests. If web responses are required to be made in real time, CPU steal time increases and performance decreases, eventually the service will fail because real-time requests cannot be fulfilled.

Troubleshooting at the cloud service vendor

Adjust the resource throttling settings for the resource utilization of VMs running on a particular server. If resource utilization is not actually being managed on a per-VM basis, other VMs might be harmed by some high resource-using VMs.
Upgrade the hypervisor. If you are using a hypervisor that lacks the skills to properly allocate resources for your VMs, upgrade the hypervisor to a newer version or replace it with better software.
Upgrade the equipment on the physical server. Increase the resources available to VMs by upgrading to a process with more processing power or adding cores.
Move VMs to offset the load. Understand the utilization of VMs to equalize resource utilization. By physically separating CPU-heavy VMs and distributing the tasks on their CPUs, you can reduce the load on virtual CPUs.

How customers using cloud are solving problems

As an end user, there is very little you can do to directly address the issue of high CPU Steal Time. If Cpu Steal Time is impacting your service, you will need to check with your hosting provider to ensure that the VMs you are currently purchasing are providing the appropriate resources per your contract, but most cloud service providers will tell you that they are. If Cpu Steal Time is impacting your service, you will need to take one of two actions

Purchase a more powerful instance.
Redeploy the application on another instance.

Wrapping up

When using the cloud, you cannot afford to neglect monitoring. Whether you are a developer, operator, or planner, you need to have the tools in place to analyze the cause if something goes wrong. And hopefully, you have a hotline to connect with a cloud specialist or expert at all times.