vSphere Performance troubleshooting Part1: CPU

Even do vSphere is telling you that the overall CPU utilization is no more than 60%, this doesn’t indicated that your VM isn’t running low on CPU resources.
In this part where going to deepdive in troubleshooting CPU performance related issues on your vSphere environment. Although I’m using vSphere 5.0 for my screenshot, most of the options used also go for vSphere 3 and 4.

Key tool for CPU performance troubleshooting is esxtop or if your running your esxtop remote resxtop.
I my examples I’m using esxtop directly on the vSphere ESXi console but all of these command also work with resxtop.You just have to provide a extra parameter called –server on witch server you want to run the command. Of course you have to provide a username and password to get access to that server.

If we start esxtop on our vSphere ESXi host we will get the following screen:

esxtopcpu01

Click on image to enlarge

 

 

 

 

 

First let me explain what we see here.

esxtopcpu02

Click on image to enlarge

 

 

 

The first three lines:

 1:06:05pm  The current time of your ESXi server. Notice that the time is in UTC.
 up 18 days 3:35  How long your ESXi server has been up.
 307 worlds  a world is a process thats running in your VMkernel
 5 VMs  The amount of VMs running on your ESXi host
 12 vCPUs  The amount of vCPU provided to VMs
 CPU load average The average CPU load per 5, 10 and 15 minutues. If the average load is higher than the amount of CPU cores, your system has not enough CPU recourses.
 PCPU USED(%) Real-time amount of CPU usage per CPU core in percentages. As you can see my system is a 8 core system (2 quad core CPUs). AVG: is the average of all pCPU cores.
(4,3 + 2,4 + 0,0 + 0,3 +5,1 + 1,7 + 2,7 + 3,3) / 8 = 25
 PCPU UTIL(%)  Real-time amount of CPU utilizaton per CPU core in percentage. AVG is the average of all pCPU cores. (4,4 + 7,2 + 100 + 5,2 + 5,9 + 2,3 + 2,4 + 4,4) / 8 = 16

 

 

 

 

 

 

 

 

 

 

 

 

 

After the first three lines you will see a table with like the following:

 

esxtopcpu03_0

Click on image to enlarge

 

 

 

So let me explain what we see there:

ID  The recourse world id. A world is an ESXi VMkernel schedulable entity, similar to a process or thread in other operating systems.
GID  The resource group world id. A group contains more worlds. If you press e in esxtop and enter the number of the GID, this GID will expand itself in multiple world’s for the same group. Every VM consists of minimum 4 worlds:

  1. vmx: This world is used for vCPU world explained in vmx-vcpu-#.
  2. vmast.#: Ths world is used for memory scanning.
  3. vmx-mks: This world is used for mouse, keyboard and monitor.
  4. vmx-vcpu-#: This world is used for every vCPU of the VM. The amount of vCPU worlds is the same as the amount of vCPU configured for this VM.
NAME  The name of the world or world recourse pool.
NWLD  The amount of world’s in the world recourse pool.
%USED The percentage of physical CPU core cycles used by the recourse pool/world.
%USED = #vCPU*100%
indicates that the VM occupies all the CPU cycles he can takes. Indicates that the VM is running at 100%.
%RUN The percentage of time scheduled. This value can be twice as large as %USED.
%RUN > %USED the pCPU is not running at its rated clock frequency. Probably due Power saving.
%SYS The percentage of time spend in the ESXi VMkernel on behave of the recourse pool/world to process interrupts and to perform other system activities.
If higher than 25 the VM is a high IO VM. If you are aware of this,OK. If not check other statistics.
%WAIT
The total percentage of time the Resource pool/world spent in wait state.
%VMWAIT
=%WAIT-%IDLE
%RDY The percentage of time the Resource pool/world was ready to run.
>20% indicated that the amount of pCPU cores is to low.
%IDLE  The percentage of time the Resource pool/world was idle.
%OVRLP  The Percentage of system time that was spent on behalf of some other Resource Pool/World while Resource Pool/World was scheduled.
%CSTP The  percentage of time the Resource pool/world spent in ready, co-deschedule state.
 >5% This accours when a VM as more vCPUs and one vCPU has to wait on another vCPU in order to catch up.
%MLMTD  Percentage  of time the ESX VMKernel deliberately did not run the Resource Pool/World because that would violate the Resource Pool/World’s limit setting.
%SWPWT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This picture (who I have borrowed from a VMworld presentation) explains the overall relationship between the different variables.

esxtopcpu04

Oke, now we now where the different variables stand for and what there relationship is. The next question will be,  which variables do I have to monitor and what are there thresholds?

Variable Threshold Resolution
%RDY >10% If higher than 10% for a long time, add more CPU cores tho your vSphere host
%CSTP >5% This only occurs in a VM with more than 1 vCPU. Add more pCPU to the host or decrease the amount of vCPUs in the VM
%MLMTD >0% If higher than 0% the vCPU is throttled because of CPU limits
%SYS >20% If higher than 20% the VM is like a high I/O VM. Check guest OS for problems
%RUN >%USED The pCPU is not running at its rated clock frequency. Probably due Power saving.

 

So that’s what you need to know about monitoring you vSphere ESXi host with esxtop.

About Michael
Michael Wilmsen is a experienced VMware Architect with more than 20 years in the IT industry. Main focus is VMware vSphere, Horizon View and Hyper Converged with a deep interest into performance and architecture. Michael is VCDX 210 certified, has been rewarded with the vExpert title from 2011, Nutanix Tech Champion and a Nutanix Platform Professional.

RSS feed for comments on this post.

Leave a Reply

You must be logged in to post a comment.