The topic of NUMA and vNUMA has come up quite a bit recently for me. I’ve had in-depth discussions with many of my customers, co-workers and people that I have interviewed for positions where I work. I like to engage people during discussions and I always try to gauge their understanding of the topic first before anything else. Creates less confusion that way when it comes time for me to explain it.
NUMA and vNUMA are an integral part of my customer discussions for designing, sizing or troubleshooting performance related concerns or issues. One customer example would involve one who recently needed a server hardware refresh. Covering this topic with them was very important because their legacy compute hardware contained dual socket 12-core CPUs whereas the new hardware utilized dual socket 10-core CPUs. Do you see the potential impact/risk here regarding NUMA as well as the negative impact it could have to overall performance? We will take a look at this more in a moment but keep this in the back of your mind for the time being.
My goal here in this blog article to provide you with a foundational understanding of NUMA. Start with the basics/fundamentals and then you can go further with it later on your own.
What is NUMA?
We are going to make this a “Basics 101” type of discussion. First, NUMA is an acronym that stands for “Non-Uniform Memory Access”. Non-uniform meaning memory is being accessed…
The underlying system hardware has more than one system bus; a Shared Memory Architecture found in today’s SMP systems. Each physical CPU is assigned “local” memory. Local memory access delivers Low Latency and High Performance (bandwidth). If a CPU has to access memory owned by another CPU, latency will increase; low performance (bandwidth). This type of memory access is called “remote” memory and a situation you want to avoid.
Next, some of you may be asking what is a NUMA node? This part is very simple…it is a nothing more than the logical grouping of CPU and memory resources. Take a look at the graphic below. Here you will see two (2) NUMA nodes; each with access to local memory. Each physical CPU socket represents a node.
Each NUMA node above contains 12 CPU cores. More importantly I want you to pay very close attention to the “balance” of resources as they are balanced. Equal CPU and memory resources results in a balanced host configuration. Balanced configurations enable you to achieve stability. I love the word “balanced” when describing NUMA and you should too. Balance is critical when looking to achieve optimal performance.
Next I’m going to throw out an arbitrary amount of memory resources so let’s say the configuration for this server contains 512 GB of memory. Assume the DIMMs installed in the server are balanced equally between the NUMA nodes; an equal number of DIMMs per socket and per channel.
Now another question…how much memory (in GB) does each NUMA node have in this example? The answer is 256 GB of memory. Take the amount of memory you have in the server (GB) and divide that by the number of sockets. It’s a very simple equation.
512 GB Memory / 2 Sockets = 256 GB of memory per NUMA node
Now let’s assume I have a server with 512 GB of memory but it is configured with four (4) sockets. How much memory do I have per NUMA node?
512 GB Memory / 4 Sockets = 128 GB of memory per NUMA node
I think you are getting the picture here! 🙂 Each one of these NUMA constructs is commonly referred to as a NUMA Home Node; the physical CPU and its local memory. Understanding the NUMA node size is very important when sizing and allocating CPU and memory resources for your VMs which we will talk about very shortly.
Is it possible to have an unbalanced NUMA configuration? Yes it is very possible. It all comes down to designing your underlying server very carefully. If your solution requires both memory capacity and optimal memory performance then the memory RANKING CONFIGURATION is going to be very important in your server config.
An unbalanced NUMA node configuration could be a system that has an uneven number of DIMMs per CPU channel. For example, let’s say I have a smaller ESXi host configured with dual sockets and this time only 96GB of memory (8GB DIMMs x 12). NUMA Node 0 will likely have access to eight (8) of the 8GB DIMMs (64GB of memory) whereas NUMA Node 1 will have access to only four (4) of the 8GB DIMMs (32GB of memory). Again, this is an unbalanced hardware configuration.
I can also have an unbalanced “channel” configuration. Basically the CPU’s (NUMA Nodes) may have an equal # of DIMMs assigned to them but the channels per CPU (node) is unbalanced.
So be very careful when designing your ESXi servers and take into account the CPU and memory configuration of that hardware! DIMMs per channel, Load Reduced DIMMs (LRDIMMs) and Ranks will all play a factor. Rely on your vendor or distributor to ensure you have a balanced CPU and memory configuration. Properly configure the server resources to guarantee performance levels for the VMs. Failure to do so can have an immediate negative impact and create a great deal of stress over the long term of your investment.
It is important for you to understand that just because HT is enabled on your ESXi does not automatically equate to more logical cores per socket. The reason is, by default NUMA does not calculate logical processors within a NUMA node. Can an administrator override this and expose HT logical processors? Yes. Configure the following advanced setting on your VM that would benefit from memory and cache sharing. This is on a per VM basis:
numa.vcpu.preferHT = TRUE
Should you rush to change this on every single VM? No. I would not recommend doing so. The most important thing here is to fully understand the workload (application) within in the VM and whether or not it would benefit from this setting. Remember, not all applications benefit from HT. Multi-threaded applications will realize the benefit of HT whereas single-threaded apps will not. So make sure you fully understand the requirements and implications before enabling this advanced setting as it can have a positive or negative outcome depending on how it is used.
Very important rule of thumb here coming from Frank Denneman and Niels Hagoort in their latest book ‘VMware vSphere 6.5 Host Resources deep Dive’ and that is “count threads, not cores.”
vNUMA and Properly Sizing VMs
The VMware hypervisor has been NUMA aware for a very long time now. Always make sure you know and understand your ESXi hosts physical NUMA topology in your current vSphere environment and even more so when doing a hardware refresh due to the potential impacts it can have on the overall NUMA architecture.
So what is vNUMA? vNUMA is when the underlying physical NUMA topology (the ESXi host) becomes visible to the guests above. Plain and simple. However, two conditions must be met before vNUMA becomes enabled on a VM:
- The virtual machine is configured with nine (9) or more vCPUs.
- vCPU count exceeds the physical core count of the NUMA node. The ‘numa.vcpu.preferHT = True’ setting can play a potential role here.
I have always sized my VMs according to the NUMA node size. You will experience the best possible performance for that VM because the memory is always accessed LOCALLY and not remote. If my ESXi hosts are configured with dual socket, 10 core CPUs and 128 GB of memory I know immediately right off the bat the maximum possible size for any large VMs would be 10 vCPU and 64 GB of memory. That particular NUMA node size becomes my ceiling, my constraint for allocating CPU and memory resources to my virtual machines. Anything beyond that I begin to enter what I call “Monster VM” territory (commonly known as a ‘wide VM’). I refer to it as a “Monster VM” not simply due to it’s size but more so because of the application latency risk and that is just “straight up UGLY” to deal with (the Shrek VM).
What if the application requires a large # of CPUs…12, 16 or greater? Simple. Make sure the CPUs in your server hardware contains that CPU core count at a minimum. Same principle will apply to sizing per your memory requirements. Just do whatever possible to avoid a “Monster VM” configuration. If my VM is has a 12 vCPU and 64 GB of RAM resource requirement then I would make sure my ESXi host hardware contains 12 core CPUs and at least 128GB of memory.
Would I build all of my VMs from a template that is configured with my maximum NUMA node size? No. That’s just ridiculous. Remember one thing, maximizing the amount of CPU and memory resources for your VMs does not always equate to better VM performance. Lean on your high-level management and monitoring tools like vRealize Operations, Turbonomic or Veeam ONE to help you “right size” all of your VMs.
Tools for NUMA
There are some useful tools available when viewing NUMA configurations within your VMs. Microsoft Sysinternals ‘CoreInfo’ tool can be used to expose the underlying CPU architecture. This tool is executed from a command-line. Another free tool to download and view the CPU core info is ‘Numa Explorer’ which can be found HERE.
Inside a Red Hat or CentOS virtual machine utilize the ‘numactl’ command to view NUMA. For other operating systems refer to your deputy administrator…GOOGLE!
Last but least, the greatest tool of all time for understanding anything performance related in vSphere (specific to vSphere 6.5) and especially anything NUMA related, read the hottest book on the market…
VMware vSphere 6.5 Host Resources Deep Dive
by Frank Denneman & Niels Hagoort
I’m still reading the copy I bought from Amazon a few weeks ago and it is simply AMAZING! The title “Deep Dive” doesn’t do it justice! The content of this book goes into the deepest depth possible with NUMA and many, many other performance related topics. Optimize everything in your vSphere environment….CPU, memory, storage resources (including vSAN), networking and more.
The great part about this book is it isn’t written it was would seem like a “foreign language” like some deep dive books can get. These guys break it down for you and make it very easy to understand. The only way this book would be “too much” for you or over your head is if you are entirely new to vSphere. You should certainly have at least VCP-level knowledge before picking this book up. It’ll be much more enjoyable.
Don’t waste time…buy this thing now! As I said a moment ago, I ordered my copy using Amazon Prime and had it in less than 48 hours. Now that you’ve read my “baby blog” article about NUMA and vNUMA, you are now ready for their book and dive deeper into NUMA. Much, much deeper!
You can also follow these guys on Twitter!