Pages

Sunday, February 5, 2017

NUMA configuration issues on VMs running SQL Server

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors) 1

While configuring a SQL Server environments, it is very important to understand the underlying NUMA architecture. SQL Server has its own scheduling system, the User Mode Scheduler (UMS) which is NUMA aware. 

 Taking this into account we have to be aware of:
  1. Parallel queries do not exceed the boundaries of a NUMA node. Ensure that the ‘max degree of parallelism’ setting (MAXDOP) should be set to maximum the number of physical CPUs per NUMA node or a number through which this number can be divided. For example, for 8 CPUs per NUMA node valid the MAXDOP values would be 8,4,2 or 1. The idea behind is to avoid remote memory access on NUMA nodes
  2. Even distributed number of CPUs across NUMA nodes
Because of core licensing, CPU affinity mask or virtual machine configuration, the number of CPUs used per NUMA node can differ. This must be adjusted in order to prevent scheduling and performance issues.
Query to get the number of CPUs per NUMA node used by SQL Server:
SELECT parent_node_id 'NUMANode',COUNT(*) 'ActiveCPUs'
FROM sys.dm_os_schedulers
WHERE scheduler_id<1048576 AND is_online=1
GROUP BY parent_node_id
Query to get the number of CPUs per NUMA node:
SELECT parent_node_id 'NUMANode',count(*) 'Numer of CPUs'
FROM sys.dm_os_schedulers
WHERE scheduler_id<1048576
GROUP BY parent_node_id

In this article, I am going more in depth on the VM configuration in order to successfully align it to the host NUMA configuration. In order to get the information about your NUMA configuration of the host you can use the Sysinternals Coreinfo tool (https://technet.microsoft.com/en-us/sysinternals/cc835722.aspx).

The most important sections to understand the architecture:
·         Logical to Physical Processor Map: you can see if hyper-threading is enabled and how the logical processors are mapped to the physical one
·         Logical Processor to Socket Map
·         Logical Processor to NUMA Node Map
We are taking as an example a host which has the following configuration:
  • ·         4 NUMA nodes
  • ·         8 CPUs per NUMA node
  • ·         80 GB Ram per NUMA node


Scenario 1
If we create a VM with 8 cores and 80 GM of RAM the resources in green will be used by the VM:

Scenario 2
If we modify the VM and instead of 80 GB we assign 81 GB of RAM(anything between 80 and 160 GB), the NUMA on the VM will see completely different. The resources used are going to look like as follows:

Note that actually not the full memory is going to be used. Out of the 160 GB are going to be used only 81 GB. Also take into account that the memory is distributed over the NUMA nodes.
The setup is still ok, but it would be better to use the previous one. Parallel queries should use maximum 4 cores (according to rule 2, MAXDOP should be set to maximum 4) instead of 8.

Scenario 3
We are discussing now about 16 CPUs 160 GB of RAM

An interesting fact occurs after increasing the RAM over 160 GB but still equal or less-than 240 GB. You cannot evenly distribute 16 (CPUs) across 3 (NUMA nodes).

We end up having 5 or 6 CPUs per NUMA node. In this setup, none of the 2 rules are validated. What MAXDOP value should you use? (Probably 1 – disabling parallelism).
From my experience, multiple performance issues can occur (some of them cannot be identified as pointing to this issue). Some examples:
  • ·         Uneven distribution of workload across schedulers (CPUs)
  • ·         Deadlocks
  • ·         Parallel queries run longer

A more particular scenario I faced (I cannot generalize this) was that the DBCC CHECKDB brought all CPUs (limited to the MAXDOP value) constantly to 100% , despite the fact that no other workload was on the VM. A more in depth investigation, revealed that 40% of the CPU usage was used by spinlocks while allocating memory (this spinlock could not be found in the undocumented dynamic management view sys.dm_os_spinlock_stats). After the evenly distributing the CPUs across the NUMA nodes, the CPU was no longer at 100%. Actually it had spikes not reaching more than 70% on each core.

Takeaway: When running VMs, ensure that their NUMA configuration is aligned with the ones of the host.


1 Source Wikipedia

No comments:

Post a Comment