Running R clusters on an AMD Threadripper 3990X in Windows 11 (update)

In November 2020, I wrote a post on using all cores and threads on the latest Threadripper 64 core processors under Windows 10. This is a quick update for Windows 11.

As some of you may know, I have been modelling the COVID-19 pandemic in a number of countries using a compartmental disease model. The model is written in the R language. R itself is a single-threaded language which means it does not in its native form make use of more than one core on modern multi-core CPUs. Even so, I manage to use parallel processing to do Monte Carlo analysis in R. In effect, I run the disease model thousands of times. To make this as fast as possible, I run many copies of the model at once using the R parallel package. The parallel package helps overcome the R single thread limitation.[1]

Typically, I set up a cluster of nodes to run the model on one or two Windows 10 workstations. Depending upon the type of model[2] I run between 1 and 2 copies of the model per physical core on each CPU. On the latest AMD Threadripper 3990X processor, which has 64 physical cores, I should be able to run a parallel cluster of 128 nodes, essentially 2 per physical core, 1 per simultaneous multithreading (SMT) virtual core.

I recently upgraded my main workstation to a Threadripper 3990X, relegating my old 32 core Threadripper to the role of slave workstation. For larger Monte Carlo runs, I run another 60 nodes on this slave workstation using OpenSSH to communicate between them.[3] The new workstation should have been able to run a cluster of 128 nodes. Unfortunately, when I ran the model with the instructions to set up a 128 node cluster, it crashed.[4] It turned out that under R version 4.0.3, the latest release, the maximum number of sockets (one is needed to communicate with each node) that could be included in a cluster was 125.[5]

The problem

So, I limited the number of nodes in the cluster to 124 (to be safe). The model ran fine but was not nearly as fast as I had hoped, only a little quicker than the 32 core processor running a 60 node cluster. When I looked at CPU utilisation, I found that 124 nodes had been set up in the cluster but half of them sat idle. All that money on 32 extra cores for no benefit!

Why was this happening?

Google explained the problem. Apparently, whenever Windows encounters more than 64 nodes or threads on a single physical processor, it allocates those threads equally into processor ‘groups’. On the Threadripper 3990X, which has 128 virtual cores or threads, it defines two processor groups (0 and 1) of 64 virtual cores each.[6] When a Windows process spawns a new process, such as the main R program spawning a copy of the model on a new cluster node, it will not normally allocate it to a virtual core outside the processor group of the main program.

So, if the parent R program is running in processor group 0, all the cluster nodes get allocated to group 0. Since there are only 64 virtual cores in a group but we try and set up 124 nodes, 60 of the nodes sit idle. Essentially, close to half of the processor is not used. You can read more about this here: https://www.anandtech.com/show/15483/amd-threadripper-3990x-review/3.

Some applications written for server environments are group aware. They are able to spawn new processes on idle cores outside the group of the calling parent. R is not one of those applications.

Solution

One solution is to use Linux. But Linux does not have Microsoft Office 365. A partial solution is to turn off SMT on the Threadripper so it has only 64 cores. Each of these cores is potentially more powerful than a virtual core under SMT so should be able to run a copy of the model faster. Indeed it can, but with only a small gain over the 32 core CPU running a 60 node cluster. Upgrading Windows 10 Pro to Widows 10 Pro for Workstation (which has some tweaks to its process scheduling that make it better at handling CPUs with many cores) improved matters a little. But Widows 10 Pro for Workstation did not deal with the main problem of the 60 idle virtual cores.

I managed to concoct a solution after many hours on Google. It turns out that when R sets up a parallel cluster, the ‘parallel’ package ‘makeCluster’ command starts a new copy of an R script processing executable (called RScript.exe) which runs some R code when it starts. That R code is responsible for establishing the communication between R running on the new node and the parent. It does this using a socket.

But you can tell makeCluster to start a different program including a Windows batch file. So I wrote two batch files, placed in the same directory as the RScript.exe executable, called BaseRScript.bat and BaseRScript_1.bat. The first has a single command:

start /B /NODE 0 /AFFINITY 0xFFFFFFFFFFFFFFFE C:\Progra~1\MS\ROpen\R-4.0.2\bin\x64\RScript.exe %*

The second:

start /B /NODE 1 /AFFINITY 0xFFFFFFFFFFFFFFFE C:\Progra~1\MS\ROpen\R-4.0.2\bin\x64\RScript.exe %*

This is Windows command shell script. The start command simply starts an executable (in this case RScript.exe). The /B switch opens the executable in the current window. %* ensures that any command line arguments passed from the parent to RScript.exe running on the cluster node are included.

The /NODE and /AFFINITY switches dictate on which processor group and across which cores the program can run. The allowable cores are given by the hexadecimal number (in these examples 0xFFFFFFFFFFFFFFFE, which means every one of the 64 virtual cores in the group bar core 0).[7] You will need administrator privileges for this to work. I then just need to start 62 cluster nodes with one batch file, and 62 with the second. The following R code does this, resulting in the first 62 nodes allocated to processor group 0, the remainder to processor group 1.

The R code to set up the cluster

a <- list( host='localhost', user='', ncore = 62,
 rscript = 'C:\\Progra~1\\MS\\ROpen\\R-  4.0.2\\bin\\x64\\BaseRScript.bat')

b <- list( host='localhost', user='', ncore = 62,
 rscript = 'C:\\Progra~1\\MS\\ROpen\\R-4.0.2\\bin\\x64\\BaseRScript_1.bat')

machineAddresses <- list( a, b )

spec <- lapply(machineAddresses,
   function(machine) { rep(list(list( host=machine$host,
    rscript=machine$rscript ) ), machine$ncore )})

spec <- unlist(spec,recursive=FALSE)

Cluster <- parallel::makeCluster(type='PSOCK', spec=spec)

Clunky, but it works, and I get the substantial improvements in performance I was expecting.

Windows 11 stopped this working

But! When I updated my OS to Windows 11 (Pro), it stopped working. Superficially, Windows 11 ‘appears’ to have done away with the clunky processor group scheduling approach. If you look in Task Manager, Windows 11 (in its default settings) shows 128 CPUs (2 virtual CPUs per physical core) in a single group, no separate NUMA nodes. The START command no longer recognises NUMA nodes other than 0 (zero) so my simple batch file approach does not work any more; the Monte Carlo model simply hangs.

You would think, then, that there is a single group of 128 CPUs, that it is possible to set up 128 parallel tasks and that makeCluster would do this. Of course not. When you call makeCluster asking for 128 nodes, it will create the cluster. But just as before, only 64 of the possible 128 virtual cores on the Threadripper will be used. In reality, then, Windows 11 merely flatters to deceive. It pretends that there is only a single processor group but, under the skin, it still creates two separate groups of 64 virtual cores for scheduling purposes.

Fortunately there is a work around. I found it here. There is a registry tweak that appears to make Windows 11 behave like Windows 10, preventing Windows from what Microsoft calls “node splitting.” Node splitting allows multiple groups to appear as a single node, so the 64 cores for Group 0 and and the 64 cores for Group 1 are, by default in Windows 11, allocated to a single node. You can precent this behaviour by a tweak to the registry. Create a REG_DWORD value named SplitLargeNodes with value 1 underneath HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\NUMA.

Hitting the 128 socket limit

But the 128 socket limit in R version 4.0.3 remains a problem. While my Threadripper 3990X Windows 10 workstation works fine, I cannot recruit my old workstation to add to the cluster. There is no easy solution to this as the 128 socket limit is defined as a static variable in the source code for the R language itself (NNCONECTIONS in the file connections.c found in the src\main folder of the source code, around line 127).

However, R is distributed under a General Public License and the source code and compilation tools are freely available. I simply changed NNCONECTIONS to 256, recompiled the R language itself (which was easier than I thought it would be).

I can now run my COVID-19 model on a cluster of 180 nodes. My power bill has gone up a little, but time spent watching a progress bar has fallen like a stone.

For more information on R visit https://www.r-bloggers.com/.

[1] There are other languages that make better use of modern multi-core CPUs. Microsoft’s own version of R includes multi-threaded libraries for handling some matrix operations. R, as an interpreted language, is also quite slow. However, R has is widely used in statistical analysis and has a large library of code under GNU licensing for handling a vast number of different statistical tasks. For example, there are packages that considerably simplify the development of compartmental disease models.

[2] On CPUs that support simultaneous multithreading (two virtual cores for each physical core),the optimum number to simultaneous model copies to run on each physical core (between 1 and 2) depends upon the characteristics of the CPU in terms of cache size and mode of access to main memory (number of channels, and direct or indirect path etc) and on the nature of the model. Models that use a large amount of data that are memory hungry typically run best on a dedicated physical core. Models that are less memory intensive run fastest with two copies allocated to each physical core, one to each virtual core.

[3] It is worth noting that for power system simulation Monte Carlo linear programming in R, I only allocate 1 copy of the model to each physical core because large linear programming task are extremely memory intensive. Running too many copies of an R LP model at once on a workstation creates memory contention issues which can slow performance to a crawl. Adding more memory to the workstation does not help as the bottleneck is cache and main memory bandwidth.

[4] With an error, all sockets in use.

[5] R v4.0.3 supports 128 sockets but uses 3 for its own purposes. The remainder can be used for a parallel processing cluster.

[6] These groups look to the operating system like NUMA (non-uniform memory architecture). When the operating system encounters NUMA, it schedules tasks within a NUMA group rather than across NUMA groups in order to maximise memory bandwidth.

[7] Setting the affinity switch to 0xFFFFFFFFFFFFFFFF does not work. It is simply ignored, along with the node switch, so we are back to original problem

Sam Lovick Consulting

Practical, timely and useful economics