Some time ago, I wrote a rather popular post The MySQL “swap insanity” problem and the effects of the NUMA architecture (if you haven’t read it, stop now and do that!), which described using numactl --interleave=all to balance memory allocation across nodes in a NUMA system.
I should've titled it differently
In reality, the problem posed by uneven allocation across nodes under NUMA is not entirely a swapping problem. I titled the previous post as it was and explained it in the way it was explained largely to address a specific problem seen in the MySQL community. However, the problem described actually has very little to do with swap itself. The problem is really related to Linux's behavior under memory pressure, and specifically the pressure imposed by running a single NUMA node (and especially node 0) completely out of memory.
When swap is disabled completely, problems are still encountered, usually in the form of extremely slow performance and failed memory allocations.
A more thorough solution
The original post also only addressed only one part of the solution: using interleaved allocation. A complete and reliable solution actually requires three things, as we found when implementing this change for production systems at Twitter:
- Forcing interleaved allocation with numactl --interleave=all. This is exactly as described previously, and works well.
- Flushing Linux's buffer caches just before mysqld startup with sysctl -q -w vm.drop_caches=3. This helps to ensure allocation fairness, even if the daemon is restarted while significant amounts of data are in the operating system buffer cache.
- Forcing the OS to allocate InnoDB's buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.
On a production machine with 144GB of RAM and a 120GB InnoDB buffer pool, all used memory has been allocated within 152 pages (0.00045%) of perfectly balanced across both NUMA nodes:
N0 : 16870335 ( 64.36 GB) N1 : 16870183 ( 64.35 GB) active : 81 ( 0.00 GB) anon : 33739094 (128.70 GB) dirty : 33739094 (128.70 GB) mapmax : 221 ( 0.00 GB) mapped : 1467 ( 0.01 GB)
The buffer pool itself was allocated within 4 pages of balanced (line-wrapped for clarity):
2aaaab2db000 interleave=0-1 anon=33358486 dirty=33358486 N0=16679245 N1=16679241
Much more importantly, these systems have been extremely stable and have not experienced the "random" stalls under heavy load that we had seen before.