Some time ago, I wrote a rather popular post The MySQL “swap insanity” problem and the effects of the NUMA architecture (if you haven’t read it, stop now and do that!), which described using numactl --interleave=all to balance memory allocation across nodes in a NUMA system.
I should’ve titled it differently
In reality, the problem posed by uneven allocation across nodes under NUMA is not entirely a swapping problem. I titled the previous post as it was and explained it in the way it was explained largely to address a specific problem seen in the MySQL community. However, the problem described actually has very little to do with swap itself. The problem is really related to Linux’s behavior under memory pressure, and specifically the pressure imposed by running a single NUMA node (and especially node 0) completely out of memory.
When swap is disabled completely, problems are still encountered, usually in the form of extremely slow performance and failed memory allocations.
A more thorough solution
The original post also only addressed only one part of the solution: using interleaved allocation. A complete and reliable solution actually requires three things, as we found when implementing this change for production systems at Twitter:
- Forcing interleaved allocation with numactl --interleave=all. This is exactly as described previously, and works well.
- Flushing Linux’s buffer caches just before mysqld startup with sysctl -q -w vm.drop_caches=3. This helps to ensure allocation fairness, even if the daemon is restarted while significant amounts of data are in the operating system buffer cache.
- Forcing the OS to allocate InnoDB’s buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.
These changes are implemented in Twitter MySQL 5.5 as the mysqld_safe options numa-interleave and flush-caches, and mysqld option innodb_buffer_pool_populate, respectively.
On a production machine with 144GB of RAM and a 120GB InnoDB buffer pool, all used memory has been allocated within 152 pages (0.00045%) of perfectly balanced across both NUMA nodes:
N0 : 16870335 ( 64.36 GB) N1 : 16870183 ( 64.35 GB) active : 81 ( 0.00 GB) anon : 33739094 (128.70 GB) dirty : 33739094 (128.70 GB) mapmax : 221 ( 0.00 GB) mapped : 1467 ( 0.01 GB)
The buffer pool itself was allocated within 4 pages of balanced (line-wrapped for clarity):
2aaaab2db000 interleave=0-1 anon=33358486 dirty=33358486 N0=16679245 N1=16679241
Much more importantly, these systems have been extremely stable and have not experienced the “random” stalls under heavy load that we had seen before.
The MySQL “swap insanity” problem and the effects of the NUMA architecture « Jeremy Cole
Why is this preferable to switching the hardware from NUMA to SUMA in the BIOS?
Switching to SUMA in the bios causes the memory to be interleaved at a low level.
This is bascially the same as the configuration above, except that *everything* is interleaved, even things that would actually perform better if they were not interleaved.
It’s better to leave it as NUMA, and only interleave those things that need it.
Basically, the hardware is NUMA regardless of what the BIOS is set to. When you “turn off” NUMA in the BIOS you are just making it impossible for the OS to control anything related to NUMA. We’d rather control it appropriately at the OS level and, in the future, take specific advantage of it.
Very cool. Nice work.
I’m interested to know the behavior of the random stalls you were experiencing. Could you describe what you observe when they happen?
Very good and clear explanation.
Hello Jeremy! I presume you’re aware, but just in case: Vadim went ahead and implemented these in Percona Server :) http://www.percona.com/doc/percona-server/5.5/performance/innodb_numa_support.html
mmap的MAP_POPULATE标志妙用 | 非业余研究
MySQL如何避免使用swap[copy] | 香草的技术博客
Juris Technologies | NUMA and database headaches
Nice work. Any more update? Also do you know anyone else who uses your customization? Many thanks in advance.
Could you explain the following in detail, because MAP_POPULATE is a C function from what I believe, and I am not really sure where to start, you said:
“3.Forcing the OS to allocate InnoDB’s buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.”
If you’re not comfortable with patching and building your own MySQL version from source, you would need to use some distribution that has already done so. Percona Server has support for these NUMA-related options.
The mentioned MAP_POPULATE is an option to mmap, not a function.
Linux performance tuning tips for MySQL
The write cache: Swap insanity tome III
Updated link for NUMA in PS 5.6 http://www.percona.com/doc/percona-server/5.6/performance/innodb_numa_support.html
NUMA — 想说爱你不容易-紫金星程序员教程网
NUMA, swap insanity, MySQL | xdhyix
Anybody smart enough to know if these changes/solutions to NUMA/MySQL have also been implemented in Mariadb 10.2 for centos7 ?
I’m fighting somewhat with a Centos7 with more than 100G RAM and 100GB innodb_buffer and its swapping frenzy ).
Unfortunately centos7 doesn’t (yet?) have the innodb_numa_interleave avavilable that I suspect might have helped, and stumbled over this rather old discussion about same issues.
NUMA架构的CPU - 你真的用好了么？ - 程式筆記
NUMA and database headaches • JurisTech
MySQL and Memory: a love story (part 1) – lefred's blog: tribulations of a MySQL Evangelist
NUMA架构的成因和具体原理 – 源码巴士
Mysql slowly uses memory until it starts to use swap