A brief update on NUMA and MySQL

Some time ago, I wrote a rather popular post The MySQL “swap insanity” problem and the effects of the NUMA architecture (if you haven’t read it, stop now and do that!), which described using numactl --interleave=all to balance memory allocation across nodes in a NUMA system.

I should've titled it differently

In reality, the problem posed by uneven allocation across nodes under NUMA is not entirely a swapping problem. I titled the previous post as it was and explained it in the way it was explained largely to address a specific problem seen in the MySQL community. However, the problem described actually has very little to do with swap itself. The problem is really related to Linux's behavior under memory pressure, and specifically the pressure imposed by running a single NUMA node (and especially node 0) completely out of memory.

When swap is disabled completely, problems are still encountered, usually in the form of extremely slow performance and failed memory allocations.

A more thorough solution

The original post also only addressed only one part of the solution: using interleaved allocation. A complete and reliable solution actually requires three things, as we found when implementing this change for production systems at Twitter:

  1. Forcing interleaved allocation with numactl --interleave=all. This is exactly as described previously, and works well.
  2. Flushing Linux's buffer caches just before mysqld startup with sysctl -q -w vm.drop_caches=3. This helps to ensure allocation fairness, even if the daemon is restarted while significant amounts of data are in the operating system buffer cache.
  3. Forcing the OS to allocate InnoDB's buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.

These changes are implemented in Twitter MySQL 5.5 as the mysqld_safe options numa-interleave and flush-caches, and mysqld option innodb_buffer_pool_populate, respectively.

The results

On a production machine with 144GB of RAM and a 120GB InnoDB buffer pool, all used memory has been allocated within 152 pages (0.00045%) of perfectly balanced across both NUMA nodes:

N0        :     16870335 ( 64.36 GB)
N1        :     16870183 ( 64.35 GB)
active    :           81 (  0.00 GB)
anon      :     33739094 (128.70 GB)
dirty     :     33739094 (128.70 GB)
mapmax    :          221 (  0.00 GB)
mapped    :         1467 (  0.01 GB)

The buffer pool itself was allocated within 4 pages of balanced (line-wrapped for clarity):

2aaaab2db000 interleave=0-1 anon=33358486 dirty=33358486
  N0=16679245 N1=16679241

Much more importantly, these systems have been extremely stable and have not experienced the "random" stalls under heavy load that we had seen before.

17 thoughts on “A brief update on NUMA and MySQL

  1. The MySQL “swap insanity” problem and the effects of the NUMA architecture « Jeremy Cole

    • Switching to SUMA in the bios causes the memory to be interleaved at a low level.

      This is bascially the same as the configuration above, except that *everything* is interleaved, even things that would actually perform better if they were not interleaved.

      It’s better to leave it as NUMA, and only interleave those things that need it.

    • Basically, the hardware is NUMA regardless of what the BIOS is set to. When you “turn off” NUMA in the BIOS you are just making it impossible for the OS to control anything related to NUMA. We’d rather control it appropriately at the OS level and, in the future, take specific advantage of it.

  2. Hi Jeremy,

    I’m interested to know the behavior of the random stalls you were experiencing. Could you describe what you observe when they happen?

  3. mmap的MAP_POPULATE标志妙用 | 非业余研究

  4. MySQL如何避免使用swap[copy] | 香草的技术博客

  5. Juris Technologies | NUMA and database headaches

    • Could you explain the following in detail, because MAP_POPULATE is a C function from what I believe, and I am not really sure where to start, you said:
      “3.Forcing the OS to allocate InnoDB’s buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.”

      Thanks.

      • James,

        If you’re not comfortable with patching and building your own MySQL version from source, you would need to use some distribution that has already done so. Percona Server has support for these NUMA-related options.

        The mentioned MAP_POPULATE is an option to mmap, not a function.

        Regards,

        Jeremy

  6. Linux performance tuning tips for MySQL

  7. The write cache: Swap insanity tome III

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s