The MySQL “swap insanity” problem and the effects of the NUMA architecture

Update: Also read A brief update on NUMA and MySQL.

The “swap insanity” problem, in brief

When running MySQL on a large system (e.g., 64GB RAM and dual quad core CPUs) with a large InnoDB buffer pool (e.g., 48GB), over time, Linux decides to swap out potentially large amounts of memory, despite appearing¹ to be under no real memory pressure. Monitoring reveals that at no time is the system in actual need of more memory than it has available; and memory isn’t leaking, mysqld‘s RSS is normal and stable.

Normally a tiny bit of swap usage could be OK (we’re really concerned about activity—swaps in and out), but in many cases, “real” useful memory is being swapped: primarily parts of InnoDB’s buffer pool. When it’s needed once again, a big performance hit is taken to swap it back in, causing random delays in random queries. This can cause overall unpredictable performance on production systems, and often once swapping starts, the system may enter a performance death-spiral.

While not every system, and not every workload experiences this problem, it’s common enough that it’s well known, and for those that know it well it can be a major headache.

The history of “swap insanity”

Over the past two to four years, there has been an off-and-on discussion about Linux swapping and MySQL, often titled “swap insanity” (I think coined by Kevin Burton). I have followed it closely, but I haven’t contributed much because I didn’t have anything new to add. The major contributors to the discussion over the past years have been:

Kevin Burton — Discussion of swappiness and MySQL on Linux.
Kevin Burton — Proposed IO_DIRECT as a solution (doesn’t work) and discussed memlock (may help, but not a full solution).
Peter Zaitsev — Discussed swappiness, memlock, and fielded a lot of discussion in the comments.
Don MacAskill — Proposed an innovative (albeit hacky) solution using swap on ramdisk, and a lot more interesting discussion in the comments.
Dathan Pattishall — Describes how Linux behavior can be even worse with swap disabled, and proposes using swapoff to clear it, but no real solution.
Rik van Riel on the LKML — A few answers and proposal of the Split-LRU patch.
Kevin Burton — Discussion of Linux Split-LRU patch with some success.
Mark Callaghan — Discussion of vmstat and monitoring things, and a recap of a few possible solutions.
Kevin Burton — More discussion that Linux Split-LRU is essential.
Kevin Burton — Choosing the middle road by enabling swap, but with a small amount of space, and giving up the battle.
Peter Zaitsev — More discussion about why swapping is bad, but no solution.

Despite all the discussion, not much has changed. There are some hacky solutions to get MySQL to stop swapping, but nothing definite. I’ve known these solutions and hacks now for a while, but the core question was never really settled: “Why does this happen?” and it’s never sat well with me. I was recently tasked with trying to sort this mess out once and for all, so I’ve now done quite a bit of research and testing related to the problem. I’ve learned a lot, and decided a big blog entry might be the best way to share it. Enjoy.

There was a lot of discussion and some work went into adding the relatively new swappiness tunable a few years ago, and I think that may have solved some of the original problems, but at around the same time the basic architecture of the machine changed to NUMA, which I think introduced some new problems, with the very same symptoms, masking the original fix.

Contrasting the SMP/UMA and NUMA architectures

The SMP/UMA architecture

The SMP, or UMA architecture, simplified

When the PC world first got multiple processors, they were all arranged with equal access to all of the memory in the system. This is called Symmetric Multi-processing (SMP), or sometimes Uniform Memory Architecture (UMA, especially in contrast to NUMA). In the past few years this architecture has been largely phased out between physical socketed processors, but is still alive and well today within a single processor with multiple cores: all cores have equal access to the memory bank.

The NUMA architecture

The NUMA architecture, simplified

The new architecture for multiple processors, starting with AMD’s Opteron and Intel’s Nehalem² processors (we’ll call these “modern PC CPUs”), is a Non-Uniform Memory Access (NUMA) architecture, or more correctly Cache-Coherent NUMA (ccNUMA). In this architecture, each processor has a “local” bank of memory, to which it has much closer (lower latency) access. The whole system may still operate as one unit, and all memory is basically accessible from everywhere, but at a potentially higher latency and lower performance.

Fundamentally, some memory locations (“local” ones) are faster, that is, cost less to access, than other locations (“remote” ones attached to other processors). For a more detailed discussion of NUMA implementation and its support in Linux, see Ulrich Drepper’s article on LWN.net.

How Linux handles a NUMA system

Linux automatically understands when it’s running on a NUMA architecture system and does a few things:

Enumerates the hardware to understand the physical layout.
Divides the processors (not cores) into “nodes”. With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
Attaches each memory module in the system to the node for the processor it is local to.
Collects cost information about inter-node communication (“distance” between nodes).

You can see how Linux enumerated your system’s NUMA layout using the numactl --hardware command:

# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 32276 MB
node 0 free: 26856 MB
node 1 size: 32320 MB
node 1 free: 26897 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

This tells you a few important things:

The number of nodes, and their node numbers — In this case there are two nodes numbered “0” and “1”.
The amount of memory available within each node — This machine has 64GB of memory total, and two physical (quad core) CPUs, so it has 32GB in each node. Note that the sizes aren’t exactly half of 64GB, and aren’t exactly equal, due to some memory being stolen from each node for whatever internal purposes the kernel has in mind.
The “distance” between nodes — This is a representation of the cost of accessing memory located in (for example) Node 0 from Node 1. In this case, Linux claims a distance of “10” for local memory and “21” for non-local memory.

How NUMA changes things for Linux

Technically, as long as everything runs just fine, there’s no reason that being UMA or NUMA should change how things work at the OS level. However, if you’re to get the best possible performance (and indeed in some cases with extreme performance differences for non-local NUMA access, any performance at all) some additional work has to be done, directly dealing with the internals of NUMA. Linux does the following things which might be unexpected if you think of CPUs and memory as black boxes:

Each process and thread inherits, from its parent, a NUMA policy. The inherited policy can be modified on a per-thread basis, and it defines the CPUs and even individual cores the process is allowed to be scheduled on, where it should be allocated memory from, and how strict to be about those two decisions.
Each thread is initially allocated a “preferred” node to run on. The thread can be run elsewhere (if policy allows), but the scheduler attempts to ensure that it is always run on the preferred node.
Memory allocated for the process is allocated on a particular node, by default “current”, which means the same node as the thread is preferred to run on. On UMA/SMP architectures all memory was treated equally, and had the same cost, but now the system has to think a bit about where it comes from, because accessing non-local memory has implications on performance and may cause cache coherency delays.
Memory allocations made on one node will not be moved to another node, regardless of system needs. Once memory is allocated on a node, it will stay there.

The NUMA policy of any process can be changed, with broad-reaching effects, very simply using numactl as a wrapper for the program. With a bit of additional work, it can be fine-tuned in detail by linking in libnuma and writing some code yourself to manage the policy. Some interesting things that can be done simply with the numactl wrapper are:

Allocate memory with a particular policy:
- locally on the “current” node — using --localalloc, and also the default mode
- preferably on a particular node, but elsewhere if necessary — using --preferred=node
- always on a particular node or set of nodes — using --membind=nodes
- interleaved, that is, spread evenly round-robin across all or a set of nodes — using --interleaved=all or --interleaved=nodes
Run the program on a particular node or set of nodes, in this case that means physical CPUs (--cpunodebind=nodes) or on a particular core or set of cores (--physcpubind=cpus).

What NUMA means for MySQL and InnoDB

InnoDB, and really, nearly all database servers (such as Oracle), present an atypical workload (from the point of view of the majority of installations) to Linux: a single large multi-threaded process which consumes nearly all of the system’s memory and should be expected to consume as much of the rest of the system resources as possible.

In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward. The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node. Additionally, since many different queries will be running at the same time, on both processors, neither individual processor necessarily has preferential access to any particular part of memory needed by a particular query.

It turns out that this seems to matter in one very important way. Using /proc/pid/numa_maps we can see all of the allocations made by mysqld, and some interesting information about them. If you look for a really big number in the anon=size, you can pretty easily find the buffer pool (which will consume more than 51GB of memory for the 48GB that it has been configured to use) [line-wrapped for clarity]:

2aaaaad3e000 default anon=13240527 dirty=13223315 
  swapcache=3440324 active=13202235 N0=7865429 N1=5375098

The fields being shown here are:

2aaaaad3e000 — The virtual address of the memory region. Ignore this other than the fact that it’s a unique ID for this piece of memory.
default — The NUMA policy in use for this region.
anon=number — The number of anonymous pages mapped.
dirty=number — The number of pages that are dirty because they have been modified. Generally memory allocated only within a single process is always going to be used, and thus dirty, but if a process forks it may have many copy-on-write pages mapped that are not dirty.
swapcache=number — The number of pages swapped out but unmodified since they were swapped out, and thus they are ready to be freed if needed, but are still in memory at the moment.
active=number — The number of pages on the “active list”; if this field is shown, some memory is inactive (anon minus active) which means it may be paged out by the swapper soon.
N0=number and N1=number — The number of pages allocated on Node 0 and Node 1, respectively.

The entire numa_maps can be quickly summarized by the a simple script numa-maps-summary.pl, which I’ve written while analyzing this problem:

N0        :      7983584 ( 30.45 GB)
N1        :      5440464 ( 20.75 GB)
active    :     13406601 ( 51.14 GB)
anon      :     13422697 ( 51.20 GB)
dirty     :     13407242 ( 51.14 GB)
mapmax    :          977 (  0.00 GB)
mapped    :         1377 (  0.01 GB)
swapcache :      3619780 ( 13.81 GB)

An couple of interesting and somewhat unexpected things pop out to me:

The sheer imbalance in how much memory is allocated in Node 0 versus Node 1. This is actually absolutely normal per the default policy. Using the default NUMA policy, memory was preferentially allocated in Node 0, but Node 1 was used as a last resort.
The sheer amount of memory allocated in Node 0. This is absolutely critical — Node 0 is out of free memory! It only contains about 32GB of memory in total, and it has allocated a single large chunk of more than 30GB to InnoDB’s buffer pool. A few other smaller allocations to other processes finish it off, and suddenly it has no memory free, and isn’t even caching anything.

The memory allocated by MySQL looks something like this:

Allocating memory severely imbalanced, preferring Node 0

Due to Node 0 being completely exhausted of free memory, even though the system has plenty of free memory overall (over 10GB has been used for caches) it is entirely on Node 1. If any process scheduled on Node 0 needs local memory for anything, it will cause some of the already-allocated memory to be swapped out in order to free up some Node 0 pages. Even though there is free memory on Node 1, the Linux kernel in many circumstances (which admittedly I don’t totally understand³) prefers to page out Node 0 memory rather than free some of the cache on Node 1 and use that memory. Of course the paging is far more expensive than non-local memory access ever would be.

A small change, to big effect

An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:

# numactl --interleave all command

We can use this with MySQL by making a one-line change to mysqld_safe, adding the following line (after cmd="$NOHUP_NICENESS"), which prefixes the command to start mysqld with a call to numactl:

cmd="/usr/bin/numactl --interleave all $cmd"

Now, when MySQL needs memory it will allocate it interleaved across all nodes, effectively balancing the amount of memory allocated in each node. This will leave some free memory in each node, allowing the Linux kernel to cache data on both nodes, thus allowing memory to be easily freed on either node just by freeing caches (as it’s supposed to work) rather than paging.

Performance regression testing has been done comparing the two scenarios (default local plus spillover allocation versus interleaved allocation) using the DBT2 benchmark, and found that performance in the nominal case is identical. This is expected. The breakthrough comes in that: In all cases where swap use could be triggered in a repeatable fashion, the system no longer swaps!

You can now see from the numa_maps that all allocated memory has been spread evenly across Node 0 and Node 1:

2aaaaad3e000 interleave=0-1 anon=13359067 dirty=13359067 
  N0=6679535 N1=6679532

And the summary looks like this:

N0        :      6814756 ( 26.00 GB)
N1        :      6816444 ( 26.00 GB)
anon      :     13629853 ( 51.99 GB)
dirty     :     13629853 ( 51.99 GB)
mapmax    :          296 (  0.00 GB)
mapped    :         1384 (  0.01 GB)

In graphical terms, the allocation of all memory within mysqld has been made in a balanced way:

Allocating memory balanced (interleaved) across nodes

An aside on `zone_reclaim_mode`

The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory reclamation policies in a NUMA system. Subject to some clarifications from the linux-mm mailing list, it doesn’t seem to help in this case.

An even better solution?

It occurred to me (and was backed up by the linux-mm mailing list) that there is probably further room for optimization, although I haven’t done any testing so far. Interleaving all allocations is a pretty big hammer, and while it does solve this problem, I wonder if an even better solution would be to intelligently manage the fact that this is a NUMA architecture, using the libnuma library. Some thoughts that come to mind are:

Spread the buffer pool across all nodes intelligently in large chunks, or by index, rather than round-robin per page.
Keep the allocation policy for normal query threads to “local” so their memory isn’t interleaved across both nodes. I think interleaved allocation could cause slightly worse performance for some queries which would use a substantial amount of local memory (such as for large queries, temporary tables, or sorts), but I haven’t tested this.
Managing I/O in and out to/from the buffer pool using threads that will only be scheduled on the same node that the memory they will use is allocated on (this is a rather complex optimization).
Re-schedule simpler query threads (many PK lookups, etc.) on nodes with local access to the data they need. Move them actively when necessary, rather than keeping them on the same node. (I don’t know if the cost of the switch makes up for this, but it could be trivial if the buffer pool were organized by index onto separate nodes.)

I have no idea if any of the above would really show practical benefits in a real-world system, but I’d love to hear any comments or ideas.

Update 1: Changed the link for “Rik van Riel on the LKML — A few answers and proposal of the Split-LRU patch.” to be a bit closer to my intention. The old link points to the message that started the thread, the new link points to the index of the messages in the thread.

Update 2: Added a link above provided by Kevin Closson about Oracle on NUMA systems.

Update 3: I should have included a warning about numa_maps. Simon Mudd notes correctly that reading the /proc/pid/numa_maps file stalls the pid process to generate the data it provides. It should be used carefully in production systems against mysqld as connection and query stalls will occur while it is reading. Do not monitor it minutely.

– – –

¹ Using free shows some memory free and lots of cache in use, and totalling up the resident set sizes from ps or top shows that the running processes don’t need more memory than is available.

² An article in Dr. Dobb’s Journal titled A Deeper Look Inside Intel QuickPath Interconnect gives pretty good high level coverage. Intel published a paper entitled Performance Analysis Guide for IntelÂ® Core^TM i7 Processor and IntelÂ® Xeon^TM 5500 processors which is quite good for understanding the internals of NUMA and QPI on Intel’s Nehalem series of processors.

³ I started a thread on the linux-mm mailing list related to MySQL on NUMA, and there are two other threads related on zone_reclaim_mode and on swapping.

157 thoughts on “The MySQL “swap insanity” problem and the effects of the NUMA architecture”

Tweets that mention jcoleâ€™s weblog: Jeremy Coleâ€™s take on life. Â» Blog Archive Â» The MySQL â€œswap insanityâ€ problem and the effects of the NUMA architecture -- Topsy.com
Terry A Davis says:

September 28, 2010 at 21:50

With LoseThos, I originally set aside some memory for each core’s heaps because I was worried they’d fight for access. I think I assumed there was a problem where there was none. It’s a whole lot nicer with all cores allocating from one pool, not worrying about partitioning memory.

Reply
Robert Foss says:

September 29, 2010 at 02:30

Really informative and interesting read. Thanks (add a flattr button so that I can donate some beer money).

Reply
Nathan Kurz says:

September 29, 2010 at 02:31

Are you sure that “Memory allocations made on one node will not be moved to another node, regardless of system needs”? This paper (http://www.kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf) from 2006 makes it sound like a done deal. Was “memory migration” pulled?

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 02:40
  
  Nathan: My understanding is that while page migration is *technically possible* it will never occur automatically. Perhaps I should say that “Memory allocations made on one node will not be *automatically* moved to another node, regardless of system needs”. My understanding, and what seems to be supported by that document as well, is that there are several manual interfaces to force movement of pages, but there is no automated logic anywhere to do it.
  
  Reply
DrGkill says:

September 29, 2010 at 02:38

Great article ! Congrats.

Reply
Lee Hambley says:

September 29, 2010 at 03:14

Thanks for a very informative article, it’s great to see someone persuing the idea of well performing MySQL! Your article is very well written, my only take-away is that it appears this is only a problem on NUMA systems, which effectively means multi-processor (possibly with multi-core?)

Reply
Quora
Andrew says:

September 29, 2010 at 07:22

This link is wrong (doesn’t match description):

Rik van Riel on the LKML â€” A few answers and proposal of the Split-LRU patch.

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 07:28
  
  Andrew: I guess the link wasn’t as clear as I intended it to be; Rik gets involved later on in the thread, but I figure you ought to read the context. I changed the link instead to point to the index of messages in the thread, so now it more closely matches my intentions.
  
  Reply
Justin Swanhart says:

September 29, 2010 at 07:28

Very cool post, thanks Jeremy.

The new InnoDB split buffer pool code should be modified to use numalib to allocate each of the smaller innodb buffer pools to use the local node. If you split the buffer pool into an even number of chunks, the entire buffer pool should be able to be allocated locally. Of course, you want to make sure you preserve space on each node for cache.

Reply
Justin Swanhart says:

September 29, 2010 at 07:32

I meant spread them across the local/remote nodes. Too early.

Reply
Mysql/Linux swap problems | Areth Foster-Webster
Alan says:

September 29, 2010 at 08:23

How does one figure out if a machine is NUMA or SMP/UMA? This looks like it might be a problem we’re seeing, but I don’t know if this is it or not.

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 08:24
  
  Alan: You can use “numactl –hardware”; if you see more than one “node”, you’re on a NUMA system.
  
  Reply
  - Pravata says:
    
    October 16, 2022 at 06:58
    
    Or we can check system messages as well like below.
    # dmesg | grep -i numa
    [ 0.000000] No NUMA configuration found
    
    Thanks, Jeremy for this detailed post, we recently implemented this in our Mysql NUMA servers which improved the performance a lot.
Justin Mason says:

September 29, 2010 at 08:46

Great writeup! Thanks for this.

It sounds like NUMA memory-pool introspection would be a fantastic feature for top(1), if it doesn’t already exist.

Reply
Dimitri says:

September 29, 2010 at 08:48

Jeremy,

Thanks for sharing this! Great post!

Rgds,
-Dimitri

Reply
Nick Wilkens says:

September 29, 2010 at 09:57

What about using the large-pages (huge pages) mysql option: http://dev.mysql.com/doc/refman/5.0/en/large-page-support.ht…

From the kernel documentation @ http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage…. : “Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.”

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 10:00
  
  Nick: Using large-pages will keep mysqld from being swapped out, but it won’t keep the system from swapping *something* out. And, if it really needs memory on a particular node, and can’t swap out pages for mysqld, it will swap something else which might be more important to the function of the system, fail the allocation, or OOM-kill a process (likely mysqld).
  
  Reply
Jonathan Moore says:

September 29, 2010 at 10:54

I have been working with Kevin Burton at spinn3r/tailrank for years and we did eventually solve the swap problem together but in a totally different way.

Before I tell you how, first answer the question to your self: why would you ever *want* to swap on a server?

We have come to the conclusion that we never want to swap at spinn3r; and I can’t come up with any good reasons for process on a server to swap at all.

So we don’t have swap. I don’t mean that we turned swap off with swapoff; that has known problems which I believe have to do with the way the memory allocator tries to find free pages. ( Though there has been some work on this in 2010. ) What we did is compile out swap. This is an idea we got form a friend at google who apparently run all their servers in this way. Not only does this keep you from ever swapping it also speeds up the memory management as there is a whole set of bookkeeping the kernel no longer has to do.

The observant reader might be thinking but what about over commit and the OOM killer. The short answer is don’t over provision your server; but in all likelihood you might do it by accident and then that 40G mysql process is going to look like a good target to the OOM killer. To keep your large warmed-up mysql process form getting killed you can set a its /proc/PID/oom_adj to “-17” which will keep the OOM killer form considering that process as a target. ( We also set our sshd similarly so it does not get killed. )

In this way we have a kernel that wont swap, and is faster, while still keeping important processes safe from the OOM killer.

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 11:18
  
  Jonathan: I think this makes sense if you have a sufficiently well-managed and carefully configured system, but I don’t think that’s a good solution for the general case. If you have swap either turned off or compiled out, that will disable memory overcommit, and severely limit how much mmap can be done, by my understanding.
  
  Reply
Tim says:

September 29, 2010 at 11:00

Jeremy: good write up. You’re nicely working counter to something I continually am asserting: “Nobody understands NUMA”.

Also I don’t think Dave Hansen referred you here, but there’s a wiki page where we try to clarify some of this more: http://linux-mm.org/Low_On_Memory
This is a very FAQ situation and the low on memory page could use updating for NUMA and zone reclaim. Especially as you’ve noted…most any system these days has NUMA effects and the allocation/reclaim behavior (and subsequent runtime performance) is only more complicated from a user perspective as a result.

Another note on large pages: keep in mind that using large pages on linux _may_ in the very near future not keep mysql or other large page using apps from being swapped out. Depending on whether or not the transparent huge page patches merge in, there may be additional interesting runtime behaviors coming soon.

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 11:21
  
  Tim: Thanks for the comment. I read the Low_On_Memory page, and it was very helpful, although incomplete, it’s a great resource. Good point on the system behavior changing in the future. The key point of my research and solution is to solve the *problem* (that the system thinks it needs to swap) rather than the symptom (the swapping itself).
  
  Reply
Anil Pal says:

September 29, 2010 at 11:10

Nice investigation. You mention that the NUMA code has parameters for the relative costs of local/remote access. But it sounds like it doesn’t use that information in making decisions. That should help automatically differentiate the two types of systems, and decide whether local swap or remote access is better for a specific allocation.

Reply
Kevin Burton says:

September 29, 2010 at 11:11

We’ve migrated to using noswap kernels on all our boxes. Works well. Not a solution for everyone.

I think your “An even better solution” is to run N mysql daemons (one per node) and then bind the memory allocations there.

This has been my goal for two years now but we haven’t been able to do it in our stack (yet).

The first change we’ll probably make is to do this to our robots which are a bit more shardable than our mysql boxes.

Balancing memory is a good solution for about 70% of people but one major application won’t work well in this role. If you’re reading large chunks of memory at a time, and processing them on the LOCAL box, then your application will be slower as nr_nodes-1/nr_nodes of your pages will be non-local.

For web applications where the biggest latencies are gigabit and CPU (and webapp) this probably won’t dominate though.

Reply
Jon Watte says:

September 29, 2010 at 11:38

Thanks for the blog post!
You’d think that “swapping” to a NUMA node would be preferable to swapping to disk. Perhaps we’ll get this in the 2.8 series.

Reply
Jonathan Moore says:

September 29, 2010 at 11:50

Jeremy Cole: I am sure that removing swap dose not remove over commit.

Over commit happens when the kernel believes that the a process wont actual need all the memory it has asked for; as a example when a process forks the kernel needs create copy of parents processes address space in the child but much of that memory will never diverge between the parent and child processes, so the kernel can use COW and over commit to make the memory use more effective. Compiling out swap dost not, and can not, turn off virtual memory.

To the well manicured system argument; if you have a production server that might start swapping at any time, or in general who’s behavior you do not understand, you are playing with fire. ( read you should expect down time randomly )

By the way if it was not clear I think this is really good and enlightening work you did hear, thanks.

Reply
links for 2010-09-29 : Bob Plankers, The Lone Sysadmin
Kevin Burton says:

September 29, 2010 at 12:01

Ideally, it would be possible to track how many memory reads were local vs non-local. Obviously, you want non-local reads to be near zero.

I’m tracking this down now but I don’t see a way……

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 12:03
  
  Kevin: See the numastat command and the various sys files. http://www.kernel.org/doc/Documentation/numastat.txt
  
  Reply
zer says:

September 29, 2010 at 12:18

Ran into this issue a few years ago (custom proprietary high performance dbs). A lot of details aside, we felt most of these performance based memory issues came down to the software architecture. Running batteries of tests on many different systems, the optimal configuration was always 1 process per socket (physical CPU) and then threads on each core. This always gave us the highest throughput and the lowest latency. Going too far on either extreme (not enough processes vs too many) resulted in poorer performance and could leave you disk bound which is never fun. Not sure if you have this level of process/thread control in your mysql/innodb architecture. Obviously this is the direction hardware is going and keeping this in mind when data architecting will leave you with a lot of performance flexibility.

Reply
Matthew Rogers says:

September 29, 2010 at 12:36

Skip using a swap partition. Write a script to activate one when the system hits the rams limits.

Reply
- Jeremy Cole says:
  
  September 29, 2010 at 12:40
  
  Matthew Rogers: I don’t think you’ve thought that through. When the system needs to swap, clearly it doesn’t have any memory available. Where’s it supposed to get the memory it needs to (a) detect that it’s out of memory, and (b) run a script to enable swap? The memory management under no-memory conditions is already marginal — it always takes a little bit more memory to free memory already.
  
  Reply
Linux memory management collides with MySql | Smash Company
The shared memory model is doomed (a teaser) | Generic Language
Philip Reynolds says:

September 29, 2010 at 23:15

@Jonathan

I’m interested in your assertion that turning off swap (via swapoff) is prone to problems with the memory allocator. Googling hasn’t helped me back this claim up, I’m wondering if you know where this is reported / documented / talked up about.

Rockin article btw, although we’re not on NUMA architecture, we do run with no swap to avoid this. (hence my curiosity about Jonathan’s comment)

Reply
- Jeremy Cole says:
  
  October 4, 2010 at 14:04
  
  Philip: My understanding is this: disabling swap completely disables memory overcommit, which can cause problems forking in large processes, amongst other things.
  
  Reply
Why is mysql swapping when I have plenty of RAM? | You Can Change It Later
Jonathan Moore says:

September 30, 2010 at 10:14

Philip Reynolds:

This memory manager bug not documented any where as far as I know but I have discussed this with kernel developers and they are aware of the issue, but know one has given them a reproducible test case. There has been some patches in one of the last few kernels that might address this issue but I have not had a chance to test. Running with swap off seems to make the memory bug worse then compiling out swap.

Reply
» links for 2010-09-30 (Dhananjay Nene)
Francis Ridder says:

October 1, 2010 at 10:55

It would be interesting to see how other OS’s handle this scenario. Specifically how FreeBSD or Solaris deal with NUMA.

Reply
mharrigan says:

October 4, 2010 at 04:00

Really excellent detective work, I’m not a MySQL user but it’s still very interesting.

Reply
Kevin Closson says:

October 4, 2010 at 11:59

Hi Jeremy,

Have you tested much with the system booted with numa=off? It appears you are using a Nehalem EP system. You might just boot it in SUMA mode and see how it performs.

While not mysql-related, your readers might find the following NUMA information useful:

http://kevinclosson.wordpress.com/2009/05/14/you-buy-a-numa-system-oracle-says-disable-numa-what-gives-part-ii/

Reply
- Jeremy Cole says:
  
  October 4, 2010 at 14:07
  
  Kevin: I haven’t done any testing thus far with the NUMA features disabled in the BIOS, but it would be interesting to try. Your articles are excellent — I will bring them up into my main article as another reference with more explanation about NUMA and big databases. It’s great to see a bit of the Oracle perspective on this.
  
  Reply
Numa | Enabling Technology
Matthew Rogers says:

October 13, 2010 at 21:29

@Jeremy Cole

You could simply run a cronjob and when ram usage hits 80% allocate more space via swap or simply send an alert. I don’t even think you should run ANY swap space.

If you system is running out of ram and is actively using swap during day to day operations the balancing is mis configured or your server is not sized properly.

Reply
- Jeremy Cole says:
  
  October 13, 2010 at 22:13
  
  Matthew Rogers: Clearly you did not read the article. Try again.
  
  Reply
bitplumber.net » A Cassandra Hardware Stack – Dell C1100′s – OCZ Vertex 2 SSD’s with Sandforce – Arista 7048′s
Sys Administrivia Podcasts » SysAdministrivia Episode 001
yiihsiaçš„åšå®¢ » æŠ€æœ¯é¢†åŸŸâ€”æµ·é‡å˜å‚¨è®¡ç®—
running mysql with “memlock” « Random Thoughts
How to start mysqld using numactl | Simon J Mudd's Blog
Administrating MongoDB - Obsolete Your Idols
4 performance fixes to MySQL on large servers | OpenLife.cc
Gawain's Jail - MySQL?????swap
Brain Matters » Blog Archive » Of Penguins, Rabbits and Buses
Howto – Avoid MySQL swapping on multi core CPUs
???MongoDB???? | ????
???MongoDB???? - ????
???? » ????MongoDB????
???MongoDB???? - ?????-??????-????-SNS-????-DBA
??????????? » Blog Archive » MySQL??????swap???
Ian East says:

January 12, 2012 at 12:06

This article really helped me improve mongodb performance. Perhaps you might want to make a version that is not specific to MySQL. Basically, this applies to any application that uses large amounts of memory.

Reply
Taavi Burns says:

January 31, 2012 at 07:54

I actually worked on NUMA memory performance issues at IBM a few years ago, that begat this patent:
http://www.google.com/patents/US7302533

It’s basically the same story, but we were working to reduce traffic on the interconnect fabric and reduce latency, binding workers to nodes along with the memory they usually work on. We were working with AIX, so the swappiness problem probably isn’t comparable. :)

Reply
Terrence Andrew Davis says:

April 8, 2012 at 04:48

In LoseThos, I want one memory map shared by all CPUs. If a system does not have that, I don’t care about it.

SWAPS/s on the wallpaper is how many times a task is changed.

During disk I/O, code like this executes:

while (INP(DRIVE_STATUS)!=READY)
YieldCPU2NextTask;

Reply
Jonathan DePrizio says:

April 8, 2012 at 06:46

I’m a bit curious as to why interleaving is the correct solution, rather than setting –preferred=node

Wouldn’t using –preferred=node still give you the intended benefit of NUMA, while eliminated the problem described in the article?

(I realize that your testing showed no performance degradation with interleaving, but I’m speaking hypothetically/theoretically).

Thanks for this research – it finally puts to rest a big “huh??” that I’ve been keeping in the back of my head!

Reply
- Jeremy Cole says:
  
  April 8, 2012 at 10:40
  
  Jonathan: Using preferred allocation wouldn’t help in this case because, since the total amount of memory being allocated is “most of the memory in the system” for a MySQL/InnoDB system, preferring any single node will surely run that node out of memory.
  
  Reply
John P says:

April 8, 2012 at 10:26

Really great article. Do you have any recommendations on articles, tutorials or books (prefer any of those 3 to a newsgroup/mailing list) to understand more about linux memory management?

Reply
- Jeremy Cole says:
  
  April 8, 2012 at 10:44
  
  John: The closest thing I could recommend is:
  
  This book is quite outdated (Published November 2005) but it is largely still relevant, and does a very good job of explaining the different subsystems, their purposes, and their interactions. As far as I remember, it doesn’t actually touch on NUMA at all (since it was not really relevant to most systems back in 2005).
  
  Reply
  - Andreas Paul (@xorpaul) says:
    
    April 12, 2012 at 11:03
    
    It does touch briefly on NUMA at the beginning of the Memory Management chapter (pages 297 and 298).
    I think that it has been added in the 3rd edition, so that’s maybe why you don’t remember it.
James says:

April 8, 2012 at 20:49

This was a great article. Thanks for the read… I noticed that the “cpu#” seems to spread across all physical cpu’s, and then loop back again. Can anyone explain why ?
Example:
[root@iron ~]# numactl –hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 16374 MB
node 0 free: 14968 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 16384 MB
node 1 free: 15809 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 16384 MB
node 2 free: 15678 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 16384 MB
node 3 free: 15903 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10
[root@iron ~]#

Reply
- Jeremy Cole says:
  
  April 8, 2012 at 22:23
  
  James: I assume that’s a quad-cpu hex-core machines with hyperthreading enabled? This is really just a matter of how the physical CPUs and hyperthreaded CPUs are enumerated. On one sample machine (dual-cpu, hex-core, hyperthreading) I see:
  
  node0: 0,2,4,6,8,10,12,14,16,18,20,22
  node1: 1,3,5,7,9,11,13,15,17,19,21,23
  
  I think this is largely a combination of the kernel code that enumerates CPUs, and the hardware itself.
  
  Reply
ziwyz says:

April 9, 2012 at 02:12

Very informative article!

Does this also apply to high-memory VMs running on top of NUMA boxes?

Reply
- Jeremy Cole says:
  
  April 9, 2012 at 08:16
  
  Good question. I’d guess it primarily applies to the host OS side of that equation, but I honestly don’t know what the interplay is between NUMA and the virtualization stuff.
  
  Reply
Stuff The Internet Says On Scalability For April 13, 2012 | Krantenkoppen Tech
Tim Vaillancourt says:

April 13, 2012 at 17:17

This explains a lot of oddness I’ve experienced. I am interested to see how this will be solved by the MySQL community. I agree that using libnuma is likely a good long-term approach.

Reply
Istvan (@lix) says:

April 13, 2012 at 18:56

This is why users moved over to Postgres or another SQL server. :)

Reply
- Jeremy Cole says:
  
  April 13, 2012 at 19:59
  
  Istvan: Hah. Re-read the article, since you clearly skipped over it in your fervor to post something anti-MySQL, and you will notice that it is not really MySQL specific.
  
  Reply
A brief update on NUMA and MySQL « Jeremy Cole
atifali says:

April 16, 2012 at 20:03

Is this relevant if the memory allocated to mysql is less than even the total of 1 node in NUMA .
For example, if i have 2 socket , system with 32 GB memory .
and the total memory allocated to mysql is just 8GB .

Reply
bristolito says:

April 25, 2012 at 12:35

Hi Jeremy,
very interesting post, thanks for sharing!
We are experiencing most of the symptoms you describe here. We have an IBM x3650 server which is supposed to be a 2 quad core processor and 4 memory slots (16 Gb). But this is what I get when I run numactl –hardware:
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16383 MB
node 0 free: 195 MB
node distances:
node 0
0: 10

Why do I see only 1 node as if it was a UMA/SMP architecture? And if it was really a UMA/SMP architecture, why do you think I could be experiencing the same symptoms you describe here? i.e swapping memory with still a lot of memory available making mysql to have random delays in random queries.
Thanks in advance.

Reply
- Jeremy Cole says:
  
  April 26, 2012 at 12:29
  
  Bristolito: It looks like that system is either UMA (older generation) or may have NUMA “disabled” in the BIOS. In that case you should not see degradations related to the above, but could see swapping/degradation for other reasons.
  
  Reply
CleverBlogName
Checking /proc//numa_maps can be dangerous for mysql client connections | Simon J Mudd's Blog
Non-Uniform Memory Access (NUMA) architecture with Oracle database by examples | IT World
NUMA 在 DB 上的一些问题 | Jasey Wang
SWAP的罪与罚 | 火丁笔记
MySQL-dump
Die wunderbare Welt von Isotopp
Zhang says:

December 13, 2012 at 06:26

So if I have 2 physical sockets (6 cores each), 96 GB memory total and want to run 2 database servers on it. Does it make scenes to let each MySQL server run on it’s own node?
e.g.
numactl –membind=0 –cpunodebind=0 $mysql_cmd_for_srv_1
numactl –membind=1 –cpunodebind=1 $mysql_cmd_for_srv_2

Each server has exact same configuration except file locations

Reply
Chris Jones says:

January 4, 2013 at 05:38

Hi Jeremy,

Might be a very dumb question, but does the value of innodb_buffer_pool_instances have any impact on the –interleaved option? Would multiple buffer pools per node actually achieve the same effect, by making MySQL spread its memory demands more evenly in the first place? Or is it irrelevant in this discussion?

Reply
- jaeheung says:
  
  August 25, 2014 at 16:40
  
  I had the same thought and tested with innodb_buffer_pool_instances=8, which is the default in MySQL 5.6 or later(I tested in MySQL 5.5 though), and confirmed that it achieves the same effect exactly as you wrote. I also expected some performance increase thanks to the reduced contention and the all-local node memory access. But I didn’t see any apparent difference in performance. Not sure why.
  Anyway, the users of MySQL 5.6 or later version might not understand what we’re talking about. :-)
  BTW, innodb_buffer_pool_instances is a relatively new MySQL configuration parameter (introduced in 5.5), and the author’s solution was probably the only solution when he was investigating the issue. Still an excellent article. Thank you.
  
  Reply
Planet MySQL has a long history … « Serge Frezefond 's blog
MySQL如何避免使用swap[copy] | 香草的技术博客
Adam Scott says:

April 12, 2013 at 14:08

Thank you Jeremy. Very helpful with great explanations!

Reply
The write cache: Swap insanity tome III
NUMA- linux tuning — this is humbling level of complexity | traversals of a schizoverted mind
Anand Jeyahar says:

October 9, 2013 at 23:35

Reblogged this on traversals of a schizoverted mind.

Reply
Linux Memory Management - Geek Base
Linux performance tuning tips for MySQL
针对MySQL的Linux性能调优技巧[翻译] | DBA的罗浮宫
on swapping and kernels | domas mituzas
Linux内存管理优化:低延迟和高吞吐量的数据库GraphDB - 过程&实践 - Barnett‘s Blog
Notes on the storage stack | InsideMySQL
Stewart Smith says:

June 3, 2014 at 18:44

I have a patch that may help even further, posted on http://bugs.mysql.com/bug.php?id=72811

Basically, it sets NUMA policy inside mysqld to interleave for all memory allocation performed before we start accepting connections (i.e. buffer pool, key cache etc) and then sets policy back to default.

the idea being that since a mysql connection runs in a single thread, all memory it allocates will be for processing that connection/query and so it may as well be node local to where the thread is scheduled to run.

Reply
Tatsuo Natsukawa says:

August 12, 2014 at 10:18

Great article. In addition to the imbalance in memory allocation among numa nodes, my concern had been memory fragmentation. Your suggestion works well:

https://drive.google.com/file/d/0B4HH7pZxARz8MW0xNi1PVEJPc0k/edit?usp=sharing

Thank you!

Reply
carloprad says:

August 19, 2014 at 03:03

This is a MUST read article, thanks for sharing!

Reply
Numa system performance issues – more than just swapping to consider | Scalemysql
MySQL 5.6参数说明 | cz
MySQL numa交叉模式启动说明 | cz
MySQL and NUMA | Tater's Tech Blog
The sad state of MySQL and NUMA | Ramblings
Multiplatform NUMA-aware software | aurisju
NUMA, swap insanity, MySQL | xdhyix
MySQL5.7.26 Release Note解读（innodb及复制模块） | Simple Life
SWAP存亡之道 – 金泽的樱花站
tux says:

February 8, 2016 at 20:12

I am seeing following error in MySQL Cluster Data Node, do you think this is because of NUMA?

2016-02-08 16:41:53 [ndbd] INFO — Watchdog: User time: 114 System time: 1156
2016-02-08 16:41:53 [ndbd] WARNING — Ndb kernel thread 0 is stuck in: Job Handling elapsed=14928
2016-02-08 16:41:53 [ndbd] INFO — Watchdog: User time: 114 System time: 1156
2016-02-08 16:41:53 [ndbd] WARNING — Ndb kernel thread 0 is stuck in: Job Handling elapsed=15028

Reply
NUMA体系结构 – 润土-用技术与世界交流
mysql,内存过高,排查 | sptty'blog
NUMA — 想说爱你不容易 | 歪布IT笔记
MongoDB&MySQL对于NUMA架构CPU的问题 – 运维那点事
cron.weekly issue #60: Debian, Vim, Gitlab, Jenkins, Piwik, Nginx, MySQL & more!
Linkblogging – Random Thoughts
CPU High Performance, INNODB Mysql – Blog Ops
CPU High Performance, INNODB Mysql – Blog Ops
Terry.lu says:

June 19, 2017 at 20:02

Learn a lot from this blog. Thank you!

“Normally a tiny bit of swap usage could be OK (we’re really concerned about activity—swaps in and out), but in many cases, “real” useful memory is being swapped: primarily parts of InnoDB’s buffer pool. ”

I am quite curious about how do you know the memory swapped out is part of innodb buffer memory?

Reply
SWAP的罪与罚 – 往事如烟
MySQL Swapping With Fsync | Official Pythian® Blog
Alta CPU en el server MySQL Fusion-io Servidor
MySQL and Memory: a love story (part 1) – lefred's blog: tribulations of a MySQL Evangelist
High CPU on MySql Fusion-io server – jtiyjg
趣记-SWAP的罪与罚
NUMA架构的CPU - 你真的用好了么？ - 程式筆記
2013年个人微博推荐技术资料汇总 | 3F Blog 2020
MongoDB Best Practices 2020 Edition - Percona Database Performance Blog
MySQL 和 swapping – FIXBBS
Adjusting MySQL 8.0 Memory Parameters - Percona Database Performance Blog
MySQL查询随机滞后|php问答
데이터 이동기가있는 Linux I / O 병목 현상 16 %이며 일반적으로 30 % 이상입니다. - How IT
MySQL 복제 성능 향상시킬 수 있습니까? 머신 B는 - How IT
Ajuste de los parámetros de memoria de MySQL 8.0 – Es mi PC
plantegg (@plantegg) says:

March 7, 2022 at 18:56

Normally if not have local memory then use remote memory, this is a kernel bug cause used swap replace the remote memroy: https://github.com/torvalds/linux/commit/4f9b16a64753d0bb607454347036dc997fd03b82

Now we should enable NUMA anyway.

Reply
Resolved: MariaDB threads optimalization - Resolved Problem
충분한 여유 메모리가 남아있을 때 스왑이 사용되는 이유는 무엇입니까? 꽤 좋은 웹 (전용) - IT HOW
mihvoi says:

May 6, 2022 at 01:04

Somewhat related: I encountered high swapping and cleared disk cache when Linux was automatically aggregated pages into hugepages.

This aggregation can block DRAM access for a short while even when there is no swapping involved. For some reason this process seems to forcefully free a lot of memory when it works, dramatically reducing the cache. This process is very low level and hard to observe if you don’t know where to look.

The effect can be a significant decrease in cache memory – even if there is enough free memory, followed by a big burst of disk access – because of cleared cache.

Especially for databases, it’s better to disable transparent_hugepage feature
echo never > /sys/kernel/mm/transparent_hugepage/enabled

Reply
내 애플리케이션이 널 검사를 수행하는 데 24 %의 시간을 소비하는 이유는 무엇입니까? - IT HOW
NUMA架构 – 源码巴士
How do you tune MySQL for a heavy InnoDB workload?
Very slow Insert and update on 20 core 250 Gb RAM dedicated MYSQL 5.5 server on Ubuntu [closed]
Capabilities of InnoDB INSERT Performance
MySQL using swap during conversion to InnoDB
MongoDB Best Practices: Security, Data Modeling, & Schema Design
InnoDB import performance
InnoDB Index Doesn't Fit in Memory
Adjusting MySQL 8.0 Memory Parameters
Week 3 of January 2025 – Last Week I Learned