Update: Also read A brief update on NUMA and MySQL.
The “swap insanity” problem, in brief
When running MySQL on a large system (e.g., 64GB RAM and dual quad core CPUs) with a large InnoDB buffer pool (e.g., 48GB), over time, Linux decides to swap out potentially large amounts of memory, despite appearing1 to be under no real memory pressure. Monitoring reveals that at no time is the system in actual need of more memory than it has available; and memory isn’t leaking, mysqld‘s RSS is normal and stable.
Normally a tiny bit of swap usage could be OK (we’re really concerned about activity—swaps in and out), but in many cases, “real” useful memory is being swapped: primarily parts of InnoDB’s buffer pool. When it’s needed once again, a big performance hit is taken to swap it back in, causing random delays in random queries. This can cause overall unpredictable performance on production systems, and often once swapping starts, the system may enter a performance death-spiral.
While not every system, and not every workload experiences this problem, it’s common enough that it’s well known, and for those that know it well it can be a major headache.
The history of “swap insanity”
Over the past two to four years, there has been an off-and-on discussion about Linux swapping and MySQL, often titled “swap insanity” (I think coined by Kevin Burton). I have followed it closely, but I haven’t contributed much because I didn’t have anything new to add. The major contributors to the discussion over the past years have been:
- Kevin Burton — Discussion of swappiness and MySQL on Linux.
- Kevin Burton — Proposed IO_DIRECT as a solution (doesn’t work) and discussed memlock (may help, but not a full solution).
- Peter Zaitsev — Discussed swappiness, memlock, and fielded a lot of discussion in the comments.
- Don MacAskill — Proposed an innovative (albeit hacky) solution using swap on ramdisk, and a lot more interesting discussion in the comments.
- Dathan Pattishall — Describes how Linux behavior can be even worse with swap disabled, and proposes using swapoff to clear it, but no real solution.
- Rik van Riel on the LKML — A few answers and proposal of the Split-LRU patch.
- Kevin Burton — Discussion of Linux Split-LRU patch with some success.
- Mark Callaghan — Discussion of vmstat and monitoring things, and a recap of a few possible solutions.
- Kevin Burton — More discussion that Linux Split-LRU is essential.
- Kevin Burton — Choosing the middle road by enabling swap, but with a small amount of space, and giving up the battle.
- Peter Zaitsev — More discussion about why swapping is bad, but no solution.
Despite all the discussion, not much has changed. There are some hacky solutions to get MySQL to stop swapping, but nothing definite. I’ve known these solutions and hacks now for a while, but the core question was never really settled: “Why does this happen?” and it’s never sat well with me. I was recently tasked with trying to sort this mess out once and for all, so I’ve now done quite a bit of research and testing related to the problem. I’ve learned a lot, and decided a big blog entry might be the best way to share it. Enjoy.
There was a lot of discussion and some work went into adding the relatively new swappiness tunable a few years ago, and I think that may have solved some of the original problems, but at around the same time the basic architecture of the machine changed to NUMA, which I think introduced some new problems, with the very same symptoms, masking the original fix.
Contrasting the SMP/UMA and NUMA architectures
The SMP/UMA architecture

The SMP, or UMA architecture, simplified
When the PC world first got multiple processors, they were all arranged with equal access to all of the memory in the system. This is called Symmetric Multi-processing (SMP), or sometimes Uniform Memory Architecture (UMA, especially in contrast to NUMA). In the past few years this architecture has been largely phased out between physical socketed processors, but is still alive and well today within a single processor with multiple cores: all cores have equal access to the memory bank.
The NUMA architecture

The NUMA architecture, simplified
The new architecture for multiple processors, starting with AMD’s Opteron and Intel’s Nehalem2 processors (we’ll call these “modern PC CPUs”), is a Non-Uniform Memory Access (NUMA) architecture, or more correctly Cache-Coherent NUMA (ccNUMA). In this architecture, each processor has a “local” bank of memory, to which it has much closer (lower latency) access. The whole system may still operate as one unit, and all memory is basically accessible from everywhere, but at a potentially higher latency and lower performance.
Fundamentally, some memory locations (“local” ones) are faster, that is, cost less to access, than other locations (“remote” ones attached to other processors). For a more detailed discussion of NUMA implementation and its support in Linux, see Ulrich Drepper’s article on LWN.net.
How Linux handles a NUMA system
Linux automatically understands when it’s running on a NUMA architecture system and does a few things:
- Enumerates the hardware to understand the physical layout.
- Divides the processors (not cores) into “nodes”. With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
- Attaches each memory module in the system to the node for the processor it is local to.
- Collects cost information about inter-node communication (“distance” between nodes).
You can see how Linux enumerated your system’s NUMA layout using the numactl --hardware command:
# numactl --hardware available: 2 nodes (0-1) node 0 size: 32276 MB node 0 free: 26856 MB node 1 size: 32320 MB node 1 free: 26897 MB node distances: node 0 1 0: 10 21 1: 21 10
This tells you a few important things:
- The number of nodes, and their node numbers — In this case there are two nodes numbered “0″ and “1″.
- The amount of memory available within each node — This machine has 64GB of memory total, and two physical (quad core) CPUs, so it has 32GB in each node. Note that the sizes aren’t exactly half of 64GB, and aren’t exactly equal, due to some memory being stolen from each node for whatever internal purposes the kernel has in mind.
- The “distance” between nodes — This is a representation of the cost of accessing memory located in (for example) Node 0 from Node 1. In this case, Linux claims a distance of “10″ for local memory and “21″ for non-local memory.
How NUMA changes things for Linux
Technically, as long as everything runs just fine, there’s no reason that being UMA or NUMA should change how things work at the OS level. However, if you’re to get the best possible performance (and indeed in some cases with extreme performance differences for non-local NUMA access, any performance at all) some additional work has to be done, directly dealing with the internals of NUMA. Linux does the following things which might be unexpected if you think of CPUs and memory as black boxes:
- Each process and thread inherits, from its parent, a NUMA policy. The inherited policy can be modified on a per-thread basis, and it defines the CPUs and even individual cores the process is allowed to be scheduled on, where it should be allocated memory from, and how strict to be about those two decisions.
- Each thread is initially allocated a “preferred” node to run on. The thread can be run elsewhere (if policy allows), but the scheduler attempts to ensure that it is always run on the preferred node.
- Memory allocated for the process is allocated on a particular node, by default “current”, which means the same node as the thread is preferred to run on. On UMA/SMP architectures all memory was treated equally, and had the same cost, but now the system has to think a bit about where it comes from, because accessing non-local memory has implications on performance and may cause cache coherency delays.
- Memory allocations made on one node will not be moved to another node, regardless of system needs. Once memory is allocated on a node, it will stay there.
The NUMA policy of any process can be changed, with broad-reaching effects, very simply using numactl as a wrapper for the program. With a bit of additional work, it can be fine-tuned in detail by linking in libnuma and writing some code yourself to manage the policy. Some interesting things that can be done simply with the numactl wrapper are:
- Allocate memory with a particular policy:
- locally on the “current” node — using --localalloc, and also the default mode
- preferably on a particular node, but elsewhere if necessary — using --preferred=node
- always on a particular node or set of nodes — using --membind=nodes
- interleaved, that is, spread evenly round-robin across all or a set of nodes — using --interleaved=all or --interleaved=nodes
- Run the program on a particular node or set of nodes, in this case that means physical CPUs (--cpunodebind=nodes) or on a particular core or set of cores (--physcpubind=cpus).
What NUMA means for MySQL and InnoDB
InnoDB, and really, nearly all database servers (such as Oracle), present an atypical workload (from the point of view of the majority of installations) to Linux: a single large multi-threaded process which consumes nearly all of the system’s memory and should be expected to consume as much of the rest of the system resources as possible.
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward. The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node. Additionally, since many different queries will be running at the same time, on both processors, neither individual processor necessarily has preferential access to any particular part of memory needed by a particular query.
It turns out that this seems to matter in one very important way. Using /proc/pid/numa_maps we can see all of the allocations made by mysqld, and some interesting information about them. If you look for a really big number in the anon=size, you can pretty easily find the buffer pool (which will consume more than 51GB of memory for the 48GB that it has been configured to use) [line-wrapped for clarity]:
2aaaaad3e000 default anon=13240527 dirty=13223315 swapcache=3440324 active=13202235 N0=7865429 N1=5375098
The fields being shown here are:
- 2aaaaad3e000 — The virtual address of the memory region. Ignore this other than the fact that it’s a unique ID for this piece of memory.
- default — The NUMA policy in use for this region.
- anon=number — The number of anonymous pages mapped.
- dirty=number — The number of pages that are dirty because they have been modified. Generally memory allocated only within a single process is always going to be used, and thus dirty, but if a process forks it may have many copy-on-write pages mapped that are not dirty.
- swapcache=number — The number of pages swapped out but unmodified since they were swapped out, and thus they are ready to be freed if needed, but are still in memory at the moment.
- active=number — The number of pages on the “active list”; if this field is shown, some memory is inactive (anon minus active) which means it may be paged out by the swapper soon.
- N0=number and N1=number — The number of pages allocated on Node 0 and Node 1, respectively.
The entire numa_maps can be quickly summarized by the a simple script numa-maps-summary.pl, which I’ve written while analyzing this problem:
N0 : 7983584 ( 30.45 GB) N1 : 5440464 ( 20.75 GB) active : 13406601 ( 51.14 GB) anon : 13422697 ( 51.20 GB) dirty : 13407242 ( 51.14 GB) mapmax : 977 ( 0.00 GB) mapped : 1377 ( 0.01 GB) swapcache : 3619780 ( 13.81 GB)
An couple of interesting and somewhat unexpected things pop out to me:
- The sheer imbalance in how much memory is allocated in Node 0 versus Node 1. This is actually absolutely normal per the default policy. Using the default NUMA policy, memory was preferentially allocated in Node 0, but Node 1 was used as a last resort.
- The sheer amount of memory allocated in Node 0. This is absolutely critical — Node 0 is out of free memory! It only contains about 32GB of memory in total, and it has allocated a single large chunk of more than 30GB to InnoDB’s buffer pool. A few other smaller allocations to other processes finish it off, and suddenly it has no memory free, and isn’t even caching anything.
The memory allocated by MySQL looks something like this:

Allocating memory severely imbalanced, preferring Node 0
Due to Node 0 being completely exhausted of free memory, even though the system has plenty of free memory overall (over 10GB has been used for caches) it is entirely on Node 1. If any process scheduled on Node 0 needs local memory for anything, it will cause some of the already-allocated memory to be swapped out in order to free up some Node 0 pages. Even though there is free memory on Node 1, the Linux kernel in many circumstances (which admittedly I don’t totally understand3) prefers to page out Node 0 memory rather than free some of the cache on Node 1 and use that memory. Of course the paging is far more expensive than non-local memory access ever would be.
A small change, to big effect
An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:
# numactl --interleave all command
We can use this with MySQL by making a one-line change to mysqld_safe, adding the following line (after cmd="$NOHUP_NICENESS"), which prefixes the command to start mysqld with a call to numactl:
cmd="/usr/bin/numactl --interleave all $cmd"
Now, when MySQL needs memory it will allocate it interleaved across all nodes, effectively balancing the amount of memory allocated in each node. This will leave some free memory in each node, allowing the Linux kernel to cache data on both nodes, thus allowing memory to be easily freed on either node just by freeing caches (as it’s supposed to work) rather than paging.
Performance regression testing has been done comparing the two scenarios (default local plus spillover allocation versus interleaved allocation) using the DBT2 benchmark, and found that performance in the nominal case is identical. This is expected. The breakthrough comes in that: In all cases where swap use could be triggered in a repeatable fashion, the system no longer swaps!
You can now see from the numa_maps that all allocated memory has been spread evenly across Node 0 and Node 1:
2aaaaad3e000 interleave=0-1 anon=13359067 dirty=13359067 N0=6679535 N1=6679532
And the summary looks like this:
N0 : 6814756 ( 26.00 GB) N1 : 6816444 ( 26.00 GB) anon : 13629853 ( 51.99 GB) dirty : 13629853 ( 51.99 GB) mapmax : 296 ( 0.00 GB) mapped : 1384 ( 0.01 GB)
In graphical terms, the allocation of all memory within mysqld has been made in a balanced way:

Allocating memory balanced (interleaved) across nodes
An aside on zone_reclaim_mode
The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory reclamation policies in a NUMA system. Subject to some clarifications from the linux-mm mailing list, it doesn’t seem to help in this case.
An even better solution?
It occurred to me (and was backed up by the linux-mm mailing list) that there is probably further room for optimization, although I haven’t done any testing so far. Interleaving all allocations is a pretty big hammer, and while it does solve this problem, I wonder if an even better solution would be to intelligently manage the fact that this is a NUMA architecture, using the libnuma library. Some thoughts that come to mind are:
- Spread the buffer pool across all nodes intelligently in large chunks, or by index, rather than round-robin per page.
- Keep the allocation policy for normal query threads to “local” so their memory isn’t interleaved across both nodes. I think interleaved allocation could cause slightly worse performance for some queries which would use a substantial amount of local memory (such as for large queries, temporary tables, or sorts), but I haven’t tested this.
- Managing I/O in and out to/from the buffer pool using threads that will only be scheduled on the same node that the memory they will use is allocated on (this is a rather complex optimization).
- Re-schedule simpler query threads (many PK lookups, etc.) on nodes with local access to the data they need. Move them actively when necessary, rather than keeping them on the same node. (I don’t know if the cost of the switch makes up for this, but it could be trivial if the buffer pool were organized by index onto separate nodes.)
I have no idea if any of the above would really show practical benefits in a real-world system, but I’d love to hear any comments or ideas.
Update 1: Changed the link for “Rik van Riel on the LKML — A few answers and proposal of the Split-LRU patch.” to be a bit closer to my intention. The old link points to the message that started the thread, the new link points to the index of the messages in the thread.
Update 2: Added a link above provided by Kevin Closson about Oracle on NUMA systems.
- – -
1 Using free shows some memory free and lots of cache in use, and totalling up the resident set sizes from ps or top shows that the running processes don’t need more memory than is available.
2 An article in Dr. Dobb’s Journal titled A Deeper Look Inside Intel QuickPath Interconnect gives pretty good high level coverage. Intel published a paper entitled Performance Analysis Guide for Intel® CoreTM i7 Processor and Intel® XeonTM 5500 processors which is quite good for understanding the internals of NUMA and QPI on Intel’s Nehalem series of processors.
3 I started a thread on the linux-mm mailing list related to MySQL on NUMA, and there are two other threads related on zone_reclaim_mode and on swapping.
September 28, 2010 at 20:10 |
[...] This post was mentioned on Twitter by Programación Web, Mark Callaghan. Mark Callaghan said: I need to find the time to read this http://fb.me/yGLQ5OUU [...]
September 28, 2010 at 21:50 |
With LoseThos, I originally set aside some memory for each core’s heaps because I was worried they’d fight for access. I think I assumed there was a problem where there was none. It’s a whole lot nicer with all cores allocating from one pool, not worrying about partitioning memory.
September 29, 2010 at 02:30 |
Really informative and interesting read. Thanks (add a flattr button so that I can donate some beer money).
September 29, 2010 at 02:31 |
Are you sure that “Memory allocations made on one node will not be moved to another node, regardless of system needs”? This paper (http://www.kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf) from 2006 makes it sound like a done deal. Was “memory migration” pulled?
September 29, 2010 at 02:40 |
Nathan: My understanding is that while page migration is *technically possible* it will never occur automatically. Perhaps I should say that “Memory allocations made on one node will not be *automatically* moved to another node, regardless of system needs”. My understanding, and what seems to be supported by that document as well, is that there are several manual interfaces to force movement of pages, but there is no automated logic anywhere to do it.
September 29, 2010 at 02:38 |
Great article ! Congrats.
September 29, 2010 at 03:14 |
Thanks for a very informative article, it’s great to see someone persuing the idea of well performing MySQL! Your article is very well written, my only take-away is that it appears this is only a problem on NUMA systems, which effectively means multi-processor (possibly with multi-core?)
September 29, 2010 at 04:42 |
When is the Linux kernel going to be patched so that I do not have to swap to RAM disc to avoid pages from being written to disc when I have gigabytes of free memory!?…
Does it need patching? Have you set swappiness the way you want? https://help.ubuntu.com/community/SwapFaq#What%20is%20swappiness%20and%20how%20do%20I%20change%20it? http://kerneltrap.org/node/3000 Also, this article may help: http://jcole.us/blog/arc…
September 29, 2010 at 07:22 |
This link is wrong (doesn’t match description):
Rik van Riel on the LKML — A few answers and proposal of the Split-LRU patch.
September 29, 2010 at 07:28 |
Andrew: I guess the link wasn’t as clear as I intended it to be; Rik gets involved later on in the thread, but I figure you ought to read the context. I changed the link instead to point to the index of messages in the thread, so now it more closely matches my intentions.
September 29, 2010 at 07:28 |
Very cool post, thanks Jeremy.
The new InnoDB split buffer pool code should be modified to use numalib to allocate each of the smaller innodb buffer pools to use the local node. If you split the buffer pool into an even number of chunks, the entire buffer pool should be able to be allocated locally. Of course, you want to make sure you preserve space on each node for cache.
September 29, 2010 at 07:32 |
I meant spread them across the local/remote nodes. Too early.
September 29, 2010 at 08:15 |
[...] http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ This entry was posted in Uncategorized. Bookmark the permalink. ← Cloning an Ubuntu distro [...]
September 29, 2010 at 08:23 |
How does one figure out if a machine is NUMA or SMP/UMA? This looks like it might be a problem we’re seeing, but I don’t know if this is it or not.
September 29, 2010 at 08:24 |
Alan: You can use “numactl –hardware”; if you see more than one “node”, you’re on a NUMA system.
September 29, 2010 at 08:46 |
Great writeup! Thanks for this.
It sounds like NUMA memory-pool introspection would be a fantastic feature for top(1), if it doesn’t already exist.
September 29, 2010 at 08:48 |
Jeremy,
Thanks for sharing this! Great post!
Rgds,
-Dimitri
September 29, 2010 at 09:57 |
What about using the large-pages (huge pages) mysql option: http://dev.mysql.com/doc/refman/5.0/en/large-page-support.ht…
From the kernel documentation @ http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage…. : “Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.”
September 29, 2010 at 10:00 |
Nick: Using large-pages will keep mysqld from being swapped out, but it won’t keep the system from swapping *something* out. And, if it really needs memory on a particular node, and can’t swap out pages for mysqld, it will swap something else which might be more important to the function of the system, fail the allocation, or OOM-kill a process (likely mysqld).
September 29, 2010 at 10:54 |
I have been working with Kevin Burton at spinn3r/tailrank for years and we did eventually solve the swap problem together but in a totally different way.
Before I tell you how, first answer the question to your self: why would you ever *want* to swap on a server?
We have come to the conclusion that we never want to swap at spinn3r; and I can’t come up with any good reasons for process on a server to swap at all.
So we don’t have swap. I don’t mean that we turned swap off with swapoff; that has known problems which I believe have to do with the way the memory allocator tries to find free pages. ( Though there has been some work on this in 2010. ) What we did is compile out swap. This is an idea we got form a friend at google who apparently run all their servers in this way. Not only does this keep you from ever swapping it also speeds up the memory management as there is a whole set of bookkeeping the kernel no longer has to do.
The observant reader might be thinking but what about over commit and the OOM killer. The short answer is don’t over provision your server; but in all likelihood you might do it by accident and then that 40G mysql process is going to look like a good target to the OOM killer. To keep your large warmed-up mysql process form getting killed you can set a its /proc/PID/oom_adj to “-17″ which will keep the OOM killer form considering that process as a target. ( We also set our sshd similarly so it does not get killed. )
In this way we have a kernel that wont swap, and is faster, while still keeping important processes safe from the OOM killer.
September 29, 2010 at 11:18 |
Jonathan: I think this makes sense if you have a sufficiently well-managed and carefully configured system, but I don’t think that’s a good solution for the general case. If you have swap either turned off or compiled out, that will disable memory overcommit, and severely limit how much mmap can be done, by my understanding.
September 29, 2010 at 11:00 |
Jeremy: good write up. You’re nicely working counter to something I continually am asserting: “Nobody understands NUMA”.
Also I don’t think Dave Hansen referred you here, but there’s a wiki page where we try to clarify some of this more: http://linux-mm.org/Low_On_Memory
This is a very FAQ situation and the low on memory page could use updating for NUMA and zone reclaim. Especially as you’ve noted…most any system these days has NUMA effects and the allocation/reclaim behavior (and subsequent runtime performance) is only more complicated from a user perspective as a result.
Another note on large pages: keep in mind that using large pages on linux _may_ in the very near future not keep mysql or other large page using apps from being swapped out. Depending on whether or not the transparent huge page patches merge in, there may be additional interesting runtime behaviors coming soon.
September 29, 2010 at 11:21 |
Tim: Thanks for the comment. I read the Low_On_Memory page, and it was very helpful, although incomplete, it’s a great resource. Good point on the system behavior changing in the future. The key point of my research and solution is to solve the *problem* (that the system thinks it needs to swap) rather than the symptom (the swapping itself).
September 29, 2010 at 11:10 |
Nice investigation. You mention that the NUMA code has parameters for the relative costs of local/remote access. But it sounds like it doesn’t use that information in making decisions. That should help automatically differentiate the two types of systems, and decide whether local swap or remote access is better for a specific allocation.
September 29, 2010 at 11:11 |
We’ve migrated to using noswap kernels on all our boxes. Works well. Not a solution for everyone.
I think your “An even better solution” is to run N mysql daemons (one per node) and then bind the memory allocations there.
This has been my goal for two years now but we haven’t been able to do it in our stack (yet).
The first change we’ll probably make is to do this to our robots which are a bit more shardable than our mysql boxes.
Balancing memory is a good solution for about 70% of people but one major application won’t work well in this role. If you’re reading large chunks of memory at a time, and processing them on the LOCAL box, then your application will be slower as nr_nodes-1/nr_nodes of your pages will be non-local.
For web applications where the biggest latencies are gigabit and CPU (and webapp) this probably won’t dominate though.
September 29, 2010 at 11:38 |
Thanks for the blog post!
You’d think that “swapping” to a NUMA node would be preferable to swapping to disk. Perhaps we’ll get this in the 2.8 series.
September 29, 2010 at 11:50 |
Jeremy Cole: I am sure that removing swap dose not remove over commit.
Over commit happens when the kernel believes that the a process wont actual need all the memory it has asked for; as a example when a process forks the kernel needs create copy of parents processes address space in the child but much of that memory will never diverge between the parent and child processes, so the kernel can use COW and over commit to make the memory use more effective. Compiling out swap dost not, and can not, turn off virtual memory.
To the well manicured system argument; if you have a production server that might start swapping at any time, or in general who’s behavior you do not understand, you are playing with fire. ( read you should expect down time randomly )
By the way if it was not clear I think this is really good and enlightening work you did hear, thanks.
September 29, 2010 at 12:01 |
[...] jcole’s weblog: Jeremy Cole’s take on life. » Blog Archive » The MySQL “swap insanity†pro… This is worth a separate blog post — people really don't understand how NUMA works (well, shoot, they barely understand how their base OS works, IT people in general need more CS knowledge). [...]
September 29, 2010 at 12:01 |
Ideally, it would be possible to track how many memory reads were local vs non-local. Obviously, you want non-local reads to be near zero.
I’m tracking this down now but I don’t see a way……
September 29, 2010 at 12:03 |
Kevin: See the numastat command and the various sys files. http://www.kernel.org/doc/Documentation/numastat.txt
September 29, 2010 at 12:18 |
Ran into this issue a few years ago (custom proprietary high performance dbs). A lot of details aside, we felt most of these performance based memory issues came down to the software architecture. Running batteries of tests on many different systems, the optimal configuration was always 1 process per socket (physical CPU) and then threads on each core. This always gave us the highest throughput and the lowest latency. Going too far on either extreme (not enough processes vs too many) resulted in poorer performance and could leave you disk bound which is never fun. Not sure if you have this level of process/thread control in your mysql/innodb architecture. Obviously this is the direction hardware is going and keeping this in mind when data architecting will leave you with a lot of performance flexibility.
September 29, 2010 at 12:36 |
Skip using a swap partition. Write a script to activate one when the system hits the rams limits.
September 29, 2010 at 12:40 |
Matthew Rogers: I don’t think you’ve thought that through. When the system needs to swap, clearly it doesn’t have any memory available. Where’s it supposed to get the memory it needs to (a) detect that it’s out of memory, and (b) run a script to enable swap? The memory management under no-memory conditions is already marginal — it always takes a little bit more memory to free memory already.
September 29, 2010 at 14:38 |
[...] Interesting. Source [...]
September 29, 2010 at 16:21 |
[...] memory model is doomed (a teaser) Posted on September 29, 2010 by genericlanguage This rather wonderful blog post on debugging a horrible memory management issue with MySQL databases illustrates one of the incredibly complex subtleties that the shared memory model [...]
September 29, 2010 at 23:15 |
@Jonathan
I’m interested in your assertion that turning off swap (via swapoff) is prone to problems with the memory allocator. Googling hasn’t helped me back this claim up, I’m wondering if you know where this is reported / documented / talked up about.
Rockin article btw, although we’re not on NUMA architecture, we do run with no swap to avoid this. (hence my curiosity about Jonathan’s comment)
October 4, 2010 at 14:04 |
Philip: My understanding is this: disabling swap completely disables memory overcommit, which can cause problems forking in large processes, amongst other things.
September 30, 2010 at 08:32 |
[...] Talented mysql consultant Jeremy Cole appears to have discovered the cause for this problem and writ… Go read it. [...]
September 30, 2010 at 10:14 |
Philip Reynolds:
This memory manager bug not documented any where as far as I know but I have discussed this with kernel developers and they are aware of the issue, but know one has given them a reproducible test case. There has been some patches in one of the last few kernels that might address this issue but I have not had a chance to test. Running with swap off seems to make the memory bug worse then compiling out swap.
September 30, 2010 at 13:05 |
[...] jcole’s weblog: Jeremy Cole’s take on life. » Blog Archive » The MySQL “swap insanity†pro… The MySQL “swap insanity†problem and the effects of the NUMA architecture http://ff.im/-rjEqN (tags: via:packrati.us) [...]
October 1, 2010 at 10:55 |
It would be interesting to see how other OS’s handle this scenario. Specifically how FreeBSD or Solaris deal with NUMA.
October 4, 2010 at 04:00 |
Really excellent detective work, I’m not a MySQL user but it’s still very interesting.
October 4, 2010 at 11:59 |
Hi Jeremy,
Have you tested much with the system booted with numa=off? It appears you are using a Nehalem EP system. You might just boot it in SUMA mode and see how it performs.
While not mysql-related, your readers might find the following NUMA information useful:
http://kevinclosson.wordpress.com/2009/05/14/you-buy-a-numa-system-oracle-says-disable-numa-what-gives-part-ii/
October 4, 2010 at 14:07 |
Kevin: I haven’t done any testing thus far with the NUMA features disabled in the BIOS, but it would be interesting to try. Your articles are excellent — I will bring them up into my main article as another reference with more explanation about NUMA and big databases. It’s great to see a bit of the Oracle perspective on this.
October 11, 2010 at 01:36 |
[...] http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ This entry was posted in default. Bookmark the permalink. ← Articles about REST [...]
October 13, 2010 at 21:29 |
@Jeremy Cole
You could simply run a cronjob and when ram usage hits 80% allocate more space via swap or simply send an alert. I don’t even think you should run ANY swap space.
If you system is running out of ram and is actively using swap during day to day operations the balancing is mis configured or your server is not sized properly.
October 13, 2010 at 22:13 |
Matthew Rogers: Clearly you did not read the article. Try again.
October 24, 2010 at 16:09 |
[...] Dell C1100′s did not come set by default in NUMA mode in the BIOS. CentOS 5.5 supports this and so we turned it on. I am not sure how much (if [...]
October 28, 2010 at 09:17 |
[...] MySQL Swap Insanity [...]
November 16, 2010 at 18:41 |
[...] The MySQL “swap insanity†problem and the effects of the NUMA architecture [...]
December 11, 2010 at 05:25 |
[...] out is very high. Even on optimally sized standalone database machines due to various reasons (Jeremy has a nice blog post on this), there are chances of the database memory buffers getting [...]
March 15, 2011 at 01:30 |
[...] also an article by Jeremy Cole. However his solution requires a change to mysqld_safe which while it’s only a [...]
March 17, 2011 at 14:11 |
[...] out about numactl. This is a good clearinghouse post about numactl as it applies to mysql. At large core + large memory sizes, it applies to mongodb, as [...]
May 11, 2011 at 00:36 |
[...] Linux is already swapping. Yes, seeing that happen will be very confusing! This problem has been investigated and explained brilliantly by Jeremy Cole. The solution is a one liner given at the end of his [...]
July 7, 2011 at 22:28 |
[...] The “swap insanity” problem, in brief [...]
July 12, 2011 at 18:10 |
[...] thread that is requesting it. This was a topic area we were already familiar with, but in terms of database applications that consume most of system RAM. That clearly was not the case here, as RabbitMQ barely consumed [...]
July 31, 2011 at 19:47 |
[...] resources: The MySQL “swap insanity” problem and the effects of the NUMA architecture Ubuntu [...]
August 9, 2011 at 08:14 |
[...] ?????????????????????????????????NUMA????????????????????????????????? This entry was posted in Technical and tagged MongoDB by ??. Bookmark the permalink. [...]
August 9, 2011 at 19:23 |
[...] ?????????????????????????????????NUMA??????????????????????????????????? [...]
August 16, 2011 at 02:17 |
[...] ??NUMA???????????????? [...]
October 17, 2011 at 00:23 |
[...] ??NUMA???????????????CPU?????NUMA???????????????CPU???????????????????????????????????????CPU????????????????MongoDB??????????????????????NUMA???????????????? [...]
October 29, 2011 at 02:12 |
[...] The MySQL “swap insanity” problem and the effects of the NUMA architecture [...]
January 12, 2012 at 12:06 |
This article really helped me improve mongodb performance. Perhaps you might want to make a version that is not specific to MySQL. Basically, this applies to any application that uses large amounts of memory.
January 31, 2012 at 07:54 |
I actually worked on NUMA memory performance issues at IBM a few years ago, that begat this patent:
http://www.google.com/patents/US7302533
It’s basically the same story, but we were working to reduce traffic on the interconnect fabric and reduce latency, binding workers to nodes along with the memory they usually work on. We were working with AIX, so the swappiness problem probably isn’t comparable.
April 8, 2012 at 04:48 |
In LoseThos, I want one memory map shared by all CPUs. If a system does not have that, I don’t care about it.
SWAPS/s on the wallpaper is how many times a task is changed.
During disk I/O, code like this executes:
while (INP(DRIVE_STATUS)!=READY)
YieldCPU2NextTask;
April 8, 2012 at 06:46 |
I’m a bit curious as to why interleaving is the correct solution, rather than setting –preferred=node
Wouldn’t using –preferred=node still give you the intended benefit of NUMA, while eliminated the problem described in the article?
(I realize that your testing showed no performance degradation with interleaving, but I’m speaking hypothetically/theoretically).
Thanks for this research – it finally puts to rest a big “huh??” that I’ve been keeping in the back of my head!
April 8, 2012 at 10:40 |
Jonathan: Using preferred allocation wouldn’t help in this case because, since the total amount of memory being allocated is “most of the memory in the system” for a MySQL/InnoDB system, preferring any single node will surely run that node out of memory.
April 8, 2012 at 10:26 |
Really great article. Do you have any recommendations on articles, tutorials or books (prefer any of those 3 to a newsgroup/mailing list) to understand more about linux memory management?
April 8, 2012 at 10:44 |
John: The closest thing I could recommend is:
http://www.amazon.com/Understanding-Linux-Kernel-Third-Edition/dp/0596005652/
This book is quite outdated (Published November 2005) but it is largely still relevant, and does a very good job of explaining the different subsystems, their purposes, and their interactions. As far as I remember, it doesn’t actually touch on NUMA at all (since it was not really relevant to most systems back in 2005).
April 12, 2012 at 11:03
It does touch briefly on NUMA at the beginning of the Memory Management chapter (pages 297 and 298).
I think that it has been added in the 3rd edition, so that’s maybe why you don’t remember it.
April 8, 2012 at 20:49 |
This was a great article. Thanks for the read… I noticed that the “cpu#” seems to spread across all physical cpu’s, and then loop back again. Can anyone explain why ?
Example:
[root@iron ~]# numactl –hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 16374 MB
node 0 free: 14968 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 16384 MB
node 1 free: 15809 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 16384 MB
node 2 free: 15678 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 16384 MB
node 3 free: 15903 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10
[root@iron ~]#
April 8, 2012 at 22:23 |
James: I assume that’s a quad-cpu hex-core machines with hyperthreading enabled? This is really just a matter of how the physical CPUs and hyperthreaded CPUs are enumerated. On one sample machine (dual-cpu, hex-core, hyperthreading) I see:
node0: 0,2,4,6,8,10,12,14,16,18,20,22
node1: 1,3,5,7,9,11,13,15,17,19,21,23
I think this is largely a combination of the kernel code that enumerates CPUs, and the hardware itself.
April 9, 2012 at 02:12 |
Very informative article!
Does this also apply to high-memory VMs running on top of NUMA boxes?
April 9, 2012 at 08:16 |
Good question. I’d guess it primarily applies to the host OS side of that equation, but I honestly don’t know what the interplay is between NUMA and the virtualization stuff.
April 13, 2012 at 09:39 |
[...] The MySQL “swap insanity” problem and the effects of the NUMA architecture. Jeremy Cole with a spectacular article on the differences between NUMA and SMP/UMA systems and the mostly unsatisfactory tricks required to get MySQL to perform on NUMA systems. There are really two issues: the evils of an OS controlled swap and NUMA performance effects due to a single node (in the NUMA sense) running out of memory. This is the kind of stuff you only see when you push your systems to the edge. [...]
April 13, 2012 at 17:17 |
This explains a lot of oddness I’ve experienced. I am interested to see how this will be solved by the MySQL community. I agree that using libnuma is likely a good long-term approach.
April 13, 2012 at 18:56 |
This is why users moved over to Postgres or another SQL server.
April 13, 2012 at 19:59 |
Istvan: Hah. Re-read the article, since you clearly skipped over it in your fervor to post something anti-MySQL, and you will notice that it is not really MySQL specific.
April 16, 2012 at 13:22 |
[...] time ago, I wrote a rather popular post The MySQL “swap insanity” problem and the effects of the NUMA architecture (if you haven’t read it, stop now and do that!), which described using numactl [...]
April 16, 2012 at 20:03 |
Is this relevant if the memory allocated to mysql is less than even the total of 1 node in NUMA .
For example, if i have 2 socket , system with 32 GB memory .
and the total memory allocated to mysql is just 8GB .
April 25, 2012 at 12:35 |
Hi Jeremy,
very interesting post, thanks for sharing!
We are experiencing most of the symptoms you describe here. We have an IBM x3650 server which is supposed to be a 2 quad core processor and 4 memory slots (16 Gb). But this is what I get when I run numactl –hardware:
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16383 MB
node 0 free: 195 MB
node distances:
node 0
0: 10
Why do I see only 1 node as if it was a UMA/SMP architecture? And if it was really a UMA/SMP architecture, why do you think I could be experiencing the same symptoms you describe here? i.e swapping memory with still a lot of memory available making mysql to have random delays in random queries.
Thanks in advance.
April 26, 2012 at 12:29 |
Bristolito: It looks like that system is either UMA (older generation) or may have NUMA “disabled” in the BIOS. In that case you should not see degradations related to the above, but could see swapping/degradation for other reasons.
April 27, 2012 at 13:06 |
The Wierdness of Hardware, or Why the Whole Stack Matters…
Over the last wee while I’ve been testing JBoss apps virtualised under RHEV, and this week I had a bizarre experience: my team-mates and I had been puzzling over the high standard deviations (and hence eccentric behaviour) of our web app, which w…