The customer is always wrong

“The customer is always right” – that’s a saying that serious businesses actually say sometimes, with a serious and straight face. It’s one of those sayings rooted in good intentions and immediately polluted with the messy reality of the real world.

When you’re working in Database or even Network Engineering, the real answer is far more complicated and nuanced, but it’s much closer to “the customer is always wrong”. That is not to say that the Database or Network Engineer is right, either, though.

Usually the customer (in this case usually a Software Engineer) comes to us, or more likely pages us, with something like “the database is down”. Why? Because they got an error message like “The server has gone away” or “Too many connections” or “Host unreachable” or any number of error messages which are simultaneously too descriptive, and completely unhelpful. Such error messages are good at leading people astray.

For many reasons, not the least of which is that it’s impossible, error messages never say “Your query, yes the new one from ReallyExpensiveJob at line 52, added last Thursday, was really not a good idea” – that would be too simple. Instead the web page, or background job, or query quietly and efficiently consumes every resource available to it, or deadlocks some important process, or overwhelms the network, or any number of other secondary or tertiary symptoms. Maybe it even works fine most of the time but in some circumstances (say, when processing certain users’ data, or a specific shard) it goes absolutely wild.

More often than not, what that looks like both to the Software Engineer and the Database Engineer is that the “database is down”… because for all practical purposes, it is. Often though, the database is not really the problem. The database is the victim of some bad behavior: anywhere from a new bad query to an external end user clicking a button or a crawler/bot doing something.

When responding to these sorts of issues, in my experience, the priority is:

  1. Don’t make a bigger mess. Avoid taking any drastic actions to “fix” the underlying systems: don’t failover, don’t rebuild a host, don’t reboot.
  2. Provide first aid. Do what you need to: turn things off, block jobs, or even put the application in maintenance mode to give yourself breathing room.
  3. Find the root cause. Figure out what are primary symptoms and what are secondary symptoms.
  4. Patch temporarily. Get things mostly working again relatively quickly.
  5. Document the incident. Make sure future-you and your peers know what happened and what still needs to be done.
  6. Fix permanently. Implement a long-term solution.

Don’t make a bigger mess

When you log in and see that a particular thing is “broken”, it can be very tempting to try and immediately fix it. The graphs all show db73 is broken, it’s obvious that you should try replacing db73, right? The other folks responding in their own on-call rotations are likely also applying additional pressure for you to do something – anything – to fix it.

Resist the temptation to failover, rebuild, reboot, or restart things without a better understanding of the underlying problem. Sometimes doing so is the right answer, and you’ll feel stupid for not having done it sooner. More often than not, though, there are  a lot of symptoms to sort through to figure out what the root cause is.

Why not, though? A failover or reboot in the middle of an incident caused by what turns out to be a bad query can distract you, but it can also seriously exacerbate the actual problem. Changing key parts of your infrastructure in the middle of an incident may cause problems due to locks, cold caches, killed/broken queries, and just generally introduces more change to an unknown system state. Additionally, it may mask the problem (making you think you fixed the problem) because the bad actor itself gets temporarily broken/stopped by the change and takes some time to come back.

Provide first aid

Hopefully you have a few tools at your disposal to make things better without pushing new code or taking drastic measures in the underlying systems. These will usually be temporary measures to “stop the bleeding”: blocking specific kinds of jobs, IP blocks, locking specific users out, or even just putting the entire application or parts of it into maintenance mode.

The idea here is not to actually solve the problem, just to give yourself or other responders more breathing room and stop any escalation of harm. Don’t be afraid to take fairly extreme measures, so long as they are temporary and safe (and ideally well designed and documented). I promise you that users will not mind getting a maintenance page instead of a page that repeatedly times out loading.

Find the root cause

Now you’ve got some breathing room, it’s time to figure out what is going on. When things go terribly wrong though, pretty much every graph for every system will look “wrong”. You will need to systematically work through the available data, logs, and metrics to figure out what are primary symptoms and what are secondary symptoms.

Some examples of secondary symptoms which can easily confuse your ability to make sense of the data and cause you to misunderstand the problem are:

  • When the system is overloaded, every query may get slower, time out, or error. Those can be conflicting signals that send you on a wild goose chase.
  • When queries are slower, generally fewer of them will get processed, and they will “stack up” more, running concurrently. This will introduce entirely new behaviors.
  • When the database is overloaded or behaving badly, upstream systems such as web servers will behave differently and look different. The number of workers may increase. They may see more timeouts or even OOMs.
  • Conversely: Maybe the increase in workers is actually the problem and they are now overloading the databases? Confusing, isn’t it?
  • As you’re digging through logs and metrics, you’re likely to find a few things that have “always” been there and always been bad: bad queries, weird user behaviors, abusive bots, etc. Keep that in mind, and try to figure out whether something you’re seeing is new or whether it changed during the incident window.
  • The monitoring systems that provide all of this data sometimes themselves cause extra load or bad behaviors when the system is under load. This can be due to increased logging rates, the size of some frequently queried internal table, locking required to access internal data, or any number of other reasons.
  • Automatic retries in components of your systems can cause escalating or increasingly stacked concurrent loads. For example, a system may retry an operation repeatedly after a short timeout. Then what happens when the operation starts to time out repeatedly because of an unrelated problem? This can make it seem like a particular system is the root cause, when it’s really a secondary symptom.
  • Users will often make things worse. When a page doesn’t load or produces an error, users will often try it again, and again, and again. They’re impatient and tenacious. Don’t be afraid to lock them out while you work to fix things.

It can be really frustrating and challenging to figure out what is really going on when the system is overloaded and everything is broken. 

Patch temporarily

Okay, now you’ve identified what you think is the root cause.

Resist the temptation to dig into the real fix for that yet. Often, there’s a temporary solution to get the system back to “mostly” working: Perhaps you can block only a specific type of job rather than all jobs; or perhaps you can early-return an error for a specific URL rather than having the entire site in maintenance mode. Be creative, but keep things simple and temporary.

Making a clever temporary patch will also help you to validate that you’ve identified the root cause correctly before you spend time implementing the permanent fix. It sucks to spend time writing a permanent fix with thorough review and tests only to realize later that the problem was not what you understood, and your fix doesn’t help.

Document the incident

Make sure that you document everything that you did as part of this investigation to the extent possible, as soon as possible after the fire is out (or ideally even work with someone else during the incident to document it in real time):

  • Temporary patches or configuration changes that need to be reversed.
  • A proposal for the permanent fix, if you have one.
  • Interesting alternative theories about the problem, in case your final identified root cause turns out to be wrong.
  • Unrelated bad or broken things you found during your investigation.
  • Missing monitoring or alerting that would’ve identified the problem more efficiently.
  • Missing protections in the system that could’ve prevented the problem.

Fix permanently

If you’ve gotten this far, the world is no longer on fire, and you should have a little bit more time to implement a permanent and more proper fix. Ideally, someone else actually is responsible for that, so you can go back to sleep or the fix can wait for until Monday.

Reconsidering access paths for index ordering… a dangerous optimization… and a fix!

MySQL has had an interesting optimization for years now1, which has popped up from time to time: in certain circumstances, it may choose to use an index that is index-wise less efficient, but provides the resulting rows in order, to avoid a filesort of the result.

What does this typically look like in production? A query that seems simple and easy takes much longer than it should, sometimes. (Perhaps in production, the query gets killed by pt-kill or exceeds the max_execution_time provided.) The query could be very simple indeed:

SELECT ... WHERE `other_id` = 555 ORDER BY `id` ASC LIMIT 1

There’s an index on other_id, and running the query with an appropriate USE INDEX, the query is fast. Even weirder, changing the query to use LIMIT 10 causes it to run lightning-fast! If anything, the LIMIT 1 should be faster… so, what gives?

Looking at the EXPLAIN, you may notice that the LIMIT 1 version is using access type index on the PRIMARY key (in this case id), whereas the LIMIT 10 version is using ref on a secondary key. Access type index means a full-index scan… on the primary key… which is the clustered key… which is a full table scan.

The optimization is hoping that the LIMIT n with a small enough limit will allow execution to be completed early, without scanning many rows, once the LIMIT is satisfied. This hope is often misplaced: there is no guarantee that there will be any matching rows in the first m rows of the table when ordered by the unwisely-chosen index. Hope is not a strategy.

Although the underlying issue had been reported several times already, under various circumstances, since there were so many individual bugs reported, I filed a new bug with a summary of the situation… and a patch: MySQL Bug 97001.

On MySQL Bug 97001, I proposed a solution (and provided a patch, submitted to Oracle under the NDA) introducing a new optimizer_switch flag named reconsider_index_for_order (defaulting to on to duplicate the current behavior). Although this optimization might benefit some queries, it’s too dependent on the actual data in the involved tables, so it’s not a good candidate for a general optimizer feature in my opinion. Maybe the default could eventually be off allowing users to opt into this optimization when they want it but providing compatible default behavior.

1 The underlying optimization actually appears to be have roots in some very old code, most importantly probably one specific commit in 5.1.

Thanks, Oracle, for fixing the stupid and dangerous SET GLOBAL sql_log_bin!

As of MySQL 5.5.41, released on November 28 (last week), Oracle has fixed MySQL Bug 67433, which I filed on October 31, 2012 and wrote about 4 months ago in Stupid and dangerous: SET GLOBAL sql_log_bin. You can see the fix in Launchpad revision 4718.

The MySQL 5.5.41 release notes mention:

Replication: The global scope for the sql_log_bin system variable has been deprecated, and this variable can now be set with session scope only. The statement SET GLOBAL SQL_LOG_BIN now produces an error. It remains possible for now to read the global value of sql_log_bin, but you should act to remove from your applications any dependencies on reading this value, as the ability to do so will be removed in a future MySQL release. (Bug #67433, Bug #15868071)

Thanks for addressing this and making the world’s data (and DBAs’ jobs) a little safer, Oracle! Special thanks to Morgan Tocker (MySQL Community Manager at Oracle) who cared about it enough to keep pushing it through to a reasonable fix/resolution.

Visualizing the impact of ordered vs. random index insertion in InnoDB

[This post refers to innodb_ruby version 0.9.11 as of October 2, 2014.]

Many DBAs know that building indexes in “random” order (or really any order that greatly differs from ordered by key) can be much less efficient. However, it’s often hard to really understand why that is. With the “-illustrate” visualization modes available in innodb_ruby, it’s possible to quite easily visualize the structure of indexes. The space-lsn-age-illustrate mode to innodb_space allows visualization of all pages in a space file by “LSN age”, generating something like a heatmap of the space file based on how recently each page was modified.

(Note that a small Ruby script generate_data_simple.rb was used to generate the test tables used below. You’ll also want to have a reasonable monospace font supporting Unicode block characters correctly, for which I’d heartily recommend Adobe’s Source Code Pro.)

Building an index by insertion in key order

Inserting rows into an index in key order is much more efficient primarily for two reasons:

  1. Pages can be filled completely, and the database (with sufficient optimizations present) can detect the “bulk loading” behavior and efficiently split pages by creating new, empty pages, rather than splitting pages in half.
  2. Only the “edge” of the index is being written to; once a page has filled it will not be revisited. This can make the buffer pool, which caches database pages, much more effective.

Here’s an example of a table with only a PRIMARY KEY clustered index, built by inserting rows in key order:

$ innodb_space -s ibdata1 -T test/t space-lsn-age-illustrate

As you can see in the above image, the index pages are written to in nearly perfect order as they are allocated from the beginning of the file towards the end.

The first extent (the first line of the image) is allocated as a fragment extent, and contains single pages allocated for different purposes. You can clearly see the tablespace bookkeeping pages, FSP_HDR at page 0 and INODE at page 2, immediately followed by the root index page at page 3, all very recently modified. Following that are 32 individually allocated “fragment” pages which were allocated first in the index before it switched to allocating full extents. Then the index pages progress perfectly until the end of the used space.

Also note the LSN age histogram (printed in white above the colored legend at the bottom) shows all buckets equally filled.

Building an index by insertion in random order

Looking at an index built by insertion in key order was reassuring. What if it is built in random order instead? The result is perhaps a bit surprising:

$ innodb_space -s ibdata1 -T test/t_shuffle space-lsn-age-illustrate

Because the rows were inserted in completely random order, every page has an equal chance for insertion. This means in practice that every page is very recently modified, and this is clearly visible by the entire heatmap being purple. What this also means is that the entire table must be continually present in the buffer pool, and if it doesn’t fit, performance will suffer greatly. This is the main reason that performance of insertion in random order is terrible.

Additionally, you can see a few misbehaviors of InnoDB: Note the extents starting at page 1088, 1152, and 1216. Why do you think they look like that? Let me know your ideas in the comments.

Building a primary and secondary index in primary key order

What if you have multiple indexes? Looking at a very simple and typical case, inserting data in key order by the primary key, while a secondary index exists:

$ innodb_space -s ibdata1 -T test/t_index_before space-lsn-age-illustrate

Notice that this looks like the previous cases perfectly interleaved together, because it is exactly that. Since the primary key and secondary index contain completely different data, insertion is ordered by the primary key, but completely mis-ordered by the secondary index, resulting in the secondary index being built inefficiently.

Adding a secondary index to an existing table

The obvious answer then to the previous problem is to add the index after the data has been loaded, and this does in fact result in the expected outcome:

$ innodb_space -s ibdata1 -T test/t_index_after space-lsn-age-illustrate

When the index is built on the existing table (via ALTER TABLE ... ADD INDEX), it is built by scanning and sorting the data before insertion into the index, resulting in an optimal (and very fast) index build.

Visualizing page fill rate

While the above illustrations show how recently each page was modified during the index build, it’s also possible to use space-extents-illustrate to visualize how full each page is. The key-ordered index looks like this:

$ innodb_space -s ibdata1 -T test/t space-extents-illustrate

Compared to the random-ordered index:

$ innodb_space -s ibdata1 -T test/t_shuffle space-extents-illustrate

The random-ordered insertion caused pages to be split more frequently, and in some cases to be severely under-filled, causing a very poor page fill rate on average.

Those of you who are particularly observant will have noticed that the index illustrations in the first pictures above showed that the ordered-insertion index is significantly smaller than the random one. You can see here that the random-insertion index is 41% larger at 1043 pages compared to just 737 pages for the ordered-insertion index. Additionally, 206 more pages are left unused, making the actual disk space usage 57% larger.

Stupid and dangerous: SET GLOBAL sql_log_bin

It’s been almost 4.5 years since, during some code refactoring, it was decided (or accidentally changed?) that sql_log_bin should become a GLOBAL variable as well as a SESSION one. Almost 2 years ago, during MySQL 5.5 upgrades at Twitter, I filed MySQL Bug 67433 describing in detail how stupid and dangerous this change was, and asking for a reversal.

Nothing has been changed or reversed, so SET GLOBAL sql_log_bin continues to be allowed. I continue to directly see, and hear about damage caused by incorrect use of SET GLOBAL sql_log_bin. So, let me describe just how stupid and dangerous it really is.

What really happens when you run SET GLOBAL sql_log_bin?

If you have a master-slave replication configuration, your master will have binary logging enabled (log_bin) and be recording each transaction to its local binary logs, which the slaves read (more or less in real time) in order to replicate. This log must be complete for replication to be useful and trustworthy. If it is incomplete, serious data loss may occur.

It’s been possible for a long time to use SET [SESSION] sql_log_bin=0 within a connection (with a SUPER user) to temporarily disable binary logging of a session. This can be quite useful to be able to run e.g. ALTER TABLE commands without having them immediately replicated to the slaves.

Most people would guess that running the very similar SET GLOBAL sql_log_bin=0 would disable binary logging for all sessions (you know, GLOBAL?) immediately. However, that’s not quite the case.

As background, there are essentially three different scopes of variables in MySQL server:

  • Global-only: The variable only exists, and only makes sense, in a global scope. An example is innodb_buffer_pool_size — there is only one buffer pool, so it makes no sense in any session context. These variables may also be read-only, so that they can’t be changed at runtime.
  • Global and session: The global variable defines the session default, and is copied to the session variable on connect. A user may change the variable within their session if they want, and there may be restrictions on doing so. If the global variable is changed, it will only affect new sessions; existing sessions will keep the variable they copied at session start. A typical example of this is max_allowed_packet.
  • Session-only: There is no global variable associated with the session variable. A user may change the variable only within their session.

The sql_log_bin variable was previously Session-only and with the commit mentioned above, it was made Global and session.

There are a few implications of this:

  • When a session is initiated, it copies its sql_log_bin value from the global scope to its session scope copy.
  • If the DBA issues SET GLOBAL sql_log_bin=0, it only affects new sessions, any already-connected sessions will continue logging. This makes it pretty useless for managing a failover.
  • Changes to the global variable using SET GLOBAL do not even affect the currently connected session; issuing SET GLOBAL sql_log_bin=0 within a session does not prevent the following commands in that session from being binary logged. This is actually often the first sign that something didn’t work as the DBA expected—they suddenly find all slaves to be executing something they thought they had prevented from being replicated.
  • If a connection is made while sql_log_bin=0 and the DBA then issues SET GLOBAL sql_log_bin=1 the sessions which started while it was 0 will continue to not log anything to the binary log for as long as they are allowed to stay connected. This also makes it pretty useless for managing a failover.
  • If the DBA accidentally issues SET GLOBAL sql_log_bin=0 briefly, even for milliseconds, the entire master-slave replication topology may be irreparably damaged and if this is done without immediately noticing it, permanent and severe data loss may occur.

Is anyone putting it to good use?

The only comment that MySQL Bug 67433 has received from Oracle was from Sveta Smirnova, saying:

Regarding to “revert” I don’t agree: with global access it is possible to make a server master online, without bringing it down.

If you Google search for “set global sql_log_bin” you can find plenty of examples of people talking about it in various languages. I couldn’t find a single example of a script or person using or describing its use correctly, and I looked through pages and pages of examples.

So here’s a Hall of Shame instead:

Also notably, the MySQL manual text for sql_log_bin also still does not adequately describe or warn about its behavior.

Your move, Oracle

Do the right thing, and revert this change. We even provided a patch.

Update: Morgan Tocker, MySQL Community Manager at Oracle is requesting feedback on the future of SET GLOBAL sql_log_bin. Go and make your opinion known!