Thanks, Oracle, for fixing the stupid and dangerous SET GLOBAL sql_log_bin!

As of MySQL 5.5.41, released on November 28 (last week), Oracle has fixed MySQL Bug 67433, which I filed on October 31, 2012 and wrote about 4 months ago in Stupid and dangerous: SET GLOBAL sql_log_bin. You can see the fix in Launchpad revision 4718.

The MySQL 5.5.41 release notes mention:

Replication: The global scope for the sql_log_bin system variable has been deprecated, and this variable can now be set with session scope only. The statement SET GLOBAL SQL_LOG_BIN now produces an error. It remains possible for now to read the global value of sql_log_bin, but you should act to remove from your applications any dependencies on reading this value, as the ability to do so will be removed in a future MySQL release. (Bug #67433, Bug #15868071)

Thanks for addressing this and making the world’s data (and DBAs’ jobs) a little safer, Oracle! Special thanks to Morgan Tocker (MySQL Community Manager at Oracle) who cared about it enough to keep pushing it through to a reasonable fix/resolution.

Stupid and dangerous: SET GLOBAL sql_log_bin

It’s been almost 4.5 years since, during some code refactoring, it was decided (or accidentally changed?) that sql_log_bin should become a GLOBAL variable as well as a SESSION one. Almost 2 years ago, during MySQL 5.5 upgrades at Twitter, I filed MySQL Bug 67433 describing in detail how stupid and dangerous this change was, and asking for a reversal.

Nothing has been changed or reversed, so SET GLOBAL sql_log_bin continues to be allowed. I continue to directly see, and hear about damage caused by incorrect use of SET GLOBAL sql_log_bin. So, let me describe just how stupid and dangerous it really is.

What really happens when you run SET GLOBAL sql_log_bin?

If you have a master-slave replication configuration, your master will have binary logging enabled (log_bin) and be recording each transaction to its local binary logs, which the slaves read (more or less in real time) in order to replicate. This log must be complete for replication to be useful and trustworthy. If it is incomplete, serious data loss may occur.

It’s been possible for a long time to use SET [SESSION] sql_log_bin=0 within a connection (with a SUPER user) to temporarily disable binary logging of a session. This can be quite useful to be able to run e.g. ALTER TABLE commands without having them immediately replicated to the slaves.

Most people would guess that running the very similar SET GLOBAL sql_log_bin=0 would disable binary logging for all sessions (you know, GLOBAL?) immediately. However, that’s not quite the case.

As background, there are essentially three different scopes of variables in MySQL server:

  • Global-only: The variable only exists, and only makes sense, in a global scope. An example is innodb_buffer_pool_size — there is only one buffer pool, so it makes no sense in any session context. These variables may also be read-only, so that they can’t be changed at runtime.
  • Global and session: The global variable defines the session default, and is copied to the session variable on connect. A user may change the variable within their session if they want, and there may be restrictions on doing so. If the global variable is changed, it will only affect new sessions; existing sessions will keep the variable they copied at session start. A typical example of this is max_allowed_packet.
  • Session-only: There is no global variable associated with the session variable. A user may change the variable only within their session.

The sql_log_bin variable was previously Session-only and with the commit mentioned above, it was made Global and session.

There are a few implications of this:

  • When a session is initiated, it copies its sql_log_bin value from the global scope to its session scope copy.
  • If the DBA issues SET GLOBAL sql_log_bin=0, it only affects new sessions, any already-connected sessions will continue logging. This makes it pretty useless for managing a failover.
  • Changes to the global variable using SET GLOBAL do not even affect the currently connected session; issuing SET GLOBAL sql_log_bin=0 within a session does not prevent the following commands in that session from being binary logged. This is actually often the first sign that something didn’t work as the DBA expected—they suddenly find all slaves to be executing something they thought they had prevented from being replicated.
  • If a connection is made while sql_log_bin=0 and the DBA then issues SET GLOBAL sql_log_bin=1 the sessions which started while it was 0 will continue to not log anything to the binary log for as long as they are allowed to stay connected. This also makes it pretty useless for managing a failover.
  • If the DBA accidentally issues SET GLOBAL sql_log_bin=0 briefly, even for milliseconds, the entire master-slave replication topology may be irreparably damaged and if this is done without immediately noticing it, permanent and severe data loss may occur.

Is anyone putting it to good use?

The only comment that MySQL Bug 67433 has received from Oracle was from Sveta Smirnova, saying:

Regarding to “revert” I don’t agree: with global access it is possible to make a server master online, without bringing it down.

If you Google search for “set global sql_log_bin” you can find plenty of examples of people talking about it in various languages. I couldn’t find a single example of a script or person using or describing its use correctly, and I looked through pages and pages of examples.

So here’s a Hall of Shame instead:

Also notably, the MySQL manual text for sql_log_bin also still does not adequately describe or warn about its behavior.

Your move, Oracle

Do the right thing, and revert this change. We even provided a patch.

Update: Morgan Tocker, MySQL Community Manager at Oracle is requesting feedback on the future of SET GLOBAL sql_log_bin. Go and make your opinion known!

InnoDB bugs found during research on InnoDB data storage

During the process of researching InnoDB’s storage formats and building the innodb_ruby and innodb_diagrams projects discussed in my series of InnoDB blog posts, Davi Arnaut and I found a number of InnoDB bugs. I thought I’d bring up a few of them, as they are fairly interesting.

These bugs were largely discoverable due to the innodb_space utility making important internal information visible in a way that it had never been visible in the past. Using it to examine production tables provided many leads to go on to find the bugs responsible. When we initially looked at a graphical plot of free space by page produced from innodb_space data, we were quite surprised to see so many pages less than half filled (including many nearly empty). After much research we were able to track down all of the causes for the anomalies we discovered.

Bug #67718: InnoDB drastically under-fills pages in certain conditions

Due to overly-aggressive attempts to optimize page split based on insertion order during insertion, InnoDB could leave pages under-filled with as few as one record in each page. This was observed in several production systems in two cases which I believe could be quite common for others:

  1. Mostly-increasing keys — Twitter uses Snowflake for ID generation in a distributed way. Overall it’s quite nice. Snowflake generates 64-bit mostly-incrementing IDs that contain a timestamp component. Insertion is typically happening via queues and other non-immediate mechanisms, so IDs will find their way to the database slightly out of order.
  2. Nearly-ordered keys — Another schema has a Primary Key and Secondary Key which are similarly—but not exactly— ordered. Insertion into a table to copy data in either order ends up nearly ordered by the other key.

Both of these circumstances ended up tripping over this bug and causing drastically under-filled pages to appear in production databases, consuming large amounts of disk space.

Bug #67963: InnoDB wastes 62 out of every 16384 pages

InnoDB needs to occasionally allocate some internal bookkeeping pages; two for every 256 MiB of data. In order to do so, it allocates an extent (64 pages), allocates the two pages it needed, and then adds the remainder of the extent (62 free pages) to a list of extents to be used for single page allocations called FREE_FRAG. Almost nothing allocates pages from that list, so these pages go to waste.

This is fairly subtle, wasting only 0.37% of disk space in any large InnoDB table, but nonetheless interesting and quite fixable.

Bug #68023: InnoDB reserves an excessive amount of disk space for write operations

InnoDB attempts to ensure write operations will always succeed after they’ve reached a certain point by pre-reserving 1% of the tablespace size for the write operation. This is an excessive amount; 1% of every large table in a production system really adds up. This should be capped at some reasonable amount.

Bug #68501: InnoDB fails to merge under-filled pages depending on deletion order

Depending on the order that records are deleted from pages, InnoDB may not merge multiple adjacent under-filled pages together, wasting disk space.

Bug #68545: InnoDB should check left/right pages when target page is full to avoid splitting

During an insertion operation, only one of two outcomes is currently possible:

  1. The record fits in the target page and is inserted without splitting the page.
  2. The record does not fit in the target page and the page is then split into two pages, each with half of the records on the original page. After the page is split, the insertion will happen into one of the two resulting pages two pages.

This misses a very common case in practice, when the target page is full but one or more of its adjacent pages have free space or may even be nearly empty. A more intelligent alternative would be to consider merging the adjacent pages in order to make free space on the target page, rather than split the target page, creating a completely new half-full page.

Bug #68546: InnoDB stores unnecessary PKV fields in unique SK non-leaf pages

Non-leaf pages in Secondary Keys need a key that is guaranteed to be unique even though there may be many child pages with the same minimum key value. InnoDB adds all Primary Key fields to the key, but when the Secondary Key is already unique this is unnecessary. For systems with unique Secondary Keys and a large Primary Key, this can add up to a lot of disk space to store the unnecessary fields. Fixing this in a compatible way would be complex, and most users are unaffected, so I’d say it’s unlikely to be fixed.

Bug #68868: Documentation for InnoDB tablespace flags for file format incorrect

As I wrote in How InnoDB accidentally reserved only 1 bit for table format, InnoDB purportedly reserved 6 bits of a field for storing the table format (Antelope, Barracuda, etc.), but due to a bug in the C #defines only reserved 1 bit.

How InnoDB accidentally reserved only 1 bit for table format

The MySQL 5.5 (and 5.6) documentation says, in Identifying the File Format in Use:

“… Otherwise, the least significant bit should be set in the tablespace flags, and the file format identifier is written in the bits 5 through 11. …”

This is incorrect for any version due to a bug in how the tablespace flags were stored (which caused only 1 bit to be reserved, rather than 6). This was all re-worked in MySQL 5.6, so someone obviously noticed it, but the documentation has been left incorrect for all versions, and the incorrect and misleading code has been left in MySQL 5.5. I filed MySQL Bug #68868 about the documentation.

File formats and names

There are file format names in the documentation and code for values 0 through 25 (letters “A” through “Z”), although only 0 (“Antelope”) and 1 (“Barracuda”) are currently used. They are all defined in storage/innobase/trx/trx0sys.c:

    97  /** List of animal names representing file format. */
    98  static const char*      file_format_name_map[] = {
    99          "Antelope",
   100          "Barracuda",
   101          "Cheetah",
   102          "Dragon",
   103          "Elk",
   104          "Fox",
   105          "Gazelle",
   106          "Hornet",
   107          "Impala",
   108          "Jaguar",
   109          "Kangaroo",
   110          "Leopard",
   111          "Moose",
   112          "Nautilus",
   113          "Ocelot",
   114          "Porpoise",
   115          "Quail",
   116          "Rabbit",
   117          "Shark",
   118          "Tiger",
   119          "Urchin",
   120          "Viper",
   121          "Whale",
   122          "Xenops",
   123          "Yak",
   124          "Zebra"
   125  };

How only one bit was reserved

The code to store the file format identifier into an InnoDB tablespace file’s tablespace flags is in storage/innobase/include/dict0mem.h and follows, with my commentary.

The first bit is reserved for 1 = compact, 0 = redundant format:

    70  /** Table flags.  All unused bits must be 0. */
    71  /* @{ */
    72  #define DICT_TF_COMPACT                 1       /* Compact page format.
    73                                                  This must be set for
    74                                                  new file formats
    75                                                  (later than
    76                                                  DICT_TF_FORMAT_51). */

The next 4 bits are reserved for the compressed page size:

    78  /** Compressed page size (0=uncompressed, up to 15 compressed sizes) */
    79  /* @{ */
    80  #define DICT_TF_ZSSIZE_SHIFT            1
    81  #define DICT_TF_ZSSIZE_MASK             (15 << DICT_TF_ZSSIZE_SHIFT)
    82  #define DICT_TF_ZSSIZE_MAX (UNIV_PAGE_SIZE_SHIFT - PAGE_ZIP_MIN_SIZE_SHIFT + 1)
    83  /* @} */

Next we’re supposed to reserve 6 bits for the file format (up to 64 formats):

    85  /** File format */
    86  /* @{ */
    87  #define DICT_TF_FORMAT_SHIFT            5       /* file format */
    88  #define DICT_TF_FORMAT_MASK             \
    89  ((~(~0 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT))) << DICT_TF_FORMAT_SHIFT)

Two values are currently defined, which correspond to Antelope and Barracuda (with rather strange names “51” and “ZIP” as defined):

    90  #define DICT_TF_FORMAT_51               0       /*!< InnoDB/MySQL up to 5.1 */
    91  #define DICT_TF_FORMAT_ZIP              1       /*!< InnoDB plugin for 5.1:
    92                                                  compressed tables,
    93                                                  new BLOB treatment */
    94  /** Maximum supported file format */
    95  #define DICT_TF_FORMAT_MAX              DICT_TF_FORMAT_ZIP
    96
    97  /** Minimum supported file format */
    98  #define DICT_TF_FORMAT_MIN              DICT_TF_FORMAT_51

This is where things get interesting. It is not clear if DICT_TF_BITS (defined below) is supposed to represent the total number of flag bits (11 so far!), or the number of bits for the format above (6, but then shouldn’t it be called DICT_TF_FORMAT_BITS?). However since 6 is larger than the non­-format related bits (5), and only 1 bit has actually been used for format in practice (0..1), nothing will blow up here, and the #error check passes cleanly.

   100  /* @} */
   101  #define DICT_TF_BITS                    6       /*!< number of flag bits */
   102  #if (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
   103  # error "DICT_TF_BITS is insufficient for DICT_TF_FORMAT_MAX"
   104  #endif
   105  /* @} */

Also note that the #error there is easy enough to calculate. It works out to:

  1. (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
  2. (1 << (6 - 5)) <= 1
  3. (1 << 1) <= 1
  4. 2 <= 1
  5. FALSE

The “6 - 5” in the calculation above represents essentially the number of bits reserved for the table format flag, which turns out to be only 1.

The above defines go on to be used by DICT_TF2 (another set of flags) which currently only uses a single bit:

   107  /** @brief Additional table flags.
   108
   109  These flags will be stored in SYS_TABLES.MIX_LEN.  All unused flags
   110  will be written as 0.  The column may contain garbage for tables
   111  created with old versions of InnoDB that only implemented
   112  ROW_FORMAT=REDUNDANT. */
   113  /* @{ */
   114  #define DICT_TF2_SHIFT                  DICT_TF_BITS
   115                                                  /*!flags. */
   117  #define DICT_TF2_TEMPORARY              1       /*!< TRUE for tables from
   118                                                  CREATE TEMPORARY TABLE. */
   119  #define DICT_TF2_BITS                   (DICT_TF2_SHIFT + 1)
   120                                                  /*!flags. */
   122  /* @} */

It’s very easy to see here that if DICT_TF2_SHIFT is DICT_TF_BITS, which is 6, the DICT_TF2_TEMPORARY flag is being stored at 1 << 6, which is only leaving the file format a single bit, when it should be reserving 6 bits.

The end result of this is that the DICT_TF2_TEMPORARY bit is being stored into a bit reserved for the table format, rather than after the table format. The DICT_TF2 stuff seems to only be stored in the data dictionary, and never in the IBD file, so this would I guess manifest when Cheetah would be implemented and a temporary table is created.

Why this could happen

This code is unnecessarily complex and confusing, and to make matters worse it is inconsistent. There is no concise description of the fields being stored; only the code documents the structure, and since it is badly written, its value as documentation is low.

The bug is two­-fold:

  1. There should be a DICT_TF_FORMAT_BITS define to capture the expected number of bits required to store the DICT_TF_FORMAT_* structure (dictionary, table flags, format) which is defined to 6, and that should be used in the masks associated with DICT_TF_FORMAT_*.
  2. The DICT_TF_BITS define should mean the total size of the DICT_TF structures (which precede the DICT_TF2 structures obviously), and should be 1 + 4 + 6 = 11 bits, but this should be defined only by summing the previous structures sizes.

Because of the way this is written, it’s actually quite difficult to discern that there is a bug present visually, so I am not surprised that this was not caught — however I am dismayed about the code quality and clarity, and that this passes any sort of code review.