How InnoDB accidentally reserved only 1 bit for table format

The MySQL 5.5 (and 5.6) documentation says, in Identifying the File Format in Use:

“… Otherwise, the least significant bit should be set in the tablespace flags, and the file format identifier is written in the bits 5 through 11. …”

This is incorrect for any version due to a bug in how the tablespace flags were stored (which caused only 1 bit to be reserved, rather than 6). This was all re-worked in MySQL 5.6, so someone obviously noticed it, but the documentation has been left incorrect for all versions, and the incorrect and misleading code has been left in MySQL 5.5. I filed MySQL Bug #68868 about the documentation.

File formats and names

There are file format names in the documentation and code for values 0 through 25 (letters “A” through “Z”), although only 0 (“Antelope”) and 1 (“Barracuda”) are currently used. They are all defined in storage/innobase/trx/trx0sys.c:

    97  /** List of animal names representing file format. */
    98  static const char*      file_format_name_map[] = {
    99          "Antelope",
   100          "Barracuda",
   101          "Cheetah",
   102          "Dragon",
   103          "Elk",
   104          "Fox",
   105          "Gazelle",
   106          "Hornet",
   107          "Impala",
   108          "Jaguar",
   109          "Kangaroo",
   110          "Leopard",
   111          "Moose",
   112          "Nautilus",
   113          "Ocelot",
   114          "Porpoise",
   115          "Quail",
   116          "Rabbit",
   117          "Shark",
   118          "Tiger",
   119          "Urchin",
   120          "Viper",
   121          "Whale",
   122          "Xenops",
   123          "Yak",
   124          "Zebra"
   125  };

How only one bit was reserved

The code to store the file format identifier into an InnoDB tablespace file’s tablespace flags is in storage/innobase/include/dict0mem.h and follows, with my commentary.

The first bit is reserved for 1 = compact, 0 = redundant format:

    70  /** Table flags.  All unused bits must be 0. */
    71  /* @{ */
    72  #define DICT_TF_COMPACT                 1       /* Compact page format.
    73                                                  This must be set for
    74                                                  new file formats
    75                                                  (later than
    76                                                  DICT_TF_FORMAT_51). */

The next 4 bits are reserved for the compressed page size:

    78  /** Compressed page size (0=uncompressed, up to 15 compressed sizes) */
    79  /* @{ */
    80  #define DICT_TF_ZSSIZE_SHIFT            1
    81  #define DICT_TF_ZSSIZE_MASK             (15 << DICT_TF_ZSSIZE_SHIFT)
    82  #define DICT_TF_ZSSIZE_MAX (UNIV_PAGE_SIZE_SHIFT - PAGE_ZIP_MIN_SIZE_SHIFT + 1)
    83  /* @} */

Next we’re supposed to reserve 6 bits for the file format (up to 64 formats):

    85  /** File format */
    86  /* @{ */
    87  #define DICT_TF_FORMAT_SHIFT            5       /* file format */
    88  #define DICT_TF_FORMAT_MASK             \
    89  ((~(~0 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT))) << DICT_TF_FORMAT_SHIFT)

Two values are currently defined, which correspond to Antelope and Barracuda (with rather strange names “51” and “ZIP” as defined):

    90  #define DICT_TF_FORMAT_51               0       /*!< InnoDB/MySQL up to 5.1 */
    91  #define DICT_TF_FORMAT_ZIP              1       /*!< InnoDB plugin for 5.1:
    92                                                  compressed tables,
    93                                                  new BLOB treatment */
    94  /** Maximum supported file format */
    95  #define DICT_TF_FORMAT_MAX              DICT_TF_FORMAT_ZIP
    96
    97  /** Minimum supported file format */
    98  #define DICT_TF_FORMAT_MIN              DICT_TF_FORMAT_51

This is where things get interesting. It is not clear if DICT_TF_BITS (defined below) is supposed to represent the total number of flag bits (11 so far!), or the number of bits for the format above (6, but then shouldn’t it be called DICT_TF_FORMAT_BITS?). However since 6 is larger than the non­-format related bits (5), and only 1 bit has actually been used for format in practice (0..1), nothing will blow up here, and the #error check passes cleanly.

   100  /* @} */
   101  #define DICT_TF_BITS                    6       /*!< number of flag bits */
   102  #if (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
   103  # error "DICT_TF_BITS is insufficient for DICT_TF_FORMAT_MAX"
   104  #endif
   105  /* @} */

Also note that the #error there is easy enough to calculate. It works out to:

  1. (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
  2. (1 << (6 - 5)) <= 1
  3. (1 << 1) <= 1
  4. 2 <= 1
  5. FALSE

The “6 - 5” in the calculation above represents essentially the number of bits reserved for the table format flag, which turns out to be only 1.

The above defines go on to be used by DICT_TF2 (another set of flags) which currently only uses a single bit:

   107  /** @brief Additional table flags.
   108
   109  These flags will be stored in SYS_TABLES.MIX_LEN.  All unused flags
   110  will be written as 0.  The column may contain garbage for tables
   111  created with old versions of InnoDB that only implemented
   112  ROW_FORMAT=REDUNDANT. */
   113  /* @{ */
   114  #define DICT_TF2_SHIFT                  DICT_TF_BITS
   115                                                  /*!flags. */
   117  #define DICT_TF2_TEMPORARY              1       /*!< TRUE for tables from
   118                                                  CREATE TEMPORARY TABLE. */
   119  #define DICT_TF2_BITS                   (DICT_TF2_SHIFT + 1)
   120                                                  /*!flags. */
   122  /* @} */

It’s very easy to see here that if DICT_TF2_SHIFT is DICT_TF_BITS, which is 6, the DICT_TF2_TEMPORARY flag is being stored at 1 << 6, which is only leaving the file format a single bit, when it should be reserving 6 bits.

The end result of this is that the DICT_TF2_TEMPORARY bit is being stored into a bit reserved for the table format, rather than after the table format. The DICT_TF2 stuff seems to only be stored in the data dictionary, and never in the IBD file, so this would I guess manifest when Cheetah would be implemented and a temporary table is created.

Why this could happen

This code is unnecessarily complex and confusing, and to make matters worse it is inconsistent. There is no concise description of the fields being stored; only the code documents the structure, and since it is badly written, its value as documentation is low.

The bug is two­-fold:

  1. There should be a DICT_TF_FORMAT_BITS define to capture the expected number of bits required to store the DICT_TF_FORMAT_* structure (dictionary, table flags, format) which is defined to 6, and that should be used in the masks associated with DICT_TF_FORMAT_*.
  2. The DICT_TF_BITS define should mean the total size of the DICT_TF structures (which precede the DICT_TF2 structures obviously), and should be 1 + 4 + 6 = 11 bits, but this should be defined only by summing the previous structures sizes.

Because of the way this is written, it’s actually quite difficult to discern that there is a bug present visually, so I am not surprised that this was not caught — however I am dismayed about the code quality and clarity, and that this passes any sort of code review.

9 thoughts on “How InnoDB accidentally reserved only 1 bit for table format

  1. A complete overhaul of this code was completed in the summer of 2011 by the InnoDB team at Oracle and is being released in version 5.6. The newer defines have more accurate names and are easier to understand, with a lot more documentation in the code. Just to be clear, the code you are describing was put into 5.1 when the Barracuda format was introduced, and exists up to version 5.5.

    Both the ‘bugs’ you mention above are not really bugs because the code worked fine. You are just pointing out that newer row formats with higher numbers would be hard to fit in that scheme. But what you did not realize is that only 6 bits were stored on disk for these table flags. The flags2 that you describe above is only combined with the 6 bits in a ‘flags’ field that was stored in memory for each table. The on-disk format stored only the 6 bits of a table flag and zero in all other bits. So there was always room to expand if a newer row format was introduced.

    We recognized years ago that this code was confusing and needed to be refactored. Version 5.6 moves the code in a new direction while retaining the backward compatibility to all existing file structures. In addition, the changes in 5.6 go a long way toward clarifying the table flags fields wherever they are used. These changes now describe a distinction between these flags when they describe a table (table flags) and when they describe a tablespace (tablespace flags).

    This newer code no longer hints that a multi-bit number will be used to describe the row format. Instead, these flags describe features in the row format or file format. The list of ‘future’ row formats is still found in storage\innobase\trx\trx0sys.cc, if those are ever used. But those numbers are no longer targeted as a part of the table flags or tablespace flags.

    • Kevin,

      I understand that it was fixed already in 5.6 — I said that in the post. The new code is nicer. The post is merely describing how this happened and pointing out that the documentation is (and always has been) wrong. We could debate whether bugs are bugs if you haven’t hit them, but that wouldn’t be very interesting.

      Regards,

      Jeremy

      • Six bits is all that was needed in 5.1 through 5.5. The lowest order bit says whether the row format is redundant or compact. The next 4 bits contain either zero if not compressed or an ‘ssize’ if the row format is compressed. If the sixth bit is set, the row format is either Dynamic or Compressed. If the zip size is zero, it is Dynamic. So it is easy to see that the sixth bit also indicated whether the file format is Antelope (Redundant and Compact) or Barracuda (Dynamic and Compressed).

        But since no other bit was needed, no higher file formats exists yet, there is no use for any more than 1 bit for the file format. All other bits are set to zero in flags put on-disk. Upon reading these flags from disk, any bits higher than the sixth that are non-zero would cause the tablespace not to be loaded. This prevents an older engine from using any newer file formats or features.

  2. Latest data Industry news round up, Log Buffer #314

  3. InnoDB bugs found during research on InnoDB data storage – Jeremy Cole

  4. Have a question:
    In the MySQL 5.5 Reference Manual

    http://dev.mysql.com/doc/refman/5.5/en/innodb-file-format-identifying.html

    There were following description:
    Bit 0: Zero for Antelope and no other bits will be set. One for Barracuda, and other bits may be set.
    Bit 5: Same value as Bit 0, zero for Antelope, and one for Barracuda. If Bit 0 and Bit 5 are set and Bits 1 to 4 are not, the row format is “Dynamic”

    But Kevin Lewis’ comments:
    The lowest order bit says whether the row format is redundant or compact.

    I want to know bit 0 means file format(Antelope, Barracuda) or table format( redundant, compact).

    • “I want to know bit 0 means file format(Antelope, Barracuda) or table format( redundant, compact).”

      As pointed out by me above, there are really two versions of these flags, which we can call table flags and tablespace flags. This distinction is made clear in the refactoring of version 5.6. But even in 5.1 and 5.5, the difference existed. Table flags are stored on disk in each record of SYS_TABLES, a system table in ibdata1. Tablespace flags are stored in the header of each tablespace, be it ibdata1 or any IBD file.

      Bit 0 in table flags distinguishes between Redundant and Compact Row Format. Bit 5 indicates Antelope or Barracuda. Bit 1 is always set in the table flags if bit 5 is set. So the table flags look like this;

      Bits 543210 – Table Flags
      000000 Redundant
      000001 Compact
      100001 Dynamic
      100111 Compressed with compressed page size = 4k because ssize = 3 and page size = 2 ^ (ssize + 9)

      The difference between table flags and tablespace flags is in bit 0. The tablespace flags are written to each IBD file header as well as the ibdata1 file header. And the system tablespace can contain both Redundant and Compact row formats since it contains many tables. It did not make sense to use bit 0 to identify either Redundant or Compressed in a tablespace that can contain either one. So in the tablespace flags, bit 0 always matches bit 5 since it is only set for Dynamic or Compressed.

      Bits 543210 – Tablespace Flags
      000000 Redundant
      000000 Compact
      100001 Dynamic
      100111 Compressed with compressed page size = 4k because ssize = 3 and page size = 2 ^ (ssize + 9)

      These tablespace flags are the same in ibdata1 as they are in the IBD files. So you cannot tell whether an IBD file contains a Redundant row formatted table or a Compact row formatted table by looking at its tablespace flags. You must look at the table flags for the table in that IBD tablespace. The table flags can be seen by;

      SELECT * FROM information_schema.innodb_sys_tables;

      In version 5.6, you can see the tablespace flags with this;

      SELECT * FROM information_schema.innodb_sys_tablespaces;

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s