How does InnoDB behave without a Primary Key?

This afternoon, Arjen Lentz and I were discussing InnoDB’s behavior without a declared PRIMARY KEY, and the topic felt interesting enough and undocumented enough to warrant its own short post.

Background on InnoDB clustered keys

In The physical structure of InnoDB index pages I described how “Everything is an index in InnoDB”. This means that InnoDB must always have a “cluster key” for each table, which is normally the PRIMARY KEY. The manual has this to say in Clustered and Secondary Indexes:

If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.

I had previously assumed this meant that an invisible column would be used along with the same sequence generation code that is used to implement auto_increment (which itself has some scalability issues). However the reality is that they are completely different implementations.

Implementation of implicit Row IDs

How this is actually implemented is, as the manual says, if a table is declared with no PRIMARY KEY and no non-nullable UNIQUE KEY, InnoDB will automatically add a 6-byte (48-bit) integer column called ROW_ID to the table, and cluster the data based on that column. The column won’t be accessible to any queries nor usable for anything internally such as row-based replication.

What the manual doesn’t mention is that all tables using such ROW_ID columns share the same global sequence counter (the manual says “increases monotonically” and doesn’t clarify), which is part of the data dictionary. The maximum used value for all row IDs (well, technically the next ID to be used) is stored in the system tablespace (e.g. ibdata1) in page 7 (type SYS), within the data dictionary header (field DICT_HDR_ROW_ID).

This global sequence counter is protected by dict_sys->mutex, even for incrementing (as opposed to using atomic increment). The implementation is in include/dict0boot.ic (many blank lines deleted):

    38  UNIV_INLINE
    39  row_id_t
    40  dict_sys_get_new_row_id(void)
    41  /*=========================*/
    42  {
    43          row_id_t        id;
    44  
    45          mutex_enter(&(dict_sys->mutex));
    47          id = dict_sys->row_id;
    49          if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
    51                  dict_hdr_flush_row_id();
    52          }
    54          dict_sys->row_id++;
    56          mutex_exit(&(dict_sys->mutex));
    57  
    58          return(id);
    59  }

(You may also notice that this code lacks any protection for overflowing the 48 bits allotted to row IDs. That is unnecessarily sloppy coding, but even at a continuous 1 million inserts per second [which is probably a bit optimistic ;)] it would take about 9 years to exhaust the ID space. I guess that’s okay.)

Ensuring non-conflicting IDs are generated

The counter is flushed to disk every 256th ID generated (the define DICT_HDR_ROW_ID_WRITE_MARGIN above), by modifying the value in the SYS data dictionary page, which is logged to the transaction log. On startup, InnoDB will increase the DICT_HDR_ROW_ID stored on disk by at least 256, and at most 511. This ensures that any IDs generated will have been less than the new starting value, and thus there will not be any conflicts.

Performance and contention implications

Given how much other code within InnoDB is protected by dict_sys->mutex I think it’s fair to say any tables with an implicit clustered key (ROW_ID) could expect to experience random insert stalls during operations like dropping (unrelated) tables. Parallel insertion into multiple tables with implicit keys could be performance-constrained, as it will be serialized on both the shared mutex and cache contention for the shared counter variable. Additionally, every 256th value generated will cause a log write (and flush) for the SYS page modification, regardless of whether the transaction has committed yet (or ever will).

InnoDB: A journey to the core: At the MySQL Conference

Next week is the Percona Live MySQL Conference and Expo 2013.

Davi Arnaut and I are co-presenting InnoDB: A journey to the core, based on my InnoDB blog series by the same name. We will (fairly quickly) cover InnoDB’s storage formats as described in those posts, but in an interactive format. There will be some new material that hasn’t been blogged yet (mostly stuff that is more difficult to explain or has been incompletely described in innodb_diagrams). Most importantly, Davi and I will be available for questions, and hopefully some of the InnoDB developers will stop by as well!

You might have seen my previous post about Julian Cash “white background” community photos at Percona Live MySQL Conference — Take a moment to help out by funding Julian’s photography at the conference, if you can! I’d really love to see a bunch of new MySQL community photos!

See you there!

InnoDB bugs found during research on InnoDB data storage

During the process of researching InnoDB’s storage formats and building the innodb_ruby and innodb_diagrams projects discussed in my series of InnoDB blog posts, Davi Arnaut and I found a number of InnoDB bugs. I thought I’d bring up a few of them, as they are fairly interesting.

These bugs were largely discoverable due to the innodb_space utility making important internal information visible in a way that it had never been visible in the past. Using it to examine production tables provided many leads to go on to find the bugs responsible. When we initially looked at a graphical plot of free space by page produced from innodb_space data, we were quite surprised to see so many pages less than half filled (including many nearly empty). After much research we were able to track down all of the causes for the anomalies we discovered.

Bug #67718: InnoDB drastically under-fills pages in certain conditions

Due to overly-aggressive attempts to optimize page split based on insertion order during insertion, InnoDB could leave pages under-filled with as few as one record in each page. This was observed in several production systems in two cases which I believe could be quite common for others:

  1. Mostly-increasing keys — Twitter uses Snowflake for ID generation in a distributed way. Overall it’s quite nice. Snowflake generates 64-bit mostly-incrementing IDs that contain a timestamp component. Insertion is typically happening via queues and other non-immediate mechanisms, so IDs will find their way to the database slightly out of order.
  2. Nearly-ordered keys — Another schema has a Primary Key and Secondary Key which are similarly—but not exactly— ordered. Insertion into a table to copy data in either order ends up nearly ordered by the other key.

Both of these circumstances ended up tripping over this bug and causing drastically under-filled pages to appear in production databases, consuming large amounts of disk space.

Bug #67963: InnoDB wastes 62 out of every 16384 pages

InnoDB needs to occasionally allocate some internal bookkeeping pages; two for every 256 MiB of data. In order to do so, it allocates an extent (64 pages), allocates the two pages it needed, and then adds the remainder of the extent (62 free pages) to a list of extents to be used for single page allocations called FREE_FRAG. Almost nothing allocates pages from that list, so these pages go to waste.

This is fairly subtle, wasting only 0.37% of disk space in any large InnoDB table, but nonetheless interesting and quite fixable.

Bug #68023: InnoDB reserves an excessive amount of disk space for write operations

InnoDB attempts to ensure write operations will always succeed after they’ve reached a certain point by pre-reserving 1% of the tablespace size for the write operation. This is an excessive amount; 1% of every large table in a production system really adds up. This should be capped at some reasonable amount.

Bug #68501: InnoDB fails to merge under-filled pages depending on deletion order

Depending on the order that records are deleted from pages, InnoDB may not merge multiple adjacent under-filled pages together, wasting disk space.

Bug #68545: InnoDB should check left/right pages when target page is full to avoid splitting

During an insertion operation, only one of two outcomes is currently possible:

  1. The record fits in the target page and is inserted without splitting the page.
  2. The record does not fit in the target page and the page is then split into two pages, each with half of the records on the original page. After the page is split, the insertion will happen into one of the two resulting pages two pages.

This misses a very common case in practice, when the target page is full but one or more of its adjacent pages have free space or may even be nearly empty. A more intelligent alternative would be to consider merging the adjacent pages in order to make free space on the target page, rather than split the target page, creating a completely new half-full page.

Bug #68546: InnoDB stores unnecessary PKV fields in unique SK non-leaf pages

Non-leaf pages in Secondary Keys need a key that is guaranteed to be unique even though there may be many child pages with the same minimum key value. InnoDB adds all Primary Key fields to the key, but when the Secondary Key is already unique this is unnecessary. For systems with unique Secondary Keys and a large Primary Key, this can add up to a lot of disk space to store the unnecessary fields. Fixing this in a compatible way would be complex, and most users are unaffected, so I’d say it’s unlikely to be fixed.

Bug #68868: Documentation for InnoDB tablespace flags for file format incorrect

As I wrote in How InnoDB accidentally reserved only 1 bit for table format, InnoDB purportedly reserved 6 bits of a field for storing the table format (Antelope, Barracuda, etc.), but due to a bug in the C #defines only reserved 1 bit.

How InnoDB accidentally reserved only 1 bit for table format

The MySQL 5.5 (and 5.6) documentation says, in Identifying the File Format in Use:

“… Otherwise, the least significant bit should be set in the tablespace flags, and the file format identifier is written in the bits 5 through 11. …”

This is incorrect for any version due to a bug in how the tablespace flags were stored (which caused only 1 bit to be reserved, rather than 6). This was all re-worked in MySQL 5.6, so someone obviously noticed it, but the documentation has been left incorrect for all versions, and the incorrect and misleading code has been left in MySQL 5.5. I filed MySQL Bug #68868 about the documentation.

File formats and names

There are file format names in the documentation and code for values 0 through 25 (letters “A” through “Z”), although only 0 (“Antelope”) and 1 (“Barracuda”) are currently used. They are all defined in storage/innobase/trx/trx0sys.c:

    97  /** List of animal names representing file format. */
    98  static const char*      file_format_name_map[] = {
    99          "Antelope",
   100          "Barracuda",
   101          "Cheetah",
   102          "Dragon",
   103          "Elk",
   104          "Fox",
   105          "Gazelle",
   106          "Hornet",
   107          "Impala",
   108          "Jaguar",
   109          "Kangaroo",
   110          "Leopard",
   111          "Moose",
   112          "Nautilus",
   113          "Ocelot",
   114          "Porpoise",
   115          "Quail",
   116          "Rabbit",
   117          "Shark",
   118          "Tiger",
   119          "Urchin",
   120          "Viper",
   121          "Whale",
   122          "Xenops",
   123          "Yak",
   124          "Zebra"
   125  };

How only one bit was reserved

The code to store the file format identifier into an InnoDB tablespace file’s tablespace flags is in storage/innobase/include/dict0mem.h and follows, with my commentary.

The first bit is reserved for 1 = compact, 0 = redundant format:

    70  /** Table flags.  All unused bits must be 0. */
    71  /* @{ */
    72  #define DICT_TF_COMPACT                 1       /* Compact page format.
    73                                                  This must be set for
    74                                                  new file formats
    75                                                  (later than
    76                                                  DICT_TF_FORMAT_51). */

The next 4 bits are reserved for the compressed page size:

    78  /** Compressed page size (0=uncompressed, up to 15 compressed sizes) */
    79  /* @{ */
    80  #define DICT_TF_ZSSIZE_SHIFT            1
    81  #define DICT_TF_ZSSIZE_MASK             (15 << DICT_TF_ZSSIZE_SHIFT)
    82  #define DICT_TF_ZSSIZE_MAX (UNIV_PAGE_SIZE_SHIFT - PAGE_ZIP_MIN_SIZE_SHIFT + 1)
    83  /* @} */

Next we’re supposed to reserve 6 bits for the file format (up to 64 formats):

    85  /** File format */
    86  /* @{ */
    87  #define DICT_TF_FORMAT_SHIFT            5       /* file format */
    88  #define DICT_TF_FORMAT_MASK             \
    89  ((~(~0 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT))) << DICT_TF_FORMAT_SHIFT)

Two values are currently defined, which correspond to Antelope and Barracuda (with rather strange names “51” and “ZIP” as defined):

    90  #define DICT_TF_FORMAT_51               0       /*!< InnoDB/MySQL up to 5.1 */
    91  #define DICT_TF_FORMAT_ZIP              1       /*!< InnoDB plugin for 5.1:
    92                                                  compressed tables,
    93                                                  new BLOB treatment */
    94  /** Maximum supported file format */
    95  #define DICT_TF_FORMAT_MAX              DICT_TF_FORMAT_ZIP
    96
    97  /** Minimum supported file format */
    98  #define DICT_TF_FORMAT_MIN              DICT_TF_FORMAT_51

This is where things get interesting. It is not clear if DICT_TF_BITS (defined below) is supposed to represent the total number of flag bits (11 so far!), or the number of bits for the format above (6, but then shouldn’t it be called DICT_TF_FORMAT_BITS?). However since 6 is larger than the non­-format related bits (5), and only 1 bit has actually been used for format in practice (0..1), nothing will blow up here, and the #error check passes cleanly.

   100  /* @} */
   101  #define DICT_TF_BITS                    6       /*!< number of flag bits */
   102  #if (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
   103  # error "DICT_TF_BITS is insufficient for DICT_TF_FORMAT_MAX"
   104  #endif
   105  /* @} */

Also note that the #error there is easy enough to calculate. It works out to:

  1. (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
  2. (1 << (6 - 5)) <= 1
  3. (1 << 1) <= 1
  4. 2 <= 1
  5. FALSE

The “6 - 5” in the calculation above represents essentially the number of bits reserved for the table format flag, which turns out to be only 1.

The above defines go on to be used by DICT_TF2 (another set of flags) which currently only uses a single bit:

   107  /** @brief Additional table flags.
   108
   109  These flags will be stored in SYS_TABLES.MIX_LEN.  All unused flags
   110  will be written as 0.  The column may contain garbage for tables
   111  created with old versions of InnoDB that only implemented
   112  ROW_FORMAT=REDUNDANT. */
   113  /* @{ */
   114  #define DICT_TF2_SHIFT                  DICT_TF_BITS
   115                                                  /*!flags. */
   117  #define DICT_TF2_TEMPORARY              1       /*!< TRUE for tables from
   118                                                  CREATE TEMPORARY TABLE. */
   119  #define DICT_TF2_BITS                   (DICT_TF2_SHIFT + 1)
   120                                                  /*!flags. */
   122  /* @} */

It’s very easy to see here that if DICT_TF2_SHIFT is DICT_TF_BITS, which is 6, the DICT_TF2_TEMPORARY flag is being stored at 1 << 6, which is only leaving the file format a single bit, when it should be reserving 6 bits.

The end result of this is that the DICT_TF2_TEMPORARY bit is being stored into a bit reserved for the table format, rather than after the table format. The DICT_TF2 stuff seems to only be stored in the data dictionary, and never in the IBD file, so this would I guess manifest when Cheetah would be implemented and a temporary table is created.

Why this could happen

This code is unnecessarily complex and confusing, and to make matters worse it is inconsistent. There is no concise description of the fields being stored; only the code documents the structure, and since it is badly written, its value as documentation is low.

The bug is two­-fold:

  1. There should be a DICT_TF_FORMAT_BITS define to capture the expected number of bits required to store the DICT_TF_FORMAT_* structure (dictionary, table flags, format) which is defined to 6, and that should be used in the masks associated with DICT_TF_FORMAT_*.
  2. The DICT_TF_BITS define should mean the total size of the DICT_TF structures (which precede the DICT_TF2 structures obviously), and should be 1 + 4 + 6 = 11 bits, but this should be defined only by summing the previous structures sizes.

Because of the way this is written, it’s actually quite difficult to discern that there is a bug present visually, so I am not surprised that this was not caught — however I am dismayed about the code quality and clarity, and that this passes any sort of code review.

Efficiently traversing InnoDB B+Trees with the page directory

[This post refers to innodb_ruby version 0.8.8 as of February 3, 2014.]

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages, and the logical structure was described in B+Tree index structures in InnoDB, and the physical structure of records was described in The physical structure of records in InnoDB. Now we’ll look in detail at the “page directory” structure that has been seen several times already, but not yet described.

In this post, only COMPACT row format (for Barracuda table format) is considered.

The purpose of the page directory

As described in the posts mentioned above, all records in INDEX pages are linked together in a singly-linked list in ascending order. However, list traversal through a page with potentially several hundred records in it is very expensive: every record’s key must be compared, and this needs to be done at each level of the B+Tree until the record sought is found on a leaf page.

The page directory greatly optimizes this search by providing a fixed-width data structure with direct pointers to 1 of every 4-8 records, in order. Thus, it can be used for a traditional binary search of the records in each page, starting at the mid-point of the directory and progressively pruning the directory by half until only a single entry remains, and then linear-scanning from there. Since the directory is effectively an array, it can be traversed in either ascending or descending order, despite the records being linked in only ascending order.

The physical structure of the page directory

In The physical structure of InnoDB index pages, the page directory’s physical structure was briefly presented:

The structure is actually very simple. The number of slots (the page directory length) is specified in the first field of the INDEX header of the page. The page directory always contains an entry for the infimum and supremum system records (so the minimum size is 2 entries), and may contain 0 or more additional entries, one for each 4-8 system records. A record is said to “own” another record if it represents it in the page directory. Each entry in the page directory “owns” the records between the previous entry in the directory, up to and including itself. The count of records “owned” by each record is stored in the record header that precedes each record.

The page-directory-summary mode of innodb_space can be used to view the page directory contents, in this case for a completely empty table (with the same schema as the 1 million row table used in A quick introduction to innodb_ruby), showing the minimum possible page directory:

$ innodb_space -f t_page_directory.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       112     supremum      1       

If we insert a single record, we can see that it gets owned by the record with a greater key than itself that has an entry in the page directory. In this case, supremum will own the record (as previously discussed, supremum represents a record higher than any possible key in the page):

$ innodb_space -f t_page_directory.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       112     supremum      2       

The infimum record always owns only itself, since no record can have a lower key. The supremum record always owns itself, but has no minimum record ownership. Each additional entry in the page directory should own a minimum of 4 records (itself plus 3 others) and a maximum of 8 records (itself plus 7 others).

To illustrate, each record with an entry in the page directory (bolded) owns the records immediately prior to it in the singly-linked list (K = Key, O = Number of Records Owned):

Growth of the page directory

Once any page directory slot would exceed 8 records owned, the page directory is rebalanced to distribute the records into 4-record groups. If we insert 6 additional records into the table, supremum will now own a total of 8 records:

$ innodb_space -f t_page_directory.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       112     supremum      8       

The next insert will cause a re-organization:

$ innodb_space -f t_page_directory.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       191     conventional  4       
2       112     supremum      5       

Using a record describer with innodb_space can allow you to see the pointed-to record’s key for each entry in the directory, and I will use this describer for all future examples in this post:

$ innodb_space -f t_page_directory.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       191     conventional  4       (i=4)
2       112     supremum      5       

If a page is completely full, the page directory may look something like this one (now using the 1 million row table itself):

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 4 page-directory-summary

slot    offset  type          owned   key
0       99      infimum       1       
1       7297    conventional  5       (i=5)
2       5999    conventional  4       (i=9)
3       1841    conventional  5       (i=14)
4       14623   conventional  8       (i=22)
5       3029    conventional  4       (i=26)

<many lines omitted>

73      851     conventional  7       (i=420)
74      3183    conventional  6       (i=426)
75      1577    conventional  5       (i=431)
76      5405    conventional  5       (i=436)
77      455     conventional  5       (i=441)
78      112     supremum      6       

A logical view of the page directory

At a logical level, the page directory (and records) for a page with 24 records (with keys from 0 to 23) would look like this:

Take note that:

  • Records are singly linked from infimum to supremum through all 24 user records, as previously discussed.
  • Approximately each 4th record is entered into the page directory, represented in the illustration both by bolding that record and by noting its offset in the page directory array represented at the top of the illustration.
  • The page directory is stored “backwards” in the page, so is reversed in this illustration compared to its ordering on disk.

Efficiently searching using the B+Tree and page directory

Without the page directory, a large number of records would need to be compared in order to find the record being sought. Demonstrating actual code is probably the best way to prove how efficient the B+Tree with page directory can be. Using innodb_ruby, it is possible to search an actual InnoDB index, although it doesn’t have a nice command-line interface for doing so yet. Instead, irb, the interactive Ruby shell can be used. (Note that this functionality in innodb_ruby is for illustrative and learning purposes only. It should not be considered for any serious use.)

An interactive shell can be set up similarly to the previous innodb_space commands’ configurations with:

$ irb -r rubygems -r innodb

irb> require "./simple_t_describer.rb"
irb> space = Innodb::Space.new("t.ibd")
irb> space.record_describer = SimpleTDescriber.new
irb> index = space.index(3)

Since we’re interested mostly in exploring here, debug output should be enabled so that the various index traversal operations can be seen:

irb> index.debug = true

The innodb_ruby library provides two methods for searching within the B+Tree:

  • index.linear_search(key) — Use only purely linear search on the singly-linked record lists to traverse the B+Tree. This is primarily intended as an inefficient counter-example to binary_search but is also useful to verify various algorithms (such as key comparison).
  • index.binary_search(key) — Use binary search on the page directory and linear search as appropriate in order to search efficiently. This is intended to mimic (although not exactly) InnoDB’s algorithm for efficient search.

Note that the key parameter to each method is an array of fields forming the key of the index (either primary key or secondary key).

Linear search

First, we’ll reset the internal statistics (counters) that the index tracks for debugging purposes:

irb> index.reset_stats

Next, initiate a linear search for key “10000” in our 1 million row table:

irb> index.linear_search([10000])

linear_search: root=3, level=2, key=(10000)
linear_search_from_cursor: page=3, level=2, start=(i=252)
linear_search_from_cursor: page=3, level=2, current=(i=252)
linear_search_from_cursor: page=36, level=1, start=(i=252)
linear_search_from_cursor: page=36, level=1, current=(i=252)
linear_search_from_cursor: page=36, level=1, current=(i=447)

<many lines omitted>

linear_search_from_cursor: page=36, level=1, current=(i=8930)
linear_search_from_cursor: page=36, level=1, current=(i=9381)
linear_search_from_cursor: page=36, level=1, current=(i=9830)
linear_search_from_cursor: page=424, level=0, start=(i=9830)
linear_search_from_cursor: page=424, level=0, current=(i=9830)
linear_search_from_cursor: page=424, level=0, current=(i=9831)

<many lines omitted>

linear_search_from_cursor: page=424, level=0, current=(i=9998)
linear_search_from_cursor: page=424, level=0, current=(i=9999)
linear_search_from_cursor: page=424, level=0, current=(i=10000)

I omitted many lines, but the full output can be seen in linear_search.txt. The basic algorithm is:

  1. Start at the root page of the index.
  2. Linear search from infimum until finding an individual record with the highest key that does not exceed the search key. If the current page is a leaf page, return the record. If the current page is a non-leaf page, load the child page this record points to, and return to step 2.

We can check the stats that were collected:

irb> pp index.stats

{:linear_search=>1,
 :linear_search_from_cursor=>3,
 :linear_search_from_cursor_record_scans=>196,
 :compare_key=>589,
 :compare_key_field_comparison=>589}

So this has compared 589 records’ keys in order to find the key we were looking for. Not very efficient at all.

Binary search

Again, reset the stats:

irb> index.reset_stats

This time initiate a binary search using the page directory:

irb> index.binary_search([10000])

binary_search: root=3, level=2, key=(10000)
binary_search_by_directory: page=3, level=2, dir.size=1, dir[0]=()
linear_search_from_cursor: page=3, level=2, start=(i=252)
linear_search_from_cursor: page=3, level=2, current=(i=252)
binary_search_by_directory: page=36, level=1, dir.size=166, dir[82]=(i=258175)
binary_search_by_directory: page=36, level=1, dir.size=82, dir[40]=(i=122623)
binary_search_by_directory: page=36, level=1, dir.size=40, dir[19]=(i=52742)
binary_search_by_directory: page=36, level=1, dir.size=19, dir[9]=(i=20930)
binary_search_by_directory: page=36, level=1, dir.size=9, dir[4]=(i=8930)
binary_search_by_directory: page=36, level=1, dir.size=5, dir[2]=(i=12759)
binary_search_by_directory: page=36, level=1, dir.size=2, dir[0]=(i=8930)
linear_search_from_cursor: page=36, level=1, start=(i=8930)
linear_search_from_cursor: page=36, level=1, current=(i=8930)
linear_search_from_cursor: page=36, level=1, current=(i=9381)
linear_search_from_cursor: page=36, level=1, current=(i=9830)
binary_search_by_directory: page=424, level=0, dir.size=81, dir[40]=(i=10059)
binary_search_by_directory: page=424, level=0, dir.size=40, dir[19]=(i=9938)
binary_search_by_directory: page=424, level=0, dir.size=21, dir[10]=(i=9997)
binary_search_by_directory: page=424, level=0, dir.size=11, dir[5]=(i=10025)
binary_search_by_directory: page=424, level=0, dir.size=5, dir[2]=(i=10006)
binary_search_by_directory: page=424, level=0, dir.size=2, dir[0]=(i=9997)
linear_search_from_cursor: page=424, level=0, start=(i=9997)
linear_search_from_cursor: page=424, level=0, current=(i=9997)
linear_search_from_cursor: page=424, level=0, current=(i=9998)
linear_search_from_cursor: page=424, level=0, current=(i=9999)
linear_search_from_cursor: page=424, level=0, current=(i=10000)

That is the complete output. The algorithm here is only subtly different:

  1. Start at the root page of the index.
  2. Binary search using the page directory (repeatedly splitting the directory in half based on whether the current record is greater than or less than the search key) until a record is found via the page directory with the highest key that does not exceed the search key.
  3. Linear search from that record until finding an individual record with the highest key that does not exceed the search key. If the current page is a leaf page, return the record. If the current page is a non-leaf page, load the child page this record points to, and return to step 2.

In the above output you can see the directory size being repeatedly halved (dir.size), and the compared key (dir[x]) getting repeatedly nearer to the search key in the typical binary search pattern. In between binary searches you can see short linear searches once the nearest page directory entry is found (up to a maximum of 8 records).

The stats collected during the search also look a lot different:

irb> pp index.stats

{:binary_search=>1,
 :binary_search_by_directory=>14,
 :linear_search_from_cursor=>3,
 :linear_search_from_cursor_record_scans=>8,
 :compare_key=>40,
 :compare_key_field_comparison=>40,
 :binary_search_by_directory_recurse_left=>8,
 :binary_search_by_directory_recurse_right=>3,
 :binary_search_by_directory_linear_search=>2}

Especially notice that the compare_key operation is done only 40 times, compared to 589 times in the linear search. In terms of record comparisons, the binary search was 14x more efficient than linear search (and this will vary quite a bit; depending on the exact value searched to could be 40x better).

The physical structure of records in InnoDB

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages, and the logical structure was described in B+Tree index structures in InnoDB. Now we’ll look in detail at the physical structure of records used in those pages.

In this post, only COMPACT row format (for Barracuda table format) is considered.

Record offsets

In previous posts, record offsets have been described in many structures that need to “point” to records. Record offsets point to the start of the record data itself, which is variable-length, but each record is preceded by a record header, which is also variable-length. In this post and the illustrations in it, I use N to mean the start of the record, where the record data is at N and using positive offsets (e.g. N+1), and the record header uses negative offsets (e.g. N-1). InnoDB often refers to the start of the record data, location N as the “origin”.

The record header

In previous posts, the record header was mentioned a few times, but hasn’t been fully described. As above, the record header precedes the record itself, and is structured as follows:

The fields in the record header are (in order, backwards from N):

  1. Next Record Offset: A relative offset from the current record to the origin of the next record within the page in ascending order by key.
  2. Record Type: The type of the record, where currently only 4 values are supported: conventional (0), node pointer (1), infimum (2), and supremum (3).
  3. Order: The order in which this record was inserted into the heap. Heap records (which include infimum and supremum) are numbered from 0. Infimum is always order 0, supremum is always order 1. User records inserted will be numbered from 2.
  4. Number of Records Owned: The number of records “owned” by the current record in the page directory. This field will be further discussed in a future post about the page directory.
  5. Info Flags: A 4-bit bitmap to store boolean flags about this record. Currently only two flags are defined: min_rec (1) meaning this record is the minimum record in a non-leaf level of the B+Tree, and deleted (2) meaning the record is delete-marked (and will be actually deleted by a purge operation in the future).
  6. Nullable field bitmap (optional): A single bit per nullable field to store whether the field is NULL, rounded up to a whole number of bytes. If a field is NULL its field value will be eliminated from the key or row portion of the record. If no fields are nullable, this bitmap is absent.
  7. Variable field lengths array (optional): An array of 8-bit or 16-bit integers (depending on the maximum size of the field) per variable-length field to store the length of the field data for that field. If no fields are variable-length, this array is absent.

The record header is a minimum of 5 bytes per row, but for records with many variable-width fields, it could be a lot larger.

Clustered indexes

The clustered key (PRIMARY KEY) has one of the more complex record structures:

The following fields are included in the record data:

  1. Cluster Key Fields: The cluster key fields, concatenated together (literally). InnoDB just concatenates the raw bytes of its internal storage formats per column type together into a single byte stream.
  2. Transaction ID: The 48-bit integer transaction ID of the transaction which last modified this record.
  3. Roll Pointer: A structure containing information about the location in the rollback segment of the undo records for the transaction which last modified this record. The fields of the roll pointer structure are: 1-bit “is insert” flag, 7-bit rollback segment ID, 4-byte page number and 2-byte page offset of the undo log location.
  4. Non-Key Fields: All non-key fields (the actual row data for all non-PRIMARY KEY fields) concatenated together into a single byte stream.

The structure of non-leaf page records is similar, but somewhat simpler:

Since non-leaf pages are not MVCC, the Transaction ID and Roll Pointer fields are eliminated. Instead of the non-key fields, the child page number this node pointer points to is included. Since the cluster key cannot be NULL, the nullable field bitmap is absent.

Secondary indexes

Secondary indexes in InnoDB have an identical overall structure to the clustered key (PRIMARY KEY) but instead of containing non-key fields, they contain the clustered key fields, also known as a Primary Key Value or most commonly, PKV. If any fields overlap between the secondary key and the clustered key, the overlapping fields are removed from the clustered key stored in the secondary index records. For example, if a table has a PRIMARY KEY (a, b, c) and a secondary index KEY (a, d), the secondary key in the index will be as expected, (a, d) but the PKVs will contain only (b, c).

Since secondary keys are allowed to include fields that are both non-unique and nullable, both the variable field lengths array and the nullable field bitmap may be present if necessary. Otherwise the leaf page structure is very simple:

The secondary key fields are concatenated together into a single byte stream as with the clustered key. The clustered key fields are concatenated together in exactly the same way to form the PKV.

The secondary index non-leaf pages will also look familiar:

There is one thing of note for secondary index non-leaf pages: the clustered key fields (PKV) are included in the record and is considered part of the record’s key, not its value. Secondary indexes may be non-unique, but each record in the page must have a unique identifier, so the PKV must be included in the record to ensure uniqueness. This will mean that records in non-leaf pages of secondary keys will be 4 bytes larger than their leaf page counterparts.

An aside on per-row overhead

Looking at the above illustrations, you can easily calculate the per-row overhead InnoDB requires. Clustered Key Leaf Pages require a minimum of 5 bytes for the header, 6 bytes for the Transaction ID, and 7 bytes for the Roll Pointer, totaling 18 bytes per row. For very narrow tables (for example 2-3 integers) this overhead can be quite high.

In addition, there is substantial per-page overhead, and waste in inefficiently filling pages which can consume large amounts of space (for instance pages may be half-filled).

What’s next?

In the next post I will describe the page directory and its purpose in efficient record retrieval.

Update: I had incorrectly stated that the nullable field bitmap would not be present on clustered key leaf pages, but it is in fact present if any non-key fields are nullable. It is always absent on clustered key non-leaf pages because the clustered key must be NOT NULL.

B+Tree index structures in InnoDB

[This post refers to innodb_ruby version 0.8.8 as of February 3, 2014.]

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages. We’ll now look into how InnoDB logically structures its indexes, using some practical examples.

An aside on terminology: B+Tree, root, leaf, and level

InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.

An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.

Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.

InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.

Leaf and non-leaf pages

For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.

Leaf pages contain the non-key values as part of the “data” contained in each record:

Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:

Pages at the same level

Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:

Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).

A detailed look at a single-page table

Let’s take a look at most of what’s B+Tree related in a single index page:

Create and populate the table

The test table in use in the illustration above can be created and populated with (make sure you’re using innodb_file_per_table and using Barracuda file format):

CREATE TABLE t_btree (
  i INT NOT NULL,
  s CHAR(10) NOT NULL,
  PRIMARY KEY(i)
) ENGINE=InnoDB;

INSERT INTO t_btree (i, s)
  VALUES (0, "A"), (1, "B"), (2, "C");

While this table is quite small and not realistic, it does demonstrate nicely how records and record traversal works.

Verify the basic structure of the space file

The table should match what we’ve examined before, with the three standard overhead pages (FSP_HDR, IBUF_BITMAP, and INODE) followed by a single INDEX page for the root of the index, and in this case two unused ALLOCATED pages.

$ innodb_space -f t_btree.ibd space-page-type-regions
start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           3           1           INDEX               
4           5           2           FREE (ALLOCATED)    

The space-index-pages-summary mode will give us a count of records in each page, and is showing the expected 3 records:

$ innodb_space -f t_btree.ibd space-index-pages-summary
page        index   level   data    free    records 
3           18      0       96      16156   3       
4           0       0       0       16384   0       
5           0       0       0       16384   0       

(Note that space-index-pages-summary also shows the empty ALLOCATED pages as empty pages with zero records, since that’s often what you’re interested in for plotting purposes.)

The space-indexes mode will show the stats about our PRIMARY KEY index, which is consuming a single page on its internal file segment:

$ innodb_space -f t_btree.ibd space-indexes
id          root        fseg        used        allocated   fill_factor 
18          3           internal    1           1           100.00%     
18          3           leaf        0           0           0.00%       

Set up a record describer

In order for innodb_ruby to parse the contents of records, we need to provide a record describer, which is just a Ruby class providing a method that returns a description of an index:

class SimpleTBTreeDescriber < Innodb::RecordDescriber
  type :clustered
  key "i", :INT, :NOT_NULL
  row "s", "CHAR(10)", :NOT_NULL
end

We need to note that this is the clustered key, provide the column descriptions for the key, and the column descriptions for the non-key (“row”) fields. It’s necessary to ask innodb_space to load this class with the following additional arguments:

-r -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber

Look at the record contents

The root page (which is a leaf page) in this example can be dumped using the page-dump mode and providing the page number for the root page:

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber -p 3 page-dump

Aside from some parts of this output we’ve looked at before, it will now print a “records:” section with the following structure per record:

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>157,
   :type=>:conventional,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>false,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0, 0, 0, 0],
   :field_externs=>[false, false, false, false]},
 :next=>157,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT", :value=>0, :extern=>nil}],
 :transaction_id=>"0000000f4745",
 :roll_pointer=>
  {:is_insert=>true, :rseg_id=>8, :undo_log=>{:page=>312, :offset=>272}},
 :row=>[{:name=>"s", :type=>"CHAR(10)", :value=>"A", :extern=>nil}]}

This should align with the above detailed illustration perfectly, as I’ve copied most of the information from this example for accuracy. Note the following aspects:

  • The :format being :compact indicates that the record is the new “compact” format in Barracuda format tables (as opposed to “redundant” in Antelope tables).
  • The :key listed in the output is an array of key fields for the index, and :row is an array of non-key fields.
  • The :transaction_id and :roll_pointer fields are internal fields for MVCC included in each record, since this is a clustered key (the PRIMARY KEY).
  • The :next field within the :header hash is a relative offset (32) which must be added to the current record offset (125) to yield the actual offset of the next record (157). For convenience this calculated offset is included as :next in the record hash.

Recurse the index

A nice and simple output of recursing the entire index can be achieved with the index-recurse mode, but since this is still a single-page index, the output will be very short:

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber -p 3 index-recurse
ROOT NODE #3: 3 records, 96 bytes
  RECORD: (i=0) -> (s=A)
  RECORD: (i=1) -> (s=B)
  RECORD: (i=2) -> (s=C)

Building a non-trivial index tree

A multi-level index tree (overly simplified) in InnoDB looks like:

As previously described, all pages at each level are doubly-linked to each other, and within each page, records are singly-linked in ascending order. Non-leaf pages contain “pointers” (containing the child page number) rather than non-key row data.

If we use the simpler table schema with 1 million rows created in A quick introduction to innodb_ruby, the tree structure looks a little more interesting:

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 3 index-recurse

ROOT NODE #3: 2 records, 26 bytes
  NODE POINTER RECORD >= (i=252) -> #36
  INTERNAL NODE #36: 1117 records, 14521 bytes
    NODE POINTER RECORD >= (i=252) -> #4
    LEAF NODE #4: 446 records, 9812 bytes
      RECORD: (i=1) -> ()
      RECORD: (i=2) -> ()
      RECORD: (i=3) -> ()
      RECORD: (i=4) -> ()

<many lines omitted>

    NODE POINTER RECORD >= (i=447) -> #1676
    LEAF NODE #1676: 444 records, 9768 bytes
      RECORD: (i=447) -> ()
      RECORD: (i=448) -> ()
      RECORD: (i=449) -> ()
      RECORD: (i=450) -> ()

<many lines omitted>

    NODE POINTER RECORD >= (i=891) -> #771
    LEAF NODE #771: 512 records, 11264 bytes
      RECORD: (i=891) -> ()
      RECORD: (i=892) -> ()
      RECORD: (i=893) -> ()
      RECORD: (i=894) -> ()

This is a three-level index tree, which can be seen by the ROOT, INTERNAL, LEAF lines above. We can see that some pages are completely full, with 468 records consuming almost 15 KiB of the 16 KiB page.

Looking at a non-leaf page (page 36, in the above output) using the page-dump mode, records look slightly different than the leaf pages shown previously:

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 36 page-dump

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>11877,
   :type=>:node_pointer,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>true,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0],
   :field_externs=>[false]},
 :next=>11877,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT UNSIGNED", :value=>252, :extern=>nil}],
 :child_page_number=>4}

The :key array is present, although it represents the minimum key rather than an exact key, and no :row is present, as a :child_page_number takes its place.

The root page is a bit special

Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.

However, the root page itself can’t actually be split, since it cannot be relocated. Instead, a new, empty page is allocated, the records in the root are moved there (the root is “raised” a level), and that new page is split into two. The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called “node pointers”), which in practice often means several hundred to more than a thousand.

B+Tree levels and increasing tree depth

As an example of the efficiency of B+Tree indexes, assume perfect record packing (every page full, which will never quite happen in practice, but is useful for discussion). A B+Tree index in InnoDB for the simple table in the examples above will be able to store 468 records per leaf page, or 1203 records per non-leaf page. The index tree can then be a maximum of the following sizes at the given tree heights:

Height Non-leaf pages Leaf pages Rows Size in bytes
1 0 1 468 16.0 KiB
2 1 1203 > 563 thousand 18.8 MiB
3 1204 1447209 > 677 million 22.1 GiB
4 1448413 1740992427 > 814 billion 25.9 TiB

As you can imagine, most tables with sensible PRIMARY KEY definitions are 2-3 levels, with some achieving 4 levels. Using an excessively large PRIMARY KEY can cause the B+Tree to be much less efficient, however, since primary key values must be stored in the non-leaf pages. This can drastically inflate the size of the records in non-leaf pages, meaning many fewer of those records fit in each non-leaf page, making the whole structure less efficient.

What’s next?

Next we’ll take a look at the page directory structure in INDEX pages which mentioned several times already, and then look at how to search efficiently within InnoDB indexes.