Idea: A “system” localization for MySQL

Currently the English error messages are embedded in all of the tests in MySQL. This means that you can’t really update the English translations without breaking a bunch of tests. I’m not sure if there’s a standard way to fix this, but it occurs to me that it would be quite easy to have a “system” localization which just prints a language-neutral version of the error, meaning that any version of it can be updated without breaking any tests.

For example, the following simple syntax error gives a message in English:

mysql> select foo;
ERROR 1054 (42S22): Unknown column 'foo' in 'field list'

This is based on the following definition in the errmsg-utf8.txt file:

ER_BAD_FIELD_ERROR 42S22 S0022
        eng "Unknown column '%-.192s' in '%-.192s'"

In a test case this might be codified as:

--echo # Test that ER_BAD_FIELD_ERROR works.
--error ER_BAD_FIELD_ERROR
SELECT foo;

The --error allows the error to be ignored (as an expected error); however the text of the English error message still ends up in the test result file:

# Test that ER_BAD_FIELD_ERROR works.
SELECT foo;
ERROR 42S22: Unknown column 'foo' in 'field list'

Changing the text of the message even in a trivial way (fixing a typo) will cause the test to fail due to a mismatch on the error message string, since the result files are just compared as text when running tests:

main.test_message                        [ fail ]
        Test ended at 2013-04-08 17:29:30

CURRENT_TEST: main.test_message
--- mysql-test/r/test_message.result	2013-04-09 03:26:27.516721785 +0300
+++ mysql-test/r/test_message.reject	2013-04-09 03:29:30.360718783 +0300
@@ -1,3 +1,3 @@
 # Test that ER_BAD_FIELD_ERROR works.
 SELECT foo;
-ERROR 42S22: Unknown column 'foo' in 'field list'
+ERROR 42S22: Unknown column 'foo' found in 'field list'

mysqltest: Result length mismatch

A sys “language” could easily be added, however:

ER_BAD_FIELD_ERROR 42S22 S0022
        sys "ER_BAD_FIELD_ERROR({%-.192s}, {%-.192s})"
        eng "Unknown column '%-.192s' in '%-.192s'"

Ideally, these could of course be auto-generated based on all the context present already. When running with this localization the same error would result in:

mysql> select foo;
ERROR 1054 (42S22): ER_BAD_FIELD_ERROR({foo}, {field list})

Thus preserving the language-neutrality of the tests and allowing the English versions of them to be tweaked for better readability without breaking the world.

This would of course require one massive commit to fix the tests when changing the language the tests run under to the new “sys” language…

What do you think? How do other systems (especially databases) handle this?

Regression in MySQL server localization from 5.0 to 5.6

MySQL server has supported localization of error messages since the very beginning, and its implementation has gone through a few revisions:

  • Through 4.1, a separate language/errmsg.txt file for each language, with one message per line, and each file in a language-appropriate character set.
  • In 5.0 and 5.1, a single errmsg.txt file with a group of translations for each message, but with different character sets for each language (making the file very difficult to edit with any editor).
  • In 5.5 and 5.6, a single errmsg-utf8.txt file with the same structure as in 5.0 and 5.1, but with all messages in UTF-8 (whew!).

In the early days, folks at MySQL tried to translate all of the error messages fairly frequently, keeping most localizations relatively up to date. Many volunteers also translated the messages into their favorite language and contributed those files.

In recent years, however, the vast majority of error messages added to the message file are in English only, or at most English and one other language1 (presumably the author’s native language, often German). While the number of unique error messages in English has increased from 481 to 862 between 5.0 to 5.6, all other translations with the exception of German, Swedish, and Japanese2 have been almost entirely stagnant.

This chart shows the percentage of translated messages available by MySQL version from 5.0 through 5.6:

This chart shows a few of the major supported languages, showing the utter stagnation of translated messages for all languages since 5.0, and for German since 5.5:

(Click on the graphs to see the Google Docs Spreadsheet, which is rendered a lot better than its image export.)

The best-supported language (after English) is German, but even it has fallen from 94% translated in MySQL 5.0 to only 77% translated in MySQL 5.6. Swedish, which was once one of the sacred translations, has fallen from 53% translated in 5.0 to only 39% in 5.6. Eliminating English and German (as high outliers) and Bulgarian (as a low outlier), the average translation completeness in MySQL 5.6 is less than 25%.

Is it useful to actually have multiple language support if it is this woefully incomplete3? For most of these languages, even if the user goes to the trouble to enable their alternate language, 75% on average of the messages they see will be in English. Is that really any better than 100%?

Have Oracle given up on maintaining the error message translations? Would community effort to get them all updated be welcome? Would it be useful to rip out this mess and start over with a more standardized and mature localization framework?

1 Bizarrely, in MySQL 5.6, Georgi Kodinov added Bulgarian as a supported language, with exactly one translated message supported.

2 It appears that Japanese got a major overhaul by Yasufumi Kinoshita, removing the unused “jps” variant and adding and adding a bunch more translations to the “jpn” variant. Alas, it is still quite incomplete at only 34% translated in 5.6.

3 Leaving aside the any discussion about the way that languages are implemented in MySQL currently, which is not awesome.

How InnoDB accidentally reserved only 1 bit for table format

The MySQL 5.5 (and 5.6) documentation says, in Identifying the File Format in Use:

“… Otherwise, the least significant bit should be set in the tablespace flags, and the file format identifier is written in the bits 5 through 11. …”

This is incorrect for any version due to a bug in how the tablespace flags were stored (which caused only 1 bit to be reserved, rather than 6). This was all re-worked in MySQL 5.6, so someone obviously noticed it, but the documentation has been left incorrect for all versions, and the incorrect and misleading code has been left in MySQL 5.5. I filed MySQL Bug #68868 about the documentation.

File formats and names

There are file format names in the documentation and code for values 0 through 25 (letters “A” through “Z”), although only 0 (“Antelope”) and 1 (“Barracuda”) are currently used. They are all defined in storage/innobase/trx/trx0sys.c:

    97  /** List of animal names representing file format. */
    98  static const char*      file_format_name_map[] = {
    99          "Antelope",
   100          "Barracuda",
   101          "Cheetah",
   102          "Dragon",
   103          "Elk",
   104          "Fox",
   105          "Gazelle",
   106          "Hornet",
   107          "Impala",
   108          "Jaguar",
   109          "Kangaroo",
   110          "Leopard",
   111          "Moose",
   112          "Nautilus",
   113          "Ocelot",
   114          "Porpoise",
   115          "Quail",
   116          "Rabbit",
   117          "Shark",
   118          "Tiger",
   119          "Urchin",
   120          "Viper",
   121          "Whale",
   122          "Xenops",
   123          "Yak",
   124          "Zebra"
   125  };

How only one bit was reserved

The code to store the file format identifier into an InnoDB tablespace file’s tablespace flags is in storage/innobase/include/dict0mem.h and follows, with my commentary.

The first bit is reserved for 1 = compact, 0 = redundant format:

    70  /** Table flags.  All unused bits must be 0. */
    71  /* @{ */
    72  #define DICT_TF_COMPACT                 1       /* Compact page format.
    73                                                  This must be set for
    74                                                  new file formats
    75                                                  (later than
    76                                                  DICT_TF_FORMAT_51). */

The next 4 bits are reserved for the compressed page size:

    78  /** Compressed page size (0=uncompressed, up to 15 compressed sizes) */
    79  /* @{ */
    80  #define DICT_TF_ZSSIZE_SHIFT            1
    81  #define DICT_TF_ZSSIZE_MASK             (15 << DICT_TF_ZSSIZE_SHIFT)
    82  #define DICT_TF_ZSSIZE_MAX (UNIV_PAGE_SIZE_SHIFT - PAGE_ZIP_MIN_SIZE_SHIFT + 1)
    83  /* @} */

Next we’re supposed to reserve 6 bits for the file format (up to 64 formats):

    85  /** File format */
    86  /* @{ */
    87  #define DICT_TF_FORMAT_SHIFT            5       /* file format */
    88  #define DICT_TF_FORMAT_MASK             \
    89  ((~(~0 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT))) << DICT_TF_FORMAT_SHIFT)

Two values are currently defined, which correspond to Antelope and Barracuda (with rather strange names “51″ and “ZIP” as defined):

    90  #define DICT_TF_FORMAT_51               0       /*!< InnoDB/MySQL up to 5.1 */
    91  #define DICT_TF_FORMAT_ZIP              1       /*!< InnoDB plugin for 5.1:
    92                                                  compressed tables,
    93                                                  new BLOB treatment */
    94  /** Maximum supported file format */
    95  #define DICT_TF_FORMAT_MAX              DICT_TF_FORMAT_ZIP
    96
    97  /** Minimum supported file format */
    98  #define DICT_TF_FORMAT_MIN              DICT_TF_FORMAT_51

This is where things get interesting. It is not clear if DICT_TF_BITS (defined below) is supposed to represent the total number of flag bits (11 so far!), or the number of bits for the format above (6, but then shouldn’t it be called DICT_TF_FORMAT_BITS?). However since 6 is larger than the non­-format related bits (5), and only 1 bit has actually been used for format in practice (0..1), nothing will blow up here, and the #error check passes cleanly.

   100  /* @} */
   101  #define DICT_TF_BITS                    6       /*!< number of flag bits */
   102  #if (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
   103  # error "DICT_TF_BITS is insufficient for DICT_TF_FORMAT_MAX"
   104  #endif
   105  /* @} */

Also note that the #error there is easy enough to calculate. It works out to:

  1. (1 << (DICT_TF_BITS - DICT_TF_FORMAT_SHIFT)) <= DICT_TF_FORMAT_MAX
  2. (1 << (6 - 5)) <= 1
  3. (1 << 1) <= 1
  4. 2 <= 1
  5. FALSE

The “6 - 5” in the calculation above represents essentially the number of bits reserved for the table format flag, which turns out to be only 1.

The above defines go on to be used by DICT_TF2 (another set of flags) which currently only uses a single bit:

   107  /** @brief Additional table flags.
   108
   109  These flags will be stored in SYS_TABLES.MIX_LEN.  All unused flags
   110  will be written as 0.  The column may contain garbage for tables
   111  created with old versions of InnoDB that only implemented
   112  ROW_FORMAT=REDUNDANT. */
   113  /* @{ */
   114  #define DICT_TF2_SHIFT                  DICT_TF_BITS
   115                                                  /*!flags. */
   117  #define DICT_TF2_TEMPORARY              1       /*!< TRUE for tables from
   118                                                  CREATE TEMPORARY TABLE. */
   119  #define DICT_TF2_BITS                   (DICT_TF2_SHIFT + 1)
   120                                                  /*!flags. */
   122  /* @} */

It’s very easy to see here that if DICT_TF2_SHIFT is DICT_TF_BITS, which is 6, the DICT_TF2_TEMPORARY flag is being stored at 1 << 6, which is only leaving the file format a single bit, when it should be reserving 6 bits.

The end result of this is that the DICT_TF2_TEMPORARY bit is being stored into a bit reserved for the table format, rather than after the table format. The DICT_TF2 stuff seems to only be stored in the data dictionary, and never in the IBD file, so this would I guess manifest when Cheetah would be implemented and a temporary table is created.

Why this could happen

This code is unnecessarily complex and confusing, and to make matters worse it is inconsistent. There is no concise description of the fields being stored; only the code documents the structure, and since it is badly written, its value as documentation is low.

The bug is two­-fold:

  1. There should be a DICT_TF_FORMAT_BITS define to capture the expected number of bits required to store the DICT_TF_FORMAT_* structure (dictionary, table flags, format) which is defined to 6, and that should be used in the masks associated with DICT_TF_FORMAT_*.
  2. The DICT_TF_BITS define should mean the total size of the DICT_TF structures (which precede the DICT_TF2 structures obviously), and should be 1 + 4 + 6 = 11 bits, but this should be defined only by summing the previous structures sizes.

Because of the way this is written, it’s actually quite difficult to discern that there is a bug present visually, so I am not surprised that this was not caught — however I am dismayed about the code quality and clarity, and that this passes any sort of code review.

Efficiently traversing InnoDB B+Trees with the page directory

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages, and the logical structure was described in B+Tree index structures in InnoDB, and the physical structure of records was described in The physical structure of records in InnoDB. Now we’ll look in detail at the “page directory” structure that has been seen several times already, but not yet described.

In this post, only COMPACT row format (for Barracuda table format) is considered.

(Note that if you’ve previously installed innodb_ruby, you should upgrade to at least version 0.7.10 for the examples in this post.)

The purpose of the page directory

As described in the posts mentioned above, all records in INDEX pages are linked together in a singly-linked list in ascending order. However, list traversal through a page with potentially several hundred records in it is very expensive: every record’s key must be compared, and this needs to be done at each level of the B+Tree until the record sought is found on a leaf page.

The page directory greatly optimizes this search by providing a fixed-width data structure with direct pointers to 1 of every 4-8 records, in order. Thus, it can be used for a traditional binary search of the records in each page, starting at the mid-point of the directory and progressively pruning the directory by half until only a single entry remains, and then linear-scanning from there. Since the directory is effectively an array, it can be traversed in either ascending or descending order, despite the records being linked in only ascending order.

The physical structure of the page directory

In The physical structure of InnoDB index pages, the page directory’s physical structure was briefly presented:

The structure is actually very simple. The number of slots (the page directory length) is specified in the first field of the INDEX header of the page. The page directory always contains an entry for the infimum and supremum system records (so the minimum size is 2 entries), and may contain 0 or more additional entries, one for each 4-8 system records. A record is said to “own” another record if it represents it in the page directory. Each entry in the page directory “owns” the records between the previous entry in the directory, up to and including itself. The count of records “owned” by each record is stored in the record header that precedes each record.

The page-directory-summary mode of innodb_space can be used to view the page directory contents, in this case for a completely empty table, showing the minimum possible page directory:

$ innodb_space -f test/t.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       112     supremum      1       

If we insert a single record, we can see that it gets owned by the record with a greater key than itself that has an entry in the page directory. In this case, supremum will own the record (as previously discussed, supremum represents a record higher than any possible key in the page):

$ innodb_space -f test/t.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       112     supremum      2       

The infimum record always owns only itself, since no record can have a lower key. The supremum record always owns itself, but has no minimum record ownership. Each additional entry in the page directory should own a minimum of 4 records (itself plus 3 others) and a maximum of 8 records (itself plus 7 others).

To illustrate, each record with an entry in the page directory (bolded) owns the records immediately prior to it in the singly-linked list (K = Key, O = Number of Records Owned):

Growth of the page directory

Once any page directory slot would exceed 8 records owned, the page directory is rebalanced to distribute the records into 4-record groups. If we insert 6 additional records into the table, supremum will now own a total of 8 records:

$ innodb_space -f test/t.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       112     supremum      8       

The next insert will cause a re-organization:

$ innodb_space -f test/t.ibd -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       221     conventional  4       
2       112     supremum      5       

Using a record describer with innodb_space can allow you to see the pointed-to record’s key for each entry in the directory, and I will use this describer for all future examples in this post:

$ innodb_space -f test/t.ibd -r ./describer_test.rb -d Test -p 3 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       221     conventional  4       (4)
2       112     supremum      5       

If a page is completely full, the page directory may look something like this one:

$ innodb_space -f test/t.ibd -r ./describer_test.rb -d Test -p 5 page-directory-summary
slot    offset  type          owned   key
0       99      infimum       1       
1       221     conventional  4       (238)
2       349     conventional  4       (242)
3       477     conventional  4       (246)
4       605     conventional  4       (250)
5       733     conventional  4       (254)

<many lines omitted>

112     14429   conventional  4       (682)
113     14557   conventional  4       (686)
114     14685   conventional  4       (690)
115     14813   conventional  4       (694)
116     14941   conventional  4       (698)
117     112     supremum      5       

A logical view of the page directory

At a logical level, the page directory (and records) for a page with 24 records (with keys from 0 to 23) would look like this:

Take note that:

  • Records are singly linked from infimum to supremum through all 24 user records, as previously discussed.
  • Approximately each 4th record is entered into the page directory, represented in the illustration both by bolding that record and by noting its offset in the page directory array represented at the top of the illustration.
  • The page directory is stored “backwards” in the page, so is reversed in this illustration compared to its ordering on disk.

Efficiently searching using the B+Tree and page directory

Without the page directory, a large number of records would need to be compared in order to find the record being sought. Demonstrating actual code is probably the best way to prove how efficient the B+Tree with page directory can be. Using innodb_ruby, it is possible to search an actual InnoDB index, although it doesn’t have a nice command-line interface for doing so yet. Instead, irb, the interactive Ruby shell can be used. (Note that this functionality in innodb_ruby is for illustrative and learning purposes only. It should not be considered for any serious use.)

An interactive shell can be set up similarly to the previous innodb_space commands’ configurations with:

$ irb -rrubygems -rinnodb
irb> require "./describer_test.rb"
irb> space = Innodb::Space.new("test/t.ibd")
irb> space.record_describer = Innodb::RecordDescriber::Test
irb> index = space.index(3)

Since we’re interested mostly in exploring here, debug output should be enabled so that the various index traversal operations can be seen:

irb> index.debug = true

The innodb_ruby library provides two methods for searching within the B+Tree:

  • index.linear_search(key) — Use only purely linear search on the singly-linked record lists to traverse the B+Tree. This is primarily intended as an inefficient counter-example to binary_search but is also useful to verify various algorithms (such as key comparison).
  • index.binary_search(key) — Use binary search on the page directory and linear search as appropriate in order to search efficiently. This is intended to mimic (although not exactly) InnoDB’s algorithm for efficient search.

Note that the key parameter to each method is an array of fields forming the key of the index (either primary key or secondary key).

Linear search

First, we’ll reset the internal statistics (counters) that the index tracks for debugging purposes:

irb> index.reset_stats

Next, initiate a linear search for key “10000″ in our 1 million row table:

irb> index.linear_search([10000])
linear_search: root=3, level=2, key=(10000)
linear_search_from_cursor: page=3, level=2, start=(1)
linear_search_from_cursor: page=3, level=2, current=(1)
linear_search_from_cursor: page=36, level=1, start=(1)
linear_search_from_cursor: page=36, level=1, current=(1)
linear_search_from_cursor: page=36, level=1, current=(235)
linear_search_from_cursor: page=36, level=1, current=(703)

<16 lines omitted>

linear_search_from_cursor: page=36, level=1, current=(8659)
linear_search_from_cursor: page=36, level=1, current=(9127)
linear_search_from_cursor: page=36, level=1, current=(9595)
linear_search_from_cursor: page=25, level=0, start=(9595)
linear_search_from_cursor: page=25, level=0, current=(9595)
linear_search_from_cursor: page=25, level=0, current=(9596)
linear_search_from_cursor: page=25, level=0, current=(9597)

<400 lines omitted>

linear_search_from_cursor: page=25, level=0, current=(9998)
linear_search_from_cursor: page=25, level=0, current=(9999)
linear_search_from_cursor: page=25, level=0, current=(10000)

I omitted many lines, but the full output can be seen in linear_search.txt. The basic algorithm is:

  1. Start at the root page of the index.
  2. Linear search from infimum until finding an individual record with the highest key that does not exceed the search key. If the current page is a leaf page, return the record. If the current page is a non-leaf page, load the child page this record points to, and return to step 2.

We can check the stats that were collected:

irb> pp index.stats
{:linear_search=>1,
 :linear_search_from_cursor=>3,
 :linear_search_from_cursor_record_scans=>429,
 :compare_key=>1288,
 :compare_key_field_comparison=>1288}

So this has compared 1,288 records’ keys in order to find the key we were looking for. Not very efficient at all.

Binary search

Again, reset the stats:

irb> index.reset_stats

This time initiate a binary search using the page directory:

irb> index.binary_search([10000])
binary_search: root=3, level=2, key=(10000)
binary_search_by_directory: page=3, level=2, dir.size=1, dir[0]=()
linear_search_from_cursor: page=3, level=2, start=(1)
linear_search_from_cursor: page=3, level=2, current=(1)
binary_search_by_directory: page=36, level=1, dir.size=151, dir[75]=(139699)
binary_search_by_directory: page=36, level=1, dir.size=75, dir[37]=(68563)
binary_search_by_directory: page=36, level=1, dir.size=37, dir[18]=(32995)
binary_search_by_directory: page=36, level=1, dir.size=18, dir[8]=(14275)
binary_search_by_directory: page=36, level=1, dir.size=8, dir[3]=(4915)
binary_search_by_directory: page=36, level=1, dir.size=5, dir[2]=(8659)
binary_search_by_directory: page=36, level=1, dir.size=3, dir[1]=(10531)
binary_search_by_directory: page=36, level=1, dir.size=1, dir[0]=(8659)
linear_search_from_cursor: page=36, level=1, start=(8659)
linear_search_from_cursor: page=36, level=1, current=(8659)
linear_search_from_cursor: page=36, level=1, current=(9127)
linear_search_from_cursor: page=36, level=1, current=(9595)
binary_search_by_directory: page=25, level=0, dir.size=117, dir[58]=(9826)
binary_search_by_directory: page=25, level=0, dir.size=59, dir[29]=(9942)
binary_search_by_directory: page=25, level=0, dir.size=30, dir[14]=(9998)
binary_search_by_directory: page=25, level=0, dir.size=16, dir[7]=(10026)
binary_search_by_directory: page=25, level=0, dir.size=7, dir[3]=(10010)
binary_search_by_directory: page=25, level=0, dir.size=3, dir[1]=(10002)
binary_search_by_directory: page=25, level=0, dir.size=1, dir[0]=(9998)
linear_search_from_cursor: page=25, level=0, start=(9998)
linear_search_from_cursor: page=25, level=0, current=(9998)
linear_search_from_cursor: page=25, level=0, current=(9999)
linear_search_from_cursor: page=25, level=0, current=(10000)

That is the complete output. The algorithm here is only subtly different:

  1. Start at the root page of the index.
  2. Binary search using the page directory (repeatedly splitting the directory in half based on whether the current record is greater than or less than the search key) until a record is found via the page directory with the highest key that does not exceed the search key.
  3. Linear search from that record until finding an individual record with the highest key that does not exceed the search key. If the current page is a leaf page, return the record. If the current page is a non-leaf page, load the child page this record points to, and return to step 2.

In the above output you can see the directory size being repeatedly halved (dir.size), and the compared key (dir[x]) getting repeatedly nearer to the search key in the typical binary search pattern. In between binary searches you can see short linear searches once the nearest page directory entry is found (up to a maximum of 8 records).

The stats collected during the search also look a lot different:

irb> pp index.stats
{:binary_search=>1,
 :binary_search_by_directory=>16,
 :linear_search_from_cursor=>3,
 :linear_search_from_cursor_record_scans=>7,
 :compare_key=>37,
 :compare_key_field_comparison=>37,
 :binary_search_by_directory_recurse_left=>8,
 :binary_search_by_directory_recurse_right=>5,
 :binary_search_by_directory_linear_search=>2}

Especially notice that the compare_key operation is done only 37 times, compared to 1,288 times in the linear search. In terms of record comparisons, the binary search was 34x more efficient than linear search.

The physical structure of records in InnoDB

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages, and the logical structure was described in B+Tree index structures in InnoDB. Now we’ll look in detail at the physical structure of records used in those pages.

In this post, only COMPACT row format (for Barracuda table format) is considered.

Record offsets

In previous posts, record offsets have been described in many structures that need to “point” to records. Record offsets point to the start of the record data itself, which is variable-length, but each record is preceded by a record header, which is also variable-length. In this post and the illustrations in it, I use N to mean the start of the record, where the record data is at N and using positive offsets (e.g. N+1), and the record header uses negative offsets (e.g. N-1). InnoDB often refers to the start of the record data, location N as the “origin”.

The record header

In previous posts, the record header was mentioned a few times, but hasn’t been fully described. As above, the record header precedes the record itself, and is structured as follows:

The fields in the record header are (in order, backwards from N):

  1. Next Record Offset: A relative offset from the current record to the origin of the next record within the page in ascending order by key.
  2. Record Type: The type of the record, where currently only 4 values are supported: conventional (0), node pointer (1), infimum (2), and supremum (3).
  3. Order: The order in which this record was inserted into the heap. Heap records (which include infimum and supremum) are numbered from 0. Infimum is always order 0, supremum is always order 1. User records inserted will be numbered from 2.
  4. Number of Records Owned: The number of records “owned” by the current record in the page directory. This field will be further discussed in a future post about the page directory.
  5. Info Flags: A 4-bit bitmap to store boolean flags about this record. Currently only two flags are defined: min_rec (1) meaning this record is the minimum record in a non-leaf level of the B+Tree, and deleted (2) meaning the record is delete-marked (and will be actually deleted by a purge operation in the future).
  6. Nullable field bitmap (optional): A single bit per nullable field to store whether the field is NULL, rounded up to a whole number of bytes. If a field is NULL its field value will be eliminated from the key or row portion of the record. If no fields are nullable, this bitmap is absent.
  7. Variable field lengths array (optional): An array of 8-bit or 16-bit integers (depending on the maximum size of the field) per variable-length field to store the length of the field data for that field. If no fields are variable-length, this array is absent.

The record header is a minimum of 5 bytes per row, but for records with many variable-width fields, it could be a lot larger.

Clustered indexes

The clustered key (PRIMARY KEY) has one of the more complex record structures:

Since clustered keys are unique and not nullable, the nullable field bitmap will be absent. Clustered keys may be of variable length, and if so the variable field length array will be present. The following fields are included in the record data:

  1. Cluster Key Fields: The cluster key fields, concatenated together (literally). InnoDB just concatenates the raw bytes of its internal storage formats per column type together into a single byte stream.
  2. Transaction ID: The 48-bit integer transaction ID of the transaction which last modified this record.
  3. Roll Pointer: A structure containing information about the location in the rollback segment of the undo records for the transaction which last modified this record. The fields of the roll pointer structure are: 1-bit “is insert” flag, 7-bit rollback segment ID, 4-byte page number and 2-byte page offset of the undo log location.
  4. Non-Key Fields: All non-key fields (the actual row data for all non-PRIMARY KEY fields) concatenated together into a single byte stream.

The structure of non-leaf page records is similar, but somewhat simpler:

Since non-leaf pages are not MVCC, the Transaction ID and Roll Pointer fields are eliminated. Instead of the non-key fields, the child page number this node pointer points to is included.

Secondary indexes

Secondary indexes in InnoDB have an identical overall structure to the clustered key (PRIMARY KEY) but instead of containing non-key fields, they contain the clustered key fields, also known as a Primary Key Value or most commonly, PKV. If any fields overlap between the secondary key and the clustered key, the overlapping fields are removed from the clustered key stored in the secondary index records. For example, if a table has a PRIMARY KEY (a, b, c) and a secondary index KEY (a, d), the secondary key in the index will be as expected, (a, d) but the PKVs will contain only (b, c).

Since secondary keys are allowed to include fields that are both non-unique and nullable, both the variable field lengths array and the nullable field bitmap may be present if necessary. Otherwise the leaf page structure is very simple:

The secondary key fields are concatenated together into a single byte stream as with the clustered key. The clustered key fields are concatenated together in exactly the same way to form the PKV.

The secondary index non-leaf pages will also look familiar:

There is one thing of note for secondary index non-leaf pages: the clustered key fields (PKV) are included in the record and is considered part of the record’s key, not its value. Secondary indexes may be non-unique, but each record in the page must have a unique identifier, so the PKV must be included in the record to ensure uniqueness. This will mean that records in non-leaf pages of secondary keys will be 4 bytes larger than their leaf page counterparts.

An aside on per-row overhead

Looking at the above illustrations, you can easily calculate the per-row overhead InnoDB requires. Clustered Key Leaf Pages require a minimum of 5 bytes for the header, 6 bytes for the Transaction ID, and 7 bytes for the Roll Pointer, totaling 18 bytes per row. For very narrow tables (for example 2-3 integers) this overhead can be quite high.

In addition, there is substantial per-page overhead, and waste in inefficiently filling pages which can consume large amounts of space (for instance pages may be half-filled).

What’s next?

In the next post I will describe the page directory and its purpose in efficient record retrieval.

B+Tree index structures in InnoDB

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages. We’ll now look into how InnoDB logically structures its indexes, using some practical examples.

An aside on terminology: B+Tree, root, leaf, and level

InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.

An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.

Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.

InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.

Leaf and non-leaf pages

For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.

Leaf pages contain the non-key values as part of the “data” contained in each record:

Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:

Pages at the same level

Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:

Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).

A detailed look at a single-page table

Let’s take a look at most of what’s B+Tree related in a single index page:

Create and populate the table

The test table in use in the illustration above can be created and populated with (make sure you’re using innodb_file_per_table and using Barracuda file format):

CREATE TABLE t (
  i INT NOT NULL,
  s CHAR(10) NOT NULL,
  PRIMARY KEY(i)
) ENGINE=InnoDB;

INSERT INTO t (i, s)
  VALUES (0, "A"), (1, "B"), (2, "C");

While this table is quite small and not realistic, it does demonstrate nicely how records and record traversal works.

Verify the basic structure of the space file

The table should match what we’ve examined before, with the three standard overhead pages (FSP_HDR, IBUF_BITMAP, and INODE) followed by a single INDEX page for the root of the index, and in this case two unused ALLOCATED pages.

$ innodb_space -f test/t.ibd space-page-type-regions
start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           3           1           INDEX               
4           5           2           ALLOCATED           

The space-index-pages-summary mode will give us a count of records in each page, and is showing the expected 3 records:

$ innodb_space -f test/t.ibd space-index-pages-summary
page        index   level   data    free    records 
3           21      0       96      16156   3       
4           0       0       0       16384   0       
5           0       0       0       16384   0       

(Note that space-index-pages-summary also shows the empty ALLOCATED pages as empty pages with zero records, since that’s often what you’re interested in for plotting purposes.)

The space-indexes mode will show the stats about our PRIMARY KEY index, which is consuming a single page on its internal file segment:

$ innodb_space -f test/t.ibd space-indexes
id          root        fseg        used        allocated   fill_factor 
21          3           internal    1           1           100.00%     
21          3           leaf        0           0           0.00%       

Set up a record describer

In order for innodb_ruby to parse the contents of records, we need to provide a record describer, which is just a Ruby class providing a method that returns a description of an index:

class Innodb::RecordDescriber::T < Innodb::RecordDescriber
  def self.cursor_sendable_description(page)
    {
      :type => :clustered,
      :key => [
        [:INT, :NOT_NULL],
      ],
      :row => [
        ["CHAR(10)", :NOT_NULL],
      ],
    }
  end
end

We need to return that this is the clustered key, the column descriptions for the key, and the column descriptions for the non-key fields. It’s necessary to ask innodb_space to load this class with the following additional arguments:

-r ./describer_t.rb -d T

(Note that the structure of these classes may change dramatically in the future, and is not well documented, but is somewhat self-explanatory.)

Look at the record contents

The root page (which is a leaf page) in this example can be dumped using the page-dump mode and providing the page number for the root page:

$ innodb_space -f test/t.ibd -r ./describer_t.rb -d T -p 3 page-dump

Aside from some parts of this output we’ve looked at before, it will now print a “records:” section with the following structure per record:

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>32,
   :type=>:conventional,
   :order=>2,
   :n_owned=>0,
   :min_rec=>false,
   :deleted=>false,
   :null_bitmap=>nil,
   :variable_length=>[0, 0]},
 :next=>157,
 :type=>:clustered,
 :key=>[0],
 :transaction_id=>"0000002f1345",
 :roll_pointer=>
  {:is_insert=>true, :rseg_id=>24, :undo_log=>{:page=>330, :offset=>272}},
 :row=>["A"]}

This should align with the above detailed illustration perfectly, as I’ve copied most of the information from this example for accuracy. Note the following aspects:

  • The :format being :compact indicates that the record is the new “compact” format in Barracuda format tables (as opposed to “redundant” in Antelope tables).
  • The :key listed in the output is an array of key fields for the index, and :row is an array of non-key fields.
  • The :transaction_id and :roll_pointer fields are internal fields for MVCC included in each record, since this is a clustered key (the PRIMARY KEY).
  • The :next field within the :header hash is a relative offset (32) which must be added to the current record offset (125) to yield the actual offset of the next record (157). For convenience this calculated offset is included as :next in the record hash.

Recurse the index

A nice and simple output of recursing the entire index can be achieved with the index-recurse mode, but since this is still a single-page index, the output will be very short:

$ innodb_space -f test/t.ibd -r ./describer_t.rb -d T -p 3 index-recurse
ROOT NODE #3: 3 records, 96 bytes
  RECORD: (0) -> (A)
  RECORD: (1) -> (B)
  RECORD: (2) -> (C)

Building a non-trivial index tree

A multi-level index tree (overly simplified) in InnoDB looks like:

As previously described, all pages at each level are doubly-linked to each other, and within each page, records are singly-linked in ascending order. Non-leaf pages contain "pointers" (containing the child page number) rather than non-key row data.

If we use the previously described simple table schema, but instead of only three rows, we insert 1 million rows (numbered 1 to 1,000,000), the tree structure looks a little more interesting:

$ innodb_space -f test/t.ibd -r ./describer_t.rb -d T index-recurse
ROOT NODE #3: 3 records, 39 bytes
  NODE POINTER RECORD >= (1) -> #36
  INTERNAL NODE #36: 601 records, 7813 bytes
    NODE POINTER RECORD >= (1) -> #4
    LEAF NODE #4: 234 records, 7488 bytes
      RECORD: (1) -> (A)
      RECORD: (2) -> (A)
      RECORD: (3) -> (A)
      RECORD: (4) -> (A)

<many lines omitted>

    LEAF NODE #5: 468 records, 14976 bytes
      RECORD: (235) -> (A)
      RECORD: (236) -> (A)
      RECORD: (237) -> (A)
      RECORD: (238) -> (A)

<many lines omitted>

    LEAF NODE #6: 468 records, 14976 bytes
      RECORD: (703) -> (A)
      RECORD: (704) -> (A)
      RECORD: (705) -> (A)
      RECORD: (706) -> (A)

This is a three-level index tree, which can be seen by the ROOT, INTERNAL, LEAF lines above. We can see that some pages are completely full, with 468 records consuming almost 15 KiB of the 16 KiB page.

Looking at a non-leaf page (page 36, in the above output) using the page-dump mode, records look slightly different than the leaf pages shown previously:

innodb_space -f test/t.ibd -r ./describer_t.rb -d T -p 36 page-dump

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>13,
   :type=>:node_pointer,
   :order=>2,
   :n_owned=>0,
   :min_rec=>true,
   :deleted=>false,
   :null_bitmap=>nil,
   :variable_length=>[0, 0]},
 :next=>138,
 :type=>:clustered,
 :key=>[1],
 :child_page_number=>4}

The :key array is present, although it represents the minimum key rather than an exact key, and no :row is present, as a :child_page_number takes its place.

The root page is a bit special

Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.

However, the root page itself can't actually be split, since it cannot be relocated. Instead, a new, empty page is allocated, the records in the root are moved there (the root is "raised" a level), and that new page is split into two. The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called "node pointers"), which in practice often means several hundred to more than a thousand.

B+Tree levels and increasing tree depth

As an example of the efficiency of B+Tree indexes, assume perfect record packing (every page full, which will never quite happen in practice, but is useful for discussion). A B+Tree index in InnoDB for the simple table in the examples above will be able to store 468 records per leaf page, or 1203 records per non-leaf page. The index tree can then be a maximum of the following sizes at the given tree heights:

Height Non-leaf pages Leaf pages Rows Size in bytes
1 0 1 468 16.0 KiB
2 1 1203 > 563 thousand 18.8 MiB
3 1204 1447209 > 677 million 22.1 GiB
4 1448413 1740992427 > 814 billion 25.9 TiB

As you can imagine, most tables with sensible PRIMARY KEY definitions are 2-3 levels, with some achieving 4 levels. Using an excessively large PRIMARY KEY can cause the B+Tree to be much less efficient, however, since primary key values must be stored in the non-leaf pages. This can drastically inflate the size of the records in non-leaf pages, meaning many fewer of those records fit in each non-leaf page, making the whole structure less efficient.

What's next?

Next we'll take a look at the page directory structure in INDEX pages which mentioned several times already, and then look at how to search efficiently within InnoDB indexes.

The physical structure of InnoDB index pages

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. (Note that each image below is linked to a higher resolution version of the same image.)

The basic structure of the space and each page was described in The basics of InnoDB space file layout, and we’ll now take a deeper look into how INDEX pages are physically structured. This will lay the ground work to discuss indexes at a logical (and much higher) level.

Everything is an index in InnoDB

Before diving into physical structures, it’s critical to understand that in InnoDB, everything is an index. What does that mean to the physical structure?

  1. Every table has a primary key; if the CREATE TABLE does not specify one, the first non-NULL unique key is used, and failing that, a 48-bit hidden “Row ID” field is automatically added to the table structure and used as the primary key. Always add a primary key yourself. The hidden one is useless to you but still costs 6 bytes per row.
  2. The “row data” (non-PRIMARY KEY fields) are stored in the PRIMARY KEY index structure, which is also called the “clustered key”. This index structure is keyed on the PRIMARY KEY fields, and the row data is the value attached to that key (as well as some extra fields for MVCC).
  3. Secondary keys are stored in an identical index structure, but they are keyed on the KEY fields, and the primary key value (PKV) is attached to that key.

When discussing “indexes” in InnoDB (as in this post), this actually means both tables and indexes as the DBA may think of them.

Overview of INDEX page structure

Every index page has an overall structure as follows:

The major sections of the page structure are (not in order):

  • The FIL header and trailer: This is typical and common to all page types. One aspect somewhat unique to INDEX pages is that the “previous page” and “next page” pointers in the FIL header point to the previous and next pages at the same level of the index, and in ascending order based on the index’s key. This forms a doubly-linked list of all pages at each level. This will be further described in the logical index structure.
  • The FSEG header: As described in Page management in InnoDB space files, the index root page’s FSEG header contains pointers to the file segments used by this index. All other index pages’ FSEG headers are unused and zero-filled.
  • The INDEX header: Contains many fields related to INDEX pages and record management. Fully described below.
  • System records: InnoDB has two system records in each page called infimum and supremum. These records are stored in a fixed location in the page so that they can always be found directly based on byte offset in the page.
  • User records: The actual data. Every record has a variable-width header and the actual column data itself. The header contains a “next record” pointer which stores the offset to the next record within the page in ascending order, forming a singly-linked list. Details of user record structure will be described in a future post.
  • The page directory: The page directory grows downwards from the “top” of the page starting at the FIL trailer and contains pointers to some of the records in the page (every 4th to 8th record). Details of the page directory logical structure and its purpose will be described in a future post.

The INDEX header

The INDEX header in each INDEX page is fixed-width and has the following structure:

The fields stored in this structure are (not in order):

  • Index ID: The ID of the index this page belongs to.
  • Format Flag: The format of the records in this page, stored in the high bit (0x8000) of the “Number of Heap Records” field. Two values are possible: COMPACT and REDUNDANT. Described fully below.
  • Maximum Transaction ID: The maximum transaction ID of any modification to any record in this page.
  • Number of Heap Records: The total number of records in the page, including the infimum and supremum system records, and garbage (deleted) records.
  • Number of Records: The number of non-deleted user records in the page.
  • Heap Top Position: The byte offset of the “end” of the currently used space. All space between the heap top and the end of the page directory is free space.
  • First Garbage Record Offset: A pointer to the first entry in the list of garbage (deleted) records. The list is singly-linked together using the “next record” pointers in each record header. (This is called “free” within InnoDB, but this name is somewhat confusing.)
  • Garbage Space: The total number of bytes consumed by the deleted records in the garbage record list.
  • Last Insert Position: The byte offset of the record that was last inserted into the page.
  • Page Direction: Three values are currently used for page direction: LEFT, RIGHT, and NO_DIRECTION. This is an indication as to whether this page is experiencing sequential inserts (to the left [lower values] or right [higher values]) or random inserts. At each insert, the record at the last insert position is read and its key is compared to the currently inserted record’s key, in order to determine insert direction.
  • Number of Inserts in Page Direction: Once the page direction is set, any following inserts that don’t reset the direction (due to their direction differing) will instead increment this value.
  • Number of Directory Slots: The size of the page directory in “slots”, which are each 16-bit byte offsets.
  • Page Level: The level of this page in the index. Leaf pages are at level 0, and the level increments up the B+tree from there. In a typical 3-level B+tree, the root will be level 2, some number of internal non-leaf pages will be level 1, and leaf pages will be level 0. This will be discussed in more detail in a future post as it relates to logical structure.

Record format: redundant versus compact

The COMPACT record format is new in the Barracuda table format, while the REDUNDANT record format is the original one in the Antelope table format (neither of which had a name officially until Barracuda was created). The COMPACT format mostly eliminated information that was redundantly stored in each record and can be obtained from the data dictionary, such as the number of fields, which fields are nullable, and which fields are dynamic length.

An aside on record pointers

Record pointers are used in several different places: the Last Insert Position field in the INDEX header, all values in the page directory, and the “next record” pointers in the system records and user records. All records contain a header (which may be variable-width) followed by the actual record data (which may also be variable-width). Record pointers point to the location of the first byte of record data, which is effectively “between” the header and the record data. This allows the header to be read by reading backwards from that location, and the record data to be read forward from that location.

Since the “next record” pointer in system and user records is always the first field in the header reading backwards from this pointer, this means it is also possible to very efficiently read through all records in a page without having to parse the variable-width record data at all.

System records: infimum and supremum

Every INDEX page contains two system records, called infimum and supremum, at fixed locations (offset 99 and offset 112 respectively) within the page, with the following structure:

The two system records have a typical record header preceding their location, and the literal strings “infimum” and “supremum” as their only data. A full description of record header fields will be provided in a future post. For now, it is important primarily to observe that the first field (working backwards from the record data, as previously described) is the “next record” pointer.

The infimum record

The infimum record represents a value lower than any possible key in the page. Its “next record” pointer points to the user record with the lowest key in the page. Infimum serves as a fixed entry point for sequentially scanning user records.

The supremum record

The supremum record represents a key higher than any possible key in the page. Its “next record” pointer is always zero (which represents NULL, and is always an invalid position for an actual record, due to the page headers). The “next record” pointer of the user record with the highest key on the page always points to supremum.

User records

The actual on-disk format of user records will be described in a future post, as it is fairly complex and will require a lengthy explanation itself.

User records are added to the page body in the order they are inserted (and may take existing free space from previously deleted records), and are singly-linked in ascending order by key using the “next record” pointers in each record header. A singly-linked list is formed from infimum, through all user records in ascending order, and terminating at supremum. Using this list, it is trivial to scan in through all user records in a page in ascending order.

Further, using the “next page” pointer in the INDEX header, it’s easy to scan from page to page through the entire index in ascending order. This means an ascending-order table scan is also trivial to implement:

  1. Start at the first (lowest key) page in the index. (This page is found through B+tree traversal, which will be described in a future post.)
  2. Read infimum, and follow its “next record” pointer.
  3. If the record is supremum, proceed to step 5. If not, read and process the record contents.
  4. Follow the “next record” pointer and return to step 3.
  5. If the “next page” pointer points to NULL, exit. If not, follow the “next page” pointer and return to step 2.

Since records are singly-linked rather than doubly-linked, traversing the index in descending order is not as trivial, and will be discussed in a future post.

The page directory

The page directory starts at the FIL trailer and grows “downwards” from there towards the user records. The page directory contains a pointer to every 4-8 records, in addition to always containing an entry for infimum and supremum.

The page directory is simply a dynamically-sized array of 16-bit offset pointers to records within the page. Its purpose will be more fully described in a future post dedicated to the page directory.

Free space

The space between the user records (which grow “upwards”) and the page directory (which grows “downwards”) is considered free space. Once the two sections meet in the middle, exhausting the free space (and assuming no space can be reclaimed by re-organizing to remove the garbage) the page is considered full.

What’s next?

Next we’ll look at the logical structure of an index, including some examples.