B+Tree index structures in InnoDB

[This post refers to innodb_ruby version 0.8.8 as of February 3, 2014.]

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages. We’ll now look into how InnoDB logically structures its indexes, using some practical examples.

An aside on terminology: B+Tree, root, leaf, and level

InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.

An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.

Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.

InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.

Leaf and non-leaf pages

For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.

Leaf pages contain the non-key values as part of the “data” contained in each record:

Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:

Pages at the same level

Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:

Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).

A detailed look at a single-page table

Let’s take a look at most of what’s B+Tree related in a single index page:

Create and populate the table

The test table in use in the illustration above can be created and populated with (make sure you’re using innodb_file_per_table and using Barracuda file format):

CREATE TABLE t_btree (
  i INT NOT NULL,
  s CHAR(10) NOT NULL,
  PRIMARY KEY(i)
) ENGINE=InnoDB;

INSERT INTO t_btree (i, s)
  VALUES (0, "A"), (1, "B"), (2, "C");

While this table is quite small and not realistic, it does demonstrate nicely how records and record traversal works.

Verify the basic structure of the space file

The table should match what we’ve examined before, with the three standard overhead pages (FSP_HDR, IBUF_BITMAP, and INODE) followed by a single INDEX page for the root of the index, and in this case two unused ALLOCATED pages.

$ innodb_space -f t_btree.ibd space-page-type-regions
start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           3           1           INDEX               
4           5           2           FREE (ALLOCATED)    

The space-index-pages-summary mode will give us a count of records in each page, and is showing the expected 3 records:

$ innodb_space -f t_btree.ibd space-index-pages-summary
page        index   level   data    free    records 
3           18      0       96      16156   3       
4           0       0       0       16384   0       
5           0       0       0       16384   0       

(Note that space-index-pages-summary also shows the empty ALLOCATED pages as empty pages with zero records, since that’s often what you’re interested in for plotting purposes.)

The space-indexes mode will show the stats about our PRIMARY KEY index, which is consuming a single page on its internal file segment:

$ innodb_space -f t_btree.ibd space-indexes
id          root        fseg        used        allocated   fill_factor 
18          3           internal    1           1           100.00%     
18          3           leaf        0           0           0.00%       

Set up a record describer

In order for innodb_ruby to parse the contents of records, we need to provide a record describer, which is just a Ruby class providing a method that returns a description of an index:

class SimpleTBTreeDescriber < Innodb::RecordDescriber
  type :clustered
  key "i", :INT, :NOT_NULL
  row "s", "CHAR(10)", :NOT_NULL
end

We need to note that this is the clustered key, provide the column descriptions for the key, and the column descriptions for the non-key (“row”) fields. It’s necessary to ask innodb_space to load this class with the following additional arguments:

-r -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber

Look at the record contents

The root page (which is a leaf page) in this example can be dumped using the page-dump mode and providing the page number for the root page:

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber -p 3 page-dump

Aside from some parts of this output we’ve looked at before, it will now print a “records:” section with the following structure per record:

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>157,
   :type=>:conventional,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>false,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0, 0, 0, 0],
   :field_externs=>[false, false, false, false]},
 :next=>157,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT", :value=>0, :extern=>nil}],
 :transaction_id=>"0000000f4745",
 :roll_pointer=>
  {:is_insert=>true, :rseg_id=>8, :undo_log=>{:page=>312, :offset=>272}},
 :row=>[{:name=>"s", :type=>"CHAR(10)", :value=>"A", :extern=>nil}]}

This should align with the above detailed illustration perfectly, as I’ve copied most of the information from this example for accuracy. Note the following aspects:

  • The :format being :compact indicates that the record is the new “compact” format in Barracuda format tables (as opposed to “redundant” in Antelope tables).
  • The :key listed in the output is an array of key fields for the index, and :row is an array of non-key fields.
  • The :transaction_id and :roll_pointer fields are internal fields for MVCC included in each record, since this is a clustered key (the PRIMARY KEY).
  • The :next field within the :header hash is a relative offset (32) which must be added to the current record offset (125) to yield the actual offset of the next record (157). For convenience this calculated offset is included as :next in the record hash.

Recurse the index

A nice and simple output of recursing the entire index can be achieved with the index-recurse mode, but since this is still a single-page index, the output will be very short:

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber -p 3 index-recurse
ROOT NODE #3: 3 records, 96 bytes
  RECORD: (i=0) -> (s=A)
  RECORD: (i=1) -> (s=B)
  RECORD: (i=2) -> (s=C)

Building a non-trivial index tree

A multi-level index tree (overly simplified) in InnoDB looks like:

As previously described, all pages at each level are doubly-linked to each other, and within each page, records are singly-linked in ascending order. Non-leaf pages contain “pointers” (containing the child page number) rather than non-key row data.

If we use the simpler table schema with 1 million rows created in A quick introduction to innodb_ruby, the tree structure looks a little more interesting:

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 3 index-recurse

ROOT NODE #3: 2 records, 26 bytes
  NODE POINTER RECORD >= (i=252) -> #36
  INTERNAL NODE #36: 1117 records, 14521 bytes
    NODE POINTER RECORD >= (i=252) -> #4
    LEAF NODE #4: 446 records, 9812 bytes
      RECORD: (i=1) -> ()
      RECORD: (i=2) -> ()
      RECORD: (i=3) -> ()
      RECORD: (i=4) -> ()

<many lines omitted>

    NODE POINTER RECORD >= (i=447) -> #1676
    LEAF NODE #1676: 444 records, 9768 bytes
      RECORD: (i=447) -> ()
      RECORD: (i=448) -> ()
      RECORD: (i=449) -> ()
      RECORD: (i=450) -> ()

<many lines omitted>

    NODE POINTER RECORD >= (i=891) -> #771
    LEAF NODE #771: 512 records, 11264 bytes
      RECORD: (i=891) -> ()
      RECORD: (i=892) -> ()
      RECORD: (i=893) -> ()
      RECORD: (i=894) -> ()

This is a three-level index tree, which can be seen by the ROOT, INTERNAL, LEAF lines above. We can see that some pages are completely full, with 468 records consuming almost 15 KiB of the 16 KiB page.

Looking at a non-leaf page (page 36, in the above output) using the page-dump mode, records look slightly different than the leaf pages shown previously:

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 36 page-dump

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>11877,
   :type=>:node_pointer,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>true,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0],
   :field_externs=>[false]},
 :next=>11877,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT UNSIGNED", :value=>252, :extern=>nil}],
 :child_page_number=>4}

The :key array is present, although it represents the minimum key rather than an exact key, and no :row is present, as a :child_page_number takes its place.

The root page is a bit special

Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.

However, the root page itself can’t actually be split, since it cannot be relocated. Instead, a new, empty page is allocated, the records in the root are moved there (the root is “raised” a level), and that new page is split into two. The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called “node pointers”), which in practice often means several hundred to more than a thousand.

B+Tree levels and increasing tree depth

As an example of the efficiency of B+Tree indexes, assume perfect record packing (every page full, which will never quite happen in practice, but is useful for discussion). A B+Tree index in InnoDB for the simple table in the examples above will be able to store 468 records per leaf page, or 1203 records per non-leaf page. The index tree can then be a maximum of the following sizes at the given tree heights:

Height Non-leaf pages Leaf pages Rows Size in bytes
1 0 1 468 16.0 KiB
2 1 1203 > 563 thousand 18.8 MiB
3 1204 1447209 > 677 million 22.1 GiB
4 1448413 1740992427 > 814 billion 25.9 TiB

As you can imagine, most tables with sensible PRIMARY KEY definitions are 2-3 levels, with some achieving 4 levels. Using an excessively large PRIMARY KEY can cause the B+Tree to be much less efficient, however, since primary key values must be stored in the non-leaf pages. This can drastically inflate the size of the records in non-leaf pages, meaning many fewer of those records fit in each non-leaf page, making the whole structure less efficient.

What’s next?

Next we’ll take a look at the page directory structure in INDEX pages which mentioned several times already, and then look at how to search efficiently within InnoDB indexes.

20 thoughts on “B+Tree index structures in InnoDB

  1. Hi, Cole
    As i learned, Max_Trx_Id in the index page is only used for coverage index scan, so it is valid only on the secondary index, in primary index, it’s useless.

  2. Hi,Cole

    I’ve fully learned the B+Tree Detailed Page Structure, I think i know all the attributes but two: Order and Min_rec in the user record, what’s these two attributes means? or where i can find these two attributes in the innodb source code?

    Look forward to your reply! Thanks!

    • He:

      Order is quite simple: It’s the order the record was inserted into the heap. Infimum is always Order: 0, Supremum is always Order: 1, and user records end up incrementing from there. Check for rec_get_heap_no_new function and REC_NEW_HEAP_NO define.

      Min_Rec is a true/false flag which is set on the lowest-key record at any non-leaf level. I’m not sure of its actual purpose. Check for rec_get_info_bits/rec_set_info_bits_new functions and REC_INFO_MIN_REC_FLAG define.

      • Thanks for your reply, i am clear now:)

        For Order, i named it heap_no, which is used to identify a record in the page, row lock is locked on this number.

        For Min_Rec, i am also puzzled. I’ll see the source for its actual purpose.

        I am also write some documents on InnoDB Index Structure, Page Structure, Row Format, and Operations on these structures(Insert/Update/Delete/Purge/SMO), unfortunately in chinese. Keeping in touch.

      • He: It may be that I should rename “Order” to “Heap Number” in my documentation. I believe the term “Order” came from some (very incomplete) InnoDB documentation that I referenced, but it is not used at all in the InnoDB source code. So that may be unnecessarily confusing.

  3. Hi, Cole
    I am trying to follow the example you give, but it seems there are some typos:

    $ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 3 index-recurse
    should be
    $ innodb_space -f t.ibd -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber -p 3 index-recurse

    $ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber -p 36 page-dump
    should be
    $ innodb_space -f t.ibd -r ./ simple_t_btree_describer.rb -d SimpleTBTreeDescriber -p 36 page-dump

    Not quite sure about the above…Maybe I missed something.
    But thank you so much for the great article. It helps a lot to understand how the B+tree works.

  4. I have a question about “NULL” value. Could you let me know where is the NULL value stored in leaf node of secondary index?

    • 468 records per leaf page, or 1203 records per non-leaf page.
      I wonder how to caculate the number 468?

      page size: 16KB = 1024 * 16 = 16384
      16384 – 38(FIL Header) – 36(Index header) – 20 (FSEG header) – 26(system records) – 8 (FIL Trailer) = 16256

      CREATE TABLE t_btree (
      i INT NOT NULL,
      s CHAR(10) NOT NULL,
      PRIMARY KEY(i)
      ) ENGINE=InnoDB;

      user recode = 5(Record header) + 4(Primary key INT) + 6(TransactionID) +7 (Roll pointer) + 10(non-key field Char10) = 32

      page directory = (user record count / 8) * 2 or (user record count / 4) * 2

      16256 = 32 * x + (x / 8) * 2
      x = 504

      16256 = 32 * x + (x /4) * 2
      x = 500

      and not 468 , so i wonder why 468

      • I think because the maximum fill for a page is 15/16ths (approximately); this leaves margin for records to be modified in place without page split.

  5. Hi Cole,

    I tried to create an index in MySQL using phpmyadmin. When I look at the index info, it shows the index has been created using BTREE. Is this B+Tree likely you described on this page? Or it is just B Tree. Many thanks

  6. You say the tree is balanced. I wonder whether its balance algorithm is just like AVL tree?
    And balance operation is background or real time on write operation.

  7. in the ‘B+Tree levels and increasing tree depth’ section
    I would like to know how to calculate this value of 120,Can you teach me? Write the calculation formula if convenient

  8. each non-leaf page have 1203 records, how to calculate it? below is my algorithm.

    page size: 16KB, load factor: 15/16ths
    available: 16 * 1024 *(15/16) – 120 – 8 = 15232
    user reord: 5(record header) + 4(cluster key min) + 4(child page number) = 13

    page directory = (user record count / 8) * 2 or (user record count / 4) * 2
    1. asumming page direcory slot to every 4 records
    record count = 15232/ (13 + 0.5) = 1128(round down)
    2. asumming page direcory slot to every 8 records
    record count = 15232/ (13 + 0.25) = 1149(round down)

    which step is wrong?

What do you think?