The physical structure of InnoDB index pages

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. (Note that each image below is linked to a higher resolution version of the same image.)

The basic structure of the space and each page was described in The basics of InnoDB space file layout, and we’ll now take a deeper look into how INDEX pages are physically structured. This will lay the ground work to discuss indexes at a logical (and much higher) level.

Everything is an index in InnoDB

Before diving into physical structures, it’s critical to understand that in InnoDB, everything is an index. What does that mean to the physical structure?

Every table has a primary key; if the CREATE TABLE does not specify one, the first non-NULL unique key is used, and failing that, a 48-bit hidden “Row ID” field is automatically added to the table structure and used as the primary key. Always add a primary key yourself. The hidden one is useless to you but still costs 6 bytes per row.
The “row data” (non-PRIMARY KEY fields) are stored in the PRIMARY KEY index structure, which is also called the “clustered key”. This index structure is keyed on the PRIMARY KEY fields, and the row data is the value attached to that key (as well as some extra fields for MVCC).
Secondary keys are stored in an identical index structure, but they are keyed on the KEY fields, and the primary key value (PKV) is attached to that key.

When discussing “indexes” in InnoDB (as in this post), this actually means both tables and indexes as the DBA may think of them.

Overview of `INDEX` page structure

Every index page has an overall structure as follows:

The major sections of the page structure are (not in order):

The FIL header and trailer: This is typical and common to all page types. One aspect somewhat unique to INDEX pages is that the “previous page” and “next page” pointers in the FIL header point to the previous and next pages at the same level of the index, and in ascending order based on the index’s key. This forms a doubly-linked list of all pages at each level. This will be further described in the logical index structure.
The FSEG header: As described in Page management in InnoDB space files, the index root page’s FSEG header contains pointers to the file segments used by this index. All other index pages’ FSEG headers are unused and zero-filled.
The INDEX header: Contains many fields related to INDEX pages and record management. Fully described below.
System records: InnoDB has two system records in each page called infimum and supremum. These records are stored in a fixed location in the page so that they can always be found directly based on byte offset in the page.
User records: The actual data. Every record has a variable-width header and the actual column data itself. The header contains a “next record” pointer which stores the offset to the next record within the page in ascending order, forming a singly-linked list. Details of user record structure will be described in a future post.
The page directory: The page directory grows downwards from the “top” of the page starting at the FIL trailer and contains pointers to some of the records in the page (every 4th to 8th record). Details of the page directory logical structure and its purpose will be described in a future post.

The `INDEX` header

The INDEX header in each INDEX page is fixed-width and has the following structure:

The fields stored in this structure are (not in order):

Index ID: The ID of the index this page belongs to.
Format Flag: The format of the records in this page, stored in the high bit (0x8000) of the “Number of Heap Records” field. Two values are possible: COMPACT and REDUNDANT. Described fully below.
Maximum Transaction ID: The maximum transaction ID of any modification to any record in this page.
Number of Heap Records: The total number of records in the page, including the infimum and supremum system records, and garbage (deleted) records.
Number of Records: The number of non-deleted user records in the page.
Heap Top Position: The byte offset of the “end” of the currently used space. All space between the heap top and the end of the page directory is free space.
First Garbage Record Offset: A pointer to the first entry in the list of garbage (deleted) records. The list is singly-linked together using the “next record” pointers in each record header. (This is called “free” within InnoDB, but this name is somewhat confusing.)
Garbage Space: The total number of bytes consumed by the deleted records in the garbage record list.
Last Insert Position: The byte offset of the record that was last inserted into the page.
Page Direction: Three values are currently used for page direction: LEFT, RIGHT, and NO_DIRECTION. This is an indication as to whether this page is experiencing sequential inserts (to the left [lower values] or right [higher values]) or random inserts. At each insert, the record at the last insert position is read and its key is compared to the currently inserted record’s key, in order to determine insert direction.
Number of Inserts in Page Direction: Once the page direction is set, any following inserts that don’t reset the direction (due to their direction differing) will instead increment this value.
Number of Directory Slots: The size of the page directory in “slots”, which are each 16-bit byte offsets.
Page Level: The level of this page in the index. Leaf pages are at level 0, and the level increments up the B+tree from there. In a typical 3-level B+tree, the root will be level 2, some number of internal non-leaf pages will be level 1, and leaf pages will be level 0. This will be discussed in more detail in a future post as it relates to logical structure.

Record format: redundant versus compact

The COMPACT record format is new in the Barracuda table format, while the REDUNDANT record format is the original one in the Antelope table format (neither of which had a name officially until Barracuda was created). The COMPACT format mostly eliminated information that was redundantly stored in each record and can be obtained from the data dictionary, such as the number of fields, which fields are nullable, and which fields are dynamic length.

An aside on record pointers

Record pointers are used in several different places: the Last Insert Position field in the INDEX header, all values in the page directory, and the “next record” pointers in the system records and user records. All records contain a header (which may be variable-width) followed by the actual record data (which may also be variable-width). Record pointers point to the location of the first byte of record data, which is effectively “between” the header and the record data. This allows the header to be read by reading backwards from that location, and the record data to be read forward from that location.

Since the “next record” pointer in system and user records is always the first field in the header reading backwards from this pointer, this means it is also possible to very efficiently read through all records in a page without having to parse the variable-width record data at all.

System records: infimum and supremum

Every INDEX page contains two system records, called infimum and supremum, at fixed locations (offset 99 and offset 112 respectively) within the page, with the following structure:

The two system records have a typical record header preceding their location, and the literal strings “infimum” and “supremum” as their only data. A full description of record header fields will be provided in a future post. For now, it is important primarily to observe that the first field (working backwards from the record data, as previously described) is the “next record” pointer.

The infimum record

The infimum record represents a value lower than any possible key in the page. Its “next record” pointer points to the user record with the lowest key in the page. Infimum serves as a fixed entry point for sequentially scanning user records.

The supremum record

The supremum record represents a key higher than any possible key in the page. Its “next record” pointer is always zero (which represents NULL, and is always an invalid position for an actual record, due to the page headers). The “next record” pointer of the user record with the highest key on the page always points to supremum.

User records

The actual on-disk format of user records will be described in a future post, as it is fairly complex and will require a lengthy explanation itself.

User records are added to the page body in the order they are inserted (and may take existing free space from previously deleted records), and are singly-linked in ascending order by key using the “next record” pointers in each record header. A singly-linked list is formed from infimum, through all user records in ascending order, and terminating at supremum. Using this list, it is trivial to scan in through all user records in a page in ascending order.

Further, using the “next page” pointer in the FIL header, it’s easy to scan from page to page through the entire index in ascending order. This means an ascending-order table scan is also trivial to implement:

Start at the first (lowest key) page in the index. (This page is found through B+tree traversal, which will be described in a future post.)
Read infimum, and follow its “next record” pointer.
If the record is supremum, proceed to step 5. If not, read and process the record contents.
Follow the “next record” pointer and return to step 3.
If the “next page” pointer points to NULL, exit. If not, follow the “next page” pointer and return to step 2.

Since records are singly-linked rather than doubly-linked, traversing the index in descending order is not as trivial, and will be discussed in a future post.

The page directory

The page directory starts at the FIL trailer and grows “downwards” from there towards the user records. The page directory contains a pointer to every 4-8 records, in addition to always containing an entry for infimum and supremum.

The page directory is simply a dynamically-sized array of 16-bit offset pointers to records within the page. Its purpose will be more fully described in a future post dedicated to the page directory.

Free space

The space between the user records (which grow “upwards”) and the page directory (which grows “downwards”) is considered free space. Once the two sections meet in the middle, exhausting the free space (and assuming no space can be reclaimed by re-organizing to remove the garbage) the page is considered full.

What’s next?

Next we’ll look at the logical structure of an index, including some examples.

Exploring InnoDB page management with innodb_ruby

[This post refers to innodb_ruby version 0.8.8 as of February 3, 2014.]

In On learning InnoDB: A journey to the core I introduced a new library and command-line tool in the innodb_ruby project. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

In my last post, Page management in InnoDB space files, I described InnoDB’s extent, file segment, and free space management structures. Now I will provide a few demonstrations of using innodb_space to examine those structures in real tables.

A minimal, empty table

I created an empty table (the schema doesn’t matter) to illustrate the “minimal” state of InnoDB’s page management structures. The space-page-type-regions mode will summarize the type of all contiguous regions of the same page type:

$ innodb_space -f test/e.ibd space-page-type-regions
start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           3           1           INDEX               
4           5           2           FREE (ALLOCATED)

The table has allocated the standard pages for IBD space files: FSP_HDR, IBUF_BITMAP, INODE, and an INDEX page for the root of the (empty) index. There are also two FREE (ALLOCATED) pages which are unused.

The space-lists mode can be used to summarize the extent descriptor and inode lists in the space:

$ innodb_space -f test/e.ibd space-lists
name                length      f_page      f_offset    l_page      l_offset    
free                0           0           0           0           0           
free_frag           1           0           158         0           158         
full_frag           0           0           0           0           0           
full_inodes         0           0           0           0           0           
free_inodes         1           2           38          2           38

Only the free_frag extent descriptor list has any entries, and only a single extent is in it. The free_inodes list has the one INODE page seen above in it.

The contents of the free_frag list can be examined with the space-list-iterate mode, which will print a graphic illustrating the usage of pages within all extents in an extent list (“#” means the page is used, “.” means the page is free):

$ innodb_space -f test/e.ibd -L free_frag space-list-iterate
start_page  page_used_bitmap                                                
0           ####............................................................

The file segments in all indexes in the space can be summarized with the space-indexes mode:

$ innodb_space -f test/e.ibd space-indexes
id          root        fseg        used        allocated   fill_factor 
16          3           internal    1           1           100.00%     
16          3           leaf        0           0           0.00%

Only the internal file segment has any pages allocated, and it only has a single page allocated. The index-fseg-internal-lists mode will summarize the extent lists in the internal file segment:

$ innodb_space -f test/e.ibd -p 3 index-fseg-internal-lists
name                length      f_page      f_offset    l_page      l_offset    
free                0           0           0           0           0           
not_full            0           0           0           0           0           
full                0           0           0           0           0

All three lists are empty, because this empty table has not allocated any full extents. So where did the 1 page that is used go? It’s a “fragment” page, and those can be listed with the index-fseg-internal-frag-pages mode:

$ innodb_space -f test/e.ibd -p 3 index-fseg-internal-frag-pages
page        index   level   data    free    records 
3           16      0       0       16252   0

That’s the minimal state of things — mostly empty bookkeeping structures, and a single INDEX page. Let’s take a look at a table with some real data in it.

A table with one million rows

In A quick introduction to innodb_ruby, I created a table with 1 million rows in it. We’ll use the same table in the examples here.

There are a total of 2,165 pages, with the majority of them type INDEX as a typical table would be:

$ innodb_space -f test/t.ibd space-page-type-regions
start       end         count       type                
3           37          35          INDEX               
38          63          26          FREE (ALLOCATED)    
64          2188        2125        INDEX               
2189        2239        51          FREE (ALLOCATED)    
2240        2240        1           INDEX               
2241        2303        63          FREE (ALLOCATED)    
2304        2304        1           INDEX               
2305        2367        63          FREE (ALLOCATED)    
2368        2368        1           INDEX               
2369        2431        63          FREE (ALLOCATED)    
2432        2432        1           INDEX               
2433        2495        63          FREE (ALLOCATED)    
2496        2496        1           INDEX               
2497        2687        191         FREE (ALLOCATED)

Note that there are a few gaps of ALLOCATED (free) pages in between blocks of INDEX pages. InnoDB does not guarantee that it uses free pages sequentially, and many optimizations around bulk data loading will cause pages to be used out of order. (More on page splitting and these optimizations in a future post.)

Looking at the space’s lists, there are actually a few extents in free, as well as the usual one extent in free_frag:

$ innodb_space -f test/t.ibd space-lists
name                length      f_page      f_offset    l_page      l_offset    
free                2           0           1758        0           1798        
free_frag           1           0           158         0           158         
full_frag           0           0           0           0           0           
full_inodes         0           0           0           0           0           
free_inodes         1           2           38          2           38

The pages in the free extent descriptor list are all free, as expected:

$ innodb_space -f test/t.ibd -L free space-list-iterate
2560        ................................................................
2624        ................................................................

The free_frag extent descriptor list shows a number of “fragment” pages are used as well:

$ innodb_space -f test/t.ibd -L free_frag space-list-iterate
start_page  page_used_bitmap                                                
0           ######################################..........................

The index file segments show the bulk of the used pages are allocated to the leaf file segment, also as expected (there are only 3 non-leaf internal pages to manage the 2,137 leaf pages in the B+tree):

$ innodb_space -f test/t.ibd space-indexes
id          root        fseg        used        allocated   fill_factor 
15          3           internal    3           3           100.00%     
15          3           leaf        2162        2528        85.52%

You can also see that the leaf index file segment has more pages allocated than it’s actually using, showing an 85.52% fill factor. This is due to InnoDB’s “segment fill factor” which is fixed at 87.5% in stock MySQL, but is now configurable in Twitter MySQL thanks to a heads up from Facebook filing MySQL Bug 64673.

Since the internal index file segment has only three pages, as expected the file segment lists are all empty:

$ innodb_space -f test/t.ibd -p 3 index-fseg-internal-lists
name                length      f_page      f_offset    l_page      l_offset    
free                0           0           0           0           0           
not_full            0           0           0           0           0           
full                0           0           0           0           0

The three used pages are allocated as fragment pages:

$ innodb_space -f test/t.ibd -p 3 index-fseg-internal-frag-pages
page        index   level   data    free    records 
3           15      2       26      16226   2       
36          15      1       14521   1401    1117    
37          15      1       13585   2341    1045

The leaf index file segment lists are pretty busy, with 32 full extents and 6 not full extents:

$ innodb_space -f test/t.ibd -p 3 index-fseg-leaf-lists
name                length      f_page      f_offset    l_page      l_offset    
free                0           0           0           0           0           
not_full            6           0           1518        0           1718        
full                33          0           198         0           1478

In addition the leaf index file segment has allocated all 32 fragment pages possible (before any of the full extents above):

$ innodb_space -f test/t.ibd -p 3 index-fseg-leaf-frag-pages
page        index   level   data    free    records 
4           15      0       9812    6286    446     
5           15      0       15158   860     689     
6           15      0       10912   5170    496     
7           15      0       10670   5412    485     
8           15      0       12980   3066    590     
9           15      0       11264   4808    512     
10          15      0       4488    11690   204     
11          15      0       9680    6418    440     
12          15      0       9306    6800    423     
13          15      0       9658    6434    439     
14          15      0       10032   6062    456     
15          15      0       9988    6108    454     
16          15      0       9570    6530    435     
17          15      0       9130    6978    415     
18          15      0       8844    7266    402     
19          15      0       11770   4300    535     
20          15      0       9020    7092    410     
21          15      0       8646    7462    393     
22          15      0       9746    6354    443     
23          15      0       11066   5014    503     
24          15      0       8910    7204    405     
25          15      0       11748   4322    534     
26          15      0       10978   5094    499     
27          15      0       11132   4940    506     
28          15      0       9350    6750    425     
29          15      0       13508   2526    614     
30          15      0       14938   1082    679     
31          15      0       14520   1506    660     
32          15      0       9086    7016    413     
33          15      0       9724    6368    442     
34          15      0       10978   5102    499     
35          15      0       9504    6592    432

The full extents, as expected, are all full:

$ innodb_space -f test/t.ibd -p 3 -L full index-fseg-leaf-list-iterate
start_page  page_used_bitmap                                                
64          ################################################################
128         ################################################################
192         ################################################################
256         ################################################################
320         ################################################################
384         ################################################################
448         ################################################################
512         ################################################################
576         ################################################################
640         ################################################################
704         ################################################################
768         ################################################################
832         ################################################################
896         ################################################################
960         ################################################################
1024        ################################################################
1088        ################################################################
1152        ################################################################
1216        ################################################################
1280        ################################################################
1344        ################################################################
1408        ################################################################
1472        ################################################################
1536        ################################################################
1600        ################################################################
1664        ################################################################
1728        ################################################################
1792        ################################################################
1856        ################################################################
1920        ################################################################
1984        ################################################################
2048        ################################################################
2112        ################################################################

The not_full extents are all partially filled, as expected:

$ innodb_space -f test/t.ibd -p 3 -L not_full index-fseg-leaf-list-iterate
start_page  page_used_bitmap                                                
2176        #############...................................................
2240        #...............................................................
2304        #...............................................................
2368        #...............................................................
2432        #...............................................................
2496        #...............................................................

You can see the artifacts of InnoDB’s page split optimizations here: It has taken the first page out of an extent several times (due to page number “hinting” which is dubious at best) in an effort to lay out data in sequential order on disk. A deeper examination of this behavior will come in the future.

Page management in InnoDB space files

The basic structure of the space and each page was described in The basics of InnoDB space file layout, so now we’ll expand on that to describe InnoDB’s structures related to management of pages and extents, and “free space” management, and how it tracks pages allocated to the many different purposes for which pages are used.

Extents and extent descriptors

As described previously, InnoDB pages are normally 16 KiB, and are grouped into 1 MiB blocks of 64 contiguous pages, which is called an “extent”. InnoDB allocates FSP_HDR and XDES pages at fixed locations within the space, to keep track of which extents are in use, and which pages within each extent are in use. These pages have a pretty simple structure:

They contain the usual FIL header and trailer, an FSP header (discussed a bit later), and 256 “extent descriptors” or just “descriptors”. They also contain a sizable amount of unused space.

Extent descriptors have the following structure:

The purpose of the various fields in an extent descriptor are:

File Segment ID: The ID of the file segment to which the extent belongs, if it belongs to a file segment.
List node for XDES list: Pointers to previous and next extents in a doubly-linked extent descriptor list.
State: The current state of the extent, for which only four values are currently defined: FREE, FREE_FRAG, and FULL_FRAG, meaning this extent belongs to the space’s list with the same name; and FSEG, meaning this extent belongs to the file segment with the ID stored in the File Segment ID field. (More on these lists below.)
Page State Bitmap: A bitmap of 2 bits per page in the extent (64 x 2 = 128 bits, or 16 bytes). The first bit indicates whether the page is free. The second bit is reserved to indicate whether the page is clean (has no un-flushed data), but this bit is currently unused and is always set to 1.

Other structures that reference an extent do so using a combination of the page number of the FSP_HDR or XDES page in which the extent’s descriptor resides, and the byte offset within that page of the descriptor entry itself. For example, the extent referenced by “page 0 offset 150” is the first extent in the space, accounting for pages 0-63, while “page 16384 offset 270” accounts for pages 16576-16639.

List base nodes and list nodes

Lists (or “free lists” as InnoDB calls them) are a fairly generic structure that allows linking multiple related structures together. It consists of two complementary structures, which form a nicely featured on-disk doubly-linked list. The “list base node” has the following structure:

The base node is stored only once in some high level structure (such as the FSP header). It contains the length of the list, and pointers to the first and last list node in the list. The actual “list nodes” look very similar:

Instead of storing first and last pointers, of course, the list node stores previous and next pointers.

All pointers consist of a page number (which is required to be within the same space), and a byte offset within that page where the list node can be found. All pointers point to the beginning (that is, N+0) of the list node, not necessarily of the structure being linked together. For example, when extent descriptor entries are linked in a list, since the list node is at offset 8 within the XDES entry structure, the code reading a list entry must “know” that the descriptor structure starts 8 bytes before the offset of the list node, and read the structure from there. (It would’ve perhaps been a good idea to ensure that the list node is always first in any structure, but that is not the case.)

The file space header and extent lists

Aside from storing the extent descriptor entries themselves, the FSP_HDR page (which is always page 0 in a space) also stores the FSP header, which contains a lot of lists, so couldn’t easily be described earlier. The structure of the FSP header is as follows:

The purposes of the non-list related fields in an FSP header are (not in order):

Space ID: The space ID of the current space.
Highest page number in the space (size): The “size” is the highest valid page number, and is incremented when the file is grown. However, not all of these pages are initialized (some may be zero-filled), as extending a space is a multi-step process.
Highest page number initialized (free limit): The “free limit” is the highest page number for which the FIL header has been initialized, storing the page number in the page itself, amongst other things. The free limit will always be less than or equal to the size.
Flags: Storage of flags related to the space.
Next Unused Segment ID: The file segment ID that will be used for the next allocated file segment. (This is essentially an auto-increment integer.)
Number of pages used in the FREE_FRAG list: This is stored as an optimization to be able to quickly calculate the number of free pages in the FREE_FRAG list, without iterating through all extents in the list and summing the free pages available in each.

List base nodes for the following extent descriptor lists are also stored in the FSP header:

FREE_FRAG: Extents with free pages remaining that are allocated to be used in “fragments”, having individual pages allocated to different purposes rather than allocating the entire extent. For example, every extent with an FSP_HDR or XDES page will be placed on the FREE_FRAG list so that the remaining free pages in the extent can be allocated for other uses.
FULL_FRAG: Exactly like FREE_FRAG but for extents with no free pages remaining. Extents are moved from FREE_FRAG to FULL_FRAG when they become full, and moved back to FREE_FRAG if a page is released so that they are no longer full.
FREE: Extents that are completely unused and available to be allocated in whole to some purpose. A FREE extent could be allocated to a file segment (and placed on the appropriate INODE list), or moved to the FREE_FRAG list for individual page use.

File segments and inodes

File segments and inodes are perhaps where InnoDB terminology and documentation are murkiest. InnoDB overloads the term “inode”, which is in common use in filesystems, and uses it for both INODE entries (a single small structure) and INODE pages (a page type holding many INODE entries). Ignoring the naming confusion for a moment, an INODE entry in InnoDB merely describes a file segment (frequently called an FSEG), and will be referred to as a “file segment INODE” from now on. The INODE pages that contain them have the following structure:

Each INODE page contains 85 file segment INODE entries (for a 16 KiB page), each of which are 192 bytes. In addition, they contain a list node which is used in the following lists of INODE pages in the FSP_HDR‘s FSP header structure as described above:

FREE_INODES: A list of INODE pages which have at least one free file segment INODE entry.
FULL_INODES: A list of INODE pages which have zero free file segment INODE entries. When using “file per table” spaces, this list in each per-table space will be empty unless the table has more than 42 indexes, as each index consumes exactly two file segment INODE entries.

A file segment INODE entry has the following structure:

The purpose of each of the non-list fields in each INODE entry are:

File Segment ID: The ID of the file segment (FSEG) described by this file segment INODE entry. If the ID is 0, the entry is unused.
Magic Number: The value 97937874 is stored as a marker that this file segment INODE entry has been properly initialized.
Number of used pages in the NOT_FULL list: Exactly like the space’s FREE_FRAG list (in the FSP header), this field stores the number of pages used in the NOT_FULL list as an optimization to be able to quickly calculate the number of free pages in the list without iterating through all extents in the list.
Fragment Array: An array of 32 page numbers of pages allocated individually from extents in the space’s FREE_FRAG or FULL_FRAG list of “fragment” extents. Once this array becomes full, only full extents can be allocated to the file segment.

As a table grows it will allocate individual pages in each file segment until the fragment array becomes full, and then switch to allocating 1 extent at a time, and eventually to allocating 4 extents at a time.

List base nodes of extent descriptors are also present in each file segment INODE entry:

FREE: Extents that are completely unused and are allocated to this file segment.
NOT_FULL: Extents with at least one used page allocated to this file segment. When the last free page is used, the extent is moved to the FULL list.
FULL: Extents with no free pages allocated to this file segment. If a page becomes free, the extent is moved to the NOT_FULL list.

If the last used page is freed from an extent in the NOT_FULL list, the extent could be moved to the file segment’s FREE list, but in practice is actually moved directly back to the space’s FREE list instead.

How indexes use file segments

Although INDEX pages haven’t been described yet, one small aspect can be looked at now. The root page of each index’s FSEG header contains pointers to the file segment INODE entries which describe the file segments used by the index. Each index uses one file segment for leaf pages and one for non-leaf (internal) pages. This information is stored in the FSEG header structure (in the INDEX page):

The space IDs present are somewhat superfluous — they will always be the same as the current space. The page number and offset point to a file segment INODE entry in an INODE page. Both file segments will always be present, even though they may be completely empty.

For example in a newly created table, the only page that exists will be the root page, which is also a leaf page, but is present in the “internal” file segment (so that it doesn’t have to be moved later). The “leaf” file segment INODE lists and fragment array will all be empty. The “internal” file segment INODE lists will all be empty, and the single root page will be in the fragment array.

Tying it all together

The following diagram attempts to illustrate the entire multi-level structure for an index:

The index root page points to two inodes (file segments), each of which have a fragment array (pointing to up to 32 individual pages from a fragment list), as well as several lists of whole extents, which are linked together using list pointers in the extent descriptors. The extent descriptors are used both to reference an extent as well as to keep track of free pages within an extent. Easy!

What’s next?

Next we’ll look at the structure of one of the most important page types from a user’s perspective, INDEX pages. We’ll then look at how InnoDB structures its indexes at a high level.

The basics of InnoDB space file layout

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post.

InnoDB’s data storage model uses “spaces”, often called “tablespaces” in the context of MySQL, and sometimes called “file spaces” in InnoDB itself. A space may consist of multiple actual files at the operating system level (e.g. ibdata1, ibdata2, etc.) but it is just a single logical file — multiple physical files are just treated as though they were physically concatenated together.

Each space in InnoDB is assigned a 32-bit integer space ID, which is used in many different places to refer to the space. InnoDB always has a “system space”, which is always assigned the space ID of 0. The system space is used for various special bookkeeping that InnoDB requires. Through MySQL, InnoDB currently only supports additional spaces in the form of “file per table” spaces, which create an .ibd file for each MySQL table. Internally, this .ibd file is actually a fully functional space which could contain multiple tables, but in the implementation with MySQL, they will only contain a single table.

Pages

Each space is divided into pages, normally 16 KiB each (this can differ for two reasons: if the compile-time define UNIV_PAGE_SIZE is changed, or if InnoDB compression is used). Each page within a space is assigned a 32-bit integer page number, often called “offset”, which is actually just the page’s offset from the beginning of the space (not necessarily the file, for multi-file spaces). So, page 0 is located at file offset 0, page 1 at file offset 16384, and so on. (The astute may remember that InnoDB has a limit of 64TiB of data; this is actually a limit per space, and is due primarily to the page number being a 32-bit integer combined with the default page size: 2³² x 16 KiB = 64 TiB.)

A page is laid out as follows:

Every page has a 38-byte FIL header and 8-byte FIL trailer (FIL is a shortened form of “file”). The header contains a field which is used to indicate the page type, which determines the structure of the rest of the page. The structure of the FIL header and trailer are:

The FIL header and trailer contain the following structures (not in order):

The page type is stored in the header. This is necessary in order to parse the rest of the page data. Pages are allocated for file space management, extent management, the transaction system, the data dictionary, undo logs, blobs, and of course indexes (table data).
The space ID is stored in the header.
The page number is stored in the header once the page has been initialized. Checking that the page number read from that field matches what it should be based on the offset into the file is helpful to indicate that reading is correct, and this field being initialized indicates that the page has been initialized.
A 32-bit checksum is stored in the header, and an older format (and broken) 32-bit checksum is stored in the trailer. The older checksum could be deprecated and that space reclaimed at some point.
Pointers to the logical previous and next page for this page type are stored in the header. This allows doubly-linked lists of pages to be built, and this is used for INDEX pages to link all pages at the same level, which allows for e.g. full index scans to be efficient. Many page types do not use these fields.
The 64-bit log sequence number (LSN) of the last modification of the page is stored in the header, and the low 32-bits of the same LSN are stored in the trailer.
A 64-bit “flush LSN” field is stored in the header, which is actually only populated for a single page in the entire system, page 0 of space 0. This stores the highest LSN flushed to any page in the entire system (all spaces). This field is a great candidate for re-use in the rest of the space.

Space files

A space file is just a concatenation of many (up to 2³²) pages. For more efficient management, pages are grouped into blocks of 1 MiB (64 contiguous pages with the default page size of 16 KiB), and called an “extent”. Many structures then refer only to extents to allocate pages within a space.

InnoDB needs to do some bookkeeping to keep track of all of the pages, extents, and the space itself, so a space file has some mandatory super-structure:

The first page (page 0) in a space is always an FSP_HDR or “file space header” page. The FSP_HDR page contains (confusingly) an FSP header structure, which tracks things like the size of the space and lists of free, fragmented, and full extents. (A more detailed discussion of free space management is reserved for a future post.)

An FSP_HDR page only has enough space internally to store bookkeeping information for 256 extents (or 16,384 pages, 256 MiB), so additional space must be reserved every 16,384 pages for bookkeeping information in the form of an XDES page. The structure of XDES and FSP_HDR pages is identical, except that the FSP header structure is zeroed-out in XDES pages. These additional pages are allocated automatically as a space file grows.

The third page in each space (page 2) will be an INODE page, which is used to store lists related to file segments (groupings of extents plus an array of singly-allocated “fragment” pages). Each INODE page can store 85 INODE entries, and each index requires two INODE entries. (A more detailed discussion of INODE entries and file segments is reserved for a future post.)

Alongside each FSP_HDR or XDES page will also be an IBUF_BITMAP page, which is used for bookkeeping information related to insert buffering, and is outside the scope of this post.

The system space

The system space (space 0) is special in InnoDB, and contains quite a few pages allocated at fixed page numbers to store a wide range of information critical to InnoDB’s operation. Since the system space is a space like any other, it has the required FSP_HDR, IBUF_BITMAP, and INODE pages allocated as its first three pages. After that, it is a bit special:

The following pages are allocated:

Page 3, type SYS: Headers and bookkeeping information related to insert buffering.
Page 4, type INDEX: The root page of the index structure used for insert buffering.
Page 5, type TRX_SYS: Information related to the operation of InnoDB’s transaction system, such as the latest transaction ID, MySQL binary log information, and the location of the double write buffer extents.
Page 6, type SYS: The first rollback segment page. Additional pages (or whole extents) are allocated as needed to store rollback segment data.
Page 7, type SYS: Headers related to the data dictionary, containing root page numbers for the indexes that make up the data dictionary. This information is required to be able to find any other indexes (tables), as their root page numbers are stored in the data dictionary itself.
Pages 64-127: The first block of 64 pages (an extent) in the double write buffer. The double write buffer is used as part of InnoDB’s recovery mechanism.
Pages 128-191: The second block of the double write buffer.

All other pages are allocated on an as-needed basis to indexes, rollback segments, undo logs, etc.

Per-table space files

InnoDB offers a “file per table” mode, which will create a file (which as explained above is actually a space) for each MySQL table created. A better name for this feature may be “space per table” rather than “file per table”. The .ibd file created for each table has the typical space file structure:

Ignoring “fast index creation” which adds indexes at runtime, after the requisite 3 initial pages, the next pages allocated in the space will be the root pages of each index in the table, in the order they were defined in the table creation. Page 3 will be the root of the clustered index, Page 4 will be the root of the first secondary key, etc.

Since most of InnoDB’s bookkeeping structures are stored in the system space, most pages allocated in a per-table space will be of type INDEX and store table data.

What’s next?

Next we’ll look at free space management within InnoDB: extent descriptors, file segments (inodes), and lists.

A quick introduction to innodb_ruby

[This post refers to innodb_ruby version 0.8.8 as of February 3, 2014.]

In On learning InnoDB: A journey to the core I introduced a new library and command-line tool in the innodb_ruby project. Now I’ll show off a few of the things it can do. I won’t try to explain all of the InnoDB structures exposed, since that will get the demos here way off track. We’ll come back to those structures later on!

Installing `innodb_ruby`

If you’re familiar with Ruby and gems (or you just happen to have a well-configured Ruby installation), I regularly push innodb_ruby gems to RubyGems, so you should only need to:

gem install innodb_ruby

If that doesn’t work, you might want to check out The RubyGems manual to try and get your installation working. Or abandon all hope. :-D

When you have a working installation, you should have an innodb_space command in your path:

$ innodb_space 
Error: File must be provided with -f argument

Usage: innodb_space -f <file> [-p <page>] [-l <level>] <mode> [<mode>, ...]

Generating some data

For these examples, I need more than a few rows to exist in order to properly examine different data structures. Make sure you’re running a new enough server (MySQL 5.5 is good) with Barrracuda tables and that you have innodb_file_per_table enabled. Create and populate a very simple table with a small bit of Ruby:

#!/usr/bin/env ruby

require "mysql"

m = Mysql.new("127.0.0.1", "root", "", "test")

m.query("DROP TABLE IF EXISTS t")

m.query("CREATE TABLE t (i INT UNSIGNED NOT NULL, PRIMARY KEY(i)) ENGINE=InnoDB")

(1..1000000).to_a.shuffle.each_with_index do |i, index|
  m.query("INSERT INTO t (i) VALUES (#{i})")
  puts "Inserted #{index} rows..." if index % 10000 == 0
end

This should generate a table of 1 million rows (inserted in random order to make things more interesting), of about 48MiB, or 3,071 16KiB pages.

(Note that if you’re trying this at home, you’ll want to watch SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty' to wait for all dirty pages to be flushed before proceeding, since the tools below will be accessing the tablespace file on disk, with no coordination with a running InnoDB instance.)

Examining a tablespace file

One of the most high-level overviews possible with innodb_space is space-page-type-regions, which prints one line per contiguous block of a given page type:

$ innodb_space -f test/t.ibd space-page-type-regions
start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           37          35          INDEX               
38          63          26          FREE (ALLOCATED)    
64          2188        2125        INDEX               
2189        2239        51          FREE (ALLOCATED)    
2240        2240        1           INDEX               
2241        2303        63          FREE (ALLOCATED)    
2304        2304        1           INDEX               
2305        2367        63          FREE (ALLOCATED)    
2368        2368        1           INDEX               
2369        2431        63          FREE (ALLOCATED)    
2432        2432        1           INDEX               
2433        2495        63          FREE (ALLOCATED)    
2496        2496        1           INDEX               
2497        2687        191         FREE (ALLOCATED)

Without getting into too many of the InnoDB internal implementation details, you see some of InnoDB’s bookkeeping structures (FSP_HDR, IBUF_BITMAP, and INODE pages), actual table data (INDEX pages), and free space (FREE (ALLOCATED) pages).

A listing of space consumed, in pages, by each index (actually each “file segment”, or FSEG for each index) can be fairly interesting as well:

$ innodb_space -f test/t.ibd space-indexes
id          root        fseg        used        allocated   fill_factor 
15          3           internal    3           3           100.00%     
15          3           leaf        2162        2528        85.52%

Every index has an “internal” file segment, used for non-leaf pages, and a “leaf” file segment, used for leaf pages. Pages may be allocated to a file segment but currently unused (type FREE (ALLOCATED)), so “fill_factor” will show the ratio of used to unused. (Keep in mind this has no relation to how full the index pages are, that is another matter.)

Examining a single page

The page-dump mode dumps everything it knows about a single page. It currently leans heavily on the typical Ruby pretty-printer module pp to print the structures — that would be a great thing to clean up in the future. The innodb_ruby library initially parses pages using a minimal Innodb::Page class, and then using the type field present in the common header optionally hands off the different page types to specialized classes (such as Innodb::Page::Index for type INDEX) for further parsing.

A good page to start looking would be the first INDEX page, which is the root node of the index tree for the test table created above, and is located at page 3:

$ innodb_space -f test/t.ibd -p 3 page-dump

The initial line will tell you which class is handling this page:

#<Innodb::Page::Index:0x007fe304855360>:

The FIL header is printed next:

fil header:
{:checksum=>621772966,
 :offset=>3,
 :prev=>nil,
 :next=>nil,
 :lsn=>102947976,
 :type=>:INDEX,
 :flush_lsn=>0,
 :space_id=>1}

The FIL header (and footer) is common to all page types and contains primarily information about the page itself.

Additional information follows depending on the page type; for INDEX pages the following information is dumped:

the “page header”, information about the index page
the “fseg header”, information related to space management for the file segments (groups of extents) used by this index
a summary of sizes (in bytes) of different parts of the page: free space, data space, record size, etc.
the system records, infimum and supremum
the contents of the page directory, which is used to make record searches more efficient
the user records, the actual data stored by the user (the fields of which will not be parsed unless a record “describer” has been loaded)

Looking at index space consumption

It’s possible to see some of the most useful space-consumption related data for all index pages by using the space-index-pages-summary mode:

$ innodb_space -f test/t.ibd space-index-pages-summary | head -n 10
page        index   level   data    free    records 
3           15      2       26      16226   2       
4           15      0       9812    6286    446     
5           15      0       15158   860     689     
6           15      0       10912   5170    496     
7           15      0       10670   5412    485     
8           15      0       12980   3066    590     
9           15      0       11264   4808    512     
10          15      0       4488    11690   204     
11          15      0       9680    6418    440

This allows you to see the amount of data and free space, and a record count for the table with a minimal fuss.

If a working gnuplot is present and the Ruby gnuplot gem is installed, it’s also very easy to make an useful (although not very pretty) scatter plot of this information:

$ innodb_space -f test/t.ibd space-index-pages-free-plot
Wrote t_free.png

The plots produced by space-index-pages-free-plot look like:

Free Space Plot – The Y axis indicates the amount of free space in each page, while the X axis is the page number, and represents file offset as well. Click for a full-size version.

Making sense of row data

In order to be really useful at examining real tables, innodb_ruby needs to be provided with some way to understand the table schema. This is done in the form of a “describer” class which can be loaded dynamically. This is one aspect of the innodb_ruby library that is not terribly well documented (or well designed, yet). A simple describer class for the above table (i INT UNSIGNED NOT NULL, PRIMARY KEY (i) and no other columns or indexes) would look like:

class SimpleTDescriber < Innodb::RecordDescriber
  type :clustered
  key "i", :INT, :UNSIGNED, :NOT_NULL
end

If this class is saved in a file simple_t_describer.rb, it can be loaded (require‘ed) in innodb_space with -r <file> and enabled with -d <class> arguments:

$ innodb_space -f test/t.ibd -r /path/to/simple_t_describer.rb -d SimpleTDescriber <mode>

Having a working record describer loaded does primarily two things:

Enable record parsing and dumping in page-dump mode. This will cause :key and :row keys to be populated in the records dumped, as well as make the transaction ID and roll pointer keys available (they are stored in-between the key and non-key fields, so are not reachable without knowing how to parse at least the key fields).
Allow use of all index recursion functions, including the index-recurse mode. The ability to parse records is required in order to parse InnoDB’s internal B+tree “node pointer records” which link the B+tree pages together.

Some sample page dumps with full records printed are available: test_t_page_3_page_dump.txt (the index root page) and test_t_page_4_page_dump.txt (an index leaf page).

Recursing an index

Once a record describer is available, indexes can be recursed using index-recurse:

$ innodb_space -f test/t.ibd -r /path/to/simple_t_describer.rb -d SimpleTDescriber -p 3 index-recurse
ROOT NODE #3: 2 records, 26 bytes
  NODE POINTER RECORD >= (i=252) -> #36
  INTERNAL NODE #36: 1117 records, 14521 bytes
    NODE POINTER RECORD >= (i=252) -> #4
    LEAF NODE #4: 446 records, 9812 bytes
      RECORD: (i=1) -> ()
      RECORD: (i=2) -> ()
      RECORD: (i=3) -> ()
      RECORD: (i=4) -> ()
      RECORD: (i=5) -> ()

This will actually walk the B+tree in ascending order (basically a full-table scan) while printing out some information about each node (page) encountered and dumping user records on leaf pages. A larger sample (10k lines) of its output is available here: test_t_page_3_index_recurse.txt.

More to come in the future

I hope this has been a useful first introduction. There is a lot more to come in the future. Patches, comments, and advice are very welcome!

Update 1: Davi pointed out several typos and errors which have been corrected. ;) Make sure you’re using the newest code from examples above.

Everything is an index in InnoDB

Overview of INDEX page structure

The INDEX header

Record format: redundant versus compact

An aside on record pointers

System records: infimum and supremum

The infimum record

The supremum record

User records

The page directory

Free space

What’s next?

A minimal, empty table

A table with one million rows

Extents and extent descriptors

List base nodes and list nodes

The file space header and extent lists

File segments and inodes

How indexes use file segments

Tying it all together

What’s next?

Pages

Space files

The system space

Per-table space files

What’s next?

Installing innodb_ruby

Generating some data

Examining a tablespace file

Examining a single page

Looking at index space consumption

Making sense of row data

Recursing an index

More to come in the future

Overview of `INDEX` page structure

The `INDEX` header

Installing `innodb_ruby`