The basics of the InnoDB undo logging and history system

InnoDB implements multi-version concurrency control (MVCC), meaning that different users will see different versions of the data they are interacting with (sometimes called snapshots, which is a bit of a misleading term). This is done in order to allow users to see a consistent view of the system without expensive and performance-constraining locking which would limit concurrency. (This is where the “concurrency control” part of the term comes from; one alternative is locking everything the user may need.) Undo logging and InnoDB’s “history” system are the mechanisms that underly its implementation of MVCC, but the way this works is generally very poorly understood.

InnoDB keeps a copy of everything that is changed

The key thing to know in InnoDB’s implementation of MVCC is that when a record is modified, the current (“old”) version of the data being modified is first stashed away as an “undo record” in an “undo log”. It’s called an undo log because it contains the information necessary to undo the change made by the user, reverting the record to its previous version.

Every record contains a reference to its most recent undo record, called a rollback pointer or ROLL_PTR, and every undo record contains a reference to its previous undo record (except for an initial record insert, which can be undone by simply deleting the record), forming a chain of all previous versions of a record. In this way, any previous version of a record can be easily constructed, as long as the the undo records (the “history”) still exist in the undo logs.

Transactions always operate on the “live” data — there are no private copies

Any transaction1, no matter how small or temporary it may be, is always operating on the database. As records are added, modified, and deleted, this is done in the very same index structure that all other transactions and users are actively using. Although the data for these in-flight transactions may not be visible to other transactions (depending on their transaction isolation level), the effects—particularly the performance costs—associated with those modifications are immediately visible.

When reading an index, a transaction uses a “read view”, which controls what version of records a transaction is allowed to see. While reading records in the index, any recently modified record (modified by a transaction whose ID is newer than the reading transaction’s read view would allow it to see) must first be reverted to an old-enough version. (And this may cause the record to not be visible at all.)

When a transaction updates a record, without yet committing, all other transactions using transaction isolation are immediately impacted by having to revert the version of that record to an older version (that they are allowed to see) every time they encounter the record in a read.

What about transaction isolation levels?

There are three transaction isolation levels of interest for undo logging, history, and multi-versioning:

  • READ UNCOMMITTED — Also known as “dirty read”, because it literally always uses the newest data in the index without regard to transaction isolation at all, potentially reading data which isn’t currently (and may never be) committed. Even within a single statement, transactional inconsistencies may be seen from one record to the next, because no record is ever reverted to a previous version during a read.
  • READ COMMITTED — A new read view is used for each statement, based on the current maximum committed transaction ID at statement start. Records read or returned within the statement will still be consistent with each other, but from statement to statement the user will see new data.
  • REPEATABLE READ — The default for MySQL/InnoDB. A read view is created at transaction start, and that read view is used for all statements within the transaction, allowing for a consistent view of the database from statement to statement. That is, reads of data are “repeatable” within the transaction.

(Additionally there is one more transaction isolation level supported by MySQL/InnoDB, called SERIALIZABLE, but it is primarily a difference in locking, not transaction visibility, compared to REPEATABLE READ.)

In the normal course of accessing an index, some small number of records will need to be reverted to a previous version in order to satisfy the transaction isolation requirements imposed by the system. This has a cost, but as long as the read view of the transaction is fairly new, most records will not require reversion, and there is very little performance cost to doing this.

Long-running transactions and queries

It is common and mostly unsubstantiated wisdom that long-running transactions are “bad” in MySQL — but why is that? There are two reasons that long-running transactions can cause problems for MySQL:

  1. Extremely old read views. A long-running transaction (especially in the default REPEATABLE READ isolation level) will have an old read view. In a write-heavy database this may require reverting the version of very many rows to very old versions. This will slow down the transaction itself and in the worst case may mean that very long-running queries in a write-heavy database can never actually complete; the longer they run the more expensive their reads get. They can spiral into performance death eventually.
  2. Delaying purge. Because a long-running transaction has an old (potentially very old) read view, purging of undo logs (history) for the entire system will be stalled until the transaction completes. This can cause the total size of the undo logs to grow (rather than re-using the same space over and over as it normally would), causing the system tablespace (ibdata1) to grow—and of course due to other limitations, it can’t be shrunk later.

If a very long-running transaction (or query) is needed, it’s well worth considering whether it could use dirty reads in READ UNCOMMITTED isolation level in order to avoid these problems.

Deleting isn’t really deleting

Whenever a record is deleted, other transactions may still need to see the record as it existed, due to transaction isolation. If, upon delete, the record was immediately removed from the index, other transactions wouldn’t be able to find it, and thus would also not be able to find its reference to the previous record version they may need. (Keep in mind that any number of transactions may see the record in any number of versions, so five different transactions may see up to five separate versions of the record.) In order to handle this, DELETE doesn’t actually delete anything: instead it delete marks the record, flipping a “deleted” flag on.

Global history and purge operations

In addition to every record having a reference to its previous version, there is also a global view of the history of the entire database, called the “history list”. As each transaction is committed, its history is linked into this global history list in transaction serialization (commit) order. The history list is used primarily for cleaning up after a transaction, once no existing read view still needs its history (all other transactions have completed).

In the background, InnoDB runs a continuous “purge” process which is responsible for two things:

  1. Actually deleting delete-marked records2, if the current version of the record in the index at the time of purge is still delete-marked and bears the same transaction ID. (That is, the record hasn’t been re-inserted.)
  2. Freeing undo log pages and unlinking them from the global history list to make them available for re-use.

InnoDB exposes the total amount of history present in the system as a “History list length”, which can be seen in SHOW ENGINE INNODB STATUS. This is the count of all database modifications present in the undo logs, in units of undo logs (which may contain a single record modification or many).

What’s next?

Next, the physical structure of the undo logs, undo records, and history will be examined.

1 Note that InnoDB does not start a transaction internally when a BEGIN or START TRANSACTION is issued; this happens only after the first read, or immediately if START TRANSACTION WITH CONSISTENT SNAPSHOT is executed.

2 Of course actually deleting the record still doesn’t actually delete it — it just links the record into a “garbage” list, where record space may be reused. However nothing is guaranteed to be really removed from a page until the page is re-organized.

Speaking at this year’s MySQL conference

For this year’s Percona Live: MySQL Conference and Expo 2014 (PLMCE), Davi Arnaut and I submitted a couple of co-speaking sessions proposals, both of which were accepted. Please come and see us and learn some more about InnoDB at the following sessions:

InnoDB: A journey to the core II

In this new installment of the series, we explore the most crucial structures for crash recovery and consistency, the redo and undo logs. This presentation dissects the structure and organization of InnoDB’s redo and undo logs, providing insight into how data changes occur at the InnoDB layer and how it guarantees that it can recover to a consistent state in the event of a crash. This presentation will increase your understanding of how data travels through InnoDB.

InnoDB: A hands-on exploration of on-disk storage with innodb_ruby

Last year, we demoed innodb_ruby in InnoDB: A journey to the core with very small snippets; this year we’ll walk you through exploring InnoDB’s on-disk storage for yourself. We’ll provide sample data files and a working environment for you to explore along with us, and demonstrate usage of the tool’s many features in realistic scenarios.

The innodb_ruby library and tools can be used to:

  • Understand byte-by-byte the exact data stored in InnoDB files, pages, and records.
  • Make better informed decisions about database and schema design.
  • Discover the storage efficiency of InnoDB for different table schemas.
  • Uncover storage-related bugs and problems with different table schemas and access patterns.

In addition to the provided data files, feel free to bring your own computer, data files, and create table statements (nothing confidential, please!) and we’ll help you get set up to explore them on your own. We’ll be on-hand to explain and answer any questions you may have about InnoDB storage internals.

InnoDB Tidbit: The doublewrite buffer wastes 32 pages (512 KiB)

In my ongoing quest to completely understand InnoDB’s data storage, I came across a quite small and inconsequential waste, which is nevertheless fun to write about. I noticed the following block of pages which were allocated very early in the ibdata1 system tablespace but apparently unused (unnecessary lines removed from output):

$ innodb_space -f ibdata1 space-page-type-regions

start       end         count       type                
13          44          32          ALLOCATED           

Background on the doublewrite buffer

Most people using InnoDB have heard of the “doublewrite buffer”—part of InnoDB’s page flushing strategy. The doublewrite buffer is used as a “scratch area” to write (by default) 128 pages contiguously before flushing them out to their final destinations (which may be up to 128 different writes). The MySQL manual says, in “InnoDB Disk I/O”:

InnoDB uses a novel file flush technique involving a structure called the doublewrite buffer. It adds safety to recovery following an operating system crash or a power outage, and improves performance on most varieties of Unix by reducing the need for fsync() operations.

Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.

The doublewrite buffer needs to be accounted for, too

Normally the doublewrite buffer consists of two extents, each of which is 64 contiguous pages (1 MiB), for a total of 128 pages, or 2 MiB. However, InnoDB can’t just blindly borrow those two extents; it must account for them in the space file. To do this, it creates a file segment (aka Fseg) and uses an Inode to point to it. In Page management in InnoDB space files I described how file segments contain:

  • An array of up to 32 individually-allocated “fragment” pages
  • A list of “full” extents (no pages free)
  • A list of “not full” extents (partially allocated)
  • A list of “free” extents (no pages allocated)

Causing full extents to be allocated

Allocations to a file segment will always fill up the fragment array before allocating complete extents. The doublewrite buffer is strangely not special in this case. The code which allocates it uses the following loop in trx/trx0sys.c at line 335:

       + FSP_EXTENT_SIZE / 2; i++) {

This is unfortunately written without any comments of note, but it is allocating a total of 160 pages:

  • FSP_EXTENT_SIZE / 264 / 2 → 32 pages
  • 2 * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE2 * 64 → 128 pages

The initial 32 pages allocated are there purely just to cause the fragment array to be filled up thus forcing the fseg_alloc_free_page calls that follow to start allocating complete extents for the remaining 128 pages (that the doublewrite buffer actually needs). The code then checks which extents were allocated and adds the initial page numbers for those extents to the TRX_SYS header as the doublewrite buffer allocation. In a typical system, InnoDB would allocate the following pages:

  • Fragment pages 13-44 — Perpetually unused fragment pages, but left allocated to the file segment for the doublewrite buffer.
  • Extent starting at page 64, ending at page 127 — Block 1 of the doublewrite buffer in practice.
  • Extent starting at page 128, ending at page 191 — Block 2 of the doublewrite buffer in practice.

Using innodb_ruby to dump file segments (by inode)

I recently added a new space-inodes-detail and space-inodes-summary modes to the innodb_space program in innodb_ruby which can handily show exactly which pages and extents are allocated to a given file segment (trimmed for clarity and reformatted for line wrap; normally printed on a single long line):

$ innodb_space -f ibdata1 space-inodes-detail

INODE fseg_id=15, pages=160,
  frag=32 pages (13, 14, ..., 43, 44),
  full=2 extents (64-127, 128-191),
  not_full=0 extents () (0/0 pages used),
  free=0 extents ()

Here you can clearly see the two complete extents in the file segment’s “full” list, along with the 32 fragment pages.


There are a few ways this could have been avoided, such as freeing the individual pages after the two extents were allocated, or adding a special “no fragments” allocation method. However, as I said at the start it is pretty inconsequential, as it amounts to only 512 KiB per installation. The code could definitely use a rewrite for clarity though, given the nuanced behavior. It could also stand to use one of the existing defines such as FSEG_FRAG_ARR_N_SLOTS or FSEG_FRAG_LIMIT rather than repeating the basically unexplained calculation FSP_EXTENT_SIZE / 2. Additionally rewriting it to use a more meaningful loop structure would be helpful; there’s no reason it needs to allocate all three sets of pages in the same for loop (especially without comments).

How does InnoDB behave without a Primary Key?

This afternoon, Arjen Lentz and I were discussing InnoDB’s behavior without a declared PRIMARY KEY, and the topic felt interesting enough and undocumented enough to warrant its own short post.

Background on InnoDB clustered keys

In The physical structure of InnoDB index pages I described how “Everything is an index in InnoDB”. This means that InnoDB must always have a “cluster key” for each table, which is normally the PRIMARY KEY. The manual has this to say in Clustered and Secondary Indexes:

If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.

I had previously assumed this meant that an invisible column would be used along with the same sequence generation code that is used to implement auto_increment (which itself has some scalability issues). However the reality is that they are completely different implementations.

Implementation of implicit Row IDs

How this is actually implemented is, as the manual says, if a table is declared with no PRIMARY KEY and no non-nullable UNIQUE KEY, InnoDB will automatically add a 6-byte (48-bit) integer column called ROW_ID to the table, and cluster the data based on that column. The column won’t be accessible to any queries nor usable for anything internally such as row-based replication.

What the manual doesn’t mention is that all tables using such ROW_ID columns share the same global sequence counter (the manual says “increases monotonically” and doesn’t clarify), which is part of the data dictionary. The maximum used value for all row IDs (well, technically the next ID to be used) is stored in the system tablespace (e.g. ibdata1) in page 7 (type SYS), within the data dictionary header (field DICT_HDR_ROW_ID).

This global sequence counter is protected by dict_sys->mutex, even for incrementing (as opposed to using atomic increment). The implementation is in include/dict0boot.ic (many blank lines deleted):

    39  row_id_t
    40  dict_sys_get_new_row_id(void)
    41  /*=========================*/
    42  {
    43          row_id_t        id;
    45          mutex_enter(&(dict_sys->mutex));
    47          id = dict_sys->row_id;
    49          if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
    51                  dict_hdr_flush_row_id();
    52          }
    54          dict_sys->row_id++;
    56          mutex_exit(&(dict_sys->mutex));
    58          return(id);
    59  }

(You may also notice that this code lacks any protection for overflowing the 48 bits allotted to row IDs. That is unnecessarily sloppy coding, but even at a continuous 1 million inserts per second [which is probably a bit optimistic ;)] it would take about 9 years to exhaust the ID space. I guess that’s okay.)

Ensuring non-conflicting IDs are generated

The counter is flushed to disk every 256th ID generated (the define DICT_HDR_ROW_ID_WRITE_MARGIN above), by modifying the value in the SYS data dictionary page, which is logged to the transaction log. On startup, InnoDB will increase the DICT_HDR_ROW_ID stored on disk by at least 256, and at most 511. This ensures that any IDs generated will have been less than the new starting value, and thus there will not be any conflicts.

Performance and contention implications

Given how much other code within InnoDB is protected by dict_sys->mutex I think it’s fair to say any tables with an implicit clustered key (ROW_ID) could expect to experience random insert stalls during operations like dropping (unrelated) tables. Parallel insertion into multiple tables with implicit keys could be performance-constrained, as it will be serialized on both the shared mutex and cache contention for the shared counter variable. Additionally, every 256th value generated will cause a log write (and flush) for the SYS page modification, regardless of whether the transaction has committed yet (or ever will).

MySQL Community Contributor of the Year 2013

First of all, thank you to everyone who nominated me, voted for me, and to those of you who shared kind words with me and congratulated me. It’s humbling to have been awarded one of the “MySQL Community Contributor of the Year” awards for 2013. Many people have asked or wondered without asking why I do what I do, and how I got here. Given the occasion, I thought I would share some thoughts on that.

Early days as a user in web development

I started working with MySQL (and before that, mSQL) back in 1998 while working with a web development company. MySQL worked quite well (and I pretty quickly forgot about mSQL), and I started to learn more and more about it. As many new users at the time, I hit a few bugs or quirks, and poked at the code from time to time to understand what was going on. I continued just being a user of MySQL into 1999, and started to build more and more complex applications.

A first bug report

In October 1999, I encountered a crashing bug (a floating point exception which wasn’t caught) with SELECT FLOOR(POW(2,63)) on MySQL 3.22 and FreeBSD 3.3, and I made my first MySQL bug report by emailing the mailing list. After a short discussion with Sinisa Milivojevic and Monty Widenius, Monty agreed to fix the bug. Of course I watched with bright eyes, I read the code for the fix, and I worked to understand it.

The mailing list and IRC drew me in

I was hooked. I found an actual problem, as a 17 year old hacker sitting in Kansas, and I worked with these nice folks who I’ve never met, halfway around the world in Cyprus and Finland, and they agreed to do work for me to fix it, and they didn’t even complain. They were genuinely happy to help me.

I joined the mailing list to report that bug, but I stayed subscribed to it and read every mail. I browsed the archives and learned how the (tiny) community worked at the time. I joined the #mysql IRC channel on EFnet and started listening there as well.

Helping out on the mailing list and on IRC

While lurking on the mailing list and on IRC, I quickly realized that there were a lot of people with problems and questions that I could help out with. I knew some of the answers! I answered things where I knew the answer, and I worked to answer questions that I didn’t know. Through experimentation and reading the MySQL documentation and source code to solve other people’s problems, I learned an amazing amount.

Improving the documentation

In the process of doing web development work, and of helping out answering other people’s questions, I found that the MySQL manual was moderately technically complete, but very messy, sometimes buggy, and strangely worded. I poked around until I could figure out how the manual itself worked. I learned about this weird Texinfo format it was written in. Once I got things to build, I undertook an initial editing of the MySQL manual by reading through the entire Texinfo file, fixing typos and rewording things. I checked examples in the manual against an actual server and cleaned up broken examples and incomplete documentation.

Hey, this is more fun than my real job

I was then working at a web development company in Nashville, and realized that I wasn’t very happy doing that work. At the same time, the company started to melt down, and I began interviewing elsewhere. I spent more and more time doing work on MySQL (sometimes instead of work I should’ve been doing). Contributing to MySQL and working with the MySQL community made me much happier than any other work I had done so far.

Monty??? Hire me!

I don’t actually remember how I initially contacted Monty about this (although he probably still has the email archives), but he and I exchanged emails. He offered that I should come to an upcoming developer meeting in Monterey, California in July of 2000, coinciding with OSCON 2000. I jumped at the chance. I mentioned the invitation to Indrek Siitan on IRC, and he invited me to join a planned road trip to Monterey with some of the earliest MySQL employees: himself, Matt Wagner, and Tim Smith.

No interstates, no hotels, nothing but love

Although I wasn’t an employee yet, and had never met any of them in person, Matt Wagner drove from Minnesota and Tim Smith drove from North Carolina to my house in Tennessee. We piled in Matt’s pickup truck and drove from there down to Louisiana to pick up Indrek. The four of us drove in two cars from New Orleans to Monterey for about 10 days, with a plan to use no interstates—only highways—and camp each night.

I was an almost completely broke and unemployed kid, and they paid for almost everything and took me along—as a friend—across the entire country. I got to know my first few MySQL employees through those many hours in the car talking about life, technology, MySQL, and anything that came up. We had a lot of fun and they showed literally nothing but love. We all became fast friends and they accepted me without hesitation. This became my canonical example of the MySQL community, and still is, even to this day.

Meeting the team

We arrived in Monterey and I (a random non-employee) got to sit in all of the internal company meetings and technical discussions. I got to have a say in how MySQL was being made, and I got to argue with the very creators of MySQL. They not only listened, but respected me and valued my opinion. I mostly just listened through these meetings and got to know everyone, but this was an amazing experience.

At some point later in the meeting, Monty and I met, and he offered me a job at MySQL. I accepted it without hesitation and jumped into my official MySQL career head-first. My first paycheck was wired directly from Monty’s personal bank account in Finland, because there was some trouble setting up payroll for me, and Monty was concerned about making sure I got paid quickly.

Documenting MySQL, and a foray into Swedish and Swenglish

My first tasks were all about making the MySQL documentation better. I made several complete passes through the manual, reading and correcting it. I did some fairly major restructuring of the order of the sections, and normalized the structure as much as possible. (I also got quite good at reading Texinfo documents unformatted and visualizing the formatting.)

I started studying Swedish in order to understand all of the source code comments, variable and function names, and the Swedish error messages. I translated many of these remnants of Swedish and Swenglish as some of my first contributions to the actual codebase, and I did a lot of other easy formatting and fixing work while learning how the code worked. I figured out where all the functions and syntax were defined in order to make sure all elements of the syntax were documented.

A new life as a MySQLer

While at MySQL, I initially worked on documentation and helped out with support, and when customers needed help in person, I flew around and consulted with them. Kaj Arnö’s company Polycon’s training group was acquired by MySQL, and I started helping out with that training. They needed someone to teach training classes, so I started doing that too, eventually managing the whole group.

Ever present in the MySQL world

Since then I have had the opportunity to be a part of a lot of amazing things, and have made sure that every new opportunity and every new job undertaken gives me ample opportunity and motivation to continue being part of the MySQL community. Why? It’s just a part of who I am. I have some gifts for communication, making dense material understandable, understanding the needs of database users, and building scalable and manageable database systems. I want to share with others and give back to the community to give them the same or better opportunities as I was given.

Thanks to you all

Where I am in the MySQL community, and where I am in my life and career would not be possible without amazing examples given to me by a bunch of amazing people. There’s not any one mentor who was my sole example, but rather a community of dozens of individuals, each of whom I admire and have aspired to learn various things from. I’d like to offer special thanks and acknowledgement to the following folks though:

  • Monty Widenius — Of course, Monty was the father of it all, but he has also acted as a father to me personally, taken care of me, and invited me into his home and his family. He has a huge heart and is both a personal and technical mentor to me.
  • Matt Wagner, Indrek Siitan, and Tim Smith — Matt, Indrek, and Tim offered a great example of how a team can be a family, and welcomed me into the community, into their lives, and into the company in an amazing fashion. In addition, they were also great technical mentors and taught me a lot about MySQL.
  • Sinisa Milivojevic, Sasha Pachev, Jani Tolonen, Miguel Solórzano, Tõnu Samuel, Sergei Golubchik, Paul DuBois, Kaj Arnö, Arjen Lentz, Mårten Mickos, Carsten Pedersen, Zak Greant, David Axmark, Brian Aker — These folks are a mix of developers, executives, peers and community, of all backgrounds and experiences. One thing they all have in common is that they helped me to learn what it takes to build software, to run a company, and to be a community. While we haven’t always gotten along or agreed on everything, I have always respected every one of them and keep track of as many of them as I can.
  • Countless others in the community — Others on the mailing lists and IRC, customers, partners, and peers. Thanks for all being here and being awesome!

On the award

In Henrik Ingo’s words:

Several people nominated Jeremy and indeed he has a long history with MySQL, pretty much back to the first release.

For example, people mentioned Jeremy’s insights shared on his blog, on issues such as Linux NUMA memory management. His recent work on innodb_ruby has been widely appreciated both for it’s educational value and perhaps even some potential usefullness.

Most of us will have used the SHOW PROFILE(S) commands created by Jeremy – and for a long time this was the only community contribution that actually made it into MySQL Server!

His consulting company Proven Scaling used to mirror the MySQL Enterprise binaries that were GPL but not otherwise publicly available. This grew into a historical archive of (almost) all MySQL binaries ever released. Related to his issues with the MySQL Enterprise process, and poor handling of community contributions, Proven Scaling was actually the first company to create a community fork of MySQL known as Dorsal Source.

You might also remember in 2008 Jeremy took a public stand against MySQL’s plans to offer new backup functionality only as closed source. This resulted in public outcry on Slashdot and elsewhere, and Sun eventually commanded MySQL executives to give up on those plans.

So any way we look at it, over the years he has really contributed a lot and always had the interests of the MySQL Community close to his heart.


I look forward to continuing to contribute my efforts and my skills to MySQL, and always making my work available to the community. There’s a lot of work left to do, and I hope my efforts in that will be useful to many.