Thanks, Oracle for fixing MySQL Bug 69477

Just under a year ago, on June 14, 2013, I reported MySQL Bug 69477 (and MariaDB Bug MDEV-4662) titled “InnoDB: Use of large externally-stored fields makes crash recovery lose data”. The gist of this bug is that if you use large BLOB fields (> ~10% of your InnoDB redo logs size), you can compromise crash recovery and suffer data loss after a crash.

The MySQL bug was marked as “Verified” in less than an hour (wow!), but since then there was not a single word about it. Suddenly a couple of days ago, I got an email with “[Com,Ver->Csd]” in the subject. My immediate thought was “Damn it, closed?! How can they close it?! It’s a bug!” — however, upon opening the email I noticed it was not closed as “Won’t fix”, but rather, it was fixed!

Daniel Price at Oracle left the following comment on the bug:

Fixed as of 5.6.20, 5.7.5, and here’s the changelog entry:

Redo log writes for large, externally stored “BLOB” fields could overwrite the most recent checkpoint. The 5.6.20 patch limits the size of redo log “BLOB” writes to 10% of the redo log file size. The 5.7.5 patch addresses the bug without imposing a limitation. For MySQL 5.5, the bug remains a known limitation.

The limitation has also been documented in the reference manual. The changes should appear soon, with the next published documentation build.

Thank you for the bug report.

Thanks, Oracle, for fixing1 this bug! I look forward to seeing the fix in both 5.6 and 5.7. I was just thinking the other day that I might have to fix it myself, because it seemed like nobody was working on it.

1 Maybe next time leave a comment that you’re working on it? Pretty please? :)

InnoDB with reduced page sizes wastes up to 6% of disk space

In InnoDB bugs found during research on InnoDB data storage I mentioned MySQL Bug #67963 which was then titled “InnoDB wastes 62 out of every 16384 pages”. I said:

InnoDB needs to occasionally allocate some internal bookkeeping pages; two for every 256 MiB of data. In order to do so, it allocates an extent (64 pages), allocates the two pages it needed, and then adds the remainder of the extent (62 free pages) to a list of extents to be used for single page allocations called FREE_FRAG. Almost nothing allocates pages from that list, so these pages go to waste.

This is fairly subtle, wasting only 0.37% of disk space in any large InnoDB table, but nonetheless interesting and quite fixable.

Wasting 0.37% of disk space was unfortunate, but not a huge problem…

MySQL 5.6 brings adjustable page sizes

Since MySQL 5.6, InnoDB supports adjustable page size through the new configuration parameter innodb_page_size1, allowing you to use 4 KiB or 8 KiB pages instead of the default 16 KiB pages. I won’t go into the reasons why you would want to reduce the page size here. Instead, coming back to MySQL Bug #67963… neither the number 62 nor 16384 are fixed; they are in fact variable.

The number 62 actually comes from the size of the extent, in pages. For 16 KiB pages, with 1 MiB extents, this works out to 1048576 / 16384 = 64 pages per extent. Since two pages are stolen for bookkeeping, that leaves the 62 pages above.

The number 16384 comes from InnoDB’s need to repeat these bookkeeping pages every so often — it uses the page size, in pages, for this frequency2, which means that for 16 KiB pages it repeats the bookkeeping pages every 16,384 pages.

If we use 8 KiB pages instead by setting innodb_page_size=8k in the configuration? The number of pages per extent changes to 1048576 / 8192 = 128 pages per extent. The frequency of the bookkeeping pages changes to every 8192 pages. So we now waste 126 / 8192 = ~1.5% of disk space for this bug.

If we use 4 KiB pages instead by setting innodb_page_size=4k in the configuration? The number of pages per extent changes to 1048576 / 4096 = 256 pages per extent. The frequency of the bookkeeping pages changes to every 4096 pages. So we now waste 254 / 4096 = ~6.2% of disk space for this bug.

An aside: When is an extent not an extent?

An interesting aside to all of this is that although the manual claims it is so, in InnoDB an extent is actually not always 1 MiB. It is actually (1048576 / innodb_page_size) * table_page_size. As far as I can tell this was more or less a mistake in the InnoDB compression code; it should have used the table’s actual page size (which comes from KEY_BLOCK_SIZE aka zip_size for compressed tables) rather than the system default page size (UNIV_PAGE_SIZE) which was at the time fixed at compile-time.

So, for a system with innodb_page_size=16k (the default), and a table created with ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8, the “extent” is actually only 512 KiB.

The bug gets even worse if you mix InnoDB compression in…

If you mix the new configurable page size feature with InnoDB compression, due to the above weirdness with how extent size really works, you can get some pretty interesting results.

For a system with innodb_page_size=4k and a table created with ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1, the system actually wastes 254 / 1024 = ~24.8% (!!!) of the disk space to this bug (in other words, every 4th extent will be an unusable fragment extent).

A new title for Bug #67963, and a conclusion

I updated Bug #67963 to add the above and changed the title to “InnoDB wastes almost one extent out of every innodb_page_size pages” to be slightly more accurate with the reality.

If you were thinking about using 4k pages in your systems, you may want to subscribe to the bug, and maybe hold off, unless you can afford to waste more than 6% of your disk space (in addition to all other waste).

1 And prior to MySQL 5.6, you could always have changed it by changing UNIV_PAGE_SIZE in the source code and recompiling.

2 As the page size is reduced, there is less disk space available to store the bitmaps that need to be stored in the XDES page, and reducing the amount of pages represented by each page proportionally with the page size is a good enough way to do it.

A little fun with InnoDB multi-versioning

Consider the following commands, executed in the MySQL CLI on a new connection with no special preparation (and pay special attention to the execution time):

mysql> show create table t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `a` int(10) unsigned NOT NULL,
  `b` int(10) unsigned NOT NULL,
1 row in set (0.00 sec)

mysql> select * from t;
Empty set (5.20 sec)

mysql> select count(*) from t;
| count(*) |
|        0 |
1 row in set (5.22 sec)

mysql> select * from t where a = 10;
Empty set (0.00 sec)

mysql> select * from t where a < 10;
Empty set (5.35 sec)

mysql> select * from t where a > 10;
Empty set (5.41 sec)

mysql> select * from t where a in (10, 20, 30);
Empty set (0.00 sec)

mysql> select * from t where a > 1000000;
Empty set (0.00 sec)

mysql> select * from t where a > 500000;
Empty set (2.60 sec)

What is happening? Why is it so slow? Why are some things slow and others not?

Looking for the culprit

The SHOW PROCESSLIST doesn’t show anything unusual:

mysql> show processlist;
| Id | User | Host            | db   | Command | Time | State | Info             |
|  6 | root | localhost:34840 | test | Query   |    0 | init  | show processlist |
|  8 | root | localhost:34842 | test | Sleep   | 2116 |       | NULL             |
2 rows in set (0.00 sec)

If we look at the SHOW ENGINE INNODB STATUS there’s a clue: a transaction that’s been busy being evil:

---TRANSACTION 1481, ACTIVE 2183 sec
3487 lock struct(s), heap size 570920, 2001743 row lock(s), undo log entries 11000000
MySQL thread id 8, OS thread handle 0x7fca27ba2700, query id 5000144 localhost root cleaning up  

It also shows up in information_schema.innodb_trx:

mysql> select * from information_schema.innodb_trx \G
*************************** 1. row ***************************
                    trx_id: 1481
                 trx_state: RUNNING
               trx_started: 2014-04-17 01:38:27
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 11003487
       trx_mysql_thread_id: 8
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 3487
     trx_lock_memory_bytes: 570920
           trx_rows_locked: 2001743
         trx_rows_modified: 11000000
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.01 sec)

How was the transaction evil?

I used the following script to INSERT one million rows, and then UPDATE each row ten times, leaving the transaction open and un-committed:

#!/usr/bin/env ruby

require "mysql"

m ="", "root", "", "test", 13000)



(1..1000000).each do |i|
  m.query("INSERT INTO t (a, b) VALUES (#{i}, 0)")
  puts "Inserted #{i} rows..." if i % 10000 == 0

(1..10).each do |i|
  m.query("UPDATE t SET b=#{i}")
  puts "Updated #{i} times..."

sleep 1000000

As described in The basics of the InnoDB undo logging and history system, these modifications are made to the database, and there is a single index structure used by both the uncommitted write transaction and my read transaction. Although my read transaction is not able to see any of the rows inserted and subsequently modified by the writing transaction, in order to figure that out, it needs to apply all the undo records to each row encountered (10 each). This, of course, takes time.

Leaky visibility

You can see some of the hidden visibility of the in-flight data “leaking” in the amount of time different operations take. For instance, scanning all rows takes more than 5 seconds, but using the condition a > 1000000 can be immediately evaluated. Suspiciously the condition a > 500000 takes approximately half as much time as scanning all rows (because it has half as much work to do). Of course, all of these queries return an empty set.

Applicability to the real world

While almost no one would intentionally do what my script does, I have actually seen the end result in production systems before: a very large transaction updated a row many times resulting in many slow queries trying to access the table. Some queries would be fast, some slow.

Serious implications for system administrators and DBAs

There are no limits to how much space a user can consume in the form of undo history. I have filed MySQL Bug #72362: “Users may use an unlimited amount of undo space” suggesting that configuration options be added to limit the amount of undo space which a user could cause to be consumed. I wrote the following:

As InnoDB is currently implemented, it is possible for users to use an unlimited amount of space for undo history by either:

  1. Creating a single or many large transactions writing to as few as one row many times, directly accumulating undo history.
  2. Leaving open a transaction (while minimally keeping it alive) with a read view, indirectly accumulating undo history by preventing purge.

(Additionally, the user could combine both of these approaches, leaving open a transaction with a read view while simultaneously generating a large number of small transactions which individually have a small amount of undo history. This may better prevent detection of the culprit.)

Both of these situations allow a regular user, without special privileges, to consume large amounts of disk space in the system tablespace, potentially causing the system tablespace to be expanded to consume all filesystem space and without an easy recourse from the system administrator.

And I’ve suggested that new configuration options be added for the following:

  • Limit the undo space consumed by a single transaction.
  • Limit the aggregate undo space consumed by a given user.
  • Limit the age of the transaction read view for a given user.

The basics of the InnoDB undo logging and history system

InnoDB implements multi-version concurrency control (MVCC), meaning that different users will see different versions of the data they are interacting with (sometimes called snapshots, which is a bit of a misleading term). This is done in order to allow users to see a consistent view of the system without expensive and performance-constraining locking which would limit concurrency. (This is where the “concurrency control” part of the term comes from; one alternative is locking everything the user may need.) Undo logging and InnoDB’s “history” system are the mechanisms that underly its implementation of MVCC, but the way this works is generally very poorly understood.

InnoDB keeps a copy of everything that is changed

The key thing to know in InnoDB’s implementation of MVCC is that when a record is modified, the current (“old”) version of the data being modified is first stashed away as an “undo record” in an “undo log”. It’s called an undo log because it contains the information necessary to undo the change made by the user, reverting the record to its previous version.

Every record contains a reference to its most recent undo record, called a rollback pointer or ROLL_PTR, and every undo record contains a reference to its previous undo record (except for an initial record insert, which can be undone by simply deleting the record), forming a chain of all previous versions of a record. In this way, any previous version of a record can be easily constructed, as long as the the undo records (the “history”) still exist in the undo logs.

Transactions always operate on the “live” data — there are no private copies

Any transaction1, no matter how small or temporary it may be, is always operating on the database. As records are added, modified, and deleted, this is done in the very same index structure that all other transactions and users are actively using. Although the data for these in-flight transactions may not be visible to other transactions (depending on their transaction isolation level), the effects—particularly the performance costs—associated with those modifications are immediately visible.

When reading an index, a transaction uses a “read view”, which controls what version of records a transaction is allowed to see. While reading records in the index, any recently modified record (modified by a transaction whose ID is newer than the reading transaction’s read view would allow it to see) must first be reverted to an old-enough version. (And this may cause the record to not be visible at all.)

When a transaction updates a record, without yet committing, all other transactions using transaction isolation are immediately impacted by having to revert the version of that record to an older version (that they are allowed to see) every time they encounter the record in a read.

What about transaction isolation levels?

There are three transaction isolation levels of interest for undo logging, history, and multi-versioning:

  • READ UNCOMMITTED — Also known as “dirty read”, because it literally always uses the newest data in the index without regard to transaction isolation at all, potentially reading data which isn’t currently (and may never be) committed. Even within a single statement, transactional inconsistencies may be seen from one record to the next, because no record is ever reverted to a previous version during a read.
  • READ COMMITTED — A new read view is used for each statement, based on the current maximum committed transaction ID at statement start. Records read or returned within the statement will still be consistent with each other, but from statement to statement the user will see new data.
  • REPEATABLE READ — The default for MySQL/InnoDB. A read view is created at transaction start, and that read view is used for all statements within the transaction, allowing for a consistent view of the database from statement to statement. That is, reads of data are “repeatable” within the transaction.

(Additionally there is one more transaction isolation level supported by MySQL/InnoDB, called SERIALIZABLE, but it is primarily a difference in locking, not transaction visibility, compared to REPEATABLE READ.)

In the normal course of accessing an index, some small number of records will need to be reverted to a previous version in order to satisfy the transaction isolation requirements imposed by the system. This has a cost, but as long as the read view of the transaction is fairly new, most records will not require reversion, and there is very little performance cost to doing this.

Long-running transactions and queries

It is common and mostly unsubstantiated wisdom that long-running transactions are “bad” in MySQL — but why is that? There are two reasons that long-running transactions can cause problems for MySQL:

  1. Extremely old read views. A long-running transaction (especially in the default REPEATABLE READ isolation level) will have an old read view. In a write-heavy database this may require reverting the version of very many rows to very old versions. This will slow down the transaction itself and in the worst case may mean that very long-running queries in a write-heavy database can never actually complete; the longer they run the more expensive their reads get. They can spiral into performance death eventually.
  2. Delaying purge. Because a long-running transaction has an old (potentially very old) read view, purging of undo logs (history) for the entire system will be stalled until the transaction completes. This can cause the total size of the undo logs to grow (rather than re-using the same space over and over as it normally would), causing the system tablespace (ibdata1) to grow—and of course due to other limitations, it can’t be shrunk later.

If a very long-running transaction (or query) is needed, it’s well worth considering whether it could use dirty reads in READ UNCOMMITTED isolation level in order to avoid these problems.

Deleting isn’t really deleting

Whenever a record is deleted, other transactions may still need to see the record as it existed, due to transaction isolation. If, upon delete, the record was immediately removed from the index, other transactions wouldn’t be able to find it, and thus would also not be able to find its reference to the previous record version they may need. (Keep in mind that any number of transactions may see the record in any number of versions, so five different transactions may see up to five separate versions of the record.) In order to handle this, DELETE doesn’t actually delete anything: instead it delete marks the record, flipping a “deleted” flag on.

Global history and purge operations

In addition to every record having a reference to its previous version, there is also a global view of the history of the entire database, called the “history list”. As each transaction is committed, its history is linked into this global history list in transaction serialization (commit) order. The history list is used primarily for cleaning up after a transaction, once no existing read view still needs its history (all other transactions have completed).

In the background, InnoDB runs a continuous “purge” process which is responsible for two things:

  1. Actually deleting delete-marked records2, if the current version of the record in the index at the time of purge is still delete-marked and bears the same transaction ID. (That is, the record hasn’t been re-inserted.)
  2. Freeing undo log pages and unlinking them from the global history list to make them available for re-use.

InnoDB exposes the total amount of history present in the system as a “History list length”, which can be seen in SHOW ENGINE INNODB STATUS. This is the count of all database modifications present in the undo logs, in units of undo logs (which may contain a single record modification or many).

What’s next?

Next, the physical structure of the undo logs, undo records, and history will be examined.

1 Note that InnoDB does not start a transaction internally when a BEGIN or START TRANSACTION is issued; this happens only after the first read, or immediately if START TRANSACTION WITH CONSISTENT SNAPSHOT is executed.

2 Of course actually deleting the record still doesn’t actually delete it — it just links the record into a “garbage” list, where record space may be reused. However nothing is guaranteed to be really removed from a page until the page is re-organized.

Speaking at this year’s MySQL conference

For this year’s Percona Live: MySQL Conference and Expo 2014 (PLMCE), Davi Arnaut and I submitted a couple of co-speaking sessions proposals, both of which were accepted. Please come and see us and learn some more about InnoDB at the following sessions:

InnoDB: A journey to the core II

In this new installment of the series, we explore the most crucial structures for crash recovery and consistency, the redo and undo logs. This presentation dissects the structure and organization of InnoDB’s redo and undo logs, providing insight into how data changes occur at the InnoDB layer and how it guarantees that it can recover to a consistent state in the event of a crash. This presentation will increase your understanding of how data travels through InnoDB.

InnoDB: A hands-on exploration of on-disk storage with innodb_ruby

Last year, we demoed innodb_ruby in InnoDB: A journey to the core with very small snippets; this year we’ll walk you through exploring InnoDB’s on-disk storage for yourself. We’ll provide sample data files and a working environment for you to explore along with us, and demonstrate usage of the tool’s many features in realistic scenarios.

The innodb_ruby library and tools can be used to:

  • Understand byte-by-byte the exact data stored in InnoDB files, pages, and records.
  • Make better informed decisions about database and schema design.
  • Discover the storage efficiency of InnoDB for different table schemas.
  • Uncover storage-related bugs and problems with different table schemas and access patterns.

In addition to the provided data files, feel free to bring your own computer, data files, and create table statements (nothing confidential, please!) and we’ll help you get set up to explore them on your own. We’ll be on-hand to explain and answer any questions you may have about InnoDB storage internals.