A little fun with InnoDB multi-versioning

Consider the following commands, executed in the MySQL CLI on a new connection with no special preparation (and pay special attention to the execution time):

mysql> show create table t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `a` int(10) unsigned NOT NULL,
  `b` int(10) unsigned NOT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> select * from t;
Empty set (5.20 sec)

mysql> select count(*) from t;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (5.22 sec)

mysql> select * from t where a = 10;
Empty set (0.00 sec)

mysql> select * from t where a < 10;
Empty set (5.35 sec)

mysql> select * from t where a > 10;
Empty set (5.41 sec)

mysql> select * from t where a in (10, 20, 30);
Empty set (0.00 sec)

mysql> select * from t where a > 1000000;
Empty set (0.00 sec)

mysql> select * from t where a > 500000;
Empty set (2.60 sec)

What is happening? Why is it so slow? Why are some things slow and others not?

Looking for the culprit

The SHOW PROCESSLIST doesn’t show anything unusual:

mysql> show processlist;
+----+------+-----------------+------+---------+------+-------+------------------+
| Id | User | Host            | db   | Command | Time | State | Info             |
+----+------+-----------------+------+---------+------+-------+------------------+
|  6 | root | localhost:34840 | test | Query   |    0 | init  | show processlist |
|  8 | root | localhost:34842 | test | Sleep   | 2116 |       | NULL             |
+----+------+-----------------+------+---------+------+-------+------------------+
2 rows in set (0.00 sec)

If we look at the SHOW ENGINE INNODB STATUS there’s a clue: a transaction that’s been busy being evil:

---TRANSACTION 1481, ACTIVE 2183 sec
3487 lock struct(s), heap size 570920, 2001743 row lock(s), undo log entries 11000000
MySQL thread id 8, OS thread handle 0x7fca27ba2700, query id 5000144 localhost 127.0.0.1 root cleaning up

It also shows up in information_schema.innodb_trx:

mysql> select * from information_schema.innodb_trx \G
*************************** 1. row ***************************
                    trx_id: 1481
                 trx_state: RUNNING
               trx_started: 2014-04-17 01:38:27
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 11003487
       trx_mysql_thread_id: 8
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 3487
     trx_lock_memory_bytes: 570920
           trx_rows_locked: 2001743
         trx_rows_modified: 11000000
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.01 sec)

How was the transaction evil?

I used the following script to INSERT one million rows, and then UPDATE each row ten times, leaving the transaction open and un-committed:

#!/usr/bin/env ruby

require "mysql"

m = Mysql.new("127.0.0.1", "root", "", "test", 13000)

m.query("DROP TABLE IF EXISTS t")
m.query("CREATE TABLE t (a INT UNSIGNED NOT NULL, b INT UNSIGNED NOT NULL, PRIMARY KEY (a)) ENGINE=InnoDB")

m.query("START TRANSACTION")

(1..1000000).each do |i|
  m.query("INSERT INTO t (a, b) VALUES (#{i}, 0)")
  puts "Inserted #{i} rows..." if i % 10000 == 0
end

(1..10).each do |i|
  m.query("UPDATE t SET b=#{i}")
  puts "Updated #{i} times..."
end

sleep 1000000

As described in The basics of the InnoDB undo logging and history system, these modifications are made to the database, and there is a single index structure used by both the uncommitted write transaction and my read transaction. Although my read transaction is not able to see any of the rows inserted and subsequently modified by the writing transaction, in order to figure that out, it needs to apply all the undo records to each row encountered (10 each). This, of course, takes time.

Leaky visibility

You can see some of the hidden visibility of the in-flight data “leaking” in the amount of time different operations take. For instance, scanning all rows takes more than 5 seconds, but using the condition a > 1000000 can be immediately evaluated. Suspiciously the condition a > 500000 takes approximately half as much time as scanning all rows (because it has half as much work to do). Of course, all of these queries return an empty set.

Applicability to the real world

While almost no one would intentionally do what my script does, I have actually seen the end result in production systems before: a very large transaction updated a row many times resulting in many slow queries trying to access the table. Some queries would be fast, some slow.

Serious implications for system administrators and DBAs

There are no limits to how much space a user can consume in the form of undo history. I have filed MySQL Bug #72362: “Users may use an unlimited amount of undo space” suggesting that configuration options be added to limit the amount of undo space which a user could cause to be consumed. I wrote the following:

As InnoDB is currently implemented, it is possible for users to use an unlimited amount of space for undo history by either:

Creating a single or many large transactions writing to as few as one row many times, directly accumulating undo history.

Leaving open a transaction (while minimally keeping it alive) with a read view, indirectly accumulating undo history by preventing purge.

(Additionally, the user could combine both of these approaches, leaving open a transaction with a read view while simultaneously generating a large number of small transactions which individually have a small amount of undo history. This may better prevent detection of the culprit.)

Both of these situations allow a regular user, without special privileges, to consume large amounts of disk space in the system tablespace, potentially causing the system tablespace to be expanded to consume all filesystem space and without an easy recourse from the system administrator.

And I’ve suggested that new configuration options be added for the following:

Limit the undo space consumed by a single transaction.
Limit the aggregate undo space consumed by a given user.
Limit the age of the transaction read view for a given user.

5 thoughts on “A little fun with InnoDB multi-versioning”

freshdaz says:

April 17, 2014 at 07:30

Hi Jeremy,
And from dev side, I guess the only current workaround is to keep transactions as small as possible ?

- Jeremy Cole says:
  
  April 17, 2014 at 09:00
  
  @freshdaz
  
  Yes, but that is also good general advice: Make transactions no larger than they need be.
  
  Otherwise, don’t worry about this problem/situation unless you have good cause to.
  
  Regards,
  
  Jeremy
  
- Baron says:
  
  April 19, 2014 at 14:25
  
  You can also use the READ COMMITTED isolation level.
  
Queen says:

May 5, 2014 at 02:49

Leaving open a transaction (while minimally keeping it alive) with a read view, indirectly accumulating undo history by preventing purge.

a good prank on our DBA.

hahaha.

Sudalai says:

April 29, 2021 at 06:51

Hi Jeremy,

Thanks for the detailed explanation. One small clarification.

> Although my read transaction is not able to see any of the rows inserted and subsequently modified by the writing transaction, in order to figure that out, it needs to apply all the undo records to each row encountered (10 each). This, of course, takes time.

So MySQL writes undo records for all subsequent updates of the row in same transaction, correct ? If any transaction rollback we have to revert the modified row to initial version(i.e row-version before it was modified by the transaction). I believe It is sufficient to write undo record for the first update for the row in a transaction, for subsequent update of the same row in same transaction undo logging is not required. Kindly let me know if it is for some purpose or my understanding is wrong.

Thanks,
Sudalai