InnoDB with reduced page sizes wastes up to 6% of disk space

In InnoDB bugs found during research on InnoDB data storage I mentioned MySQL Bug #67963 which was then titled “InnoDB wastes 62 out of every 16384 pages”. I said:

InnoDB needs to occasionally allocate some internal bookkeeping pages; two for every 256 MiB of data. In order to do so, it allocates an extent (64 pages), allocates the two pages it needed, and then adds the remainder of the extent (62 free pages) to a list of extents to be used for single page allocations called FREE_FRAG. Almost nothing allocates pages from that list, so these pages go to waste.

This is fairly subtle, wasting only 0.37% of disk space in any large InnoDB table, but nonetheless interesting and quite fixable.

Wasting 0.37% of disk space was unfortunate, but not a huge problem…

MySQL 5.6 brings adjustable page sizes

Since MySQL 5.6, InnoDB supports adjustable page size through the new configuration parameter innodb_page_size1, allowing you to use 4 KiB or 8 KiB pages instead of the default 16 KiB pages. I won’t go into the reasons why you would want to reduce the page size here. Instead, coming back to MySQL Bug #67963… neither the number 62 nor 16384 are fixed; they are in fact variable.

The number 62 actually comes from the size of the extent, in pages. For 16 KiB pages, with 1 MiB extents, this works out to 1048576 / 16384 = 64 pages per extent. Since two pages are stolen for bookkeeping, that leaves the 62 pages above.

The number 16384 comes from InnoDB’s need to repeat these bookkeeping pages every so often — it uses the page size, in pages, for this frequency2, which means that for 16 KiB pages it repeats the bookkeeping pages every 16,384 pages.

If we use 8 KiB pages instead by setting innodb_page_size=8k in the configuration? The number of pages per extent changes to 1048576 / 8192 = 128 pages per extent. The frequency of the bookkeeping pages changes to every 8192 pages. So we now waste 126 / 8192 = ~1.5% of disk space for this bug.

If we use 4 KiB pages instead by setting innodb_page_size=4k in the configuration? The number of pages per extent changes to 1048576 / 4096 = 256 pages per extent. The frequency of the bookkeeping pages changes to every 4096 pages. So we now waste 254 / 4096 = ~6.2% of disk space for this bug.

An aside: When is an extent not an extent?

An interesting aside to all of this is that although the manual claims it is so, in InnoDB an extent is actually not always 1 MiB. It is actually (1048576 / innodb_page_size) * table_page_size. As far as I can tell this was more or less a mistake in the InnoDB compression code; it should have used the table’s actual page size (which comes from KEY_BLOCK_SIZE aka zip_size for compressed tables) rather than the system default page size (UNIV_PAGE_SIZE) which was at the time fixed at compile-time.

So, for a system with innodb_page_size=16k (the default), and a table created with ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8, the “extent” is actually only 512 KiB.

The bug gets even worse if you mix InnoDB compression in…

If you mix the new configurable page size feature with InnoDB compression, due to the above weirdness with how extent size really works, you can get some pretty interesting results.

For a system with innodb_page_size=4k and a table created with ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1, the system actually wastes 254 / 1024 = ~24.8% (!!!) of the disk space to this bug (in other words, every 4th extent will be an unusable fragment extent).

A new title for Bug #67963, and a conclusion

I updated Bug #67963 to add the above and changed the title to “InnoDB wastes almost one extent out of every innodb_page_size pages” to be slightly more accurate with the reality.

If you were thinking about using 4k pages in your systems, you may want to subscribe to the bug, and maybe hold off, unless you can afford to waste more than 6% of your disk space (in addition to all other waste).

1 And prior to MySQL 5.6, you could always have changed it by changing UNIV_PAGE_SIZE in the source code and recompiling.

2 As the page size is reduced, there is less disk space available to store the bitmaps that need to be stored in the XDES page, and reducing the amount of pages represented by each page proportionally with the page size is a good enough way to do it.

5 thoughts on “InnoDB with reduced page sizes wastes up to 6% of disk space

  1. Those are minor losses compared to other waste in InnoDB. (Still, InnoDB is reasonably efficient.)

    * 1/16 of each block is deliberately left unfilled, even when filling a table in the most efficient way.

    * Random actions in a BTree inherently leave the blocks an average of 69% full. (Sequential actions _may_ lead to 15/16 full.)

    * Some (not all) indexes are (were?) so sloppily built that the number of blocks is _several_ times what it should be for the index. (This is easier to see with MariaDB’s metrics.)

    * Bigger blobs/varchars/etc in a block will lead to more off-block data (plus the 20-byte pointer to it). I have not heard how much waste there is in the off-block storage. Jeremy, can you weigh in on that? Is the off-block storage allocated in the same units (16/8/4KB)?

    * Bigger rows are more likely to overflow a smaller block size. On average, think of the lost space (per block) as being 1/2 of the typical row size. So, 1KB rows lose about 3% in 16KB blocks; 12% in 4KB blocks.

    All of these add up to my Rule of Thumb: InnoDB takes 2x-3x the disk space of MyISAM. (Hot off the press: InnoDB FULLTEXT seems to be much more frugal in disk space than MyISAM.)

    With 16KB blocks, you “cannot” have rows bigger than “about 8K”. Does that mean that with 4KB blocks a row cannot be bigger than “about 2K”? (This excludes what can be put in off-block storage.)

    Here’s a strong(?) use case for a smaller block size: big table + SSDs + and mostly random access. That is, if the table is too big to be fully cached in the buffer_pool, and your access is so random that a block is rarely reused before it is bumped from cache, then SSDs can be more efficient with smaller block size. The use case says that you need do an I/O for nearly all actions; SSD I/O for 8KB is about 2x faster than for 16KB.

    • Rick, I know there is a ton of waste. We’ve been researching this for a long time. The main point here was that there is an *additional* 6% here. :)

      Yes, with 4k pages, maximum row size is only ~2k.

      I think there are some use cases for smaller (and larger) page sizes, for instance attempting to match SSD page size, but that’s a topic for a future post…

  2. Hi Jeremy,

    We have Activiti who inserts and deletes rows every time.
    The problem is TABLE_ROWS in information_schema.tables differs to real rown in table, and this difference is bigger if no reorg table was issued periodically.
    I mean, TABLE_ROWS shows 300K rows when have phisical 0 rows.
    Is any paper, note or something to explain that?
    Thanks in advance,

    Fernando.

  3. Fernando…
    InnoDB does not (can not) give you the exact number of rows in a table. If you need that, do SELECT COUNT(*) FROM tbl — it will probably use the smallest index to perform the count.

    300K vs 0 — are you DELETEing _all_ the rows? If so, why not do TRUNCATE?

    Are you replacing the entire table? If so, this is much better:
    1. CREATE TABLE new LIKE real;
    2. INSERT … into new
    3. RENAME TABLE real TO old, new TO real;
    4. DROP TABLE old;

    • Is a ACTIVITI Table, constantly insert and delete rows.
      Sometimes have 0 real rows, Sometimes not.
      But we’re sure statistics table have two or three magnitude order.

What do you think?