On 1U cases, PCI risers, and LSI MegaRAID

I’ve been working with one Proven Scaling customer that has had some interesting issues recently, involving InnoDB corruption, resulting in messages similar to these:

InnoDB: Page checksum 3156980109, prior-to-4.0.14-form checksum 577557610
InnoDB: stored checksum 741279449, prior-to-4.0.14-form stored checksum 577557610
InnoDB: Page lsn 0 2323869442, low 4 bytes of lsn at page end 2323869442
InnoDB: Page number (if stored to page already) 195716,
InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 0 2831
InnoDB: (index PRIMARY of table db/table)
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 195716.

The problem was encountered when testing hardware for a move from software RAID to hardware RAID using LSI MegaRAID SCSI 320-2 cards. The servers are 1U machines with Tyan motherboards, and a PCI riser card which the MegaRAID plugged into.

They were receiving the same messages on several different machines, ruling out a single bad piece of hardware. After spending weeks trying to figure out what the problem could be, testing different configurations and isolating variables, it was tracked down to the PCI riser cards. Searching for “lsi pci riser” shows quite a few people having similar issues.

It turns out that LSI “does not support” using their cards with PCI risers, at all. Maybe they should reword things a bit—if their cards don’t work with PCI risers.

The scariest part of the whole exercise, though, is that the corruption was occurring completely silently: data comes in, is written to disk, but gets corrupted in flight. Since the OS wrote certain data it is now caching the correct copy of the data, but the disks contain something different. The only way the corruption is discovered is when the page is read back quite a bit later, after having been flushed from cache.

You’d think that somewhere along the line, the OS or the RAID card would catch the corruption?

Leaving Yahoo!; storage, consulting

As many of you know already, I’ve decided to leave Yahoo!. Today was my last day. I’ve had a great time at Yahoo! and worked with a bunch of brilliant people. I will miss you all!

   

Where am I headed? That’s actually a complicated question. I’ll be working part time at Pivot3, a mature startup that’s doing great things in the storage industry. Contact me if you’re interesting in cutting storage costs, getting some amazing management tools, and some awesome performance. I would love to tell you more! MySQL will absolutely rock on this platform.

Yes, I said part time. I have a startup of my own, Proven Scaling, providing advanced MySQL consulting and training services. Looking to get more performance, scalability, or reliability out of MySQL? Need help with replication? Designing a database solution that really scales? Need help interviewing a candidate for a MySQL position? We can help.

Madness continues: SHOW STATUS

Last month I blogged about the silly changes to the default for the SHOW STATUS command in MySQL 5.0 in Stop the madness: SHOW STATUS, and I filed MySQL Bug #19093: SHOW STATUS defaults to SESSION.

Well, it looks like Monty has spoken and this will not be fixed. That’s unfortunate, but oh well, I tried.

You can read the full text of Monty’s reply in the bug report itself, but his reasons for supporting the current behaviour boil down to:

  • It’s similar to what SHOW VARIABLES does. — Yes, this is true, but neither commands’ output makes sense, really. If you do SHOW [SESSION] STATUS, you get a mix of session-scoped and global-scoped results back, and there’s no way to tell which is which. Same goes for SHOW [SESSION] VARIABLES. Ugh.
  • MySQL 5.0 is considered “stable” now, so it’s bad to change it now. — I agree on this part, it’s bad form to change it now, but at the same time, very few people are really using 5.0, so this is in fact a good time to change it, before people find out that it’s broken.
  • MySQL’s documentation has long suggested that people use FLUSH STATUS and SHOW STATUS to find out how query was executed, which kind of works now. — Yes, the manual has suggested this, and a few people have made limited use of it, but come on, it’s impossible to use this in production, and it’s only even of limited value on a test server.

I suggested a compromise: To add a Scope column to the output of the SHOW STATUS and SHOW VARIABLES commands, which would indicate the scope of the value seen in the output. What do you think?

MySQL 5.1: Partitioning getting closer

Back in March, I wrote about Partitioning in MySQL 5.1 not being usable, and since a few releases have gone out, and a few months have passed, I am re-visiting the issues I saw back then.

MySQL still doesn’t allow you to partition using a native DATE or DATETIME column. This is a major problem in my opinion, as it’s probably one of the most common uses for partitioning, but is only seen as a “feature request” by MySQL. Hopefully that will get fixed sometime in the near future.

I filed a couple more bugs today, the most serious of which was the inability to repair “crashed” partitions:

mysql> alter table ontime repair partition p_199609;
ERROR 145 (HY000): Table './flightstats/ontime#P#p_199609' is marked as
crashed and should be repaired

Here are the bugs I filed:

Partitioning still has quite a way to go before it’s really usable in my opinion, but it’s certainly improving!

On disk performance and MySQL replication

If you’re not using replication, the only thing you have to worry about is MyISAM and InnoDB performance. There are a lot of things you can easily do to get more (or less!) performance out of either of them. So, you get your system tuned and handling 1500 updates per second, easy as pie. You’ve got a single IDE, SATA, or SCSI disk, or maybe two using software RAID.

Then, due to natural progression, needed scalability, or catastrophic failure, you realize that you should probably be using replication, so that you can have more than one of these systems working at once.

The first step in setting up replication is normally to turn on binary logging (by enabling log-bin) on the future master. This ensures that every change to the database is logged so that it can be sent to your future slaves and executed there. Here, though, you find your first obstacle.

You enable binary logging, and all of the sudden, performance goes to hell. Your server was running 1500 updates per second, but now it’s running only 150 per second. You freak out. You disable binary logging, and things return to normal. What’s happening?

In order to replicate to slaves, and to maintain consistency, MySQL’s binary logging mechanism writes the binary log “event” for a transaction as part of the commit process. Before a transaction can return “OK” for a COMMIT command (or an implicit commit), the transaction must have been logged to disk in the binary log. In order to ensure that the OS doesn’t cheat and cache the write, MySQL calls fsync on the binary log as part of this commit.

As long as the system is also doing some reads, a few other writes, etc., you can assume that any time you write data and force it to disk with fsync, it will require a disk seek in order to be written. Disk seeks on most disks take anywhere from 4-8ms each, so a single disk can only do anywhere from 125-250 of them per second, on average.

Are things starting to come together now? Enabling binary logging drops your update rate to ~150 per second because each of those updates will need to be written to the binary log, and flushed to disk, and you’re limited by the number of seeks that the disk can physically do in a given second. The only way to get more updates per second through the binary logging mechanism is to do fewer disk seeks. There are a few ways to accomplish this:

  • Unsafe syncing — You can use the sync_binlog=n option, where n > 1 to not call fsync for every single transaction. This is the easiest option, but it means that if your server crashes (power failure, kernel panic, etc.), you will potentially lose n number of committed transactions. That’s generally bad.
  • RAID with battery-backed write cache (BBWC) — This option, though the most expensive, is the most common. The BBWC is some amount of RAM, usually 128MB or so, on the RAID card itself, that is used to cache writes. The RAID card will acknowledge a write once it has been saved in the RAM, but before it has been written to the disk.
  • Separate disk — Put the binary log on its own disk(s)1 using Linux’s ext3 filesystem with a sizable journal, mounted with the options data=journal,commit=10.
  • Ramdisk journal — Put the binary log on a shared disk, but on its own volume, using Linux’s ext3 filesystem, with the journal on a battery-backed hardware ramdisk device. I’ll be writing some more blog entries about this quite soon. :)

I hope you’ve learned something. Feel free to leave me some feedback!

1 Yes, it’s own disk. If anything else touches the disk, ever, this strategy fails.