I’ve been working with one Proven Scaling customer that has had some interesting issues recently, involving InnoDB corruption, resulting in messages similar to these:
InnoDB: Page checksum 3156980109, prior-to-4.0.14-form checksum 577557610 InnoDB: stored checksum 741279449, prior-to-4.0.14-form stored checksum 577557610 InnoDB: Page lsn 0 2323869442, low 4 bytes of lsn at page end 2323869442 InnoDB: Page number (if stored to page already) 195716, InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 2831 InnoDB: (index PRIMARY of table db/table) InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 195716.
The problem was encountered when testing hardware for a move from software RAID to hardware RAID using LSI MegaRAID SCSI 320-2 cards. The servers are 1U machines with Tyan motherboards, and a PCI riser card which the MegaRAID plugged into.
They were receiving the same messages on several different machines, ruling out a single bad piece of hardware. After spending weeks trying to figure out what the problem could be, testing different configurations and isolating variables, it was tracked down to the PCI riser cards. Searching for “lsi pci riser” shows quite a few people having similar issues.
It turns out that LSI “does not support” using their cards with PCI risers, at all. Maybe they should reword things a bit—if their cards don’t work with PCI risers.
The scariest part of the whole exercise, though, is that the corruption was occurring completely silently: data comes in, is written to disk, but gets corrupted in flight. Since the OS wrote certain data it is now caching the correct copy of the data, but the disks contain something different. The only way the corruption is discovered is when the page is read back quite a bit later, after having been flushed from cache.
You’d think that somewhere along the line, the OS or the RAID card would catch the corruption?
Did you find any other cards which would work with 1U boxes ?
Regarding silent corruption – I’m not that much surprised. I’m not sure PCI bus itself has error detection and I would doubt drivers would have something like checksum for IO blocks so RAID card could check them.
Yeah. Silent corruption is annoying but there are quite a few other cases when you can see it – memory melting without ECC, overheating CPU, mainboard bugs etc.
This is why Innodb has checksums on the first place
This is a prime reason for end to end CRC protection. Ethernet frames are CRC protected, but as soon as they jump to another network, the original header/crc is stripped away and a new header/crc is wrapped. While the data (TCP/Data) is sitting in RAM, it can become corrupted there as well.
Fibre Channel maintains a CRC across the whole network but then it’s stripped away once it is in an array.
iSCSI has a header/data digest which is a CRC that will follow the frame across all networks between the source/destination.
There is only 1 way I know of to protect data from OS to disk and back to OS: 520 Byte blocks. 8 bytes are used for CRC protection. This will protect the data at rest and in flight.
The LSI card did not catch the problem because there is no CRC protection on the IO boundary. This is just “supposed to work”…
I’m sure you had a fun and tricky time figuring this out…
Oh yeah, I’m not too sure about the InnoDB checksum algorithm that Peter metions. Checksums in general don’t catch all errors and all combinations of bit errors. A CRC will catch all errors but they are computationially expensive to implement (especially in software…).
Adaptec is the only other well-known vendor of PCI SCSI RAID with BBWC. I haven’t done any testing on those yet. The only other option is to go to bigger cases (yuck), in-built RAID controllers in the motherboard (a la Dell PERC and HP SmartArray), or to go to NAS or SAN.
Agreed, this could be avoided through the addition of end-to-end CRC. What scares me is that it isn’t implemented already. I guess I kind of assumed that the PCI bus would detect errors, as they can’t be all THAT uncommon, given the hard timing and high speed issues in play. It seems like without that protection, it’s impossible to *really* know your data, no?
I wonder if using O_DIRECT to avoid the OS caches, in combination with a read-back verify would catch this error?
Regarding how InnoDB’s checksum currently works, Peter could probably provide more details on how InnoDB implements checksumming, and whether it is “good enough”.
This is the rationale for ZFS.
I never got Adaptec 2120S RAID working with a riser on a Tyan motherboard either – would keep forgetting the array configuration – but I can’t be 100% sure the riser was the problem, it might not have worked in a slot either (physically impossible, 2U case).
We see CRC detected corruption routinely. I don’t recall seeing a case of corruption that neither of the InnoDB checksums detected. It has two, one for head and tail of a page, the other for the whole page. I don’t recall a case of random location page corruption in InnoDB cause by InnoDB, thought there have been a few systematic bugs in the way it manipulates its data.
Given the regularity with which this corruption is found in database server hardware I recommend not using a storage engine without CRC or better checking if you care about avoiding silent, gradual, corruption of your data.
James Day, Support Engineer, MySQL AB.
Busses typically only do parity. Parity is ok but does not catch all errors or combination of bit errors. Readback of data once commited to disk works too, but what happens over time is anyone’s guess. CRC is really the only way to do this. This would be circuitry that could be optimal in a microprocessor. Are there any ASIC engineers from AMD reading this???
In the end, there is truly no way to “know” how safe you’re data is with today’s technology without totally bogging down your system with lots of ugly calculations…
Your comment on the PCI bus catching this error is tricky because where is the circuitry that is supposed to catch the error. Impedance mismatch and long trace lengths (relative to the bus freqeuency) will cause oscillations and cause signals to not settle. This is typically specified in a setup/hold time on an IC. Once you’ve inserted a riser card, the PCI bus circuitry may see a “1” while there is a signal still in transition from a “0” to a “1” on the other end of the wire.
Ok, enough about hardware. I don’t do that anymore…
Sorry for the somewhat late post, but wouldn’t ZFS be a solution? You’d be forced to use Solaris (or FreeBSD I think) but it should provide you with end-to-end integrity. Hope Linux gets something like this in the future.