I’ve been working with one Proven Scaling customer that has had some interesting issues recently, involving InnoDB corruption, resulting in messages similar to these:
InnoDB: Page checksum 3156980109, prior-to-4.0.14-form checksum 577557610 InnoDB: stored checksum 741279449, prior-to-4.0.14-form stored checksum 577557610 InnoDB: Page lsn 0 2323869442, low 4 bytes of lsn at page end 2323869442 InnoDB: Page number (if stored to page already) 195716, InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 2831 InnoDB: (index PRIMARY of table db/table) InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 195716.
The problem was encountered when testing hardware for a move from software RAID to hardware RAID using LSI MegaRAID SCSI 320-2 cards. The servers are 1U machines with Tyan motherboards, and a PCI riser card which the MegaRAID plugged into.
They were receiving the same messages on several different machines, ruling out a single bad piece of hardware. After spending weeks trying to figure out what the problem could be, testing different configurations and isolating variables, it was tracked down to the PCI riser cards. Searching for “lsi pci riser” shows quite a few people having similar issues.
It turns out that LSI “does not support” using their cards with PCI risers, at all. Maybe they should reword things a bit—if their cards don’t work with PCI risers.
The scariest part of the whole exercise, though, is that the corruption was occurring completely silently: data comes in, is written to disk, but gets corrupted in flight. Since the OS wrote certain data it is now caching the correct copy of the data, but the disks contain something different. The only way the corruption is discovered is when the page is read back quite a bit later, after having been flushed from cache.
You’d think that somewhere along the line, the OS or the RAID card would catch the corruption?