I just opened MySQL Bug #26123, to attempt to find out how many people are seeing this possible replication bug. A few Proven Scaling customers have seen the same bug, and I haven’t been able to reproduce it, so I opened a bug as a feeler. It appears to have something to do with using BLOB or TEXT fields in replication.
Are you seeing slaves stop with corrupted relay logs? Does restarting replication using CHANGE MASTER and the Exec_Master_Log_Pos from the stopped slave1 work just fine? Do the master’s binary logs look perfectly OK? Leave a comment on the bug.
1 This effectively forces the slave to re-download the exact same log events that it currently has in its relay logs. Since the corruption appears to happen either in the master’s slave thread, or the slave’s replication IO thread, this gets things going again.
I’ve experienced it too, and doing the change master to … ; start slave; fixes it every time. It probably happens about once every few months. This is with 4.0 and 4.1.
I opened bug http://bugs.mysql.com/bug.php?id=25737 to request checksums on binlog events. I see enough binlog corruption that I’m sure some corrupt events are reaching the slave and executing (maybe a byte is scrambled but the SQL is still parseable). Maybe I’m paranoid, but I think this is probably the cause of at least some of the times a slave gets out of sync with the master.
Pythian Group Blog » Log Buffer #31: a Carnival of the Vanities for DBAs