I have been considering a new way to think about, measure, graph, and monitor replication lag for a while. Previously I’ve always been primarily been using replication delay (lag), in seconds. Initially this came from SHOW SLAVE STATUS‘s Seconds_Behind_Master field, later replaced by mk-heartbeat‘s delay information. (These are equivalent, but mk-heartbeat is less buggy and works to represent true lag with relay slaves in the replication topology.) Using delay in seconds to monitor replication has a few problems though:
- The current number of seconds delayed provides no information about how long it might take to catch up.
- It’s very difficult to determine whether a slave is catching up at all, due to the scale. If a slave is a few thousand seconds behind it’s hard to tell whether it’s catching up, falling behind, or neither, at any particular moment.
- If multiple slaves are graphed together, they may have widely different absolute delay values, and thus scales, and it can be very difficult to compare them, or to see that a slave is having problems, until it’s too late. Two slaves may be falling behind at the same rate, but if one is 300 seconds behind, and one is 20,000 seconds behind, the graphs are difficult to interpret.
Given these problems, I determined that while we need the absolute delay information available, it’s not good enough by itself. I started to think about what it is that I’m really trying to determine when looking at the delay graphs.
Vector: Velocity and direction
The key bits of information missing from the delay information seem to be:
- Is the slave falling behind, catching up, or neither? We need a measure of the direction of replication’s delay.
- How fast is the slave falling behind or catching up? We need a measure of velocity of replication’s performance
Fortunately these two things can be combined together into a single number representing the vector of replication. This can then be presented to the user (likely a DBA) in an easy to consume format. The graphs can be read as follows:
- Y = 0 means the slave is neither catching up nor falling behind. It is replicating in real time. I chose zero for this state in order to make the other two states a bit more meaningful and to make the graph symmetric by default.
- Y > 0 means the slave is catching up at Y seconds per second. There is no maximum rate a slave can catch up, but in practice seeing velocities >1 for extended periods is relatively uncommon in already busy systems.
- Y < 0 means the slave is falling behind at Y seconds per second. As a special case, Y = -1 means that the slave is completely stopped and playing no events. Lagging is a function of the passage of time, so it is not possible to lag faster than one second per second.
I like the symmetry of having the zero line be the center point, and having healthy hosts idle with a flat line at zero. Lag appears in the form of a meander away from zero into the negative, matched by an always equal-area1 (but not necessarily similarly shaped) correction into the positive. In practice the Y-scale of graphs is fixed at [-1, +1] and the graphs are very easy and quick to interpret.
Most slaves are replicating real time; one slave fell behind for some time before catching up again.
Vector – A few small perturbations can be seen, and one slave replicated at less than real time time for many hours, before finally crossing over zero and catching up at an increasing rate until current time was reached.
Delay – The small perturbations are difficult to see due to the scale imposed by the one very delayed slave. Although it’s easy to see on a day view that the slave did catch up quickly, that is less obvious when monitoring in real time.
Many slaves with different replication rates, and a lot of trouble.
Vector – Overall replication performance is quite poor, and shows evidence of being unlikely to catch up to current time or maintain real time replication in the future.
Delay – It’s difficult to know if things are getting better or worse. The replication performance of each host is almost impossible to compare.
In basic terms, the number of seconds of replication stream applied per second of real time, should be measured frequently, and with reasonably good precision. I have mk-heartbeat writing heartbeat events into a heartbeat table once a second on the master (which has an NTP-synchronized clock), providing a ready source of the progression of “replicated heartbeat time”2 (htime) to the slaves. The slaves of course have their own NTP-synchronized clocks providing a source of local “clock time” (ctime). Both of these are collected on each slave once a minute, as integers (Unix epoch timestamps). Both the current sample (subscript c) and the previous successful sample (subscript p) are available to the processing program. The vector is calculated, stored, and sent off to be graphed once per minute.
The implementation is actually quite simple, and tolerant of almost any sampling interval. In the future it could be extended to use millisecond resolution (although it can never be any higher resolution than the frequency the heartbeat is updated).
1 This is kind of an interesting point. Since the graph is nicely centered on zero, negative numbers represent the exact same scale as positive numbers, on the same dimensions.
2 SELECT UNIX_TIMESTAMP(ts) AS ts FROM heartbeat WHERE id = 1