I have been considering a new way to think about, measure, graph, and monitor replication lag for a while. Previously I’ve always been primarily been using replication delay (lag), in seconds. Initially this came from SHOW SLAVE STATUS‘s Seconds_Behind_Master field, later replaced by mk-heartbeat‘s delay information. (These are equivalent, but mk-heartbeat is less buggy and works to represent true lag with relay replicas in the replication topology.) Using delay in seconds to monitor replication has a few problems though:
- The current number of seconds delayed provides no information about how long it might take to catch up.
- It’s very difficult to determine whether a replica is catching up at all, due to the scale. If a replica is a few thousand seconds behind it’s hard to tell whether it’s catching up, falling behind, or neither, at any particular moment.
- If multiple replicas are graphed together, they may have widely different absolute delay values, and thus scales, and it can be very difficult to compare them, or to see that a replica is having problems, until it’s too late. Two replicas may be falling behind at the same rate, but if one is 300 seconds behind, and one is 20,000 seconds behind, the graphs are difficult to interpret.
Given these problems, I determined that while we need the absolute delay information available, it’s not good enough by itself. I started to think about what it is that I’m really trying to determine when looking at the delay graphs.
Vector: Velocity and direction
The key bits of information missing from the delay information seem to be:
- Is the replica falling behind, catching up, or neither? We need a measure of the direction of replication’s delay.
- How fast is the replica falling behind or catching up? We need a measure of velocity of replication’s performance
Fortunately these two things can be combined together into a single number representing the vector of replication. This can then be presented to the user (likely a DBA) in an easy to consume format. The graphs can be read as follows:
- Y = 0 means the replica is neither catching up nor falling behind. It is replicating in real time. I chose zero for this state in order to make the other two states a bit more meaningful and to make the graph symmetric by default.
- Y > 0 means the replica is catching up at Y seconds per second. There is no maximum rate a replica can catch up, but in practice seeing velocities >1 for extended periods is relatively uncommon in already busy systems.
- Y < 0 means the replica is falling behind at Y seconds per second. As a special case, Y = -1 means that the replica is completely stopped and playing no events. Lagging is a function of the passage of time, so it is not possible to lag faster than one second per second.
I like the symmetry of having the zero line be the center point, and having healthy hosts idle with a flat line at zero. Lag appears in the form of a meander away from zero into the negative, matched by an always equal-area1 (but not necessarily similarly shaped) correction into the positive. In practice the Y-scale of graphs is fixed at [-1, +1] and the graphs are very easy and quick to interpret.
Most replicas are replicating real time; one replica fell behind for some time before catching up again.
Vector – A few small perturbations can be seen, and one replica replicated at less than real time time for many hours, before finally crossing over zero and catching up at an increasing rate until current time was reached.
Delay – The small perturbations are difficult to see due to the scale imposed by the one very delayed replica. Although it’s easy to see on a day view that the replica did catch up quickly, that is less obvious when monitoring in real time.
Many replicas with different replication rates, and a lot of trouble.
Vector – Overall replication performance is quite poor, and shows evidence of being unlikely to catch up to current time or maintain real time replication in the future.
Delay – It’s difficult to know if things are getting better or worse. The replication performance of each host is almost impossible to compare.
In basic terms, the number of seconds of replication stream applied per second of real time, should be measured frequently, and with reasonably good precision. I have mk-heartbeat writing heartbeat events into a heartbeat table once a second on the master (which has an NTP-synchronized clock), providing a ready source of the progression of “replicated heartbeat time”2 (htime) to the replicas. The replicas of course have their own NTP-synchronized clocks providing a source of local “clock time” (ctime). Both of these are collected on each replica once a minute, as integers (Unix epoch timestamps). Both the current sample (subscript c) and the previous successful sample (subscript p) are available to the processing program. The vector is calculated, stored, and sent off to be graphed once per minute.
The implementation is actually quite simple, and tolerant of almost any sampling interval. In the future it could be extended to use millisecond resolution (although it can never be any higher resolution than the frequency the heartbeat is updated).
1 This is kind of an interesting point. Since the graph is nicely centered on zero, negative numbers represent the exact same scale as positive numbers, on the same dimensions.
2 SELECT UNIX_TIMESTAMP(ts) AS ts FROM heartbeat WHERE id = 1