Vector: A new way to think about replication performance and replica lag

Background

I have been considering a new way to think about, measure, graph, and monitor replication lag for a while. Previously I’ve always been primarily been using replication delay (lag), in seconds. Initially this came from SHOW SLAVE STATUS‘s Seconds_Behind_Master field, later replaced by mk-heartbeat‘s delay information. (These are equivalent, but mk-heartbeat is less buggy and works to represent true lag with relay replicas in the replication topology.) Using delay in seconds to monitor replication has a few problems though:

  • The current number of seconds delayed provides no information about how long it might take to catch up.
  • It’s very difficult to determine whether a replica is catching up at all, due to the scale. If a replica is a few thousand seconds behind it’s hard to tell whether it’s catching up, falling behind, or neither, at any particular moment.
  • If multiple replicas are graphed together, they may have widely different absolute delay values, and thus scales, and it can be very difficult to compare them, or to see that a replica is having problems, until it’s too late. Two replicas may be falling behind at the same rate, but if one is 300 seconds behind, and one is 20,000 seconds behind, the graphs are difficult to interpret.

Given these problems, I determined that while we need the absolute delay information available, it’s not good enough by itself. I started to think about what it is that I’m really trying to determine when looking at the delay graphs.

Vector: Velocity and direction

The key bits of information missing from the delay information seem to be:

  • Is the replica falling behind, catching up, or neither? We need a measure of the direction of replication’s delay.
  • How fast is the replica falling behind or catching up? We need a measure of velocity of replication’s performance

Fortunately these two things can be combined together into a single number representing the vector of replication. This can then be presented to the user (likely a DBA) in an easy to consume format. The graphs can be read as follows:

  • Y = 0 means the replica is neither catching up nor falling behind. It is replicating in real time. I chose zero for this state in order to make the other two states a bit more meaningful and to make the graph symmetric by default.
  • Y > 0 means the replica is catching up at Y seconds per second. There is no maximum rate a replica can catch up, but in practice seeing velocities >1 for extended periods is relatively uncommon in already busy systems.
  • Y < 0 means the replica is falling behind at Y seconds per second. As a special case, Y = -1 means that the replica is completely stopped and playing no events. Lagging is a function of the passage of time, so it is not possible to lag faster than one second per second.

I like the symmetry of having the zero line be the center point, and having healthy hosts idle with a flat line at zero. Lag appears in the form of a meander away from zero into the negative, matched by an always equal-area1 (but not necessarily similarly shaped) correction into the positive. In practice the Y-scale of graphs is fixed at [-1, +1] and the graphs are very easy and quick to interpret.

Example 1

Most replicas are replicating real time; one replica fell behind for some time before catching up again.


Vector – A few small perturbations can be seen, and one replica replicated at less than real time time for many hours, before finally crossing over zero and catching up at an increasing rate until current time was reached.


Delay – The small perturbations are difficult to see due to the scale imposed by the one very delayed replica. Although it’s easy to see on a day view that the replica did catch up quickly, that is less obvious when monitoring in real time.

Example 2

Many replicas with different replication rates, and a lot of trouble.


Vector – Overall replication performance is quite poor, and shows evidence of being unlikely to catch up to current time or maintain real time replication in the future.


Delay – It’s difficult to know if things are getting better or worse. The replication performance of each host is almost impossible to compare.

Implementation

In basic terms, the number of seconds of replication stream applied per second of real time, should be measured frequently, and with reasonably good precision. I have mk-heartbeat writing heartbeat events into a heartbeat table once a second on the master (which has an NTP-synchronized clock), providing a ready source of the progression of “replicated heartbeat time”2 (htime) to the replicas. The replicas of course have their own NTP-synchronized clocks providing a source of local “clock time” (ctime). Both of these are collected on each replica once a minute, as integers (Unix epoch timestamps). Both the current sample (subscript c) and the previous successful sample (subscript p) are available to the processing program. The vector is calculated, stored, and sent off to be graphed once per minute.

The implementation is actually quite simple, and tolerant of almost any sampling interval. In the future it could be extended to use millisecond resolution (although it can never be any higher resolution than the frequency the heartbeat is updated).

1 This is kind of an interesting point. Since the graph is nicely centered on zero, negative numbers represent the exact same scale as positive numbers, on the same dimensions.

2 SELECT UNIX_TIMESTAMP(ts) AS ts FROM heartbeat WHERE id = 1

January 2011 MySQL Meetup: Clustrix

Clustrix will be presenting at this month’s Silicon Valley MySQL Meetup in Sunnyvale, CA. Stop by if you can!

Paul Mikesell (CEO and VP Engineering) and Aaron Passey (CTO) will be discussing the unique architecture behind Clustrix’s massively scalable, distributed, MySQL-compatible database solution. They will talk about how the company has addressed the common challenges associated with achieving massive scale for transactional (OLTP) relational workloads.

Prior to developing the Clustrix solution over the past four years, Paul was co-founder of Isilon Systems, just acquired by EMC in December 2010 for $2.25B and still has the largest clustering capability (>120 nodes) of any NAS solution.

Read more and RSVP on Meetup.com!

InnoDB online index add and “The table ‘t’ is full” error

While trying to add an index to a fairly large table today, on a server1 I’d not worked on previously, I got the following error after some time (and while I was away from the computer):

mysql> ALTER TABLE t
    ->   ADD KEY `index_a` (`a`),
    ->   ADD KEY `index_b` (`b`),
    ->   ADD KEY `index_c` (`c`);
ERROR 1114 (HY000): The table 't' is full

The error log did not bring particular enlightenment (as usual, InnoDB is extremely verbose with the logs, without saying anything useful):

110105 16:22:30  InnoDB: Error: Write to file (merge) failed at offset 4 1387266048.
InnoDB: 1048576 bytes should have been written, only 888832 were written.
InnoDB: Operating system error number 0.
InnoDB: Check that your OS and file system support files of this size.
InnoDB: Check also that the disk is not full or a disk quota exceeded.
InnoDB: Error number 0 means 'Success'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html
110105 16:23:00 [ERROR] /usr/sbin/mysqld: The table 't' is full

I had to re-run the command while keeping a close eye on things, and I discovered that it was writing significant amounts of data to the root file system (which isn’t very big, as usual). I looked in all the usual places, and didn’t see any files of note. However, on a hunch, I checked out /proc/<pid>/fd (which can be a lifesaver). I found these:

# ls -l /proc/`pidof mysqld`/fd | grep deleted
lrwx------ 1 root root 64 Jan  5 17:33 14 -> /var/tmp/ibEnEaSj (deleted)
lrwx------ 1 root root 64 Jan  5 17:33 5 -> /var/tmp/ibEoZHQc (deleted)
lrwx------ 1 root root 64 Jan  5 17:33 6 -> /var/tmp/ibHlWZb3 (deleted)
lrwx------ 1 root root 64 Jan  5 17:33 7 -> /var/tmp/ibUtVhxT (deleted)
lrwx------ 1 root root 64 Jan  5 17:33 8 -> /var/tmp/ibt1daDR (deleted)

I can only assume it’s one of these files that’s growing. Changing the setting of tmpdir fixed things up, and it’s writing its large data files to a place with significant space now (and on a much bigger and faster RAID array, to boot). However, this brings with it a couple of questions:

  1. Why does InnoDB need significant space in tmpdir? This is a new requirement with InnoDB plugin (due to online index addition only?), but I don’t see it documented anywhere.2
  2. Why are the files deleted while in use? This makes it very painful for a DBA to manage it and see what’s using space. I know it’s a typical Unix paradigm, but cleanup-on-start and leaving the files linked is much easier to manage.
  3. Why are the error messages useless? How else is a DBA supposed to track this down?

I could also note that using online index add makes useless the only previous way of getting some sort of a status update while adding indexes: watching the temporary file grow. Perhaps it’s time to bring back my patch to show progress (MySQL Bug #26182)?

1 Running MySQL-server-percona-5.1.42-9.rhel5 with innodb_version = 1.0.6-9.

2 Perhaps it should go on Fast Index Creation in the InnoDB Storage Engine: Limitations at very least?

Now a Database Architect at Twitter

Just a small announcement:

Starting today, I am now “MySQL Database Architect” at Twitter, where I am joining some old friends on the small but hard-working DBA and operations teams there. I’ll be working to help debug, support, and scale the MySQL databases, of course, and who knows what else. I’m looking forward to the challenges and fast paced operations again. I’m also looking forward to writing a lot more on this blog about MySQL. I’ve had a Twitter account for a long time, but I suppose I’ll write a lot more on it now:

Since I will now be commuting from Sunnyvale to San Francisco nearly daily, I think my old “Commuting” category will get more of a work-out too.

On early MySQL development hostnames

While reading through the manual I ran across something I had totally forgotten about from the early MySQL days. Early on, Monty (or was it Jani?) decided to name many development servers variants of “bitch” in different languages. I have no idea what the back-story was, but maybe Monty or Jani can fill it in. All of these names live on all over the place, such as in the MySQL and InnoDB documentation, bug reports, and mailing list messages. See:

  • bitch.mysql.fi — English, of course.
  • hundin.mysql.fiGerman
  • hynda.mysql.fiSwedish
  • narttu.mysql.fiFinnish
  • tik.mysql.fiSwedish
  • tramp.mysql.fi — English, probably. Similar in meaning to some slang uses of “bitch”.

There are a few honorable mentions, which I’m not sure are variants of “bitch”, but very well could be:

  • donna.mysql.fi
  • mashka.mysql.fi
  • mishka.mysql.fi

It’s funny to see how those names live on in “infamy” on Google. Try searching for any of them like “+mysql +hundin“.