On iostat, disk latency; iohist onward!

Just a little heads-up and a bit of MySQL-related technical content for all of you still out there following along…

At Proven Scaling, we take on MySQL performance problems pretty regularly, I’m often in need of good tools to characterize current performance and find any issues. In the database world, you’re really looking for a few things of interest related to I/O: throughput in bytes, requests, and latency. The typical tool to get this information on Linux is iostat. You would normally run it like iostat -dx 1 sda and its output would be something like this, repeating every 1 second:

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 8.00 0.00 4.00 0.00 96.00 0.00 48.00 24.00 0.06 15.75 15.75 6.30

Most of the output of iostat is interesting and reasonable for its intended purpose, which is as a general purpose way to monitor I/O. The really interesting things for most database servers (especially those in trouble) are:

avgrq-sz — Average request size, in kilobytes.
avgqu-sz — Average I/O queue length, in requests.
await — Average waiting time (in queue and scheduler) before servicing a request, in milliseconds.
svctm — Average total service time for I/O requests, in milliseconds. This includes await, so should always be higher than await. This is the most interesting number for any write-heavy transactional database server, as it translates directly to transaction commit time.
%util — Approximate percent utilization for the device.

There are one major problem with using iostat to monitor MySQL/InnoDB servers: svctm and await combine reads and writes. With a reasonably configured InnoDB, on a server with RAID with a battery-backed write cache (BBWC), reads and writes will have very different behaviour. In general, with a non-filled cache, writes should complete (to the BBWC) in just about zero milliseconds. Reads should take approximately the theoretical average time possible on the underlying disk subsystem.

I’ve often times found myself scratching my head looking at a non-sensical svctm due to reads and writes being combined together. One day I was perplexed enough to do something about it: I opened up the code for iostat to see how it worked. It turns out that the core of what it does is quite simple (so much so, I wonder why it’s C instead of Perl) — it opens /proc/diskstats, and /proc/stat and does some magic to the contents.

What I really wanted is a histogram of the reads and writes (separately, please!) for the given device. I hacked up a quick script to do that, and noticed how incredibly useful it is. I recently had to extend it to address other customer needs, so I worked on it a bit more and now it looks pretty good. Here’s an example from a test machine (so not that realistic for a MySQL server):

util:   1.27% r_ios:     0  w_ios:     1  aveq:     0,
ms : r_svctm                     : w_svctm
 0 :                             :
 1 :                             :
 2 :                             :
 3 : x                           :
 4 : x                           :
 5 : xxx                         :
 6 : xxxx                        :
 7 :                             :
 8 : x                           : x
 9 : x                           : xx
10 : x                           : xxxxx
11 :                             : xxxxxxxxxxxxxxx
12 :                             : xxxxxxxxxxxxxxxxxxxxxxxxx
13 : xx                          : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
14 :                             : xxxxxxxxxxxxxxxxxxxxx
15 : xx                          : xxxxxxxxxxx
16 : x                           : xxxxx
17 : x                           : xxxxxx
18 :                             : xxxx
19 : x                           : xx
20 :                             : x
21 :                             : x
22 :                             : x
23 :                             : x
24 :                             : x
25 :                             :
26 :                             :
27 :                             :
28 : x                           :
29 :                             :
30 :                             :
++ : 0                           : 250

It uses Curses now to avoid redrawing the entire screen, and I’ve got a ton of ideas on how to improve it. I have a few more must-haves before I release it formally to the world, but I wonder what more features people would want from it. It is Linux-only for the foreseeable future.

What do you think?

20 thoughts on “On iostat, disk latency; iohist onward!”

vadim mysqlperformanceblog.com says:

May 8, 2007 at 23:19

Great stuff!

I would like to see the same info in old text format. It is easier to copy-paste.

Reply
Colin Charles says:

May 8, 2007 at 23:31

Agreed with vadim. Text format output should be supported as well. And this is great stuff, kudos!

Reply
Kevin Burton says:

May 9, 2007 at 00:56

Hm…….. I think you’re on to something… yeah… iostat is simple.

This would sort of be a TOP for IO……… hm…… I think the concept should be extended a bit but I generally like the idea…… w00t.

Reply
Sean says:

May 9, 2007 at 04:52

I think that would be great… I’ve been fighting with some I/O problems on an NFS server and I’d love that tool.

Other features would be a time series display (ie just like iostat is now, but with reads and writes separated and in graph form) and some form of logging. I’ve just started looking at I/O scheduler tuning and have been trying to figure out a way to measure how effective a given setting is.

thanks,

Sean

Reply
Jeremy Cole says:

May 9, 2007 at 08:02

Vadim, Colin:

By “text format” do you mean raw numbers instead of a graph?

Sounds easy enough. :)

Reply
Francois Saint-Jacques says:

May 9, 2007 at 15:57

Raw numbers! You can combine them with nagios/rrd/munin…

Reply
Farhan Mashraqi says:

May 9, 2007 at 16:12

I got your IM this morning. Thank you for writing this.

More insight into IO = Less time scratching head

Since the iostat on Solaris 10 has some different options, I will run all the commands you mentioned on Solaris to see how the same info can be extracted from Solaris iostat.

BTW, the writing comes at an ideal time because I am working on some DTrace stuff to track IO. The concepts you mentioned are very helpful.

Thanks again,

Frank
PS: Man, I was seeing chatting w/ Pascal and got to see pics from the Proven Scaling’s annual trip. It made me realize how much I missed it :)

Reply
Farhan Mashraqi says:

May 9, 2007 at 16:13

s/seeing chatting/chatting/
:)

Reply
Pythian Group Blog » Log Buffer #44: a Carnival of the Vanities for DBAs
Colin Charles says:

May 14, 2007 at 18:27

@Jeremy: yeah, raw numbers for sure. Should be an easy implementation, because its easier to manipulate numbers than it is to manipulate a graph

Reply
Blake Golliher says:

June 12, 2007 at 22:07

Any update on when / if you will release this?

Reply
Dean Thompson says:

November 5, 2007 at 04:57

So — any chance you would release this tool? Pretty please, with sugar on top?

Reply
Jeremy Zawodny says:

August 10, 2008 at 08:29

I can haz download?!

Reply
Yuri says:

April 9, 2009 at 05:37

The MAN says
From MAN

await The average time (in milliseconds) for I/O requests
issued to the device to be served. This includes the time
spent by the requests in queue and the time spent servic-
ing them.

svctm
The average service time (in milliseconds) for I/O
requests that were issued to the device.

You seem to have it backwards. Your svctm includes await. Is that the conclusion you drew from looking into iostat code or just a typo? My awaits are almost always higher than svctm

Reply
Measuring & Optimizing I/O Performance - igvita.com
Capacity Planning, Architecture, Scaling, Response time, Throughput « JZ Talk Blogger
Wei, Li says:

September 10, 2012 at 01:21

» avgrq-sz — Average request size, in kilobytes.
But form the Man page:
» avgrq-sz
The average size (in sectors) of the requests that were issued to the device.
seems it is not in KB, but in sector, could you confirm that?
Thx!

Reply
Andrew says:

April 13, 2015 at 06:30

Useful – thanks
Heres what i did with it …

Jeremy Cole r_svctm w_svctm

#!/usr/bin/perl
#
# ==============================================
# Source :
# ==============================================
# https://blog.jcole.us/2007/05/08/on-iostat-disk-latency-iohist-onward/
# https://github.com/jeremycole/iohist/blob/master/iohist#
#
# This is iohist which makes histograms about IO information from Linux’s /proc/diskstats information.
# A histogram of the reads and writes (separately) for the given device.
#
# APH This is a perl script which for a specified device will
# create a histogram of IO reads and writes vs duration in ms over execution period of script.
#
# ==============================================
# Creation / Invocation :
# ==============================================
# su –
# welcome
#
# cd ~ ; pwd ; ls -lrt iohist.pl ; cat -n iohist.pl
# vi ~/iohist.pl
# chmod 755 ~/iohist.pl
# cd ~ ; pwd ; ls -lrt iohist.pl ; cat -n iohist.pl
#
# cd /proc/ ; pwd ; ls -lrt diskstats ; cat -n diskstats ;
# cd /proc/ ; pwd ; ls -lrt stat ; cat -n stat ;
#
# which perl
# perl -v | awk NF | egrep -i ‘built’
#
# cd ~ ; pwd ; ls -lrt iohist.pl ; perl iohist.pl sda
#
# ==============================================
# Author Jeremy Coles notes :
# ==============================================
# — Tool to characterize current performance and find any issues.
# — In the database world, you’re really looking for a few things of interest related to I/O:
# — throughput in bytes,
# — requests, and
# — latency.
# — The typical tool to get this information on Linux is iostat.
# — You would normally run it like iostat -dx 1 sda and its output would be something like this, repeating every 1 second:
# — iostat -dx 1
# — >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
# — >>> sda 0.00 8.00 0.00 4.00 0.00 96.00 0.00 48.00 24.00 0.06 15.75 15.75 6.30
# —
# — Most of the output of iostat is interesting and reasonable for its intended purpose, which is as a general purpose way to monitor I/O.
# — The really interesting things for most database servers (especially those in trouble) are:
# — — •%util — Approximate percent utilization for the device.
# — — •avgrq-sz — Average request size, in kilobytes.
# — — avgrq-sz – The average size (in sectors) of the requests that were issued to the device.
# — — •avgqu-sz — Average I/O queue length, in requests.
# — — •await — Average waiting time (in queue and scheduler) before servicing a request, in milliseconds.
# — — •svctm — Average total service time for I/O requests, in milliseconds.
# — — This includes await, so should always be higher than await.
# — — This is the most interesting number for any write-heavy transactional database server, as it translates directly to transaction commit time.
# —
# — There are one major problem with using iostat to monitor MySQL/InnoDB servers:
# — svctm and await combine reads and writes.
# — With a reasonably configured InnoDB, on a server with RAID with a battery-backed write cache (BBWC),
# — reads and writes will have very different behaviour.
# — In general, with a non-filled cache, writes should complete (to the BBWC) in just about zero milliseconds.
# — Reads should take approximately the theoretical average time possible on the underlying disk subsystem.
# —
# — I’ve often times found myself scratching my head looking at a non-sensical svctm (Average total service time for I/O requests, in milliseconds)
# — due to reads and writes being combined together.
# — One day I was perplexed enough to do something about it:
# — I opened up the code for iostat to see how it worked.
# — It turns out that the core of what it does is quite simple (so much so, I wonder why it’s C instead of Perl)
# — — it opens /proc/diskstats, and /proc/stat and does some magic to the contents.
# —
# — What I really wanted is a histogram of the reads and writes (separately, please!) for the given device.
# — I hacked up a quick script to do that, and noticed how incredibly useful it is.
# — I recently had to extend it to address other customer needs, so I worked on it a bit more and now it looks pretty good.
# —
# — Here’s an example from a test machine (so not that realistic for a MySQL server):
# — >>> util: 1.27% r_ios: 0 w_ios: 1 aveq: 0,
# — >>> ms : r_svctm : w_svctm
# — >>> 0 : :
# — >>> 1 : :
# — >>> 2 : :
# — >>> 3 : x :
# — >>> 4 : x :
# — >>> 5 : xxx :
# — >>> 6 : xxxx :
# — >>> 7 : :
# — >>> 8 : x : x
# — >>> 9 : x : xx
# — >>> 10 : x : xxxxx
# — >>> 11 : : xxxxxxxxxxxxxxx
# — >>> 12 : : xxxxxxxxxxxxxxxxxxxxxxxxx
# — >>> 13 : xx : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# — >>> 14 : : xxxxxxxxxxxxxxxxxxxxx
# — >>> 15 : xx : xxxxxxxxxxx
# — >>> 16 : x : xxxxx
# — >>> 17 : x : xxxxxx
# — >>> 18 : : xxxx
# — >>> 19 : x : xx
# — >>> 20 : : x
# — >>> 21 : : x
# — >>> 22 : : x
# — >>> 23 : : x
# — >>> 24 : : x
# — >>> 25 : :
# — >>> 26 : :
# — >>> 27 : :
# — >>> 28 : x :
# — >>> 29 : :
# — >>> 30 : :
# — >>> ++ : 0 : 250
# —
# — It uses Curses now to avoid redrawing the entire screen, and I’ve got a ton of ideas on how to improve it.
# — I have a few more must-haves before I release it formally to the world, but I wonder what more features people would want from it.
# — It is Linux-only for the foreseeable future.
#
# ==============================================
# dlhobaptest01 @20150413 cd ~ ; pwd ; ls -lrt iohist.pl ; perl iohist.pl sda
# ==============================================
# >>> ms : r_svctm : w_svctm : a_svctm
# >>> 0 : : :
# >>> 1 : : :
# >>> 2 : xxx : :
# >>> 3 : xx : :
# >>> 4 : xx : :
# >>> 5 : xxxx : xxx : xxxxx
# >>> 6 : xx : xxxxxxxxxxxxxxxxxxxx : xxxxxxxxxxxxxxxxxxx
# >>> 7 : xx : xxxxxxxxxxxxxxxxxx : xxxxxxxxxxxxxxxxxxxx
# >>> 8 : x : xxxxxxxxxxxxx : xxxxxxxxxxxxx
# >>> 9 : x : xxxxxxxxxxxxx : xxxxxxxxxxxxx
# >>> 10 : : xxxxx : xxxxxx
# >>> 11 : : xxxxx : xxxx
# >>> 12 : : xxx : xxx
# >>> 13 : : xxx : xxxx
# >>> 14 : : xxxx : xxxx
# >>> 15 : : xx : xx
# >>> 16 : : xx : xx
# >>> 17 : : xx : x
# >>> 18 : : xxx : xxx
# >>> 19 : : xx : xx
# >>> 20 : : xx : xx
# >>> 21 : : xx : xx
# >>> 22 : : xx : xx
# >>> 23 : : x : x
# >>> 24 : : xx : xx
# >>> 25 : : x :
# >>> 26 : : :
# >>> 27 : : : x
# >>> 28 : : x : x
# >>> 29 : : x :
# >>> 30 : : :
# >>> 31 : : x : x
# >>> 32 : : :
# >>> 33 : : :
# >>> 34 : : :
# >>> 35 : : :
# >>> 36 : : x : x
# >>> ++ : 0 : 25 : 25
#
# ==============================================
# dlhobdbtest01 @20150413 cd ~ ; pwd ; ls -lrt iohist.pl ; perl iohist.pl sda
# ==============================================
# >>> ms : r_svctm : w_svctm : a_svctm
# >>> 0 : xx : :
# >>> 1 : x : :
# >>> 2 : xxxxx : :
# >>> 3 : xxxxxxxxxxxxxx : : xxx
# >>> 4 : xx : : x
# >>> 5 : xxxxxxxx : xxxxx : xxxxxxx
# >>> 6 : xxxxxx : xxxxxxxxxxxxxx : xxxxxxxxxxxxxxxxxxx
# >>> 7 : xxxxx : xxxxxxxxxxxxxxxxxxxx : xxxxxxxxxxxxxxxxxxxx
# >>> 8 : xxxxx : xxxxxxxxxxxxxxxxxxx : xxxxxxxxxxxxxxxxxxx
# >>> 9 : xxxxxx : xxxxxxxxxxx : xxxxxxxxxx
# >>> 10 : xxxxx : xxxxx : xxxxx
# >>> 11 : xx : x : xx
# >>> 12 : x : x : x
# >>> 13 : : : x
# >>> 14 : : x : x
# >>> 15 : : :
# >>> 16 : xx : :
# >>> 17 : : x :
# >>> 18 : : x : x
# >>> 19 : : : x
# >>> 20 : : x : x
# >>> 21 : : x : x
# >>> 22 : : : x
# >>> 23 : : :
# >>> 24 : x : :
# >>> 25 : x : x : x
# >>> 26 : : : x
# >>> 27 : : xx : x
# >>> 28 : : x : x
# >>> 29 : : x : x
# >>> 30 : : :
# >>> 31 : : x : x
# >>> 32 : : : x
# >>> 33 : : :
# >>> 34 : : :
# >>> 35 : : : x
# >>> 36 : : :
# >>> ++ : 6 : 265 : 256
#

use strict;
use Data::Dumper;

my $hist_width = 20;
my $hist_max = 37;

sub parse_stat
{
my $line = shift;
my %stat = ();

$line =~ /^\s*([0-9]+)\s+([0-9]+)\s+([a-zA-Z0-9\/\-]+)\s+(.*)$/;
$stat{‘major’} = $1;
$stat{‘minor’} = $2;
$stat{‘device’} = $3;
my $metric_line = $4;

$metric_line =~ s/[ ]+/|/g;
my @metrics = split /[|]/, $metric_line;

$stat{‘r_ios’} = $metrics[0];
$stat{‘r_merges’} = $metrics[1];
$stat{‘r_sectors’} = $metrics[2];
$stat{‘r_ticks’} = $metrics[3];
$stat{‘w_ios’} = $metrics[4];
$stat{‘w_merges’} = $metrics[5];
$stat{‘w_sectors’} = $metrics[6];
$stat{‘w_ticks’} = $metrics[7];
$stat{‘in_progress’} = $metrics[9];
$stat{‘ticks’} = $metrics[10];
$stat{‘aveq’} = $metrics[11];

return \%stat;
}

sub get_stats
{
my %stats = ();

open STATS, “</proc/diskstats";

while(my $line = )
{
my $stat = parse_stat($line);
$stats{$stat->{‘device’}} = $stat;
}

close STATS;
return \%stats
}

sub diff_stat
{
my $stat_a = shift;
my $stat_b = shift;
my $diff = {};

my @metrics = (
‘r_ios’, ‘r_merges’, ‘r_sectors’, ‘r_ticks’,
‘w_ios’, ‘w_merges’, ‘w_sectors’, ‘w_ticks’,
‘in_progress’, ‘ticks’, ‘aveq’
);

foreach my $metric (@metrics)
{
$diff->{$metric} = $stat_a->{$metric} – $stat_b->{$metric};
}

my $a_ios = ($diff->{‘r_ios’} + $diff->{‘w_ios’});
$diff->{‘r_svctm’} =
$diff->{‘r_ios’} ? $diff->{‘r_ticks’} / $diff->{‘r_ios’} : 0;
$diff->{‘w_svctm’} =
$diff->{‘w_ios’} ? $diff->{‘w_ticks’} / $diff->{‘w_ios’} : 0;
$diff->{‘a_svctm’} = $a_ios ? $diff->{‘ticks’} / $a_ios : 0;

return $diff;
}

sub max
{
my $a = shift;
my $b = shift;

return $a>$b?$a:$b;
}

sub main
{
my $old_stats = undef;
my $new_stats = &get_stats;

my %buckets = ();
my %biggest_bucket = ();
my %scaling_factor = ();

my @bucket_names = (
‘r_svctm’, ‘w_svctm’, ‘a_svctm’
);

my $device = shift @ARGV;
die(“Unknown device”) unless defined($new_stats->{$device});

while(1)
{
sleep 1;
$old_stats = $new_stats;
$new_stats = &get_stats;

my $diff = &diff_stat($new_stats->{$device}, $old_stats->{$device});

foreach my $bucket (@bucket_names)
{
my $report_ms = $diff->{$bucket};
if($report_ms > 0.0)
{
$biggest_bucket{$bucket} =
max($biggest_bucket{$bucket}, ++$buckets{$bucket}[int($report_ms)]);
}
if($biggest_bucket{$bucket} > $hist_width)
{
$scaling_factor{$bucket} = $biggest_bucket{$bucket} / $hist_width;
} else {
$scaling_factor{$bucket} = 1;
}
}

#printf “%10.2f%10.2f%10.2f\n”, $diff->{r_svctm}, $diff->{w_svctm}, $diff->{a_svctm};

print “ms “;
foreach my $bucket (@bucket_names)
{
printf “: %-20s “, $bucket;
}
print “\n”;
for(my $x=0; $x < $hist_max; $x++)
{
printf "%2i ", $x;
foreach my $bucket (@bucket_names)
{
printf ": %-20s ", "x" x int($buckets{$bucket}[$x] / $scaling_factor{$bucket});
}
print "\n";
}
printf "++ ";
foreach my $bucket (@bucket_names)
{
my $value = 0;
for(my $x=$hist_max; $x < 20000; $x++)
{
$value += $buckets{$bucket}[$x];
}
printf ": %-20i ", $value;
}
print "\n";
}
}

&main;

Reply
iostat 服务器性能监控 | 飞行日志
ankit kakkar says:

July 14, 2016 at 03:58

I think you have mixed up svctm and await.

svctm – The average service time (in milliseconds) for I/O requests that were issued to the device. and it doesn’t include the waiting time.

await – The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

http://linux.die.net/man/1/iostat

Reply