RAID: Alive and well in the real world

Kevin Burton wrote a sort-of-reply to my call for action in getting LSI to open source their CLI tool for the LSI MegaRAID SAS aka Dell PERC 5/i, where he asserted that “RAID is dying”. I’d like to assert otherwise. In my world, RAID is quite alive and well. Why?

  • RAID is cheap. Contrary to popular opinion, RAID isn’t really that expensive. The controller is cheap (only $299 for Dell’s PERC 5/i, with BBWC, if you pay full retail). The “2x” disk usage in RAID 10 is really quite debatable, since those disks aren’t just wasting space, they are also improving read (and subsequently write) performance.
  • Latency. The battery-backed write cache is a necessity. If you want to safely store data quickly, you need a place to stash it that is reliable1. This is one of the main reasons (or only reasons, even) for using hardware RAID controllers.
  • Disks fail. Often. If anything, we should have learned that from Google. Automatic RAID rebuild is proven and effective way to manage this without sinking a huge amount of time and/or resources into managing disk failures. RAID turns a disk failure into a non-event instead of a crisis.
  • Hot swap ability. If you forgo hardware RAID, but make use of multiple disks in the machine, there’s a very good chance you will not be able to hot swap a failed disk. Most hot-swappable disk controllers are RAID controllers. So, if you want to hot-swap your disks, you likely end up paying the cost for the controller anyway.

I don’t think it’s fair for anyone to say “Google doesn’t use RAID”. For a few reasons:

  1. I would be willing to bet there are a number of hardware RAIDs spread across Google (feel free to correct me if I’m wrong, Googlers, but I very much doubt I am). Google has many applications. Many applications with different needs.
  2. As pointed out by a commenter on Kevin’s entry, Google is, in many ways, its own RAID. So even in applications where they don’t use real RAID, they are sort of a special case.

In the latter half of his entry, Kevin mentions some crazy examples using single disks running multiple MySQL daemons, etc., to avoid RAID. He seems fixated on “performance” and talks about MBps, which is, in most databases, just about the least important aspect of “performance”. What his solution does not address, and in fact where it makes matters worse, is latency. Running four MySQL servers against four disks individually is going to make absolutely terrible use of those disks in the normal case.

One of the biggest concerns our customers, and many other companies have, is power consumption. I like to think of hardware in terms of “critical” and “overhead” components. Most database servers are bottlenecked on disk IO, specifically on latency (seeks). This means that their CPUs, power supplies, etc., are all “overhead” — components necessary to support the “critical” component: disk spindles. The less overhead you have in your overall system, the better, obviously. This means you want to make the best use (in terms of seek capacity) of your disks possible, and minimize downtime, in order to make the best use of the immutable overhead.

RAID 10 helps in this case by making the best use of the available spindles, spreading IO across the disks so that as long as there is work to be done, in theory, no disk is underutilized. This is exactly something you cannot accomplish using single disks and crazy multiple-daemon setups. In addition, in your crazy setup, you will waste untold amounts of memory and CPU by handling the same logical connection multiple times. Again, more overhead.

What do I think is the future, if RAID is not dying? Better RAID, faster disks (20k anyone? 30k? Bring it on!), bigger battery-backed write caches, and non-spinning storage, such as flash.

1 There’s a lot to be said for treating the network as “reliable”, for instance with Google’s semi-synchronous replication, but that is not available at this time, and isn’t really a viable option for most applications. Nonetheless, I would still assert that RAID is cheap compared to the cost (in terms of time, wasted effort, blips, etc.) of rebuilding an entire machine/daemon due to a single failed disk.

Help convince Dell to leverage LSI to Open Source MegaCli

I’ve just submitted “Leverage LSI to Open Source MegaCli” to the Dell IdeaStorm website:

Dell makes some awesome and affordable hardware. Many new Dell machines have the PERC 5/i SAS RAID controller, which is a rebranded LSI MegaRAID SAS.

LSI makes some nice RAID cards. Dell likes LSI. Dell made a deal with LSI to provide the chips for their fancy new PERC 5/i cards.

We buy machines with these cards in them. We need to monitor our RAIDs, rebuild them, and do all manner of other maintenance tasks. We do not expect LSI to provide perfect tools. LSI is a hardware vendor, and it’s understandable that they provide terrible *software*. What is NOT understandable, though, is why LSI’s terrible tools are closed source.

What is further incomprehensible is why Dell is willing to accept this situation on behalf of their enterprise customers. Has anyone from Dell even tried to use the tools LSI provides, and Dell recommends, to manage a RAID array on Linux?

MegaCli is the worst command-line utility I have ever seen, bar none. But, we don’t expect LSI to make it better, we expect LSI to OPEN SOURCE it. That way we software professionals can spend our own time to make them better. We need better tools. We are willing to work for free. Give us the source, or give us good documentation, but give us something.

We’re willing to provide infinite amounts of value to both Dell and LSI. Dell has enough clout with LSI to make this happen. Please make it happen.

Signed,

Jeremy Cole
Open Source Database Guy

Please go there and “promote” this if you care about Dell and RAID!

I love you, Akismet

Blogging used to be fun.

Then it started to suck. Spam sucked. Life sucked.

Now life is good. Spam is no more. Matt told me to use Akismet; I was skeptical. I am no longer skeptical. I love you, Akismet.

Akismet has caught 501,725 spam for you since you first installed it.

Yup. Since January 15.

On iostat, disk latency; iohist onward!

Just a little heads-up and a bit of MySQL-related technical content for all of you still out there following along…

At Proven Scaling, we take on MySQL performance problems pretty regularly, I’m often in need of good tools to characterize current performance and find any issues. In the database world, you’re really looking for a few things of interest related to I/O: throughput in bytes, requests, and latency. The typical tool to get this information on Linux is iostat. You would normally run it like iostat -dx 1 sda and its output would be something like this, repeating every 1 second:


Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 8.00 0.00 4.00 0.00 96.00 0.00 48.00 24.00 0.06 15.75 15.75 6.30

Most of the output of iostat is interesting and reasonable for its intended purpose, which is as a general purpose way to monitor I/O. The really interesting things for most database servers (especially those in trouble) are:

  • avgrq-sz — Average request size, in kilobytes.
  • avgqu-sz — Average I/O queue length, in requests.
  • await — Average waiting time (in queue and scheduler) before servicing a request, in milliseconds.
  • svctm — Average total service time for I/O requests, in milliseconds. This includes await, so should always be higher than await. This is the most interesting number for any write-heavy transactional database server, as it translates directly to transaction commit time.
  • %util — Approximate percent utilization for the device.

There are one major problem with using iostat to monitor MySQL/InnoDB servers: svctm and await combine reads and writes. With a reasonably configured InnoDB, on a server with RAID with a battery-backed write cache (BBWC), reads and writes will have very different behaviour. In general, with a non-filled cache, writes should complete (to the BBWC) in just about zero milliseconds. Reads should take approximately the theoretical average time possible on the underlying disk subsystem.

I’ve often times found myself scratching my head looking at a non-sensical svctm due to reads and writes being combined together. One day I was perplexed enough to do something about it: I opened up the code for iostat to see how it worked. It turns out that the core of what it does is quite simple (so much so, I wonder why it’s C instead of Perl) — it opens /proc/diskstats, and /proc/stat and does some magic to the contents.

What I really wanted is a histogram of the reads and writes (separately, please!) for the given device. I hacked up a quick script to do that, and noticed how incredibly useful it is. I recently had to extend it to address other customer needs, so I worked on it a bit more and now it looks pretty good. Here’s an example from a test machine (so not that realistic for a MySQL server):

util:   1.27% r_ios:     0  w_ios:     1  aveq:     0,
ms : r_svctm                     : w_svctm
 0 :                             :
 1 :                             :
 2 :                             :
 3 : x                           :
 4 : x                           :
 5 : xxx                         :
 6 : xxxx                        :
 7 :                             :
 8 : x                           : x
 9 : x                           : xx
10 : x                           : xxxxx
11 :                             : xxxxxxxxxxxxxxx
12 :                             : xxxxxxxxxxxxxxxxxxxxxxxxx
13 : xx                          : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
14 :                             : xxxxxxxxxxxxxxxxxxxxx
15 : xx                          : xxxxxxxxxxx
16 : x                           : xxxxx
17 : x                           : xxxxxx
18 :                             : xxxx
19 : x                           : xx
20 :                             : x
21 :                             : x
22 :                             : x
23 :                             : x
24 :                             : x
25 :                             :
26 :                             :
27 :                             :
28 : x                           :
29 :                             :
30 :                             :
++ : 0                           : 250

It uses Curses now to avoid redrawing the entire screen, and I’ve got a ton of ideas on how to improve it. I have a few more must-haves before I release it formally to the world, but I wonder what more features people would want from it. It is Linux-only for the foreseeable future.

What do you think?

Now Available: Profiling in MySQL

Back around November 2005, I started working on query profiling in MySQL via the SHOW PROFILE and SHOW PROFILES commands. It’s been an interesting ride, but profiling support is finally available in public releases of MySQL starting with MySQL Community 5.0.37!

I had a few thoughts on the process and the feature that I’d like to share:

It’s been rough — Although everyone at MySQL who had seen the patch had wildly positive feedback about it, it took almost a year and a half to get things committed. Chad Miller took up my cause (back in December?) with the profiling patch as well as many others, and things actually started making progress. Thanks Chad!

Things were changed — In order to accept the feature, MySQL wanted a few things changed, which Chad handled. An interface using INFORMATION_SCHEMA was added, which I don’t entirely agree with, and the times and statistics returned were changed to absolute instead of cumulative. More on this below.

Absolute times are misleading — With SHOW PROFILE you will see rows like this:

| query end            | 0.00028300 | 

Does that mean it took 0.283ms to end the query? Not necessarily. The only way SHOW PROFILE knows when to cut off the timer is when the status next changes. Since the status messages were only meant to be informational, and in fact many of them were never meant to be seen in the first place, the status is not always changed in logical places in order to collect accurate timestamps this way.

My original patch only used cumulative numbers—they don’t imply any given amount of time spent in a particular place, just the total time or statistics collected at the moment the status was changed. I may submit a patch to once again reveal this information with e.g. SHOW CUMULATIVE PROFILE, as it seems very unlikely that the powers that be will allow it to be changed now.

Status messages need some updating — The last phase of the profiling patch that has yet to be done is to go through all of the status messages, cleaning them up where appropriate, and adding new messages to display more useful profiles. Perhaps I will have time to work on this soon.

Let me know how you like profiling and if you manage to make use of it!