Analyzing disk performance on Linux

7:43 PM

So many tickets have been created requesting help with storage, reports of poor performance, etc. but with little preliminary analysis done. Here are some things you can look at and do to provide a better picture internally and to clients.

Measuring performance

If you can't measure the performance, you can't say it's a performance problem, now can you? Gut feelings don't count. The only real command you need to be familiar with is iostat. This one command will give you a wealth of data to look at. It's part of the sysstat package so if it isn't installed, install it. You don't need root privileges to run it.
iostat displays CPU information, 2 different sets of device IO information and NFS information. You can also give 2 numerical parameters: how many seconds to wait between samples and how many samples to take. Both are optional but without at least one of them, iostat's output is meaningless. This is because the first thing iostat outputs are the accumulated values since boot. If you're measuring IO, you want to measure deltas (changes over time,) not aggregates.
iostat reports several values for CPU-related data. These will be user%, nice%, system%, iowait% and idle%. If you're on a RHEL5 system, you'll also have steal%. These represent where the CPU is spending it's time (remember, this is an average of all CPUs on the system.)
For device-related information, iostat can report 2 different sets of data with the -d (default) and the -x options. These options cannot be combined so you may want to open 2 terminal windows and run iostat in each, one using -x and one without. It also helps to use the -k option to report data in kilobytes instead of blocks since blocks are of indeterminate size.
If you are on RHEL5 system, you can view NFS performance data with the -n option. Sorry, RHEL4 - this data is only available in kernels 2.6.17 and newer so downloading and building a newer version of sysstat won't work.

Output

Let's get this out of the way, iowait is IO from the CPU perspective, not the disk subsystem perspective. Additionally, most tools (iostat included) report the average iowait across all CPUs. With multi-CPU systems, this measure looses a lot of it's meaning. Does this mean iowait is useless? No. It just means it might not be telling you the full story. Don't rely solely on iowait numbers.
The man page for iostat lists what each column measures and won't be repeated here. The most interesting pieces of data to look at are avgrq-sz avgqu-sz, await, svctm and %util in the -x output. Here's sample output of just these columns from the command iostat -x sdb 1 during a large disk write operation:
Device:    avgrq-sz avgqu-sz   await  svctm  %util
sdb        705.43     3.85   16.38   2.55  59.41
sdb        894.57    56.02  214.46   3.22  83.94
sdb        772.82     6.44   26.34   2.51  62.40
sdb        722.84     3.20   19.10   4.63  78.14
Only the last column, %util, gives you a good idea of what's going on, as it represents the utilization of the device (controller and all) at that point in time. Most of the other columns are only meaningful when compared to a baseline or a benchmark. You can't answer questions such as "is the 214.46 await too high?" without comparing it to something else. A benchmark should measure maximum performance the device can deliver while a baseline measures performance at typical application loads. Ideally, you should have both to make the best estimation on performance.

A note about SAN performance

The SAN is a shared resource. Currently, there is no QoS (Quality of Service) support but that is coming. This means clients are impacting each other's performance and there's little that can be done about it. Once QoS is deployed, the Storage Team will be able to give guaranteed levels of performance to clients and performance hogs will only hurt themselves.

You Might Also Like

0 comments

Contact Form

Name

Email *

Message *

Translate

Wikipedia

Search results