Re: raw/block device disc troughput

2012-05-25 Thread Edgar Fuß
 In this case, dd has to block after each disk write to wait for its
 buffer to be (unnecessarily, as it happens, though it can't know that)
 zeroed for the next write.  This both imposes additional delay and
 enforces a lack of overlap between each write and the next.
But can't we safely assume reading /dev/zero taking zero time?
I've checked that indeed get about 10GB/s from /dev/zero.


Re: raw/block device disc troughput

2012-05-25 Thread Edgar Fuß
Thanks for the most insightful explanation!

 Also keep in mind:
Yes, sure. That's why I would have expected the raw device to outperform even 
at lower block sizes.

Re: raw/block device disc troughput

2012-05-25 Thread Eduardo Horvath
On Fri, 25 May 2012, Edgar Fu? wrote:

 Thanks for the most insightful explanation!
 
  Also keep in mind:
 Yes, sure. That's why I would have expected the raw device to outperform even 
 at lower block sizes.

No, for small block sizes the overhead of the copyin() is more than offset 
by the larger buffercache block size.  And the I/O operation is 
asynchronous with respect to the write() system call.

With the character device the I/O operation must complete before the 
write() returns.  So the I/O operations cannot be combined and you suffer 
the overhead of each one.

Eduardo

Re: raw/block device disc troughput

2012-05-24 Thread Paul Goyette

On Thu, 24 May 2012, Edgar Fu? wrote:


I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two
of these discs, and, after the parity initialisation was complete, wrote
to [r]raid0b.
On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on bs.
On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was
vnode for the whole run. Why is that so and why is throughput so low?


What is the partition alignment on the raid?

There's a discussion of the alignment impact on my raidframe how-to on 
the NetBSD WiKi.



-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-

Re: raw/block device disc troughput

2012-05-24 Thread Mouse
 It seems that I have to update my understanding of raw and block
 devices for discs.  [...performance oddities...]

Mostly I have nothing useful to say here.  But...

 2. I would have expected inceasing the block size above MAXPHYS not
improving the performance.

There is at least one aspect of performance that will not be cut off by
MAXPHYS, that being syscall overhead.  I don't know your system (you
don't say which port you're running on, for example), but if syscall
overhead for your hardware is not ignorably small compared to the costs
of doing the disk transfer, then doing one syscall per 256K will be
four times as costly in syscall overhead as doing one syscall per 1M,
even if it is four times as costly in disk transfers.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: raw/block device disc troughput

2012-05-24 Thread Eduardo Horvath
On Thu, 24 May 2012, Edgar Fu? wrote:

 It seems that I have to update my understanding of raw and block devices
 for discs.
 
 Using a (non-recent) 6.0_BETA INSTALL kernel and an ST9146853SS 15k SAS disc
 behind an LSI SAS 1068E (i.e. mpt(4)), I did a
   dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx.

What's od=?

 For the raw device, the troughput dramatically increased with the block size:
   Block size  16k 64k 256k1M
   Troughput (MByte/s) 4   15  49  112
 For the block device, throughput was around 81MByte/s independent of block 
 size.
 
 This surprised me in two ways:
 1. I would have expected the raw device to outperform the block devices
with not too small block sizes.
 2. I would have expected inceasing the block size above MAXPHYS not
improving the performance.
 
 So obviously, my understanding is wrong.

Not awfully surprizing given your setup.  Keep in mind mpt uese a rather 
inefficient communication protocol and does tagged queuing.  The former 
means the overhead for each command is not so good, but the latter means 
it can keep lots of commands in the air at the same time. 

 I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two
 of these discs, and, after the parity initialisation was complete, wrote
 to [r]raid0b.
 On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on 
 bs.
 On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was
 vnode for the whole run. Why is that so and why is throughput so low?

Now you're just complicating things 8^).

Let's see, RAID 1 is striping.  That means all operations are broken at 
64K boundaries so they can be sent to different disks.  And split 
operations need to wait for all the devices to complete before the master 
operation can be completed.  I expect you would probably get some rather 
unusual non-linear behavior in this sort of setup.  

Eduardo

Re: raw/block device disc troughput

2012-05-24 Thread Thor Lancelot Simon
On Thu, May 24, 2012 at 06:26:45PM +0200, Edgar Fu? wrote:
 It seems that I have to update my understanding of raw and block devices
 for discs.
 
 Using a (non-recent) 6.0_BETA INSTALL kernel and an ST9146853SS 15k SAS disc
 behind an LSI SAS 1068E (i.e. mpt(4)), I did a
   dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx.
 For the raw device, the troughput dramatically increased with the block size:
   Block size  16k 64k 256k1M
   Troughput (MByte/s) 4   15  49  112
 For the block device, throughput was around 81MByte/s independent of block 
 size.
 
 This surprised me in two ways:
 1. I would have expected the raw device to outperform the block devices
with not too small block sizes.

The block device will cause readahead at the OS layer.  Since you are
accessing the disk sequentially, this will have a significant effect --
evidently greater than the overhead caused by memory allocation in the
cache layer under the block device.

I suspect that if you double-buffered at the client application layer
this effect might disappear, since the drive itself will already readahead,
and if we can present it with enough requests at once, it should return
the results simultaneously.  Plain dd on the raw device will not do that
since it waits for every read to complete before issuing another, thus
increasing latency and reducing the number of transactions the drive can
effectively overlap.

 2. I would have expected inceasing the block size above MAXPHYS not
improving the performance.

The increase is again tied to less latency in the synchronous dd read-write
loop.  The kernel breaks the large request down to many MAXPHYS sized ones
and dispatches each in turn.  I can't remember whether it is really
asynchronous or whether it waits for each request to complete before issuing
the next; if the former, it's effectively double- buffering for you.

 
 I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two
 of these discs, and, after the parity initialisation was complete, wrote
 to [r]raid0b.
 On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on 
 bs.

What does read performance look like?  I would be particularly interested to
know what it looks like if you use a tool like buffer or ddd that
double-buffers the I/O for you.  It should be roughly twice the single-disk
rate or something is wrong with RAIDframe (or, at least, suboptimal).

 On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was
 vnode for the whole run. Why is that so and why is throughput so low?

I would guess locking, or, somehow, extra buffering.  It may be waiting on
the lock for the vnode for the block device?

-- 
Thor Lancelot Simon  t...@panix.com
  The liberties...lose much of their value whenever those who have greater
   private means are permitted to use their advantages to control the course
   of public debate.   -John Rawls


Re: raw/block device disc troughput

2012-05-24 Thread Edgar Fuß
 I don't know your system (you don't say which port you're running on,
 for example)
Ah, sorry: amd64.


Re: raw/block device disc troughput

2012-05-24 Thread Mouse
  dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx.

(I've been assuming od= should be of=)

 The block device will cause readahead at the OS layer.

I thought of that too, but didn't mention it, because it's not
relevant.  dd isn't reading from the disk; it's writing to it.

 I suspect that if you double-buffered at the client application layer
 this effect might disappear,

I suspect this is a significant effect.  I was once using dd to copy
from one disk to another, and both drives happened to have activity
lights on them.  Watching each drive wait for the other convinced me dd
is an inefficient way to do that.  I built a program that uses two
processes, one reading and one writing, with a large chunk of memory
shared between them for buffer space.  Disk-to-disk copies (not on the
same spindle) got significantly faster. :)

In this case, dd has to block after each disk write to wait for its
buffer to be (unnecessarily, as it happens, though it can't know that)
zeroed for the next write.  This both imposes additional delay and
enforces a lack of overlap between each write and the next.

I speculate that the cooked device helps because it means that dd's
write finishes when the bits are in the buffer cache, rather than
waiting for them to hit disk.  Flushing from the buffer cache to the
disk then (a) gets overlapped with zeroing memory for the next cycle
and (b) allows writes to adjacent disk blocks to be collapsed as they
get pushed from the buffer cache to the drive.  Unless the host is
unusually slow, writing to the disk will be the limiting factor here,
meaning the buffer cache will have a large number of writes pending, so
coalescing writes is plausible, even likely.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: raw/block device disc troughput

2012-05-24 Thread Edgar Fuß
 What's od=?
Fat fingers. of=, or course.

 Not awfully surprizing given your setup.
Ah, good.

 Keep in mind mpt uese a rather inefficient communication protocol and does
 tagged queuing.
You mean the protocol the main CPU uses to communicate with an MPT adapter is
inefficient? Or do you mean SAS is inefficient?

 The former means the overhead for each command is not so good, but the
 latter means it can keep lots of commands in the air at the same time. 
I'm sorry, I'm unable to conclude why this explains my results.

 Now you're just complicating things 8^).
Sometimes, probably.

 Let's see, RAID 1 is striping.
Sorry, RAID 1 is mirroring.
So, everything needs to be sent to two discs in parallel. I can't believe
the CPU-HBA communication being a bottleneck?


Re: raw/block device disc troughput

2012-05-24 Thread Eduardo Horvath
On Thu, 24 May 2012, Edgar Fu? wrote:

  Keep in mind mpt uese a rather inefficient communication protocol and does
  tagged queuing.
 You mean the protocol the main CPU uses to communicate with an MPT adapter is
 inefficient? Or do you mean SAS is inefficient?

The protocol used to communicate between the CPU and the adapter is 
inefficient.  Not well designed.  They redesigned it for SAS2.

  The former means the overhead for each command is not so good, but the
  latter means it can keep lots of commands in the air at the same time. 
 I'm sorry, I'm unable to conclude why this explains my results.

dd will send the kernel individual write operations.  sd and physio() will 
break them up into MAXPHYS chunks.  Each chunk will be queued at the 
HBA.  The HBA will dispatch them all as fast as it can.  Tagged queuing 
will overlap them.  

With smaller transfers, the setup overhead becomes significant and you see 
poor performance.

With large transfers (larger than MAXPHYS) the writes are split up into 
MAXPHYS chunks and the disk handles them in parallel, hence the 
performance increase even beyond MAXPHYS.

Also keep in mind:

When using the block device the data is copied from the process buffer 
into the buffer cache and the I/O happens from the buffer cache pages.

When using the raw device the I/O happens directly from process memory, no 
copying involved.

Eduardo

Re: raw/block device disc troughput

2012-05-24 Thread Edgar Fuß
 The block device will cause readahead at the OS layer.
But I'm writing, not reading!

 The increase is again tied to less latency in the synchronous dd read-write
 loop.  The kernel breaks the large request down to many MAXPHYS sized ones
 and dispatches each in turn.  I can't remember whether it is really
 asynchronous or whether it waits for each request to complete before issuing
 the next; if the former, it's effectively double- buffering for you.
But would you expect an eightfold increas in troughput just by that?

 What does read performance look like?  I would be particularly interested to
 know what it looks like if you use a tool like buffer or ddd that
 double-buffers the I/O for you.  It should be roughly twice the single-disk
 rate or something is wrong with RAIDframe (or, at least, suboptimal).
I will test that. I can't right now because I have no physical access atm.

On an identical machine but with the RAID's SectorsPerSU=16, I get 99MByte/s
from the raw device at 16k blocks, 191 at 64k blocks, roughly the same at
1M blocks.
On the block device, I get 19MB/s independent of block size.
All with dd.


Re: raw/block device disc troughput

2012-05-24 Thread Thor Lancelot Simon
On Thu, May 24, 2012 at 05:31:43PM +, Eduardo Horvath wrote:
 
 With large transfers (larger than MAXPHYS) the writes are split up into 
 MAXPHYS chunks and the disk handles them in parallel, hence the 
 performance increase even beyond MAXPHYS.

Is this actually true?  For requests from userspace via the raw device,
does physio actually issue the smaller chunks in parallel?

Thor


Re: raw/block device disc troughput

2012-05-24 Thread Eduardo Horvath
On Thu, 24 May 2012, Thor Lancelot Simon wrote:

 On Thu, May 24, 2012 at 05:31:43PM +, Eduardo Horvath wrote:
  
  With large transfers (larger than MAXPHYS) the writes are split up into 
  MAXPHYS chunks and the disk handles them in parallel, hence the 
  performance increase even beyond MAXPHYS.
 
 Is this actually true?  For requests from userspace via the raw device,
 does physio actually issue the smaller chunks in parallel?

Depends... this case it's true.  physio() breaks the iov into chunks and 
allocates a buf for each chunk and calls the strategy() routine on each 
buf without waiting for completion.  So on a controller that does tagged 
queuing they run in parallel.

Eduardo