Re: XFS regression?

2007-10-15 Thread David Chinner
On Mon, Oct 15, 2007 at 03:28:34PM +0530, Bhagi rathi wrote:
 Thanks Dave for the response.   Thinking futher, why is that xfs_iunpin has
 to mark the inode dirty?

Because the inode has been modified, and instead of sprinkling
mark_inode_dirty_sync() all over the code, we can do it in a single
spot that catches all inode modifications. We don't have to think about
it by doing this - inodes in transactions get marked dirty for free

 All transactions generally modify one time or other, xfs_ichgtime takes care
 of marking inode as
 dirty.

Sure, but there's plenty of other transactions that don't have such
a convenient hook.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-14 Thread David Chinner
On Sat, Oct 13, 2007 at 07:05:17PM +0530, Bhagi rathi wrote:
 David, Can you let me know the use after free problem? I want to understand
 how the life cycle of linux inode
 and xfs inode are related to log flush.

Log I/O completion:

  - xfs_trans_commited
   - xfs_iunpin(xfs inode)
 get linux inode from xfs inode
  - mark_inode_dirty_sync(linux inode)

Freeing the linux inode:

clear_inode(linux_inode)
 - xfs_inactive()
- xfs_trans_commit() (e.g. freeing data associated with unlinked inode)
   - xfs_ipin()
(link between xfs and linux inode broken)
  linux inode freed 
 
So, in log I/O completion, we can be completing a previous
transaction at the same time clear_inode() is running, and
hence in xfs_iunpin() we can race with the freeing of the
linux inode as xfs_iunpin does not hold any locks.

 Any pointer is also of great help.

/me points at the code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-14 Thread David Chinner
On Fri, Oct 12, 2007 at 12:36:01PM +0100, Andrew Clayton wrote:
 On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote:
  
  You can breath again. Here's a test patch (warning - may harm
 
 heh
 
  kittens - not fully tested or verified) that solves both
  the use-after-free issue (by avoiding it altogether) as well the
  unlink/create latency because the log force is no longer there.
  
  (yay! xfsqa test 016 passes again ;)
  
  It does have other possible side effects triggering extra
  log forces elsewhere on inode writeback and affects sync behaviour
  so it's only a proof of concept at this point.
 
 What kernel is that against?. I got rejects with 2.6.23 

The xfs-dev tree - i.e. the XFS that will be in 2.6.25 ;)

 However I tried a 2.6.18 on the file server and ran my test, it didn't
 show the problem. I then made a 2.6.23 but with the patch from my git
 bisect reverted.
 
 Doing the test with that kernel, while writing a 1GB file I saw only
 one  1 second latency (1.2) and only a few ~ 0.5 second latencies.
 
 However over the longer term I'm still seeing latencies  1 second.

Sure - you've got a busy disk. If the truncate has to flush the log
and wait for space, then it's going to take some time for I/Os
to complete. Full queue + busy disk == unpredictable latency for all
operations.

 Just leaving my strace test running (no dd) on the raid filesystem I see
 the
 latencies come when the raid5 stripe cache fills up. So I think I'm
 perhaps seeing another problem here.

Software raid isn't good for latency, either ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-12 Thread Andrew Clayton
On Fri, 12 Oct 2007 12:36:01 +0100, Andrew Clayton wrote:


Ignore the numbers below. under normal conditions the raid array and
single drive show basically the same numbers. The raid array numbers
were done over NFS.

 raid array
 
 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.122943
 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.021620
 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.014963
 
 system disk
 
 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.000190
 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.39
 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.000191

While we're talking number's...

raid array, strace over NFS

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.018330
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.026398
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 8.927449
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 1.284661
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.030078
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.021407
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.018660

raid array, strace locally

open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.69
open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.44
open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 8.995286
open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 1.258810
open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 1.225763
open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.56
open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.92


Those where running concurrently. I think that rules out networking.
I'm getting more convinced it's a raid5 problem...


Andrew
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread Andrew Clayton
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:

 
 Latencies are an order of magnitude lower at 60-70ms because the disks
 have less deep queues. This is expected - deep queues and multiple
 outstanding I/Os are the enemy of single I/O latency
 
 If I remount with barriers enabled, the latency at nr_requests=128
 goes up to a consistent 2.2s. Not surprising, we're flushing the drive
 cache very regularly now and it points to the create or truncate
 transaction having to pushing log buffers to disk. The latency remains
 at 70-80ms at nr_requests=4.

Thanks for the info. I did try fiddling nr_requests but I made it
bigger. I'll try with it lower.
 
  It seems this problem was introduced between 2.6.18 and 2.6.19. 
 
 When the new SATA driver infrastructure was introduced. Do you have
 NCQ enabled on more recent kernels and not on 2.6.18? If so, try
 disabling it and see if the problem goes away

Unfortunately the drives in the file server don't support NCQ. Not sure
if it's supported in the machine I was testing on (it's certainly a few
years old).
 
  The other thing I've found is that if I do the dd to an ext3 fs (on
  the same disk at least) while running the test in the XFS fs then I
  also see the latencies.
 
 So it's almost certainly pointing at an elevator or driver change, not
 an XFS change.

OK, though it doesn't seem to effect ext3. I'm going to run a git
bisect to see what it comes up with.

 Cheers,
 
 dave.

Cheers,

Andrew
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread Andrew Clayton
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:

 So it's almost certainly pointing at an elevator or driver change, not
 an XFS change.

heh, git bisect begs to differ :)

4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit
commit 4c60658e0f4e253cf275f12b7c76bf128515a774
Author: David Chinner [EMAIL PROTECTED]
Date:   Sat Nov 11 18:05:00 2006 +1100

[XFS] Prevent a deadlock when xfslogd unpins inodes.

The previous fixes for the use after free in xfs_iunpin left a nasty log
deadlock when xfslogd unpinned the inode and dropped the last reference to
the inode. the -clear_inode() method can issue transactions, and if the
log was full, the transaction could push on the log and get stuck trying
to push the inode it was currently unpinning.

To fix this, we provide xfs_iunpin a guarantee that it will always have a
valid xfs_inode - linux inode link or a particular flag will be set on
the inode. We then use log forces during lookup to ensure transactions are
completed before we recycle the inode. This ensures that xfs_iunpin will
never use the linux inode after it is being freed, and any lookup on an
inode on the reclaim list will wait until it is safe to attach a new linux
inode to the xfs inode.

SGI-PV: 956832
SGI-Modid: xfs-linux-melb:xfs-kern:27359a

Signed-off-by: David Chinner [EMAIL PROTECTED]
Signed-off-by: Shailendra Tripathi [EMAIL PROTECTED]
Signed-off-by: Takenori Nagano [EMAIL PROTECTED]
Signed-off-by: Tim Shimmin [EMAIL PROTECTED]

:04 04 1c47ba44bc404456c87c5a493d543a8d30696b88 
92319b34585d03c64e53890a80e550b579a0363d M  fs


If you'd like more info, don't hesitate to ask.

 Cheers,
 
 dave.

Cheers,

Andrew
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread David Chinner
On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
 On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
 
  So it's almost certainly pointing at an elevator or driver change, not an
  XFS change.
 
 heh, git bisect begs to differ :)
 
 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner [EMAIL 
 PROTECTED]
 Date:   Sat Nov 11 18:05:00 2006 +1100
 
 [XFS] Prevent a deadlock when xfslogd unpins inodes.

Oh, of course - I failed to notice the significance of
this loop in your test:

while [foo]; do
touch fred
rm fred
done

The inode allocator keeps reusing the same inode.  If the
transaction that did the unlink has not hit the disk before we
allocate the inode again, we have to force the log to get the unlink
transaction to disk to get the xfs inode unpinned (i.e. able to be
modified in memory again).

It's the log force I/O that's introducing the latency.

If we don't force the log, then we have a possible use-after free
of the linux inode because of a fundamental mismatch between
the XFS inode life cycle and the linux inode life cycle. The
use-after free only occurs on large machines under heavy, heavy
metadata load to many disks and filesystems (requires enough
traffic to overload an xfslogd) and is very difficult to
reproduce (large machine, lots of disks and 20-30 hours MTTF).

I'll have a look at other ways to solve this problem, but it
took 6 months to find a solution to the race in the first place
so don't hold your breath.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread David Chinner
On Fri, Oct 12, 2007 at 07:53:53AM +1000, David Chinner wrote:
 On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
  On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
  
   So it's almost certainly pointing at an elevator or driver change, not an
   XFS change.
  
  heh, git bisect begs to differ :)
  
  4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
  4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner [EMAIL 
  PROTECTED]
  Date:   Sat Nov 11 18:05:00 2006 +1100
  
  [XFS] Prevent a deadlock when xfslogd unpins inodes.
 
 Oh, of course - I failed to notice the significance of
 this loop in your test:
 
   while [foo]; do
   touch fred
   rm fred
   done
 
 The inode allocator keeps reusing the same inode.  If the
 transaction that did the unlink has not hit the disk before we
 allocate the inode again, we have to force the log to get the unlink
 transaction to disk to get the xfs inode unpinned (i.e. able to be
 modified in memory again).
 
 It's the log force I/O that's introducing the latency.
 
 If we don't force the log, then we have a possible use-after free
 of the linux inode because of a fundamental mismatch between
 the XFS inode life cycle and the linux inode life cycle. The
 use-after free only occurs on large machines under heavy, heavy
 metadata load to many disks and filesystems (requires enough
 traffic to overload an xfslogd) and is very difficult to
 reproduce (large machine, lots of disks and 20-30 hours MTTF).
 
 I'll have a look at other ways to solve this problem, but it
 took 6 months to find a solution to the race in the first place
 so don't hold your breath.

You can breath again. Here's a test patch (warning - may harm
kittens - not fully tested or verified) that solves both
the use-after-free issue (by avoiding it altogether) as well the
unlink/create latency because the log force is no longer there.

(yay! xfsqa test 016 passes again ;)

It does have other possible side effects triggering extra
log forces elsewhere on inode writeback and affects sync behaviour
so it's only a proof of concept at this point.

The latency output looks like:

idle

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000281
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000184
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000188
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000182
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000225
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000182
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000185

start dd

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000405
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000224
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000199
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000163

writeback starts

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000145
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.817704
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000148
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000143
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000154
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000147
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000158
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000144
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000379
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000151
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000190
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000191
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000150
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.797099
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000139
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
0.000163

writeback ends

So we still see some operations show high latency - that will most likely
be due to the allocation transaction filling a log buffer and pushing it
to disk, then having to wait because I/O is in progress on all other log
buffers.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/linux-2.6/xfs_iops.c |   16 
 fs/xfs/xfs_iget.c   |   18 --
 fs/xfs/xfs_inode.c  |   34 +-
 fs/xfs/xfs_inode.h  |1 +
 fs/xfs/xfs_inode_item.c |  

XFS regression?

2007-10-10 Thread Andrew Clayton
Hi,

(Seeing as I haven't been able to subscribe or post to the XFS mailing
list, I'll try here)

I'll try not to flood with information on the first post.

In trying to track down this issue here:
http://www.spinics.net/lists/raid/msg17195.html

I think I'm seeing a regression in XFS


If I run the following program (This was on an Athlon XP, Seagate IDE,
with a 2.6.23-rc9-git2)

/* fslattest.c */

#define _GNU_SOURCE

#include stdio.h
#include stdlib.h
#include unistd.h
#include sys/stat.h
#include sys/types.h
#include fcntl.h
#include string.h


int main(int argc, char *argv[])
{
char file[255];

if (argc  2) {
printf(Usage: fslattest file\n);
exit(1);
}

strncpy(file, argv[1], 254);
printf(Opening %s\n, file);

while (1) {
int testfd = open(file, 
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 

0600); 
close(testfd);
unlink(file); 
sleep(1);
}

exit(0);
}


e.g $ strace -T -e open fslattest test

And then after a few seconds run

$ dd if=/dev/zero of=bigfile bs=1M count=500

I see the following

Before dd kicks in

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.005043 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 0.000212 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.016844

while dd is running

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
2.000348 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 1.594441 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 2.224636
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
1.074615

dd stopped

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.013224 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 0.007109 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.007108


Doing the same thing with ext3 shows no such stalls. e.g before, during
and after the above dd

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.015423 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 0.92 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.93
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.88 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 0.000103 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.96
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.94 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 0.000114 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.91
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.000274 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 0.000107


I have found a machine in the office to do more testing on and that can
run older kernels.

It seems this problem was introduced between 2.6.18 and 2.6.19. 

The other thing I've found is that if I do the dd to an ext3 fs (on
the same disk at least) while running the test in the XFS fs then I
also see the latencies.

I can replicate this behaviour on at least 4 different machines so far.

Just ask it you'd like more details.

Cheers,

Andrew
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-10 Thread David Chinner
On Wed, Oct 10, 2007 at 03:27:42PM +0100, Andrew Clayton wrote:
 Hi,
 
 (Seeing as I haven't been able to subscribe or post to the XFS mailing
 list, I'll try here)
 
 I'll try not to flood with information on the first post.
 
 In trying to track down this issue here:
 http://www.spinics.net/lists/raid/msg17195.html
 
 I think I'm seeing a regression in XFS

perhaps not XFS. 2.6.23-rc4, cfq, ia64, 10k rpm scsi
disk, WCE, ctq enabled, nobarrier mount option.

lattest output at defaults:

# cat /sys/block/sdb/queue/nr_requests
128

idle state:

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000160
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000159
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000158
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000159
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000155
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000159
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000155
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000159
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000156
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000158
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000158
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000153

run dd:

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000456
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000372
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000412
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000365

(writeback starts)

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.401471
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.544249
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.727238
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.569971
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.715226
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.561839
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.714825
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000389
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000480
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.441398
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.488799
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.701921

So we've got latencies of 500-700ms there while writeback is occuring. i.e. the
disk is busy.

# echo 4  /sys/block/sdb/queue/nr_requests
# cat /sys/block/sdb/queue/nr_requests
4

idle:

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000223
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000219
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000219
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000269
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000249
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000220
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000215
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000209
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000214
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000220
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000213
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000239
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000281
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000209
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000326

run dd:

open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.001026
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.001077
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000882
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000197
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000862
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000198
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000199
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000218
open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 
0.000216