Re: XFS regression?
On Mon, Oct 15, 2007 at 03:28:34PM +0530, Bhagi rathi wrote: Thanks Dave for the response. Thinking futher, why is that xfs_iunpin has to mark the inode dirty? Because the inode has been modified, and instead of sprinkling mark_inode_dirty_sync() all over the code, we can do it in a single spot that catches all inode modifications. We don't have to think about it by doing this - inodes in transactions get marked dirty for free All transactions generally modify one time or other, xfs_ichgtime takes care of marking inode as dirty. Sure, but there's plenty of other transactions that don't have such a convenient hook. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Sat, Oct 13, 2007 at 07:05:17PM +0530, Bhagi rathi wrote: David, Can you let me know the use after free problem? I want to understand how the life cycle of linux inode and xfs inode are related to log flush. Log I/O completion: - xfs_trans_commited - xfs_iunpin(xfs inode) get linux inode from xfs inode - mark_inode_dirty_sync(linux inode) Freeing the linux inode: clear_inode(linux_inode) - xfs_inactive() - xfs_trans_commit() (e.g. freeing data associated with unlinked inode) - xfs_ipin() (link between xfs and linux inode broken) linux inode freed So, in log I/O completion, we can be completing a previous transaction at the same time clear_inode() is running, and hence in xfs_iunpin() we can race with the freeing of the linux inode as xfs_iunpin does not hold any locks. Any pointer is also of great help. /me points at the code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Fri, Oct 12, 2007 at 12:36:01PM +0100, Andrew Clayton wrote: On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote: You can breath again. Here's a test patch (warning - may harm heh kittens - not fully tested or verified) that solves both the use-after-free issue (by avoiding it altogether) as well the unlink/create latency because the log force is no longer there. (yay! xfsqa test 016 passes again ;) It does have other possible side effects triggering extra log forces elsewhere on inode writeback and affects sync behaviour so it's only a proof of concept at this point. What kernel is that against?. I got rejects with 2.6.23 The xfs-dev tree - i.e. the XFS that will be in 2.6.25 ;) However I tried a 2.6.18 on the file server and ran my test, it didn't show the problem. I then made a 2.6.23 but with the patch from my git bisect reverted. Doing the test with that kernel, while writing a 1GB file I saw only one 1 second latency (1.2) and only a few ~ 0.5 second latencies. However over the longer term I'm still seeing latencies 1 second. Sure - you've got a busy disk. If the truncate has to flush the log and wait for space, then it's going to take some time for I/Os to complete. Full queue + busy disk == unpredictable latency for all operations. Just leaving my strace test running (no dd) on the raid filesystem I see the latencies come when the raid5 stripe cache fills up. So I think I'm perhaps seeing another problem here. Software raid isn't good for latency, either ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Fri, 12 Oct 2007 12:36:01 +0100, Andrew Clayton wrote: Ignore the numbers below. under normal conditions the raid array and single drive show basically the same numbers. The raid array numbers were done over NFS. raid array open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.122943 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.021620 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.014963 system disk open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.000190 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.39 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.000191 While we're talking number's... raid array, strace over NFS open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.018330 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.026398 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 8.927449 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 1.284661 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.030078 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.021407 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.018660 raid array, strace locally open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.69 open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.44 open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 8.995286 open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 1.258810 open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 1.225763 open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.56 open(test2, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.92 Those where running concurrently. I think that rules out networking. I'm getting more convinced it's a raid5 problem... Andrew - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: Latencies are an order of magnitude lower at 60-70ms because the disks have less deep queues. This is expected - deep queues and multiple outstanding I/Os are the enemy of single I/O latency If I remount with barriers enabled, the latency at nr_requests=128 goes up to a consistent 2.2s. Not surprising, we're flushing the drive cache very regularly now and it points to the create or truncate transaction having to pushing log buffers to disk. The latency remains at 70-80ms at nr_requests=4. Thanks for the info. I did try fiddling nr_requests but I made it bigger. I'll try with it lower. It seems this problem was introduced between 2.6.18 and 2.6.19. When the new SATA driver infrastructure was introduced. Do you have NCQ enabled on more recent kernels and not on 2.6.18? If so, try disabling it and see if the problem goes away Unfortunately the drives in the file server don't support NCQ. Not sure if it's supported in the machine I was testing on (it's certainly a few years old). The other thing I've found is that if I do the dd to an ext3 fs (on the same disk at least) while running the test in the XFS fs then I also see the latencies. So it's almost certainly pointing at an elevator or driver change, not an XFS change. OK, though it doesn't seem to effect ext3. I'm going to run a git bisect to see what it comes up with. Cheers, dave. Cheers, Andrew - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: So it's almost certainly pointing at an elevator or driver change, not an XFS change. heh, git bisect begs to differ :) 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner [EMAIL PROTECTED] Date: Sat Nov 11 18:05:00 2006 +1100 [XFS] Prevent a deadlock when xfslogd unpins inodes. The previous fixes for the use after free in xfs_iunpin left a nasty log deadlock when xfslogd unpinned the inode and dropped the last reference to the inode. the -clear_inode() method can issue transactions, and if the log was full, the transaction could push on the log and get stuck trying to push the inode it was currently unpinning. To fix this, we provide xfs_iunpin a guarantee that it will always have a valid xfs_inode - linux inode link or a particular flag will be set on the inode. We then use log forces during lookup to ensure transactions are completed before we recycle the inode. This ensures that xfs_iunpin will never use the linux inode after it is being freed, and any lookup on an inode on the reclaim list will wait until it is safe to attach a new linux inode to the xfs inode. SGI-PV: 956832 SGI-Modid: xfs-linux-melb:xfs-kern:27359a Signed-off-by: David Chinner [EMAIL PROTECTED] Signed-off-by: Shailendra Tripathi [EMAIL PROTECTED] Signed-off-by: Takenori Nagano [EMAIL PROTECTED] Signed-off-by: Tim Shimmin [EMAIL PROTECTED] :04 04 1c47ba44bc404456c87c5a493d543a8d30696b88 92319b34585d03c64e53890a80e550b579a0363d M fs If you'd like more info, don't hesitate to ask. Cheers, dave. Cheers, Andrew - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote: On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: So it's almost certainly pointing at an elevator or driver change, not an XFS change. heh, git bisect begs to differ :) 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner [EMAIL PROTECTED] Date: Sat Nov 11 18:05:00 2006 +1100 [XFS] Prevent a deadlock when xfslogd unpins inodes. Oh, of course - I failed to notice the significance of this loop in your test: while [foo]; do touch fred rm fred done The inode allocator keeps reusing the same inode. If the transaction that did the unlink has not hit the disk before we allocate the inode again, we have to force the log to get the unlink transaction to disk to get the xfs inode unpinned (i.e. able to be modified in memory again). It's the log force I/O that's introducing the latency. If we don't force the log, then we have a possible use-after free of the linux inode because of a fundamental mismatch between the XFS inode life cycle and the linux inode life cycle. The use-after free only occurs on large machines under heavy, heavy metadata load to many disks and filesystems (requires enough traffic to overload an xfslogd) and is very difficult to reproduce (large machine, lots of disks and 20-30 hours MTTF). I'll have a look at other ways to solve this problem, but it took 6 months to find a solution to the race in the first place so don't hold your breath. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Fri, Oct 12, 2007 at 07:53:53AM +1000, David Chinner wrote: On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote: On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: So it's almost certainly pointing at an elevator or driver change, not an XFS change. heh, git bisect begs to differ :) 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner [EMAIL PROTECTED] Date: Sat Nov 11 18:05:00 2006 +1100 [XFS] Prevent a deadlock when xfslogd unpins inodes. Oh, of course - I failed to notice the significance of this loop in your test: while [foo]; do touch fred rm fred done The inode allocator keeps reusing the same inode. If the transaction that did the unlink has not hit the disk before we allocate the inode again, we have to force the log to get the unlink transaction to disk to get the xfs inode unpinned (i.e. able to be modified in memory again). It's the log force I/O that's introducing the latency. If we don't force the log, then we have a possible use-after free of the linux inode because of a fundamental mismatch between the XFS inode life cycle and the linux inode life cycle. The use-after free only occurs on large machines under heavy, heavy metadata load to many disks and filesystems (requires enough traffic to overload an xfslogd) and is very difficult to reproduce (large machine, lots of disks and 20-30 hours MTTF). I'll have a look at other ways to solve this problem, but it took 6 months to find a solution to the race in the first place so don't hold your breath. You can breath again. Here's a test patch (warning - may harm kittens - not fully tested or verified) that solves both the use-after-free issue (by avoiding it altogether) as well the unlink/create latency because the log force is no longer there. (yay! xfsqa test 016 passes again ;) It does have other possible side effects triggering extra log forces elsewhere on inode writeback and affects sync behaviour so it's only a proof of concept at this point. The latency output looks like: idle open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000281 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000184 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000188 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000182 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000225 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000182 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000185 start dd open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000405 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000224 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000199 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000163 writeback starts open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000145 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.817704 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000148 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000143 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000154 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000147 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000158 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000144 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000379 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000151 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000190 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000191 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000150 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.797099 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000139 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 0.000163 writeback ends So we still see some operations show high latency - that will most likely be due to the allocation transaction filling a log buffer and pushing it to disk, then having to wait because I/O is in progress on all other log buffers. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_iops.c | 16 fs/xfs/xfs_iget.c | 18 -- fs/xfs/xfs_inode.c | 34 +- fs/xfs/xfs_inode.h |1 + fs/xfs/xfs_inode_item.c |
XFS regression?
Hi, (Seeing as I haven't been able to subscribe or post to the XFS mailing list, I'll try here) I'll try not to flood with information on the first post. In trying to track down this issue here: http://www.spinics.net/lists/raid/msg17195.html I think I'm seeing a regression in XFS If I run the following program (This was on an Athlon XP, Seagate IDE, with a 2.6.23-rc9-git2) /* fslattest.c */ #define _GNU_SOURCE #include stdio.h #include stdlib.h #include unistd.h #include sys/stat.h #include sys/types.h #include fcntl.h #include string.h int main(int argc, char *argv[]) { char file[255]; if (argc 2) { printf(Usage: fslattest file\n); exit(1); } strncpy(file, argv[1], 254); printf(Opening %s\n, file); while (1) { int testfd = open(file, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600); close(testfd); unlink(file); sleep(1); } exit(0); } e.g $ strace -T -e open fslattest test And then after a few seconds run $ dd if=/dev/zero of=bigfile bs=1M count=500 I see the following Before dd kicks in open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.005043 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000212 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.016844 while dd is running open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 2.000348 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 1.594441 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 2.224636 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 1.074615 dd stopped open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.013224 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.007109 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.007108 Doing the same thing with ext3 shows no such stalls. e.g before, during and after the above dd open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.015423 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.92 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.93 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.88 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000103 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.96 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.94 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000114 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.91 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000274 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000107 I have found a machine in the office to do more testing on and that can run older kernels. It seems this problem was introduced between 2.6.18 and 2.6.19. The other thing I've found is that if I do the dd to an ext3 fs (on the same disk at least) while running the test in the XFS fs then I also see the latencies. I can replicate this behaviour on at least 4 different machines so far. Just ask it you'd like more details. Cheers, Andrew - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Wed, Oct 10, 2007 at 03:27:42PM +0100, Andrew Clayton wrote: Hi, (Seeing as I haven't been able to subscribe or post to the XFS mailing list, I'll try here) I'll try not to flood with information on the first post. In trying to track down this issue here: http://www.spinics.net/lists/raid/msg17195.html I think I'm seeing a regression in XFS perhaps not XFS. 2.6.23-rc4, cfq, ia64, 10k rpm scsi disk, WCE, ctq enabled, nobarrier mount option. lattest output at defaults: # cat /sys/block/sdb/queue/nr_requests 128 idle state: open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000160 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000159 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000158 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000159 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000155 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000159 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000155 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000159 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000156 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000158 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000158 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000153 run dd: open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000456 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000372 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000412 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000365 (writeback starts) open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.401471 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.544249 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.727238 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.569971 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.715226 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.561839 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.714825 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000389 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000480 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.441398 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.488799 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.701921 So we've got latencies of 500-700ms there while writeback is occuring. i.e. the disk is busy. # echo 4 /sys/block/sdb/queue/nr_requests # cat /sys/block/sdb/queue/nr_requests 4 idle: open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000223 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000219 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000219 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000269 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000249 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000220 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000215 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000209 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000214 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000220 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000213 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000239 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000281 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000209 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000326 run dd: open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.001026 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.001077 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000882 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000197 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000862 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000198 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000199 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000218 open(/mnt/scratch/testfile, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 0.000216