Re: Understanding I/O behaviour - next try
On Wed, 2007-08-29 at 01:15 -0700, Martin Knoblauch wrote: > > > Another thing I saw during my tests is that when writing to NFS, the > > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, > > > or a bug? > > > > What are the nr_unstable numbers? NFS has the concept of unstable storage, that is a state where it is agreed the page has been transferred to the remote server, but has not yet been written to disk. > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty > numbers for the disk case. Good to know. > > For NFS, the nr_writeback numbers seem surprisingly high. They also go > to 80-90k (pages ?). In the disk case they rarely go over 12k. see: /proc/sys/fs/nfs/nfs_congestion_kb That is the limit for when the nfs BDI is marked congested, so nfs_writeout + nfs_unstable <= nfs_congestion_kb The nfs_dirty always being 0 just means that pages very quickly start their writeout cycle. signature.asc Description: This is a digitally signed message part
Re: Understanding I/O behaviour - next try
On Wed, 2007-08-29 at 01:15 -0700, Martin Knoblauch wrote: Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? NFS has the concept of unstable storage, that is a state where it is agreed the page has been transferred to the remote server, but has not yet been written to disk. Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. see: /proc/sys/fs/nfs/nfs_congestion_kb That is the limit for when the nfs BDI is marked congested, so nfs_writeout + nfs_unstable = nfs_congestion_kb The nfs_dirty always being 0 just means that pages very quickly start their writeout cycle. signature.asc Description: This is a digitally signed message part
Re: Understanding I/O behaviour - next try
--- Jens Axboe <[EMAIL PROTECTED]> wrote: > > Try limiting the queue depth on the cciss device, some of those are > notoriously bad at starving commands. Something like the below hack, > see > if it makes a difference (and please verify in dmesg that it prints > the > message about limiting depth!): > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c > index 084358a..257e1c3 100644 > --- a/drivers/block/cciss.c > +++ b/drivers/block/cciss.c > @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, > struct pci_dev *pdev) > if (board_id == products[i].board_id) { > c->product_name = products[i].product_name; > c->access = *(products[i].access); > +#if 0 > c->nr_cmds = products[i].nr_cmds; > +#else > + c->nr_cmds = 2; > + printk("cciss: limited max commands to 2\n"); > +#endif > break; > } > } > > -- > Jens Axboe > > Hi Jens, how exactely is the queue depth related to the max # of commands? I ask, because with the 2.6.22 kernel the "maximum queue depth since init" seems to be never higher than 16, even with much higher outstanding commands. On a 2.6.19 kernel, maximum queue depth is much higher, just a bit below "max # of commands since init". [2.6.22]# cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Max sectors: 2048 Current Q depth: 0 Current # commands on controller: 145 Max Q depth since init: 16 Max # commands on controller since init: 204 Max SG entries since init: 31 Sequential access devices: 0 [2.6.19] cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Current Q depth: 0 Current # commands on controller: 0 Max Q depth since init: 197 Max # commands on controller since init: 198 Max SG entries since init: 31 Sequential access devices: 0 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Robert Hancock <[EMAIL PROTECTED]> wrote: > > I saw a bulletin from HP recently that sugggested disabling the > write-back cache on some Smart Array controllers as a workaround > because > it reduced performance in applications that did large bulk writes. > Presumably they are planning on releasing some updated firmware that > fixes this eventually.. > > -- > Robert Hancock Saskatoon, SK, Canada > To email, remove "nospam" from [EMAIL PROTECTED] > Home Page: http://www.roberthancock.com/ > Robert, just checked it out. At least with the "6i", you do not want to disable the WBC :-) Performance really goes down the toilet for all cases. Do you still have a pointer to that bulletin? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Robert Hancock [EMAIL PROTECTED] wrote: I saw a bulletin from HP recently that sugggested disabling the write-back cache on some Smart Array controllers as a workaround because it reduced performance in applications that did large bulk writes. Presumably they are planning on releasing some updated firmware that fixes this eventually.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ Robert, just checked it out. At least with the 6i, you do not want to disable the WBC :-) Performance really goes down the toilet for all cases. Do you still have a pointer to that bulletin? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe [EMAIL PROTECTED] wrote: Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 084358a..257e1c3 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev) if (board_id == products[i].board_id) { c-product_name = products[i].product_name; c-access = *(products[i].access); +#if 0 c-nr_cmds = products[i].nr_cmds; +#else + c-nr_cmds = 2; + printk(cciss: limited max commands to 2\n); +#endif break; } } -- Jens Axboe Hi Jens, how exactely is the queue depth related to the max # of commands? I ask, because with the 2.6.22 kernel the maximum queue depth since init seems to be never higher than 16, even with much higher outstanding commands. On a 2.6.19 kernel, maximum queue depth is much higher, just a bit below max # of commands since init. [2.6.22]# cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Max sectors: 2048 Current Q depth: 0 Current # commands on controller: 145 Max Q depth since init: 16 Max # commands on controller since init: 204 Max SG entries since init: 31 Sequential access devices: 0 [2.6.19] cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Current Q depth: 0 Current # commands on controller: 0 Max Q depth since init: 197 Max # commands on controller since init: 198 Max SG entries since init: 31 Sequential access devices: 0 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Chuck Ebbert <[EMAIL PROTECTED]> wrote: > On 08/28/2007 11:53 AM, Martin Knoblauch wrote: > > > > The basic setup is a dual x86_64 box with 8 GB of memory. The > DL380 > > has a HW RAID5, made from 4x72GB disks and about 100 MB write > cache. > > The performance of the block device with O_DIRECT is about 90 > MB/sec. > > > > The problematic behaviour comes when we are moving large files > through > > the system. The file usage in this case is mostly "use once" or > > streaming. As soon as the amount of file data is larger than 7.5 > GB, we > > see occasional unresponsiveness of the system (e.g. no more ssh > > connections into the box) of more than 1 or 2 minutes (!) duration > > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads > and > > some other poor guys being in "D" state. > > Try booting with "mem=4096M", "mem=2048M", ... > > hmm. I tried 1024M a while ago and IIRC did not see a lot [any] difference. But as it is no big deal, I will repeat it tomorrow. Just curious - what are you expecting? Why should it help? Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On 08/28/2007 11:53 AM, Martin Knoblauch wrote: > > The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 > has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. > The performance of the block device with O_DIRECT is about 90 MB/sec. > > The problematic behaviour comes when we are moving large files through > the system. The file usage in this case is mostly "use once" or > streaming. As soon as the amount of file data is larger than 7.5 GB, we > see occasional unresponsiveness of the system (e.g. no more ssh > connections into the box) of more than 1 or 2 minutes (!) duration > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and > some other poor guys being in "D" state. Try booting with "mem=4096M", "mem=2048M", ... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
Jens Axboe wrote: On Tue, Aug 28 2007, Martin Knoblauch wrote: Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some "misbehaviour" related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly "use once" or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in "D" state. The data flows in basically three modes. All of them are affected: local-disk -> NFS NFS -> local-disk NFS -> NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing "perfect yet. Use once does seem to be a problem. Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): I saw a bulletin from HP recently that sugggested disabling the write-back cache on some Smart Array controllers as a workaround because it reduced performance in applications that did large bulk writes. Presumably they are planning on releasing some updated firmware that fixes this eventually.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28 2007, Martin Knoblauch wrote: > > Keywords: I/O, bdi-v9, cfs > > > > Try limiting the queue depth on the cciss device, some of those are > notoriously bad at starving commands. Something like the below hack, > see > if it makes a difference (and please verify in dmesg that it prints > the > message about limiting depth!): > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c > index 084358a..257e1c3 100644 > --- a/drivers/block/cciss.c > +++ b/drivers/block/cciss.c > @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, > struct pci_dev *pdev) > if (board_id == products[i].board_id) { > c->product_name = products[i].product_name; > c->access = *(products[i].access); > +#if 0 > c->nr_cmds = products[i].nr_cmds; > +#else > + c->nr_cmds = 2; > + printk("cciss: limited max commands to 2\n"); > +#endif > break; > } > } > > -- > Jens Axboe > > > Hi Jens, thanks for the suggestion. Unfortunatelly the non-direct [parallel] writes to the device got considreably slower. I guess the "6i" controller copes better with higher values. Can nr_cmds be changed at runtime? Maybe there is a optimal setting. [ 69.438851] SCSI subsystem initialized [ 69.442712] HP CISS Driver (v 3.6.14) [ 69.442871] ACPI: PCI Interrupt :04:03.0[A] -> GSI 51 (level, low) -> IRQ 51 [ 69.442899] cciss: limited max commands to 2 (Smart Array 6i) [ 69.482370] cciss0: <0x46> at PCI :04:03.0 IRQ 51 using DAC [ 69.494352] blocks= 426759840 block_size= 512 [ 69.498350] heads=255, sectors=32, cylinders=52299 [ 69.498352] [ 69.498509] blocks= 426759840 block_size= 512 [ 69.498602] heads=255, sectors=32, cylinders=52299 [ 69.498604] [ 69.498608] cciss/c0d0: p1 p2 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On Tue, Aug 28 2007, Martin Knoblauch wrote: > Keywords: I/O, bdi-v9, cfs > > Hi, > > a while ago I asked a few questions on the Linux I/O behaviour, > because I were (still am) fighting some "misbehaviour" related to heavy > I/O. > > The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 > has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. > The performance of the block device with O_DIRECT is about 90 MB/sec. > > The problematic behaviour comes when we are moving large files through > the system. The file usage in this case is mostly "use once" or > streaming. As soon as the amount of file data is larger than 7.5 GB, we > see occasional unresponsiveness of the system (e.g. no more ssh > connections into the box) of more than 1 or 2 minutes (!) duration > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and > some other poor guys being in "D" state. > > The data flows in basically three modes. All of them are affected: > > local-disk -> NFS > NFS -> local-disk > NFS -> NFS > > NFS is V3/TCP. > > So, I made a few experiments in the last few days, using three > different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. > > The first observation (independent of the kernel) is that we *should* > use O_DIRECT, at least for output to the local disk. Here we see about > 90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel > threads to the same block device (through a ext2 FS) gives: > > O_Direct: 88 MB/s, 2x44, 3x29.5 > non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 > > - Observation 1a: IO schedulers are mostly equivalent, with CFQ > slightly worse than AS and DEADLINE > - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT > performance goes [slightly] down. With three threads it is 3x10 MB/s. > Ingo? > - Observation 1c: bdi-v9 does not help in this case, which is not > surprising. > > The real question here is why the non-O_DIRECT case is so slow. Is > this a general thing? Is this related to the CCISS controller? Using > O_DIRECT is unfortunatelly not an option for us. > > When using three different targets (local disk plus two different NFS > Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem > to be] limited to the speed of the slowest FS. With bdi-v9 we see a > considerable speedup. > > Just by chance I found out that doing all I/O inc sync-mode does > prevent the load from going up. Of course, I/O throughput is not > stellar (but not much worse than the non-O_DIRECT case). But the > responsiveness seem OK. Maybe a solution, as this can be controlled via > mount (would be great for O_DIRECT :-). > > In general 2.6.22 seems to bee better that 2.6.19, but this is highly > subjective :-( I am using the following setting in /proc. They seem to > provide the smoothest responsiveness: > > vm.dirty_background_ratio = 1 > vm.dirty_ratio = 1 > vm.swappiness = 1 > vm.vfs_cache_pressure = 1 > > Another thing I saw during my tests is that when writing to NFS, the > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, > or a bug? > > In any case, view this as a report for one specific loadcase that does > not behave very well. It seems there are ways to make things better > (sync, per device throttling, ...), but nothing "perfect yet. Use once > does seem to be a problem. Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 084358a..257e1c3 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev) if (board_id == products[i].board_id) { c->product_name = products[i].product_name; c->access = *(products[i].access); +#if 0 c->nr_cmds = products[i].nr_cmds; +#else + c->nr_cmds = 2; + printk("cciss: limited max commands to 2\n"); +#endif break; } } -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: > > > > --- Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > > You are apparently running into the sluggish kupdate-style > writeback > > > problem with large files: huge amount of dirty pages are getting > > > accumulated and flushed to the disk all at once when dirty > background > > > ratio is reached. The current -mm tree has some fixes for it, and > > > there are some more in my tree. Martin, I'll send you the patch > if > > > you'd like to try it out. > > > > > Hi Fengguang, > > > > Yeah, that pretty much describes the situation we end up. Although > > "sluggish" is much to friendly if we hit the situation :-) > > > > Yes, I am very interested to check out your patch. I saw your > > postings on LKML already and was already curious. Any chance you > have > > something agains 2.6.22-stable? I have reasons not to move to -23 > or > > -mm. > > Well, they are a dozen patches from various sources. I managed to > back-port them. It compiles and runs, however I cannot guarantee > more... > Thanks. I understand the limited scope of the warranty :-) I will give it a spin today. > > > > Another thing I saw during my tests is that when writing to > NFS, > > > the > > > > "dirty" or "nr_dirty" numbers are always 0. Is this a > conceptual > > > thing, > > > > or a bug? > > > > > > What are the nr_unstable numbers? > > > > > > > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty > > numbers for the disk case. Good to know. > > > > For NFS, the nr_writeback numbers seem surprisingly high. They > also go > > to 80-90k (pages ?). In the disk case they rarely go over 12k. > > Maybe the difference of throttling one single 'cp' and a dozen > 'nfsd'? > No "nfsd" running on that box. It is just a client. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: > > --- Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > You are apparently running into the sluggish kupdate-style writeback > > problem with large files: huge amount of dirty pages are getting > > accumulated and flushed to the disk all at once when dirty background > > ratio is reached. The current -mm tree has some fixes for it, and > > there are some more in my tree. Martin, I'll send you the patch if > > you'd like to try it out. > > > Hi Fengguang, > > Yeah, that pretty much describes the situation we end up. Although > "sluggish" is much to friendly if we hit the situation :-) > > Yes, I am very interested to check out your patch. I saw your > postings on LKML already and was already curious. Any chance you have > something agains 2.6.22-stable? I have reasons not to move to -23 or > -mm. Well, they are a dozen patches from various sources. I managed to back-port them. It compiles and runs, however I cannot guarantee more... > > > Another thing I saw during my tests is that when writing to NFS, > > the > > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual > > thing, > > > or a bug? > > > > What are the nr_unstable numbers? > > > > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty > numbers for the disk case. Good to know. > > For NFS, the nr_writeback numbers seem surprisingly high. They also go > to 80-90k (pages ?). In the disk case they rarely go over 12k. Maybe the difference of throttling one single 'cp' and a dozen 'nfsd'? Fengguang --- linux-2.6.22.orig/fs/fs-writeback.c +++ linux-2.6.22/fs/fs-writeback.c @@ -24,6 +24,148 @@ #include #include "internal.h" +/* + * Add @inode to its superblock's radix tree of dirty inodes. + * + * - the radix tree is indexed by inode number + * - inode_tree is not authoritative; inode_list is + * - inode_tree is a superset of inode_list: it is possible that an inode + * get synced elsewhere and moved to other lists, while still remaining + * in the radix tree. + */ +static void add_to_dirty_tree(struct inode *inode) +{ + struct super_block *sb = inode->i_sb; + struct dirty_inode_tree *dt = >s_dirty_tree; + int e; + + e = radix_tree_preload(GFP_ATOMIC); + if (!e) { + e = radix_tree_insert(>inode_tree, inode->i_ino, inode); + /* + * - inode numbers are not necessarily unique + * - an inode might somehow be redirtied and resent to us + */ + if (!e) { + __iget(inode); + dt->nr_inodes++; + if (dt->max_index < inode->i_ino) + dt->max_index = inode->i_ino; + list_move(>i_list, >s_dirty_tree.inode_list); + } + radix_tree_preload_end(); + } +} + +#define DIRTY_SCAN_BATCH 16 +#define DIRTY_SCAN_ALL LONG_MAX +#define DIRTY_SCAN_REMAINING (LONG_MAX-1) + +/* + * Scan the dirty inode tree and pull some inodes onto s_io. + * It could go beyond @end - it is a soft/approx limit. + */ +static unsigned long scan_dirty_tree(struct super_block *sb, + unsigned long begin, unsigned long end) +{ + struct dirty_inode_tree *dt = >s_dirty_tree; + struct inode *inodes[DIRTY_SCAN_BATCH]; + struct inode *inode = NULL; + int i, j; + void *p; + + while (begin < end) { + j = radix_tree_gang_lookup(>inode_tree, (void **)inodes, + begin, DIRTY_SCAN_BATCH); + if (!j) + break; + for (i = 0; i < j; i++) { + inode = inodes[i]; + if (end != DIRTY_SCAN_ALL) { +/* skip young volatile ones */ +if (time_after(inode->dirtied_when, + jiffies - dirty_volatile_interval)) { + inodes[i] = 0; + continue; +} + } + + dt->nr_inodes--; + p = radix_tree_delete(>inode_tree, inode->i_ino); + BUG_ON(!p); + + if (!(inode->i_state & I_SYNC)) +list_move(>i_list, >s_io); + } + begin = inode->i_ino + 1; + + spin_unlock(_lock); + for (i = 0; i < j; i++) + if (inodes[i]) +iput(inodes[i]); + cond_resched(); + spin_lock(_lock); + } + + return begin; +} + +/* + * Move a cluster of dirty inodes to the io dispatch queue. + */ +static void dispatch_cluster_inodes(struct super_block *sb, + unsigned long *older_than_this) +{ + struct dirty_inode_tree *dt = >s_dirty_tree; + int scan_interval = dirty_expire_interval - dirty_volatile_interval; + unsigned long begin; + unsigned long end; + + if (!older_than_this) { + /* + * Be aggressive: either it is a sync(), or we fall into + * background writeback because kupdate-style writebacks + * could not catch up with fast writers. + */ + begin = 0; + end = DIRTY_SCAN_ALL; + } else if (time_after_eq(jiffies, +dt->start_jiffies + scan_interval)) { + begin = dt->next_index; + end = DIRTY_SCAN_REMAINING; /* complete this sweep */ + } else { + unsigned long time_total = max(scan_interval, 1); + unsigned long time_delta = jiffies - dt->start_jiffies; + unsigned long scan_total = dt->max_index; + unsigned long scan_delta = scan_total * time_delta / time_total; + + begin = dt->next_index; + end = scan_delta; + } + + scan_dirty_tree(sb, begin, end); + + if (end
Re: Understanding I/O behaviour - next try
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: > [...] > > The basic setup is a dual x86_64 box with 8 GB of memory. The > DL380 > > has a HW RAID5, made from 4x72GB disks and about 100 MB write > cache. > > The performance of the block device with O_DIRECT is about 90 > MB/sec. > > > > The problematic behaviour comes when we are moving large files > through > > the system. The file usage in this case is mostly "use once" or > > streaming. As soon as the amount of file data is larger than 7.5 > GB, we > > see occasional unresponsiveness of the system (e.g. no more ssh > > connections into the box) of more than 1 or 2 minutes (!) duration > > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads > and > > some other poor guys being in "D" state. > [...] > > Just by chance I found out that doing all I/O inc sync-mode does > > prevent the load from going up. Of course, I/O throughput is not > > stellar (but not much worse than the non-O_DIRECT case). But the > > responsiveness seem OK. Maybe a solution, as this can be controlled > via > > mount (would be great for O_DIRECT :-). > > > > In general 2.6.22 seems to bee better that 2.6.19, but this is > highly > > subjective :-( I am using the following setting in /proc. They seem > to > > provide the smoothest responsiveness: > > > > vm.dirty_background_ratio = 1 > > vm.dirty_ratio = 1 > > vm.swappiness = 1 > > vm.vfs_cache_pressure = 1 > > You are apparently running into the sluggish kupdate-style writeback > problem with large files: huge amount of dirty pages are getting > accumulated and flushed to the disk all at once when dirty background > ratio is reached. The current -mm tree has some fixes for it, and > there are some more in my tree. Martin, I'll send you the patch if > you'd like to try it out. > Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although "sluggish" is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. > > Another thing I saw during my tests is that when writing to NFS, > the > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual > thing, > > or a bug? > > What are the nr_unstable numbers? > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Cheers Martin > Fengguang > > -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu [EMAIL PROTECTED] wrote: On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: [...] The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. [...] Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although sluggish is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Cheers Martin Fengguang -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: --- Fengguang Wu [EMAIL PROTECTED] wrote: You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although sluggish is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. Well, they are a dozen patches from various sources. I managed to back-port them. It compiles and runs, however I cannot guarantee more... Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Maybe the difference of throttling one single 'cp' and a dozen 'nfsd'? Fengguang --- linux-2.6.22.orig/fs/fs-writeback.c +++ linux-2.6.22/fs/fs-writeback.c @@ -24,6 +24,148 @@ #include linux/buffer_head.h #include internal.h +/* + * Add @inode to its superblock's radix tree of dirty inodes. + * + * - the radix tree is indexed by inode number + * - inode_tree is not authoritative; inode_list is + * - inode_tree is a superset of inode_list: it is possible that an inode + * get synced elsewhere and moved to other lists, while still remaining + * in the radix tree. + */ +static void add_to_dirty_tree(struct inode *inode) +{ + struct super_block *sb = inode-i_sb; + struct dirty_inode_tree *dt = sb-s_dirty_tree; + int e; + + e = radix_tree_preload(GFP_ATOMIC); + if (!e) { + e = radix_tree_insert(dt-inode_tree, inode-i_ino, inode); + /* + * - inode numbers are not necessarily unique + * - an inode might somehow be redirtied and resent to us + */ + if (!e) { + __iget(inode); + dt-nr_inodes++; + if (dt-max_index inode-i_ino) + dt-max_index = inode-i_ino; + list_move(inode-i_list, sb-s_dirty_tree.inode_list); + } + radix_tree_preload_end(); + } +} + +#define DIRTY_SCAN_BATCH 16 +#define DIRTY_SCAN_ALL LONG_MAX +#define DIRTY_SCAN_REMAINING (LONG_MAX-1) + +/* + * Scan the dirty inode tree and pull some inodes onto s_io. + * It could go beyond @end - it is a soft/approx limit. + */ +static unsigned long scan_dirty_tree(struct super_block *sb, + unsigned long begin, unsigned long end) +{ + struct dirty_inode_tree *dt = sb-s_dirty_tree; + struct inode *inodes[DIRTY_SCAN_BATCH]; + struct inode *inode = NULL; + int i, j; + void *p; + + while (begin end) { + j = radix_tree_gang_lookup(dt-inode_tree, (void **)inodes, + begin, DIRTY_SCAN_BATCH); + if (!j) + break; + for (i = 0; i j; i++) { + inode = inodes[i]; + if (end != DIRTY_SCAN_ALL) { +/* skip young volatile ones */ +if (time_after(inode-dirtied_when, + jiffies - dirty_volatile_interval)) { + inodes[i] = 0; + continue; +} + } + + dt-nr_inodes--; + p = radix_tree_delete(dt-inode_tree, inode-i_ino); + BUG_ON(!p); + + if (!(inode-i_state I_SYNC)) +list_move(inode-i_list, sb-s_io); + } + begin = inode-i_ino + 1; + + spin_unlock(inode_lock); + for (i = 0; i j; i++) + if (inodes[i]) +iput(inodes[i]); + cond_resched(); + spin_lock(inode_lock); + } + + return begin; +} + +/* + * Move a cluster of dirty inodes to the io dispatch queue. + */ +static void dispatch_cluster_inodes(struct super_block *sb, + unsigned long *older_than_this) +{ + struct dirty_inode_tree *dt = sb-s_dirty_tree; + int scan_interval = dirty_expire_interval - dirty_volatile_interval; + unsigned long begin; + unsigned long end; + + if (!older_than_this) { + /* + * Be aggressive: either it is a sync(), or we fall into + * background writeback because kupdate-style writebacks + * could not catch up with fast writers. + */ + begin = 0; + end = DIRTY_SCAN_ALL; + } else if (time_after_eq(jiffies, +dt-start_jiffies + scan_interval)) { + begin = dt-next_index; + end = DIRTY_SCAN_REMAINING; /* complete this sweep */ + } else { + unsigned long time_total = max(scan_interval, 1); + unsigned long time_delta = jiffies - dt-start_jiffies; + unsigned long scan_total = dt-max_index; + unsigned long scan_delta = scan_total * time_delta / time_total; + + begin = dt-next_index; + end = scan_delta; + } + + scan_dirty_tree(sb, begin, end); + + if (end DIRTY_SCAN_REMAINING) { +
Re: Understanding I/O behaviour - next try
--- Fengguang Wu [EMAIL PROTECTED] wrote: On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: --- Fengguang Wu [EMAIL PROTECTED] wrote: You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although sluggish is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. Well, they are a dozen patches from various sources. I managed to back-port them. It compiles and runs, however I cannot guarantee more... Thanks. I understand the limited scope of the warranty :-) I will give it a spin today. Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Maybe the difference of throttling one single 'cp' and a dozen 'nfsd'? No nfsd running on that box. It is just a client. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On Tue, Aug 28 2007, Martin Knoblauch wrote: Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some misbehaviour related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. The data flows in basically three modes. All of them are affected: local-disk - NFS NFS - local-disk NFS - NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple dd using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing perfect yet. Use once does seem to be a problem. Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 084358a..257e1c3 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev) if (board_id == products[i].board_id) { c-product_name = products[i].product_name; c-access = *(products[i].access); +#if 0 c-nr_cmds = products[i].nr_cmds; +#else + c-nr_cmds = 2; + printk(cciss: limited max commands to 2\n); +#endif break; } } -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe [EMAIL PROTECTED] wrote: On Tue, Aug 28 2007, Martin Knoblauch wrote: Keywords: I/O, bdi-v9, cfs Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 084358a..257e1c3 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev) if (board_id == products[i].board_id) { c-product_name = products[i].product_name; c-access = *(products[i].access); +#if 0 c-nr_cmds = products[i].nr_cmds; +#else + c-nr_cmds = 2; + printk(cciss: limited max commands to 2\n); +#endif break; } } -- Jens Axboe Hi Jens, thanks for the suggestion. Unfortunatelly the non-direct [parallel] writes to the device got considreably slower. I guess the 6i controller copes better with higher values. Can nr_cmds be changed at runtime? Maybe there is a optimal setting. [ 69.438851] SCSI subsystem initialized [ 69.442712] HP CISS Driver (v 3.6.14) [ 69.442871] ACPI: PCI Interrupt :04:03.0[A] - GSI 51 (level, low) - IRQ 51 [ 69.442899] cciss: limited max commands to 2 (Smart Array 6i) [ 69.482370] cciss0: 0x46 at PCI :04:03.0 IRQ 51 using DAC [ 69.494352] blocks= 426759840 block_size= 512 [ 69.498350] heads=255, sectors=32, cylinders=52299 [ 69.498352] [ 69.498509] blocks= 426759840 block_size= 512 [ 69.498602] heads=255, sectors=32, cylinders=52299 [ 69.498604] [ 69.498608] cciss/c0d0: p1 p2 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
Jens Axboe wrote: On Tue, Aug 28 2007, Martin Knoblauch wrote: Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some misbehaviour related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. The data flows in basically three modes. All of them are affected: local-disk - NFS NFS - local-disk NFS - NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple dd using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing perfect yet. Use once does seem to be a problem. Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): I saw a bulletin from HP recently that sugggested disabling the write-back cache on some Smart Array controllers as a workaround because it reduced performance in applications that did large bulk writes. Presumably they are planning on releasing some updated firmware that fixes this eventually.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On 08/28/2007 11:53 AM, Martin Knoblauch wrote: The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. Try booting with mem=4096M, mem=2048M, ... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Chuck Ebbert [EMAIL PROTECTED] wrote: On 08/28/2007 11:53 AM, Martin Knoblauch wrote: The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. Try booting with mem=4096M, mem=2048M, ... hmm. I tried 1024M a while ago and IIRC did not see a lot [any] difference. But as it is no big deal, I will repeat it tomorrow. Just curious - what are you expecting? Why should it help? Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: [...] > The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 > has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. > The performance of the block device with O_DIRECT is about 90 MB/sec. > > The problematic behaviour comes when we are moving large files through > the system. The file usage in this case is mostly "use once" or > streaming. As soon as the amount of file data is larger than 7.5 GB, we > see occasional unresponsiveness of the system (e.g. no more ssh > connections into the box) of more than 1 or 2 minutes (!) duration > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and > some other poor guys being in "D" state. [...] > Just by chance I found out that doing all I/O inc sync-mode does > prevent the load from going up. Of course, I/O throughput is not > stellar (but not much worse than the non-O_DIRECT case). But the > responsiveness seem OK. Maybe a solution, as this can be controlled via > mount (would be great for O_DIRECT :-). > > In general 2.6.22 seems to bee better that 2.6.19, but this is highly > subjective :-( I am using the following setting in /proc. They seem to > provide the smoothest responsiveness: > > vm.dirty_background_ratio = 1 > vm.dirty_ratio = 1 > vm.swappiness = 1 > vm.vfs_cache_pressure = 1 You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. > Another thing I saw during my tests is that when writing to NFS, the > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, > or a bug? What are the nr_unstable numbers? Fengguang - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Understanding I/O behaviour - next try
Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some "misbehaviour" related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly "use once" or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in "D" state. The data flows in basically three modes. All of them are affected: local-disk -> NFS NFS -> local-disk NFS -> NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing "perfect yet. Use once does seem to be a problem. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Understanding I/O behaviour - next try
Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some misbehaviour related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. The data flows in basically three modes. All of them are affected: local-disk - NFS NFS - local-disk NFS - NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple dd using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing perfect yet. Use once does seem to be a problem. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: [...] The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. [...] Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? Fengguang - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/