Re: Temporary lockup on loopback block device
> > On 2.6.23 it could happen even without loopback > > Let's focus on this point, because we already know how the lockup > happens _with_ loopback and any other kind of bdi stacking. > > Can you describe the setup? Or better still, can you reproduce it and > post the sysrq-t output? Hi The trace is this, it is perfectly reproducible. It is 128M machine, Pentium 2 300MHz, host filesystem ext2, loop filesystems ext2 and spadfs (both of them locked up). But the problem is really over in 2.6.24, I think there is no more need to investigate it. Mikulas Nov 10 19:34:45 gerlinda kernel: SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks Nov 10 19:34:53 gerlinda kernel: SysRq : Show Blocked State Nov 10 19:34:53 gerlinda kernel: taskPC stack pid father Nov 10 19:34:54 gerlinda kernel: ddD 0286 0 4603 2985 Nov 10 19:34:55 gerlinda kernel:c580bcdc 0086 c0308c20 0286 0286 c580bcec 002a4e87 Nov 10 19:34:55 gerlinda kernel:c580bd10 c0284bba c580bd1c c03775e0 c03775e0 002a4e87 c011d050 Nov 10 19:34:55 gerlinda kernel:c117c030 c03771a0 0064 c02f8eb4 c0283efe c580bd44 c0145ebc Nov 10 19:34:55 gerlinda kernel: Call Trace: Nov 10 19:34:55 gerlinda kernel: [] schedule_timeout+0x4a/0xc0 Nov 10 19:34:55 gerlinda kernel: [] process_timeout+0x0/0x10 Nov 10 19:34:55 gerlinda kernel: [] io_schedule_timeout+0xe/0x20 Nov 10 19:34:55 gerlinda kernel: [] congestion_wait+0x6c/0x90 Nov 10 19:34:55 gerlinda kernel: [] autoremove_wake_function+0x0/0x50Nov 10 19:34:55 gerlinda kernel: [] balance_dirty_pages_ratelimited_nr+0x11f/0x1e0 Nov 10 19:34:55 gerlinda kernel: [] generic_file_buffered_write+0x2f8/0x6f0 Nov 10 19:34:55 gerlinda kernel: [] irq_exit+0x47/0x70 Nov 10 19:34:55 gerlinda kernel: [] do_IRQ+0x47/0x80 Nov 10 19:34:55 gerlinda kernel: [] common_interrupt+0x23/0x28 Nov 10 19:34:55 gerlinda kernel: [] __generic_file_aio_write_nolock+0x253/0x540 Nov 10 19:34:55 gerlinda kernel: [] hrtimer_run_queues+0x6b/0x290 Nov 10 19:34:55 gerlinda kernel: [] generic_file_aio_write+0x56/0xd0 Nov 10 19:34:55 gerlinda kernel: [] tick_handle_periodic+0xf/0x70 Nov 10 19:34:55 gerlinda kernel: [] do_sync_write+0xc6/0x110 Nov 10 19:34:55 gerlinda kernel: [] autoremove_wake_function+0x0/0x50Nov 10 19:34:55 gerlinda kernel: [] clear_user+0x2f/0x50 Nov 10 19:34:55 gerlinda kernel: [] ptrace_notify+0x30/0x90 Nov 10 19:34:55 gerlinda kernel: [] vfs_write+0xa6/0x140 Nov 10 19:34:55 gerlinda kernel: [] SPADFS_FILE_WRITE+0x0/0x10 [spadfs] Nov 10 19:34:55 gerlinda kernel: [] sys_write+0x41/0x70 Nov 10 19:34:55 gerlinda kernel: [] syscall_call+0x7/0xb Nov 10 19:34:55 gerlinda kernel: === > Thanks, > Miklos > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On 2.6.23 it could happen even without loopback Let's focus on this point, because we already know how the lockup happens _with_ loopback and any other kind of bdi stacking. Can you describe the setup? Or better still, can you reproduce it and post the sysrq-t output? Hi The trace is this, it is perfectly reproducible. It is 128M machine, Pentium 2 300MHz, host filesystem ext2, loop filesystems ext2 and spadfs (both of them locked up). But the problem is really over in 2.6.24, I think there is no more need to investigate it. Mikulas Nov 10 19:34:45 gerlinda kernel: SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks Nov 10 19:34:53 gerlinda kernel: SysRq : Show Blocked State Nov 10 19:34:53 gerlinda kernel: taskPC stack pid father Nov 10 19:34:54 gerlinda kernel: ddD 0286 0 4603 2985 Nov 10 19:34:55 gerlinda kernel:c580bcdc 0086 c0308c20 0286 0286 c580bcec 002a4e87 Nov 10 19:34:55 gerlinda kernel:c580bd10 c0284bba c580bd1c c03775e0 c03775e0 002a4e87 c011d050 Nov 10 19:34:55 gerlinda kernel:c117c030 c03771a0 0064 c02f8eb4 c0283efe c580bd44 c0145ebc Nov 10 19:34:55 gerlinda kernel: Call Trace: Nov 10 19:34:55 gerlinda kernel: [c0284bba] schedule_timeout+0x4a/0xc0 Nov 10 19:34:55 gerlinda kernel: [c011d050] process_timeout+0x0/0x10 Nov 10 19:34:55 gerlinda kernel: [c0283efe] io_schedule_timeout+0xe/0x20 Nov 10 19:34:55 gerlinda kernel: [c0145ebc] congestion_wait+0x6c/0x90 Nov 10 19:34:55 gerlinda kernel: [c01274e0] autoremove_wake_function+0x0/0x50Nov 10 19:34:55 gerlinda kernel: [c014135f] balance_dirty_pages_ratelimited_nr+0x11f/0x1e0 Nov 10 19:34:55 gerlinda kernel: [c013cb98] generic_file_buffered_write+0x2f8/0x6f0 Nov 10 19:34:55 gerlinda kernel: [c01198b7] irq_exit+0x47/0x70 Nov 10 19:34:55 gerlinda kernel: [c01049e7] do_IRQ+0x47/0x80 Nov 10 19:34:55 gerlinda kernel: [c0102cbf] common_interrupt+0x23/0x28 Nov 10 19:34:55 gerlinda kernel: [c013d1e3] __generic_file_aio_write_nolock+0x253/0x540 Nov 10 19:34:55 gerlinda kernel: [c012a87b] hrtimer_run_queues+0x6b/0x290 Nov 10 19:34:55 gerlinda kernel: [c013d526] generic_file_aio_write+0x56/0xd0 Nov 10 19:34:55 gerlinda kernel: [c012ed9f] tick_handle_periodic+0xf/0x70 Nov 10 19:34:55 gerlinda kernel: [c015a1d6] do_sync_write+0xc6/0x110 Nov 10 19:34:55 gerlinda kernel: [c01274e0] autoremove_wake_function+0x0/0x50Nov 10 19:34:55 gerlinda kernel: [c01c604f] clear_user+0x2f/0x50 Nov 10 19:34:55 gerlinda kernel: [c012] ptrace_notify+0x30/0x90 Nov 10 19:34:55 gerlinda kernel: [c015aa56] vfs_write+0xa6/0x140 Nov 10 19:34:55 gerlinda kernel: [c8926310] SPADFS_FILE_WRITE+0x0/0x10 [spadfs] Nov 10 19:34:55 gerlinda kernel: [c015b031] sys_write+0x41/0x70 Nov 10 19:34:55 gerlinda kernel: [c0102b16] syscall_call+0x7/0xb Nov 10 19:34:55 gerlinda kernel: === Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
> On 2.6.23 it could happen even without loopback Let's focus on this point, because we already know how the lockup happens _with_ loopback and any other kind of bdi stacking. Can you describe the setup? Or better still, can you reproduce it and post the sysrq-t output? Thanks, Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On 2.6.23 it could happen even without loopback Let's focus on this point, because we already know how the lockup happens _with_ loopback and any other kind of bdi stacking. Can you describe the setup? Or better still, can you reproduce it and post the sysrq-t output? Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
> > Why are there over-limit dirty pages that no one is writing? > > Please do a sysrq-t, and cat /proc/vmstat during the hang. Those > will show us what exactly is happening. I did and I posted relevant information from my finding --- it looped in balance_dirty_pages. > I've seen this type of hang many times, and I agree with Peter, that > it's probably about loopback, and is fixed in 2.6.24-rc. On 2.6.23 it could happen even without loopback --- loopback just made it happen very often. 2.6.24 seems ok. Mikulas > Thanks, > Miklos > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
Why are there over-limit dirty pages that no one is writing? Please do a sysrq-t, and cat /proc/vmstat during the hang. Those will show us what exactly is happening. I did and I posted relevant information from my finding --- it looped in balance_dirty_pages. I've seen this type of hang many times, and I agree with Peter, that it's probably about loopback, and is fixed in 2.6.24-rc. On 2.6.23 it could happen even without loopback --- loopback just made it happen very often. 2.6.24 seems ok. Mikulas Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
> > > Arguably we just have the wrong backing-device here, and what we should do > > > is to propagate the real backing device's pointer through up into the > > > filesystem. There's machinery for this which things like DM stacks use. > > > > > > I wonder if the post-2.6.23 changes happened to make this problem go away. > > > > The per BDI dirty stuff in 24 should make this work, I just checked and > > loopback thingies seem to have their own BDI, so all should be well. > > This is not only about loopback (I think the lockup can happen even > without loopback) --- the main problem is: > > Why are there over-limit dirty pages that no one is writing? Please do a sysrq-t, and cat /proc/vmstat during the hang. Those will show us what exactly is happening. I've seen this type of hang many times, and I agree with Peter, that it's probably about loopback, and is fixed in 2.6.24-rc. Thanks, Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
> > > Arguably we just have the wrong backing-device here, and what we > > > should do is to propagate the real backing device's pointer through > > > up into the filesystem. There's machinery for this which things > > > like DM stacks use. Just thinking about the new implementation --- you shouldn't really propagate physical block device's backing_device into loopback device. If you leave it as is (each loop device has it's own backing store), you can nicely avoid the long-standing loopback deadlock coming from the fact that flushing one page on loopback device can generate several more dirty pages on the filesystem. If you let loopback device and physical device have the same backing store, then it can go wild creating more and more dirty pages up to a memory exhaustion. If you let them have different backing stores, it can't happen --- loopback flushing will just wait until the pages on the filesystem are written. Mikulas > So I compiled it and I don't see any more lock-ups. The writeback loop > doesn't depend on any global page count, so the above scenario can't > happen here. Good. > > Mikulas > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sun, 11 Nov 2007, Mikulas Patocka wrote: > On Sat, 10 Nov 2007, Andrew Morton wrote: > > > On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <[EMAIL > > PROTECTED]> wrote: > > > > > Hi > > > > > > I am experiencing a transient lockup in 'D' state with loopback device. > > > It > > > happens when process writes to a filesystem in loopback with command like > > > dd if=/dev/zero of=/s/fill bs=4k > > > > > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in > > > congestion_wait called from balance_dirty_pages. > > > > > > After about 30 seconds, the lockup is gone and dd resumes, but it locks > > > up > > > soon again. > > > > > > I added a printk to the balance_dirty_pages > > > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, > > > pages_written %d, write_chunk %d\n", nr_reclaimable, > > > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, > > > write_chunk); > > > > > > and it shows this during the lockup: > > > > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > > pages_written 1021, write_chunk 1522 > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > > pages_written 1021, write_chunk 1522 > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > > pages_written 1021, write_chunk 1522 > > > > > > What apparently happens: > > > > > > writeback_inodes syncs inodes only on the given wbc->bdi, however > > > balance_dirty_pages checks against global counts of dirty pages. So if > > > there's nothing to sync on a given device, but there are other dirty > > > pages > > > so that the counts are over the limit, it will loop without doing any > > > work. > > > > > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if > > > something writes to the backing device, it flushes the dirty pages > > > generated by the loopback and the lockup is gone. If you add printk, > > > don't > > > forget to stop klogd, otherwise logging would end the lockup. > > > > erk. > > > > > The hotfix (that I verified to work) is to not set wbc->bdi, so that all > > > devices are flushed ... but the code probably needs some redesign (i.e. > > > either account per-device and flush per-device, or account-global and > > > flush-global). > > > > > > Mikulas > > > > > > > > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c > > > --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 > > > 18:43:44.0 +0200 > > > +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 > > > @@ -214,7 +214,6 @@ > > > > > > for (;;) { > > > struct writeback_control wbc = { > > > - .bdi= bdi, > > > .sync_mode = WB_SYNC_NONE, > > > .older_than_this = NULL, > > > .nr_to_write= write_chunk, > > > > Arguably we just have the wrong backing-device here, and what we should do > > is to propagate the real backing device's pointer through up into the > > filesystem. There's machinery for this which things like DM stacks use. > > If you change loopback backing-device, you just turn this nicely > reproducible example into a subtle race condition that can happen whenever > you use loopback or not. Think, what happens when different process > dirties memory: > > You have process "A" that dirtied a lot of pages on device "1" but has not > started writing them. > You have process "B" that is trying to write to device "2", sees dirty > page count over limit, but can't do anything about it, because it is only > allowed to flush pages on device "2". --- so it endlessly loops. > > If you want to use the current flushing semantics, you just have to audit > the whole kernel to make sure that if some process sees over-limit dirty > page count, there is another process that is flushing the pages. Currently > it is not true, the "dd" process sees over-limit count, but there is > no-one writing. > > > I wonder if the post-2.6.23 changes happened to make this problem go away. > > I will try 2.6.24-rc2, but I don't think the root cause of this went away. > Maybe you just reduced probability. > > Mikulas So I compiled it and I don't see any more lock-ups. The writeback loop doesn't depend on any global page count, so the above scenario can't happen here. Good. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
> > Arguably we just have the wrong backing-device here, and what we should do > > is to propagate the real backing device's pointer through up into the > > filesystem. There's machinery for this which things like DM stacks use. > > > > I wonder if the post-2.6.23 changes happened to make this problem go away. > > The per BDI dirty stuff in 24 should make this work, I just checked and > loopback thingies seem to have their own BDI, so all should be well. This is not only about loopback (I think the lockup can happen even without loopback) --- the main problem is: Why are there over-limit dirty pages that no one is writing? Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sat, 10 Nov 2007, Andrew Morton wrote: > On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <[EMAIL PROTECTED]> > wrote: > > > Hi > > > > I am experiencing a transient lockup in 'D' state with loopback device. It > > happens when process writes to a filesystem in loopback with command like > > dd if=/dev/zero of=/s/fill bs=4k > > > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in > > congestion_wait called from balance_dirty_pages. > > > > After about 30 seconds, the lockup is gone and dd resumes, but it locks up > > soon again. > > > > I added a printk to the balance_dirty_pages > > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, > > pages_written %d, write_chunk %d\n", nr_reclaimable, > > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, > > write_chunk); > > > > and it shows this during the lockup: > > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > pages_written 1021, write_chunk 1522 > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > pages_written 1021, write_chunk 1522 > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > pages_written 1021, write_chunk 1522 > > > > What apparently happens: > > > > writeback_inodes syncs inodes only on the given wbc->bdi, however > > balance_dirty_pages checks against global counts of dirty pages. So if > > there's nothing to sync on a given device, but there are other dirty pages > > so that the counts are over the limit, it will loop without doing any > > work. > > > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if > > something writes to the backing device, it flushes the dirty pages > > generated by the loopback and the lockup is gone. If you add printk, don't > > forget to stop klogd, otherwise logging would end the lockup. > > erk. > > > The hotfix (that I verified to work) is to not set wbc->bdi, so that all > > devices are flushed ... but the code probably needs some redesign (i.e. > > either account per-device and flush per-device, or account-global and > > flush-global). > > > > Mikulas > > > > > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c > > --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 > > 18:43:44.0 +0200 > > +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 > > @@ -214,7 +214,6 @@ > > > > for (;;) { > > struct writeback_control wbc = { > > - .bdi= bdi, > > .sync_mode = WB_SYNC_NONE, > > .older_than_this = NULL, > > .nr_to_write= write_chunk, > > Arguably we just have the wrong backing-device here, and what we should do > is to propagate the real backing device's pointer through up into the > filesystem. There's machinery for this which things like DM stacks use. If you change loopback backing-device, you just turn this nicely reproducible example into a subtle race condition that can happen whenever you use loopback or not. Think, what happens when different process dirties memory: You have process "A" that dirtied a lot of pages on device "1" but has not started writing them. You have process "B" that is trying to write to device "2", sees dirty page count over limit, but can't do anything about it, because it is only allowed to flush pages on device "2". --- so it endlessly loops. If you want to use the current flushing semantics, you just have to audit the whole kernel to make sure that if some process sees over-limit dirty page count, there is another process that is flushing the pages. Currently it is not true, the "dd" process sees over-limit count, but there is no-one writing. > I wonder if the post-2.6.23 changes happened to make this problem go away. I will try 2.6.24-rc2, but I don't think the root cause of this went away. Maybe you just reduced probability. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sat, 2007-11-10 at 14:54 -0800, Andrew Morton wrote: > On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <[EMAIL PROTECTED]> > wrote: > > > Hi > > > > I am experiencing a transient lockup in 'D' state with loopback device. It > > happens when process writes to a filesystem in loopback with command like > > dd if=/dev/zero of=/s/fill bs=4k > > > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in > > congestion_wait called from balance_dirty_pages. > > > > After about 30 seconds, the lockup is gone and dd resumes, but it locks up > > soon again. > > > > I added a printk to the balance_dirty_pages > > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, > > pages_written %d, write_chunk %d\n", nr_reclaimable, > > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, > > write_chunk); > > > > and it shows this during the lockup: > > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > pages_written 1021, write_chunk 1522 > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > pages_written 1021, write_chunk 1522 > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > pages_written 1021, write_chunk 1522 > > > > What apparently happens: > > > > writeback_inodes syncs inodes only on the given wbc->bdi, however > > balance_dirty_pages checks against global counts of dirty pages. So if > > there's nothing to sync on a given device, but there are other dirty pages > > so that the counts are over the limit, it will loop without doing any > > work. > > > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if > > something writes to the backing device, it flushes the dirty pages > > generated by the loopback and the lockup is gone. If you add printk, don't > > forget to stop klogd, otherwise logging would end the lockup. > > erk. known issue. > > The hotfix (that I verified to work) is to not set wbc->bdi, so that all > > devices are flushed ... but the code probably needs some redesign (i.e. > > either account per-device and flush per-device, or account-global and > > flush-global). .24 will have the per-device solution. > > > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c > > --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 > > 18:43:44.0 +0200 > > +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 > > @@ -214,7 +214,6 @@ > > > > for (;;) { > > struct writeback_control wbc = { > > - .bdi= bdi, > > .sync_mode = WB_SYNC_NONE, > > .older_than_this = NULL, > > .nr_to_write= write_chunk, > > Arguably we just have the wrong backing-device here, and what we should do > is to propagate the real backing device's pointer through up into the > filesystem. There's machinery for this which things like DM stacks use. > > I wonder if the post-2.6.23 changes happened to make this problem go away. The per BDI dirty stuff in 24 should make this work, I just checked and loopback thingies seem to have their own BDI, so all should be well. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <[EMAIL PROTECTED]> wrote: > Hi > > I am experiencing a transient lockup in 'D' state with loopback device. It > happens when process writes to a filesystem in loopback with command like > dd if=/dev/zero of=/s/fill bs=4k > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in > congestion_wait called from balance_dirty_pages. > > After about 30 seconds, the lockup is gone and dd resumes, but it locks up > soon again. > > I added a printk to the balance_dirty_pages > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, > pages_written %d, write_chunk %d\n", nr_reclaimable, > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, > write_chunk); > > and it shows this during the lockup: > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > pages_written 1021, write_chunk 1522 > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > pages_written 1021, write_chunk 1522 > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > pages_written 1021, write_chunk 1522 > > What apparently happens: > > writeback_inodes syncs inodes only on the given wbc->bdi, however > balance_dirty_pages checks against global counts of dirty pages. So if > there's nothing to sync on a given device, but there are other dirty pages > so that the counts are over the limit, it will loop without doing any > work. > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if > something writes to the backing device, it flushes the dirty pages > generated by the loopback and the lockup is gone. If you add printk, don't > forget to stop klogd, otherwise logging would end the lockup. erk. > The hotfix (that I verified to work) is to not set wbc->bdi, so that all > devices are flushed ... but the code probably needs some redesign (i.e. > either account per-device and flush per-device, or account-global and > flush-global). > > Mikulas > > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c > --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 > +0200 > +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 > @@ -214,7 +214,6 @@ > > for (;;) { > struct writeback_control wbc = { > - .bdi= bdi, > .sync_mode = WB_SYNC_NONE, > .older_than_this = NULL, > .nr_to_write= write_chunk, Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. I wonder if the post-2.6.23 changes happened to make this problem go away. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Temporary lockup on loopback block device
Hi I am experiencing a transient lockup in 'D' state with loopback device. It happens when process writes to a filesystem in loopback with command like dd if=/dev/zero of=/s/fill bs=4k CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in congestion_wait called from balance_dirty_pages. After about 30 seconds, the lockup is gone and dd resumes, but it locks up soon again. I added a printk to the balance_dirty_pages printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, pages_written %d, write_chunk %d\n", nr_reclaimable, global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, write_chunk); and it shows this during the lockup: wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 What apparently happens: writeback_inodes syncs inodes only on the given wbc->bdi, however balance_dirty_pages checks against global counts of dirty pages. So if there's nothing to sync on a given device, but there are other dirty pages so that the counts are over the limit, it will loop without doing any work. To reproduce it, you need totally idle machine (no GUI, etc.) -- if something writes to the backing device, it flushes the dirty pages generated by the loopback and the lockup is gone. If you add printk, don't forget to stop klogd, otherwise logging would end the lockup. The hotfix (that I verified to work) is to not set wbc->bdi, so that all devices are flushed ... but the code probably needs some redesign (i.e. either account per-device and flush per-device, or account-global and flush-global). Mikulas diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 +0200 +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 @@ -214,7 +214,6 @@ for (;;) { struct writeback_control wbc = { - .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Temporary lockup on loopback block device
Hi I am experiencing a transient lockup in 'D' state with loopback device. It happens when process writes to a filesystem in loopback with command like dd if=/dev/zero of=/s/fill bs=4k CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in congestion_wait called from balance_dirty_pages. After about 30 seconds, the lockup is gone and dd resumes, but it locks up soon again. I added a printk to the balance_dirty_pages printk(wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, pages_written %d, write_chunk %d\n, nr_reclaimable, global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, write_chunk); and it shows this during the lockup: wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 What apparently happens: writeback_inodes syncs inodes only on the given wbc-bdi, however balance_dirty_pages checks against global counts of dirty pages. So if there's nothing to sync on a given device, but there are other dirty pages so that the counts are over the limit, it will loop without doing any work. To reproduce it, you need totally idle machine (no GUI, etc.) -- if something writes to the backing device, it flushes the dirty pages generated by the loopback and the lockup is gone. If you add printk, don't forget to stop klogd, otherwise logging would end the lockup. The hotfix (that I verified to work) is to not set wbc-bdi, so that all devices are flushed ... but the code probably needs some redesign (i.e. either account per-device and flush per-device, or account-global and flush-global). Mikulas diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 +0200 +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 @@ -214,7 +214,6 @@ for (;;) { struct writeback_control wbc = { - .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka [EMAIL PROTECTED] wrote: Hi I am experiencing a transient lockup in 'D' state with loopback device. It happens when process writes to a filesystem in loopback with command like dd if=/dev/zero of=/s/fill bs=4k CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in congestion_wait called from balance_dirty_pages. After about 30 seconds, the lockup is gone and dd resumes, but it locks up soon again. I added a printk to the balance_dirty_pages printk(wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, pages_written %d, write_chunk %d\n, nr_reclaimable, global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, write_chunk); and it shows this during the lockup: wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 What apparently happens: writeback_inodes syncs inodes only on the given wbc-bdi, however balance_dirty_pages checks against global counts of dirty pages. So if there's nothing to sync on a given device, but there are other dirty pages so that the counts are over the limit, it will loop without doing any work. To reproduce it, you need totally idle machine (no GUI, etc.) -- if something writes to the backing device, it flushes the dirty pages generated by the loopback and the lockup is gone. If you add printk, don't forget to stop klogd, otherwise logging would end the lockup. erk. The hotfix (that I verified to work) is to not set wbc-bdi, so that all devices are flushed ... but the code probably needs some redesign (i.e. either account per-device and flush per-device, or account-global and flush-global). Mikulas diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 +0200 +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 @@ -214,7 +214,6 @@ for (;;) { struct writeback_control wbc = { - .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. I wonder if the post-2.6.23 changes happened to make this problem go away. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sat, 2007-11-10 at 14:54 -0800, Andrew Morton wrote: On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka [EMAIL PROTECTED] wrote: Hi I am experiencing a transient lockup in 'D' state with loopback device. It happens when process writes to a filesystem in loopback with command like dd if=/dev/zero of=/s/fill bs=4k CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in congestion_wait called from balance_dirty_pages. After about 30 seconds, the lockup is gone and dd resumes, but it locks up soon again. I added a printk to the balance_dirty_pages printk(wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, pages_written %d, write_chunk %d\n, nr_reclaimable, global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, write_chunk); and it shows this during the lockup: wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 What apparently happens: writeback_inodes syncs inodes only on the given wbc-bdi, however balance_dirty_pages checks against global counts of dirty pages. So if there's nothing to sync on a given device, but there are other dirty pages so that the counts are over the limit, it will loop without doing any work. To reproduce it, you need totally idle machine (no GUI, etc.) -- if something writes to the backing device, it flushes the dirty pages generated by the loopback and the lockup is gone. If you add printk, don't forget to stop klogd, otherwise logging would end the lockup. erk. known issue. The hotfix (that I verified to work) is to not set wbc-bdi, so that all devices are flushed ... but the code probably needs some redesign (i.e. either account per-device and flush per-device, or account-global and flush-global). .24 will have the per-device solution. diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 +0200 +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 @@ -214,7 +214,6 @@ for (;;) { struct writeback_control wbc = { - .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. I wonder if the post-2.6.23 changes happened to make this problem go away. The per BDI dirty stuff in 24 should make this work, I just checked and loopback thingies seem to have their own BDI, so all should be well. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. I wonder if the post-2.6.23 changes happened to make this problem go away. The per BDI dirty stuff in 24 should make this work, I just checked and loopback thingies seem to have their own BDI, so all should be well. This is not only about loopback (I think the lockup can happen even without loopback) --- the main problem is: Why are there over-limit dirty pages that no one is writing? Mikulas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sat, 10 Nov 2007, Andrew Morton wrote: On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka [EMAIL PROTECTED] wrote: Hi I am experiencing a transient lockup in 'D' state with loopback device. It happens when process writes to a filesystem in loopback with command like dd if=/dev/zero of=/s/fill bs=4k CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in congestion_wait called from balance_dirty_pages. After about 30 seconds, the lockup is gone and dd resumes, but it locks up soon again. I added a printk to the balance_dirty_pages printk(wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, pages_written %d, write_chunk %d\n, nr_reclaimable, global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, write_chunk); and it shows this during the lockup: wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 What apparently happens: writeback_inodes syncs inodes only on the given wbc-bdi, however balance_dirty_pages checks against global counts of dirty pages. So if there's nothing to sync on a given device, but there are other dirty pages so that the counts are over the limit, it will loop without doing any work. To reproduce it, you need totally idle machine (no GUI, etc.) -- if something writes to the backing device, it flushes the dirty pages generated by the loopback and the lockup is gone. If you add printk, don't forget to stop klogd, otherwise logging would end the lockup. erk. The hotfix (that I verified to work) is to not set wbc-bdi, so that all devices are flushed ... but the code probably needs some redesign (i.e. either account per-device and flush per-device, or account-global and flush-global). Mikulas diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 +0200 +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 @@ -214,7 +214,6 @@ for (;;) { struct writeback_control wbc = { - .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. If you change loopback backing-device, you just turn this nicely reproducible example into a subtle race condition that can happen whenever you use loopback or not. Think, what happens when different process dirties memory: You have process A that dirtied a lot of pages on device 1 but has not started writing them. You have process B that is trying to write to device 2, sees dirty page count over limit, but can't do anything about it, because it is only allowed to flush pages on device 2. --- so it endlessly loops. If you want to use the current flushing semantics, you just have to audit the whole kernel to make sure that if some process sees over-limit dirty page count, there is another process that is flushing the pages. Currently it is not true, the dd process sees over-limit count, but there is no-one writing. I wonder if the post-2.6.23 changes happened to make this problem go away. I will try 2.6.24-rc2, but I don't think the root cause of this went away. Maybe you just reduced probability. Mikulas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
On Sun, 11 Nov 2007, Mikulas Patocka wrote: On Sat, 10 Nov 2007, Andrew Morton wrote: On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka [EMAIL PROTECTED] wrote: Hi I am experiencing a transient lockup in 'D' state with loopback device. It happens when process writes to a filesystem in loopback with command like dd if=/dev/zero of=/s/fill bs=4k CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in congestion_wait called from balance_dirty_pages. After about 30 seconds, the lockup is gone and dd resumes, but it locks up soon again. I added a printk to the balance_dirty_pages printk(wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, pages_written %d, write_chunk %d\n, nr_reclaimable, global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, write_chunk); and it shows this during the lockup: wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, pages_written 1021, write_chunk 1522 What apparently happens: writeback_inodes syncs inodes only on the given wbc-bdi, however balance_dirty_pages checks against global counts of dirty pages. So if there's nothing to sync on a given device, but there are other dirty pages so that the counts are over the limit, it will loop without doing any work. To reproduce it, you need totally idle machine (no GUI, etc.) -- if something writes to the backing device, it flushes the dirty pages generated by the loopback and the lockup is gone. If you add printk, don't forget to stop klogd, otherwise logging would end the lockup. erk. The hotfix (that I verified to work) is to not set wbc-bdi, so that all devices are flushed ... but the code probably needs some redesign (i.e. either account per-device and flush per-device, or account-global and flush-global). Mikulas diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.0 +0200 +++ mm/page-writeback.c 2007-11-10 20:32:43.0 +0100 @@ -214,7 +214,6 @@ for (;;) { struct writeback_control wbc = { - .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. If you change loopback backing-device, you just turn this nicely reproducible example into a subtle race condition that can happen whenever you use loopback or not. Think, what happens when different process dirties memory: You have process A that dirtied a lot of pages on device 1 but has not started writing them. You have process B that is trying to write to device 2, sees dirty page count over limit, but can't do anything about it, because it is only allowed to flush pages on device 2. --- so it endlessly loops. If you want to use the current flushing semantics, you just have to audit the whole kernel to make sure that if some process sees over-limit dirty page count, there is another process that is flushing the pages. Currently it is not true, the dd process sees over-limit count, but there is no-one writing. I wonder if the post-2.6.23 changes happened to make this problem go away. I will try 2.6.24-rc2, but I don't think the root cause of this went away. Maybe you just reduced probability. Mikulas So I compiled it and I don't see any more lock-ups. The writeback loop doesn't depend on any global page count, so the above scenario can't happen here. Good. Mikulas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. Just thinking about the new implementation --- you shouldn't really propagate physical block device's backing_device into loopback device. If you leave it as is (each loop device has it's own backing store), you can nicely avoid the long-standing loopback deadlock coming from the fact that flushing one page on loopback device can generate several more dirty pages on the filesystem. If you let loopback device and physical device have the same backing store, then it can go wild creating more and more dirty pages up to a memory exhaustion. If you let them have different backing stores, it can't happen --- loopback flushing will just wait until the pages on the filesystem are written. Mikulas So I compiled it and I don't see any more lock-ups. The writeback loop doesn't depend on any global page count, so the above scenario can't happen here. Good. Mikulas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Temporary lockup on loopback block device
Arguably we just have the wrong backing-device here, and what we should do is to propagate the real backing device's pointer through up into the filesystem. There's machinery for this which things like DM stacks use. I wonder if the post-2.6.23 changes happened to make this problem go away. The per BDI dirty stuff in 24 should make this work, I just checked and loopback thingies seem to have their own BDI, so all should be well. This is not only about loopback (I think the lockup can happen even without loopback) --- the main problem is: Why are there over-limit dirty pages that no one is writing? Please do a sysrq-t, and cat /proc/vmstat during the hang. Those will show us what exactly is happening. I've seen this type of hang many times, and I agree with Peter, that it's probably about loopback, and is fixed in 2.6.24-rc. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/