Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Björn Steinbrink wrote: On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote: On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages Good, this seems to be the exact path that actually triggers it. I got to journal_unmap_buffer(), but was too lazy to actually then bother to follow it all the way down - I decided that I didn't actually really even care what the low-level FS layer did, I had already convinced myself that it obviously must be dirtying the page some way, since that matched the symptoms exactly (ie only the journaling case was impacted, and this was all about the journal). But perhaps more importantly: regardless of what the low-level filesystem did at that point, the VM accounting shouldn't care, and should be robust in the face of a low-level filesystem doing strange and wonderful things. But thanks for bothering to go through the whole history and figure out what exactly is up. Oh well, after seeing the move of cancel_dirty_page, I just went backwards from __set_page_dirty using cscope + some smart guessing and quickly ended up at ext3_invalidatepage, so it wasn't that hard :-) As try_to_free_buffers got its ext3 hack back in ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for the accounting fix in cancel_dirty_page, of course). Yes, I think we have room for cleanups now, and I agree: we ended up reinstating some questionable code in the VM just because we didn't really know or understand what was going on in the ext3 journal code. Hm, you attributed more to my mail than there was actually in it. I didn't even start to think of cleanups (because I don't know jack about the whole ext3/jdb stuff, so I simply cannot come up with any cleanups (yet?)).What I meant is that we only did a half-revert of that hackery. When try_to_free_buffers started to check for PG_dirty, the cancel_dirty_page call had to be called before do_invalidatepage, to "fix" a _huge_ leak. But that caused the accouting breakage we're now seeing, because we never account for the pages that got redirtied during do_invalidatepage. Then the change to try_to_free_buffers got reverted, so we no longer need to call cancel_dirty_page before do_invalidatepage, but still we do. Thus the accounting bug remains. So what I meant to suggest was simply to actually "finish" the revert we started. Or expressed as a patch: diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..2974903 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return; - cancel_dirty_page(page, PAGE_CACHE_SIZE); - if (PagePrivate(page)) do_invalidatepage(page, 0); + cancel_dirty_page(page, PAGE_CACHE_SIZE); + remove_from_page_cache(page); ClearPageUptodate(page); ClearPageMappedToDisk(page); I'll be the last one to comment on whether or not that causes inaccurate accouting, so I'll just watch you and Jan battle that out until someone comes up with a post-.24 patch to provide a clean fix for the issue. Krzysztof, could you give this patch a test run? If that "fixes" the problem for now, I'll try to come up with some usable commit message, or if somehow wants to beat me to it, you can already have my Signed-off-by: Björn Steinbrink <[EMAIL PROTECTED]> Checked with 2.6.24-rc5 + debug/fixup patch from Linus + above fix. After 3h there have been no warnings about __remove_from_page_cache(). So, it seems that it is OK. Tested-by: Krzysztof Piotr Oledzki <[EMAIL PROTECTED]> Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Björn Steinbrink wrote: On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote: On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages Good, this seems to be the exact path that actually triggers it. I got to journal_unmap_buffer(), but was too lazy to actually then bother to follow it all the way down - I decided that I didn't actually really even care what the low-level FS layer did, I had already convinced myself that it obviously must be dirtying the page some way, since that matched the symptoms exactly (ie only the journaling case was impacted, and this was all about the journal). But perhaps more importantly: regardless of what the low-level filesystem did at that point, the VM accounting shouldn't care, and should be robust in the face of a low-level filesystem doing strange and wonderful things. But thanks for bothering to go through the whole history and figure out what exactly is up. Oh well, after seeing the move of cancel_dirty_page, I just went backwards from __set_page_dirty using cscope + some smart guessing and quickly ended up at ext3_invalidatepage, so it wasn't that hard :-) As try_to_free_buffers got its ext3 hack back in ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for the accounting fix in cancel_dirty_page, of course). Yes, I think we have room for cleanups now, and I agree: we ended up reinstating some questionable code in the VM just because we didn't really know or understand what was going on in the ext3 journal code. Hm, you attributed more to my mail than there was actually in it. I didn't even start to think of cleanups (because I don't know jack about the whole ext3/jdb stuff, so I simply cannot come up with any cleanups (yet?)).What I meant is that we only did a half-revert of that hackery. When try_to_free_buffers started to check for PG_dirty, the cancel_dirty_page call had to be called before do_invalidatepage, to fix a _huge_ leak. But that caused the accouting breakage we're now seeing, because we never account for the pages that got redirtied during do_invalidatepage. Then the change to try_to_free_buffers got reverted, so we no longer need to call cancel_dirty_page before do_invalidatepage, but still we do. Thus the accounting bug remains. So what I meant to suggest was simply to actually finish the revert we started. Or expressed as a patch: diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..2974903 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page-mapping != mapping) return; - cancel_dirty_page(page, PAGE_CACHE_SIZE); - if (PagePrivate(page)) do_invalidatepage(page, 0); + cancel_dirty_page(page, PAGE_CACHE_SIZE); + remove_from_page_cache(page); ClearPageUptodate(page); ClearPageMappedToDisk(page); I'll be the last one to comment on whether or not that causes inaccurate accouting, so I'll just watch you and Jan battle that out until someone comes up with a post-.24 patch to provide a clean fix for the issue. Krzysztof, could you give this patch a test run? If that fixes the problem for now, I'll try to come up with some usable commit message, or if somehow wants to beat me to it, you can already have my Signed-off-by: Björn Steinbrink [EMAIL PROTECTED] Checked with 2.6.24-rc5 + debug/fixup patch from Linus + above fix. After 3h there have been no warnings about __remove_from_page_cache(). So, it seems that it is OK. Tested-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED] Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Friday 21 December 2007 06:24, Linus Torvalds wrote: > On Thu, 20 Dec 2007, Jan Kara wrote: > > As I wrote in my previous email, this solution works but hides the > > fact that the page really *has* dirty data in it and *is* pinned in > > memory until the commit code gets to writing it. So in theory it could > > disturb the writeout logic by having more dirty data in memory than vm > > thinks it has. Not that I'd have a better fix now but I wanted to point > > out this problem. > > Well, I worry more about the VM being sane - and by the time we actually > hit this case, as far as VM sanity is concerned, the page no longer really > exists. It's been removed from the page cache, and it only really exists > as any other random kernel allocation. It does allow the VM to just not worry about this. However I don't really like this kinds of catch-all conditions that are hard to get rid of and can encourage bad behaviour. It would be nice if the "insane" things were made to clean up after themselves. > The fact that low-level filesystems (in this case ext3 journaling) do > their own insane things is not something the VM even _should_ care about. > It's just an internal FS allocation, and the FS can do whatever the hell > it wants with it, including doing IO etc. > > The kernel doesn't consider any other random IO pages to be "dirty" either > (eg if you do direct-IO writes using low-level SCSI commands, the VM > doesn't consider that to be any special dirty stuff, it's just random page > allocations again). This is really no different. > > In other words: the Linux "VM" subsystem is really two differnt parts: the > low-level page allocator (which obviously knows that the page is still in > *use*, since it hasn't been free'd), and the higher-level file mapping and > caching stuff that knows about things like page "dirtyiness". And once > you've done a "remove_from_page_cache()", the higher levels are no longer > involved, and dirty accounting simply doesn't get into the picture. That's all true... it would simply be nice to ask the filesystems to do this. But anyway I think your patch is pretty reasonable for the moment. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote: > > > On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: > > > > OK, so I looked for PG_dirty anyway. > > > > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers > > bail out if the page is dirty. > > > > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed > > truncate_complete_page, because it called cancel_dirty_page (and thus > > cleared PG_dirty) after try_to_free_buffers was called via > > do_invalidatepage. > > > > Now, if I'm not mistaken, we can end up as follows. > > > > truncate_complete_page() > > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages > > do_invalidatepage() > > ext3_invalidatepage() > > journal_invalidatepage() > > journal_unmap_buffer() > > __dispose_buffer() > > __journal_unfile_buffer() > > __journal_temp_unlink_buffer() > > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages > > Good, this seems to be the exact path that actually triggers it. I got to > journal_unmap_buffer(), but was too lazy to actually then bother to follow > it all the way down - I decided that I didn't actually really even care > what the low-level FS layer did, I had already convinced myself that it > obviously must be dirtying the page some way, since that matched the > symptoms exactly (ie only the journaling case was impacted, and this was > all about the journal). > > But perhaps more importantly: regardless of what the low-level filesystem > did at that point, the VM accounting shouldn't care, and should be robust > in the face of a low-level filesystem doing strange and wonderful things. > But thanks for bothering to go through the whole history and figure out > what exactly is up. Oh well, after seeing the move of cancel_dirty_page, I just went backwards from __set_page_dirty using cscope + some smart guessing and quickly ended up at ext3_invalidatepage, so it wasn't that hard :-) > > As try_to_free_buffers got its ext3 hack back in > > ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe > > 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for > > the accounting fix in cancel_dirty_page, of course). > > Yes, I think we have room for cleanups now, and I agree: we ended up > reinstating some questionable code in the VM just because we didn't really > know or understand what was going on in the ext3 journal code. Hm, you attributed more to my mail than there was actually in it. I didn't even start to think of cleanups (because I don't know jack about the whole ext3/jdb stuff, so I simply cannot come up with any cleanups (yet?)).What I meant is that we only did a half-revert of that hackery. When try_to_free_buffers started to check for PG_dirty, the cancel_dirty_page call had to be called before do_invalidatepage, to "fix" a _huge_ leak. But that caused the accouting breakage we're now seeing, because we never account for the pages that got redirtied during do_invalidatepage. Then the change to try_to_free_buffers got reverted, so we no longer need to call cancel_dirty_page before do_invalidatepage, but still we do. Thus the accounting bug remains. So what I meant to suggest was simply to actually "finish" the revert we started. Or expressed as a patch: diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..2974903 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return; - cancel_dirty_page(page, PAGE_CACHE_SIZE); - if (PagePrivate(page)) do_invalidatepage(page, 0); + cancel_dirty_page(page, PAGE_CACHE_SIZE); + remove_from_page_cache(page); ClearPageUptodate(page); ClearPageMappedToDisk(page); I'll be the last one to comment on whether or not that causes inaccurate accouting, so I'll just watch you and Jan battle that out until someone comes up with a post-.24 patch to provide a clean fix for the issue. Krzysztof, could you give this patch a test run? If that "fixes" the problem for now, I'll try to come up with some usable commit message, or if somehow wants to beat me to it, you can already have my Signed-off-by: Björn Steinbrink <[EMAIL PROTECTED]> > > On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task > > io accounting for cancelled writes happened always happened if the page > > was dirty, regardless of page->mapping. This was also already true for > > the old test_clear_page_dirty code, and the commit log for > > 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic > > change either, so maybe the "if (account_size)" block should be moved > > out of the if "(mapping && ...)" block? > > I think the "if (account_size)" thing was *purely* for me being worried > about hugetlb entries, and I think that's the only thing that passes in a >
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Jan Kara wrote: > > As I wrote in my previous email, this solution works but hides the > fact that the page really *has* dirty data in it and *is* pinned in memory > until the commit code gets to writing it. So in theory it could disturb > the writeout logic by having more dirty data in memory than vm thinks it > has. Not that I'd have a better fix now but I wanted to point out this > problem. Well, I worry more about the VM being sane - and by the time we actually hit this case, as far as VM sanity is concerned, the page no longer really exists. It's been removed from the page cache, and it only really exists as any other random kernel allocation. The fact that low-level filesystems (in this case ext3 journaling) do their own insane things is not something the VM even _should_ care about. It's just an internal FS allocation, and the FS can do whatever the hell it wants with it, including doing IO etc. The kernel doesn't consider any other random IO pages to be "dirty" either (eg if you do direct-IO writes using low-level SCSI commands, the VM doesn't consider that to be any special dirty stuff, it's just random page allocations again). This is really no different. In other words: the Linux "VM" subsystem is really two differnt parts: the low-level page allocator (which obviously knows that the page is still in *use*, since it hasn't been free'd), and the higher-level file mapping and caching stuff that knows about things like page "dirtyiness". And once you've done a "remove_from_page_cache()", the higher levels are no longer involved, and dirty accounting simply doesn't get into the picture. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
> On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: > > > > OK, so I looked for PG_dirty anyway. > > > > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers > > bail out if the page is dirty. > > > > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed > > truncate_complete_page, because it called cancel_dirty_page (and thus > > cleared PG_dirty) after try_to_free_buffers was called via > > do_invalidatepage. > > > > Now, if I'm not mistaken, we can end up as follows. > > > > truncate_complete_page() > > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages > > do_invalidatepage() > > ext3_invalidatepage() > > journal_invalidatepage() > > journal_unmap_buffer() > > __dispose_buffer() > > __journal_unfile_buffer() > > __journal_temp_unlink_buffer() > > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages > > Good, this seems to be the exact path that actually triggers it. I got to > journal_unmap_buffer(), but was too lazy to actually then bother to follow > it all the way down - I decided that I didn't actually really even care > what the low-level FS layer did, I had already convinced myself that it > obviously must be dirtying the page some way, since that matched the > symptoms exactly (ie only the journaling case was impacted, and this was > all about the journal). > > But perhaps more importantly: regardless of what the low-level filesystem > did at that point, the VM accounting shouldn't care, and should be robust > in the face of a low-level filesystem doing strange and wonderful things. > But thanks for bothering to go through the whole history and figure out > what exactly is up. As I wrote in my previous email, this solution works but hides the fact that the page really *has* dirty data in it and *is* pinned in memory until the commit code gets to writing it. So in theory it could disturb the writeout logic by having more dirty data in memory than vm thinks it has. Not that I'd have a better fix now but I wanted to point out this problem. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: > > OK, so I looked for PG_dirty anyway. > > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers > bail out if the page is dirty. > > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed > truncate_complete_page, because it called cancel_dirty_page (and thus > cleared PG_dirty) after try_to_free_buffers was called via > do_invalidatepage. > > Now, if I'm not mistaken, we can end up as follows. > > truncate_complete_page() > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages > do_invalidatepage() > ext3_invalidatepage() > journal_invalidatepage() > journal_unmap_buffer() > __dispose_buffer() > __journal_unfile_buffer() > __journal_temp_unlink_buffer() > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages Good, this seems to be the exact path that actually triggers it. I got to journal_unmap_buffer(), but was too lazy to actually then bother to follow it all the way down - I decided that I didn't actually really even care what the low-level FS layer did, I had already convinced myself that it obviously must be dirtying the page some way, since that matched the symptoms exactly (ie only the journaling case was impacted, and this was all about the journal). But perhaps more importantly: regardless of what the low-level filesystem did at that point, the VM accounting shouldn't care, and should be robust in the face of a low-level filesystem doing strange and wonderful things. But thanks for bothering to go through the whole history and figure out what exactly is up. > As try_to_free_buffers got its ext3 hack back in > ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe > 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for > the accounting fix in cancel_dirty_page, of course). Yes, I think we have room for cleanups now, and I agree: we ended up reinstating some questionable code in the VM just because we didn't really know or understand what was going on in the ext3 journal code. Of course, it may well be that there is something *else* going on too, but I do believe that this whole case is what it was all about, and the hacks end up just (a) making the VM harder to understand (because they cause non-obvious VM code to work around some very specific filesystem behaviour) and (b) the hacks obviously hid the _real_ issue, but I think we've established the real cause, and the hacks clearly weren't enough to really hide it 100% anyway. However, there's no way I'll play with that right now (I'm planning on an -rc6 today), but it might be worth it to make a test-cleanup patch for -mm which does some VM cleanups: - don't touch dirty pages in fs/buffer.c (ie undo the meat of commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d, but not resurrecting the debugging code) - remove the calling of "cancel_dirty_page()" entirely from "truncate_complete_page()", and let "remove_from_page_cache()" just always handle it (and probably just add a "ClearPageDirty()" to match the "ClearPageUptodate()"). - remove "cancel_dirty_page()" from "truncate_huge_page()", which seems to be the exact same issue (ie we should just use the logic in remove_from_page_cache()). at that point "cancel_dirty_page()" literally is only used for what its name implies, and the only in-tree use of it seems to be NFS for when the filesystem gets called for ->invalidatepage - which makes tons of conceptual sense, but I suspect we could drop it from there too, since the VM layer _will_ cancel the dirtiness at a VM level when it then later removes it from the page cache. So we essentially seem to be able to simplify things a bit by getting rid of a hack in try_to_free_buffers(), and potentially getting rid of cancel_dirty_page() entirely. It would imply that we need to do the task_io_account_cancelled_write() inside "remove_from_page_cache()", but that should be ok (I don't see any locking issues there). > On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task > io accounting for cancelled writes happened always happened if the page > was dirty, regardless of page->mapping. This was also already true for > the old test_clear_page_dirty code, and the commit log for > 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic > change either, so maybe the "if (account_size)" block should be moved > out of the if "(mapping && ...)" block? I think the "if (account_size)" thing was *purely* for me being worried about hugetlb entries, and I think that's the only thing that passes in a zero account size. But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we cannot account for those pages *anyway*), so I think we could go further than move the account_size outside of the test, I think we could probably remove that test entirely and drop the whole thing. The thing
Re: [Bug 9182] Critical memory leak (dirty pages)
> > On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote: > > > > > > > > > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: > > > > > > > > I'll confirm this tomorrow but it seems that even switching to > > > > data=ordered > > > > (AFAIK default o ext3) is indeed enough to cure this problem. > > > > > > Ok, do we actually have any ext3 expert following this? I have no idea > > > about what the journalling code does, but I have painful memories of ext3 > > > doing really odd buffer-head-based IO and totally bypassing all the > > > normal > > > page dirty logic. > > > > > > Judging by the symptoms (sorry for not following this well, it came up > > > while I was mostly away travelling), something probably *does* clear the > > > dirty bit on the pages, but the dirty *accounting* is not done properly, > > > so the kernel keeps thinking it has dirty pages. > > > > > > Now, a simple "grep" shows that ext3 does not actually do any > > > ClearPageDirty() or similar on its own, although maybe I missed some > > > other > > > subtle way this can happen. And the *normal* VFS routines that do > > > ClearPageDirty should all be doing the proper accounting. > > > > > > So I see a couple of possible cases: > > > > > > - actually clearing the PG_dirty bit somehow, without doing the > > >accounting. > > > > > >This looks very unlikely. PG_dirty is always cleared by some variant > > > of > > >"*ClearPageDirty()", and that bit definition isn't used for anything > > >else in the whole kernel judging by "grep" (the page allocator tests > > >the bit, that's it). > > > > OK, so I looked for PG_dirty anyway. > > > > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers > > bail out if the page is dirty. > > > > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed > > truncate_complete_page, because it called cancel_dirty_page (and thus > > cleared PG_dirty) after try_to_free_buffers was called via > > do_invalidatepage. > > > > Now, if I'm not mistaken, we can end up as follows. > > > > truncate_complete_page() > > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages > > do_invalidatepage() > > ext3_invalidatepage() > > journal_invalidatepage() > > journal_unmap_buffer() > > __dispose_buffer() > > __journal_unfile_buffer() > > __journal_temp_unlink_buffer() > > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages > > > > If journal_unmap_buffer then returns 0, try_to_free_buffers is not > > called and neither is cancel_dirty_page, so the dirty pages accounting > > is not decreased again. > Yes, this can happen. The call to mark_buffer_dirty() is a fallout > from journal_unfile_buffer() trying to sychronise JBD private dirty bit > (jbddirty) with the standard dirty bit. We could actually clear the > jbddirty bit before calling journal_unfile_buffer() so that this doesn't > happen but since Linus changed remove_from_pagecache() to not care about > redirtying the page I guess it's not needed any more... Oops, sorry, I spoke to soon. After thinking more about it, I think we cannot clear the dirty bit (at least not jbddirty) in all cases and in fact moving cancel_dirty_page() after do_invalidatepage() call only hides the real problem. Let's recap what JBD/ext3 code requires in case of truncation. A life-cycle of a journaled buffer looks as follows: When we want to write some data to it, it gets attached to the running transaction. When the transaction is committing, the buffer is written to the journal. Sometime later, the buffer is written to it's final place in the filesystem - this is called checkpoint - and can be released. Now suppose a write to the buffer happens in one transaction and you truncate the buffer in the next one. You cannot just free the buffer immediately - it can for example happen, that the transaction in which the write happened hasn't committed yet. So we just leave the dirty buffer there and it should be cleaned up later when the committing transaction writes the data where it needs... The problem is that when the commit code writes the buffer, it eventually calls try_to_free_buffers() but as Nick pointed out, ->mapping is set to NULL by that time so we don't even call cancel_dirty_page() and so the number of dirty pages is not properly decreased. Of course, we could decrease the number of dirty pages after we return from do_invalidatepage when clearing ->mapping but that would make dirty accounting imprecise - we really still have those dirty data which need writeout. But it's probably the best workaround I can currently think of. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: [Bug 9182] Critical memory leak (dirty pages)
> On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote: > > > > > > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: > > > > > > I'll confirm this tomorrow but it seems that even switching to > > > data=ordered > > > (AFAIK default o ext3) is indeed enough to cure this problem. > > > > Ok, do we actually have any ext3 expert following this? I have no idea > > about what the journalling code does, but I have painful memories of ext3 > > doing really odd buffer-head-based IO and totally bypassing all the normal > > page dirty logic. > > > > Judging by the symptoms (sorry for not following this well, it came up > > while I was mostly away travelling), something probably *does* clear the > > dirty bit on the pages, but the dirty *accounting* is not done properly, > > so the kernel keeps thinking it has dirty pages. > > > > Now, a simple "grep" shows that ext3 does not actually do any > > ClearPageDirty() or similar on its own, although maybe I missed some other > > subtle way this can happen. And the *normal* VFS routines that do > > ClearPageDirty should all be doing the proper accounting. > > > > So I see a couple of possible cases: > > > > - actually clearing the PG_dirty bit somehow, without doing the > >accounting. > > > >This looks very unlikely. PG_dirty is always cleared by some variant of > >"*ClearPageDirty()", and that bit definition isn't used for anything > >else in the whole kernel judging by "grep" (the page allocator tests > >the bit, that's it). > > OK, so I looked for PG_dirty anyway. > > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers > bail out if the page is dirty. > > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed > truncate_complete_page, because it called cancel_dirty_page (and thus > cleared PG_dirty) after try_to_free_buffers was called via > do_invalidatepage. > > Now, if I'm not mistaken, we can end up as follows. > > truncate_complete_page() > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages > do_invalidatepage() > ext3_invalidatepage() > journal_invalidatepage() > journal_unmap_buffer() > __dispose_buffer() > __journal_unfile_buffer() > __journal_temp_unlink_buffer() > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages > > If journal_unmap_buffer then returns 0, try_to_free_buffers is not > called and neither is cancel_dirty_page, so the dirty pages accounting > is not decreased again. Yes, this can happen. The call to mark_buffer_dirty() is a fallout from journal_unfile_buffer() trying to sychronise JBD private dirty bit (jbddirty) with the standard dirty bit. We could actually clear the jbddirty bit before calling journal_unfile_buffer() so that this doesn't happen but since Linus changed remove_from_pagecache() to not care about redirtying the page I guess it's not needed any more... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote: > > > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: > > > > I'll confirm this tomorrow but it seems that even switching to data=ordered > > (AFAIK default o ext3) is indeed enough to cure this problem. > > Ok, do we actually have any ext3 expert following this? I have no idea > about what the journalling code does, but I have painful memories of ext3 > doing really odd buffer-head-based IO and totally bypassing all the normal > page dirty logic. > > Judging by the symptoms (sorry for not following this well, it came up > while I was mostly away travelling), something probably *does* clear the > dirty bit on the pages, but the dirty *accounting* is not done properly, > so the kernel keeps thinking it has dirty pages. > > Now, a simple "grep" shows that ext3 does not actually do any > ClearPageDirty() or similar on its own, although maybe I missed some other > subtle way this can happen. And the *normal* VFS routines that do > ClearPageDirty should all be doing the proper accounting. > > So I see a couple of possible cases: > > - actually clearing the PG_dirty bit somehow, without doing the >accounting. > >This looks very unlikely. PG_dirty is always cleared by some variant of >"*ClearPageDirty()", and that bit definition isn't used for anything >else in the whole kernel judging by "grep" (the page allocator tests >the bit, that's it). OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages If journal_unmap_buffer then returns 0, try_to_free_buffers is not called and neither is cancel_dirty_page, so the dirty pages accounting is not decreased again. As try_to_free_buffers got its ext3 hack back in ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for the accounting fix in cancel_dirty_page, of course). On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task io accounting for cancelled writes happened always happened if the page was dirty, regardless of page->mapping. This was also already true for the old test_clear_page_dirty code, and the commit log for 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic change either, so maybe the "if (account_size)" block should be moved out of the if "(mapping && ...)" block? Björn - not sending patches because he needs sleep and wouldn't have a damn clue about what to write as a commit message anyway -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote: On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple grep shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of *ClearPageDirty(), and that bit definition isn't used for anything else in the whole kernel judging by grep (the page allocator tests the bit, that's it). OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages If journal_unmap_buffer then returns 0, try_to_free_buffers is not called and neither is cancel_dirty_page, so the dirty pages accounting is not decreased again. As try_to_free_buffers got its ext3 hack back in ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for the accounting fix in cancel_dirty_page, of course). On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task io accounting for cancelled writes happened always happened if the page was dirty, regardless of page-mapping. This was also already true for the old test_clear_page_dirty code, and the commit log for 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic change either, so maybe the if (account_size) block should be moved out of the if (mapping ...) block? Björn - not sending patches because he needs sleep and wouldn't have a damn clue about what to write as a commit message anyway -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote: On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple grep shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of *ClearPageDirty(), and that bit definition isn't used for anything else in the whole kernel judging by grep (the page allocator tests the bit, that's it). OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages If journal_unmap_buffer then returns 0, try_to_free_buffers is not called and neither is cancel_dirty_page, so the dirty pages accounting is not decreased again. Yes, this can happen. The call to mark_buffer_dirty() is a fallout from journal_unfile_buffer() trying to sychronise JBD private dirty bit (jbddirty) with the standard dirty bit. We could actually clear the jbddirty bit before calling journal_unfile_buffer() so that this doesn't happen but since Linus changed remove_from_pagecache() to not care about redirtying the page I guess it's not needed any more... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote: On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple grep shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of *ClearPageDirty(), and that bit definition isn't used for anything else in the whole kernel judging by grep (the page allocator tests the bit, that's it). OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages If journal_unmap_buffer then returns 0, try_to_free_buffers is not called and neither is cancel_dirty_page, so the dirty pages accounting is not decreased again. Yes, this can happen. The call to mark_buffer_dirty() is a fallout from journal_unfile_buffer() trying to sychronise JBD private dirty bit (jbddirty) with the standard dirty bit. We could actually clear the jbddirty bit before calling journal_unfile_buffer() so that this doesn't happen but since Linus changed remove_from_pagecache() to not care about redirtying the page I guess it's not needed any more... Oops, sorry, I spoke to soon. After thinking more about it, I think we cannot clear the dirty bit (at least not jbddirty) in all cases and in fact moving cancel_dirty_page() after do_invalidatepage() call only hides the real problem. Let's recap what JBD/ext3 code requires in case of truncation. A life-cycle of a journaled buffer looks as follows: When we want to write some data to it, it gets attached to the running transaction. When the transaction is committing, the buffer is written to the journal. Sometime later, the buffer is written to it's final place in the filesystem - this is called checkpoint - and can be released. Now suppose a write to the buffer happens in one transaction and you truncate the buffer in the next one. You cannot just free the buffer immediately - it can for example happen, that the transaction in which the write happened hasn't committed yet. So we just leave the dirty buffer there and it should be cleaned up later when the committing transaction writes the data where it needs... The problem is that when the commit code writes the buffer, it eventually calls try_to_free_buffers() but as Nick pointed out, -mapping is set to NULL by that time so we don't even call cancel_dirty_page() and so the number of dirty pages is not properly decreased. Of course, we could decrease the number of dirty pages after we return from do_invalidatepage when clearing -mapping but that would make dirty accounting imprecise - we really still have those dirty data which need writeout. But it's probably the best workaround I can currently think of. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages Good, this seems to be the exact path that actually triggers it. I got to journal_unmap_buffer(), but was too lazy to actually then bother to follow it all the way down - I decided that I didn't actually really even care what the low-level FS layer did, I had already convinced myself that it obviously must be dirtying the page some way, since that matched the symptoms exactly (ie only the journaling case was impacted, and this was all about the journal). But perhaps more importantly: regardless of what the low-level filesystem did at that point, the VM accounting shouldn't care, and should be robust in the face of a low-level filesystem doing strange and wonderful things. But thanks for bothering to go through the whole history and figure out what exactly is up. As try_to_free_buffers got its ext3 hack back in ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for the accounting fix in cancel_dirty_page, of course). Yes, I think we have room for cleanups now, and I agree: we ended up reinstating some questionable code in the VM just because we didn't really know or understand what was going on in the ext3 journal code. Of course, it may well be that there is something *else* going on too, but I do believe that this whole case is what it was all about, and the hacks end up just (a) making the VM harder to understand (because they cause non-obvious VM code to work around some very specific filesystem behaviour) and (b) the hacks obviously hid the _real_ issue, but I think we've established the real cause, and the hacks clearly weren't enough to really hide it 100% anyway. However, there's no way I'll play with that right now (I'm planning on an -rc6 today), but it might be worth it to make a test-cleanup patch for -mm which does some VM cleanups: - don't touch dirty pages in fs/buffer.c (ie undo the meat of commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d, but not resurrecting the debugging code) - remove the calling of cancel_dirty_page() entirely from truncate_complete_page(), and let remove_from_page_cache() just always handle it (and probably just add a ClearPageDirty() to match the ClearPageUptodate()). - remove cancel_dirty_page() from truncate_huge_page(), which seems to be the exact same issue (ie we should just use the logic in remove_from_page_cache()). at that point cancel_dirty_page() literally is only used for what its name implies, and the only in-tree use of it seems to be NFS for when the filesystem gets called for -invalidatepage - which makes tons of conceptual sense, but I suspect we could drop it from there too, since the VM layer _will_ cancel the dirtiness at a VM level when it then later removes it from the page cache. So we essentially seem to be able to simplify things a bit by getting rid of a hack in try_to_free_buffers(), and potentially getting rid of cancel_dirty_page() entirely. It would imply that we need to do the task_io_account_cancelled_write() inside remove_from_page_cache(), but that should be ok (I don't see any locking issues there). On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task io accounting for cancelled writes happened always happened if the page was dirty, regardless of page-mapping. This was also already true for the old test_clear_page_dirty code, and the commit log for 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic change either, so maybe the if (account_size) block should be moved out of the if (mapping ...) block? I think the if (account_size) thing was *purely* for me being worried about hugetlb entries, and I think that's the only thing that passes in a zero account size. But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we cannot account for those pages *anyway*), so I think we could go further than move the account_size outside of the test, I think we could probably remove that test entirely and drop the whole thing. The thing is, task_io_account_cancelled_write() doesn't make sense on
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages Good, this seems to be the exact path that actually triggers it. I got to journal_unmap_buffer(), but was too lazy to actually then bother to follow it all the way down - I decided that I didn't actually really even care what the low-level FS layer did, I had already convinced myself that it obviously must be dirtying the page some way, since that matched the symptoms exactly (ie only the journaling case was impacted, and this was all about the journal). But perhaps more importantly: regardless of what the low-level filesystem did at that point, the VM accounting shouldn't care, and should be robust in the face of a low-level filesystem doing strange and wonderful things. But thanks for bothering to go through the whole history and figure out what exactly is up. As I wrote in my previous email, this solution works but hides the fact that the page really *has* dirty data in it and *is* pinned in memory until the commit code gets to writing it. So in theory it could disturb the writeout logic by having more dirty data in memory than vm thinks it has. Not that I'd have a better fix now but I wanted to point out this problem. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Jan Kara wrote: As I wrote in my previous email, this solution works but hides the fact that the page really *has* dirty data in it and *is* pinned in memory until the commit code gets to writing it. So in theory it could disturb the writeout logic by having more dirty data in memory than vm thinks it has. Not that I'd have a better fix now but I wanted to point out this problem. Well, I worry more about the VM being sane - and by the time we actually hit this case, as far as VM sanity is concerned, the page no longer really exists. It's been removed from the page cache, and it only really exists as any other random kernel allocation. The fact that low-level filesystems (in this case ext3 journaling) do their own insane things is not something the VM even _should_ care about. It's just an internal FS allocation, and the FS can do whatever the hell it wants with it, including doing IO etc. The kernel doesn't consider any other random IO pages to be dirty either (eg if you do direct-IO writes using low-level SCSI commands, the VM doesn't consider that to be any special dirty stuff, it's just random page allocations again). This is really no different. In other words: the Linux VM subsystem is really two differnt parts: the low-level page allocator (which obviously knows that the page is still in *use*, since it hasn't been free'd), and the higher-level file mapping and caching stuff that knows about things like page dirtyiness. And once you've done a remove_from_page_cache(), the higher levels are no longer involved, and dirty accounting simply doesn't get into the picture. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote: On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote: OK, so I looked for PG_dirty anyway. In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers bail out if the page is dirty. Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed truncate_complete_page, because it called cancel_dirty_page (and thus cleared PG_dirty) after try_to_free_buffers was called via do_invalidatepage. Now, if I'm not mistaken, we can end up as follows. truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages Good, this seems to be the exact path that actually triggers it. I got to journal_unmap_buffer(), but was too lazy to actually then bother to follow it all the way down - I decided that I didn't actually really even care what the low-level FS layer did, I had already convinced myself that it obviously must be dirtying the page some way, since that matched the symptoms exactly (ie only the journaling case was impacted, and this was all about the journal). But perhaps more importantly: regardless of what the low-level filesystem did at that point, the VM accounting shouldn't care, and should be robust in the face of a low-level filesystem doing strange and wonderful things. But thanks for bothering to go through the whole history and figure out what exactly is up. Oh well, after seeing the move of cancel_dirty_page, I just went backwards from __set_page_dirty using cscope + some smart guessing and quickly ended up at ext3_invalidatepage, so it wasn't that hard :-) As try_to_free_buffers got its ext3 hack back in ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for the accounting fix in cancel_dirty_page, of course). Yes, I think we have room for cleanups now, and I agree: we ended up reinstating some questionable code in the VM just because we didn't really know or understand what was going on in the ext3 journal code. Hm, you attributed more to my mail than there was actually in it. I didn't even start to think of cleanups (because I don't know jack about the whole ext3/jdb stuff, so I simply cannot come up with any cleanups (yet?)).What I meant is that we only did a half-revert of that hackery. When try_to_free_buffers started to check for PG_dirty, the cancel_dirty_page call had to be called before do_invalidatepage, to fix a _huge_ leak. But that caused the accouting breakage we're now seeing, because we never account for the pages that got redirtied during do_invalidatepage. Then the change to try_to_free_buffers got reverted, so we no longer need to call cancel_dirty_page before do_invalidatepage, but still we do. Thus the accounting bug remains. So what I meant to suggest was simply to actually finish the revert we started. Or expressed as a patch: diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..2974903 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page-mapping != mapping) return; - cancel_dirty_page(page, PAGE_CACHE_SIZE); - if (PagePrivate(page)) do_invalidatepage(page, 0); + cancel_dirty_page(page, PAGE_CACHE_SIZE); + remove_from_page_cache(page); ClearPageUptodate(page); ClearPageMappedToDisk(page); I'll be the last one to comment on whether or not that causes inaccurate accouting, so I'll just watch you and Jan battle that out until someone comes up with a post-.24 patch to provide a clean fix for the issue. Krzysztof, could you give this patch a test run? If that fixes the problem for now, I'll try to come up with some usable commit message, or if somehow wants to beat me to it, you can already have my Signed-off-by: Björn Steinbrink [EMAIL PROTECTED] On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task io accounting for cancelled writes happened always happened if the page was dirty, regardless of page-mapping. This was also already true for the old test_clear_page_dirty code, and the commit log for 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic change either, so maybe the if (account_size) block should be moved out of the if (mapping ...) block? I think the if (account_size) thing was *purely* for me being worried about hugetlb entries, and I think that's the only thing that passes in a zero account size. But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we cannot account
Re: [Bug 9182] Critical memory leak (dirty pages)
On Friday 21 December 2007 06:24, Linus Torvalds wrote: On Thu, 20 Dec 2007, Jan Kara wrote: As I wrote in my previous email, this solution works but hides the fact that the page really *has* dirty data in it and *is* pinned in memory until the commit code gets to writing it. So in theory it could disturb the writeout logic by having more dirty data in memory than vm thinks it has. Not that I'd have a better fix now but I wanted to point out this problem. Well, I worry more about the VM being sane - and by the time we actually hit this case, as far as VM sanity is concerned, the page no longer really exists. It's been removed from the page cache, and it only really exists as any other random kernel allocation. It does allow the VM to just not worry about this. However I don't really like this kinds of catch-all conditions that are hard to get rid of and can encourage bad behaviour. It would be nice if the insane things were made to clean up after themselves. The fact that low-level filesystems (in this case ext3 journaling) do their own insane things is not something the VM even _should_ care about. It's just an internal FS allocation, and the FS can do whatever the hell it wants with it, including doing IO etc. The kernel doesn't consider any other random IO pages to be dirty either (eg if you do direct-IO writes using low-level SCSI commands, the VM doesn't consider that to be any special dirty stuff, it's just random page allocations again). This is really no different. In other words: the Linux VM subsystem is really two differnt parts: the low-level page allocator (which obviously knows that the page is still in *use*, since it hasn't been free'd), and the higher-level file mapping and caching stuff that knows about things like page dirtyiness. And once you've done a remove_from_page_cache(), the higher levels are no longer involved, and dirty accounting simply doesn't get into the picture. That's all true... it would simply be nice to ask the filesystems to do this. But anyway I think your patch is pretty reasonable for the moment. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thursday 20 December 2007 12:05, Jan Kara wrote: > > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: > > > I'll confirm this tomorrow but it seems that even switching to > > > data=ordered (AFAIK default o ext3) is indeed enough to cure this > > > problem. > > > > Ok, do we actually have any ext3 expert following this? I have no idea > > about what the journalling code does, but I have painful memories of ext3 > > doing really odd buffer-head-based IO and totally bypassing all the > > normal page dirty logic. > > > > Judging by the symptoms (sorry for not following this well, it came up > > while I was mostly away travelling), something probably *does* clear the > > dirty bit on the pages, but the dirty *accounting* is not done properly, > > so the kernel keeps thinking it has dirty pages. > > > > Now, a simple "grep" shows that ext3 does not actually do any > > ClearPageDirty() or similar on its own, although maybe I missed some > > other subtle way this can happen. And the *normal* VFS routines that do > > ClearPageDirty should all be doing the proper accounting. > > > > So I see a couple of possible cases: > > > > - actually clearing the PG_dirty bit somehow, without doing the > >accounting. > > > >This looks very unlikely. PG_dirty is always cleared by some variant > > of "*ClearPageDirty()", and that bit definition isn't used for anything > > else in the whole kernel judging by "grep" (the page allocator tests the > > bit, that's it). > > > >And there aren't that many hits for ClearPageDirty, and they all seem > >to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if > > the mapping has dirty state accounting. > > > >The exceptions seem to be: > > - the page freeing path, but that path checks that "mapping" is NULL > > (so no accounting), and would complain loudly if it wasn't > > - the swap state stuff ("move_from_swap_cache()"), but that should > > only ever trigger for swap cache pages (we have a BUG_ON() in that > > path), and those don't do dirty accounting anyway. > > - pageout(), but again only for pages that have a NULL mapping. > > > > - ext3 might be clearing (probably indirectly) the "page->mapping" thing > >or similar, which in turn will make the VFS think that even a dirty > >page isn't actually to be accounted for - so when the page *turned* > >dirty, it was accounted as a dirty page, but then, when it was > > cleaned, the accounting wasn't reversed because ->mapping had become > > NULL. > > > >This would be some interaction with the truncation logic, and quite > >frankly, that should be all shared with the non-journal case, so I > > find this all very unlikely. > > > > However, that second case is interesting, because the pageout case > > actually has a comment like this: > > > > /* > > * Some data journaling orphaned pages can have > > * page->mapping == NULL while being dirty with clean buffers. > > */ > > > > which really sounds like the case in question. > > > > I may know the VM, but that special case was added due to insane > > journaling filesystems, and I don't know what insane things they do. > > Which is why I'm wondering if there is any ext3 person who knows the > > journaling code? > > Yes, I'm looking into the problem... I think those orphan pages > without mapping are created because we cannot drop truncated > buffers/pages immediately. There can be a committing transaction that > still needs the data in those buffers and until it commits we have to > keep the pages (and even maybe write them to disk etc.). But eventually, > we should write the buffers, call try_to_free_buffers() which calls > cancel_dirty_page() and everything should be happy... in theory ;) If mapping is NULL, then try_to_free_buffers won't call cancel_dirty_page, I think? I don't know whether ext3 can be changed to not require/allow these dirty pages, but I would rather Linus's dirty page accounting fix to go into that path (the /* can this still happen? */ in try_to_free_buffers()), if possible. Then you could also have a WARN_ON in __remove_from_page_cache(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
> On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: > > > > I'll confirm this tomorrow but it seems that even switching to data=ordered > > (AFAIK default o ext3) is indeed enough to cure this problem. > > Ok, do we actually have any ext3 expert following this? I have no idea > about what the journalling code does, but I have painful memories of ext3 > doing really odd buffer-head-based IO and totally bypassing all the normal > page dirty logic. > > Judging by the symptoms (sorry for not following this well, it came up > while I was mostly away travelling), something probably *does* clear the > dirty bit on the pages, but the dirty *accounting* is not done properly, > so the kernel keeps thinking it has dirty pages. > > Now, a simple "grep" shows that ext3 does not actually do any > ClearPageDirty() or similar on its own, although maybe I missed some other > subtle way this can happen. And the *normal* VFS routines that do > ClearPageDirty should all be doing the proper accounting. > > So I see a couple of possible cases: > > - actually clearing the PG_dirty bit somehow, without doing the >accounting. > >This looks very unlikely. PG_dirty is always cleared by some variant of >"*ClearPageDirty()", and that bit definition isn't used for anything >else in the whole kernel judging by "grep" (the page allocator tests >the bit, that's it). > >And there aren't that many hits for ClearPageDirty, and they all seem >to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the >mapping has dirty state accounting. > >The exceptions seem to be: > - the page freeing path, but that path checks that "mapping" is NULL > (so no accounting), and would complain loudly if it wasn't > - the swap state stuff ("move_from_swap_cache()"), but that should > only ever trigger for swap cache pages (we have a BUG_ON() in that > path), and those don't do dirty accounting anyway. > - pageout(), but again only for pages that have a NULL mapping. > > - ext3 might be clearing (probably indirectly) the "page->mapping" thing >or similar, which in turn will make the VFS think that even a dirty >page isn't actually to be accounted for - so when the page *turned* >dirty, it was accounted as a dirty page, but then, when it was cleaned, >the accounting wasn't reversed because ->mapping had become NULL. > >This would be some interaction with the truncation logic, and quite >frankly, that should be all shared with the non-journal case, so I find >this all very unlikely. > > However, that second case is interesting, because the pageout case > actually has a comment like this: > > /* >* Some data journaling orphaned pages can have >* page->mapping == NULL while being dirty with clean buffers. >*/ > > which really sounds like the case in question. > > I may know the VM, but that special case was added due to insane > journaling filesystems, and I don't know what insane things they do. Which > is why I'm wondering if there is any ext3 person who knows the journaling > code? Yes, I'm looking into the problem... I think those orphan pages without mapping are created because we cannot drop truncated buffers/pages immediately. There can be a committing transaction that still needs the data in those buffers and until it commits we have to keep the pages (and even maybe write them to disk etc.). But eventually, we should write the buffers, call try_to_free_buffers() which calls cancel_dirty_page() and everything should be happy... in theory ;) In practice, I have not yet narrowed down where the problem is. fsx-linux is able to trigger the problem on my test machine so as suspected it is some bad interaction of writes (plain writes, no mmap), truncates and probably writeback. Small tests don't seem to trigger the problem (fsx needs at least few hundreds operations to trigger the problem) - on the other hand when some sequence of operations causes lost dirty pages, they are lost deterministically in every run. Also the file fsx operates on can be fairly small - 2MB was enough - so page reclaim and such stuff probably isn't the thing we interact with. Tomorrow I'll try more... > How/when does it ever "orphan" pages? Because yes, if it ever does that, > and clears the ->mapping field on a mapped page, then that page will have > incremented the dirty counts when it became dirty, but will *not* > decrement the dirty count when it is an orphan. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: > > I'll confirm this tomorrow but it seems that even switching to data=ordered > (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple "grep" shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of "*ClearPageDirty()", and that bit definition isn't used for anything else in the whole kernel judging by "grep" (the page allocator tests the bit, that's it). And there aren't that many hits for ClearPageDirty, and they all seem to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the mapping has dirty state accounting. The exceptions seem to be: - the page freeing path, but that path checks that "mapping" is NULL (so no accounting), and would complain loudly if it wasn't - the swap state stuff ("move_from_swap_cache()"), but that should only ever trigger for swap cache pages (we have a BUG_ON() in that path), and those don't do dirty accounting anyway. - pageout(), but again only for pages that have a NULL mapping. - ext3 might be clearing (probably indirectly) the "page->mapping" thing or similar, which in turn will make the VFS think that even a dirty page isn't actually to be accounted for - so when the page *turned* dirty, it was accounted as a dirty page, but then, when it was cleaned, the accounting wasn't reversed because ->mapping had become NULL. This would be some interaction with the truncation logic, and quite frankly, that should be all shared with the non-journal case, so I find this all very unlikely. However, that second case is interesting, because the pageout case actually has a comment like this: /* * Some data journaling orphaned pages can have * page->mapping == NULL while being dirty with clean buffers. */ which really sounds like the case in question. I may know the VM, but that special case was added due to insane journaling filesystems, and I don't know what insane things they do. Which is why I'm wondering if there is any ext3 person who knows the journaling code? How/when does it ever "orphan" pages? Because yes, if it ever does that, and clears the ->mapping field on a mapped page, then that page will have incremented the dirty counts when it became dirty, but will *not* decrement the dirty count when it is an orphan. > Two questions remain then: why system dies when dirty reaches ~200MB and what > is wrong with ext3+data=journal with >=2.6.20-rc2? Well, that one is probably pretty straightforward: since the kernel thinks that there are too many dirty pages, it will ask everybody who creates more dirty pages to clean out some *old* dirty pages, but since they don't exist, the whole thing will basically wait forever for a writeout to clean things out that will never happen. 200MB is 10% of your 2GB of low-mem RAM, and 10% is the default dirty_ratio that causes synchronous waits for writeback. If you use the normal 3:1 VM split, the hang should happen even earlier (at the ~100MB "dirty" mark). So that part isn't the bug. The bug is in the accounting, but I'm pretty damn sure that the core VM itself is pretty ok, since that code has now been stable for people for the last year or so. It seems that ext3 (with data journaling) does something dodgy wrt some page. But how about trying this appended patch. It should warn a few times if some page is ever removed from a mapping while it's dirty (and the mapping is one that should have been accouned). It also tries to "fix up" the case, so *if* this is the cause, it should also fix the bug. I'd love to hear if you get any stack dumps with this, and what the backtrace is (and whether the dirty counts then stay ok). The patch is totally untested. It compiles for me. That's all I can say. (There's a few other places that set ->mapping to NULL, but they're pretty esoteric. Page migration? Stuff like that). Linus --- mm/filemap.c | 12 1 files changed, 12 insertions(+), 0
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple grep shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of *ClearPageDirty(), and that bit definition isn't used for anything else in the whole kernel judging by grep (the page allocator tests the bit, that's it). And there aren't that many hits for ClearPageDirty, and they all seem to do the proper dec_zone_page_state(page, NR_FILE_DIRTY); etc if the mapping has dirty state accounting. The exceptions seem to be: - the page freeing path, but that path checks that mapping is NULL (so no accounting), and would complain loudly if it wasn't - the swap state stuff (move_from_swap_cache()), but that should only ever trigger for swap cache pages (we have a BUG_ON() in that path), and those don't do dirty accounting anyway. - pageout(), but again only for pages that have a NULL mapping. - ext3 might be clearing (probably indirectly) the page-mapping thing or similar, which in turn will make the VFS think that even a dirty page isn't actually to be accounted for - so when the page *turned* dirty, it was accounted as a dirty page, but then, when it was cleaned, the accounting wasn't reversed because -mapping had become NULL. This would be some interaction with the truncation logic, and quite frankly, that should be all shared with the non-journal case, so I find this all very unlikely. However, that second case is interesting, because the pageout case actually has a comment like this: /* * Some data journaling orphaned pages can have * page-mapping == NULL while being dirty with clean buffers. */ which really sounds like the case in question. I may know the VM, but that special case was added due to insane journaling filesystems, and I don't know what insane things they do. Which is why I'm wondering if there is any ext3 person who knows the journaling code? How/when does it ever orphan pages? Because yes, if it ever does that, and clears the -mapping field on a mapped page, then that page will have incremented the dirty counts when it became dirty, but will *not* decrement the dirty count when it is an orphan. Two questions remain then: why system dies when dirty reaches ~200MB and what is wrong with ext3+data=journal with =2.6.20-rc2? Well, that one is probably pretty straightforward: since the kernel thinks that there are too many dirty pages, it will ask everybody who creates more dirty pages to clean out some *old* dirty pages, but since they don't exist, the whole thing will basically wait forever for a writeout to clean things out that will never happen. 200MB is 10% of your 2GB of low-mem RAM, and 10% is the default dirty_ratio that causes synchronous waits for writeback. If you use the normal 3:1 VM split, the hang should happen even earlier (at the ~100MB dirty mark). So that part isn't the bug. The bug is in the accounting, but I'm pretty damn sure that the core VM itself is pretty ok, since that code has now been stable for people for the last year or so. It seems that ext3 (with data journaling) does something dodgy wrt some page. But how about trying this appended patch. It should warn a few times if some page is ever removed from a mapping while it's dirty (and the mapping is one that should have been accouned). It also tries to fix up the case, so *if* this is the cause, it should also fix the bug. I'd love to hear if you get any stack dumps with this, and what the backtrace is (and whether the dirty counts then stay ok). The patch is totally untested. It compiles for me. That's all I can say. (There's a few other places that set -mapping to NULL, but they're pretty esoteric. Page migration? Stuff like that). Linus --- mm/filemap.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/mm/filemap.c
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple grep shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of *ClearPageDirty(), and that bit definition isn't used for anything else in the whole kernel judging by grep (the page allocator tests the bit, that's it). And there aren't that many hits for ClearPageDirty, and they all seem to do the proper dec_zone_page_state(page, NR_FILE_DIRTY); etc if the mapping has dirty state accounting. The exceptions seem to be: - the page freeing path, but that path checks that mapping is NULL (so no accounting), and would complain loudly if it wasn't - the swap state stuff (move_from_swap_cache()), but that should only ever trigger for swap cache pages (we have a BUG_ON() in that path), and those don't do dirty accounting anyway. - pageout(), but again only for pages that have a NULL mapping. - ext3 might be clearing (probably indirectly) the page-mapping thing or similar, which in turn will make the VFS think that even a dirty page isn't actually to be accounted for - so when the page *turned* dirty, it was accounted as a dirty page, but then, when it was cleaned, the accounting wasn't reversed because -mapping had become NULL. This would be some interaction with the truncation logic, and quite frankly, that should be all shared with the non-journal case, so I find this all very unlikely. However, that second case is interesting, because the pageout case actually has a comment like this: /* * Some data journaling orphaned pages can have * page-mapping == NULL while being dirty with clean buffers. */ which really sounds like the case in question. I may know the VM, but that special case was added due to insane journaling filesystems, and I don't know what insane things they do. Which is why I'm wondering if there is any ext3 person who knows the journaling code? Yes, I'm looking into the problem... I think those orphan pages without mapping are created because we cannot drop truncated buffers/pages immediately. There can be a committing transaction that still needs the data in those buffers and until it commits we have to keep the pages (and even maybe write them to disk etc.). But eventually, we should write the buffers, call try_to_free_buffers() which calls cancel_dirty_page() and everything should be happy... in theory ;) In practice, I have not yet narrowed down where the problem is. fsx-linux is able to trigger the problem on my test machine so as suspected it is some bad interaction of writes (plain writes, no mmap), truncates and probably writeback. Small tests don't seem to trigger the problem (fsx needs at least few hundreds operations to trigger the problem) - on the other hand when some sequence of operations causes lost dirty pages, they are lost deterministically in every run. Also the file fsx operates on can be fairly small - 2MB was enough - so page reclaim and such stuff probably isn't the thing we interact with. Tomorrow I'll try more... How/when does it ever orphan pages? Because yes, if it ever does that, and clears the -mapping field on a mapped page, then that page will have incremented the dirty counts when it became dirty, but will *not* decrement the dirty count when it is an orphan. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thursday 20 December 2007 12:05, Jan Kara wrote: On Sun, 16 Dec 2007, Krzysztof Oledzki wrote: I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Ok, do we actually have any ext3 expert following this? I have no idea about what the journalling code does, but I have painful memories of ext3 doing really odd buffer-head-based IO and totally bypassing all the normal page dirty logic. Judging by the symptoms (sorry for not following this well, it came up while I was mostly away travelling), something probably *does* clear the dirty bit on the pages, but the dirty *accounting* is not done properly, so the kernel keeps thinking it has dirty pages. Now, a simple grep shows that ext3 does not actually do any ClearPageDirty() or similar on its own, although maybe I missed some other subtle way this can happen. And the *normal* VFS routines that do ClearPageDirty should all be doing the proper accounting. So I see a couple of possible cases: - actually clearing the PG_dirty bit somehow, without doing the accounting. This looks very unlikely. PG_dirty is always cleared by some variant of *ClearPageDirty(), and that bit definition isn't used for anything else in the whole kernel judging by grep (the page allocator tests the bit, that's it). And there aren't that many hits for ClearPageDirty, and they all seem to do the proper dec_zone_page_state(page, NR_FILE_DIRTY); etc if the mapping has dirty state accounting. The exceptions seem to be: - the page freeing path, but that path checks that mapping is NULL (so no accounting), and would complain loudly if it wasn't - the swap state stuff (move_from_swap_cache()), but that should only ever trigger for swap cache pages (we have a BUG_ON() in that path), and those don't do dirty accounting anyway. - pageout(), but again only for pages that have a NULL mapping. - ext3 might be clearing (probably indirectly) the page-mapping thing or similar, which in turn will make the VFS think that even a dirty page isn't actually to be accounted for - so when the page *turned* dirty, it was accounted as a dirty page, but then, when it was cleaned, the accounting wasn't reversed because -mapping had become NULL. This would be some interaction with the truncation logic, and quite frankly, that should be all shared with the non-journal case, so I find this all very unlikely. However, that second case is interesting, because the pageout case actually has a comment like this: /* * Some data journaling orphaned pages can have * page-mapping == NULL while being dirty with clean buffers. */ which really sounds like the case in question. I may know the VM, but that special case was added due to insane journaling filesystems, and I don't know what insane things they do. Which is why I'm wondering if there is any ext3 person who knows the journaling code? Yes, I'm looking into the problem... I think those orphan pages without mapping are created because we cannot drop truncated buffers/pages immediately. There can be a committing transaction that still needs the data in those buffers and until it commits we have to keep the pages (and even maybe write them to disk etc.). But eventually, we should write the buffers, call try_to_free_buffers() which calls cancel_dirty_page() and everything should be happy... in theory ;) If mapping is NULL, then try_to_free_buffers won't call cancel_dirty_page, I think? I don't know whether ext3 can be changed to not require/allow these dirty pages, but I would rather Linus's dirty page accounting fix to go into that path (the /* can this still happen? */ in try_to_free_buffers()), if possible. Then you could also have a WARN_ON in __remove_from_page_cache(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. yes, sorry, I meant ordered. OK, I can confirm that the problem is with data=journal. With data=ordered I get: # uname -rns;uptime;sync;sleep 1;sync ;sleep 1; sync;grep Dirty /proc/meminfo Linux cougar 2.6.24-rc5 17:50:34 up 1 day, 20 min, 1 user, load average: 0.99, 0.48, 0.35 Dirty: 0 kB Two questions remain then: why system dies when dirty reaches ~200MB I think you have ~2G of RAM and you're running with /proc/sys/vm/dirty_ratio=10, yes? If so, when that machine hits 10% * 2G of dirty memory then everyone who wants to dirty pages gets blocked. Oh, right. Thank you for the explanation. and what is wrong with ext3+data=journal with >=2.6.20-rc2? Ah. It has a bug in it ;) As I said, data=journal has exceptional handling of pagecache data and is not well tested. Someone (and I'm not sure who) will need to get in there and fix it. OK, I'm willing to test it. ;) Best regrds, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
> On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[EMAIL > PROTECTED]> wrote: > > > >>> Which filesystem, which mount options > > >> > > >> - ext3 on RAID1 (MD): / - rootflags=data=journal > > > > > > It wouldn't surprise me if this is specific to data=journal: that > > > journalling mode is pretty complex wrt dairty-data handling and isn't well > > > tested. > > > > > > Does switching that to data=writeback change things? > > > > I'll confirm this tomorrow but it seems that even switching to > > data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. > > yes, sorry, I meant ordered. > > > Two questions remain then: why system dies when dirty reaches ~200MB > > I think you have ~2G of RAM and you're running with > /proc/sys/vm/dirty_ratio=10, yes? > > If so, when that machine hits 10% * 2G of dirty memory then everyone who > wants to dirty pages gets blocked. > > > and what is wrong with ext3+data=journal with >=2.6.20-rc2? > > Ah. It has a bug in it ;) > > As I said, data=journal has exceptional handling of pagecache data and is > not well tested. Someone (and I'm not sure who) will need to get in there > and fix it. It seems fsx-linux is able to trigger the leak on my test machine so I'll have a look into it (not sure if I'll get to it today but I should find some time for it this week)... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. yes, sorry, I meant ordered. Two questions remain then: why system dies when dirty reaches ~200MB I think you have ~2G of RAM and you're running with /proc/sys/vm/dirty_ratio=10, yes? If so, when that machine hits 10% * 2G of dirty memory then everyone who wants to dirty pages gets blocked. and what is wrong with ext3+data=journal with =2.6.20-rc2? Ah. It has a bug in it ;) As I said, data=journal has exceptional handling of pagecache data and is not well tested. Someone (and I'm not sure who) will need to get in there and fix it. It seems fsx-linux is able to trigger the leak on my test machine so I'll have a look into it (not sure if I'll get to it today but I should find some time for it this week)... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. yes, sorry, I meant ordered. OK, I can confirm that the problem is with data=journal. With data=ordered I get: # uname -rns;uptime;sync;sleep 1;sync ;sleep 1; sync;grep Dirty /proc/meminfo Linux cougar 2.6.24-rc5 17:50:34 up 1 day, 20 min, 1 user, load average: 0.99, 0.48, 0.35 Dirty: 0 kB Two questions remain then: why system dies when dirty reaches ~200MB I think you have ~2G of RAM and you're running with /proc/sys/vm/dirty_ratio=10, yes? If so, when that machine hits 10% * 2G of dirty memory then everyone who wants to dirty pages gets blocked. Oh, right. Thank you for the explanation. and what is wrong with ext3+data=journal with =2.6.20-rc2? Ah. It has a bug in it ;) As I said, data=journal has exceptional handling of pagecache data and is not well tested. Someone (and I'm not sure who) will need to get in there and fix it. OK, I'm willing to test it. ;) Best regrds, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: > >>> Which filesystem, which mount options > >> > >> - ext3 on RAID1 (MD): / - rootflags=data=journal > > > > It wouldn't surprise me if this is specific to data=journal: that > > journalling mode is pretty complex wrt dairty-data handling and isn't well > > tested. > > > > Does switching that to data=writeback change things? > > I'll confirm this tomorrow but it seems that even switching to > data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. yes, sorry, I meant ordered. > Two questions remain then: why system dies when dirty reaches ~200MB I think you have ~2G of RAM and you're running with /proc/sys/vm/dirty_ratio=10, yes? If so, when that machine hits 10% * 2G of dirty memory then everyone who wants to dirty pages gets blocked. > and what is wrong with ext3+data=journal with >=2.6.20-rc2? Ah. It has a bug in it ;) As I said, data=journal has exceptional handling of pagecache data and is not well tested. Someone (and I'm not sure who) will need to get in there and fix it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: On Sat, 15 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - >= 2.6.20: BAD (but not *very* BAD!) well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. No, not for me alone. Probably only I and Thomas Osterried have systems where it is so easy to reproduce. Please note that the problem exists on my all systems, but only on one it is critical. It is enough to run "sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure. With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only on one it goes to ~200MB in about 2 weeks and then everything dies: http://bugzilla.kernel.org/attachment.cgi?id=13824 http://bugzilla.kernel.org/attachment.cgi?id=13825 http://bugzilla.kernel.org/attachment.cgi?id=13826 http://bugzilla.kernel.org/attachment.cgi?id=13827 I wonder what you're doing that is different/special. Me to. :| Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Two questions remain then: why system dies when dirty reaches ~200MB and what is wrong with ext3+data=journal with >=2.6.20-rc2? Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #39 from [EMAIL PROTECTED] 2007-12-16 01:58 --- So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - >= 2.6.20: BAD (but not *very* BAD!) based on the great info you already acquired, you should be able to bisect this rather effectively, via: 2.6.20-rc1-git8 == 921320210bd2ec4f17053d283355b73048ac0e56 $ git-bisect start $ git-bisect bad 921320210bd2ec4f17053d283355b73048ac0e56 $ git-bisect good v2.6.20-rc1 Bisecting: 133 revisions left to test after this so about 7-8 bootups would pinpoint the breakage. Except that I have very limited time where I can do my tests on this host. Please also note that it takes about ~2h after a reboot, to be 100% sure. So, 7-8 bootups => 14-16h. :| It would likely pinpoint fba2591b, so it would perhaps be best to first attempt a revert of fba2591b on a recent kernel. I wish I could: :( [EMAIL PROTECTED]:/usr/src/linux-2.6.23.9$ cat ..p1 |patch -p1 --dry-run -R patching file fs/hugetlbfs/inode.c Hunk #1 succeeded at 203 (offset 27 lines). patching file include/linux/page-flags.h Hunk #1 succeeded at 262 (offset 9 lines). patching file mm/page-writeback.c Hunk #1 succeeded at 903 (offset 58 lines). patching file mm/truncate.c Unreversed patch detected! Ignore -R? [n] y Hunk #1 succeeded at 52 with fuzz 2 (offset 1 line). Hunk #2 FAILED at 85. Hunk #3 FAILED at 365. Hunk #4 FAILED at 400. 3 out of 4 hunks FAILED -- saving rejects to file mm/truncate.c.rej Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: > > > On Sat, 15 Dec 2007, Andrew Morton wrote: > > > On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL > > PROTECTED]> wrote: > > > >> > >> > >> On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: > >> > >>> http://bugzilla.kernel.org/show_bug.cgi?id=9182 > >>> > >>> > >>> --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- > >>> Krzysztof, I'd hate point you to a hard path (at least time consuming), > >>> but > >>> you've done a lot of digging by now anyway. How about git bisecting > >>> between > >>> 2.6.20-rc2 and rc1? Here is great info on bisecting: > >>> http://www.kernel.org/doc/local/git-quick.html > >> > >> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad > >> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. > >> So it took me only 2 reboots. ;) > >> > >> The guilty patch is the one I proposed just an hour ago: > >> > >> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 > >> > >> So: > >> - 2.6.20-rc1: OK > >> - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 > >> reverted: OK > >> - 2.6.20-rc1-git8: very BAD > >> - 2.6.20-rc2: very BAD > >> - 2.6.20-rc4: very BAD > >> - >= 2.6.20: BAD (but not *very* BAD!) > >> > > > > well.. We have code which has been used by *everyone* for a year and it's > > misbehaving for you alone. > > No, not for me alone. Probably only I and Thomas Osterried have systems > where it is so easy to reproduce. Please note that the problem exists on > my all systems, but only on one it is critical. It is enough to run > "sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure. > With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only > on one it goes to ~200MB in about 2 weeks and then everything dies: > http://bugzilla.kernel.org/attachment.cgi?id=13824 > http://bugzilla.kernel.org/attachment.cgi?id=13825 > http://bugzilla.kernel.org/attachment.cgi?id=13826 > http://bugzilla.kernel.org/attachment.cgi?id=13827 > > > I wonder what you're doing that is different/special. > Me to. :| > > > Which filesystem, which mount options > > - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? THomas, do you have ext3 data=journal on any filesytems? > - ext3 on LVM on RAID5 (MD) > - nfs > > /dev/md0 on / type ext3 (rw) > proc on /proc type proc (rw) > sysfs on /sys type sysfs (rw,nosuid,nodev,noexec) > devpts on /dev/pts type devpts (rw,nosuid,noexec) > /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal) > /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal) > /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 > (rw,nosuid,nodev,noatime,data=writeback) > /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 > (rw,nosuid,nodev,noatime,data=writeback) > /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 > (rw,nosuid,nodev,noatime) > shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev) > usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85) > owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs > (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26) > > > > what sort of workload? > Different, depending on a host: mail (postfix + amavisd + spamassasin + > clamav + sqlgray), squid, mysql, apache, nfs, rsync, But it seems > that the biggest problem is on the host running mentioned mail service. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sat, 15 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - >= 2.6.20: BAD (but not *very* BAD!) well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. No, not for me alone. Probably only I and Thomas Osterried have systems where it is so easy to reproduce. Please note that the problem exists on my all systems, but only on one it is critical. It is enough to run "sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure. With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only on one it goes to ~200MB in about 2 weeks and then everything dies: http://bugzilla.kernel.org/attachment.cgi?id=13824 http://bugzilla.kernel.org/attachment.cgi?id=13825 http://bugzilla.kernel.org/attachment.cgi?id=13826 http://bugzilla.kernel.org/attachment.cgi?id=13827 I wonder what you're doing that is different/special. Me to. :| Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal - ext3 on LVM on RAID5 (MD) - nfs /dev/md0 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec) devpts on /dev/pts type devpts (rw,nosuid,noexec) /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal) /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal) /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 (rw,nosuid,nodev,noatime,data=writeback) /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 (rw,nosuid,nodev,noatime,data=writeback) /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 (rw,nosuid,nodev,noatime) shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev) usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85) owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26) what sort of workload? Different, depending on a host: mail (postfix + amavisd + spamassasin + clamav + sqlgray), squid, mysql, apache, nfs, rsync, But it seems that the biggest problem is on the host running mentioned mail service. Thanks. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. yes, sorry, I meant ordered. Two questions remain then: why system dies when dirty reaches ~200MB I think you have ~2G of RAM and you're running with /proc/sys/vm/dirty_ratio=10, yes? If so, when that machine hits 10% * 2G of dirty memory then everyone who wants to dirty pages gets blocked. and what is wrong with ext3+data=journal with =2.6.20-rc2? Ah. It has a bug in it ;) As I said, data=journal has exceptional handling of pagecache data and is not well tested. Someone (and I'm not sure who) will need to get in there and fix it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sat, 15 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - = 2.6.20: BAD (but not *very* BAD!) well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. No, not for me alone. Probably only I and Thomas Osterried have systems where it is so easy to reproduce. Please note that the problem exists on my all systems, but only on one it is critical. It is enough to run sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo to be sure. With =2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only on one it goes to ~200MB in about 2 weeks and then everything dies: http://bugzilla.kernel.org/attachment.cgi?id=13824 http://bugzilla.kernel.org/attachment.cgi?id=13825 http://bugzilla.kernel.org/attachment.cgi?id=13826 http://bugzilla.kernel.org/attachment.cgi?id=13827 I wonder what you're doing that is different/special. Me to. :| Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal - ext3 on LVM on RAID5 (MD) - nfs /dev/md0 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec) devpts on /dev/pts type devpts (rw,nosuid,noexec) /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal) /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal) /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 (rw,nosuid,nodev,noatime,data=writeback) /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 (rw,nosuid,nodev,noatime,data=writeback) /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 (rw,nosuid,nodev,noatime) shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev) usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85) owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26) what sort of workload? Different, depending on a host: mail (postfix + amavisd + spamassasin + clamav + sqlgray), squid, mysql, apache, nfs, rsync, But it seems that the biggest problem is on the host running mentioned mail service. Thanks. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Sat, 15 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - = 2.6.20: BAD (but not *very* BAD!) well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. No, not for me alone. Probably only I and Thomas Osterried have systems where it is so easy to reproduce. Please note that the problem exists on my all systems, but only on one it is critical. It is enough to run sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo to be sure. With =2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only on one it goes to ~200MB in about 2 weeks and then everything dies: http://bugzilla.kernel.org/attachment.cgi?id=13824 http://bugzilla.kernel.org/attachment.cgi?id=13825 http://bugzilla.kernel.org/attachment.cgi?id=13826 http://bugzilla.kernel.org/attachment.cgi?id=13827 I wonder what you're doing that is different/special. Me to. :| Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? THomas, do you have ext3 data=journal on any filesytems? - ext3 on LVM on RAID5 (MD) - nfs /dev/md0 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec) devpts on /dev/pts type devpts (rw,nosuid,noexec) /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal) /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal) /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 (rw,nosuid,nodev,noatime,data=writeback) /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 (rw,nosuid,nodev,noatime,data=writeback) /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 (rw,nosuid,nodev,noatime) shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev) usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85) owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26) what sort of workload? Different, depending on a host: mail (postfix + amavisd + spamassasin + clamav + sqlgray), squid, mysql, apache, nfs, rsync, But it seems that the biggest problem is on the host running mentioned mail service. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #39 from [EMAIL PROTECTED] 2007-12-16 01:58 --- So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - = 2.6.20: BAD (but not *very* BAD!) based on the great info you already acquired, you should be able to bisect this rather effectively, via: 2.6.20-rc1-git8 == 921320210bd2ec4f17053d283355b73048ac0e56 $ git-bisect start $ git-bisect bad 921320210bd2ec4f17053d283355b73048ac0e56 $ git-bisect good v2.6.20-rc1 Bisecting: 133 revisions left to test after this so about 7-8 bootups would pinpoint the breakage. Except that I have very limited time where I can do my tests on this host. Please also note that it takes about ~2h after a reboot, to be 100% sure. So, 7-8 bootups = 14-16h. :| It would likely pinpoint fba2591b, so it would perhaps be best to first attempt a revert of fba2591b on a recent kernel. I wish I could: :( [EMAIL PROTECTED]:/usr/src/linux-2.6.23.9$ cat ..p1 |patch -p1 --dry-run -R patching file fs/hugetlbfs/inode.c Hunk #1 succeeded at 203 (offset 27 lines). patching file include/linux/page-flags.h Hunk #1 succeeded at 262 (offset 9 lines). patching file mm/page-writeback.c Hunk #1 succeeded at 903 (offset 58 lines). patching file mm/truncate.c Unreversed patch detected! Ignore -R? [n] y Hunk #1 succeeded at 52 with fuzz 2 (offset 1 line). Hunk #2 FAILED at 85. Hunk #3 FAILED at 365. Hunk #4 FAILED at 400. 3 out of 4 hunks FAILED -- saving rejects to file mm/truncate.c.rej Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Sat, 15 Dec 2007, Andrew Morton wrote: On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - = 2.6.20: BAD (but not *very* BAD!) well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. No, not for me alone. Probably only I and Thomas Osterried have systems where it is so easy to reproduce. Please note that the problem exists on my all systems, but only on one it is critical. It is enough to run sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo to be sure. With =2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only on one it goes to ~200MB in about 2 weeks and then everything dies: http://bugzilla.kernel.org/attachment.cgi?id=13824 http://bugzilla.kernel.org/attachment.cgi?id=13825 http://bugzilla.kernel.org/attachment.cgi?id=13826 http://bugzilla.kernel.org/attachment.cgi?id=13827 I wonder what you're doing that is different/special. Me to. :| Which filesystem, which mount options - ext3 on RAID1 (MD): / - rootflags=data=journal It wouldn't surprise me if this is specific to data=journal: that journalling mode is pretty complex wrt dairty-data handling and isn't well tested. Does switching that to data=writeback change things? I'll confirm this tomorrow but it seems that even switching to data=ordered (AFAIK default o ext3) is indeed enough to cure this problem. Two questions remain then: why system dies when dirty reaches ~200MB and what is wrong with ext3+data=journal with =2.6.20-rc2? Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> wrote: > > > On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=9182 > > > > > > --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- > > Krzysztof, I'd hate point you to a hard path (at least time consuming), but > > you've done a lot of digging by now anyway. How about git bisecting between > > 2.6.20-rc2 and rc1? Here is great info on bisecting: > > http://www.kernel.org/doc/local/git-quick.html > > As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad > as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. > So it took me only 2 reboots. ;) > > The guilty patch is the one I proposed just an hour ago: > > http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 > > So: > - 2.6.20-rc1: OK > - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK > - 2.6.20-rc1-git8: very BAD > - 2.6.20-rc2: very BAD > - 2.6.20-rc4: very BAD > - >= 2.6.20: BAD (but not *very* BAD!) > well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. I wonder what you're doing that is different/special. Which filesystem, which mount options, what sort of workload? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - >= 2.6.20: BAD (but not *very* BAD!) Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
http://bugzilla.kernel.org/show_bug.cgi?id=9182 On Sat, 15 Dec 2007, Krzysztof Oledzki wrote: On Thu, 13 Dec 2007, Krzysztof Oledzki wrote: On Thu, 13 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ Thanks. So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Not only doesn't fall but continuously grows. Does it happen with other filesystems as well? Don't know. I generally only use ext3 and I'm afraid I'm not able to switch this system to other filesystem. What are you ext3 mount options? /dev/root / ext3 rw,data=journal 0 0 /dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/news_spool /var/spool/news ext3 rw,nosuid,nodev,noatime,data=ordered 0 0 BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it was introduced but it is hard to do it on a highly critical production system, especially since it takes ~2h after a reboot, to be sure. However, 2h is quite good time, on other systems I have to wait ~2 months to get 20MB of leaked memory: # uptime 13:29:34 up 58 days, 13:04, 9 users, load average: 0.38, 0.27, 0.31 # sync;sync;sleep 1;sync;grep Dirt /proc/meminfo Dirty: 23820 kB More news, I hope this time my problem get more attention from developers since now I have much more information. So far I found that: - 2.6.20-rc4 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14057 - 2.6.20-rc2 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14058 - 2.6.20-rc1 - OK (probably, I need to wait little more to be 100% sure). 2.6.20-rc1 with 33m uptime: ~$ grep Dirt /proc/meminfo ;sync ; sleep 1 ; sync ; grep Dirt /proc/meminfo Dirty: 10504 kB Dirty: 0 kB 2.6.20-rc2 was released Dec 23/24 2006 (BAD) 2.6.20-rc1 was released Dec 13/14 2006 (GOOD?) It seems that this bug was introduced exactly one year ago. Surprisingly, dirty memory in 2.6.20-rc2/2.6.20-rc4 leaks _much_ more faster than in 2.6.20-final and later kernels as it took only about 6h to reach 172MB. So, this bug might be cured afterward, but only a little. There are three commits that may be somehow related: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=3e67c0987d7567ad41164a153dca9a43b11d http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=5f2a105d5e33a038a717995d2738434f9c25aed2 I'm going to check 2.6.20-rc1-git... releases but it would be *very* nice if someone could finally give ma a hand and point some hints helping debugging this problem. Please note that none of my systems with kernels >= 2.6.20-rc1 is able to reach 0 kb of dirty memory, even after many synces, even when idle. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 13 Dec 2007, Krzysztof Oledzki wrote: On Thu, 13 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ Thanks. So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Not only doesn't fall but continuously grows. Does it happen with other filesystems as well? Don't know. I generally only use ext3 and I'm afraid I'm not able to switch this system to other filesystem. What are you ext3 mount options? /dev/root / ext3 rw,data=journal 0 0 /dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/news_spool /var/spool/news ext3 rw,nosuid,nodev,noatime,data=ordered 0 0 BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it was introduced but it is hard to do it on a highly critical production system, especially since it takes ~2h after a reboot, to be sure. However, 2h is quite good time, on other systems I have to wait ~2 months to get 20MB of leaked memory: # uptime 13:29:34 up 58 days, 13:04, 9 users, load average: 0.38, 0.27, 0.31 # sync;sync;sleep 1;sync;grep Dirt /proc/meminfo Dirty: 23820 kB Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 13 Dec 2007, Krzysztof Oledzki wrote: On Thu, 13 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ Thanks. So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Not only doesn't fall but continuously grows. Does it happen with other filesystems as well? Don't know. I generally only use ext3 and I'm afraid I'm not able to switch this system to other filesystem. What are you ext3 mount options? /dev/root / ext3 rw,data=journal 0 0 /dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/news_spool /var/spool/news ext3 rw,nosuid,nodev,noatime,data=ordered 0 0 BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it was introduced but it is hard to do it on a highly critical production system, especially since it takes ~2h after a reboot, to be sure. However, 2h is quite good time, on other systems I have to wait ~2 months to get 20MB of leaked memory: # uptime 13:29:34 up 58 days, 13:04, 9 users, load average: 0.38, 0.27, 0.31 # sync;sync;sleep 1;sync;grep Dirt /proc/meminfo Dirty: 23820 kB Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
http://bugzilla.kernel.org/show_bug.cgi?id=9182 On Sat, 15 Dec 2007, Krzysztof Oledzki wrote: On Thu, 13 Dec 2007, Krzysztof Oledzki wrote: On Thu, 13 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ Thanks. So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Not only doesn't fall but continuously grows. Does it happen with other filesystems as well? Don't know. I generally only use ext3 and I'm afraid I'm not able to switch this system to other filesystem. What are you ext3 mount options? /dev/root / ext3 rw,data=journal 0 0 /dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/news_spool /var/spool/news ext3 rw,nosuid,nodev,noatime,data=ordered 0 0 BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it was introduced but it is hard to do it on a highly critical production system, especially since it takes ~2h after a reboot, to be sure. However, 2h is quite good time, on other systems I have to wait ~2 months to get 20MB of leaked memory: # uptime 13:29:34 up 58 days, 13:04, 9 users, load average: 0.38, 0.27, 0.31 # sync;sync;sleep 1;sync;grep Dirt /proc/meminfo Dirty: 23820 kB More news, I hope this time my problem get more attention from developers since now I have much more information. So far I found that: - 2.6.20-rc4 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14057 - 2.6.20-rc2 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14058 - 2.6.20-rc1 - OK (probably, I need to wait little more to be 100% sure). 2.6.20-rc1 with 33m uptime: ~$ grep Dirt /proc/meminfo ;sync ; sleep 1 ; sync ; grep Dirt /proc/meminfo Dirty: 10504 kB Dirty: 0 kB 2.6.20-rc2 was released Dec 23/24 2006 (BAD) 2.6.20-rc1 was released Dec 13/14 2006 (GOOD?) It seems that this bug was introduced exactly one year ago. Surprisingly, dirty memory in 2.6.20-rc2/2.6.20-rc4 leaks _much_ more faster than in 2.6.20-final and later kernels as it took only about 6h to reach 172MB. So, this bug might be cured afterward, but only a little. There are three commits that may be somehow related: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=3e67c0987d7567ad41164a153dca9a43b11d http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=5f2a105d5e33a038a717995d2738434f9c25aed2 I'm going to check 2.6.20-rc1-git... releases but it would be *very* nice if someone could finally give ma a hand and point some hints helping debugging this problem. Please note that none of my systems with kernels = 2.6.20-rc1 is able to reach 0 kb of dirty memory, even after many synces, even when idle. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - = 2.6.20: BAD (but not *very* BAD!) Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #33 from [EMAIL PROTECTED] 2007-12-15 14:19 --- Krzysztof, I'd hate point you to a hard path (at least time consuming), but you've done a lot of digging by now anyway. How about git bisecting between 2.6.20-rc2 and rc1? Here is great info on bisecting: http://www.kernel.org/doc/local/git-quick.html As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. So it took me only 2 reboots. ;) The guilty patch is the one I proposed just an hour ago: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 So: - 2.6.20-rc1: OK - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK - 2.6.20-rc1-git8: very BAD - 2.6.20-rc2: very BAD - 2.6.20-rc4: very BAD - = 2.6.20: BAD (but not *very* BAD!) well.. We have code which has been used by *everyone* for a year and it's misbehaving for you alone. I wonder what you're doing that is different/special. Which filesystem, which mount options, what sort of workload? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 13 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ Thanks. So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Not only doesn't fall but continuously grows. Does it happen with other filesystems as well? Don't know. I generally only use ext3 and I'm afraid I'm not able to switch this system to other filesystem. What are you ext3 mount options? /dev/root / ext3 rw,data=journal 0 0 /dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/news_spool /var/spool/news ext3 rw,nosuid,nodev,noatime,data=ordered 0 0 Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: > > BTW: Could someone please look at this problem? I feel little ignored and > in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Does it happen with other filesystems as well? What are you ext3 mount options? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Mon, 3 Dec 2007, Thomas Osterried wrote: On the machine which has troubles, the bug occured within about 10 days During these days, the amount of dirty pages increased, up to 400MB. I have testet kernel 2.6.19, 2.6.20, 2.6.22.1 and 2.6.22.10 (with our config), and even linux-2.6.20 from ubuntu-sever. They have all shown that behaviour. 10 days ago, i installed kernel 2.6.18.5 on this machine (with backported 3ware controller code). I'm quite sure that this kernel will now fixes our severe stability problems on this production machine (currently: Dirty: 472 kB, nr_dirty 118). If so, it's the "lastest" kernel i found usable, after half of a year of pain. Strange, my tests show that both 2.6.18(.8) and 2.6.19(.7) are OK and the first wrong kernel is 2.6.20. BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Mon, 3 Dec 2007, Thomas Osterried wrote: On the machine which has troubles, the bug occured within about 10 days During these days, the amount of dirty pages increased, up to 400MB. I have testet kernel 2.6.19, 2.6.20, 2.6.22.1 and 2.6.22.10 (with our config), and even linux-2.6.20 from ubuntu-sever. They have all shown that behaviour. CUT 10 days ago, i installed kernel 2.6.18.5 on this machine (with backported 3ware controller code). I'm quite sure that this kernel will now fixes our severe stability problems on this production machine (currently: Dirty: 472 kB, nr_dirty 118). If so, it's the lastest kernel i found usable, after half of a year of pain. Strange, my tests show that both 2.6.18(.8) and 2.6.19(.7) are OK and the first wrong kernel is 2.6.20. BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Does it happen with other filesystems as well? What are you ext3 mount options? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 13 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote: BTW: Could someone please look at this problem? I feel little ignored and in my situation this is a critical regression. I was hoping to get around to it today, but I guess tomorrow will have to do :-/ Thanks. So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0, right? Not only doesn't fall but continuously grows. Does it happen with other filesystems as well? Don't know. I generally only use ext3 and I'm afraid I'm not able to switch this system to other filesystem. What are you ext3 mount options? /dev/root / ext3 rw,data=journal 0 0 /dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0 /dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 rw,nosuid,nodev,noatime,data=writeback 0 0 /dev/VolGrp0/news_spool /var/spool/news ext3 rw,nosuid,nodev,noatime,data=ordered 0 0 Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Tue, 11 Dec 2007, Krzysztof Oledzki wrote: On Wed, 5 Dec 2007, Krzysztof Oledzki wrote: On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #20 from [EMAIL PROTECTED] 2007-12-05 13:37 --- Please monitor the "Dirty:" record in /proc/meminfo. Is it slowly rising and never falling? It is slowly rising with respect to a small fluctuation caused by a current load. Does it then fall if you run /bin/sync? Only a little, by ~1-2MB like in a normal system. But it is not able to fall below a local minimum. So, after a first sync it does not fall more with additional synces. Compile up usemem.c from http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and run usemem -m where N is the number of megabytes whcih that machine has. It has 2GB but: # ./usemem -m 1662 ; echo $? 0 # ./usemem -m 1663 ; echo $? ./usemem: mmap failed: Cannot allocate memory 1 Did this cause /proc/meminfo:Dirty to fall? No. OK, I booted a kernel without 2:2 memsplit but instead with a standard 3.1:0.9 and even without highmem. So, now I have ~900MB and I am able to set -m to the number of megabytes which themachine has. However, usemem still does does not cause dirty memory usage to fall. :( OK, I can confirm that this is a regression from 2.6.18 where it works OK: [EMAIL PROTECTED]:~$ uname -r 2.6.18.8 [EMAIL PROTECTED]:~$ uptime;grep Dirt /proc/meminfo;sync;sleep 2;sync;sleep 1;sync;grep Dirt /proc/meminfo 14:21:53 up 1:00, 1 user, load average: 0.23, 0.36, 0.35 Dirty: 376 kB Dirty: 0 kB It seems that this leak also exists in my other system as even after many synces number of dirty pages are still >> 0, but this the only one where it is so critical and at the same time - so easy to reproduce. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Tue, 11 Dec 2007, Krzysztof Oledzki wrote: On Wed, 5 Dec 2007, Krzysztof Oledzki wrote: On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 --- Comment #20 from [EMAIL PROTECTED] 2007-12-05 13:37 --- Please monitor the Dirty: record in /proc/meminfo. Is it slowly rising and never falling? It is slowly rising with respect to a small fluctuation caused by a current load. Does it then fall if you run /bin/sync? Only a little, by ~1-2MB like in a normal system. But it is not able to fall below a local minimum. So, after a first sync it does not fall more with additional synces. Compile up usemem.c from http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and run usemem -m N where N is the number of megabytes whcih that machine has. It has 2GB but: # ./usemem -m 1662 ; echo $? 0 # ./usemem -m 1663 ; echo $? ./usemem: mmap failed: Cannot allocate memory 1 Did this cause /proc/meminfo:Dirty to fall? No. OK, I booted a kernel without 2:2 memsplit but instead with a standard 3.1:0.9 and even without highmem. So, now I have ~900MB and I am able to set -m to the number of megabytes which themachine has. However, usemem still does does not cause dirty memory usage to fall. :( OK, I can confirm that this is a regression from 2.6.18 where it works OK: [EMAIL PROTECTED]:~$ uname -r 2.6.18.8 [EMAIL PROTECTED]:~$ uptime;grep Dirt /proc/meminfo;sync;sleep 2;sync;sleep 1;sync;grep Dirt /proc/meminfo 14:21:53 up 1:00, 1 user, load average: 0.23, 0.36, 0.35 Dirty: 376 kB Dirty: 0 kB It seems that this leak also exists in my other system as even after many synces number of dirty pages are still 0, but this the only one where it is so critical and at the same time - so easy to reproduce. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 [EMAIL PROTECTED] changed: What|Removed |Added Component|Other |Other KernelVersion|2.6.22-stable/2.6.23-stable |2.6.20-stable/2.6.22- ||stable/2.6.23-stable Product|IO/Storage |Memory Management Regression|0 |1 Summary|Strange system hangs|Critical memory leak (dirty ||pages) After additional hint from Thomas Osterried I can confirm that the problem I have been dealing with for half of a year comes from continuous dirty pages increas: http://bugzilla.kernel.org/attachment.cgi?id=13864=view (in 1 KB units) So, after two days of uptime I have ~140MB of dirty pages and that explains why my system crashes every 2-3 weeks. Best regards, Krzysztof Olędzki
Re: [Bug 9182] Critical memory leak (dirty pages)
On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9182 [EMAIL PROTECTED] changed: What|Removed |Added Component|Other |Other KernelVersion|2.6.22-stable/2.6.23-stable |2.6.20-stable/2.6.22- ||stable/2.6.23-stable Product|IO/Storage |Memory Management Regression|0 |1 Summary|Strange system hangs|Critical memory leak (dirty ||pages) After additional hint from Thomas Osterried I can confirm that the problem I have been dealing with for half of a year comes from continuous dirty pages increas: http://bugzilla.kernel.org/attachment.cgi?id=13864action=view (in 1 KB units) So, after two days of uptime I have ~140MB of dirty pages and that explains why my system crashes every 2-3 weeks. Best regards, Krzysztof Olędzki