Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-21 Thread Krzysztof Oledzki



On Thu, 20 Dec 2007, Björn Steinbrink wrote:


On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:



On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:


OK, so I looked for PG_dirty anyway.

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
bail out if the page is dirty.

Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
truncate_complete_page, because it called cancel_dirty_page (and thus
cleared PG_dirty) after try_to_free_buffers was called via
do_invalidatepage.

Now, if I'm not mistaken, we can end up as follows.

truncate_complete_page()
  cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
  do_invalidatepage()
ext3_invalidatepage()
  journal_invalidatepage()
journal_unmap_buffer()
  __dispose_buffer()
__journal_unfile_buffer()
  __journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages


Good, this seems to be the exact path that actually triggers it. I got to
journal_unmap_buffer(), but was too lazy to actually then bother to follow
it all the way down - I decided that I didn't actually really even care
what the low-level FS layer did, I had already convinced myself that it
obviously must be dirtying the page some way, since that matched the
symptoms exactly (ie only the journaling case was impacted, and this was
all about the journal).

But perhaps more importantly: regardless of what the low-level filesystem
did at that point, the VM accounting shouldn't care, and should be robust
in the face of a low-level filesystem doing strange and wonderful things.
But thanks for bothering to go through the whole history and figure out
what exactly is up.


Oh well, after seeing the move of cancel_dirty_page, I just went
backwards from __set_page_dirty using cscope + some smart guessing and
quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)


As try_to_free_buffers got its ext3 hack back in
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
the accounting fix in cancel_dirty_page, of course).


Yes, I think we have room for cleanups now, and I agree: we ended up
reinstating some questionable code in the VM just because we didn't really
know or understand what was going on in the ext3 journal code.


Hm, you attributed more to my mail than there was actually in it. I
didn't even start to think of cleanups (because I don't know jack about
the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
(yet?)).What I meant is that we only did a half-revert of that hackery.

When try_to_free_buffers started to check for PG_dirty, the
cancel_dirty_page call had to be called before do_invalidatepage, to
"fix" a _huge_ leak.  But that caused the accouting breakage we're now
seeing, because we never account for the pages that got redirtied during
do_invalidatepage.

Then the change to try_to_free_buffers got reverted, so we no longer
need to call cancel_dirty_page before do_invalidatepage, but still we
do. Thus the accounting bug remains. So what I meant to suggest was
simply to actually "finish" the revert we started.

Or expressed as a patch:

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
if (page->mapping != mapping)
return;

-   cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);

+   cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);

I'll be the last one to comment on whether or not that causes inaccurate
accouting, so I'll just watch you and Jan battle that out until someone
comes up with a post-.24 patch to provide a clean fix for the issue.

Krzysztof, could you give this patch a test run?

If that "fixes" the problem for now, I'll try to come up with some
usable commit message, or if somehow wants to beat me to it, you can
already have my

Signed-off-by: Björn Steinbrink <[EMAIL PROTECTED]>


Checked with 2.6.24-rc5 + debug/fixup patch from Linus + above fix. After 
3h there have been no warnings about __remove_from_page_cache(). So, it 
seems that it is OK.


Tested-by: Krzysztof Piotr Oledzki <[EMAIL PROTECTED]>

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-21 Thread Krzysztof Oledzki



On Thu, 20 Dec 2007, Björn Steinbrink wrote:


On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:



On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:


OK, so I looked for PG_dirty anyway.

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
bail out if the page is dirty.

Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
truncate_complete_page, because it called cancel_dirty_page (and thus
cleared PG_dirty) after try_to_free_buffers was called via
do_invalidatepage.

Now, if I'm not mistaken, we can end up as follows.

truncate_complete_page()
  cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
  do_invalidatepage()
ext3_invalidatepage()
  journal_invalidatepage()
journal_unmap_buffer()
  __dispose_buffer()
__journal_unfile_buffer()
  __journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages


Good, this seems to be the exact path that actually triggers it. I got to
journal_unmap_buffer(), but was too lazy to actually then bother to follow
it all the way down - I decided that I didn't actually really even care
what the low-level FS layer did, I had already convinced myself that it
obviously must be dirtying the page some way, since that matched the
symptoms exactly (ie only the journaling case was impacted, and this was
all about the journal).

But perhaps more importantly: regardless of what the low-level filesystem
did at that point, the VM accounting shouldn't care, and should be robust
in the face of a low-level filesystem doing strange and wonderful things.
But thanks for bothering to go through the whole history and figure out
what exactly is up.


Oh well, after seeing the move of cancel_dirty_page, I just went
backwards from __set_page_dirty using cscope + some smart guessing and
quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)


As try_to_free_buffers got its ext3 hack back in
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
the accounting fix in cancel_dirty_page, of course).


Yes, I think we have room for cleanups now, and I agree: we ended up
reinstating some questionable code in the VM just because we didn't really
know or understand what was going on in the ext3 journal code.


Hm, you attributed more to my mail than there was actually in it. I
didn't even start to think of cleanups (because I don't know jack about
the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
(yet?)).What I meant is that we only did a half-revert of that hackery.

When try_to_free_buffers started to check for PG_dirty, the
cancel_dirty_page call had to be called before do_invalidatepage, to
fix a _huge_ leak.  But that caused the accouting breakage we're now
seeing, because we never account for the pages that got redirtied during
do_invalidatepage.

Then the change to try_to_free_buffers got reverted, so we no longer
need to call cancel_dirty_page before do_invalidatepage, but still we
do. Thus the accounting bug remains. So what I meant to suggest was
simply to actually finish the revert we started.

Or expressed as a patch:

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
if (page-mapping != mapping)
return;

-   cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);

+   cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);

I'll be the last one to comment on whether or not that causes inaccurate
accouting, so I'll just watch you and Jan battle that out until someone
comes up with a post-.24 patch to provide a clean fix for the issue.

Krzysztof, could you give this patch a test run?

If that fixes the problem for now, I'll try to come up with some
usable commit message, or if somehow wants to beat me to it, you can
already have my

Signed-off-by: Björn Steinbrink [EMAIL PROTECTED]


Checked with 2.6.24-rc5 + debug/fixup patch from Linus + above fix. After 
3h there have been no warnings about __remove_from_page_cache(). So, it 
seems that it is OK.


Tested-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED]

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Nick Piggin
On Friday 21 December 2007 06:24, Linus Torvalds wrote:
> On Thu, 20 Dec 2007, Jan Kara wrote:
> >   As I wrote in my previous email, this solution works but hides the
> > fact that the page really *has* dirty data in it and *is* pinned in
> > memory until the commit code gets to writing it. So in theory it could
> > disturb the writeout logic by having more dirty data in memory than vm
> > thinks it has. Not that I'd have a better fix now but I wanted to point
> > out this problem.
>
> Well, I worry more about the VM being sane - and by the time we actually
> hit this case, as far as VM sanity is concerned, the page no longer really
> exists. It's been removed from the page cache, and it only really exists
> as any other random kernel allocation.

It does allow the VM to just not worry about this. However I don't
really like this kinds of catch-all conditions that are hard to get
rid of and can encourage bad behaviour.

It would be nice if the "insane" things were made to clean up after
themselves.


> The fact that low-level filesystems (in this case ext3 journaling) do
> their own insane things is not something the VM even _should_ care about.
> It's just an internal FS allocation, and the FS can do whatever the hell
> it wants with it, including doing IO etc.
>
> The kernel doesn't consider any other random IO pages to be "dirty" either
> (eg if you do direct-IO writes using low-level SCSI commands, the VM
> doesn't consider that to be any special dirty stuff, it's just random page
> allocations again). This is really no different.
>
> In other words: the Linux "VM" subsystem is really two differnt parts: the
> low-level page allocator (which obviously knows that the page is still in
> *use*, since it hasn't been free'd), and the higher-level file mapping and
> caching stuff that knows about things like page "dirtyiness". And once
> you've done a "remove_from_page_cache()", the higher levels are no longer
> involved, and dirty accounting simply doesn't get into the picture.

That's all true... it would simply be nice to ask the filesystems to do
this. But anyway I think your patch is pretty reasonable for the moment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Björn Steinbrink
On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
> > 
> > OK, so I looked for PG_dirty anyway.
> > 
> > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> > bail out if the page is dirty.
> > 
> > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
> > truncate_complete_page, because it called cancel_dirty_page (and thus
> > cleared PG_dirty) after try_to_free_buffers was called via
> > do_invalidatepage.
> > 
> > Now, if I'm not mistaken, we can end up as follows.
> > 
> > truncate_complete_page()
> >   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> >   do_invalidatepage()
> > ext3_invalidatepage()
> >   journal_invalidatepage()
> > journal_unmap_buffer()
> >   __dispose_buffer()
> > __journal_unfile_buffer()
> >   __journal_temp_unlink_buffer()
> > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
> 
> Good, this seems to be the exact path that actually triggers it. I got to 
> journal_unmap_buffer(), but was too lazy to actually then bother to follow 
> it all the way down - I decided that I didn't actually really even care 
> what the low-level FS layer did, I had already convinced myself that it 
> obviously must be dirtying the page some way, since that matched the 
> symptoms exactly (ie only the journaling case was impacted, and this was 
> all about the journal).
> 
> But perhaps more importantly: regardless of what the low-level filesystem 
> did at that point, the VM accounting shouldn't care, and should be robust 
> in the face of a low-level filesystem doing strange and wonderful things. 
> But thanks for bothering to go through the whole history and figure out 
> what exactly is up.

Oh well, after seeing the move of cancel_dirty_page, I just went
backwards from __set_page_dirty using cscope + some smart guessing and
quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)

> > As try_to_free_buffers got its ext3 hack back in
> > ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
> > 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
> > the accounting fix in cancel_dirty_page, of course).
> 
> Yes, I think we have room for cleanups now, and I agree: we ended up 
> reinstating some questionable code in the VM just because we didn't really 
> know or understand what was going on in the ext3 journal code. 

Hm, you attributed more to my mail than there was actually in it. I
didn't even start to think of cleanups (because I don't know jack about
the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
(yet?)).What I meant is that we only did a half-revert of that hackery.

When try_to_free_buffers started to check for PG_dirty, the
cancel_dirty_page call had to be called before do_invalidatepage, to
"fix" a _huge_ leak.  But that caused the accouting breakage we're now
seeing, because we never account for the pages that got redirtied during
do_invalidatepage.

Then the change to try_to_free_buffers got reverted, so we no longer
need to call cancel_dirty_page before do_invalidatepage, but still we
do. Thus the accounting bug remains. So what I meant to suggest was
simply to actually "finish" the revert we started.

Or expressed as a patch:

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
if (page->mapping != mapping)
return;
 
-   cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);
 
+   cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);

I'll be the last one to comment on whether or not that causes inaccurate
accouting, so I'll just watch you and Jan battle that out until someone
comes up with a post-.24 patch to provide a clean fix for the issue.

Krzysztof, could you give this patch a test run?

If that "fixes" the problem for now, I'll try to come up with some
usable commit message, or if somehow wants to beat me to it, you can
already have my

Signed-off-by: Björn Steinbrink <[EMAIL PROTECTED]>

> > On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
> > io accounting for cancelled writes happened always happened if the page
> > was dirty, regardless of page->mapping. This was also already true for
> > the old test_clear_page_dirty code, and the commit log for
> > 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
> > change either, so maybe the "if (account_size)" block should be moved
> > out of the if "(mapping && ...)" block?
> 
> I think the "if (account_size)" thing was *purely* for me being worried 
> about hugetlb entries, and I think that's the only thing that passes in a 
> 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Linus Torvalds


On Thu, 20 Dec 2007, Jan Kara wrote:
>
>   As I wrote in my previous email, this solution works but hides the
> fact that the page really *has* dirty data in it and *is* pinned in memory
> until the commit code gets to writing it. So in theory it could disturb
> the writeout logic by having more dirty data in memory than vm thinks it
> has. Not that I'd have a better fix now but I wanted to point out this
> problem.

Well, I worry more about the VM being sane - and by the time we actually 
hit this case, as far as VM sanity is concerned, the page no longer really 
exists. It's been removed from the page cache, and it only really exists 
as any other random kernel allocation.

The fact that low-level filesystems (in this case ext3 journaling) do 
their own insane things is not something the VM even _should_ care about. 
It's just an internal FS allocation, and the FS can do whatever the hell 
it wants with it, including doing IO etc.

The kernel doesn't consider any other random IO pages to be "dirty" either 
(eg if you do direct-IO writes using low-level SCSI commands, the VM 
doesn't consider that to be any special dirty stuff, it's just random page 
allocations again). This is really no different.

In other words: the Linux "VM" subsystem is really two differnt parts: the 
low-level page allocator (which obviously knows that the page is still in 
*use*, since it hasn't been free'd), and the higher-level file mapping and 
caching stuff that knows about things like page "dirtyiness". And once 
you've done a "remove_from_page_cache()", the higher levels are no longer 
involved, and dirty accounting simply doesn't get into the picture.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara
> On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
> > 
> > OK, so I looked for PG_dirty anyway.
> > 
> > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> > bail out if the page is dirty.
> > 
> > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
> > truncate_complete_page, because it called cancel_dirty_page (and thus
> > cleared PG_dirty) after try_to_free_buffers was called via
> > do_invalidatepage.
> > 
> > Now, if I'm not mistaken, we can end up as follows.
> > 
> > truncate_complete_page()
> >   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> >   do_invalidatepage()
> > ext3_invalidatepage()
> >   journal_invalidatepage()
> > journal_unmap_buffer()
> >   __dispose_buffer()
> > __journal_unfile_buffer()
> >   __journal_temp_unlink_buffer()
> > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
> 
> Good, this seems to be the exact path that actually triggers it. I got to 
> journal_unmap_buffer(), but was too lazy to actually then bother to follow 
> it all the way down - I decided that I didn't actually really even care 
> what the low-level FS layer did, I had already convinced myself that it 
> obviously must be dirtying the page some way, since that matched the 
> symptoms exactly (ie only the journaling case was impacted, and this was 
> all about the journal).
> 
> But perhaps more importantly: regardless of what the low-level filesystem 
> did at that point, the VM accounting shouldn't care, and should be robust 
> in the face of a low-level filesystem doing strange and wonderful things. 
> But thanks for bothering to go through the whole history and figure out 
> what exactly is up.
  As I wrote in my previous email, this solution works but hides the
fact that the page really *has* dirty data in it and *is* pinned in memory
until the commit code gets to writing it. So in theory it could disturb
the writeout logic by having more dirty data in memory than vm thinks it
has. Not that I'd have a better fix now but I wanted to point out this
problem.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Linus Torvalds


On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
> 
> OK, so I looked for PG_dirty anyway.
> 
> In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> bail out if the page is dirty.
> 
> Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
> truncate_complete_page, because it called cancel_dirty_page (and thus
> cleared PG_dirty) after try_to_free_buffers was called via
> do_invalidatepage.
> 
> Now, if I'm not mistaken, we can end up as follows.
> 
> truncate_complete_page()
>   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
>   do_invalidatepage()
> ext3_invalidatepage()
>   journal_invalidatepage()
> journal_unmap_buffer()
>   __dispose_buffer()
> __journal_unfile_buffer()
>   __journal_temp_unlink_buffer()
> mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

Good, this seems to be the exact path that actually triggers it. I got to 
journal_unmap_buffer(), but was too lazy to actually then bother to follow 
it all the way down - I decided that I didn't actually really even care 
what the low-level FS layer did, I had already convinced myself that it 
obviously must be dirtying the page some way, since that matched the 
symptoms exactly (ie only the journaling case was impacted, and this was 
all about the journal).

But perhaps more importantly: regardless of what the low-level filesystem 
did at that point, the VM accounting shouldn't care, and should be robust 
in the face of a low-level filesystem doing strange and wonderful things. 
But thanks for bothering to go through the whole history and figure out 
what exactly is up.

> As try_to_free_buffers got its ext3 hack back in
> ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
> 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
> the accounting fix in cancel_dirty_page, of course).

Yes, I think we have room for cleanups now, and I agree: we ended up 
reinstating some questionable code in the VM just because we didn't really 
know or understand what was going on in the ext3 journal code. 

Of course, it may well be that there is something *else* going on too, but 
I do believe that this whole case is what it was all about, and the hacks 
end up just (a) making the VM harder to understand (because they cause 
non-obvious VM code to work around some very specific filesystem 
behaviour) and (b) the hacks obviously hid the _real_ issue, but I think 
we've established the real cause, and the hacks clearly weren't enough to 
really hide it 100% anyway.

However, there's no way I'll play with that  right now (I'm planning on an 
-rc6 today), but it might be worth it to make a test-cleanup patch for -mm 
which does some VM cleanups:

 - don't touch dirty pages in fs/buffer.c (ie undo the meat of commit 
   ecdfc9787fe527491baefc22dce8b2dbd5b2908d, but not resurrecting the 
   debugging code)

 - remove the calling of "cancel_dirty_page()" entirely from 
   "truncate_complete_page()", and let "remove_from_page_cache()" just 
   always handle it (and probably just add a "ClearPageDirty()" to match 
   the "ClearPageUptodate()").

 - remove "cancel_dirty_page()" from "truncate_huge_page()", which seems 
   to be the exact same issue (ie we should just use the logic in 
   remove_from_page_cache()).

at that point "cancel_dirty_page()" literally is only used for what its 
name implies, and the only in-tree use of it seems to be NFS for when 
the filesystem gets called for ->invalidatepage - which makes tons of 
conceptual sense, but I suspect we could drop it from there too, since the 
VM layer _will_ cancel the dirtiness at a VM level when it then later 
removes it from the page cache.

So we essentially seem to be able to simplify things a bit by getting rid 
of a hack in try_to_free_buffers(), and potentially getting rid of 
cancel_dirty_page() entirely.

It would imply that we need to do the task_io_account_cancelled_write() 
inside "remove_from_page_cache()", but that should be ok (I don't see any 
locking issues there).

> On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
> io accounting for cancelled writes happened always happened if the page
> was dirty, regardless of page->mapping. This was also already true for
> the old test_clear_page_dirty code, and the commit log for
> 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
> change either, so maybe the "if (account_size)" block should be moved
> out of the if "(mapping && ...)" block?

I think the "if (account_size)" thing was *purely* for me being worried 
about hugetlb entries, and I think that's the only thing that passes in a 
zero account size.

But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we 
cannot account for those pages *anyway*), so I think we could go further 
than move the account_size outside of the test, I think we could probably 
remove that test entirely and drop the whole thing.

The thing 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara
> > On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
> > > 
> > > 
> > > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > > > 
> > > > I'll confirm this tomorrow but it seems that even switching to 
> > > > data=ordered
> > > > (AFAIK default o ext3) is indeed enough to cure this problem.
> > > 
> > > Ok, do we actually have any ext3 expert following this? I have no idea 
> > > about what the journalling code does, but I have painful memories of ext3 
> > > doing really odd buffer-head-based IO and totally bypassing all the 
> > > normal 
> > > page dirty logic.
> > > 
> > > Judging by the symptoms (sorry for not following this well, it came up 
> > > while I was mostly away travelling), something probably *does* clear the 
> > > dirty bit on the pages, but the dirty *accounting* is not done properly, 
> > > so the kernel keeps thinking it has dirty pages.
> > > 
> > > Now, a simple "grep" shows that ext3 does not actually do any 
> > > ClearPageDirty() or similar on its own, although maybe I missed some 
> > > other 
> > > subtle way this can happen. And the *normal* VFS routines that do 
> > > ClearPageDirty should all be doing the proper accounting.
> > > 
> > > So I see a couple of possible cases:
> > > 
> > >  - actually clearing the PG_dirty bit somehow, without doing the 
> > >accounting.
> > > 
> > >This looks very unlikely. PG_dirty is always cleared by some variant 
> > > of 
> > >"*ClearPageDirty()", and that bit definition isn't used for anything 
> > >else in the whole kernel judging by "grep" (the page allocator tests 
> > >the bit, that's it).
> > 
> > OK, so I looked for PG_dirty anyway.
> > 
> > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> > bail out if the page is dirty.
> > 
> > Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
> > truncate_complete_page, because it called cancel_dirty_page (and thus
> > cleared PG_dirty) after try_to_free_buffers was called via
> > do_invalidatepage.
> > 
> > Now, if I'm not mistaken, we can end up as follows.
> > 
> > truncate_complete_page()
> >   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> >   do_invalidatepage()
> > ext3_invalidatepage()
> >   journal_invalidatepage()
> > journal_unmap_buffer()
> >   __dispose_buffer()
> > __journal_unfile_buffer()
> >   __journal_temp_unlink_buffer()
> > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
> > 
> > If journal_unmap_buffer then returns 0, try_to_free_buffers is not
> > called and neither is cancel_dirty_page, so the dirty pages accounting
> > is not decreased again.
>   Yes, this can happen. The call to mark_buffer_dirty() is a fallout
> from journal_unfile_buffer() trying to sychronise JBD private dirty bit
> (jbddirty) with the standard dirty bit. We could actually clear the
> jbddirty bit before calling journal_unfile_buffer() so that this doesn't
> happen but since Linus changed remove_from_pagecache() to not care about
> redirtying the page I guess it's not needed any more...
  Oops, sorry, I spoke to soon. After thinking more about it, I think we
cannot clear the dirty bit (at least not jbddirty) in all cases and in
fact moving cancel_dirty_page() after do_invalidatepage() call only
hides the real problem.
  Let's recap what JBD/ext3 code requires in case of truncation.  A
life-cycle of a journaled buffer looks as follows: When we want to write
some data to it, it gets attached to the running transaction. When the
transaction is committing, the buffer is written to the journal.
Sometime later, the buffer is written to it's final place in the
filesystem - this is called checkpoint - and can be released.
  Now suppose a write to the buffer happens in one transaction and you
truncate the buffer in the next one. You cannot just free the buffer
immediately - it can for example happen, that the transaction in which
the write happened hasn't committed yet. So we just leave the dirty
buffer there and it should be cleaned up later when the committing
transaction writes the data where it needs...
  The problem is that when the commit code writes the buffer, it
eventually calls try_to_free_buffers() but as Nick pointed out,
->mapping is set to NULL by that time so we don't even call
cancel_dirty_page() and so the number of dirty pages is not properly
decreased. Of course, we could decrease the number of dirty pages after
we return from do_invalidatepage when clearing ->mapping but that would
make dirty accounting imprecise - we really still have those dirty data
which need writeout. But it's probably the best workaround I can
currently think of.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara
> On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
> > 
> > 
> > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > > 
> > > I'll confirm this tomorrow but it seems that even switching to 
> > > data=ordered
> > > (AFAIK default o ext3) is indeed enough to cure this problem.
> > 
> > Ok, do we actually have any ext3 expert following this? I have no idea 
> > about what the journalling code does, but I have painful memories of ext3 
> > doing really odd buffer-head-based IO and totally bypassing all the normal 
> > page dirty logic.
> > 
> > Judging by the symptoms (sorry for not following this well, it came up 
> > while I was mostly away travelling), something probably *does* clear the 
> > dirty bit on the pages, but the dirty *accounting* is not done properly, 
> > so the kernel keeps thinking it has dirty pages.
> > 
> > Now, a simple "grep" shows that ext3 does not actually do any 
> > ClearPageDirty() or similar on its own, although maybe I missed some other 
> > subtle way this can happen. And the *normal* VFS routines that do 
> > ClearPageDirty should all be doing the proper accounting.
> > 
> > So I see a couple of possible cases:
> > 
> >  - actually clearing the PG_dirty bit somehow, without doing the 
> >accounting.
> > 
> >This looks very unlikely. PG_dirty is always cleared by some variant of 
> >"*ClearPageDirty()", and that bit definition isn't used for anything 
> >else in the whole kernel judging by "grep" (the page allocator tests 
> >the bit, that's it).
> 
> OK, so I looked for PG_dirty anyway.
> 
> In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> bail out if the page is dirty.
> 
> Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
> truncate_complete_page, because it called cancel_dirty_page (and thus
> cleared PG_dirty) after try_to_free_buffers was called via
> do_invalidatepage.
> 
> Now, if I'm not mistaken, we can end up as follows.
> 
> truncate_complete_page()
>   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
>   do_invalidatepage()
> ext3_invalidatepage()
>   journal_invalidatepage()
> journal_unmap_buffer()
>   __dispose_buffer()
> __journal_unfile_buffer()
>   __journal_temp_unlink_buffer()
> mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
> 
> If journal_unmap_buffer then returns 0, try_to_free_buffers is not
> called and neither is cancel_dirty_page, so the dirty pages accounting
> is not decreased again.
  Yes, this can happen. The call to mark_buffer_dirty() is a fallout
from journal_unfile_buffer() trying to sychronise JBD private dirty bit
(jbddirty) with the standard dirty bit. We could actually clear the
jbddirty bit before calling journal_unfile_buffer() so that this doesn't
happen but since Linus changed remove_from_pagecache() to not care about
redirtying the page I guess it's not needed any more...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Björn Steinbrink
On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > 
> > I'll confirm this tomorrow but it seems that even switching to data=ordered
> > (AFAIK default o ext3) is indeed enough to cure this problem.
> 
> Ok, do we actually have any ext3 expert following this? I have no idea 
> about what the journalling code does, but I have painful memories of ext3 
> doing really odd buffer-head-based IO and totally bypassing all the normal 
> page dirty logic.
> 
> Judging by the symptoms (sorry for not following this well, it came up 
> while I was mostly away travelling), something probably *does* clear the 
> dirty bit on the pages, but the dirty *accounting* is not done properly, 
> so the kernel keeps thinking it has dirty pages.
> 
> Now, a simple "grep" shows that ext3 does not actually do any 
> ClearPageDirty() or similar on its own, although maybe I missed some other 
> subtle way this can happen. And the *normal* VFS routines that do 
> ClearPageDirty should all be doing the proper accounting.
> 
> So I see a couple of possible cases:
> 
>  - actually clearing the PG_dirty bit somehow, without doing the 
>accounting.
> 
>This looks very unlikely. PG_dirty is always cleared by some variant of 
>"*ClearPageDirty()", and that bit definition isn't used for anything 
>else in the whole kernel judging by "grep" (the page allocator tests 
>the bit, that's it).

OK, so I looked for PG_dirty anyway.

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
bail out if the page is dirty.

Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
truncate_complete_page, because it called cancel_dirty_page (and thus
cleared PG_dirty) after try_to_free_buffers was called via
do_invalidatepage.

Now, if I'm not mistaken, we can end up as follows.

truncate_complete_page()
  cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
  do_invalidatepage()
ext3_invalidatepage()
  journal_invalidatepage()
journal_unmap_buffer()
  __dispose_buffer()
__journal_unfile_buffer()
  __journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

If journal_unmap_buffer then returns 0, try_to_free_buffers is not
called and neither is cancel_dirty_page, so the dirty pages accounting
is not decreased again.

As try_to_free_buffers got its ext3 hack back in
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
the accounting fix in cancel_dirty_page, of course).


On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
io accounting for cancelled writes happened always happened if the page
was dirty, regardless of page->mapping. This was also already true for
the old test_clear_page_dirty code, and the commit log for
8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
change either, so maybe the "if (account_size)" block should be moved
out of the if "(mapping && ...)" block?

Björn - not sending patches because he needs sleep and wouldn't have a
damn clue about what to write as a commit message anyway
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Björn Steinbrink
On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
 
 
 On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
  
  I'll confirm this tomorrow but it seems that even switching to data=ordered
  (AFAIK default o ext3) is indeed enough to cure this problem.
 
 Ok, do we actually have any ext3 expert following this? I have no idea 
 about what the journalling code does, but I have painful memories of ext3 
 doing really odd buffer-head-based IO and totally bypassing all the normal 
 page dirty logic.
 
 Judging by the symptoms (sorry for not following this well, it came up 
 while I was mostly away travelling), something probably *does* clear the 
 dirty bit on the pages, but the dirty *accounting* is not done properly, 
 so the kernel keeps thinking it has dirty pages.
 
 Now, a simple grep shows that ext3 does not actually do any 
 ClearPageDirty() or similar on its own, although maybe I missed some other 
 subtle way this can happen. And the *normal* VFS routines that do 
 ClearPageDirty should all be doing the proper accounting.
 
 So I see a couple of possible cases:
 
  - actually clearing the PG_dirty bit somehow, without doing the 
accounting.
 
This looks very unlikely. PG_dirty is always cleared by some variant of 
*ClearPageDirty(), and that bit definition isn't used for anything 
else in the whole kernel judging by grep (the page allocator tests 
the bit, that's it).

OK, so I looked for PG_dirty anyway.

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
bail out if the page is dirty.

Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
truncate_complete_page, because it called cancel_dirty_page (and thus
cleared PG_dirty) after try_to_free_buffers was called via
do_invalidatepage.

Now, if I'm not mistaken, we can end up as follows.

truncate_complete_page()
  cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
  do_invalidatepage()
ext3_invalidatepage()
  journal_invalidatepage()
journal_unmap_buffer()
  __dispose_buffer()
__journal_unfile_buffer()
  __journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

If journal_unmap_buffer then returns 0, try_to_free_buffers is not
called and neither is cancel_dirty_page, so the dirty pages accounting
is not decreased again.

As try_to_free_buffers got its ext3 hack back in
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
the accounting fix in cancel_dirty_page, of course).


On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
io accounting for cancelled writes happened always happened if the page
was dirty, regardless of page-mapping. This was also already true for
the old test_clear_page_dirty code, and the commit log for
8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
change either, so maybe the if (account_size) block should be moved
out of the if (mapping  ...) block?

Björn - not sending patches because he needs sleep and wouldn't have a
damn clue about what to write as a commit message anyway
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara
 On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
  
  
  On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
   
   I'll confirm this tomorrow but it seems that even switching to 
   data=ordered
   (AFAIK default o ext3) is indeed enough to cure this problem.
  
  Ok, do we actually have any ext3 expert following this? I have no idea 
  about what the journalling code does, but I have painful memories of ext3 
  doing really odd buffer-head-based IO and totally bypassing all the normal 
  page dirty logic.
  
  Judging by the symptoms (sorry for not following this well, it came up 
  while I was mostly away travelling), something probably *does* clear the 
  dirty bit on the pages, but the dirty *accounting* is not done properly, 
  so the kernel keeps thinking it has dirty pages.
  
  Now, a simple grep shows that ext3 does not actually do any 
  ClearPageDirty() or similar on its own, although maybe I missed some other 
  subtle way this can happen. And the *normal* VFS routines that do 
  ClearPageDirty should all be doing the proper accounting.
  
  So I see a couple of possible cases:
  
   - actually clearing the PG_dirty bit somehow, without doing the 
 accounting.
  
 This looks very unlikely. PG_dirty is always cleared by some variant of 
 *ClearPageDirty(), and that bit definition isn't used for anything 
 else in the whole kernel judging by grep (the page allocator tests 
 the bit, that's it).
 
 OK, so I looked for PG_dirty anyway.
 
 In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
 bail out if the page is dirty.
 
 Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
 truncate_complete_page, because it called cancel_dirty_page (and thus
 cleared PG_dirty) after try_to_free_buffers was called via
 do_invalidatepage.
 
 Now, if I'm not mistaken, we can end up as follows.
 
 truncate_complete_page()
   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
   do_invalidatepage()
 ext3_invalidatepage()
   journal_invalidatepage()
 journal_unmap_buffer()
   __dispose_buffer()
 __journal_unfile_buffer()
   __journal_temp_unlink_buffer()
 mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
 
 If journal_unmap_buffer then returns 0, try_to_free_buffers is not
 called and neither is cancel_dirty_page, so the dirty pages accounting
 is not decreased again.
  Yes, this can happen. The call to mark_buffer_dirty() is a fallout
from journal_unfile_buffer() trying to sychronise JBD private dirty bit
(jbddirty) with the standard dirty bit. We could actually clear the
jbddirty bit before calling journal_unfile_buffer() so that this doesn't
happen but since Linus changed remove_from_pagecache() to not care about
redirtying the page I guess it's not needed any more...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara
  On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
   
   
   On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:

I'll confirm this tomorrow but it seems that even switching to 
data=ordered
(AFAIK default o ext3) is indeed enough to cure this problem.
   
   Ok, do we actually have any ext3 expert following this? I have no idea 
   about what the journalling code does, but I have painful memories of ext3 
   doing really odd buffer-head-based IO and totally bypassing all the 
   normal 
   page dirty logic.
   
   Judging by the symptoms (sorry for not following this well, it came up 
   while I was mostly away travelling), something probably *does* clear the 
   dirty bit on the pages, but the dirty *accounting* is not done properly, 
   so the kernel keeps thinking it has dirty pages.
   
   Now, a simple grep shows that ext3 does not actually do any 
   ClearPageDirty() or similar on its own, although maybe I missed some 
   other 
   subtle way this can happen. And the *normal* VFS routines that do 
   ClearPageDirty should all be doing the proper accounting.
   
   So I see a couple of possible cases:
   
- actually clearing the PG_dirty bit somehow, without doing the 
  accounting.
   
  This looks very unlikely. PG_dirty is always cleared by some variant 
   of 
  *ClearPageDirty(), and that bit definition isn't used for anything 
  else in the whole kernel judging by grep (the page allocator tests 
  the bit, that's it).
  
  OK, so I looked for PG_dirty anyway.
  
  In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
  bail out if the page is dirty.
  
  Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
  truncate_complete_page, because it called cancel_dirty_page (and thus
  cleared PG_dirty) after try_to_free_buffers was called via
  do_invalidatepage.
  
  Now, if I'm not mistaken, we can end up as follows.
  
  truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
  ext3_invalidatepage()
journal_invalidatepage()
  journal_unmap_buffer()
__dispose_buffer()
  __journal_unfile_buffer()
__journal_temp_unlink_buffer()
  mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
  
  If journal_unmap_buffer then returns 0, try_to_free_buffers is not
  called and neither is cancel_dirty_page, so the dirty pages accounting
  is not decreased again.
   Yes, this can happen. The call to mark_buffer_dirty() is a fallout
 from journal_unfile_buffer() trying to sychronise JBD private dirty bit
 (jbddirty) with the standard dirty bit. We could actually clear the
 jbddirty bit before calling journal_unfile_buffer() so that this doesn't
 happen but since Linus changed remove_from_pagecache() to not care about
 redirtying the page I guess it's not needed any more...
  Oops, sorry, I spoke to soon. After thinking more about it, I think we
cannot clear the dirty bit (at least not jbddirty) in all cases and in
fact moving cancel_dirty_page() after do_invalidatepage() call only
hides the real problem.
  Let's recap what JBD/ext3 code requires in case of truncation.  A
life-cycle of a journaled buffer looks as follows: When we want to write
some data to it, it gets attached to the running transaction. When the
transaction is committing, the buffer is written to the journal.
Sometime later, the buffer is written to it's final place in the
filesystem - this is called checkpoint - and can be released.
  Now suppose a write to the buffer happens in one transaction and you
truncate the buffer in the next one. You cannot just free the buffer
immediately - it can for example happen, that the transaction in which
the write happened hasn't committed yet. So we just leave the dirty
buffer there and it should be cleaned up later when the committing
transaction writes the data where it needs...
  The problem is that when the commit code writes the buffer, it
eventually calls try_to_free_buffers() but as Nick pointed out,
-mapping is set to NULL by that time so we don't even call
cancel_dirty_page() and so the number of dirty pages is not properly
decreased. Of course, we could decrease the number of dirty pages after
we return from do_invalidatepage when clearing -mapping but that would
make dirty accounting imprecise - we really still have those dirty data
which need writeout. But it's probably the best workaround I can
currently think of.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Linus Torvalds


On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
 
 OK, so I looked for PG_dirty anyway.
 
 In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
 bail out if the page is dirty.
 
 Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
 truncate_complete_page, because it called cancel_dirty_page (and thus
 cleared PG_dirty) after try_to_free_buffers was called via
 do_invalidatepage.
 
 Now, if I'm not mistaken, we can end up as follows.
 
 truncate_complete_page()
   cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
   do_invalidatepage()
 ext3_invalidatepage()
   journal_invalidatepage()
 journal_unmap_buffer()
   __dispose_buffer()
 __journal_unfile_buffer()
   __journal_temp_unlink_buffer()
 mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

Good, this seems to be the exact path that actually triggers it. I got to 
journal_unmap_buffer(), but was too lazy to actually then bother to follow 
it all the way down - I decided that I didn't actually really even care 
what the low-level FS layer did, I had already convinced myself that it 
obviously must be dirtying the page some way, since that matched the 
symptoms exactly (ie only the journaling case was impacted, and this was 
all about the journal).

But perhaps more importantly: regardless of what the low-level filesystem 
did at that point, the VM accounting shouldn't care, and should be robust 
in the face of a low-level filesystem doing strange and wonderful things. 
But thanks for bothering to go through the whole history and figure out 
what exactly is up.

 As try_to_free_buffers got its ext3 hack back in
 ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
 3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
 the accounting fix in cancel_dirty_page, of course).

Yes, I think we have room for cleanups now, and I agree: we ended up 
reinstating some questionable code in the VM just because we didn't really 
know or understand what was going on in the ext3 journal code. 

Of course, it may well be that there is something *else* going on too, but 
I do believe that this whole case is what it was all about, and the hacks 
end up just (a) making the VM harder to understand (because they cause 
non-obvious VM code to work around some very specific filesystem 
behaviour) and (b) the hacks obviously hid the _real_ issue, but I think 
we've established the real cause, and the hacks clearly weren't enough to 
really hide it 100% anyway.

However, there's no way I'll play with that  right now (I'm planning on an 
-rc6 today), but it might be worth it to make a test-cleanup patch for -mm 
which does some VM cleanups:

 - don't touch dirty pages in fs/buffer.c (ie undo the meat of commit 
   ecdfc9787fe527491baefc22dce8b2dbd5b2908d, but not resurrecting the 
   debugging code)

 - remove the calling of cancel_dirty_page() entirely from 
   truncate_complete_page(), and let remove_from_page_cache() just 
   always handle it (and probably just add a ClearPageDirty() to match 
   the ClearPageUptodate()).

 - remove cancel_dirty_page() from truncate_huge_page(), which seems 
   to be the exact same issue (ie we should just use the logic in 
   remove_from_page_cache()).

at that point cancel_dirty_page() literally is only used for what its 
name implies, and the only in-tree use of it seems to be NFS for when 
the filesystem gets called for -invalidatepage - which makes tons of 
conceptual sense, but I suspect we could drop it from there too, since the 
VM layer _will_ cancel the dirtiness at a VM level when it then later 
removes it from the page cache.

So we essentially seem to be able to simplify things a bit by getting rid 
of a hack in try_to_free_buffers(), and potentially getting rid of 
cancel_dirty_page() entirely.

It would imply that we need to do the task_io_account_cancelled_write() 
inside remove_from_page_cache(), but that should be ok (I don't see any 
locking issues there).

 On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
 io accounting for cancelled writes happened always happened if the page
 was dirty, regardless of page-mapping. This was also already true for
 the old test_clear_page_dirty code, and the commit log for
 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
 change either, so maybe the if (account_size) block should be moved
 out of the if (mapping  ...) block?

I think the if (account_size) thing was *purely* for me being worried 
about hugetlb entries, and I think that's the only thing that passes in a 
zero account size.

But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we 
cannot account for those pages *anyway*), so I think we could go further 
than move the account_size outside of the test, I think we could probably 
remove that test entirely and drop the whole thing.

The thing is, task_io_account_cancelled_write() doesn't make sense on 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara
 On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
  
  OK, so I looked for PG_dirty anyway.
  
  In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
  bail out if the page is dirty.
  
  Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
  truncate_complete_page, because it called cancel_dirty_page (and thus
  cleared PG_dirty) after try_to_free_buffers was called via
  do_invalidatepage.
  
  Now, if I'm not mistaken, we can end up as follows.
  
  truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
  ext3_invalidatepage()
journal_invalidatepage()
  journal_unmap_buffer()
__dispose_buffer()
  __journal_unfile_buffer()
__journal_temp_unlink_buffer()
  mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
 
 Good, this seems to be the exact path that actually triggers it. I got to 
 journal_unmap_buffer(), but was too lazy to actually then bother to follow 
 it all the way down - I decided that I didn't actually really even care 
 what the low-level FS layer did, I had already convinced myself that it 
 obviously must be dirtying the page some way, since that matched the 
 symptoms exactly (ie only the journaling case was impacted, and this was 
 all about the journal).
 
 But perhaps more importantly: regardless of what the low-level filesystem 
 did at that point, the VM accounting shouldn't care, and should be robust 
 in the face of a low-level filesystem doing strange and wonderful things. 
 But thanks for bothering to go through the whole history and figure out 
 what exactly is up.
  As I wrote in my previous email, this solution works but hides the
fact that the page really *has* dirty data in it and *is* pinned in memory
until the commit code gets to writing it. So in theory it could disturb
the writeout logic by having more dirty data in memory than vm thinks it
has. Not that I'd have a better fix now but I wanted to point out this
problem.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Linus Torvalds


On Thu, 20 Dec 2007, Jan Kara wrote:

   As I wrote in my previous email, this solution works but hides the
 fact that the page really *has* dirty data in it and *is* pinned in memory
 until the commit code gets to writing it. So in theory it could disturb
 the writeout logic by having more dirty data in memory than vm thinks it
 has. Not that I'd have a better fix now but I wanted to point out this
 problem.

Well, I worry more about the VM being sane - and by the time we actually 
hit this case, as far as VM sanity is concerned, the page no longer really 
exists. It's been removed from the page cache, and it only really exists 
as any other random kernel allocation.

The fact that low-level filesystems (in this case ext3 journaling) do 
their own insane things is not something the VM even _should_ care about. 
It's just an internal FS allocation, and the FS can do whatever the hell 
it wants with it, including doing IO etc.

The kernel doesn't consider any other random IO pages to be dirty either 
(eg if you do direct-IO writes using low-level SCSI commands, the VM 
doesn't consider that to be any special dirty stuff, it's just random page 
allocations again). This is really no different.

In other words: the Linux VM subsystem is really two differnt parts: the 
low-level page allocator (which obviously knows that the page is still in 
*use*, since it hasn't been free'd), and the higher-level file mapping and 
caching stuff that knows about things like page dirtyiness. And once 
you've done a remove_from_page_cache(), the higher levels are no longer 
involved, and dirty accounting simply doesn't get into the picture.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Björn Steinbrink
On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:
 
 
 On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
  
  OK, so I looked for PG_dirty anyway.
  
  In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
  bail out if the page is dirty.
  
  Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
  truncate_complete_page, because it called cancel_dirty_page (and thus
  cleared PG_dirty) after try_to_free_buffers was called via
  do_invalidatepage.
  
  Now, if I'm not mistaken, we can end up as follows.
  
  truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
  ext3_invalidatepage()
journal_invalidatepage()
  journal_unmap_buffer()
__dispose_buffer()
  __journal_unfile_buffer()
__journal_temp_unlink_buffer()
  mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
 
 Good, this seems to be the exact path that actually triggers it. I got to 
 journal_unmap_buffer(), but was too lazy to actually then bother to follow 
 it all the way down - I decided that I didn't actually really even care 
 what the low-level FS layer did, I had already convinced myself that it 
 obviously must be dirtying the page some way, since that matched the 
 symptoms exactly (ie only the journaling case was impacted, and this was 
 all about the journal).
 
 But perhaps more importantly: regardless of what the low-level filesystem 
 did at that point, the VM accounting shouldn't care, and should be robust 
 in the face of a low-level filesystem doing strange and wonderful things. 
 But thanks for bothering to go through the whole history and figure out 
 what exactly is up.

Oh well, after seeing the move of cancel_dirty_page, I just went
backwards from __set_page_dirty using cscope + some smart guessing and
quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)

  As try_to_free_buffers got its ext3 hack back in
  ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
  3e67c0987d7567ad41164a153dca9a43b11d should be reverted? (Except for
  the accounting fix in cancel_dirty_page, of course).
 
 Yes, I think we have room for cleanups now, and I agree: we ended up 
 reinstating some questionable code in the VM just because we didn't really 
 know or understand what was going on in the ext3 journal code. 

Hm, you attributed more to my mail than there was actually in it. I
didn't even start to think of cleanups (because I don't know jack about
the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
(yet?)).What I meant is that we only did a half-revert of that hackery.

When try_to_free_buffers started to check for PG_dirty, the
cancel_dirty_page call had to be called before do_invalidatepage, to
fix a _huge_ leak.  But that caused the accouting breakage we're now
seeing, because we never account for the pages that got redirtied during
do_invalidatepage.

Then the change to try_to_free_buffers got reverted, so we no longer
need to call cancel_dirty_page before do_invalidatepage, but still we
do. Thus the accounting bug remains. So what I meant to suggest was
simply to actually finish the revert we started.

Or expressed as a patch:

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
if (page-mapping != mapping)
return;
 
-   cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);
 
+   cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);

I'll be the last one to comment on whether or not that causes inaccurate
accouting, so I'll just watch you and Jan battle that out until someone
comes up with a post-.24 patch to provide a clean fix for the issue.

Krzysztof, could you give this patch a test run?

If that fixes the problem for now, I'll try to come up with some
usable commit message, or if somehow wants to beat me to it, you can
already have my

Signed-off-by: Björn Steinbrink [EMAIL PROTECTED]

  On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
  io accounting for cancelled writes happened always happened if the page
  was dirty, regardless of page-mapping. This was also already true for
  the old test_clear_page_dirty code, and the commit log for
  8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
  change either, so maybe the if (account_size) block should be moved
  out of the if (mapping  ...) block?
 
 I think the if (account_size) thing was *purely* for me being worried 
 about hugetlb entries, and I think that's the only thing that passes in a 
 zero account size.
 
 But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we 
 cannot account 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Nick Piggin
On Friday 21 December 2007 06:24, Linus Torvalds wrote:
 On Thu, 20 Dec 2007, Jan Kara wrote:
As I wrote in my previous email, this solution works but hides the
  fact that the page really *has* dirty data in it and *is* pinned in
  memory until the commit code gets to writing it. So in theory it could
  disturb the writeout logic by having more dirty data in memory than vm
  thinks it has. Not that I'd have a better fix now but I wanted to point
  out this problem.

 Well, I worry more about the VM being sane - and by the time we actually
 hit this case, as far as VM sanity is concerned, the page no longer really
 exists. It's been removed from the page cache, and it only really exists
 as any other random kernel allocation.

It does allow the VM to just not worry about this. However I don't
really like this kinds of catch-all conditions that are hard to get
rid of and can encourage bad behaviour.

It would be nice if the insane things were made to clean up after
themselves.


 The fact that low-level filesystems (in this case ext3 journaling) do
 their own insane things is not something the VM even _should_ care about.
 It's just an internal FS allocation, and the FS can do whatever the hell
 it wants with it, including doing IO etc.

 The kernel doesn't consider any other random IO pages to be dirty either
 (eg if you do direct-IO writes using low-level SCSI commands, the VM
 doesn't consider that to be any special dirty stuff, it's just random page
 allocations again). This is really no different.

 In other words: the Linux VM subsystem is really two differnt parts: the
 low-level page allocator (which obviously knows that the page is still in
 *use*, since it hasn't been free'd), and the higher-level file mapping and
 caching stuff that knows about things like page dirtyiness. And once
 you've done a remove_from_page_cache(), the higher levels are no longer
 involved, and dirty accounting simply doesn't get into the picture.

That's all true... it would simply be nice to ask the filesystems to do
this. But anyway I think your patch is pretty reasonable for the moment.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-19 Thread Nick Piggin
On Thursday 20 December 2007 12:05, Jan Kara wrote:
> > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > > I'll confirm this tomorrow but it seems that even switching to
> > > data=ordered (AFAIK default o ext3) is indeed enough to cure this
> > > problem.
> >
> > Ok, do we actually have any ext3 expert following this? I have no idea
> > about what the journalling code does, but I have painful memories of ext3
> > doing really odd buffer-head-based IO and totally bypassing all the
> > normal page dirty logic.
> >
> > Judging by the symptoms (sorry for not following this well, it came up
> > while I was mostly away travelling), something probably *does* clear the
> > dirty bit on the pages, but the dirty *accounting* is not done properly,
> > so the kernel keeps thinking it has dirty pages.
> >
> > Now, a simple "grep" shows that ext3 does not actually do any
> > ClearPageDirty() or similar on its own, although maybe I missed some
> > other subtle way this can happen. And the *normal* VFS routines that do
> > ClearPageDirty should all be doing the proper accounting.
> >
> > So I see a couple of possible cases:
> >
> >  - actually clearing the PG_dirty bit somehow, without doing the
> >accounting.
> >
> >This looks very unlikely. PG_dirty is always cleared by some variant
> > of "*ClearPageDirty()", and that bit definition isn't used for anything
> > else in the whole kernel judging by "grep" (the page allocator tests the
> > bit, that's it).
> >
> >And there aren't that many hits for ClearPageDirty, and they all seem
> >to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if
> > the mapping has dirty state accounting.
> >
> >The exceptions seem to be:
> > - the page freeing path, but that path checks that "mapping" is NULL
> >   (so no accounting), and would complain loudly if it wasn't
> > - the swap state stuff ("move_from_swap_cache()"), but that should
> >   only ever trigger for swap cache pages (we have a BUG_ON() in that
> >   path), and those don't do dirty accounting anyway.
> > - pageout(), but again only for pages that have a NULL mapping.
> >
> >  - ext3 might be clearing (probably indirectly) the "page->mapping" thing
> >or similar, which in turn will make the VFS think that even a dirty
> >page isn't actually to be accounted for - so when the page *turned*
> >dirty, it was accounted as a dirty page, but then, when it was
> > cleaned, the accounting wasn't reversed because ->mapping had become
> > NULL.
> >
> >This would be some interaction with the truncation logic, and quite
> >frankly, that should be all shared with the non-journal case, so I
> > find this all very unlikely.
> >
> > However, that second case is interesting, because the pageout case
> > actually has a comment like this:
> >
> > /*
> >  * Some data journaling orphaned pages can have
> >  * page->mapping == NULL while being dirty with clean buffers.
> >  */
> >
> > which really sounds like the case in question.
> >
> > I may know the VM, but that special case was added due to insane
> > journaling filesystems, and I don't know what insane things they do.
> > Which is why I'm wondering if there is any ext3 person who knows the
> > journaling code?
>
>   Yes, I'm looking into the problem... I think those orphan pages
> without mapping are created because we cannot drop truncated
> buffers/pages immediately.  There can be a committing transaction that
> still needs the data in those buffers and until it commits we have to
> keep the pages (and even maybe write them to disk etc.). But eventually,
> we should write the buffers, call try_to_free_buffers() which calls
> cancel_dirty_page() and everything should be happy... in theory ;)

If mapping is NULL, then try_to_free_buffers won't call cancel_dirty_page,
I think?

I don't know whether ext3 can be changed to not require/allow these dirty
pages, but I would rather Linus's dirty page accounting fix to go into that
path (the /* can this still happen? */ in try_to_free_buffers()), if possible.

Then you could also have a WARN_ON in __remove_from_page_cache().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-19 Thread Jan Kara
> On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > 
> > I'll confirm this tomorrow but it seems that even switching to data=ordered
> > (AFAIK default o ext3) is indeed enough to cure this problem.
> 
> Ok, do we actually have any ext3 expert following this? I have no idea 
> about what the journalling code does, but I have painful memories of ext3 
> doing really odd buffer-head-based IO and totally bypassing all the normal 
> page dirty logic.
> 
> Judging by the symptoms (sorry for not following this well, it came up 
> while I was mostly away travelling), something probably *does* clear the 
> dirty bit on the pages, but the dirty *accounting* is not done properly, 
> so the kernel keeps thinking it has dirty pages.
> 
> Now, a simple "grep" shows that ext3 does not actually do any 
> ClearPageDirty() or similar on its own, although maybe I missed some other 
> subtle way this can happen. And the *normal* VFS routines that do 
> ClearPageDirty should all be doing the proper accounting.
> 
> So I see a couple of possible cases:
> 
>  - actually clearing the PG_dirty bit somehow, without doing the 
>accounting.
> 
>This looks very unlikely. PG_dirty is always cleared by some variant of 
>"*ClearPageDirty()", and that bit definition isn't used for anything 
>else in the whole kernel judging by "grep" (the page allocator tests 
>the bit, that's it).
> 
>And there aren't that many hits for ClearPageDirty, and they all seem 
>to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the 
>mapping has dirty state accounting.
> 
>The exceptions seem to be:
> - the page freeing path, but that path checks that "mapping" is NULL 
>   (so no accounting), and would complain loudly if it wasn't
> - the swap state stuff ("move_from_swap_cache()"), but that should 
>   only ever trigger for swap cache pages (we have a BUG_ON() in that 
>   path), and those don't do dirty accounting anyway.
> - pageout(), but again only for pages that have a NULL mapping.
> 
>  - ext3 might be clearing (probably indirectly) the "page->mapping" thing 
>or similar, which in turn will make the VFS think that even a dirty 
>page isn't actually to be accounted for - so when the page *turned* 
>dirty, it was accounted as a dirty page, but then, when it was cleaned, 
>the accounting wasn't reversed because ->mapping had become NULL.
> 
>This would be some interaction with the truncation logic, and quite 
>frankly, that should be all shared with the non-journal case, so I find 
>this all very unlikely. 
> 
> However, that second case is interesting, because the pageout case 
> actually has a comment like this:
> 
>   /*
>* Some data journaling orphaned pages can have
>* page->mapping == NULL while being dirty with clean buffers.
>*/
> 
> which really sounds like the case in question. 
> 
> I may know the VM, but that special case was added due to insane 
> journaling filesystems, and I don't know what insane things they do. Which 
> is why I'm wondering if there is any ext3 person who knows the journaling 
> code?
  Yes, I'm looking into the problem... I think those orphan pages
without mapping are created because we cannot drop truncated
buffers/pages immediately.  There can be a committing transaction that
still needs the data in those buffers and until it commits we have to
keep the pages (and even maybe write them to disk etc.). But eventually,
we should write the buffers, call try_to_free_buffers() which calls
cancel_dirty_page() and everything should be happy... in theory ;)
  In practice, I have not yet narrowed down where the problem is.
fsx-linux is able to trigger the problem on my test machine so as
suspected it is some bad interaction of writes (plain writes, no mmap),
truncates and probably writeback. Small tests don't seem to trigger the
problem (fsx needs at least few hundreds operations to trigger the
problem) - on the other hand when some sequence of operations causes
lost dirty pages, they are lost deterministically in every run. Also the
file fsx operates on can be fairly small - 2MB was enough - so page
reclaim and such stuff probably isn't the thing we interact with.
  Tomorrow I'll try more...

> How/when does it ever "orphan" pages? Because yes, if it ever does that, 
> and clears the ->mapping field on a mapped page, then that page will have 
> incremented the dirty counts when it became dirty, but will *not* 
> decrement the dirty count when it is an orphan.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-19 Thread Linus Torvalds


On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> 
> I'll confirm this tomorrow but it seems that even switching to data=ordered
> (AFAIK default o ext3) is indeed enough to cure this problem.

Ok, do we actually have any ext3 expert following this? I have no idea 
about what the journalling code does, but I have painful memories of ext3 
doing really odd buffer-head-based IO and totally bypassing all the normal 
page dirty logic.

Judging by the symptoms (sorry for not following this well, it came up 
while I was mostly away travelling), something probably *does* clear the 
dirty bit on the pages, but the dirty *accounting* is not done properly, 
so the kernel keeps thinking it has dirty pages.

Now, a simple "grep" shows that ext3 does not actually do any 
ClearPageDirty() or similar on its own, although maybe I missed some other 
subtle way this can happen. And the *normal* VFS routines that do 
ClearPageDirty should all be doing the proper accounting.

So I see a couple of possible cases:

 - actually clearing the PG_dirty bit somehow, without doing the 
   accounting.

   This looks very unlikely. PG_dirty is always cleared by some variant of 
   "*ClearPageDirty()", and that bit definition isn't used for anything 
   else in the whole kernel judging by "grep" (the page allocator tests 
   the bit, that's it).

   And there aren't that many hits for ClearPageDirty, and they all seem 
   to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the 
   mapping has dirty state accounting.

   The exceptions seem to be:
- the page freeing path, but that path checks that "mapping" is NULL 
  (so no accounting), and would complain loudly if it wasn't
- the swap state stuff ("move_from_swap_cache()"), but that should 
  only ever trigger for swap cache pages (we have a BUG_ON() in that 
  path), and those don't do dirty accounting anyway.
- pageout(), but again only for pages that have a NULL mapping.

 - ext3 might be clearing (probably indirectly) the "page->mapping" thing 
   or similar, which in turn will make the VFS think that even a dirty 
   page isn't actually to be accounted for - so when the page *turned* 
   dirty, it was accounted as a dirty page, but then, when it was cleaned, 
   the accounting wasn't reversed because ->mapping had become NULL.

   This would be some interaction with the truncation logic, and quite 
   frankly, that should be all shared with the non-journal case, so I find 
   this all very unlikely. 

However, that second case is interesting, because the pageout case 
actually has a comment like this:

/*
 * Some data journaling orphaned pages can have
 * page->mapping == NULL while being dirty with clean buffers.
 */

which really sounds like the case in question. 

I may know the VM, but that special case was added due to insane 
journaling filesystems, and I don't know what insane things they do. Which 
is why I'm wondering if there is any ext3 person who knows the journaling 
code?

How/when does it ever "orphan" pages? Because yes, if it ever does that, 
and clears the ->mapping field on a mapped page, then that page will have 
incremented the dirty counts when it became dirty, but will *not* 
decrement the dirty count when it is an orphan.

> Two questions remain then: why system dies when dirty reaches ~200MB and what
> is wrong with ext3+data=journal with >=2.6.20-rc2?

Well, that one is probably pretty straightforward: since the kernel thinks 
that there are too many dirty pages, it will ask everybody who creates 
more dirty pages to clean out some *old* dirty pages, but since they don't 
exist, the whole thing will basically wait forever for a writeout to clean 
things out that will never happen.

200MB is 10% of your 2GB of low-mem RAM, and 10% is the default 
dirty_ratio that causes synchronous waits for writeback. If you use the 
normal 3:1 VM split, the hang should happen even earlier (at the ~100MB 
"dirty" mark).

So that part isn't the bug. The bug is in the accounting, but I'm pretty 
damn sure that the core VM itself is pretty ok, since that code has now 
been stable for people for the last year or so. It seems that ext3 (with 
data journaling) does something dodgy wrt some page.

But how about trying this appended patch. It should warn a few times if 
some page is ever removed from a mapping while it's dirty (and the mapping 
is one that should have been accouned). It also tries to "fix up" the 
case, so *if* this is the cause, it should also fix the bug.

I'd love to hear if you get any stack dumps with this, and what the 
backtrace is (and whether the dirty counts then stay ok).

The patch is totally untested. It compiles for me. That's all I can say.

(There's a few other places that set ->mapping to NULL, but they're pretty 
esoteric. Page migration? Stuff like that).

Linus

---
 mm/filemap.c |   12 
 1 files changed, 12 insertions(+), 0 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-19 Thread Linus Torvalds


On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
 
 I'll confirm this tomorrow but it seems that even switching to data=ordered
 (AFAIK default o ext3) is indeed enough to cure this problem.

Ok, do we actually have any ext3 expert following this? I have no idea 
about what the journalling code does, but I have painful memories of ext3 
doing really odd buffer-head-based IO and totally bypassing all the normal 
page dirty logic.

Judging by the symptoms (sorry for not following this well, it came up 
while I was mostly away travelling), something probably *does* clear the 
dirty bit on the pages, but the dirty *accounting* is not done properly, 
so the kernel keeps thinking it has dirty pages.

Now, a simple grep shows that ext3 does not actually do any 
ClearPageDirty() or similar on its own, although maybe I missed some other 
subtle way this can happen. And the *normal* VFS routines that do 
ClearPageDirty should all be doing the proper accounting.

So I see a couple of possible cases:

 - actually clearing the PG_dirty bit somehow, without doing the 
   accounting.

   This looks very unlikely. PG_dirty is always cleared by some variant of 
   *ClearPageDirty(), and that bit definition isn't used for anything 
   else in the whole kernel judging by grep (the page allocator tests 
   the bit, that's it).

   And there aren't that many hits for ClearPageDirty, and they all seem 
   to do the proper dec_zone_page_state(page, NR_FILE_DIRTY); etc if the 
   mapping has dirty state accounting.

   The exceptions seem to be:
- the page freeing path, but that path checks that mapping is NULL 
  (so no accounting), and would complain loudly if it wasn't
- the swap state stuff (move_from_swap_cache()), but that should 
  only ever trigger for swap cache pages (we have a BUG_ON() in that 
  path), and those don't do dirty accounting anyway.
- pageout(), but again only for pages that have a NULL mapping.

 - ext3 might be clearing (probably indirectly) the page-mapping thing 
   or similar, which in turn will make the VFS think that even a dirty 
   page isn't actually to be accounted for - so when the page *turned* 
   dirty, it was accounted as a dirty page, but then, when it was cleaned, 
   the accounting wasn't reversed because -mapping had become NULL.

   This would be some interaction with the truncation logic, and quite 
   frankly, that should be all shared with the non-journal case, so I find 
   this all very unlikely. 

However, that second case is interesting, because the pageout case 
actually has a comment like this:

/*
 * Some data journaling orphaned pages can have
 * page-mapping == NULL while being dirty with clean buffers.
 */

which really sounds like the case in question. 

I may know the VM, but that special case was added due to insane 
journaling filesystems, and I don't know what insane things they do. Which 
is why I'm wondering if there is any ext3 person who knows the journaling 
code?

How/when does it ever orphan pages? Because yes, if it ever does that, 
and clears the -mapping field on a mapped page, then that page will have 
incremented the dirty counts when it became dirty, but will *not* 
decrement the dirty count when it is an orphan.

 Two questions remain then: why system dies when dirty reaches ~200MB and what
 is wrong with ext3+data=journal with =2.6.20-rc2?

Well, that one is probably pretty straightforward: since the kernel thinks 
that there are too many dirty pages, it will ask everybody who creates 
more dirty pages to clean out some *old* dirty pages, but since they don't 
exist, the whole thing will basically wait forever for a writeout to clean 
things out that will never happen.

200MB is 10% of your 2GB of low-mem RAM, and 10% is the default 
dirty_ratio that causes synchronous waits for writeback. If you use the 
normal 3:1 VM split, the hang should happen even earlier (at the ~100MB 
dirty mark).

So that part isn't the bug. The bug is in the accounting, but I'm pretty 
damn sure that the core VM itself is pretty ok, since that code has now 
been stable for people for the last year or so. It seems that ext3 (with 
data journaling) does something dodgy wrt some page.

But how about trying this appended patch. It should warn a few times if 
some page is ever removed from a mapping while it's dirty (and the mapping 
is one that should have been accouned). It also tries to fix up the 
case, so *if* this is the cause, it should also fix the bug.

I'd love to hear if you get any stack dumps with this, and what the 
backtrace is (and whether the dirty counts then stay ok).

The patch is totally untested. It compiles for me. That's all I can say.

(There's a few other places that set -mapping to NULL, but they're pretty 
esoteric. Page migration? Stuff like that).

Linus

---
 mm/filemap.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c 

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-19 Thread Jan Kara
 On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
  
  I'll confirm this tomorrow but it seems that even switching to data=ordered
  (AFAIK default o ext3) is indeed enough to cure this problem.
 
 Ok, do we actually have any ext3 expert following this? I have no idea 
 about what the journalling code does, but I have painful memories of ext3 
 doing really odd buffer-head-based IO and totally bypassing all the normal 
 page dirty logic.
 
 Judging by the symptoms (sorry for not following this well, it came up 
 while I was mostly away travelling), something probably *does* clear the 
 dirty bit on the pages, but the dirty *accounting* is not done properly, 
 so the kernel keeps thinking it has dirty pages.
 
 Now, a simple grep shows that ext3 does not actually do any 
 ClearPageDirty() or similar on its own, although maybe I missed some other 
 subtle way this can happen. And the *normal* VFS routines that do 
 ClearPageDirty should all be doing the proper accounting.
 
 So I see a couple of possible cases:
 
  - actually clearing the PG_dirty bit somehow, without doing the 
accounting.
 
This looks very unlikely. PG_dirty is always cleared by some variant of 
*ClearPageDirty(), and that bit definition isn't used for anything 
else in the whole kernel judging by grep (the page allocator tests 
the bit, that's it).
 
And there aren't that many hits for ClearPageDirty, and they all seem 
to do the proper dec_zone_page_state(page, NR_FILE_DIRTY); etc if the 
mapping has dirty state accounting.
 
The exceptions seem to be:
 - the page freeing path, but that path checks that mapping is NULL 
   (so no accounting), and would complain loudly if it wasn't
 - the swap state stuff (move_from_swap_cache()), but that should 
   only ever trigger for swap cache pages (we have a BUG_ON() in that 
   path), and those don't do dirty accounting anyway.
 - pageout(), but again only for pages that have a NULL mapping.
 
  - ext3 might be clearing (probably indirectly) the page-mapping thing 
or similar, which in turn will make the VFS think that even a dirty 
page isn't actually to be accounted for - so when the page *turned* 
dirty, it was accounted as a dirty page, but then, when it was cleaned, 
the accounting wasn't reversed because -mapping had become NULL.
 
This would be some interaction with the truncation logic, and quite 
frankly, that should be all shared with the non-journal case, so I find 
this all very unlikely. 
 
 However, that second case is interesting, because the pageout case 
 actually has a comment like this:
 
   /*
* Some data journaling orphaned pages can have
* page-mapping == NULL while being dirty with clean buffers.
*/
 
 which really sounds like the case in question. 
 
 I may know the VM, but that special case was added due to insane 
 journaling filesystems, and I don't know what insane things they do. Which 
 is why I'm wondering if there is any ext3 person who knows the journaling 
 code?
  Yes, I'm looking into the problem... I think those orphan pages
without mapping are created because we cannot drop truncated
buffers/pages immediately.  There can be a committing transaction that
still needs the data in those buffers and until it commits we have to
keep the pages (and even maybe write them to disk etc.). But eventually,
we should write the buffers, call try_to_free_buffers() which calls
cancel_dirty_page() and everything should be happy... in theory ;)
  In practice, I have not yet narrowed down where the problem is.
fsx-linux is able to trigger the problem on my test machine so as
suspected it is some bad interaction of writes (plain writes, no mmap),
truncates and probably writeback. Small tests don't seem to trigger the
problem (fsx needs at least few hundreds operations to trigger the
problem) - on the other hand when some sequence of operations causes
lost dirty pages, they are lost deterministically in every run. Also the
file fsx operates on can be fairly small - 2MB was enough - so page
reclaim and such stuff probably isn't the thing we interact with.
  Tomorrow I'll try more...

 How/when does it ever orphan pages? Because yes, if it ever does that, 
 and clears the -mapping field on a mapped page, then that page will have 
 incremented the dirty counts when it became dirty, but will *not* 
 decrement the dirty count when it is an orphan.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-19 Thread Nick Piggin
On Thursday 20 December 2007 12:05, Jan Kara wrote:
  On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
   I'll confirm this tomorrow but it seems that even switching to
   data=ordered (AFAIK default o ext3) is indeed enough to cure this
   problem.
 
  Ok, do we actually have any ext3 expert following this? I have no idea
  about what the journalling code does, but I have painful memories of ext3
  doing really odd buffer-head-based IO and totally bypassing all the
  normal page dirty logic.
 
  Judging by the symptoms (sorry for not following this well, it came up
  while I was mostly away travelling), something probably *does* clear the
  dirty bit on the pages, but the dirty *accounting* is not done properly,
  so the kernel keeps thinking it has dirty pages.
 
  Now, a simple grep shows that ext3 does not actually do any
  ClearPageDirty() or similar on its own, although maybe I missed some
  other subtle way this can happen. And the *normal* VFS routines that do
  ClearPageDirty should all be doing the proper accounting.
 
  So I see a couple of possible cases:
 
   - actually clearing the PG_dirty bit somehow, without doing the
 accounting.
 
 This looks very unlikely. PG_dirty is always cleared by some variant
  of *ClearPageDirty(), and that bit definition isn't used for anything
  else in the whole kernel judging by grep (the page allocator tests the
  bit, that's it).
 
 And there aren't that many hits for ClearPageDirty, and they all seem
 to do the proper dec_zone_page_state(page, NR_FILE_DIRTY); etc if
  the mapping has dirty state accounting.
 
 The exceptions seem to be:
  - the page freeing path, but that path checks that mapping is NULL
(so no accounting), and would complain loudly if it wasn't
  - the swap state stuff (move_from_swap_cache()), but that should
only ever trigger for swap cache pages (we have a BUG_ON() in that
path), and those don't do dirty accounting anyway.
  - pageout(), but again only for pages that have a NULL mapping.
 
   - ext3 might be clearing (probably indirectly) the page-mapping thing
 or similar, which in turn will make the VFS think that even a dirty
 page isn't actually to be accounted for - so when the page *turned*
 dirty, it was accounted as a dirty page, but then, when it was
  cleaned, the accounting wasn't reversed because -mapping had become
  NULL.
 
 This would be some interaction with the truncation logic, and quite
 frankly, that should be all shared with the non-journal case, so I
  find this all very unlikely.
 
  However, that second case is interesting, because the pageout case
  actually has a comment like this:
 
  /*
   * Some data journaling orphaned pages can have
   * page-mapping == NULL while being dirty with clean buffers.
   */
 
  which really sounds like the case in question.
 
  I may know the VM, but that special case was added due to insane
  journaling filesystems, and I don't know what insane things they do.
  Which is why I'm wondering if there is any ext3 person who knows the
  journaling code?

   Yes, I'm looking into the problem... I think those orphan pages
 without mapping are created because we cannot drop truncated
 buffers/pages immediately.  There can be a committing transaction that
 still needs the data in those buffers and until it commits we have to
 keep the pages (and even maybe write them to disk etc.). But eventually,
 we should write the buffers, call try_to_free_buffers() which calls
 cancel_dirty_page() and everything should be happy... in theory ;)

If mapping is NULL, then try_to_free_buffers won't call cancel_dirty_page,
I think?

I don't know whether ext3 can be changed to not require/allow these dirty
pages, but I would rather Linus's dirty page accounting fix to go into that
path (the /* can this still happen? */ in try_to_free_buffers()), if possible.

Then you could also have a WARN_ON in __remove_from_page_cache().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-17 Thread Krzysztof Oledzki



On Sun, 16 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:


Which filesystem, which mount options


  - ext3 on RAID1 (MD): / - rootflags=data=journal


It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?


I'll confirm this tomorrow but it seems that even switching to
data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.


yes, sorry, I meant ordered.


OK, I can confirm that the problem is with data=journal. With data=ordered 
I get:


# uname -rns;uptime;sync;sleep 1;sync ;sleep 1; sync;grep Dirty /proc/meminfo
Linux cougar 2.6.24-rc5
 17:50:34 up 1 day, 20 min,  1 user,  load average: 0.99, 0.48, 0.35
Dirty:   0 kB


Two questions remain then: why system dies when dirty reaches ~200MB


I think you have ~2G of RAM and you're running with
/proc/sys/vm/dirty_ratio=10, yes?

If so, when that machine hits 10% * 2G of dirty memory then everyone who
wants to dirty pages gets blocked.


Oh, right. Thank you for the explanation.


and what is wrong with ext3+data=journal with >=2.6.20-rc2?


Ah.  It has a bug in it ;)

As I said, data=journal has exceptional handling of pagecache data and is
not well tested.  Someone (and I'm not sure who) will need to get in there
and fix it.


OK, I'm willing to test it. ;)

Best regrds,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-17 Thread Jan Kara
> On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[EMAIL 
> PROTECTED]> wrote:
> 
> > >>> Which filesystem, which mount options
> > >>
> > >>   - ext3 on RAID1 (MD): / - rootflags=data=journal
> > >
> > > It wouldn't surprise me if this is specific to data=journal: that
> > > journalling mode is pretty complex wrt dairty-data handling and isn't well
> > > tested.
> > >
> > > Does switching that to data=writeback change things?
> > 
> > I'll confirm this tomorrow but it seems that even switching to 
> > data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.
> 
> yes, sorry, I meant ordered.
> 
> > Two questions remain then: why system dies when dirty reaches ~200MB 
> 
> I think you have ~2G of RAM and you're running with 
> /proc/sys/vm/dirty_ratio=10, yes?
> 
> If so, when that machine hits 10% * 2G of dirty memory then everyone who
> wants to dirty pages gets blocked.
> 
> > and what is wrong with ext3+data=journal with >=2.6.20-rc2?
> 
> Ah.  It has a bug in it ;)
> 
> As I said, data=journal has exceptional handling of pagecache data and is
> not well tested.  Someone (and I'm not sure who) will need to get in there
> and fix it.
  It seems fsx-linux is able to trigger the leak on my test machine so
I'll have a look into it (not sure if I'll get to it today but I should
find some time for it this week)...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-17 Thread Jan Kara
 On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki [EMAIL 
 PROTECTED] wrote:
 
   Which filesystem, which mount options
  
 - ext3 on RAID1 (MD): / - rootflags=data=journal
  
   It wouldn't surprise me if this is specific to data=journal: that
   journalling mode is pretty complex wrt dairty-data handling and isn't well
   tested.
  
   Does switching that to data=writeback change things?
  
  I'll confirm this tomorrow but it seems that even switching to 
  data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.
 
 yes, sorry, I meant ordered.
 
  Two questions remain then: why system dies when dirty reaches ~200MB 
 
 I think you have ~2G of RAM and you're running with 
 /proc/sys/vm/dirty_ratio=10, yes?
 
 If so, when that machine hits 10% * 2G of dirty memory then everyone who
 wants to dirty pages gets blocked.
 
  and what is wrong with ext3+data=journal with =2.6.20-rc2?
 
 Ah.  It has a bug in it ;)
 
 As I said, data=journal has exceptional handling of pagecache data and is
 not well tested.  Someone (and I'm not sure who) will need to get in there
 and fix it.
  It seems fsx-linux is able to trigger the leak on my test machine so
I'll have a look into it (not sure if I'll get to it today but I should
find some time for it this week)...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-17 Thread Krzysztof Oledzki



On Sun, 16 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:


Which filesystem, which mount options


  - ext3 on RAID1 (MD): / - rootflags=data=journal


It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?


I'll confirm this tomorrow but it seems that even switching to
data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.


yes, sorry, I meant ordered.


OK, I can confirm that the problem is with data=journal. With data=ordered 
I get:


# uname -rns;uptime;sync;sleep 1;sync ;sleep 1; sync;grep Dirty /proc/meminfo
Linux cougar 2.6.24-rc5
 17:50:34 up 1 day, 20 min,  1 user,  load average: 0.99, 0.48, 0.35
Dirty:   0 kB


Two questions remain then: why system dies when dirty reaches ~200MB


I think you have ~2G of RAM and you're running with
/proc/sys/vm/dirty_ratio=10, yes?

If so, when that machine hits 10% * 2G of dirty memory then everyone who
wants to dirty pages gets blocked.


Oh, right. Thank you for the explanation.


and what is wrong with ext3+data=journal with =2.6.20-rc2?


Ah.  It has a bug in it ;)

As I said, data=journal has exceptional handling of pagecache data and is
not well tested.  Someone (and I'm not sure who) will need to get in there
and fix it.


OK, I'm willing to test it. ;)

Best regrds,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Andrew Morton
On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:

> >>> Which filesystem, which mount options
> >>
> >>   - ext3 on RAID1 (MD): / - rootflags=data=journal
> >
> > It wouldn't surprise me if this is specific to data=journal: that
> > journalling mode is pretty complex wrt dairty-data handling and isn't well
> > tested.
> >
> > Does switching that to data=writeback change things?
> 
> I'll confirm this tomorrow but it seems that even switching to 
> data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.

yes, sorry, I meant ordered.

> Two questions remain then: why system dies when dirty reaches ~200MB 

I think you have ~2G of RAM and you're running with 
/proc/sys/vm/dirty_ratio=10, yes?

If so, when that machine hits 10% * 2G of dirty memory then everyone who
wants to dirty pages gets blocked.

> and what is wrong with ext3+data=journal with >=2.6.20-rc2?

Ah.  It has a bug in it ;)

As I said, data=journal has exceptional handling of pagecache data and is
not well tested.  Someone (and I'm not sure who) will need to get in there
and fix it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Krzysztof Oledzki



On Sun, 16 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:




On Sat, 15 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:




On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
Krzysztof, I'd hate point you to a hard path (at least time consuming), but
you've done a lot of digging by now anyway. How about git bisecting between
2.6.20-rc2 and rc1? Here is great info on bisecting:
http://www.kernel.org/doc/local/git-quick.html


As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
So it took me only 2 reboots. ;)

The guilty patch is the one I proposed just an hour ago:
  
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
  - 2.6.20-rc1: OK
  - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
  - 2.6.20-rc1-git8: very BAD
  - 2.6.20-rc2: very BAD
  - 2.6.20-rc4: very BAD
  - >= 2.6.20: BAD (but not *very* BAD!)



well..  We have code which has been used by *everyone* for a year and it's
misbehaving for you alone.


No, not for me alone. Probably only I and Thomas Osterried have systems
where it is so easy to reproduce. Please note that the problem exists on
my all systems, but only on one it is critical. It is enough to run
"sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure.
With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only
on one it goes to ~200MB in about 2 weeks and then everything dies:
http://bugzilla.kernel.org/attachment.cgi?id=13824
http://bugzilla.kernel.org/attachment.cgi?id=13825
http://bugzilla.kernel.org/attachment.cgi?id=13826
http://bugzilla.kernel.org/attachment.cgi?id=13827


 I wonder what you're doing that is different/special.

Me to. :|


Which filesystem, which mount options


  - ext3 on RAID1 (MD): / - rootflags=data=journal


It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?


I'll confirm this tomorrow but it seems that even switching to 
data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.


Two questions remain then: why system dies when dirty reaches ~200MB 
and what is wrong with ext3+data=journal with >=2.6.20-rc2?


Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Krzysztof Oledzki



On Sun, 16 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182





--- Comment #39 from [EMAIL PROTECTED]  2007-12-16 01:58 ---


So:
  - 2.6.20-rc1: OK
  - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
  - 2.6.20-rc1-git8: very BAD
  - 2.6.20-rc2: very BAD
  - 2.6.20-rc4: very BAD
  - >= 2.6.20: BAD (but not *very* BAD!)


based on the great info you already acquired, you should be able to
bisect this rather effectively, via:

2.6.20-rc1-git8 == 921320210bd2ec4f17053d283355b73048ac0e56

$ git-bisect start
$ git-bisect bad 921320210bd2ec4f17053d283355b73048ac0e56
$ git-bisect good v2.6.20-rc1
Bisecting: 133 revisions left to test after this

so about 7-8 bootups would pinpoint the breakage.


Except that I have very limited time where I can do my tests on this host. 
Please also note that it takes about ~2h after a reboot, to be 100% sure. 
So, 7-8 bootups => 14-16h. :|



It would likely pinpoint fba2591b, so it would perhaps be best to first
attempt a revert of fba2591b on a recent kernel.


I wish I could: :(

[EMAIL PROTECTED]:/usr/src/linux-2.6.23.9$ cat ..p1 |patch -p1 --dry-run -R
patching file fs/hugetlbfs/inode.c
Hunk #1 succeeded at 203 (offset 27 lines).
patching file include/linux/page-flags.h
Hunk #1 succeeded at 262 (offset 9 lines).
patching file mm/page-writeback.c
Hunk #1 succeeded at 903 (offset 58 lines).
patching file mm/truncate.c
Unreversed patch detected!  Ignore -R? [n] y
Hunk #1 succeeded at 52 with fuzz 2 (offset 1 line).
Hunk #2 FAILED at 85.
Hunk #3 FAILED at 365.
Hunk #4 FAILED at 400.
3 out of 4 hunks FAILED -- saving rejects to file mm/truncate.c.rej

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Andrew Morton
On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:

> 
> 
> On Sat, 15 Dec 2007, Andrew Morton wrote:
> 
> > On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL 
> > PROTECTED]> wrote:
> >
> >>
> >>
> >> On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:
> >>
> >>> http://bugzilla.kernel.org/show_bug.cgi?id=9182
> >>>
> >>>
> >>> --- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
> >>> Krzysztof, I'd hate point you to a hard path (at least time consuming), 
> >>> but
> >>> you've done a lot of digging by now anyway. How about git bisecting 
> >>> between
> >>> 2.6.20-rc2 and rc1? Here is great info on bisecting:
> >>> http://www.kernel.org/doc/local/git-quick.html
> >>
> >> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
> >> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
> >> So it took me only 2 reboots. ;)
> >>
> >> The guilty patch is the one I proposed just an hour ago:
> >>   
> >> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
> >>
> >> So:
> >>   - 2.6.20-rc1: OK
> >>   - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 
> >> reverted: OK
> >>   - 2.6.20-rc1-git8: very BAD
> >>   - 2.6.20-rc2: very BAD
> >>   - 2.6.20-rc4: very BAD
> >>   - >= 2.6.20: BAD (but not *very* BAD!)
> >>
> >
> > well..  We have code which has been used by *everyone* for a year and it's
> > misbehaving for you alone.
> 
> No, not for me alone. Probably only I and Thomas Osterried have systems 
> where it is so easy to reproduce. Please note that the problem exists on 
> my all systems, but only on one it is critical. It is enough to run
> "sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure. 
> With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only 
> on one it goes to ~200MB in about 2 weeks and then everything dies:
> http://bugzilla.kernel.org/attachment.cgi?id=13824
> http://bugzilla.kernel.org/attachment.cgi?id=13825
> http://bugzilla.kernel.org/attachment.cgi?id=13826
> http://bugzilla.kernel.org/attachment.cgi?id=13827
> 
> >  I wonder what you're doing that is different/special.
> Me to. :|
> 
> > Which filesystem, which mount options
> 
>   - ext3 on RAID1 (MD): / - rootflags=data=journal

It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?

THomas, do you have ext3 data=journal on any filesytems?

>   - ext3 on LVM on RAID5 (MD)
>   - nfs
> 
> /dev/md0 on / type ext3 (rw)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
> devpts on /dev/pts type devpts (rw,nosuid,noexec)
> /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
> /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
> /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 
> (rw,nosuid,nodev,noatime,data=writeback)
> /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 
> (rw,nosuid,nodev,noatime,data=writeback)
> /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 
> (rw,nosuid,nodev,noatime)
> shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
> usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
> owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs 
> (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)
> 
> 
> > what sort of workload?
> Different, depending on a host: mail (postfix + amavisd + spamassasin + 
> clamav + sqlgray), squid, mysql, apache, nfs, rsync,  But it seems 
> that the biggest problem is on the host running mentioned mail service.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Krzysztof Oledzki



On Sat, 15 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:




On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
Krzysztof, I'd hate point you to a hard path (at least time consuming), but
you've done a lot of digging by now anyway. How about git bisecting between
2.6.20-rc2 and rc1? Here is great info on bisecting:
http://www.kernel.org/doc/local/git-quick.html


As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
So it took me only 2 reboots. ;)

The guilty patch is the one I proposed just an hour ago:
  
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
  - 2.6.20-rc1: OK
  - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
  - 2.6.20-rc1-git8: very BAD
  - 2.6.20-rc2: very BAD
  - 2.6.20-rc4: very BAD
  - >= 2.6.20: BAD (but not *very* BAD!)



well..  We have code which has been used by *everyone* for a year and it's
misbehaving for you alone.


No, not for me alone. Probably only I and Thomas Osterried have systems 
where it is so easy to reproduce. Please note that the problem exists on 
my all systems, but only on one it is critical. It is enough to run
"sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure. 
With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only 
on one it goes to ~200MB in about 2 weeks and then everything dies:

http://bugzilla.kernel.org/attachment.cgi?id=13824
http://bugzilla.kernel.org/attachment.cgi?id=13825
http://bugzilla.kernel.org/attachment.cgi?id=13826
http://bugzilla.kernel.org/attachment.cgi?id=13827


 I wonder what you're doing that is different/special.

Me to. :|


Which filesystem, which mount options


 - ext3 on RAID1 (MD): / - rootflags=data=journal
 - ext3 on LVM on RAID5 (MD)
 - nfs

/dev/md0 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
/dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 
(rw,nosuid,nodev,noatime)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs 
(ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)



what sort of workload?
Different, depending on a host: mail (postfix + amavisd + spamassasin + 
clamav + sqlgray), squid, mysql, apache, nfs, rsync,  But it seems 
that the biggest problem is on the host running mentioned mail service.


Thanks.

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Andrew Morton
On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:

  Which filesystem, which mount options
 
- ext3 on RAID1 (MD): / - rootflags=data=journal
 
  It wouldn't surprise me if this is specific to data=journal: that
  journalling mode is pretty complex wrt dairty-data handling and isn't well
  tested.
 
  Does switching that to data=writeback change things?
 
 I'll confirm this tomorrow but it seems that even switching to 
 data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.

yes, sorry, I meant ordered.

 Two questions remain then: why system dies when dirty reaches ~200MB 

I think you have ~2G of RAM and you're running with 
/proc/sys/vm/dirty_ratio=10, yes?

If so, when that machine hits 10% * 2G of dirty memory then everyone who
wants to dirty pages gets blocked.

 and what is wrong with ext3+data=journal with =2.6.20-rc2?

Ah.  It has a bug in it ;)

As I said, data=journal has exceptional handling of pagecache data and is
not well tested.  Someone (and I'm not sure who) will need to get in there
and fix it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Krzysztof Oledzki



On Sat, 15 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:




On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
Krzysztof, I'd hate point you to a hard path (at least time consuming), but
you've done a lot of digging by now anyway. How about git bisecting between
2.6.20-rc2 and rc1? Here is great info on bisecting:
http://www.kernel.org/doc/local/git-quick.html


As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
So it took me only 2 reboots. ;)

The guilty patch is the one I proposed just an hour ago:
  
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
  - 2.6.20-rc1: OK
  - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
  - 2.6.20-rc1-git8: very BAD
  - 2.6.20-rc2: very BAD
  - 2.6.20-rc4: very BAD
  - = 2.6.20: BAD (but not *very* BAD!)



well..  We have code which has been used by *everyone* for a year and it's
misbehaving for you alone.


No, not for me alone. Probably only I and Thomas Osterried have systems 
where it is so easy to reproduce. Please note that the problem exists on 
my all systems, but only on one it is critical. It is enough to run
sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo to be sure. 
With =2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only 
on one it goes to ~200MB in about 2 weeks and then everything dies:

http://bugzilla.kernel.org/attachment.cgi?id=13824
http://bugzilla.kernel.org/attachment.cgi?id=13825
http://bugzilla.kernel.org/attachment.cgi?id=13826
http://bugzilla.kernel.org/attachment.cgi?id=13827


 I wonder what you're doing that is different/special.

Me to. :|


Which filesystem, which mount options


 - ext3 on RAID1 (MD): / - rootflags=data=journal
 - ext3 on LVM on RAID5 (MD)
 - nfs

/dev/md0 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
/dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 
(rw,nosuid,nodev,noatime)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs 
(ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)



what sort of workload?
Different, depending on a host: mail (postfix + amavisd + spamassasin + 
clamav + sqlgray), squid, mysql, apache, nfs, rsync,  But it seems 
that the biggest problem is on the host running mentioned mail service.


Thanks.

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Andrew Morton
On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:

 
 
 On Sat, 15 Dec 2007, Andrew Morton wrote:
 
  On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL 
  PROTECTED] wrote:
 
 
 
  On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:
 
  http://bugzilla.kernel.org/show_bug.cgi?id=9182
 
 
  --- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
  Krzysztof, I'd hate point you to a hard path (at least time consuming), 
  but
  you've done a lot of digging by now anyway. How about git bisecting 
  between
  2.6.20-rc2 and rc1? Here is great info on bisecting:
  http://www.kernel.org/doc/local/git-quick.html
 
  As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
  as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
  So it took me only 2 reboots. ;)
 
  The guilty patch is the one I proposed just an hour ago:

  http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
 
  So:
- 2.6.20-rc1: OK
- 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 
  reverted: OK
- 2.6.20-rc1-git8: very BAD
- 2.6.20-rc2: very BAD
- 2.6.20-rc4: very BAD
- = 2.6.20: BAD (but not *very* BAD!)
 
 
  well..  We have code which has been used by *everyone* for a year and it's
  misbehaving for you alone.
 
 No, not for me alone. Probably only I and Thomas Osterried have systems 
 where it is so easy to reproduce. Please note that the problem exists on 
 my all systems, but only on one it is critical. It is enough to run
 sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo to be sure. 
 With =2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only 
 on one it goes to ~200MB in about 2 weeks and then everything dies:
 http://bugzilla.kernel.org/attachment.cgi?id=13824
 http://bugzilla.kernel.org/attachment.cgi?id=13825
 http://bugzilla.kernel.org/attachment.cgi?id=13826
 http://bugzilla.kernel.org/attachment.cgi?id=13827
 
   I wonder what you're doing that is different/special.
 Me to. :|
 
  Which filesystem, which mount options
 
   - ext3 on RAID1 (MD): / - rootflags=data=journal

It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?

THomas, do you have ext3 data=journal on any filesytems?

   - ext3 on LVM on RAID5 (MD)
   - nfs
 
 /dev/md0 on / type ext3 (rw)
 proc on /proc type proc (rw)
 sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
 devpts on /dev/pts type devpts (rw,nosuid,noexec)
 /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
 /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
 /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 
 (rw,nosuid,nodev,noatime,data=writeback)
 /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 
 (rw,nosuid,nodev,noatime,data=writeback)
 /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 
 (rw,nosuid,nodev,noatime)
 shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
 usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
 owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs 
 (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)
 
 
  what sort of workload?
 Different, depending on a host: mail (postfix + amavisd + spamassasin + 
 clamav + sqlgray), squid, mysql, apache, nfs, rsync,  But it seems 
 that the biggest problem is on the host running mentioned mail service.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Krzysztof Oledzki



On Sun, 16 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182





--- Comment #39 from [EMAIL PROTECTED]  2007-12-16 01:58 ---


So:
  - 2.6.20-rc1: OK
  - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
  - 2.6.20-rc1-git8: very BAD
  - 2.6.20-rc2: very BAD
  - 2.6.20-rc4: very BAD
  - = 2.6.20: BAD (but not *very* BAD!)


based on the great info you already acquired, you should be able to
bisect this rather effectively, via:

2.6.20-rc1-git8 == 921320210bd2ec4f17053d283355b73048ac0e56

$ git-bisect start
$ git-bisect bad 921320210bd2ec4f17053d283355b73048ac0e56
$ git-bisect good v2.6.20-rc1
Bisecting: 133 revisions left to test after this

so about 7-8 bootups would pinpoint the breakage.


Except that I have very limited time where I can do my tests on this host. 
Please also note that it takes about ~2h after a reboot, to be 100% sure. 
So, 7-8 bootups = 14-16h. :|



It would likely pinpoint fba2591b, so it would perhaps be best to first
attempt a revert of fba2591b on a recent kernel.


I wish I could: :(

[EMAIL PROTECTED]:/usr/src/linux-2.6.23.9$ cat ..p1 |patch -p1 --dry-run -R
patching file fs/hugetlbfs/inode.c
Hunk #1 succeeded at 203 (offset 27 lines).
patching file include/linux/page-flags.h
Hunk #1 succeeded at 262 (offset 9 lines).
patching file mm/page-writeback.c
Hunk #1 succeeded at 903 (offset 58 lines).
patching file mm/truncate.c
Unreversed patch detected!  Ignore -R? [n] y
Hunk #1 succeeded at 52 with fuzz 2 (offset 1 line).
Hunk #2 FAILED at 85.
Hunk #3 FAILED at 365.
Hunk #4 FAILED at 400.
3 out of 4 hunks FAILED -- saving rejects to file mm/truncate.c.rej

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-16 Thread Krzysztof Oledzki



On Sun, 16 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:




On Sat, 15 Dec 2007, Andrew Morton wrote:


On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:




On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
Krzysztof, I'd hate point you to a hard path (at least time consuming), but
you've done a lot of digging by now anyway. How about git bisecting between
2.6.20-rc2 and rc1? Here is great info on bisecting:
http://www.kernel.org/doc/local/git-quick.html


As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
So it took me only 2 reboots. ;)

The guilty patch is the one I proposed just an hour ago:
  
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
  - 2.6.20-rc1: OK
  - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
  - 2.6.20-rc1-git8: very BAD
  - 2.6.20-rc2: very BAD
  - 2.6.20-rc4: very BAD
  - = 2.6.20: BAD (but not *very* BAD!)



well..  We have code which has been used by *everyone* for a year and it's
misbehaving for you alone.


No, not for me alone. Probably only I and Thomas Osterried have systems
where it is so easy to reproduce. Please note that the problem exists on
my all systems, but only on one it is critical. It is enough to run
sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo to be sure.
With =2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only
on one it goes to ~200MB in about 2 weeks and then everything dies:
http://bugzilla.kernel.org/attachment.cgi?id=13824
http://bugzilla.kernel.org/attachment.cgi?id=13825
http://bugzilla.kernel.org/attachment.cgi?id=13826
http://bugzilla.kernel.org/attachment.cgi?id=13827


 I wonder what you're doing that is different/special.

Me to. :|


Which filesystem, which mount options


  - ext3 on RAID1 (MD): / - rootflags=data=journal


It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?


I'll confirm this tomorrow but it seems that even switching to 
data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.


Two questions remain then: why system dies when dirty reaches ~200MB 
and what is wrong with ext3+data=journal with =2.6.20-rc2?


Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Andrew Morton
On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[EMAIL PROTECTED]> 
wrote:

> 
> 
> On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=9182
> >
> >
> > --- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
> > Krzysztof, I'd hate point you to a hard path (at least time consuming), but
> > you've done a lot of digging by now anyway. How about git bisecting between
> > 2.6.20-rc2 and rc1? Here is great info on bisecting:
> > http://www.kernel.org/doc/local/git-quick.html
> 
> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad 
> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. 
> So it took me only 2 reboots. ;)
> 
> The guilty patch is the one I proposed just an hour ago:
>   
> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
> 
> So:
>   - 2.6.20-rc1: OK
>   - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
>   - 2.6.20-rc1-git8: very BAD
>   - 2.6.20-rc2: very BAD
>   - 2.6.20-rc4: very BAD
>   - >= 2.6.20: BAD (but not *very* BAD!)
> 

well..  We have code which has been used by *everyone* for a year and it's
misbehaving for you alone.  I wonder what you're doing that is
different/special.

Which filesystem, which mount options, what sort of workload?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Krzysztof Oledzki



On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
Krzysztof, I'd hate point you to a hard path (at least time consuming), but
you've done a lot of digging by now anyway. How about git bisecting between
2.6.20-rc2 and rc1? Here is great info on bisecting:
http://www.kernel.org/doc/local/git-quick.html


As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad 
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. 
So it took me only 2 reboots. ;)


The guilty patch is the one I proposed just an hour ago:
 
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
 - 2.6.20-rc1: OK
 - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
 - 2.6.20-rc1-git8: very BAD
 - 2.6.20-rc2: very BAD
 - 2.6.20-rc4: very BAD
 - >= 2.6.20: BAD (but not *very* BAD!)

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Krzysztof Oledzki


http://bugzilla.kernel.org/show_bug.cgi?id=9182


On Sat, 15 Dec 2007, Krzysztof Oledzki wrote:




On Thu, 13 Dec 2007, Krzysztof Oledzki wrote:




On Thu, 13 Dec 2007, Peter Zijlstra wrote:



On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:





BTW: Could someone please look at this problem? I feel little ignored and
in my situation this is a critical regression.


I was hoping to get around to it today, but I guess tomorrow will have
to do :-/


Thanks.


So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?


Not only doesn't fall but continuously grows.


Does it happen with other filesystems as well?


Don't know. I generally only use ext3 and I'm afraid I'm not able to switch 
this system to other filesystem.



What are you ext3 mount options?

/dev/root / ext3 rw,data=journal 0 0
/dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/news_spool /var/spool/news ext3 
rw,nosuid,nodev,noatime,data=ordered 0 0


BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it was 
introduced but it is hard to do it on a highly critical production system, 
especially since it takes ~2h after a reboot, to be sure.


However, 2h is quite good time, on other systems I have to wait ~2 months to 
get 20MB of leaked memory:


# uptime
13:29:34 up 58 days, 13:04,  9 users,  load average: 0.38, 0.27, 0.31

# sync;sync;sleep 1;sync;grep Dirt /proc/meminfo
Dirty:   23820 kB


More news, I hope this time my problem get more attention from developers 
since now I have much more information.


So far I found that:
 - 2.6.20-rc4 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14057
 - 2.6.20-rc2 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14058
 - 2.6.20-rc1 - OK (probably, I need to wait little more to be 100% sure).

2.6.20-rc1 with 33m uptime:
~$ grep Dirt /proc/meminfo ;sync ; sleep 1 ; sync ; grep Dirt /proc/meminfo
Dirty:   10504 kB
Dirty:   0 kB

2.6.20-rc2 was released Dec 23/24 2006 (BAD)
2.6.20-rc1 was released Dec 13/14 2006 (GOOD?)

It seems that this bug was introduced exactly one year ago. Surprisingly, 
dirty memory in 2.6.20-rc2/2.6.20-rc4 leaks _much_ more faster than in 
2.6.20-final and later kernels as it took only about 6h to reach 172MB. 
So, this bug might be cured afterward, but only a little.


There are three commits that may be somehow related:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=3e67c0987d7567ad41164a153dca9a43b11d
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=5f2a105d5e33a038a717995d2738434f9c25aed2

I'm going to check 2.6.20-rc1-git... releases but it would be *very* nice 
if someone could finally give ma a hand and point some hints helping 
debugging this problem.


Please note that none of my systems with kernels >= 2.6.20-rc1 is able to 
reach 0 kb of dirty memory, even after many synces, even when idle.


Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Krzysztof Oledzki



On Thu, 13 Dec 2007, Krzysztof Oledzki wrote:




On Thu, 13 Dec 2007, Peter Zijlstra wrote:



On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:





BTW: Could someone please look at this problem? I feel little ignored and
in my situation this is a critical regression.


I was hoping to get around to it today, but I guess tomorrow will have
to do :-/


Thanks.


So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?


Not only doesn't fall but continuously grows.


Does it happen with other filesystems as well?


Don't know. I generally only use ext3 and I'm afraid I'm not able to switch 
this system to other filesystem.



What are you ext3 mount options?

/dev/root / ext3 rw,data=journal 0 0
/dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/news_spool /var/spool/news ext3 
rw,nosuid,nodev,noatime,data=ordered 0 0


BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it 
was introduced but it is hard to do it on a highly critical production 
system, especially since it takes ~2h after a reboot, to be sure.


However, 2h is quite good time, on other systems I have to wait ~2 months 
to get 20MB of leaked memory:


# uptime
 13:29:34 up 58 days, 13:04,  9 users,  load average: 0.38, 0.27, 0.31

# sync;sync;sleep 1;sync;grep Dirt /proc/meminfo
Dirty:   23820 kB

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Krzysztof Oledzki



On Thu, 13 Dec 2007, Krzysztof Oledzki wrote:




On Thu, 13 Dec 2007, Peter Zijlstra wrote:



On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:





BTW: Could someone please look at this problem? I feel little ignored and
in my situation this is a critical regression.


I was hoping to get around to it today, but I guess tomorrow will have
to do :-/


Thanks.


So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?


Not only doesn't fall but continuously grows.


Does it happen with other filesystems as well?


Don't know. I generally only use ext3 and I'm afraid I'm not able to switch 
this system to other filesystem.



What are you ext3 mount options?

/dev/root / ext3 rw,data=journal 0 0
/dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/news_spool /var/spool/news ext3 
rw,nosuid,nodev,noatime,data=ordered 0 0


BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it 
was introduced but it is hard to do it on a highly critical production 
system, especially since it takes ~2h after a reboot, to be sure.


However, 2h is quite good time, on other systems I have to wait ~2 months 
to get 20MB of leaked memory:


# uptime
 13:29:34 up 58 days, 13:04,  9 users,  load average: 0.38, 0.27, 0.31

# sync;sync;sleep 1;sync;grep Dirt /proc/meminfo
Dirty:   23820 kB

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Krzysztof Oledzki


http://bugzilla.kernel.org/show_bug.cgi?id=9182


On Sat, 15 Dec 2007, Krzysztof Oledzki wrote:




On Thu, 13 Dec 2007, Krzysztof Oledzki wrote:




On Thu, 13 Dec 2007, Peter Zijlstra wrote:



On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:





BTW: Could someone please look at this problem? I feel little ignored and
in my situation this is a critical regression.


I was hoping to get around to it today, but I guess tomorrow will have
to do :-/


Thanks.


So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?


Not only doesn't fall but continuously grows.


Does it happen with other filesystems as well?


Don't know. I generally only use ext3 and I'm afraid I'm not able to switch 
this system to other filesystem.



What are you ext3 mount options?

/dev/root / ext3 rw,data=journal 0 0
/dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/news_spool /var/spool/news ext3 
rw,nosuid,nodev,noatime,data=ordered 0 0


BTW: this regression also exists in 2.6.24-rc5. I'll try to find when it was 
introduced but it is hard to do it on a highly critical production system, 
especially since it takes ~2h after a reboot, to be sure.


However, 2h is quite good time, on other systems I have to wait ~2 months to 
get 20MB of leaked memory:


# uptime
13:29:34 up 58 days, 13:04,  9 users,  load average: 0.38, 0.27, 0.31

# sync;sync;sleep 1;sync;grep Dirt /proc/meminfo
Dirty:   23820 kB


More news, I hope this time my problem get more attention from developers 
since now I have much more information.


So far I found that:
 - 2.6.20-rc4 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14057
 - 2.6.20-rc2 - bad: http://bugzilla.kernel.org/attachment.cgi?id=14058
 - 2.6.20-rc1 - OK (probably, I need to wait little more to be 100% sure).

2.6.20-rc1 with 33m uptime:
~$ grep Dirt /proc/meminfo ;sync ; sleep 1 ; sync ; grep Dirt /proc/meminfo
Dirty:   10504 kB
Dirty:   0 kB

2.6.20-rc2 was released Dec 23/24 2006 (BAD)
2.6.20-rc1 was released Dec 13/14 2006 (GOOD?)

It seems that this bug was introduced exactly one year ago. Surprisingly, 
dirty memory in 2.6.20-rc2/2.6.20-rc4 leaks _much_ more faster than in 
2.6.20-final and later kernels as it took only about 6h to reach 172MB. 
So, this bug might be cured afterward, but only a little.


There are three commits that may be somehow related:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=3e67c0987d7567ad41164a153dca9a43b11d
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commitdiff;h=5f2a105d5e33a038a717995d2738434f9c25aed2

I'm going to check 2.6.20-rc1-git... releases but it would be *very* nice 
if someone could finally give ma a hand and point some hints helping 
debugging this problem.


Please note that none of my systems with kernels = 2.6.20-rc1 is able to 
reach 0 kb of dirty memory, even after many synces, even when idle.


Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Krzysztof Oledzki



On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
Krzysztof, I'd hate point you to a hard path (at least time consuming), but
you've done a lot of digging by now anyway. How about git bisecting between
2.6.20-rc2 and rc1? Here is great info on bisecting:
http://www.kernel.org/doc/local/git-quick.html


As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad 
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. 
So it took me only 2 reboots. ;)


The guilty patch is the one I proposed just an hour ago:
 
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
 - 2.6.20-rc1: OK
 - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
 - 2.6.20-rc1-git8: very BAD
 - 2.6.20-rc2: very BAD
 - 2.6.20-rc4: very BAD
 - = 2.6.20: BAD (but not *very* BAD!)

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-15 Thread Andrew Morton
On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki [EMAIL PROTECTED] 
wrote:

 
 
 On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:
 
  http://bugzilla.kernel.org/show_bug.cgi?id=9182
 
 
  --- Comment #33 from [EMAIL PROTECTED]  2007-12-15 14:19 ---
  Krzysztof, I'd hate point you to a hard path (at least time consuming), but
  you've done a lot of digging by now anyway. How about git bisecting between
  2.6.20-rc2 and rc1? Here is great info on bisecting:
  http://www.kernel.org/doc/local/git-quick.html
 
 As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad 
 as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK. 
 So it took me only 2 reboots. ;)
 
 The guilty patch is the one I proposed just an hour ago:
   
 http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
 
 So:
   - 2.6.20-rc1: OK
   - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
   - 2.6.20-rc1-git8: very BAD
   - 2.6.20-rc2: very BAD
   - 2.6.20-rc4: very BAD
   - = 2.6.20: BAD (but not *very* BAD!)
 

well..  We have code which has been used by *everyone* for a year and it's
misbehaving for you alone.  I wonder what you're doing that is
different/special.

Which filesystem, which mount options, what sort of workload?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-13 Thread Krzysztof Oledzki



On Thu, 13 Dec 2007, Peter Zijlstra wrote:



On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:





BTW: Could someone please look at this problem? I feel little ignored and
in my situation this is a critical regression.


I was hoping to get around to it today, but I guess tomorrow will have
to do :-/


Thanks.


So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?


Not only doesn't fall but continuously grows.


Does it happen with other filesystems as well?


Don't know. I generally only use ext3 and I'm afraid I'm not able to 
switch this system to other filesystem.



What are you ext3 mount options?

/dev/root / ext3 rw,data=journal 0 0
/dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/news_spool /var/spool/news ext3 
rw,nosuid,nodev,noatime,data=ordered 0 0

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-13 Thread Peter Zijlstra

On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:
> 

> BTW: Could someone please look at this problem? I feel little ignored and 
> in my situation this is a critical regression.

I was hoping to get around to it today, but I guess tomorrow will have
to do :-/

So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?

Does it happen with other filesystems as well?

What are you ext3 mount options?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-13 Thread Krzysztof Oledzki



On Mon, 3 Dec 2007, Thomas Osterried wrote:


On the machine which has troubles, the bug occured within about 10 days
During these days, the amount of dirty pages increased, up to 400MB.
I have testet kernel 2.6.19, 2.6.20, 2.6.22.1 and 2.6.22.10 (with our config),
and even linux-2.6.20 from ubuntu-sever. They have all shown that behaviour.





10 days ago, i installed kernel 2.6.18.5 on this machine (with backported
3ware controller code). I'm quite sure that this kernel will now fixes our
severe stability problems on this production machine (currently:
Dirty:  472 kB, nr_dirty 118).
If so, it's the "lastest" kernel i found usable, after half of a year of pain.


Strange, my tests show that both 2.6.18(.8) and 2.6.19(.7) are OK and the 
first wrong kernel is 2.6.20.


BTW: Could someone please look at this problem? I feel little ignored and 
in my situation this is a critical regression.


Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-13 Thread Krzysztof Oledzki



On Mon, 3 Dec 2007, Thomas Osterried wrote:


On the machine which has troubles, the bug occured within about 10 days
During these days, the amount of dirty pages increased, up to 400MB.
I have testet kernel 2.6.19, 2.6.20, 2.6.22.1 and 2.6.22.10 (with our config),
and even linux-2.6.20 from ubuntu-sever. They have all shown that behaviour.


CUT


10 days ago, i installed kernel 2.6.18.5 on this machine (with backported
3ware controller code). I'm quite sure that this kernel will now fixes our
severe stability problems on this production machine (currently:
Dirty:  472 kB, nr_dirty 118).
If so, it's the lastest kernel i found usable, after half of a year of pain.


Strange, my tests show that both 2.6.18(.8) and 2.6.19(.7) are OK and the 
first wrong kernel is 2.6.20.


BTW: Could someone please look at this problem? I feel little ignored and 
in my situation this is a critical regression.


Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-13 Thread Peter Zijlstra

On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:
 

 BTW: Could someone please look at this problem? I feel little ignored and 
 in my situation this is a critical regression.

I was hoping to get around to it today, but I guess tomorrow will have
to do :-/

So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?

Does it happen with other filesystems as well?

What are you ext3 mount options?



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-13 Thread Krzysztof Oledzki



On Thu, 13 Dec 2007, Peter Zijlstra wrote:



On Thu, 2007-12-13 at 16:17 +0100, Krzysztof Oledzki wrote:





BTW: Could someone please look at this problem? I feel little ignored and
in my situation this is a critical regression.


I was hoping to get around to it today, but I guess tomorrow will have
to do :-/


Thanks.


So, its ext3, dirty some pages, sync, and dirty doesn't fall to 0,
right?


Not only doesn't fall but continuously grows.


Does it happen with other filesystems as well?


Don't know. I generally only use ext3 and I'm afraid I'm not able to 
switch this system to other filesystem.



What are you ext3 mount options?

/dev/root / ext3 rw,data=journal 0 0
/dev/VolGrp0/usr /usr ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/var /var ext3 rw,nodev,data=journal 0 0
/dev/VolGrp0/squid_spool /var/cache/squid/cd0 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/squid_spool2 /var/cache/squid/cd1 ext3 
rw,nosuid,nodev,noatime,data=writeback 0 0
/dev/VolGrp0/news_spool /var/spool/news ext3 
rw,nosuid,nodev,noatime,data=ordered 0 0

Best regards,

Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-12 Thread Krzysztof Oledzki



On Tue, 11 Dec 2007, Krzysztof Oledzki wrote:




On Wed, 5 Dec 2007, Krzysztof Oledzki wrote:




On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #20 from [EMAIL PROTECTED]  2007-12-05 13:37 ---
Please monitor the "Dirty:" record in /proc/meminfo.  Is it slowly rising
and never falling?


It is slowly rising with respect to a small fluctuation caused by a current 
load.



Does it then fall if you run /bin/sync?
Only a little, by ~1-2MB like in a normal system. But it is not able to 
fall below a local minimum. So, after a first sync it does not fall more 
with additional synces.



Compile up usemem.c from
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and run

usemem -m 

where N is the number of megabytes whcih that machine has.


It has 2GB but:

# ./usemem -m 1662 ; echo $?
0

# ./usemem -m 1663 ; echo $?
./usemem: mmap failed: Cannot allocate memory
1


 Did this cause /proc/meminfo:Dirty to fall?

No.


OK, I booted a kernel without 2:2 memsplit but instead with a standard 
3.1:0.9 and even without highmem. So, now I have ~900MB and I am able to set 
-m to the number of megabytes which themachine has. However, usemem still 
does does not cause dirty memory usage to fall. :(


OK, I can confirm that this is a regression from 2.6.18 where it works OK:

[EMAIL PROTECTED]:~$ uname -r
2.6.18.8

[EMAIL PROTECTED]:~$ uptime;grep Dirt /proc/meminfo;sync;sleep 2;sync;sleep 
1;sync;grep Dirt /proc/meminfo
 14:21:53 up  1:00,  1 user,  load average: 0.23, 0.36, 0.35
Dirty: 376 kB
Dirty:   0 kB

It seems that this leak also exists in my other system as even after many 
synces number of dirty pages are still >> 0, but this the only one where 
it is so critical and at the same time - so easy to reproduce.


Best regards,


Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-12 Thread Krzysztof Oledzki



On Tue, 11 Dec 2007, Krzysztof Oledzki wrote:




On Wed, 5 Dec 2007, Krzysztof Oledzki wrote:




On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


--- Comment #20 from [EMAIL PROTECTED]  2007-12-05 13:37 ---
Please monitor the Dirty: record in /proc/meminfo.  Is it slowly rising
and never falling?


It is slowly rising with respect to a small fluctuation caused by a current 
load.



Does it then fall if you run /bin/sync?
Only a little, by ~1-2MB like in a normal system. But it is not able to 
fall below a local minimum. So, after a first sync it does not fall more 
with additional synces.



Compile up usemem.c from
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and run

usemem -m N

where N is the number of megabytes whcih that machine has.


It has 2GB but:

# ./usemem -m 1662 ; echo $?
0

# ./usemem -m 1663 ; echo $?
./usemem: mmap failed: Cannot allocate memory
1


 Did this cause /proc/meminfo:Dirty to fall?

No.


OK, I booted a kernel without 2:2 memsplit but instead with a standard 
3.1:0.9 and even without highmem. So, now I have ~900MB and I am able to set 
-m to the number of megabytes which themachine has. However, usemem still 
does does not cause dirty memory usage to fall. :(


OK, I can confirm that this is a regression from 2.6.18 where it works OK:

[EMAIL PROTECTED]:~$ uname -r
2.6.18.8

[EMAIL PROTECTED]:~$ uptime;grep Dirt /proc/meminfo;sync;sleep 2;sync;sleep 
1;sync;grep Dirt /proc/meminfo
 14:21:53 up  1:00,  1 user,  load average: 0.23, 0.36, 0.35
Dirty: 376 kB
Dirty:   0 kB

It seems that this leak also exists in my other system as even after many 
synces number of dirty pages are still  0, but this the only one where 
it is so critical and at the same time - so easy to reproduce.


Best regards,


Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-05 Thread Krzysztof Oledzki



On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


[EMAIL PROTECTED] changed:

  What|Removed |Added

 Component|Other   |Other
 KernelVersion|2.6.22-stable/2.6.23-stable |2.6.20-stable/2.6.22-
  ||stable/2.6.23-stable
   Product|IO/Storage  |Memory Management
Regression|0   |1
   Summary|Strange system hangs|Critical memory leak (dirty
  ||pages)



After additional hint from Thomas Osterried I can confirm that the problem 
I have been dealing with for half of a year comes from continuous dirty 
pages increas:


http://bugzilla.kernel.org/attachment.cgi?id=13864=view (in 1 KB 
units)


So, after two days of uptime I have ~140MB of dirty pages and that 
explains why my system crashes every 2-3 weeks.


Best regards,


Krzysztof Olędzki

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-05 Thread Krzysztof Oledzki



On Wed, 5 Dec 2007, [EMAIL PROTECTED] wrote:


http://bugzilla.kernel.org/show_bug.cgi?id=9182


[EMAIL PROTECTED] changed:

  What|Removed |Added

 Component|Other   |Other
 KernelVersion|2.6.22-stable/2.6.23-stable |2.6.20-stable/2.6.22-
  ||stable/2.6.23-stable
   Product|IO/Storage  |Memory Management
Regression|0   |1
   Summary|Strange system hangs|Critical memory leak (dirty
  ||pages)



After additional hint from Thomas Osterried I can confirm that the problem 
I have been dealing with for half of a year comes from continuous dirty 
pages increas:


http://bugzilla.kernel.org/attachment.cgi?id=13864action=view (in 1 KB 
units)


So, after two days of uptime I have ~140MB of dirty pages and that 
explains why my system crashes every 2-3 weeks.


Best regards,


Krzysztof Olędzki