Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-19 Thread Martin Schwidefsky
On Tue, 9 Oct 2012 16:21:24 -0700 (PDT)
Hugh Dickins  wrote:

> > 
> > I am seriously tempted to switch to pure software dirty bits by using
> > page protection for writable but clean pages. The worry is the number of
> > additional protection faults we would get. But as we do software dirty
> > bit tracking for the most part anyway this might not be as bad as it
> > used to be.  
> 
> That's exactly the same reason why tmpfs opts out of dirty tracking, fear
> of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
> anomalous too, and I'd be a hypocrite to push for you to make that change.

I tested the waters with the software dirty bit idea. Using kernel compile
as test case I got these numbers:

disk backing, swdirty: 10,023,870 minor-faults 18 major-faults
disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults  


tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults
tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults

That does not look bad at all. One test I found that shows an effect is
lat_mmap from LMBench:

disk backing, hwdirty: 30,894 minor-faults 0 major-faults
disk backing, swdirty: 30,894 minor-faults 0 major-faults

tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults
tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults 

The runtime between the hwdirty vs. the swdirty setup is very similar,
encouraging enough for me to ask our performance team to run a larger test.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-19 Thread Martin Schwidefsky
On Tue, 9 Oct 2012 16:21:24 -0700 (PDT)
Hugh Dickins hu...@google.com wrote:

  
  I am seriously tempted to switch to pure software dirty bits by using
  page protection for writable but clean pages. The worry is the number of
  additional protection faults we would get. But as we do software dirty
  bit tracking for the most part anyway this might not be as bad as it
  used to be.  
 
 That's exactly the same reason why tmpfs opts out of dirty tracking, fear
 of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
 anomalous too, and I'd be a hypocrite to push for you to make that change.

I tested the waters with the software dirty bit idea. Using kernel compile
as test case I got these numbers:

disk backing, swdirty: 10,023,870 minor-faults 18 major-faults
disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults  


tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults
tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults

That does not look bad at all. One test I found that shows an effect is
lat_mmap from LMBench:

disk backing, hwdirty: 30,894 minor-faults 0 major-faults
disk backing, swdirty: 30,894 minor-faults 0 major-faults

tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults
tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults 

The runtime between the hwdirty vs. the swdirty setup is very similar,
encouraging enough for me to ask our performance team to run a larger test.

-- 
blue skies,
   Martin.

Reality continues to ruin my life. - Calvin.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-16 Thread Jan Kara
On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:

> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > if ((PageSwapCache(page) ||
> >  (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  Just to followup on this. The new version of the patch runs fine for
several days on our s390 build machines. I was also running fsx-linux on
tmpfs while pushing the machine to swap. fsx ran fine but I hit
WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is:
 [<03c008edb38e>] xfs_vm_releasepage+0xc6/0xd4 [xfs]
 [<00213326>] shrink_page_list+0x6ba/0x734
 [<00213924>] shrink_inactive_list+0x230/0x578
 [<00214148>] shrink_list+0x6c/0x120
 [<002143ee>] shrink_zone+0x1f2/0x238
 [<00215482>] balance_pgdat+0x5f6/0x86c
 [<002158b8>] kswapd+0x1c0/0x248
 [<0017642a>] kthread+0xa6/0xb0
 [<004e58be>] kernel_thread_starter+0x6/0xc
 [<004e58b8>] kernel_thread_starter+0x0/0xc

I don't think it is really related but I'll hold off the patch for a while
to investigate what's going on...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-16 Thread Jan Kara
On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
 On Tue, 9 Oct 2012, Jan Kara wrote:
snip a lot
   But here's where I think the problem is.  You're assuming that all
   filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
   there's no such function, just a confusing maze of three) route as XFS.
   
   But filesystems like tmpfs and ramfs (perhaps they're the only two
   that matter here) don't participate in that, and wait for an mmap'ed
   page to be seen modified by the user (usually via pte_dirty, but that's
   a no-op on s390) before page is marked dirty; and page reclaim throws
   away undirtied pages.
I admit I haven't thought of tmpfs and similar. After some discussion Mel
  pointed me to the code in mmap which makes a difference. So if I get it
  right, the difference which causes us problems is that on tmpfs we map the
  page writeably even during read-only fault. OK, then if I make the above
  code in page_remove_rmap():
  if ((PageSwapCache(page) ||
   (!anon  !mapping_cap_account_dirty(page-mapping))) 
  page_test_and_clear_dirty(page_to_pfn(page), 1))
  set_page_dirty(page);
  
Things should be ok (modulo the ugliness of this condition), right?
 
 (Setting aside my reservations above...) That's almost exactly right, but
 I think the issue of a racing truncation (which could reset page-mapping
 to NULL at any moment) means we have to be a bit more careful.  Usually
 we guard against that with page lock, but here we can rely on mapcount.
 
 page_mapping(page), with its built-in PageSwapCache check, actually ends
 up making the condition look less ugly; and so far as I could tell,
 the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
 when we are left with its VM_BUG_ON(PageSlab(page))).
 
 But please look this over very critically and test (and if you like it,
 please adopt it as your own): I'm not entirely convinced yet myself.
  Just to followup on this. The new version of the patch runs fine for
several days on our s390 build machines. I was also running fsx-linux on
tmpfs while pushing the machine to swap. fsx ran fine but I hit
WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is:
 [03c008edb38e] xfs_vm_releasepage+0xc6/0xd4 [xfs]
 [00213326] shrink_page_list+0x6ba/0x734
 [00213924] shrink_inactive_list+0x230/0x578
 [00214148] shrink_list+0x6c/0x120
 [002143ee] shrink_zone+0x1f2/0x238
 [00215482] balance_pgdat+0x5f6/0x86c
 [002158b8] kswapd+0x1c0/0x248
 [0017642a] kthread+0xa6/0xb0
 [004e58be] kernel_thread_starter+0x6/0xc
 [004e58b8] kernel_thread_starter+0x0/0xc

I don't think it is really related but I'll hold off the patch for a while
to investigate what's going on...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-11 Thread Martin Schwidefsky
On Thu, 11 Oct 2012 08:56:00 +1100
Dave Chinner  wrote:

> On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets 
> > > > > architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard 
> > > > > write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, 
> > > > > page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of 
> > > > > problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the 
> > > > > page get
> > > > > freed, so when the page gets later marked as dirty and writeback 
> > > > > writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
> > > > > page_buffers() called
> > > > > from xfs_count_page_state().
> > > > 
> > > > What changed recently?  Was XFS hardly used on s390 until now?
> > >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I 
> > > think
> > > XFS just started to be more peevish about what pages it gets between these
> > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > > up).
> > 
> > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> > ASSERT usually compiled out, stumbling later in page_buffers() as you say.
> 
> What that says is that no-one is running xfstests-based QA on s390
> with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
> I've never tested XFS on s390 before, and I doubt any of the
> upstream developers have, either, because not many peopl ehave s390
> machines in their basement. So this is probably just an oversight
> in the distro QA environment more than anything

Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and
watch for the fallout.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-11 Thread Martin Schwidefsky
On Wed, 10 Oct 2012 14:28:32 -0700 (PDT)
Hugh Dickins  wrote:

> But perhaps these machines aren't much into heavy swapping.  Now, 
> if Martin would send me a nice little zSeries netbook for Xmas,
> I could then test that end of it myself ;)

Are you sure about that? The electricity cost alone for such a beast
is quite high ;-)

> I've just arrived at the conclusion that page migration does _not_
> have a problem with transferring the dirty storage key: I had been
> thinking that your testing might stumble on that issue, and need a
> further patch, but I'll explain in other mail why now I think not.

That is good to know, one problem less on the list.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-11 Thread Martin Schwidefsky
On Wed, 10 Oct 2012 14:28:32 -0700 (PDT)
Hugh Dickins hu...@google.com wrote:

 But perhaps these machines aren't much into heavy swapping.  Now, 
 if Martin would send me a nice little zSeries netbook for Xmas,
 I could then test that end of it myself ;)

Are you sure about that? The electricity cost alone for such a beast
is quite high ;-)

 I've just arrived at the conclusion that page migration does _not_
 have a problem with transferring the dirty storage key: I had been
 thinking that your testing might stumble on that issue, and need a
 further patch, but I'll explain in other mail why now I think not.

That is good to know, one problem less on the list.

-- 
blue skies,
   Martin.

Reality continues to ruin my life. - Calvin.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-11 Thread Martin Schwidefsky
On Thu, 11 Oct 2012 08:56:00 +1100
Dave Chinner da...@fromorbit.com wrote:

 On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
  On Tue, 9 Oct 2012, Jan Kara wrote:
   On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
On Mon, 1 Oct 2012, Jan Kara wrote:

 On s390 any write to a page (even from kernel itself) sets 
 architecture
 specific page dirty bit. Thus when a page is written to via standard 
 write, HW
 dirty bit gets set and when we later map and unmap the page, 
 page_remove_rmap()
 finds the dirty bit and calls set_page_dirty().
 
 Dirtying of a page which shouldn't be dirty can cause all sorts of 
 problems to
 filesystems. The bug we observed in practice is that buffers from the 
 page get
 freed, so when the page gets later marked as dirty and writeback 
 writes it, XFS
 crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
 page_buffers() called
 from xfs_count_page_state().

What changed recently?  Was XFS hardly used on s390 until now?
 The problem was originally hit on SLE11-SP2 which is 3.0 based after
   migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I 
   think
   XFS just started to be more peevish about what pages it gets between these
   two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things
   up).
  
  Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
  whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
  ASSERT usually compiled out, stumbling later in page_buffers() as you say.
 
 What that says is that no-one is running xfstests-based QA on s390
 with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
 I've never tested XFS on s390 before, and I doubt any of the
 upstream developers have, either, because not many peopl ehave s390
 machines in their basement. So this is probably just an oversight
 in the distro QA environment more than anything

Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and
watch for the fallout.

-- 
blue skies,
   Martin.

Reality continues to ruin my life. - Calvin.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Hugh Dickins
On Tue, 9 Oct 2012, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> > Hugh Dickins  wrote:
> > 
> > > A separate worry came to mind as I thought about your patch: where
> > > in page migration is s390's dirty storage key migrated from old page
> > > to new?  And if there is a problem there, that too should be fixed
> > > by what I propose in the previous paragraph.
> > 
> > That is covered by the SetPageUptodate() in migrate_page_copy().
> 
> I don't think so: that makes sure that the newpage is not marked
> dirty in storage key just because of the copy_highpage to it; but
> I see nothing to mark the newpage dirty in storage key when the
> old page was dirty there.

I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been reported.

Most of it is handled by page migration's unmap_and_move() having to
unmap the old page first: so the old page will pass through the final
page_remove_rmap(), which will transfer storage key to page_dirty in
those cases which it deals with (with the old code, any file or swap
page; with the new code, any unaccounted file or swap page, now that
we realize the accounted files don't even need this); and page_dirty
is already properly migrated to the new page.

But that does leave one case behind: an anonymous page not yet in
swapcache, migrated via a swap-like migration entry.  But this case
is not a problem because PageDirty doesn't actually affect anything
for an anonymous page not in swapcache.  There are various places
where we set it, and its life-history is hard to make sense of, but
in fact it's meaningless in 2.6, where page reclaim adds anon to swap
(and sets PageDirty) whether the page was marked dirty before or not
(which makes sense when we use the ZERO_PAGE for anon read faults).

2.4 did behave differently: it was liable to free anon pages not
marked dirty, and I think most of our anon SetPageDirtys are just a
relic of those days - I do have a patch from 18 months ago to remove
them (adding PG_dirty to the flags which should not be set when a
page is freed), but there are usually more urgent things to attend
to than rebase and retest that.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Dave Chinner
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard 
> > > > write, HW
> > > > dirty bit gets set and when we later map and unmap the page, 
> > > > page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of 
> > > > problems to
> > > > filesystems. The bug we observed in practice is that buffers from the 
> > > > page get
> > > > freed, so when the page gets later marked as dirty and writeback writes 
> > > > it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
> > > > page_buffers() called
> > > > from xfs_count_page_state().
> > > 
> > > What changed recently?  Was XFS hardly used on s390 until now?
> >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > XFS just started to be more peevish about what pages it gets between these
> > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > up).
> 
> Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> ASSERT usually compiled out, stumbling later in page_buffers() as you say.

What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Hugh Dickins
On Wed, 10 Oct 2012, Jan Kara wrote:
> On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets 
> > > > > architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard 
> > > > > write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, 
> > > > > page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of 
> > > > > problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the 
> > > > > page get
> > > > > freed, so when the page gets later marked as dirty and writeback 
> > > > > writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
> > > > > page_buffers() called
> > > > > from xfs_count_page_state().
> ...
> > > > > Similar problem can also happen when zero_user_segment() call from
> > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set 
> > > > > the
> > > > > hardware dirty bit during writeback, later buffers get freed, and 
> > > > > then page
> > > > > unmapped.
> > 
> > Similar problem, or is that the whole of the problem?  Where else does
> > the page get written to, after clearing page dirty?  (It may not be worth
> > spending time to answer me, I feel I'm wasting too much time on this.)
>   I think the devil is in "after clearing page dirty" -
> clear_page_dirty_for_io() has an optimization that it does not bother
> transfering pte or storage key dirty bits to page dirty bit when page is
> not mapped.

Right, its "if (page_mkclean) set_page_dirty".

> On s390 that results in storage key dirty bit set once buffered
> write modifies the page.

Ah yes, because set_page_dirty does not clean the storage key,
as perhaps I was expecting (and we wouldn't want to add that if
everything is working without).

> 
> BTW there's no other place I'm aware of (and I was looking for some time
> before I realized that storage key could remain set from buffered write as
> described above).

> > I guess I'm worrying too much; but it's not crystal clear to me why any
> > !mapping_cap_account_dirty mapping would necessarily not have the problem.
>   They can have a problem - if they cared that page_remove_rmap() can mark
> as dirty a page which was never written to via mmap. So far we are lucky
> and all !mapping_cap_account_dirty users don't care.

Yes, I think it's good enough: it's a workaround rather than a thorough
future-proof fix; a workaround with a nice optimization bonus for s390.

> > >   Things should be ok (modulo the ugliness of this condition), right?
> > 
> > (Setting aside my reservations above...) That's almost exactly right, but
> > I think the issue of a racing truncation (which could reset page->mapping
> > to NULL at any moment) means we have to be a bit more careful.  Usually
> > we guard against that with page lock, but here we can rely on mapcount.
> > 
> > page_mapping(page), with its built-in PageSwapCache check, actually ends
> > up making the condition look less ugly; and so far as I could tell,
> > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> > when we are left with its VM_BUG_ON(PageSlab(page))).
> > 
> > But please look this over very critically and test (and if you like it,
> > please adopt it as your own): I'm not entirely convinced yet myself.
>   OK, I'll push the kernel with your updated patch to our build machines
> and let it run there for a few days (it took about a day to reproduce the
> issue originally). Thanks a lot for helping me with this.

And thank you for explaining it repeatedly for me.

I expect you're most interested in testing the XFS end of it; but if
you've time to check the swap/tmpfs aspect too, fsx on tmpfs while
heavily swapping should do it.

But perhaps these machines aren't much into heavy swapping.  Now, 
if Martin would send me a nice little zSeries netbook for Xmas,
I could then test that end of it myself ;)

I've just arrived at the conclusion that page migration does _not_
have a problem with transferring the dirty storage key: I had been
thinking that your testing might stumble on that issue, and need a
further patch, but I'll explain in other mail why now I think not.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Jan Kara
On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard 
> > > > write, HW
> > > > dirty bit gets set and when we later map and unmap the page, 
> > > > page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of 
> > > > problems to
> > > > filesystems. The bug we observed in practice is that buffers from the 
> > > > page get
> > > > freed, so when the page gets later marked as dirty and writeback writes 
> > > > it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
> > > > page_buffers() called
> > > > from xfs_count_page_state().
...
> > > > Similar problem can also happen when zero_user_segment() call from
> > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > hardware dirty bit during writeback, later buffers get freed, and then 
> > > > page
> > > > unmapped.
> 
> Similar problem, or is that the whole of the problem?  Where else does
> the page get written to, after clearing page dirty?  (It may not be worth
> spending time to answer me, I feel I'm wasting too much time on this.)
  I think the devil is in "after clearing page dirty" -
clear_page_dirty_for_io() has an optimization that it does not bother
transfering pte or storage key dirty bits to page dirty bit when page is
not mapped. On s390 that results in storage key dirty bit set once buffered
write modifies the page.

BTW there's no other place I'm aware of (and I was looking for some time
before I realized that storage key could remain set from buffered write as
described above).
> 
> I keep trying to put my finger on the precise bug.  I said in earlier
> mails to Mel and to Martin that we're mixing a bugfix and an optimization,
> but I cannot quite point to the bug.  Could one say that it's precisely at
> the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
> that the storage key ought to be re-cleaned after that?
  I think the precise bug is that we can leave dirty bit in storage key set
after writes from kernel while some parts of kernel assume the bit can be
set only via user mapping.

In a perfect world with infinite computation resources, all writes to
pages from kernel could look like:
.. assume locked page ..
page_mkclean(page);
if (page_test_and_clear_dirty(page))
set_page_dirty(page);
write to page
page_test_and_clear_dirty(page);/* Clean storage key */

This would be bulletproof ... and ridiculously expensive.

> What if one day I happened to copy that code into shmem_writepage()?
> I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
> never write shmem to swap while it's still mapped, so it wouldn't even
> have a chance to redirty the page in page_remove_rmap().
> 
> I guess I'm worrying too much; but it's not crystal clear to me why any
> !mapping_cap_account_dirty mapping would necessarily not have the problem.
  They can have a problem - if they cared that page_remove_rmap() can mark
as dirty a page which was never written to via mmap. So far we are lucky
and all !mapping_cap_account_dirty users don't care.

> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > if ((PageSwapCache(page) ||
> >  (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Jan Kara
On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
 On Tue, 9 Oct 2012, Jan Kara wrote:
  On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
   On Mon, 1 Oct 2012, Jan Kara wrote:
   
On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard 
write, HW
dirty bit gets set and when we later map and unmap the page, 
page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of 
problems to
filesystems. The bug we observed in practice is that buffers from the 
page get
freed, so when the page gets later marked as dirty and writeback writes 
it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
page_buffers() called
from xfs_count_page_state().
...
Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then 
page
unmapped.
 
 Similar problem, or is that the whole of the problem?  Where else does
 the page get written to, after clearing page dirty?  (It may not be worth
 spending time to answer me, I feel I'm wasting too much time on this.)
  I think the devil is in after clearing page dirty -
clear_page_dirty_for_io() has an optimization that it does not bother
transfering pte or storage key dirty bits to page dirty bit when page is
not mapped. On s390 that results in storage key dirty bit set once buffered
write modifies the page.

BTW there's no other place I'm aware of (and I was looking for some time
before I realized that storage key could remain set from buffered write as
described above).
 
 I keep trying to put my finger on the precise bug.  I said in earlier
 mails to Mel and to Martin that we're mixing a bugfix and an optimization,
 but I cannot quite point to the bug.  Could one say that it's precisely at
 the page straddles i_size zero_user_segment(), in XFS or in other FSes?
 that the storage key ought to be re-cleaned after that?
  I think the precise bug is that we can leave dirty bit in storage key set
after writes from kernel while some parts of kernel assume the bit can be
set only via user mapping.

In a perfect world with infinite computation resources, all writes to
pages from kernel could look like:
.. assume locked page ..
page_mkclean(page);
if (page_test_and_clear_dirty(page))
set_page_dirty(page);
write to page
page_test_and_clear_dirty(page);/* Clean storage key */

This would be bulletproof ... and ridiculously expensive.

 What if one day I happened to copy that code into shmem_writepage()?
 I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
 never write shmem to swap while it's still mapped, so it wouldn't even
 have a chance to redirty the page in page_remove_rmap().
 
 I guess I'm worrying too much; but it's not crystal clear to me why any
 !mapping_cap_account_dirty mapping would necessarily not have the problem.
  They can have a problem - if they cared that page_remove_rmap() can mark
as dirty a page which was never written to via mmap. So far we are lucky
and all !mapping_cap_account_dirty users don't care.

   But here's where I think the problem is.  You're assuming that all
   filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
   there's no such function, just a confusing maze of three) route as XFS.
   
   But filesystems like tmpfs and ramfs (perhaps they're the only two
   that matter here) don't participate in that, and wait for an mmap'ed
   page to be seen modified by the user (usually via pte_dirty, but that's
   a no-op on s390) before page is marked dirty; and page reclaim throws
   away undirtied pages.
I admit I haven't thought of tmpfs and similar. After some discussion Mel
  pointed me to the code in mmap which makes a difference. So if I get it
  right, the difference which causes us problems is that on tmpfs we map the
  page writeably even during read-only fault. OK, then if I make the above
  code in page_remove_rmap():
  if ((PageSwapCache(page) ||
   (!anon  !mapping_cap_account_dirty(page-mapping))) 
  page_test_and_clear_dirty(page_to_pfn(page), 1))
  set_page_dirty(page);
  
Things should be ok (modulo the ugliness of this condition), right?
 
 (Setting aside my reservations above...) That's almost exactly right, but
 I think the issue of a racing truncation (which could reset page-mapping
 to NULL at any moment) means we have to be a bit more careful.  Usually
 we guard against that with page lock, but here we can rely on mapcount.
 
 page_mapping(page), with its built-in PageSwapCache check, actually ends
 up making the condition look less ugly; and so far as I could tell,
 the extra code does get optimized out on x86 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Hugh Dickins
On Wed, 10 Oct 2012, Jan Kara wrote:
 On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
  On Tue, 9 Oct 2012, Jan Kara wrote:
   On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
On Mon, 1 Oct 2012, Jan Kara wrote:

 On s390 any write to a page (even from kernel itself) sets 
 architecture
 specific page dirty bit. Thus when a page is written to via standard 
 write, HW
 dirty bit gets set and when we later map and unmap the page, 
 page_remove_rmap()
 finds the dirty bit and calls set_page_dirty().
 
 Dirtying of a page which shouldn't be dirty can cause all sorts of 
 problems to
 filesystems. The bug we observed in practice is that buffers from the 
 page get
 freed, so when the page gets later marked as dirty and writeback 
 writes it, XFS
 crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
 page_buffers() called
 from xfs_count_page_state().
 ...
 Similar problem can also happen when zero_user_segment() call from
 xfs_vm_writepage() (or block_write_full_page() for that matter) set 
 the
 hardware dirty bit during writeback, later buffers get freed, and 
 then page
 unmapped.
  
  Similar problem, or is that the whole of the problem?  Where else does
  the page get written to, after clearing page dirty?  (It may not be worth
  spending time to answer me, I feel I'm wasting too much time on this.)
   I think the devil is in after clearing page dirty -
 clear_page_dirty_for_io() has an optimization that it does not bother
 transfering pte or storage key dirty bits to page dirty bit when page is
 not mapped.

Right, its if (page_mkclean) set_page_dirty.

 On s390 that results in storage key dirty bit set once buffered
 write modifies the page.

Ah yes, because set_page_dirty does not clean the storage key,
as perhaps I was expecting (and we wouldn't want to add that if
everything is working without).

 
 BTW there's no other place I'm aware of (and I was looking for some time
 before I realized that storage key could remain set from buffered write as
 described above).

  I guess I'm worrying too much; but it's not crystal clear to me why any
  !mapping_cap_account_dirty mapping would necessarily not have the problem.
   They can have a problem - if they cared that page_remove_rmap() can mark
 as dirty a page which was never written to via mmap. So far we are lucky
 and all !mapping_cap_account_dirty users don't care.

Yes, I think it's good enough: it's a workaround rather than a thorough
future-proof fix; a workaround with a nice optimization bonus for s390.

 Things should be ok (modulo the ugliness of this condition), right?
  
  (Setting aside my reservations above...) That's almost exactly right, but
  I think the issue of a racing truncation (which could reset page-mapping
  to NULL at any moment) means we have to be a bit more careful.  Usually
  we guard against that with page lock, but here we can rely on mapcount.
  
  page_mapping(page), with its built-in PageSwapCache check, actually ends
  up making the condition look less ugly; and so far as I could tell,
  the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
  when we are left with its VM_BUG_ON(PageSlab(page))).
  
  But please look this over very critically and test (and if you like it,
  please adopt it as your own): I'm not entirely convinced yet myself.
   OK, I'll push the kernel with your updated patch to our build machines
 and let it run there for a few days (it took about a day to reproduce the
 issue originally). Thanks a lot for helping me with this.

And thank you for explaining it repeatedly for me.

I expect you're most interested in testing the XFS end of it; but if
you've time to check the swap/tmpfs aspect too, fsx on tmpfs while
heavily swapping should do it.

But perhaps these machines aren't much into heavy swapping.  Now, 
if Martin would send me a nice little zSeries netbook for Xmas,
I could then test that end of it myself ;)

I've just arrived at the conclusion that page migration does _not_
have a problem with transferring the dirty storage key: I had been
thinking that your testing might stumble on that issue, and need a
further patch, but I'll explain in other mail why now I think not.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Dave Chinner
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
 On Tue, 9 Oct 2012, Jan Kara wrote:
  On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
   On Mon, 1 Oct 2012, Jan Kara wrote:
   
On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard 
write, HW
dirty bit gets set and when we later map and unmap the page, 
page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of 
problems to
filesystems. The bug we observed in practice is that buffers from the 
page get
freed, so when the page gets later marked as dirty and writeback writes 
it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
page_buffers() called
from xfs_count_page_state().
   
   What changed recently?  Was XFS hardly used on s390 until now?
The problem was originally hit on SLE11-SP2 which is 3.0 based after
  migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
  XFS just started to be more peevish about what pages it gets between these
  two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things
  up).
 
 Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
 whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
 ASSERT usually compiled out, stumbling later in page_buffers() as you say.

What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Hugh Dickins
On Tue, 9 Oct 2012, Hugh Dickins wrote:
 On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
  On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
  Hugh Dickins hu...@google.com wrote:
  
   A separate worry came to mind as I thought about your patch: where
   in page migration is s390's dirty storage key migrated from old page
   to new?  And if there is a problem there, that too should be fixed
   by what I propose in the previous paragraph.
  
  That is covered by the SetPageUptodate() in migrate_page_copy().
 
 I don't think so: that makes sure that the newpage is not marked
 dirty in storage key just because of the copy_highpage to it; but
 I see nothing to mark the newpage dirty in storage key when the
 old page was dirty there.

I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been reported.

Most of it is handled by page migration's unmap_and_move() having to
unmap the old page first: so the old page will pass through the final
page_remove_rmap(), which will transfer storage key to page_dirty in
those cases which it deals with (with the old code, any file or swap
page; with the new code, any unaccounted file or swap page, now that
we realize the accounted files don't even need this); and page_dirty
is already properly migrated to the new page.

But that does leave one case behind: an anonymous page not yet in
swapcache, migrated via a swap-like migration entry.  But this case
is not a problem because PageDirty doesn't actually affect anything
for an anonymous page not in swapcache.  There are various places
where we set it, and its life-history is hard to make sense of, but
in fact it's meaningless in 2.6, where page reclaim adds anon to swap
(and sets PageDirty) whether the page was marked dirty before or not
(which makes sense when we use the ZERO_PAGE for anon read faults).

2.4 did behave differently: it was liable to free anon pages not
marked dirty, and I think most of our anon SetPageDirtys are just a
relic of those days - I do have a patch from 18 months ago to remove
them (adding PG_dirty to the flags which should not be set when a
page is freed), but there are usually more urgent things to attend
to than rebase and retest that.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Hugh Dickins
On Tue, 9 Oct 2012, Jan Kara wrote:
> On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard 
> > > write, HW
> > > dirty bit gets set and when we later map and unmap the page, 
> > > page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of 
> > > problems to
> > > filesystems. The bug we observed in practice is that buffers from the 
> > > page get
> > > freed, so when the page gets later marked as dirty and writeback writes 
> > > it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
> > > called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
>   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> XFS just started to be more peevish about what pages it gets between these
> two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> up).

Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
ASSERT usually compiled out, stumbling later in page_buffers() as you say.

> 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then 
> > > page
> > > unmapped.

Similar problem, or is that the whole of the problem?  Where else does
the page get written to, after clearing page dirty?  (It may not be worth
spending time to answer me, I feel I'm wasting too much time on this.)

I keep trying to put my finger on the precise bug.  I said in earlier
mails to Mel and to Martin that we're mixing a bugfix and an optimization,
but I cannot quite point to the bug.  Could one say that it's precisely at
the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
that the storage key ought to be re-cleaned after that?

What if one day I happened to copy that code into shmem_writepage()?
I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
never write shmem to swap while it's still mapped, so it wouldn't even
have a chance to redirty the page in page_remove_rmap().

I guess I'm worrying too much; but it's not crystal clear to me why any
!mapping_cap_account_dirty mapping would necessarily not have the problem.

> > But here's where I think the problem is.  You're assuming that all
> > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > there's no such function, just a confusing maze of three) route as XFS.
> > 
> > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > that matter here) don't participate in that, and wait for an mmap'ed
> > page to be seen modified by the user (usually via pte_dirty, but that's
> > a no-op on s390) before page is marked dirty; and page reclaim throws
> > away undirtied pages.
>   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> pointed me to the code in mmap which makes a difference. So if I get it
> right, the difference which causes us problems is that on tmpfs we map the
> page writeably even during read-only fault. OK, then if I make the above
> code in page_remove_rmap():
>   if ((PageSwapCache(page) ||
>(!anon && !mapping_cap_account_dirty(page->mapping))) &&
>   page_test_and_clear_dirty(page_to_pfn(page), 1))
>   set_page_dirty(page);
> 
>   Things should be ok (modulo the ugliness of this condition), right?

(Setting aside my reservations above...) That's almost exactly right, but
I think the issue of a racing truncation (which could reset page->mapping
to NULL at any moment) means we have to be a bit more careful.  Usually
we guard against that with page lock, but here we can rely on mapcount.

page_mapping(page), with its built-in PageSwapCache check, actually ends
up making the condition look less ugly; and so far as I could tell,
the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
when we are left with its VM_BUG_ON(PageSlab(page))).

But please look this over very critically and test (and if you like it,
please adopt it as your own): I'm not entirely convinced yet myself.

(One day, I do want to move that block further down page_remove_rmap(),
outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think
there's an actual problem at present in calling set_page_dirty() there,
but I have seen patches which could give it a lock-ordering issue, so
better to untangle them.  No reason to muddle that in with your fix,
but I thought I'd mention it while we're all staring at this.)

Hugh

---


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Hugh Dickins
On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard 
> > > write, HW
> > > dirty bit gets set and when we later map and unmap the page, 
> > > page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of 
> > > problems to
> > > filesystems. The bug we observed in practice is that buffers from the 
> > > page get
> > > freed, so when the page gets later marked as dirty and writeback writes 
> > > it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
> > > called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
> 
> One thing that changed is that the zero_user_segment for the remaining bytes 
> between
> i_size and the end of the page has been moved to block_write_full_page_endio, 
> see
> git commit eebd2aa355692afa. That changed the timing of the race window in 
> regard
> to map/unmap of the page by user space. And yes XFS is in use on s390.

February 2008: I think we have different ideas of "recently" ;)

>  
> > > 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then 
> > > page
> > > unmapped.
> > > 
> > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > > page_mkclean() and page_remove_rmap(). This is safe because when a page 
> > > gets
> > > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > > do_page_fault(). When the dirty bit is cleared by 
> > > clear_page_dirty_for_io(),
> > > the page gets writeprotected in page_mkclean(). So pagecache page is 
> > > writeable
> > > if and only if it is dirty.
> > 
> > Very interesting patch...
> 
> Yes, it is an interesting idea. I really like the part that we'll use less 
> storage
> key operations, as these are freaking expensive.

As I said to Mel and will repeat to Jan, though an optimization would
be nice, I don't think we should necessarily mix it with the bugfix.

> 
> > > 
> > > CC: Martin Schwidefsky 
> > 
> > which I'd very much like Martin's opinion on...
> 
> Until you pointed out the short-comings of the patch I really liked it ..
> 
> > > ---
> > >  mm/rmap.c |   16 ++--
> > >  1 files changed, 14 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 0f3b7cd..6ce8ddb 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > >   struct address_space *mapping = page_mapping(page);
> > >   if (mapping) {
> > >   ret = page_mkclean_file(mapping, page);
> > > - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > > + /*
> > > +  * We ignore dirty bit for pagecache pages. It is safe
> > > +  * as page is marked dirty iff it is writeable (page is
> > > +  * marked as dirty when it is made writeable and
> > > +  * clear_page_dirty_for_io() writeprotects the page
> > > +  * again).
> > > +  */
> > > + if (PageSwapCache(page) &&
> > > + page_test_and_clear_dirty(page_to_pfn(page), 1))
> > >   ret = 1;
> > 
> > This part you could cut out: page_mkclean() is not used on SwapCache pages.
> > I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
> future? At least we should add a comment there.

I set out to do so, to add a comment there; but honestly, it's a strange
place for such a comment when there's no longer even the code to comment
upon.  And page_mkclean_file(), called in the line above, already says
BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change
that sends PageSwapCache pages this way.  It is possible that one day we
shall want to send tmpfs and swapcache down this route, I'm not ruling
that out; but then we shall have to extend page_mkclean(), yes.

> 
> The patch relies on the software dirty bit tracking for file backed pages,
> if dirty bit tracking is not done for tmpfs and ramfs we are borked.
>  
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Hugh Dickins
On Tue, 9 Oct 2012, Mel Gorman wrote:
> On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > 
> > So, if I'm understanding right, with this change s390 would be in danger
> > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> > written with the write system call would already be PageDirty and secure.
> > 
> 
> In the case of ramfs, what marks the page clean so it could be discarded? It
> does not participate in dirty accounting so it's not going to clear the
> dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
> handler that would use an end_io handler to clear the page after "IO"
> completes. I am not seeing how a ramfs page can get discarded at the moment.

But we don't have a page clean bit: we have a page dirty bit, and where
is that set in the ramfs read-fault case?  I've not experimented to check,
maybe you're right and ramfs is exempt from the issue.  I thought it was
__do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE.
Ah, you quote almost the very place further down.

> 
> shm and tmpfs are indeed different and I did not take them into account
> (ba dum tisch) when reviewing. For those pages would it be sufficient to
> check the following?
> 
> PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

Something like that, yes: I've a possible patch I'll put in reply to Jan.

> 
> The problem the patch dealt with involved buffers associated with the page
> and that shouldn't be a problem for tmpfs, right?

Right, though I'm now beginning to wonder what the underlying bug is.
It seems to me that we have a bug and an optimization on our hands,
and have rushed into the optimization which would avoid the bug,
without considering what the actual bug is.  More in reply to Jan.

> I recognise that this
> might work just because of co-incidence and set off your "Yuck" detector
> and you'll prefer the proposed solution below.

No, I was mistaken to think that s390 would have dirty pages where
others had clean, Martin has now explained that SetPageUptodate cleans.
I didn't mind continuing an (imagined) inefficiency in s390, but I don't
want to make it more inefficient.

> 
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> > 
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

So that fantasy was all wrong: appealing, but wrong.

> >  
> 
> Do you mean something like this?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..c66166f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   } else {
>   inc_mm_counter_fast(mm, MM_FILEPAGES);
>   page_add_file_rmap(page);
> - if (flags & FAULT_FLAG_WRITE) {
> +
> + /*
> +  * s390 depends on the dirty flag from the storage key
> +  * being propagated when the page is unmapped from the
> +  * page tables. For dirty-accounted mapping, we instead
> +  * depend on the page being marked dirty on writes and
> +  * being write-protected on clear_page_dirty_for_io.
> +  * The same protection does not apply for tmpfs pages
> +  * that do not participate in dirty accounting so mark
> +  * them dirty at fault time to avoid the data being
> +  * lost
> +  */
> + if (flags & FAULT_FLAG_WRITE ||
> + !bdi_cap_account_dirty(page->mapping)) {
>   dirty_page = page;
>   get_page(dirty_page);
>   }
> 
> Could something like this result in more writes to swap? Lets say there
> is an unmapped tmpfs file with data on it -- a process maps it, reads the
> entire mapping and exits. The page is now dirty and potentially will have
> to be rewritten to swap. That seems bad. Did I miss your point?

My point was that I mistakenly thought s390 must already be behaving
like that, so wanted it to continue that way, but with 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Jan Kara
On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, 
> > HW
> > dirty bit gets set and when we later map and unmap the page, 
> > page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems 
> > to
> > filesystems. The bug we observed in practice is that buffers from the page 
> > get
> > freed, so when the page gets later marked as dirty and writeback writes it, 
> > XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
> > called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?
  The problem was originally hit on SLE11-SP2 which is 3.0 based after
migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
XFS just started to be more peevish about what pages it gets between these
two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
up).

> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is 
> > writeable
> > if and only if it is dirty.
> 
> Very interesting patch...
  Originally, I even wanted to rip out pte dirty bit handling for shared
file pages but in the end that seemed too bold and unnecessary for my
problem ;)

> > CC: linux-s...@vger.kernel.org
> > Signed-off-by: Jan Kara 
> 
> but I think it's wrong.
  Thanks for having a look.

> > ---
> >  mm/rmap.c |   16 ++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > struct address_space *mapping = page_mapping(page);
> > if (mapping) {
> > ret = page_mkclean_file(mapping, page);
> > -   if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +   /*
> > +* We ignore dirty bit for pagecache pages. It is safe
> > +* as page is marked dirty iff it is writeable (page is
> > +* marked as dirty when it is made writeable and
> > +* clear_page_dirty_for_io() writeprotects the page
> > +* again).
> > +*/
> > +   if (PageSwapCache(page) &&
> > +   page_test_and_clear_dirty(page_to_pfn(page), 1))
> > ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
  OK, will do.

> > }
> > }
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  * this if the page is anon, so about to be freed; but perhaps
> >  * not if it's in swapcache - there might be another pte slot
> >  * containing the swap entry, but page not yet written to swap.
> > +* For pagecache pages, we don't care about dirty bit in storage
> > +* key because the page is writeable iff it is dirty (page is marked
> > +* as dirty when it is made writeable and clear_page_dirty_for_io()
> > +* writeprotects the page again).
> >  */
> > -   if ((!anon || PageSwapCache(page)) &&
> > +   if (PageSwapCache(page) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
  I admit I haven't thought of tmpfs and similar. After some discussion Mel
pointed me to the code in mmap which makes a difference. So if I get it
right, the difference which causes us problems is that on tmpfs we map the
page writeably even during read-only fault. OK, then if I make the above
code 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Martin Schwidefsky
On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
Hugh Dickins  wrote:

> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, 
> > HW
> > dirty bit gets set and when we later map and unmap the page, 
> > page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems 
> > to
> > filesystems. The bug we observed in practice is that buffers from the page 
> > get
> > freed, so when the page gets later marked as dirty and writeback writes it, 
> > XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
> > called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?

One thing that changed is that the zero_user_segment for the remaining bytes 
between
i_size and the end of the page has been moved to block_write_full_page_endio, 
see
git commit eebd2aa355692afa. That changed the timing of the race window in 
regard
to map/unmap of the page by user space. And yes XFS is in use on s390.
 
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is 
> > writeable
> > if and only if it is dirty.
> 
> Very interesting patch...

Yes, it is an interesting idea. I really like the part that we'll use less 
storage
key operations, as these are freaking expensive.

> > 
> > CC: Martin Schwidefsky 
> 
> which I'd very much like Martin's opinion on...

Until you pointed out the short-comings of the patch I really liked it ..

> > ---
> >  mm/rmap.c |   16 ++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > struct address_space *mapping = page_mapping(page);
> > if (mapping) {
> > ret = page_mkclean_file(mapping, page);
> > -   if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +   /*
> > +* We ignore dirty bit for pagecache pages. It is safe
> > +* as page is marked dirty iff it is writeable (page is
> > +* marked as dirty when it is made writeable and
> > +* clear_page_dirty_for_io() writeprotects the page
> > +* again).
> > +*/
> > +   if (PageSwapCache(page) &&
> > +   page_test_and_clear_dirty(page_to_pfn(page), 1))
> > ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.

Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
future? At least we should add a comment there.

> > }
> > }
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  * this if the page is anon, so about to be freed; but perhaps
> >  * not if it's in swapcache - there might be another pte slot
> >  * containing the swap entry, but page not yet written to swap.
> > +* For pagecache pages, we don't care about dirty bit in storage
> > +* key because the page is writeable iff it is dirty (page is marked
> > +* as dirty when it is made writeable and clear_page_dirty_for_io()
> > +* writeprotects the page again).
> >  */
> > -   if ((!anon || PageSwapCache(page)) &&
> > +   if (PageSwapCache(page) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
>
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Mel Gorman
On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > 
> > CC: Mel Gorman 
> 
> and I'm grateful to Mel's ack for reawakening me to it...
> 
> > CC: linux-s...@vger.kernel.org
> > Signed-off-by: Jan Kara 
> 
> but I think it's wrong.
> 

Dang.

> > ---
> >  mm/rmap.c |   16 ++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > struct address_space *mapping = page_mapping(page);
> > if (mapping) {
> > ret = page_mkclean_file(mapping, page);
> > -   if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +   /*
> > +* We ignore dirty bit for pagecache pages. It is safe
> > +* as page is marked dirty iff it is writeable (page is
> > +* marked as dirty when it is made writeable and
> > +* clear_page_dirty_for_io() writeprotects the page
> > +* again).
> > +*/
> > +   if (PageSwapCache(page) &&
> > +   page_test_and_clear_dirty(page_to_pfn(page), 1))
> > ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> > }
> > }
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  * this if the page is anon, so about to be freed; but perhaps
> >  * not if it's in swapcache - there might be another pte slot
> >  * containing the swap entry, but page not yet written to swap.
> > +* For pagecache pages, we don't care about dirty bit in storage
> > +* key because the page is writeable iff it is dirty (page is marked
> > +* as dirty when it is made writeable and clear_page_dirty_for_io()
> > +* writeprotects the page again).
> >  */
> > -   if ((!anon || PageSwapCache(page)) &&
> > +   if (PageSwapCache(page) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
> 
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 

In the case of ramfs, what marks the page clean so it could be discarded? It
does not participate in dirty accounting so it's not going to clear the
dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
handler that would use an end_io handler to clear the page after "IO"
completes. I am not seeing how a ramfs page can get discarded at the moment.

shm and tmpfs are indeed different and I did not take them into account
(ba dum tisch) when reviewing. For those pages would it be sufficient to
check the following?

PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

The problem the patch dealt with involved buffers associated with the page
and that shouldn't be a problem for tmpfs, right? I recognise that this
might work just because of co-incidence and set off your "Yuck" detector
and you'll prefer the proposed solution below.

> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
> 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
>  

Do you mean something like this?

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..c66166f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Mel Gorman
On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
  SNIP
  CC: Mel Gorman mgor...@suse.de
 
 and I'm grateful to Mel's ack for reawakening me to it...
 
  CC: linux-s...@vger.kernel.org
  Signed-off-by: Jan Kara j...@suse.cz
 
 but I think it's wrong.
 

Dang.

  ---
   mm/rmap.c |   16 ++--
   1 files changed, 14 insertions(+), 2 deletions(-)
  
  diff --git a/mm/rmap.c b/mm/rmap.c
  index 0f3b7cd..6ce8ddb 100644
  --- a/mm/rmap.c
  +++ b/mm/rmap.c
  @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
  struct address_space *mapping = page_mapping(page);
  if (mapping) {
  ret = page_mkclean_file(mapping, page);
  -   if (page_test_and_clear_dirty(page_to_pfn(page), 1))
  +   /*
  +* We ignore dirty bit for pagecache pages. It is safe
  +* as page is marked dirty iff it is writeable (page is
  +* marked as dirty when it is made writeable and
  +* clear_page_dirty_for_io() writeprotects the page
  +* again).
  +*/
  +   if (PageSwapCache(page) 
  +   page_test_and_clear_dirty(page_to_pfn(page), 1))
  ret = 1;
 
 This part you could cut out: page_mkclean() is not used on SwapCache pages.
 I believe you are safe to remove the page_test_and_clear_dirty() from here.
 
  }
  }
  @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
   * this if the page is anon, so about to be freed; but perhaps
   * not if it's in swapcache - there might be another pte slot
   * containing the swap entry, but page not yet written to swap.
  +* For pagecache pages, we don't care about dirty bit in storage
  +* key because the page is writeable iff it is dirty (page is marked
  +* as dirty when it is made writeable and clear_page_dirty_for_io()
  +* writeprotects the page again).
   */
  -   if ((!anon || PageSwapCache(page)) 
  +   if (PageSwapCache(page) 
  page_test_and_clear_dirty(page_to_pfn(page), 1))
  set_page_dirty(page);
 
 But here's where I think the problem is.  You're assuming that all
 filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
 there's no such function, just a confusing maze of three) route as XFS.
 
 But filesystems like tmpfs and ramfs (perhaps they're the only two
 that matter here) don't participate in that, and wait for an mmap'ed
 page to be seen modified by the user (usually via pte_dirty, but that's
 a no-op on s390) before page is marked dirty; and page reclaim throws
 away undirtied pages.
 
 So, if I'm understanding right, with this change s390 would be in danger
 of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
 written with the write system call would already be PageDirty and secure.
 

In the case of ramfs, what marks the page clean so it could be discarded? It
does not participate in dirty accounting so it's not going to clear the
dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
handler that would use an end_io handler to clear the page after IO
completes. I am not seeing how a ramfs page can get discarded at the moment.

shm and tmpfs are indeed different and I did not take them into account
(ba dum tisch) when reviewing. For those pages would it be sufficient to
check the following?

PageSwapCache(page) || (page-mapping  !bdi_cap_account_dirty(page-mapping)

The problem the patch dealt with involved buffers associated with the page
and that shouldn't be a problem for tmpfs, right? I recognise that this
might work just because of co-incidence and set off your Yuck detector
and you'll prefer the proposed solution below.

 You mention above that even the kernel writing to the page would mark
 the s390 storage key dirty.  I think that means that these shm and
 tmpfs and ramfs pages would all have dirty storage keys just from the
 clear_highpage() used to prepare them originally, and so would have
 been found dirty anyway by the existing code here in page_remove_rmap(),
 even though other architectures would regard them as clean and removable.
 
 If that's the case, then maybe we'd do better just to mark them dirty
 when faulted in the s390 case.  Then your patch above should (I think)
 be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
 too (I've not thought through exactly what that patch would be, just
 one or two suitably placed SetPageDirtys, I think), and eliminate
 page_test_and_clear_dirty() altogether - no tears shed by any of us!
  

Do you mean something like this?

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..c66166f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
} else {
inc_mm_counter_fast(mm, 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Martin Schwidefsky
On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
Hugh Dickins hu...@google.com wrote:

 On Mon, 1 Oct 2012, Jan Kara wrote:
 
  On s390 any write to a page (even from kernel itself) sets architecture
  specific page dirty bit. Thus when a page is written to via standard write, 
  HW
  dirty bit gets set and when we later map and unmap the page, 
  page_remove_rmap()
  finds the dirty bit and calls set_page_dirty().
  
  Dirtying of a page which shouldn't be dirty can cause all sorts of problems 
  to
  filesystems. The bug we observed in practice is that buffers from the page 
  get
  freed, so when the page gets later marked as dirty and writeback writes it, 
  XFS
  crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
  called
  from xfs_count_page_state().
 
 What changed recently?  Was XFS hardly used on s390 until now?

One thing that changed is that the zero_user_segment for the remaining bytes 
between
i_size and the end of the page has been moved to block_write_full_page_endio, 
see
git commit eebd2aa355692afa. That changed the timing of the race window in 
regard
to map/unmap of the page by user space. And yes XFS is in use on s390.
 
  
  Similar problem can also happen when zero_user_segment() call from
  xfs_vm_writepage() (or block_write_full_page() for that matter) set the
  hardware dirty bit during writeback, later buffers get freed, and then page
  unmapped.
  
  Fix the issue by ignoring s390 HW dirty bit for page cache pages in
  page_mkclean() and page_remove_rmap(). This is safe because when a page gets
  marked as writeable in PTE it is also marked dirty in do_wp_page() or
  do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
  the page gets writeprotected in page_mkclean(). So pagecache page is 
  writeable
  if and only if it is dirty.
 
 Very interesting patch...

Yes, it is an interesting idea. I really like the part that we'll use less 
storage
key operations, as these are freaking expensive.

  
  CC: Martin Schwidefsky schwidef...@de.ibm.com
 
 which I'd very much like Martin's opinion on...

Until you pointed out the short-comings of the patch I really liked it ..

  ---
   mm/rmap.c |   16 ++--
   1 files changed, 14 insertions(+), 2 deletions(-)
  
  diff --git a/mm/rmap.c b/mm/rmap.c
  index 0f3b7cd..6ce8ddb 100644
  --- a/mm/rmap.c
  +++ b/mm/rmap.c
  @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
  struct address_space *mapping = page_mapping(page);
  if (mapping) {
  ret = page_mkclean_file(mapping, page);
  -   if (page_test_and_clear_dirty(page_to_pfn(page), 1))
  +   /*
  +* We ignore dirty bit for pagecache pages. It is safe
  +* as page is marked dirty iff it is writeable (page is
  +* marked as dirty when it is made writeable and
  +* clear_page_dirty_for_io() writeprotects the page
  +* again).
  +*/
  +   if (PageSwapCache(page) 
  +   page_test_and_clear_dirty(page_to_pfn(page), 1))
  ret = 1;
 
 This part you could cut out: page_mkclean() is not used on SwapCache pages.
 I believe you are safe to remove the page_test_and_clear_dirty() from here.

Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
future? At least we should add a comment there.

  }
  }
  @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
   * this if the page is anon, so about to be freed; but perhaps
   * not if it's in swapcache - there might be another pte slot
   * containing the swap entry, but page not yet written to swap.
  +* For pagecache pages, we don't care about dirty bit in storage
  +* key because the page is writeable iff it is dirty (page is marked
  +* as dirty when it is made writeable and clear_page_dirty_for_io()
  +* writeprotects the page again).
   */
  -   if ((!anon || PageSwapCache(page)) 
  +   if (PageSwapCache(page) 
  page_test_and_clear_dirty(page_to_pfn(page), 1))
  set_page_dirty(page);
 
 But here's where I think the problem is.  You're assuming that all
 filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
 there's no such function, just a confusing maze of three) route as XFS.
 
 But filesystems like tmpfs and ramfs (perhaps they're the only two
 that matter here) don't participate in that, and wait for an mmap'ed
 page to be seen modified by the user (usually via pte_dirty, but that's
 a no-op on s390) before page is marked dirty; and page reclaim throws
 away undirtied pages.

 So, if I'm understanding right, with this change s390 would be in danger
 of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
 written with the write system call would already be PageDirty and secure.

The patch relies on the software 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Jan Kara
On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
 On Mon, 1 Oct 2012, Jan Kara wrote:
 
  On s390 any write to a page (even from kernel itself) sets architecture
  specific page dirty bit. Thus when a page is written to via standard write, 
  HW
  dirty bit gets set and when we later map and unmap the page, 
  page_remove_rmap()
  finds the dirty bit and calls set_page_dirty().
  
  Dirtying of a page which shouldn't be dirty can cause all sorts of problems 
  to
  filesystems. The bug we observed in practice is that buffers from the page 
  get
  freed, so when the page gets later marked as dirty and writeback writes it, 
  XFS
  crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
  called
  from xfs_count_page_state().
 
 What changed recently?  Was XFS hardly used on s390 until now?
  The problem was originally hit on SLE11-SP2 which is 3.0 based after
migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
XFS just started to be more peevish about what pages it gets between these
two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things
up).

  Similar problem can also happen when zero_user_segment() call from
  xfs_vm_writepage() (or block_write_full_page() for that matter) set the
  hardware dirty bit during writeback, later buffers get freed, and then page
  unmapped.
  
  Fix the issue by ignoring s390 HW dirty bit for page cache pages in
  page_mkclean() and page_remove_rmap(). This is safe because when a page gets
  marked as writeable in PTE it is also marked dirty in do_wp_page() or
  do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
  the page gets writeprotected in page_mkclean(). So pagecache page is 
  writeable
  if and only if it is dirty.
 
 Very interesting patch...
  Originally, I even wanted to rip out pte dirty bit handling for shared
file pages but in the end that seemed too bold and unnecessary for my
problem ;)

  CC: linux-s...@vger.kernel.org
  Signed-off-by: Jan Kara j...@suse.cz
 
 but I think it's wrong.
  Thanks for having a look.

  ---
   mm/rmap.c |   16 ++--
   1 files changed, 14 insertions(+), 2 deletions(-)
  
  diff --git a/mm/rmap.c b/mm/rmap.c
  index 0f3b7cd..6ce8ddb 100644
  --- a/mm/rmap.c
  +++ b/mm/rmap.c
  @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
  struct address_space *mapping = page_mapping(page);
  if (mapping) {
  ret = page_mkclean_file(mapping, page);
  -   if (page_test_and_clear_dirty(page_to_pfn(page), 1))
  +   /*
  +* We ignore dirty bit for pagecache pages. It is safe
  +* as page is marked dirty iff it is writeable (page is
  +* marked as dirty when it is made writeable and
  +* clear_page_dirty_for_io() writeprotects the page
  +* again).
  +*/
  +   if (PageSwapCache(page) 
  +   page_test_and_clear_dirty(page_to_pfn(page), 1))
  ret = 1;
 
 This part you could cut out: page_mkclean() is not used on SwapCache pages.
 I believe you are safe to remove the page_test_and_clear_dirty() from here.
  OK, will do.

  }
  }
  @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
   * this if the page is anon, so about to be freed; but perhaps
   * not if it's in swapcache - there might be another pte slot
   * containing the swap entry, but page not yet written to swap.
  +* For pagecache pages, we don't care about dirty bit in storage
  +* key because the page is writeable iff it is dirty (page is marked
  +* as dirty when it is made writeable and clear_page_dirty_for_io()
  +* writeprotects the page again).
   */
  -   if ((!anon || PageSwapCache(page)) 
  +   if (PageSwapCache(page) 
  page_test_and_clear_dirty(page_to_pfn(page), 1))
  set_page_dirty(page);
 
 But here's where I think the problem is.  You're assuming that all
 filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
 there's no such function, just a confusing maze of three) route as XFS.
 
 But filesystems like tmpfs and ramfs (perhaps they're the only two
 that matter here) don't participate in that, and wait for an mmap'ed
 page to be seen modified by the user (usually via pte_dirty, but that's
 a no-op on s390) before page is marked dirty; and page reclaim throws
 away undirtied pages.
  I admit I haven't thought of tmpfs and similar. After some discussion Mel
pointed me to the code in mmap which makes a difference. So if I get it
right, the difference which causes us problems is that on tmpfs we map the
page writeably even during read-only fault. OK, then if I make the above
code in page_remove_rmap():
if ((PageSwapCache(page) ||
 (!anon  !mapping_cap_account_dirty(page-mapping))) 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Hugh Dickins
On Tue, 9 Oct 2012, Mel Gorman wrote:
 On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
  
  So, if I'm understanding right, with this change s390 would be in danger
  of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
  written with the write system call would already be PageDirty and secure.
  
 
 In the case of ramfs, what marks the page clean so it could be discarded? It
 does not participate in dirty accounting so it's not going to clear the
 dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
 handler that would use an end_io handler to clear the page after IO
 completes. I am not seeing how a ramfs page can get discarded at the moment.

But we don't have a page clean bit: we have a page dirty bit, and where
is that set in the ramfs read-fault case?  I've not experimented to check,
maybe you're right and ramfs is exempt from the issue.  I thought it was
__do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE.
Ah, you quote almost the very place further down.

 
 shm and tmpfs are indeed different and I did not take them into account
 (ba dum tisch) when reviewing. For those pages would it be sufficient to
 check the following?
 
 PageSwapCache(page) || (page-mapping  !bdi_cap_account_dirty(page-mapping)

Something like that, yes: I've a possible patch I'll put in reply to Jan.

 
 The problem the patch dealt with involved buffers associated with the page
 and that shouldn't be a problem for tmpfs, right?

Right, though I'm now beginning to wonder what the underlying bug is.
It seems to me that we have a bug and an optimization on our hands,
and have rushed into the optimization which would avoid the bug,
without considering what the actual bug is.  More in reply to Jan.

 I recognise that this
 might work just because of co-incidence and set off your Yuck detector
 and you'll prefer the proposed solution below.

No, I was mistaken to think that s390 would have dirty pages where
others had clean, Martin has now explained that SetPageUptodate cleans.
I didn't mind continuing an (imagined) inefficiency in s390, but I don't
want to make it more inefficient.

 
  You mention above that even the kernel writing to the page would mark
  the s390 storage key dirty.  I think that means that these shm and
  tmpfs and ramfs pages would all have dirty storage keys just from the
  clear_highpage() used to prepare them originally, and so would have
  been found dirty anyway by the existing code here in page_remove_rmap(),
  even though other architectures would regard them as clean and removable.
  
  If that's the case, then maybe we'd do better just to mark them dirty
  when faulted in the s390 case.  Then your patch above should (I think)
  be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
  too (I've not thought through exactly what that patch would be, just
  one or two suitably placed SetPageDirtys, I think), and eliminate
  page_test_and_clear_dirty() altogether - no tears shed by any of us!

So that fantasy was all wrong: appealing, but wrong.

   
 
 Do you mean something like this?
 
 diff --git a/mm/memory.c b/mm/memory.c
 index 5736170..c66166f 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct 
 vm_area_struct *vma,
   } else {
   inc_mm_counter_fast(mm, MM_FILEPAGES);
   page_add_file_rmap(page);
 - if (flags  FAULT_FLAG_WRITE) {
 +
 + /*
 +  * s390 depends on the dirty flag from the storage key
 +  * being propagated when the page is unmapped from the
 +  * page tables. For dirty-accounted mapping, we instead
 +  * depend on the page being marked dirty on writes and
 +  * being write-protected on clear_page_dirty_for_io.
 +  * The same protection does not apply for tmpfs pages
 +  * that do not participate in dirty accounting so mark
 +  * them dirty at fault time to avoid the data being
 +  * lost
 +  */
 + if (flags  FAULT_FLAG_WRITE ||
 + !bdi_cap_account_dirty(page-mapping)) {
   dirty_page = page;
   get_page(dirty_page);
   }
 
 Could something like this result in more writes to swap? Lets say there
 is an unmapped tmpfs file with data on it -- a process maps it, reads the
 entire mapping and exits. The page is now dirty and potentially will have
 to be rewritten to swap. That seems bad. Did I miss your point?

My point was that I mistakenly thought s390 must already be behaving
like that, so wanted it to continue that way, but with cleaner source.

But the CONFIG_S390 in SetPageUptodate makes sure that the zeroed page
starts out 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Hugh Dickins
On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
 On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
 Hugh Dickins hu...@google.com wrote:
  On Mon, 1 Oct 2012, Jan Kara wrote:
  
   On s390 any write to a page (even from kernel itself) sets architecture
   specific page dirty bit. Thus when a page is written to via standard 
   write, HW
   dirty bit gets set and when we later map and unmap the page, 
   page_remove_rmap()
   finds the dirty bit and calls set_page_dirty().
   
   Dirtying of a page which shouldn't be dirty can cause all sorts of 
   problems to
   filesystems. The bug we observed in practice is that buffers from the 
   page get
   freed, so when the page gets later marked as dirty and writeback writes 
   it, XFS
   crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
   called
   from xfs_count_page_state().
  
  What changed recently?  Was XFS hardly used on s390 until now?
 
 One thing that changed is that the zero_user_segment for the remaining bytes 
 between
 i_size and the end of the page has been moved to block_write_full_page_endio, 
 see
 git commit eebd2aa355692afa. That changed the timing of the race window in 
 regard
 to map/unmap of the page by user space. And yes XFS is in use on s390.

February 2008: I think we have different ideas of recently ;)

  
   
   Similar problem can also happen when zero_user_segment() call from
   xfs_vm_writepage() (or block_write_full_page() for that matter) set the
   hardware dirty bit during writeback, later buffers get freed, and then 
   page
   unmapped.
   
   Fix the issue by ignoring s390 HW dirty bit for page cache pages in
   page_mkclean() and page_remove_rmap(). This is safe because when a page 
   gets
   marked as writeable in PTE it is also marked dirty in do_wp_page() or
   do_page_fault(). When the dirty bit is cleared by 
   clear_page_dirty_for_io(),
   the page gets writeprotected in page_mkclean(). So pagecache page is 
   writeable
   if and only if it is dirty.
  
  Very interesting patch...
 
 Yes, it is an interesting idea. I really like the part that we'll use less 
 storage
 key operations, as these are freaking expensive.

As I said to Mel and will repeat to Jan, though an optimization would
be nice, I don't think we should necessarily mix it with the bugfix.

 
   
   CC: Martin Schwidefsky schwidef...@de.ibm.com
  
  which I'd very much like Martin's opinion on...
 
 Until you pointed out the short-comings of the patch I really liked it ..
 
   ---
mm/rmap.c |   16 ++--
1 files changed, 14 insertions(+), 2 deletions(-)
   
   diff --git a/mm/rmap.c b/mm/rmap.c
   index 0f3b7cd..6ce8ddb 100644
   --- a/mm/rmap.c
   +++ b/mm/rmap.c
   @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
 struct address_space *mapping = page_mapping(page);
 if (mapping) {
 ret = page_mkclean_file(mapping, page);
   - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
   + /*
   +  * We ignore dirty bit for pagecache pages. It is safe
   +  * as page is marked dirty iff it is writeable (page is
   +  * marked as dirty when it is made writeable and
   +  * clear_page_dirty_for_io() writeprotects the page
   +  * again).
   +  */
   + if (PageSwapCache(page) 
   + page_test_and_clear_dirty(page_to_pfn(page), 1))
 ret = 1;
  
  This part you could cut out: page_mkclean() is not used on SwapCache pages.
  I believe you are safe to remove the page_test_and_clear_dirty() from here.
 
 Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
 future? At least we should add a comment there.

I set out to do so, to add a comment there; but honestly, it's a strange
place for such a comment when there's no longer even the code to comment
upon.  And page_mkclean_file(), called in the line above, already says
BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change
that sends PageSwapCache pages this way.  It is possible that one day we
shall want to send tmpfs and swapcache down this route, I'm not ruling
that out; but then we shall have to extend page_mkclean(), yes.

 
 The patch relies on the software dirty bit tracking for file backed pages,
 if dirty bit tracking is not done for tmpfs and ramfs we are borked.
  
  You mention above that even the kernel writing to the page would mark
  the s390 storage key dirty.  I think that means that these shm and
  tmpfs and ramfs pages would all have dirty storage keys just from the
  clear_highpage() used to prepare them originally, and so would have
  been found dirty anyway by the existing code here in page_remove_rmap(),
  even though other architectures would regard them as clean and removable.
 
 No, the clear_highpage() will set the dirty bit in the storage key but
 the SetPageUptodate will clear 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-09 Thread Hugh Dickins
On Tue, 9 Oct 2012, Jan Kara wrote:
 On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
  On Mon, 1 Oct 2012, Jan Kara wrote:
  
   On s390 any write to a page (even from kernel itself) sets architecture
   specific page dirty bit. Thus when a page is written to via standard 
   write, HW
   dirty bit gets set and when we later map and unmap the page, 
   page_remove_rmap()
   finds the dirty bit and calls set_page_dirty().
   
   Dirtying of a page which shouldn't be dirty can cause all sorts of 
   problems to
   filesystems. The bug we observed in practice is that buffers from the 
   page get
   freed, so when the page gets later marked as dirty and writeback writes 
   it, XFS
   crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
   called
   from xfs_count_page_state().
  
  What changed recently?  Was XFS hardly used on s390 until now?
   The problem was originally hit on SLE11-SP2 which is 3.0 based after
 migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
 XFS just started to be more peevish about what pages it gets between these
 two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things
 up).

Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
ASSERT usually compiled out, stumbling later in page_buffers() as you say.

 
   Similar problem can also happen when zero_user_segment() call from
   xfs_vm_writepage() (or block_write_full_page() for that matter) set the
   hardware dirty bit during writeback, later buffers get freed, and then 
   page
   unmapped.

Similar problem, or is that the whole of the problem?  Where else does
the page get written to, after clearing page dirty?  (It may not be worth
spending time to answer me, I feel I'm wasting too much time on this.)

I keep trying to put my finger on the precise bug.  I said in earlier
mails to Mel and to Martin that we're mixing a bugfix and an optimization,
but I cannot quite point to the bug.  Could one say that it's precisely at
the page straddles i_size zero_user_segment(), in XFS or in other FSes?
that the storage key ought to be re-cleaned after that?

What if one day I happened to copy that code into shmem_writepage()?
I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
never write shmem to swap while it's still mapped, so it wouldn't even
have a chance to redirty the page in page_remove_rmap().

I guess I'm worrying too much; but it's not crystal clear to me why any
!mapping_cap_account_dirty mapping would necessarily not have the problem.

  But here's where I think the problem is.  You're assuming that all
  filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
  there's no such function, just a confusing maze of three) route as XFS.
  
  But filesystems like tmpfs and ramfs (perhaps they're the only two
  that matter here) don't participate in that, and wait for an mmap'ed
  page to be seen modified by the user (usually via pte_dirty, but that's
  a no-op on s390) before page is marked dirty; and page reclaim throws
  away undirtied pages.
   I admit I haven't thought of tmpfs and similar. After some discussion Mel
 pointed me to the code in mmap which makes a difference. So if I get it
 right, the difference which causes us problems is that on tmpfs we map the
 page writeably even during read-only fault. OK, then if I make the above
 code in page_remove_rmap():
   if ((PageSwapCache(page) ||
(!anon  !mapping_cap_account_dirty(page-mapping))) 
   page_test_and_clear_dirty(page_to_pfn(page), 1))
   set_page_dirty(page);
 
   Things should be ok (modulo the ugliness of this condition), right?

(Setting aside my reservations above...) That's almost exactly right, but
I think the issue of a racing truncation (which could reset page-mapping
to NULL at any moment) means we have to be a bit more careful.  Usually
we guard against that with page lock, but here we can rely on mapcount.

page_mapping(page), with its built-in PageSwapCache check, actually ends
up making the condition look less ugly; and so far as I could tell,
the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
when we are left with its VM_BUG_ON(PageSlab(page))).

But please look this over very critically and test (and if you like it,
please adopt it as your own): I'm not entirely convinced yet myself.

(One day, I do want to move that block further down page_remove_rmap(),
outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think
there's an actual problem at present in calling set_page_dirty() there,
but I have seen patches which could give it a lock-ordering issue, so
better to untangle them.  No reason to muddle that in with your fix,
but I thought I'd mention it while we're all staring at this.)

Hugh

---

 mm/rmap.c |   20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

--- 3.6.0+/mm/rmap.c

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-08 Thread Hugh Dickins
On Mon, 1 Oct 2012, Jan Kara wrote:

> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, 
> page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, 
> XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
> called
> from xfs_count_page_state().

What changed recently?  Was XFS hardly used on s390 until now?

> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.

Very interesting patch...

> 
> CC: Martin Schwidefsky 

which I'd very much like Martin's opinion on...

> CC: Mel Gorman 

and I'm grateful to Mel's ack for reawakening me to it...

> CC: linux-s...@vger.kernel.org
> Signed-off-by: Jan Kara 

but I think it's wrong.

> ---
>  mm/rmap.c |   16 ++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..6ce8ddb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
>   struct address_space *mapping = page_mapping(page);
>   if (mapping) {
>   ret = page_mkclean_file(mapping, page);
> - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> + /*
> +  * We ignore dirty bit for pagecache pages. It is safe
> +  * as page is marked dirty iff it is writeable (page is
> +  * marked as dirty when it is made writeable and
> +  * clear_page_dirty_for_io() writeprotects the page
> +  * again).
> +  */
> + if (PageSwapCache(page) &&
> + page_test_and_clear_dirty(page_to_pfn(page), 1))
>   ret = 1;

This part you could cut out: page_mkclean() is not used on SwapCache pages.
I believe you are safe to remove the page_test_and_clear_dirty() from here.

>   }
>   }
> @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
>* this if the page is anon, so about to be freed; but perhaps
>* not if it's in swapcache - there might be another pte slot
>* containing the swap entry, but page not yet written to swap.
> +  * For pagecache pages, we don't care about dirty bit in storage
> +  * key because the page is writeable iff it is dirty (page is marked
> +  * as dirty when it is made writeable and clear_page_dirty_for_io()
> +  * writeprotects the page again).
>*/
> - if ((!anon || PageSwapCache(page)) &&
> + if (PageSwapCache(page) &&
>   page_test_and_clear_dirty(page_to_pfn(page), 1))
>   set_page_dirty(page);

But here's where I think the problem is.  You're assuming that all
filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
there's no such function, just a confusing maze of three) route as XFS.

But filesystems like tmpfs and ramfs (perhaps they're the only two
that matter here) don't participate in that, and wait for an mmap'ed
page to be seen modified by the user (usually via pte_dirty, but that's
a no-op on s390) before page is marked dirty; and page reclaim throws
away undirtied pages.

So, if I'm understanding right, with this change s390 would be in danger
of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
written with the write system call would already be PageDirty and secure.

You mention above that even the kernel writing to the page would mark
the s390 storage key dirty.  I think that means that these shm and
tmpfs and ramfs pages would all have dirty storage keys just from the
clear_highpage() used to prepare them originally, and so would have
been found dirty anyway by the existing code here in page_remove_rmap(),
even though other architectures would regard them as clean and removable.

If that's the case, then maybe we'd do better just to mark them dirty
when faulted in the s390 case.  Then your patch above should (I think)
be safe.  Though I'd then be VERY 

Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-08 Thread Mel Gorman
On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote:
> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, 
> page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, 
> XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
> called
> from xfs_count_page_state().
> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
> 
> CC: Martin Schwidefsky 
> CC: Mel Gorman 
> CC: linux-s...@vger.kernel.org
> Signed-off-by: Jan Kara 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-08 Thread Mel Gorman
On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote:
 On s390 any write to a page (even from kernel itself) sets architecture
 specific page dirty bit. Thus when a page is written to via standard write, HW
 dirty bit gets set and when we later map and unmap the page, 
 page_remove_rmap()
 finds the dirty bit and calls set_page_dirty().
 
 Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
 filesystems. The bug we observed in practice is that buffers from the page get
 freed, so when the page gets later marked as dirty and writeback writes it, 
 XFS
 crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
 called
 from xfs_count_page_state().
 
 Similar problem can also happen when zero_user_segment() call from
 xfs_vm_writepage() (or block_write_full_page() for that matter) set the
 hardware dirty bit during writeback, later buffers get freed, and then page
 unmapped.
 
 Fix the issue by ignoring s390 HW dirty bit for page cache pages in
 page_mkclean() and page_remove_rmap(). This is safe because when a page gets
 marked as writeable in PTE it is also marked dirty in do_wp_page() or
 do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
 the page gets writeprotected in page_mkclean(). So pagecache page is writeable
 if and only if it is dirty.
 
 CC: Martin Schwidefsky schwidef...@de.ibm.com
 CC: Mel Gorman mgor...@suse.de
 CC: linux-s...@vger.kernel.org
 Signed-off-by: Jan Kara j...@suse.cz

Acked-by: Mel Gorman mgor...@suse.de

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-08 Thread Hugh Dickins
On Mon, 1 Oct 2012, Jan Kara wrote:

 On s390 any write to a page (even from kernel itself) sets architecture
 specific page dirty bit. Thus when a page is written to via standard write, HW
 dirty bit gets set and when we later map and unmap the page, 
 page_remove_rmap()
 finds the dirty bit and calls set_page_dirty().
 
 Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
 filesystems. The bug we observed in practice is that buffers from the page get
 freed, so when the page gets later marked as dirty and writeback writes it, 
 XFS
 crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() 
 called
 from xfs_count_page_state().

What changed recently?  Was XFS hardly used on s390 until now?

 
 Similar problem can also happen when zero_user_segment() call from
 xfs_vm_writepage() (or block_write_full_page() for that matter) set the
 hardware dirty bit during writeback, later buffers get freed, and then page
 unmapped.
 
 Fix the issue by ignoring s390 HW dirty bit for page cache pages in
 page_mkclean() and page_remove_rmap(). This is safe because when a page gets
 marked as writeable in PTE it is also marked dirty in do_wp_page() or
 do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
 the page gets writeprotected in page_mkclean(). So pagecache page is writeable
 if and only if it is dirty.

Very interesting patch...

 
 CC: Martin Schwidefsky schwidef...@de.ibm.com

which I'd very much like Martin's opinion on...

 CC: Mel Gorman mgor...@suse.de

and I'm grateful to Mel's ack for reawakening me to it...

 CC: linux-s...@vger.kernel.org
 Signed-off-by: Jan Kara j...@suse.cz

but I think it's wrong.

 ---
  mm/rmap.c |   16 ++--
  1 files changed, 14 insertions(+), 2 deletions(-)
 
 diff --git a/mm/rmap.c b/mm/rmap.c
 index 0f3b7cd..6ce8ddb 100644
 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
   struct address_space *mapping = page_mapping(page);
   if (mapping) {
   ret = page_mkclean_file(mapping, page);
 - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
 + /*
 +  * We ignore dirty bit for pagecache pages. It is safe
 +  * as page is marked dirty iff it is writeable (page is
 +  * marked as dirty when it is made writeable and
 +  * clear_page_dirty_for_io() writeprotects the page
 +  * again).
 +  */
 + if (PageSwapCache(page) 
 + page_test_and_clear_dirty(page_to_pfn(page), 1))
   ret = 1;

This part you could cut out: page_mkclean() is not used on SwapCache pages.
I believe you are safe to remove the page_test_and_clear_dirty() from here.

   }
   }
 @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
* this if the page is anon, so about to be freed; but perhaps
* not if it's in swapcache - there might be another pte slot
* containing the swap entry, but page not yet written to swap.
 +  * For pagecache pages, we don't care about dirty bit in storage
 +  * key because the page is writeable iff it is dirty (page is marked
 +  * as dirty when it is made writeable and clear_page_dirty_for_io()
 +  * writeprotects the page again).
*/
 - if ((!anon || PageSwapCache(page)) 
 + if (PageSwapCache(page) 
   page_test_and_clear_dirty(page_to_pfn(page), 1))
   set_page_dirty(page);

But here's where I think the problem is.  You're assuming that all
filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
there's no such function, just a confusing maze of three) route as XFS.

But filesystems like tmpfs and ramfs (perhaps they're the only two
that matter here) don't participate in that, and wait for an mmap'ed
page to be seen modified by the user (usually via pte_dirty, but that's
a no-op on s390) before page is marked dirty; and page reclaim throws
away undirtied pages.

So, if I'm understanding right, with this change s390 would be in danger
of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
written with the write system call would already be PageDirty and secure.

You mention above that even the kernel writing to the page would mark
the s390 storage key dirty.  I think that means that these shm and
tmpfs and ramfs pages would all have dirty storage keys just from the
clear_highpage() used to prepare them originally, and so would have
been found dirty anyway by the existing code here in page_remove_rmap(),
even though other architectures would regard them as clean and removable.

If that's the case, then maybe we'd do better just to mark them dirty
when faulted in the s390 case.  Then your patch above should (I think)
be safe.  Though I'd then be VERY tempted to adjust the