Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012 16:21:24 -0700 (PDT) Hugh Dickins wrote: > > > > I am seriously tempted to switch to pure software dirty bits by using > > page protection for writable but clean pages. The worry is the number of > > additional protection faults we would get. But as we do software dirty > > bit tracking for the most part anyway this might not be as bad as it > > used to be. > > That's exactly the same reason why tmpfs opts out of dirty tracking, fear > of unnecessary extra faults. Anomalous as s390 is here, tmpfs is being > anomalous too, and I'd be a hypocrite to push for you to make that change. I tested the waters with the software dirty bit idea. Using kernel compile as test case I got these numbers: disk backing, swdirty: 10,023,870 minor-faults 18 major-faults disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults That does not look bad at all. One test I found that shows an effect is lat_mmap from LMBench: disk backing, hwdirty: 30,894 minor-faults 0 major-faults disk backing, swdirty: 30,894 minor-faults 0 major-faults tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults The runtime between the hwdirty vs. the swdirty setup is very similar, encouraging enough for me to ask our performance team to run a larger test. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012 16:21:24 -0700 (PDT) Hugh Dickins hu...@google.com wrote: I am seriously tempted to switch to pure software dirty bits by using page protection for writable but clean pages. The worry is the number of additional protection faults we would get. But as we do software dirty bit tracking for the most part anyway this might not be as bad as it used to be. That's exactly the same reason why tmpfs opts out of dirty tracking, fear of unnecessary extra faults. Anomalous as s390 is here, tmpfs is being anomalous too, and I'd be a hypocrite to push for you to make that change. I tested the waters with the software dirty bit idea. Using kernel compile as test case I got these numbers: disk backing, swdirty: 10,023,870 minor-faults 18 major-faults disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults That does not look bad at all. One test I found that shows an effect is lat_mmap from LMBench: disk backing, hwdirty: 30,894 minor-faults 0 major-faults disk backing, swdirty: 30,894 minor-faults 0 major-faults tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults The runtime between the hwdirty vs. the swdirty setup is very similar, encouraging enough for me to ask our performance team to run a larger test. -- blue skies, Martin. Reality continues to ruin my life. - Calvin. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue 09-10-12 19:19:09, Hugh Dickins wrote: > On Tue, 9 Oct 2012, Jan Kara wrote: > > > But here's where I think the problem is. You're assuming that all > > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah, > > > there's no such function, just a confusing maze of three) route as XFS. > > > > > > But filesystems like tmpfs and ramfs (perhaps they're the only two > > > that matter here) don't participate in that, and wait for an mmap'ed > > > page to be seen modified by the user (usually via pte_dirty, but that's > > > a no-op on s390) before page is marked dirty; and page reclaim throws > > > away undirtied pages. > > I admit I haven't thought of tmpfs and similar. After some discussion Mel > > pointed me to the code in mmap which makes a difference. So if I get it > > right, the difference which causes us problems is that on tmpfs we map the > > page writeably even during read-only fault. OK, then if I make the above > > code in page_remove_rmap(): > > if ((PageSwapCache(page) || > > (!anon && !mapping_cap_account_dirty(page->mapping))) && > > page_test_and_clear_dirty(page_to_pfn(page), 1)) > > set_page_dirty(page); > > > > Things should be ok (modulo the ugliness of this condition), right? > > (Setting aside my reservations above...) That's almost exactly right, but > I think the issue of a racing truncation (which could reset page->mapping > to NULL at any moment) means we have to be a bit more careful. Usually > we guard against that with page lock, but here we can rely on mapcount. > > page_mapping(page), with its built-in PageSwapCache check, actually ends > up making the condition look less ugly; and so far as I could tell, > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM, > when we are left with its VM_BUG_ON(PageSlab(page))). > > But please look this over very critically and test (and if you like it, > please adopt it as your own): I'm not entirely convinced yet myself. Just to followup on this. The new version of the patch runs fine for several days on our s390 build machines. I was also running fsx-linux on tmpfs while pushing the machine to swap. fsx ran fine but I hit WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is: [<03c008edb38e>] xfs_vm_releasepage+0xc6/0xd4 [xfs] [<00213326>] shrink_page_list+0x6ba/0x734 [<00213924>] shrink_inactive_list+0x230/0x578 [<00214148>] shrink_list+0x6c/0x120 [<002143ee>] shrink_zone+0x1f2/0x238 [<00215482>] balance_pgdat+0x5f6/0x86c [<002158b8>] kswapd+0x1c0/0x248 [<0017642a>] kthread+0xa6/0xb0 [<004e58be>] kernel_thread_starter+0x6/0xc [<004e58b8>] kernel_thread_starter+0x0/0xc I don't think it is really related but I'll hold off the patch for a while to investigate what's going on... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue 09-10-12 19:19:09, Hugh Dickins wrote: On Tue, 9 Oct 2012, Jan Kara wrote: snip a lot But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. I admit I haven't thought of tmpfs and similar. After some discussion Mel pointed me to the code in mmap which makes a difference. So if I get it right, the difference which causes us problems is that on tmpfs we map the page writeably even during read-only fault. OK, then if I make the above code in page_remove_rmap(): if ((PageSwapCache(page) || (!anon !mapping_cap_account_dirty(page-mapping))) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); Things should be ok (modulo the ugliness of this condition), right? (Setting aside my reservations above...) That's almost exactly right, but I think the issue of a racing truncation (which could reset page-mapping to NULL at any moment) means we have to be a bit more careful. Usually we guard against that with page lock, but here we can rely on mapcount. page_mapping(page), with its built-in PageSwapCache check, actually ends up making the condition look less ugly; and so far as I could tell, the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM, when we are left with its VM_BUG_ON(PageSlab(page))). But please look this over very critically and test (and if you like it, please adopt it as your own): I'm not entirely convinced yet myself. Just to followup on this. The new version of the patch runs fine for several days on our s390 build machines. I was also running fsx-linux on tmpfs while pushing the machine to swap. fsx ran fine but I hit WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is: [03c008edb38e] xfs_vm_releasepage+0xc6/0xd4 [xfs] [00213326] shrink_page_list+0x6ba/0x734 [00213924] shrink_inactive_list+0x230/0x578 [00214148] shrink_list+0x6c/0x120 [002143ee] shrink_zone+0x1f2/0x238 [00215482] balance_pgdat+0x5f6/0x86c [002158b8] kswapd+0x1c0/0x248 [0017642a] kthread+0xa6/0xb0 [004e58be] kernel_thread_starter+0x6/0xc [004e58b8] kernel_thread_starter+0x0/0xc I don't think it is really related but I'll hold off the patch for a while to investigate what's going on... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Thu, 11 Oct 2012 08:56:00 +1100 Dave Chinner wrote: > On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote: > > On Tue, 9 Oct 2012, Jan Kara wrote: > > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote: > > > > On Mon, 1 Oct 2012, Jan Kara wrote: > > > > > > > > > On s390 any write to a page (even from kernel itself) sets > > > > > architecture > > > > > specific page dirty bit. Thus when a page is written to via standard > > > > > write, HW > > > > > dirty bit gets set and when we later map and unmap the page, > > > > > page_remove_rmap() > > > > > finds the dirty bit and calls set_page_dirty(). > > > > > > > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of > > > > > problems to > > > > > filesystems. The bug we observed in practice is that buffers from the > > > > > page get > > > > > freed, so when the page gets later marked as dirty and writeback > > > > > writes it, XFS > > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in > > > > > page_buffers() called > > > > > from xfs_count_page_state(). > > > > > > > > What changed recently? Was XFS hardly used on s390 until now? > > > The problem was originally hit on SLE11-SP2 which is 3.0 based after > > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I > > > think > > > XFS just started to be more peevish about what pages it gets between these > > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things > > > up). > > > > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, > > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the > > ASSERT usually compiled out, stumbling later in page_buffers() as you say. > > What that says is that no-one is running xfstests-based QA on s390 > with CONFIG_XFS_DEBUG enabled, otherwise this would have been found. > I've never tested XFS on s390 before, and I doubt any of the > upstream developers have, either, because not many peopl ehave s390 > machines in their basement. So this is probably just an oversight > in the distro QA environment more than anything Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and watch for the fallout. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Wed, 10 Oct 2012 14:28:32 -0700 (PDT) Hugh Dickins wrote: > But perhaps these machines aren't much into heavy swapping. Now, > if Martin would send me a nice little zSeries netbook for Xmas, > I could then test that end of it myself ;) Are you sure about that? The electricity cost alone for such a beast is quite high ;-) > I've just arrived at the conclusion that page migration does _not_ > have a problem with transferring the dirty storage key: I had been > thinking that your testing might stumble on that issue, and need a > further patch, but I'll explain in other mail why now I think not. That is good to know, one problem less on the list. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Wed, 10 Oct 2012 14:28:32 -0700 (PDT) Hugh Dickins hu...@google.com wrote: But perhaps these machines aren't much into heavy swapping. Now, if Martin would send me a nice little zSeries netbook for Xmas, I could then test that end of it myself ;) Are you sure about that? The electricity cost alone for such a beast is quite high ;-) I've just arrived at the conclusion that page migration does _not_ have a problem with transferring the dirty storage key: I had been thinking that your testing might stumble on that issue, and need a further patch, but I'll explain in other mail why now I think not. That is good to know, one problem less on the list. -- blue skies, Martin. Reality continues to ruin my life. - Calvin. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Thu, 11 Oct 2012 08:56:00 +1100 Dave Chinner da...@fromorbit.com wrote: On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote: On Tue, 9 Oct 2012, Jan Kara wrote: On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? The problem was originally hit on SLE11-SP2 which is 3.0 based after migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think XFS just started to be more peevish about what pages it gets between these two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things up). Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the ASSERT usually compiled out, stumbling later in page_buffers() as you say. What that says is that no-one is running xfstests-based QA on s390 with CONFIG_XFS_DEBUG enabled, otherwise this would have been found. I've never tested XFS on s390 before, and I doubt any of the upstream developers have, either, because not many peopl ehave s390 machines in their basement. So this is probably just an oversight in the distro QA environment more than anything Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and watch for the fallout. -- blue skies, Martin. Reality continues to ruin my life. - Calvin. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Hugh Dickins wrote: > On Tue, 9 Oct 2012, Martin Schwidefsky wrote: > > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT) > > Hugh Dickins wrote: > > > > > A separate worry came to mind as I thought about your patch: where > > > in page migration is s390's dirty storage key migrated from old page > > > to new? And if there is a problem there, that too should be fixed > > > by what I propose in the previous paragraph. > > > > That is covered by the SetPageUptodate() in migrate_page_copy(). > > I don't think so: that makes sure that the newpage is not marked > dirty in storage key just because of the copy_highpage to it; but > I see nothing to mark the newpage dirty in storage key when the > old page was dirty there. I went to prepare a patch to fix this, and ended up finding no such problem to fix - which fits with how no such problem has been reported. Most of it is handled by page migration's unmap_and_move() having to unmap the old page first: so the old page will pass through the final page_remove_rmap(), which will transfer storage key to page_dirty in those cases which it deals with (with the old code, any file or swap page; with the new code, any unaccounted file or swap page, now that we realize the accounted files don't even need this); and page_dirty is already properly migrated to the new page. But that does leave one case behind: an anonymous page not yet in swapcache, migrated via a swap-like migration entry. But this case is not a problem because PageDirty doesn't actually affect anything for an anonymous page not in swapcache. There are various places where we set it, and its life-history is hard to make sense of, but in fact it's meaningless in 2.6, where page reclaim adds anon to swap (and sets PageDirty) whether the page was marked dirty before or not (which makes sense when we use the ZERO_PAGE for anon read faults). 2.4 did behave differently: it was liable to free anon pages not marked dirty, and I think most of our anon SetPageDirtys are just a relic of those days - I do have a patch from 18 months ago to remove them (adding PG_dirty to the flags which should not be set when a page is freed), but there are usually more urgent things to attend to than rebase and retest that. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote: > On Tue, 9 Oct 2012, Jan Kara wrote: > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote: > > > On Mon, 1 Oct 2012, Jan Kara wrote: > > > > > > > On s390 any write to a page (even from kernel itself) sets architecture > > > > specific page dirty bit. Thus when a page is written to via standard > > > > write, HW > > > > dirty bit gets set and when we later map and unmap the page, > > > > page_remove_rmap() > > > > finds the dirty bit and calls set_page_dirty(). > > > > > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of > > > > problems to > > > > filesystems. The bug we observed in practice is that buffers from the > > > > page get > > > > freed, so when the page gets later marked as dirty and writeback writes > > > > it, XFS > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in > > > > page_buffers() called > > > > from xfs_count_page_state(). > > > > > > What changed recently? Was XFS hardly used on s390 until now? > > The problem was originally hit on SLE11-SP2 which is 3.0 based after > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think > > XFS just started to be more peevish about what pages it gets between these > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things > > up). > > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the > ASSERT usually compiled out, stumbling later in page_buffers() as you say. What that says is that no-one is running xfstests-based QA on s390 with CONFIG_XFS_DEBUG enabled, otherwise this would have been found. I've never tested XFS on s390 before, and I doubt any of the upstream developers have, either, because not many peopl ehave s390 machines in their basement. So this is probably just an oversight in the distro QA environment more than anything Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Wed, 10 Oct 2012, Jan Kara wrote: > On Tue 09-10-12 19:19:09, Hugh Dickins wrote: > > On Tue, 9 Oct 2012, Jan Kara wrote: > > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote: > > > > On Mon, 1 Oct 2012, Jan Kara wrote: > > > > > > > > > On s390 any write to a page (even from kernel itself) sets > > > > > architecture > > > > > specific page dirty bit. Thus when a page is written to via standard > > > > > write, HW > > > > > dirty bit gets set and when we later map and unmap the page, > > > > > page_remove_rmap() > > > > > finds the dirty bit and calls set_page_dirty(). > > > > > > > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of > > > > > problems to > > > > > filesystems. The bug we observed in practice is that buffers from the > > > > > page get > > > > > freed, so when the page gets later marked as dirty and writeback > > > > > writes it, XFS > > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in > > > > > page_buffers() called > > > > > from xfs_count_page_state(). > ... > > > > > Similar problem can also happen when zero_user_segment() call from > > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set > > > > > the > > > > > hardware dirty bit during writeback, later buffers get freed, and > > > > > then page > > > > > unmapped. > > > > Similar problem, or is that the whole of the problem? Where else does > > the page get written to, after clearing page dirty? (It may not be worth > > spending time to answer me, I feel I'm wasting too much time on this.) > I think the devil is in "after clearing page dirty" - > clear_page_dirty_for_io() has an optimization that it does not bother > transfering pte or storage key dirty bits to page dirty bit when page is > not mapped. Right, its "if (page_mkclean) set_page_dirty". > On s390 that results in storage key dirty bit set once buffered > write modifies the page. Ah yes, because set_page_dirty does not clean the storage key, as perhaps I was expecting (and we wouldn't want to add that if everything is working without). > > BTW there's no other place I'm aware of (and I was looking for some time > before I realized that storage key could remain set from buffered write as > described above). > > I guess I'm worrying too much; but it's not crystal clear to me why any > > !mapping_cap_account_dirty mapping would necessarily not have the problem. > They can have a problem - if they cared that page_remove_rmap() can mark > as dirty a page which was never written to via mmap. So far we are lucky > and all !mapping_cap_account_dirty users don't care. Yes, I think it's good enough: it's a workaround rather than a thorough future-proof fix; a workaround with a nice optimization bonus for s390. > > > Things should be ok (modulo the ugliness of this condition), right? > > > > (Setting aside my reservations above...) That's almost exactly right, but > > I think the issue of a racing truncation (which could reset page->mapping > > to NULL at any moment) means we have to be a bit more careful. Usually > > we guard against that with page lock, but here we can rely on mapcount. > > > > page_mapping(page), with its built-in PageSwapCache check, actually ends > > up making the condition look less ugly; and so far as I could tell, > > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM, > > when we are left with its VM_BUG_ON(PageSlab(page))). > > > > But please look this over very critically and test (and if you like it, > > please adopt it as your own): I'm not entirely convinced yet myself. > OK, I'll push the kernel with your updated patch to our build machines > and let it run there for a few days (it took about a day to reproduce the > issue originally). Thanks a lot for helping me with this. And thank you for explaining it repeatedly for me. I expect you're most interested in testing the XFS end of it; but if you've time to check the swap/tmpfs aspect too, fsx on tmpfs while heavily swapping should do it. But perhaps these machines aren't much into heavy swapping. Now, if Martin would send me a nice little zSeries netbook for Xmas, I could then test that end of it myself ;) I've just arrived at the conclusion that page migration does _not_ have a problem with transferring the dirty storage key: I had been thinking that your testing might stumble on that issue, and need a further patch, but I'll explain in other mail why now I think not. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue 09-10-12 19:19:09, Hugh Dickins wrote: > On Tue, 9 Oct 2012, Jan Kara wrote: > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote: > > > On Mon, 1 Oct 2012, Jan Kara wrote: > > > > > > > On s390 any write to a page (even from kernel itself) sets architecture > > > > specific page dirty bit. Thus when a page is written to via standard > > > > write, HW > > > > dirty bit gets set and when we later map and unmap the page, > > > > page_remove_rmap() > > > > finds the dirty bit and calls set_page_dirty(). > > > > > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of > > > > problems to > > > > filesystems. The bug we observed in practice is that buffers from the > > > > page get > > > > freed, so when the page gets later marked as dirty and writeback writes > > > > it, XFS > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in > > > > page_buffers() called > > > > from xfs_count_page_state(). ... > > > > Similar problem can also happen when zero_user_segment() call from > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > > > > hardware dirty bit during writeback, later buffers get freed, and then > > > > page > > > > unmapped. > > Similar problem, or is that the whole of the problem? Where else does > the page get written to, after clearing page dirty? (It may not be worth > spending time to answer me, I feel I'm wasting too much time on this.) I think the devil is in "after clearing page dirty" - clear_page_dirty_for_io() has an optimization that it does not bother transfering pte or storage key dirty bits to page dirty bit when page is not mapped. On s390 that results in storage key dirty bit set once buffered write modifies the page. BTW there's no other place I'm aware of (and I was looking for some time before I realized that storage key could remain set from buffered write as described above). > > I keep trying to put my finger on the precise bug. I said in earlier > mails to Mel and to Martin that we're mixing a bugfix and an optimization, > but I cannot quite point to the bug. Could one say that it's precisely at > the "page straddles i_size" zero_user_segment(), in XFS or in other FSes? > that the storage key ought to be re-cleaned after that? I think the precise bug is that we can leave dirty bit in storage key set after writes from kernel while some parts of kernel assume the bit can be set only via user mapping. In a perfect world with infinite computation resources, all writes to pages from kernel could look like: .. assume locked page .. page_mkclean(page); if (page_test_and_clear_dirty(page)) set_page_dirty(page); write to page page_test_and_clear_dirty(page);/* Clean storage key */ This would be bulletproof ... and ridiculously expensive. > What if one day I happened to copy that code into shmem_writepage()? > I've no intention to do so! And it wouldn't cause a BUG. Ah, and we > never write shmem to swap while it's still mapped, so it wouldn't even > have a chance to redirty the page in page_remove_rmap(). > > I guess I'm worrying too much; but it's not crystal clear to me why any > !mapping_cap_account_dirty mapping would necessarily not have the problem. They can have a problem - if they cared that page_remove_rmap() can mark as dirty a page which was never written to via mmap. So far we are lucky and all !mapping_cap_account_dirty users don't care. > > > But here's where I think the problem is. You're assuming that all > > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah, > > > there's no such function, just a confusing maze of three) route as XFS. > > > > > > But filesystems like tmpfs and ramfs (perhaps they're the only two > > > that matter here) don't participate in that, and wait for an mmap'ed > > > page to be seen modified by the user (usually via pte_dirty, but that's > > > a no-op on s390) before page is marked dirty; and page reclaim throws > > > away undirtied pages. > > I admit I haven't thought of tmpfs and similar. After some discussion Mel > > pointed me to the code in mmap which makes a difference. So if I get it > > right, the difference which causes us problems is that on tmpfs we map the > > page writeably even during read-only fault. OK, then if I make the above > > code in page_remove_rmap(): > > if ((PageSwapCache(page) || > > (!anon && !mapping_cap_account_dirty(page->mapping))) && > > page_test_and_clear_dirty(page_to_pfn(page), 1)) > > set_page_dirty(page); > > > > Things should be ok (modulo the ugliness of this condition), right? > > (Setting aside my reservations above...) That's almost exactly right, but > I think the issue of a racing truncation (which could reset page->mapping > to NULL at any moment) means we have to be a bit more careful. Usually > we guard against that with page lock, but here we can rely on mapcount. > >
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue 09-10-12 19:19:09, Hugh Dickins wrote: On Tue, 9 Oct 2012, Jan Kara wrote: On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). ... Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Similar problem, or is that the whole of the problem? Where else does the page get written to, after clearing page dirty? (It may not be worth spending time to answer me, I feel I'm wasting too much time on this.) I think the devil is in after clearing page dirty - clear_page_dirty_for_io() has an optimization that it does not bother transfering pte or storage key dirty bits to page dirty bit when page is not mapped. On s390 that results in storage key dirty bit set once buffered write modifies the page. BTW there's no other place I'm aware of (and I was looking for some time before I realized that storage key could remain set from buffered write as described above). I keep trying to put my finger on the precise bug. I said in earlier mails to Mel and to Martin that we're mixing a bugfix and an optimization, but I cannot quite point to the bug. Could one say that it's precisely at the page straddles i_size zero_user_segment(), in XFS or in other FSes? that the storage key ought to be re-cleaned after that? I think the precise bug is that we can leave dirty bit in storage key set after writes from kernel while some parts of kernel assume the bit can be set only via user mapping. In a perfect world with infinite computation resources, all writes to pages from kernel could look like: .. assume locked page .. page_mkclean(page); if (page_test_and_clear_dirty(page)) set_page_dirty(page); write to page page_test_and_clear_dirty(page);/* Clean storage key */ This would be bulletproof ... and ridiculously expensive. What if one day I happened to copy that code into shmem_writepage()? I've no intention to do so! And it wouldn't cause a BUG. Ah, and we never write shmem to swap while it's still mapped, so it wouldn't even have a chance to redirty the page in page_remove_rmap(). I guess I'm worrying too much; but it's not crystal clear to me why any !mapping_cap_account_dirty mapping would necessarily not have the problem. They can have a problem - if they cared that page_remove_rmap() can mark as dirty a page which was never written to via mmap. So far we are lucky and all !mapping_cap_account_dirty users don't care. But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. I admit I haven't thought of tmpfs and similar. After some discussion Mel pointed me to the code in mmap which makes a difference. So if I get it right, the difference which causes us problems is that on tmpfs we map the page writeably even during read-only fault. OK, then if I make the above code in page_remove_rmap(): if ((PageSwapCache(page) || (!anon !mapping_cap_account_dirty(page-mapping))) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); Things should be ok (modulo the ugliness of this condition), right? (Setting aside my reservations above...) That's almost exactly right, but I think the issue of a racing truncation (which could reset page-mapping to NULL at any moment) means we have to be a bit more careful. Usually we guard against that with page lock, but here we can rely on mapcount. page_mapping(page), with its built-in PageSwapCache check, actually ends up making the condition look less ugly; and so far as I could tell, the extra code does get optimized out on x86
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Wed, 10 Oct 2012, Jan Kara wrote: On Tue 09-10-12 19:19:09, Hugh Dickins wrote: On Tue, 9 Oct 2012, Jan Kara wrote: On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). ... Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Similar problem, or is that the whole of the problem? Where else does the page get written to, after clearing page dirty? (It may not be worth spending time to answer me, I feel I'm wasting too much time on this.) I think the devil is in after clearing page dirty - clear_page_dirty_for_io() has an optimization that it does not bother transfering pte or storage key dirty bits to page dirty bit when page is not mapped. Right, its if (page_mkclean) set_page_dirty. On s390 that results in storage key dirty bit set once buffered write modifies the page. Ah yes, because set_page_dirty does not clean the storage key, as perhaps I was expecting (and we wouldn't want to add that if everything is working without). BTW there's no other place I'm aware of (and I was looking for some time before I realized that storage key could remain set from buffered write as described above). I guess I'm worrying too much; but it's not crystal clear to me why any !mapping_cap_account_dirty mapping would necessarily not have the problem. They can have a problem - if they cared that page_remove_rmap() can mark as dirty a page which was never written to via mmap. So far we are lucky and all !mapping_cap_account_dirty users don't care. Yes, I think it's good enough: it's a workaround rather than a thorough future-proof fix; a workaround with a nice optimization bonus for s390. Things should be ok (modulo the ugliness of this condition), right? (Setting aside my reservations above...) That's almost exactly right, but I think the issue of a racing truncation (which could reset page-mapping to NULL at any moment) means we have to be a bit more careful. Usually we guard against that with page lock, but here we can rely on mapcount. page_mapping(page), with its built-in PageSwapCache check, actually ends up making the condition look less ugly; and so far as I could tell, the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM, when we are left with its VM_BUG_ON(PageSlab(page))). But please look this over very critically and test (and if you like it, please adopt it as your own): I'm not entirely convinced yet myself. OK, I'll push the kernel with your updated patch to our build machines and let it run there for a few days (it took about a day to reproduce the issue originally). Thanks a lot for helping me with this. And thank you for explaining it repeatedly for me. I expect you're most interested in testing the XFS end of it; but if you've time to check the swap/tmpfs aspect too, fsx on tmpfs while heavily swapping should do it. But perhaps these machines aren't much into heavy swapping. Now, if Martin would send me a nice little zSeries netbook for Xmas, I could then test that end of it myself ;) I've just arrived at the conclusion that page migration does _not_ have a problem with transferring the dirty storage key: I had been thinking that your testing might stumble on that issue, and need a further patch, but I'll explain in other mail why now I think not. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote: On Tue, 9 Oct 2012, Jan Kara wrote: On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? The problem was originally hit on SLE11-SP2 which is 3.0 based after migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think XFS just started to be more peevish about what pages it gets between these two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things up). Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the ASSERT usually compiled out, stumbling later in page_buffers() as you say. What that says is that no-one is running xfstests-based QA on s390 with CONFIG_XFS_DEBUG enabled, otherwise this would have been found. I've never tested XFS on s390 before, and I doubt any of the upstream developers have, either, because not many peopl ehave s390 machines in their basement. So this is probably just an oversight in the distro QA environment more than anything Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Hugh Dickins wrote: On Tue, 9 Oct 2012, Martin Schwidefsky wrote: On Mon, 8 Oct 2012 21:24:40 -0700 (PDT) Hugh Dickins hu...@google.com wrote: A separate worry came to mind as I thought about your patch: where in page migration is s390's dirty storage key migrated from old page to new? And if there is a problem there, that too should be fixed by what I propose in the previous paragraph. That is covered by the SetPageUptodate() in migrate_page_copy(). I don't think so: that makes sure that the newpage is not marked dirty in storage key just because of the copy_highpage to it; but I see nothing to mark the newpage dirty in storage key when the old page was dirty there. I went to prepare a patch to fix this, and ended up finding no such problem to fix - which fits with how no such problem has been reported. Most of it is handled by page migration's unmap_and_move() having to unmap the old page first: so the old page will pass through the final page_remove_rmap(), which will transfer storage key to page_dirty in those cases which it deals with (with the old code, any file or swap page; with the new code, any unaccounted file or swap page, now that we realize the accounted files don't even need this); and page_dirty is already properly migrated to the new page. But that does leave one case behind: an anonymous page not yet in swapcache, migrated via a swap-like migration entry. But this case is not a problem because PageDirty doesn't actually affect anything for an anonymous page not in swapcache. There are various places where we set it, and its life-history is hard to make sense of, but in fact it's meaningless in 2.6, where page reclaim adds anon to swap (and sets PageDirty) whether the page was marked dirty before or not (which makes sense when we use the ZERO_PAGE for anon read faults). 2.4 did behave differently: it was liable to free anon pages not marked dirty, and I think most of our anon SetPageDirtys are just a relic of those days - I do have a patch from 18 months ago to remove them (adding PG_dirty to the flags which should not be set when a page is freed), but there are usually more urgent things to attend to than rebase and retest that. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Jan Kara wrote: > On Mon 08-10-12 21:24:40, Hugh Dickins wrote: > > On Mon, 1 Oct 2012, Jan Kara wrote: > > > > > On s390 any write to a page (even from kernel itself) sets architecture > > > specific page dirty bit. Thus when a page is written to via standard > > > write, HW > > > dirty bit gets set and when we later map and unmap the page, > > > page_remove_rmap() > > > finds the dirty bit and calls set_page_dirty(). > > > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of > > > problems to > > > filesystems. The bug we observed in practice is that buffers from the > > > page get > > > freed, so when the page gets later marked as dirty and writeback writes > > > it, XFS > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() > > > called > > > from xfs_count_page_state(). > > > > What changed recently? Was XFS hardly used on s390 until now? > The problem was originally hit on SLE11-SP2 which is 3.0 based after > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think > XFS just started to be more peevish about what pages it gets between these > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things > up). Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the ASSERT usually compiled out, stumbling later in page_buffers() as you say. > > > > Similar problem can also happen when zero_user_segment() call from > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > > > hardware dirty bit during writeback, later buffers get freed, and then > > > page > > > unmapped. Similar problem, or is that the whole of the problem? Where else does the page get written to, after clearing page dirty? (It may not be worth spending time to answer me, I feel I'm wasting too much time on this.) I keep trying to put my finger on the precise bug. I said in earlier mails to Mel and to Martin that we're mixing a bugfix and an optimization, but I cannot quite point to the bug. Could one say that it's precisely at the "page straddles i_size" zero_user_segment(), in XFS or in other FSes? that the storage key ought to be re-cleaned after that? What if one day I happened to copy that code into shmem_writepage()? I've no intention to do so! And it wouldn't cause a BUG. Ah, and we never write shmem to swap while it's still mapped, so it wouldn't even have a chance to redirty the page in page_remove_rmap(). I guess I'm worrying too much; but it's not crystal clear to me why any !mapping_cap_account_dirty mapping would necessarily not have the problem. > > But here's where I think the problem is. You're assuming that all > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah, > > there's no such function, just a confusing maze of three) route as XFS. > > > > But filesystems like tmpfs and ramfs (perhaps they're the only two > > that matter here) don't participate in that, and wait for an mmap'ed > > page to be seen modified by the user (usually via pte_dirty, but that's > > a no-op on s390) before page is marked dirty; and page reclaim throws > > away undirtied pages. > I admit I haven't thought of tmpfs and similar. After some discussion Mel > pointed me to the code in mmap which makes a difference. So if I get it > right, the difference which causes us problems is that on tmpfs we map the > page writeably even during read-only fault. OK, then if I make the above > code in page_remove_rmap(): > if ((PageSwapCache(page) || >(!anon && !mapping_cap_account_dirty(page->mapping))) && > page_test_and_clear_dirty(page_to_pfn(page), 1)) > set_page_dirty(page); > > Things should be ok (modulo the ugliness of this condition), right? (Setting aside my reservations above...) That's almost exactly right, but I think the issue of a racing truncation (which could reset page->mapping to NULL at any moment) means we have to be a bit more careful. Usually we guard against that with page lock, but here we can rely on mapcount. page_mapping(page), with its built-in PageSwapCache check, actually ends up making the condition look less ugly; and so far as I could tell, the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM, when we are left with its VM_BUG_ON(PageSlab(page))). But please look this over very critically and test (and if you like it, please adopt it as your own): I'm not entirely convinced yet myself. (One day, I do want to move that block further down page_remove_rmap(), outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think there's an actual problem at present in calling set_page_dirty() there, but I have seen patches which could give it a lock-ordering issue, so better to untangle them. No reason to muddle that in with your fix, but I thought I'd mention it while we're all staring at this.) Hugh ---
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Martin Schwidefsky wrote: > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT) > Hugh Dickins wrote: > > On Mon, 1 Oct 2012, Jan Kara wrote: > > > > > On s390 any write to a page (even from kernel itself) sets architecture > > > specific page dirty bit. Thus when a page is written to via standard > > > write, HW > > > dirty bit gets set and when we later map and unmap the page, > > > page_remove_rmap() > > > finds the dirty bit and calls set_page_dirty(). > > > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of > > > problems to > > > filesystems. The bug we observed in practice is that buffers from the > > > page get > > > freed, so when the page gets later marked as dirty and writeback writes > > > it, XFS > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() > > > called > > > from xfs_count_page_state(). > > > > What changed recently? Was XFS hardly used on s390 until now? > > One thing that changed is that the zero_user_segment for the remaining bytes > between > i_size and the end of the page has been moved to block_write_full_page_endio, > see > git commit eebd2aa355692afa. That changed the timing of the race window in > regard > to map/unmap of the page by user space. And yes XFS is in use on s390. February 2008: I think we have different ideas of "recently" ;) > > > > > > > Similar problem can also happen when zero_user_segment() call from > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > > > hardware dirty bit during writeback, later buffers get freed, and then > > > page > > > unmapped. > > > > > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in > > > page_mkclean() and page_remove_rmap(). This is safe because when a page > > > gets > > > marked as writeable in PTE it is also marked dirty in do_wp_page() or > > > do_page_fault(). When the dirty bit is cleared by > > > clear_page_dirty_for_io(), > > > the page gets writeprotected in page_mkclean(). So pagecache page is > > > writeable > > > if and only if it is dirty. > > > > Very interesting patch... > > Yes, it is an interesting idea. I really like the part that we'll use less > storage > key operations, as these are freaking expensive. As I said to Mel and will repeat to Jan, though an optimization would be nice, I don't think we should necessarily mix it with the bugfix. > > > > > > > CC: Martin Schwidefsky > > > > which I'd very much like Martin's opinion on... > > Until you pointed out the short-comings of the patch I really liked it .. > > > > --- > > > mm/rmap.c | 16 ++-- > > > 1 files changed, 14 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index 0f3b7cd..6ce8ddb 100644 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) > > > struct address_space *mapping = page_mapping(page); > > > if (mapping) { > > > ret = page_mkclean_file(mapping, page); > > > - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) > > > + /* > > > + * We ignore dirty bit for pagecache pages. It is safe > > > + * as page is marked dirty iff it is writeable (page is > > > + * marked as dirty when it is made writeable and > > > + * clear_page_dirty_for_io() writeprotects the page > > > + * again). > > > + */ > > > + if (PageSwapCache(page) && > > > + page_test_and_clear_dirty(page_to_pfn(page), 1)) > > > ret = 1; > > > > This part you could cut out: page_mkclean() is not used on SwapCache pages. > > I believe you are safe to remove the page_test_and_clear_dirty() from here. > > Hmm, who guarantees that page_mkclean won't be used for SwapCache in the > future? At least we should add a comment there. I set out to do so, to add a comment there; but honestly, it's a strange place for such a comment when there's no longer even the code to comment upon. And page_mkclean_file(), called in the line above, already says BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change that sends PageSwapCache pages this way. It is possible that one day we shall want to send tmpfs and swapcache down this route, I'm not ruling that out; but then we shall have to extend page_mkclean(), yes. > > The patch relies on the software dirty bit tracking for file backed pages, > if dirty bit tracking is not done for tmpfs and ramfs we are borked. > > > You mention above that even the kernel writing to the page would mark > > the s390 storage key dirty. I think that means that these shm and > > tmpfs and ramfs pages would all have dirty storage keys just from the > > clear_highpage() used to prepare them originally, and so would have > > been found dirty anyway by the existing code here in
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Mel Gorman wrote: > On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote: > > > > So, if I'm understanding right, with this change s390 would be in danger > > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages > > written with the write system call would already be PageDirty and secure. > > > > In the case of ramfs, what marks the page clean so it could be discarded? It > does not participate in dirty accounting so it's not going to clear the > dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage > handler that would use an end_io handler to clear the page after "IO" > completes. I am not seeing how a ramfs page can get discarded at the moment. But we don't have a page clean bit: we have a page dirty bit, and where is that set in the ramfs read-fault case? I've not experimented to check, maybe you're right and ramfs is exempt from the issue. I thought it was __do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE. Ah, you quote almost the very place further down. > > shm and tmpfs are indeed different and I did not take them into account > (ba dum tisch) when reviewing. For those pages would it be sufficient to > check the following? > > PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping) Something like that, yes: I've a possible patch I'll put in reply to Jan. > > The problem the patch dealt with involved buffers associated with the page > and that shouldn't be a problem for tmpfs, right? Right, though I'm now beginning to wonder what the underlying bug is. It seems to me that we have a bug and an optimization on our hands, and have rushed into the optimization which would avoid the bug, without considering what the actual bug is. More in reply to Jan. > I recognise that this > might work just because of co-incidence and set off your "Yuck" detector > and you'll prefer the proposed solution below. No, I was mistaken to think that s390 would have dirty pages where others had clean, Martin has now explained that SetPageUptodate cleans. I didn't mind continuing an (imagined) inefficiency in s390, but I don't want to make it more inefficient. > > > You mention above that even the kernel writing to the page would mark > > the s390 storage key dirty. I think that means that these shm and > > tmpfs and ramfs pages would all have dirty storage keys just from the > > clear_highpage() used to prepare them originally, and so would have > > been found dirty anyway by the existing code here in page_remove_rmap(), > > even though other architectures would regard them as clean and removable. > > > > If that's the case, then maybe we'd do better just to mark them dirty > > when faulted in the s390 case. Then your patch above should (I think) > > be safe. Though I'd then be VERY tempted to adjust the SwapCache case > > too (I've not thought through exactly what that patch would be, just > > one or two suitably placed SetPageDirtys, I think), and eliminate > > page_test_and_clear_dirty() altogether - no tears shed by any of us! So that fantasy was all wrong: appealing, but wrong. > > > > Do you mean something like this? > > diff --git a/mm/memory.c b/mm/memory.c > index 5736170..c66166f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct > vm_area_struct *vma, > } else { > inc_mm_counter_fast(mm, MM_FILEPAGES); > page_add_file_rmap(page); > - if (flags & FAULT_FLAG_WRITE) { > + > + /* > + * s390 depends on the dirty flag from the storage key > + * being propagated when the page is unmapped from the > + * page tables. For dirty-accounted mapping, we instead > + * depend on the page being marked dirty on writes and > + * being write-protected on clear_page_dirty_for_io. > + * The same protection does not apply for tmpfs pages > + * that do not participate in dirty accounting so mark > + * them dirty at fault time to avoid the data being > + * lost > + */ > + if (flags & FAULT_FLAG_WRITE || > + !bdi_cap_account_dirty(page->mapping)) { > dirty_page = page; > get_page(dirty_page); > } > > Could something like this result in more writes to swap? Lets say there > is an unmapped tmpfs file with data on it -- a process maps it, reads the > entire mapping and exits. The page is now dirty and potentially will have > to be rewritten to swap. That seems bad. Did I miss your point? My point was that I mistakenly thought s390 must already be behaving like that, so wanted it to continue that way, but with
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon 08-10-12 21:24:40, Hugh Dickins wrote: > On Mon, 1 Oct 2012, Jan Kara wrote: > > > On s390 any write to a page (even from kernel itself) sets architecture > > specific page dirty bit. Thus when a page is written to via standard write, > > HW > > dirty bit gets set and when we later map and unmap the page, > > page_remove_rmap() > > finds the dirty bit and calls set_page_dirty(). > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems > > to > > filesystems. The bug we observed in practice is that buffers from the page > > get > > freed, so when the page gets later marked as dirty and writeback writes it, > > XFS > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() > > called > > from xfs_count_page_state(). > > What changed recently? Was XFS hardly used on s390 until now? The problem was originally hit on SLE11-SP2 which is 3.0 based after migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think XFS just started to be more peevish about what pages it gets between these two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things up). > > Similar problem can also happen when zero_user_segment() call from > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > > hardware dirty bit during writeback, later buffers get freed, and then page > > unmapped. > > > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in > > page_mkclean() and page_remove_rmap(). This is safe because when a page gets > > marked as writeable in PTE it is also marked dirty in do_wp_page() or > > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), > > the page gets writeprotected in page_mkclean(). So pagecache page is > > writeable > > if and only if it is dirty. > > Very interesting patch... Originally, I even wanted to rip out pte dirty bit handling for shared file pages but in the end that seemed too bold and unnecessary for my problem ;) > > CC: linux-s...@vger.kernel.org > > Signed-off-by: Jan Kara > > but I think it's wrong. Thanks for having a look. > > --- > > mm/rmap.c | 16 ++-- > > 1 files changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 0f3b7cd..6ce8ddb 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) > > struct address_space *mapping = page_mapping(page); > > if (mapping) { > > ret = page_mkclean_file(mapping, page); > > - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) > > + /* > > +* We ignore dirty bit for pagecache pages. It is safe > > +* as page is marked dirty iff it is writeable (page is > > +* marked as dirty when it is made writeable and > > +* clear_page_dirty_for_io() writeprotects the page > > +* again). > > +*/ > > + if (PageSwapCache(page) && > > + page_test_and_clear_dirty(page_to_pfn(page), 1)) > > ret = 1; > > This part you could cut out: page_mkclean() is not used on SwapCache pages. > I believe you are safe to remove the page_test_and_clear_dirty() from here. OK, will do. > > } > > } > > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) > > * this if the page is anon, so about to be freed; but perhaps > > * not if it's in swapcache - there might be another pte slot > > * containing the swap entry, but page not yet written to swap. > > +* For pagecache pages, we don't care about dirty bit in storage > > +* key because the page is writeable iff it is dirty (page is marked > > +* as dirty when it is made writeable and clear_page_dirty_for_io() > > +* writeprotects the page again). > > */ > > - if ((!anon || PageSwapCache(page)) && > > + if (PageSwapCache(page) && > > page_test_and_clear_dirty(page_to_pfn(page), 1)) > > set_page_dirty(page); > > But here's where I think the problem is. You're assuming that all > filesystems go the same mapping_cap_account_writeback_dirty() (yeah, > there's no such function, just a confusing maze of three) route as XFS. > > But filesystems like tmpfs and ramfs (perhaps they're the only two > that matter here) don't participate in that, and wait for an mmap'ed > page to be seen modified by the user (usually via pte_dirty, but that's > a no-op on s390) before page is marked dirty; and page reclaim throws > away undirtied pages. I admit I haven't thought of tmpfs and similar. After some discussion Mel pointed me to the code in mmap which makes a difference. So if I get it right, the difference which causes us problems is that on tmpfs we map the page writeably even during read-only fault. OK, then if I make the above code
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, 8 Oct 2012 21:24:40 -0700 (PDT) Hugh Dickins wrote: > On Mon, 1 Oct 2012, Jan Kara wrote: > > > On s390 any write to a page (even from kernel itself) sets architecture > > specific page dirty bit. Thus when a page is written to via standard write, > > HW > > dirty bit gets set and when we later map and unmap the page, > > page_remove_rmap() > > finds the dirty bit and calls set_page_dirty(). > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems > > to > > filesystems. The bug we observed in practice is that buffers from the page > > get > > freed, so when the page gets later marked as dirty and writeback writes it, > > XFS > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() > > called > > from xfs_count_page_state(). > > What changed recently? Was XFS hardly used on s390 until now? One thing that changed is that the zero_user_segment for the remaining bytes between i_size and the end of the page has been moved to block_write_full_page_endio, see git commit eebd2aa355692afa. That changed the timing of the race window in regard to map/unmap of the page by user space. And yes XFS is in use on s390. > > > > Similar problem can also happen when zero_user_segment() call from > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > > hardware dirty bit during writeback, later buffers get freed, and then page > > unmapped. > > > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in > > page_mkclean() and page_remove_rmap(). This is safe because when a page gets > > marked as writeable in PTE it is also marked dirty in do_wp_page() or > > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), > > the page gets writeprotected in page_mkclean(). So pagecache page is > > writeable > > if and only if it is dirty. > > Very interesting patch... Yes, it is an interesting idea. I really like the part that we'll use less storage key operations, as these are freaking expensive. > > > > CC: Martin Schwidefsky > > which I'd very much like Martin's opinion on... Until you pointed out the short-comings of the patch I really liked it .. > > --- > > mm/rmap.c | 16 ++-- > > 1 files changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 0f3b7cd..6ce8ddb 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) > > struct address_space *mapping = page_mapping(page); > > if (mapping) { > > ret = page_mkclean_file(mapping, page); > > - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) > > + /* > > +* We ignore dirty bit for pagecache pages. It is safe > > +* as page is marked dirty iff it is writeable (page is > > +* marked as dirty when it is made writeable and > > +* clear_page_dirty_for_io() writeprotects the page > > +* again). > > +*/ > > + if (PageSwapCache(page) && > > + page_test_and_clear_dirty(page_to_pfn(page), 1)) > > ret = 1; > > This part you could cut out: page_mkclean() is not used on SwapCache pages. > I believe you are safe to remove the page_test_and_clear_dirty() from here. Hmm, who guarantees that page_mkclean won't be used for SwapCache in the future? At least we should add a comment there. > > } > > } > > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) > > * this if the page is anon, so about to be freed; but perhaps > > * not if it's in swapcache - there might be another pte slot > > * containing the swap entry, but page not yet written to swap. > > +* For pagecache pages, we don't care about dirty bit in storage > > +* key because the page is writeable iff it is dirty (page is marked > > +* as dirty when it is made writeable and clear_page_dirty_for_io() > > +* writeprotects the page again). > > */ > > - if ((!anon || PageSwapCache(page)) && > > + if (PageSwapCache(page) && > > page_test_and_clear_dirty(page_to_pfn(page), 1)) > > set_page_dirty(page); > > But here's where I think the problem is. You're assuming that all > filesystems go the same mapping_cap_account_writeback_dirty() (yeah, > there's no such function, just a confusing maze of three) route as XFS. > > But filesystems like tmpfs and ramfs (perhaps they're the only two > that matter here) don't participate in that, and wait for an mmap'ed > page to be seen modified by the user (usually via pte_dirty, but that's > a no-op on s390) before page is marked dirty; and page reclaim throws > away undirtied pages. > > So, if I'm understanding right, with this change s390 would be in danger > of discarding shm, and mmap'ed tmpfs and ramfs
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote: > > > > CC: Mel Gorman > > and I'm grateful to Mel's ack for reawakening me to it... > > > CC: linux-s...@vger.kernel.org > > Signed-off-by: Jan Kara > > but I think it's wrong. > Dang. > > --- > > mm/rmap.c | 16 ++-- > > 1 files changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 0f3b7cd..6ce8ddb 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) > > struct address_space *mapping = page_mapping(page); > > if (mapping) { > > ret = page_mkclean_file(mapping, page); > > - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) > > + /* > > +* We ignore dirty bit for pagecache pages. It is safe > > +* as page is marked dirty iff it is writeable (page is > > +* marked as dirty when it is made writeable and > > +* clear_page_dirty_for_io() writeprotects the page > > +* again). > > +*/ > > + if (PageSwapCache(page) && > > + page_test_and_clear_dirty(page_to_pfn(page), 1)) > > ret = 1; > > This part you could cut out: page_mkclean() is not used on SwapCache pages. > I believe you are safe to remove the page_test_and_clear_dirty() from here. > > > } > > } > > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) > > * this if the page is anon, so about to be freed; but perhaps > > * not if it's in swapcache - there might be another pte slot > > * containing the swap entry, but page not yet written to swap. > > +* For pagecache pages, we don't care about dirty bit in storage > > +* key because the page is writeable iff it is dirty (page is marked > > +* as dirty when it is made writeable and clear_page_dirty_for_io() > > +* writeprotects the page again). > > */ > > - if ((!anon || PageSwapCache(page)) && > > + if (PageSwapCache(page) && > > page_test_and_clear_dirty(page_to_pfn(page), 1)) > > set_page_dirty(page); > > But here's where I think the problem is. You're assuming that all > filesystems go the same mapping_cap_account_writeback_dirty() (yeah, > there's no such function, just a confusing maze of three) route as XFS. > > But filesystems like tmpfs and ramfs (perhaps they're the only two > that matter here) don't participate in that, and wait for an mmap'ed > page to be seen modified by the user (usually via pte_dirty, but that's > a no-op on s390) before page is marked dirty; and page reclaim throws > away undirtied pages. > > So, if I'm understanding right, with this change s390 would be in danger > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages > written with the write system call would already be PageDirty and secure. > In the case of ramfs, what marks the page clean so it could be discarded? It does not participate in dirty accounting so it's not going to clear the dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage handler that would use an end_io handler to clear the page after "IO" completes. I am not seeing how a ramfs page can get discarded at the moment. shm and tmpfs are indeed different and I did not take them into account (ba dum tisch) when reviewing. For those pages would it be sufficient to check the following? PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping) The problem the patch dealt with involved buffers associated with the page and that shouldn't be a problem for tmpfs, right? I recognise that this might work just because of co-incidence and set off your "Yuck" detector and you'll prefer the proposed solution below. > You mention above that even the kernel writing to the page would mark > the s390 storage key dirty. I think that means that these shm and > tmpfs and ramfs pages would all have dirty storage keys just from the > clear_highpage() used to prepare them originally, and so would have > been found dirty anyway by the existing code here in page_remove_rmap(), > even though other architectures would regard them as clean and removable. > > If that's the case, then maybe we'd do better just to mark them dirty > when faulted in the s390 case. Then your patch above should (I think) > be safe. Though I'd then be VERY tempted to adjust the SwapCache case > too (I've not thought through exactly what that patch would be, just > one or two suitably placed SetPageDirtys, I think), and eliminate > page_test_and_clear_dirty() altogether - no tears shed by any of us! > Do you mean something like this? diff --git a/mm/memory.c b/mm/memory.c index 5736170..c66166f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote: SNIP CC: Mel Gorman mgor...@suse.de and I'm grateful to Mel's ack for reawakening me to it... CC: linux-s...@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz but I think it's wrong. Dang. --- mm/rmap.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..6ce8ddb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) + /* +* We ignore dirty bit for pagecache pages. It is safe +* as page is marked dirty iff it is writeable (page is +* marked as dirty when it is made writeable and +* clear_page_dirty_for_io() writeprotects the page +* again). +*/ + if (PageSwapCache(page) + page_test_and_clear_dirty(page_to_pfn(page), 1)) ret = 1; This part you could cut out: page_mkclean() is not used on SwapCache pages. I believe you are safe to remove the page_test_and_clear_dirty() from here. } } @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) * this if the page is anon, so about to be freed; but perhaps * not if it's in swapcache - there might be another pte slot * containing the swap entry, but page not yet written to swap. +* For pagecache pages, we don't care about dirty bit in storage +* key because the page is writeable iff it is dirty (page is marked +* as dirty when it is made writeable and clear_page_dirty_for_io() +* writeprotects the page again). */ - if ((!anon || PageSwapCache(page)) + if (PageSwapCache(page) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. So, if I'm understanding right, with this change s390 would be in danger of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages written with the write system call would already be PageDirty and secure. In the case of ramfs, what marks the page clean so it could be discarded? It does not participate in dirty accounting so it's not going to clear the dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage handler that would use an end_io handler to clear the page after IO completes. I am not seeing how a ramfs page can get discarded at the moment. shm and tmpfs are indeed different and I did not take them into account (ba dum tisch) when reviewing. For those pages would it be sufficient to check the following? PageSwapCache(page) || (page-mapping !bdi_cap_account_dirty(page-mapping) The problem the patch dealt with involved buffers associated with the page and that shouldn't be a problem for tmpfs, right? I recognise that this might work just because of co-incidence and set off your Yuck detector and you'll prefer the proposed solution below. You mention above that even the kernel writing to the page would mark the s390 storage key dirty. I think that means that these shm and tmpfs and ramfs pages would all have dirty storage keys just from the clear_highpage() used to prepare them originally, and so would have been found dirty anyway by the existing code here in page_remove_rmap(), even though other architectures would regard them as clean and removable. If that's the case, then maybe we'd do better just to mark them dirty when faulted in the s390 case. Then your patch above should (I think) be safe. Though I'd then be VERY tempted to adjust the SwapCache case too (I've not thought through exactly what that patch would be, just one or two suitably placed SetPageDirtys, I think), and eliminate page_test_and_clear_dirty() altogether - no tears shed by any of us! Do you mean something like this? diff --git a/mm/memory.c b/mm/memory.c index 5736170..c66166f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else { inc_mm_counter_fast(mm,
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, 8 Oct 2012 21:24:40 -0700 (PDT) Hugh Dickins hu...@google.com wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? One thing that changed is that the zero_user_segment for the remaining bytes between i_size and the end of the page has been moved to block_write_full_page_endio, see git commit eebd2aa355692afa. That changed the timing of the race window in regard to map/unmap of the page by user space. And yes XFS is in use on s390. Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Fix the issue by ignoring s390 HW dirty bit for page cache pages in page_mkclean() and page_remove_rmap(). This is safe because when a page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), the page gets writeprotected in page_mkclean(). So pagecache page is writeable if and only if it is dirty. Very interesting patch... Yes, it is an interesting idea. I really like the part that we'll use less storage key operations, as these are freaking expensive. CC: Martin Schwidefsky schwidef...@de.ibm.com which I'd very much like Martin's opinion on... Until you pointed out the short-comings of the patch I really liked it .. --- mm/rmap.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..6ce8ddb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) + /* +* We ignore dirty bit for pagecache pages. It is safe +* as page is marked dirty iff it is writeable (page is +* marked as dirty when it is made writeable and +* clear_page_dirty_for_io() writeprotects the page +* again). +*/ + if (PageSwapCache(page) + page_test_and_clear_dirty(page_to_pfn(page), 1)) ret = 1; This part you could cut out: page_mkclean() is not used on SwapCache pages. I believe you are safe to remove the page_test_and_clear_dirty() from here. Hmm, who guarantees that page_mkclean won't be used for SwapCache in the future? At least we should add a comment there. } } @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) * this if the page is anon, so about to be freed; but perhaps * not if it's in swapcache - there might be another pte slot * containing the swap entry, but page not yet written to swap. +* For pagecache pages, we don't care about dirty bit in storage +* key because the page is writeable iff it is dirty (page is marked +* as dirty when it is made writeable and clear_page_dirty_for_io() +* writeprotects the page again). */ - if ((!anon || PageSwapCache(page)) + if (PageSwapCache(page) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. So, if I'm understanding right, with this change s390 would be in danger of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages written with the write system call would already be PageDirty and secure. The patch relies on the software
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? The problem was originally hit on SLE11-SP2 which is 3.0 based after migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think XFS just started to be more peevish about what pages it gets between these two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things up). Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Fix the issue by ignoring s390 HW dirty bit for page cache pages in page_mkclean() and page_remove_rmap(). This is safe because when a page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), the page gets writeprotected in page_mkclean(). So pagecache page is writeable if and only if it is dirty. Very interesting patch... Originally, I even wanted to rip out pte dirty bit handling for shared file pages but in the end that seemed too bold and unnecessary for my problem ;) CC: linux-s...@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz but I think it's wrong. Thanks for having a look. --- mm/rmap.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..6ce8ddb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) + /* +* We ignore dirty bit for pagecache pages. It is safe +* as page is marked dirty iff it is writeable (page is +* marked as dirty when it is made writeable and +* clear_page_dirty_for_io() writeprotects the page +* again). +*/ + if (PageSwapCache(page) + page_test_and_clear_dirty(page_to_pfn(page), 1)) ret = 1; This part you could cut out: page_mkclean() is not used on SwapCache pages. I believe you are safe to remove the page_test_and_clear_dirty() from here. OK, will do. } } @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) * this if the page is anon, so about to be freed; but perhaps * not if it's in swapcache - there might be another pte slot * containing the swap entry, but page not yet written to swap. +* For pagecache pages, we don't care about dirty bit in storage +* key because the page is writeable iff it is dirty (page is marked +* as dirty when it is made writeable and clear_page_dirty_for_io() +* writeprotects the page again). */ - if ((!anon || PageSwapCache(page)) + if (PageSwapCache(page) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. I admit I haven't thought of tmpfs and similar. After some discussion Mel pointed me to the code in mmap which makes a difference. So if I get it right, the difference which causes us problems is that on tmpfs we map the page writeably even during read-only fault. OK, then if I make the above code in page_remove_rmap(): if ((PageSwapCache(page) || (!anon !mapping_cap_account_dirty(page-mapping)))
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Mel Gorman wrote: On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote: So, if I'm understanding right, with this change s390 would be in danger of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages written with the write system call would already be PageDirty and secure. In the case of ramfs, what marks the page clean so it could be discarded? It does not participate in dirty accounting so it's not going to clear the dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage handler that would use an end_io handler to clear the page after IO completes. I am not seeing how a ramfs page can get discarded at the moment. But we don't have a page clean bit: we have a page dirty bit, and where is that set in the ramfs read-fault case? I've not experimented to check, maybe you're right and ramfs is exempt from the issue. I thought it was __do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE. Ah, you quote almost the very place further down. shm and tmpfs are indeed different and I did not take them into account (ba dum tisch) when reviewing. For those pages would it be sufficient to check the following? PageSwapCache(page) || (page-mapping !bdi_cap_account_dirty(page-mapping) Something like that, yes: I've a possible patch I'll put in reply to Jan. The problem the patch dealt with involved buffers associated with the page and that shouldn't be a problem for tmpfs, right? Right, though I'm now beginning to wonder what the underlying bug is. It seems to me that we have a bug and an optimization on our hands, and have rushed into the optimization which would avoid the bug, without considering what the actual bug is. More in reply to Jan. I recognise that this might work just because of co-incidence and set off your Yuck detector and you'll prefer the proposed solution below. No, I was mistaken to think that s390 would have dirty pages where others had clean, Martin has now explained that SetPageUptodate cleans. I didn't mind continuing an (imagined) inefficiency in s390, but I don't want to make it more inefficient. You mention above that even the kernel writing to the page would mark the s390 storage key dirty. I think that means that these shm and tmpfs and ramfs pages would all have dirty storage keys just from the clear_highpage() used to prepare them originally, and so would have been found dirty anyway by the existing code here in page_remove_rmap(), even though other architectures would regard them as clean and removable. If that's the case, then maybe we'd do better just to mark them dirty when faulted in the s390 case. Then your patch above should (I think) be safe. Though I'd then be VERY tempted to adjust the SwapCache case too (I've not thought through exactly what that patch would be, just one or two suitably placed SetPageDirtys, I think), and eliminate page_test_and_clear_dirty() altogether - no tears shed by any of us! So that fantasy was all wrong: appealing, but wrong. Do you mean something like this? diff --git a/mm/memory.c b/mm/memory.c index 5736170..c66166f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else { inc_mm_counter_fast(mm, MM_FILEPAGES); page_add_file_rmap(page); - if (flags FAULT_FLAG_WRITE) { + + /* + * s390 depends on the dirty flag from the storage key + * being propagated when the page is unmapped from the + * page tables. For dirty-accounted mapping, we instead + * depend on the page being marked dirty on writes and + * being write-protected on clear_page_dirty_for_io. + * The same protection does not apply for tmpfs pages + * that do not participate in dirty accounting so mark + * them dirty at fault time to avoid the data being + * lost + */ + if (flags FAULT_FLAG_WRITE || + !bdi_cap_account_dirty(page-mapping)) { dirty_page = page; get_page(dirty_page); } Could something like this result in more writes to swap? Lets say there is an unmapped tmpfs file with data on it -- a process maps it, reads the entire mapping and exits. The page is now dirty and potentially will have to be rewritten to swap. That seems bad. Did I miss your point? My point was that I mistakenly thought s390 must already be behaving like that, so wanted it to continue that way, but with cleaner source. But the CONFIG_S390 in SetPageUptodate makes sure that the zeroed page starts out
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Martin Schwidefsky wrote: On Mon, 8 Oct 2012 21:24:40 -0700 (PDT) Hugh Dickins hu...@google.com wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? One thing that changed is that the zero_user_segment for the remaining bytes between i_size and the end of the page has been moved to block_write_full_page_endio, see git commit eebd2aa355692afa. That changed the timing of the race window in regard to map/unmap of the page by user space. And yes XFS is in use on s390. February 2008: I think we have different ideas of recently ;) Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Fix the issue by ignoring s390 HW dirty bit for page cache pages in page_mkclean() and page_remove_rmap(). This is safe because when a page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), the page gets writeprotected in page_mkclean(). So pagecache page is writeable if and only if it is dirty. Very interesting patch... Yes, it is an interesting idea. I really like the part that we'll use less storage key operations, as these are freaking expensive. As I said to Mel and will repeat to Jan, though an optimization would be nice, I don't think we should necessarily mix it with the bugfix. CC: Martin Schwidefsky schwidef...@de.ibm.com which I'd very much like Martin's opinion on... Until you pointed out the short-comings of the patch I really liked it .. --- mm/rmap.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..6ce8ddb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) + /* + * We ignore dirty bit for pagecache pages. It is safe + * as page is marked dirty iff it is writeable (page is + * marked as dirty when it is made writeable and + * clear_page_dirty_for_io() writeprotects the page + * again). + */ + if (PageSwapCache(page) + page_test_and_clear_dirty(page_to_pfn(page), 1)) ret = 1; This part you could cut out: page_mkclean() is not used on SwapCache pages. I believe you are safe to remove the page_test_and_clear_dirty() from here. Hmm, who guarantees that page_mkclean won't be used for SwapCache in the future? At least we should add a comment there. I set out to do so, to add a comment there; but honestly, it's a strange place for such a comment when there's no longer even the code to comment upon. And page_mkclean_file(), called in the line above, already says BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change that sends PageSwapCache pages this way. It is possible that one day we shall want to send tmpfs and swapcache down this route, I'm not ruling that out; but then we shall have to extend page_mkclean(), yes. The patch relies on the software dirty bit tracking for file backed pages, if dirty bit tracking is not done for tmpfs and ramfs we are borked. You mention above that even the kernel writing to the page would mark the s390 storage key dirty. I think that means that these shm and tmpfs and ramfs pages would all have dirty storage keys just from the clear_highpage() used to prepare them originally, and so would have been found dirty anyway by the existing code here in page_remove_rmap(), even though other architectures would regard them as clean and removable. No, the clear_highpage() will set the dirty bit in the storage key but the SetPageUptodate will clear
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, 9 Oct 2012, Jan Kara wrote: On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? The problem was originally hit on SLE11-SP2 which is 3.0 based after migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think XFS just started to be more peevish about what pages it gets between these two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things up). Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the ASSERT usually compiled out, stumbling later in page_buffers() as you say. Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Similar problem, or is that the whole of the problem? Where else does the page get written to, after clearing page dirty? (It may not be worth spending time to answer me, I feel I'm wasting too much time on this.) I keep trying to put my finger on the precise bug. I said in earlier mails to Mel and to Martin that we're mixing a bugfix and an optimization, but I cannot quite point to the bug. Could one say that it's precisely at the page straddles i_size zero_user_segment(), in XFS or in other FSes? that the storage key ought to be re-cleaned after that? What if one day I happened to copy that code into shmem_writepage()? I've no intention to do so! And it wouldn't cause a BUG. Ah, and we never write shmem to swap while it's still mapped, so it wouldn't even have a chance to redirty the page in page_remove_rmap(). I guess I'm worrying too much; but it's not crystal clear to me why any !mapping_cap_account_dirty mapping would necessarily not have the problem. But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. I admit I haven't thought of tmpfs and similar. After some discussion Mel pointed me to the code in mmap which makes a difference. So if I get it right, the difference which causes us problems is that on tmpfs we map the page writeably even during read-only fault. OK, then if I make the above code in page_remove_rmap(): if ((PageSwapCache(page) || (!anon !mapping_cap_account_dirty(page-mapping))) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); Things should be ok (modulo the ugliness of this condition), right? (Setting aside my reservations above...) That's almost exactly right, but I think the issue of a racing truncation (which could reset page-mapping to NULL at any moment) means we have to be a bit more careful. Usually we guard against that with page lock, but here we can rely on mapcount. page_mapping(page), with its built-in PageSwapCache check, actually ends up making the condition look less ugly; and so far as I could tell, the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM, when we are left with its VM_BUG_ON(PageSlab(page))). But please look this over very critically and test (and if you like it, please adopt it as your own): I'm not entirely convinced yet myself. (One day, I do want to move that block further down page_remove_rmap(), outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think there's an actual problem at present in calling set_page_dirty() there, but I have seen patches which could give it a lock-ordering issue, so better to untangle them. No reason to muddle that in with your fix, but I thought I'd mention it while we're all staring at this.) Hugh --- mm/rmap.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) --- 3.6.0+/mm/rmap.c
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, 1 Oct 2012, Jan Kara wrote: > On s390 any write to a page (even from kernel itself) sets architecture > specific page dirty bit. Thus when a page is written to via standard write, HW > dirty bit gets set and when we later map and unmap the page, > page_remove_rmap() > finds the dirty bit and calls set_page_dirty(). > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to > filesystems. The bug we observed in practice is that buffers from the page get > freed, so when the page gets later marked as dirty and writeback writes it, > XFS > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() > called > from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? > > Similar problem can also happen when zero_user_segment() call from > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > hardware dirty bit during writeback, later buffers get freed, and then page > unmapped. > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in > page_mkclean() and page_remove_rmap(). This is safe because when a page gets > marked as writeable in PTE it is also marked dirty in do_wp_page() or > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), > the page gets writeprotected in page_mkclean(). So pagecache page is writeable > if and only if it is dirty. Very interesting patch... > > CC: Martin Schwidefsky which I'd very much like Martin's opinion on... > CC: Mel Gorman and I'm grateful to Mel's ack for reawakening me to it... > CC: linux-s...@vger.kernel.org > Signed-off-by: Jan Kara but I think it's wrong. > --- > mm/rmap.c | 16 ++-- > 1 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 0f3b7cd..6ce8ddb 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) > struct address_space *mapping = page_mapping(page); > if (mapping) { > ret = page_mkclean_file(mapping, page); > - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) > + /* > + * We ignore dirty bit for pagecache pages. It is safe > + * as page is marked dirty iff it is writeable (page is > + * marked as dirty when it is made writeable and > + * clear_page_dirty_for_io() writeprotects the page > + * again). > + */ > + if (PageSwapCache(page) && > + page_test_and_clear_dirty(page_to_pfn(page), 1)) > ret = 1; This part you could cut out: page_mkclean() is not used on SwapCache pages. I believe you are safe to remove the page_test_and_clear_dirty() from here. > } > } > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) >* this if the page is anon, so about to be freed; but perhaps >* not if it's in swapcache - there might be another pte slot >* containing the swap entry, but page not yet written to swap. > + * For pagecache pages, we don't care about dirty bit in storage > + * key because the page is writeable iff it is dirty (page is marked > + * as dirty when it is made writeable and clear_page_dirty_for_io() > + * writeprotects the page again). >*/ > - if ((!anon || PageSwapCache(page)) && > + if (PageSwapCache(page) && > page_test_and_clear_dirty(page_to_pfn(page), 1)) > set_page_dirty(page); But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. So, if I'm understanding right, with this change s390 would be in danger of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages written with the write system call would already be PageDirty and secure. You mention above that even the kernel writing to the page would mark the s390 storage key dirty. I think that means that these shm and tmpfs and ramfs pages would all have dirty storage keys just from the clear_highpage() used to prepare them originally, and so would have been found dirty anyway by the existing code here in page_remove_rmap(), even though other architectures would regard them as clean and removable. If that's the case, then maybe we'd do better just to mark them dirty when faulted in the s390 case. Then your patch above should (I think) be safe. Though I'd then be VERY
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote: > On s390 any write to a page (even from kernel itself) sets architecture > specific page dirty bit. Thus when a page is written to via standard write, HW > dirty bit gets set and when we later map and unmap the page, > page_remove_rmap() > finds the dirty bit and calls set_page_dirty(). > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to > filesystems. The bug we observed in practice is that buffers from the page get > freed, so when the page gets later marked as dirty and writeback writes it, > XFS > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() > called > from xfs_count_page_state(). > > Similar problem can also happen when zero_user_segment() call from > xfs_vm_writepage() (or block_write_full_page() for that matter) set the > hardware dirty bit during writeback, later buffers get freed, and then page > unmapped. > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in > page_mkclean() and page_remove_rmap(). This is safe because when a page gets > marked as writeable in PTE it is also marked dirty in do_wp_page() or > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), > the page gets writeprotected in page_mkclean(). So pagecache page is writeable > if and only if it is dirty. > > CC: Martin Schwidefsky > CC: Mel Gorman > CC: linux-s...@vger.kernel.org > Signed-off-by: Jan Kara Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Fix the issue by ignoring s390 HW dirty bit for page cache pages in page_mkclean() and page_remove_rmap(). This is safe because when a page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), the page gets writeprotected in page_mkclean(). So pagecache page is writeable if and only if it is dirty. CC: Martin Schwidefsky schwidef...@de.ibm.com CC: Mel Gorman mgor...@suse.de CC: linux-s...@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz Acked-by: Mel Gorman mgor...@suse.de -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Fix the issue by ignoring s390 HW dirty bit for page cache pages in page_mkclean() and page_remove_rmap(). This is safe because when a page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), the page gets writeprotected in page_mkclean(). So pagecache page is writeable if and only if it is dirty. Very interesting patch... CC: Martin Schwidefsky schwidef...@de.ibm.com which I'd very much like Martin's opinion on... CC: Mel Gorman mgor...@suse.de and I'm grateful to Mel's ack for reawakening me to it... CC: linux-s...@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz but I think it's wrong. --- mm/rmap.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..6ce8ddb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -973,7 +973,15 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); - if (page_test_and_clear_dirty(page_to_pfn(page), 1)) + /* + * We ignore dirty bit for pagecache pages. It is safe + * as page is marked dirty iff it is writeable (page is + * marked as dirty when it is made writeable and + * clear_page_dirty_for_io() writeprotects the page + * again). + */ + if (PageSwapCache(page) + page_test_and_clear_dirty(page_to_pfn(page), 1)) ret = 1; This part you could cut out: page_mkclean() is not used on SwapCache pages. I believe you are safe to remove the page_test_and_clear_dirty() from here. } } @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page) * this if the page is anon, so about to be freed; but perhaps * not if it's in swapcache - there might be another pte slot * containing the swap entry, but page not yet written to swap. + * For pagecache pages, we don't care about dirty bit in storage + * key because the page is writeable iff it is dirty (page is marked + * as dirty when it is made writeable and clear_page_dirty_for_io() + * writeprotects the page again). */ - if ((!anon || PageSwapCache(page)) + if (PageSwapCache(page) page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); But here's where I think the problem is. You're assuming that all filesystems go the same mapping_cap_account_writeback_dirty() (yeah, there's no such function, just a confusing maze of three) route as XFS. But filesystems like tmpfs and ramfs (perhaps they're the only two that matter here) don't participate in that, and wait for an mmap'ed page to be seen modified by the user (usually via pte_dirty, but that's a no-op on s390) before page is marked dirty; and page reclaim throws away undirtied pages. So, if I'm understanding right, with this change s390 would be in danger of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages written with the write system call would already be PageDirty and secure. You mention above that even the kernel writing to the page would mark the s390 storage key dirty. I think that means that these shm and tmpfs and ramfs pages would all have dirty storage keys just from the clear_highpage() used to prepare them originally, and so would have been found dirty anyway by the existing code here in page_remove_rmap(), even though other architectures would regard them as clean and removable. If that's the case, then maybe we'd do better just to mark them dirty when faulted in the s390 case. Then your patch above should (I think) be safe. Though I'd then be VERY tempted to adjust the