Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 7 Jan 2007 12:36:18 +1030 "Tom Lanyon" <[EMAIL PROTECTED]> wrote: > On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: > > What would also actually be interesting is whether somebody can reproduce > > this on Reiserfs, for example. I _think_ all the reports I've seen are on > > ext2 or ext3, and if this is somehow writeback-related, it could be some > > bug that is just shared between the two by virtue of them still having a > > lot of stuff in common. > > > > Linus > > I've been following this thread for a while now as I started > experiencing file corruption in rtorrent when I upgraded to 2.6.19. I > am using reiserfs. reiserfs defaults to data=ordered, so it's quite possibly the same bug. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 1/7/07, Tom Lanyon <[EMAIL PROTECTED]> wrote: I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far... -- Tom Lanyon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. -- Tom Lanyon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote: > > The only -mm stuff I recall being in the Fedora 2.6.18 is > > the inode-diet stuff which ended up in 2.6.19, though the xmas > > break has left my head somewhat empty so I may be forgetting something. > > What patch in particular are you talking about? > > it's no longer visible in the FC6 cvs, due to rebase > but it's name was linux-2.6-mm-tracking-dirty-pages.patch > it is an earlier almagame of the merged patch serie: >- mm: tracking shared dirty pages >- mm: balance dirty pages >- mm: optimize the new mprotect() code a bit >- mm: small cleanup of install_page() >- mm: fixup do_wp_page() >- mm: msync() cleanup (closes: #394392) Ohh, that. Yes. I had forgotten all about that. I've been hitting the nog a little too hard :) Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote: > On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: > > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > 2.6.18 > > > > > (or older)? > > > > > > > > Well, that was a really _old_ fedora kernel. I guarantee you it > didn't > > > > have the page throttling patches in it, those were written this > summer. So > > > > it would either have to be Fedora carrying around another patch that > just > > > > happens to result in the same corruption for _years_, or it's the > same > > > > bug. > > > > > > The only notable VM patch in Fedora kernels of that vintage that I recall > > > was Ingo's 4g/4g thing. > > > > no the fedora 2.6.18 kernel is affected. > > I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. > > > it carries the same -mm patches that Debian backported > > for LSB 3.1 compliance. > > The only -mm stuff I recall being in the Fedora 2.6.18 is > the inode-diet stuff which ended up in 2.6.19, though the xmas > break has left my head somewhat empty so I may be forgetting something. > What patch in particular are you talking about? it's no longer visible in the FC6 cvs, due to rebase but it's name was linux-2.6-mm-tracking-dirty-pages.patch it is an earlier almagame of the merged patch serie: - mm: tracking shared dirty pages - mm: balance dirty pages - mm: optimize the new mprotect() code a bit - mm: small cleanup of install_page() - mm: fixup do_wp_page() - mm: msync() cleanup (closes: #394392) -- maks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Linus Torvalds a écrit : going back to Linux-2.6.5 at least, according to one tester). I apologize for the confusion, but it just occurred to me that I was actually experiencing a totally different problem: I set a root filesystem of 3Mib for qemu, so the test program just didn't have enough space for its file. -- Guillaume - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > > me up), and that seems to show the corruption going way way back > > (ie going > > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > > 2.6.18 > > > > (or older)? > > > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > > have the page throttling patches in it, those were written this summer. > > So > > > it would either have to be Fedora carrying around another patch that > > just > > > happens to result in the same corruption for _years_, or it's the same > > > bug. > > > > The only notable VM patch in Fedora kernels of that vintage that I recall > > was Ingo's 4g/4g thing. > > no the fedora 2.6.18 kernel is affected. I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. > it carries the same -mm patches that Debian backported > for LSB 3.1 compliance. The only -mm stuff I recall being in the Fedora 2.6.18 is the inode-diet stuff which ended up in 2.6.19, though the xmas break has left my head somewhat empty so I may be forgetting something. What patch in particular are you talking about? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > me up), and that seems to show the corruption going way way back (ie > going > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > 2.6.18 > > > (or older)? > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > have the page throttling patches in it, those were written this summer. So > > it would either have to be Fedora carrying around another patch that just > > happens to result in the same corruption for _years_, or it's the same > > bug. > > The only notable VM patch in Fedora kernels of that vintage that I recall > was Ingo's 4g/4g thing. > > Dave no the fedora 2.6.18 kernel is affected. it carries the same -mm patches that Debian backported for LSB 3.1 compliance. -- maks ps sorry for stripping cc, only downloaded that message raw. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006 17:38:38 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > in > the hope that somebody else is working on this corruption issue and is > interested.. What corruption issue? ;) I'm finding that the corruption happens trivially with your test app, but apparently doesn't happen at all with ext2 or ext3, data=writeback. Maybe it will happen with increased rarity, but the difference is quite stark. Removing the err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL, journal_dirty_data_fn); from ext3_ordered_writepage() fixes things up. The things which journal_submit_data_buffers() does after dropping all the locks are ... disturbing - I don't think we have sufficient tests in there to ensure that the buffer is still where we think it is after we retake locks (they're slippery little buggers). But that wouldn't explain it anyway. It's inefficient that journal_dirty_data() will put these locked, clean buffers onto BJ_SyncData instead of BJ_Locked, but journal_submit_data_buffers() seems to dtrt with them. So no theory yet. Maybe ext3 is just altering timing. But the difference is really large.. Disabling all the WB_SYNC_NONE stuff and making everything go synchronous everywhere has no effect. Disabling bdi_write_congested() has no effect. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Btw, much cleaned-up page tracing patch here, in case anybody cares (and "test.c" attached, although I don't think it changed since last time). The test.c output is a bit hard to read at times, since it will give offsets in bytes as hex (ie "00a77664" means page frame 0a77, and byte 664h within that page), while the kernel output is obvioiusly the page indexes (but the page fault _addresses_ can contain information about the exact byte in a page, so you can match them up when some kernel event is related to a page fault). So both forms are necessary/logical, but it means that to match things up, you often need to ignore the last three hex digits of the address that "test.c" outputs. This one also adds traces for the tags and the writeback activity, but since I'm going out for birthday dinner, I won't have time to try to actually analyse the trace I have.. Which is why I'm sending it out, in the hope that somebody else is working on this corruption issue and is interested.. Linus diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..f5e132a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page) set_buffer_dirty(bh); bh = bh->b_this_page; } while (bh != head); + PAGE_TRACE(page, "dirtied buffers"); } spin_unlock(&mapping->private_lock); @@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..0cf3dce 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,14 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) + +#define PAGE_TRACE(page, msg, arg...) do { \ + if (PageInteresting(page)) \ + printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", \ + (page)->index, __FILE__, __LINE__ ,##arg ); \ +} while (0) #if (BITS_PER_LONG > 32) /* @@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page) #define PageWriteback(page)test_bit(PG_writeback, &(page)->flags) #define SetPageWriteback(page) \ do {\ - if (!test_and_set_bit(PG_writeback, \ - &(page)->flags))\ + if (!test_and_set_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK);\ + } \ } while (0) #define TestSetPageWriteback(page) \ ({ \ int ret;\ ret = test_and_set_bit(PG_writeback,\ &(page)->flags);\ - if (!ret) \ + if (!ret) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK);\ + } \ ret;\ }) #define ClearPageWriteback(page) \ do {\ - if (test_and_clear_bit(PG_writeback,\ - &(page)->flags))\ + if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK);\ + } \ } while (0) #define TestClearPageWriteback(page) \ ({
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Linus Torvalds wrote: > Ok, > with the ugly trace capture patch, I've actually captured this corruption > in action, I think. > > I did a full trace of all pages involved in one run, and picked one > corruption at random: > > Chunk 14465 corrupted (0-75) (01423fb4-01423fff) > Expected 129, got 0 > Written as (5126)9509(15017) > > That's the first 76 bytes of a chunk missing, and it's the last 76 bytes > on a page. It's page index 01423 in the mapped file, and bytes fb4-fff > within that file. > > There were four chunks written to that page: > > Writing chunk 14463/15800 (15%) (0142344c) (1) > Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423) > Writing chunk 14464/15800 (32%) (01423a00) (3) > Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! > > and the other three chunks checked out all right. > > And here's the annotated trace as it concerns that page: > > - here we write the first chunk to the page: > ** (1) do_no_page: mapping index 1423 at b7d1f44c (write) > ** Setting page 1423 dirty > > - something flushes it out to disk: > ** cpd_for_io: index 1423 > ** cleaning index 1423 at b7d1f000 > > - here we write the second chunk (which was split over the previous page >and the interesting one): > ** (2) Setting page 1422 dirty > ** (2) Setting page 1423 dirty > > - and here we do a cleaning event > ** cpd_for_io: index 1423 > ** cleaning index 1423 at b7d1f000 > > - here we write the third chunk: > ** (3) Setting page 1423 dirty > > - here we write the fourth chunk: > ** (4) NO DIRTY EVENT > > - and a third flush to disk: > ** cpd_for_io: index 1423 > ** cleaning index 1423 at b7d1f000 > > - here we unmap and flush: > ** Unmapped index 1423 at b7d1f000 > ** Removing index 1423 from page cache > > - here we remap to check: > ** do_no_page: mapping index 1423 at b7d1f000 (read) > ** Unmapped index 1423 at b7d1f000 > > - and finally, here I remove the file after the run: > ** Removing index 1423 from page cache > > Now, the important thing to see here is: > > - the missing write did not have a "Setting page 1423 dirty" event >associated with it. > > - but I can _see_ where the actual dirty event would be happening in the >logs, because I can see the dirty events of the other chunk writes >around it, so I know exactly where that fourth write happens. And >indeed, it _shouldn't_ get a dirty event, because the page is still >dirty from the write of chunk #3 to that page, which _did_ get a dirty >event. > >I can see that, because the testing app writes the log of the pages it >writes, and this is the log around the fourth and final write: > > ... > Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f > Writing chunk 960/15800 (60%) (00156300)PFN: 156 > Writing chunk 14465/15800 (60%) (01423fb4) < > Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 > Writing chunk 556/15800 (60%) (000c62f0)PFN: c6 > Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 > ... > >and I can match this up with the full log from the kernel, which looks >like this: > > Setting page 076e dirty > Setting page 076f dirty > Setting page 0156 dirty > Setting page 00c6 dirty > Setting page 1526 dirty > >so I know exactly where the missing writes (to our page at pfn 1423, >and the fpn-bf7 page) happened. > > - and the thing is, I can see a "cpd_for_io()" happening AFTER that >fourth write. Quite a long while after, in fact. So all of this looks >very fine indeed. We are not losing any dirty bits. > > - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses >the SAME dirty bit as write 4 did (which didn't make it out to disk!). >The event that clears the dirty bit that write 3 did happens AFTER >write 4 has happened! > > So if we're not losing any dirty bits, what's going on? > > I think we have some nasty interaction with the buffer heads. In But are chunks 3 and 4 in separate buffer heads? Sorry could not see it immediately from the output you showed... It is just that there may be a different cause rather than buffer dirty state... A shot in the dark I know but it could perhaps be that a "COW for MAP_PRIVATE" like event happens when the page is dirty already thus the second write never actually makes it to the shared page thus it never gets written out. I am almost certainly totally barking up the wrong tree but I thought it may be worth mentioning just in case there was a slip in the COW logic or page wr
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Anton Altaparmakov wrote: > > But are chunks 3 and 4 in separate buffer heads? Sorry could not see it > immediately from the output you showed... No, this is a 4kB filesystem. A single bh per page. > It is just that there may be a different cause rather than buffer dirty > state... Sure. > A shot in the dark I know but it could perhaps be that a "COW for > MAP_PRIVATE" like event happens when the page is dirty already thus the > second write never actually makes it to the shared page thus it never gets > written out. There are no private mappings anywhere, and no forks. Just a single mmap (well, we unmap and remap in order to force the page cache to be invalidated properly with the posix_fadvise() thing, but that's literally the only user). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, David Miller wrote: > > What happens when we writeback, to the PTEs? Not a damn thing. We clear the PTE's _before_ we even start the write. The writeback does nothing to them. If the user dirties the page while writeback is in progress, we'll take the page fault and re-dirty it _again_. > page_mkclean_file() iterates the VMAs and when it finds a shared > one it goes: > > entry = ptep_clear_flush(vma, address, pte); > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > > and that's fine, but that PTE is still marked writable, and > I think that's key. No it's not. It's right there. "pte_wrprotect(entry)". You even copied it yourself. > What does the fault path do in this situation? > > if (write_access) { > if (!pte_write(entry)) > return do_wp_page(mm, vma, address, > pte, pmd, ptl, entry); So we call "do_wp_page()", and that does everythign right. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
From: Linus Torvalds <[EMAIL PROTECTED]> Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST) > So if we're not losing any dirty bits, what's going on? What happens when we writeback, to the PTEs? page_mkclean_file() iterates the VMAs and when it finds a shared one it goes: entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); and that's fine, but that PTE is still marked writable, and I think that's key. What does the fault path do in this situation? if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); entry = pte_mkdirty(entry); } It does nothing to update the page dirty state, because it's writable, it just sets the PTE dirty bit and that's it. Should it be setting the page dirty here for SHARED cases? So until vmscan actually unmaps the PTE completely, we have this window in which the application can write to the PTE and the page dirty state doesn't get updated. Perhaps something later cleans up after this, f.e. by rechecking the PTE dirty bit at the end of I/O or when vmscan unmaps the page. I guess that should handle things, but the above logic definitely stood out to me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Ok, with the ugly trace capture patch, I've actually captured this corruption in action, I think. I did a full trace of all pages involved in one run, and picked one corruption at random: Chunk 14465 corrupted (0-75) (01423fb4-01423fff) Expected 129, got 0 Written as (5126)9509(15017) That's the first 76 bytes of a chunk missing, and it's the last 76 bytes on a page. It's page index 01423 in the mapped file, and bytes fb4-fff within that file. There were four chunks written to that page: Writing chunk 14463/15800 (15%) (0142344c) (1) Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423) Writing chunk 14464/15800 (32%) (01423a00) (3) Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! and the other three chunks checked out all right. And here's the annotated trace as it concerns that page: - here we write the first chunk to the page: ** (1) do_no_page: mapping index 1423 at b7d1f44c (write) ** Setting page 1423 dirty - something flushes it out to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the second chunk (which was split over the previous page and the interesting one): ** (2) Setting page 1422 dirty ** (2) Setting page 1423 dirty - and here we do a cleaning event ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the third chunk: ** (3) Setting page 1423 dirty - here we write the fourth chunk: ** (4) NO DIRTY EVENT - and a third flush to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we unmap and flush: ** Unmapped index 1423 at b7d1f000 ** Removing index 1423 from page cache - here we remap to check: ** do_no_page: mapping index 1423 at b7d1f000 (read) ** Unmapped index 1423 at b7d1f000 - and finally, here I remove the file after the run: ** Removing index 1423 from page cache Now, the important thing to see here is: - the missing write did not have a "Setting page 1423 dirty" event associated with it. - but I can _see_ where the actual dirty event would be happening in the logs, because I can see the dirty events of the other chunk writes around it, so I know exactly where that fourth write happens. And indeed, it _shouldn't_ get a dirty event, because the page is still dirty from the write of chunk #3 to that page, which _did_ get a dirty event. I can see that, because the testing app writes the log of the pages it writes, and this is the log around the fourth and final write: ... Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f Writing chunk 960/15800 (60%) (00156300)PFN: 156 Writing chunk 14465/15800 (60%) (01423fb4) < Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 Writing chunk 556/15800 (60%) (000c62f0)PFN: c6 Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 ... and I can match this up with the full log from the kernel, which looks like this: Setting page 076e dirty Setting page 076f dirty Setting page 0156 dirty Setting page 00c6 dirty Setting page 1526 dirty so I know exactly where the missing writes (to our page at pfn 1423, and the fpn-bf7 page) happened. - and the thing is, I can see a "cpd_for_io()" happening AFTER that fourth write. Quite a long while after, in fact. So all of this looks very fine indeed. We are not losing any dirty bits. - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses the SAME dirty bit as write 4 did (which didn't make it out to disk!). The event that clears the dirty bit that write 3 did happens AFTER write 4 has happened! So if we're not losing any dirty bits, what's going on? I think we have some nasty interaction with the buffer heads. In particular, I don't think it's the dirty page bits that are broken (I _see_ that the PageDirty bit was set after write 4 was done to memory in the kernel traces). So I think that a real writeback just doesn't happen, because somebody has marked the buffer heads clean _after_ it started IO on them. I think "__mpage_writepage()" is buggy in this regard, for example. It even has a comment about its crapola behaviour: /* * Must try to add the page before marking the buffer clean or * the confused fail path above (OOM) will be very confused when * it finds all bh marked clean (i.e. it will not write anything) */ however, I don't think that particular thing explains it, because I don't think we use that function for the cases I'm looking at
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote: > On Thu, 28 Dec 2006, Linus Torvalds wrote: > > > > What we need now is actually looking at the source code, and people who > > understand the VM, I'm afraid. I'm gathering traces now that I have a good > > test-case. I'll post my trace tools once I've tested that they work, in > > case others want to help. > > Ok, I've got the traces, but quite frankly, I doubt anybody is crazy > enough to want to trawl through them. It's a bit painful, since we're > talking thousands of pages to trigger this problem. > > Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably > ARM, but is used for other things on ia64, powerpc and sparc64. But here's > the patch in case anybody cares. PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to hitting userspace, in the same way that sparc64 uses it. So ARM systems should not have this patch applied. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Linus Torvalds wrote: > > What we need now is actually looking at the source code, and people who > understand the VM, I'm afraid. I'm gathering traces now that I have a good > test-case. I'll post my trace tools once I've tested that they work, in > case others want to help. Ok, I've got the traces, but quite frankly, I doubt anybody is crazy enough to want to trawl through them. It's a bit painful, since we're talking thousands of pages to trigger this problem. Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably ARM, but is used for other things on ia64, powerpc and sparc64. But here's the patch in case anybody cares. It wants a _big_ kernel buffer to capture all the crud into (which is why I made the thing accept a bigger log buffer), and quite frankly, I'm not at all sure that all the locking is ok (ie I could imagine that the dcache-locking thing there in "is_interesting()" could deadlock, what do I know..) But I've captured some real data with this, which I'll describe separately. Linus diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..967dd80 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) #if (BITS_PER_LONG > 32) /* diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..d6a0f56 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; +if (PageInteresting(page)) printk("Removing index %08x from page cache\n", page->index); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(&dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..14c9815 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; +if (PageInteresting(page)) + printk("Unmapped index %08x at %08x\n", page->index, addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1607,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); +if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at %08lx\n", new_page->index, address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2252,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); +if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx (%s)\n", new_page->index, address, write_access ? "write" : "read");
Re: 2.6.19 file content corruption on ext3
On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote: > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > me up), and that seems to show the corruption going way way back (ie > going > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > 2.6.18 > > > (or older)? > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > have the page throttling patches in it, those were written this summer. So > > it would either have to be Fedora carrying around another patch that just > > happens to result in the same corruption for _years_, or it's the same > > bug. > > The only notable VM patch in Fedora kernels of that vintage that I recall > was Ingo's 4g/4g thing. which does tlb flushes *all the time* so that even rules out (well almost) a stale tlb somewhere... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > me up), and that seems to show the corruption going way way back (ie > > > going > > > back to Linux-2.6.5 at least, according to one tester). > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > (or older)? > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > have the page throttling patches in it, those were written this summer. So > it would either have to be Fedora carrying around another patch that just > happens to result in the same corruption for _years_, or it's the same > bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > me up), and that seems to show the corruption going way way back (ie going > > back to Linux-2.6.5 at least, according to one tester). > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. I bet it's the same bug, and it's been around for ages. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote: > And I have a test-program that shows the corruption _much_ easier (at > least according to my own testing, and that of several reporters that back > me up), and that seems to show the corruption going way way back (ie going > back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Marc Haber wrote: > > After being up for ten days, I have now encountered the file > corruption of pkgcache.bin for the first time again. The 256 MB i386 > box is like 26M in swap, is under very moderate load. > > I am running plain vanilla 2.6.19.1. Is there a patch that I should > apply against 2.6.19.1 that would help in debugging? Not right now. And I have a test-program that shows the corruption _much_ easier (at least according to my own testing, and that of several reporters that back me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new bug. What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. (And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote: > On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote: > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > > would pass, yet people running normal workloads are able to easily trigger > > failures. I suspect we're looking in the wrong place. > > I do not have a clue about memory management at all, but is it > possible that you're testing on a box with too much memory? My box has > only 256 MB, and I used to use mutt with a _huge_ inbox with mutt > taking somewhat 150 MB. Add spamassassin and a reasonably busy mail > server, and the box used to be like 150 MB in swap. > > I have tidied my inbox in the mean time and mutt's memory requirement > has been reduced to somewhat 30 MB, which might be the cause that I > don't see the issue that often any more. After being up for ten days, I have now encountered the file corruption of pkgcache.bin for the first time again. The 256 MB i386 box is like 26M in swap, is under very moderate load. I am running plain vanilla 2.6.19.1. Is there a patch that I should apply against 2.6.19.1 that would help in debugging? Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 28 Dec 2006, Martin Schwidefsky wrote: > > For s390 there are two aspects to consider: > 1) the pte values are 100% software controlled. That's fine. In that situation, you shouldn't need any atomic ops at all, I think all our sw page-table operations are already done under the pte lock. The reason x86 needs to be careful is exactly the fact that the hardware will obviously do a lot on its own, and the hardware is _not_ going to honor our page table locking ;) In an all-sw situation, a lot of this should be easier. S390 has _other_ things that are inconvenient (the strange "dirty bit is not in the page tables" thing that makes it look different from everybody else), but hey, it's a balance.. So for s390, ptep_exchange() in my example should be able to be a simple "load old value and store new one", assuming everybody honors the pte lock (and they _should_). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 2006-12-21 at 12:01 -0800, Linus Torvalds wrote: > What do you guys think? Does something like this work out for S/390 too? I > tried to make that "ptep_flush_dirty()" concept work for architectures > that hide the dirty bit somewhere else too, but.. For s390 there are two aspects to consider: 1) the pte values are 100% software controlled. They only change because a cpu stored a value to it or issued one of the specialized instructions (csp, ipte and idte). The ptep_flush_dirty would be a nop for s390. 2) ptep_exchange is a bit dangerous. For s390 we need a lock that protects the software controlled updates of the ptes. The reason is the ipte instruction. It is implemented by the machine microcode in a non-atomic way in regard to the memory. It reads the byte of the pte that contains the invalid bit, flushes the tlb entries for it and then writes back the byte with the invalid bit set. The microcode makes sure that this pte cannot be used for form a new tlb on any cpu while the ipte is in progress. That means a compare-and-swap semantics on ptes won't work together with the ipte optimization. As long as there is the pte lock that protects all software accesses to the pte we are fine. But if any code expects that ptep_exchange does something like an xchg things break. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: I do get this error on reiserfs ( old one, didn't try on reiser4 ). Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the debian bts. I've had reports of corrupted data on earlier kernel releases with reiserfs3, which were fixed by upgrading to reiserfs4. Jari Sundell - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Tue, Dec 26, 2006 at 11:26:50AM -0800, Linus Torvalds wrote: > What would also actually be interesting is whether somebody can reproduce > this on Reiserfs, for example. I _think_ all the reports I've seen are on > ext2 or ext3, and if this is somehow writeback-related, it could be some > bug that is just shared between the two by virtue of them still having a > lot of stuff in common. > > Linus I do get this error on reiserfs ( old one, didn't try on reiser4 ). Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the debian bts. flo attenberger --- Linux master 2.6.19 #1 PREEMPT Thu Dec 21 10:55:34 CET 2006 x86_64 GNU/Linux # # Automatically generated make config: don't edit # Linux kernel version: 2.6.19 # Thu Dec 21 10:45:05 2006 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y # # Block layer # CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=m CONFIG_IOSCHED_DEADLINE=m CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set CONFIG_MK8=y # CONFIG_MPSC is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=m CONFIG_X86_CPUID=m CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y # CONFIG_SMP is not set # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_X86_MCE=y # CONFIG_X86_MCE_INTEL is not set CONFIG_X86_MCE_AMD=y CONFIG_KEXEC=y # CONFIG_CRASH_DUMP is not set CONFIG_PHYSICAL_START=0x20 CONFIG_SECCOMP=y # CONFIG_CC_STACKPROTECTOR is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 CONFIG_REORDER=y CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y # # Power management options # CONFIG_PM=y CONFIG_PM_LEGACY=y # CONFIG_PM_DEBUG is not set CONFIG_PM_SYSFS_DEPRECATED=y # CONFIG_SOFTWARE_SUSPEND is not set # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set CONFIG_ACPI_AC=m # CONFIG_ACPI_BATTERY is not set CONFIG_ACPI_BUTTON=m CONFIG_ACPI_VIDEO=m CONFIG_ACPI_HOTKEY=m CONFIG_ACPI_FAN=m # CONFIG_ACPI_DOCK is not set CONFIG_ACPI_PROCESSOR=m CONFIG_ACPI_THERMAL=m # CONFIG_ACPI_ASUS is not set # CONFIG_ACPI_IBM is not set # CONFIG_ACPI_TOSHIBA is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_P
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: - It never uses mprotect on the shared mappings, but it _does_ do: "mincore()" - but the return values don't much matter (it's used as a heuristic on which parts to hash, apparently) I double- and triple-checked this one, because I did make changes to "mincore()", but those didn't go into the affected kernels anyway (ie they are not in plain 2.6.19, nor in 2.6.18.3 either) Correct, mincore is only used to check if it should delay the hash checking. "madvise(MADV_WILLNEED)" "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag) "munmap()" of course - it never seems to mix mmap() and write() - it does _only_ mmap. - it seems to mmap/munmap the shared files in nice 64-page chunks, all 64-page aligned in the file (ie it does NOT create one big mapping, it has some kind of LRU of thse 64-page chunks). The only exception being the last chunk, which it maps byte-accurate to the size. The length of the chunks is only page aligned on single file torrents, not so on multi-file torrents. I've attached a patch for rtorrent that will extend the length to the page boundary. - I haven't checked whether it only ever has the same chunk mapped once at a time. This should be the case, but two mapped chunks may share a page, sometimes with different r/w permissions. Jari Sundell extend_mapping.diff Description: Binary data
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Tue, 26 Dec 2006, Nick Piggin wrote: > Linus Torvalds wrote: > > > > Ok, so how about this diff. > > > > I'm actually feeling good about this one. It really looks like > > "do_no_page()" was simply buggy, and that this explains everything. > > Still trying to catch up here, so I'm not going to reply to any old > stuff and just start at the tip of the thread... Other than to say > that I really like cancel_page_dirty ;) Yeah, I think that part is a bit clearer about what's going on now. > I think your patch is quite right so that's a good catch. Actually, since people told me it didn't matter, I went back and looked at _why_ - the thing is, "vma->vm_page_prot" should always be read-only anyway, except for mappings that don't do dirty accounting at all, so I think my patch only found cases that are unimportant (ie pages that get faulted on on filesystems like ramfs that doesn't do any dirty page accounting because they're all dirty anyway). > But I'm not too surprised that it does not help the problem, because I > don't think we have started shedding any old pte_dirty tests at > unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, > as such. True. We should no longer _need_ those dirty bit reclaims at unmap/reclaim, but we still do them, so you're right, even if we were buggy in this area, it should only really matter for the dirty page counting, not for any lost data. > I was hoping that you've almost narrowed it down to the filesystem > writeback code, with the last few mails? I think so, yes. However, I've checked, and "rtorrent" really does seem to be fairly well-behaved wrt any filesystem activity. It does - no threading. It's 100% single-threaded, and doesn't even appear to use signals. - exactly _one_ "ftruncate()", and it does it at the beginning, for the full final size. IOW, it's not anything subtle with truncate and dirty page cancel. - It never uses mprotect on the shared mappings, but it _does_ do: "mincore()" - but the return values don't much matter (it's used as a heuristic on which parts to hash, apparently) I double- and triple-checked this one, because I did make changes to "mincore()", but those didn't go into the affected kernels anyway (ie they are not in plain 2.6.19, nor in 2.6.18.3 either) "madvise(MADV_WILLNEED)" "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag) "munmap()" of course - it never seems to mix mmap() and write() - it does _only_ mmap. - it seems to mmap/munmap the shared files in nice 64-page chunks, all 64-page aligned in the file (ie it does NOT create one big mapping, it has some kind of LRU of thse 64-page chunks). The only exception being the last chunk, which it maps byte-accurate to the size. - I haven't checked whether it only ever has the same chunk mapped once at a time. Anyway, the _one_ half-way interesting thing is the fact that it doesn't allocate any backing store at all for the file, and as such the page writeback needs to create all the underlying buffers on the filesystem. I really don't see why that would be a problem either, but I could imagine that if we have some writeback bug where we can end up writing back the _same_ page concurrently, we'd actually end up racing in the kernel, and allocating two different backing stores, and then maybe the other one would effectively "get lost" (and the earlier writeback would win the race, explaining why we'd end up with zeroes at the end of a block). Or something. However, all the codepaths _seem_ to test for PG_writeback, and not even try to start another writeback while the first one is still active. What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Tue, Dec 26, 2006 at 05:51:55PM +, Al Viro wrote: > On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote: > > > > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > > > Hash check on download completion found bad chunks, consider using > > > "safe_sync". > > > > Dang. Did you get any warning messages from the kernel? > > > > Linus > > BTW, rmap.c patch is broken - needs at least ... but that doesn't affect most of the architectures - only sparc64 and some of powerpc. So it's definitely not enough. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote: > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > Hash check on download completion found bad chunks, consider using > > "safe_sync". > > Dang. Did you get any warning messages from the kernel? > > Linus BTW, rmap.c patch is broken - needs at least Signed-off-by: Al Viro <[EMAIL PROTECTED]> --- diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..669acb2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -452,7 +452,7 @@ static int page_mkclean_one(struct page entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - set_pte_at(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
Linus Torvalds wrote: On Sun, 24 Dec 2006, Linus Torvalds wrote: Peter, tell me I'm crazy, but with the new rules, the following condition is a bug: - shared mapping - writable - not already marked dirty in the PTE Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. Still trying to catch up here, so I'm not going to reply to any old stuff and just start at the tip of the thread... Other than to say that I really like cancel_page_dirty ;) I think your patch is quite right so that's a good catch. But I'm not too surprised that it does not help the problem, because I don't think we have started shedding any old pte_dirty tests at unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, as such. I was hoping that you've almost narrowed it down to the filesystem writeback code, with the last few mails? Nick Please please please test. Throw all the other patches away (with the possible exception of the "update_mmu_cache()" sanity checker, which is still interesting in case some _other_ place does this too). Don't do the "wait_on_page_writeback()" thing, because it changes timings and might hide thngs for the wrong reasons. Just apply this on top of a known failing kernel, and test. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-24 11:35]: > And if this doesn't fix it, I don't know what will.. Sorry, but it still fails (on top of plain 2.6.19). -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
> Quoting Linus Torvalds <[EMAIL PROTECTED]>: > Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content > corruption on ext3) > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE > > because that combination means that the hardware can mark the PTE dirty > without us even realizing (and thus not marking the "struct page *" > dirty). Er. Sorry about bumping in, and I'm not sure I understand all of the discussion, but this reminded me of an old issue with COW that created what looks like a vaguely similiar data corruption on infiniband. We solved this for infiniband with MADV_DONTFORK, but I always wondered why does it not affect other parts of kernel. Small reminder from that discussion: down mmap sem get user pages up mmap sem page becomes shared, and COW (e.g. fork) process writes to first byte of page <- gets a copy Now we had a problem: struct page that we got from get user pages does not point to a correct page in our process. For example: if at some point we map this page for DMA, and hardware writes to last byte of page -> process does not see this data. So for infiniband, what we do is a combination of - prevent page from becoming COW while hardware might DMA to this page, and - ask users not to write to page if hardware might DMA to same page (even if its using different bytes). I just wandered - is there some chance something like this could be happening in the fs code? HTH, -- MST - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/24/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. I tested with just this patch and 2.6.19 and no change. Sorry Linus, no early Christmas present :-( Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 12:24 -0800, Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > Hash check on download completion found bad chunks, consider using > > "safe_sync". > > Dang. Did you get any warning messages from the kernel? > only these: ACPI: EC: evaluating _Q80 ACPI: EC: evaluating _Q80 ACPI: EC: evaluating _Q80 but I don't think has anything to do with... > Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Andrei Popa wrote: > > Hash check on download completion found bad chunks, consider using > "safe_sync". Dang. Did you get any warning messages from the kernel? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 11:35 -0800, Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > > > The apt cache files (/var/cache/apt/*.bin) still get corrupted with > > this patch and 2.6.19. > > Yeah, if my guess about do_no_page() is right, _none_ of the previous > patches should have ANY effect what-so-ever. In fact, I'd say that even > the "ext3 works in writeback mode" thing that Andrei reports is probably a > total fluke brought on by timing changes rather than anything else. > > So please try the latest patch instead (on top of anything that shows > corruption reliably - the patch should be _totally_ independent of all the > other issues, and I think it will apply cleanly on top of 2.6.18.3 and > 2.6.19 too, so anything that shows corruption is a fine target - but try > to choose something that has been the "best" at corrupting things for you, > to make the testing as good as possible). > > Patch included here again (although I think you were cc'd on my previous > email too, so you should already have it, and our emails just crossed) > > And if this doesn't fix it, I don't know what will.. With latest git and patches: http://lkml.org/lkml/diff/2006/12/24/56/1 http://lkml.org/lkml/diff/2006/12/24/61/1 Hash check on download completion found bad chunks, consider using "safe_sync". > > Linus > > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..cf429c4 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2247,21 +2249,23 @@ retry: > if (pte_none(*page_table)) { > flush_icache_page(vma, new_page); > entry = mk_pte(new_page, vma->vm_page_prot); > - if (write_access) > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > - set_pte_at(mm, address, page_table, entry); > if (anon) { > inc_mm_counter(mm, anon_rss); > lru_cache_add_active(new_page); > page_add_new_anon_rmap(new_page, vma, address); > + if (write_access) > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } else { > inc_mm_counter(mm, file_rss); > page_add_file_rmap(new_page); > + entry = pte_wrprotect(entry); > if (write_access) { > dirty_page = new_page; > get_page(dirty_page); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } > } > + set_pte_at(mm, address, page_table, entry); > } else { > /* One of our sibling threads was faster, back out. */ > page_cache_release(new_page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > The apt cache files (/var/cache/apt/*.bin) still get corrupted with > this patch and 2.6.19. Yeah, if my guess about do_no_page() is right, _none_ of the previous patches should have ANY effect what-so-ever. In fact, I'd say that even the "ext3 works in writeback mode" thing that Andrei reports is probably a total fluke brought on by timing changes rather than anything else. So please try the latest patch instead (on top of anything that shows corruption reliably - the patch should be _totally_ independent of all the other issues, and I think it will apply cleanly on top of 2.6.18.3 and 2.6.19 too, so anything that shows corruption is a fine target - but try to choose something that has been the "best" at corrupting things for you, to make the testing as good as possible). Patch included here again (although I think you were cc'd on my previous email too, so you should already have it, and our emails just crossed) And if this doesn't fix it, I don't know what will.. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/24/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: How about this particularly stupid diff? (please test with something that _would_ cause corruption normally). It is _entirely_ untested, but what it tries to do is to simply serialize any writeback in progress with any process that tries to re-map a shared page into its address space and dirty it. I haven't tested it, and maybe it misses some case, but it looks likea good way to try to avoid races with marking pages dirty and the writeback phase .. The apt cache files (/var/cache/apt/*.bin) still get corrupted with this patch and 2.6.19. Gordon diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/buffer.c2006-12-21 01:16:31.0 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -*/ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.0 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.0 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c --- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/mm/memory.c2006-12-24 11:04:03.0 -0700 @@ -1534,6 +1534,7 @@ static int do_wp_page(struct mm_struct * if (!pte_same(*page_table, orig_pte)) goto unlock; } + wait_on_page_writeback(old_page); dirty_page = old_page; get_page(dirty_page); reuse = 1; @@ -1832,6 +1833,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1865,6 +1893,7 @@ do_expand: goto ou
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Linus Torvalds wrote: > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. Please please please test. Throw all the other patches away (with the possible exception of the "update_mmu_cache()" sanity checker, which is still interesting in case some _other_ place does this too). Don't do the "wait_on_page_writeback()" thing, because it changes timings and might hide thngs for the wrong reasons. Just apply this on top of a known failing kernel, and test. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Linus Torvalds wrote: > > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). Actually, here's an even more stupid diff, which actually to some degree seems to capture the real problem better. Peter, tell me I'm crazy, but with the new rules, the following condition is a bug: - shared mapping - writable - not already marked dirty in the PTE because that combination means that the hardware can mark the PTE dirty without us even realizing (and thus not marking the "struct page *" dirty). (The above is actually a valid situation for IO mappings, but not for "real" mappings. And IO mappings should never take page faults, I think). So, with that in mind, I wrote this stupid patch (for 32-bit x86, since I used my Mac Mini for testing ratehr than my main machine - but the x86-64 version should be pretty much identcal).. And you know what, Peter? It triggers for me. I get WARNING at mm/memory.c:2274 do_no_page() [] show_trace_log_lvl+0x1a/0x2f [] show_trace+0x12/0x14 [] dump_stack+0x16/0x18 [] __handle_mm_fault+0x38d/0x919 [] do_page_fault+0x1ff/0x507 [] error_code+0x7c/0x84 which seems to say that do_no_page() can be used to insert shared and non-dirty, but still writable, pages. But maybe my patch is just bogus, and I didn't think it through. Peter, I realize it's Christmas Eve, but let's face it, Santa appreciates good boys and girls, and we all want tons of loot. So please be good, and waste some time looking at this and tell me why I'm either wrong, or there's a real smoking gun here.. ;) Linus --- diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h index e6a4723..1389bb7 100644 --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -494,7 +494,13 @@ do { \ * The i386 doesn't have any external MMU info: the kernel page * tables contain all the necessary information. */ -#define update_mmu_cache(vma,address,pte) do { } while (0) +#define bad_shared_pte(pte) (pte_write(pte) && !pte_dirty(pte)) +#define update_mmu_cache(vma,address,pte) do { \ + static int __cnt; \ + WARN_ON(((vma)->vm_flags & VM_SHARED) \ +&& bad_shared_pte(pte) \ +&& ++__cnt < 5); \ +} while (0) #endif /* !__ASSEMBLY__ */ #ifdef CONFIG_FLATMEM - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 09:16:06 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > > > Andrei Popa <[EMAIL PROTECTED]> wrote: > > > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > > > > > I don't have corruption. I tested twice. > > > > > > This is a surprising result. Can you pleas retest ext3 > > > data=writeback,nobh? > > > > Yes, no corruption. Also tested only with data=writeback and had no > > corruption. > > Ok, so it would seem to be writeback related _somehow_. However, most of > the differences (I _thought_) in ext3 actually show up only if you have > *both* "nobh" and "data=writeback", and as far as I can tell, just a > simple "data=writeback" should still use the bog-standard > "block_write_full_page()". > > Andrew? > > Although as far as I can see, then ext2 should work as-is too (since it > too also just uses "block_write_full_page()" without anything fancy). ext2 uses the multipage-bio assembly code for writeback whereas ext3 doesn't. But ext3 doesn't use that code in data=ordered mode, of course. Still, this: --- a/fs/ext2/inode.c~a +++ a/fs/ext2/inode.c @@ -693,7 +693,7 @@ const struct address_space_operations ex .commit_write = generic_commit_write, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, - .writepages = ext2_writepages, +// .writepages = ext2_writepages, .migratepage= buffer_migrate_page, }; @@ -711,7 +711,7 @@ const struct address_space_operations ex .commit_write = nobh_commit_write, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, - .writepages = ext2_writepages, +// .writepages = ext2_writepages, .migratepage= buffer_migrate_page, }; _ will switch it off for ext2. > Strange. > > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). > > It is _entirely_ untested, but what it tries to do is to simply serialize > any writeback in progress with any process that tries to re-map a shared > page into its address space and dirty it. I haven't tested it, and maybe > it misses some case, but it looks likea good way to try to avoid races > with marking pages dirty and the writeback phase .. > > Linus > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..64ed10b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct > vm_area_struct *vma, > if (!pte_same(*page_table, orig_pte)) > goto unlock; > } > + wait_on_page_writeback(old_page); > dirty_page = old_page; > get_page(dirty_page); > reuse = 1; > @@ -2215,6 +2216,7 @@ retry: > page_cache_release(new_page); > return VM_FAULT_SIGBUS; > } > + wait_on_page_writeback(new_page); > } > } yup. Also, we could perhaps lock the target page during pagefaults.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Andrei Popa wrote: > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > > Andrei Popa <[EMAIL PROTECTED]> wrote: > > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > > > I don't have corruption. I tested twice. > > > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? > > Yes, no corruption. Also tested only with data=writeback and had no > corruption. Ok, so it would seem to be writeback related _somehow_. However, most of the differences (I _thought_) in ext3 actually show up only if you have *both* "nobh" and "data=writeback", and as far as I can tell, just a simple "data=writeback" should still use the bog-standard "block_write_full_page()". Andrew? Although as far as I can see, then ext2 should work as-is too (since it too also just uses "block_write_full_page()" without anything fancy). Strange. How about this particularly stupid diff? (please test with something that _would_ cause corruption normally). It is _entirely_ untested, but what it tries to do is to simply serialize any writeback in progress with any process that tries to re-map a shared page into its address space and dirty it. I haven't tested it, and maybe it misses some case, but it looks likea good way to try to avoid races with marking pages dirty and the writeback phase .. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..64ed10b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!pte_same(*page_table, orig_pte)) goto unlock; } + wait_on_page_writeback(old_page); dirty_page = old_page; get_page(dirty_page); reuse = 1; @@ -2215,6 +2216,7 @@ retry: page_cache_release(new_page); return VM_FAULT_SIGBUS; } + wait_on_page_writeback(new_page); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > On Sun, 24 Dec 2006 14:14:38 +0200 > Andrei Popa <[EMAIL PROTECTED]> wrote: > > > > - mount the fs with ext2 with the no-buffer-head option. That means > > > either: > > > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > > /etc/fstab: ext2 nobh > > > > ierdnac ~ # mount > > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > > > I have corruption. > > > > > > > > - mount the fs with ext3 data=writeback, nobh > > > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > > > works) > > > /etc/fstab: ext2 data=writeback,nobh > > > > ierdnac ~ # mount > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > ierdnac ~ # dmesg|grep EXT3 > > EXT3-fs: mounted filesystem with writeback data mode. > > EXT3 FS on sda7, internal journal > > > > I don't have corruption. I tested twice. > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? Yes, no corruption. Also tested only with data=writeback and had no corruption. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Andrew Morton <[EMAIL PROTECTED]> [2006-12-24 00:57]: > /etc/fstab: ext2 nobh > /etc/fstab: ext3 data=writeback,nobh It seems that busybox mount ignores the nobh option but both ext2 and ext3 data=writeback work for me. This is with plain 2.6.19 which normally always fails. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 14:14:38 +0200 Andrei Popa <[EMAIL PROTECTED]> wrote: > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > /etc/fstab: ext2 nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > I have corruption. > > > > > - mount the fs with ext3 data=writeback, nobh > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > > works) > > /etc/fstab: ext2 data=writeback,nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > ierdnac ~ # dmesg|grep EXT3 > EXT3-fs: mounted filesystem with writeback data mode. > EXT3 FS on sda7, internal journal > > I don't have corruption. I tested twice. This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 14:26:01 +0200 Andrei Popa <[EMAIL PROTECTED]> wrote: > I also tested with ext3 ordered, nobh and I have file corruption... ordered+nobh isn't a possible combination. The filesystem probably ignored nobh. nobh mode only makes sense with data=writeback. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 14:14 +0200, Andrei Popa wrote: > On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: > > On Sun, 24 Dec 2006 00:43:54 -0800 (PST) > > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > I now _suspect_ that we're talking about something like > > > > > > - we started a writeout. The IO is still pending, and the page was > > >marked clean and is now in the "writeback" phase. > > > - a write happens to the page, and the page gets marked dirty again. > > >Marking the page dirty also marks all the _buffers_ in the page dirty, > > >but they were actually already dirty, because the IO hasn't completed > > >yet. > > > - the IO from the _previous_ write completes, and marks the buffers > > > clean > > >again. > > > > Some things for the testers to try, please: > > > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > /etc/fstab: ext2 nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > I have corruption. > > > > > - mount the fs with ext3 data=writeback, nobh > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > > works) > > /etc/fstab: ext2 data=writeback,nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > ierdnac ~ # dmesg|grep EXT3 > EXT3-fs: mounted filesystem with writeback data mode. > EXT3 FS on sda7, internal journal > > I don't have corruption. I tested twice. > I also tested with ext3 ordered, nobh and I have file corruption... > > > > if that still fails we can rule out buffer_head funnies. > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: > On Sun, 24 Dec 2006 00:43:54 -0800 (PST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > I now _suspect_ that we're talking about something like > > > > - we started a writeout. The IO is still pending, and the page was > >marked clean and is now in the "writeback" phase. > > - a write happens to the page, and the page gets marked dirty again. > >Marking the page dirty also marks all the _buffers_ in the page dirty, > >but they were actually already dirty, because the IO hasn't completed > >yet. > > - the IO from the _previous_ write completes, and marks the buffers clean > >again. > > Some things for the testers to try, please: > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > grub.conf: rootfstype=ext2 rootflags=nobh > /etc/fstab: ext2 nobh ierdnac ~ # mount /dev/sda7 on / type ext2 (rw,noatime,nobh) I have corruption. > > - mount the fs with ext3 data=writeback, nobh > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > works) > /etc/fstab: ext2 data=writeback,nobh ierdnac ~ # mount /dev/sda7 on / type ext3 (rw,noatime,nobh) ierdnac ~ # dmesg|grep EXT3 EXT3-fs: mounted filesystem with writeback data mode. EXT3 FS on sda7, internal journal I don't have corruption. I tested twice. > > if that still fails we can rule out buffer_head funnies. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Andrew Morton wrote: > > > I now _suspect_ that we're talking about something like > > > > - we started a writeout. The IO is still pending, and the page was > >marked clean and is now in the "writeback" phase. > > - a write happens to the page, and the page gets marked dirty again. > >Marking the page dirty also marks all the _buffers_ in the page dirty, > >but they were actually already dirty, because the IO hasn't completed > >yet. > > - the IO from the _previous_ write completes, and marks the buffers clean > >again. > > Some things for the testers to try, please: > > - mount the fs with ext2 with the no-buffer-head option. That means either: [ snip snip ] This is definitely worth testing, but the exact schenario I outlined is probably not the thing that happens. It was really meant to be more of an exmple of the _kind_ of situation I think we might have. That would explain why we didn't see this before: we simply didn't mark pages clean all that aggressively, and an app like rtorrent would normally have caused its flushes to happen _synchronously_ by using msync() (even if the IO itself was done asynchronously, all the dirty bit stuff would be synchronous wrt any rtorrent behaviour). And the things that /did/ use to clean pages asynchronously (VM scanning) would always actually look at the "young" bit (aka "accessed") and not even touch the dirty bit if an application had accessed the page recently, so that basically avoided any likely races, because we'd touch the dirty bit ONLY if the page was "cold". So this is why I'm saying that it might be an old bug, and it would be just the new pattern of handling dirty bits that triggers it. But avoiding buffer heads and testing that part is worth doing. Just to remove one thing from the equation. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 00:43:54 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > I now _suspect_ that we're talking about something like > > - we started a writeout. The IO is still pending, and the page was >marked clean and is now in the "writeback" phase. > - a write happens to the page, and the page gets marked dirty again. >Marking the page dirty also marks all the _buffers_ in the page dirty, >but they were actually already dirty, because the IO hasn't completed >yet. > - the IO from the _previous_ write completes, and marks the buffers clean >again. Some things for the testers to try, please: - mount the fs with ext2 with the no-buffer-head option. That means either: grub.conf: rootfstype=ext2 rootflags=nobh /etc/fstab: ext2 nobh - mount the fs with ext3 data=writeback, nobh grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works) /etc/fstab: ext2 data=writeback,nobh if that still fails we can rule out buffer_head funnies. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > Is there any way to provide any debugging information that may help > solve the problem ? I think we have people working on this. I know I'm trying to even come up with an idea of what is going on. I don't think we know yet. > Would it help to know the nature of the corruption e.g. an analysis > of the corruption in the file ? I actually think we know that, because Andrei already gave details. The corruption seems to be basically a few pages that get zeroes at the end rather than the expected contents. That's consistent with the page being written out once, but then _not_ getting written out again despite being dirtied some more. But if you see ay other pattern, please holler, because that would be interesting. > BTW, I decided to try Linus's test program [1] on ARM (I don't think > that anybody had tried it on ARM before). You get the expected results, and in fact, I'd be very surprised if you didn't. It's something subtler than that going on. I now _suspect_ that we're talking about something like - we started a writeout. The IO is still pending, and the page was marked clean and is now in the "writeback" phase. - a write happens to the page, and the page gets marked dirty again. Marking the page dirty also marks all the _buffers_ in the page dirty, but they were actually already dirty, because the IO hasn't completed yet. - the IO from the _previous_ write completes, and marks the buffers clean again. And no, thatr's not actually what is going on. The thing is, we actually clear the buffer dirty bits when we start the IO, not when we end it, but I think it is going to be this _kind_ of situation, where we missed something, and marked it clean too late, and thus cleared a dirty bit. I don't think it's a page table issue any more, it just doesn't look likely with the ARM UP corruption. It's also not apparently even on a cacheline boundary, so it probably is really a dirty bit that got cleared wrogn due to some race with IO. But right now we're all clueless. I personally suspect it's not even a new bug: it's probably an old bug that simply didn't matter before. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/22/06, Martin Michlmayr <[EMAIL PROTECTED]> wrote: * Peter Zijlstra <[EMAIL PROTECTED]> [2006-12-22 14:25]: > > and it failed. > Since you are on ARM you might want to try with the page_mkclean_one > cleanup patch too. I've already tried it and it didn't work. I just tried it again together with Linus' patch and the two from Andrew and it still fails. (For reference, the patch is attached.) I can confirm this behaviour with 2.6.19 and the patches mentioned above (cumulative patch for 2.6.19 appended to the end of this email). Is there any way to provide any debugging information that may help solve the problem ? Would it help to know the nature of the corruption e.g. an analysis of the corruption in the file ? I have previously asked apt developers if they wanted to look at the corrupted cache files, but there were no takers then. BTW, I decided to try Linus's test program [1] on ARM (I don't think that anybody had tried it on ARM before). Since we see file corruption with 2.6.18 + [PATCH] mm: tracking shared dirty pages [2], I ran Linus's program on machines with the following setups: 2.6.18 + the following patches mm: tracking shared dirty pages [2] mm: balance dirty pages [3] mm: optimize the new mprotect() code a bit [4] mm: small cleanup of install_page() [5] mm: fixup do_wp_page() [6] mm: msync() cleanup [7] $ ./mm-test | od -x 000 020 040 050 2.6.18 (no mm patches) $ ./mm-test | od -x 000 020 040 050 I don't know if this helps at all. Gordon [1] http://lkml.org/lkml/2006/12/19/200 [2] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 [3] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=edc79b2a46ed854595e40edcf3f8b37f9f14aa3f [4] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c1e6098b23bb46e2b488fe9a26f831f867157483 [5] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e88dd6c11c5aef74d8b74a062767add53315533b [6] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee6a6457886a80415db209e87033b63f2b06558c [7] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=204ec841fbea3e5138168edbc3a76d46747cc987 diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/buffer.c2006-12-21 01:16:31.0 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -*/ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.0 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.0 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Fri, 2006-12-22 at 13:32 +0100, Martin Michlmayr wrote: > * Andrei Popa <[EMAIL PROTECTED]> [2006-12-22 14:24]: > > With all three patches I have corruption > > I've completed one installation with Linus' patch plus the two from > Andrew successfully, but I'm currently trying again... but I really > need a better testcase since an installation takes about an hour. > Andrei, which torrent do you download as a testcase? It would be good > if someone could suggest a torrent which is legal and not too large. It's a 1.4GB file torrent split in 84 rar files and there are many seeders. I download with ~ 5MB/sec. The torrent is private. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Peter Zijlstra <[EMAIL PROTECTED]> [2006-12-22 14:25]: > > and it failed. > Since you are on ARM you might want to try with the page_mkclean_one > cleanup patch too. I've already tried it and it didn't work. I just tried it again together with Linus' patch and the two from Andrew and it still fails. (For reference, the patch is attached.) -- Martin Michlmayr http://www.cyrius.com/ --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -* -* Also, during truncate, discard_buffer will have marked all -* the page's buffers clean. We discover that here and clean -* the page also. -*/ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..4f4cd13 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..350878a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,15 +253,11 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff --git a/mm/memory.c b/mm/memory.c index c00bac6..79cecab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..b3a198c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *pag EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (!mapping)
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Fri, 22 Dec 2006, Peter Zijlstra wrote: > > fix page_mkclean_one() > > - add flush_cache_page() for all those virtual indexed cache >architectures. I think the flush_cache_page() should be after we've actually flushed it from the TLB and re-inserted it (this is one reason why I did the "ptep_exchange()" version of this). Otherwise somebody can still write to the page _after_ the cache flush.. > - handle s390. Yeah, that looks like the proper way to handle that. That said, it looks like we still see corruption. You may not, but Martin and Andrei still report problems, even with all the patches (including the last one from Andrew that avoids "dirty" going negative under some circumstances, and explains the "slow and/or never completed" case that Gordon and Martin saw). The good news is that I think the code now is cleaner and more understandable. The bad news is that nothing we've ever tried seems to have fixed the _problem_. And I don't think it's page_mkclean(). Especially not since the ARM people are seeing this under UP without PREEMPT. In that kind of schenario, the only possible races tend to be from things that actually block: "set_page_dirty()" (which blocks on IO in balancing), memory allocations, and obviously doing actual IO. And it's not a virtual cache problem, since others see it on x86. Of course, since it's quite possibly two different issues, maybe the virtual cache flush is required in order to force write-back to memory (which in turn is required for the DMA for the actual write!). So the ARM issue certainly could be due to the flush_cache_page() thing... Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Mon, 18 Dec 2006, Gene Heskett wrote: > > What about the mm/rmap.c one liner, in or out? The one that just removes the "pte_mkclean()"? That's definitely out, it was just a test-patch to verify that the pte dirty bits seemed to matter at all (and they do). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-22 08:30]: > Based on the kernel gurus current knowledge of the problem, would > you expect the corruption to occur at the same point in a file, or > is it possible that the corruption could occur at different points > on successive Debian installer attempts on a UP, non PREEMPT system? Seems like it can occur anywhere. In fact, some people see apt problems because of filesystem corruption on the NSLU2 after they have already installe Debian. I've only seen this once myself and failed many times to find a reproducible situation. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Sat, Dec 16, 2006 at 06:43:10PM +, Martin Michlmayr wrote: > * Marc Haber <[EMAIL PROTECTED]> [2006-12-09 10:26]: > > Unfortunately, I am lacking the knowledge needed to do this in an > > informed way. I am neither familiar enough with git nor do I possess > > the necessary C powers. > > I wonder if what you're seein is related to > http://lkml.org/lkml/2006/12/16/73 > > You said that you don't see any corruption with 2.6.18. Can you try > to apply the patch from > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 > to 2.6.18 to see if the corruption shows up? Since I am no longer seeing the issue after easing the memory load, I doubt that this would make sense. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote: > Marc Haber wrote: > >After updating to 2.6.19, Debian's apt control file > >/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under > >six hours. In that situation, "aptitude update" segfaults. When I > >delete the file and have apt recreate it, things are fine again for a > >few hours before the file is broken again and the segfault start over. > >In all cases, umounting the file system and doing an fsck does not > >show issues with the file system. > > Are you using wireless networking of any kind? Since the system in question is a colocated server box, I am pretty sure that there is no wireless networking. > Might be useful if you could post 'dmesg' output so that people can > see the other hardware that you have. I have attached what I could scrape from syslog. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 Dec 18 15:45:01 torres syslogd 1.4.1#17: restart. Dec 18 15:45:01 torres kernel: klogd 1.4.1#17, log source = /proc/kmsg started. Dec 18 15:45:01 torres kernel: Inspecting /boot/System.map-2.6.19.1-zgsrv Dec 18 15:45:01 torres kernel: Loaded 26500 symbols from /boot/System.map-2.6.19.1-zgsrv. Dec 18 15:45:01 torres kernel: Symbols match kernel version 2.6.19. Dec 18 15:45:01 torres kernel: No module symbols loaded - kernel modules not enabled. Dec 18 15:45:01 torres kernel: Linux version 2.6.19.1-zgsrv ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 Sun Dec 17 12:44:56 UTC 2006 Dec 18 15:45:01 torres kernel: BIOS-provided physical RAM map: Dec 18 15:45:01 torres kernel: BIOS-e820: - 000a (usable) Dec 18 15:45:01 torres kernel: BIOS-e820: 000f - 0010 (reserved) Dec 18 15:45:01 torres kernel: BIOS-e820: 0010 - 0f7f (usable) Dec 18 15:45:01 torres kernel: BIOS-e820: 0f7f - 0f7f3000 (ACPI NVS) Dec 18 15:45:01 torres kernel: BIOS-e820: 0f7f3000 - 0f80 (ACPI data) Dec 18 15:45:01 torres kernel: BIOS-e820: - 0001 (reserved) Dec 18 15:45:01 torres kernel: 0MB HIGHMEM available. Dec 18 15:45:01 torres kernel: 247MB LOWMEM available. Dec 18 15:45:01 torres kernel: Entering add_active_range(0, 0, 63472) 0 entries of 256 used Dec 18 15:45:01 torres kernel: Zone PFN ranges: Dec 18 15:45:01 torres kernel: DMA 0 -> 4096 Dec 18 15:45:01 torres kernel: Normal 4096 ->63472 Dec 18 15:45:01 torres kernel: HighMem 63472 ->63472 Dec 18 15:45:01 torres kernel: early_node_map[1] active PFN ranges Dec 18 15:45:01 torres kernel: 0:0 ->63472 Dec 18 15:45:01 torres kernel: On node 0 totalpages: 63472 Dec 18 15:45:01 torres kernel: DMA zone: 32 pages used for memmap Dec 18 15:45:01 torres kernel: DMA zone: 0 pages reserved Dec 18 15:45:01 torres kernel: DMA zone: 4064 pages, LIFO batch:0 Dec 18 15:45:01 torres kernel: Normal zone: 463 pages used for memmap Dec 18 15:45:01 torres kernel: Normal zone: 58913 pages, LIFO batch:15 Dec 18 15:45:01 torres kernel: HighMem zone: 0 pages used for memmap Dec 18 15:45:01 torres kernel: DMI 2.2 present. Dec 18 15:45:01 torres kernel: ACPI: RSDP (v000 VIA694 ) @ 0x000f8050 Dec 18 15:45:01 torres kernel: ACPI: RSDT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x) @ 0x0f7f3000 Dec 18 15:45:01 torres kernel: ACPI: FADT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x) @ 0x0f7f3040 Dec 18 15:45:01 torres kernel: ACPI: DSDT (v001 VIA694 AWRDACPI 0x1000 MSFT 0x010c) @ 0x Dec 18 15:45:01 torres kernel: ACPI: PM-Timer IO Port: 0x4008 Dec 18 15:45:01 torres kernel: Allocating PCI resources starting at 1000 (gap: 0f80:f07f) Dec 18 15:45:01 torres kernel: Detected 1466.361 MHz processor. Dec 18 15:45:01 torres kernel: Built 1 zonelists. Total pages: 62977 Dec 18 15:45:01 torres kernel: Kernel command line: root=/dev/hda1 ro vga=normal Dec 18 15:45:01 torres kernel: Enabling fast FPU save and restore... done. Dec 18 15:45:01 torres kernel: Enabling unmasked SIMD FPU exception support... done. Dec 18 15:45:01 torres kernel: Initializing CPU#0 Dec 18 15:45:01 torres kernel: PID hash table entries: 1024 (order: 10, 4096 bytes) Dec 18 15:45:01 torres kernel: Console: colour VGA+ 80x25 Dec 18 15:45:01 torres kernel: Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Dec 18 15:45:01 torres kernel: Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Dec 18 15:45:01 torres kernel: Memory: 246964k/253888k available (2896k kernel code, 6368k reserved, 859k data, 204k init, 0k highmem) Dec 18 15:45:0
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/22/06, Martin Michlmayr <[EMAIL PROTECTED]> wrote: ... and now that we've completed this step, the apt cache has suddenly been reduced (see Gordon's mail for an explanation) and it segfaults: sh-3.1# ls -l /var/cache/apt/ total 12524 drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives -rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin -rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin sh-3.1# apt-get -f install Reading package lists... Done Segmentation faulty tree... 50% I think that we are seeing different manifestations of apt's response to corrupted cache files. There does not appear to be any pattern to which manifestation occurs. Maybe it depends on where in the cache file the corruption is located, i.e. when the corruption occurs. Based on the kernel gurus current knowledge of the problem, would you expect the corruption to occur at the same point in a file, or is it possible that the corruption could occur at different points on successive Debian installer attempts on a UP, non PREEMPT system ? Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Fri, Dec 22, 2006 at 01:32:49PM +0100, Martin Michlmayr wrote: > * Andrei Popa <[EMAIL PROTECTED]> [2006-12-22 14:24]: > > With all three patches I have corruption > > I've completed one installation with Linus' patch plus the two from > Andrew successfully, but I'm currently trying again... but I really > need a better testcase since an installation takes about an hour. > Andrei, which torrent do you download as a testcase? It would be good > if someone could suggest a torrent which is legal and not too large. Hi everyone, I have been reading this thread for the last few days, but have been silent. I have 3 torrents here for testing, if you want. You can easily reproduce with "rtorrent", if you: - Have a completly downloaded one, no matter what size - Corrupt the download with dd if=/dev/zero of=download.file bs=16k count=1 - Restart 'rtorrent', hash-check fails - It will download 1 piece that was corrupted. The important part here is that rtorrent transfers one piece, using its own code sequence to write to the file. Let me offer to test until Saturday afternoon CET, I have a cloned git repository, downloaded torrent files and "apt". My systems that are affected are: Linux oscar 2.6.18 SMP (2x450Mhz Intel P3) (rolled back to 2.6.18 but can boot latest git) Linux tony 2.6.20-git UP (can be tested using all kinds of "apt" operations) Both machines are using: IDE -> MD-RAID1 -> LVM -> EXT3 (data=ordered) SCSI -> MD-RAID5 -> . I don't want to disturb your technical discussion, just offering some help in testing. Regards, Patrick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/22/06, Martin Michlmayr <[EMAIL PROTECTED]> wrote: sh-3.1# ls -l /var/cache/apt/ total 5252 drwxr-xr-x 3 root root12288 Dec 22 04:41 archives -rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin -rw-r--r-- 1 root root 8554 Dec 22 04:45 srcpkgcache.bin This listing is a little different to what I got. For me, srcpkgcache.bin did not exist when apt eventually finished. Did you notice whether the install took a lot longer than usual ? Gordon, does it fail for you where it normally does (installing initramfs-tools) or much later? For me, the installer was able to install initramfs-tools and the kernel, but apt now hangs at "Select and install software". apt didn't hang for me, it just took 20 to 30 minutes to complete building the package database. Usually, it takes less than a minute. The installer stopped because it could not find a kernel to install. I have seen this failure mde before, and as you have previously pointed out, is probably the same problem (corrupted apt cache files), just a different manifestation. Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/21/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: Andrew located at least one bug: we run cancel_dirty_page() too late in "truncate_complete_page()", which means that do_invalidatepage() ends up not clearing the page cache. His patch is appended. Thanks. I'll try this out later today. But it sounds like I probably misunderstood something, because I thought that Martin had acknowledged that this patch actually worked for him. Which sounded very similar to your setup (he has a 32M ARM box too, no?) Yup, we have the same machines (Linksys NSLU2) and are running the same test case (installing Debian). However, I'm not sure what kernel version he had used for his latest test. I presumed 2.6.20-git, whereas I had used 2.6.19. Maybe it's mount option issue? I've got data=ordered on my machine, are you perhaps runnign with something else? We are also using ordered. /dev/scsi/host0/bus0/target0/lun0/part1 /target ext3 rw,data=ordered 0 0 Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
A cleanup of try_to_unmap. I have not identified any races that this would solve, but for consistencies sake. Also includes a small s390 optimization by moving page_test_and_clear_dirty() out of the vma iteration. From: Peter Zijlstra <[EMAIL PROTECTED]> We clear the page in the following sequence: ClearPageDirty - lock ptl, clear pte, unlock ptl hence we should dirty in the opposite order: lock ptl, clear pte, unlock ptl - SetPageDirty try_to_unmap_one violates this by doing the SetPageDirty under the ptl. Also move page_test_and_clear_dirty() to try_to_unmap(). Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- mm/rmap.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -590,8 +590,6 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ - if (page_test_and_clear_dirty(page)) - set_page_dirty(page); __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } @@ -610,6 +608,7 @@ static int try_to_unmap_one(struct page pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; + struct page *dirty_page = NULL; address = vma_address(page, vma); if (address == -EFAULT) @@ -636,7 +635,7 @@ static int try_to_unmap_one(struct page /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) - set_page_dirty(page); + dirty_page = page; /* Update high watermark before we lower rss */ update_hiwater_rss(mm); @@ -687,6 +686,8 @@ static int try_to_unmap_one(struct page out_unmap: pte_unmap_unlock(pte, ptl); + if (dirty_page) + set_page_dirty(dirty_page); out: return ret; } @@ -918,6 +919,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (page_test_and_clear_dirty(page)) + set_page_dirty(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Fri, 2006-12-22 at 13:59 +0100, Martin Michlmayr wrote: > * Martin Michlmayr <[EMAIL PROTECTED]> [2006-12-22 13:32]: > > I've completed one installation with Linus' patch plus the two from > > Andrew successfully, but I'm currently trying again... > > and it failed. Since you are on ARM you might want to try with the page_mkclean_one cleanup patch too. Arjan agreed that the loop is not needed; we clear the pte, flush on all CPUs and then re-establish the pte. Any race will fault and be serialised on the pte lock. FWIW - with todays -git and Andrews second cancel_dirty_page() patch: http://lkml.org/lkml/2006/12/22/49 I am unable to trigger any corruption - I could again earlier by raising the number of seeds from 3 to 6. (am currently at 10 seeds) From: Peter Zijlstra <[EMAIL PROTECTED]> fix page_mkclean_one() - add flush_cache_page() for all those virtual indexed cache architectures. - handle s390. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- mm/rmap.c | 38 +- 1 file changed, 25 insertions(+), 13 deletions(-) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + if (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; @@ -489,6 +490,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Marc Haber wrote: After updating to 2.6.19, Debian's apt control file /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under six hours. In that situation, "aptitude update" segfaults. When I delete the file and have apt recreate it, things are fine again for a few hours before the file is broken again and the segfault start over. In all cases, umounting the file system and doing an fsck does not show issues with the file system. Are you using wireless networking of any kind? If so which driver and security key system? Might be useful if you could post 'dmesg' output so that people can see the other hardware that you have. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Martin Michlmayr <[EMAIL PROTECTED]> [2006-12-22 13:32]: > I've completed one installation with Linus' patch plus the two from > Andrew successfully, but I'm currently trying again... ... and it failed. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Andrei Popa <[EMAIL PROTECTED]> [2006-12-22 14:24]: > With all three patches I have corruption I've completed one installation with Linus' patch plus the two from Andrew successfully, but I'm currently trying again... but I really need a better testcase since an installation takes about an hour. Andrei, which torrent do you download as a testcase? It would be good if someone could suggest a torrent which is legal and not too large. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
With all three patches I have corruption diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -* -* Also, during truncate, discard_buffer will have marked all -* the page's buffers clean. We discover that here and clean -* the page also. -*/ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..4f4cd13 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 9d774d0..8879f1d 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -61,31 +61,6 @@ ({ \ }) #endif -#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY -#define ptep_test_and_clear_dirty(__vma, __address, __ptep)\ -({ \ - pte_t __pte = *__ptep; \ - int r = 1; \ - if (!pte_dirty(__pte)) \ - r = 0; \ - else\ - set_pte_at((__vma)->vm_mm, (__address), (__ptep), \ - pte_mkclean(__pte)); \ - r; \ -}) -#endif - -#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH -#define ptep_clear_flush_dirty(__vma, __address, __ptep) \ -({ \ - int __dirty;\ - __dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep); \ - if (__dirty)\ - flush_tlb_page(__vma, __address); \ - __dirty;\ -}) -#endif - #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR #define ptep_get_and_clear(__mm, __address, __ptep)\ ({ \ diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h index e6a4723..b61d6f9 100644 --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -300,18 +300,20 @@ do { \ flush_tlb_page(vma, address); \ } while (0) -#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH -#define ptep_clear_flush_dirty(vma, address, ptep) \ -({ \ - int __dirty;\ - __dirty = pte_dirty(*(ptep)); \ - if (__dirty) { \ - clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low); \ - pte_update_defer((vma)->vm_mm, (address), (ptep)); \ - flush_tlb_page(vma, address); \ - } \ - __dirty;
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Andrew Morton <[EMAIL PROTECTED]> [2006-12-22 02:17]: > > This hunk (on top of git from about 2 days ago and your latest patch) > > results in the installer hanging right at the start. > > You'll need this also: It starts again, thanks. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Martin Michlmayr <[EMAIL PROTECTED]> [2006-12-22 11:10]: > > immediately when I started wget, the hanging apt-get process > > continued. > ... and now that we've completed this step, the apt cache has suddenly > been reduced (see Gordon's mail for an explanation) and it segfaults: One of my questions was why apt-get worked to install the initramfs-tools, the kernel and some other packages but later hung while it was building the cache (which clearly it had built already to install some packages): before the installer offers to install additional packages, it changes the apt sources, which leads to apt rebuilding the cache, and here it hangs. Remember how I said that downloading a file with wget prompts apt to work again? Apparently any filesystem access will do (I just ran find / > /dev/null). Gordon, can you confirm this? -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Fri, 22 Dec 2006 11:00:04 +0100 Martin Michlmayr <[EMAIL PROTECTED]> wrote: > > - if (TestClearPageDirty(page) && account_size) > > + if (TestClearPageDirty(page) && account_size) { > > + dec_zone_page_state(page, NR_FILE_DIRTY); > > task_io_account_cancelled_write(account_size); > > + } > > This hunk (on top of git from about 2 days ago and your latest patch) > results in the installer hanging right at the start. You'll need this also: From: Andrew Morton <[EMAIL PROTECTED]> Only (un)account for IO and page-dirtying for devices which have real backing store (ie: not tmpfs or ramdisks). Cc: "David S. Miller" <[EMAIL PROTECTED]> Cc: Linus Torvalds <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- mm/truncate.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c --- a/mm/truncate.c~truncate-dirty-memory-accounting-fix +++ a/mm/truncate.c @@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page WARN_ON(++warncount < 5); } - if (TestClearPageDirty(page) && account_size) { + if (TestClearPageDirty(page) && account_size && + mapping_cap_account_dirty(page->mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); task_io_account_cancelled_write(account_size); } _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Martin Michlmayr <[EMAIL PROTECTED]> [2006-12-22 11:06]: > Okay, it's really weird. So apt-get just hangs doing nothing and I > cannot even kill it. I just tried to download strace via wget and > immediately when I started wget, the hanging apt-get process > continued. ... and now that we've completed this step, the apt cache has suddenly been reduced (see Gordon's mail for an explanation) and it segfaults: sh-3.1# ls -l /var/cache/apt/ total 12524 drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives -rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin -rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin sh-3.1# apt-get -f install Reading package lists... Done Segmentation faulty tree... 50% -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Martin Michlmayr <[EMAIL PROTECTED]> [2006-12-22 11:00]: > This time, however, I let the installer continue and it seems that > with your patch apt now works where it failed in the past, but it > hangs later on. It's pretty weird because I cannot even kill the > process: Okay, it's really weird. So apt-get just hangs doing nothing and I cannot even kill it. I just tried to download strace via wget and immediately when I started wget, the hanging apt-get process continued. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-21 21:20]: > generating these files, pkgcache.bin grows to 12582912 bytes, and when > apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is > 64254483 bytes. This time, when apt-get exited, it had only created > pkgcache.bin which was still 12582912 bytes. Yes, same here: sh-3.1# ls -l /var/cache/apt/ total 5252 drwxr-xr-x 3 root root12288 Dec 22 04:41 archives -rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin -rw-r--r-- 1 root root 8554 Dec 22 04:45 srcpkgcache.bin Gordon, does it fail for you where it normally does (installing initramfs-tools) or much later? For me, the installer was able to install initramfs-tools and the kernel, but apt now hangs at "Select and install software". -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-21 20:54]: > But it sounds like I probably misunderstood something, because I thought > that Martin had acknowledged that this patch actually worked for him. That's what I thought too but now I can confirm what Gordon sees. But it's pretty weird. Our testcase is to run Debian installer on the NSLU2 arm device and apt-get would either segfault or hang at this particular spot in the installation (when apt is first run). With your patch, apt works correctly where it normally fails (at least for me). I stopped the installation at this point and repeated it several more times to make sure it's really working. And, yes, I can repeat this result. This time, however, I let the installer continue and it seems that with your patch apt now works where it failed in the past, but it hangs later on. It's pretty weird because I cannot even kill the process: sh-3.1# ps aux | grep 31126 root 31126 5.7 20.6 16240 6076 ?R+ 04:45 0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest root 31157 0.0 1.6 1516 492 ttyS0S+ 04:51 0:00 grep 31126 sh-3.1# kill -9 31126 sh-3.1# kill -9 31126 sh-3.1# ps aux | grep 31126 root 31126 5.6 20.6 16240 6076 ?R+ 04:45 0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest root 31159 0.0 1.6 1516 492 ttyS0S+ 04:51 0:00 grep 31126 sh-3.1# > Which sounded very similar to your setup (he has a 32M ARM box too, no?) It's the same device, a Linksys NSLU2. > Author: Andrew Morton <[EMAIL PROTECTED]> This patch makes it even worse for me. > - if (TestClearPageDirty(page) && account_size) > + if (TestClearPageDirty(page) && account_size) { > + dec_zone_page_state(page, NR_FILE_DIRTY); > task_io_account_cancelled_write(account_size); > + } This hunk (on top of git from about 2 days ago and your latest patch) results in the installer hanging right at the start. The Linux kernel boots fine, the debian-installer is loaded into a ramdisk but when ncurses is being started it just hangs. Reverting this hunk makes it start again. Does that help or confuse you even more? -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 21 Dec 2006, Gordon Farquharson wrote: > > I tested 2.6.19 with a version of Linus's patch that applies cleanly > to 2.6.19 (patch appended to the end of this email) on ARM and apt-get > failed. It did not segfault this time, but instead got stuck for about > 20 to 30 minutes and was accessing the hard drive frequently. Ok, there's definitely something screwy going on. Andrew located at least one bug: we run cancel_dirty_page() too late in "truncate_complete_page()", which means that do_invalidatepage() ends up not clearing the page cache. His patch is appended. But it sounds like I probably misunderstood something, because I thought that Martin had acknowledged that this patch actually worked for him. Which sounded very similar to your setup (he has a 32M ARM box too, no?) And your failure sounds a lot like one that David Miller is reporting. At the same time, my own shared file mmap tests on my own machines obviously work fine (I lower the dirty-writeback tresholds to force writeback more easily, and then mmap a file and write and rewrite to it in memory, and truncate it). Maybe it's mount option issue? I've got data=ordered on my machine, are you perhaps runnign with something else? Linus --- commit 3e67c0987d7567ad41164a153dca9a43b11d Author: Andrew Morton <[EMAIL PROTECTED]> Date: Thu Dec 21 11:00:33 2006 -0800 [PATCH] truncate: clear page dirtiness before running try_to_free_buffers() truncate presently invalidates the dirty page's buffer_heads then shoots down the page. But try_to_free_buffers() will now bale out because the page is dirty. Net effect: the LRU gets filled with dirty pages which have invalidated buffer_heads attached. They have no ->mapping and hence cannot be cleaned. The machine leaks memory at an enormous rate. Fix this by cleaning the page before running try_to_free_buffers(), so try_to_free_buffers() can do its work. Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the machine won't wedge up trying to write non-existent dirty pages. Probably still wrong, but now less so. Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> diff --git a/mm/truncate.c b/mm/truncate.c index bf9e296..89a5c35 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -60,11 +60,12 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) WARN_ON(++warncount < 5); } - if (TestClearPageDirty(page) && account_size) + if (TestClearPageDirty(page) && account_size) { + dec_zone_page_state(page, NR_FILE_DIRTY); task_io_account_cancelled_write(account_size); + } } - /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -81,11 +82,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return; + cancel_dirty_page(page, PAGE_CACHE_SIZE); + if (PagePrivate(page)) do_invalidatepage(page, 0); - cancel_dirty_page(page, PAGE_CACHE_SIZE); - ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/21/06, Andrew Morton <[EMAIL PROTECTED]> wrote: > Can the call to task_io_account_cancelled_write() simply be removed > from cancel_dirty_page() for testing the patch with 2.6.19 (since > 2.6.19 doesn't seem to have the task I/O accounting) ? Yes. I tested 2.6.19 with a version of Linus's patch that applies cleanly to 2.6.19 (patch appended to the end of this email) on ARM and apt-get failed. It did not segfault this time, but instead got stuck for about 20 to 30 minutes and was accessing the hard drive frequently. Here is some background about the problem we see with apt which may help somebody with knowledge of the apt source code analyse the problem in the context of the patch. When apt-get is first run, it generates pkgcache.bin and srcpkgcache.bin in /var/cache/apt. We have found that these are the files that get corrupted when we apply the patch "mm: tracking shared dirty pages" [1] to 2.6.18. The corruption of these files is what causes apt-get to segfault. I have observed that the normal operation of apt-get is that while apt-get is generating these files, pkgcache.bin grows to 12582912 bytes, and when apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is 64254483 bytes. This time, when apt-get exited, it had only created pkgcache.bin which was still 12582912 bytes. Also, the patch caused apt to slow down a lot. I ran apt-get -f install after apt had exited, and it took so long that I killed it before it had finished. I did not try 2.6.20-git, but I presume that this version is what Martin tried earlier. Maybe Linus's patch doesn't work with 2.6.19, because 2.6.19 is missing some other patch. Gordon [1] http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/buffer.c2006-12-21 01:16:31.0 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -*/ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.0 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.0 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c --- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/mm/memory.c2006-12-21 01:15:21.0 -0700 @@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if
Re: 2.6.19 file content corruption on ext3
On Thu, 21 Dec 2006 14:03:20 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: > > > > Btw, > > here's a totally new tangent on this: it's possible that user code is > > simply BUGGY. > > depmod: BADNESS: written outside isize 22183 akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap . ./zlibsupport.c:map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); So presumably it's in a library. akpm:/usr/src/25> ldd /sbin/depmod linux-gate.so.1 => (0xe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000) /lib/ld-linux.so.2 (0x4631d000) worrisome. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 21 Dec 2006, Peter Zijlstra wrote: > > Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing > at the beginning of the loop. flush_tlb_page() does IPI the other cpus > to flush their tlb too, so there should not be a SMP race, Arjan? Now, the reason I think the loop may be needed is: CPU#0 CPU#1 - - load old PTE entry clear dirty and WP bits write to page using old PTE NOT CHECKING that the new one is write-protected, and just setting the dirty bit blindly (but atomically) flush_tlb_page() TLB flushed, but we now have a page that is marked dirty and unwritable in the page tables, and we will mark it clean in "struct page *" Now, the scary thing is, IF a CPU does this, then the way we do all this, we may actually have the following sequence: CPU#0 CPU#1 - - load old PTE entry ptep_clear_flush(): atomic "set dirty bit" sequence PTEP now contains 040 !!! flush_tlb_page(); TLB flushed, but PTEP is still "dirty zero" write the clear/readonly PTE THE DIRTY BIT WAS LOST! which might actually explain this bug. I personally _thought_ that Intel CPU's don't actually do an "set dirty bit atomically" sequence, but more of a "set dirty bit but trap if the TLB is nonpresent" thing, but I have absolutely no proof for that. Anyway, IF this is the case, then the following patch may or may not fix things. It avoids things by never overwriting a PTE entry, not even the "cleared" one. It always does an atomic "xchg()" with a valid new entry, and looks at the old bits. What do you guys think? Does something like this work out for S/390 too? I tried to make that "ptep_flush_dirty()" concept work for architectures that hide the dirty bit somewhere else too, but.. It actually simplifies the architecture-specific code (you just need to implement a trivial "ptep_exchange()" and "ptep_flush_dirty()" macro), but I only did x86-64 and i386, and while I've booted with this, I haven't really given the thing a lot of really _deep_ thought. But I think this might be safer, as per above.. And it _might_ actually explain the problem. Exactly because the "ptep_clear() + blindly assign to ptep" might lose a dirty bit that was written by another CPU. But this really does depend on what a CPU does when it marks a page dirty. Does it just blindly write the dirty bit? Or does it actually _validate_ that the old page table entry was still present and writable? This patch makes no assumptions. It should work even if a CPU just writes the dirty bit blindly, and the only expectation is that the page tables can be accessed atomically (which had _better_ be true on any SMP architecture) Arjan, can you please check within Intel, and ask what the "proper" sequence for doing something like this is? Linus commit 301d2d53ca0e5d2f61b1c1c259da410c7ee6d6a7 Author: Linus Torvalds <[EMAIL PROTECTED]> Date: Thu Dec 21 11:11:05 2006 -0800 Rewrite the page table "clear dirty and writable" accesses This is much simpler for most architectures, and allows us to do the dirty and writable clear in a single operation without any races or any double flushes. It's also much more careful: we never overwrite the old dirty bits at any time, and always make sure to do atomic memory ops to exchange and see the old value. Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 9d774d0..8879f1d 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -61,31 +61,6 @@ do { \ }) #endif -#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY -#define ptep_test_and_clear_dirty(__vma, __address, __ptep)\ -({ \ - pte_t __pte = *__ptep; \ - int r = 1; \ - if (!pte_dirty(__pte)) \ - r = 0
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Wed, 20 Dec 2006, Trond Myklebust wrote: > > I can't see that it is the business of invalidate_inode_pages2() to > resolve races between ->direct_IO() and pages that are redirtied by > mmap(). All it needs to ensure is that pages that clean are discarded, > since those are neither consistent with data that the ->directIO() call > wrote to the disk nor are they scheduled to be written to disk. Sure, we could happily just remove the -EIO. Alternatively, we could still do all the invalidates over the whole range, and return -EIO at the end of any of the pages weren't invalidated because they had to be written back. I don't personally care whether we should just return success or something to indicate that there were busy pages, but somebody who _uses_ direct-IO might want to know that the thing didn't throw away everything. If you know such users, can you ask them? (Maybe "-EAGAIN" is better than "-EIO", since it's not really even a fatal error). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 21 Dec 2006, Andrei Popa wrote: > On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote: > > > > Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be > > talking about different bugs, so _both_ of your experiences definitely > > matter here). > > with http://lkml.org/lkml/diff/2006/12/20/204/1 > I have corruption: Hash check on download completion found bad chunks, > consider using "safe_sync". Gaah. Martin Michlmayr reported that it apparently fixes his ARM corruption. Now, admittedly I already suspected the issues might be different (if only because of the UP vs SMP/PREEMPT case), but I really had my hopes up after Martin's report, because if anything, _his_ issue might have been a superset of your problem (while obviously any subtle SMP races you might be seeing are definitely not an issue in his case). Oh well. I think the ARM case is enough of a reason to apply those patches (if it hadn't made any difference at all, I'd have waited until after 2.6.20), and we'll just have to continue on the SMP PREEMPT angle. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote: > > Btw, I'd really love to hear whether the patch I sent out actually _helps_ > at all, or whether we're just discussing something that in the end is just > a cleanup.. > > Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be > talking about different bugs, so _both_ of your experiences definitely > matter here). with http://lkml.org/lkml/diff/2006/12/20/204/1 I have corruption: Hash check on download completion found bad chunks, consider using "safe_sync". > > Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
> On Wed, 20 Dec 2006, Linus Torvalds wrote: > Martin, Andrei, does this make any difference for your corruption cases? Hi! I've been watching this issue since I'm experiencing rtorrent corruption since 2.6.19. Details: i386, UP, no preempt: kungen:/proc# zgrep PREEMPT config.gz CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set kungen:/proc# uname -a Linux kungen.fatbob.nu 2.6.19.1 #3 Thu Dec 21 13:18:06 CET 2006 i686 GNU/Linux Corruption is still present with the patch below (patched against 2.6.19.1 and removed task_io_account_cancelled_write call) /Martin [Not subscribed to the list] > --- > diff --git a/fs/buffer.c b/fs/buffer.c > index d1f1b54..263f88e 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > -if (PageWriteback(page)) > +if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) {/* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > -if (ret) { > -/* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > -if (test_clear_page_dirty(page)) > -task_io_account_cancelled_write(PAGE_CACHE_SIZE); > -} > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index ed2c223..4f4cd13 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct file *file, > > static void truncate_huge_page(struct page *page) > { > -clear_page_dirty(page); > +cancel_dirty_page(page, /* No IO accounting for huge pages? */0); > ClearPageUptodate(page); > remove_from_page_cache(page); > put_page(page); > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 4830a3b..350878a 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struct page *page) > > struct page;/* forward declaration */ > > -int test_clear_page_dirty(struct page *page); > +extern void cancel_dirty_page(struct page *page, unsigned int account_size); > + > int test_clear_page_writeback(struct page *page); > int test_set_page_writeback(struct page *page); > > -static inline void clear_page_dirty(struct page *page) > -{ > -test_clear_page_dirty(page); > -} > - > static inline void set_page_writeback(struct page *page) > { > test_set_page_writeback(page); > diff --git a/mm/memory.c b/mm/memory.c > index c00bac6..79cecab 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, > } > EXPORT_SYMBOL(unmap_mapping_range); > > +static void check_last_page(struct address_space *mapping, loff_t size) > +{ > +pgoff_t index; > +unsigned int offset; > +struct page *page; > + > +if (!mapping) > +return; > +offset = size & ~PAGE_MASK; > +if (!offset) > +return; > +index = size >> PAGE_SHIFT; > +page = find_lock_page(mapping, index); > +if (page) { > +unsigned int check = 0; > +unsigned char *kaddr = kmap_atomic(page, KM_USER0); > +do { > +check += kaddr[offset++]; > +} while (offset < PAGE_SIZE); > +kunmap_atomic(kaddr,KM_USER0); > +unlock_page(page); > +page_cache_release(page); > +if (check) > +printk("%s: BADNESS: truncate check %u\n", current->comm, check); > +} > +} > + > /** > * vmtruncate - unmap mappings "freed" by truncate() syscall > * @inode: inode of the file used > @@ -1875,6 +1902,7 @@ do_expand: > goto out_sig; > if (offset > inode->i_sb->s_maxbytes) > goto out_big; > +check_last_page(mapping, inode->i_size); > i_size_write(inode, offset); > > out_truncate: > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 237107c..b3a198c 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *page) > EXPORT_SYMBOL(set_page
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: > > Btw, > here's a totally new tangent on this: it's possible that user code is > simply BUGGY. depmod: BADNESS: written outside isize 22183 --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..5db9fd9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page, } EXPORT_SYMBOL(nobh_commit_write); +static void __check_tail_zero(char *kaddr, unsigned int offset) +{ + unsigned int check = 0; + do { + check += kaddr[offset++]; + } while (offset < PAGE_CACHE_SIZE); + if (check) + printk(KERN_ERR "%s: BADNESS: written outside isize %u\n", + current->comm, check); +} + /* * nobh_writepage() - based on block_full_write_page() except * that it tries to operate without attaching bufferheads to @@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); @@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, Dec 21, 2006 at 12:30:22PM +, Russell King wrote: > On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote: > > That's obviously a bug worth fixing on its own. Do you know when it > > started? > > My last merge, just before 2.6.19-rc1. Obviously 2.6.20-rc1. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote: > That's obviously a bug worth fixing on its own. Do you know when it > started? My last merge, just before 2.6.19-rc1. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-20 11:50]: > Martin, Andrei, does this make any difference for your corruption > cases? Works for me. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, Dec 21, 2006 at 09:18:45AM +0100, Martin Michlmayr wrote: > * Russell King <[EMAIL PROTECTED]> [2006-12-20 22:11]: > > > This patch doesn't fix my problem (apt segfaults on ARM because its > > > database is corrupted). > > > > Are you using IDE in PIO mode? If so, the bug probably lies there. > > I'm using usb-storage. It's used to access an external IDE drive in > an USB enclosure but I don't think it matters that it's IDE since > we're using the SCSI layer to talk to it, right? USB generally uses DMA so you're probably safe. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 2006-12-21 at 10:20 +0100, Peter Zijlstra wrote: > > Now you are flushing the tlb twice. ptep_clear_flush clears the pte and > > flushes the tlb, ptep_establish sets the new pte and flushes the tlb. > > Not good. Use set_pte_at instead of the ptep_establish. > > Yeah, sorry, I already noticed and corrected that :-| > > Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing > at the beginning of the loop. flush_tlb_page() does IPI the other cpus > to flush their tlb too, so there should not be a SMP race, Arjan? The while loop is protected by the pte lock and flush_tlb_page has to remove the tlbs on all cpus. So yes, I think the while loop is not necessary. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 21 Dec 2006 02:17:05 -0700 "Gordon Farquharson" <[EMAIL PROTECTED]> wrote: > Can the call to task_io_account_cancelled_write() simply be removed > from cancel_dirty_page() for testing the patch with 2.6.19 (since > 2.6.19 doesn't seem to have the task I/O accounting) ? Yes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 2006-12-21 at 10:16 +0100, Martin Schwidefsky wrote: > On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote: > > current version > > Nitpicking .. > > > @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page > > if (!pte) > > goto out; > > > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > > - goto unlock; > > + while (pte_dirty(*pte) || pte_write(*pte)) { > > + pte_t entry; > > > > - entry = ptep_get_and_clear(mm, address, pte); > > - entry = pte_mkclean(entry); > > - entry = pte_wrprotect(entry); > > - ptep_establish(vma, address, pte, entry); > > - lazy_mmu_prot_update(entry); > > - ret = 1; > > + flush_cache_page(vma, address, pte_pfn(*pte)); > > + entry = ptep_clear_flush(vma, address, pte); > > + entry = pte_wrprotect(entry); > > + entry = pte_mkclean(entry); > > + ptep_establish(vma, address, pte, entry); > > Now you are flushing the tlb twice. ptep_clear_flush clears the pte and > flushes the tlb, ptep_establish sets the new pte and flushes the tlb. > Not good. Use set_pte_at instead of the ptep_establish. Yeah, sorry, I already noticed and corrected that :-| Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing at the beginning of the loop. flush_tlb_page() does IPI the other cpus to flush their tlb too, so there should not be a SMP race, Arjan? > > + lazy_mmu_prot_update(entry); > > + ret = 1; > > + } > > > > -unlock: > > pte_unmap_unlock(pte, ptl); > > out: > > return ret; > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/21/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: That said, I think the patch I sent out should actually work on top of plain 2.6.19 too. I don't think things have changed in this area that much. IOW, you don't _need_ latest -git to test it, you just need a broken kernel ;) I created a version of your patch that applied to 2.6.19, but it doesn't compile: mm/built-in.o: In function `cancel_dirty_page': slab.c:(.text+0x8964): undefined reference to `task_io_account_cancelled_write' make[3]: *** [.tmp_vmlinux1] Error 1 It looks like task_io_account_cancelled_write() was added in http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7c3ab7381e79dfc7db14a67c6f4f3285664e1ec2 Can the call to task_io_account_cancelled_write() simply be removed from cancel_dirty_page() for testing the patch with 2.6.19 (since 2.6.19 doesn't seem to have the task I/O accounting) ? Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote: > current version Nitpicking .. > @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page > if (!pte) > goto out; > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > - goto unlock; > + while (pte_dirty(*pte) || pte_write(*pte)) { > + pte_t entry; > > - entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > - entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > - lazy_mmu_prot_update(entry); > - ret = 1; > + flush_cache_page(vma, address, pte_pfn(*pte)); > + entry = ptep_clear_flush(vma, address, pte); > + entry = pte_wrprotect(entry); > + entry = pte_mkclean(entry); > + ptep_establish(vma, address, pte, entry); Now you are flushing the tlb twice. ptep_clear_flush clears the pte and flushes the tlb, ptep_establish sets the new pte and flushes the tlb. Not good. Use set_pte_at instead of the ptep_establish. > + lazy_mmu_prot_update(entry); > + ret = 1; > + } > > -unlock: > pte_unmap_unlock(pte, ptl); > out: > return ret; -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 21 Dec 2006, Martin Michlmayr wrote: > > This is a known issue. The following patch has been proposed > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1 > although I just notice that it has been marked as "discarded". > Apparently Russell King commited a better patch so this should be > fixed in git when he sends his next pull request. Ahh, ok. Then it might even be in the set of merges I did earlier today (and which should mirror out soon enough, hopefully). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-20 23:53]: > > Unfortunately, I cannot get the latest git version of the kernel to > > boot on the ARM machine on which Martin and I are experiencing the apt > > segfault. > > Ouch. > > That's obviously a bug worth fixing on its own. Do you know when it > started? This is a known issue. The following patch has been proposed http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1 although I just notice that it has been marked as "discarded". Apparently Russell King commited a better patch so this should be fixed in git when he sends his next pull request. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Russell King <[EMAIL PROTECTED]> [2006-12-20 22:11]: > > This patch doesn't fix my problem (apt segfaults on ARM because its > > database is corrupted). > > Are you using IDE in PIO mode? If so, the bug probably lies there. I'm using usb-storage. It's used to access an external IDE drive in an USB enclosure but I don't think it matters that it's IDE since we're using the SCSI layer to talk to it, right? -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Wed, 2006-12-20 at 21:36 -0500, Trond Myklebust wrote: > On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote: > > I think this is also needed: > > NAK > > invalidate_inode_pages2() should _not_ be pretending that dirty pages > are clean. This patch is incorrect both for the NFS usage and for the > directIO usage. > > In the latter case, if someone has the page mmapped, resulting in the > page getting marked as dirty _after_ a directIO write, then it would be > wrong to discard that data. Only dirty data from _before_ the directIO > write should needs to be discarded (and that is achieved by unmapping, > then cleaning the page prior to the directIO call)... > > For the NFS case, the race is a bit more tricky, since you have the > "unstable write" case which means that the page is neither marked as > dirty, nor is entirely clean ('cos we don't know that the server has > committed the data to permanent storage yet). Then this patch: http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc1/2.6.20-rc1-mm1/broken-out/nfs-fix-nr_file_dirty-underflow.patch is equally wrong, right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/