Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 06:58:23PM -0700, Joel Becker wrote: Oh, no, that's not it at all. This is a disaster. I can't see for the life of me why we haven't had 100,000 bug reports. You're going Btw, we've figured out why we don't have 100,000 bug reports. Most normal usage will never encounter this. But it is still a serious problem, and the fix is just as high a priority. Joel P.S.: Thanks, LWN, for making me look good on the QotW ;-) -- I inject pure kryptonite into my brain. It improves my kung fu, and it eases the pain. Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 07:04:20PM -0700, Joel Becker wrote: On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote: Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. xfs_file_aio_write() handles both those cases for us via xfs_zero_eof(). What it does is map the region from the old EOF to the start of the new write and zeroes any allocated blocks that are not marked unwritten that lie within the range. It does this via the internal mapping interface because we hide allocated blocks past EOF from the page cache and higher layers. Makes sense as an approach. We deliberately do this through the page cache to take advantage of its I/O patterns and tie in with JBD2. Also, we don't feel like maintaining an entire shadow page cache ;-) Just to clarify any possible misunderstanding here, xfs_zero_eof() also does it's IO through the page cache for similar reasons. It's just the mappings are found via the internal interfaces before the zeroing is done via the anonymous pagecache_write_begin()/ pagecache_write_end() functions (in xfs_iozero()) rather than using the generic block functions. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote: On Mon, Jun 28, 2010 at 6:58 PM, Joel Becker joel.bec...@oracle.com wrote: Well, shit. Something has changed in here, or we're really really (un)lucky. We visited this code a year ago or so when we had serious zeroing problems, and we tested the hell out of it. Now it is broken again. And it sure looks like that block_write_full_page() check has been there since before git. Hmm. I'm actually starting to worry that we should do the revert after all. Why? Locking. That page-writeback.c thing decides to limit the end to the inode size the same way that block_write_full_page() does, but block_write_full_page() holds the page lock, while page-writeback.c does not. Which means that as a race against somebody else doing a truncate(), the two things really are pretty different. That said, write_cache_pages() obviously doesn't actually invalidate the page (the way block_write_full_page() does), so locking matters a whole lot less for it. If somebody is doing a concurrent truncate or a concurrent write, then for the data to really show up reliably on disk there would obviously have to be a separate sync operation involved, so even with the lack of any locking, it should be safe. Yes, that is the premise on which the stop @ EOF code in write_cache_pages() is based. It's simply a snapshot of the EOF when the data integrity sync starts and as such any subsequent extensions to it that happen after the sync started are not something we have to worry about for this sync operation. OTOH, if there is a concurrent truncation while the loop is operating, then the existing checks for truncation after locking the page _must_ be sufficent to avoid writeback of such truncated pages otherwise truncate would already be broken. I dunno. Filesystem corruption makes me nervous. You're not alone in that feeling. :/ FWIW, it's taken us quite a long while (years) to iron out all of the known sync+crash bugs in XFS and as a result f the process we have a fair number of regression tests that tell us quickly when sync is has been broken. This test hasn't indicated any problems with XFS, so I'm fairly confident the change is safe. That said, So I'm certainly totally willing to do the revert if that makes ocfs2 work again. Even if work again happens to be partly by mistake, and for some reason that isn't obvious. Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to have an -rc4 (in a few days - not today) with just the revert, I'm ok with that. Even if it's only a at least no worse than 2.6.34 situation rather than a real fix. ... I agree with this. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 05:54:04PM -0700, Joel Becker wrote: On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote: On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. Hi Joel, I have no problems with it being reverted - it's a really just a WAR for the simplest case of the sync hold holdoff. I have to insist that we revert it until we find a way to make ocfs2 work. The rest of the email will discuss the ocfs2 issues therein. This patch causes any filesystem with an allocation unit larger than the filesystem blocksize will leak unzeroed data. During a file extend, the entire allocation unit is zeroed. XFS has this same underlying issue - it can have uninitialised, allocated blocks past EOF that have to be zeroed when extending the file. Does XFS do this in get_blocks()? We deliberately do no allocation in get_blocks(), which is where our need for up-front allocation comes from. No, it does it in xfs_file_aio_write() (i.e. -aio_write()) so it catches both buffered and direct IO. This can't be done in the get_blocks() callback because (IMO) there really isn't the context available to know exactly how we are extending the file in get_blocks(). However, this patch prevents the tail blocks of the allocation unit from being written back to disk. When the file is next extended, i_size will now cover these unzeroed blocks, leaking the old contents of the disk to userspace and creating a corrupt file. XFS doesn't zero blocks at allocation. Instead, XFS zeros the range between the old EOF and the new EOF on each extending write. Hence these pages get written because they fall inside the new i_size that is set during the write. The i_size on disk doesn't get changed until after the data writes have completed, so even on a crash we don't expose uninitialised blocks. We do the same, but we zero the entire allocation. This works both when filling holes and when extending, though obviously the extending is what we're worried about here. We change i_size in write_end, so our guarantee is the same as yours for the page containing i_size. Ok, so the you've got cached pages covering the file and the tail of the last page/block zeroed in memory. I'd guess that ordered mode journalling then ensures the inode size update doesn't hit the disk until after the data does, so this is crash-safe. That would explain (to me, at least) why you are not seeing stale data exposure on crashes. Looking at ocfs2_writepage(), it simply calls block_write_full_page(), which does: /* Is the page fully outside i_size? (truncate in progress) */ offset = i_size (PAGE_CACHE_SIZE-1); if (page-index = end_index+1 || !offset) { /* * The page may have dirty, unmapped buffers. For example, * they may have been added in ext3_writepage(). Make them * freeable here, so the page does not leak. */ do_invalidatepage(page, 0); unlock_page(page); return 0; /* don't care */ } i.e. pages beyond EOF get invalidated. If it somehow gets through that check, __block_write_full_page() will avoid writing dirty bufferheads beyond EOF because the write is racing with truncate. Your contention is that we've never gotten those tail blocks to disk. Instead, our code either handles the future extensions of i_size or we've just gotten lucky with our testing. Our current BUG trigger is because we have a new check that catches this case. Does that summarize your position correctly? Yes, that summarises it pretty well ;) I'm not averse to having a zero-only-till-i_size policy, but I know we've visited this problem before and got bit. I have to go reload that context. There's no hurry from my perspective - I just prefer to understand the the root cause of a problem before jumping Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. xfs_file_aio_write() handles both those cases for us via xfs_zero_eof(). What it does is map the region from the old EOF to the start of the new write and zeroes any allocated blocks that are not marked unwritten that lie within the range. It does this via the internal mapping interface because we hide allocated blocks past EOF from the page cache and higher layers. FWIW, the way XFS does this is safe against crashes because the inode size does not get updated on disk or in the journal until after the data has hit the disk. Ordered journalling should
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Tue, Jun 29, 2010 at 12:27:57PM +1000, Dave Chinner wrote: Just to clarify any possible misunderstanding here, xfs_zero_eof() also does it's IO through the page cache for similar reasons. It's just the mappings are found via the internal interfaces before the zeroing is done via the anonymous pagecache_write_begin()/ pagecache_write_end() functions (in xfs_iozero()) rather than using the generic block functions. Mark and I discussed this some earlier this evening. I think we might be able to get away cheaper than does. In ocfs2_write_begin_nolock(), we call ocfs2_expand_nonsparse_inode() in the case of older filesystems that don't allow sparse files. That's where we handle the zeroing from old_i_size to pos for those files. I the exact same place, we could probably just detect we're about to cover the unzeroed allocation and get it there. This needs some more code eval until we're sure, and then the serious testing happens. Joel -- Nothing is wrong with California that a rise in the ocean level wouldn't cure. - Ross MacDonald Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote: I dunno. Filesystem corruption makes me nervous. So I'm certainly totally willing to do the revert if that makes ocfs2 work again. Even if work again happens to be partly by mistake, and for some reason that isn't obvious. Filesystem corruption makes me more than nervous. I'm quite devastated by this. Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to have an -rc4 (in a few days - not today) with just the revert, I'm ok with that. Even if it's only a at least no worse than 2.6.34 situation rather than a real fix. I've checked both before this patch and with the patch reverted. We corrupt in both cases. The problem is our assumption about zeroing past i_size. The revert will fix our BUG_ON, but not the corruption. Mark and I have ideas on how to fix the actual bug, but they will take some time and especially testing. We also have some shorter-term ideas on how to paper over the issue. We have to have to have this fixed by .35. If -rc4 isn't coming for a couple of days, can we hold off on the decision until we get a chance to think about a paper-over solution for it? Then we can avoid the revert. Joel -- You can use a screwdriver to screw in screws or to clean your ears, however, the latter needs real skill, determination and a lack of fear of injuring yourself. It is much the same with JavaScript. - Chris Heilmann Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Tue, Jun 29, 2010 at 01:16:11AM -0700, Joel Becker wrote: On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote: Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to have an -rc4 (in a few days - not today) with just the revert, I'm ok with that. Even if it's only a at least no worse than 2.6.34 situation rather than a real fix. If -rc4 isn't coming for a couple of days, can we hold off on the decision until we get a chance to think about a paper-over solution for it? Then we can avoid the revert. Linus, I'm going to withdraw the revert request for now. Our proposed paper-over solution is too big, and we're just going to focus on the actual fix. This will be for .35. Yes, .35-rc will have a BUG_ON with refcount trees until we get the fix in, but I'd rather avoid the churn when the final .35 should have Dave's patch and our fix. Joel -- The first thing we do, let's kill all the lawyers. -Henry VI, IV:ii Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. Hi Joel, I have no problems with it being reverted - it's a really just a WAR for the simplest case of the sync hold holdoff. However, I had no idea that any filesystem relied on being able to write pages beyond EOF, and I'd like to understand the implications of it on the higher level code and, more importantly, understand how the writes are getting to disk through multiple layers of page-beyond-i_size checks in the writeback code This patch causes any filesystem with an allocation unit larger than the filesystem blocksize will leak unzeroed data. During a file extend, the entire allocation unit is zeroed. XFS has this same underlying issue - it can have uninitialised, allocated blocks past EOF that have to be zeroed when extending the file. However, this patch prevents the tail blocks of the allocation unit from being written back to disk. When the file is next extended, i_size will now cover these unzeroed blocks, leaking the old contents of the disk to userspace and creating a corrupt file. XFS doesn't zero blocks at allocation. Instead, XFS zeros the range between the old EOF and the new EOF on each extending write. Hence these pages get written because they fall inside the new i_size that is set during the write. The i_size on disk doesn't get changed until after the data writes have completed, so even on a crash we don't expose uninitialised blocks. This affects ocfs2 directly. As Tao Ma mentioned in his reporting email: 1. all the place we use filemap_fdatawrite in ocfs2 doesn't flush pages after i_size now. 2. sync, fsync, fdatasync and umount don't flush pages after i_size(they are called from writeback_single_inode). I'm not sure this was ever supposed to work - my understanding is that we should never do anything with pages beyong i_size as pages beyond EOF as being beyond i_size implies we are racing with a truncate and the page is no longer valid. In that case, we should definitely not write it back to disk. Looking at ocfs2_writepage(), it simply calls block_write_full_page(), which does: /* Is the page fully outside i_size? (truncate in progress) */ offset = i_size (PAGE_CACHE_SIZE-1); if (page-index = end_index+1 || !offset) { /* * The page may have dirty, unmapped buffers. For example, * they may have been added in ext3_writepage(). Make them * freeable here, so the page does not leak. */ do_invalidatepage(page, 0); unlock_page(page); return 0; /* don't care */ } i.e. pages beyond EOF get invalidated. If it somehow gets through that check, __block_write_full_page() will avoid writing dirty bufferheads beyond EOF because the write is racing with truncate. Hence there are multiple layers of protection against writing past i_size, so I'm wondering how these pages are even getting to disk in the first place 3. reflink have a BUG_ON triggered because we have some dirty pages while during CoW. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1265 I'd suggest that the reason you see the BUG_ON() with this patch is that the pages beyond EOF are not being invalidated because they are not being passed to -writepage and hence are remaining dirty in the cache. IOWs, I suspect that this commit has uncovered a bug in ocfs2, not that it has caused a regression. Your thoughts, Joel? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote: On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. Hi Joel, I have no problems with it being reverted - it's a really just a WAR for the simplest case of the sync hold holdoff. I have to insist that we revert it until we find a way to make ocfs2 work. The rest of the email will discuss the ocfs2 issues therein. This patch causes any filesystem with an allocation unit larger than the filesystem blocksize will leak unzeroed data. During a file extend, the entire allocation unit is zeroed. XFS has this same underlying issue - it can have uninitialised, allocated blocks past EOF that have to be zeroed when extending the file. Does XFS do this in get_blocks()? We deliberately do no allocation in get_blocks(), which is where our need for up-front allocation comes from. However, this patch prevents the tail blocks of the allocation unit from being written back to disk. When the file is next extended, i_size will now cover these unzeroed blocks, leaking the old contents of the disk to userspace and creating a corrupt file. XFS doesn't zero blocks at allocation. Instead, XFS zeros the range between the old EOF and the new EOF on each extending write. Hence these pages get written because they fall inside the new i_size that is set during the write. The i_size on disk doesn't get changed until after the data writes have completed, so even on a crash we don't expose uninitialised blocks. We do the same, but we zero the entire allocation. This works both when filling holes and when extending, though obviously the extending is what we're worried about here. We change i_size in write_end, so our guarantee is the same as yours for the page containing i_size. Looking at ocfs2_writepage(), it simply calls block_write_full_page(), which does: /* Is the page fully outside i_size? (truncate in progress) */ offset = i_size (PAGE_CACHE_SIZE-1); if (page-index = end_index+1 || !offset) { /* * The page may have dirty, unmapped buffers. For example, * they may have been added in ext3_writepage(). Make them * freeable here, so the page does not leak. */ do_invalidatepage(page, 0); unlock_page(page); return 0; /* don't care */ } i.e. pages beyond EOF get invalidated. If it somehow gets through that check, __block_write_full_page() will avoid writing dirty bufferheads beyond EOF because the write is racing with truncate. Your contention is that we've never gotten those tail blocks to disk. Instead, our code either handles the future extensions of i_size or we've just gotten lucky with our testing. Our current BUG trigger is because we have a new check that catches this case. Does that summarize your position correctly? I'm not averse to having a zero-only-till-i_size policy, but I know we've visited this problem before and got bit. I have to go reload that context. Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. Joel -- The real reason GNU ls is 8-bit-clean is so that they can start using ISO-8859-1 option characters. - Christopher Davis (c...@loiosh.kei.com) Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 06:12:35PM -0700, Linus Torvalds wrote: On Mon, Jun 28, 2010 at 5:54 PM, Joel Becker joel.bec...@oracle.com wrote: Your contention is that we've never gotten those tail blocks to disk. Instead, our code either handles the future extensions of i_size or we've just gotten lucky with our testing. Our current BUG trigger is because we have a new check that catches this case. Does that summarize your position correctly? Maybe Dave has some more exhaustive answer, but his point that block_write_full_page() already just drops the page does seem to be very valid. Which makes me suspect that it would be better to remove the ocfs2 BUG_ON() as a stop-gap measure, rather than reverting the commit. It seems to be true that the don't bother flushing past EOF commit really just uncovered an older bug. Well, shit. Something has changed in here, or we're really really (un)lucky. We visited this code a year ago or so when we had serious zeroing problems, and we tested the hell out of it. Now it is broken again. And it sure looks like that block_write_full_page() check has been there since before git. So maybe ocfs2 should just replace the bug-on with invalidating the page (perhaps with a WARN_ONCE() to make sure the problem doesn't get forgotten about?) Oh, no, that's not it at all. This is a disaster. I can't see for the life of me why we haven't had 100,000 bug reports. You're going to have an ocfs2 patch by the end of the week. It will be ugly, I'm sure of it, but it has to be done. For every extend, we're going to have to zero and potentially CoW around old_i_size if the old allocation isn't within the bounds of the current write. Joel -- In a crisis, don't hide behind anything or anybody. They're going to find you anyway. - Paul Bear Bryant Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote: Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. xfs_file_aio_write() handles both those cases for us via xfs_zero_eof(). What it does is map the region from the old EOF to the start of the new write and zeroes any allocated blocks that are not marked unwritten that lie within the range. It does this via the internal mapping interface because we hide allocated blocks past EOF from the page cache and higher layers. Makes sense as an approach. We deliberately do this through the page cache to take advantage of its I/O patterns and tie in with JBD2. Also, we don't feel like maintaining an entire shadow page cache ;-) Joel -- Life's Little Instruction Book #356 Be there when people need you. Joel Becker Consulting Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF
On Mon, Jun 28, 2010 at 6:58 PM, Joel Becker joel.bec...@oracle.com wrote: Well, shit. Something has changed in here, or we're really really (un)lucky. We visited this code a year ago or so when we had serious zeroing problems, and we tested the hell out of it. Now it is broken again. And it sure looks like that block_write_full_page() check has been there since before git. Hmm. I'm actually starting to worry that we should do the revert after all. Why? Locking. That page-writeback.c thing decides to limit the end to the inode size the same way that block_write_full_page() does, but block_write_full_page() holds the page lock, while page-writeback.c does not. Which means that as a race against somebody else doing a truncate(), the two things really are pretty different. That said, write_cache_pages() obviously doesn't actually invalidate the page (the way block_write_full_page() does), so locking matters a whole lot less for it. If somebody is doing a concurrent truncate or a concurrent write, then for the data to really show up reliably on disk there would obviously have to be a separate sync operation involved, so even with the lack of any locking, it should be safe. I dunno. Filesystem corruption makes me nervous. So I'm certainly totally willing to do the revert if that makes ocfs2 work again. Even if work again happens to be partly by mistake, and for some reason that isn't obvious. Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to have an -rc4 (in a few days - not today) with just the revert, I'm ok with that. Even if it's only a at least no worse than 2.6.34 situation rather than a real fix. Linus ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel