Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-07-06 Thread Joel Becker
On Mon, Jun 28, 2010 at 06:58:23PM -0700, Joel Becker wrote:
   Oh, no, that's not it at all.  This is a disaster.  I can't see
 for the life of me why we haven't had 100,000 bug reports.  You're going

Btw, we've figured out why we don't have 100,000 bug reports.
Most normal usage will never encounter this.  But it is still a serious
problem, and the fix is just as high a priority.

Joel

P.S.: Thanks, LWN, for making me look good on the QotW ;-)

-- 

I inject pure kryptonite into my brain.
 It improves my kung fu, and it eases the pain.


Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-29 Thread Dave Chinner
On Mon, Jun 28, 2010 at 07:04:20PM -0700, Joel Becker wrote:
 On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote:
 Regarding XFS, how do you handle catching the tail of an
   allocation with an lseek(2)'d write?  That is, your current allocation
   has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
   and write there.  The code has to recognize to zero around old_i_size
   before moving out to new_i_size, right?  I think that's where our old
   approaches had problems.
  
  xfs_file_aio_write() handles both those cases for us via
  xfs_zero_eof().  What it does is map the region from the old EOF to
  the start of the new write and zeroes any allocated blocks that are
  not marked unwritten that lie within the range. It does this via the
  internal mapping interface because we hide allocated blocks past EOF
  from the page cache and higher layers.
 
   Makes sense as an approach.  We deliberately do this through the
 page cache to take advantage of its I/O patterns and tie in with JBD2.
 Also, we don't feel like maintaining an entire shadow page cache ;-)

Just to clarify any possible misunderstanding here, xfs_zero_eof()
also does it's IO through the page cache for similar reasons.  It's
just the mappings are found via the internal interfaces before the
zeroing is done via the anonymous pagecache_write_begin()/
pagecache_write_end() functions (in xfs_iozero()) rather than using
the generic block functions.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-29 Thread Dave Chinner
On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote:
 On Mon, Jun 28, 2010 at 6:58 PM, Joel Becker joel.bec...@oracle.com wrote:
 
         Well, shit.  Something has changed in here, or we're really
  really (un)lucky.  We visited this code a year ago or so when we had
  serious zeroing problems, and we tested the hell out of it.  Now it is
  broken again.  And it sure looks like that block_write_full_page() check
  has been there since before git.
 
 Hmm. I'm actually starting to worry that we should do the revert after all.
 
 Why? Locking. That page-writeback.c thing decides to limit the end to
 the inode size the same way that block_write_full_page() does, but
 block_write_full_page() holds the page lock, while page-writeback.c
 does not. Which means that as a race against somebody else doing a
 truncate(), the two things really are pretty different.
 
 That said, write_cache_pages() obviously doesn't actually invalidate
 the page (the way block_write_full_page() does), so locking matters a
 whole lot less for it. If somebody is doing a concurrent truncate or a
 concurrent write, then for the data to really show up reliably on disk
 there would obviously have to be a separate sync operation involved,
 so even with the lack of any locking, it should be safe.

Yes, that is the premise on which the stop @ EOF code in
write_cache_pages() is based. It's simply a snapshot of the EOF when
the data integrity sync starts and as such any subsequent extensions
to it that happen after the sync started are not something we have
to worry about for this sync operation.

OTOH, if there is a concurrent truncation while the loop is
operating, then the existing checks for truncation after locking the
page _must_ be sufficent to avoid writeback of such truncated pages
otherwise truncate would already be broken.

 I dunno. Filesystem corruption makes me nervous.

You're not alone in that feeling. :/

FWIW, it's taken us quite a long while (years) to iron out all of
the known sync+crash bugs in XFS and as a result f the process we
have a fair number of regression tests that tell us quickly when
sync is has been broken. This test hasn't indicated any problems
with XFS, so I'm fairly confident the change is safe.

That said, 

 So I'm certainly
 totally willing to do the revert if that makes ocfs2 work again. Even
 if work again happens to be partly by mistake, and for some reason
 that isn't obvious.
 
 Your call, I guess.  If any ocfs2 fix looks scary, and you'd prefer to
 have an -rc4 (in a few days - not today) with just the revert, I'm ok
 with that. Even if it's only a at least no worse than 2.6.34
 situation rather than a real fix.

... I agree with this.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-29 Thread Dave Chinner
On Mon, Jun 28, 2010 at 05:54:04PM -0700, Joel Becker wrote:
 On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
  On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
   This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
  
  Hi Joel,
  
  I have no problems with it being reverted - it's a really just a
  WAR for the simplest case of the sync hold holdoff.
 
   I have to insist that we revert it until we find a way to make
 ocfs2 work.  The rest of the email will discuss the ocfs2 issues
 therein.
 
   This patch causes any filesystem with an allocation unit larger than the
   filesystem blocksize will leak unzeroed data. During a file extend, the
   entire allocation unit is zeroed.
  
  XFS has this same underlying issue - it can have uninitialised,
  allocated blocks past EOF that have to be zeroed when extending the
  file.
 
   Does XFS do this in get_blocks()?  We deliberately do no
 allocation in get_blocks(), which is where our need for up-front
 allocation comes from.

No, it does it in xfs_file_aio_write() (i.e. -aio_write()) so it
catches both buffered and direct IO.  This can't be done in the
get_blocks() callback because (IMO) there really isn't the context
available to know exactly how we are extending the file in
get_blocks().


   However, this patch prevents the tail
   blocks of the allocation unit from being written back to disk.  When the
   file is next extended, i_size will now cover these unzeroed blocks,
   leaking the old contents of the disk to userspace and creating a corrupt
   file.
  
  XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
  between the old EOF and the new EOF on each extending write. Hence
  these pages get written because they fall inside the new i_size that
  is set during the write.  The i_size on disk doesn't get changed
  until after the data writes have completed, so even on a crash we
  don't expose uninitialised blocks.
 
   We do the same, but we zero the entire allocation.  This works
 both when filling holes and when extending, though obviously the
 extending is what we're worried about here.  We change i_size in
 write_end, so our guarantee is the same as yours for the page containing
 i_size.

Ok, so the you've got cached pages covering the file and the tail of
the last page/block zeroed in memory. I'd guess that ordered mode
journalling then ensures the inode size update doesn't hit the disk
until after the data does, so this is crash-safe. That would explain
(to me, at least) why you are not seeing stale data exposure on
crashes.

  Looking at ocfs2_writepage(), it simply calls
  block_write_full_page(), which does:
  
  /* Is the page fully outside i_size? (truncate in progress) */
  offset = i_size  (PAGE_CACHE_SIZE-1);
  if (page-index = end_index+1 || !offset) {
  /*
   * The page may have dirty, unmapped buffers.  For example,
   * they may have been added in ext3_writepage().  Make them
   * freeable here, so the page does not leak.
   */
  do_invalidatepage(page, 0);
  unlock_page(page);
  return 0; /* don't care */
  }
  
  i.e. pages beyond EOF get invalidated.  If it somehow gets through
  that check, __block_write_full_page() will avoid writing dirty
  bufferheads beyond EOF because the write is racing with truncate.
 
   Your contention is that we've never gotten those tail blocks to
 disk.  Instead, our code either handles the future extensions of i_size
 or we've just gotten lucky with our testing.  Our current BUG trigger is
 because we have a new check that catches this case.  Does that summarize
 your position correctly?

Yes, that summarises it pretty well ;)

   I'm not averse to having a zero-only-till-i_size policy, but I
 know we've visited this problem before and got bit.  I have to go reload
 that context.

There's no hurry from my perspective - I just prefer to understand the
the root cause of a problem before jumping

   Regarding XFS, how do you handle catching the tail of an
 allocation with an lseek(2)'d write?  That is, your current allocation
 has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
 and write there.  The code has to recognize to zero around old_i_size
 before moving out to new_i_size, right?  I think that's where our old
 approaches had problems.

xfs_file_aio_write() handles both those cases for us via
xfs_zero_eof().  What it does is map the region from the old EOF to
the start of the new write and zeroes any allocated blocks that are
not marked unwritten that lie within the range. It does this via the
internal mapping interface because we hide allocated blocks past EOF
from the page cache and higher layers.

FWIW, the way XFS does this is safe against crashes because the
inode size does not get updated on disk or in the journal until
after the data has hit the disk. Ordered journalling should 

Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-29 Thread Joel Becker
On Tue, Jun 29, 2010 at 12:27:57PM +1000, Dave Chinner wrote:
 Just to clarify any possible misunderstanding here, xfs_zero_eof()
 also does it's IO through the page cache for similar reasons.  It's
 just the mappings are found via the internal interfaces before the
 zeroing is done via the anonymous pagecache_write_begin()/
 pagecache_write_end() functions (in xfs_iozero()) rather than using
 the generic block functions.

Mark and I discussed this some earlier this evening.  I think we
might be able to get away cheaper than does.  In
ocfs2_write_begin_nolock(), we call ocfs2_expand_nonsparse_inode() in
the case of older filesystems that don't allow sparse files.  That's
where we handle the zeroing from old_i_size to pos for those files.
I the exact same place, we could probably just detect we're
about to cover the unzeroed allocation and get it there.  This needs
some more code eval until we're sure, and then the serious testing
happens.

Joel

-- 

Nothing is wrong with California that a rise in the ocean level
 wouldn't cure.
- Ross MacDonald

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-29 Thread Joel Becker
On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote:
 I dunno. Filesystem corruption makes me nervous. So I'm certainly
 totally willing to do the revert if that makes ocfs2 work again. Even
 if work again happens to be partly by mistake, and for some reason
 that isn't obvious.

Filesystem corruption makes me more than nervous.  I'm quite
devastated by this.

 Your call, I guess.  If any ocfs2 fix looks scary, and you'd prefer to
 have an -rc4 (in a few days - not today) with just the revert, I'm ok
 with that. Even if it's only a at least no worse than 2.6.34
 situation rather than a real fix.

I've checked both before this patch and with the patch reverted.
We corrupt in both cases.  The problem is our assumption about zeroing
past i_size.  The revert will fix our BUG_ON, but not the corruption.
Mark and I have ideas on how to fix the actual bug, but they
will take some time and especially testing.  We also have some
shorter-term ideas on how to paper over the issue.  We have to have to
have this fixed by .35.
If -rc4 isn't coming for a couple of days, can we hold off on
the decision until we get a chance to think about a paper-over solution
for it?  Then we can avoid the revert.

Joel

-- 

You can use a screwdriver to screw in screws or to clean your ears,
however, the latter needs real skill, determination and a lack of fear
of injuring yourself.  It is much the same with JavaScript.
- Chris Heilmann

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-29 Thread Joel Becker
On Tue, Jun 29, 2010 at 01:16:11AM -0700, Joel Becker wrote:
 On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote:
  Your call, I guess.  If any ocfs2 fix looks scary, and you'd prefer to
  have an -rc4 (in a few days - not today) with just the revert, I'm ok
  with that. Even if it's only a at least no worse than 2.6.34
  situation rather than a real fix.
 
   If -rc4 isn't coming for a couple of days, can we hold off on
 the decision until we get a chance to think about a paper-over solution
 for it?  Then we can avoid the revert.

Linus,
I'm going to withdraw the revert request for now.  Our proposed
paper-over solution is too big, and we're just going to focus on the
actual fix.  This will be for .35.  Yes, .35-rc will have a BUG_ON with
refcount trees until we get the fix in, but I'd rather avoid the churn
when the final .35 should have Dave's patch and our fix.

Joel

-- 

The first thing we do, let's kill all the lawyers.
-Henry VI, IV:ii

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-28 Thread Dave Chinner
On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
 This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.

Hi Joel,

I have no problems with it being reverted - it's a really just a
WAR for the simplest case of the sync hold holdoff.

However, I had no idea that any filesystem relied on being able to
write pages beyond EOF, and I'd like to understand the implications
of it on the higher level code and, more importantly, understand how
the writes are getting to disk through multiple layers of
page-beyond-i_size checks in the writeback code

 This patch causes any filesystem with an allocation unit larger than the
 filesystem blocksize will leak unzeroed data. During a file extend, the
 entire allocation unit is zeroed.

XFS has this same underlying issue - it can have uninitialised,
allocated blocks past EOF that have to be zeroed when extending the
file.

 However, this patch prevents the tail
 blocks of the allocation unit from being written back to disk.  When the
 file is next extended, i_size will now cover these unzeroed blocks,
 leaking the old contents of the disk to userspace and creating a corrupt
 file.

XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
between the old EOF and the new EOF on each extending write. Hence
these pages get written because they fall inside the new i_size that
is set during the write.  The i_size on disk doesn't get changed
until after the data writes have completed, so even on a crash we
don't expose uninitialised blocks.

 This affects ocfs2 directly.  As Tao Ma mentioned in his reporting
 email:
 
 1. all the place we use filemap_fdatawrite in ocfs2 doesn't flush pages
 after i_size now.
 2. sync, fsync, fdatasync and umount don't flush pages after i_size(they
 are called from writeback_single_inode).

I'm not sure this was ever supposed to work - my understanding is
that we should never do anything with pages beyong i_size as pages
beyond EOF as being beyond i_size implies we are racing with a
truncate and the page is no longer valid. In that case, we should
definitely not write it back to disk.

Looking at ocfs2_writepage(), it simply calls
block_write_full_page(), which does:

/* Is the page fully outside i_size? (truncate in progress) */
offset = i_size  (PAGE_CACHE_SIZE-1);
if (page-index = end_index+1 || !offset) {
/*
 * The page may have dirty, unmapped buffers.  For example,
 * they may have been added in ext3_writepage().  Make them
 * freeable here, so the page does not leak.
 */
do_invalidatepage(page, 0);
unlock_page(page);
return 0; /* don't care */
}

i.e. pages beyond EOF get invalidated.  If it somehow gets through
that check, __block_write_full_page() will avoid writing dirty
bufferheads beyond EOF because the write is racing with truncate.

Hence there are multiple layers of protection against writing past
i_size, so I'm wondering how these pages are even getting to disk in
the first place

 3. reflink have a BUG_ON triggered because we have some dirty pages
 while during CoW. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1265

I'd suggest that the reason you see the BUG_ON() with this patch is
that the pages beyond EOF are not being invalidated because they are
not being passed to -writepage and hence are remaining dirty in the
cache.  IOWs, I suspect that this commit has uncovered a bug in
ocfs2, not that it has caused a regression.

Your thoughts, Joel?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-28 Thread Joel Becker
On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
 On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
  This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
 
 Hi Joel,
 
 I have no problems with it being reverted - it's a really just a
 WAR for the simplest case of the sync hold holdoff.

I have to insist that we revert it until we find a way to make
ocfs2 work.  The rest of the email will discuss the ocfs2 issues
therein.

  This patch causes any filesystem with an allocation unit larger than the
  filesystem blocksize will leak unzeroed data. During a file extend, the
  entire allocation unit is zeroed.
 
 XFS has this same underlying issue - it can have uninitialised,
 allocated blocks past EOF that have to be zeroed when extending the
 file.

Does XFS do this in get_blocks()?  We deliberately do no
allocation in get_blocks(), which is where our need for up-front
allocation comes from.

  However, this patch prevents the tail
  blocks of the allocation unit from being written back to disk.  When the
  file is next extended, i_size will now cover these unzeroed blocks,
  leaking the old contents of the disk to userspace and creating a corrupt
  file.
 
 XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
 between the old EOF and the new EOF on each extending write. Hence
 these pages get written because they fall inside the new i_size that
 is set during the write.  The i_size on disk doesn't get changed
 until after the data writes have completed, so even on a crash we
 don't expose uninitialised blocks.

We do the same, but we zero the entire allocation.  This works
both when filling holes and when extending, though obviously the
extending is what we're worried about here.  We change i_size in
write_end, so our guarantee is the same as yours for the page containing
i_size.

 Looking at ocfs2_writepage(), it simply calls
 block_write_full_page(), which does:
 
   /* Is the page fully outside i_size? (truncate in progress) */
   offset = i_size  (PAGE_CACHE_SIZE-1);
   if (page-index = end_index+1 || !offset) {
   /*
* The page may have dirty, unmapped buffers.  For example,
* they may have been added in ext3_writepage().  Make them
* freeable here, so the page does not leak.
*/
   do_invalidatepage(page, 0);
   unlock_page(page);
   return 0; /* don't care */
   }
 
 i.e. pages beyond EOF get invalidated.  If it somehow gets through
 that check, __block_write_full_page() will avoid writing dirty
 bufferheads beyond EOF because the write is racing with truncate.

Your contention is that we've never gotten those tail blocks to
disk.  Instead, our code either handles the future extensions of i_size
or we've just gotten lucky with our testing.  Our current BUG trigger is
because we have a new check that catches this case.  Does that summarize
your position correctly?
I'm not averse to having a zero-only-till-i_size policy, but I
know we've visited this problem before and got bit.  I have to go reload
that context.
Regarding XFS, how do you handle catching the tail of an
allocation with an lseek(2)'d write?  That is, your current allocation
has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
and write there.  The code has to recognize to zero around old_i_size
before moving out to new_i_size, right?  I think that's where our old
approaches had problems.

Joel

-- 

The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters.
- Christopher Davis (c...@loiosh.kei.com)

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-28 Thread Joel Becker
On Mon, Jun 28, 2010 at 06:12:35PM -0700, Linus Torvalds wrote:
 On Mon, Jun 28, 2010 at 5:54 PM, Joel Becker joel.bec...@oracle.com wrote:
         Your contention is that we've never gotten those tail blocks to
  disk.  Instead, our code either handles the future extensions of i_size
  or we've just gotten lucky with our testing.  Our current BUG trigger is
  because we have a new check that catches this case.  Does that summarize
  your position correctly?
 
 Maybe Dave has some more exhaustive answer, but his point that
 block_write_full_page() already just drops the page does seem to be
 very valid.  Which makes me suspect that it would be better to remove
 the ocfs2 BUG_ON() as a stop-gap measure, rather than reverting the
 commit. It seems to be true that the don't bother flushing past EOF
 commit really just uncovered an older bug.

Well, shit.  Something has changed in here, or we're really
really (un)lucky.  We visited this code a year ago or so when we had
serious zeroing problems, and we tested the hell out of it.  Now it is
broken again.  And it sure looks like that block_write_full_page() check
has been there since before git.

 So maybe ocfs2 should just replace the bug-on with invalidating the
 page (perhaps with a WARN_ONCE() to make sure the problem doesn't get
 forgotten about?)

Oh, no, that's not it at all.  This is a disaster.  I can't see
for the life of me why we haven't had 100,000 bug reports.  You're going
to have an ocfs2 patch by the end of the week.  It will be ugly, I'm
sure of it, but it has to be done.  For every extend, we're going to
have to zero and potentially CoW around old_i_size if the old allocation
isn't within the bounds of the current write.

Joel

-- 

In a crisis, don't hide behind anything or anybody. They're going
 to find you anyway.
- Paul Bear Bryant

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-28 Thread Joel Becker
On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote:
  Regarding XFS, how do you handle catching the tail of an
  allocation with an lseek(2)'d write?  That is, your current allocation
  has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
  and write there.  The code has to recognize to zero around old_i_size
  before moving out to new_i_size, right?  I think that's where our old
  approaches had problems.
 
 xfs_file_aio_write() handles both those cases for us via
 xfs_zero_eof().  What it does is map the region from the old EOF to
 the start of the new write and zeroes any allocated blocks that are
 not marked unwritten that lie within the range. It does this via the
 internal mapping interface because we hide allocated blocks past EOF
 from the page cache and higher layers.

Makes sense as an approach.  We deliberately do this through the
page cache to take advantage of its I/O patterns and tie in with JBD2.
Also, we don't feel like maintaining an entire shadow page cache ;-)

Joel

-- 

Life's Little Instruction Book #356

Be there when people need you.

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Revert writeback: limit write_cache_pages integrity scanning to current EOF

2010-06-28 Thread Linus Torvalds
On Mon, Jun 28, 2010 at 6:58 PM, Joel Becker joel.bec...@oracle.com wrote:

        Well, shit.  Something has changed in here, or we're really
 really (un)lucky.  We visited this code a year ago or so when we had
 serious zeroing problems, and we tested the hell out of it.  Now it is
 broken again.  And it sure looks like that block_write_full_page() check
 has been there since before git.

Hmm. I'm actually starting to worry that we should do the revert after all.

Why? Locking. That page-writeback.c thing decides to limit the end to
the inode size the same way that block_write_full_page() does, but
block_write_full_page() holds the page lock, while page-writeback.c
does not. Which means that as a race against somebody else doing a
truncate(), the two things really are pretty different.

That said, write_cache_pages() obviously doesn't actually invalidate
the page (the way block_write_full_page() does), so locking matters a
whole lot less for it. If somebody is doing a concurrent truncate or a
concurrent write, then for the data to really show up reliably on disk
there would obviously have to be a separate sync operation involved,
so even with the lack of any locking, it should be safe.

I dunno. Filesystem corruption makes me nervous. So I'm certainly
totally willing to do the revert if that makes ocfs2 work again. Even
if work again happens to be partly by mistake, and for some reason
that isn't obvious.

Your call, I guess.  If any ocfs2 fix looks scary, and you'd prefer to
have an -rc4 (in a few days - not today) with just the revert, I'm ok
with that. Even if it's only a at least no worse than 2.6.34
situation rather than a real fix.

 Linus

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel