Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Sun, Feb 04, 2007 at 05:40:35PM +, Anton Altaparmakov wrote:
 On Sun, 4 Feb 2007, Andrew Morton wrote:
  truncate's OK: we're holding i_mutex.
 
 How about excluding readpage() (in addition to truncate if Nick is right  
 and some cases of truncate do not hold i_mutex) with an extra page flag as
 I proposed for truncate exclusion?  Then it would not matter that
 prepare_write might have allocated blocks and might expose stale data.
 It would go to sleep and wait on the bit to be cleared instead of trying  
 to bring the page uptodate.  It can then lock the page and either find it 
 uptodate (because commit_write did it) or not and then bring it uptodate.
 
 Then we could safely fault in the page, copy from it into a temporary 
 page, then lock the destination page again and copy into it.
 
 This is getting more involved as a patch again...  )-:  But at least it   
 does not affect the common case except for having to check the new page 
 flag in every readpage() and truncate() call.  But at least the checks 
 could be with an if (unlikely(newpageflag())) so should not be too bad.
 
 Have I missed anything this time?

Yes. If you have a flag to exclude readpage(), then you must also
exclude filemap_nopage, in which case it is still deadlocky.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
 On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
  On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
  
   They're not likely to hit the deadlocks, either. Probability gets more
   likely after my patch to lock the page in the fault path. But practially,
   we could live without that too, because the data corruption it fixes is
   very rare as well. Which is exactly what we've been doing quite happily
   for most of 2.6, including all distro kernels (I think).
  
  Thing is, an application which is relying on the contents of that page is
  already unreliable (or really peculiar), because it can get indeterminate
  results anyway.
 
 Not necessarily -- they could read from one part of a page and write to
 another. I see this as the biggest data corruption problem.

And in fact, it is not just transient errors either. This problem can
add permanent corruption into the pagecache and onto disk, and it doesn't
even require two processes to race.

After zeroing out the uncopied part of the page, and attempting to loop
again, we might bail out of the loop for any reason before completing the
rest of the copy, leaving the pagecache corrupted, which will soon go out
to disk.

Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Andrew Morton
On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
  On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
   On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
   
They're not likely to hit the deadlocks, either. Probability gets more
likely after my patch to lock the page in the fault path. But 
practially,
we could live without that too, because the data corruption it fixes is
very rare as well. Which is exactly what we've been doing quite happily
for most of 2.6, including all distro kernels (I think).
   
   Thing is, an application which is relying on the contents of that page is
   already unreliable (or really peculiar), because it can get indeterminate
   results anyway.
  
  Not necessarily -- they could read from one part of a page and write to
  another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here.  The
problem was only ever observed from userspace by an artificial stress-test
thing.

 And in fact, it is not just transient errors either. This problem can
 add permanent corruption into the pagecache and onto disk, and it doesn't
 even require two processes to race.
 
 After zeroing out the uncopied part of the page, and attempting to loop
 again, we might bail out of the loop for any reason before completing the
 rest of the copy, leaving the pagecache corrupted, which will soon go out
 to disk.
 

Only because -commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them.  -readpage() does that.

So...  what happens if we do

lock_page()
prepare_write()
if (copy_from_user_atomic()) {
readpage()
wait_on_page()
lock_page()
}
commit_write()
unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
  not uptodate.  In the former case, don't call readpage at all.  In the
  latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
  and will bring the non-uptodate ones up to date from disk.

  Some of the data which we copied from userspace may get overwritten
  from backing store, but that's OK.

seems crazy, but it's close.  We do have the minor problem that readpage
went and unlocked the page so we need to relock it.  I bet there are holes
in there.




Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user().  If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
  from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
  same page, we know that we've locked it, so nobody will be able to unmap
  it while we're copying from it.

Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
scenario.


btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic.  Are you doing that in the
implemetnations which you've been working on?  I forget...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Mon, Feb 05, 2007 at 09:30:06PM -0800, Andrew Morton wrote:
 On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
   
   Not necessarily -- they could read from one part of a page and write to
   another. I see this as the biggest data corruption problem.
 
 The kernel gets that sort of thing wrong anyway, and always has, because
 it uses memcpy()-style copying and not memmove()-style.
 
 I can't imagine what sort of application you're envisaging here.  The
 problem was only ever observed from userspace by an artificial stress-test
 thing.

No, I'm not talking about writing into a page with memory from the same
page. I'm talking about one process writing to part of a file, and another
reading from that same file (different offset).

If they happen to be covered by the same page, then the reader can see
zeroes.

I'm not envisaging any sort of application, all I know is that there are
several (related) data corruption bugs and I'm trying to fix them (and
fix these deadlock problems without introducing more).

  And in fact, it is not just transient errors either. This problem can
  add permanent corruption into the pagecache and onto disk, and it doesn't
  even require two processes to race.
  
  After zeroing out the uncopied part of the page, and attempting to loop
  again, we might bail out of the loop for any reason before completing the
  rest of the copy, leaving the pagecache corrupted, which will soon go out
  to disk.
  
 
 Only because -commit_write() went and incorrectly marked parts of the page
 as up-to-date.
 
 Zeroing out the fag end of the copy_from_user() on fault is actually 
 incorrect. 

Yes, I know.

 What we _should_ do is to bring those uncopyable, non-uptodate parts of the
 page uptodate rather than zeroing them.  -readpage() does that.
 
 So...  what happens if we do
 
   lock_page()
   prepare_write()
   if (copy_from_user_atomic()) {
   readpage()
   wait_on_page()
   lock_page()
   }
   commit_write()
   unlock_page()
 
 - If the page has no buffers then it is either fully uptodate or fully
   not uptodate.  In the former case, don't call readpage at all.  In the
   latter case, readpage() is the correct thing to do.
 
 - If the page had buffers, then readpage() won't touch the uptodate ones
   and will bring the non-uptodate ones up to date from disk.
 
   Some of the data which we copied from userspace may get overwritten
   from backing store, but that's OK.
 
 seems crazy, but it's close.  We do have the minor problem that readpage
 went and unlocked the page so we need to relock it.  I bet there are holes
 in there.

Yes, I tried doing this as well and there are holes in it. Even supposing
that we add a readpage_dontunlock, there is still the issue of breaking
the filesystem API from nesting readpage inside prepare_write. You also
do need to zero newly allocated blocks, for example.

 Idea #42: after we've locked the pagecache page, do an atomic get_user()
 against the source page(s) before attempting the copy_from_user().  If that
 faults, don't run prepare_write or anything else: drop the page lock and
 try again.
 
 Because
 
 - If the get_user() faults, it might be because the page we're copying
   from and to is the same page, and someone went and unmapped it: deadlock.
 
 - If the get_user() doesn't fault, and if we're copying from and to the
   same page, we know that we've locked it, so nobody will be able to unmap
   it while we're copying from it.
 
 Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
 scenario.

Yes I considered this too. Hard isn't it?

 btw, to fix the writev() performance problem we may need to go off and run
 get_user() against up to 1024 separate user pages before locking the
 pagecache page, which sounds fairly idiotic.  Are you doing that in the
 implemetnations which you've been working on?  I forget...

No, because in my fix it can do non-atomic usercopies for !uptodate pages.

For uptodate pages, yes there is a possibility that it may do a short copy
and have to retry, but it is probably safe to bet that the source data is
fairly recently accessed in most cases, so a short copy will be unlikely.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Tue, Feb 06, 2007 at 06:49:05AM +0100, Nick Piggin wrote:
  - If the get_user() doesn't fault, and if we're copying from and to the
same page, we know that we've locked it, so nobody will be able to unmap
it while we're copying from it.
  
  Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
  scenario.
 
 Yes I considered this too. Hard isn't it?

BTW, there are two different abba deadlocks. It's all documented in the
patch 9/9 description.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
 On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin [EMAIL PROTECTED] 
 wrote:
 
  2.  If we find the destination page is non uptodate, unlock it (this could 
  be
  made slightly more optimal), then find and pin the source page with
  get_user_pages. Relock the destination page and continue with the copy.
  However, instead of a usercopy (which might take a fault), copy the data
  via the kernel address space.
 
 argh.  We just can't go adding all this gunk into the write() path. 
 
 mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
 Even single-process write() will suffer, let along multithreaded stuff,
 where mmap_sem contention may be the bigger problem.

The write path is broken. I prefer my kernels slow, than buggy.

As I said, I'm working on a replacement API so that the filesystems
that care, can be correct *and* fast.

 I was going to do some quick measurements of this, but the code oopses
 on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

Cool, a kernel thread is calling sys_write. Fun.

I guess I should be able to reproduce this if using initramfs. Thanks.

 There's a build error in filemap_xip.c btw.
 
 
 
 We need to think different.
 
 What happened to the idea of doing an atomic copy into the non-uptodate
 page and handling it somehow?

That was my second idea. I didn't get any feedback on that patchset
except to try this method, so I assume everyone hated it.

I actually liked it, because it didn't have to do the writev
segment-at-a-time for !uptodate pages like this one does. Considering
this code gets called from mm-less contexts, maybe I'll have to go back
to this approach.

 Another option might be to effectively pin the whole mm during the copy:
 
   down_read(current-mm-unpaging_lock);
   get_user(addr); /* Fault the page in */
   ...
   copy_from_user()
   up_read(current-mm-unpaging_lock);
 
 then, anyone who wants to unmap pages from this mm requires
 write_lock(unpaging_lock).  So we know the results of that get_user()
 cannot be undone.

Fugly. Don't know whether there are any lock order problems making it
hard to implement, but you introduce the theoretical memory deadlock
where a task cannot reclaim its own memory.

 Or perhaps something like this can be done on a per-vma basis.  Just
 something to tell the VM hey, you're not allowed to unmap this page right
 now?

Same memory deadlock problem.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Christoph Hellwig
On Sun, Feb 04, 2007 at 11:15:29AM +0100, Nick Piggin wrote:
 Cool, a kernel thread is calling sys_write. Fun.

There are tons of places where we possible call into -write from
either kernel threads or at least with a kernel pointer  and set_fs/set_ds
magic.  Anything in the buffer write path that tries to touch page tables
can't work.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 11:46:09AM +0100, Nick Piggin wrote:
 
  It's better than taking mmap_sem and walking pagetables...
 
 I'm not convinced.

Though I am more convinced that looking at mm *at all* (either to
take the mmap_sem and find the vma, or to take the mmap_sem and
run get_user_pages) is going to hurt.

We'd have to special case kernel threads, which don't even have an
mm, let alone the vmas... too ugly.

I'll revert to my temporary-page approach: at least that will
fix the problem.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
  On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
  
   The write path is broken. I prefer my kernels slow, than buggy.
  
  That won't fly.
 
 What won't fly?

I suspect the performance cost of this approach would force us to redo it
all.

What happened to the idea of doing an atomic copy into the non-uptodate
page and handling it somehow?
   
   That was my second idea.
  
  Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
  practical because of the games we needed to play with -commit_write.
 
 Maybe I misunderstood what you meant, above.

The original set of half-written patches I sent you.  Do an atomic 
copy_from_user()
inside the page lock and if that fails, zero out the remainder of the page, run
commit_write() and then redo the whole thing.

 I have an alterative fix
 where a temporary page is allocated if the write enncounters a non
 uptodate page. The usercopy then goes into that page, and from there
 into the target page after we have opened the prepare_write().

Remember that a non-uptodate page is the common case.

 My *first* idea to fix this was to do the atomic copy into a non-uptodate
 page and then calling a zero-length commit_write if it failed. I pretty
 carefully constructed all these good arguments as to why each case works
 properly, but in the end it just didn't fly because it broke lots of
 filesystems.

I forget the details now.  I think we did have a workable-looking solution
based on the atomic copy_from_user() but it would have re-exposed the old
problem wherein a page would fleetingly have a bunch of zeroes in the
middle of it, if someone looked at it during the write.

If that recollection is right, I think we could afford to reintroduce that
problem, frankly.  Especially as it only happens in the incredibly rare
case of that get_user()ed page getting unmapped under our feet.

   but you introduce the theoretical memory deadlock
   where a task cannot reclaim its own memory.
  
  Nah, that'll never happen - both pages are already allocated.
 
 Both pages? I don't get it.
 
 You set the don't-reclaim vma flag, then run get_user, which takes a
 page fault and potentially has to allocate N pages for pagetables,
 pagecache readahead, buffers and fs private data and pagecache radix
 tree nodes for all of the pages read in.

Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
rwsem.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 10:59:58 + (GMT) Anton Altaparmakov [EMAIL PROTECTED] 
wrote:

 On Sun, 4 Feb 2007, Andrew Morton wrote:
  On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin [EMAIL PROTECTED] 
  wrote:
   2.  If we find the destination page is non uptodate, unlock it (this 
   could be
   made slightly more optimal), then find and pin the source page with
   get_user_pages. Relock the destination page and continue with the 
   copy.
   However, instead of a usercopy (which might take a fault), copy the 
   data
   via the kernel address space.
  
  argh.  We just can't go adding all this gunk into the write() path. 
  
  mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
  Even single-process write() will suffer, let along multithreaded stuff,
  where mmap_sem contention may be the bigger problem.
  
  I was going to do some quick measurements of this, but the code oopses
  on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
  
  We need to think different.
 
 How about leaving the existing code with the following minor 
 modifications:
 
 Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
 bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
 would do (could of course move into a helper function of course):
 
 pagefault_disable()
 kaddr = kmap_atomic(page, KM_USER0);
 left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 kunmap_atomic(kaddr, KM_USER0);
 pagefault_enable()
 
 if (unlikely(left)) {
   /* The user space page got unmapped before we got to copy it. */
   ...
 }
 
 Thus the 99.999% (or more!) of the time the code would just work as it 
 always has and there is no bug and no speed impact.  Only in the very rare 
 and hard to trigger race condition that the user space page after being 
 faulted in got thrown out again before we did the atomic memory copy do we 
 run into the above ... code path.

Right.  And what I wanted to do here is to zero out the uncopied part of
the page (if it wasn't uptodate), then run commit_write(), then retry the
whole thing.

iirc, we ruled that out because those temporary zeroes are exposed to
userspace.  But the kernel used to do that anyway for a long time (years)
until someone noticed, and we'll only do it in your 0.0001% case anyway.

(Actually, perhaps we can prevent it by not marking the page uptodate in
this case.  But that'll cause a read()er to try to bring it uptodate...)

 I would propose to call out a different function altogether which could do 
 a multitude of things including drop the lock on the destination page 
 (maintaining a reference on the page!), allocate a temporary page, copy 
 from the user space page into the temporary page, then lock the 
 destination page again, and copy from the temporary page into the 
 destination page.

The problem with all these things is that as soon as we unlock the page,
it's visible to read().  And in fact, as soon as we mark it uptodate it's
visible to mmap, even if it's still locked.

 This would be slow but who cares given it would only happen incredibly 
 rarely and on majority of machines it would never happen at all.
 
 The only potential problem I can see is that the destination page could be 
 truncated whilst it is unlocked.  I can see two possible solutions to 
 this:

truncate's OK: we're holding i_mutex.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
 On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
 
  On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
   On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
   
   If that recollection is right, I think we could afford to reintroduce that
   problem, frankly.  Especially as it only happens in the incredibly rare
   case of that get_user()ed page getting unmapped under our feet.
  
  Dang. I was hoping to fix it without introducing data corruption.
 
 Well.  It's a compromise.  Being practical about it, I raly doubt that
 anyone will hit this combination of circumstances.

They're not likely to hit the deadlocks, either. Probability gets more
likely after my patch to lock the page in the fault path. But practially,
we could live without that too, because the data corruption it fixes is
very rare as well. Which is exactly what we've been doing quite happily
for most of 2.6, including all distro kernels (I think).

But (sadly) for me, there is no compromise. I may be known as the clown
who tries outlandish things to shave a few atomic ops and interrupt flag
changes in the page allocator, or make the pagecache lockless. However
I can't be happy even making something faster if correctness is  100.0%,
even if less likely than hardware failure.

  but you introduce the theoretical memory deadlock
  where a task cannot reclaim its own memory.
 
 Nah, that'll never happen - both pages are already allocated.

Both pages? I don't get it.

You set the don't-reclaim vma flag, then run get_user, which takes a
page fault and potentially has to allocate N pages for pagetables,
pagecache readahead, buffers and fs private data and pagecache radix
tree nodes for all of the pages read in.
   
   Oh, OK.  Need to do the get_user() twice then.  Once before taking that 
   new
   rwsem.
  
  Race condition remains.
 
 No, not in a million years.

There is a huge window. Think about what other activity will be holding
that very rwsem for write, that you'll have to wait for in the race window.
But you could say that's also a question of practicality, because it is
pretty unlikely to do anything bad even if you do hit the race.

Well if fixing this is just going to be flat-out vetoed on performance
reasons then I can't argue, because it does impact performance.

Thinking about the numbers, if your kernel's reliability is already the
same order of magnitude as reliability of commodity hardware, then
trading a bit of performance for a bit of reliability is a BAD tradeoff
if you are at all interested in performance on commodity hardware. That
is especially true if you have a massive redundant cluster of commodity
systems, which I understand is a fairly big market for Linux. The 
X-nines guys who would disagree are probably a tiny niche for Linux.

So I do understand your argument for praticality, even if I don't agree.

For the record, here is the temporary page fix that should fix it
properly. And some numbers.

Nick

--

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6.  mmap_sem(r)
7.   handle_mm_fault
8.lock_page (filemap_nopage)
9.commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.get_user_pages
e. handle_mm_fault
f.  lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
from the pinned temporary page via the kernel address space.

(also, rename 

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Anton Altaparmakov
On Sun, 4 Feb 2007, Andrew Morton wrote:
 On Sun, 4 Feb 2007 10:59:58 + (GMT) Anton Altaparmakov [EMAIL 
 PROTECTED] wrote:
  On Sun, 4 Feb 2007, Andrew Morton wrote:
   On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin [EMAIL PROTECTED] 
   wrote:
2.  If we find the destination page is non uptodate, unlock it (this 
could be
made slightly more optimal), then find and pin the source page with
get_user_pages. Relock the destination page and continue with the 
copy.
However, instead of a usercopy (which might take a fault), copy the 
data
via the kernel address space.
   
   argh.  We just can't go adding all this gunk into the write() path. 
   
   mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every 
   page. 
   Even single-process write() will suffer, let along multithreaded stuff,
   where mmap_sem contention may be the bigger problem.
   
   I was going to do some quick measurements of this, but the code oopses
   on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
   
   We need to think different.
  
  How about leaving the existing code with the following minor 
  modifications:
  
  Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
  bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
  would do (could of course move into a helper function of course):
  
  pagefault_disable()
  kaddr = kmap_atomic(page, KM_USER0);
  left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
  kunmap_atomic(kaddr, KM_USER0);
  pagefault_enable()
  
  if (unlikely(left)) {
  /* The user space page got unmapped before we got to copy it. */
  ...
  }
  
  Thus the 99.999% (or more!) of the time the code would just work as it 
  always has and there is no bug and no speed impact.  Only in the very rare 
  and hard to trigger race condition that the user space page after being 
  faulted in got thrown out again before we did the atomic memory copy do we 
  run into the above ... code path.
 
 Right.  And what I wanted to do here is to zero out the uncopied part of
 the page (if it wasn't uptodate), then run commit_write(), then retry the
 whole thing.
 
 iirc, we ruled that out because those temporary zeroes are exposed to
 userspace.  But the kernel used to do that anyway for a long time (years)
 until someone noticed, and we'll only do it in your 0.0001% case anyway.
 
 (Actually, perhaps we can prevent it by not marking the page uptodate in
 this case.  But that'll cause a read()er to try to bring it uptodate...)

My thinking was not marking the page uptodate.  But yes that causes the 
problem of a concurrent readpage reading uninitialized disk blocks that 
prepare_write allocated.

  I would propose to call out a different function altogether which could do 
  a multitude of things including drop the lock on the destination page 
  (maintaining a reference on the page!), allocate a temporary page, copy 
  from the user space page into the temporary page, then lock the 
  destination page again, and copy from the temporary page into the 
  destination page.
 
 The problem with all these things is that as soon as we unlock the page,
 it's visible to read().  And in fact, as soon as we mark it uptodate it's
 visible to mmap, even if it's still locked.

Yes we definitely cannot mark it uptodate before it really is uptodate.
Either we have to not mark it uptodate or we have to zero it or we have to
think of some other magic...

  This would be slow but who cares given it would only happen incredibly 
  rarely and on majority of machines it would never happen at all.
  
  The only potential problem I can see is that the destination page could be 
  truncated whilst it is unlocked.  I can see two possible solutions to 
  this:
 
 truncate's OK: we're holding i_mutex.

How about excluding readpage() (in addition to truncate if Nick is right  
and some cases of truncate do not hold i_mutex) with an extra page flag as
I proposed for truncate exclusion?  Then it would not matter that
prepare_write might have allocated blocks and might expose stale data.
It would go to sleep and wait on the bit to be cleared instead of trying  
to bring the page uptodate.  It can then lock the page and either find it 
uptodate (because commit_write did it) or not and then bring it uptodate.

Then we could safely fault in the page, copy from it into a temporary 
page, then lock the destination page again and copy into it.

This is getting more involved as a patch again...  )-:  But at least it   
does not affect the common case except for having to check the new page 
flag in every readpage() and truncate() call.  But at least the checks 
could be with an if (unlikely(newpageflag())) so should not be too bad.

Have I missed anything this time?

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-02 Thread Andrew Morton
On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin [EMAIL PROTECTED] wrote:

 Modify the core write() code so that it won't take a pagefault while holding a
 lock on the pagecache page. There are a number of different deadlocks possible
 if we try to do such a thing:
 
 1.  generic_buffered_write
 2.   lock_page
 3.prepare_write
 4. unlock_page+vmtruncate
 5. copy_from_user
 6.  mmap_sem(r)
 7.   handle_mm_fault
 8.lock_page (filemap_nopage)
 9.commit_write
 10.  unlock_page
 
 a. sys_munmap / sys_mlock / others
 b.  mmap_sem(w)
 c.   make_pages_present
 d.get_user_pages
 e. handle_mm_fault
 f.  lock_page (filemap_nopage)
 
 2,8   - recursive deadlock if page is same
 2,8;2,8   - ABBA deadlock is page is different
 2,6;b,f   - ABBA deadlock if page is same
 
 The solution is as follows:
 1.  If we find the destination page is uptodate, continue as normal, but use
 atomic usercopies which do not take pagefaults and do not zero the 
 uncopied
 tail of the destination. The destination is already uptodate, so we can
 commit_write the full length even if there was a partial copy: it does not
 matter that the tail was not modified, because if it is dirtied and 
 written
 back to disk it will not cause any problems (uptodate *means* that the
 destination page is as new or newer than the copy on disk).
 
 1a. The above requires that fault_in_pages_readable correctly returns access
 information, because atomic usercopies cannot distinguish between
 non-present pages in a readable mapping, from lack of a readable mapping.
 
 2.  If we find the destination page is non uptodate, unlock it (this could be
 made slightly more optimal), then find and pin the source page with
 get_user_pages. Relock the destination page and continue with the copy.
 However, instead of a usercopy (which might take a fault), copy the data
 via the kernel address space.
 

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case.  We've always tried to avoid doing a pte-walk in the write() path to
fix this bug.  Careful performance testing is needed here so we can assess
the impact.  For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases.  For example, we don't yet have a page to run page_mapped()
against.

   break;
   }
  
 + /*
 +  * non-uptodate pages cannot cope with short copies, and we
 +  * cannot take a pagefault with the destination page locked.
 +  * So pin the source page to copy it.
 +  */
 + if (!PageUptodate(page)) {
 + unlock_page(page);
 +
 + bytes = min(bytes, PAGE_CACHE_SIZE -
 +  ((unsigned long)buf  ~PAGE_CACHE_MASK));
 +
 + /*
 +  * Cannot get_user_pages with a page locked for the
 +  * same reason as we can't take a page fault with a
 +  * page locked (as explained below).
 +  */
 + down_read(current-mm-mmap_sem);
 + status = get_user_pages(current, current-mm,
 + (unsigned long)buf  PAGE_CACHE_MASK, 1,
 + 0, 0, src_page, NULL);
 + up_read(current-mm-mmap_sem);
 + if (status != 1) {
 + page_cache_release(page);
 + break;
 + }
 +
 + lock_page(page);
 + if (!page-mapping) {

Hopefully this can't happen?  If it can, who went and took our page off the
mapping?  Reclaim?  The elevated page_count will prevent that?

 + unlock_page(page);
 + page_cache_release(page);
 + page_cache_release(src_page);
 + continue;
 + }
 +
 + }
 +
   status = a_ops-prepare_write(file, page, offset, offset+bytes);
   if (unlikely(status))
   goto fs_write_aop_error;
  
 - copied = filemap_copy_from_user(page, offset,
 + if (!src_page) {
 + /*
 +  * Must not enter the pagefault handler here, because
 +  * we hold the page lock, so we might recursively
 +  * deadlock on the same lock, or get an ABBA deadlock
 +  * against a different lock, or against the mmap_sem
 +  * (which nests outside the page lock).  So increment
 +  * preempt count, and use _atomic usercopies.
 +  *
 +

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-02 Thread Nick Piggin
On Fri, Feb 02, 2007 at 03:53:11PM -0800, Andrew Morton wrote:
 On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  Modify the core write() code so that it won't take a pagefault while 
  holding a
  lock on the pagecache page. There are a number of different deadlocks 
  possible
  if we try to do such a thing:
  
  1.  generic_buffered_write
  2.   lock_page
  3.prepare_write
  4. unlock_page+vmtruncate
  5. copy_from_user
  6.  mmap_sem(r)
  7.   handle_mm_fault
  8.lock_page (filemap_nopage)
  9.commit_write
  10.  unlock_page
  
  a. sys_munmap / sys_mlock / others
  b.  mmap_sem(w)
  c.   make_pages_present
  d.get_user_pages
  e. handle_mm_fault
  f.  lock_page (filemap_nopage)
  
  2,8 - recursive deadlock if page is same
  2,8;2,8 - ABBA deadlock is page is different
  2,6;b,f - ABBA deadlock if page is same
  
  The solution is as follows:
  1.  If we find the destination page is uptodate, continue as normal, but use
  atomic usercopies which do not take pagefaults and do not zero the 
  uncopied
  tail of the destination. The destination is already uptodate, so we can
  commit_write the full length even if there was a partial copy: it does 
  not
  matter that the tail was not modified, because if it is dirtied and 
  written
  back to disk it will not cause any problems (uptodate *means* that the
  destination page is as new or newer than the copy on disk).
  
  1a. The above requires that fault_in_pages_readable correctly returns access
  information, because atomic usercopies cannot distinguish between
  non-present pages in a readable mapping, from lack of a readable 
  mapping.
  
  2.  If we find the destination page is non uptodate, unlock it (this could 
  be
  made slightly more optimal), then find and pin the source page with
  get_user_pages. Relock the destination page and continue with the copy.
  However, instead of a usercopy (which might take a fault), copy the data
  via the kernel address space.
  
 
 Oh what a mess we're making :(
 
 Unfortunately, write() into a non-uptodate page is very much the common
 case.  We've always tried to avoid doing a pte-walk in the write() path to
 fix this bug.  Careful performance testing is needed here so we can assess
 the impact.  For threaded applications, simply the taking of mmap_sem might
 be the biggest problem.
 
 And I can't think of any tricks we can play to avoid doing the pte-walk in
 most cases.  For example, we don't yet have a page to run page_mapped()
 against.

After this patch series, I am working on another that will allow filesystems
to specifically code around the problem (eg. by handling short usercopies
properly).

I tried to take this approach generically the first time, but it turns out
lots of filesystems had subtle problems, so if we do it this way instead,
then filesystem developers who actually care enough can improve their
code, and those that don't won't hold them back (or prevent this bug from
being fixed).

  break;
  }
   
  +   /*
  +* non-uptodate pages cannot cope with short copies, and we
  +* cannot take a pagefault with the destination page locked.
  +* So pin the source page to copy it.
  +*/
  +   if (!PageUptodate(page)) {
  +   unlock_page(page);
  +
  +   bytes = min(bytes, PAGE_CACHE_SIZE -
  +((unsigned long)buf  ~PAGE_CACHE_MASK));
  +
  +   /*
  +* Cannot get_user_pages with a page locked for the
  +* same reason as we can't take a page fault with a
  +* page locked (as explained below).
  +*/
  +   down_read(current-mm-mmap_sem);
  +   status = get_user_pages(current, current-mm,
  +   (unsigned long)buf  PAGE_CACHE_MASK, 1,
  +   0, 0, src_page, NULL);
  +   up_read(current-mm-mmap_sem);
  +   if (status != 1) {
  +   page_cache_release(page);
  +   break;
  +   }
  +
  +   lock_page(page);
  +   if (!page-mapping) {
 
 Hopefully this can't happen?  If it can, who went and took our page off the
 mapping?  Reclaim?  The elevated page_count will prevent that?

Truncate/invalidate?

  +   unlock_page(page);
  +   page_cache_release(page);
  +   page_cache_release(src_page);
  +   continue;
  +   }
  +
  +   }
  +
  status = a_ops-prepare_write(file, page, offset, offset+bytes);
  if (unlikely(status))
  goto fs_write_aop_error;
   
  

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-01-29 Thread Nick Piggin
On Mon, Jan 29, 2007 at 11:33:03AM +0100, Nick Piggin wrote:
 + } else {
 + char *src, *dst;
 + src = kmap(src_page);
 + dst = kmap(page);
 + memcpy(dst + offset,
 + src + ((unsigned long)buf  ~PAGE_CACHE_MASK),
 + bytes);
 + kunmap(page);
 + kunmap(src_page);
 + copied = bytes;
 + }
   flush_dcache_page(page);

Hmm, I guess these should use kmap_atomic with KM_USER[01]?

The kmap is from an earlier iteration that wanted to sleep
with the page mapped into kernel.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html