Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread J. R. Okajima

Al Viro:
> Different ->i_mutex; you are holding one on the parent directory already.

Let me make sure. In your scenario,
- processA writes something into the union, and the unioning fs operates
  the writable layer. After sb_start_write() succeeds, processA should
  not block by the reason of fsfreeze.
- processC causes the copyup.
  The current aufs implementation holds parent->i_mutex on the writable
  layer during the copyup. The parent->i_mutex can make processA
  blocking.

Now I am considering the copyup approach you suggested in another mail,
and I am going to replace your
"unlink the target, re-link later, no dir lock during copyup"
by
"make it hidden instead of unlinking, rename the correct name
later, no dir lock during copyup"
since I am not sure all FSs can operate "->link with the unlinked
one". I guess most FS can handle it, but I don't want to make sure
everything particulary remote fs, journals.

To make a file "hidden", I guess I can use the aufs "doubley whiteouted"
approach. As you might know, aufs prepends the ".wh." prefix to the
filename as whiteout. With one more prefix, the name loses the role of
whiteout. Aufs simply ignores such doubly whiteouted name.
The demerit is that aufs has to limit the length of the name.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread Al Viro
On Sat, Mar 23, 2013 at 11:49:11AM +0900, J. R. Okajima wrote:
> 
> Al Viro:
> > The scenario, BTW, looks so:
> > process A does sb_start_write() (on your upper layer)
> > process B tries to freeze said upper layer and blocks, waiting for A to 
> > finish
> > process C grabs ->i_mutex in your upper layer
> > process C does vfs_write(), which blocks, since there's a pending attempt to
> > freeze
> > process A tries to grab ->i_mutex held by C and blocks
> 
> According to latest mm/filemap.c:generic_file_aio_write(),
>   sb_start_write(inode->i_sb);
>   mutex_lock(>i_mutex);
>   ret = __generic_file_aio_write(iocb, iov, nr_segs, >ki_pos);
>   mutex_unlock(>i_mutex);
>   :::
>   sb_end_write(inode->i_sb);
> 
> Process C would block *BEFORE* i_mutex by sb_start_write()? No?

Different ->i_mutex; you are holding one on the parent directory already.

That's the problem - you have ->i_mutex nested both inside that sucker (as
it ought to) and outside.  Which tends to do bad things, obviously, in
particular because something like mkdir(2) will do sb_start_write() (from
mnt_want_write(), called by kern_path_create()) before grabbing directory
->i_mutex.

Thus the activity with lifting the bastard out of ->aio_write(), etc. in
vfs.git#experimental - *any* union-like variant will need the ability to
pull sb_start_write() outside of locking the parent directory on copyup.
And yes, it's a common prerequisite to anything doing copyups - aufs is
in the same boat as overlayfs and unionmount.  Same deadlock for all three
of them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread J. R. Okajima

Al Viro:
> The scenario, BTW, looks so:
> process A does sb_start_write() (on your upper layer)
> process B tries to freeze said upper layer and blocks, waiting for A to finish
> process C grabs ->i_mutex in your upper layer
> process C does vfs_write(), which blocks, since there's a pending attempt to
> freeze
> process A tries to grab ->i_mutex held by C and blocks

According to latest mm/filemap.c:generic_file_aio_write(),
sb_start_write(inode->i_sb);
mutex_lock(>i_mutex);
ret = __generic_file_aio_write(iocb, iov, nr_segs, >ki_pos);
mutex_unlock(>i_mutex);
:::
sb_end_write(inode->i_sb);

Process C would block *BEFORE* i_mutex by sb_start_write()? No?
Honestly speaking I didn't pay attention about the freeze feature since
I don't use it. I am making aufs to support it now. But I don't know how
to test it...


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread Al Viro
On Fri, Mar 22, 2013 at 06:11:11PM +, Al Viro wrote:
> On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote:
> > 
> > David Howells:
> > > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> > > might operate, so it's possible that this is a false alarm.  Maybe Jan 
> > > Kara can
> > > illuminate further, so I've added him to the cc list.
> > 
> > It is related to the design of UnionMount, isn't it?
> > UnionMount is not a filesystem and doen't have its own superblock.
> > If it was a fs, then
> > - vfs_truncate() acquires sb_writers for the unioning-fs.
> > - the unioning-fs may call vfs_truncate() again for the underlying fs.
> > - this time, sb_writers is for the underlying fs which is a different
> >   sb_writers object from the already acquired one.
> > So there would be no deadlock.
> 
> Doesn't help the situation with copyup - witness overlayfs stepping into the
> same deadlock on copyup.  It wants ->i_mutex held on directory in upper layer
> and it tries to write to file it has created in there.  The problem is
> with the upper layer superblock getting frozen; having a separate one for
> union is irrelevant.  Let me check how aufs does...  Aha.  Your
> au_do_copy_file() ends up calling vfs_write() on the file opened in
> upper layer.  And AFAICS it's called with ->i_mutex held on the directory
> in upper layer, so you've got the same deadlock, sorry.

The scenario, BTW, looks so:
process A does sb_start_write() (on your upper layer)
process B tries to freeze said upper layer and blocks, waiting for A to finish
process C grabs ->i_mutex in your upper layer
process C does vfs_write(), which blocks, since there's a pending attempt to
freeze
process A tries to grab ->i_mutex held by C and blocks
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread Al Viro
On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote:
> 
> David Howells:
> > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> > might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
> > can
> > illuminate further, so I've added him to the cc list.
> 
> It is related to the design of UnionMount, isn't it?
> UnionMount is not a filesystem and doen't have its own superblock.
> If it was a fs, then
> - vfs_truncate() acquires sb_writers for the unioning-fs.
> - the unioning-fs may call vfs_truncate() again for the underlying fs.
> - this time, sb_writers is for the underlying fs which is a different
>   sb_writers object from the already acquired one.
> So there would be no deadlock.

Doesn't help the situation with copyup - witness overlayfs stepping into the
same deadlock on copyup.  It wants ->i_mutex held on directory in upper layer
and it tries to write to file it has created in there.  The problem is
with the upper layer superblock getting frozen; having a separate one for
union is irrelevant.  Let me check how aufs does...  Aha.  Your
au_do_copy_file() ends up calling vfs_write() on the file opened in
upper layer.  And AFAICS it's called with ->i_mutex held on the directory
in upper layer, so you've got the same deadlock, sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread J. R. Okajima

David Howells:
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
> can
> illuminate further, so I've added him to the cc list.

It is related to the design of UnionMount, isn't it?
UnionMount is not a filesystem and doen't have its own superblock.
If it was a fs, then
- vfs_truncate() acquires sb_writers for the unioning-fs.
- the unioning-fs may call vfs_truncate() again for the underlying fs.
- this time, sb_writers is for the underlying fs which is a different
  sb_writers object from the already acquired one.
So there would be no deadlock.

Still lockdep will produce the message since sb_writers doesn't have the
lockdep class. Of course, we can introduce the lock class for it, or call
lockdep_off()/on() simply in order to stop the message. But, as long as
the unioning feature is not implemented as a fs, the solution will not
be so easy. I am afraid UnionMount will need to introduce a new counter
(or a new flag) to indicate the task entered the union, and adjust the
lock class or decide to call lockdep_off() for sb_writers. I don't think
it is a good idea.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread J. R. Okajima

David Howells:
 Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
 might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
 can
 illuminate further, so I've added him to the cc list.

It is related to the design of UnionMount, isn't it?
UnionMount is not a filesystem and doen't have its own superblock.
If it was a fs, then
- vfs_truncate() acquires sb_writers for the unioning-fs.
- the unioning-fs may call vfs_truncate() again for the underlying fs.
- this time, sb_writers is for the underlying fs which is a different
  sb_writers object from the already acquired one.
So there would be no deadlock.

Still lockdep will produce the message since sb_writers doesn't have the
lockdep class. Of course, we can introduce the lock class for it, or call
lockdep_off()/on() simply in order to stop the message. But, as long as
the unioning feature is not implemented as a fs, the solution will not
be so easy. I am afraid UnionMount will need to introduce a new counter
(or a new flag) to indicate the task entered the union, and adjust the
lock class or decide to call lockdep_off() for sb_writers. I don't think
it is a good idea.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread Al Viro
On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote:
 
 David Howells:
  Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
  might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
  can
  illuminate further, so I've added him to the cc list.
 
 It is related to the design of UnionMount, isn't it?
 UnionMount is not a filesystem and doen't have its own superblock.
 If it was a fs, then
 - vfs_truncate() acquires sb_writers for the unioning-fs.
 - the unioning-fs may call vfs_truncate() again for the underlying fs.
 - this time, sb_writers is for the underlying fs which is a different
   sb_writers object from the already acquired one.
 So there would be no deadlock.

Doesn't help the situation with copyup - witness overlayfs stepping into the
same deadlock on copyup.  It wants -i_mutex held on directory in upper layer
and it tries to write to file it has created in there.  The problem is
with the upper layer superblock getting frozen; having a separate one for
union is irrelevant.  Let me check how aufs does...  Aha.  Your
au_do_copy_file() ends up calling vfs_write() on the file opened in
upper layer.  And AFAICS it's called with -i_mutex held on the directory
in upper layer, so you've got the same deadlock, sorry.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread Al Viro
On Fri, Mar 22, 2013 at 06:11:11PM +, Al Viro wrote:
 On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote:
  
  David Howells:
   Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
   might operate, so it's possible that this is a false alarm.  Maybe Jan 
   Kara can
   illuminate further, so I've added him to the cc list.
  
  It is related to the design of UnionMount, isn't it?
  UnionMount is not a filesystem and doen't have its own superblock.
  If it was a fs, then
  - vfs_truncate() acquires sb_writers for the unioning-fs.
  - the unioning-fs may call vfs_truncate() again for the underlying fs.
  - this time, sb_writers is for the underlying fs which is a different
sb_writers object from the already acquired one.
  So there would be no deadlock.
 
 Doesn't help the situation with copyup - witness overlayfs stepping into the
 same deadlock on copyup.  It wants -i_mutex held on directory in upper layer
 and it tries to write to file it has created in there.  The problem is
 with the upper layer superblock getting frozen; having a separate one for
 union is irrelevant.  Let me check how aufs does...  Aha.  Your
 au_do_copy_file() ends up calling vfs_write() on the file opened in
 upper layer.  And AFAICS it's called with -i_mutex held on the directory
 in upper layer, so you've got the same deadlock, sorry.

The scenario, BTW, looks so:
process A does sb_start_write() (on your upper layer)
process B tries to freeze said upper layer and blocks, waiting for A to finish
process C grabs -i_mutex in your upper layer
process C does vfs_write(), which blocks, since there's a pending attempt to
freeze
process A tries to grab -i_mutex held by C and blocks
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread J. R. Okajima

Al Viro:
 The scenario, BTW, looks so:
 process A does sb_start_write() (on your upper layer)
 process B tries to freeze said upper layer and blocks, waiting for A to finish
 process C grabs -i_mutex in your upper layer
 process C does vfs_write(), which blocks, since there's a pending attempt to
 freeze
 process A tries to grab -i_mutex held by C and blocks

According to latest mm/filemap.c:generic_file_aio_write(),
sb_start_write(inode-i_sb);
mutex_lock(inode-i_mutex);
ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos);
mutex_unlock(inode-i_mutex);
:::
sb_end_write(inode-i_sb);

Process C would block *BEFORE* i_mutex by sb_start_write()? No?
Honestly speaking I didn't pay attention about the freeze feature since
I don't use it. I am making aufs to support it now. But I don't know how
to test it...


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread Al Viro
On Sat, Mar 23, 2013 at 11:49:11AM +0900, J. R. Okajima wrote:
 
 Al Viro:
  The scenario, BTW, looks so:
  process A does sb_start_write() (on your upper layer)
  process B tries to freeze said upper layer and blocks, waiting for A to 
  finish
  process C grabs -i_mutex in your upper layer
  process C does vfs_write(), which blocks, since there's a pending attempt to
  freeze
  process A tries to grab -i_mutex held by C and blocks
 
 According to latest mm/filemap.c:generic_file_aio_write(),
   sb_start_write(inode-i_sb);
   mutex_lock(inode-i_mutex);
   ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos);
   mutex_unlock(inode-i_mutex);
   :::
   sb_end_write(inode-i_sb);
 
 Process C would block *BEFORE* i_mutex by sb_start_write()? No?

Different -i_mutex; you are holding one on the parent directory already.

That's the problem - you have -i_mutex nested both inside that sucker (as
it ought to) and outside.  Which tends to do bad things, obviously, in
particular because something like mkdir(2) will do sb_start_write() (from
mnt_want_write(), called by kern_path_create()) before grabbing directory
-i_mutex.

Thus the activity with lifting the bastard out of -aio_write(), etc. in
vfs.git#experimental - *any* union-like variant will need the ability to
pull sb_start_write() outside of locking the parent directory on copyup.
And yes, it's a common prerequisite to anything doing copyups - aufs is
in the same boat as overlayfs and unionmount.  Same deadlock for all three
of them.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-22 Thread J. R. Okajima

Al Viro:
 Different -i_mutex; you are holding one on the parent directory already.

Let me make sure. In your scenario,
- processA writes something into the union, and the unioning fs operates
  the writable layer. After sb_start_write() succeeds, processA should
  not block by the reason of fsfreeze.
- processC causes the copyup.
  The current aufs implementation holds parent-i_mutex on the writable
  layer during the copyup. The parent-i_mutex can make processA
  blocking.

Now I am considering the copyup approach you suggested in another mail,
and I am going to replace your
unlink the target, re-link later, no dir lock during copyup
by
make it hidden instead of unlinking, rename the correct name
later, no dir lock during copyup
since I am not sure all FSs can operate -link with the unlinked
one. I guess most FS can handle it, but I don't want to make sure
everything particulary remote fs, journals.

To make a file hidden, I guess I can use the aufs doubley whiteouted
approach. As you might know, aufs prepends the .wh. prefix to the
filename as whiteout. With one more prefix, the name loses the role of
whiteout. Aufs simply ignores such doubly whiteouted name.
The demerit is that aufs has to limit the length of the name.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Jan Kara
On Wed 20-03-13 21:48:13, Al Viro wrote:
> On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote:
> > > do_bio_filebacked(), with some ugliness between that and callsite.  Note,
> > > BTW, that we have a pair of possible vfs_fsync() calls in there; how do 
> > > those
> > > interact with freeze?
> >   Freezing code takes care that all dirty data is synced before fs is
> > frozen and no new dirty data can be created before fs is thawed. So
> > vfs_fsync() should just return without doing anything on frozen filesystem.
> 
> Um...  How does it interact with vfs_fsync() already in progress when you
> ask to freeze it?
  So the exact sequence of freezing is:
sb->s_writers.frozen = SB_FREEZE_WRITE;
smp_wmb();
sb_wait_write(sb, SB_FREEZE_WRITE);
Now there are no processes in sb_start_write() - sb_end_write() section.
Then we do the same for SB_FREEZE_PAGEFAULT. After this noone should be
able to dirty a page or inode. Writeback or vfs_fsync() may be still
running (so fs can be creating new transactions in the journal for
writeback etc.).
sync_filesystem(sb);
After this there should be no dirty data so although we can still be
somewhere inside vfs_fsync() it should have nothing to do.

Now we freeze to state SB_FREEZE_FS (nop for ext4, but for XFS it may
interact e.g. with inode reclaim trimming preallocated blocks) and we are
done.
 
> Anyway, I've pulled the fscker out of ->aio_write, ->write and ->splice_write;
> on that pathway it's in the do_splice_from() (see vfs.git#experimental).
> 
> ... and now, for something *really* nasty: where do mandatory file locks
> belong in the locking hierarchy?  Relative to fsfreeze one, for starters,
> but both for unionmount and overlayfs we need to decide where they live
> relative to ->i_mutex on directories.
  Hum, interesting question :). Relative to fsfreeze, it doesn't seem to
matter much, does it? We don't actually lock / unlock these from write
paths needing freeze protection, we only wait for them. But maybe I miss
some ugly case you have in mind.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Al Viro
On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote:
> > do_bio_filebacked(), with some ugliness between that and callsite.  Note,
> > BTW, that we have a pair of possible vfs_fsync() calls in there; how do 
> > those
> > interact with freeze?
>   Freezing code takes care that all dirty data is synced before fs is
> frozen and no new dirty data can be created before fs is thawed. So
> vfs_fsync() should just return without doing anything on frozen filesystem.

Um...  How does it interact with vfs_fsync() already in progress when you
ask to freeze it?

Anyway, I've pulled the fscker out of ->aio_write, ->write and ->splice_write;
on that pathway it's in the do_splice_from() (see vfs.git#experimental).

... and now, for something *really* nasty: where do mandatory file locks
belong in the locking hierarchy?  Relative to fsfreeze one, for starters,
but both for unionmount and overlayfs we need to decide where they live
relative to ->i_mutex on directories.

And that, BTW, may be the strongest argument so far in favour of the scheme
I'd suggested for copyup-via-opened-but-unlinked...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Jan Kara
On Wed 20-03-13 02:33:08, Al Viro wrote:
> On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote:
> 
> > OK, it's going to be an interesting series - aforementioned tentative patch
> > was badly incomplete ;-/
> 
> The interesting question is how far do we want to lift that.  ->aio_write()
> part is trivial - see vfs.git#experimental; the trouble begins with
> ->splice_write().  For *everything* except default_file_splice_write(),
> lifting into the caller (do_splice_from()) is the right thing to do.
> 
> default_file_splice_write(), however, it trickier; there we end up calling
> vfs_write() (via an ugly callchain).  And _that_ is a real bitch.  Granted,
> vfs_write() is somewhat an overkill there (we'd already done rw_verify_area()
> and access_ok() is pointless due to set_fs() we do around vfs_write()
> there) and we'd already lifted it up to do_sync_write().  But if we lift
> it any further, we'll need to deal with ->write() callers in the tree.
> Current situation:
> 
> fs/coredump.c:662:  return access_ok(VERIFY_READ, addr, nr) && 
> file->f_op->write(file, addr, nr, >f_pos) == nr;
> arch/powerpc/platforms/cell/spufs/coredump.c:63:written = 
> file->f_op->write(file, addr, nr, >f_pos);
> 
> for these guys we might actually want to lift all way up to do_coredump()
> 
> drivers/staging/comedi/drivers/serial2002.c:91: result = f->f_op->write(f, 
> buf, count, >f_pos);
> fs/autofs4/waitq.c:73: (wr = 
> file->f_op->write(file,data,bytes,>f_pos)) > 0) {
> 
> not regular files, unless I'm seriously misreading the code.
> 
> kernel/acct.c:553:  file->f_op->write(file, (char *),
> 
> BTW, this is probably where we want to deal with your acct deadlock.
> 
> fs/compat.c:1103:   fn = (io_fn_t)file->f_op->write;
> fs/read_write.c:435:ret = file->f_op->write(file, buf, 
> count, pos);
> fs/read_write.c:732:fn = (io_fn_t)file->f_op->write;
> 
> syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/
> compat_writev, or actually take it further.
> 
> fs/cachefiles/rdwr.c:967:   ret = file->f_op->write(
> 
> cachefiles_write_page(); no fucking idea what locks might be held by caller
> and potentially that's a rather nasty source of PITA
> 
> fs/coda/file.c:84:  ret = host_file->f_op->write(host_file, buf, count, 
> ppos);
> 
> coda writing to file in cache on local fs.  Potentially a nasty bugger, since
> it's hard to lift any further - the caller has no idea that the thing is
> on CODA, let alone what happens to hold the local cache.
> 
> drivers/block/loop.c:234:   bw = file->f_op->write(file, buf, len, );
> 
> do_bio_filebacked(), with some ugliness between that and callsite.  Note,
> BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
> interact with freeze?
  Freezing code takes care that all dirty data is synced before fs is
frozen and no new dirty data can be created before fs is thawed. So
vfs_fsync() should just return without doing anything on frozen filesystem.

> This does *not* touch the current callers of vfs_write()/vfs_writev(); any of
> those called while holding ->i_mutex on a directory (or mnt_want_write(), for
> that matter) is a deadlock right now.
> 
> And we'd better start thinking about how we'll backport that crap - deadlock
> in e.g. xfs ->splice_write() had been there since last summer ;-/
  Yeah but noone really noticed because in practice the code isn't
stressed much. Much bigger problems had been there for years before they
were fixed last summer without anybody complaining... So I'm not sure how
hard do we want to try to backport this.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread David Howells
Al Viro  wrote:

> fs/cachefiles/rdwr.c:967:   ret = file->f_op->write(
> 
> cachefiles_write_page(); no fucking idea what locks might be held by caller
> and potentially that's a rather nasty source of PITA

The caller of cachefiles_write_page() (ie. fscache_write_op()) holds no locks
over the call.

__fscache_write_page() queues the pages for writing to the cache from the
netfs's read-side routines and returns immediately.

fscache_write_op() then picks them up in the background and passes them over
to the backend (eg. cachefiles_write_page()) which writes them out to the
cache.

So I think it should be safe to call file_start/end_write() or whatever around
the file->f_op->write() calls.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 10:24 PM, Al Viro  wrote:
> it still might make sense to implement
> something as a layer, but some parts of that sucker may be better off as
> fs primitives.  Hell, we could, in theory, implement xattrs as a layer;
> just look at how reiserfs had done them.  We could do the same for hardlinks
> (look at qnx4, if you wish to see that hack).  Of for symlinks (sysvfs).
> Or for opened-and-unlinked files (sillyrename could be done as a generic
> layer).  Or for permissions/ownership/arbitrary names (umsdos, and that
> _was_ very similar to layering).  It's just that often an underlying fs
> has a better way of doing that.  IMO whiteouts are in that class.

My gut feeling is that whiteouts (and all that's related) are just too
specialized.  All the examples you listed are more general purpose.

I know one that could be useful for a variety of things, an "exchange"
operation.   While similar to rename, the semantics are much cleaner
and so the implementation becomes easier as well.

But that one doesn't quite solve the rename-with-whiteout thing.   For
that a three way permutation would be needed where one of the three is
an unlinked "floating" object.   That one is a really generic op but
we'd need to introduce linking unlinked objects into the tree which
comes with its own problems.

But I think it may be worth trying to come up with something more
generally useful before adding whiteouts and various overlay-specific
flags to filesystem code.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 10:24 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 it still might make sense to implement
 something as a layer, but some parts of that sucker may be better off as
 fs primitives.  Hell, we could, in theory, implement xattrs as a layer;
 just look at how reiserfs had done them.  We could do the same for hardlinks
 (look at qnx4, if you wish to see that hack).  Of for symlinks (sysvfs).
 Or for opened-and-unlinked files (sillyrename could be done as a generic
 layer).  Or for permissions/ownership/arbitrary names (umsdos, and that
 _was_ very similar to layering).  It's just that often an underlying fs
 has a better way of doing that.  IMO whiteouts are in that class.

My gut feeling is that whiteouts (and all that's related) are just too
specialized.  All the examples you listed are more general purpose.

I know one that could be useful for a variety of things, an exchange
operation.   While similar to rename, the semantics are much cleaner
and so the implementation becomes easier as well.

But that one doesn't quite solve the rename-with-whiteout thing.   For
that a three way permutation would be needed where one of the three is
an unlinked floating object.   That one is a really generic op but
we'd need to introduce linking unlinked objects into the tree which
comes with its own problems.

But I think it may be worth trying to come up with something more
generally useful before adding whiteouts and various overlay-specific
flags to filesystem code.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread David Howells
Al Viro v...@zeniv.linux.org.uk wrote:

 fs/cachefiles/rdwr.c:967:   ret = file-f_op-write(
 
 cachefiles_write_page(); no fucking idea what locks might be held by caller
 and potentially that's a rather nasty source of PITA

The caller of cachefiles_write_page() (ie. fscache_write_op()) holds no locks
over the call.

__fscache_write_page() queues the pages for writing to the cache from the
netfs's read-side routines and returns immediately.

fscache_write_op() then picks them up in the background and passes them over
to the backend (eg. cachefiles_write_page()) which writes them out to the
cache.

So I think it should be safe to call file_start/end_write() or whatever around
the file-f_op-write() calls.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Jan Kara
On Wed 20-03-13 02:33:08, Al Viro wrote:
 On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote:
 
  OK, it's going to be an interesting series - aforementioned tentative patch
  was badly incomplete ;-/
 
 The interesting question is how far do we want to lift that.  -aio_write()
 part is trivial - see vfs.git#experimental; the trouble begins with
 -splice_write().  For *everything* except default_file_splice_write(),
 lifting into the caller (do_splice_from()) is the right thing to do.
 
 default_file_splice_write(), however, it trickier; there we end up calling
 vfs_write() (via an ugly callchain).  And _that_ is a real bitch.  Granted,
 vfs_write() is somewhat an overkill there (we'd already done rw_verify_area()
 and access_ok() is pointless due to set_fs() we do around vfs_write()
 there) and we'd already lifted it up to do_sync_write().  But if we lift
 it any further, we'll need to deal with -write() callers in the tree.
 Current situation:
 
 fs/coredump.c:662:  return access_ok(VERIFY_READ, addr, nr)  
 file-f_op-write(file, addr, nr, file-f_pos) == nr;
 arch/powerpc/platforms/cell/spufs/coredump.c:63:written = 
 file-f_op-write(file, addr, nr, file-f_pos);
 
 for these guys we might actually want to lift all way up to do_coredump()
 
 drivers/staging/comedi/drivers/serial2002.c:91: result = f-f_op-write(f, 
 buf, count, f-f_pos);
 fs/autofs4/waitq.c:73: (wr = 
 file-f_op-write(file,data,bytes,file-f_pos))  0) {
 
 not regular files, unless I'm seriously misreading the code.
 
 kernel/acct.c:553:  file-f_op-write(file, (char *)ac,
 
 BTW, this is probably where we want to deal with your acct deadlock.
 
 fs/compat.c:1103:   fn = (io_fn_t)file-f_op-write;
 fs/read_write.c:435:ret = file-f_op-write(file, buf, 
 count, pos);
 fs/read_write.c:732:fn = (io_fn_t)file-f_op-write;
 
 syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/
 compat_writev, or actually take it further.
 
 fs/cachefiles/rdwr.c:967:   ret = file-f_op-write(
 
 cachefiles_write_page(); no fucking idea what locks might be held by caller
 and potentially that's a rather nasty source of PITA
 
 fs/coda/file.c:84:  ret = host_file-f_op-write(host_file, buf, count, 
 ppos);
 
 coda writing to file in cache on local fs.  Potentially a nasty bugger, since
 it's hard to lift any further - the caller has no idea that the thing is
 on CODA, let alone what happens to hold the local cache.
 
 drivers/block/loop.c:234:   bw = file-f_op-write(file, buf, len, pos);
 
 do_bio_filebacked(), with some ugliness between that and callsite.  Note,
 BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
 interact with freeze?
  Freezing code takes care that all dirty data is synced before fs is
frozen and no new dirty data can be created before fs is thawed. So
vfs_fsync() should just return without doing anything on frozen filesystem.

 This does *not* touch the current callers of vfs_write()/vfs_writev(); any of
 those called while holding -i_mutex on a directory (or mnt_want_write(), for
 that matter) is a deadlock right now.
 
 And we'd better start thinking about how we'll backport that crap - deadlock
 in e.g. xfs -splice_write() had been there since last summer ;-/
  Yeah but noone really noticed because in practice the code isn't
stressed much. Much bigger problems had been there for years before they
were fixed last summer without anybody complaining... So I'm not sure how
hard do we want to try to backport this.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Al Viro
On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote:
  do_bio_filebacked(), with some ugliness between that and callsite.  Note,
  BTW, that we have a pair of possible vfs_fsync() calls in there; how do 
  those
  interact with freeze?
   Freezing code takes care that all dirty data is synced before fs is
 frozen and no new dirty data can be created before fs is thawed. So
 vfs_fsync() should just return without doing anything on frozen filesystem.

Um...  How does it interact with vfs_fsync() already in progress when you
ask to freeze it?

Anyway, I've pulled the fscker out of -aio_write, -write and -splice_write;
on that pathway it's in the do_splice_from() (see vfs.git#experimental).

... and now, for something *really* nasty: where do mandatory file locks
belong in the locking hierarchy?  Relative to fsfreeze one, for starters,
but both for unionmount and overlayfs we need to decide where they live
relative to -i_mutex on directories.

And that, BTW, may be the strongest argument so far in favour of the scheme
I'd suggested for copyup-via-opened-but-unlinked...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-20 Thread Jan Kara
On Wed 20-03-13 21:48:13, Al Viro wrote:
 On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote:
   do_bio_filebacked(), with some ugliness between that and callsite.  Note,
   BTW, that we have a pair of possible vfs_fsync() calls in there; how do 
   those
   interact with freeze?
Freezing code takes care that all dirty data is synced before fs is
  frozen and no new dirty data can be created before fs is thawed. So
  vfs_fsync() should just return without doing anything on frozen filesystem.
 
 Um...  How does it interact with vfs_fsync() already in progress when you
 ask to freeze it?
  So the exact sequence of freezing is:
sb-s_writers.frozen = SB_FREEZE_WRITE;
smp_wmb();
sb_wait_write(sb, SB_FREEZE_WRITE);
Now there are no processes in sb_start_write() - sb_end_write() section.
Then we do the same for SB_FREEZE_PAGEFAULT. After this noone should be
able to dirty a page or inode. Writeback or vfs_fsync() may be still
running (so fs can be creating new transactions in the journal for
writeback etc.).
sync_filesystem(sb);
After this there should be no dirty data so although we can still be
somewhere inside vfs_fsync() it should have nothing to do.

Now we freeze to state SB_FREEZE_FS (nop for ext4, but for XFS it may
interact e.g. with inode reclaim trimming preallocated blocks) and we are
done.
 
 Anyway, I've pulled the fscker out of -aio_write, -write and -splice_write;
 on that pathway it's in the do_splice_from() (see vfs.git#experimental).
 
 ... and now, for something *really* nasty: where do mandatory file locks
 belong in the locking hierarchy?  Relative to fsfreeze one, for starters,
 but both for unionmount and overlayfs we need to decide where they live
 relative to -i_mutex on directories.
  Hum, interesting question :). Relative to fsfreeze, it doesn't seem to
matter much, does it? We don't actually lock / unlock these from write
paths needing freeze protection, we only wait for them. But maybe I miss
some ugly case you have in mind.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote:

> OK, it's going to be an interesting series - aforementioned tentative patch
> was badly incomplete ;-/

The interesting question is how far do we want to lift that.  ->aio_write()
part is trivial - see vfs.git#experimental; the trouble begins with
->splice_write().  For *everything* except default_file_splice_write(),
lifting into the caller (do_splice_from()) is the right thing to do.

default_file_splice_write(), however, it trickier; there we end up calling
vfs_write() (via an ugly callchain).  And _that_ is a real bitch.  Granted,
vfs_write() is somewhat an overkill there (we'd already done rw_verify_area()
and access_ok() is pointless due to set_fs() we do around vfs_write()
there) and we'd already lifted it up to do_sync_write().  But if we lift
it any further, we'll need to deal with ->write() callers in the tree.
Current situation:

fs/coredump.c:662:  return access_ok(VERIFY_READ, addr, nr) && 
file->f_op->write(file, addr, nr, >f_pos) == nr;
arch/powerpc/platforms/cell/spufs/coredump.c:63:written = 
file->f_op->write(file, addr, nr, >f_pos);

for these guys we might actually want to lift all way up to do_coredump()

drivers/staging/comedi/drivers/serial2002.c:91: result = f->f_op->write(f, buf, 
count, >f_pos);
fs/autofs4/waitq.c:73: (wr = 
file->f_op->write(file,data,bytes,>f_pos)) > 0) {

not regular files, unless I'm seriously misreading the code.

kernel/acct.c:553:  file->f_op->write(file, (char *),

BTW, this is probably where we want to deal with your acct deadlock.

fs/compat.c:1103:   fn = (io_fn_t)file->f_op->write;
fs/read_write.c:435:ret = file->f_op->write(file, buf, 
count, pos);
fs/read_write.c:732:fn = (io_fn_t)file->f_op->write;

syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/
compat_writev, or actually take it further.

fs/cachefiles/rdwr.c:967:   ret = file->f_op->write(

cachefiles_write_page(); no fucking idea what locks might be held by caller
and potentially that's a rather nasty source of PITA

fs/coda/file.c:84:  ret = host_file->f_op->write(host_file, buf, count, 
ppos);

coda writing to file in cache on local fs.  Potentially a nasty bugger, since
it's hard to lift any further - the caller has no idea that the thing is
on CODA, let alone what happens to hold the local cache.

drivers/block/loop.c:234:   bw = file->f_op->write(file, buf, len, );

do_bio_filebacked(), with some ugliness between that and callsite.  Note,
BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
interact with freeze?

This does *not* touch the current callers of vfs_write()/vfs_writev(); any of
those called while holding ->i_mutex on a directory (or mnt_want_write(), for
that matter) is a deadlock right now.

And we'd better start thinking about how we'll backport that crap - deadlock
in e.g. xfs ->splice_write() had been there since last summer ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 09:38:31PM +, Al Viro wrote:
> On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote:
> > > BTW, having sb_start_write() buried in individual ->splice_write() is
> > > asking for trouble; could you describe the rules for that?  E.g. where
> > > does it nest wrt filesystem-private locks?  XFS iolock, for example...
> >   Generally, the freeze protection should be the outermost lock taken (so
> > that we mitigate possibility of blocking readers when waiting for fs to
> > unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.
> 
> Welcome to deadlock, then:
> xfs_file_splice_write()
> ...
> xfs_ilock(ip, XFS_IOLOCK_EXCL);
> ...
> ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);
> 
> > It seems that I screwed this up for ->splice_write() :-| If we are going to
> > move out sb_start_write() out of filesystems' hands into do_splice_from()
> > then we should likely do the same with ->aio_write(). Hmm?

... and then there's ext4_file_dio_write(), which cheerfully ignores freeze.
... and udf expanding from inline files to separately allocated before it
gets to sb_start_write()
... and a bunch of guys doing generic_write_sync() after 
generic_aio_file_write()
... and a lot of other fun stuff.  Ouch...

OK, it's going to be an interesting series - aforementioned tentative patch
was badly incomplete ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote:
> > BTW, having sb_start_write() buried in individual ->splice_write() is
> > asking for trouble; could you describe the rules for that?  E.g. where
> > does it nest wrt filesystem-private locks?  XFS iolock, for example...
>   Generally, the freeze protection should be the outermost lock taken (so
> that we mitigate possibility of blocking readers when waiting for fs to
> unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.

Welcome to deadlock, then:
xfs_file_splice_write()
...
xfs_ilock(ip, XFS_IOLOCK_EXCL);
...
ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);

> It seems that I screwed this up for ->splice_write() :-| If we are going to
> move out sb_start_write() out of filesystems' hands into do_splice_from()
> then we should likely do the same with ->aio_write(). Hmm?

Yes, I've a tentative patch doing just that; will push tonight.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 07:32:42PM +0100, Miklos Szeredi wrote:

> > * victim is negative => create a whiteout
> > * victim is a directory, parent opaque => rmdir
> > * victim is a non-directory, parent opaque => unlink
> > * victim is positive, parent _not_ opaque => replace with whiteout
> > * old_dir in case of ->rename() is opaque => normal rename
> > * old_dir in case of ->rename() is not opaque => leave whiteout 
> > behind
> > Non-unioned => opaque, of course (nothing showing through it).
> >
> 
> I dunnow.  Overloading common paths with overlay/union specific things
> doesn't look very clean to me.

We certainly can add a couple of new methods; the question is, would that
buy us anything?  This ->rename_and_whiteout would be practically identical
to ->rename on most of filesystems; the only difference for something like
ext* is that instead of deleting the old directory entry in the end we
would change its inumber to 0 and type to 14.  Everything else would be
the same.  Moreover, turning a positive entry into a whiteout differs from
unlink/rmdir resp. in exactly the same way.  FWIW, creating a whiteout from
scratch might be better off as a separate method - it's closer to ->mknod()
and friends.  But whiteout from positive and whiteout-on-rename probably
isn't worth separating from the normal methods.  All we end up "overloading"
is actual removal of directory entry, which is a rather minor part of the
logics in those.

And merging ->unlink/->rmdir is definitely a good idea, regardless of anything
union-related; just look at the existing instances of both.

> > Getting good behaviour on rename interrupted by crash is going to be _very_
> > tricky with any strategy other than whiteouts-in-fs, AFAICS.
> >
> 
> One idea is to add a journal to the overlay itself (yeah, namespace issues).

... and extra complexity from hell.  If the underlying fs has a journal of
its own, we are getting all kinds of interesting interplay between those;
if not, replaying the journal in overlay will buy you nothing, since the
damage to underlying objects makes it all moot.

Layering is a very nice tool for quick prototyping; no arguments about that.
It allows to avoid modifying $BIGNUM filesystems and for something that
lives out of tree it's a very big benefit.  Once the thing is merged, though,
that benefit becomes much smaller; it still might make sense to implement
something as a layer, but some parts of that sucker may be better off as
fs primitives.  Hell, we could, in theory, implement xattrs as a layer;
just look at how reiserfs had done them.  We could do the same for hardlinks
(look at qnx4, if you wish to see that hack).  Of for symlinks (sysvfs).
Or for opened-and-unlinked files (sillyrename could be done as a generic
layer).  Or for permissions/ownership/arbitrary names (umsdos, and that
_was_ very similar to layering).  It's just that often an underlying fs
has a better way of doing that.  IMO whiteouts are in that class.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Jan Kara
On Mon 18-03-13 23:01:03, Al Viro wrote:
> On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote:
> > On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> > >   IMO the deadlock is real. In freeze_super() we wait for all writers to
> > > the filesystem to finish while blocking beginning of any further writes. 
> > > So
> > > we have a deadlock scenario like:
> > > 
> > >   THREAD1 THREAD2 THREAD3
> > > mnt_want_write()  mutex_lock(>i_mutex);
> > > ...   freeze_super()
> > > block on mutex_lock(>i_mutex)
> > > sb_wait_write(sb, 
> > > SB_FREEZE_WRITE);
> > >   block in sb_start_write()
> > 
> > The bug is on fsfreeze side and this is not the only problem related to it.
> > I've missed the implications when I applied "fs: Add freezing handling
> > to mnt_want_write() / mnt_drop_write()" last June ;-/
> > 
> > The thing is, until then mnt_want_write() used to be a counter; it could be
> > nested.  Now any such nesting is a deadlock you've just described.  This
> > is seriously wrong, IMO.
> > 
> > BTW, having sb_start_write() buried in individual ->splice_write() is
> > asking for trouble; could you describe the rules for that?  E.g. where
> > does it nest wrt filesystem-private locks?  XFS iolock, for example...
> 
> I'm looking at the existing callers and I really wonder if we ought to
> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
> callers.
  Yeah, that should be OK.

> Something like file_start_write()/file_end_write(), with check for file
> being regular one might be a good starting point.  As it is, copyup is
> really fucked both in unionmount and overlayfs...
  Makes sense. I can do the patch. BTW, for months I'm trying to push to you
a patch which creates a function like file_start_write() which returns
EAGAIN if the file is open with O_NONBLOCK and fs is frozen (this allows me
to solve a deadlock with bsd process accounting to frozen fs). After this
change the patch will become trivial so I'll add it to the series and
hopefully it won't get ignored.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Jan Kara
On Mon 18-03-13 21:53:34, Al Viro wrote:
> On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> >   IMO the deadlock is real. In freeze_super() we wait for all writers to
> > the filesystem to finish while blocking beginning of any further writes. So
> > we have a deadlock scenario like:
> > 
> >   THREAD1   THREAD2 THREAD3
> > mnt_want_write()mutex_lock(>i_mutex);
> > ... freeze_super()
> > block on mutex_lock(>i_mutex)
> >   sb_wait_write(sb, 
> > SB_FREEZE_WRITE);
> > block in sb_start_write()
> 
> The bug is on fsfreeze side and this is not the only problem related to it.
> I've missed the implications when I applied "fs: Add freezing handling
> to mnt_want_write() / mnt_drop_write()" last June ;-/
> 
> The thing is, until then mnt_want_write() used to be a counter; it could be
> nested.  Now any such nesting is a deadlock you've just described.  This
> is seriously wrong, IMO.
  Well, but sb_start_write() has to be blocking (blocks when fs is frozen)
and you have to get it somewhere. It seems only natural to get the counter
from original mnt_want_write() at the same place and use one function for
that. Whether I should have changed the name from mnt_want_write() to
something else is questionable...

> BTW, having sb_start_write() buried in individual ->splice_write() is
> asking for trouble; could you describe the rules for that?  E.g. where
> does it nest wrt filesystem-private locks?  XFS iolock, for example...
  Generally, the freeze protection should be the outermost lock taken (so
that we mitigate possibility of blocking readers when waiting for fs to
unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.

It seems that I screwed this up for ->splice_write() :-| If we are going to
move out sb_start_write() out of filesystems' hands into do_splice_from()
then we should likely do the same with ->aio_write(). Hmm?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 6:03 PM, Al Viro  wrote:
> On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote:
>
>> Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
>> way down in the list of things to do.  I'd drop splice in a jiffy if
>> it's in the way.
>
> What makes you think that write is any better?  Same deadlock there - check
> generic_file_aio_write(), it calls the same sb_start_write()...  IOW,
> switching from splice to write won't help at all.

Okay, I missed that. Yeah, that needs fixing...

>> Much more interesting question:  what happens if we crash during a
>> rename?  Whiteout implemented in the filesystem won't save us.  And
>> the results are interesting: old versions of files become visible and
>> similar fun.  Far from likely to happen, but ...
>>
>> Add a rename-with-whiteout primitive on filesystems?  That one is not
>> going to be as simple as plain whiteout.  Or?
>
> Umm...  If/when we start caring about that kind of atomicity (and I agree
> that we ought to) overlayfs approach to whiteouts will actually have much
> harder time - it doesn't take much to teach a journalling fs how to do that
> kind of ->rename() in a single transaction; the only question is how to tell
> it that we want to leave a whiteout behind us.  Hell knows; one variant is
> to add a flag, of course.  Another might be more interesting - we want some
> kind of "directory is opaque" flag, so if we start reshuffling the methods,
> we might try to merge unlink/rmdir/whiteout.  Rules:
> * victim is negative => create a whiteout
> * victim is a directory, parent opaque => rmdir
> * victim is a non-directory, parent opaque => unlink
> * victim is positive, parent _not_ opaque => replace with whiteout
> * old_dir in case of ->rename() is opaque => normal rename
> * old_dir in case of ->rename() is not opaque => leave whiteout behind
> Non-unioned => opaque, of course (nothing showing through it).
>

I dunnow.  Overloading common paths with overlay/union specific things
doesn't look very clean to me.

I have a similar problem with union-mounts: it's hooking into lots of
common paths in the VFS for the sake of a very specialized feature.

> Getting good behaviour on rename interrupted by crash is going to be _very_
> tricky with any strategy other than whiteouts-in-fs, AFAICS.
>

One idea is to add a journal to the overlay itself (yeah, namespace issues).

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote:

> Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
> way down in the list of things to do.  I'd drop splice in a jiffy if
> it's in the way.

What makes you think that write is any better?  Same deadlock there - check
generic_file_aio_write(), it calls the same sb_start_write()...  IOW,
switching from splice to write won't help at all.

> Much more interesting question:  what happens if we crash during a
> rename?  Whiteout implemented in the filesystem won't save us.  And
> the results are interesting: old versions of files become visible and
> similar fun.  Far from likely to happen, but ...
> 
> Add a rename-with-whiteout primitive on filesystems?  That one is not
> going to be as simple as plain whiteout.  Or?

Umm...  If/when we start caring about that kind of atomicity (and I agree
that we ought to) overlayfs approach to whiteouts will actually have much
harder time - it doesn't take much to teach a journalling fs how to do that
kind of ->rename() in a single transaction; the only question is how to tell
it that we want to leave a whiteout behind us.  Hell knows; one variant is
to add a flag, of course.  Another might be more interesting - we want some
kind of "directory is opaque" flag, so if we start reshuffling the methods,
we might try to merge unlink/rmdir/whiteout.  Rules:
* victim is negative => create a whiteout
* victim is a directory, parent opaque => rmdir
* victim is a non-directory, parent opaque => unlink
* victim is positive, parent _not_ opaque => replace with whiteout
* old_dir in case of ->rename() is opaque => normal rename
* old_dir in case of ->rename() is not opaque => leave whiteout behind
Non-unioned => opaque, of course (nothing showing through it).

Getting good behaviour on rename interrupted by crash is going to be _very_
tricky with any strategy other than whiteouts-in-fs, AFAICS.

Again, I have no problem whatsoever with changing the set of directory
methods, as long as the replacement is sane.  We'd done that kind of thing
before and it's not a problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 12:04 PM, David Howells  wrote:
> Miklos Szeredi  wrote:
>
>> > BTW, I wonder what's the right locking for that sucker; overlayfs is
>> > probably too heavy - we are talking about copying a file from one fs to
>> > another, which can obviously take quite a while, so holding ->i_mutex on
>> > _parent_ all along is asking for very serious contention.
>>
>> Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
>> way down in the list of things to do.  I'd drop splice in a jiffy if
>> it's in the way.
>
> Yes, but it could block the parent directory for a long time.  I suspect it's
> fine if you can RCU walk through the parent, but if you have to grab a lock on
> it...

Right.

Lets look at it this way:  users of an overlay accept that an
operation X can take T time, where T is much longer than would be on a
normal filesystem.  Then why would they complain that operation Y
(which happens to bump into the parent lock of X) also takes T?

If copy up of huge files happens more then very very occasionally,
then the overlay will be basically unusable anyway.  It's just not
what it is designed for, so why try to optimize this case?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread David Howells
Miklos Szeredi  wrote:

> > BTW, I wonder what's the right locking for that sucker; overlayfs is
> > probably too heavy - we are talking about copying a file from one fs to
> > another, which can obviously take quite a while, so holding ->i_mutex on
> > _parent_ all along is asking for very serious contention.
> 
> Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
> way down in the list of things to do.  I'd drop splice in a jiffy if
> it's in the way.

Yes, but it could block the parent directory for a long time.  I suspect it's
fine if you can RCU walk through the parent, but if you have to grab a lock on
it...

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 2:38 AM, Al Viro  wrote:
> On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote:
>
>> I'm looking at the existing callers and I really wonder if we ought to
>> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
>> callers.
>>
>> Something like file_start_write()/file_end_write(), with check for file
>> being regular one might be a good starting point.  As it is, copyup is
>> really fucked both in unionmount and overlayfs...
>
> BTW, I wonder what's the right locking for that sucker; overlayfs is probably
> too heavy - we are talking about copying a file from one fs to another, which
> can obviously take quite a while, so holding ->i_mutex on _parent_ all along
> is asking for very serious contention.

Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
way down in the list of things to do.  I'd drop splice in a jiffy if
it's in the way.

Much more interesting question:  what happens if we crash during a
rename?  Whiteout implemented in the filesystem won't save us.  And
the results are interesting: old versions of files become visible and
similar fun.  Far from likely to happen, but ...

Add a rename-with-whiteout primitive on filesystems?  That one is not
going to be as simple as plain whiteout.  Or?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread J. R. Okajima

Al Viro:
> BTW, I wonder what's the right locking for that sucker; overlayfs is probably
> too heavy - we are talking about copying a file from one fs to another, which
> can obviously take quite a while, so holding ->i_mutex on _parent_ all along
> is asking for very serious contention.  OTOH, there's a pile of unpleasant

Yes, holding parent->i_mutex can be longer.
Using splice function for copy-up is simple and (probably) fastest. But
it doesn't support a "hole" in the file (sparse file). All holes are
filled with NUL byte and consumes a disk block on the upper layer. It is
a problem, especially for users who have smaller tmpfs as his upper
layer.
The copy-up with considering a hole may cost more, but it can save the
storage consumtion.


> Another fun issue is copyup vs. copyup - we want to sit and wait for copyup
> attempt in progress to complete, rather than start another one in parallel.
> And whoever comes the second must check if copyup has succeeded, obviously -
> it's possible to have user run into 5% limit and fail copyup, followed by
> root doing it successfully.

"5% limit" means the reserved are for a superuser on a filesystem,
right? As far as I know, overlayfs (UnionMount too?) solves this problem
as changing the task credential and the capability. But I am not sure
whether it solves the similar problem around the resource limit like
RLIMIT_CPU, RLIMIT_FSIZE or something.


> Another one: overwriting rename() vs. copyup.  Similar to unlink() vs. 
> copyup().

Hmm, do you mean that, just after the parent dir lock on the underlying
fs, the copyup routine should confirm whether the target is still alive?
If so, I agree.


> Another one: read-only open() vs. copyup().  Hell knows - we obviously don't
> want to open a partially copied file; might want to wait or pretend that this
> open() has come first and give it the underlying file.  The same goes for
> stat() vs. copyup().

Exactly.
Moreover users don't want to refer to the lower file which is obsoleted
by copyup.


> FWIW, something like "lock parent, ->create(), ->unlink(), unlock parent,
> copy data and metadata, lock parent, allocate a new dentry in covering layer
> and do ->lookup() on it, verify that it is negative and not a whiteout, lock
> child, use ->link() to put it into directory, unlock everything" would
> probably DTRT for unlink/copyup and rename/copyup.  The rest... hell knows;

Please let me make sure.
You are saying,
- create the file on the upper layer
- get the "struct file" object
- hide it from users
- before copying the file, unlock the parent in order to stop the long
  period locking
- copy the file without the parent lock. it doesn't matter since the
  file is invisible to users.
- confirm that the target name is still available for copyup
- make the completed file visible by ->link()

It is ineteresting to ->link() with the unlinked file. While
vfs_unlink() rejects such case, it may not matter for the underlying fs.
Need to verify FS including jounals.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread J. R. Okajima

Al Viro:
 BTW, I wonder what's the right locking for that sucker; overlayfs is probably
 too heavy - we are talking about copying a file from one fs to another, which
 can obviously take quite a while, so holding -i_mutex on _parent_ all along
 is asking for very serious contention.  OTOH, there's a pile of unpleasant

Yes, holding parent-i_mutex can be longer.
Using splice function for copy-up is simple and (probably) fastest. But
it doesn't support a hole in the file (sparse file). All holes are
filled with NUL byte and consumes a disk block on the upper layer. It is
a problem, especially for users who have smaller tmpfs as his upper
layer.
The copy-up with considering a hole may cost more, but it can save the
storage consumtion.


 Another fun issue is copyup vs. copyup - we want to sit and wait for copyup
 attempt in progress to complete, rather than start another one in parallel.
 And whoever comes the second must check if copyup has succeeded, obviously -
 it's possible to have user run into 5% limit and fail copyup, followed by
 root doing it successfully.

5% limit means the reserved are for a superuser on a filesystem,
right? As far as I know, overlayfs (UnionMount too?) solves this problem
as changing the task credential and the capability. But I am not sure
whether it solves the similar problem around the resource limit like
RLIMIT_CPU, RLIMIT_FSIZE or something.


 Another one: overwriting rename() vs. copyup.  Similar to unlink() vs. 
 copyup().

Hmm, do you mean that, just after the parent dir lock on the underlying
fs, the copyup routine should confirm whether the target is still alive?
If so, I agree.


 Another one: read-only open() vs. copyup().  Hell knows - we obviously don't
 want to open a partially copied file; might want to wait or pretend that this
 open() has come first and give it the underlying file.  The same goes for
 stat() vs. copyup().

Exactly.
Moreover users don't want to refer to the lower file which is obsoleted
by copyup.


 FWIW, something like lock parent, -create(), -unlink(), unlock parent,
 copy data and metadata, lock parent, allocate a new dentry in covering layer
 and do -lookup() on it, verify that it is negative and not a whiteout, lock
 child, use -link() to put it into directory, unlock everything would
 probably DTRT for unlink/copyup and rename/copyup.  The rest... hell knows;

Please let me make sure.
You are saying,
- create the file on the upper layer
- get the struct file object
- hide it from users
- before copying the file, unlock the parent in order to stop the long
  period locking
- copy the file without the parent lock. it doesn't matter since the
  file is invisible to users.
- confirm that the target name is still available for copyup
- make the completed file visible by -link()

It is ineteresting to -link() with the unlinked file. While
vfs_unlink() rejects such case, it may not matter for the underlying fs.
Need to verify FS including jounals.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 2:38 AM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote:

 I'm looking at the existing callers and I really wonder if we ought to
 push sb_start_write() from -splice_write()/-aio_write()/etc. into the
 callers.

 Something like file_start_write()/file_end_write(), with check for file
 being regular one might be a good starting point.  As it is, copyup is
 really fucked both in unionmount and overlayfs...

 BTW, I wonder what's the right locking for that sucker; overlayfs is probably
 too heavy - we are talking about copying a file from one fs to another, which
 can obviously take quite a while, so holding -i_mutex on _parent_ all along
 is asking for very serious contention.

Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
way down in the list of things to do.  I'd drop splice in a jiffy if
it's in the way.

Much more interesting question:  what happens if we crash during a
rename?  Whiteout implemented in the filesystem won't save us.  And
the results are interesting: old versions of files become visible and
similar fun.  Far from likely to happen, but ...

Add a rename-with-whiteout primitive on filesystems?  That one is not
going to be as simple as plain whiteout.  Or?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread David Howells
Miklos Szeredi mik...@szeredi.hu wrote:

  BTW, I wonder what's the right locking for that sucker; overlayfs is
  probably too heavy - we are talking about copying a file from one fs to
  another, which can obviously take quite a while, so holding -i_mutex on
  _parent_ all along is asking for very serious contention.
 
 Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
 way down in the list of things to do.  I'd drop splice in a jiffy if
 it's in the way.

Yes, but it could block the parent directory for a long time.  I suspect it's
fine if you can RCU walk through the parent, but if you have to grab a lock on
it...

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 12:04 PM, David Howells dhowe...@redhat.com wrote:
 Miklos Szeredi mik...@szeredi.hu wrote:

  BTW, I wonder what's the right locking for that sucker; overlayfs is
  probably too heavy - we are talking about copying a file from one fs to
  another, which can obviously take quite a while, so holding -i_mutex on
  _parent_ all along is asking for very serious contention.

 Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
 way down in the list of things to do.  I'd drop splice in a jiffy if
 it's in the way.

 Yes, but it could block the parent directory for a long time.  I suspect it's
 fine if you can RCU walk through the parent, but if you have to grab a lock on
 it...

Right.

Lets look at it this way:  users of an overlay accept that an
operation X can take T time, where T is much longer than would be on a
normal filesystem.  Then why would they complain that operation Y
(which happens to bump into the parent lock of X) also takes T?

If copy up of huge files happens more then very very occasionally,
then the overlay will be basically unusable anyway.  It's just not
what it is designed for, so why try to optimize this case?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote:

 Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
 way down in the list of things to do.  I'd drop splice in a jiffy if
 it's in the way.

What makes you think that write is any better?  Same deadlock there - check
generic_file_aio_write(), it calls the same sb_start_write()...  IOW,
switching from splice to write won't help at all.

 Much more interesting question:  what happens if we crash during a
 rename?  Whiteout implemented in the filesystem won't save us.  And
 the results are interesting: old versions of files become visible and
 similar fun.  Far from likely to happen, but ...
 
 Add a rename-with-whiteout primitive on filesystems?  That one is not
 going to be as simple as plain whiteout.  Or?

Umm...  If/when we start caring about that kind of atomicity (and I agree
that we ought to) overlayfs approach to whiteouts will actually have much
harder time - it doesn't take much to teach a journalling fs how to do that
kind of -rename() in a single transaction; the only question is how to tell
it that we want to leave a whiteout behind us.  Hell knows; one variant is
to add a flag, of course.  Another might be more interesting - we want some
kind of directory is opaque flag, so if we start reshuffling the methods,
we might try to merge unlink/rmdir/whiteout.  Rules:
* victim is negative = create a whiteout
* victim is a directory, parent opaque = rmdir
* victim is a non-directory, parent opaque = unlink
* victim is positive, parent _not_ opaque = replace with whiteout
* old_dir in case of -rename() is opaque = normal rename
* old_dir in case of -rename() is not opaque = leave whiteout behind
Non-unioned = opaque, of course (nothing showing through it).

Getting good behaviour on rename interrupted by crash is going to be _very_
tricky with any strategy other than whiteouts-in-fs, AFAICS.

Again, I have no problem whatsoever with changing the set of directory
methods, as long as the replacement is sane.  We'd done that kind of thing
before and it's not a problem.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Miklos Szeredi
On Tue, Mar 19, 2013 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote:

 Copy up is a once-in-a-lifetime event for an object.  Optimizing it is
 way down in the list of things to do.  I'd drop splice in a jiffy if
 it's in the way.

 What makes you think that write is any better?  Same deadlock there - check
 generic_file_aio_write(), it calls the same sb_start_write()...  IOW,
 switching from splice to write won't help at all.

Okay, I missed that. Yeah, that needs fixing...

 Much more interesting question:  what happens if we crash during a
 rename?  Whiteout implemented in the filesystem won't save us.  And
 the results are interesting: old versions of files become visible and
 similar fun.  Far from likely to happen, but ...

 Add a rename-with-whiteout primitive on filesystems?  That one is not
 going to be as simple as plain whiteout.  Or?

 Umm...  If/when we start caring about that kind of atomicity (and I agree
 that we ought to) overlayfs approach to whiteouts will actually have much
 harder time - it doesn't take much to teach a journalling fs how to do that
 kind of -rename() in a single transaction; the only question is how to tell
 it that we want to leave a whiteout behind us.  Hell knows; one variant is
 to add a flag, of course.  Another might be more interesting - we want some
 kind of directory is opaque flag, so if we start reshuffling the methods,
 we might try to merge unlink/rmdir/whiteout.  Rules:
 * victim is negative = create a whiteout
 * victim is a directory, parent opaque = rmdir
 * victim is a non-directory, parent opaque = unlink
 * victim is positive, parent _not_ opaque = replace with whiteout
 * old_dir in case of -rename() is opaque = normal rename
 * old_dir in case of -rename() is not opaque = leave whiteout behind
 Non-unioned = opaque, of course (nothing showing through it).


I dunnow.  Overloading common paths with overlay/union specific things
doesn't look very clean to me.

I have a similar problem with union-mounts: it's hooking into lots of
common paths in the VFS for the sake of a very specialized feature.

 Getting good behaviour on rename interrupted by crash is going to be _very_
 tricky with any strategy other than whiteouts-in-fs, AFAICS.


One idea is to add a journal to the overlay itself (yeah, namespace issues).

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Jan Kara
On Mon 18-03-13 21:53:34, Al Viro wrote:
 On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
IMO the deadlock is real. In freeze_super() we wait for all writers to
  the filesystem to finish while blocking beginning of any further writes. So
  we have a deadlock scenario like:
  
THREAD1   THREAD2 THREAD3
  mnt_want_write()mutex_lock(inode-i_mutex);
  ... freeze_super()
  block on mutex_lock(inode-i_mutex)
sb_wait_write(sb, 
  SB_FREEZE_WRITE);
  block in sb_start_write()
 
 The bug is on fsfreeze side and this is not the only problem related to it.
 I've missed the implications when I applied fs: Add freezing handling
 to mnt_want_write() / mnt_drop_write() last June ;-/
 
 The thing is, until then mnt_want_write() used to be a counter; it could be
 nested.  Now any such nesting is a deadlock you've just described.  This
 is seriously wrong, IMO.
  Well, but sb_start_write() has to be blocking (blocks when fs is frozen)
and you have to get it somewhere. It seems only natural to get the counter
from original mnt_want_write() at the same place and use one function for
that. Whether I should have changed the name from mnt_want_write() to
something else is questionable...

 BTW, having sb_start_write() buried in individual -splice_write() is
 asking for trouble; could you describe the rules for that?  E.g. where
 does it nest wrt filesystem-private locks?  XFS iolock, for example...
  Generally, the freeze protection should be the outermost lock taken (so
that we mitigate possibility of blocking readers when waiting for fs to
unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.

It seems that I screwed this up for -splice_write() :-| If we are going to
move out sb_start_write() out of filesystems' hands into do_splice_from()
then we should likely do the same with -aio_write(). Hmm?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Jan Kara
On Mon 18-03-13 23:01:03, Al Viro wrote:
 On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote:
  On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
 IMO the deadlock is real. In freeze_super() we wait for all writers to
   the filesystem to finish while blocking beginning of any further writes. 
   So
   we have a deadlock scenario like:
   
 THREAD1 THREAD2 THREAD3
   mnt_want_write()  mutex_lock(inode-i_mutex);
   ...   freeze_super()
   block on mutex_lock(inode-i_mutex)
   sb_wait_write(sb, 
   SB_FREEZE_WRITE);
 block in sb_start_write()
  
  The bug is on fsfreeze side and this is not the only problem related to it.
  I've missed the implications when I applied fs: Add freezing handling
  to mnt_want_write() / mnt_drop_write() last June ;-/
  
  The thing is, until then mnt_want_write() used to be a counter; it could be
  nested.  Now any such nesting is a deadlock you've just described.  This
  is seriously wrong, IMO.
  
  BTW, having sb_start_write() buried in individual -splice_write() is
  asking for trouble; could you describe the rules for that?  E.g. where
  does it nest wrt filesystem-private locks?  XFS iolock, for example...
 
 I'm looking at the existing callers and I really wonder if we ought to
 push sb_start_write() from -splice_write()/-aio_write()/etc. into the
 callers.
  Yeah, that should be OK.

 Something like file_start_write()/file_end_write(), with check for file
 being regular one might be a good starting point.  As it is, copyup is
 really fucked both in unionmount and overlayfs...
  Makes sense. I can do the patch. BTW, for months I'm trying to push to you
a patch which creates a function like file_start_write() which returns
EAGAIN if the file is open with O_NONBLOCK and fs is frozen (this allows me
to solve a deadlock with bsd process accounting to frozen fs). After this
change the patch will become trivial so I'll add it to the series and
hopefully it won't get ignored.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 07:32:42PM +0100, Miklos Szeredi wrote:

  * victim is negative = create a whiteout
  * victim is a directory, parent opaque = rmdir
  * victim is a non-directory, parent opaque = unlink
  * victim is positive, parent _not_ opaque = replace with whiteout
  * old_dir in case of -rename() is opaque = normal rename
  * old_dir in case of -rename() is not opaque = leave whiteout 
  behind
  Non-unioned = opaque, of course (nothing showing through it).
 
 
 I dunnow.  Overloading common paths with overlay/union specific things
 doesn't look very clean to me.

We certainly can add a couple of new methods; the question is, would that
buy us anything?  This -rename_and_whiteout would be practically identical
to -rename on most of filesystems; the only difference for something like
ext* is that instead of deleting the old directory entry in the end we
would change its inumber to 0 and type to 14.  Everything else would be
the same.  Moreover, turning a positive entry into a whiteout differs from
unlink/rmdir resp. in exactly the same way.  FWIW, creating a whiteout from
scratch might be better off as a separate method - it's closer to -mknod()
and friends.  But whiteout from positive and whiteout-on-rename probably
isn't worth separating from the normal methods.  All we end up overloading
is actual removal of directory entry, which is a rather minor part of the
logics in those.

And merging -unlink/-rmdir is definitely a good idea, regardless of anything
union-related; just look at the existing instances of both.

  Getting good behaviour on rename interrupted by crash is going to be _very_
  tricky with any strategy other than whiteouts-in-fs, AFAICS.
 
 
 One idea is to add a journal to the overlay itself (yeah, namespace issues).

... and extra complexity from hell.  If the underlying fs has a journal of
its own, we are getting all kinds of interesting interplay between those;
if not, replaying the journal in overlay will buy you nothing, since the
damage to underlying objects makes it all moot.

Layering is a very nice tool for quick prototyping; no arguments about that.
It allows to avoid modifying $BIGNUM filesystems and for something that
lives out of tree it's a very big benefit.  Once the thing is merged, though,
that benefit becomes much smaller; it still might make sense to implement
something as a layer, but some parts of that sucker may be better off as
fs primitives.  Hell, we could, in theory, implement xattrs as a layer;
just look at how reiserfs had done them.  We could do the same for hardlinks
(look at qnx4, if you wish to see that hack).  Of for symlinks (sysvfs).
Or for opened-and-unlinked files (sillyrename could be done as a generic
layer).  Or for permissions/ownership/arbitrary names (umsdos, and that
_was_ very similar to layering).  It's just that often an underlying fs
has a better way of doing that.  IMO whiteouts are in that class.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote:
  BTW, having sb_start_write() buried in individual -splice_write() is
  asking for trouble; could you describe the rules for that?  E.g. where
  does it nest wrt filesystem-private locks?  XFS iolock, for example...
   Generally, the freeze protection should be the outermost lock taken (so
 that we mitigate possibility of blocking readers when waiting for fs to
 unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.

Welcome to deadlock, then:
xfs_file_splice_write()
...
xfs_ilock(ip, XFS_IOLOCK_EXCL);
...
ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);

 It seems that I screwed this up for -splice_write() :-| If we are going to
 move out sb_start_write() out of filesystems' hands into do_splice_from()
 then we should likely do the same with -aio_write(). Hmm?

Yes, I've a tentative patch doing just that; will push tonight.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 09:38:31PM +, Al Viro wrote:
 On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote:
   BTW, having sb_start_write() buried in individual -splice_write() is
   asking for trouble; could you describe the rules for that?  E.g. where
   does it nest wrt filesystem-private locks?  XFS iolock, for example...
Generally, the freeze protection should be the outermost lock taken (so
  that we mitigate possibility of blocking readers when waiting for fs to
  unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.
 
 Welcome to deadlock, then:
 xfs_file_splice_write()
 ...
 xfs_ilock(ip, XFS_IOLOCK_EXCL);
 ...
 ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);
 
  It seems that I screwed this up for -splice_write() :-| If we are going to
  move out sb_start_write() out of filesystems' hands into do_splice_from()
  then we should likely do the same with -aio_write(). Hmm?

... and then there's ext4_file_dio_write(), which cheerfully ignores freeze.
... and udf expanding from inline files to separately allocated before it
gets to sb_start_write()
... and a bunch of guys doing generic_write_sync() after 
generic_aio_file_write()
... and a lot of other fun stuff.  Ouch...

OK, it's going to be an interesting series - aforementioned tentative patch
was badly incomplete ;-/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-19 Thread Al Viro
On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote:

 OK, it's going to be an interesting series - aforementioned tentative patch
 was badly incomplete ;-/

The interesting question is how far do we want to lift that.  -aio_write()
part is trivial - see vfs.git#experimental; the trouble begins with
-splice_write().  For *everything* except default_file_splice_write(),
lifting into the caller (do_splice_from()) is the right thing to do.

default_file_splice_write(), however, it trickier; there we end up calling
vfs_write() (via an ugly callchain).  And _that_ is a real bitch.  Granted,
vfs_write() is somewhat an overkill there (we'd already done rw_verify_area()
and access_ok() is pointless due to set_fs() we do around vfs_write()
there) and we'd already lifted it up to do_sync_write().  But if we lift
it any further, we'll need to deal with -write() callers in the tree.
Current situation:

fs/coredump.c:662:  return access_ok(VERIFY_READ, addr, nr)  
file-f_op-write(file, addr, nr, file-f_pos) == nr;
arch/powerpc/platforms/cell/spufs/coredump.c:63:written = 
file-f_op-write(file, addr, nr, file-f_pos);

for these guys we might actually want to lift all way up to do_coredump()

drivers/staging/comedi/drivers/serial2002.c:91: result = f-f_op-write(f, buf, 
count, f-f_pos);
fs/autofs4/waitq.c:73: (wr = 
file-f_op-write(file,data,bytes,file-f_pos))  0) {

not regular files, unless I'm seriously misreading the code.

kernel/acct.c:553:  file-f_op-write(file, (char *)ac,

BTW, this is probably where we want to deal with your acct deadlock.

fs/compat.c:1103:   fn = (io_fn_t)file-f_op-write;
fs/read_write.c:435:ret = file-f_op-write(file, buf, 
count, pos);
fs/read_write.c:732:fn = (io_fn_t)file-f_op-write;

syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/
compat_writev, or actually take it further.

fs/cachefiles/rdwr.c:967:   ret = file-f_op-write(

cachefiles_write_page(); no fucking idea what locks might be held by caller
and potentially that's a rather nasty source of PITA

fs/coda/file.c:84:  ret = host_file-f_op-write(host_file, buf, count, 
ppos);

coda writing to file in cache on local fs.  Potentially a nasty bugger, since
it's hard to lift any further - the caller has no idea that the thing is
on CODA, let alone what happens to hold the local cache.

drivers/block/loop.c:234:   bw = file-f_op-write(file, buf, len, pos);

do_bio_filebacked(), with some ugliness between that and callsite.  Note,
BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
interact with freeze?

This does *not* touch the current callers of vfs_write()/vfs_writev(); any of
those called while holding -i_mutex on a directory (or mnt_want_write(), for
that matter) is a deadlock right now.

And we'd better start thinking about how we'll backport that crap - deadlock
in e.g. xfs -splice_write() had been there since last summer ;-/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Al Viro
On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote:

> I'm looking at the existing callers and I really wonder if we ought to
> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
> callers.
> 
> Something like file_start_write()/file_end_write(), with check for file
> being regular one might be a good starting point.  As it is, copyup is
> really fucked both in unionmount and overlayfs...

BTW, I wonder what's the right locking for that sucker; overlayfs is probably
too heavy - we are talking about copying a file from one fs to another, which
can obviously take quite a while, so holding ->i_mutex on _parent_ all along
is asking for very serious contention.  OTOH, there's a pile of unpleasant
races that need to be dealt with; consider e.g. chmod("lower_file", 0644) vs.
unlink("lower_file").  The former means creating a copy of lower_file in the
covering layer (and chmod of that copy once it's finished).  The latter
means creating a whiteout in the covering layer.  No matter which comes first,
the result *must* be whiteout in directory + no stray copies left in covering
layer.  chmod() may legitimately return -ENOENT or 0, but resulting state of
fs is unambiguous.  Holding ->i_mutex on parent would, of course, suffice to
serialize them, but it's not particulary nice thing to do wrt contention.

Another fun issue is copyup vs. copyup - we want to sit and wait for copyup
attempt in progress to complete, rather than start another one in parallel.
And whoever comes the second must check if copyup has succeeded, obviously -
it's possible to have user run into 5% limit and fail copyup, followed by
root doing it successfully.

Another one: overwriting rename() vs. copyup.  Similar to unlink() vs. copyup().

Another one: read-only open() vs. copyup().  Hell knows - we obviously don't
want to open a partially copied file; might want to wait or pretend that this
open() has come first and give it the underlying file.  The same goes for
stat() vs. copyup().

FWIW, something like "lock parent, ->create(), ->unlink(), unlock parent,
copy data and metadata, lock parent, allocate a new dentry in covering layer
and do ->lookup() on it, verify that it is negative and not a whiteout, lock
child, use ->link() to put it into directory, unlock everything" would
probably DTRT for unlink/copyup and rename/copyup.  The rest... hell knows;
->i_mutex on child is a non-starter, since we couldn't use the normal ->write()
or ->splice_write() under it.  One possibility is to allocate a structure for
copyup in progress, make it easily located (hash by lower dentry/upper sb
pair) and serialize on that.  Hell knows...

One potential unpleasantness here is dcache lookup coming between ->create()
and ->unlink(); OTOH, there had been fairly reasonable requests for something
like atomic combination of open and unlink and it's not like _that_ would be
hard to implement as a new method - pretty much everything local could just
take part of ->create() and that would be it.  Which might make sense on its
own - open() flag creating a new temporary file, unlinked from the very
beginning and thus killed when we close the damn thing.  Objections?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Al Viro
On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote:
> On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> >   IMO the deadlock is real. In freeze_super() we wait for all writers to
> > the filesystem to finish while blocking beginning of any further writes. So
> > we have a deadlock scenario like:
> > 
> >   THREAD1   THREAD2 THREAD3
> > mnt_want_write()mutex_lock(>i_mutex);
> > ... freeze_super()
> > block on mutex_lock(>i_mutex)
> >   sb_wait_write(sb, 
> > SB_FREEZE_WRITE);
> > block in sb_start_write()
> 
> The bug is on fsfreeze side and this is not the only problem related to it.
> I've missed the implications when I applied "fs: Add freezing handling
> to mnt_want_write() / mnt_drop_write()" last June ;-/
> 
> The thing is, until then mnt_want_write() used to be a counter; it could be
> nested.  Now any such nesting is a deadlock you've just described.  This
> is seriously wrong, IMO.
> 
> BTW, having sb_start_write() buried in individual ->splice_write() is
> asking for trouble; could you describe the rules for that?  E.g. where
> does it nest wrt filesystem-private locks?  XFS iolock, for example...

I'm looking at the existing callers and I really wonder if we ought to
push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
callers.

Something like file_start_write()/file_end_write(), with check for file
being regular one might be a good starting point.  As it is, copyup is
really fucked both in unionmount and overlayfs...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Al Viro
On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
>   IMO the deadlock is real. In freeze_super() we wait for all writers to
> the filesystem to finish while blocking beginning of any further writes. So
> we have a deadlock scenario like:
> 
>   THREAD1 THREAD2 THREAD3
> mnt_want_write()  mutex_lock(>i_mutex);
> ...   freeze_super()
> block on mutex_lock(>i_mutex)
> sb_wait_write(sb, 
> SB_FREEZE_WRITE);
>   block in sb_start_write()

The bug is on fsfreeze side and this is not the only problem related to it.
I've missed the implications when I applied "fs: Add freezing handling
to mnt_want_write() / mnt_drop_write()" last June ;-/

The thing is, until then mnt_want_write() used to be a counter; it could be
nested.  Now any such nesting is a deadlock you've just described.  This
is seriously wrong, IMO.

BTW, having sb_start_write() buried in individual ->splice_write() is
asking for trouble; could you describe the rules for that?  E.g. where
does it nest wrt filesystem-private locks?  XFS iolock, for example...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Jan Kara
On Sun 17-03-13 13:06:59, David Howells wrote:
> Miklos Szeredi  wrote:
> 
> > Export do_splice_direct() to modules.  Needed by overlay filesystem.
> 
> Apparently you cannot call this from any function that is holding an i_mutex
> if the target of the splice uses generic_file_splice_write().
> 
> The problem is a potential deadlock situation:
> 
> We have places already that do:
> 
>   mnt_want_write()
>   mutex_lock()
> 
> This can be found in do_last() for example.
> 
> However, mnt_want_write() calls sb_start_write() as does
> generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:
> 
>   mutex_lock()
>   sb_start_write()
> 
> which lockdep reports as a potential ABBA deadlock.
> 
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
> can
> illuminate further, so I've added him to the cc list.
  IMO the deadlock is real. In freeze_super() we wait for all writers to
the filesystem to finish while blocking beginning of any further writes. So
we have a deadlock scenario like:

  THREAD1   THREAD2 THREAD3
mnt_want_write()mutex_lock(>i_mutex);
... freeze_super()
block on mutex_lock(>i_mutex)
  sb_wait_write(sb, 
SB_FREEZE_WRITE);
block in sb_start_write()

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Jan Kara
On Sun 17-03-13 13:06:59, David Howells wrote:
 Miklos Szeredi mik...@szeredi.hu wrote:
 
  Export do_splice_direct() to modules.  Needed by overlay filesystem.
 
 Apparently you cannot call this from any function that is holding an i_mutex
 if the target of the splice uses generic_file_splice_write().
 
 The problem is a potential deadlock situation:
 
 We have places already that do:
 
   mnt_want_write()
   mutex_lock()
 
 This can be found in do_last() for example.
 
 However, mnt_want_write() calls sb_start_write() as does
 generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:
 
   mutex_lock()
   sb_start_write()
 
 which lockdep reports as a potential ABBA deadlock.
 
 Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
 might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
 can
 illuminate further, so I've added him to the cc list.
  IMO the deadlock is real. In freeze_super() we wait for all writers to
the filesystem to finish while blocking beginning of any further writes. So
we have a deadlock scenario like:

  THREAD1   THREAD2 THREAD3
mnt_want_write()mutex_lock(inode-i_mutex);
... freeze_super()
block on mutex_lock(inode-i_mutex)
  sb_wait_write(sb, 
SB_FREEZE_WRITE);
block in sb_start_write()

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Al Viro
On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
   IMO the deadlock is real. In freeze_super() we wait for all writers to
 the filesystem to finish while blocking beginning of any further writes. So
 we have a deadlock scenario like:
 
   THREAD1 THREAD2 THREAD3
 mnt_want_write()  mutex_lock(inode-i_mutex);
 ...   freeze_super()
 block on mutex_lock(inode-i_mutex)
 sb_wait_write(sb, 
 SB_FREEZE_WRITE);
   block in sb_start_write()

The bug is on fsfreeze side and this is not the only problem related to it.
I've missed the implications when I applied fs: Add freezing handling
to mnt_want_write() / mnt_drop_write() last June ;-/

The thing is, until then mnt_want_write() used to be a counter; it could be
nested.  Now any such nesting is a deadlock you've just described.  This
is seriously wrong, IMO.

BTW, having sb_start_write() buried in individual -splice_write() is
asking for trouble; could you describe the rules for that?  E.g. where
does it nest wrt filesystem-private locks?  XFS iolock, for example...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Al Viro
On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote:
 On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
IMO the deadlock is real. In freeze_super() we wait for all writers to
  the filesystem to finish while blocking beginning of any further writes. So
  we have a deadlock scenario like:
  
THREAD1   THREAD2 THREAD3
  mnt_want_write()mutex_lock(inode-i_mutex);
  ... freeze_super()
  block on mutex_lock(inode-i_mutex)
sb_wait_write(sb, 
  SB_FREEZE_WRITE);
  block in sb_start_write()
 
 The bug is on fsfreeze side and this is not the only problem related to it.
 I've missed the implications when I applied fs: Add freezing handling
 to mnt_want_write() / mnt_drop_write() last June ;-/
 
 The thing is, until then mnt_want_write() used to be a counter; it could be
 nested.  Now any such nesting is a deadlock you've just described.  This
 is seriously wrong, IMO.
 
 BTW, having sb_start_write() buried in individual -splice_write() is
 asking for trouble; could you describe the rules for that?  E.g. where
 does it nest wrt filesystem-private locks?  XFS iolock, for example...

I'm looking at the existing callers and I really wonder if we ought to
push sb_start_write() from -splice_write()/-aio_write()/etc. into the
callers.

Something like file_start_write()/file_end_write(), with check for file
being regular one might be a good starting point.  As it is, copyup is
really fucked both in unionmount and overlayfs...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-18 Thread Al Viro
On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote:

 I'm looking at the existing callers and I really wonder if we ought to
 push sb_start_write() from -splice_write()/-aio_write()/etc. into the
 callers.
 
 Something like file_start_write()/file_end_write(), with check for file
 being regular one might be a good starting point.  As it is, copyup is
 really fucked both in unionmount and overlayfs...

BTW, I wonder what's the right locking for that sucker; overlayfs is probably
too heavy - we are talking about copying a file from one fs to another, which
can obviously take quite a while, so holding -i_mutex on _parent_ all along
is asking for very serious contention.  OTOH, there's a pile of unpleasant
races that need to be dealt with; consider e.g. chmod(lower_file, 0644) vs.
unlink(lower_file).  The former means creating a copy of lower_file in the
covering layer (and chmod of that copy once it's finished).  The latter
means creating a whiteout in the covering layer.  No matter which comes first,
the result *must* be whiteout in directory + no stray copies left in covering
layer.  chmod() may legitimately return -ENOENT or 0, but resulting state of
fs is unambiguous.  Holding -i_mutex on parent would, of course, suffice to
serialize them, but it's not particulary nice thing to do wrt contention.

Another fun issue is copyup vs. copyup - we want to sit and wait for copyup
attempt in progress to complete, rather than start another one in parallel.
And whoever comes the second must check if copyup has succeeded, obviously -
it's possible to have user run into 5% limit and fail copyup, followed by
root doing it successfully.

Another one: overwriting rename() vs. copyup.  Similar to unlink() vs. copyup().

Another one: read-only open() vs. copyup().  Hell knows - we obviously don't
want to open a partially copied file; might want to wait or pretend that this
open() has come first and give it the underlying file.  The same goes for
stat() vs. copyup().

FWIW, something like lock parent, -create(), -unlink(), unlock parent,
copy data and metadata, lock parent, allocate a new dentry in covering layer
and do -lookup() on it, verify that it is negative and not a whiteout, lock
child, use -link() to put it into directory, unlock everything would
probably DTRT for unlink/copyup and rename/copyup.  The rest... hell knows;
-i_mutex on child is a non-starter, since we couldn't use the normal -write()
or -splice_write() under it.  One possibility is to allocate a structure for
copyup in progress, make it easily located (hash by lower dentry/upper sb
pair) and serialize on that.  Hell knows...

One potential unpleasantness here is dcache lookup coming between -create()
and -unlink(); OTOH, there had been fairly reasonable requests for something
like atomic combination of open and unlink and it's not like _that_ would be
hard to implement as a new method - pretty much everything local could just
take part of -create() and that would be it.  Which might make sense on its
own - open() flag creating a new temporary file, unlinked from the very
beginning and thus killed when we close the damn thing.  Objections?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-17 Thread Dave Chinner
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote:
> Miklos Szeredi  wrote:
> 
> > Export do_splice_direct() to modules.  Needed by overlay filesystem.
> 
> Apparently you cannot call this from any function that is holding an i_mutex
> if the target of the splice uses generic_file_splice_write().
> 
> The problem is a potential deadlock situation:
> 
> We have places already that do:
> 
>   mnt_want_write()
>   mutex_lock()
> 
> This can be found in do_last() for example.
> 
> However, mnt_want_write() calls sb_start_write() as does
> generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:
> 
>   mutex_lock()
>   sb_start_write()
> 
> which lockdep reports as a potential ABBA deadlock.
> 
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
> can
> illuminate further, so I've added him to the cc list.
> 
> I've attached the report I got with unionmount.

There's plenty of problems with splice locking that can lead to
deadlocks. Here's another that's been known for ages:

http://oss.sgi.com/archives/xfs/2011-08/msg00168.html

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-17 Thread David Howells
Miklos Szeredi  wrote:

> Export do_splice_direct() to modules.  Needed by overlay filesystem.

Apparently you cannot call this from any function that is holding an i_mutex
if the target of the splice uses generic_file_splice_write().

The problem is a potential deadlock situation:

We have places already that do:

mnt_want_write()
mutex_lock()

This can be found in do_last() for example.

However, mnt_want_write() calls sb_start_write() as does
generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:

mutex_lock()
sb_start_write()

which lockdep reports as a potential ABBA deadlock.

Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
might operate, so it's possible that this is a false alarm.  Maybe Jan Kara can
illuminate further, so I've added him to the cc list.

I've attached the report I got with unionmount.

David
---
[ INFO: possible recursive locking detected ]
3.9.0-rc1-fsdevel+ #934 Not tainted
-
fs-op/4476 is trying to acquire lock:
 (sb_writers#4){.+.+.+}, at: [] 
generic_file_splice_write+0x5d/0x14b
but task is already holding lock:
 (sb_writers#4){.+.+.+}, at: [] mnt_want_write+0x1f/0x46
other info that might help us debug this:
 Possible unsafe locking scenario:
   CPU0
   
  lock(sb_writers#4);
  lock(sb_writers#4);
 *** DEADLOCK ***
 May be due to missing lock nesting notation
2 locks held by fs-op/4476:
 #0:  (sb_writers#4){.+.+.+}, at: [] mnt_want_write+0x1f/0x46
 #1:  (>i_mutex_dir_key[1]){+.+.+.}, at: [] 
__union_copy_up+0x9a/0x132
stack backtrace:
Pid: 4476, comm: fs-op Not tainted 3.9.0-rc1-fsdevel+ #934
Call Trace:
 [] __lock_acquire+0x86a/0x16cf
 [] ? page_cache_pipe_buf_release+0x1b/0x1b
 [] lock_acquire+0x57/0x6d
 [] ? generic_file_splice_write+0x5d/0x14b
 [] __sb_start_write+0x10d/0x15d
 [] ? generic_file_splice_write+0x5d/0x14b
 [] generic_file_splice_write+0x5d/0x14b
 [] do_splice_from+0x74/0x91
 [] direct_splice_actor+0x1e/0x20
 [] splice_direct_to_actor+0xc2/0x17e
 [] ? do_splice_from+0x91/0x91
 [] do_splice_direct+0x47/0x5a
 [] __union_copy_up_locked+0x171/0x2b2
 [] __union_copy_up+0xea/0x132
 [] vfs_truncate+0x15e/0x289
 [] do_sys_truncate+0x46/0x83
 [] sys_truncate+0x9/0xb
 [] system_call_fastpath+0x16/0x1b
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-17 Thread David Howells
Miklos Szeredi mik...@szeredi.hu wrote:

 Export do_splice_direct() to modules.  Needed by overlay filesystem.

Apparently you cannot call this from any function that is holding an i_mutex
if the target of the splice uses generic_file_splice_write().

The problem is a potential deadlock situation:

We have places already that do:

mnt_want_write()
mutex_lock()

This can be found in do_last() for example.

However, mnt_want_write() calls sb_start_write() as does
generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:

mutex_lock()
sb_start_write()

which lockdep reports as a potential ABBA deadlock.

Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
might operate, so it's possible that this is a false alarm.  Maybe Jan Kara can
illuminate further, so I've added him to the cc list.

I've attached the report I got with unionmount.

David
---
[ INFO: possible recursive locking detected ]
3.9.0-rc1-fsdevel+ #934 Not tainted
-
fs-op/4476 is trying to acquire lock:
 (sb_writers#4){.+.+.+}, at: [811087a4] 
generic_file_splice_write+0x5d/0x14b
but task is already holding lock:
 (sb_writers#4){.+.+.+}, at: [810ff97c] mnt_want_write+0x1f/0x46
other info that might help us debug this:
 Possible unsafe locking scenario:
   CPU0
   
  lock(sb_writers#4);
  lock(sb_writers#4);
 *** DEADLOCK ***
 May be due to missing lock nesting notation
2 locks held by fs-op/4476:
 #0:  (sb_writers#4){.+.+.+}, at: [810ff97c] mnt_want_write+0x1f/0x46
 #1:  (type-i_mutex_dir_key[1]){+.+.+.}, at: [81131c74] 
__union_copy_up+0x9a/0x132
stack backtrace:
Pid: 4476, comm: fs-op Not tainted 3.9.0-rc1-fsdevel+ #934
Call Trace:
 [81070398] __lock_acquire+0x86a/0x16cf
 [811081cc] ? page_cache_pipe_buf_release+0x1b/0x1b
 [810715e2] lock_acquire+0x57/0x6d
 [811087a4] ? generic_file_splice_write+0x5d/0x14b
 [810e3314] __sb_start_write+0x10d/0x15d
 [811087a4] ? generic_file_splice_write+0x5d/0x14b
 [811087a4] generic_file_splice_write+0x5d/0x14b
 [811083d5] do_splice_from+0x74/0x91
 [81108410] direct_splice_actor+0x1e/0x20
 [8110868b] splice_direct_to_actor+0xc2/0x17e
 [811083f2] ? do_splice_from+0x91/0x91
 [8110999d] do_splice_direct+0x47/0x5a
 [81131a99] __union_copy_up_locked+0x171/0x2b2
 [81131cc4] __union_copy_up+0xea/0x132
 [810e02ca] vfs_truncate+0x15e/0x289
 [810e043b] do_sys_truncate+0x46/0x83
 [810e05cf] sys_truncate+0x9/0xb
 [81456f92] system_call_fastpath+0x16/0x1b
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-17 Thread Dave Chinner
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote:
 Miklos Szeredi mik...@szeredi.hu wrote:
 
  Export do_splice_direct() to modules.  Needed by overlay filesystem.
 
 Apparently you cannot call this from any function that is holding an i_mutex
 if the target of the splice uses generic_file_splice_write().
 
 The problem is a potential deadlock situation:
 
 We have places already that do:
 
   mnt_want_write()
   mutex_lock()
 
 This can be found in do_last() for example.
 
 However, mnt_want_write() calls sb_start_write() as does
 generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:
 
   mutex_lock()
   sb_start_write()
 
 which lockdep reports as a potential ABBA deadlock.
 
 Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
 might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
 can
 illuminate further, so I've added him to the cc list.
 
 I've attached the report I got with unionmount.

There's plenty of problems with splice locking that can lead to
deadlocks. Here's another that's been known for ages:

http://oss.sgi.com/archives/xfs/2011-08/msg00168.html

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-13 Thread Andrew Morton
On Wed, 13 Mar 2013 15:16:26 +0100 Miklos Szeredi  wrote:

> From: Miklos Szeredi 
> 
> Export do_splice_direct() to modules.  Needed by overlay filesystem.
> 
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/splice.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 718bd00..0e8f44a 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, 
> struct file *out,
>  
>   return ret;
>  }
> +EXPORT_SYMBOL(do_splice_direct);

_GPL?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-13 Thread Andrew Morton
On Wed, 13 Mar 2013 15:16:26 +0100 Miklos Szeredi mik...@szeredi.hu wrote:

 From: Miklos Szeredi mszer...@suse.cz
 
 Export do_splice_direct() to modules.  Needed by overlay filesystem.
 
 Signed-off-by: Miklos Szeredi mszer...@suse.cz
 ---
  fs/splice.c |1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/fs/splice.c b/fs/splice.c
 index 718bd00..0e8f44a 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, 
 struct file *out,
  
   return ret;
  }
 +EXPORT_SYMBOL(do_splice_direct);

_GPL?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/