Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-20 Thread Herbert Xu
Al Viro  wrote:
>
>* shoved into scatterlist, which gets fed into crypto/*.c machinery.
> No way for a pipe_buffer stuff to get there, fortunately, because I would
> be very surprised if it works correctly with compound pages and large
> ranges in those.

FWIW the crypto API has always been supposed to handle SG entries
that cross page boundaries.  There were a couple of bugs in this
area but AFAIK they've all been fixed.

Of course I cannot guarantee that every crypto driver also handles
it correctly, but at least we have a few test vectors which test
the page-crossing case specifically.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Al Viro
On Sun, Sep 18, 2016 at 11:31:17PM +0100, Al Viro wrote:

> At the moment there are 11 callers (10 in mainline; one more added in
> conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
> anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
> right now, hopefully will come up with something sane...

FWIW, I wonder how many of those users are ready to cope with compound
pages in the first place; they end up passed to
* skb_fill_page_desc().  Probably OK (as in all of them, modulo
calculating the number of pages and ranges for them).
* shoved into scatterlist, which gets passed to virtqueue_add_sgs().
Need to check virtio to see what happens there.
* shoved into nfs ->wb_page and fed into nfs_pageio_add_request() and
machinery behind it.  These, BTW, are reachable by pipe_buffer-derived ones
at the moment (splice to O_DIRECT nfs file).  The code looks like it's
playing fast and loose with ->wb_page - in some cases it's an NFS pagecache
one, in some - anything from userland, and there are places like
inode = page_file_mapping(req->wb_page)->host;
which will do nasty things if they are ever reached by the second kind.
nfs_pgio_rpcsetup() looks like it won't be happy with compound pages, but
again, I'm not familiar enough with that code to tell if it's reachable
from nfs_pageio_add_request().
* shoved into scatterlist, which gets fed into crypto/*.c machinery.
No way for a pipe_buffer stuff to get there, fortunately, because I would
be very surprised if it works correctly with compound pages and large
ranges in those.
* shoved into lustre ->ldp_pages; almost certainly not ready for
compound pages.
* fed to ceph_osd_data_pages_init(); again, practically certain not
to be ready.
* put into dio_submit ->pages[], eventually fed to bio_add_page();
that might be fixable, but it would take some massage in fs/direct-io.c
* fuse - probably OK, but that's only on a fairly cursory look.

It certainly won't be easy to verify in details ;-/


Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Linus Torvalds
On Sun, Sep 18, 2016 at 3:31 PM, Al Viro  wrote:
>
> What worries me is iov_iter_get_pages() and friends.

So honestly, if it worries you, I'm not going to complain at all if
you decide that you'd rather translate the pipe_buffer[] array into a
kvec by always splitting at page boundaries.

Even with large packets in networking, it's not going t be a huge
deal. And maybe we *should* make it a rule that a "kvec" is always
composed of individual entries that fit entirely within a page.

In this code, being safe rather than clever would be a welcome and
surprising change, I guess.

 Linus


Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Al Viro
On Sun, Sep 18, 2016 at 01:12:21PM -0700, Linus Torvalds wrote:

> So if the splice code ends up being confused by "this is not just
> inside a single page", then the splice code is buggy, I think.
> 
> Why would splice_write() cases be confused anyway? A filesystem needs
> to be able to handle the case of "this needs to be split" regardless,
> since even if the source buffer were to fit in a page, the offset
> might obviously mean that the target won't fit in a page.

What worries me is iov_iter_get_pages() and friends.  The calling conventions
are
size = iov_iter_get_pages(iter, pages, maxlen, maxpages, );

They are convenient enough for most of the callers - we fill an array of
pages, the first (and only in bvec case) one having start bytes skipped.

The thing is, the calculation of the number of pages returned is broken
in this case; normally it's ROUND_DIV_UP(start + n, PAGE_SIZE).  That,
of course, gets broken even by the offset being large enough.  We don't
have that many users of that thing (and iov_iter_get_pages_alloc()), but
it'll need careful review.  What's more, looking at those shows other
fun issues:
sg_init_table(sgl->sg, npages + 1);

for (i = 0, len = n; i < npages; i++) {
int plen = min_t(int, len, PAGE_SIZE - off);

sg_set_page(sgl->sg + i, sgl->pages[i], plen, off);

and that'll instantly blow up, due to PAGE_SIZE - off possibly becoming
negative.  That's af_alg_make_sg(), and it shouldn't see anything
coming from pipe buffers (right now the only way for that to happen is
iter_file_splice_write()), but the things like e.g. dio_refill_pages()
might, and they also get seriously confused by that.  Worse, some of those
callers have calling conventions that have similar problems of their own.

At the moment there are 11 callers (10 in mainline; one more added in
conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
right now, hopefully will come up with something sane...


Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Linus Torvalds
On Sun, Sep 18, 2016 at 12:31 PM, Al Viro  wrote:
> FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
> might come from compound pages and I'm not entirely convinced that we won't
> end up with coalesced fragments putting more than PAGE_SIZE into a single
> pipe_buffer.  And that could badly confuse a bunch of code.

The pipe buffer code is actually *supposed* to handle any size
allocations at all. They should *not* be limited by pages, exactly
because the data can come from huge-pages or just multi-page
allocations. It's definitely possible with networking, and networking
is one of the *primary* targets of splice in many ways.

So if the splice code ends up being confused by "this is not just
inside a single page", then the splice code is buggy, I think.

Why would splice_write() cases be confused anyway? A filesystem needs
to be able to handle the case of "this needs to be split" regardless,
since even if the source buffer were to fit in a page, the offset
might obviously mean that the target won't fit in a page.

Now, if you decide that you want to make the iterator always split
those possibly big cases and never have big iovec entries, I guess
that would potentially be ok. But my initial reaction is that they are
perfectly normal and should be handled normally, and any code that
depends on a splice buffer fitting in one page is just buggy and
should be fixed.

 Linus


skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Al Viro
FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
might come from compound pages and I'm not entirely convinced that we won't
end up with coalesced fragments putting more than PAGE_SIZE into a single
pipe_buffer.  And that could badly confuse a bunch of code.

Can that legitimately happen?  If so, we'll need to audit quite a few
->splice_write()-related codepaths; FUSE, in particular, is very likely
to be unhappy with that kind of stuff, and it's not the only place where
we might count upon never seeing e.g. longer than PAGE_SIZE chunks in
bio_vec.  It shouldn't be all that hard to fix, but if the whole thing
is simply impossible, I would rather avoid that round of RTFS at the moment...

Comments?