Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
Al Virowrote: > >* shoved into scatterlist, which gets fed into crypto/*.c machinery. > No way for a pipe_buffer stuff to get there, fortunately, because I would > be very surprised if it works correctly with compound pages and large > ranges in those. FWIW the crypto API has always been supposed to handle SG entries that cross page boundaries. There were a couple of bugs in this area but AFAIK they've all been fixed. Of course I cannot guarantee that every crypto driver also handles it correctly, but at least we have a few test vectors which test the page-crossing case specifically. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
On Sun, Sep 18, 2016 at 11:31:17PM +0100, Al Viro wrote: > At the moment there are 11 callers (10 in mainline; one more added in > conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant > anyway - it gets fed an iovec-backed iov_iter). I'm looking through those > right now, hopefully will come up with something sane... FWIW, I wonder how many of those users are ready to cope with compound pages in the first place; they end up passed to * skb_fill_page_desc(). Probably OK (as in all of them, modulo calculating the number of pages and ranges for them). * shoved into scatterlist, which gets passed to virtqueue_add_sgs(). Need to check virtio to see what happens there. * shoved into nfs ->wb_page and fed into nfs_pageio_add_request() and machinery behind it. These, BTW, are reachable by pipe_buffer-derived ones at the moment (splice to O_DIRECT nfs file). The code looks like it's playing fast and loose with ->wb_page - in some cases it's an NFS pagecache one, in some - anything from userland, and there are places like inode = page_file_mapping(req->wb_page)->host; which will do nasty things if they are ever reached by the second kind. nfs_pgio_rpcsetup() looks like it won't be happy with compound pages, but again, I'm not familiar enough with that code to tell if it's reachable from nfs_pageio_add_request(). * shoved into scatterlist, which gets fed into crypto/*.c machinery. No way for a pipe_buffer stuff to get there, fortunately, because I would be very surprised if it works correctly with compound pages and large ranges in those. * shoved into lustre ->ldp_pages; almost certainly not ready for compound pages. * fed to ceph_osd_data_pages_init(); again, practically certain not to be ready. * put into dio_submit ->pages[], eventually fed to bio_add_page(); that might be fixable, but it would take some massage in fs/direct-io.c * fuse - probably OK, but that's only on a fairly cursory look. It certainly won't be easy to verify in details ;-/
Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
On Sun, Sep 18, 2016 at 3:31 PM, Al Virowrote: > > What worries me is iov_iter_get_pages() and friends. So honestly, if it worries you, I'm not going to complain at all if you decide that you'd rather translate the pipe_buffer[] array into a kvec by always splitting at page boundaries. Even with large packets in networking, it's not going t be a huge deal. And maybe we *should* make it a rule that a "kvec" is always composed of individual entries that fit entirely within a page. In this code, being safe rather than clever would be a welcome and surprising change, I guess. Linus
Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
On Sun, Sep 18, 2016 at 01:12:21PM -0700, Linus Torvalds wrote: > So if the splice code ends up being confused by "this is not just > inside a single page", then the splice code is buggy, I think. > > Why would splice_write() cases be confused anyway? A filesystem needs > to be able to handle the case of "this needs to be split" regardless, > since even if the source buffer were to fit in a page, the offset > might obviously mean that the target won't fit in a page. What worries me is iov_iter_get_pages() and friends. The calling conventions are size = iov_iter_get_pages(iter, pages, maxlen, maxpages, ); They are convenient enough for most of the callers - we fill an array of pages, the first (and only in bvec case) one having start bytes skipped. The thing is, the calculation of the number of pages returned is broken in this case; normally it's ROUND_DIV_UP(start + n, PAGE_SIZE). That, of course, gets broken even by the offset being large enough. We don't have that many users of that thing (and iov_iter_get_pages_alloc()), but it'll need careful review. What's more, looking at those shows other fun issues: sg_init_table(sgl->sg, npages + 1); for (i = 0, len = n; i < npages; i++) { int plen = min_t(int, len, PAGE_SIZE - off); sg_set_page(sgl->sg + i, sgl->pages[i], plen, off); and that'll instantly blow up, due to PAGE_SIZE - off possibly becoming negative. That's af_alg_make_sg(), and it shouldn't see anything coming from pipe buffers (right now the only way for that to happen is iter_file_splice_write()), but the things like e.g. dio_refill_pages() might, and they also get seriously confused by that. Worse, some of those callers have calling conventions that have similar problems of their own. At the moment there are 11 callers (10 in mainline; one more added in conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant anyway - it gets fed an iovec-backed iov_iter). I'm looking through those right now, hopefully will come up with something sane...
Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
On Sun, Sep 18, 2016 at 12:31 PM, Al Virowrote: > FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments > might come from compound pages and I'm not entirely convinced that we won't > end up with coalesced fragments putting more than PAGE_SIZE into a single > pipe_buffer. And that could badly confuse a bunch of code. The pipe buffer code is actually *supposed* to handle any size allocations at all. They should *not* be limited by pages, exactly because the data can come from huge-pages or just multi-page allocations. It's definitely possible with networking, and networking is one of the *primary* targets of splice in many ways. So if the splice code ends up being confused by "this is not just inside a single page", then the splice code is buggy, I think. Why would splice_write() cases be confused anyway? A filesystem needs to be able to handle the case of "this needs to be split" regardless, since even if the source buffer were to fit in a page, the offset might obviously mean that the target won't fit in a page. Now, if you decide that you want to make the iterator always split those possibly big cases and never have big iovec entries, I guess that would potentially be ok. But my initial reaction is that they are perfectly normal and should be handled normally, and any code that depends on a splice buffer fitting in one page is just buggy and should be fixed. Linus
skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments might come from compound pages and I'm not entirely convinced that we won't end up with coalesced fragments putting more than PAGE_SIZE into a single pipe_buffer. And that could badly confuse a bunch of code. Can that legitimately happen? If so, we'll need to audit quite a few ->splice_write()-related codepaths; FUSE, in particular, is very likely to be unhappy with that kind of stuff, and it's not the only place where we might count upon never seeing e.g. longer than PAGE_SIZE chunks in bio_vec. It shouldn't be all that hard to fix, but if the whole thing is simply impossible, I would rather avoid that round of RTFS at the moment... Comments?