Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, 2018-07-19 at 17:15 +0200, Jan Kara wrote: > On Thu 19-07-18 14:23:53, Martin Wilck wrote: > > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > > > Secondly, I don't think it is good to discard error from > > > bio_iov_iter_get_pages() here and just submit partial IO. It will > > > again > > > lead to part of IO being done as direct and part attempted to be > > > done > > > as > > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > > > behaves > > > differently - it aborts and returns error if > > > bio_iov_iter_get_pages() > > > ever > > > returned error. IMO we should do the same here. > > > > Well, it aborts the loop, but then (in the sync case) it still > > waits > > for the already submitted IOs to finish. Here, too, I'd find it > > more > > logical to return the number of successfully transmitted bytes > > rather > > than an error code. In the async case, the submitted bios are left > > in > > place, and will probably sooner or later finish, changing iocb- > > >ki_pos. > > Well, both these behaviors make sense, just traditionally (defined by > our > implementation) DIO returns error even if part of IO has actually > been > successfully submitted. Making a userspace visible change like you > suggest > thus has to be very carefully analyzed and frankly I don't think it's > worth > the bother. OK, maybe not. I'll rework the treatment of the error case. Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, 2018-07-19 at 16:53 +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > > Well, there has never been a promise that it will grab *all* > > > pages in the > > > iter AFAIK. Practically, I think that it was just too hairy to > > > implement in > > > the macro magic that iter processing is... Al might know more > > > (added to > > > CC). > > > > Not really - it's more that VM has every right to refuse letting > > you pin > > an arbitrary amount of pages anyway. > > In which case the code after this patch isn't going to help either, > because > it still tries to pin it all, just in multiple calls to > get_user_pages(). That was not the point of the patch. It's not about a situation in which MM refuses to pin more pages. The patch is about the "iterator::next()" nature of bio_iov_iter_get_pages(). If it can't pin the pages, bio_iov_iter_get_pages() returns an error code (elsewhere in this thread we discussed how to treat that right). Otherwise, it always adds _some_ data to the bio, but the amount added depends on the segment structure of the input iov_iter. If the input iovec has just a single segment, it fills the bio in a single call. With multiple segments, it just returns the page(s) of the first segment. The point of my patch is to make no difference between single- segment and multi-segment IOs. Regards Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, 2018-07-19 at 17:11 +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote: > > Yeah. Actually previous version of the fix (not posted publicly) > > submitted > > partial bio and then reused the bio to submit more. This is also > > the way > > __blkdev_direct_IO operates. Martin optimized this to fill the > > bio > > completely (as we know we have enough bvecs) before submitting > > which has > > chances to perform better. I'm fine with either approach, just we > > have to > > decide which way to go. > > I think this first version is going to be less fragile, so I we > should > aim for that. Wait a second. We changed the approach for a reason. By submitting partial bios synchronously and sequentially, we loose a lot of performance, so much that this "fast path" quickly falls behind the regular __blkkdev_direct_IO() path as the number of input IO segments increases. The patch I submitted avoids that. The whole point of this fast path is to submit a single bio, and return as quickly as possible. It's not an error condition if bio_iov_iter_get_pages() returns less data than possible. The function just takes one iteration step at a time, and thus iterating over it until the desired number of pages is obtained, and collecting the resulting pages in a single bio, isn't "fragile", it's just the natural thing to do. Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, 2018-07-19 at 17:21 +0200, Jan Kara wrote: > On Thu 19-07-18 20:20:51, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > > > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > > > bio_iov_iter_get_pages() returns only pages for a single non- > > > > > empty > > > > > segment of the input iov_iter's iovec. This may be much less > > > > > than the number > > > > > of pages __blkdev_direct_IO_simple() is supposed to process. > > > > > Call > > > > > bio_iov_iter_get_pages() repeatedly until either the > > > > > requested number > > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this > > > > > is not done, > > > > > short writes or reads may occur for direct synchronous IOs > > > > > with multiple > > > > > iovec slots (such as generated by writev()). In that case, > > > > > __generic_file_write_iter() falls back to buffered writes, > > > > > which > > > > > has been observed to cause data corruption in certain > > > > > workloads. > > > > > > > > > > Note: if segments aren't page-aligned in the input iovec, > > > > > this patch may > > > > > result in multiple adjacent slots of the bi_io_vec array to > > > > > reference the same > > > > > page (the byte ranges are guaranteed to be disjunct if the > > > > > preceding patch is > > > > > applied). We haven't seen problems with that in our and the > > > > > customer's > > > > > tests. It'd be possible to detect this situation and merge > > > > > bi_io_vec slots > > > > > that refer to the same page, but I prefer to keep it simple > > > > > for now. > > > > > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO > > > > > for simplified bdev direct-io") > > > > > Signed-off-by: Martin Wilck > > > > > --- > > > > > fs/block_dev.c | 8 +++- > > > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > > index 0dd87aa..41643c4 100644 > > > > > --- a/fs/block_dev.c > > > > > +++ b/fs/block_dev.c > > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb > > > > > *iocb, struct iov_iter *iter, > > > > > > > > > > ret = bio_iov_iter_get_pages(, iter); > > > > > if (unlikely(ret)) > > > > > - return ret; > > > > > + goto out; > > > > > + > > > > > + while (ret == 0 && > > > > > +bio.bi_vcnt < bio.bi_max_vecs && > > > > > iov_iter_count(iter) > 0) > > > > > + ret = bio_iov_iter_get_pages(, iter); > > > > > + > > > > > ret = bio.bi_iter.bi_size; > > > > > > > > > > if (iov_iter_rw(iter) == READ) { > > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb > > > > > *iocb, struct iov_iter *iter, > > > > > put_page(bvec->bv_page); > > > > > } > > > > > > > > > > +out: > > > > > if (vecs != inline_vecs) > > > > > kfree(vecs); > > > > > > > > > > > > > You might put the 'vecs' leak fix into another patch, and resue > > > > the > > > > current code block for that. > > > > > > > > Looks all users of bio_iov_iter_get_pages() need this kind of > > > > fix, so > > > > what do you think about the following way? > > > > > > No. AFAICT all the other users of bio_iov_iter_get_pages() are > > > perfectly > > > fine with it returning less pages and they loop appropriately. > > > > OK, but this way still may make one bio to hold more data, > > especially > > the comment of bio_iov_iter_get_pages() says that 'Pins as many > > pages from > > *iter', so looks it is the correct way to do. > > Well, but as Al pointed out MM may decide that user has already > pinned too > many pages and refuse to pin more. So pinning full iter worth of > pages may > just be impossible. Currently, there are no checks like this in MM > but > eventually, I'd like to account pinned pages in mlock ulimit (or a > limit of > similar kind) and then what Al speaks about would become very real. > So I'd > prefer to not develop new locations that must be able to pin > arbitrary > amount of pages. Doesn't this wrongfully penalize scatter-gather IO? Consider two 1MiB IO requests, one with a single 1 MiB slot, and one with 256 single-page slots. Calling bio_iov_iter_get_pages() once will result in a 1MiB bio in first case, and a 4k bio in the second. You need to submit 256 individual bios to finish the IO. By using the approach which my patch is takes (repeatedly calling bio_io_iter_get_pages before submitting the bio), you'd get a 1Mib bio in both cases. So Ming's idea to generalize this may not be so bad, after all. I don't see that it would increase the pressure on MM. If we are concerned about MM, we should limit the total number of pages that can be pinned, rather than the number of iovec segments that are processed. Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu 19-07-18 20:20:51, Ming Lei wrote: > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > > segment of the input iov_iter's iovec. This may be much less than the > > > > number > > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > bio_iov_iter_get_pages() repeatedly until either the requested number > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > > > > short writes or reads may occur for direct synchronous IOs with multiple > > > > iovec slots (such as generated by writev()). In that case, > > > > __generic_file_write_iter() falls back to buffered writes, which > > > > has been observed to cause data corruption in certain workloads. > > > > > > > > Note: if segments aren't page-aligned in the input iovec, this patch may > > > > result in multiple adjacent slots of the bi_io_vec array to reference > > > > the same > > > > page (the byte ranges are guaranteed to be disjunct if the preceding > > > > patch is > > > > applied). We haven't seen problems with that in our and the customer's > > > > tests. It'd be possible to detect this situation and merge bi_io_vec > > > > slots > > > > that refer to the same page, but I prefer to keep it simple for now. > > > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > > > simplified bdev direct-io") > > > > Signed-off-by: Martin Wilck > > > > --- > > > > fs/block_dev.c | 8 +++- > > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > index 0dd87aa..41643c4 100644 > > > > --- a/fs/block_dev.c > > > > +++ b/fs/block_dev.c > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > > > struct iov_iter *iter, > > > > > > > > ret = bio_iov_iter_get_pages(, iter); > > > > if (unlikely(ret)) > > > > - return ret; > > > > + goto out; > > > > + > > > > + while (ret == 0 && > > > > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > > > > > 0) > > > > + ret = bio_iov_iter_get_pages(, iter); > > > > + > > > > ret = bio.bi_iter.bi_size; > > > > > > > > if (iov_iter_rw(iter) == READ) { > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > > > struct iov_iter *iter, > > > > put_page(bvec->bv_page); > > > > } > > > > > > > > +out: > > > > if (vecs != inline_vecs) > > > > kfree(vecs); > > > > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > > current code block for that. > > > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so > > > what do you think about the following way? > > > > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly > > fine with it returning less pages and they loop appropriately. > > OK, but this way still may make one bio to hold more data, especially > the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from > *iter', so looks it is the correct way to do. Well, but as Al pointed out MM may decide that user has already pinned too many pages and refuse to pin more. So pinning full iter worth of pages may just be impossible. Currently, there are no checks like this in MM but eventually, I'd like to account pinned pages in mlock ulimit (or a limit of similar kind) and then what Al speaks about would become very real. So I'd prefer to not develop new locations that must be able to pin arbitrary amount of pages. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu 19-07-18 14:23:53, Martin Wilck wrote: > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > > Secondly, I don't think it is good to discard error from > > bio_iov_iter_get_pages() here and just submit partial IO. It will > > again > > lead to part of IO being done as direct and part attempted to be done > > as > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > > behaves > > differently - it aborts and returns error if bio_iov_iter_get_pages() > > ever > > returned error. IMO we should do the same here. > > Well, it aborts the loop, but then (in the sync case) it still waits > for the already submitted IOs to finish. Here, too, I'd find it more > logical to return the number of successfully transmitted bytes rather > than an error code. In the async case, the submitted bios are left in > place, and will probably sooner or later finish, changing iocb->ki_pos. Well, both these behaviors make sense, just traditionally (defined by our implementation) DIO returns error even if part of IO has actually been successfully submitted. Making a userspace visible change like you suggest thus has to be very carefully analyzed and frankly I don't think it's worth the bother. > I'm actually not quite certain if that's correct. In the sync case, it > causes the already-performed IO to be done again, buffered. In the > async case, it it may even cause two IOs for the same range to be in > flight at the same time ... ? It doesn't cause IO to be done again. Look at __generic_file_write_iter(). If generic_file_direct_write() returned error, we immediately return error as well without retrying buffered IO. We only retry buffered IO for partial (or 0) return value. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote: > Yeah. Actually previous version of the fix (not posted publicly) submitted > partial bio and then reused the bio to submit more. This is also the way > __blkdev_direct_IO operates. Martin optimized this to fill the bio > completely (as we know we have enough bvecs) before submitting which has > chances to perform better. I'm fine with either approach, just we have to > decide which way to go. I think this first version is going to be less fragile, so I we should aim for that.
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu 19-07-18 16:53:51, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > > Well, there has never been a promise that it will grab *all* pages in the > > > iter AFAIK. Practically, I think that it was just too hairy to implement > > > in > > > the macro magic that iter processing is... Al might know more (added to > > > CC). > > > > Not really - it's more that VM has every right to refuse letting you pin > > an arbitrary amount of pages anyway. > > In which case the code after this patch isn't going to help either, because > it still tries to pin it all, just in multiple calls to get_user_pages(). Yeah. Actually previous version of the fix (not posted publicly) submitted partial bio and then reused the bio to submit more. This is also the way __blkdev_direct_IO operates. Martin optimized this to fill the bio completely (as we know we have enough bvecs) before submitting which has chances to perform better. I'm fine with either approach, just we have to decide which way to go. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > Well, there has never been a promise that it will grab *all* pages in the > > iter AFAIK. Practically, I think that it was just too hairy to implement in > > the macro magic that iter processing is... Al might know more (added to > > CC). > > Not really - it's more that VM has every right to refuse letting you pin > an arbitrary amount of pages anyway. In which case the code after this patch isn't going to help either, because it still tries to pin it all, just in multiple calls to get_user_pages().
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, 2018-07-19 at 13:56 +0200, Jan Kara wrote: > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non- > > > empty > > > segment of the input iov_iter's iovec. This may be much less than > > > the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > bio_iov_iter_get_pages() repeatedly until either the requested > > > number > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is > > > not done, > > > short writes or reads may occur for direct synchronous IOs with > > > multiple > > > iovec slots (such as generated by writev()). In that case, > > > __generic_file_write_iter() falls back to buffered writes, which > > > has been observed to cause data corruption in certain workloads. > > > > > > Note: if segments aren't page-aligned in the input iovec, this > > > patch may > > > result in multiple adjacent slots of the bi_io_vec array to > > > reference the same > > > page (the byte ranges are guaranteed to be disjunct if the > > > preceding patch is > > > applied). We haven't seen problems with that in our and the > > > customer's > > > tests. It'd be possible to detect this situation and merge > > > bi_io_vec slots > > > that refer to the same page, but I prefer to keep it simple for > > > now. > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > > simplified bdev direct-io") > > > Signed-off-by: Martin Wilck > > > --- > > > fs/block_dev.c | 8 +++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 0dd87aa..41643c4 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb > > > *iocb, struct iov_iter *iter, > > > > > > ret = bio_iov_iter_get_pages(, iter); > > > if (unlikely(ret)) > > > - return ret; > > > + goto out; > > > + > > > + while (ret == 0 && > > > +bio.bi_vcnt < bio.bi_max_vecs && > > > iov_iter_count(iter) > 0) > > > + ret = bio_iov_iter_get_pages(, iter); > > > + > > > ret = bio.bi_iter.bi_size; > > > > > > if (iov_iter_rw(iter) == READ) { > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > > struct iov_iter *iter, > > > put_page(bvec->bv_page); > > > } > > > > > > +out: > > > if (vecs != inline_vecs) > > > kfree(vecs); > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > current code block for that. > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, > > so > > what do you think about the following way? > > No. AFAICT all the other users of bio_iov_iter_get_pages() are > perfectly > fine with it returning less pages and they loop appropriately. We might consider to better fill the bios in __blkdev_direct_IO(), too, it might be a performance gain. Regards Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > On Thu 19-07-18 11:39:18, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than > > the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > bio_iov_iter_get_pages() repeatedly until either the requested > > number > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not > > done, > > short writes or reads may occur for direct synchronous IOs with > > multiple > > iovec slots (such as generated by writev()). In that case, > > __generic_file_write_iter() falls back to buffered writes, which > > has been observed to cause data corruption in certain workloads. > > > > Note: if segments aren't page-aligned in the input iovec, this > > patch may > > result in multiple adjacent slots of the bi_io_vec array to > > reference the same > > page (the byte ranges are guaranteed to be disjunct if the > > preceding patch is > > applied). We haven't seen problems with that in our and the > > customer's > > tests. It'd be possible to detect this situation and merge > > bi_io_vec slots > > that refer to the same page, but I prefer to keep it simple for > > now. > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io") > > Signed-off-by: Martin Wilck > > --- > > fs/block_dev.c | 8 +++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 0dd87aa..41643c4 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > struct iov_iter *iter, > > > > ret = bio_iov_iter_get_pages(, iter); > > if (unlikely(ret)) > > - return ret; > > + goto out; > > + > > + while (ret == 0 && > > + bio.bi_vcnt < bio.bi_max_vecs && > > iov_iter_count(iter) > 0) > > + ret = bio_iov_iter_get_pages(, iter); > > + > > I have two suggestions here (posting them now in public): > > Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we > made > sure we have enough vecs for pages in iter. So I'd WARN if this isn't > true. Yeah. I wanted to add that to the patch. Slipped through, somehow. Sorry about that. > Secondly, I don't think it is good to discard error from > bio_iov_iter_get_pages() here and just submit partial IO. It will > again > lead to part of IO being done as direct and part attempted to be done > as > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > behaves > differently - it aborts and returns error if bio_iov_iter_get_pages() > ever > returned error. IMO we should do the same here. Well, it aborts the loop, but then (in the sync case) it still waits for the already submitted IOs to finish. Here, too, I'd find it more logical to return the number of successfully transmitted bytes rather than an error code. In the async case, the submitted bios are left in place, and will probably sooner or later finish, changing iocb->ki_pos. I'm actually not quite certain if that's correct. In the sync case, it causes the already-performed IO to be done again, buffered. In the async case, it it may even cause two IOs for the same range to be in flight at the same time ... ? Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the > > > number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > bio_iov_iter_get_pages() repeatedly until either the requested number > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > > > short writes or reads may occur for direct synchronous IOs with multiple > > > iovec slots (such as generated by writev()). In that case, > > > __generic_file_write_iter() falls back to buffered writes, which > > > has been observed to cause data corruption in certain workloads. > > > > > > Note: if segments aren't page-aligned in the input iovec, this patch may > > > result in multiple adjacent slots of the bi_io_vec array to reference the > > > same > > > page (the byte ranges are guaranteed to be disjunct if the preceding > > > patch is > > > applied). We haven't seen problems with that in our and the customer's > > > tests. It'd be possible to detect this situation and merge bi_io_vec slots > > > that refer to the same page, but I prefer to keep it simple for now. > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > > simplified bdev direct-io") > > > Signed-off-by: Martin Wilck > > > --- > > > fs/block_dev.c | 8 +++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 0dd87aa..41643c4 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > > > iov_iter *iter, > > > > > > ret = bio_iov_iter_get_pages(, iter); > > > if (unlikely(ret)) > > > - return ret; > > > + goto out; > > > + > > > + while (ret == 0 && > > > +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > > > + ret = bio_iov_iter_get_pages(, iter); > > > + > > > ret = bio.bi_iter.bi_size; > > > > > > if (iov_iter_rw(iter) == READ) { > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > > > iov_iter *iter, > > > put_page(bvec->bv_page); > > > } > > > > > > +out: > > > if (vecs != inline_vecs) > > > kfree(vecs); > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > current code block for that. > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so > > what do you think about the following way? > > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly > fine with it returning less pages and they loop appropriately. OK, but this way still may make one bio to hold more data, especially the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from *iter', so looks it is the correct way to do. thanks, Ming
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu 19-07-18 19:04:46, Ming Lei wrote: > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > bio_iov_iter_get_pages() repeatedly until either the requested number > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > > short writes or reads may occur for direct synchronous IOs with multiple > > iovec slots (such as generated by writev()). In that case, > > __generic_file_write_iter() falls back to buffered writes, which > > has been observed to cause data corruption in certain workloads. > > > > Note: if segments aren't page-aligned in the input iovec, this patch may > > result in multiple adjacent slots of the bi_io_vec array to reference the > > same > > page (the byte ranges are guaranteed to be disjunct if the preceding patch > > is > > applied). We haven't seen problems with that in our and the customer's > > tests. It'd be possible to detect this situation and merge bi_io_vec slots > > that refer to the same page, but I prefer to keep it simple for now. > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified > > bdev direct-io") > > Signed-off-by: Martin Wilck > > --- > > fs/block_dev.c | 8 +++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 0dd87aa..41643c4 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > > iov_iter *iter, > > > > ret = bio_iov_iter_get_pages(, iter); > > if (unlikely(ret)) > > - return ret; > > + goto out; > > + > > + while (ret == 0 && > > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > > + ret = bio_iov_iter_get_pages(, iter); > > + > > ret = bio.bi_iter.bi_size; > > > > if (iov_iter_rw(iter) == READ) { > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > > iov_iter *iter, > > put_page(bvec->bv_page); > > } > > > > +out: > > if (vecs != inline_vecs) > > kfree(vecs); > > > > You might put the 'vecs' leak fix into another patch, and resue the > current code block for that. > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so > what do you think about the following way? No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly fine with it returning less pages and they loop appropriately. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote: > On Thu 19-07-18 18:21:23, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the > > > number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > > as many as possible pages since both 'maxsize' and 'maxpages' are provided > > to cover all. Huh? Why does having a way to say "I (the caller) don't want more than " implies anything of that sort? It never had been promised, explicitly or not... > > So the question is why iov_iter_get_pages() doesn't work as expected? > > Well, there has never been a promise that it will grab *all* pages in the > iter AFAIK. Practically, I think that it was just too hairy to implement in > the macro magic that iter processing is... Al might know more (added to > CC). Not really - it's more that VM has every right to refuse letting you pin an arbitrary amount of pages anyway.
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified > bdev direct-io") > Signed-off-by: Martin Wilck > --- > fs/block_dev.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aa..41643c4 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > iov_iter *iter, > > ret = bio_iov_iter_get_pages(, iter); > if (unlikely(ret)) > - return ret; > + goto out; > + > + while (ret == 0 && > +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > + ret = bio_iov_iter_get_pages(, iter); > + > ret = bio.bi_iter.bi_size; > > if (iov_iter_rw(iter) == READ) { > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > iov_iter *iter, > put_page(bvec->bv_page); > } > > +out: > if (vecs != inline_vecs) > kfree(vecs); > You might put the 'vecs' leak fix into another patch, and resue the current code block for that. Looks all users of bio_iov_iter_get_pages() need this kind of fix, so what do you think about the following way? diff --git a/block/bio.c b/block/bio.c index f3536bfc8298..23dd4c163dfc 100644 --- a/block/bio.c +++ b/block/bio.c @@ -904,15 +904,7 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); -/** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio - * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped - * - * Pins as many pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. - */ -int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) +static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; @@ -951,6 +943,28 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) iov_iter_advance(iter, size); return 0; } + +/** + * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * @bio: bio to add pages to + * @iter: iov iterator describing the region to be mapped + * + * Pins as many pages from *iter and appends them to @bio's bvec array. The + * pages will have to be released using put_page() when done. + */ +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) +{ + int ret; + unsigned int size; + + do { + size = bio->bi_iter.bi_size; + ret = __bio_iov_iter_get_pages(bio, iter); + } while (!bio_full(bio) && iov_iter_count(iter) > 0 && + bio->bi_iter.bi_size > size); + + return bio->bi_iter.bi_size > 0 ? 0 : ret; +} EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); static void submit_bio_wait_endio(struct bio *bio) Thanks, Ming
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote: > On Thu 19-07-18 18:21:23, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the > > > number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > > as many as possible pages since both 'maxsize' and 'maxpages' are provided > > to cover all. > > > > So the question is why iov_iter_get_pages() doesn't work as expected? > > Well, there has never been a promise that it will grab *all* pages in the > iter AFAIK. Practically, I think that it was just too hairy to implement in > the macro magic that iter processing is... Al might know more (added to > CC). OK, I got it, given get_user_pages_fast() may fail and it is reasonable to see less pages retrieved. Thanks, Ming
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu 19-07-18 11:39:18, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified > bdev direct-io") > Signed-off-by: Martin Wilck > --- > fs/block_dev.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aa..41643c4 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > iov_iter *iter, > > ret = bio_iov_iter_get_pages(, iter); > if (unlikely(ret)) > - return ret; > + goto out; > + > + while (ret == 0 && > +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > + ret = bio_iov_iter_get_pages(, iter); > + I have two suggestions here (posting them now in public): Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made sure we have enough vecs for pages in iter. So I'd WARN if this isn't true. Secondly, I don't think it is good to discard error from bio_iov_iter_get_pages() here and just submit partial IO. It will again lead to part of IO being done as direct and part attempted to be done as buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves differently - it aborts and returns error if bio_iov_iter_get_pages() ever returned error. IMO we should do the same here. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu 19-07-18 18:21:23, Ming Lei wrote: > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > as many as possible pages since both 'maxsize' and 'maxpages' are provided > to cover all. > > So the question is why iov_iter_get_pages() doesn't work as expected? Well, there has never been a promise that it will grab *all* pages in the iter AFAIK. Practically, I think that it was just too hairy to implement in the macro magic that iter processing is... Al might know more (added to CC). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve as many as possible pages since both 'maxsize' and 'maxpages' are provided to cover all. So the question is why iov_iter_get_pages() doesn't work as expected? Thanks, Ming
Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
On 07/19/2018 11:39 AM, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified > bdev direct-io") > Signed-off-by: Martin Wilck > --- > fs/block_dev.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
[PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
bio_iov_iter_get_pages() returns only pages for a single non-empty segment of the input iov_iter's iovec. This may be much less than the number of pages __blkdev_direct_IO_simple() is supposed to process. Call bio_iov_iter_get_pages() repeatedly until either the requested number of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, short writes or reads may occur for direct synchronous IOs with multiple iovec slots (such as generated by writev()). In that case, __generic_file_write_iter() falls back to buffered writes, which has been observed to cause data corruption in certain workloads. Note: if segments aren't page-aligned in the input iovec, this patch may result in multiple adjacent slots of the bi_io_vec array to reference the same page (the byte ranges are guaranteed to be disjunct if the preceding patch is applied). We haven't seen problems with that in our and the customer's tests. It'd be possible to detect this situation and merge bi_io_vec slots that refer to the same page, but I prefer to keep it simple for now. Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") Signed-off-by: Martin Wilck --- fs/block_dev.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aa..41643c4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, ret = bio_iov_iter_get_pages(, iter); if (unlikely(ret)) - return ret; + goto out; + + while (ret == 0 && + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) + ret = bio_iov_iter_get_pages(, iter); + ret = bio.bi_iter.bi_size; if (iov_iter_rw(iter) == READ) { @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, put_page(bvec->bv_page); } +out: if (vecs != inline_vecs) kfree(vecs); -- 2.17.1