Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
On Thu, 2018-07-19 at 17:15 +0200, Jan Kara wrote:
> On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > > Secondly, I don't think it is good to discard error from
> > > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > > again
> > > lead to part of IO being done as direct and part attempted to be
> > > done
> > > as
> > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > > behaves
> > > differently - it aborts and returns error if
> > > bio_iov_iter_get_pages()
> > > ever
> > > returned error. IMO we should do the same here.
> > 
> > Well, it aborts the loop, but then (in the sync case) it still
> > waits
> > for the already submitted IOs to finish. Here, too, I'd find it
> > more
> > logical to return the number of successfully transmitted bytes
> > rather
> > than an error code. In the async case, the submitted bios are left
> > in
> > place, and will probably sooner or later finish, changing iocb-
> > >ki_pos.
> 
> Well, both these behaviors make sense, just traditionally (defined by
> our
> implementation) DIO returns error even if part of IO has actually
> been
> successfully submitted. Making a userspace visible change like you
> suggest
> thus has to be very carefully analyzed and frankly I don't think it's
> worth
> the bother.

OK, maybe not.
I'll rework the treatment of the error case.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
On Thu, 2018-07-19 at 16:53 +0200, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all*
> > > pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to
> > > implement in
> > > the macro magic that iter processing is... Al might know more
> > > (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting
> > you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either,
> because
> it still tries to pin it all, just in multiple calls to
> get_user_pages().

That was not the point of the patch. It's not about a situation in
which MM refuses to pin more pages.

The patch is about the "iterator::next()" nature of
bio_iov_iter_get_pages().

If it can't pin the pages, bio_iov_iter_get_pages() returns an error
code (elsewhere in this thread we discussed how to treat that right).
Otherwise, it always adds _some_ data to the bio, but the amount added
depends on the segment structure of the input iov_iter. If the input
iovec has just a single segment, it fills the bio in a single call.
With multiple segments, it just returns the page(s) of the first
segment. The point of my patch is to make no difference between single-
segment and multi-segment IOs.

Regards
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
On Thu, 2018-07-19 at 17:11 +0200, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote:
> > Yeah. Actually previous version of the fix (not posted publicly)
> > submitted
> > partial bio and then reused the bio to submit more. This is also
> > the way
> > __blkdev_direct_IO operates. Martin optimized this to fill the
> > bio
> > completely (as we know we have enough bvecs) before submitting
> > which has
> > chances to perform better. I'm fine with either approach, just we
> > have to
> > decide which way to go.
> 
> I think this first version is going to be less fragile, so I we
> should
> aim for that.

Wait a second. We changed the approach for a reason. By submitting
partial bios synchronously and sequentially, we loose a lot of
performance, so much that this "fast path" quickly falls behind the
regular __blkkdev_direct_IO() path as the number of input IO segments
increases. The patch I submitted avoids that. The whole point of this
fast path is to submit a single bio, and return as quickly as possible.

It's not an error condition if bio_iov_iter_get_pages() returns less
data than possible. The function just takes one iteration step at a
time, and thus iterating over it until the desired number of pages is
obtained, and collecting the resulting pages in a single bio, isn't
"fragile", it's just the natural thing to do.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
On Thu, 2018-07-19 at 17:21 +0200, Jan Kara wrote:
> On Thu 19-07-18 20:20:51, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> > > On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > > > bio_iov_iter_get_pages() returns only pages for a single non-
> > > > > empty
> > > > > segment of the input iov_iter's iovec. This may be much less
> > > > > than the number
> > > > > of pages __blkdev_direct_IO_simple() is supposed to process.
> > > > > Call
> > > > > bio_iov_iter_get_pages() repeatedly until either the
> > > > > requested number
> > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this
> > > > > is not done,
> > > > > short writes or reads may occur for direct synchronous IOs
> > > > > with multiple
> > > > > iovec slots (such as generated by writev()). In that case,
> > > > > __generic_file_write_iter() falls back to buffered writes,
> > > > > which
> > > > > has been observed to cause data corruption in certain
> > > > > workloads.
> > > > > 
> > > > > Note: if segments aren't page-aligned in the input iovec,
> > > > > this patch may
> > > > > result in multiple adjacent slots of the bi_io_vec array to
> > > > > reference the same
> > > > > page (the byte ranges are guaranteed to be disjunct if the
> > > > > preceding patch is
> > > > > applied). We haven't seen problems with that in our and the
> > > > > customer's
> > > > > tests. It'd be possible to detect this situation and merge
> > > > > bi_io_vec slots
> > > > > that refer to the same page, but I prefer to keep it simple
> > > > > for now.
> > > > > 
> > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO
> > > > > for simplified bdev direct-io")
> > > > > Signed-off-by: Martin Wilck 
> > > > > ---
> > > > >  fs/block_dev.c | 8 +++-
> > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > > index 0dd87aa..41643c4 100644
> > > > > --- a/fs/block_dev.c
> > > > > +++ b/fs/block_dev.c
> > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb
> > > > > *iocb, struct iov_iter *iter,
> > > > >  
> > > > >   ret = bio_iov_iter_get_pages(, iter);
> > > > >   if (unlikely(ret))
> > > > > - return ret;
> > > > > + goto out;
> > > > > +
> > > > > + while (ret == 0 &&
> > > > > +bio.bi_vcnt < bio.bi_max_vecs &&
> > > > > iov_iter_count(iter) > 0)
> > > > > + ret = bio_iov_iter_get_pages(, iter);
> > > > > +
> > > > >   ret = bio.bi_iter.bi_size;
> > > > >  
> > > > >   if (iov_iter_rw(iter) == READ) {
> > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb
> > > > > *iocb, struct iov_iter *iter,
> > > > >   put_page(bvec->bv_page);
> > > > >   }
> > > > >  
> > > > > +out:
> > > > >   if (vecs != inline_vecs)
> > > > >   kfree(vecs);
> > > > > 
> > > > 
> > > > You might put the 'vecs' leak fix into another patch, and resue
> > > > the
> > > > current code block for that.
> > > > 
> > > > Looks all users of bio_iov_iter_get_pages() need this kind of
> > > > fix, so
> > > > what do you think about the following way?
> > > 
> > > No. AFAICT all the other users of bio_iov_iter_get_pages() are
> > > perfectly
> > > fine with it returning less pages and they loop appropriately.
> > 
> > OK, but this way still may make one bio to hold more data,
> > especially
> > the comment of bio_iov_iter_get_pages() says that 'Pins as many
> > pages from
> > *iter', so looks it is the correct way to do.
> 
> Well, but as Al pointed out MM may decide that user has already
> pinned too
> many pages and refuse to pin more. So pinning full iter worth of
> pages may
> just be impossible. Currently, there are no checks like this in MM
> but
> eventually, I'd like to account pinned pages in mlock ulimit (or a
> limit of
> similar kind) and then what Al speaks about would become very real.
> So I'd
> prefer to not develop new locations that must be able to pin
> arbitrary
> amount of pages.

Doesn't this wrongfully penalize scatter-gather IO? Consider two 1MiB
IO requests, one with a single 1 MiB slot, and one with 256 single-page 
slots. Calling bio_iov_iter_get_pages() once will result in a 1MiB bio
in first case, and a 4k bio in the second. You need to submit 256
individual bios to finish the IO.

By using the approach which my patch is takes (repeatedly calling
bio_io_iter_get_pages before submitting the bio), you'd get a 1Mib bio
in both cases. So Ming's idea to generalize this may not be so bad,
after all. I don't see that it would increase the pressure on MM. If we
are concerned about MM, we should limit the total number of pages that
can be pinned, rather than the number of iovec segments that are
processed.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG 

Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 20:20:51, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> > On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > > segment of the input iov_iter's iovec. This may be much less than the 
> > > > number
> > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > > bio_iov_iter_get_pages() repeatedly until either the requested number
> > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > > > short writes or reads may occur for direct synchronous IOs with multiple
> > > > iovec slots (such as generated by writev()). In that case,
> > > > __generic_file_write_iter() falls back to buffered writes, which
> > > > has been observed to cause data corruption in certain workloads.
> > > > 
> > > > Note: if segments aren't page-aligned in the input iovec, this patch may
> > > > result in multiple adjacent slots of the bi_io_vec array to reference 
> > > > the same
> > > > page (the byte ranges are guaranteed to be disjunct if the preceding 
> > > > patch is
> > > > applied). We haven't seen problems with that in our and the customer's
> > > > tests. It'd be possible to detect this situation and merge bi_io_vec 
> > > > slots
> > > > that refer to the same page, but I prefer to keep it simple for now.
> > > > 
> > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for 
> > > > simplified bdev direct-io")
> > > > Signed-off-by: Martin Wilck 
> > > > ---
> > > >  fs/block_dev.c | 8 +++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aa..41643c4 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, 
> > > > struct iov_iter *iter,
> > > >  
> > > > ret = bio_iov_iter_get_pages(, iter);
> > > > if (unlikely(ret))
> > > > -   return ret;
> > > > +   goto out;
> > > > +
> > > > +   while (ret == 0 &&
> > > > +  bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 
> > > > 0)
> > > > +   ret = bio_iov_iter_get_pages(, iter);
> > > > +
> > > > ret = bio.bi_iter.bi_size;
> > > >  
> > > > if (iov_iter_rw(iter) == READ) {
> > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, 
> > > > struct iov_iter *iter,
> > > > put_page(bvec->bv_page);
> > > > }
> > > >  
> > > > +out:
> > > > if (vecs != inline_vecs)
> > > > kfree(vecs);
> > > >
> > > 
> > > You might put the 'vecs' leak fix into another patch, and resue the
> > > current code block for that.
> > > 
> > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> > > what do you think about the following way?
> > 
> > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
> > fine with it returning less pages and they loop appropriately.
> 
> OK, but this way still may make one bio to hold more data, especially
> the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from
> *iter', so looks it is the correct way to do.

Well, but as Al pointed out MM may decide that user has already pinned too
many pages and refuse to pin more. So pinning full iter worth of pages may
just be impossible. Currently, there are no checks like this in MM but
eventually, I'd like to account pinned pages in mlock ulimit (or a limit of
similar kind) and then what Al speaks about would become very real. So I'd
prefer to not develop new locations that must be able to pin arbitrary
amount of pages.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > Secondly, I don't think it is good to discard error from
> > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > again
> > lead to part of IO being done as direct and part attempted to be done
> > as
> > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > behaves
> > differently - it aborts and returns error if bio_iov_iter_get_pages()
> > ever
> > returned error. IMO we should do the same here.
> 
> Well, it aborts the loop, but then (in the sync case) it still waits
> for the already submitted IOs to finish. Here, too, I'd find it more
> logical to return the number of successfully transmitted bytes rather
> than an error code. In the async case, the submitted bios are left in
> place, and will probably sooner or later finish, changing iocb->ki_pos.

Well, both these behaviors make sense, just traditionally (defined by our
implementation) DIO returns error even if part of IO has actually been
successfully submitted. Making a userspace visible change like you suggest
thus has to be very carefully analyzed and frankly I don't think it's worth
the bother.

> I'm actually not quite certain if that's correct. In the sync case, it
> causes the already-performed IO to be done again, buffered. In the
> async case, it it may even cause two IOs for the same range to be in
> flight at the same time ... ?

It doesn't cause IO to be done again. Look at __generic_file_write_iter().
If generic_file_direct_write() returned error, we immediately return error
as well without retrying buffered IO. We only retry buffered IO for partial
(or 0) return value.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Christoph Hellwig
On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote:
> Yeah. Actually previous version of the fix (not posted publicly) submitted
> partial bio and then reused the bio to submit more. This is also the way
> __blkdev_direct_IO operates. Martin optimized this to fill the bio
> completely (as we know we have enough bvecs) before submitting which has
> chances to perform better. I'm fine with either approach, just we have to
> decide which way to go.

I think this first version is going to be less fragile, so I we should
aim for that.


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 16:53:51, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all* pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to implement 
> > > in
> > > the macro magic that iter processing is... Al might know more (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either, because
> it still tries to pin it all, just in multiple calls to get_user_pages().

Yeah. Actually previous version of the fix (not posted publicly) submitted
partial bio and then reused the bio to submit more. This is also the way
__blkdev_direct_IO operates. Martin optimized this to fill the bio
completely (as we know we have enough bvecs) before submitting which has
chances to perform better. I'm fine with either approach, just we have to
decide which way to go.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Christoph Hellwig
On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > Well, there has never been a promise that it will grab *all* pages in the
> > iter AFAIK. Practically, I think that it was just too hairy to implement in
> > the macro magic that iter processing is... Al might know more (added to
> > CC).
> 
> Not really - it's more that VM has every right to refuse letting you pin
> an arbitrary amount of pages anyway.

In which case the code after this patch isn't going to help either, because
it still tries to pin it all, just in multiple calls to get_user_pages().


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
On Thu, 2018-07-19 at 13:56 +0200, Jan Kara wrote:
> On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-
> > > empty
> > > segment of the input iov_iter's iovec. This may be much less than
> > > the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > bio_iov_iter_get_pages() repeatedly until either the requested
> > > number
> > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is
> > > not done,
> > > short writes or reads may occur for direct synchronous IOs with
> > > multiple
> > > iovec slots (such as generated by writev()). In that case,
> > > __generic_file_write_iter() falls back to buffered writes, which
> > > has been observed to cause data corruption in certain workloads.
> > > 
> > > Note: if segments aren't page-aligned in the input iovec, this
> > > patch may
> > > result in multiple adjacent slots of the bi_io_vec array to
> > > reference the same
> > > page (the byte ranges are guaranteed to be disjunct if the
> > > preceding patch is
> > > applied). We haven't seen problems with that in our and the
> > > customer's
> > > tests. It'd be possible to detect this situation and merge
> > > bi_io_vec slots
> > > that refer to the same page, but I prefer to keep it simple for
> > > now.
> > > 
> > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > > simplified bdev direct-io")
> > > Signed-off-by: Martin Wilck 
> > > ---
> > >  fs/block_dev.c | 8 +++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 0dd87aa..41643c4 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb
> > > *iocb, struct iov_iter *iter,
> > >  
> > >   ret = bio_iov_iter_get_pages(, iter);
> > >   if (unlikely(ret))
> > > - return ret;
> > > + goto out;
> > > +
> > > + while (ret == 0 &&
> > > +bio.bi_vcnt < bio.bi_max_vecs &&
> > > iov_iter_count(iter) > 0)
> > > + ret = bio_iov_iter_get_pages(, iter);
> > > +
> > >   ret = bio.bi_iter.bi_size;
> > >  
> > >   if (iov_iter_rw(iter) == READ) {
> > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > > struct iov_iter *iter,
> > >   put_page(bvec->bv_page);
> > >   }
> > >  
> > > +out:
> > >   if (vecs != inline_vecs)
> > >   kfree(vecs);
> > > 
> > 
> > You might put the 'vecs' leak fix into another patch, and resue the
> > current code block for that.
> > 
> > Looks all users of bio_iov_iter_get_pages() need this kind of fix,
> > so
> > what do you think about the following way?
> 
> No. AFAICT all the other users of bio_iov_iter_get_pages() are
> perfectly
> fine with it returning less pages and they loop appropriately.

We might consider to better fill the bios in __blkdev_direct_IO(), too,
it might be a performance gain.

Regards
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than
> > the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested
> > number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not
> > done,
> > short writes or reads may occur for direct synchronous IOs with
> > multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this
> > patch may
> > result in multiple adjacent slots of the bi_io_vec array to
> > reference the same
> > page (the byte ranges are guaranteed to be disjunct if the
> > preceding patch is
> > applied). We haven't seen problems with that in our and the
> > customer's
> > tests. It'd be possible to detect this situation and merge
> > bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for
> > now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io")
> > Signed-off-by: Martin Wilck 
> > ---
> >  fs/block_dev.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > struct iov_iter *iter,
> >  
> > ret = bio_iov_iter_get_pages(, iter);
> > if (unlikely(ret))
> > -   return ret;
> > +   goto out;
> > +
> > +   while (ret == 0 &&
> > +  bio.bi_vcnt < bio.bi_max_vecs &&
> > iov_iter_count(iter) > 0)
> > +   ret = bio_iov_iter_get_pages(, iter);
> > +
> 
> I have two suggestions here (posting them now in public):
> 
> Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we
> made
> sure we have enough vecs for pages in iter. So I'd WARN if this isn't
> true.

Yeah. I wanted to add that to the patch. Slipped through, somehow.
Sorry about that.

> Secondly, I don't think it is good to discard error from
> bio_iov_iter_get_pages() here and just submit partial IO. It will
> again
> lead to part of IO being done as direct and part attempted to be done
> as
> buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> behaves
> differently - it aborts and returns error if bio_iov_iter_get_pages()
> ever
> returned error. IMO we should do the same here.

Well, it aborts the loop, but then (in the sync case) it still waits
for the already submitted IOs to finish. Here, too, I'd find it more
logical to return the number of successfully transmitted bytes rather
than an error code. In the async case, the submitted bios are left in
place, and will probably sooner or later finish, changing iocb->ki_pos.

I'm actually not quite certain if that's correct. In the sync case, it
causes the already-performed IO to be done again, buffered. In the
async case, it it may even cause two IOs for the same range to be in
flight at the same time ... ?

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Ming Lei
On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the 
> > > number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > bio_iov_iter_get_pages() repeatedly until either the requested number
> > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > > short writes or reads may occur for direct synchronous IOs with multiple
> > > iovec slots (such as generated by writev()). In that case,
> > > __generic_file_write_iter() falls back to buffered writes, which
> > > has been observed to cause data corruption in certain workloads.
> > > 
> > > Note: if segments aren't page-aligned in the input iovec, this patch may
> > > result in multiple adjacent slots of the bi_io_vec array to reference the 
> > > same
> > > page (the byte ranges are guaranteed to be disjunct if the preceding 
> > > patch is
> > > applied). We haven't seen problems with that in our and the customer's
> > > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > > that refer to the same page, but I prefer to keep it simple for now.
> > > 
> > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for 
> > > simplified bdev direct-io")
> > > Signed-off-by: Martin Wilck 
> > > ---
> > >  fs/block_dev.c | 8 +++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 0dd87aa..41643c4 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> > > iov_iter *iter,
> > >  
> > >   ret = bio_iov_iter_get_pages(, iter);
> > >   if (unlikely(ret))
> > > - return ret;
> > > + goto out;
> > > +
> > > + while (ret == 0 &&
> > > +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > > + ret = bio_iov_iter_get_pages(, iter);
> > > +
> > >   ret = bio.bi_iter.bi_size;
> > >  
> > >   if (iov_iter_rw(iter) == READ) {
> > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> > > iov_iter *iter,
> > >   put_page(bvec->bv_page);
> > >   }
> > >  
> > > +out:
> > >   if (vecs != inline_vecs)
> > >   kfree(vecs);
> > >
> > 
> > You might put the 'vecs' leak fix into another patch, and resue the
> > current code block for that.
> > 
> > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> > what do you think about the following way?
> 
> No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
> fine with it returning less pages and they loop appropriately.

OK, but this way still may make one bio to hold more data, especially
the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from
*iter', so looks it is the correct way to do.

thanks,
Ming


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 19:04:46, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > short writes or reads may occur for direct synchronous IOs with multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this patch may
> > result in multiple adjacent slots of the bi_io_vec array to reference the 
> > same
> > page (the byte ranges are guaranteed to be disjunct if the preceding patch 
> > is
> > applied). We haven't seen problems with that in our and the customer's
> > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> > bdev direct-io")
> > Signed-off-by: Martin Wilck 
> > ---
> >  fs/block_dev.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> > iov_iter *iter,
> >  
> > ret = bio_iov_iter_get_pages(, iter);
> > if (unlikely(ret))
> > -   return ret;
> > +   goto out;
> > +
> > +   while (ret == 0 &&
> > +  bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > +   ret = bio_iov_iter_get_pages(, iter);
> > +
> > ret = bio.bi_iter.bi_size;
> >  
> > if (iov_iter_rw(iter) == READ) {
> > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> > iov_iter *iter,
> > put_page(bvec->bv_page);
> > }
> >  
> > +out:
> > if (vecs != inline_vecs)
> > kfree(vecs);
> >
> 
> You might put the 'vecs' leak fix into another patch, and resue the
> current code block for that.
> 
> Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> what do you think about the following way?

No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
fine with it returning less pages and they loop appropriately.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Al Viro
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote:
> On Thu 19-07-18 18:21:23, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the 
> > > number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > 
> > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> > as many as possible pages since both 'maxsize' and 'maxpages' are provided
> > to cover all.

Huh?  Why does having a way to say "I (the caller) don't want more than 
"
implies anything of that sort?  It never had been promised, explicitly or not...

> > So the question is why iov_iter_get_pages() doesn't work as expected?
>
> Well, there has never been a promise that it will grab *all* pages in the
> iter AFAIK. Practically, I think that it was just too hairy to implement in
> the macro magic that iter processing is... Al might know more (added to
> CC).

Not really - it's more that VM has every right to refuse letting you pin
an arbitrary amount of pages anyway.


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Ming Lei
On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> bdev direct-io")
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>  
>   ret = bio_iov_iter_get_pages(, iter);
>   if (unlikely(ret))
> - return ret;
> + goto out;
> +
> + while (ret == 0 &&
> +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> + ret = bio_iov_iter_get_pages(, iter);
> +
>   ret = bio.bi_iter.bi_size;
>  
>   if (iov_iter_rw(iter) == READ) {
> @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>   put_page(bvec->bv_page);
>   }
>  
> +out:
>   if (vecs != inline_vecs)
>   kfree(vecs);
>

You might put the 'vecs' leak fix into another patch, and resue the
current code block for that.

Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
what do you think about the following way?

diff --git a/block/bio.c b/block/bio.c
index f3536bfc8298..23dd4c163dfc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -904,15 +904,7 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
-/**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
- * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
- *
- * Pins as many pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- */
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
@@ -951,6 +943,28 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
iov_iter *iter)
iov_iter_advance(iter, size);
return 0;
 }
+
+/**
+ * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * @bio: bio to add pages to
+ * @iter: iov iterator describing the region to be mapped
+ *
+ * Pins as many pages from *iter and appends them to @bio's bvec array. The
+ * pages will have to be released using put_page() when done.
+ */
+int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+   int ret;
+   unsigned int size;
+
+   do {
+   size = bio->bi_iter.bi_size;
+   ret = __bio_iov_iter_get_pages(bio, iter);
+   } while (!bio_full(bio) && iov_iter_count(iter) > 0 &&
+   bio->bi_iter.bi_size > size);
+
+   return bio->bi_iter.bi_size > 0 ? 0 : ret;
+}
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)

Thanks, 
Ming


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Ming Lei
On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote:
> On Thu 19-07-18 18:21:23, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the 
> > > number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > 
> > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> > as many as possible pages since both 'maxsize' and 'maxpages' are provided
> > to cover all.
> > 
> > So the question is why iov_iter_get_pages() doesn't work as expected?
> 
> Well, there has never been a promise that it will grab *all* pages in the
> iter AFAIK. Practically, I think that it was just too hairy to implement in
> the macro magic that iter processing is... Al might know more (added to
> CC).

OK, I got it, given get_user_pages_fast() may fail and it is reasonable
to see less pages retrieved.

Thanks,
Ming


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> bdev direct-io")
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>  
>   ret = bio_iov_iter_get_pages(, iter);
>   if (unlikely(ret))
> - return ret;
> + goto out;
> +
> + while (ret == 0 &&
> +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> + ret = bio_iov_iter_get_pages(, iter);
> +

I have two suggestions here (posting them now in public):

Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made
sure we have enough vecs for pages in iter. So I'd WARN if this isn't true.

Secondly, I don't think it is good to discard error from
bio_iov_iter_get_pages() here and just submit partial IO. It will again
lead to part of IO being done as direct and part attempted to be done as
buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves
differently - it aborts and returns error if bio_iov_iter_get_pages() ever
returned error. IMO we should do the same here.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 18:21:23, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> 
> In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> as many as possible pages since both 'maxsize' and 'maxpages' are provided
> to cover all.
> 
> So the question is why iov_iter_get_pages() doesn't work as expected?

Well, there has never been a promise that it will grab *all* pages in the
iter AFAIK. Practically, I think that it was just too hairy to implement in
the macro magic that iter processing is... Al might know more (added to
CC).

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Ming Lei
On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call

In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
as many as possible pages since both 'maxsize' and 'maxpages' are provided
to cover all.

So the question is why iov_iter_get_pages() doesn't work as expected?

Thanks,
Ming


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Hannes Reinecke
On 07/19/2018 11:39 AM, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> bdev direct-io")
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Martin Wilck
bio_iov_iter_get_pages() returns only pages for a single non-empty
segment of the input iov_iter's iovec. This may be much less than the number
of pages __blkdev_direct_IO_simple() is supposed to process. Call
bio_iov_iter_get_pages() repeatedly until either the requested number
of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
short writes or reads may occur for direct synchronous IOs with multiple
iovec slots (such as generated by writev()). In that case,
__generic_file_write_iter() falls back to buffered writes, which
has been observed to cause data corruption in certain workloads.

Note: if segments aren't page-aligned in the input iovec, this patch may
result in multiple adjacent slots of the bi_io_vec array to reference the same
page (the byte ranges are guaranteed to be disjunct if the preceding patch is
applied). We haven't seen problems with that in our and the customer's
tests. It'd be possible to detect this situation and merge bi_io_vec slots
that refer to the same page, but I prefer to keep it simple for now.

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev 
direct-io")
Signed-off-by: Martin Wilck 
---
 fs/block_dev.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aa..41643c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
 
ret = bio_iov_iter_get_pages(, iter);
if (unlikely(ret))
-   return ret;
+   goto out;
+
+   while (ret == 0 &&
+  bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
+   ret = bio_iov_iter_get_pages(, iter);
+
ret = bio.bi_iter.bi_size;
 
if (iov_iter_rw(iter) == READ) {
@@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
put_page(bvec->bv_page);
}
 
+out:
if (vecs != inline_vecs)
kfree(vecs);
 
-- 
2.17.1