Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread NeilBrown
On Wed, 6 Nov 2013 15:22:36 -0500 Chris Mason 
wrote:


> > Yup - that should actually be safe for all the existing bio_clone() users
> > actually, I audited all of them - because normally you're not going to 
> > complete
> > the original bio until the clone finishes.
> 
> I'd say we need an ack from Neil before we can switch to _fast.  The
> completions look non-trivial and _fast adds new ordering requirements on
> free.

I think it should be fairly straight forward to use _fast in md.
As Kent says, the original bio that is cloned is (almost) never going to have
endio called until all the clones have done their thing and finished up.

The only exception is the "behind_pages" stuff in raid1.c.  We clearly need
to be more careful there.
If the  "ordering requirements" you mention are simply that the owner of the
original may free the biovec, so the clone must have finished using it before
that can happen, then that would most easily be fix with
   mbio->bi_io_vec = r1_bio->behind_bvecs;
rather than the current loop which copies all the bv_page pointers.

NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Kent Overstreet
On Wed, Nov 06, 2013 at 04:25:45PM -0500, Chris Mason wrote:
> Quoting Kent Overstreet (2013-11-06 15:57:34)
> > On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> > > Quoting Kent Overstreet (2013-11-06 15:02:22)
> 
> [ ... nods, thanks! ... ]
> 
> > OTOH - with regards to just the ordering requirements, the more I look at
> > various code the less accidental the fact that that works seems to be: the 
> > best
> > explanation I've come up with so far is that you already needed to ensure 
> > that
> > the _pages_ the clone points to stick around until the clone completes, and 
> > if
> > you don't own the original bio the only way to do that is to not complete 
> > the
> > original bio until after the clone completes.
> > 
> > So if you're a driver cloning bios that were submitted to you, 
> > bio_clone_fast()
> > introduces no new ordering requirements.
> > 
> > On the third hand - if you're cloning (i.e. splitting) your own bios, in 
> > that
> > case it would be possible to screw up the ordering - I don't know of any 
> > code in
> > the kernel that does this today (except for, sort of, bcache) but my dio 
> > rewrite
> > takes this approach - but if you do the obvious and sane thing with 
> > bio_chain()
> > it's a non issue, it seems to me you'd have to come up with something pretty
> > contrived and dumb for this to actually be an issue in practice.
> > 
> > Anyways, I haven't come to any definite conclusions, those are just my
> > observations so far.
> 
> I do think you're right.  We all seem to have clones doing work on
> behalf of the original, and when everyone is done we complete the
> original.
> 
> But, btrfs does have silly things like this:
> 
> dio_end_io(dio_bio, err); // end and free the original
>   bio_put(bio); // free the clone
> 
> It's not a bug yet, but given enough time the space between those two
> frees will grow new code that kills us all.

Hopefully the DIO situation will get better in the near future, Josef was
telling me btrfs was probably going to end up with its own DIO code (replacing a
good chunk of direct-io.c) and that should get a lot easier and saner for you
guys with my dio rewrite. Need to finish getting that to pass xfstests...

> Really though, the new stuff is better, thanks.

Thanks! :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-06 15:57:34)
> On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> > Quoting Kent Overstreet (2013-11-06 15:02:22)

[ ... nods, thanks! ... ]

> OTOH - with regards to just the ordering requirements, the more I look at
> various code the less accidental the fact that that works seems to be: the 
> best
> explanation I've come up with so far is that you already needed to ensure that
> the _pages_ the clone points to stick around until the clone completes, and if
> you don't own the original bio the only way to do that is to not complete the
> original bio until after the clone completes.
> 
> So if you're a driver cloning bios that were submitted to you, 
> bio_clone_fast()
> introduces no new ordering requirements.
> 
> On the third hand - if you're cloning (i.e. splitting) your own bios, in that
> case it would be possible to screw up the ordering - I don't know of any code 
> in
> the kernel that does this today (except for, sort of, bcache) but my dio 
> rewrite
> takes this approach - but if you do the obvious and sane thing with 
> bio_chain()
> it's a non issue, it seems to me you'd have to come up with something pretty
> contrived and dumb for this to actually be an issue in practice.
> 
> Anyways, I haven't come to any definite conclusions, those are just my
> observations so far.

I do think you're right.  We all seem to have clones doing work on
behalf of the original, and when everyone is done we complete the
original.

But, btrfs does have silly things like this:

dio_end_io(dio_bio, err); // end and free the original
bio_put(bio); // free the clone

It's not a bug yet, but given enough time the space between those two
frees will grow new code that kills us all.

Really though, the new stuff is better, thanks.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Kent Overstreet
On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> Quoting Kent Overstreet (2013-11-06 15:02:22)
> > Yup - that should actually be safe for all the existing bio_clone() users
> > actually, I audited all of them - because normally you're not going to 
> > complete
> > the original bio until the clone finishes.
> 
> I'd say we need an ack from Neil before we can switch to _fast.  The
> completions look non-trivial and _fast adds new ordering requirements on
> free.

Agreed, I've already broken md enough lately :P

(I'm started to work on a real test suite for the block layer, that'll help with
this stuff...)

> > > If you look at dm's crypt_free_buffer_pages(), it had similar problems.
> > 
> > Those are fine actually - in both cases, they're bios they allocated, not 
> > the
> > bios that were submitted to them.
> 
> Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
> the case before the bi_vcnt patch.  With your current patch (adding
> _fast) both should now be correct.

Yeah, but bi_vcnt being zero was just a symptom - the underlying reason the old
code was wrong was that with the clone sharing the parent's biovec, md was
modifying a biovec (including bv_page!) that it didn't own - _that_ was what was
horribly broken about it.

> > What's unclear about it? The rule is just - if you didn't allocate the 
> > biovec,
> > don't modify it or use bio_for_each_segment_all() (probably I didn't quite 
> > state
> > it clearly enough before though)
> 
> That part makes sense.  The new rule that scares me is that we can't
> free the src of the clone until all the clones are freed.  If it works
> with today's existing users it feels like it is more by accident than
> design.  I'm not saying we can't do it, we just need some bigger
> flashing warning lights.

Yeah, Jens was saying the same thing yesterday, hence this patch.

OTOH - with regards to just the ordering requirements, the more I look at
various code the less accidental the fact that that works seems to be: the best
explanation I've come up with so far is that you already needed to ensure that
the _pages_ the clone points to stick around until the clone completes, and if
you don't own the original bio the only way to do that is to not complete the
original bio until after the clone completes.

So if you're a driver cloning bios that were submitted to you, bio_clone_fast()
introduces no new ordering requirements.

On the third hand - if you're cloning (i.e. splitting) your own bios, in that
case it would be possible to screw up the ordering - I don't know of any code in
the kernel that does this today (except for, sort of, bcache) but my dio rewrite
takes this approach - but if you do the obvious and sane thing with bio_chain()
it's a non issue, it seems to me you'd have to come up with something pretty
contrived and dumb for this to actually be an issue in practice.

Anyways, I haven't come to any definite conclusions, those are just my
observations so far.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-06 15:02:22)
> On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> > Quoting Kent Overstreet (2013-11-05 22:48:41)
> > > This patch reverts the default behaviour introduced by
> > > 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> > > shares the source bio's biovec, cloning the biovec is once again the
> > > default.
> > > 
> > > Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> > > that shares the source's biovec. This patch changes bcache and md to use
> >^
> >  dm?
> > 
> > > __bio_clone_biovec_fast() since they're expecting the new behaviour due
> > > to other refactoring; most of the other uses of bio_clone() should be
> > > same to convert to the _fast() variant but that will be done more
> > > incrementally in other patches (bio_split() in particular).
> > 
> > Hi Kent,
> > 
> > I noticed yesterday the _fast variants of bio clone introduce sharing
> > between the src and the clone, but without any reference counts:
> > 
> > bio->bi_io_vec = bio_src->bi_io_vec;
> > 
> > Have you audited all of the _fast users to make sure they are not
> > freeing the src before the clone?  Sorry if this came up already in past
> > reviews.
> 
> Yup - that should actually be safe for all the existing bio_clone() users
> actually, I audited all of them - because normally you're not going to 
> complete
> the original bio until the clone finishes.

I'd say we need an ack from Neil before we can switch to _fast.  The
completions look non-trivial and _fast adds new ordering requirements on
free.

> 
> > > Note that __bio_clone() isn't being readded - the reason being that with
> > > immutable biovecs allocating the right number of biovecs for the new
> > > clone is no longer trivial so we don't want drivers trying to do that
> > > themselves.
> > > 
> > > This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> > > __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> > > own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> > > short,
> > 
> > I think I see what you mean with tying bi_vcnt to ownership of the bio,
> > but we're not consistent.  Looking at bio_for_eaach_segment_all:
> > 
> > *
> >  * drivers should _never_ use the all version - the bio may have been split
> >  * before it got to the driver and the driver won't own all of it
> >  */
> > #define bio_for_each_segment_all(bvl, bio, i)   \
> > for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> > 
> > bio_for_each_segment_all still trusts bi_vcnt, so any
> > bio_for_each_segment_all operation on a clone will basically be a noop.
> > 
> > Just looking at MD raid1 make_request()
> > 
> > mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> >   ...
> >   alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
> >   ...
> >   if (r1_bio->behind_bvecs) {
> > bio_for_each_segment_all(bvec, mbio, j)
> >   ...
> > 
> > I didn't test MD without the vcnt fix, but I think any operations in MD
> > that duplicate data for raid1 turn into noops.  I think we'll end up
> > writing garbage (or nothing) to the second mirror.
> > 
> > If you look at dm's crypt_free_buffer_pages(), it had similar problems.
> 
> Those are fine actually - in both cases, they're bios they allocated, not the
> bios that were submitted to them.

Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
the case before the bi_vcnt patch.  With your current patch (adding
_fast) both should now be correct.

>
> Though md _definitely_ shouldn't have been
> sharing the original bio's biovec, so looks like this patch will fix a bug
> there...
> 
> (I remember seeing that code before and I thought I added a bio_clone_biovec()
> call to that md code, but apparently that never got commited. Argh.)
> 
> > 
> > > not setting it might cause bugs in the short term but long term
> > > it's likely to hide nastier more subtle bugs, we don't want code looking
> > > at bi_vcnt at all for bios it does not own.
> > 
> > I think the concept of bio ownership is still much too weak, at least
> > for established users like MD and DM.  I don't know how to verify the
> > sharing of bi_io_vec without some kind of reference counting on the
> > iovec.
> 
> What's unclear about it? The rule is just - if you didn't allocate the biovec,
> don't modify it or use bio_for_each_segment_all() (probably I didn't quite 
> state
> it clearly enough before though)

That part makes sense.  The new rule that scares me is that we can't
free the src of the clone until all the clones are freed.  If it works
with today's existing users it feels like it is more by accident than
design.  I'm not saying we 

Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Kent Overstreet
On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> Quoting Kent Overstreet (2013-11-05 22:48:41)
> > This patch reverts the default behaviour introduced by
> > 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> > shares the source bio's biovec, cloning the biovec is once again the
> > default.
> > 
> > Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> > that shares the source's biovec. This patch changes bcache and md to use
>^
>  dm?
> 
> > __bio_clone_biovec_fast() since they're expecting the new behaviour due
> > to other refactoring; most of the other uses of bio_clone() should be
> > same to convert to the _fast() variant but that will be done more
> > incrementally in other patches (bio_split() in particular).
> 
> Hi Kent,
> 
> I noticed yesterday the _fast variants of bio clone introduce sharing
> between the src and the clone, but without any reference counts:
> 
> bio->bi_io_vec = bio_src->bi_io_vec;
> 
> Have you audited all of the _fast users to make sure they are not
> freeing the src before the clone?  Sorry if this came up already in past
> reviews.

Yup - that should actually be safe for all the existing bio_clone() users
actually, I audited all of them - because normally you're not going to complete
the original bio until the clone finishes.

> > Note that __bio_clone() isn't being readded - the reason being that with
> > immutable biovecs allocating the right number of biovecs for the new
> > clone is no longer trivial so we don't want drivers trying to do that
> > themselves.
> > 
> > This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> > __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> > own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> > short,
> 
> I think I see what you mean with tying bi_vcnt to ownership of the bio,
> but we're not consistent.  Looking at bio_for_eaach_segment_all:
> 
> *
>  * drivers should _never_ use the all version - the bio may have been split
>  * before it got to the driver and the driver won't own all of it
>  */
> #define bio_for_each_segment_all(bvl, bio, i)   \
> for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> 
> bio_for_each_segment_all still trusts bi_vcnt, so any
> bio_for_each_segment_all operation on a clone will basically be a noop.
> 
> Just looking at MD raid1 make_request()
> 
> mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>   ...
>   alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
>   ...
>   if (r1_bio->behind_bvecs) {
> bio_for_each_segment_all(bvec, mbio, j)
>   ...
> 
> I didn't test MD without the vcnt fix, but I think any operations in MD
> that duplicate data for raid1 turn into noops.  I think we'll end up
> writing garbage (or nothing) to the second mirror.
> 
> If you look at dm's crypt_free_buffer_pages(), it had similar problems.

Those are fine actually - in both cases, they're bios they allocated, not the
bios that were submitted to them. Though md _definitely_ shouldn't have been
sharing the original bio's biovec, so looks like this patch will fix a bug
there...

(I remember seeing that code before and I thought I added a bio_clone_biovec()
call to that md code, but apparently that never got commited. Argh.)

> 
> > not setting it might cause bugs in the short term but long term
> > it's likely to hide nastier more subtle bugs, we don't want code looking
> > at bi_vcnt at all for bios it does not own.
> 
> I think the concept of bio ownership is still much too weak, at least
> for established users like MD and DM.  I don't know how to verify the
> sharing of bi_io_vec without some kind of reference counting on the
> iovec.

What's unclear about it? The rule is just - if you didn't allocate the biovec,
don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
it clearly enough before though)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-05 22:48:41)
> This patch reverts the default behaviour introduced by
> 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> shares the source bio's biovec, cloning the biovec is once again the
> default.
> 
> Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> that shares the source's biovec. This patch changes bcache and md to use
   ^
   dm?

> __bio_clone_biovec_fast() since they're expecting the new behaviour due
> to other refactoring; most of the other uses of bio_clone() should be
> same to convert to the _fast() variant but that will be done more
> incrementally in other patches (bio_split() in particular).

Hi Kent,

I noticed yesterday the _fast variants of bio clone introduce sharing
between the src and the clone, but without any reference counts:

bio->bi_io_vec = bio_src->bi_io_vec;

Have you audited all of the _fast users to make sure they are not
freeing the src before the clone?  Sorry if this came up already in past
reviews.

> 
> Note that __bio_clone() isn't being readded - the reason being that with
> immutable biovecs allocating the right number of biovecs for the new
> clone is no longer trivial so we don't want drivers trying to do that
> themselves.
> 
> This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> short,

I think I see what you mean with tying bi_vcnt to ownership of the bio,
but we're not consistent.  Looking at bio_for_eaach_segment_all:

*
 * drivers should _never_ use the all version - the bio may have been split
 * before it got to the driver and the driver won't own all of it
 */
#define bio_for_each_segment_all(bvl, bio, i)   \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)

bio_for_each_segment_all still trusts bi_vcnt, so any
bio_for_each_segment_all operation on a clone will basically be a noop.

Just looking at MD raid1 make_request()

mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
...
alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
...
if (r1_bio->behind_bvecs) {
bio_for_each_segment_all(bvec, mbio, j)
...

I didn't test MD without the vcnt fix, but I think any operations in MD
that duplicate data for raid1 turn into noops.  I think we'll end up
writing garbage (or nothing) to the second mirror.

If you look at dm's crypt_free_buffer_pages(), it had similar problems.

> not setting it might cause bugs in the short term but long term
> it's likely to hide nastier more subtle bugs, we don't want code looking
> at bi_vcnt at all for bios it does not own.

I think the concept of bio ownership is still much too weak, at least
for established users like MD and DM.  I don't know how to verify the
sharing of bi_io_vec without some kind of reference counting on the
iovec.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Olof Johansson
On Tue, Nov 5, 2013 at 9:07 PM, Kent Overstreet  wrote:
> On Tue, Nov 05, 2013 at 09:02:19PM -0800, Olof Johansson wrote:
>> Hi,
>>
>> On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet  wrote:
>>
>> > Chris, Olaf, can you two in particular test this? I have tested the bounce
>> > buffer code (and bcache), but Jens told me today there was an md bug that I
>> > _still_ can't find any emails about so I'm not sure what to test for that.
>>
>> (I'm guessing you mean me and not some German person here. :-)
>>
>> What is this expected to apply on top of? It doesn't apply cleanly on
>> the 20131105 -next tree which is where I had been seeing the issues.
>
> Sorry - it's on top of Jens' for-3.13/core branch,
> git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git

Tested-by: Olof Johansson 

Note that last night's -next seems to have held up as well.


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Olof Johansson
On Tue, Nov 5, 2013 at 9:07 PM, Kent Overstreet k...@daterainc.com wrote:
 On Tue, Nov 05, 2013 at 09:02:19PM -0800, Olof Johansson wrote:
 Hi,

 On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet k...@daterainc.com wrote:

  Chris, Olaf, can you two in particular test this? I have tested the bounce
  buffer code (and bcache), but Jens told me today there was an md bug that I
  _still_ can't find any emails about so I'm not sure what to test for that.

 (I'm guessing you mean me and not some German person here. :-)

 What is this expected to apply on top of? It doesn't apply cleanly on
 the 20131105 -next tree which is where I had been seeing the issues.

 Sorry - it's on top of Jens' for-3.13/core branch,
 git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git

Tested-by: Olof Johansson o...@lixom.net

Note that last night's -next seems to have held up as well.


-Olof
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-05 22:48:41)
 This patch reverts the default behaviour introduced by
 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
 shares the source bio's biovec, cloning the biovec is once again the
 default.
 
 Instead, we add a new bio_clone_biovec_fast(), which creates a clone
 that shares the source's biovec. This patch changes bcache and md to use
   ^
   dm?

 __bio_clone_biovec_fast() since they're expecting the new behaviour due
 to other refactoring; most of the other uses of bio_clone() should be
 same to convert to the _fast() variant but that will be done more
 incrementally in other patches (bio_split() in particular).

Hi Kent,

I noticed yesterday the _fast variants of bio clone introduce sharing
between the src and the clone, but without any reference counts:

bio-bi_io_vec = bio_src-bi_io_vec;

Have you audited all of the _fast users to make sure they are not
freeing the src before the clone?  Sorry if this came up already in past
reviews.

 
 Note that __bio_clone() isn't being readded - the reason being that with
 immutable biovecs allocating the right number of biovecs for the new
 clone is no longer trivial so we don't want drivers trying to do that
 themselves.
 
 This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
 __bio_clone_fast() should not be setting bi_vcnt for bios that do not
 own the biovec (see Documentation/block/biovecs.txt for rationale) - in
 short,

I think I see what you mean with tying bi_vcnt to ownership of the bio,
but we're not consistent.  Looking at bio_for_eaach_segment_all:

*
 * drivers should _never_ use the all version - the bio may have been split
 * before it got to the driver and the driver won't own all of it
 */
#define bio_for_each_segment_all(bvl, bio, i)   \
for (i = 0, bvl = (bio)-bi_io_vec; i  (bio)-bi_vcnt; i++, bvl++)

bio_for_each_segment_all still trusts bi_vcnt, so any
bio_for_each_segment_all operation on a clone will basically be a noop.

Just looking at MD raid1 make_request()

mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
...
alloc_behind_pages(mbio, r1_bio); - bio_for_each_segment_all
...
if (r1_bio-behind_bvecs) {
bio_for_each_segment_all(bvec, mbio, j)
...

I didn't test MD without the vcnt fix, but I think any operations in MD
that duplicate data for raid1 turn into noops.  I think we'll end up
writing garbage (or nothing) to the second mirror.

If you look at dm's crypt_free_buffer_pages(), it had similar problems.

 not setting it might cause bugs in the short term but long term
 it's likely to hide nastier more subtle bugs, we don't want code looking
 at bi_vcnt at all for bios it does not own.

I think the concept of bio ownership is still much too weak, at least
for established users like MD and DM.  I don't know how to verify the
sharing of bi_io_vec without some kind of reference counting on the
iovec.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Kent Overstreet
On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
 Quoting Kent Overstreet (2013-11-05 22:48:41)
  This patch reverts the default behaviour introduced by
  9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
  shares the source bio's biovec, cloning the biovec is once again the
  default.
  
  Instead, we add a new bio_clone_biovec_fast(), which creates a clone
  that shares the source's biovec. This patch changes bcache and md to use
^
  dm?
 
  __bio_clone_biovec_fast() since they're expecting the new behaviour due
  to other refactoring; most of the other uses of bio_clone() should be
  same to convert to the _fast() variant but that will be done more
  incrementally in other patches (bio_split() in particular).
 
 Hi Kent,
 
 I noticed yesterday the _fast variants of bio clone introduce sharing
 between the src and the clone, but without any reference counts:
 
 bio-bi_io_vec = bio_src-bi_io_vec;
 
 Have you audited all of the _fast users to make sure they are not
 freeing the src before the clone?  Sorry if this came up already in past
 reviews.

Yup - that should actually be safe for all the existing bio_clone() users
actually, I audited all of them - because normally you're not going to complete
the original bio until the clone finishes.

  Note that __bio_clone() isn't being readded - the reason being that with
  immutable biovecs allocating the right number of biovecs for the new
  clone is no longer trivial so we don't want drivers trying to do that
  themselves.
  
  This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
  __bio_clone_fast() should not be setting bi_vcnt for bios that do not
  own the biovec (see Documentation/block/biovecs.txt for rationale) - in
  short,
 
 I think I see what you mean with tying bi_vcnt to ownership of the bio,
 but we're not consistent.  Looking at bio_for_eaach_segment_all:
 
 *
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
 #define bio_for_each_segment_all(bvl, bio, i)   \
 for (i = 0, bvl = (bio)-bi_io_vec; i  (bio)-bi_vcnt; i++, bvl++)
 
 bio_for_each_segment_all still trusts bi_vcnt, so any
 bio_for_each_segment_all operation on a clone will basically be a noop.
 
 Just looking at MD raid1 make_request()
 
 mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
   ...
   alloc_behind_pages(mbio, r1_bio); - bio_for_each_segment_all
   ...
   if (r1_bio-behind_bvecs) {
 bio_for_each_segment_all(bvec, mbio, j)
   ...
 
 I didn't test MD without the vcnt fix, but I think any operations in MD
 that duplicate data for raid1 turn into noops.  I think we'll end up
 writing garbage (or nothing) to the second mirror.
 
 If you look at dm's crypt_free_buffer_pages(), it had similar problems.

Those are fine actually - in both cases, they're bios they allocated, not the
bios that were submitted to them. Though md _definitely_ shouldn't have been
sharing the original bio's biovec, so looks like this patch will fix a bug
there...

(I remember seeing that code before and I thought I added a bio_clone_biovec()
call to that md code, but apparently that never got commited. Argh.)

 
  not setting it might cause bugs in the short term but long term
  it's likely to hide nastier more subtle bugs, we don't want code looking
  at bi_vcnt at all for bios it does not own.
 
 I think the concept of bio ownership is still much too weak, at least
 for established users like MD and DM.  I don't know how to verify the
 sharing of bi_io_vec without some kind of reference counting on the
 iovec.

What's unclear about it? The rule is just - if you didn't allocate the biovec,
don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
it clearly enough before though)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-06 15:02:22)
 On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
  Quoting Kent Overstreet (2013-11-05 22:48:41)
   This patch reverts the default behaviour introduced by
   9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
   shares the source bio's biovec, cloning the biovec is once again the
   default.
   
   Instead, we add a new bio_clone_biovec_fast(), which creates a clone
   that shares the source's biovec. This patch changes bcache and md to use
 ^
   dm?
  
   __bio_clone_biovec_fast() since they're expecting the new behaviour due
   to other refactoring; most of the other uses of bio_clone() should be
   same to convert to the _fast() variant but that will be done more
   incrementally in other patches (bio_split() in particular).
  
  Hi Kent,
  
  I noticed yesterday the _fast variants of bio clone introduce sharing
  between the src and the clone, but without any reference counts:
  
  bio-bi_io_vec = bio_src-bi_io_vec;
  
  Have you audited all of the _fast users to make sure they are not
  freeing the src before the clone?  Sorry if this came up already in past
  reviews.
 
 Yup - that should actually be safe for all the existing bio_clone() users
 actually, I audited all of them - because normally you're not going to 
 complete
 the original bio until the clone finishes.

I'd say we need an ack from Neil before we can switch to _fast.  The
completions look non-trivial and _fast adds new ordering requirements on
free.

 
   Note that __bio_clone() isn't being readded - the reason being that with
   immutable biovecs allocating the right number of biovecs for the new
   clone is no longer trivial so we don't want drivers trying to do that
   themselves.
   
   This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
   __bio_clone_fast() should not be setting bi_vcnt for bios that do not
   own the biovec (see Documentation/block/biovecs.txt for rationale) - in
   short,
  
  I think I see what you mean with tying bi_vcnt to ownership of the bio,
  but we're not consistent.  Looking at bio_for_eaach_segment_all:
  
  *
   * drivers should _never_ use the all version - the bio may have been split
   * before it got to the driver and the driver won't own all of it
   */
  #define bio_for_each_segment_all(bvl, bio, i)   \
  for (i = 0, bvl = (bio)-bi_io_vec; i  (bio)-bi_vcnt; i++, bvl++)
  
  bio_for_each_segment_all still trusts bi_vcnt, so any
  bio_for_each_segment_all operation on a clone will basically be a noop.
  
  Just looking at MD raid1 make_request()
  
  mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
...
alloc_behind_pages(mbio, r1_bio); - bio_for_each_segment_all
...
if (r1_bio-behind_bvecs) {
  bio_for_each_segment_all(bvec, mbio, j)
...
  
  I didn't test MD without the vcnt fix, but I think any operations in MD
  that duplicate data for raid1 turn into noops.  I think we'll end up
  writing garbage (or nothing) to the second mirror.
  
  If you look at dm's crypt_free_buffer_pages(), it had similar problems.
 
 Those are fine actually - in both cases, they're bios they allocated, not the
 bios that were submitted to them.

Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
the case before the bi_vcnt patch.  With your current patch (adding
_fast) both should now be correct.


 Though md _definitely_ shouldn't have been
 sharing the original bio's biovec, so looks like this patch will fix a bug
 there...
 
 (I remember seeing that code before and I thought I added a bio_clone_biovec()
 call to that md code, but apparently that never got commited. Argh.)
 
  
   not setting it might cause bugs in the short term but long term
   it's likely to hide nastier more subtle bugs, we don't want code looking
   at bi_vcnt at all for bios it does not own.
  
  I think the concept of bio ownership is still much too weak, at least
  for established users like MD and DM.  I don't know how to verify the
  sharing of bi_io_vec without some kind of reference counting on the
  iovec.
 
 What's unclear about it? The rule is just - if you didn't allocate the biovec,
 don't modify it or use bio_for_each_segment_all() (probably I didn't quite 
 state
 it clearly enough before though)

That part makes sense.  The new rule that scares me is that we can't
free the src of the clone until all the clones are freed.  If it works
with today's existing users it feels like it is more by accident than
design.  I'm not saying we can't do it, we just need some bigger
flashing warning lights.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org

Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Kent Overstreet
On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
 Quoting Kent Overstreet (2013-11-06 15:02:22)
  Yup - that should actually be safe for all the existing bio_clone() users
  actually, I audited all of them - because normally you're not going to 
  complete
  the original bio until the clone finishes.
 
 I'd say we need an ack from Neil before we can switch to _fast.  The
 completions look non-trivial and _fast adds new ordering requirements on
 free.

Agreed, I've already broken md enough lately :P

(I'm started to work on a real test suite for the block layer, that'll help with
this stuff...)

   If you look at dm's crypt_free_buffer_pages(), it had similar problems.
  
  Those are fine actually - in both cases, they're bios they allocated, not 
  the
  bios that were submitted to them.
 
 Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
 the case before the bi_vcnt patch.  With your current patch (adding
 _fast) both should now be correct.

Yeah, but bi_vcnt being zero was just a symptom - the underlying reason the old
code was wrong was that with the clone sharing the parent's biovec, md was
modifying a biovec (including bv_page!) that it didn't own - _that_ was what was
horribly broken about it.

  What's unclear about it? The rule is just - if you didn't allocate the 
  biovec,
  don't modify it or use bio_for_each_segment_all() (probably I didn't quite 
  state
  it clearly enough before though)
 
 That part makes sense.  The new rule that scares me is that we can't
 free the src of the clone until all the clones are freed.  If it works
 with today's existing users it feels like it is more by accident than
 design.  I'm not saying we can't do it, we just need some bigger
 flashing warning lights.

Yeah, Jens was saying the same thing yesterday, hence this patch.

OTOH - with regards to just the ordering requirements, the more I look at
various code the less accidental the fact that that works seems to be: the best
explanation I've come up with so far is that you already needed to ensure that
the _pages_ the clone points to stick around until the clone completes, and if
you don't own the original bio the only way to do that is to not complete the
original bio until after the clone completes.

So if you're a driver cloning bios that were submitted to you, bio_clone_fast()
introduces no new ordering requirements.

On the third hand - if you're cloning (i.e. splitting) your own bios, in that
case it would be possible to screw up the ordering - I don't know of any code in
the kernel that does this today (except for, sort of, bcache) but my dio rewrite
takes this approach - but if you do the obvious and sane thing with bio_chain()
it's a non issue, it seems to me you'd have to come up with something pretty
contrived and dumb for this to actually be an issue in practice.

Anyways, I haven't come to any definite conclusions, those are just my
observations so far.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-06 15:57:34)
 On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
  Quoting Kent Overstreet (2013-11-06 15:02:22)

[ ... nods, thanks! ... ]

 OTOH - with regards to just the ordering requirements, the more I look at
 various code the less accidental the fact that that works seems to be: the 
 best
 explanation I've come up with so far is that you already needed to ensure that
 the _pages_ the clone points to stick around until the clone completes, and if
 you don't own the original bio the only way to do that is to not complete the
 original bio until after the clone completes.
 
 So if you're a driver cloning bios that were submitted to you, 
 bio_clone_fast()
 introduces no new ordering requirements.
 
 On the third hand - if you're cloning (i.e. splitting) your own bios, in that
 case it would be possible to screw up the ordering - I don't know of any code 
 in
 the kernel that does this today (except for, sort of, bcache) but my dio 
 rewrite
 takes this approach - but if you do the obvious and sane thing with 
 bio_chain()
 it's a non issue, it seems to me you'd have to come up with something pretty
 contrived and dumb for this to actually be an issue in practice.
 
 Anyways, I haven't come to any definite conclusions, those are just my
 observations so far.

I do think you're right.  We all seem to have clones doing work on
behalf of the original, and when everyone is done we complete the
original.

But, btrfs does have silly things like this:

dio_end_io(dio_bio, err); // end and free the original
bio_put(bio); // free the clone

It's not a bug yet, but given enough time the space between those two
frees will grow new code that kills us all.

Really though, the new stuff is better, thanks.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Kent Overstreet
On Wed, Nov 06, 2013 at 04:25:45PM -0500, Chris Mason wrote:
 Quoting Kent Overstreet (2013-11-06 15:57:34)
  On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
   Quoting Kent Overstreet (2013-11-06 15:02:22)
 
 [ ... nods, thanks! ... ]
 
  OTOH - with regards to just the ordering requirements, the more I look at
  various code the less accidental the fact that that works seems to be: the 
  best
  explanation I've come up with so far is that you already needed to ensure 
  that
  the _pages_ the clone points to stick around until the clone completes, and 
  if
  you don't own the original bio the only way to do that is to not complete 
  the
  original bio until after the clone completes.
  
  So if you're a driver cloning bios that were submitted to you, 
  bio_clone_fast()
  introduces no new ordering requirements.
  
  On the third hand - if you're cloning (i.e. splitting) your own bios, in 
  that
  case it would be possible to screw up the ordering - I don't know of any 
  code in
  the kernel that does this today (except for, sort of, bcache) but my dio 
  rewrite
  takes this approach - but if you do the obvious and sane thing with 
  bio_chain()
  it's a non issue, it seems to me you'd have to come up with something pretty
  contrived and dumb for this to actually be an issue in practice.
  
  Anyways, I haven't come to any definite conclusions, those are just my
  observations so far.
 
 I do think you're right.  We all seem to have clones doing work on
 behalf of the original, and when everyone is done we complete the
 original.
 
 But, btrfs does have silly things like this:
 
 dio_end_io(dio_bio, err); // end and free the original
   bio_put(bio); // free the clone
 
 It's not a bug yet, but given enough time the space between those two
 frees will grow new code that kills us all.

Hopefully the DIO situation will get better in the near future, Josef was
telling me btrfs was probably going to end up with its own DIO code (replacing a
good chunk of direct-io.c) and that should get a lot easier and saner for you
guys with my dio rewrite. Need to finish getting that to pass xfstests...

 Really though, the new stuff is better, thanks.

Thanks! :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread NeilBrown
On Wed, 6 Nov 2013 15:22:36 -0500 Chris Mason chris.ma...@fusionio.com
wrote:


  Yup - that should actually be safe for all the existing bio_clone() users
  actually, I audited all of them - because normally you're not going to 
  complete
  the original bio until the clone finishes.
 
 I'd say we need an ack from Neil before we can switch to _fast.  The
 completions look non-trivial and _fast adds new ordering requirements on
 free.

I think it should be fairly straight forward to use _fast in md.
As Kent says, the original bio that is cloned is (almost) never going to have
endio called until all the clones have done their thing and finished up.

The only exception is the behind_pages stuff in raid1.c.  We clearly need
to be more careful there.
If the  ordering requirements you mention are simply that the owner of the
original may free the biovec, so the clone must have finished using it before
that can happen, then that would most easily be fix with
   mbio-bi_io_vec = r1_bio-behind_bvecs;
rather than the current loop which copies all the bv_page pointers.

NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-05 Thread Kent Overstreet
On Tue, Nov 05, 2013 at 09:02:19PM -0800, Olof Johansson wrote:
> Hi,
> 
> On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet  wrote:
> 
> > Chris, Olaf, can you two in particular test this? I have tested the bounce
> > buffer code (and bcache), but Jens told me today there was an md bug that I
> > _still_ can't find any emails about so I'm not sure what to test for that.
> 
> (I'm guessing you mean me and not some German person here. :-)
> 
> What is this expected to apply on top of? It doesn't apply cleanly on
> the 20131105 -next tree which is where I had been seeing the issues.

Sorry - it's on top of Jens' for-3.13/core branch,
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-05 Thread Olof Johansson
Hi,

On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet  wrote:

> Chris, Olaf, can you two in particular test this? I have tested the bounce
> buffer code (and bcache), but Jens told me today there was an md bug that I
> _still_ can't find any emails about so I'm not sure what to test for that.

(I'm guessing you mean me and not some German person here. :-)

What is this expected to apply on top of? It doesn't apply cleanly on
the 20131105 -next tree which is where I had been seeing the issues.


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] block: Revert bio_clone() default behaviour

2013-11-05 Thread Kent Overstreet
This patch reverts the default behaviour introduced by
9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
shares the source bio's biovec, cloning the biovec is once again the
default.

Instead, we add a new bio_clone_biovec_fast(), which creates a clone
that shares the source's biovec. This patch changes bcache and md to use
__bio_clone_biovec_fast() since they're expecting the new behaviour due
to other refactoring; most of the other uses of bio_clone() should be
same to convert to the _fast() variant but that will be done more
incrementally in other patches (bio_split() in particular).

Note that __bio_clone() isn't being readded - the reason being that with
immutable biovecs allocating the right number of biovecs for the new
clone is no longer trivial so we don't want drivers trying to do that
themselves.

This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
__bio_clone_fast() should not be setting bi_vcnt for bios that do not
own the biovec (see Documentation/block/biovecs.txt for rationale) - in
short, not setting it might cause bugs in the short term but long term
it's likely to hide nastier more subtle bugs, we don't want code looking
at bi_vcnt at all for bios it does not own. However, this patch
_shouldn't_ cause any regressions because of this since we're reverting
back to the old bio_clone() behaviour.

Signed-off-by: Kent Overstreet 
Cc: Jens Axboe 
Cc: Chris Mason 
Cc: Mike Snitzer 
Cc: NeilBrown 
Cc: Olof Johansson 
---
Chris, Olaf, can you two in particular test this? I have tested the bounce
buffer code (and bcache), but Jens told me today there was an md bug that I
_still_ can't find any emails about so I'm not sure what to test for that.

 drivers/md/bcache/request.c |   6 +--
 drivers/md/dm.c |   4 +-
 fs/bio.c| 104 +++-
 include/linux/bio.h |   6 +--
 mm/bounce.c |   1 -
 5 files changed, 72 insertions(+), 49 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 52a1fef..ef44198 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -681,7 +681,7 @@ static void do_bio_hook(struct search *s)
struct bio *bio = >bio.bio;
 
bio_init(bio);
-   __bio_clone(bio, s->orig_bio);
+   __bio_clone_fast(bio, s->orig_bio);
bio->bi_end_io  = request_endio;
bio->bi_private = >cl;
 
@@ -969,8 +969,8 @@ static void request_write(struct cached_dev *dc, struct 
search *s)
trace_bcache_write(s->orig_bio, s->writeback, s->op.skip);
 
if (!s->writeback) {
-   s->op.cache_bio = bio_clone_bioset(bio, GFP_NOIO,
-  dc->disk.bio_split);
+   s->op.cache_bio = bio_clone_bioset_fast(bio, GFP_NOIO,
+   dc->disk.bio_split);
 
closure_bio_submit(bio, cl, s->d);
} else {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8e6174c..bafe7ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1135,7 +1135,7 @@ static void clone_bio(struct dm_target_io *tio, struct 
bio *bio,
 {
struct bio *clone = >clone;
 
-   __bio_clone(clone, bio);
+   __bio_clone_fast(clone, bio);
 
if (bio_integrity(bio))
bio_integrity_clone(clone, bio, GFP_NOIO);
@@ -1177,7 +1177,7 @@ static void __clone_and_map_simple_bio(struct clone_info 
*ci,
 * ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
 * and discard, so no need for concern about wasted bvec allocations.
 */
-__bio_clone(clone, ci->bio);
+__bio_clone_fast(clone, ci->bio);
if (len)
bio_setup_sector(clone, ci->sector, len);
 
diff --git a/fs/bio.c b/fs/bio.c
index 6046c91..99ff176 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -539,15 +539,17 @@ inline int bio_phys_segments(struct request_queue *q, 
struct bio *bio)
 EXPORT_SYMBOL(bio_phys_segments);
 
 /**
- * __bio_clone -   clone a bio
+ * __bio_clone_fast - clone a bio that shares the original bio's biovec
  * @bio: destination bio
  * @bio_src: bio to clone
  *
  * Clone a  Caller will own the returned bio, but not
  * the actual data it points to. Reference count of returned
  * bio will be one.
+ *
+ * Caller must ensure that @bio_src is not freed before @bio.
  */
-void __bio_clone(struct bio *bio, struct bio *bio_src)
+void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 {
BUG_ON(bio->bi_pool && BIO_POOL_IDX(bio) != BIO_POOL_NONE);
 
@@ -560,20 +562,19 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
bio->bi_rw = bio_src->bi_rw;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
-   bio->bi_vcnt = bio_src->bi_vcnt;
 }
-EXPORT_SYMBOL(__bio_clone);
+EXPORT_SYMBOL(__bio_clone_fast);
 
 /**
- * 

[PATCH] block: Revert bio_clone() default behaviour

2013-11-05 Thread Kent Overstreet
This patch reverts the default behaviour introduced by
9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
shares the source bio's biovec, cloning the biovec is once again the
default.

Instead, we add a new bio_clone_biovec_fast(), which creates a clone
that shares the source's biovec. This patch changes bcache and md to use
__bio_clone_biovec_fast() since they're expecting the new behaviour due
to other refactoring; most of the other uses of bio_clone() should be
same to convert to the _fast() variant but that will be done more
incrementally in other patches (bio_split() in particular).

Note that __bio_clone() isn't being readded - the reason being that with
immutable biovecs allocating the right number of biovecs for the new
clone is no longer trivial so we don't want drivers trying to do that
themselves.

This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
__bio_clone_fast() should not be setting bi_vcnt for bios that do not
own the biovec (see Documentation/block/biovecs.txt for rationale) - in
short, not setting it might cause bugs in the short term but long term
it's likely to hide nastier more subtle bugs, we don't want code looking
at bi_vcnt at all for bios it does not own. However, this patch
_shouldn't_ cause any regressions because of this since we're reverting
back to the old bio_clone() behaviour.

Signed-off-by: Kent Overstreet k...@daterainc.com
Cc: Jens Axboe ax...@kernel.dk
Cc: Chris Mason chris.ma...@fusionio.com
Cc: Mike Snitzer snit...@redhat.com
Cc: NeilBrown ne...@suse.de
Cc: Olof Johansson o...@lixom.net
---
Chris, Olaf, can you two in particular test this? I have tested the bounce
buffer code (and bcache), but Jens told me today there was an md bug that I
_still_ can't find any emails about so I'm not sure what to test for that.

 drivers/md/bcache/request.c |   6 +--
 drivers/md/dm.c |   4 +-
 fs/bio.c| 104 +++-
 include/linux/bio.h |   6 +--
 mm/bounce.c |   1 -
 5 files changed, 72 insertions(+), 49 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 52a1fef..ef44198 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -681,7 +681,7 @@ static void do_bio_hook(struct search *s)
struct bio *bio = s-bio.bio;
 
bio_init(bio);
-   __bio_clone(bio, s-orig_bio);
+   __bio_clone_fast(bio, s-orig_bio);
bio-bi_end_io  = request_endio;
bio-bi_private = s-cl;
 
@@ -969,8 +969,8 @@ static void request_write(struct cached_dev *dc, struct 
search *s)
trace_bcache_write(s-orig_bio, s-writeback, s-op.skip);
 
if (!s-writeback) {
-   s-op.cache_bio = bio_clone_bioset(bio, GFP_NOIO,
-  dc-disk.bio_split);
+   s-op.cache_bio = bio_clone_bioset_fast(bio, GFP_NOIO,
+   dc-disk.bio_split);
 
closure_bio_submit(bio, cl, s-d);
} else {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8e6174c..bafe7ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1135,7 +1135,7 @@ static void clone_bio(struct dm_target_io *tio, struct 
bio *bio,
 {
struct bio *clone = tio-clone;
 
-   __bio_clone(clone, bio);
+   __bio_clone_fast(clone, bio);
 
if (bio_integrity(bio))
bio_integrity_clone(clone, bio, GFP_NOIO);
@@ -1177,7 +1177,7 @@ static void __clone_and_map_simple_bio(struct clone_info 
*ci,
 * ci-bio-bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
 * and discard, so no need for concern about wasted bvec allocations.
 */
-__bio_clone(clone, ci-bio);
+__bio_clone_fast(clone, ci-bio);
if (len)
bio_setup_sector(clone, ci-sector, len);
 
diff --git a/fs/bio.c b/fs/bio.c
index 6046c91..99ff176 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -539,15 +539,17 @@ inline int bio_phys_segments(struct request_queue *q, 
struct bio *bio)
 EXPORT_SYMBOL(bio_phys_segments);
 
 /**
- * __bio_clone -   clone a bio
+ * __bio_clone_fast - clone a bio that shares the original bio's biovec
  * @bio: destination bio
  * @bio_src: bio to clone
  *
  * Clone a bio. Caller will own the returned bio, but not
  * the actual data it points to. Reference count of returned
  * bio will be one.
+ *
+ * Caller must ensure that @bio_src is not freed before @bio.
  */
-void __bio_clone(struct bio *bio, struct bio *bio_src)
+void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 {
BUG_ON(bio-bi_pool  BIO_POOL_IDX(bio) != BIO_POOL_NONE);
 
@@ -560,20 +562,19 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
bio-bi_rw = bio_src-bi_rw;
bio-bi_iter = bio_src-bi_iter;
bio-bi_io_vec = bio_src-bi_io_vec;
-   bio-bi_vcnt = bio_src-bi_vcnt;
 }

Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-05 Thread Olof Johansson
Hi,

On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet k...@daterainc.com wrote:

 Chris, Olaf, can you two in particular test this? I have tested the bounce
 buffer code (and bcache), but Jens told me today there was an md bug that I
 _still_ can't find any emails about so I'm not sure what to test for that.

(I'm guessing you mean me and not some German person here. :-)

What is this expected to apply on top of? It doesn't apply cleanly on
the 20131105 -next tree which is where I had been seeing the issues.


-Olof
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-05 Thread Kent Overstreet
On Tue, Nov 05, 2013 at 09:02:19PM -0800, Olof Johansson wrote:
 Hi,
 
 On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet k...@daterainc.com wrote:
 
  Chris, Olaf, can you two in particular test this? I have tested the bounce
  buffer code (and bcache), but Jens told me today there was an md bug that I
  _still_ can't find any emails about so I'm not sure what to test for that.
 
 (I'm guessing you mean me and not some German person here. :-)
 
 What is this expected to apply on top of? It doesn't apply cleanly on
 the 20131105 -next tree which is where I had been seeing the issues.

Sorry - it's on top of Jens' for-3.13/core branch,
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/