Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())

2012-08-29 Thread Mikulas Patocka


On Wed, 15 Aug 2012, Kent Overstreet wrote:

> > Both md and dm use __GFP_WAIT allocations from mempools in 
> > generic_make_request.
> > 
> > I think you found an interesting bug here. Suppose that we have three 
> > stacked devices: d1 depends on d2 and d2 depends on d3.
> > 
> > Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
> > sends them to the device d2 - these bios end up in current->bio_list. The 
> > driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
> > current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
> > the bio list and the driver for d2 is called with b2.2 - suppose that for 
> > some reason mempool in d2 is exhausted and the driver needs to wait until 
> > b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
> > is still in current->bio_list. So it deadlocks.
> > 
> > Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
> > deadlock) into another bug (a possible bio failure with -ENOMEM).
> > 
> > Increasing mempool sizes doesn't fix it either, the bio would just have to 
> > be split to more pieces in the above example to make it deadlock.
> > 
> > I think the above possible deadlock scenario could be fixed by reversing 
> > current->bio_list processing - i.e. when some device's make_request_fn 
> > adds some bios to current->bio_list, these bios are processed before other 
> > bios that were on the list before. This way, bio list would contain "b3.1, 
> > b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would 
> > not happen.
> 
> Your patch isn't sufficient in the case where a bio may be split
> multiple times (I'm not sure if it's sufficient in the case where bios
> are only split once; trying to work it all out makes my head hurt).
> 
> You don't need multiple stacked drivers to see this; just the case where
> a single driver is running that splits multiple times is sufficient, if
> you have enough threads submitting at the same time.

That is true. dm splits one bio to multiple bios, so it could still 
deadlock.

Mikulas

> Bcache works around this with the trick I mentioned previously, where it
> masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue
> if the allocation fails.
> 
> This works but it'd have to be done in each stacking driver... it's not
> a generic solution, and it's a pain in the ass.
> 
> I came up with another idea the other day. Conceptually, it inverts my
> previous workaround - the punting to workqueue is done in the allocation
> code when necessary, for the bios that would be blocked.
> 
> It's lightly tested, gonna rig up some kind of fault injection and test
> it more thoroughly later.
> 
> commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
> Author: Kent Overstreet 
> Date:   Mon Aug 13 18:11:01 2012 -0700
> 
> block: Avoid deadlocks with bio allocation by stacking drivers
> 
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request(), we risk deadlock.
> 
> This would happen if e.g. a bio ever needed to be split more than once,
> and it's difficult to handle correctly in the drivers - so in practice
> it's not.
> 
> This patch fixes this issue by allocating a rescuer workqueue for each
> bio_set, and punting queued bios to said rescuer when necessary:
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index bc4265a..7b4f655 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
>  }
>  EXPORT_SYMBOL(bio_reset);
>  
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> + struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> + struct bio *bio;
> +
> + while (1) {
> + spin_lock(>rescue_lock);
> + bio = bio_list_pop(>rescue_list);
> + spin_unlock(>rescue_lock);
> +
> + if (!bio)
> + break;
> +
> + generic_make_request(bio);
> + }
> +}
> +
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
>   **/
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set 
> *bs)
>  {
> + gfp_t saved_gfp = gfp_mask;
>   unsigned front_pad;
>   unsigned inline_vecs;
>   unsigned long idx = BIO_POOL_NONE;
> @@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
> nr_iovecs, struct bio_set *bs)
>   p = kmalloc(sizeof(struct bio) +
>   nr_iovecs * sizeof(struct bio_vec),
>   gfp_mask);
> +
>   front_pad = 0;
>   inline_vecs = nr_iovecs;
>   } else {
> - p = mempool_alloc(bs->bio_pool, gfp_mask);
> + /*
> +  * If we're running under generic_make_request()
> +  * (current->bio_list != NULL), we 

Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())

2012-08-29 Thread Mikulas Patocka


On Wed, 15 Aug 2012, Kent Overstreet wrote:

  Both md and dm use __GFP_WAIT allocations from mempools in 
  generic_make_request.
  
  I think you found an interesting bug here. Suppose that we have three 
  stacked devices: d1 depends on d2 and d2 depends on d3.
  
  Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
  sends them to the device d2 - these bios end up in current-bio_list. The 
  driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
  current-bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
  the bio list and the driver for d2 is called with b2.2 - suppose that for 
  some reason mempool in d2 is exhausted and the driver needs to wait until 
  b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
  is still in current-bio_list. So it deadlocks.
  
  Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
  deadlock) into another bug (a possible bio failure with -ENOMEM).
  
  Increasing mempool sizes doesn't fix it either, the bio would just have to 
  be split to more pieces in the above example to make it deadlock.
  
  I think the above possible deadlock scenario could be fixed by reversing 
  current-bio_list processing - i.e. when some device's make_request_fn 
  adds some bios to current-bio_list, these bios are processed before other 
  bios that were on the list before. This way, bio list would contain b3.1, 
  b2.2 instead of b2.2, b3.1 in the above example and the deadlock would 
  not happen.
 
 Your patch isn't sufficient in the case where a bio may be split
 multiple times (I'm not sure if it's sufficient in the case where bios
 are only split once; trying to work it all out makes my head hurt).
 
 You don't need multiple stacked drivers to see this; just the case where
 a single driver is running that splits multiple times is sufficient, if
 you have enough threads submitting at the same time.

That is true. dm splits one bio to multiple bios, so it could still 
deadlock.

Mikulas

 Bcache works around this with the trick I mentioned previously, where it
 masks out _GFP_WAIT if current-bio_list != NULL, and punts to workqueue
 if the allocation fails.
 
 This works but it'd have to be done in each stacking driver... it's not
 a generic solution, and it's a pain in the ass.
 
 I came up with another idea the other day. Conceptually, it inverts my
 previous workaround - the punting to workqueue is done in the allocation
 code when necessary, for the bios that would be blocked.
 
 It's lightly tested, gonna rig up some kind of fault injection and test
 it more thoroughly later.
 
 commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
 Author: Kent Overstreet koverstr...@google.com
 Date:   Mon Aug 13 18:11:01 2012 -0700
 
 block: Avoid deadlocks with bio allocation by stacking drivers
 
 Previously, if we ever try to allocate more than once from the same bio
 set while running under generic_make_request(), we risk deadlock.
 
 This would happen if e.g. a bio ever needed to be split more than once,
 and it's difficult to handle correctly in the drivers - so in practice
 it's not.
 
 This patch fixes this issue by allocating a rescuer workqueue for each
 bio_set, and punting queued bios to said rescuer when necessary:
 
 diff --git a/fs/bio.c b/fs/bio.c
 index bc4265a..7b4f655 100644
 --- a/fs/bio.c
 +++ b/fs/bio.c
 @@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
  }
  EXPORT_SYMBOL(bio_reset);
  
 +static void bio_alloc_rescue(struct work_struct *work)
 +{
 + struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
 + struct bio *bio;
 +
 + while (1) {
 + spin_lock(bs-rescue_lock);
 + bio = bio_list_pop(bs-rescue_list);
 + spin_unlock(bs-rescue_lock);
 +
 + if (!bio)
 + break;
 +
 + generic_make_request(bio);
 + }
 +}
 +
  /**
   * bio_alloc_bioset - allocate a bio for I/O
   * @gfp_mask:   the GFP_ mask given to the slab allocator
 @@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
   **/
  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set 
 *bs)
  {
 + gfp_t saved_gfp = gfp_mask;
   unsigned front_pad;
   unsigned inline_vecs;
   unsigned long idx = BIO_POOL_NONE;
 @@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
 nr_iovecs, struct bio_set *bs)
   p = kmalloc(sizeof(struct bio) +
   nr_iovecs * sizeof(struct bio_vec),
   gfp_mask);
 +
   front_pad = 0;
   inline_vecs = nr_iovecs;
   } else {
 - p = mempool_alloc(bs-bio_pool, gfp_mask);
 + /*
 +  * If we're running under generic_make_request()
 +  * (current-bio_list != NULL), we risk deadlock if we sleep on
 +  * allocation and there's already bios on current-bio_list that
 +  * 

Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())

2012-08-15 Thread Kent Overstreet
> Both md and dm use __GFP_WAIT allocations from mempools in 
> generic_make_request.
> 
> I think you found an interesting bug here. Suppose that we have three 
> stacked devices: d1 depends on d2 and d2 depends on d3.
> 
> Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
> sends them to the device d2 - these bios end up in current->bio_list. The 
> driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
> current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
> the bio list and the driver for d2 is called with b2.2 - suppose that for 
> some reason mempool in d2 is exhausted and the driver needs to wait until 
> b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
> is still in current->bio_list. So it deadlocks.
> 
> Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
> deadlock) into another bug (a possible bio failure with -ENOMEM).
> 
> Increasing mempool sizes doesn't fix it either, the bio would just have to 
> be split to more pieces in the above example to make it deadlock.
> 
> I think the above possible deadlock scenario could be fixed by reversing 
> current->bio_list processing - i.e. when some device's make_request_fn 
> adds some bios to current->bio_list, these bios are processed before other 
> bios that were on the list before. This way, bio list would contain "b3.1, 
> b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would 
> not happen.

Your patch isn't sufficient in the case where a bio may be split
multiple times (I'm not sure if it's sufficient in the case where bios
are only split once; trying to work it all out makes my head hurt).

You don't need multiple stacked drivers to see this; just the case where
a single driver is running that splits multiple times is sufficient, if
you have enough threads submitting at the same time.

Bcache works around this with the trick I mentioned previously, where it
masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue
if the allocation fails.

This works but it'd have to be done in each stacking driver... it's not
a generic solution, and it's a pain in the ass.

I came up with another idea the other day. Conceptually, it inverts my
previous workaround - the punting to workqueue is done in the allocation
code when necessary, for the bios that would be blocked.

It's lightly tested, gonna rig up some kind of fault injection and test
it more thoroughly later.

commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
Author: Kent Overstreet 
Date:   Mon Aug 13 18:11:01 2012 -0700

block: Avoid deadlocks with bio allocation by stacking drivers

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request(), we risk deadlock.

This would happen if e.g. a bio ever needed to be split more than once,
and it's difficult to handle correctly in the drivers - so in practice
it's not.

This patch fixes this issue by allocating a rescuer workqueue for each
bio_set, and punting queued bios to said rescuer when necessary:

diff --git a/fs/bio.c b/fs/bio.c
index bc4265a..7b4f655 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+static void bio_alloc_rescue(struct work_struct *work)
+{
+   struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+   struct bio *bio;
+
+   while (1) {
+   spin_lock(>rescue_lock);
+   bio = bio_list_pop(>rescue_list);
+   spin_unlock(>rescue_lock);
+
+   if (!bio)
+   break;
+
+   generic_make_request(bio);
+   }
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
  **/
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+   gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
nr_iovecs, struct bio_set *bs)
p = kmalloc(sizeof(struct bio) +
nr_iovecs * sizeof(struct bio_vec),
gfp_mask);
+
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
-   p = mempool_alloc(bs->bio_pool, gfp_mask);
+   /*
+* If we're running under generic_make_request()
+* (current->bio_list != NULL), we risk deadlock if we sleep on
+* allocation and there's already bios on current->bio_list that
+* were allocated from the same bio_set; they won't be submitted
+* (and thus freed) as long as we're blocked here.
+*
+* To deal with this, we 

Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())

2012-08-15 Thread Kent Overstreet
 Both md and dm use __GFP_WAIT allocations from mempools in 
 generic_make_request.
 
 I think you found an interesting bug here. Suppose that we have three 
 stacked devices: d1 depends on d2 and d2 depends on d3.
 
 Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
 sends them to the device d2 - these bios end up in current-bio_list. The 
 driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
 current-bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
 the bio list and the driver for d2 is called with b2.2 - suppose that for 
 some reason mempool in d2 is exhausted and the driver needs to wait until 
 b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
 is still in current-bio_list. So it deadlocks.
 
 Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
 deadlock) into another bug (a possible bio failure with -ENOMEM).
 
 Increasing mempool sizes doesn't fix it either, the bio would just have to 
 be split to more pieces in the above example to make it deadlock.
 
 I think the above possible deadlock scenario could be fixed by reversing 
 current-bio_list processing - i.e. when some device's make_request_fn 
 adds some bios to current-bio_list, these bios are processed before other 
 bios that were on the list before. This way, bio list would contain b3.1, 
 b2.2 instead of b2.2, b3.1 in the above example and the deadlock would 
 not happen.

Your patch isn't sufficient in the case where a bio may be split
multiple times (I'm not sure if it's sufficient in the case where bios
are only split once; trying to work it all out makes my head hurt).

You don't need multiple stacked drivers to see this; just the case where
a single driver is running that splits multiple times is sufficient, if
you have enough threads submitting at the same time.

Bcache works around this with the trick I mentioned previously, where it
masks out _GFP_WAIT if current-bio_list != NULL, and punts to workqueue
if the allocation fails.

This works but it'd have to be done in each stacking driver... it's not
a generic solution, and it's a pain in the ass.

I came up with another idea the other day. Conceptually, it inverts my
previous workaround - the punting to workqueue is done in the allocation
code when necessary, for the bios that would be blocked.

It's lightly tested, gonna rig up some kind of fault injection and test
it more thoroughly later.

commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
Author: Kent Overstreet koverstr...@google.com
Date:   Mon Aug 13 18:11:01 2012 -0700

block: Avoid deadlocks with bio allocation by stacking drivers

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request(), we risk deadlock.

This would happen if e.g. a bio ever needed to be split more than once,
and it's difficult to handle correctly in the drivers - so in practice
it's not.

This patch fixes this issue by allocating a rescuer workqueue for each
bio_set, and punting queued bios to said rescuer when necessary:

diff --git a/fs/bio.c b/fs/bio.c
index bc4265a..7b4f655 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+static void bio_alloc_rescue(struct work_struct *work)
+{
+   struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+   struct bio *bio;
+
+   while (1) {
+   spin_lock(bs-rescue_lock);
+   bio = bio_list_pop(bs-rescue_list);
+   spin_unlock(bs-rescue_lock);
+
+   if (!bio)
+   break;
+
+   generic_make_request(bio);
+   }
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
  **/
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+   gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
nr_iovecs, struct bio_set *bs)
p = kmalloc(sizeof(struct bio) +
nr_iovecs * sizeof(struct bio_vec),
gfp_mask);
+
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
-   p = mempool_alloc(bs-bio_pool, gfp_mask);
+   /*
+* If we're running under generic_make_request()
+* (current-bio_list != NULL), we risk deadlock if we sleep on
+* allocation and there's already bios on current-bio_list that
+* were allocated from the same bio_set; they won't be submitted
+* (and thus freed) as long as we're blocked here.
+*
+* To deal with this, we first try 

[PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())

2012-07-26 Thread Mikulas Patocka


On Wed, 25 Jul 2012, Kent Overstreet wrote:

> On Wed, Jul 25, 2012 at 02:55:40PM +0300, Boaz Harrosh wrote:
> > On 07/24/2012 11:11 PM, Kent Overstreet wrote:
> > 
> > > The new bio_split() can split arbitrary bios - it's not restricted to
> > > single page bios, like the old bio_split() (previously renamed to
> > > bio_pair_split()). It also has different semantics - it doesn't allocate
> > > a struct bio_pair, leaving it up to the caller to handle completions.
> > > 
> > > Signed-off-by: Kent Overstreet 
> > > ---
> > >  fs/bio.c |   99 
> > > ++
> > >  1 files changed, 99 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/fs/bio.c b/fs/bio.c
> > > index 5d02aa5..a15e121 100644
> > > --- a/fs/bio.c
> > > +++ b/fs/bio.c
> > > @@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, 
> > > int first_sectors)
> > >  EXPORT_SYMBOL(bio_pair_split);
> > >  
> > >  /**
> > > + * bio_split - split a bio
> > > + * @bio: bio to split
> > > + * @sectors: number of sectors to split from the front of @bio
> > > + * @gfp: gfp mask
> > > + * @bs:  bio set to allocate from
> > > + *
> > > + * Allocates and returns a new bio which represents @sectors from the 
> > > start of
> > > + * @bio, and updates @bio to represent the remaining sectors.
> > > + *
> > > + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
> > > + * unchanged.
> > > + *
> > > + * The newly allocated bio will point to @bio's bi_io_vec, if the split 
> > > was on a
> > > + * bvec boundry; it is the caller's responsibility to ensure that @bio 
> > > is not
> > > + * freed before the split.
> > > + *
> > > + * If bio_split() is running under generic_make_request(), it's not safe 
> > > to
> > > + * allocate more than one bio from the same bio set. Therefore, if it is 
> > > running
> > > + * under generic_make_request() it masks out __GFP_WAIT when doing the
> > > + * allocation. The caller must check for failure if there's any 
> > > possibility of
> > > + * it being called from under generic_make_request(); it is then the 
> > > caller's
> > > + * responsibility to retry from a safe context (by e.g. punting to 
> > > workqueue).
> > > + */
> > > +struct bio *bio_split(struct bio *bio, int sectors,
> > > +   gfp_t gfp, struct bio_set *bs)
> > > +{
> > > + unsigned idx, vcnt = 0, nbytes = sectors << 9;
> > > + struct bio_vec *bv;
> > > + struct bio *ret = NULL;
> > > +
> > > + BUG_ON(sectors <= 0);
> > > +
> > > + /*
> > > +  * If we're being called from underneath generic_make_request() and we
> > > +  * already allocated any bios from this bio set, we risk deadlock if we
> > > +  * use the mempool. So instead, we possibly fail and let the caller punt
> > > +  * to workqueue or somesuch and retry in a safe context.
> > > +  */
> > > + if (current->bio_list)
> > > + gfp &= ~__GFP_WAIT;
> > 
> > 
> > NACK!
> > 
> > If as you said above in the comment:
> > if there's any possibility of it being called from under 
> > generic_make_request();
> > it is then the caller's responsibility to ...
> > 
> > So all the comment needs to say is: 
> > ... caller's responsibility to not set __GFP_WAIT at gfp.
> > 
> > And drop this here. It is up to the caller to decide. If the caller wants 
> > he can do
> > "if (current->bio_list)" by his own.
> > 
> > This is a general purpose utility you might not know it's context.
> > for example with osdblk above will break.
> 
> Well I'm highly highly skeptical that using __GFP_WAIT under
> generic_make_request() is ever a sane thing to do - it could certainly
> be safe in specific circumstances, but it's just such a fragile thing to
> rely on, you have to _never_ use the same bio pool more than once. And
> even then I bet there's other subtle ways it could break.
> 
> But you're not the first to complain about it, and your point about
> existing code is compelling.

Both md and dm use __GFP_WAIT allocations from mempools in 
generic_make_request.

I think you found an interesting bug here. Suppose that we have three 
stacked devices: d1 depends on d2 and d2 depends on d3.

Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
sends them to the device d2 - these bios end up in current->bio_list. The 
driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
the bio list and the driver for d2 is called with b2.2 - suppose that for 
some reason mempool in d2 is exhausted and the driver needs to wait until 
b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
is still in current->bio_list. So it deadlocks.

Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
deadlock) into another bug (a possible bio failure with -ENOMEM).

Increasing mempool sizes doesn't fix it either, the bio would just have to 
be split to more pieces in the above example 

[PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())

2012-07-26 Thread Mikulas Patocka


On Wed, 25 Jul 2012, Kent Overstreet wrote:

 On Wed, Jul 25, 2012 at 02:55:40PM +0300, Boaz Harrosh wrote:
  On 07/24/2012 11:11 PM, Kent Overstreet wrote:
  
   The new bio_split() can split arbitrary bios - it's not restricted to
   single page bios, like the old bio_split() (previously renamed to
   bio_pair_split()). It also has different semantics - it doesn't allocate
   a struct bio_pair, leaving it up to the caller to handle completions.
   
   Signed-off-by: Kent Overstreet koverstr...@google.com
   ---
fs/bio.c |   99 
   ++
1 files changed, 99 insertions(+), 0 deletions(-)
   
   diff --git a/fs/bio.c b/fs/bio.c
   index 5d02aa5..a15e121 100644
   --- a/fs/bio.c
   +++ b/fs/bio.c
   @@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, 
   int first_sectors)
EXPORT_SYMBOL(bio_pair_split);

/**
   + * bio_split - split a bio
   + * @bio: bio to split
   + * @sectors: number of sectors to split from the front of @bio
   + * @gfp: gfp mask
   + * @bs:  bio set to allocate from
   + *
   + * Allocates and returns a new bio which represents @sectors from the 
   start of
   + * @bio, and updates @bio to represent the remaining sectors.
   + *
   + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
   + * unchanged.
   + *
   + * The newly allocated bio will point to @bio's bi_io_vec, if the split 
   was on a
   + * bvec boundry; it is the caller's responsibility to ensure that @bio 
   is not
   + * freed before the split.
   + *
   + * If bio_split() is running under generic_make_request(), it's not safe 
   to
   + * allocate more than one bio from the same bio set. Therefore, if it is 
   running
   + * under generic_make_request() it masks out __GFP_WAIT when doing the
   + * allocation. The caller must check for failure if there's any 
   possibility of
   + * it being called from under generic_make_request(); it is then the 
   caller's
   + * responsibility to retry from a safe context (by e.g. punting to 
   workqueue).
   + */
   +struct bio *bio_split(struct bio *bio, int sectors,
   +   gfp_t gfp, struct bio_set *bs)
   +{
   + unsigned idx, vcnt = 0, nbytes = sectors  9;
   + struct bio_vec *bv;
   + struct bio *ret = NULL;
   +
   + BUG_ON(sectors = 0);
   +
   + /*
   +  * If we're being called from underneath generic_make_request() and we
   +  * already allocated any bios from this bio set, we risk deadlock if we
   +  * use the mempool. So instead, we possibly fail and let the caller punt
   +  * to workqueue or somesuch and retry in a safe context.
   +  */
   + if (current-bio_list)
   + gfp = ~__GFP_WAIT;
  
  
  NACK!
  
  If as you said above in the comment:
  if there's any possibility of it being called from under 
  generic_make_request();
  it is then the caller's responsibility to ...
  
  So all the comment needs to say is: 
  ... caller's responsibility to not set __GFP_WAIT at gfp.
  
  And drop this here. It is up to the caller to decide. If the caller wants 
  he can do
  if (current-bio_list) by his own.
  
  This is a general purpose utility you might not know it's context.
  for example with osdblk above will break.
 
 Well I'm highly highly skeptical that using __GFP_WAIT under
 generic_make_request() is ever a sane thing to do - it could certainly
 be safe in specific circumstances, but it's just such a fragile thing to
 rely on, you have to _never_ use the same bio pool more than once. And
 even then I bet there's other subtle ways it could break.
 
 But you're not the first to complain about it, and your point about
 existing code is compelling.

Both md and dm use __GFP_WAIT allocations from mempools in 
generic_make_request.

I think you found an interesting bug here. Suppose that we have three 
stacked devices: d1 depends on d2 and d2 depends on d3.

Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
sends them to the device d2 - these bios end up in current-bio_list. The 
driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
current-bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
the bio list and the driver for d2 is called with b2.2 - suppose that for 
some reason mempool in d2 is exhausted and the driver needs to wait until 
b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
is still in current-bio_list. So it deadlocks.

Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
deadlock) into another bug (a possible bio failure with -ENOMEM).

Increasing mempool sizes doesn't fix it either, the bio would just have to 
be split to more pieces in the above example to make it deadlock.

I think the above possible deadlock scenario could be fixed by reversing 
current-bio_list processing - i.e. when some device's make_request_fn 
adds some bios to current-bio_list, these bios are processed before other 

Re: [PATCH v4 08/12] block: Introduce new bio_split()

2012-07-25 Thread Kent Overstreet
On Wed, Jul 25, 2012 at 02:55:40PM +0300, Boaz Harrosh wrote:
> On 07/24/2012 11:11 PM, Kent Overstreet wrote:
> 
> > The new bio_split() can split arbitrary bios - it's not restricted to
> > single page bios, like the old bio_split() (previously renamed to
> > bio_pair_split()). It also has different semantics - it doesn't allocate
> > a struct bio_pair, leaving it up to the caller to handle completions.
> > 
> > Signed-off-by: Kent Overstreet 
> > ---
> >  fs/bio.c |   99 
> > ++
> >  1 files changed, 99 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/bio.c b/fs/bio.c
> > index 5d02aa5..a15e121 100644
> > --- a/fs/bio.c
> > +++ b/fs/bio.c
> > @@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
> > first_sectors)
> >  EXPORT_SYMBOL(bio_pair_split);
> >  
> >  /**
> > + * bio_split - split a bio
> > + * @bio:   bio to split
> > + * @sectors:   number of sectors to split from the front of @bio
> > + * @gfp:   gfp mask
> > + * @bs:bio set to allocate from
> > + *
> > + * Allocates and returns a new bio which represents @sectors from the 
> > start of
> > + * @bio, and updates @bio to represent the remaining sectors.
> > + *
> > + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
> > + * unchanged.
> > + *
> > + * The newly allocated bio will point to @bio's bi_io_vec, if the split 
> > was on a
> > + * bvec boundry; it is the caller's responsibility to ensure that @bio is 
> > not
> > + * freed before the split.
> > + *
> > + * If bio_split() is running under generic_make_request(), it's not safe to
> > + * allocate more than one bio from the same bio set. Therefore, if it is 
> > running
> > + * under generic_make_request() it masks out __GFP_WAIT when doing the
> > + * allocation. The caller must check for failure if there's any 
> > possibility of
> > + * it being called from under generic_make_request(); it is then the 
> > caller's
> > + * responsibility to retry from a safe context (by e.g. punting to 
> > workqueue).
> > + */
> > +struct bio *bio_split(struct bio *bio, int sectors,
> > + gfp_t gfp, struct bio_set *bs)
> > +{
> > +   unsigned idx, vcnt = 0, nbytes = sectors << 9;
> > +   struct bio_vec *bv;
> > +   struct bio *ret = NULL;
> > +
> > +   BUG_ON(sectors <= 0);
> > +
> > +   /*
> > +* If we're being called from underneath generic_make_request() and we
> > +* already allocated any bios from this bio set, we risk deadlock if we
> > +* use the mempool. So instead, we possibly fail and let the caller punt
> > +* to workqueue or somesuch and retry in a safe context.
> > +*/
> > +   if (current->bio_list)
> > +   gfp &= ~__GFP_WAIT;
> 
> 
> NACK!
> 
> If as you said above in the comment:
>   if there's any possibility of it being called from under 
> generic_make_request();
> it is then the caller's responsibility to ...
> 
> So all the comment needs to say is: 
>   ... caller's responsibility to not set __GFP_WAIT at gfp.
> 
> And drop this here. It is up to the caller to decide. If the caller wants he 
> can do
> "if (current->bio_list)" by his own.
> 
> This is a general purpose utility you might not know it's context.
> for example with osdblk above will break.

Well I'm highly highly skeptical that using __GFP_WAIT under
generic_make_request() is ever a sane thing to do - it could certainly
be safe in specific circumstances, but it's just such a fragile thing to
rely on, you have to _never_ use the same bio pool more than once. And
even then I bet there's other subtle ways it could break.

But you're not the first to complain about it, and your point about
existing code is compelling.

commit ea124f899af29887e24d07497442066572012e5b
Author: Kent Overstreet 
Date:   Wed Jul 25 16:25:10 2012 -0700

block: Introduce new bio_split()

The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

diff --git a/fs/bio.c b/fs/bio.c
index 0470376..312e5de 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1537,6 +1537,102 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
first_sectors)
 EXPORT_SYMBOL(bio_pair_split);
 
 /**
+ * bio_split - split a bio
+ * @bio:   bio to split
+ * @sectors:   number of sectors to split from the front of @bio
+ * @gfp:   gfp mask
+ * @bs:bio set to allocate from
+ *
+ * Allocates and returns a new bio which represents @sectors from the start of
+ * @bio, and updates @bio to represent the remaining sectors.
+ *
+ * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
+ * unchanged.
+ *
+ * The newly allocated bio will point to @bio's bi_io_vec, if the split was on 
a
+ * bvec boundry; it is the caller's 

Re: [PATCH v4 08/12] block: Introduce new bio_split()

2012-07-25 Thread Boaz Harrosh
On 07/24/2012 11:11 PM, Kent Overstreet wrote:

> The new bio_split() can split arbitrary bios - it's not restricted to
> single page bios, like the old bio_split() (previously renamed to
> bio_pair_split()). It also has different semantics - it doesn't allocate
> a struct bio_pair, leaving it up to the caller to handle completions.
> 
> Signed-off-by: Kent Overstreet 
> ---
>  fs/bio.c |   99 
> ++
>  1 files changed, 99 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index 5d02aa5..a15e121 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
> first_sectors)
>  EXPORT_SYMBOL(bio_pair_split);
>  
>  /**
> + * bio_split - split a bio
> + * @bio: bio to split
> + * @sectors: number of sectors to split from the front of @bio
> + * @gfp: gfp mask
> + * @bs:  bio set to allocate from
> + *
> + * Allocates and returns a new bio which represents @sectors from the start 
> of
> + * @bio, and updates @bio to represent the remaining sectors.
> + *
> + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
> + * unchanged.
> + *
> + * The newly allocated bio will point to @bio's bi_io_vec, if the split was 
> on a
> + * bvec boundry; it is the caller's responsibility to ensure that @bio is not
> + * freed before the split.
> + *
> + * If bio_split() is running under generic_make_request(), it's not safe to
> + * allocate more than one bio from the same bio set. Therefore, if it is 
> running
> + * under generic_make_request() it masks out __GFP_WAIT when doing the
> + * allocation. The caller must check for failure if there's any possibility 
> of
> + * it being called from under generic_make_request(); it is then the caller's
> + * responsibility to retry from a safe context (by e.g. punting to 
> workqueue).
> + */
> +struct bio *bio_split(struct bio *bio, int sectors,
> +   gfp_t gfp, struct bio_set *bs)
> +{
> + unsigned idx, vcnt = 0, nbytes = sectors << 9;
> + struct bio_vec *bv;
> + struct bio *ret = NULL;
> +
> + BUG_ON(sectors <= 0);
> +
> + /*
> +  * If we're being called from underneath generic_make_request() and we
> +  * already allocated any bios from this bio set, we risk deadlock if we
> +  * use the mempool. So instead, we possibly fail and let the caller punt
> +  * to workqueue or somesuch and retry in a safe context.
> +  */
> + if (current->bio_list)
> + gfp &= ~__GFP_WAIT;


NACK!

If as you said above in the comment:
if there's any possibility of it being called from under 
generic_make_request();
it is then the caller's responsibility to ...

So all the comment needs to say is: 
... caller's responsibility to not set __GFP_WAIT at gfp.

And drop this here. It is up to the caller to decide. If the caller wants he 
can do
"if (current->bio_list)" by his own.

This is a general purpose utility you might not know it's context.
for example with osdblk above will break.

Thanks
Boaz

> +
> + if (sectors >= bio_sectors(bio))
> + return bio;
> +
> + trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
> +   bio->bi_sector + sectors);
> +
> + bio_for_each_segment(bv, bio, idx) {
> + vcnt = idx - bio->bi_idx;
> +
> + if (!nbytes) {
> + ret = bio_alloc_bioset(gfp, 0, bs);
> + if (!ret)
> + return NULL;
> +
> + ret->bi_io_vec = bio_iovec(bio);
> + ret->bi_flags |= 1 << BIO_CLONED;
> + break;
> + } else if (nbytes < bv->bv_len) {
> + ret = bio_alloc_bioset(gfp, ++vcnt, bs);
> + if (!ret)
> + return NULL;
> +
> + memcpy(ret->bi_io_vec, bio_iovec(bio),
> +sizeof(struct bio_vec) * vcnt);
> +
> + ret->bi_io_vec[vcnt - 1].bv_len = nbytes;
> + bv->bv_offset   += nbytes;
> + bv->bv_len  -= nbytes;
> + break;
> + }
> +
> + nbytes -= bv->bv_len;
> + }
> +
> + ret->bi_bdev= bio->bi_bdev;
> + ret->bi_sector  = bio->bi_sector;
> + ret->bi_size= sectors << 9;
> + ret->bi_rw  = bio->bi_rw;
> + ret->bi_vcnt= vcnt;
> + ret->bi_max_vecs = vcnt;
> + ret->bi_end_io  = bio->bi_end_io;
> + ret->bi_private = bio->bi_private;
> +
> + bio->bi_sector  += sectors;
> + bio->bi_size-= sectors << 9;
> + bio->bi_idx  = idx;
> +
> + if (bio_integrity(bio)) {
> + bio_integrity_clone(ret, bio, gfp, bs);
> + bio_integrity_trim(ret, 0, bio_sectors(ret));
> + bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
> + }

Re: [PATCH v4 08/12] block: Introduce new bio_split()

2012-07-25 Thread Kent Overstreet
On Wed, Jul 25, 2012 at 02:55:40PM +0300, Boaz Harrosh wrote:
 On 07/24/2012 11:11 PM, Kent Overstreet wrote:
 
  The new bio_split() can split arbitrary bios - it's not restricted to
  single page bios, like the old bio_split() (previously renamed to
  bio_pair_split()). It also has different semantics - it doesn't allocate
  a struct bio_pair, leaving it up to the caller to handle completions.
  
  Signed-off-by: Kent Overstreet koverstr...@google.com
  ---
   fs/bio.c |   99 
  ++
   1 files changed, 99 insertions(+), 0 deletions(-)
  
  diff --git a/fs/bio.c b/fs/bio.c
  index 5d02aa5..a15e121 100644
  --- a/fs/bio.c
  +++ b/fs/bio.c
  @@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
  first_sectors)
   EXPORT_SYMBOL(bio_pair_split);
   
   /**
  + * bio_split - split a bio
  + * @bio:   bio to split
  + * @sectors:   number of sectors to split from the front of @bio
  + * @gfp:   gfp mask
  + * @bs:bio set to allocate from
  + *
  + * Allocates and returns a new bio which represents @sectors from the 
  start of
  + * @bio, and updates @bio to represent the remaining sectors.
  + *
  + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
  + * unchanged.
  + *
  + * The newly allocated bio will point to @bio's bi_io_vec, if the split 
  was on a
  + * bvec boundry; it is the caller's responsibility to ensure that @bio is 
  not
  + * freed before the split.
  + *
  + * If bio_split() is running under generic_make_request(), it's not safe to
  + * allocate more than one bio from the same bio set. Therefore, if it is 
  running
  + * under generic_make_request() it masks out __GFP_WAIT when doing the
  + * allocation. The caller must check for failure if there's any 
  possibility of
  + * it being called from under generic_make_request(); it is then the 
  caller's
  + * responsibility to retry from a safe context (by e.g. punting to 
  workqueue).
  + */
  +struct bio *bio_split(struct bio *bio, int sectors,
  + gfp_t gfp, struct bio_set *bs)
  +{
  +   unsigned idx, vcnt = 0, nbytes = sectors  9;
  +   struct bio_vec *bv;
  +   struct bio *ret = NULL;
  +
  +   BUG_ON(sectors = 0);
  +
  +   /*
  +* If we're being called from underneath generic_make_request() and we
  +* already allocated any bios from this bio set, we risk deadlock if we
  +* use the mempool. So instead, we possibly fail and let the caller punt
  +* to workqueue or somesuch and retry in a safe context.
  +*/
  +   if (current-bio_list)
  +   gfp = ~__GFP_WAIT;
 
 
 NACK!
 
 If as you said above in the comment:
   if there's any possibility of it being called from under 
 generic_make_request();
 it is then the caller's responsibility to ...
 
 So all the comment needs to say is: 
   ... caller's responsibility to not set __GFP_WAIT at gfp.
 
 And drop this here. It is up to the caller to decide. If the caller wants he 
 can do
 if (current-bio_list) by his own.
 
 This is a general purpose utility you might not know it's context.
 for example with osdblk above will break.

Well I'm highly highly skeptical that using __GFP_WAIT under
generic_make_request() is ever a sane thing to do - it could certainly
be safe in specific circumstances, but it's just such a fragile thing to
rely on, you have to _never_ use the same bio pool more than once. And
even then I bet there's other subtle ways it could break.

But you're not the first to complain about it, and your point about
existing code is compelling.

commit ea124f899af29887e24d07497442066572012e5b
Author: Kent Overstreet koverstr...@google.com
Date:   Wed Jul 25 16:25:10 2012 -0700

block: Introduce new bio_split()

The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

diff --git a/fs/bio.c b/fs/bio.c
index 0470376..312e5de 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1537,6 +1537,102 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
first_sectors)
 EXPORT_SYMBOL(bio_pair_split);
 
 /**
+ * bio_split - split a bio
+ * @bio:   bio to split
+ * @sectors:   number of sectors to split from the front of @bio
+ * @gfp:   gfp mask
+ * @bs:bio set to allocate from
+ *
+ * Allocates and returns a new bio which represents @sectors from the start of
+ * @bio, and updates @bio to represent the remaining sectors.
+ *
+ * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
+ * unchanged.
+ *
+ * The newly allocated bio will point to @bio's bi_io_vec, if the split was on 
a
+ * bvec boundry; it is the caller's responsibility to ensure that @bio is not
+ * freed before the split.
+ *
+ * BIG FAT WARNING:
+ *
+ * If you're calling 

Re: [PATCH v4 08/12] block: Introduce new bio_split()

2012-07-25 Thread Boaz Harrosh
On 07/24/2012 11:11 PM, Kent Overstreet wrote:

 The new bio_split() can split arbitrary bios - it's not restricted to
 single page bios, like the old bio_split() (previously renamed to
 bio_pair_split()). It also has different semantics - it doesn't allocate
 a struct bio_pair, leaving it up to the caller to handle completions.
 
 Signed-off-by: Kent Overstreet koverstr...@google.com
 ---
  fs/bio.c |   99 
 ++
  1 files changed, 99 insertions(+), 0 deletions(-)
 
 diff --git a/fs/bio.c b/fs/bio.c
 index 5d02aa5..a15e121 100644
 --- a/fs/bio.c
 +++ b/fs/bio.c
 @@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
 first_sectors)
  EXPORT_SYMBOL(bio_pair_split);
  
  /**
 + * bio_split - split a bio
 + * @bio: bio to split
 + * @sectors: number of sectors to split from the front of @bio
 + * @gfp: gfp mask
 + * @bs:  bio set to allocate from
 + *
 + * Allocates and returns a new bio which represents @sectors from the start 
 of
 + * @bio, and updates @bio to represent the remaining sectors.
 + *
 + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
 + * unchanged.
 + *
 + * The newly allocated bio will point to @bio's bi_io_vec, if the split was 
 on a
 + * bvec boundry; it is the caller's responsibility to ensure that @bio is not
 + * freed before the split.
 + *
 + * If bio_split() is running under generic_make_request(), it's not safe to
 + * allocate more than one bio from the same bio set. Therefore, if it is 
 running
 + * under generic_make_request() it masks out __GFP_WAIT when doing the
 + * allocation. The caller must check for failure if there's any possibility 
 of
 + * it being called from under generic_make_request(); it is then the caller's
 + * responsibility to retry from a safe context (by e.g. punting to 
 workqueue).
 + */
 +struct bio *bio_split(struct bio *bio, int sectors,
 +   gfp_t gfp, struct bio_set *bs)
 +{
 + unsigned idx, vcnt = 0, nbytes = sectors  9;
 + struct bio_vec *bv;
 + struct bio *ret = NULL;
 +
 + BUG_ON(sectors = 0);
 +
 + /*
 +  * If we're being called from underneath generic_make_request() and we
 +  * already allocated any bios from this bio set, we risk deadlock if we
 +  * use the mempool. So instead, we possibly fail and let the caller punt
 +  * to workqueue or somesuch and retry in a safe context.
 +  */
 + if (current-bio_list)
 + gfp = ~__GFP_WAIT;


NACK!

If as you said above in the comment:
if there's any possibility of it being called from under 
generic_make_request();
it is then the caller's responsibility to ...

So all the comment needs to say is: 
... caller's responsibility to not set __GFP_WAIT at gfp.

And drop this here. It is up to the caller to decide. If the caller wants he 
can do
if (current-bio_list) by his own.

This is a general purpose utility you might not know it's context.
for example with osdblk above will break.

Thanks
Boaz

 +
 + if (sectors = bio_sectors(bio))
 + return bio;
 +
 + trace_block_split(bdev_get_queue(bio-bi_bdev), bio,
 +   bio-bi_sector + sectors);
 +
 + bio_for_each_segment(bv, bio, idx) {
 + vcnt = idx - bio-bi_idx;
 +
 + if (!nbytes) {
 + ret = bio_alloc_bioset(gfp, 0, bs);
 + if (!ret)
 + return NULL;
 +
 + ret-bi_io_vec = bio_iovec(bio);
 + ret-bi_flags |= 1  BIO_CLONED;
 + break;
 + } else if (nbytes  bv-bv_len) {
 + ret = bio_alloc_bioset(gfp, ++vcnt, bs);
 + if (!ret)
 + return NULL;
 +
 + memcpy(ret-bi_io_vec, bio_iovec(bio),
 +sizeof(struct bio_vec) * vcnt);
 +
 + ret-bi_io_vec[vcnt - 1].bv_len = nbytes;
 + bv-bv_offset   += nbytes;
 + bv-bv_len  -= nbytes;
 + break;
 + }
 +
 + nbytes -= bv-bv_len;
 + }
 +
 + ret-bi_bdev= bio-bi_bdev;
 + ret-bi_sector  = bio-bi_sector;
 + ret-bi_size= sectors  9;
 + ret-bi_rw  = bio-bi_rw;
 + ret-bi_vcnt= vcnt;
 + ret-bi_max_vecs = vcnt;
 + ret-bi_end_io  = bio-bi_end_io;
 + ret-bi_private = bio-bi_private;
 +
 + bio-bi_sector  += sectors;
 + bio-bi_size-= sectors  9;
 + bio-bi_idx  = idx;
 +
 + if (bio_integrity(bio)) {
 + bio_integrity_clone(ret, bio, gfp, bs);
 + bio_integrity_trim(ret, 0, bio_sectors(ret));
 + bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
 + }
 +
 + return ret;
 +}
 +EXPORT_SYMBOL_GPL(bio_split);
 +
 +/**
   *  bio_sector_offset - Find hardware sector offset in bio
   *

[PATCH v4 08/12] block: Introduce new bio_split()

2012-07-24 Thread Kent Overstreet
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Signed-off-by: Kent Overstreet 
---
 fs/bio.c |   99 ++
 1 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 5d02aa5..a15e121 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
first_sectors)
 EXPORT_SYMBOL(bio_pair_split);
 
 /**
+ * bio_split - split a bio
+ * @bio:   bio to split
+ * @sectors:   number of sectors to split from the front of @bio
+ * @gfp:   gfp mask
+ * @bs:bio set to allocate from
+ *
+ * Allocates and returns a new bio which represents @sectors from the start of
+ * @bio, and updates @bio to represent the remaining sectors.
+ *
+ * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
+ * unchanged.
+ *
+ * The newly allocated bio will point to @bio's bi_io_vec, if the split was on 
a
+ * bvec boundry; it is the caller's responsibility to ensure that @bio is not
+ * freed before the split.
+ *
+ * If bio_split() is running under generic_make_request(), it's not safe to
+ * allocate more than one bio from the same bio set. Therefore, if it is 
running
+ * under generic_make_request() it masks out __GFP_WAIT when doing the
+ * allocation. The caller must check for failure if there's any possibility of
+ * it being called from under generic_make_request(); it is then the caller's
+ * responsibility to retry from a safe context (by e.g. punting to workqueue).
+ */
+struct bio *bio_split(struct bio *bio, int sectors,
+ gfp_t gfp, struct bio_set *bs)
+{
+   unsigned idx, vcnt = 0, nbytes = sectors << 9;
+   struct bio_vec *bv;
+   struct bio *ret = NULL;
+
+   BUG_ON(sectors <= 0);
+
+   /*
+* If we're being called from underneath generic_make_request() and we
+* already allocated any bios from this bio set, we risk deadlock if we
+* use the mempool. So instead, we possibly fail and let the caller punt
+* to workqueue or somesuch and retry in a safe context.
+*/
+   if (current->bio_list)
+   gfp &= ~__GFP_WAIT;
+
+   if (sectors >= bio_sectors(bio))
+   return bio;
+
+   trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
+ bio->bi_sector + sectors);
+
+   bio_for_each_segment(bv, bio, idx) {
+   vcnt = idx - bio->bi_idx;
+
+   if (!nbytes) {
+   ret = bio_alloc_bioset(gfp, 0, bs);
+   if (!ret)
+   return NULL;
+
+   ret->bi_io_vec = bio_iovec(bio);
+   ret->bi_flags |= 1 << BIO_CLONED;
+   break;
+   } else if (nbytes < bv->bv_len) {
+   ret = bio_alloc_bioset(gfp, ++vcnt, bs);
+   if (!ret)
+   return NULL;
+
+   memcpy(ret->bi_io_vec, bio_iovec(bio),
+  sizeof(struct bio_vec) * vcnt);
+
+   ret->bi_io_vec[vcnt - 1].bv_len = nbytes;
+   bv->bv_offset   += nbytes;
+   bv->bv_len  -= nbytes;
+   break;
+   }
+
+   nbytes -= bv->bv_len;
+   }
+
+   ret->bi_bdev= bio->bi_bdev;
+   ret->bi_sector  = bio->bi_sector;
+   ret->bi_size= sectors << 9;
+   ret->bi_rw  = bio->bi_rw;
+   ret->bi_vcnt= vcnt;
+   ret->bi_max_vecs = vcnt;
+   ret->bi_end_io  = bio->bi_end_io;
+   ret->bi_private = bio->bi_private;
+
+   bio->bi_sector  += sectors;
+   bio->bi_size-= sectors << 9;
+   bio->bi_idx  = idx;
+
+   if (bio_integrity(bio)) {
+   bio_integrity_clone(ret, bio, gfp, bs);
+   bio_integrity_trim(ret, 0, bio_sectors(ret));
+   bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(bio_split);
+
+/**
  *  bio_sector_offset - Find hardware sector offset in bio
  *  @bio:   bio to inspect
  *  @index: bio_vec index
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 08/12] block: Introduce new bio_split()

2012-07-24 Thread Kent Overstreet
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Signed-off-by: Kent Overstreet koverstr...@google.com
---
 fs/bio.c |   99 ++
 1 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 5d02aa5..a15e121 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1539,6 +1539,105 @@ struct bio_pair *bio_pair_split(struct bio *bi, int 
first_sectors)
 EXPORT_SYMBOL(bio_pair_split);
 
 /**
+ * bio_split - split a bio
+ * @bio:   bio to split
+ * @sectors:   number of sectors to split from the front of @bio
+ * @gfp:   gfp mask
+ * @bs:bio set to allocate from
+ *
+ * Allocates and returns a new bio which represents @sectors from the start of
+ * @bio, and updates @bio to represent the remaining sectors.
+ *
+ * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
+ * unchanged.
+ *
+ * The newly allocated bio will point to @bio's bi_io_vec, if the split was on 
a
+ * bvec boundry; it is the caller's responsibility to ensure that @bio is not
+ * freed before the split.
+ *
+ * If bio_split() is running under generic_make_request(), it's not safe to
+ * allocate more than one bio from the same bio set. Therefore, if it is 
running
+ * under generic_make_request() it masks out __GFP_WAIT when doing the
+ * allocation. The caller must check for failure if there's any possibility of
+ * it being called from under generic_make_request(); it is then the caller's
+ * responsibility to retry from a safe context (by e.g. punting to workqueue).
+ */
+struct bio *bio_split(struct bio *bio, int sectors,
+ gfp_t gfp, struct bio_set *bs)
+{
+   unsigned idx, vcnt = 0, nbytes = sectors  9;
+   struct bio_vec *bv;
+   struct bio *ret = NULL;
+
+   BUG_ON(sectors = 0);
+
+   /*
+* If we're being called from underneath generic_make_request() and we
+* already allocated any bios from this bio set, we risk deadlock if we
+* use the mempool. So instead, we possibly fail and let the caller punt
+* to workqueue or somesuch and retry in a safe context.
+*/
+   if (current-bio_list)
+   gfp = ~__GFP_WAIT;
+
+   if (sectors = bio_sectors(bio))
+   return bio;
+
+   trace_block_split(bdev_get_queue(bio-bi_bdev), bio,
+ bio-bi_sector + sectors);
+
+   bio_for_each_segment(bv, bio, idx) {
+   vcnt = idx - bio-bi_idx;
+
+   if (!nbytes) {
+   ret = bio_alloc_bioset(gfp, 0, bs);
+   if (!ret)
+   return NULL;
+
+   ret-bi_io_vec = bio_iovec(bio);
+   ret-bi_flags |= 1  BIO_CLONED;
+   break;
+   } else if (nbytes  bv-bv_len) {
+   ret = bio_alloc_bioset(gfp, ++vcnt, bs);
+   if (!ret)
+   return NULL;
+
+   memcpy(ret-bi_io_vec, bio_iovec(bio),
+  sizeof(struct bio_vec) * vcnt);
+
+   ret-bi_io_vec[vcnt - 1].bv_len = nbytes;
+   bv-bv_offset   += nbytes;
+   bv-bv_len  -= nbytes;
+   break;
+   }
+
+   nbytes -= bv-bv_len;
+   }
+
+   ret-bi_bdev= bio-bi_bdev;
+   ret-bi_sector  = bio-bi_sector;
+   ret-bi_size= sectors  9;
+   ret-bi_rw  = bio-bi_rw;
+   ret-bi_vcnt= vcnt;
+   ret-bi_max_vecs = vcnt;
+   ret-bi_end_io  = bio-bi_end_io;
+   ret-bi_private = bio-bi_private;
+
+   bio-bi_sector  += sectors;
+   bio-bi_size-= sectors  9;
+   bio-bi_idx  = idx;
+
+   if (bio_integrity(bio)) {
+   bio_integrity_clone(ret, bio, gfp, bs);
+   bio_integrity_trim(ret, 0, bio_sectors(ret));
+   bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(bio_split);
+
+/**
  *  bio_sector_offset - Find hardware sector offset in bio
  *  @bio:   bio to inspect
  *  @index: bio_vec index
-- 
1.7.7.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/