Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread NeilBrown
On Wed, Nov 22 2017, Mikulas Patocka wrote:

> On Wed, 22 Nov 2017, NeilBrown wrote:
>
>> On Tue, Nov 21 2017, Mikulas Patocka wrote:
>> 
>> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
>> >
>> >> On Tue, Nov 21 2017 at  4:23pm -0500,
>> >> Mikulas Patocka  wrote:
>> >> 
>> >> > This is not correct:
>> >> > 
>> >> >2206 static void dm_wq_work(struct work_struct *work)
>> >> >2207 {
>> >> >2208 struct mapped_device *md = container_of(work, struct 
>> >> > mapped_device, work);
>> >> >2209 struct bio *bio;
>> >> >2210 int srcu_idx;
>> >> >2211 struct dm_table *map;
>> >> >2212
>> >> >2213 if (!bio_list_empty(>rescued)) {
>> >> >2214 struct bio_list list;
>> >> >2215 spin_lock_irq(>deferred_lock);
>> >> >2216 list = md->rescued;
>> >> >2217 bio_list_init(>rescued);
>> >> >2218 spin_unlock_irq(>deferred_lock);
>> >> >2219 while ((bio = bio_list_pop()))
>> >> >2220 generic_make_request(bio);
>> >> >2221 }
>> >> >
>> >> >2223 map = dm_get_live_table(md, _idx);
>> >> >2224
>> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) 
>> >> > {
>> >> >2226 spin_lock_irq(>deferred_lock);
>> >> >2227 bio = bio_list_pop(>deferred);
>> >> >2228 spin_unlock_irq(>deferred_lock);
>> >> >2229
>> >> >2230 if (!bio)
>> >> >2231 break;
>> >> >2232
>> >> >2233 if (dm_request_based(md))
>> >> >2234 generic_make_request(bio);
>> >> >2235 else
>> >> >2236 __split_and_process_bio(md, map, bio);
>> >> >2237 }
>> >> >2238
>> >> >2239 dm_put_live_table(md, srcu_idx);
>> >> >2240 }
>> >> > 
>> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> >> > will not process md->rescued list.
>> >> 
>> >> Can you elaborate further?  We cannot be "in dm_wq_work in
>> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> >> of scheduling away from __split_and_process_bio?
>> >> 
>> >> The more detail you can share the better.
>> >
>> > Suppose this scenario:
>> >
>> > * dm_wq_work calls __split_and_process_bio
>> > * __split_and_process_bio eventually reaches the function snapshot_map
>> > * snapshot_map attempts to take the snapshot lock
>> >
>> > * the snapshot lock could be released only if some bios submitted by the 
>> > snapshot driver to the underlying device complete
>> > * the bios submitted to the underlying device were already offloaded by 
>> > some other task and they are waiting on the list md->rescued
>> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
>> > blocked in snapshot_map (called from __split_and_process_bio)
>> 
>> Yes, I think you are right. 
>> 
>> I think the solution is to get rid of the dm_offload() infrastructure
>> and make it not necessary.
>> i.e. discard my patches
>> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
>> and
>> dm: revise 'rescue' strategy for bio-based bioset allocations
>> 
>> And build on "dm: ensure bio submission follows a depth-first tree walk"
>> which was written after those and already makes dm_offload() less
>> important.
>> 
>> Since that "depth-first" patch, every request to the dm device, after
>> the initial splitting, allocates just one dm_target_io structure, and
>> makes just one __map_bio() call, and so will behave exactly the way
>> generic_make_request() expects and copes with - thus avoiding awkward
>> dependencies and deadlocks.  Except
>> 
>> a/ If any target defines ->num_write_bios() to return >1,
>>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>>and __map_bio(), which might need rescuing.
>>But no target defines num_write_bios, and none have since it was
>>removed from dm-cache 4.5 years ago.
>>Can we discard num_write_bios??
>> 
>> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>>to a value > 1, then __send_duplicate_bios() will also make multiple
>>calls to alloc_tio() and __map_bio().
>>Some do.
>>  dm-cache-target:  flush=2
>>  dm-snap: flush=2
>>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
>> 
>> These will only be a problem if the second (or subsequent) alloc_tio()
>> blocks waiting for an earlier allocation to complete.  This will only
>> be a problem if multiple threads are each trying to allocate multiple
>> dm_target_io from the same bioset at the same time.
>> This is rare and should be easier to address than the current
>> dm_offload() approach. 
>> One possibility would be to copy the approach taken by
>> 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:28pm -0500,
Mike Snitzer  wrote:

> 
> I'll work through your patches tomorrow.

Please see the top 3 patches on this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

This rebased dm-4.16 branch seems to be working well so far.

Mike


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Wed, Nov 22 2017 at  1:24pm -0500,
Mikulas Patocka  wrote:
 
> Another problem is this:
> 
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
> 
> What if it blocks because the bioset is exhausted?
> 
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
> 
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
> 
> if (max_sectors < bio_sectors(bio)) {
> struct bio *split = bio_split(bio, max_sectors,
>   gfp, conf->bio_split);
> bio_chain(split, bio);
> generic_make_request(bio);
> bio = split;
> r1_bio->master_bio = bio;
> r1_bio->sectors = max_sectors;
> }

I'd be happy to take an incremental patch that improves on this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e

But short of that I'll have to come back to this.

Thanks,
Mike


Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mikulas Patocka


On Wed, 22 Nov 2017, NeilBrown wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka  wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >2206 static void dm_wq_work(struct work_struct *work)
> >> >2207 {
> >> >2208 struct mapped_device *md = container_of(work, struct 
> >> > mapped_device, work);
> >> >2209 struct bio *bio;
> >> >2210 int srcu_idx;
> >> >2211 struct dm_table *map;
> >> >2212
> >> >2213 if (!bio_list_empty(>rescued)) {
> >> >2214 struct bio_list list;
> >> >2215 spin_lock_irq(>deferred_lock);
> >> >2216 list = md->rescued;
> >> >2217 bio_list_init(>rescued);
> >> >2218 spin_unlock_irq(>deferred_lock);
> >> >2219 while ((bio = bio_list_pop()))
> >> >2220 generic_make_request(bio);
> >> >2221 }
> >> >
> >> >2223 map = dm_get_live_table(md, _idx);
> >> >2224
> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >> >2226 spin_lock_irq(>deferred_lock);
> >> >2227 bio = bio_list_pop(>deferred);
> >> >2228 spin_unlock_irq(>deferred_lock);
> >> >2229
> >> >2230 if (!bio)
> >> >2231 break;
> >> >2232
> >> >2233 if (dm_request_based(md))
> >> >2234 generic_make_request(bio);
> >> >2235 else
> >> >2236 __split_and_process_bio(md, map, bio);
> >> >2237 }
> >> >2238
> >> >2239 dm_put_live_table(md, srcu_idx);
> >> >2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
> dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except
> 
> a/ If any target defines ->num_write_bios() to return >1,
>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>and __map_bio(), which might need rescuing.
>But no target defines num_write_bios, and none have since it was
>removed from dm-cache 4.5 years ago.
>Can we discard num_write_bios??
> 
> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>to a value > 1, then __send_duplicate_bios() will also make multiple
>calls to alloc_tio() and __map_bio().
>Some do.
>  dm-cache-target:  flush=2
>  dm-snap: flush=2
>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> crypt_alloc_buffer() which needs to allocate multiple entries from a
> mempool.
> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
> tries with GFP_NOIO.  

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:00pm -0500,
NeilBrown  wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka  wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >2206 static void dm_wq_work(struct work_struct *work)
> >> >2207 {
> >> >2208 struct mapped_device *md = container_of(work, struct 
> >> > mapped_device, work);
> >> >2209 struct bio *bio;
> >> >2210 int srcu_idx;
> >> >2211 struct dm_table *map;
> >> >2212
> >> >2213 if (!bio_list_empty(>rescued)) {
> >> >2214 struct bio_list list;
> >> >2215 spin_lock_irq(>deferred_lock);
> >> >2216 list = md->rescued;
> >> >2217 bio_list_init(>rescued);
> >> >2218 spin_unlock_irq(>deferred_lock);
> >> >2219 while ((bio = bio_list_pop()))
> >> >2220 generic_make_request(bio);
> >> >2221 }
> >> >
> >> >2223 map = dm_get_live_table(md, _idx);
> >> >2224
> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >> >2226 spin_lock_irq(>deferred_lock);
> >> >2227 bio = bio_list_pop(>deferred);
> >> >2228 spin_unlock_irq(>deferred_lock);
> >> >2229
> >> >2230 if (!bio)
> >> >2231 break;
> >> >2232
> >> >2233 if (dm_request_based(md))
> >> >2234 generic_make_request(bio);
> >> >2235 else
> >> >2236 __split_and_process_bio(md, map, bio);
> >> >2237 }
> >> >2238
> >> >2239 dm_put_live_table(md, srcu_idx);
> >> >2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
> dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except

Yes, FYI I've also verified that even with just the "depth-first" patch
(and dm_offload disabled) the snapshot deadlock is fixed.
 
> a/ If any target defines ->num_write_bios() to return >1,
>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>and __map_bio(), which might need rescuing.
>But no target defines num_write_bios, and none have since it was
>removed from dm-cache 4.5 years ago.
>Can we discard num_write_bios??

Yes.

> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>to a value > 1, then __send_duplicate_bios() will also make multiple
>calls to alloc_tio() and __map_bio().
>Some do.
>  dm-cache-target:  flush=2
>  dm-snap: flush=2
>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> 

Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread NeilBrown
On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka  wrote:
>> 
>> > This is not correct:
>> > 
>> >2206 static void dm_wq_work(struct work_struct *work)
>> >2207 {
>> >2208 struct mapped_device *md = container_of(work, struct 
>> > mapped_device, work);
>> >2209 struct bio *bio;
>> >2210 int srcu_idx;
>> >2211 struct dm_table *map;
>> >2212
>> >2213 if (!bio_list_empty(>rescued)) {
>> >2214 struct bio_list list;
>> >2215 spin_lock_irq(>deferred_lock);
>> >2216 list = md->rescued;
>> >2217 bio_list_init(>rescued);
>> >2218 spin_unlock_irq(>deferred_lock);
>> >2219 while ((bio = bio_list_pop()))
>> >2220 generic_make_request(bio);
>> >2221 }
>> >
>> >2223 map = dm_get_live_table(md, _idx);
>> >2224
>> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
>> >2226 spin_lock_irq(>deferred_lock);
>> >2227 bio = bio_list_pop(>deferred);
>> >2228 spin_unlock_irq(>deferred_lock);
>> >2229
>> >2230 if (!bio)
>> >2231 break;
>> >2232
>> >2233 if (dm_request_based(md))
>> >2234 generic_make_request(bio);
>> >2235 else
>> >2236 __split_and_process_bio(md, map, bio);
>> >2237 }
>> >2238
>> >2239 dm_put_live_table(md, srcu_idx);
>> >2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
 dm-cache-target:  flush=2
 dm-snap: flush=2
 dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  8:21pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  4:23pm -0500,
> > Mikulas Patocka  wrote:
> > 
> > > This is not correct:
> > > 
> > >2206 static void dm_wq_work(struct work_struct *work)
> > >2207 {
> > >2208 struct mapped_device *md = container_of(work, struct 
> > > mapped_device, work);
> > >2209 struct bio *bio;
> > >2210 int srcu_idx;
> > >2211 struct dm_table *map;
> > >2212
> > >2213 if (!bio_list_empty(>rescued)) {
> > >2214 struct bio_list list;
> > >2215 spin_lock_irq(>deferred_lock);
> > >2216 list = md->rescued;
> > >2217 bio_list_init(>rescued);
> > >2218 spin_unlock_irq(>deferred_lock);
> > >2219 while ((bio = bio_list_pop()))
> > >2220 generic_make_request(bio);
> > >2221 }
> > >
> > >2223 map = dm_get_live_table(md, _idx);
> > >2224
> > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> > >2226 spin_lock_irq(>deferred_lock);
> > >2227 bio = bio_list_pop(>deferred);
> > >2228 spin_unlock_irq(>deferred_lock);
> > >2229
> > >2230 if (!bio)
> > >2231 break;
> > >2232
> > >2233 if (dm_request_based(md))
> > >2234 generic_make_request(bio);
> > >2235 else
> > >2236 __split_and_process_bio(md, map, bio);
> > >2237 }
> > >2238
> > >2239 dm_put_live_table(md, srcu_idx);
> > >2240 }
> > > 
> > > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > > will not process md->rescued list.
> > 
> > Can you elaborate further?  We cannot be "in dm_wq_work in
> > __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> > of scheduling away from __split_and_process_bio?
> > 
> > The more detail you can share the better.
> 
> Suppose this scenario:
> 
> * dm_wq_work calls __split_and_process_bio

Right, I later realized this was the call chain you were referring to.
Not sure how I missed it the first time around.

> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
> 
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

SO you're saying the case that Neil doesn't think should happen:
https://lkml.org/lkml/2017/11/21/658
...can happen.

> > > The processing of md->rescued is also wrong - bios for different devices 
> > > must be offloaded to different helper threads, so that processing a bio 
> > > for a lower device doesn't depend on processing a bio for a higher 
> > > device. 
> > > If you offload all the bios on current->bio_list to the same thread, the 
> > > bios still depend on each other and the deadlock will still happen.
> > 
> > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> > allocations") speaks to this with:
> > 
> > "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> > contains bios that were scheduled *before* the current one started, so
> > they must have been submitted from higher up the stack, and we cannot be
> > waiting for them here (thanks to the "dm: ensure bio submission follows
> > a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> > list as there is nothing to be gained by being more selective."
> 
> I think you are right - if we only offload current->bio_list[0], then 
> mixing of dependent bios on the offloaded list won't happen.
> 
> > And again: this patchset passes your dm-snapshot deadlock test.  Is
> > that test somehow lacking?
> 
> With your patchset, the deadlock would happen only if bios are queued on 
> >deferred - and that happens only in case of resume or if we are 
> processing REQ_PREFLUSH with non-zero data size.
> 
> So, the simple test that I wrote doesn't trigger it, but a more complex 
> test involving REQ_PREFLUSH could.

Makes sense.  But I need to think further about _why_ bios submitted to
the snapshot driver's underlying device would end up on md->rescued
(like you suggested above).  Again, Neil thinks it not possible.  Neil
said:
"they will not be recursive calls, so nothing will be added to
current->bio_list[0] and nothing will be moved to md->rescued.  Each

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mikulas Patocka


On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  4:23pm -0500,
> Mikulas Patocka  wrote:
> 
> > This is not correct:
> > 
> >2206 static void dm_wq_work(struct work_struct *work)
> >2207 {
> >2208 struct mapped_device *md = container_of(work, struct 
> > mapped_device, work);
> >2209 struct bio *bio;
> >2210 int srcu_idx;
> >2211 struct dm_table *map;
> >2212
> >2213 if (!bio_list_empty(>rescued)) {
> >2214 struct bio_list list;
> >2215 spin_lock_irq(>deferred_lock);
> >2216 list = md->rescued;
> >2217 bio_list_init(>rescued);
> >2218 spin_unlock_irq(>deferred_lock);
> >2219 while ((bio = bio_list_pop()))
> >2220 generic_make_request(bio);
> >2221 }
> >
> >2223 map = dm_get_live_table(md, _idx);
> >2224
> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >2226 spin_lock_irq(>deferred_lock);
> >2227 bio = bio_list_pop(>deferred);
> >2228 spin_unlock_irq(>deferred_lock);
> >2229
> >2230 if (!bio)
> >2231 break;
> >2232
> >2233 if (dm_request_based(md))
> >2234 generic_make_request(bio);
> >2235 else
> >2236 __split_and_process_bio(md, map, bio);
> >2237 }
> >2238
> >2239 dm_put_live_table(md, srcu_idx);
> >2240 }
> > 
> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > will not process md->rescued list.
> 
> Can you elaborate further?  We cannot be "in dm_wq_work in
> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> of scheduling away from __split_and_process_bio?
> 
> The more detail you can share the better.

Suppose this scenario:

* dm_wq_work calls __split_and_process_bio
* __split_and_process_bio eventually reaches the function snapshot_map
* snapshot_map attempts to take the snapshot lock

* the snapshot lock could be released only if some bios submitted by the 
snapshot driver to the underlying device complete
* the bios submitted to the underlying device were already offloaded by 
some other task and they are waiting on the list md->rescued
* the bios waiting on md->rescued are not processed, because dm_wq_work is 
blocked in snapshot_map (called from __split_and_process_bio)

> > The processing of md->rescued is also wrong - bios for different devices 
> > must be offloaded to different helper threads, so that processing a bio 
> > for a lower device doesn't depend on processing a bio for a higher device. 
> > If you offload all the bios on current->bio_list to the same thread, the 
> > bios still depend on each other and the deadlock will still happen.
> 
> Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> allocations") speaks to this with:
> 
> "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> contains bios that were scheduled *before* the current one started, so
> they must have been submitted from higher up the stack, and we cannot be
> waiting for them here (thanks to the "dm: ensure bio submission follows
> a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> list as there is nothing to be gained by being more selective."

I think you are right - if we only offload current->bio_list[0], then 
mixing of dependent bios on the offloaded list won't happen.

> And again: this patchset passes your dm-snapshot deadlock test.  Is
> that test somehow lacking?

With your patchset, the deadlock would happen only if bios are queued on 
>deferred - and that happens only in case of resume or if we are 
processing REQ_PREFLUSH with non-zero data size.

So, the simple test that I wrote doesn't trigger it, but a more complex 
test involving REQ_PREFLUSH could.

> Or do you see a hypothetical case where a deadlock is still possible?
> That is of less concern.  I'd prefer that we tackle problems for
> targets, and associated scenarios, that we currently support.
> 
> Either way, happy to review this with you further.  Any fixes are
> welcomed too.  But I'd like us to head in a direction that this patchset
> is taking us.  Specifically: away from DM relying on BIOSET_NEED_RESCUER.
> 
> Thanks,
> Mike

Mikulas


Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]

2017-11-21 Thread NeilBrown
On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  7:43am -0500,
>> Mike Snitzer  wrote:
>>  
>> > Decided it a better use of my time to review and then hopefully use the
>> > block-core's bio splitting infrastructure in DM.  Been meaning to do
>> > that for quite a while anyway.  This mail from you just made it all the
>> > more clear that needs doing:
>> > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
>> > 
>> > So I will start here on this patch you proposed:
>> > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
>> > (of note, this patch slipped through the cracks because I was recovering
>> > from injury when it originally came through).
>> > 
>> > Once DM is using q->bio_split I'll come back to this patch (aka
>> > "[1]") as a starting point for the follow-on work to remove DM's use of
>> > BIOSET_NEED_RESCUER:
>> > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
>> 
>> Hey Neil,
>> 
>> Good news!  All your code works ;)
>> 
>> (well after 1 fixup due to a cut-n-paste bug.. the code you added to
>> dm_wq_work() to process the md->rescued bio_list was operating on
>> the md->deferred bio_list due to cut-n-paste from code you copied from
>> just below it)
>> 
>> I split your code out some to make it more reviewable.  I also tweaked
>> headers accordingly.
>> 
>> Please see this branch (which _will_ get rebased between now and the
>> 4.16 merge window):
>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
>> 
>> I successfully tested these changes using Mikulas' test program that
>> reproduces the snapshot deadlock:
>> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>> 
>> I'll throw various other DM testsuites at it to verify they all look
>> good (e.g. thinp, cache, multipath).
>> 
>> I'm open to all suggestions about changes you'd like to see (either to
>> these patches or anything you'd like to layer ontop of them).
>> 
>> Thanks for all your work, much appreciated!
>> Mike
>
> This is not correct:

Thanks for your review!

>
>2206 static void dm_wq_work(struct work_struct *work)
>2207 {
>2208 struct mapped_device *md = container_of(work, struct 
> mapped_device, work);
>2209 struct bio *bio;
>2210 int srcu_idx;
>2211 struct dm_table *map;
>2212
>2213 if (!bio_list_empty(>rescued)) {
>2214 struct bio_list list;
>2215 spin_lock_irq(>deferred_lock);
>2216 list = md->rescued;
>2217 bio_list_init(>rescued);
>2218 spin_unlock_irq(>deferred_lock);
>2219 while ((bio = bio_list_pop()))
>2220 generic_make_request(bio);
>2221 }
>
>2223 map = dm_get_live_table(md, _idx);
>2224
>2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
>2226 spin_lock_irq(>deferred_lock);
>2227 bio = bio_list_pop(>deferred);
>2228 spin_unlock_irq(>deferred_lock);
>2229
>2230 if (!bio)
>2231 break;
>2232
>2233 if (dm_request_based(md))
>2234 generic_make_request(bio);
>2235 else
>2236 __split_and_process_bio(md, map, bio);
>2237 }
>2238
>2239 dm_put_live_table(md, srcu_idx);
>2240 }
>
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Correct, but md->rescued will be empty, or irrelevant.
The first section of dm_wq_work ensures ->rescued is empty.
When __split_and_process_bio() calls generic_make_request() (indirectly
through one or more targets) they will not be recursive calls,
so nothing will be added to current->bio_list[0] and nothing
will be moved to md->rescued.  Each generic_make_request() will
completely submit the request in the lower level devel.

Some other thread could call generic_make_request on this dm device and
result in bios appeared on md->rescued.  These bios could only be a
problem if something that __split_and_process_bio calls might wait for
them.  I don't think that happens (at least I don't think it should...).


>
> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

bios on current->bio_list[0] are not allowed to depend on each other
except that later bios can depend on earlier bios.  They are all for a
lower-level 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  4:23pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  7:43am -0500,
> > Mike Snitzer  wrote:
> >  
> > > Decided it a better use of my time to review and then hopefully use the
> > > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > > that for quite a while anyway.  This mail from you just made it all the
> > > more clear that needs doing:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > > 
> > > So I will start here on this patch you proposed:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > > (of note, this patch slipped through the cracks because I was recovering
> > > from injury when it originally came through).
> > > 
> > > Once DM is using q->bio_split I'll come back to this patch (aka
> > > "[1]") as a starting point for the follow-on work to remove DM's use of
> > > BIOSET_NEED_RESCUER:
> > > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> > 
> > Hey Neil,
> > 
> > Good news!  All your code works ;)
> > 
> > (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> > dm_wq_work() to process the md->rescued bio_list was operating on
> > the md->deferred bio_list due to cut-n-paste from code you copied from
> > just below it)
> > 
> > I split your code out some to make it more reviewable.  I also tweaked
> > headers accordingly.
> > 
> > Please see this branch (which _will_ get rebased between now and the
> > 4.16 merge window):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> > 
> > I successfully tested these changes using Mikulas' test program that
> > reproduces the snapshot deadlock:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > I'll throw various other DM testsuites at it to verify they all look
> > good (e.g. thinp, cache, multipath).
> > 
> > I'm open to all suggestions about changes you'd like to see (either to
> > these patches or anything you'd like to layer ontop of them).
> > 
> > Thanks for all your work, much appreciated!
> > Mike
> 
> This is not correct:
> 
>2206 static void dm_wq_work(struct work_struct *work)
>2207 {
>2208 struct mapped_device *md = container_of(work, struct 
> mapped_device, work);
>2209 struct bio *bio;
>2210 int srcu_idx;
>2211 struct dm_table *map;
>2212
>2213 if (!bio_list_empty(>rescued)) {
>2214 struct bio_list list;
>2215 spin_lock_irq(>deferred_lock);
>2216 list = md->rescued;
>2217 bio_list_init(>rescued);
>2218 spin_unlock_irq(>deferred_lock);
>2219 while ((bio = bio_list_pop()))
>2220 generic_make_request(bio);
>2221 }
>
>2223 map = dm_get_live_table(md, _idx);
>2224
>2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
>2226 spin_lock_irq(>deferred_lock);
>2227 bio = bio_list_pop(>deferred);
>2228 spin_unlock_irq(>deferred_lock);
>2229
>2230 if (!bio)
>2231 break;
>2232
>2233 if (dm_request_based(md))
>2234 generic_make_request(bio);
>2235 else
>2236 __split_and_process_bio(md, map, bio);
>2237 }
>2238
>2239 dm_put_live_table(md, srcu_idx);
>2240 }
> 
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Can you elaborate further?  We cannot be "in dm_wq_work in
__split_and_process_bio" simultaneously.  Do you mean as a side-effect
of scheduling away from __split_and_process_bio?

The more detail you can share the better.

> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
allocations") speaks to this with:

"Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
contains bios that were scheduled *before* the current one started, so
they must have been submitted from higher up the stack, and we cannot be
waiting for them here (thanks to the "dm: ensure bio submission follows
a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
list as there is nothing to be gained by being more selective."

And again: this patchset passes your 

Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]

2017-11-21 Thread Mikulas Patocka


On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  7:43am -0500,
> Mike Snitzer  wrote:
>  
> > Decided it a better use of my time to review and then hopefully use the
> > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > that for quite a while anyway.  This mail from you just made it all the
> > more clear that needs doing:
> > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > 
> > So I will start here on this patch you proposed:
> > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > (of note, this patch slipped through the cracks because I was recovering
> > from injury when it originally came through).
> > 
> > Once DM is using q->bio_split I'll come back to this patch (aka
> > "[1]") as a starting point for the follow-on work to remove DM's use of
> > BIOSET_NEED_RESCUER:
> > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> 
> Hey Neil,
> 
> Good news!  All your code works ;)
> 
> (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> dm_wq_work() to process the md->rescued bio_list was operating on
> the md->deferred bio_list due to cut-n-paste from code you copied from
> just below it)
> 
> I split your code out some to make it more reviewable.  I also tweaked
> headers accordingly.
> 
> Please see this branch (which _will_ get rebased between now and the
> 4.16 merge window):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> 
> I successfully tested these changes using Mikulas' test program that
> reproduces the snapshot deadlock:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> I'll throw various other DM testsuites at it to verify they all look
> good (e.g. thinp, cache, multipath).
> 
> I'm open to all suggestions about changes you'd like to see (either to
> these patches or anything you'd like to layer ontop of them).
> 
> Thanks for all your work, much appreciated!
> Mike

This is not correct:

   2206 static void dm_wq_work(struct work_struct *work)
   2207 {
   2208 struct mapped_device *md = container_of(work, struct 
mapped_device, work);
   2209 struct bio *bio;
   2210 int srcu_idx;
   2211 struct dm_table *map;
   2212
   2213 if (!bio_list_empty(>rescued)) {
   2214 struct bio_list list;
   2215 spin_lock_irq(>deferred_lock);
   2216 list = md->rescued;
   2217 bio_list_init(>rescued);
   2218 spin_unlock_irq(>deferred_lock);
   2219 while ((bio = bio_list_pop()))
   2220 generic_make_request(bio);
   2221 }
   
   2223 map = dm_get_live_table(md, _idx);
   2224
   2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
   2226 spin_lock_irq(>deferred_lock);
   2227 bio = bio_list_pop(>deferred);
   2228 spin_unlock_irq(>deferred_lock);
   2229
   2230 if (!bio)
   2231 break;
   2232
   2233 if (dm_request_based(md))
   2234 generic_make_request(bio);
   2235 else
   2236 __split_and_process_bio(md, map, bio);
   2237 }
   2238
   2239 dm_put_live_table(md, srcu_idx);
   2240 }

You can see that if we are in dm_wq_work in __split_and_process_bio, we 
will not process md->rescued list.

The processing of md->rescued is also wrong - bios for different devices 
must be offloaded to different helper threads, so that processing a bio 
for a lower device doesn't depend on processing a bio for a higher device. 
If you offload all the bios on current->bio_list to the same thread, the 
bios still depend on each other and the deadlock will still happen.

Mikulas


new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  7:43am -0500,
Mike Snitzer  wrote:
 
> Decided it a better use of my time to review and then hopefully use the
> block-core's bio splitting infrastructure in DM.  Been meaning to do
> that for quite a while anyway.  This mail from you just made it all the
> more clear that needs doing:
> https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> 
> So I will start here on this patch you proposed:
> https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> (of note, this patch slipped through the cracks because I was recovering
> from injury when it originally came through).
> 
> Once DM is using q->bio_split I'll come back to this patch (aka
> "[1]") as a starting point for the follow-on work to remove DM's use of
> BIOSET_NEED_RESCUER:
> https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html

Hey Neil,

Good news!  All your code works ;)

(well after 1 fixup due to a cut-n-paste bug.. the code you added to
dm_wq_work() to process the md->rescued bio_list was operating on
the md->deferred bio_list due to cut-n-paste from code you copied from
just below it)

I split your code out some to make it more reviewable.  I also tweaked
headers accordingly.

Please see this branch (which _will_ get rebased between now and the
4.16 merge window):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

I successfully tested these changes using Mikulas' test program that
reproduces the snapshot deadlock:
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

I'll throw various other DM testsuites at it to verify they all look
good (e.g. thinp, cache, multipath).

I'm open to all suggestions about changes you'd like to see (either to
these patches or anything you'd like to layer ontop of them).

Thanks for all your work, much appreciated!
Mike