Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:28pm -0500,
Mike Snitzer  wrote:

> 
> I'll work through your patches tomorrow.

Please see the top 3 patches on this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

This rebased dm-4.16 branch seems to be working well so far.

Mike


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:28pm -0500,
Mike Snitzer  wrote:

> 
> I'll work through your patches tomorrow.

Please see the top 3 patches on this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

This rebased dm-4.16 branch seems to be working well so far.

Mike


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Wed, Nov 22 2017 at  1:24pm -0500,
Mikulas Patocka  wrote:
 
> Another problem is this:
> 
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
> 
> What if it blocks because the bioset is exhausted?
> 
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
> 
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
> 
> if (max_sectors < bio_sectors(bio)) {
> struct bio *split = bio_split(bio, max_sectors,
>   gfp, conf->bio_split);
> bio_chain(split, bio);
> generic_make_request(bio);
> bio = split;
> r1_bio->master_bio = bio;
> r1_bio->sectors = max_sectors;
> }

I'd be happy to take an incremental patch that improves on this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e

But short of that I'll have to come back to this.

Thanks,
Mike


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Wed, Nov 22 2017 at  1:24pm -0500,
Mikulas Patocka  wrote:
 
> Another problem is this:
> 
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
> 
> What if it blocks because the bioset is exhausted?
> 
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
> 
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
> 
> if (max_sectors < bio_sectors(bio)) {
> struct bio *split = bio_split(bio, max_sectors,
>   gfp, conf->bio_split);
> bio_chain(split, bio);
> generic_make_request(bio);
> bio = split;
> r1_bio->master_bio = bio;
> r1_bio->sectors = max_sectors;
> }

I'd be happy to take an incremental patch that improves on this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e

But short of that I'll have to come back to this.

Thanks,
Mike


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:00pm -0500,
NeilBrown  wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka  wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >2206 static void dm_wq_work(struct work_struct *work)
> >> >2207 {
> >> >2208 struct mapped_device *md = container_of(work, struct 
> >> > mapped_device, work);
> >> >2209 struct bio *bio;
> >> >2210 int srcu_idx;
> >> >2211 struct dm_table *map;
> >> >2212
> >> >2213 if (!bio_list_empty(>rescued)) {
> >> >2214 struct bio_list list;
> >> >2215 spin_lock_irq(>deferred_lock);
> >> >2216 list = md->rescued;
> >> >2217 bio_list_init(>rescued);
> >> >2218 spin_unlock_irq(>deferred_lock);
> >> >2219 while ((bio = bio_list_pop()))
> >> >2220 generic_make_request(bio);
> >> >2221 }
> >> >
> >> >2223 map = dm_get_live_table(md, _idx);
> >> >2224
> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >> >2226 spin_lock_irq(>deferred_lock);
> >> >2227 bio = bio_list_pop(>deferred);
> >> >2228 spin_unlock_irq(>deferred_lock);
> >> >2229
> >> >2230 if (!bio)
> >> >2231 break;
> >> >2232
> >> >2233 if (dm_request_based(md))
> >> >2234 generic_make_request(bio);
> >> >2235 else
> >> >2236 __split_and_process_bio(md, map, bio);
> >> >2237 }
> >> >2238
> >> >2239 dm_put_live_table(md, srcu_idx);
> >> >2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
> dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except

Yes, FYI I've also verified that even with just the "depth-first" patch
(and dm_offload disabled) the snapshot deadlock is fixed.
 
> a/ If any target defines ->num_write_bios() to return >1,
>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>and __map_bio(), which might need rescuing.
>But no target defines num_write_bios, and none have since it was
>removed from dm-cache 4.5 years ago.
>Can we discard num_write_bios??

Yes.

> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>to a value > 1, then __send_duplicate_bios() will also make multiple
>calls to alloc_tio() and __map_bio().
>Some do.
>  dm-cache-target:  flush=2
>  dm-snap: flush=2
>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:00pm -0500,
NeilBrown  wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka  wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >2206 static void dm_wq_work(struct work_struct *work)
> >> >2207 {
> >> >2208 struct mapped_device *md = container_of(work, struct 
> >> > mapped_device, work);
> >> >2209 struct bio *bio;
> >> >2210 int srcu_idx;
> >> >2211 struct dm_table *map;
> >> >2212
> >> >2213 if (!bio_list_empty(>rescued)) {
> >> >2214 struct bio_list list;
> >> >2215 spin_lock_irq(>deferred_lock);
> >> >2216 list = md->rescued;
> >> >2217 bio_list_init(>rescued);
> >> >2218 spin_unlock_irq(>deferred_lock);
> >> >2219 while ((bio = bio_list_pop()))
> >> >2220 generic_make_request(bio);
> >> >2221 }
> >> >
> >> >2223 map = dm_get_live_table(md, _idx);
> >> >2224
> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >> >2226 spin_lock_irq(>deferred_lock);
> >> >2227 bio = bio_list_pop(>deferred);
> >> >2228 spin_unlock_irq(>deferred_lock);
> >> >2229
> >> >2230 if (!bio)
> >> >2231 break;
> >> >2232
> >> >2233 if (dm_request_based(md))
> >> >2234 generic_make_request(bio);
> >> >2235 else
> >> >2236 __split_and_process_bio(md, map, bio);
> >> >2237 }
> >> >2238
> >> >2239 dm_put_live_table(md, srcu_idx);
> >> >2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
> dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except

Yes, FYI I've also verified that even with just the "depth-first" patch
(and dm_offload disabled) the snapshot deadlock is fixed.
 
> a/ If any target defines ->num_write_bios() to return >1,
>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>and __map_bio(), which might need rescuing.
>But no target defines num_write_bios, and none have since it was
>removed from dm-cache 4.5 years ago.
>Can we discard num_write_bios??

Yes.

> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>to a value > 1, then __send_duplicate_bios() will also make multiple
>calls to alloc_tio() and __map_bio().
>Some do.
>  dm-cache-target:  flush=2
>  dm-snap: flush=2
>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> crypt_alloc_buffer() which needs to allocate 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  8:21pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  4:23pm -0500,
> > Mikulas Patocka  wrote:
> > 
> > > This is not correct:
> > > 
> > >2206 static void dm_wq_work(struct work_struct *work)
> > >2207 {
> > >2208 struct mapped_device *md = container_of(work, struct 
> > > mapped_device, work);
> > >2209 struct bio *bio;
> > >2210 int srcu_idx;
> > >2211 struct dm_table *map;
> > >2212
> > >2213 if (!bio_list_empty(>rescued)) {
> > >2214 struct bio_list list;
> > >2215 spin_lock_irq(>deferred_lock);
> > >2216 list = md->rescued;
> > >2217 bio_list_init(>rescued);
> > >2218 spin_unlock_irq(>deferred_lock);
> > >2219 while ((bio = bio_list_pop()))
> > >2220 generic_make_request(bio);
> > >2221 }
> > >
> > >2223 map = dm_get_live_table(md, _idx);
> > >2224
> > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> > >2226 spin_lock_irq(>deferred_lock);
> > >2227 bio = bio_list_pop(>deferred);
> > >2228 spin_unlock_irq(>deferred_lock);
> > >2229
> > >2230 if (!bio)
> > >2231 break;
> > >2232
> > >2233 if (dm_request_based(md))
> > >2234 generic_make_request(bio);
> > >2235 else
> > >2236 __split_and_process_bio(md, map, bio);
> > >2237 }
> > >2238
> > >2239 dm_put_live_table(md, srcu_idx);
> > >2240 }
> > > 
> > > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > > will not process md->rescued list.
> > 
> > Can you elaborate further?  We cannot be "in dm_wq_work in
> > __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> > of scheduling away from __split_and_process_bio?
> > 
> > The more detail you can share the better.
> 
> Suppose this scenario:
> 
> * dm_wq_work calls __split_and_process_bio

Right, I later realized this was the call chain you were referring to.
Not sure how I missed it the first time around.

> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
> 
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

SO you're saying the case that Neil doesn't think should happen:
https://lkml.org/lkml/2017/11/21/658
...can happen.

> > > The processing of md->rescued is also wrong - bios for different devices 
> > > must be offloaded to different helper threads, so that processing a bio 
> > > for a lower device doesn't depend on processing a bio for a higher 
> > > device. 
> > > If you offload all the bios on current->bio_list to the same thread, the 
> > > bios still depend on each other and the deadlock will still happen.
> > 
> > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> > allocations") speaks to this with:
> > 
> > "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> > contains bios that were scheduled *before* the current one started, so
> > they must have been submitted from higher up the stack, and we cannot be
> > waiting for them here (thanks to the "dm: ensure bio submission follows
> > a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> > list as there is nothing to be gained by being more selective."
> 
> I think you are right - if we only offload current->bio_list[0], then 
> mixing of dependent bios on the offloaded list won't happen.
> 
> > And again: this patchset passes your dm-snapshot deadlock test.  Is
> > that test somehow lacking?
> 
> With your patchset, the deadlock would happen only if bios are queued on 
> >deferred - and that happens only in case of resume or if we are 
> processing REQ_PREFLUSH with non-zero data size.
> 
> So, the simple test that I wrote doesn't trigger it, but a more complex 
> test involving REQ_PREFLUSH could.

Makes sense.  But I need to think further about _why_ bios submitted to
the snapshot driver's underlying device would end up on md->rescued
(like you suggested above).  Again, Neil thinks it not possible.  Neil
said:
"they will not be recursive calls, so nothing will be added to
current->bio_list[0] and nothing will be moved to md->rescued.  Each

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  8:21pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  4:23pm -0500,
> > Mikulas Patocka  wrote:
> > 
> > > This is not correct:
> > > 
> > >2206 static void dm_wq_work(struct work_struct *work)
> > >2207 {
> > >2208 struct mapped_device *md = container_of(work, struct 
> > > mapped_device, work);
> > >2209 struct bio *bio;
> > >2210 int srcu_idx;
> > >2211 struct dm_table *map;
> > >2212
> > >2213 if (!bio_list_empty(>rescued)) {
> > >2214 struct bio_list list;
> > >2215 spin_lock_irq(>deferred_lock);
> > >2216 list = md->rescued;
> > >2217 bio_list_init(>rescued);
> > >2218 spin_unlock_irq(>deferred_lock);
> > >2219 while ((bio = bio_list_pop()))
> > >2220 generic_make_request(bio);
> > >2221 }
> > >
> > >2223 map = dm_get_live_table(md, _idx);
> > >2224
> > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> > >2226 spin_lock_irq(>deferred_lock);
> > >2227 bio = bio_list_pop(>deferred);
> > >2228 spin_unlock_irq(>deferred_lock);
> > >2229
> > >2230 if (!bio)
> > >2231 break;
> > >2232
> > >2233 if (dm_request_based(md))
> > >2234 generic_make_request(bio);
> > >2235 else
> > >2236 __split_and_process_bio(md, map, bio);
> > >2237 }
> > >2238
> > >2239 dm_put_live_table(md, srcu_idx);
> > >2240 }
> > > 
> > > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > > will not process md->rescued list.
> > 
> > Can you elaborate further?  We cannot be "in dm_wq_work in
> > __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> > of scheduling away from __split_and_process_bio?
> > 
> > The more detail you can share the better.
> 
> Suppose this scenario:
> 
> * dm_wq_work calls __split_and_process_bio

Right, I later realized this was the call chain you were referring to.
Not sure how I missed it the first time around.

> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
> 
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

SO you're saying the case that Neil doesn't think should happen:
https://lkml.org/lkml/2017/11/21/658
...can happen.

> > > The processing of md->rescued is also wrong - bios for different devices 
> > > must be offloaded to different helper threads, so that processing a bio 
> > > for a lower device doesn't depend on processing a bio for a higher 
> > > device. 
> > > If you offload all the bios on current->bio_list to the same thread, the 
> > > bios still depend on each other and the deadlock will still happen.
> > 
> > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> > allocations") speaks to this with:
> > 
> > "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> > contains bios that were scheduled *before* the current one started, so
> > they must have been submitted from higher up the stack, and we cannot be
> > waiting for them here (thanks to the "dm: ensure bio submission follows
> > a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> > list as there is nothing to be gained by being more selective."
> 
> I think you are right - if we only offload current->bio_list[0], then 
> mixing of dependent bios on the offloaded list won't happen.
> 
> > And again: this patchset passes your dm-snapshot deadlock test.  Is
> > that test somehow lacking?
> 
> With your patchset, the deadlock would happen only if bios are queued on 
> >deferred - and that happens only in case of resume or if we are 
> processing REQ_PREFLUSH with non-zero data size.
> 
> So, the simple test that I wrote doesn't trigger it, but a more complex 
> test involving REQ_PREFLUSH could.

Makes sense.  But I need to think further about _why_ bios submitted to
the snapshot driver's underlying device would end up on md->rescued
(like you suggested above).  Again, Neil thinks it not possible.  Neil
said:
"they will not be recursive calls, so nothing will be added to
current->bio_list[0] and nothing will be moved to md->rescued.  Each
generic_make_request() will completely submit the request in 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mikulas Patocka


On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  4:23pm -0500,
> Mikulas Patocka  wrote:
> 
> > This is not correct:
> > 
> >2206 static void dm_wq_work(struct work_struct *work)
> >2207 {
> >2208 struct mapped_device *md = container_of(work, struct 
> > mapped_device, work);
> >2209 struct bio *bio;
> >2210 int srcu_idx;
> >2211 struct dm_table *map;
> >2212
> >2213 if (!bio_list_empty(>rescued)) {
> >2214 struct bio_list list;
> >2215 spin_lock_irq(>deferred_lock);
> >2216 list = md->rescued;
> >2217 bio_list_init(>rescued);
> >2218 spin_unlock_irq(>deferred_lock);
> >2219 while ((bio = bio_list_pop()))
> >2220 generic_make_request(bio);
> >2221 }
> >
> >2223 map = dm_get_live_table(md, _idx);
> >2224
> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >2226 spin_lock_irq(>deferred_lock);
> >2227 bio = bio_list_pop(>deferred);
> >2228 spin_unlock_irq(>deferred_lock);
> >2229
> >2230 if (!bio)
> >2231 break;
> >2232
> >2233 if (dm_request_based(md))
> >2234 generic_make_request(bio);
> >2235 else
> >2236 __split_and_process_bio(md, map, bio);
> >2237 }
> >2238
> >2239 dm_put_live_table(md, srcu_idx);
> >2240 }
> > 
> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > will not process md->rescued list.
> 
> Can you elaborate further?  We cannot be "in dm_wq_work in
> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> of scheduling away from __split_and_process_bio?
> 
> The more detail you can share the better.

Suppose this scenario:

* dm_wq_work calls __split_and_process_bio
* __split_and_process_bio eventually reaches the function snapshot_map
* snapshot_map attempts to take the snapshot lock

* the snapshot lock could be released only if some bios submitted by the 
snapshot driver to the underlying device complete
* the bios submitted to the underlying device were already offloaded by 
some other task and they are waiting on the list md->rescued
* the bios waiting on md->rescued are not processed, because dm_wq_work is 
blocked in snapshot_map (called from __split_and_process_bio)

> > The processing of md->rescued is also wrong - bios for different devices 
> > must be offloaded to different helper threads, so that processing a bio 
> > for a lower device doesn't depend on processing a bio for a higher device. 
> > If you offload all the bios on current->bio_list to the same thread, the 
> > bios still depend on each other and the deadlock will still happen.
> 
> Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> allocations") speaks to this with:
> 
> "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> contains bios that were scheduled *before* the current one started, so
> they must have been submitted from higher up the stack, and we cannot be
> waiting for them here (thanks to the "dm: ensure bio submission follows
> a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> list as there is nothing to be gained by being more selective."

I think you are right - if we only offload current->bio_list[0], then 
mixing of dependent bios on the offloaded list won't happen.

> And again: this patchset passes your dm-snapshot deadlock test.  Is
> that test somehow lacking?

With your patchset, the deadlock would happen only if bios are queued on 
>deferred - and that happens only in case of resume or if we are 
processing REQ_PREFLUSH with non-zero data size.

So, the simple test that I wrote doesn't trigger it, but a more complex 
test involving REQ_PREFLUSH could.

> Or do you see a hypothetical case where a deadlock is still possible?
> That is of less concern.  I'd prefer that we tackle problems for
> targets, and associated scenarios, that we currently support.
> 
> Either way, happy to review this with you further.  Any fixes are
> welcomed too.  But I'd like us to head in a direction that this patchset
> is taking us.  Specifically: away from DM relying on BIOSET_NEED_RESCUER.
> 
> Thanks,
> Mike

Mikulas


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mikulas Patocka


On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  4:23pm -0500,
> Mikulas Patocka  wrote:
> 
> > This is not correct:
> > 
> >2206 static void dm_wq_work(struct work_struct *work)
> >2207 {
> >2208 struct mapped_device *md = container_of(work, struct 
> > mapped_device, work);
> >2209 struct bio *bio;
> >2210 int srcu_idx;
> >2211 struct dm_table *map;
> >2212
> >2213 if (!bio_list_empty(>rescued)) {
> >2214 struct bio_list list;
> >2215 spin_lock_irq(>deferred_lock);
> >2216 list = md->rescued;
> >2217 bio_list_init(>rescued);
> >2218 spin_unlock_irq(>deferred_lock);
> >2219 while ((bio = bio_list_pop()))
> >2220 generic_make_request(bio);
> >2221 }
> >
> >2223 map = dm_get_live_table(md, _idx);
> >2224
> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >2226 spin_lock_irq(>deferred_lock);
> >2227 bio = bio_list_pop(>deferred);
> >2228 spin_unlock_irq(>deferred_lock);
> >2229
> >2230 if (!bio)
> >2231 break;
> >2232
> >2233 if (dm_request_based(md))
> >2234 generic_make_request(bio);
> >2235 else
> >2236 __split_and_process_bio(md, map, bio);
> >2237 }
> >2238
> >2239 dm_put_live_table(md, srcu_idx);
> >2240 }
> > 
> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > will not process md->rescued list.
> 
> Can you elaborate further?  We cannot be "in dm_wq_work in
> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> of scheduling away from __split_and_process_bio?
> 
> The more detail you can share the better.

Suppose this scenario:

* dm_wq_work calls __split_and_process_bio
* __split_and_process_bio eventually reaches the function snapshot_map
* snapshot_map attempts to take the snapshot lock

* the snapshot lock could be released only if some bios submitted by the 
snapshot driver to the underlying device complete
* the bios submitted to the underlying device were already offloaded by 
some other task and they are waiting on the list md->rescued
* the bios waiting on md->rescued are not processed, because dm_wq_work is 
blocked in snapshot_map (called from __split_and_process_bio)

> > The processing of md->rescued is also wrong - bios for different devices 
> > must be offloaded to different helper threads, so that processing a bio 
> > for a lower device doesn't depend on processing a bio for a higher device. 
> > If you offload all the bios on current->bio_list to the same thread, the 
> > bios still depend on each other and the deadlock will still happen.
> 
> Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> allocations") speaks to this with:
> 
> "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> contains bios that were scheduled *before* the current one started, so
> they must have been submitted from higher up the stack, and we cannot be
> waiting for them here (thanks to the "dm: ensure bio submission follows
> a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> list as there is nothing to be gained by being more selective."

I think you are right - if we only offload current->bio_list[0], then 
mixing of dependent bios on the offloaded list won't happen.

> And again: this patchset passes your dm-snapshot deadlock test.  Is
> that test somehow lacking?

With your patchset, the deadlock would happen only if bios are queued on 
>deferred - and that happens only in case of resume or if we are 
processing REQ_PREFLUSH with non-zero data size.

So, the simple test that I wrote doesn't trigger it, but a more complex 
test involving REQ_PREFLUSH could.

> Or do you see a hypothetical case where a deadlock is still possible?
> That is of less concern.  I'd prefer that we tackle problems for
> targets, and associated scenarios, that we currently support.
> 
> Either way, happy to review this with you further.  Any fixes are
> welcomed too.  But I'd like us to head in a direction that this patchset
> is taking us.  Specifically: away from DM relying on BIOSET_NEED_RESCUER.
> 
> Thanks,
> Mike

Mikulas


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  4:23pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  7:43am -0500,
> > Mike Snitzer  wrote:
> >  
> > > Decided it a better use of my time to review and then hopefully use the
> > > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > > that for quite a while anyway.  This mail from you just made it all the
> > > more clear that needs doing:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > > 
> > > So I will start here on this patch you proposed:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > > (of note, this patch slipped through the cracks because I was recovering
> > > from injury when it originally came through).
> > > 
> > > Once DM is using q->bio_split I'll come back to this patch (aka
> > > "[1]") as a starting point for the follow-on work to remove DM's use of
> > > BIOSET_NEED_RESCUER:
> > > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> > 
> > Hey Neil,
> > 
> > Good news!  All your code works ;)
> > 
> > (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> > dm_wq_work() to process the md->rescued bio_list was operating on
> > the md->deferred bio_list due to cut-n-paste from code you copied from
> > just below it)
> > 
> > I split your code out some to make it more reviewable.  I also tweaked
> > headers accordingly.
> > 
> > Please see this branch (which _will_ get rebased between now and the
> > 4.16 merge window):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> > 
> > I successfully tested these changes using Mikulas' test program that
> > reproduces the snapshot deadlock:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > I'll throw various other DM testsuites at it to verify they all look
> > good (e.g. thinp, cache, multipath).
> > 
> > I'm open to all suggestions about changes you'd like to see (either to
> > these patches or anything you'd like to layer ontop of them).
> > 
> > Thanks for all your work, much appreciated!
> > Mike
> 
> This is not correct:
> 
>2206 static void dm_wq_work(struct work_struct *work)
>2207 {
>2208 struct mapped_device *md = container_of(work, struct 
> mapped_device, work);
>2209 struct bio *bio;
>2210 int srcu_idx;
>2211 struct dm_table *map;
>2212
>2213 if (!bio_list_empty(>rescued)) {
>2214 struct bio_list list;
>2215 spin_lock_irq(>deferred_lock);
>2216 list = md->rescued;
>2217 bio_list_init(>rescued);
>2218 spin_unlock_irq(>deferred_lock);
>2219 while ((bio = bio_list_pop()))
>2220 generic_make_request(bio);
>2221 }
>
>2223 map = dm_get_live_table(md, _idx);
>2224
>2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
>2226 spin_lock_irq(>deferred_lock);
>2227 bio = bio_list_pop(>deferred);
>2228 spin_unlock_irq(>deferred_lock);
>2229
>2230 if (!bio)
>2231 break;
>2232
>2233 if (dm_request_based(md))
>2234 generic_make_request(bio);
>2235 else
>2236 __split_and_process_bio(md, map, bio);
>2237 }
>2238
>2239 dm_put_live_table(md, srcu_idx);
>2240 }
> 
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Can you elaborate further?  We cannot be "in dm_wq_work in
__split_and_process_bio" simultaneously.  Do you mean as a side-effect
of scheduling away from __split_and_process_bio?

The more detail you can share the better.

> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
allocations") speaks to this with:

"Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
contains bios that were scheduled *before* the current one started, so
they must have been submitted from higher up the stack, and we cannot be
waiting for them here (thanks to the "dm: ensure bio submission follows
a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
list as there is nothing to be gained by being more selective."

And again: this patchset passes your 

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mike Snitzer
On Tue, Nov 21 2017 at  4:23pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  7:43am -0500,
> > Mike Snitzer  wrote:
> >  
> > > Decided it a better use of my time to review and then hopefully use the
> > > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > > that for quite a while anyway.  This mail from you just made it all the
> > > more clear that needs doing:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > > 
> > > So I will start here on this patch you proposed:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > > (of note, this patch slipped through the cracks because I was recovering
> > > from injury when it originally came through).
> > > 
> > > Once DM is using q->bio_split I'll come back to this patch (aka
> > > "[1]") as a starting point for the follow-on work to remove DM's use of
> > > BIOSET_NEED_RESCUER:
> > > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> > 
> > Hey Neil,
> > 
> > Good news!  All your code works ;)
> > 
> > (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> > dm_wq_work() to process the md->rescued bio_list was operating on
> > the md->deferred bio_list due to cut-n-paste from code you copied from
> > just below it)
> > 
> > I split your code out some to make it more reviewable.  I also tweaked
> > headers accordingly.
> > 
> > Please see this branch (which _will_ get rebased between now and the
> > 4.16 merge window):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> > 
> > I successfully tested these changes using Mikulas' test program that
> > reproduces the snapshot deadlock:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > I'll throw various other DM testsuites at it to verify they all look
> > good (e.g. thinp, cache, multipath).
> > 
> > I'm open to all suggestions about changes you'd like to see (either to
> > these patches or anything you'd like to layer ontop of them).
> > 
> > Thanks for all your work, much appreciated!
> > Mike
> 
> This is not correct:
> 
>2206 static void dm_wq_work(struct work_struct *work)
>2207 {
>2208 struct mapped_device *md = container_of(work, struct 
> mapped_device, work);
>2209 struct bio *bio;
>2210 int srcu_idx;
>2211 struct dm_table *map;
>2212
>2213 if (!bio_list_empty(>rescued)) {
>2214 struct bio_list list;
>2215 spin_lock_irq(>deferred_lock);
>2216 list = md->rescued;
>2217 bio_list_init(>rescued);
>2218 spin_unlock_irq(>deferred_lock);
>2219 while ((bio = bio_list_pop()))
>2220 generic_make_request(bio);
>2221 }
>
>2223 map = dm_get_live_table(md, _idx);
>2224
>2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
>2226 spin_lock_irq(>deferred_lock);
>2227 bio = bio_list_pop(>deferred);
>2228 spin_unlock_irq(>deferred_lock);
>2229
>2230 if (!bio)
>2231 break;
>2232
>2233 if (dm_request_based(md))
>2234 generic_make_request(bio);
>2235 else
>2236 __split_and_process_bio(md, map, bio);
>2237 }
>2238
>2239 dm_put_live_table(md, srcu_idx);
>2240 }
> 
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Can you elaborate further?  We cannot be "in dm_wq_work in
__split_and_process_bio" simultaneously.  Do you mean as a side-effect
of scheduling away from __split_and_process_bio?

The more detail you can share the better.

> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
allocations") speaks to this with:

"Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
contains bios that were scheduled *before* the current one started, so
they must have been submitted from higher up the stack, and we cannot be
waiting for them here (thanks to the "dm: ensure bio submission follows
a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
list as there is nothing to be gained by being more selective."

And again: this patchset passes your dm-snapshot deadlock test.  Is
that test somehow