Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 11:28pm -0500, Mike Snitzerwrote: > > I'll work through your patches tomorrow. Please see the top 3 patches on this branch: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 This rebased dm-4.16 branch seems to be working well so far. Mike
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 11:28pm -0500, Mike Snitzer wrote: > > I'll work through your patches tomorrow. Please see the top 3 patches on this branch: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 This rebased dm-4.16 branch seems to be working well so far. Mike
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Wed, Nov 22 2017 at 1:24pm -0500, Mikulas Patockawrote: > Another problem is this: > > struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split); > bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9); > bio_chain(b, bio); > > What if it blocks because the bioset is exhausted? > > The code basically builds a chain of bios of unlimited length (suppose for > example a case when we are splitting on every sector boundary, so there > will be one bio for every sector in the original bio), it could exhaust > the bioset easily. > > It would be better to use mechanism from md-raid that chains all the > sub-bios to the same master bio and doesn't create long chains of bios: > > if (max_sectors < bio_sectors(bio)) { > struct bio *split = bio_split(bio, max_sectors, > gfp, conf->bio_split); > bio_chain(split, bio); > generic_make_request(bio); > bio = split; > r1_bio->master_bio = bio; > r1_bio->sectors = max_sectors; > } I'd be happy to take an incremental patch that improves on this commit: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e But short of that I'll have to come back to this. Thanks, Mike
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Wed, Nov 22 2017 at 1:24pm -0500, Mikulas Patocka wrote: > Another problem is this: > > struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split); > bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9); > bio_chain(b, bio); > > What if it blocks because the bioset is exhausted? > > The code basically builds a chain of bios of unlimited length (suppose for > example a case when we are splitting on every sector boundary, so there > will be one bio for every sector in the original bio), it could exhaust > the bioset easily. > > It would be better to use mechanism from md-raid that chains all the > sub-bios to the same master bio and doesn't create long chains of bios: > > if (max_sectors < bio_sectors(bio)) { > struct bio *split = bio_split(bio, max_sectors, > gfp, conf->bio_split); > bio_chain(split, bio); > generic_make_request(bio); > bio = split; > r1_bio->master_bio = bio; > r1_bio->sectors = max_sectors; > } I'd be happy to take an incremental patch that improves on this commit: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e But short of that I'll have to come back to this. Thanks, Mike
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 11:00pm -0500, NeilBrownwrote: > On Tue, Nov 21 2017, Mikulas Patocka wrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > >> On Tue, Nov 21 2017 at 4:23pm -0500, > >> Mikulas Patocka wrote: > >> > >> > This is not correct: > >> > > >> >2206 static void dm_wq_work(struct work_struct *work) > >> >2207 { > >> >2208 struct mapped_device *md = container_of(work, struct > >> > mapped_device, work); > >> >2209 struct bio *bio; > >> >2210 int srcu_idx; > >> >2211 struct dm_table *map; > >> >2212 > >> >2213 if (!bio_list_empty(>rescued)) { > >> >2214 struct bio_list list; > >> >2215 spin_lock_irq(>deferred_lock); > >> >2216 list = md->rescued; > >> >2217 bio_list_init(>rescued); > >> >2218 spin_unlock_irq(>deferred_lock); > >> >2219 while ((bio = bio_list_pop())) > >> >2220 generic_make_request(bio); > >> >2221 } > >> > > >> >2223 map = dm_get_live_table(md, _idx); > >> >2224 > >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > >> >2226 spin_lock_irq(>deferred_lock); > >> >2227 bio = bio_list_pop(>deferred); > >> >2228 spin_unlock_irq(>deferred_lock); > >> >2229 > >> >2230 if (!bio) > >> >2231 break; > >> >2232 > >> >2233 if (dm_request_based(md)) > >> >2234 generic_make_request(bio); > >> >2235 else > >> >2236 __split_and_process_bio(md, map, bio); > >> >2237 } > >> >2238 > >> >2239 dm_put_live_table(md, srcu_idx); > >> >2240 } > >> > > >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we > >> > will not process md->rescued list. > >> > >> Can you elaborate further? We cannot be "in dm_wq_work in > >> __split_and_process_bio" simultaneously. Do you mean as a side-effect > >> of scheduling away from __split_and_process_bio? > >> > >> The more detail you can share the better. > > > > Suppose this scenario: > > > > * dm_wq_work calls __split_and_process_bio > > * __split_and_process_bio eventually reaches the function snapshot_map > > * snapshot_map attempts to take the snapshot lock > > > > * the snapshot lock could be released only if some bios submitted by the > > snapshot driver to the underlying device complete > > * the bios submitted to the underlying device were already offloaded by > > some other task and they are waiting on the list md->rescued > > * the bios waiting on md->rescued are not processed, because dm_wq_work is > > blocked in snapshot_map (called from __split_and_process_bio) > > Yes, I think you are right. > > I think the solution is to get rid of the dm_offload() infrastructure > and make it not necessary. > i.e. discard my patches > dm: prepare to discontinue use of BIOSET_NEED_RESCUER > and > dm: revise 'rescue' strategy for bio-based bioset allocations > > And build on "dm: ensure bio submission follows a depth-first tree walk" > which was written after those and already makes dm_offload() less > important. > > Since that "depth-first" patch, every request to the dm device, after > the initial splitting, allocates just one dm_target_io structure, and > makes just one __map_bio() call, and so will behave exactly the way > generic_make_request() expects and copes with - thus avoiding awkward > dependencies and deadlocks. Except Yes, FYI I've also verified that even with just the "depth-first" patch (and dm_offload disabled) the snapshot deadlock is fixed. > a/ If any target defines ->num_write_bios() to return >1, >__clone_and_map_data_bio() will make multiple calls to alloc_tio() >and __map_bio(), which might need rescuing. >But no target defines num_write_bios, and none have since it was >removed from dm-cache 4.5 years ago. >Can we discard num_write_bios?? Yes. > b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios >to a value > 1, then __send_duplicate_bios() will also make multiple >calls to alloc_tio() and __map_bio(). >Some do. > dm-cache-target: flush=2 > dm-snap: flush=2 > dm-stripe: discard, write_same, write_zeroes all set to 'stripes'. > > These will only be a problem if the second (or subsequent) alloc_tio() > blocks waiting for an earlier allocation to complete. This will only > be a problem if multiple threads are each trying to allocate multiple > dm_target_io from the same bioset at the same time. > This is rare and should be easier to address than the current > dm_offload() approach. > One possibility would be to copy the approach taken by >
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 11:00pm -0500, NeilBrown wrote: > On Tue, Nov 21 2017, Mikulas Patocka wrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > >> On Tue, Nov 21 2017 at 4:23pm -0500, > >> Mikulas Patocka wrote: > >> > >> > This is not correct: > >> > > >> >2206 static void dm_wq_work(struct work_struct *work) > >> >2207 { > >> >2208 struct mapped_device *md = container_of(work, struct > >> > mapped_device, work); > >> >2209 struct bio *bio; > >> >2210 int srcu_idx; > >> >2211 struct dm_table *map; > >> >2212 > >> >2213 if (!bio_list_empty(>rescued)) { > >> >2214 struct bio_list list; > >> >2215 spin_lock_irq(>deferred_lock); > >> >2216 list = md->rescued; > >> >2217 bio_list_init(>rescued); > >> >2218 spin_unlock_irq(>deferred_lock); > >> >2219 while ((bio = bio_list_pop())) > >> >2220 generic_make_request(bio); > >> >2221 } > >> > > >> >2223 map = dm_get_live_table(md, _idx); > >> >2224 > >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > >> >2226 spin_lock_irq(>deferred_lock); > >> >2227 bio = bio_list_pop(>deferred); > >> >2228 spin_unlock_irq(>deferred_lock); > >> >2229 > >> >2230 if (!bio) > >> >2231 break; > >> >2232 > >> >2233 if (dm_request_based(md)) > >> >2234 generic_make_request(bio); > >> >2235 else > >> >2236 __split_and_process_bio(md, map, bio); > >> >2237 } > >> >2238 > >> >2239 dm_put_live_table(md, srcu_idx); > >> >2240 } > >> > > >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we > >> > will not process md->rescued list. > >> > >> Can you elaborate further? We cannot be "in dm_wq_work in > >> __split_and_process_bio" simultaneously. Do you mean as a side-effect > >> of scheduling away from __split_and_process_bio? > >> > >> The more detail you can share the better. > > > > Suppose this scenario: > > > > * dm_wq_work calls __split_and_process_bio > > * __split_and_process_bio eventually reaches the function snapshot_map > > * snapshot_map attempts to take the snapshot lock > > > > * the snapshot lock could be released only if some bios submitted by the > > snapshot driver to the underlying device complete > > * the bios submitted to the underlying device were already offloaded by > > some other task and they are waiting on the list md->rescued > > * the bios waiting on md->rescued are not processed, because dm_wq_work is > > blocked in snapshot_map (called from __split_and_process_bio) > > Yes, I think you are right. > > I think the solution is to get rid of the dm_offload() infrastructure > and make it not necessary. > i.e. discard my patches > dm: prepare to discontinue use of BIOSET_NEED_RESCUER > and > dm: revise 'rescue' strategy for bio-based bioset allocations > > And build on "dm: ensure bio submission follows a depth-first tree walk" > which was written after those and already makes dm_offload() less > important. > > Since that "depth-first" patch, every request to the dm device, after > the initial splitting, allocates just one dm_target_io structure, and > makes just one __map_bio() call, and so will behave exactly the way > generic_make_request() expects and copes with - thus avoiding awkward > dependencies and deadlocks. Except Yes, FYI I've also verified that even with just the "depth-first" patch (and dm_offload disabled) the snapshot deadlock is fixed. > a/ If any target defines ->num_write_bios() to return >1, >__clone_and_map_data_bio() will make multiple calls to alloc_tio() >and __map_bio(), which might need rescuing. >But no target defines num_write_bios, and none have since it was >removed from dm-cache 4.5 years ago. >Can we discard num_write_bios?? Yes. > b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios >to a value > 1, then __send_duplicate_bios() will also make multiple >calls to alloc_tio() and __map_bio(). >Some do. > dm-cache-target: flush=2 > dm-snap: flush=2 > dm-stripe: discard, write_same, write_zeroes all set to 'stripes'. > > These will only be a problem if the second (or subsequent) alloc_tio() > blocks waiting for an earlier allocation to complete. This will only > be a problem if multiple threads are each trying to allocate multiple > dm_target_io from the same bioset at the same time. > This is rare and should be easier to address than the current > dm_offload() approach. > One possibility would be to copy the approach taken by > crypt_alloc_buffer() which needs to allocate
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 8:21pm -0500, Mikulas Patockawrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > On Tue, Nov 21 2017 at 4:23pm -0500, > > Mikulas Patocka wrote: > > > > > This is not correct: > > > > > >2206 static void dm_wq_work(struct work_struct *work) > > >2207 { > > >2208 struct mapped_device *md = container_of(work, struct > > > mapped_device, work); > > >2209 struct bio *bio; > > >2210 int srcu_idx; > > >2211 struct dm_table *map; > > >2212 > > >2213 if (!bio_list_empty(>rescued)) { > > >2214 struct bio_list list; > > >2215 spin_lock_irq(>deferred_lock); > > >2216 list = md->rescued; > > >2217 bio_list_init(>rescued); > > >2218 spin_unlock_irq(>deferred_lock); > > >2219 while ((bio = bio_list_pop())) > > >2220 generic_make_request(bio); > > >2221 } > > > > > >2223 map = dm_get_live_table(md, _idx); > > >2224 > > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > > >2226 spin_lock_irq(>deferred_lock); > > >2227 bio = bio_list_pop(>deferred); > > >2228 spin_unlock_irq(>deferred_lock); > > >2229 > > >2230 if (!bio) > > >2231 break; > > >2232 > > >2233 if (dm_request_based(md)) > > >2234 generic_make_request(bio); > > >2235 else > > >2236 __split_and_process_bio(md, map, bio); > > >2237 } > > >2238 > > >2239 dm_put_live_table(md, srcu_idx); > > >2240 } > > > > > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > > > will not process md->rescued list. > > > > Can you elaborate further? We cannot be "in dm_wq_work in > > __split_and_process_bio" simultaneously. Do you mean as a side-effect > > of scheduling away from __split_and_process_bio? > > > > The more detail you can share the better. > > Suppose this scenario: > > * dm_wq_work calls __split_and_process_bio Right, I later realized this was the call chain you were referring to. Not sure how I missed it the first time around. > * __split_and_process_bio eventually reaches the function snapshot_map > * snapshot_map attempts to take the snapshot lock > > * the snapshot lock could be released only if some bios submitted by the > snapshot driver to the underlying device complete > * the bios submitted to the underlying device were already offloaded by > some other task and they are waiting on the list md->rescued > * the bios waiting on md->rescued are not processed, because dm_wq_work is > blocked in snapshot_map (called from __split_and_process_bio) SO you're saying the case that Neil doesn't think should happen: https://lkml.org/lkml/2017/11/21/658 ...can happen. > > > The processing of md->rescued is also wrong - bios for different devices > > > must be offloaded to different helper threads, so that processing a bio > > > for a lower device doesn't depend on processing a bio for a higher > > > device. > > > If you offload all the bios on current->bio_list to the same thread, the > > > bios still depend on each other and the deadlock will still happen. > > > > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset > > allocations") speaks to this with: > > > > "Note that only current->bio_list[0] is offloaded. current->bio_list[1] > > contains bios that were scheduled *before* the current one started, so > > they must have been submitted from higher up the stack, and we cannot be > > waiting for them here (thanks to the "dm: ensure bio submission follows > > a depth-first tree walk" commit). Also, we now rescue *all* bios on the > > list as there is nothing to be gained by being more selective." > > I think you are right - if we only offload current->bio_list[0], then > mixing of dependent bios on the offloaded list won't happen. > > > And again: this patchset passes your dm-snapshot deadlock test. Is > > that test somehow lacking? > > With your patchset, the deadlock would happen only if bios are queued on > >deferred - and that happens only in case of resume or if we are > processing REQ_PREFLUSH with non-zero data size. > > So, the simple test that I wrote doesn't trigger it, but a more complex > test involving REQ_PREFLUSH could. Makes sense. But I need to think further about _why_ bios submitted to the snapshot driver's underlying device would end up on md->rescued (like you suggested above). Again, Neil thinks it not possible. Neil said: "they will not be recursive calls, so nothing will be added to current->bio_list[0] and nothing will be moved to md->rescued. Each
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 8:21pm -0500, Mikulas Patocka wrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > On Tue, Nov 21 2017 at 4:23pm -0500, > > Mikulas Patocka wrote: > > > > > This is not correct: > > > > > >2206 static void dm_wq_work(struct work_struct *work) > > >2207 { > > >2208 struct mapped_device *md = container_of(work, struct > > > mapped_device, work); > > >2209 struct bio *bio; > > >2210 int srcu_idx; > > >2211 struct dm_table *map; > > >2212 > > >2213 if (!bio_list_empty(>rescued)) { > > >2214 struct bio_list list; > > >2215 spin_lock_irq(>deferred_lock); > > >2216 list = md->rescued; > > >2217 bio_list_init(>rescued); > > >2218 spin_unlock_irq(>deferred_lock); > > >2219 while ((bio = bio_list_pop())) > > >2220 generic_make_request(bio); > > >2221 } > > > > > >2223 map = dm_get_live_table(md, _idx); > > >2224 > > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > > >2226 spin_lock_irq(>deferred_lock); > > >2227 bio = bio_list_pop(>deferred); > > >2228 spin_unlock_irq(>deferred_lock); > > >2229 > > >2230 if (!bio) > > >2231 break; > > >2232 > > >2233 if (dm_request_based(md)) > > >2234 generic_make_request(bio); > > >2235 else > > >2236 __split_and_process_bio(md, map, bio); > > >2237 } > > >2238 > > >2239 dm_put_live_table(md, srcu_idx); > > >2240 } > > > > > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > > > will not process md->rescued list. > > > > Can you elaborate further? We cannot be "in dm_wq_work in > > __split_and_process_bio" simultaneously. Do you mean as a side-effect > > of scheduling away from __split_and_process_bio? > > > > The more detail you can share the better. > > Suppose this scenario: > > * dm_wq_work calls __split_and_process_bio Right, I later realized this was the call chain you were referring to. Not sure how I missed it the first time around. > * __split_and_process_bio eventually reaches the function snapshot_map > * snapshot_map attempts to take the snapshot lock > > * the snapshot lock could be released only if some bios submitted by the > snapshot driver to the underlying device complete > * the bios submitted to the underlying device were already offloaded by > some other task and they are waiting on the list md->rescued > * the bios waiting on md->rescued are not processed, because dm_wq_work is > blocked in snapshot_map (called from __split_and_process_bio) SO you're saying the case that Neil doesn't think should happen: https://lkml.org/lkml/2017/11/21/658 ...can happen. > > > The processing of md->rescued is also wrong - bios for different devices > > > must be offloaded to different helper threads, so that processing a bio > > > for a lower device doesn't depend on processing a bio for a higher > > > device. > > > If you offload all the bios on current->bio_list to the same thread, the > > > bios still depend on each other and the deadlock will still happen. > > > > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset > > allocations") speaks to this with: > > > > "Note that only current->bio_list[0] is offloaded. current->bio_list[1] > > contains bios that were scheduled *before* the current one started, so > > they must have been submitted from higher up the stack, and we cannot be > > waiting for them here (thanks to the "dm: ensure bio submission follows > > a depth-first tree walk" commit). Also, we now rescue *all* bios on the > > list as there is nothing to be gained by being more selective." > > I think you are right - if we only offload current->bio_list[0], then > mixing of dependent bios on the offloaded list won't happen. > > > And again: this patchset passes your dm-snapshot deadlock test. Is > > that test somehow lacking? > > With your patchset, the deadlock would happen only if bios are queued on > >deferred - and that happens only in case of resume or if we are > processing REQ_PREFLUSH with non-zero data size. > > So, the simple test that I wrote doesn't trigger it, but a more complex > test involving REQ_PREFLUSH could. Makes sense. But I need to think further about _why_ bios submitted to the snapshot driver's underlying device would end up on md->rescued (like you suggested above). Again, Neil thinks it not possible. Neil said: "they will not be recursive calls, so nothing will be added to current->bio_list[0] and nothing will be moved to md->rescued. Each generic_make_request() will completely submit the request in
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, 21 Nov 2017, Mike Snitzer wrote: > On Tue, Nov 21 2017 at 4:23pm -0500, > Mikulas Patockawrote: > > > This is not correct: > > > >2206 static void dm_wq_work(struct work_struct *work) > >2207 { > >2208 struct mapped_device *md = container_of(work, struct > > mapped_device, work); > >2209 struct bio *bio; > >2210 int srcu_idx; > >2211 struct dm_table *map; > >2212 > >2213 if (!bio_list_empty(>rescued)) { > >2214 struct bio_list list; > >2215 spin_lock_irq(>deferred_lock); > >2216 list = md->rescued; > >2217 bio_list_init(>rescued); > >2218 spin_unlock_irq(>deferred_lock); > >2219 while ((bio = bio_list_pop())) > >2220 generic_make_request(bio); > >2221 } > > > >2223 map = dm_get_live_table(md, _idx); > >2224 > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > >2226 spin_lock_irq(>deferred_lock); > >2227 bio = bio_list_pop(>deferred); > >2228 spin_unlock_irq(>deferred_lock); > >2229 > >2230 if (!bio) > >2231 break; > >2232 > >2233 if (dm_request_based(md)) > >2234 generic_make_request(bio); > >2235 else > >2236 __split_and_process_bio(md, map, bio); > >2237 } > >2238 > >2239 dm_put_live_table(md, srcu_idx); > >2240 } > > > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > > will not process md->rescued list. > > Can you elaborate further? We cannot be "in dm_wq_work in > __split_and_process_bio" simultaneously. Do you mean as a side-effect > of scheduling away from __split_and_process_bio? > > The more detail you can share the better. Suppose this scenario: * dm_wq_work calls __split_and_process_bio * __split_and_process_bio eventually reaches the function snapshot_map * snapshot_map attempts to take the snapshot lock * the snapshot lock could be released only if some bios submitted by the snapshot driver to the underlying device complete * the bios submitted to the underlying device were already offloaded by some other task and they are waiting on the list md->rescued * the bios waiting on md->rescued are not processed, because dm_wq_work is blocked in snapshot_map (called from __split_and_process_bio) > > The processing of md->rescued is also wrong - bios for different devices > > must be offloaded to different helper threads, so that processing a bio > > for a lower device doesn't depend on processing a bio for a higher device. > > If you offload all the bios on current->bio_list to the same thread, the > > bios still depend on each other and the deadlock will still happen. > > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset > allocations") speaks to this with: > > "Note that only current->bio_list[0] is offloaded. current->bio_list[1] > contains bios that were scheduled *before* the current one started, so > they must have been submitted from higher up the stack, and we cannot be > waiting for them here (thanks to the "dm: ensure bio submission follows > a depth-first tree walk" commit). Also, we now rescue *all* bios on the > list as there is nothing to be gained by being more selective." I think you are right - if we only offload current->bio_list[0], then mixing of dependent bios on the offloaded list won't happen. > And again: this patchset passes your dm-snapshot deadlock test. Is > that test somehow lacking? With your patchset, the deadlock would happen only if bios are queued on >deferred - and that happens only in case of resume or if we are processing REQ_PREFLUSH with non-zero data size. So, the simple test that I wrote doesn't trigger it, but a more complex test involving REQ_PREFLUSH could. > Or do you see a hypothetical case where a deadlock is still possible? > That is of less concern. I'd prefer that we tackle problems for > targets, and associated scenarios, that we currently support. > > Either way, happy to review this with you further. Any fixes are > welcomed too. But I'd like us to head in a direction that this patchset > is taking us. Specifically: away from DM relying on BIOSET_NEED_RESCUER. > > Thanks, > Mike Mikulas
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, 21 Nov 2017, Mike Snitzer wrote: > On Tue, Nov 21 2017 at 4:23pm -0500, > Mikulas Patocka wrote: > > > This is not correct: > > > >2206 static void dm_wq_work(struct work_struct *work) > >2207 { > >2208 struct mapped_device *md = container_of(work, struct > > mapped_device, work); > >2209 struct bio *bio; > >2210 int srcu_idx; > >2211 struct dm_table *map; > >2212 > >2213 if (!bio_list_empty(>rescued)) { > >2214 struct bio_list list; > >2215 spin_lock_irq(>deferred_lock); > >2216 list = md->rescued; > >2217 bio_list_init(>rescued); > >2218 spin_unlock_irq(>deferred_lock); > >2219 while ((bio = bio_list_pop())) > >2220 generic_make_request(bio); > >2221 } > > > >2223 map = dm_get_live_table(md, _idx); > >2224 > >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > >2226 spin_lock_irq(>deferred_lock); > >2227 bio = bio_list_pop(>deferred); > >2228 spin_unlock_irq(>deferred_lock); > >2229 > >2230 if (!bio) > >2231 break; > >2232 > >2233 if (dm_request_based(md)) > >2234 generic_make_request(bio); > >2235 else > >2236 __split_and_process_bio(md, map, bio); > >2237 } > >2238 > >2239 dm_put_live_table(md, srcu_idx); > >2240 } > > > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > > will not process md->rescued list. > > Can you elaborate further? We cannot be "in dm_wq_work in > __split_and_process_bio" simultaneously. Do you mean as a side-effect > of scheduling away from __split_and_process_bio? > > The more detail you can share the better. Suppose this scenario: * dm_wq_work calls __split_and_process_bio * __split_and_process_bio eventually reaches the function snapshot_map * snapshot_map attempts to take the snapshot lock * the snapshot lock could be released only if some bios submitted by the snapshot driver to the underlying device complete * the bios submitted to the underlying device were already offloaded by some other task and they are waiting on the list md->rescued * the bios waiting on md->rescued are not processed, because dm_wq_work is blocked in snapshot_map (called from __split_and_process_bio) > > The processing of md->rescued is also wrong - bios for different devices > > must be offloaded to different helper threads, so that processing a bio > > for a lower device doesn't depend on processing a bio for a higher device. > > If you offload all the bios on current->bio_list to the same thread, the > > bios still depend on each other and the deadlock will still happen. > > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset > allocations") speaks to this with: > > "Note that only current->bio_list[0] is offloaded. current->bio_list[1] > contains bios that were scheduled *before* the current one started, so > they must have been submitted from higher up the stack, and we cannot be > waiting for them here (thanks to the "dm: ensure bio submission follows > a depth-first tree walk" commit). Also, we now rescue *all* bios on the > list as there is nothing to be gained by being more selective." I think you are right - if we only offload current->bio_list[0], then mixing of dependent bios on the offloaded list won't happen. > And again: this patchset passes your dm-snapshot deadlock test. Is > that test somehow lacking? With your patchset, the deadlock would happen only if bios are queued on >deferred - and that happens only in case of resume or if we are processing REQ_PREFLUSH with non-zero data size. So, the simple test that I wrote doesn't trigger it, but a more complex test involving REQ_PREFLUSH could. > Or do you see a hypothetical case where a deadlock is still possible? > That is of less concern. I'd prefer that we tackle problems for > targets, and associated scenarios, that we currently support. > > Either way, happy to review this with you further. Any fixes are > welcomed too. But I'd like us to head in a direction that this patchset > is taking us. Specifically: away from DM relying on BIOSET_NEED_RESCUER. > > Thanks, > Mike Mikulas
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 4:23pm -0500, Mikulas Patockawrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > On Tue, Nov 21 2017 at 7:43am -0500, > > Mike Snitzer wrote: > > > > > Decided it a better use of my time to review and then hopefully use the > > > block-core's bio splitting infrastructure in DM. Been meaning to do > > > that for quite a while anyway. This mail from you just made it all the > > > more clear that needs doing: > > > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html > > > > > > So I will start here on this patch you proposed: > > > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html > > > (of note, this patch slipped through the cracks because I was recovering > > > from injury when it originally came through). > > > > > > Once DM is using q->bio_split I'll come back to this patch (aka > > > "[1]") as a starting point for the follow-on work to remove DM's use of > > > BIOSET_NEED_RESCUER: > > > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html > > > > Hey Neil, > > > > Good news! All your code works ;) > > > > (well after 1 fixup due to a cut-n-paste bug.. the code you added to > > dm_wq_work() to process the md->rescued bio_list was operating on > > the md->deferred bio_list due to cut-n-paste from code you copied from > > just below it) > > > > I split your code out some to make it more reviewable. I also tweaked > > headers accordingly. > > > > Please see this branch (which _will_ get rebased between now and the > > 4.16 merge window): > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 > > > > I successfully tested these changes using Mikulas' test program that > > reproduces the snapshot deadlock: > > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html > > > > I'll throw various other DM testsuites at it to verify they all look > > good (e.g. thinp, cache, multipath). > > > > I'm open to all suggestions about changes you'd like to see (either to > > these patches or anything you'd like to layer ontop of them). > > > > Thanks for all your work, much appreciated! > > Mike > > This is not correct: > >2206 static void dm_wq_work(struct work_struct *work) >2207 { >2208 struct mapped_device *md = container_of(work, struct > mapped_device, work); >2209 struct bio *bio; >2210 int srcu_idx; >2211 struct dm_table *map; >2212 >2213 if (!bio_list_empty(>rescued)) { >2214 struct bio_list list; >2215 spin_lock_irq(>deferred_lock); >2216 list = md->rescued; >2217 bio_list_init(>rescued); >2218 spin_unlock_irq(>deferred_lock); >2219 while ((bio = bio_list_pop())) >2220 generic_make_request(bio); >2221 } > >2223 map = dm_get_live_table(md, _idx); >2224 >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { >2226 spin_lock_irq(>deferred_lock); >2227 bio = bio_list_pop(>deferred); >2228 spin_unlock_irq(>deferred_lock); >2229 >2230 if (!bio) >2231 break; >2232 >2233 if (dm_request_based(md)) >2234 generic_make_request(bio); >2235 else >2236 __split_and_process_bio(md, map, bio); >2237 } >2238 >2239 dm_put_live_table(md, srcu_idx); >2240 } > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > will not process md->rescued list. Can you elaborate further? We cannot be "in dm_wq_work in __split_and_process_bio" simultaneously. Do you mean as a side-effect of scheduling away from __split_and_process_bio? The more detail you can share the better. > The processing of md->rescued is also wrong - bios for different devices > must be offloaded to different helper threads, so that processing a bio > for a lower device doesn't depend on processing a bio for a higher device. > If you offload all the bios on current->bio_list to the same thread, the > bios still depend on each other and the deadlock will still happen. Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset allocations") speaks to this with: "Note that only current->bio_list[0] is offloaded. current->bio_list[1] contains bios that were scheduled *before* the current one started, so they must have been submitted from higher up the stack, and we cannot be waiting for them here (thanks to the "dm: ensure bio submission follows a depth-first tree walk" commit). Also, we now rescue *all* bios on the list as there is nothing to be gained by being more selective." And again: this patchset passes your
Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
On Tue, Nov 21 2017 at 4:23pm -0500, Mikulas Patocka wrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > On Tue, Nov 21 2017 at 7:43am -0500, > > Mike Snitzer wrote: > > > > > Decided it a better use of my time to review and then hopefully use the > > > block-core's bio splitting infrastructure in DM. Been meaning to do > > > that for quite a while anyway. This mail from you just made it all the > > > more clear that needs doing: > > > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html > > > > > > So I will start here on this patch you proposed: > > > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html > > > (of note, this patch slipped through the cracks because I was recovering > > > from injury when it originally came through). > > > > > > Once DM is using q->bio_split I'll come back to this patch (aka > > > "[1]") as a starting point for the follow-on work to remove DM's use of > > > BIOSET_NEED_RESCUER: > > > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html > > > > Hey Neil, > > > > Good news! All your code works ;) > > > > (well after 1 fixup due to a cut-n-paste bug.. the code you added to > > dm_wq_work() to process the md->rescued bio_list was operating on > > the md->deferred bio_list due to cut-n-paste from code you copied from > > just below it) > > > > I split your code out some to make it more reviewable. I also tweaked > > headers accordingly. > > > > Please see this branch (which _will_ get rebased between now and the > > 4.16 merge window): > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 > > > > I successfully tested these changes using Mikulas' test program that > > reproduces the snapshot deadlock: > > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html > > > > I'll throw various other DM testsuites at it to verify they all look > > good (e.g. thinp, cache, multipath). > > > > I'm open to all suggestions about changes you'd like to see (either to > > these patches or anything you'd like to layer ontop of them). > > > > Thanks for all your work, much appreciated! > > Mike > > This is not correct: > >2206 static void dm_wq_work(struct work_struct *work) >2207 { >2208 struct mapped_device *md = container_of(work, struct > mapped_device, work); >2209 struct bio *bio; >2210 int srcu_idx; >2211 struct dm_table *map; >2212 >2213 if (!bio_list_empty(>rescued)) { >2214 struct bio_list list; >2215 spin_lock_irq(>deferred_lock); >2216 list = md->rescued; >2217 bio_list_init(>rescued); >2218 spin_unlock_irq(>deferred_lock); >2219 while ((bio = bio_list_pop())) >2220 generic_make_request(bio); >2221 } > >2223 map = dm_get_live_table(md, _idx); >2224 >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { >2226 spin_lock_irq(>deferred_lock); >2227 bio = bio_list_pop(>deferred); >2228 spin_unlock_irq(>deferred_lock); >2229 >2230 if (!bio) >2231 break; >2232 >2233 if (dm_request_based(md)) >2234 generic_make_request(bio); >2235 else >2236 __split_and_process_bio(md, map, bio); >2237 } >2238 >2239 dm_put_live_table(md, srcu_idx); >2240 } > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > will not process md->rescued list. Can you elaborate further? We cannot be "in dm_wq_work in __split_and_process_bio" simultaneously. Do you mean as a side-effect of scheduling away from __split_and_process_bio? The more detail you can share the better. > The processing of md->rescued is also wrong - bios for different devices > must be offloaded to different helper threads, so that processing a bio > for a lower device doesn't depend on processing a bio for a higher device. > If you offload all the bios on current->bio_list to the same thread, the > bios still depend on each other and the deadlock will still happen. Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset allocations") speaks to this with: "Note that only current->bio_list[0] is offloaded. current->bio_list[1] contains bios that were scheduled *before* the current one started, so they must have been submitted from higher up the stack, and we cannot be waiting for them here (thanks to the "dm: ensure bio submission follows a depth-first tree walk" commit). Also, we now rescue *all* bios on the list as there is nothing to be gained by being more selective." And again: this patchset passes your dm-snapshot deadlock test. Is that test somehow