[PATCH] block: Add nr_bios to block_rq_remap tracepoint
Adding the number of bios in a remapped request to 'block_rq_remap' tracepoint. Request remapper clones bios in a request to track the completion status of each bio. So the number of bios can be useful information for investigation. Related discussions: http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html http://www.redhat.com/archives/dm-devel/2013-September/msg00024.html Signed-off-by: Jun'ichi Nomura Acked-by: Mike Snitzer Cc: Jens Axboe diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2fdb4a4..0e6f765 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -862,6 +862,17 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq) return blk_queue_get_max_sectors(q, rq->cmd_flags); } +static inline unsigned int blk_rq_count_bios(struct request *rq) +{ + unsigned int nr_bios = 0; + struct bio *bio; + + __rq_for_each_bio(bio, rq) + nr_bios++; + + return nr_bios; +} + /* * Request issue related functions. */ diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 60ae7c3..4c2301d 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -618,6 +618,7 @@ TRACE_EVENT(block_rq_remap, __field( unsigned int, nr_sector ) __field( dev_t, old_dev ) __field( sector_t, old_sector ) + __field( unsigned int, nr_bios ) __array( char, rwbs, RWBS_LEN) ), @@ -627,15 +628,16 @@ TRACE_EVENT(block_rq_remap, __entry->nr_sector = blk_rq_sectors(rq); __entry->old_dev= dev; __entry->old_sector = from; + __entry->nr_bios= blk_rq_count_bios(rq); blk_fill_rwbs(__entry->rwbs, rq->cmd_flags, blk_rq_bytes(rq)); ), - TP_printk("%d,%d %s %llu + %u <- (%d,%d) %llu", + TP_printk("%d,%d %s %llu + %u <- (%d,%d) %llu %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs, (unsigned long long)__entry->sector, __entry->nr_sector, MAJOR(__entry->old_dev), MINOR(__entry->old_dev), - (unsigned long long)__entry->old_sector) + (unsigned long long)__entry->old_sector, __entry->nr_bios) ); #endif /* _TRACE_BLOCK_H */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] block: Add nr_bios to block_rq_remap tracepoint
Adding the number of bios in a remapped request to 'block_rq_remap' tracepoint. Request remapper clones bios in a request to track the completion status of each bio. So the number of bios can be useful information for investigation. Related discussions: http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html http://www.redhat.com/archives/dm-devel/2013-September/msg00024.html Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Acked-by: Mike Snitzer snit...@redhat.com Cc: Jens Axboe ax...@kernel.dk diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2fdb4a4..0e6f765 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -862,6 +862,17 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq) return blk_queue_get_max_sectors(q, rq-cmd_flags); } +static inline unsigned int blk_rq_count_bios(struct request *rq) +{ + unsigned int nr_bios = 0; + struct bio *bio; + + __rq_for_each_bio(bio, rq) + nr_bios++; + + return nr_bios; +} + /* * Request issue related functions. */ diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 60ae7c3..4c2301d 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -618,6 +618,7 @@ TRACE_EVENT(block_rq_remap, __field( unsigned int, nr_sector ) __field( dev_t, old_dev ) __field( sector_t, old_sector ) + __field( unsigned int, nr_bios ) __array( char, rwbs, RWBS_LEN) ), @@ -627,15 +628,16 @@ TRACE_EVENT(block_rq_remap, __entry-nr_sector = blk_rq_sectors(rq); __entry-old_dev= dev; __entry-old_sector = from; + __entry-nr_bios= blk_rq_count_bios(rq); blk_fill_rwbs(__entry-rwbs, rq-cmd_flags, blk_rq_bytes(rq)); ), - TP_printk(%d,%d %s %llu + %u - (%d,%d) %llu, + TP_printk(%d,%d %s %llu + %u - (%d,%d) %llu %u, MAJOR(__entry-dev), MINOR(__entry-dev), __entry-rwbs, (unsigned long long)__entry-sector, __entry-nr_sector, MAJOR(__entry-old_dev), MINOR(__entry-old_dev), - (unsigned long long)__entry-old_sector) + (unsigned long long)__entry-old_sector, __entry-nr_bios) ); #endif /* _TRACE_BLOCK_H */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: allow error target to replace either bio-based and request-based targets
Hello Mike, On 08/23/13 09:17, Mike Snitzer wrote: >> I do like the idea of a single error target that is hybrid (supports >> both bio-based and request-based) but the DM core would need to be >> updated to support this. >> >> Specifically, we'd need to check if the device (and active table) is >> already bio-based or request-based and select the appropriate type. If >> it is a new device, default to selecting bio-based. >> >> There are some wrappers and other logic thoughout DM core that will need >> auditing too. > > Here is a patch that should work for your needs (I tested it to work > with 'dmsetup wipe_table' on both request-based and bio-based devices): How about moving the default handling in dm_table_set_type() outside of the for-each-target loop, like the modified patch below? For example, if a table has 2 targets, hybrid and request_based, and live_md_type is DM_TYPE_NONE, the table should be considered as request_based, not inconsistent. Though the end result is same as such a table is rejected by other constraint anyway, I think it's good to keep the semantics clean and error messages consistent. I.e. for the above case, the error message should be "Request-based dm doesn't support multiple targets yet", not "Inconsistent table: different target types can't be mixed up". --- Jun'ichi Nomura, NEC Corporation diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index f221812..6e683c8 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -860,14 +860,16 @@ EXPORT_SYMBOL(dm_consume_args); static int dm_table_set_type(struct dm_table *t) { unsigned i; - unsigned bio_based = 0, request_based = 0; + unsigned bio_based = 0, request_based = 0, hybrid = 0; struct dm_target *tgt; struct dm_dev_internal *dd; struct list_head *devices; for (i = 0; i < t->num_targets; i++) { tgt = t->targets + i; - if (dm_target_request_based(tgt)) + if (dm_target_hybrid(tgt)) + hybrid = 1; + else if (dm_target_request_based(tgt)) request_based = 1; else bio_based = 1; @@ -879,6 +881,25 @@ static int dm_table_set_type(struct dm_table *t) } } + if (hybrid && !bio_based && !request_based) { + /* +* The targets can work either way. +* Determine the type from the live device. +*/ + unsigned live_md_type; + dm_lock_md_type(t->md); + live_md_type = dm_get_md_type(t->md); + dm_unlock_md_type(t->md); + switch (live_md_type) { + case DM_TYPE_REQUEST_BASED: + request_based = 1; + break; + default: + bio_based = 1; + break; + } + } + if (bio_based) { /* We must use this table as bio-based */ t->type = DM_TYPE_BIO_BASED; diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 37ba5db..242e3ce 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -131,12 +131,19 @@ static int io_err_map(struct dm_target *tt, struct bio *bio) return -EIO; } +static int io_err_map_rq(struct dm_target *ti, struct request *clone, +union map_info *map_context) +{ + return -EIO; +} + static struct target_type error_target = { .name = "error", - .version = {1, 1, 0}, + .version = {1, 2, 0}, .ctr = io_err_ctr, .dtr = io_err_dtr, .map = io_err_map, + .map_rq = io_err_map_rq, }; int __init dm_target_init(void) diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 45b97da..8b4c075 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -89,10 +89,21 @@ int dm_setup_md_queue(struct mapped_device *md); #define dm_target_is_valid(t) ((t)->table) /* + * To check whether the target type is bio-based or not (request-based). + */ +#define dm_target_bio_based(t) ((t)->type->map != NULL) + +/* * To check whether the target type is request-based or not (bio-based). */ #define dm_target_request_based(t) ((t)->type->map_rq != NULL) +/* + * To check whether the target type is a hybrid (capable of being + * either request-based or bio-based). + */ +#define dm_target_hybrid(t) (dm_target_bio_based(t) && dm_target_request_based(t)) + /*- * A registry of target types. *---*/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: allow error target to replace either bio-based and request-based targets
Hello Mike, On 08/23/13 09:17, Mike Snitzer wrote: I do like the idea of a single error target that is hybrid (supports both bio-based and request-based) but the DM core would need to be updated to support this. Specifically, we'd need to check if the device (and active table) is already bio-based or request-based and select the appropriate type. If it is a new device, default to selecting bio-based. There are some wrappers and other logic thoughout DM core that will need auditing too. Here is a patch that should work for your needs (I tested it to work with 'dmsetup wipe_table' on both request-based and bio-based devices): How about moving the default handling in dm_table_set_type() outside of the for-each-target loop, like the modified patch below? For example, if a table has 2 targets, hybrid and request_based, and live_md_type is DM_TYPE_NONE, the table should be considered as request_based, not inconsistent. Though the end result is same as such a table is rejected by other constraint anyway, I think it's good to keep the semantics clean and error messages consistent. I.e. for the above case, the error message should be Request-based dm doesn't support multiple targets yet, not Inconsistent table: different target types can't be mixed up. --- Jun'ichi Nomura, NEC Corporation diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index f221812..6e683c8 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -860,14 +860,16 @@ EXPORT_SYMBOL(dm_consume_args); static int dm_table_set_type(struct dm_table *t) { unsigned i; - unsigned bio_based = 0, request_based = 0; + unsigned bio_based = 0, request_based = 0, hybrid = 0; struct dm_target *tgt; struct dm_dev_internal *dd; struct list_head *devices; for (i = 0; i t-num_targets; i++) { tgt = t-targets + i; - if (dm_target_request_based(tgt)) + if (dm_target_hybrid(tgt)) + hybrid = 1; + else if (dm_target_request_based(tgt)) request_based = 1; else bio_based = 1; @@ -879,6 +881,25 @@ static int dm_table_set_type(struct dm_table *t) } } + if (hybrid !bio_based !request_based) { + /* +* The targets can work either way. +* Determine the type from the live device. +*/ + unsigned live_md_type; + dm_lock_md_type(t-md); + live_md_type = dm_get_md_type(t-md); + dm_unlock_md_type(t-md); + switch (live_md_type) { + case DM_TYPE_REQUEST_BASED: + request_based = 1; + break; + default: + bio_based = 1; + break; + } + } + if (bio_based) { /* We must use this table as bio-based */ t-type = DM_TYPE_BIO_BASED; diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 37ba5db..242e3ce 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -131,12 +131,19 @@ static int io_err_map(struct dm_target *tt, struct bio *bio) return -EIO; } +static int io_err_map_rq(struct dm_target *ti, struct request *clone, +union map_info *map_context) +{ + return -EIO; +} + static struct target_type error_target = { .name = error, - .version = {1, 1, 0}, + .version = {1, 2, 0}, .ctr = io_err_ctr, .dtr = io_err_dtr, .map = io_err_map, + .map_rq = io_err_map_rq, }; int __init dm_target_init(void) diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 45b97da..8b4c075 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -89,10 +89,21 @@ int dm_setup_md_queue(struct mapped_device *md); #define dm_target_is_valid(t) ((t)-table) /* + * To check whether the target type is bio-based or not (request-based). + */ +#define dm_target_bio_based(t) ((t)-type-map != NULL) + +/* * To check whether the target type is request-based or not (bio-based). */ #define dm_target_request_based(t) ((t)-type-map_rq != NULL) +/* + * To check whether the target type is a hybrid (capable of being + * either request-based or bio-based). + */ +#define dm_target_hybrid(t) (dm_target_bio_based(t) dm_target_request_based(t)) + /*- * A registry of target types. *---*/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Hello Jens, please consider to pick up this patch. Without this patch, the warning below splats every when a multipath device is created. I got acks from Vivek and Tejun when I posted this for v3.7 and this same patch is still applicable to v3.8. Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()"), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (>type_lock){..}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [] __schedule_bug+0x6a/0x78 [] __schedule+0xb4/0x5e0 [] schedule+0x64/0x66 [] schedule_timeout+0x39/0xf8 [] ? put_lock_stats+0xe/0x29 [] ? lock_release_holdtime+0xb6/0xbb [] wait_for_common+0x9d/0xee [] ? try_to_wake_up+0x206/0x206 [] ? kfree_call_rcu+0x1c/0x1c [] wait_for_completion+0x1d/0x1f [] wait_rcu_gp+0x5d/0x7a [] ? wait_rcu_gp+0x7a/0x7a [] ? complete+0x21/0x53 [] synchronize_rcu+0x1e/0x20 [] blk_queue_bypass_start+0x5d/0x62 [] blkcg_activate_policy+0x73/0x270 [] ? kmem_cache_alloc_node_trace+0xc7/0x108 [] cfq_init_queue+0x80/0x28e [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [] elevator_init+0xe1/0x115 [] ? blk_queue_make_request+0x54/0x59 [] blk_init_allocated_queue+0x8c/0x9e [] dm_setup_md_queue+0x36/0xaa [dm_mod] [] table_load+0x1bd/0x2c8 [dm_mod] [] ctl_ioctl+0x1d6/0x236 [dm_mod] [] ? table_clear+0xaa/0xaa [dm_mod] [] dm_ctl_ioctl+0x13/0x17 [dm_mod] [] do_vfs_ioctl+0x3fb/0x441 [] ? file_has_perm+0x8a/0x99 [] sys_ioctl+0x5e/0x82 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura Acked-by: Vivek Goyal Acked-by: Tejun Heo Cc: Jens Axboe Cc: Alasdair G Kergon --- block/blk-cgroup.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index b8858fb..53628e4 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q, if (!blkg) return -ENOMEM; - preloaded = !radix_tree_preload(GFP_KERNEL); - blk_queue_bypass_start(q); + preloaded = !radix_tree_preload(GFP_KERNEL); + /* make sure the root blkg exists and count the existing blkgs */ spin_lock_irq(q->queue_lock); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH repost] blkcg: fix scheduling while atomic in blk_queue_bypass_start
Hello Jens, please consider to pick up this patch. Without this patch, the warning below splats every when a multipath device is created. I got acks from Vivek and Tejun when I posted this for v3.7 and this same patch is still applicable to v3.8. Since 749fefe677 in v3.7 (block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (md-type_lock){..}, at: [a019fb05] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [810723ae] __schedule_bug+0x6a/0x78 [81428ba2] __schedule+0xb4/0x5e0 [814291e6] schedule+0x64/0x66 [8142773a] schedule_timeout+0x39/0xf8 [8108ad5f] ? put_lock_stats+0xe/0x29 [8108ae30] ? lock_release_holdtime+0xb6/0xbb [814289e3] wait_for_common+0x9d/0xee [8107526c] ? try_to_wake_up+0x206/0x206 [810c0eb8] ? kfree_call_rcu+0x1c/0x1c [81428aec] wait_for_completion+0x1d/0x1f [810611f9] wait_rcu_gp+0x5d/0x7a [81061216] ? wait_rcu_gp+0x7a/0x7a [8106fb18] ? complete+0x21/0x53 [810c0556] synchronize_rcu+0x1e/0x20 [811dd903] blk_queue_bypass_start+0x5d/0x62 [811ee109] blkcg_activate_policy+0x73/0x270 [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108 [811f04b3] cfq_init_queue+0x80/0x28e [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [811d8c41] elevator_init+0xe1/0x115 [811e229f] ? blk_queue_make_request+0x54/0x59 [811dd743] blk_init_allocated_queue+0x8c/0x9e [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod] [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod] [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod] [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod] [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod] [811479fc] do_vfs_ioctl+0x3fb/0x441 [811b643c] ? file_has_perm+0x8a/0x99 [81147aa0] sys_ioctl+0x5e/0x82 [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f [814310d9] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Acked-by: Vivek Goyal vgo...@redhat.com Acked-by: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Cc: Alasdair G Kergon a...@redhat.com --- block/blk-cgroup.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index b8858fb..53628e4 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q, if (!blkg) return -ENOMEM; - preloaded = !radix_tree_preload(GFP_KERNEL); - blk_queue_bypass_start(q); + preloaded = !radix_tree_preload(GFP_KERNEL); + /* make sure the root blkg exists and count the existing blkgs */ spin_lock_irq(q-queue_lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
With 749fefe677 in v3.7 ("block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()"), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (>type_lock){..}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [] __schedule_bug+0x6a/0x78 [] __schedule+0xb4/0x5e0 [] schedule+0x64/0x66 [] schedule_timeout+0x39/0xf8 [] ? put_lock_stats+0xe/0x29 [] ? lock_release_holdtime+0xb6/0xbb [] wait_for_common+0x9d/0xee [] ? try_to_wake_up+0x206/0x206 [] ? kfree_call_rcu+0x1c/0x1c [] wait_for_completion+0x1d/0x1f [] wait_rcu_gp+0x5d/0x7a [] ? wait_rcu_gp+0x7a/0x7a [] ? complete+0x21/0x53 [] synchronize_rcu+0x1e/0x20 [] blk_queue_bypass_start+0x5d/0x62 [] blkcg_activate_policy+0x73/0x270 [] ? kmem_cache_alloc_node_trace+0xc7/0x108 [] cfq_init_queue+0x80/0x28e [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [] elevator_init+0xe1/0x115 [] ? blk_queue_make_request+0x54/0x59 [] blk_init_allocated_queue+0x8c/0x9e [] dm_setup_md_queue+0x36/0xaa [dm_mod] [] table_load+0x1bd/0x2c8 [dm_mod] [] ctl_ioctl+0x1d6/0x236 [dm_mod] [] ? table_clear+0xaa/0xaa [dm_mod] [] dm_ctl_ioctl+0x13/0x17 [dm_mod] [] do_vfs_ioctl+0x3fb/0x441 [] ? file_has_perm+0x8a/0x99 [] sys_ioctl+0x5e/0x82 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura Acked-by: Vivek Goyal Cc: Tejun Heo Cc: Jens Axboe Cc: Alasdair G Kergon --- block/blk-cgroup.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index b8858fb..53628e4 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q, if (!blkg) return -ENOMEM; - preloaded = !radix_tree_preload(GFP_KERNEL); - blk_queue_bypass_start(q); + preloaded = !radix_tree_preload(GFP_KERNEL); + /* make sure the root blkg exists and count the existing blkgs */ spin_lock_irq(q->queue_lock); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH repost] blkcg: fix scheduling while atomic in blk_queue_bypass_start
With 749fefe677 in v3.7 (block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (md-type_lock){..}, at: [a019fb05] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [810723ae] __schedule_bug+0x6a/0x78 [81428ba2] __schedule+0xb4/0x5e0 [814291e6] schedule+0x64/0x66 [8142773a] schedule_timeout+0x39/0xf8 [8108ad5f] ? put_lock_stats+0xe/0x29 [8108ae30] ? lock_release_holdtime+0xb6/0xbb [814289e3] wait_for_common+0x9d/0xee [8107526c] ? try_to_wake_up+0x206/0x206 [810c0eb8] ? kfree_call_rcu+0x1c/0x1c [81428aec] wait_for_completion+0x1d/0x1f [810611f9] wait_rcu_gp+0x5d/0x7a [81061216] ? wait_rcu_gp+0x7a/0x7a [8106fb18] ? complete+0x21/0x53 [810c0556] synchronize_rcu+0x1e/0x20 [811dd903] blk_queue_bypass_start+0x5d/0x62 [811ee109] blkcg_activate_policy+0x73/0x270 [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108 [811f04b3] cfq_init_queue+0x80/0x28e [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [811d8c41] elevator_init+0xe1/0x115 [811e229f] ? blk_queue_make_request+0x54/0x59 [811dd743] blk_init_allocated_queue+0x8c/0x9e [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod] [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod] [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod] [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod] [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod] [811479fc] do_vfs_ioctl+0x3fb/0x441 [811b643c] ? file_has_perm+0x8a/0x99 [81147aa0] sys_ioctl+0x5e/0x82 [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f [814310d9] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Acked-by: Vivek Goyal vgo...@redhat.com Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Cc: Alasdair G Kergon a...@redhat.com --- block/blk-cgroup.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index b8858fb..53628e4 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q, if (!blkg) return -ENOMEM; - preloaded = !radix_tree_preload(GFP_KERNEL); - blk_queue_bypass_start(q); + preloaded = !radix_tree_preload(GFP_KERNEL); + /* make sure the root blkg exists and count the existing blkgs */ spin_lock_irq(q-queue_lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
On 10/30/12 02:13, Vivek Goyal wrote: > On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote: >> int radix_tree_preload(gfp_t gfp_mask) >> { >> struct radix_tree_preload *rtp; >> struct radix_tree_node *node; >> int ret = -ENOMEM; >> >> preempt_disable(); >> >> >> Seems obvious why it explodes.. > > Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before > radix_tree_preload() should fix it. Junichi, can you please give it > a try. Thank you! It just works. This is a revised patch. With 749fefe677 ("block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()"), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (>type_lock){..}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [] __schedule_bug+0x6a/0x78 [] __schedule+0xb4/0x5e0 [] schedule+0x64/0x66 [] schedule_timeout+0x39/0xf8 [] ? put_lock_stats+0xe/0x29 [] ? lock_release_holdtime+0xb6/0xbb [] wait_for_common+0x9d/0xee [] ? try_to_wake_up+0x206/0x206 [] ? kfree_call_rcu+0x1c/0x1c [] wait_for_completion+0x1d/0x1f [] wait_rcu_gp+0x5d/0x7a [] ? wait_rcu_gp+0x7a/0x7a [] ? complete+0x21/0x53 [] synchronize_rcu+0x1e/0x20 [] blk_queue_bypass_start+0x5d/0x62 [] blkcg_activate_policy+0x73/0x270 [] ? kmem_cache_alloc_node_trace+0xc7/0x108 [] cfq_init_queue+0x80/0x28e [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [] elevator_init+0xe1/0x115 [] ? blk_queue_make_request+0x54/0x59 [] blk_init_allocated_queue+0x8c/0x9e [] dm_setup_md_queue+0x36/0xaa [dm_mod] [] table_load+0x1bd/0x2c8 [dm_mod] [] ctl_ioctl+0x1d6/0x236 [dm_mod] [] ? table_clear+0xaa/0xaa [dm_mod] [] dm_ctl_ioctl+0x13/0x17 [dm_mod] [] do_vfs_ioctl+0x3fb/0x441 [] ? file_has_perm+0x8a/0x99 [] sys_ioctl+0x5e/0x82 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura Cc: Vivek Goyal Cc: Tejun Heo Cc: Jens Axboe Cc: Alasdair G Kergon --- block/blk-cgroup.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index d0b7703..954f4be 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q, if (!blkg) return -ENOMEM; - preloaded = !radix_tree_preload(GFP_KERNEL); - blk_queue_bypass_start(q); + preloaded = !radix_tree_preload(GFP_KERNEL); + /* make sure the root blkg exists and count the existing blkgs */ spin_lock_irq(q->queue_lock); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page
On 10/30/12 04:07, Andi Kleen wrote: > Theodore Ts'o writes: >> Note that the problem that we're dealing with is buffered writes; so >> it's quite possible that the process which wrote the file, thus >> dirtying the page cache, has already exited; so there's no way we can >> guarantee we can inform the process which wrote the file via a signal >> or a error code return. > > Is that any different from other IO errors? It doesn't need to > be better. IMO, it is different in that next read from disk will likely succeed. (and read corrupted data) For IO errors come from disk failure, next read will likely fail again so we don't have to remember it somewhere. >> Also, if you're going to keep this state in memory, what happens if >> the inode gets pushed out of memory? > > You lose the error, just like you do today with any other IO error. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
On 10/27/12 05:21, Vivek Goyal wrote: > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote: >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized >> >> With 749fefe677 ("block: lift the initial queue bypass mode on >> blk_register_queue() instead of blk_init_allocated_queue()"), >> add_disk() eventually calls blk_queue_bypass_end(). >> This change invokes the following warning when multipath is used. >> >> BUG: scheduling while atomic: multipath/2460/0x0002 >> 1 lock held by multipath/2460: >>#0: (>type_lock){..}, at: [] >> dm_lock_md_type+0x17/0x19 [dm_mod] >> Modules linked in: ... >> Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 >> Call Trace: >>[] __schedule_bug+0x6a/0x78 >>[] __schedule+0xb4/0x5e0 >>[] schedule+0x64/0x66 >>[] schedule_timeout+0x39/0xf8 >>[] ? put_lock_stats+0xe/0x29 >>[] ? lock_release_holdtime+0xb6/0xbb >>[] wait_for_common+0x9d/0xee >>[] ? try_to_wake_up+0x206/0x206 >>[] ? kfree_call_rcu+0x1c/0x1c >>[] wait_for_completion+0x1d/0x1f >>[] wait_rcu_gp+0x5d/0x7a >>[] ? wait_rcu_gp+0x7a/0x7a >>[] ? complete+0x21/0x53 >>[] synchronize_rcu+0x1e/0x20 >>[] blk_queue_bypass_start+0x5d/0x62 >>[] blkcg_activate_policy+0x73/0x270 >>[] ? kmem_cache_alloc_node_trace+0xc7/0x108 >>[] cfq_init_queue+0x80/0x28e >>[] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] >>[] elevator_init+0xe1/0x115 >>[] ? blk_queue_make_request+0x54/0x59 >>[] blk_init_allocated_queue+0x8c/0x9e >>[] dm_setup_md_queue+0x36/0xaa [dm_mod] >>[] table_load+0x1bd/0x2c8 [dm_mod] >>[] ctl_ioctl+0x1d6/0x236 [dm_mod] >>[] ? table_clear+0xaa/0xaa [dm_mod] >>[] dm_ctl_ioctl+0x13/0x17 [dm_mod] >>[] do_vfs_ioctl+0x3fb/0x441 >>[] ? file_has_perm+0x8a/0x99 >>[] sys_ioctl+0x5e/0x82 >>[] ? trace_hardirqs_on_thunk+0x3a/0x3f >>[] system_call_fastpath+0x16/0x1b >> >> The warning means during queue initialization blk_queue_bypass_start() >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock. > > md->type_lock is a mutex, isn't it? I thought we are allowed to block > and schedule out under mutex? Hm, you are right. It's a mutex. The warning occurs only if I turned on CONFIG_PREEMPT=y. > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL). > So we already have code which can block/wait under md->type_lock. I am > not sure why should we get this warning under a mutex. add_disk() is called without md->type_lock. Call flow is like this: dm_create alloc_dev blk_alloc_queue alloc_disk add_disk blk_queue_bypass_end [with 3.7-rc2] table_load dm_lock_md_type [takes md->type_lock] dm_setup_md_queue blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED] elevator_init blkcg_activate_policy blk_queue_bypass_start <-- THIS triggers the warning blk_queue_bypass_end blk_queue_bypass_end [with 3.6] dm_unlock_md_type blk_queue_bypass_start() in blkcg_activate_policy was nested call, that did nothing, with 3.6. With 3.7-rc2, it becomes the initial call and does actual draining stuff. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page
On 10/29/12 19:37, Andi Kleen wrote: > Theodore Ts'o writes: >> On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote: >>> Except that there are filesystems that cannot implement such flags, >>> or require on-disk format changes to add more of those flags. This >>> is most definitely not a filesystem specific behaviour, so any sort >>> of VFS level per-file state needs to be kept in xattrs, not special >>> flags. Filesystems are welcome to optimise the storage of such >>> special xattrs (e.g. down to a single boolean flag in an inode), but >>> using a flag for something that dould, in fact, storage the exactly >>> offset and length of the corruption is far better than just storing >>> a "something is corrupted in this file" bit >> >> Agreed, if we're going to add an xattr, then we might as well store > > I don't think an xattr makes sense for this. It's sufficient to keep > this state in memory. > > In general these error paths are hard to test and it's important > to keep them as simple as possible. Doing IO and other complexities > just doesn't make sense. Just have the simplest possible path > that can do the job. And since it's difficult to prove, I think it's nice to have an option to panic if the memory error was on dirty page cache. It's theoretically same as disk I/O error; dirty cache is marked invalid and next read will go to disk. Though in practice, the next read will likely to fail if disk was broken. (Given that transient errors are usually recovered by retries and fail-overs in storage stack and not visible to applications which don't care.) So it's "consistent" in some sense. OTOH, the next read will likely succeed reading old data from disk in case of the memory error. I'm afraid the read-after-write inconsistency could cause silent data corruption. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page
On 10/30/12 04:07, Andi Kleen wrote: Theodore Ts'o ty...@mit.edu writes: Note that the problem that we're dealing with is buffered writes; so it's quite possible that the process which wrote the file, thus dirtying the page cache, has already exited; so there's no way we can guarantee we can inform the process which wrote the file via a signal or a error code return. Is that any different from other IO errors? It doesn't need to be better. IMO, it is different in that next read from disk will likely succeed. (and read corrupted data) For IO errors come from disk failure, next read will likely fail again so we don't have to remember it somewhere. Also, if you're going to keep this state in memory, what happens if the inode gets pushed out of memory? You lose the error, just like you do today with any other IO error. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkcg: fix scheduling while atomic in blk_queue_bypass_start
On 10/30/12 02:13, Vivek Goyal wrote: On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote: int radix_tree_preload(gfp_t gfp_mask) { struct radix_tree_preload *rtp; struct radix_tree_node *node; int ret = -ENOMEM; preempt_disable(); Seems obvious why it explodes.. Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before radix_tree_preload() should fix it. Junichi, can you please give it a try. Thank you! It just works. This is a revised patch. With 749fefe677 (block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (md-type_lock){..}, at: [a019fb05] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [810723ae] __schedule_bug+0x6a/0x78 [81428ba2] __schedule+0xb4/0x5e0 [814291e6] schedule+0x64/0x66 [8142773a] schedule_timeout+0x39/0xf8 [8108ad5f] ? put_lock_stats+0xe/0x29 [8108ae30] ? lock_release_holdtime+0xb6/0xbb [814289e3] wait_for_common+0x9d/0xee [8107526c] ? try_to_wake_up+0x206/0x206 [810c0eb8] ? kfree_call_rcu+0x1c/0x1c [81428aec] wait_for_completion+0x1d/0x1f [810611f9] wait_rcu_gp+0x5d/0x7a [81061216] ? wait_rcu_gp+0x7a/0x7a [8106fb18] ? complete+0x21/0x53 [810c0556] synchronize_rcu+0x1e/0x20 [811dd903] blk_queue_bypass_start+0x5d/0x62 [811ee109] blkcg_activate_policy+0x73/0x270 [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108 [811f04b3] cfq_init_queue+0x80/0x28e [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [811d8c41] elevator_init+0xe1/0x115 [811e229f] ? blk_queue_make_request+0x54/0x59 [811dd743] blk_init_allocated_queue+0x8c/0x9e [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod] [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod] [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod] [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod] [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod] [811479fc] do_vfs_ioctl+0x3fb/0x441 [811b643c] ? file_has_perm+0x8a/0x99 [81147aa0] sys_ioctl+0x5e/0x82 [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f [814310d9] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Cc: Vivek Goyal vgo...@redhat.com Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Cc: Alasdair G Kergon a...@redhat.com --- block/blk-cgroup.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index d0b7703..954f4be 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q, if (!blkg) return -ENOMEM; - preloaded = !radix_tree_preload(GFP_KERNEL); - blk_queue_bypass_start(q); + preloaded = !radix_tree_preload(GFP_KERNEL); + /* make sure the root blkg exists and count the existing blkgs */ spin_lock_irq(q-queue_lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page
On 10/29/12 19:37, Andi Kleen wrote: Theodore Ts'o ty...@mit.edu writes: On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote: Except that there are filesystems that cannot implement such flags, or require on-disk format changes to add more of those flags. This is most definitely not a filesystem specific behaviour, so any sort of VFS level per-file state needs to be kept in xattrs, not special flags. Filesystems are welcome to optimise the storage of such special xattrs (e.g. down to a single boolean flag in an inode), but using a flag for something that dould, in fact, storage the exactly offset and length of the corruption is far better than just storing a something is corrupted in this file bit Agreed, if we're going to add an xattr, then we might as well store I don't think an xattr makes sense for this. It's sufficient to keep this state in memory. In general these error paths are hard to test and it's important to keep them as simple as possible. Doing IO and other complexities just doesn't make sense. Just have the simplest possible path that can do the job. And since it's difficult to prove, I think it's nice to have an option to panic if the memory error was on dirty page cache. It's theoretically same as disk I/O error; dirty cache is marked invalid and next read will go to disk. Though in practice, the next read will likely to fail if disk was broken. (Given that transient errors are usually recovered by retries and fail-overs in storage stack and not visible to applications which don't care.) So it's consistent in some sense. OTOH, the next read will likely succeed reading old data from disk in case of the memory error. I'm afraid the read-after-write inconsistency could cause silent data corruption. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
On 10/27/12 05:21, Vivek Goyal wrote: On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote: [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized With 749fefe677 (block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()), add_disk() eventually calls blk_queue_bypass_end(). This change invokes the following warning when multipath is used. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (md-type_lock){..}, at: [a019fb05] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [810723ae] __schedule_bug+0x6a/0x78 [81428ba2] __schedule+0xb4/0x5e0 [814291e6] schedule+0x64/0x66 [8142773a] schedule_timeout+0x39/0xf8 [8108ad5f] ? put_lock_stats+0xe/0x29 [8108ae30] ? lock_release_holdtime+0xb6/0xbb [814289e3] wait_for_common+0x9d/0xee [8107526c] ? try_to_wake_up+0x206/0x206 [810c0eb8] ? kfree_call_rcu+0x1c/0x1c [81428aec] wait_for_completion+0x1d/0x1f [810611f9] wait_rcu_gp+0x5d/0x7a [81061216] ? wait_rcu_gp+0x7a/0x7a [8106fb18] ? complete+0x21/0x53 [810c0556] synchronize_rcu+0x1e/0x20 [811dd903] blk_queue_bypass_start+0x5d/0x62 [811ee109] blkcg_activate_policy+0x73/0x270 [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108 [811f04b3] cfq_init_queue+0x80/0x28e [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [811d8c41] elevator_init+0xe1/0x115 [811e229f] ? blk_queue_make_request+0x54/0x59 [811dd743] blk_init_allocated_queue+0x8c/0x9e [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod] [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod] [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod] [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod] [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod] [811479fc] do_vfs_ioctl+0x3fb/0x441 [811b643c] ? file_has_perm+0x8a/0x99 [81147aa0] sys_ioctl+0x5e/0x82 [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f [814310d9] system_call_fastpath+0x16/0x1b The warning means during queue initialization blk_queue_bypass_start() calls sleeping function (synchronize_rcu) while dm holds md-type_lock. md-type_lock is a mutex, isn't it? I thought we are allowed to block and schedule out under mutex? Hm, you are right. It's a mutex. The warning occurs only if I turned on CONFIG_PREEMPT=y. add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL). So we already have code which can block/wait under md-type_lock. I am not sure why should we get this warning under a mutex. add_disk() is called without md-type_lock. Call flow is like this: dm_create alloc_dev blk_alloc_queue alloc_disk add_disk blk_queue_bypass_end [with 3.7-rc2] table_load dm_lock_md_type [takes md-type_lock] dm_setup_md_queue blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED] elevator_init blkcg_activate_policy blk_queue_bypass_start -- THIS triggers the warning blk_queue_bypass_end blk_queue_bypass_end [with 3.6] dm_unlock_md_type blk_queue_bypass_start() in blkcg_activate_policy was nested call, that did nothing, with 3.6. With 3.7-rc2, it becomes the initial call and does actual draining stuff. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
On 10/25/12 18:41, Jun'ichi Nomura wrote: > With 749fefe677 ("block: lift the initial queue bypass mode on > blk_register_queue() instead of blk_init_allocated_queue()"), > add_disk() eventually calls blk_queue_bypass_end(). > This change invokes the following warning when multipath is used. ... > The warning means during queue initialization blk_queue_bypass_start() > calls sleeping function (synchronize_rcu) while dm holds md->type_lock. > > dm device initialization basically includes the following 3 steps: > 1. create ioctl, allocates queue and call add_disk() > 2. table load ioctl, determines device type and initialize queue > if request-based > 3. resume ioctl, device becomes functional > > So it is better to have dm's queue stay in bypass mode until > the initialization completes in table load ioctl. > > The effect of additional blk_queue_bypass_start(): > > 3.7-rc2 (plain) > # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ > dmsetup remove a; done > > real 0m15.434s > user 0m0.423s > sys 0m7.052s > > 3.7-rc2 (with this patch) > # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ > dmsetup remove a; done > real 0m19.766s > user 0m0.442s > sys 0m6.861s > > If this additional cost is not negligible, we need a variant of add_disk() > that does not end bypassing. Or call blk_queue_bypass_start() before add_disk(): diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ad02761..d14639b 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1868,9 +1868,9 @@ static struct mapped_device *alloc_dev(int minor) md->disk->queue = md->queue; md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - add_disk(md->disk); /* Until md type is determined, put the queue in bypass mode */ blk_queue_bypass_start(md->queue); + add_disk(md->disk); format_dev_t(md->name, MKDEV(_major, minor)); md->wq = alloc_workqueue("kdmflush", --- If the patch is modified like above, we could fix the issue without incurring additional cost on dm device creation. # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real0m15.684s user0m0.404s sys 0m7.181s -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h
[PATCH] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h dm wants to use those functions to control the bypass status of half-initialized device. This patch is a preparation for: [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized Signed-off-by: Jun'ichi Nomura Cc: Vivek Goyal Cc: Tejun Heo Cc: Jens Axboe Cc: Alasdair G Kergon --- block/blk.h|2 -- include/linux/blkdev.h |2 ++ 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk.h b/block/blk.h index ca51543..48bac5b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -26,8 +26,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq, struct bio *bio); int blk_rq_append_bio(struct request_queue *q, struct request *rq, struct bio *bio); -void blk_queue_bypass_start(struct request_queue *q); -void blk_queue_bypass_end(struct request_queue *q); void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); bool __blk_end_bidi_request(struct request *rq, int error, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1756001..86ba153 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -949,6 +949,8 @@ bool __must_check blk_get_queue(struct request_queue *); struct request_queue *blk_alloc_queue(gfp_t); struct request_queue *blk_alloc_queue_node(gfp_t, int); extern void blk_put_queue(struct request_queue *); +void blk_queue_bypass_start(struct request_queue *q); +void blk_queue_bypass_end(struct request_queue *q); /* * blk_plug permits building a queue of related requests by holding the I/O -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized With 749fefe677 ("block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()"), add_disk() eventually calls blk_queue_bypass_end(). This change invokes the following warning when multipath is used. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (>type_lock){..}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [] __schedule_bug+0x6a/0x78 [] __schedule+0xb4/0x5e0 [] schedule+0x64/0x66 [] schedule_timeout+0x39/0xf8 [] ? put_lock_stats+0xe/0x29 [] ? lock_release_holdtime+0xb6/0xbb [] wait_for_common+0x9d/0xee [] ? try_to_wake_up+0x206/0x206 [] ? kfree_call_rcu+0x1c/0x1c [] wait_for_completion+0x1d/0x1f [] wait_rcu_gp+0x5d/0x7a [] ? wait_rcu_gp+0x7a/0x7a [] ? complete+0x21/0x53 [] synchronize_rcu+0x1e/0x20 [] blk_queue_bypass_start+0x5d/0x62 [] blkcg_activate_policy+0x73/0x270 [] ? kmem_cache_alloc_node_trace+0xc7/0x108 [] cfq_init_queue+0x80/0x28e [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [] elevator_init+0xe1/0x115 [] ? blk_queue_make_request+0x54/0x59 [] blk_init_allocated_queue+0x8c/0x9e [] dm_setup_md_queue+0x36/0xaa [dm_mod] [] table_load+0x1bd/0x2c8 [dm_mod] [] ctl_ioctl+0x1d6/0x236 [dm_mod] [] ? table_clear+0xaa/0xaa [dm_mod] [] dm_ctl_ioctl+0x13/0x17 [dm_mod] [] do_vfs_ioctl+0x3fb/0x441 [] ? file_has_perm+0x8a/0x99 [] sys_ioctl+0x5e/0x82 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b The warning means during queue initialization blk_queue_bypass_start() calls sleeping function (synchronize_rcu) while dm holds md->type_lock. dm device initialization basically includes the following 3 steps: 1. create ioctl, allocates queue and call add_disk() 2. table load ioctl, determines device type and initialize queue if request-based 3. resume ioctl, device becomes functional So it is better to have dm's queue stay in bypass mode until the initialization completes in table load ioctl. The effect of additional blk_queue_bypass_start(): 3.7-rc2 (plain) # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real 0m15.434s user 0m0.423s sys 0m7.052s 3.7-rc2 (with this patch) # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real 0m19.766s user 0m0.442s sys 0m6.861s If this additional cost is not negligible, we need a variant of add_disk() that does not end bypassing. Signed-off-by: Jun'ichi Nomura Cc: Vivek Goyal Cc: Tejun Heo Cc: Jens Axboe Cc: Alasdair G Kergon --- drivers/md/dm.c|4 1 file changed, 4 insertions(+) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 02db918..ad02761 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1869,6 +1869,8 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); add_disk(md->disk); + /* Until md type is determined, put the queue in bypass mode */ + blk_queue_bypass_start(md->queue); format_dev_t(md->name, MKDEV(_major, minor)); md->wq = alloc_workqueue("kdmflush", @@ -2172,6 +2174,7 @@ static int dm_init_request_based_queue(struct mapped_device *md) return 1; /* Fully initialize the queue */ + WARN_ON(!blk_queue_bypass(md->queue)); q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL); if (!q) return 0; @@ -2198,6 +2201,7 @@ int dm_setup_md_queue(struct mapped_device *md) return -EINVAL; } + blk_queue_bypass_end(md->queue); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized With 749fefe677 (block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()), add_disk() eventually calls blk_queue_bypass_end(). This change invokes the following warning when multipath is used. BUG: scheduling while atomic: multipath/2460/0x0002 1 lock held by multipath/2460: #0: (md-type_lock){..}, at: [a019fb05] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1 Call Trace: [810723ae] __schedule_bug+0x6a/0x78 [81428ba2] __schedule+0xb4/0x5e0 [814291e6] schedule+0x64/0x66 [8142773a] schedule_timeout+0x39/0xf8 [8108ad5f] ? put_lock_stats+0xe/0x29 [8108ae30] ? lock_release_holdtime+0xb6/0xbb [814289e3] wait_for_common+0x9d/0xee [8107526c] ? try_to_wake_up+0x206/0x206 [810c0eb8] ? kfree_call_rcu+0x1c/0x1c [81428aec] wait_for_completion+0x1d/0x1f [810611f9] wait_rcu_gp+0x5d/0x7a [81061216] ? wait_rcu_gp+0x7a/0x7a [8106fb18] ? complete+0x21/0x53 [810c0556] synchronize_rcu+0x1e/0x20 [811dd903] blk_queue_bypass_start+0x5d/0x62 [811ee109] blkcg_activate_policy+0x73/0x270 [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108 [811f04b3] cfq_init_queue+0x80/0x28e [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [811d8c41] elevator_init+0xe1/0x115 [811e229f] ? blk_queue_make_request+0x54/0x59 [811dd743] blk_init_allocated_queue+0x8c/0x9e [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod] [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod] [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod] [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod] [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod] [811479fc] do_vfs_ioctl+0x3fb/0x441 [811b643c] ? file_has_perm+0x8a/0x99 [81147aa0] sys_ioctl+0x5e/0x82 [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f [814310d9] system_call_fastpath+0x16/0x1b The warning means during queue initialization blk_queue_bypass_start() calls sleeping function (synchronize_rcu) while dm holds md-type_lock. dm device initialization basically includes the following 3 steps: 1. create ioctl, allocates queue and call add_disk() 2. table load ioctl, determines device type and initialize queue if request-based 3. resume ioctl, device becomes functional So it is better to have dm's queue stay in bypass mode until the initialization completes in table load ioctl. The effect of additional blk_queue_bypass_start(): 3.7-rc2 (plain) # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real 0m15.434s user 0m0.423s sys 0m7.052s 3.7-rc2 (with this patch) # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real 0m19.766s user 0m0.442s sys 0m6.861s If this additional cost is not negligible, we need a variant of add_disk() that does not end bypassing. Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Cc: Vivek Goyal vgo...@redhat.com Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Cc: Alasdair G Kergon a...@redhat.com --- drivers/md/dm.c|4 1 file changed, 4 insertions(+) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 02db918..ad02761 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1869,6 +1869,8 @@ static struct mapped_device *alloc_dev(int minor) md-disk-private_data = md; sprintf(md-disk-disk_name, dm-%d, minor); add_disk(md-disk); + /* Until md type is determined, put the queue in bypass mode */ + blk_queue_bypass_start(md-queue); format_dev_t(md-name, MKDEV(_major, minor)); md-wq = alloc_workqueue(kdmflush, @@ -2172,6 +2174,7 @@ static int dm_init_request_based_queue(struct mapped_device *md) return 1; /* Fully initialize the queue */ + WARN_ON(!blk_queue_bypass(md-queue)); q = blk_init_allocated_queue(md-queue, dm_request_fn, NULL); if (!q) return 0; @@ -2198,6 +2201,7 @@ int dm_setup_md_queue(struct mapped_device *md) return -EINVAL; } + blk_queue_bypass_end(md-queue); return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h
[PATCH] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h dm wants to use those functions to control the bypass status of half-initialized device. This patch is a preparation for: [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Cc: Vivek Goyal vgo...@redhat.com Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Cc: Alasdair G Kergon a...@redhat.com --- block/blk.h|2 -- include/linux/blkdev.h |2 ++ 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk.h b/block/blk.h index ca51543..48bac5b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -26,8 +26,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq, struct bio *bio); int blk_rq_append_bio(struct request_queue *q, struct request *rq, struct bio *bio); -void blk_queue_bypass_start(struct request_queue *q); -void blk_queue_bypass_end(struct request_queue *q); void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); bool __blk_end_bidi_request(struct request *rq, int error, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1756001..86ba153 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -949,6 +949,8 @@ bool __must_check blk_get_queue(struct request_queue *); struct request_queue *blk_alloc_queue(gfp_t); struct request_queue *blk_alloc_queue_node(gfp_t, int); extern void blk_put_queue(struct request_queue *); +void blk_queue_bypass_start(struct request_queue *q); +void blk_queue_bypass_end(struct request_queue *q); /* * blk_plug permits building a queue of related requests by holding the I/O -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
On 10/25/12 18:41, Jun'ichi Nomura wrote: With 749fefe677 (block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()), add_disk() eventually calls blk_queue_bypass_end(). This change invokes the following warning when multipath is used. ... The warning means during queue initialization blk_queue_bypass_start() calls sleeping function (synchronize_rcu) while dm holds md-type_lock. dm device initialization basically includes the following 3 steps: 1. create ioctl, allocates queue and call add_disk() 2. table load ioctl, determines device type and initialize queue if request-based 3. resume ioctl, device becomes functional So it is better to have dm's queue stay in bypass mode until the initialization completes in table load ioctl. The effect of additional blk_queue_bypass_start(): 3.7-rc2 (plain) # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real 0m15.434s user 0m0.423s sys 0m7.052s 3.7-rc2 (with this patch) # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real 0m19.766s user 0m0.442s sys 0m6.861s If this additional cost is not negligible, we need a variant of add_disk() that does not end bypassing. Or call blk_queue_bypass_start() before add_disk(): diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ad02761..d14639b 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1868,9 +1868,9 @@ static struct mapped_device *alloc_dev(int minor) md-disk-queue = md-queue; md-disk-private_data = md; sprintf(md-disk-disk_name, dm-%d, minor); - add_disk(md-disk); /* Until md type is determined, put the queue in bypass mode */ blk_queue_bypass_start(md-queue); + add_disk(md-disk); format_dev_t(md-name, MKDEV(_major, minor)); md-wq = alloc_workqueue(kdmflush, --- If the patch is modified like above, we could fix the issue without incurring additional cost on dm device creation. # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \ dmsetup remove a; done real0m15.684s user0m0.404s sys 0m7.181s -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkcg: stop iteration early if root_rl is the only request list
__blk_queue_next_rl() finds next request list based on blkg_list while skipping root_blkg in the list. OTOH, root_rl is special as it may exist even without root_blkg. Though the later part of the function handles such a case correctly, exiting early is good for readability of the code. Signed-off-by: Jun'ichi Nomura Cc: Vivek Goyal Cc: Tejun Heo Cc: Jens Axboe --- block/blk-cgroup.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 54f35d1..a31e678 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -333,6 +333,9 @@ struct request_list *__blk_queue_next_rl(struct request_list *rl, */ if (rl == >root_rl) { ent = >blkg_list; + /* There are no more block groups, hence no request lists */ + if (list_empty(ent)) + return NULL; } else { blkg = container_of(rl, struct blkcg_gq, rl); ent = >q_node; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg
On 10/19/12 23:53, Vivek Goyal wrote: > On Thu, Oct 18, 2012 at 02:20:53PM -0700, Tejun Heo wrote: >> Hey, Vivek. >> >> On Thu, Oct 18, 2012 at 09:31:49AM -0400, Vivek Goyal wrote: >>> Tejun, for the sake of readability, are you fine with keeping the original >>> check and original patch which I had acked. >> >> Can you please send another patch to change that? It really isn't a >> related change and I don't wanna mix the two. > > Sure. Jun'ichi, would you like to send that cleanup line in a separate patch? OK. I will send that patch. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg
On 10/19/12 23:53, Vivek Goyal wrote: On Thu, Oct 18, 2012 at 02:20:53PM -0700, Tejun Heo wrote: Hey, Vivek. On Thu, Oct 18, 2012 at 09:31:49AM -0400, Vivek Goyal wrote: Tejun, for the sake of readability, are you fine with keeping the original check and original patch which I had acked. Can you please send another patch to change that? It really isn't a related change and I don't wanna mix the two. Sure. Jun'ichi, would you like to send that cleanup line in a separate patch? OK. I will send that patch. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkcg: stop iteration early if root_rl is the only request list
__blk_queue_next_rl() finds next request list based on blkg_list while skipping root_blkg in the list. OTOH, root_rl is special as it may exist even without root_blkg. Though the later part of the function handles such a case correctly, exiting early is good for readability of the code. Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Cc: Vivek Goyal vgo...@redhat.com Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk --- block/blk-cgroup.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 54f35d1..a31e678 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -333,6 +333,9 @@ struct request_list *__blk_queue_next_rl(struct request_list *rl, */ if (rl == q-root_rl) { ent = q-blkg_list; + /* There are no more block groups, hence no request lists */ + if (list_empty(ent)) + return NULL; } else { blkg = container_of(rl, struct blkcg_gq, rl); ent = blkg-q_node; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg
On 10/17/12 22:47, Vivek Goyal wrote: > On Wed, Oct 17, 2012 at 09:02:22AM +0900, Jun'ichi Nomura wrote: >> On 10/17/12 08:20, Tejun Heo wrote: >>>>>> -if (ent == >root_blkg->q_node) >>>>>> +if (q->root_blkg && ent == >root_blkg->q_node) >>>>> >>>>> Can we fix it little differently. Little earlier in the code, we check for >>>>> if q->blkg_list is empty, then all the groups are gone, and there are >>>>> no more request lists hence and return NULL. >>>>> >>>>> Current code: >>>>> if (rl == >root_rl) { >>>>> ent = >blkg_list; >>>>> >>>>> Modified code: >>>>> if (rl == >root_rl) { >>>>> ent = >blkg_list; >>>>> /* There are no more block groups, hence no request lists */ >>>>> if (list_empty(ent)) >>>>> return NULL; >>>>> } >>> >>> Do we need this at all? q->root_blkg being NULL is completely fine >>> there and the comparison would work as expected, no? >> >> Hmm? >> >> If list_empty(ent) and q->root_blkg == NULL, >> >>> /* walk to the next list_head, skip root blkcg */ >>> ent = ent->next; >> >> ent is >blkg_list again. >> >>> if (ent == >root_blkg->q_node) >> >> So ent is not >root_blkg->q_node. > > If q->root_blkg is NULL, will it not lead to NULL pointer dereference. > (q->root_blkg->q_node). It's not dereferenced. >>> ent = ent->next; >>> if (ent == >blkg_list) >>> return NULL; >> >> And we return NULL here. >> >> Ah, yes. You are correct. >> We can do without the above hunk. > > I would rather prefer to check for this boundary condition early and > return instead of letting it fall through all these conditions and > then figure out yes we have no next rl. IMO, code becomes easier to > understand if nothing else. Otherwise one needs a step by step > explanation as above to show that case of q->root_blkg is covered. I have same opinion as yours that it's good for readability. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg
blk_put_rl() does not call blkg_put() for q->root_rl because we don't take request list reference on q->root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q->root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING && CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: > WARNING: at /work/build/linux/block/blk-cgroup.h:250 > blk_put_rl+0x4d/0x95() > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 > Call Trace: >[] warn_slowpath_common+0x85/0x9d > [] warn_slowpath_null+0x1a/0x1c > [] blk_put_rl+0x4d/0x95 > [] __blk_put_request+0xc3/0xcb > [] blk_finish_request+0x232/0x23f > [] ? blk_end_bidi_request+0x34/0x5d > [] blk_end_bidi_request+0x42/0x5d > [] blk_end_request+0x10/0x12 > [] scsi_io_completion+0x207/0x4d5 > [] scsi_finish_command+0xfa/0x103 > [] scsi_softirq_done+0xff/0x108 > [] blk_done_softirq+0x8d/0xa1 > [] ? > generic_smp_call_function_single_interrupt+0x9f/0xd7 > [] __do_softirq+0x102/0x213 > [] ? lock_release_holdtime+0xb6/0xbb > [] ? raise_softirq_irqoff+0x9/0x3d > [] call_softirq+0x1c/0x30 > [] do_softirq+0x4b/0xa3 > [] irq_exit+0x53/0xd5 > [] smp_call_function_single_interrupt+0x34/0x36 > [] call_function_single_interrupt+0x6f/0x80 >[] ? mwait_idle+0x94/0xcd > [] ? mwait_idle+0x8b/0xcd > [] cpu_idle+0xbb/0x114 > [] rest_init+0xc1/0xc8 > [] ? csum_partial_copy_generic+0x16c/0x16c > [] start_kernel+0x3d4/0x3e1 > [] ? kernel_init+0x1f7/0x1f7 > [] x86_64_start_reservations+0xb8/0xbd > [] x86_64_start_kernel+0x101/0x110 This patch clears q->root_blkg and q->root_rl.blkg when root blkg is destroyed. Signed-off-by: Jun'ichi Nomura Acked-by: Vivek Goyal Cc: Tejun Heo Cc: Jens Axboe --- v3: Removed a hunk for NULL-check q->root_blkg in __blk_queue_next_rl(). Current code can handle the case without the change. v2: Added comments in code based on Vivek's suggestion. block/blk-cgroup.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f3b44a6..54f35d1 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -285,6 +285,13 @@ static void blkg_destroy_all(struct request_queue *q) blkg_destroy(blkg); spin_unlock(>lock); } + + /* +* root blkg is destroyed. Just clear the pointer since +* root_rl does not take reference on root blkg. +*/ + q->root_blkg = NULL; + q->root_rl.blkg = NULL; } static void blkg_rcu_free(struct rcu_head *rcu_head) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkcg: Fix use-after-free of q-root_blkg and q-root_rl.blkg
blk_put_rl() does not call blkg_put() for q-root_rl because we don't take request list reference on q-root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q-root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95() Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 Call Trace: IRQ [810453bd] warn_slowpath_common+0x85/0x9d [810453ef] warn_slowpath_null+0x1a/0x1c [811d5f8d] blk_put_rl+0x4d/0x95 [811d614a] __blk_put_request+0xc3/0xcb [811d71a3] blk_finish_request+0x232/0x23f [811d76c3] ? blk_end_bidi_request+0x34/0x5d [811d76d1] blk_end_bidi_request+0x42/0x5d [811d7728] blk_end_request+0x10/0x12 [812cdf16] scsi_io_completion+0x207/0x4d5 [812c6fcf] scsi_finish_command+0xfa/0x103 [812ce2f8] scsi_softirq_done+0xff/0x108 [811dcea5] blk_done_softirq+0x8d/0xa1 [810915d5] ? generic_smp_call_function_single_interrupt+0x9f/0xd7 [8104cf5b] __do_softirq+0x102/0x213 [8108a5ec] ? lock_release_holdtime+0xb6/0xbb [8104d2b4] ? raise_softirq_irqoff+0x9/0x3d [81424dfc] call_softirq+0x1c/0x30 [81011beb] do_softirq+0x4b/0xa3 [8104cdb0] irq_exit+0x53/0xd5 [8102d865] smp_call_function_single_interrupt+0x34/0x36 [8142486f] call_function_single_interrupt+0x6f/0x80 EOI [8101800b] ? mwait_idle+0x94/0xcd [81018002] ? mwait_idle+0x8b/0xcd [81017811] cpu_idle+0xbb/0x114 [81401fbd] rest_init+0xc1/0xc8 [81401efc] ? csum_partial_copy_generic+0x16c/0x16c [81cdbd3d] start_kernel+0x3d4/0x3e1 [81cdb79e] ? kernel_init+0x1f7/0x1f7 [81cdb2dd] x86_64_start_reservations+0xb8/0xbd [81cdb3e3] x86_64_start_kernel+0x101/0x110 This patch clears q-root_blkg and q-root_rl.blkg when root blkg is destroyed. Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com Acked-by: Vivek Goyal vgo...@redhat.com Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk --- v3: Removed a hunk for NULL-check q-root_blkg in __blk_queue_next_rl(). Current code can handle the case without the change. v2: Added comments in code based on Vivek's suggestion. block/blk-cgroup.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f3b44a6..54f35d1 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -285,6 +285,13 @@ static void blkg_destroy_all(struct request_queue *q) blkg_destroy(blkg); spin_unlock(blkcg-lock); } + + /* +* root blkg is destroyed. Just clear the pointer since +* root_rl does not take reference on root blkg. +*/ + q-root_blkg = NULL; + q-root_rl.blkg = NULL; } static void blkg_rcu_free(struct rcu_head *rcu_head) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg
On 10/17/12 22:47, Vivek Goyal wrote: On Wed, Oct 17, 2012 at 09:02:22AM +0900, Jun'ichi Nomura wrote: On 10/17/12 08:20, Tejun Heo wrote: -if (ent == q-root_blkg-q_node) +if (q-root_blkg ent == q-root_blkg-q_node) Can we fix it little differently. Little earlier in the code, we check for if q-blkg_list is empty, then all the groups are gone, and there are no more request lists hence and return NULL. Current code: if (rl == q-root_rl) { ent = q-blkg_list; Modified code: if (rl == q-root_rl) { ent = q-blkg_list; /* There are no more block groups, hence no request lists */ if (list_empty(ent)) return NULL; } Do we need this at all? q-root_blkg being NULL is completely fine there and the comparison would work as expected, no? Hmm? If list_empty(ent) and q-root_blkg == NULL, /* walk to the next list_head, skip root blkcg */ ent = ent-next; ent is q-blkg_list again. if (ent == q-root_blkg-q_node) So ent is not q-root_blkg-q_node. If q-root_blkg is NULL, will it not lead to NULL pointer dereference. (q-root_blkg-q_node). It's not dereferenced. ent = ent-next; if (ent == q-blkg_list) return NULL; And we return NULL here. Ah, yes. You are correct. We can do without the above hunk. I would rather prefer to check for this boundary condition early and return instead of letting it fall through all these conditions and then figure out yes we have no next rl. IMO, code becomes easier to understand if nothing else. Otherwise one needs a step by step explanation as above to show that case of q-root_blkg is covered. I have same opinion as yours that it's good for readability. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg
On 10/17/12 08:20, Tejun Heo wrote: >>>> - if (ent == >root_blkg->q_node) >>>> + if (q->root_blkg && ent == >root_blkg->q_node) >>> >>> Can we fix it little differently. Little earlier in the code, we check for >>> if q->blkg_list is empty, then all the groups are gone, and there are >>> no more request lists hence and return NULL. >>> >>> Current code: >>> if (rl == >root_rl) { >>> ent = >blkg_list; >>> >>> Modified code: >>> if (rl == >root_rl) { >>> ent = >blkg_list; >>> /* There are no more block groups, hence no request lists */ >>> if (list_empty(ent)) >>> return NULL; >>> } > > Do we need this at all? q->root_blkg being NULL is completely fine > there and the comparison would work as expected, no? Hmm? If list_empty(ent) and q->root_blkg == NULL, > /* walk to the next list_head, skip root blkcg */ > ent = ent->next; ent is >blkg_list again. > if (ent == >root_blkg->q_node) So ent is not >root_blkg->q_node. > ent = ent->next; > if (ent == >blkg_list) > return NULL; And we return NULL here. Ah, yes. You are correct. We can do without the above hunk. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg
On 10/17/12 08:20, Tejun Heo wrote: - if (ent == q-root_blkg-q_node) + if (q-root_blkg ent == q-root_blkg-q_node) Can we fix it little differently. Little earlier in the code, we check for if q-blkg_list is empty, then all the groups are gone, and there are no more request lists hence and return NULL. Current code: if (rl == q-root_rl) { ent = q-blkg_list; Modified code: if (rl == q-root_rl) { ent = q-blkg_list; /* There are no more block groups, hence no request lists */ if (list_empty(ent)) return NULL; } Do we need this at all? q-root_blkg being NULL is completely fine there and the comparison would work as expected, no? Hmm? If list_empty(ent) and q-root_blkg == NULL, /* walk to the next list_head, skip root blkcg */ ent = ent-next; ent is q-blkg_list again. if (ent == q-root_blkg-q_node) So ent is not q-root_blkg-q_node. ent = ent-next; if (ent == q-blkg_list) return NULL; And we return NULL here. Ah, yes. You are correct. We can do without the above hunk. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg
Hi Vivek, thank you for comments. On 10/11/12 00:59, Vivek Goyal wrote: > I think patch looks reasonable to me. Just that some more description > would be nice. In fact, I will prefer some code comments too as I > had to scratch my head for a while to figure out how did we reach here. > > So looks like we deactivated cfq policy (most likely changed IO > scheduler). That will destroy all the block groups (disconnect blkg > from list and drop policy reference on group). If there are any pending > IOs, then group will not be destroyed till IO is completed. (Because > of cfqq reference on blkg and because of request list reference on > blkg). > > Now, all request list take a refenrece on associated blkg except > q->root_rl. This means when last IO finished, it must have dropped > the reference on cfqq which will drop reference on associated cfqg/blkg > and immediately root blkg will be destroyed. And now we will call > blk_put_rl() and that will try to access root_rl>blkg which has > been just freed as last IO completed. Yes, and for completion of any new IOs, blk_put_rl() is misled. I'll try to extend the description according to your comments. > > So problem here is that we don't take request list reference on > root blkg and that creates all these corner cases. > > So clearing q->root_blkg and q->root_rl.blkg during policy activation > makes sense. That means that from queue and request list point of view > root blkg is gone and you can't get to it. (It might still be around for > some more time due to pending IOs though). > > Some minor comments below. > >> >> Signed-off-by: Jun'ichi Nomura >> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c >> index f3b44a6..5015764 100644 >> --- a/block/blk-cgroup.c >> +++ b/block/blk-cgroup.c >> @@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q) >> blkg_destroy(blkg); >> spin_unlock(>lock); >> } >> + >> +q->root_blkg = NULL; >> +q->root_rl.blkg = NULL; > > I think some of the above description about we not taking root_rl > reference on root group can go here so that next time I don't have > to scratch my head for a long time. I put the following comment: /* * root blkg is destroyed. Just clear the pointer since * root_rl does not take reference on root blkg. */ > >> } >> >> static void blkg_rcu_free(struct rcu_head *rcu_head) >> @@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct >> request_list *rl, >> >> /* walk to the next list_head, skip root blkcg */ >> ent = ent->next; >> -if (ent == >root_blkg->q_node) >> +if (q->root_blkg && ent == >root_blkg->q_node) > > Can we fix it little differently. Little earlier in the code, we check for > if q->blkg_list is empty, then all the groups are gone, and there are > no more request lists hence and return NULL. > > Current code: > if (rl == >root_rl) { > ent = >blkg_list; > > Modified code: > if (rl == >root_rl) { > ent = >blkg_list; > /* There are no more block groups, hence no request lists */ > if (list_empty(ent)) > return NULL; > } OK. I changed that. Below is the updated version of the patch. == blk_put_rl() does not call blkg_put() for q->root_rl because we don't take request list reference on q->root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q->root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING && CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: > WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95() > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 > Call Trace: >[] warn_slowpath_common+0x85/0x9d > [] warn_slowpath_null+0x1a/0x1c > [] blk_put_rl+0x4d/0x95 > [] __blk_put_request+0xc3/0xcb > [] blk_finish_request+0x232/0x23f > [] ? blk_end_bidi_request+0x34/0x5d > [] blk_end_bidi_request+0x42/0x5d > [] blk_end_request+0x10/0x12 > [] scsi_io_completion+0x207/0x4d5 > [] scsi_finish_command+0xfa/0x103 > [] scsi_softirq_done+0xff/0x108 > [] blk_done_softirq+0x8d/0xa1 > [] ? generic_smp_call_function_single_interrupt+0x9f/0xd7 > [] __do_softirq+0x102/0x213 >
Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg
Hi Vivek, thank you for comments. On 10/11/12 00:59, Vivek Goyal wrote: I think patch looks reasonable to me. Just that some more description would be nice. In fact, I will prefer some code comments too as I had to scratch my head for a while to figure out how did we reach here. So looks like we deactivated cfq policy (most likely changed IO scheduler). That will destroy all the block groups (disconnect blkg from list and drop policy reference on group). If there are any pending IOs, then group will not be destroyed till IO is completed. (Because of cfqq reference on blkg and because of request list reference on blkg). Now, all request list take a refenrece on associated blkg except q-root_rl. This means when last IO finished, it must have dropped the reference on cfqq which will drop reference on associated cfqg/blkg and immediately root blkg will be destroyed. And now we will call blk_put_rl() and that will try to access root_rlblkg which has been just freed as last IO completed. Yes, and for completion of any new IOs, blk_put_rl() is misled. I'll try to extend the description according to your comments. So problem here is that we don't take request list reference on root blkg and that creates all these corner cases. So clearing q-root_blkg and q-root_rl.blkg during policy activation makes sense. That means that from queue and request list point of view root blkg is gone and you can't get to it. (It might still be around for some more time due to pending IOs though). Some minor comments below. Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f3b44a6..5015764 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q) blkg_destroy(blkg); spin_unlock(blkcg-lock); } + +q-root_blkg = NULL; +q-root_rl.blkg = NULL; I think some of the above description about we not taking root_rl reference on root group can go here so that next time I don't have to scratch my head for a long time. I put the following comment: /* * root blkg is destroyed. Just clear the pointer since * root_rl does not take reference on root blkg. */ } static void blkg_rcu_free(struct rcu_head *rcu_head) @@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct request_list *rl, /* walk to the next list_head, skip root blkcg */ ent = ent-next; -if (ent == q-root_blkg-q_node) +if (q-root_blkg ent == q-root_blkg-q_node) Can we fix it little differently. Little earlier in the code, we check for if q-blkg_list is empty, then all the groups are gone, and there are no more request lists hence and return NULL. Current code: if (rl == q-root_rl) { ent = q-blkg_list; Modified code: if (rl == q-root_rl) { ent = q-blkg_list; /* There are no more block groups, hence no request lists */ if (list_empty(ent)) return NULL; } OK. I changed that. Below is the updated version of the patch. == blk_put_rl() does not call blkg_put() for q-root_rl because we don't take request list reference on q-root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q-root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95() Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 Call Trace: IRQ [810453bd] warn_slowpath_common+0x85/0x9d [810453ef] warn_slowpath_null+0x1a/0x1c [811d5f8d] blk_put_rl+0x4d/0x95 [811d614a] __blk_put_request+0xc3/0xcb [811d71a3] blk_finish_request+0x232/0x23f [811d76c3] ? blk_end_bidi_request+0x34/0x5d [811d76d1] blk_end_bidi_request+0x42/0x5d [811d7728] blk_end_request+0x10/0x12 [812cdf16] scsi_io_completion+0x207/0x4d5 [812c6fcf] scsi_finish_command+0xfa/0x103 [812ce2f8] scsi_softirq_done+0xff/0x108 [811dcea5] blk_done_softirq+0x8d/0xa1 [810915d5] ? generic_smp_call_function_single_interrupt+0x9f/0xd7 [8104cf5b] __do_softirq+0x102/0x213 [8108a5ec] ? lock_release_holdtime+0xb6/0xbb [8104d2b4] ? raise_softirq_irqoff+0x9/0x3d [81424dfc] call_softirq+0x1c/0x30 [81011beb] do_softirq+0x4b/0xa3 [8104cdb0] irq_exit+0x53/0xd5 [8102d865] smp_call_function_single_interrupt+0x34/0x36
[PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg
I got system stall after the following warning with 3.6: > WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95() > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf > ipt_REJEC > T nf_conntrack_ipv4 nf_defrag_ipv4 > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 > Call Trace: >[] warn_slowpath_common+0x85/0x9d > [] warn_slowpath_null+0x1a/0x1c > [] blk_put_rl+0x4d/0x95 > [] __blk_put_request+0xc3/0xcb > [] blk_finish_request+0x232/0x23f > [] ? blk_end_bidi_request+0x34/0x5d > [] blk_end_bidi_request+0x42/0x5d > [] blk_end_request+0x10/0x12 > [] scsi_io_completion+0x207/0x4d5 > [] scsi_finish_command+0xfa/0x103 > [] scsi_softirq_done+0xff/0x108 > [] blk_done_softirq+0x8d/0xa1 > [] ? generic_smp_call_function_single_interrupt+0x9f/0xd7 > [] __do_softirq+0x102/0x213 > [] ? lock_release_holdtime+0xb6/0xbb > [] ? raise_softirq_irqoff+0x9/0x3d > [] call_softirq+0x1c/0x30 > [] do_softirq+0x4b/0xa3 > [] irq_exit+0x53/0xd5 > [] smp_call_function_single_interrupt+0x34/0x36 > [] call_function_single_interrupt+0x6f/0x80 >[] ? mwait_idle+0x94/0xcd > [] ? mwait_idle+0x8b/0xcd > [] cpu_idle+0xbb/0x114 > [] rest_init+0xc1/0xc8 > [] ? csum_partial_copy_generic+0x16c/0x16c > [] start_kernel+0x3d4/0x3e1 > [] ? kernel_init+0x1f7/0x1f7 > [] x86_64_start_reservations+0xb8/0xbd > [] x86_64_start_kernel+0x101/0x110 blk_put_rl() does this: if (rl->blkg && rl->blkg->blkcg != _root) blkg_put(rl->blkg); but if rl is q->root_rl, rl->blkg might be a bogus pointer because blkcg_deactivate_policy() does not clear q->root_rl.blkg after blkg_destroy_all(). Attached patch works for me. Signed-off-by: Jun'ichi Nomura diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f3b44a6..5015764 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q) blkg_destroy(blkg); spin_unlock(>lock); } + + q->root_blkg = NULL; + q->root_rl.blkg = NULL; } static void blkg_rcu_free(struct rcu_head *rcu_head) @@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct request_list *rl, /* walk to the next list_head, skip root blkcg */ ent = ent->next; - if (ent == >root_blkg->q_node) + if (q->root_blkg && ent == >root_blkg->q_node) ent = ent->next; if (ent == >blkg_list) return NULL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg
I got system stall after the following warning with 3.6: WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95() Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf ipt_REJEC T nf_conntrack_ipv4 nf_defrag_ipv4 Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 Call Trace: IRQ [810453bd] warn_slowpath_common+0x85/0x9d [810453ef] warn_slowpath_null+0x1a/0x1c [811d5f8d] blk_put_rl+0x4d/0x95 [811d614a] __blk_put_request+0xc3/0xcb [811d71a3] blk_finish_request+0x232/0x23f [811d76c3] ? blk_end_bidi_request+0x34/0x5d [811d76d1] blk_end_bidi_request+0x42/0x5d [811d7728] blk_end_request+0x10/0x12 [812cdf16] scsi_io_completion+0x207/0x4d5 [812c6fcf] scsi_finish_command+0xfa/0x103 [812ce2f8] scsi_softirq_done+0xff/0x108 [811dcea5] blk_done_softirq+0x8d/0xa1 [810915d5] ? generic_smp_call_function_single_interrupt+0x9f/0xd7 [8104cf5b] __do_softirq+0x102/0x213 [8108a5ec] ? lock_release_holdtime+0xb6/0xbb [8104d2b4] ? raise_softirq_irqoff+0x9/0x3d [81424dfc] call_softirq+0x1c/0x30 [81011beb] do_softirq+0x4b/0xa3 [8104cdb0] irq_exit+0x53/0xd5 [8102d865] smp_call_function_single_interrupt+0x34/0x36 [8142486f] call_function_single_interrupt+0x6f/0x80 EOI [8101800b] ? mwait_idle+0x94/0xcd [81018002] ? mwait_idle+0x8b/0xcd [81017811] cpu_idle+0xbb/0x114 [81401fbd] rest_init+0xc1/0xc8 [81401efc] ? csum_partial_copy_generic+0x16c/0x16c [81cdbd3d] start_kernel+0x3d4/0x3e1 [81cdb79e] ? kernel_init+0x1f7/0x1f7 [81cdb2dd] x86_64_start_reservations+0xb8/0xbd [81cdb3e3] x86_64_start_kernel+0x101/0x110 blk_put_rl() does this: if (rl-blkg rl-blkg-blkcg != blkcg_root) blkg_put(rl-blkg); but if rl is q-root_rl, rl-blkg might be a bogus pointer because blkcg_deactivate_policy() does not clear q-root_rl.blkg after blkg_destroy_all(). Attached patch works for me. Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f3b44a6..5015764 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q) blkg_destroy(blkg); spin_unlock(blkcg-lock); } + + q-root_blkg = NULL; + q-root_rl.blkg = NULL; } static void blkg_rcu_free(struct rcu_head *rcu_head) @@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct request_list *rl, /* walk to the next list_head, skip root blkcg */ ent = ent-next; - if (ent == q-root_blkg-q_node) + if (q-root_blkg ent == q-root_blkg-q_node) ent = ent-next; if (ent == q-blkg_list) return NULL; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH v8 3/8] dm: Use bioset's front_pad for dm_rq_clone_bio_info
On 09/06/12 05:27, Kent Overstreet wrote: > @@ -2718,7 +2705,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned > type, unsigned integrity) > if (!pools->tio_pool) > goto free_io_pool_and_out; > > - pools->bs = bioset_create(pool_size, 0); > + pools->bs = bioset_create(pool_size, > + offsetof(struct dm_rq_clone_bio_info, clone)); > if (!pools->bs) > goto free_tio_pool_and_out; frontpad is not necessary if type is DM_TYPE_BIO_BASED. Other pool creation in that function do something like: pools->bs = (type == DM_TYPE_BIO_BASED) ? bioset_create(pool_size, 0) : bioset_create(pool_size, offsetof(struct dm_rq_clone_bio_info, clone)); -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH v8 3/8] dm: Use bioset's front_pad for dm_rq_clone_bio_info
On 09/06/12 05:27, Kent Overstreet wrote: @@ -2718,7 +2705,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity) if (!pools-tio_pool) goto free_io_pool_and_out; - pools-bs = bioset_create(pool_size, 0); + pools-bs = bioset_create(pool_size, + offsetof(struct dm_rq_clone_bio_info, clone)); if (!pools-bs) goto free_tio_pool_and_out; frontpad is not necessary if type is DM_TYPE_BIO_BASED. Other pool creation in that function do something like: pools-bs = (type == DM_TYPE_BIO_BASED) ? bioset_create(pool_size, 0) : bioset_create(pool_size, offsetof(struct dm_rq_clone_bio_info, clone)); -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On 08/29/12 11:59, Dave Chinner wrote: > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: >> And yes, I understand it's ideal, but many applications choose not to >> do that for performance reason. >> So I think it's helpful if we can surely report to such applications. I suspect "performance vs. integrity" is not a correct description of the problem. > If performance is chosen over data integrity, we are under no > obligation to keep the error around indefinitely. Fundamentally, > ensuring a write completes successfully is the reponsibility of the > application, not the kernel. There are so many different filesytem > and storage errors that can be lost right now because data is not > fsync()d, adding another one to them really doesn't change anything. > IOWs, a memory error is no different to a disk failing or the system > crashing when it comes to data integrity. If you care, you use > fsync(). I agree that applications should fsync() or O_SYNC when it wants to make sure the written data in on disk. AFAIU, what Naoya is going to address is the case where fsync() is not necessarily needed. For example, if someone do: $ patch -p1 < ../a.patch $ tar cf . > ../a.tar and disk failure occurred between "patch" and "tar", "tar" will either see uptodate data or I/O error. OTOH, if the failure was detected on dirty pagecache, the current memory failure handler invalidates the dirty page and the "tar" command will re-read old contents from disk without error. (Well, the failures above are permanent failures. IOW, the current memory failure handler turns permanent failure into transient error, which is often more difficult to handle, I think.) Naoya's patch will keep the failure information and allows the reader to get I/O error when it reads from broken pagecache. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On 08/29/12 11:59, Dave Chinner wrote: On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: And yes, I understand it's ideal, but many applications choose not to do that for performance reason. So I think it's helpful if we can surely report to such applications. I suspect performance vs. integrity is not a correct description of the problem. If performance is chosen over data integrity, we are under no obligation to keep the error around indefinitely. Fundamentally, ensuring a write completes successfully is the reponsibility of the application, not the kernel. There are so many different filesytem and storage errors that can be lost right now because data is not fsync()d, adding another one to them really doesn't change anything. IOWs, a memory error is no different to a disk failing or the system crashing when it comes to data integrity. If you care, you use fsync(). I agree that applications should fsync() or O_SYNC when it wants to make sure the written data in on disk. AFAIU, what Naoya is going to address is the case where fsync() is not necessarily needed. For example, if someone do: $ patch -p1 ../a.patch $ tar cf . ../a.tar and disk failure occurred between patch and tar, tar will either see uptodate data or I/O error. OTOH, if the failure was detected on dirty pagecache, the current memory failure handler invalidates the dirty page and the tar command will re-read old contents from disk without error. (Well, the failures above are permanent failures. IOW, the current memory failure handler turns permanent failure into transient error, which is often more difficult to handle, I think.) Naoya's patch will keep the failure information and allows the reader to get I/O error when it reads from broken pagecache. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/12] dm: Use bioset's front_pad for dm_rq_clone_bio_info
On 08/07/12 07:08, Kent Overstreet wrote: > struct dm_rq_clone_bio_info { > struct bio *orig; > struct dm_rq_target_io *tio; > + struct bio clone; > }; ... > - pools->bs = bioset_create(pool_size, 0); > + pools->bs = bioset_create(pool_size, > + offsetof(struct dm_rq_clone_bio_info, orig)); > if (!pools->bs) > goto free_tio_pool_and_out; Shouldn't this be offsetof(struct dm_rq_clone_bio_info, clone)? -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache
On 08/11/12 08:09, Andi Kleen wrote: > Naoya Horiguchi writes: > >> Current memory error handling on dirty pagecache has a bug that user >> processes who use corrupted pages via read() or write() can't be aware >> of the memory error and result in discarding dirty data silently. >> >> The following patch is to improve handling/reporting memory errors on >> this case, but as a short term solution I suggest that we should undo >> the present error handling code and just leave errors for such cases >> (which expect the 2nd MCE to panic the system) to ensure data consistency. > > Not sure that's the right approach. It's not worse than any other IO > errors isn't it? IMO, it's worse in certain cases. For example, producer-consumer type program which uses file as a temporary storage. Current memory-failure.c drops produced data from dirty pagecache and allows reader to consume old or empty data from disk (silently!), that's what I think HWPOISON should prevent. Similar thing could happen theoretically with disk I/O errors, though, practically those errors are often persistent and reader will likely get errors again instead of bad data. Also, ext3/ext4 has an option to panic when an error is detected, for people who want to avoid corruption on intermittent errors. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache
On 08/11/12 08:09, Andi Kleen wrote: Naoya Horiguchi n-horigu...@ah.jp.nec.com writes: Current memory error handling on dirty pagecache has a bug that user processes who use corrupted pages via read() or write() can't be aware of the memory error and result in discarding dirty data silently. The following patch is to improve handling/reporting memory errors on this case, but as a short term solution I suggest that we should undo the present error handling code and just leave errors for such cases (which expect the 2nd MCE to panic the system) to ensure data consistency. Not sure that's the right approach. It's not worse than any other IO errors isn't it? IMO, it's worse in certain cases. For example, producer-consumer type program which uses file as a temporary storage. Current memory-failure.c drops produced data from dirty pagecache and allows reader to consume old or empty data from disk (silently!), that's what I think HWPOISON should prevent. Similar thing could happen theoretically with disk I/O errors, though, practically those errors are often persistent and reader will likely get errors again instead of bad data. Also, ext3/ext4 has an option to panic when an error is detected, for people who want to avoid corruption on intermittent errors. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/12] dm: Use bioset's front_pad for dm_rq_clone_bio_info
On 08/07/12 07:08, Kent Overstreet wrote: struct dm_rq_clone_bio_info { struct bio *orig; struct dm_rq_target_io *tio; + struct bio clone; }; ... - pools-bs = bioset_create(pool_size, 0); + pools-bs = bioset_create(pool_size, + offsetof(struct dm_rq_clone_bio_info, orig)); if (!pools-bs) goto free_tio_pool_and_out; Shouldn't this be offsetof(struct dm_rq_clone_bio_info, clone)? -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PCI: Remove default PCI expansion ROM memory allocation
Gary Hade wrote: > Remove default PCI expansion ROM memory allocation Thank you Gary for the new patch. By not allocating resources for expansion ROM, this patch will eliminate the problem that my patch masked, too: http://lkml.org/lkml/2007/12/4/284 Regards, -- Jun'ichi Nomura, NEC Corporation of America -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PCI: Remove default PCI expansion ROM memory allocation
Gary Hade wrote: Remove default PCI expansion ROM memory allocation Thank you Gary for the new patch. By not allocating resources for expansion ROM, this patch will eliminate the problem that my patch masked, too: http://lkml.org/lkml/2007/12/4/284 Regards, -- Jun'ichi Nomura, NEC Corporation of America -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pci: Omit error message for benign allocation failure
Gary Hade wrote: > On Tue, Dec 04, 2007 at 06:23:32PM -0500, Jun'ichi Nomura wrote: >> Kernel always tries to. But it's best effort basis, IMO. >> (Maybe your patch is going to fix that?) > > If you are seeing the allocation failures both with and > without my original patch I doubt that an improved version > would work. In my case, the BIOS had allowed sufficient > resource for the expansion ROMs but was expecting the kernel > to get it from the non-prefetch instead of prefetch window. OK. Then it's different from mine. Thanks for the info. Regards, -- Jun'ichi Nomura, NEC Corporation of America -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pci: Omit error message for benign allocation failure
Gary Hade wrote: On Tue, Dec 04, 2007 at 06:23:32PM -0500, Jun'ichi Nomura wrote: Kernel always tries to. But it's best effort basis, IMO. (Maybe your patch is going to fix that?) If you are seeing the allocation failures both with and without my original patch I doubt that an improved version would work. In my case, the BIOS had allowed sufficient resource for the expansion ROMs but was expecting the kernel to get it from the non-prefetch instead of prefetch window. OK. Then it's different from mine. Thanks for the info. Regards, -- Jun'ichi Nomura, NEC Corporation of America -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pci: Omit error message for benign allocation failure
Hi Gary, Gary Hade wrote: > On Tue, Dec 04, 2007 at 02:35:48PM -0500, Jun'ichi Nomura wrote: >> On a system with PCI-to-PCI bridges, following errors are observed: >> >> PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0 >> PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0 >> >> '#6' is for expansion ROM and '#8' for the bridge where the device >> with the expansion ROM is connected. > > I believe there is a good chance that may be another instance > of the regression caused by my "Avoid creating P2P prefetch > window for expansion ROMs" patch that was recently reported by > Jan Beulich. > http://marc.info/?l=linux-kernel=11981103023=2 Sorry, I was not clear about that. I've seen the above errors even with 2.6.22 and RHEL5, which is based on 2.6.18. So it's not a regression. > I am working on a better fix for the problem that the patch > was attempting to address but this is turning out to be much > more difficult than I expected. If I don't have a solution > very soon I plan to publish a revert patch. > >> But I think the failure is benign because the allocation is >> not necessary for these resources. > > This is an interesting idea. Could you elaborate? As far > as I can tell, the kernel always tries to allocate memory > for expansion ROMs which it also exports to user level. Kernel always tries to. But it's best effort basis, IMO. (Maybe your patch is going to fix that?) In the 1st stage (pcibios_allocate_resources, etc.), it allocates resources based on PCI configuration provided by BIOS, except for expansion ROMs. In the 2nd stage (pci_assign_unassigned_resources, etc.), it allocates resources for unallocated ones including expansion ROMs. The 2nd stage doesn't reprogram the bridge settings. So if the expansion ROM is under the bridge where other resource is already allocated, the allocation failure occurs. There are comments in drivers/pci/setup-bus.c: /* Helper function for sizing routines: find first available bus resource of a given type. Note: we intentionally skip the bus resources which have already been assigned (that is, have non-NULL parent resource). */ Fixing the above might be a better solution. But I don't know if there are any users who need it. > I have assumed that some drivers or user level apps may > need to access this space. Is this not true? I haven't heard of applications which depend on the kernel resource allocation for expansion ROMs. X is a big user of expansion ROMs but I heard it solves the resource conflict itself. If there is a counter example, I would like to know it. Thanks, -- Jun'ichi Nomura, NEC Corporation of America -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] pci: Omit error message for benign allocation failure
Hi, On a system with PCI-to-PCI bridges, following errors are observed: PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0 PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0 '#6' is for expansion ROM and '#8' for the bridge where the device with the expansion ROM is connected. But I think the failure is benign because the allocation is not necessary for these resources. The allocation failure message is informative when the failure is really a problem and needs a diagnosis. However, if the resource is expansion ROMs, the message is just confusing. This patch omits the error message if the resource is an expansion ROM or a bridge. Thanks, -- Jun'ichi Nomura, NEC Corporation of America Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> --- linux-2.6.24-rc3/drivers/pci/setup-res.c.orig 2007-12-04 00:24:11.0 -0500 +++ linux-2.6.24-rc3/drivers/pci/setup-res.c2007-12-04 00:42:14.0 -0500 @@ -158,11 +158,12 @@ int pci_assign_resource(struct pci_dev * } if (ret) { - printk(KERN_ERR "PCI: Failed to allocate %s resource " - "#%d:[EMAIL PROTECTED] for %s\n", - res->flags & IORESOURCE_IO ? "I/O" : "mem", - resno, (unsigned long long)size, - (unsigned long long)res->start, pci_name(dev)); + if (resno < PCI_ROM_RESOURCE) + printk(KERN_ERR "PCI: Failed to allocate %s resource " + "#%d:[EMAIL PROTECTED] for %s\n", + res->flags & IORESOURCE_IO ? "I/O" : "mem", + resno, (unsigned long long)size, + (unsigned long long)res->start, pci_name(dev)); } else if (resno < PCI_BRIDGE_RESOURCES) { pci_update_resource(dev, res, resno); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] pci: Omit error message for benign allocation failure
Hi, On a system with PCI-to-PCI bridges, following errors are observed: PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0 PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0 '#6' is for expansion ROM and '#8' for the bridge where the device with the expansion ROM is connected. But I think the failure is benign because the allocation is not necessary for these resources. The allocation failure message is informative when the failure is really a problem and needs a diagnosis. However, if the resource is expansion ROMs, the message is just confusing. This patch omits the error message if the resource is an expansion ROM or a bridge. Thanks, -- Jun'ichi Nomura, NEC Corporation of America Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] --- linux-2.6.24-rc3/drivers/pci/setup-res.c.orig 2007-12-04 00:24:11.0 -0500 +++ linux-2.6.24-rc3/drivers/pci/setup-res.c2007-12-04 00:42:14.0 -0500 @@ -158,11 +158,12 @@ int pci_assign_resource(struct pci_dev * } if (ret) { - printk(KERN_ERR PCI: Failed to allocate %s resource - #%d:[EMAIL PROTECTED] for %s\n, - res-flags IORESOURCE_IO ? I/O : mem, - resno, (unsigned long long)size, - (unsigned long long)res-start, pci_name(dev)); + if (resno PCI_ROM_RESOURCE) + printk(KERN_ERR PCI: Failed to allocate %s resource + #%d:[EMAIL PROTECTED] for %s\n, + res-flags IORESOURCE_IO ? I/O : mem, + resno, (unsigned long long)size, + (unsigned long long)res-start, pci_name(dev)); } else if (resno PCI_BRIDGE_RESOURCES) { pci_update_resource(dev, res, resno); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pci: Omit error message for benign allocation failure
Hi Gary, Gary Hade wrote: On Tue, Dec 04, 2007 at 02:35:48PM -0500, Jun'ichi Nomura wrote: On a system with PCI-to-PCI bridges, following errors are observed: PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0 PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0 '#6' is for expansion ROM and '#8' for the bridge where the device with the expansion ROM is connected. I believe there is a good chance that may be another instance of the regression caused by my Avoid creating P2P prefetch window for expansion ROMs patch that was recently reported by Jan Beulich. http://marc.info/?l=linux-kernelm=11981103023w=2 Sorry, I was not clear about that. I've seen the above errors even with 2.6.22 and RHEL5, which is based on 2.6.18. So it's not a regression. I am working on a better fix for the problem that the patch was attempting to address but this is turning out to be much more difficult than I expected. If I don't have a solution very soon I plan to publish a revert patch. But I think the failure is benign because the allocation is not necessary for these resources. This is an interesting idea. Could you elaborate? As far as I can tell, the kernel always tries to allocate memory for expansion ROMs which it also exports to user level. Kernel always tries to. But it's best effort basis, IMO. (Maybe your patch is going to fix that?) In the 1st stage (pcibios_allocate_resources, etc.), it allocates resources based on PCI configuration provided by BIOS, except for expansion ROMs. In the 2nd stage (pci_assign_unassigned_resources, etc.), it allocates resources for unallocated ones including expansion ROMs. The 2nd stage doesn't reprogram the bridge settings. So if the expansion ROM is under the bridge where other resource is already allocated, the allocation failure occurs. There are comments in drivers/pci/setup-bus.c: /* Helper function for sizing routines: find first available bus resource of a given type. Note: we intentionally skip the bus resources which have already been assigned (that is, have non-NULL parent resource). */ Fixing the above might be a better solution. But I don't know if there are any users who need it. I have assumed that some drivers or user level apps may need to access this space. Is this not true? I haven't heard of applications which depend on the kernel resource allocation for expansion ROMs. X is a big user of expansion ROMs but I heard it solves the resource conflict itself. If there is a counter example, I would like to know it. Thanks, -- Jun'ichi Nomura, NEC Corporation of America -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: Fix panic on shrinking device size
Hi, This patch fixes a panic on shrinking DM device, due to reference outside of the DM table. The problem occurs only if there is outstanding I/O to the removed part of the device. The bug is in that __clone_and_map() assumes dm_table_find_target() always returns a valid pointer. It may fail if a BIO is sent from the block layer but its target sector is already not in the DM table. This patch tidies up dm_table_find_target() to return NULL instead of bogus pointer for failure case. Then __clone_and_map() can check the return value of it and make the BIO return with -EIO. target_message() in dm-ioctl.c, the only other user of dm_table_find_target(), is modified to use the return value of it, too. Sample test script to trigger oops: -- #!/bin/bash FILE=$(mktemp) LODEV=$(losetup -f) MAP=$(basename ${FILE}) SIZE=4M dd if=/dev/zero of=${FILE} bs=${SIZE} count=1 losetup ${LODEV} ${FILE} echo "0 $(blockdev --getsz ${LODEV}) linear ${LODEV} 0" |dmsetup create ${MAP} dmsetup suspend ${MAP} echo "0 1 linear ${LODEV} 0" |dmsetup load ${MAP} dd if=/dev/zero of=/dev/mapper/${MAP} bs=${SIZE} count=1 & echo "Wait til dd push some I/O" sleep 5 dmsetup resume ${MAP} ------ Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> --- dm-ioctl.c | 10 +++--- dm-table.c |3 +++ dm.c | 24 ++-- 3 files changed, 24 insertions(+), 13 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -663,13 +663,19 @@ static struct bio *clone_bio(struct bio return clone; } -static void __clone_and_map(struct clone_info *ci) +static int __clone_and_map(struct clone_info *ci) { struct bio *clone, *bio = ci->bio; - struct dm_target *ti = dm_table_find_target(ci->map, ci->sector); - sector_t len = 0, max = max_io_len(ci->md, ci->sector, ti); + struct dm_target *ti; + sector_t len = 0, max; struct dm_target_io *tio; + ti = dm_table_find_target(ci->map, ci->sector); + if (!ti) + return -EIO; + + max = max_io_len(ci->md, ci->sector, ti); + /* * Allocate a target io object. */ @@ -727,6 +733,9 @@ static void __clone_and_map(struct clone do { if (offset) { ti = dm_table_find_target(ci->map, ci->sector); + if (!ti) + return -EIO; + max = max_io_len(ci->md, ci->sector, ti); tio = alloc_tio(ci->md); @@ -750,6 +759,8 @@ static void __clone_and_map(struct clone ci->idx++; } + + return 0; } /* @@ -758,6 +769,7 @@ static void __clone_and_map(struct clone static void __split_bio(struct mapped_device *md, struct bio *bio) { struct clone_info ci; + int error = 0; ci.map = dm_get_table(md); if (!ci.map) { @@ -777,11 +789,11 @@ static void __split_bio(struct mapped_de ci.idx = bio->bi_idx; start_io_acct(ci.io); - while (ci.sector_count) - __clone_and_map(); + while (ci.sector_count && !error) + error = __clone_and_map(); /* drop the extra reference count */ - dec_pending(ci.io, 0); + dec_pending(ci.io, error); dm_table_put(ci.map); } /*- Index: linux-2.6.23.work/drivers/md/dm-ioctl.c === --- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c +++ linux-2.6.23.work/drivers/md/dm-ioctl.c @@ -1250,21 +1250,17 @@ static int target_message(struct dm_ioct if (!table) goto out_argv; - if (tmsg->sector >= dm_table_get_size(table)) { + ti = dm_table_find_target(table, tmsg->sector); + if (!ti) { DMWARN("Target message sector outside device."); r = -EINVAL; - goto out_table; - } - - ti = dm_table_find_target(table, tmsg->sector); - if (ti->type->message) + } else if (ti->type->message) r = ti->type->message(ti, argc, argv); else { DMWARN("Target type does not support messages"); r = -EINVAL; } - out_table: dm_table_put(table); out_argv: kfree(argv); Index: linux-2.6.23.work/drivers/md/dm-table.c === --- linux-2.6.23.work.orig/drivers/md/dm-tab
[PATCH] dm: Fix panic on shrinking device size
Hi, This patch fixes a panic on shrinking DM device, due to reference outside of the DM table. The problem occurs only if there is outstanding I/O to the removed part of the device. The bug is in that __clone_and_map() assumes dm_table_find_target() always returns a valid pointer. It may fail if a BIO is sent from the block layer but its target sector is already not in the DM table. This patch tidies up dm_table_find_target() to return NULL instead of bogus pointer for failure case. Then __clone_and_map() can check the return value of it and make the BIO return with -EIO. target_message() in dm-ioctl.c, the only other user of dm_table_find_target(), is modified to use the return value of it, too. Sample test script to trigger oops: -- #!/bin/bash FILE=$(mktemp) LODEV=$(losetup -f) MAP=$(basename ${FILE}) SIZE=4M dd if=/dev/zero of=${FILE} bs=${SIZE} count=1 losetup ${LODEV} ${FILE} echo 0 $(blockdev --getsz ${LODEV}) linear ${LODEV} 0 |dmsetup create ${MAP} dmsetup suspend ${MAP} echo 0 1 linear ${LODEV} 0 |dmsetup load ${MAP} dd if=/dev/zero of=/dev/mapper/${MAP} bs=${SIZE} count=1 echo Wait til dd push some I/O sleep 5 dmsetup resume ${MAP} -- Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] --- dm-ioctl.c | 10 +++--- dm-table.c |3 +++ dm.c | 24 ++-- 3 files changed, 24 insertions(+), 13 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -663,13 +663,19 @@ static struct bio *clone_bio(struct bio return clone; } -static void __clone_and_map(struct clone_info *ci) +static int __clone_and_map(struct clone_info *ci) { struct bio *clone, *bio = ci-bio; - struct dm_target *ti = dm_table_find_target(ci-map, ci-sector); - sector_t len = 0, max = max_io_len(ci-md, ci-sector, ti); + struct dm_target *ti; + sector_t len = 0, max; struct dm_target_io *tio; + ti = dm_table_find_target(ci-map, ci-sector); + if (!ti) + return -EIO; + + max = max_io_len(ci-md, ci-sector, ti); + /* * Allocate a target io object. */ @@ -727,6 +733,9 @@ static void __clone_and_map(struct clone do { if (offset) { ti = dm_table_find_target(ci-map, ci-sector); + if (!ti) + return -EIO; + max = max_io_len(ci-md, ci-sector, ti); tio = alloc_tio(ci-md); @@ -750,6 +759,8 @@ static void __clone_and_map(struct clone ci-idx++; } + + return 0; } /* @@ -758,6 +769,7 @@ static void __clone_and_map(struct clone static void __split_bio(struct mapped_device *md, struct bio *bio) { struct clone_info ci; + int error = 0; ci.map = dm_get_table(md); if (!ci.map) { @@ -777,11 +789,11 @@ static void __split_bio(struct mapped_de ci.idx = bio-bi_idx; start_io_acct(ci.io); - while (ci.sector_count) - __clone_and_map(ci); + while (ci.sector_count !error) + error = __clone_and_map(ci); /* drop the extra reference count */ - dec_pending(ci.io, 0); + dec_pending(ci.io, error); dm_table_put(ci.map); } /*- Index: linux-2.6.23.work/drivers/md/dm-ioctl.c === --- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c +++ linux-2.6.23.work/drivers/md/dm-ioctl.c @@ -1250,21 +1250,17 @@ static int target_message(struct dm_ioct if (!table) goto out_argv; - if (tmsg-sector = dm_table_get_size(table)) { + ti = dm_table_find_target(table, tmsg-sector); + if (!ti) { DMWARN(Target message sector outside device.); r = -EINVAL; - goto out_table; - } - - ti = dm_table_find_target(table, tmsg-sector); - if (ti-type-message) + } else if (ti-type-message) r = ti-type-message(ti, argc, argv); else { DMWARN(Target type does not support messages); r = -EINVAL; } - out_table: dm_table_put(table); out_argv: kfree(argv); Index: linux-2.6.23.work/drivers/md/dm-table.c === --- linux-2.6.23.work.orig/drivers/md/dm-table.c +++ linux-2.6.23.work/drivers/md/dm-table.c @@ -868,6 +868,9 @@ struct dm_target *dm_table_find_target(s unsigned int l, n = 0, k = 0; sector_t *node; + if (sector
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Alasdair, Jun'ichi Nomura wrote: > So as far as I understand, the point is: > 1. it's preferable to resize inode after the resume, if possible Attached is a modified version of the (2/3) patch. - The logic is basically the same as before. - Moved the set-size function outside of the table swapping. - Moved the uevent to the end of resize from the end of resume. - Lightly tested on LVM2 mirror resizing. - (1/3) and (3/3) are unchanged. What do you think about this? --- drivers/md/dm-ioctl.c |4 ++ drivers/md/dm.c | 61 ++ include/linux/device-mapper.h |5 +++ 3 files changed, 46 insertions(+), 24 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1099,31 +1099,11 @@ static void event_callback(void *context wake_up(>eventq); } -static void __set_size(struct mapped_device *md, sector_t size) -{ - set_capacity(md->disk, size); - - mutex_lock(>suspended_bdev->bd_inode->i_mutex); - i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); - mutex_unlock(>suspended_bdev->bd_inode->i_mutex); -} - static int __bind(struct mapped_device *md, struct dm_table *t) { struct request_queue *q = md->queue; - sector_t size; - - size = dm_table_get_size(t); - - /* -* Wipe any geometry if the size of the table changed. -*/ - if (size != get_capacity(md->disk)) - memset(>geometry, 0, sizeof(md->geometry)); - if (md->suspended_bdev) - __set_size(md, size); - if (size == 0) + if (!dm_table_get_size(t)) return 0; dm_table_get(t); @@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md) dm_table_unplug_all(map); - kobject_uevent(>disk->kobj, KOBJ_CHANGE); - r = 0; out: @@ -1508,6 +1486,43 @@ out: return r; } +void dm_set_size(struct mapped_device *md) +{ + struct dm_table *map; + struct block_device *bdev; + sector_t size; + + down(>suspend_lock); + + map = dm_get_table(md); + if (!map) + goto out; + + size = dm_table_get_size(map); + + /* +* Wipe any geometry if the size of the table changed. +*/ + if (size != get_capacity(md->disk)) + memset(>geometry, 0, sizeof(md->geometry)); + + set_capacity(md->disk, size); + + bdev = bdlookup_disk(md->disk, 0); + if (bdev) { + mutex_lock(>bd_mutex); + i_size_write(bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); + mutex_unlock(>bd_mutex); + bdput(bdev); + } + + kobject_uevent(>disk->kobj, KOBJ_CHANGE); + +out: + dm_table_put(map); + up(>suspend_lock); +} + /*- * Event notification. *---*/ Index: linux-2.6.23.work/drivers/md/dm-ioctl.c === --- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c +++ linux-2.6.23.work/drivers/md/dm-ioctl.c @@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa if (dm_suspended(md)) r = dm_resume(md); - if (!r) + if (!r) { + dm_set_size(md); r = __dev_status(md, param); + } dm_put(md); return r; Index: linux-2.6.23.work/include/linux/device-mapper.h === --- linux-2.6.23.work.orig/include/linux/device-mapper.h +++ linux-2.6.23.work/include/linux/device-mapper.h @@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo); int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo); +/* + * Update device size + */ +void dm_set_size(struct mapped_device *md); + /*- * Functions for manipulating device-mapper tables. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Thanks for the immediate reply. So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible 2. it's necessary to have nowait version of bdget() to avoid stall For the 1st item, I guess the reason is that we want to make the table swapping code deterministic. Is it correct? Even if we don't have to wait for bdget(), you like to have the resizing code after the resume? For the 2nd item, the patch included bdlookup() for that purpose. What do you think about it? I added below some additional comments and questions. Alasdair G Kergon wrote: > On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote: >> There is no guarantee that the I/O flowing through the device again. >> The table might need be replaced again, but to do that, the resume >> should have been completed to let the userspace know it. > > Then the first attempt to set the size could be made to fail > (because it could not get the lock immediately) and the > size could be set after the second resume instead. > > - Setting the size would lag behind the actual size the dm table was > supporting, but (given the usage cases discussed) this would not matter. > >> bdget() in noflush suspend has a possibility of stall. > > So we cannot avoid fixing that: we require immediate return > with failure instead of waiting. bdlookup() is the almost nowait version of bdget() ("almost" means it may wait for I_NEW turned off in case there is bdget() doing its latter half. But it never stalls.) Do you think we need something else? >> OTOH, calling bdget() and i_size_write() outside of the lock >> can cause race with other table swapping and may result in setting >> wrong device size. > > If the size setting is removed from the lock, then it becomes > "set the inode size to match the current size of the table" and > races would not matter - each "set size" attempt would set it > to the instantaneous live table size, not a cached value that > could be out-of-date. Even though, I think we still want set_capacity() during the table swapping. So we need to call dm_table_get_size() twice. Is it ok? Also, do you want to move the resizing code for normal suspend, too? Otherwise, the code will be duplicated and I don't think you like it. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Alasdair G Kergon wrote: > Before reviewing the details of the proposed workaround, I'd like to see > a deeper analysis of the problem to see that there isn't a cleaner way > to resolve this. OK. Let me try. > For example: > > Question) What are the realistic situations we must support that lead to > a resize during table reload with I/O outstanding? 'noflush' is currently the only option for userspace to suspend without risking otherwise-avoidable I/O lost, because failure of underlying device might occur *after* the userspace started normal suspend but *before* the flushing completes. (normal suspend = flushing suspend) E.g. if userspace wants to resize dm-mpath device with queue_if_no_path, it has to use 'noflush' suspend on table swapping. Otherwise, path failures during the suspend will cause I/O error, though queue_if_no_path is specified. Other possible solution would be allowing the suspend to fail if it can't flush all I/O and letting userspace to retry after fixing the device failure. > - The resize is the purpose of the reload; noflush is only set to avoid losing > I/O if a path should fail. So any outstanding I/O may be expected to be > consistent with both the old and new sizes of the device. E.g. If it's > beyond the end of a shrinking device and userspace cared about not > losing that I/O, it would have waited for that I/O to be flushed > *before* issuing the resize. If the I/O is beyond the end of the > existing device but within the new size, userspace would have waited for > the resize operation to complete before allowing the new I/O to be > issued. If userspace cares about not losing I/O, it should wait for the I/O before trying to shrink the device size. After the shrinking started, any I/O beyond the new end of the device would have a possibility of lost. Reducing the size of the device while actively running I/O on it would anyway have a possibility of losing some of the I/O. > If the I/O is beyond the end of the > existing device but within the new size, userspace would have waited for > the resize operation to complete before allowing the new I/O to be > issued. Issuing the I/O beyond the end of the existing device would get error. If the issuer knows the device will be extended, it should wait for the completion of the extention. If it doesn't know, such I/O won't be issued. > => Is it OK for device-mapper to handle the device size check > internally, rejecting any I/O that falls beyond the end of the table (it I think this check is needed in current dm, regardless of noflush. > already must do this lookup anyway), and to update the size recorded in > the inode later, after I/O is flowing through the device again, but (of > course) before reporting that the resize operation is complete? > I.e. does it eliminate deadlocks if the bdget() and i_size_write() > happen after the 'resume'? There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. bdget() in noflush suspend has a possibility of stall. Once it stalls, the remedy is userspace to replace the table with working one. However, since bdget() occurs inside of suspend_lock, it's not possible to run other instance of suspend/resume. OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. Also, bdget() allocates new bdev inode if there isn't. But dm wants to access bdev inode only when it exists in memory. So something like bdlookup() will fit for the purpose, IMO. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Alasdair G Kergon wrote: Before reviewing the details of the proposed workaround, I'd like to see a deeper analysis of the problem to see that there isn't a cleaner way to resolve this. OK. Let me try. For example: Question) What are the realistic situations we must support that lead to a resize during table reload with I/O outstanding? 'noflush' is currently the only option for userspace to suspend without risking otherwise-avoidable I/O lost, because failure of underlying device might occur *after* the userspace started normal suspend but *before* the flushing completes. (normal suspend = flushing suspend) E.g. if userspace wants to resize dm-mpath device with queue_if_no_path, it has to use 'noflush' suspend on table swapping. Otherwise, path failures during the suspend will cause I/O error, though queue_if_no_path is specified. Other possible solution would be allowing the suspend to fail if it can't flush all I/O and letting userspace to retry after fixing the device failure. - The resize is the purpose of the reload; noflush is only set to avoid losing I/O if a path should fail. So any outstanding I/O may be expected to be consistent with both the old and new sizes of the device. E.g. If it's beyond the end of a shrinking device and userspace cared about not losing that I/O, it would have waited for that I/O to be flushed *before* issuing the resize. If the I/O is beyond the end of the existing device but within the new size, userspace would have waited for the resize operation to complete before allowing the new I/O to be issued. If userspace cares about not losing I/O, it should wait for the I/O before trying to shrink the device size. After the shrinking started, any I/O beyond the new end of the device would have a possibility of lost. Reducing the size of the device while actively running I/O on it would anyway have a possibility of losing some of the I/O. If the I/O is beyond the end of the existing device but within the new size, userspace would have waited for the resize operation to complete before allowing the new I/O to be issued. Issuing the I/O beyond the end of the existing device would get error. If the issuer knows the device will be extended, it should wait for the completion of the extention. If it doesn't know, such I/O won't be issued. = Is it OK for device-mapper to handle the device size check internally, rejecting any I/O that falls beyond the end of the table (it I think this check is needed in current dm, regardless of noflush. already must do this lookup anyway), and to update the size recorded in the inode later, after I/O is flowing through the device again, but (of course) before reporting that the resize operation is complete? I.e. does it eliminate deadlocks if the bdget() and i_size_write() happen after the 'resume'? There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. bdget() in noflush suspend has a possibility of stall. Once it stalls, the remedy is userspace to replace the table with working one. However, since bdget() occurs inside of suspend_lock, it's not possible to run other instance of suspend/resume. OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. Also, bdget() allocates new bdev inode if there isn't. But dm wants to access bdev inode only when it exists in memory. So something like bdlookup() will fit for the purpose, IMO. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Thanks for the immediate reply. So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible 2. it's necessary to have nowait version of bdget() to avoid stall For the 1st item, I guess the reason is that we want to make the table swapping code deterministic. Is it correct? Even if we don't have to wait for bdget(), you like to have the resizing code after the resume? For the 2nd item, the patch included bdlookup() for that purpose. What do you think about it? I added below some additional comments and questions. Alasdair G Kergon wrote: On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote: There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. Then the first attempt to set the size could be made to fail (because it could not get the lock immediately) and the size could be set after the second resume instead. - Setting the size would lag behind the actual size the dm table was supporting, but (given the usage cases discussed) this would not matter. bdget() in noflush suspend has a possibility of stall. So we cannot avoid fixing that: we require immediate return with failure instead of waiting. bdlookup() is the almost nowait version of bdget() (almost means it may wait for I_NEW turned off in case there is bdget() doing its latter half. But it never stalls.) Do you think we need something else? OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. If the size setting is removed from the lock, then it becomes set the inode size to match the current size of the table and races would not matter - each set size attempt would set it to the instantaneous live table size, not a cached value that could be out-of-date. Even though, I think we still want set_capacity() during the table swapping. So we need to call dm_table_get_size() twice. Is it ok? Also, do you want to move the resizing code for normal suspend, too? Otherwise, the code will be duplicated and I don't think you like it. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Alasdair, Jun'ichi Nomura wrote: So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible Attached is a modified version of the (2/3) patch. - The logic is basically the same as before. - Moved the set-size function outside of the table swapping. - Moved the uevent to the end of resize from the end of resume. - Lightly tested on LVM2 mirror resizing. - (1/3) and (3/3) are unchanged. What do you think about this? --- drivers/md/dm-ioctl.c |4 ++ drivers/md/dm.c | 61 ++ include/linux/device-mapper.h |5 +++ 3 files changed, 46 insertions(+), 24 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1099,31 +1099,11 @@ static void event_callback(void *context wake_up(md-eventq); } -static void __set_size(struct mapped_device *md, sector_t size) -{ - set_capacity(md-disk, size); - - mutex_lock(md-suspended_bdev-bd_inode-i_mutex); - i_size_write(md-suspended_bdev-bd_inode, (loff_t)size SECTOR_SHIFT); - mutex_unlock(md-suspended_bdev-bd_inode-i_mutex); -} - static int __bind(struct mapped_device *md, struct dm_table *t) { struct request_queue *q = md-queue; - sector_t size; - - size = dm_table_get_size(t); - - /* -* Wipe any geometry if the size of the table changed. -*/ - if (size != get_capacity(md-disk)) - memset(md-geometry, 0, sizeof(md-geometry)); - if (md-suspended_bdev) - __set_size(md, size); - if (size == 0) + if (!dm_table_get_size(t)) return 0; dm_table_get(t); @@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md) dm_table_unplug_all(map); - kobject_uevent(md-disk-kobj, KOBJ_CHANGE); - r = 0; out: @@ -1508,6 +1486,43 @@ out: return r; } +void dm_set_size(struct mapped_device *md) +{ + struct dm_table *map; + struct block_device *bdev; + sector_t size; + + down(md-suspend_lock); + + map = dm_get_table(md); + if (!map) + goto out; + + size = dm_table_get_size(map); + + /* +* Wipe any geometry if the size of the table changed. +*/ + if (size != get_capacity(md-disk)) + memset(md-geometry, 0, sizeof(md-geometry)); + + set_capacity(md-disk, size); + + bdev = bdlookup_disk(md-disk, 0); + if (bdev) { + mutex_lock(bdev-bd_mutex); + i_size_write(bdev-bd_inode, (loff_t)size SECTOR_SHIFT); + mutex_unlock(bdev-bd_mutex); + bdput(bdev); + } + + kobject_uevent(md-disk-kobj, KOBJ_CHANGE); + +out: + dm_table_put(map); + up(md-suspend_lock); +} + /*- * Event notification. *---*/ Index: linux-2.6.23.work/drivers/md/dm-ioctl.c === --- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c +++ linux-2.6.23.work/drivers/md/dm-ioctl.c @@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa if (dm_suspended(md)) r = dm_resume(md); - if (!r) + if (!r) { + dm_set_size(md); r = __dev_status(md, param); + } dm_put(md); return r; Index: linux-2.6.23.work/include/linux/device-mapper.h === --- linux-2.6.23.work.orig/include/linux/device-mapper.h +++ linux-2.6.23.work/include/linux/device-mapper.h @@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo); int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo); +/* + * Update device size + */ +void dm_set_size(struct mapped_device *md); + /*- * Functions for manipulating device-mapper tables. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (3/3) Allow table swapping with different size
This patch removes the check for whether the loaded table is going to change the device size and allows resizing of the dm device suspended with 'noflush'. Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> --- dm.c |5 - 1 file changed, 5 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1288,11 +1288,6 @@ int dm_swap_table(struct mapped_device * if (!dm_suspended(md)) goto out; - /* without bdev, the device size cannot be changed */ - if (!md->suspended_bdev) - if (get_capacity(md->disk) != dm_table_get_size(table)) - goto out; - __unbind(md); r = __bind(md, table); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (1/3) Add bdlookup()
This patch adds bdlookup() and bdlookup_disk(). They are lookup-only variant of bdget()/bdget_disk(). The function can be used when - the caller wants to get bdev only if somebody already got it (i.e. there is already bd_inode allocated in memory) - the race between bdlookup() and bdget() is acceptable The patch is a preparation for the 2nd part of this patchset. Background: inode is initialized with I_LOCK and I_NEW turned on. After initialization, both flags are turned off. bdget() uses iget5_locked(), which waits until I_LOCK is removed. It could take so long or forever. Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> --- fs/block_dev.c| 27 +++ include/linux/fs.h|4 +++- include/linux/genhd.h |6 ++ 3 files changed, 36 insertions(+), 1 deletion(-) Index: linux-2.6.23.work/fs/block_dev.c === --- linux-2.6.23.work.orig/fs/block_dev.c +++ linux-2.6.23.work/fs/block_dev.c @@ -586,6 +586,33 @@ struct block_device *bdget(dev_t dev) EXPORT_SYMBOL(bdget); +/* + * Variant of bdget(), which returns bdev only when it is already gotten. + * + * The function can be used when: + * - the caller wants to get bdev only if somebody already got it + * (i.e. there is already bd_inode allocated in memory) + * and + * - the race between bdlookup() and bdget() is acceptable + */ +struct block_device *bdlookup(dev_t dev) +{ + struct inode *inode; + + might_sleep(); + + /* Use _nowait because we don't want to wait for I_LOCK */ + inode = ilookup5_nowait(bd_mnt->mnt_sb, hash(dev), bdev_test, ); + + if (!inode) + return NULL; + + wait_on_bit(>i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE); + + return _I(inode)->bdev; +} +EXPORT_SYMBOL(bdlookup); + long nr_blockdev_pages(void) { struct block_device *bdev; Index: linux-2.6.23.work/include/linux/fs.h === --- linux-2.6.23.work.orig/include/linux/fs.h +++ linux-2.6.23.work/include/linux/fs.h @@ -1211,10 +1211,11 @@ struct super_operations { #define I_DIRTY_DATASYNC 2 /* Data-related inode changes pending */ #define I_DIRTY_PAGES 4 /* Data-related inode changes pending */ #define __I_LOCK 3 +#define __I_NEW6 #define I_LOCK (1 << __I_LOCK) #define I_FREEING 16 #define I_CLEAR32 -#define I_NEW 64 +#define I_NEW (1 << __I_NEW) #define I_WILL_FREE128 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1431,6 +1432,7 @@ extern void putname(const char *name); extern int register_blkdev(unsigned int, const char *); extern void unregister_blkdev(unsigned int, const char *); extern struct block_device *bdget(dev_t); +extern struct block_device *bdlookup(dev_t); extern void bd_set_size(struct block_device *, loff_t size); extern void bd_forget(struct inode *inode); extern void bdput(struct block_device *); Index: linux-2.6.23.work/include/linux/genhd.h === --- linux-2.6.23.work.orig/include/linux/genhd.h +++ linux-2.6.23.work/include/linux/genhd.h @@ -435,6 +435,12 @@ static inline struct block_device *bdget return bdget(MKDEV(disk->major, disk->first_minor) + index); } +static inline struct block_device *bdlookup_disk(struct gendisk *disk, +int index) +{ + return bdlookup(MKDEV(disk->major, disk->first_minor) + index); +} + #endif #else /* CONFIG_BLOCK */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (2/3) Use bdlookup() on noflush suspend
This patch makes device-mapper to use bdlookup to allow resizing of noflush-suspended device. With this patch, by using bdlookup() instead of bdget(), resizing will no longer wait for I_LOCK. There is a race between bdlookup() and bdget(). However, it's benign because the only use of bdev obtained by bdlookup() is to modify i_size and bdget() will anyway see the updated disk size. (See below) Case 1: CPU 0 CPU 1 -- set_capacity() disk size is updated bdlookup() -> NULL bdget() bdget reads updated disk size and set it to bdev inode's i_size Case 2: CPU 0 CPU 1 -- set_capacity() disk size is updated bdget() bdget reads updated disk size and set it to bdev inode's i_size bdlookup() wait for the bdget completion (I_NEW) and get the bdev inode with updated disk size Case 3: CPU 0 CPU 1 -- bdget() bdget reads old disk size set_capacity() disk size is updated bdlookup() get the existing bdev inode update the i_size To change i_size of bdev inode, the new code uses bd_mutex instead of inode->i_mutex. According to a comment in include/linux/fs.h, mutex is necessary here only to serialize i_size_write(). bd_inode->i_size is set via bd_set_size() and bd_set_size() is protected by bd_mutex. So I think bd_mutex is the lock we should use. Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> --- dm.c | 27 ++- 1 file changed, 22 insertions(+), 5 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1101,11 +1101,29 @@ static void event_callback(void *context static void __set_size(struct mapped_device *md, sector_t size) { + struct block_device *bdev; + set_capacity(md->disk, size); - mutex_lock(>suspended_bdev->bd_inode->i_mutex); - i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); - mutex_unlock(>suspended_bdev->bd_inode->i_mutex); + /* +* Updated disk size should be visible to bdget() possibly +* called in parallel to bdlookup() +*/ + smp_mb(); + + if (md->suspended_bdev) + bdev = md->suspended_bdev; + else + bdev = bdlookup_disk(md->disk, 0); + + if (bdev) { + mutex_lock(>bd_mutex); + i_size_write(bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); + mutex_unlock(>bd_mutex); + } + + if (!md->suspended_bdev) + bdput(bdev); } static int __bind(struct mapped_device *md, struct dm_table *t) @@ -1121,8 +1139,7 @@ static int __bind(struct mapped_device * if (size != get_capacity(md->disk)) memset(>geometry, 0, sizeof(md->geometry)); - if (md->suspended_bdev) - __set_size(md, size); + __set_size(md, size); if (size == 0) return 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (0/3)
Hi, Currently, there is a restriction on resizing device-mapper device: it cannot be resized if 'noflush' option is given for suspend. Since both multipath-tools (dm-multipath) and LVM2 mirror use 'noflush' suspend, the restriction limits the capability of those tools. The following patches remove the restriction. [PATCH] (1/3) Add bdlookup() [PATCH] (2/3) Use bdlookup() on noflush suspend [PATCH] (3/3) Allow resizing table load on noflush suspend The patches were tested on i686 and ia64 with LVM2, for both 'noflush' case and normal case. Background: - For some device-mapper targets (multipath and mirror), the mapping table sometimes has to be replaced to cope with device failure. OTOH, device-mapper flushes all pending I/Os upon table replacement and may result in I/O errors, if there are device failures. 'noflush' suspend is used to let dm queue the pending I/Os instead of flushing them. Since it's not possible for user space program to tell whether the suspend could cause I/O error, they always use 'noflush' to suspend mirror/multipath targets. - Currently resizing is disabled for 'noflush' suspend. Resizing occurs in the course of table replacement. To resize the device under use, device-mapper needs to get its bdev inode. However, using bdget() in this case could cause deadlock by waiting for I_LOCK where an I/O process holding I_LOCK is waiting for completion of table replacement. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (0/3)
Hi, Currently, there is a restriction on resizing device-mapper device: it cannot be resized if 'noflush' option is given for suspend. Since both multipath-tools (dm-multipath) and LVM2 mirror use 'noflush' suspend, the restriction limits the capability of those tools. The following patches remove the restriction. [PATCH] (1/3) Add bdlookup() [PATCH] (2/3) Use bdlookup() on noflush suspend [PATCH] (3/3) Allow resizing table load on noflush suspend The patches were tested on i686 and ia64 with LVM2, for both 'noflush' case and normal case. Background: - For some device-mapper targets (multipath and mirror), the mapping table sometimes has to be replaced to cope with device failure. OTOH, device-mapper flushes all pending I/Os upon table replacement and may result in I/O errors, if there are device failures. 'noflush' suspend is used to let dm queue the pending I/Os instead of flushing them. Since it's not possible for user space program to tell whether the suspend could cause I/O error, they always use 'noflush' to suspend mirror/multipath targets. - Currently resizing is disabled for 'noflush' suspend. Resizing occurs in the course of table replacement. To resize the device under use, device-mapper needs to get its bdev inode. However, using bdget() in this case could cause deadlock by waiting for I_LOCK where an I/O process holding I_LOCK is waiting for completion of table replacement. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (1/3) Add bdlookup()
This patch adds bdlookup() and bdlookup_disk(). They are lookup-only variant of bdget()/bdget_disk(). The function can be used when - the caller wants to get bdev only if somebody already got it (i.e. there is already bd_inode allocated in memory) - the race between bdlookup() and bdget() is acceptable The patch is a preparation for the 2nd part of this patchset. Background: inode is initialized with I_LOCK and I_NEW turned on. After initialization, both flags are turned off. bdget() uses iget5_locked(), which waits until I_LOCK is removed. It could take so long or forever. Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] --- fs/block_dev.c| 27 +++ include/linux/fs.h|4 +++- include/linux/genhd.h |6 ++ 3 files changed, 36 insertions(+), 1 deletion(-) Index: linux-2.6.23.work/fs/block_dev.c === --- linux-2.6.23.work.orig/fs/block_dev.c +++ linux-2.6.23.work/fs/block_dev.c @@ -586,6 +586,33 @@ struct block_device *bdget(dev_t dev) EXPORT_SYMBOL(bdget); +/* + * Variant of bdget(), which returns bdev only when it is already gotten. + * + * The function can be used when: + * - the caller wants to get bdev only if somebody already got it + * (i.e. there is already bd_inode allocated in memory) + * and + * - the race between bdlookup() and bdget() is acceptable + */ +struct block_device *bdlookup(dev_t dev) +{ + struct inode *inode; + + might_sleep(); + + /* Use _nowait because we don't want to wait for I_LOCK */ + inode = ilookup5_nowait(bd_mnt-mnt_sb, hash(dev), bdev_test, dev); + + if (!inode) + return NULL; + + wait_on_bit(inode-i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE); + + return BDEV_I(inode)-bdev; +} +EXPORT_SYMBOL(bdlookup); + long nr_blockdev_pages(void) { struct block_device *bdev; Index: linux-2.6.23.work/include/linux/fs.h === --- linux-2.6.23.work.orig/include/linux/fs.h +++ linux-2.6.23.work/include/linux/fs.h @@ -1211,10 +1211,11 @@ struct super_operations { #define I_DIRTY_DATASYNC 2 /* Data-related inode changes pending */ #define I_DIRTY_PAGES 4 /* Data-related inode changes pending */ #define __I_LOCK 3 +#define __I_NEW6 #define I_LOCK (1 __I_LOCK) #define I_FREEING 16 #define I_CLEAR32 -#define I_NEW 64 +#define I_NEW (1 __I_NEW) #define I_WILL_FREE128 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1431,6 +1432,7 @@ extern void putname(const char *name); extern int register_blkdev(unsigned int, const char *); extern void unregister_blkdev(unsigned int, const char *); extern struct block_device *bdget(dev_t); +extern struct block_device *bdlookup(dev_t); extern void bd_set_size(struct block_device *, loff_t size); extern void bd_forget(struct inode *inode); extern void bdput(struct block_device *); Index: linux-2.6.23.work/include/linux/genhd.h === --- linux-2.6.23.work.orig/include/linux/genhd.h +++ linux-2.6.23.work/include/linux/genhd.h @@ -435,6 +435,12 @@ static inline struct block_device *bdget return bdget(MKDEV(disk-major, disk-first_minor) + index); } +static inline struct block_device *bdlookup_disk(struct gendisk *disk, +int index) +{ + return bdlookup(MKDEV(disk-major, disk-first_minor) + index); +} + #endif #else /* CONFIG_BLOCK */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (2/3) Use bdlookup() on noflush suspend
This patch makes device-mapper to use bdlookup to allow resizing of noflush-suspended device. With this patch, by using bdlookup() instead of bdget(), resizing will no longer wait for I_LOCK. There is a race between bdlookup() and bdget(). However, it's benign because the only use of bdev obtained by bdlookup() is to modify i_size and bdget() will anyway see the updated disk size. (See below) Case 1: CPU 0 CPU 1 -- set_capacity() disk size is updated bdlookup() - NULL bdget() bdget reads updated disk size and set it to bdev inode's i_size Case 2: CPU 0 CPU 1 -- set_capacity() disk size is updated bdget() bdget reads updated disk size and set it to bdev inode's i_size bdlookup() wait for the bdget completion (I_NEW) and get the bdev inode with updated disk size Case 3: CPU 0 CPU 1 -- bdget() bdget reads old disk size set_capacity() disk size is updated bdlookup() get the existing bdev inode update the i_size To change i_size of bdev inode, the new code uses bd_mutex instead of inode-i_mutex. According to a comment in include/linux/fs.h, mutex is necessary here only to serialize i_size_write(). bd_inode-i_size is set via bd_set_size() and bd_set_size() is protected by bd_mutex. So I think bd_mutex is the lock we should use. Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] --- dm.c | 27 ++- 1 file changed, 22 insertions(+), 5 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1101,11 +1101,29 @@ static void event_callback(void *context static void __set_size(struct mapped_device *md, sector_t size) { + struct block_device *bdev; + set_capacity(md-disk, size); - mutex_lock(md-suspended_bdev-bd_inode-i_mutex); - i_size_write(md-suspended_bdev-bd_inode, (loff_t)size SECTOR_SHIFT); - mutex_unlock(md-suspended_bdev-bd_inode-i_mutex); + /* +* Updated disk size should be visible to bdget() possibly +* called in parallel to bdlookup() +*/ + smp_mb(); + + if (md-suspended_bdev) + bdev = md-suspended_bdev; + else + bdev = bdlookup_disk(md-disk, 0); + + if (bdev) { + mutex_lock(bdev-bd_mutex); + i_size_write(bdev-bd_inode, (loff_t)size SECTOR_SHIFT); + mutex_unlock(bdev-bd_mutex); + } + + if (!md-suspended_bdev) + bdput(bdev); } static int __bind(struct mapped_device *md, struct dm_table *t) @@ -1121,8 +1139,7 @@ static int __bind(struct mapped_device * if (size != get_capacity(md-disk)) memset(md-geometry, 0, sizeof(md-geometry)); - if (md-suspended_bdev) - __set_size(md, size); + __set_size(md, size); if (size == 0) return 0; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm: noflush resizing (3/3) Allow table swapping with different size
This patch removes the check for whether the loaded table is going to change the device size and allows resizing of the dm device suspended with 'noflush'. Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] --- dm.c |5 - 1 file changed, 5 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1288,11 +1288,6 @@ int dm_swap_table(struct mapped_device * if (!dm_suspended(md)) goto out; - /* without bdev, the device size cannot be changed */ - if (!md-suspended_bdev) - if (get_capacity(md-disk) != dm_table_get_size(table)) - goto out; - __unbind(md); r = __bind(md, table); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] device-mapper: fix bd_mount_sem corruption
Hi, This patch fixes a bd_mount_sem counter corruption bug in device-mapper. thaw_bdev() should be called only when freeze_bdev() was called for the device. Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter. struct block_device with the corrupted semaphore may remain in slab cache and be reused later. Attached patch will fix it by calling unlock_fs() instead. unlock_fs() will determine whether it should call thaw_bdev() by checking the device is frozen or not. Easy reproducer is: #!/bin/sh while [ 1 ]; do dmsetup --notable create a dmsetup --nolockfs suspend a dmsetup remove a done It's not easy to see the effect of corrupted semaphore. So I have tested with putting printk below in bdev_alloc_inode(): if (atomic_read(>bdev.bd_mount_sem.count) != 1) printk(KERN_DEBUG "Incorrect semaphore count = %d (%p)\n", atomic_read(>bdev.bd_mount_sem.count), >bdev); Without the patch, I saw something like: Incorrect semaphore count = 17 (f2ab91c0) With the patch, the message didn't appear. Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 2120155..998d450 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1064,12 +1064,14 @@ static struct mapped_device *alloc_dev(int minor) return NULL; } +static void unlock_fs(struct mapped_device *md); + static void free_dev(struct mapped_device *md) { int minor = md->disk->first_minor; if (md->suspended_bdev) { - thaw_bdev(md->suspended_bdev, NULL); + unlock_fs(md); bdput(md->suspended_bdev); } mempool_destroy(md->tio_pool); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] device-mapper: fix bd_mount_sem corruption
Hi, This patch fixes a bd_mount_sem counter corruption bug in device-mapper. thaw_bdev() should be called only when freeze_bdev() was called for the device. Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter. struct block_device with the corrupted semaphore may remain in slab cache and be reused later. Attached patch will fix it by calling unlock_fs() instead. unlock_fs() will determine whether it should call thaw_bdev() by checking the device is frozen or not. Easy reproducer is: #!/bin/sh while [ 1 ]; do dmsetup --notable create a dmsetup --nolockfs suspend a dmsetup remove a done It's not easy to see the effect of corrupted semaphore. So I have tested with putting printk below in bdev_alloc_inode(): if (atomic_read(ei-bdev.bd_mount_sem.count) != 1) printk(KERN_DEBUG Incorrect semaphore count = %d (%p)\n, atomic_read(ei-bdev.bd_mount_sem.count), ei-bdev); Without the patch, I saw something like: Incorrect semaphore count = 17 (f2ab91c0) With the patch, the message didn't appear. Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 2120155..998d450 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1064,12 +1064,14 @@ static struct mapped_device *alloc_dev(int minor) return NULL; } +static void unlock_fs(struct mapped_device *md); + static void free_dev(struct mapped_device *md) { int minor = md-disk-first_minor; if (md-suspended_bdev) { - thaw_bdev(md-suspended_bdev, NULL); + unlock_fs(md); bdput(md-suspended_bdev); } mempool_destroy(md-tio_pool); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request
Patrick McHardy wrote: > Jun'ichi Nomura wrote: >> Are you using other dm modules such as dm-multipath, dm-mirror >> or dm-snapshot? >> If so, can you take the output of 'dmsetup table' and 'dmsetup ls'? > > No other modules. > >> Do you have a reliable way to reproduce the oops which I can try? > > "/etc/init.d/cryptdisk start" (debian) on a luks partition triggered > it for me. With today's git HEAD (commit 49c13b51a15f1ba9f6d47e26e4a3886c4f3931e2), I tried the following but could not reproduce the oops here. # cryptsetup luksFormat /dev/sdb1 # cryptsetup luksOpen /dev/sdb1 c # mkfs.ext3 /dev/mapper/c Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request
Hi, Patrick McHardy wrote: > Alasdair G Kergon wrote: >> From: "Jun'ichi Nomura" <[EMAIL PROTECTED]> >> >> bio_alloc_bioset() will return NULL if 'num_vecs' is too large. >> Use bio_get_nr_vecs() to get estimation of maximum number. >> >> Signed-off-by: "Jun'ichi Nomura" <[EMAIL PROTECTED]> >> Signed-off-by: Alasdair G Kergon <[EMAIL PROTECTED]> >> >> --- >> drivers/md/dm-io.c |5 - >> 1 files changed, 4 insertions(+), 1 deletion(-) > > > This patch reproducibly oopses my box: Thanks for the report. But I'm not sure how the patch is related to the oops. The stack trace shows the oops occurred in dm-crypt, which doesn't use the part of the code modified by the patch (dm-io). Are you using other dm modules such as dm-multipath, dm-mirror or dm-snapshot? If so, can you take the output of 'dmsetup table' and 'dmsetup ls'? Do you have a reliable way to reproduce the oops which I can try? > > [ 126.754204] BUG: unable to handle kernel NULL pointer dereference at > virtual address > [ 126.754326] printing eip: > [ 126.754369] c0141a67 > [ 126.754420] *pde = > [ 126.754465] Oops: [#1] > [ 126.754507] PREEMPT > [ 126.754585] Modules linked in: [...] > > > [ 126.758372] CPU:0 > [ 126.758373] EIP:0060:[]Not tainted VLI > [ 126.758374] EFLAGS: 00010282 (2.6.22 #1) > [ 126.758511] EIP is at mempool_free+0xe/0xc0 > [ 126.758558] eax: d39e65d0 ebx: ecx: df2b9898 edx: > [ 126.758605] esi: edi: d39e65d0 ebp: d487d6d0 esp: df79fec0 > [ 126.758652] ds: 007b es: 007b fs: gs: ss: 0068 > [ 126.758699] Process kcryptd/0 (pid: 3218, ti=df79f000 task=df2b9640 > task.ti=df79f000) > [ 126.758747] Stack: d3835f80 e08b0923 > e08a5f69 0200 e0ad1080 > [ 126.759093]dfb5ab40 d3835f80 e08b08c0 e08a5fb7 > c01804d8 0200 > [ 126.759439]c520bc00 0c00 d0b77438 d5754b00 df79ff5c > e08a515e d0b77444 d5754b00 > [ 126.759858] Call Trace: > [ 126.759965] [] clone_endio+0x63/0xc0 [dm_mod] > [ 126.760066] [] crypt_convert+0x131/0x17f [dm_crypt] > [ 126.760168] [] clone_endio+0x0/0xc0 [dm_mod] > [ 126.760264] [] kcryptd_do_work+0x0/0x30f [dm_crypt] > [ 126.760349] [] bio_endio+0x33/0x5d > [ 126.760462] [] dec_pending+0x28/0x39 [dm_crypt] > [ 126.760558] [] kcryptd_do_work+0x22f/0x30f [dm_crypt] > [ 126.760669] [] update_stats_wait_end+0x7f/0xb2 > [ 126.760801] [] kcryptd_do_work+0x0/0x30f [dm_crypt] > [ 126.760888] [] run_workqueue+0x84/0x179 > [ 126.760990] [] worker_thread+0x0/0xf0 > [ 126.761074] [] worker_thread+0x9d/0xf0 > [ 126.761160] [] autoremove_wake_function+0x0/0x37 > [ 126.761256] [] worker_thread+0x0/0xf0 > [ 126.761334] [] kthread+0x52/0x58 > [ 126.761411] [] kthread+0x0/0x58 > [ 126.761496] [] kernel_thread_helper+0x7/0x14 > [ 126.761598] === > [ 126.761717] Code: 1c 00 89 f6 eb a9 b8 88 13 00 00 e8 b4 56 1c 00 8d > 74 26 00 eb d5 31 db e9 11 ff ff ff 57 56 53 83 ec 04 89 c7 89 d6 85 c0 > 74 55 <8b> 02 39 42 04 7d 46 9c 58 90 8d b4 26 00 00 00 00 89 c3 fa 90 > [ 126.763964] EIP: [] mempool_free+0xe/0xc0 SS:ESP 0068:df79fec0 > Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request
Hi, Patrick McHardy wrote: Alasdair G Kergon wrote: From: Jun'ichi Nomura [EMAIL PROTECTED] bio_alloc_bioset() will return NULL if 'num_vecs' is too large. Use bio_get_nr_vecs() to get estimation of maximum number. Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] Signed-off-by: Alasdair G Kergon [EMAIL PROTECTED] --- drivers/md/dm-io.c |5 - 1 files changed, 4 insertions(+), 1 deletion(-) This patch reproducibly oopses my box: Thanks for the report. But I'm not sure how the patch is related to the oops. The stack trace shows the oops occurred in dm-crypt, which doesn't use the part of the code modified by the patch (dm-io). Are you using other dm modules such as dm-multipath, dm-mirror or dm-snapshot? If so, can you take the output of 'dmsetup table' and 'dmsetup ls'? Do you have a reliable way to reproduce the oops which I can try? [ 126.754204] BUG: unable to handle kernel NULL pointer dereference at virtual address [ 126.754326] printing eip: [ 126.754369] c0141a67 [ 126.754420] *pde = [ 126.754465] Oops: [#1] [ 126.754507] PREEMPT [ 126.754585] Modules linked in: [...] [ 126.758372] CPU:0 [ 126.758373] EIP:0060:[c0141a67]Not tainted VLI [ 126.758374] EFLAGS: 00010282 (2.6.22 #1) [ 126.758511] EIP is at mempool_free+0xe/0xc0 [ 126.758558] eax: d39e65d0 ebx: ecx: df2b9898 edx: [ 126.758605] esi: edi: d39e65d0 ebp: d487d6d0 esp: df79fec0 [ 126.758652] ds: 007b es: 007b fs: gs: ss: 0068 [ 126.758699] Process kcryptd/0 (pid: 3218, ti=df79f000 task=df2b9640 task.ti=df79f000) [ 126.758747] Stack: d3835f80 e08b0923 e08a5f69 0200 e0ad1080 [ 126.759093]dfb5ab40 d3835f80 e08b08c0 e08a5fb7 c01804d8 0200 [ 126.759439]c520bc00 0c00 d0b77438 d5754b00 df79ff5c e08a515e d0b77444 d5754b00 [ 126.759858] Call Trace: [ 126.759965] [e08b0923] clone_endio+0x63/0xc0 [dm_mod] [ 126.760066] [e08a5f69] crypt_convert+0x131/0x17f [dm_crypt] [ 126.760168] [e08b08c0] clone_endio+0x0/0xc0 [dm_mod] [ 126.760264] [e08a5fb7] kcryptd_do_work+0x0/0x30f [dm_crypt] [ 126.760349] [c01804d8] bio_endio+0x33/0x5d [ 126.760462] [e08a515e] dec_pending+0x28/0x39 [dm_crypt] [ 126.760558] [e08a61e6] kcryptd_do_work+0x22f/0x30f [dm_crypt] [ 126.760669] [c0112182] update_stats_wait_end+0x7f/0xb2 [ 126.760801] [e08a5fb7] kcryptd_do_work+0x0/0x30f [dm_crypt] [ 126.760888] [c012700e] run_workqueue+0x84/0x179 [ 126.760990] [c0127292] worker_thread+0x0/0xf0 [ 126.761074] [c012732f] worker_thread+0x9d/0xf0 [ 126.761160] [c012a360] autoremove_wake_function+0x0/0x37 [ 126.761256] [c0127292] worker_thread+0x0/0xf0 [ 126.761334] [c012a15c] kthread+0x52/0x58 [ 126.761411] [c012a10a] kthread+0x0/0x58 [ 126.761496] [c0104983] kernel_thread_helper+0x7/0x14 [ 126.761598] === [ 126.761717] Code: 1c 00 89 f6 eb a9 b8 88 13 00 00 e8 b4 56 1c 00 8d 74 26 00 eb d5 31 db e9 11 ff ff ff 57 56 53 83 ec 04 89 c7 89 d6 85 c0 74 55 8b 02 39 42 04 7d 46 9c 58 90 8d b4 26 00 00 00 00 89 c3 fa 90 [ 126.763964] EIP: [c0141a67] mempool_free+0xe/0xc0 SS:ESP 0068:df79fec0 Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request
Patrick McHardy wrote: Jun'ichi Nomura wrote: Are you using other dm modules such as dm-multipath, dm-mirror or dm-snapshot? If so, can you take the output of 'dmsetup table' and 'dmsetup ls'? No other modules. Do you have a reliable way to reproduce the oops which I can try? /etc/init.d/cryptdisk start (debian) on a luks partition triggered it for me. With today's git HEAD (commit 49c13b51a15f1ba9f6d47e26e4a3886c4f3931e2), I tried the following but could not reproduce the oops here. # cryptsetup luksFormat /dev/sdb1 # cryptsetup luksOpen /dev/sdb1 c # mkfs.ext3 /dev/mapper/c mount it and do some I/O Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.20-rc5] dm-multipath: fix stall on noflush suspend/resume
Allow noflush suspend/resume of device-mapper device only for the case where the device size is unchanged. Otherwise, dm-multipath devices can stall when resumed if noflush was used when suspending them, all paths have failed and queue_if_no_path is set. Explanation: 1. Something is doing fsync() on the block dev, holding inode->i_sem 2. The fsync write is blocked by all-paths-down and queue_if_no_path 3. Someone requests to suspend the dm device with noflush. Pending writes are left in queue. 4. In the middle of dm_resume(), __bind() tries to get inode->i_sem to do __set_size() and waits forever. Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]> --- 'noflush suspend' is a new device-mapper feature introduced in early 2.6.20. So I hope the fix being included before 2.6.20 is released. Example of reproducer: 1. Create a multipath device by dmsetup 2. Fail all paths during mkfs 3. Do dmsetup suspend --noflush and load new map with healthy paths 4. Do dmsetup resume drivers/md/dm.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) Index: linux-2.6.19/drivers/md/dm.c === --- linux-2.6.19.orig/drivers/md/dm.c 2006-12-12 22:02:29.0 + +++ linux-2.6.19/drivers/md/dm.c 2007-01-05 15:15:53.0 + @@ -1116,7 +1116,8 @@ static int __bind(struct mapped_device * if (size != get_capacity(md->disk)) memset(>geometry, 0, sizeof(md->geometry)); - __set_size(md, size); + if (md->suspended_bdev) + __set_size(md, size); if (size == 0) return 0; @@ -1265,6 +1266,11 @@ int dm_swap_table(struct mapped_device * if (!dm_suspended(md)) goto out; + /* without bdev, the device size cannot be changed */ + if (!md->suspended_bdev) + if (get_capacity(md->disk) != dm_table_get_size(table)) + goto out; + __unbind(md); r = __bind(md, table); @@ -1342,11 +1348,14 @@ int dm_suspend(struct mapped_device *md, /* This does not get reverted if there's an error later. */ dm_table_presuspend_targets(map); - md->suspended_bdev = bdget_disk(md->disk, 0); - if (!md->suspended_bdev) { - DMWARN("bdget failed in dm_suspend"); - r = -ENOMEM; - goto flush_and_out; + /* bdget() can stall if the pending I/Os are not flushed */ + if (!noflush) { + md->suspended_bdev = bdget_disk(md->disk, 0); + if (!md->suspended_bdev) { + DMWARN("bdget failed in dm_suspend"); + r = -ENOMEM; + goto flush_and_out; + } } /* @@ -1474,8 +1483,10 @@ int dm_resume(struct mapped_device *md) unlock_fs(md); - bdput(md->suspended_bdev); - md->suspended_bdev = NULL; + if (md->suspended_bdev) { + bdput(md->suspended_bdev); + md->suspended_bdev = NULL; + } clear_bit(DMF_SUSPENDED, >flags);
[PATCH 2.6.20-rc5] dm-multipath: fix stall on noflush suspend/resume
Allow noflush suspend/resume of device-mapper device only for the case where the device size is unchanged. Otherwise, dm-multipath devices can stall when resumed if noflush was used when suspending them, all paths have failed and queue_if_no_path is set. Explanation: 1. Something is doing fsync() on the block dev, holding inode-i_sem 2. The fsync write is blocked by all-paths-down and queue_if_no_path 3. Someone requests to suspend the dm device with noflush. Pending writes are left in queue. 4. In the middle of dm_resume(), __bind() tries to get inode-i_sem to do __set_size() and waits forever. Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED] --- 'noflush suspend' is a new device-mapper feature introduced in early 2.6.20. So I hope the fix being included before 2.6.20 is released. Example of reproducer: 1. Create a multipath device by dmsetup 2. Fail all paths during mkfs 3. Do dmsetup suspend --noflush and load new map with healthy paths 4. Do dmsetup resume drivers/md/dm.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) Index: linux-2.6.19/drivers/md/dm.c === --- linux-2.6.19.orig/drivers/md/dm.c 2006-12-12 22:02:29.0 + +++ linux-2.6.19/drivers/md/dm.c 2007-01-05 15:15:53.0 + @@ -1116,7 +1116,8 @@ static int __bind(struct mapped_device * if (size != get_capacity(md-disk)) memset(md-geometry, 0, sizeof(md-geometry)); - __set_size(md, size); + if (md-suspended_bdev) + __set_size(md, size); if (size == 0) return 0; @@ -1265,6 +1266,11 @@ int dm_swap_table(struct mapped_device * if (!dm_suspended(md)) goto out; + /* without bdev, the device size cannot be changed */ + if (!md-suspended_bdev) + if (get_capacity(md-disk) != dm_table_get_size(table)) + goto out; + __unbind(md); r = __bind(md, table); @@ -1342,11 +1348,14 @@ int dm_suspend(struct mapped_device *md, /* This does not get reverted if there's an error later. */ dm_table_presuspend_targets(map); - md-suspended_bdev = bdget_disk(md-disk, 0); - if (!md-suspended_bdev) { - DMWARN(bdget failed in dm_suspend); - r = -ENOMEM; - goto flush_and_out; + /* bdget() can stall if the pending I/Os are not flushed */ + if (!noflush) { + md-suspended_bdev = bdget_disk(md-disk, 0); + if (!md-suspended_bdev) { + DMWARN(bdget failed in dm_suspend); + r = -ENOMEM; + goto flush_and_out; + } } /* @@ -1474,8 +1483,10 @@ int dm_resume(struct mapped_device *md) unlock_fs(md); - bdput(md-suspended_bdev); - md-suspended_bdev = NULL; + if (md-suspended_bdev) { + bdput(md-suspended_bdev); + md-suspended_bdev = NULL; + } clear_bit(DMF_SUSPENDED, md-flags);