from:"Jun'ichi Nomura"

[PATCH] block: Add nr_bios to block_rq_remap tracepoint

2013-09-17 Thread Jun'ichi Nomura

Adding the number of bios in a remapped request to 'block_rq_remap'
tracepoint.

Request remapper clones bios in a request to track the completion
status of each bio. So the number of bios can be useful information
for investigation.

Related discussions:
  http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html
  http://www.redhat.com/archives/dm-devel/2013-September/msg00024.html

Signed-off-by: Jun'ichi Nomura 
Acked-by: Mike Snitzer 
Cc: Jens Axboe 

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2fdb4a4..0e6f765 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,6 +862,17 @@ static inline unsigned int blk_rq_get_max_sectors(struct 
request *rq)
return blk_queue_get_max_sectors(q, rq->cmd_flags);
 }
 
+static inline unsigned int blk_rq_count_bios(struct request *rq)
+{
+   unsigned int nr_bios = 0;
+   struct bio *bio;
+
+   __rq_for_each_bio(bio, rq)
+   nr_bios++;
+
+   return nr_bios;
+}
+
 /*
  * Request issue related functions.
  */
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 60ae7c3..4c2301d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -618,6 +618,7 @@ TRACE_EVENT(block_rq_remap,
__field( unsigned int,  nr_sector   )
__field( dev_t, old_dev )
__field( sector_t,  old_sector  )
+   __field( unsigned int,  nr_bios )
__array( char,  rwbs,   RWBS_LEN)
),
 
@@ -627,15 +628,16 @@ TRACE_EVENT(block_rq_remap,
__entry->nr_sector  = blk_rq_sectors(rq);
__entry->old_dev= dev;
__entry->old_sector = from;
+   __entry->nr_bios= blk_rq_count_bios(rq);
blk_fill_rwbs(__entry->rwbs, rq->cmd_flags, blk_rq_bytes(rq));
),
 
-   TP_printk("%d,%d %s %llu + %u <- (%d,%d) %llu",
+   TP_printk("%d,%d %s %llu + %u <- (%d,%d) %llu %u",
  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
  (unsigned long long)__entry->sector,
  __entry->nr_sector,
  MAJOR(__entry->old_dev), MINOR(__entry->old_dev),
- (unsigned long long)__entry->old_sector)
+ (unsigned long long)__entry->old_sector, __entry->nr_bios)
 );
 
 #endif /* _TRACE_BLOCK_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] block: Add nr_bios to block_rq_remap tracepoint

2013-09-17 Thread Jun'ichi Nomura

Adding the number of bios in a remapped request to 'block_rq_remap'
tracepoint.

Request remapper clones bios in a request to track the completion
status of each bio. So the number of bios can be useful information
for investigation.

Related discussions:
  http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html
  http://www.redhat.com/archives/dm-devel/2013-September/msg00024.html

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Acked-by: Mike Snitzer snit...@redhat.com
Cc: Jens Axboe ax...@kernel.dk

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2fdb4a4..0e6f765 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,6 +862,17 @@ static inline unsigned int blk_rq_get_max_sectors(struct 
request *rq)
return blk_queue_get_max_sectors(q, rq-cmd_flags);
 }
 
+static inline unsigned int blk_rq_count_bios(struct request *rq)
+{
+   unsigned int nr_bios = 0;
+   struct bio *bio;
+
+   __rq_for_each_bio(bio, rq)
+   nr_bios++;
+
+   return nr_bios;
+}
+
 /*
  * Request issue related functions.
  */
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 60ae7c3..4c2301d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -618,6 +618,7 @@ TRACE_EVENT(block_rq_remap,
__field( unsigned int,  nr_sector   )
__field( dev_t, old_dev )
__field( sector_t,  old_sector  )
+   __field( unsigned int,  nr_bios )
__array( char,  rwbs,   RWBS_LEN)
),
 
@@ -627,15 +628,16 @@ TRACE_EVENT(block_rq_remap,
__entry-nr_sector  = blk_rq_sectors(rq);
__entry-old_dev= dev;
__entry-old_sector = from;
+   __entry-nr_bios= blk_rq_count_bios(rq);
blk_fill_rwbs(__entry-rwbs, rq-cmd_flags, blk_rq_bytes(rq));
),
 
-   TP_printk(%d,%d %s %llu + %u - (%d,%d) %llu,
+   TP_printk(%d,%d %s %llu + %u - (%d,%d) %llu %u,
  MAJOR(__entry-dev), MINOR(__entry-dev), __entry-rwbs,
  (unsigned long long)__entry-sector,
  __entry-nr_sector,
  MAJOR(__entry-old_dev), MINOR(__entry-old_dev),
- (unsigned long long)__entry-old_sector)
+ (unsigned long long)__entry-old_sector, __entry-nr_bios)
 );
 
 #endif /* _TRACE_BLOCK_H */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: allow error target to replace either bio-based and request-based targets

2013-08-22 Thread Jun'ichi Nomura

Hello Mike,

On 08/23/13 09:17, Mike Snitzer wrote:
>> I do like the idea of a single error target that is hybrid (supports
>> both bio-based and request-based) but the DM core would need to be
>> updated to support this.
>>
>> Specifically, we'd need to check if the device (and active table) is
>> already bio-based or request-based and select the appropriate type.  If
>> it is a new device, default to selecting bio-based.
>>
>> There are some wrappers and other logic thoughout DM core that will need
>> auditing too.
> 
> Here is a patch that should work for your needs (I tested it to work
> with 'dmsetup wipe_table' on both request-based and bio-based devices):

How about moving the default handling in dm_table_set_type() outside of
the for-each-target loop, like the modified patch below?

For example, if a table has 2 targets, hybrid and request_based,
and live_md_type is DM_TYPE_NONE, the table should be considered as
request_based, not inconsistent.
Though the end result is same as such a table is rejected by other
constraint anyway, I think it's good to keep the semantics clean
and error messages consistent.

I.e. for the above case, the error message should be
"Request-based dm doesn't support multiple targets yet",
not "Inconsistent table: different target types can't be mixed up".

---
Jun'ichi Nomura, NEC Corporation


diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index f221812..6e683c8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -860,14 +860,16 @@ EXPORT_SYMBOL(dm_consume_args);
 static int dm_table_set_type(struct dm_table *t)
 {
unsigned i;
-   unsigned bio_based = 0, request_based = 0;
+   unsigned bio_based = 0, request_based = 0, hybrid = 0;
struct dm_target *tgt;
struct dm_dev_internal *dd;
struct list_head *devices;
 
for (i = 0; i < t->num_targets; i++) {
tgt = t->targets + i;
-   if (dm_target_request_based(tgt))
+   if (dm_target_hybrid(tgt))
+   hybrid = 1;
+   else if (dm_target_request_based(tgt))
request_based = 1;
else
bio_based = 1;
@@ -879,6 +881,25 @@ static int dm_table_set_type(struct dm_table *t)
}
}
 
+   if (hybrid && !bio_based && !request_based) {
+   /*
+* The targets can work either way.
+* Determine the type from the live device.
+*/
+   unsigned live_md_type;
+   dm_lock_md_type(t->md);
+   live_md_type = dm_get_md_type(t->md);
+   dm_unlock_md_type(t->md);
+   switch (live_md_type) {
+   case DM_TYPE_REQUEST_BASED:
+   request_based = 1;
+   break;
+   default:
+   bio_based = 1;
+   break;
+   }
+   }
+
if (bio_based) {
/* We must use this table as bio-based */
t->type = DM_TYPE_BIO_BASED;
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 37ba5db..242e3ce 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -131,12 +131,19 @@ static int io_err_map(struct dm_target *tt, struct bio 
*bio)
return -EIO;
 }
 
+static int io_err_map_rq(struct dm_target *ti, struct request *clone,
+union map_info *map_context)
+{
+   return -EIO;
+}
+
 static struct target_type error_target = {
.name = "error",
-   .version = {1, 1, 0},
+   .version = {1, 2, 0},
.ctr  = io_err_ctr,
.dtr  = io_err_dtr,
.map  = io_err_map,
+   .map_rq = io_err_map_rq,
 };
 
 int __init dm_target_init(void)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 45b97da..8b4c075 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -89,10 +89,21 @@ int dm_setup_md_queue(struct mapped_device *md);
 #define dm_target_is_valid(t) ((t)->table)
 
 /*
+ * To check whether the target type is bio-based or not (request-based).
+ */
+#define dm_target_bio_based(t) ((t)->type->map != NULL)
+
+/*
  * To check whether the target type is request-based or not (bio-based).
  */
 #define dm_target_request_based(t) ((t)->type->map_rq != NULL)
 
+/*
+ * To check whether the target type is a hybrid (capable of being
+ * either request-based or bio-based).
+ */
+#define dm_target_hybrid(t) (dm_target_bio_based(t) && 
dm_target_request_based(t))
+
 /*-
  * A registry of target types.
  *---*/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: allow error target to replace either bio-based and request-based targets

2013-08-22 Thread Jun'ichi Nomura

Hello Mike,

On 08/23/13 09:17, Mike Snitzer wrote:
 I do like the idea of a single error target that is hybrid (supports
 both bio-based and request-based) but the DM core would need to be
 updated to support this.

 Specifically, we'd need to check if the device (and active table) is
 already bio-based or request-based and select the appropriate type.  If
 it is a new device, default to selecting bio-based.

 There are some wrappers and other logic thoughout DM core that will need
 auditing too.
 
 Here is a patch that should work for your needs (I tested it to work
 with 'dmsetup wipe_table' on both request-based and bio-based devices):

How about moving the default handling in dm_table_set_type() outside of
the for-each-target loop, like the modified patch below?

For example, if a table has 2 targets, hybrid and request_based,
and live_md_type is DM_TYPE_NONE, the table should be considered as
request_based, not inconsistent.
Though the end result is same as such a table is rejected by other
constraint anyway, I think it's good to keep the semantics clean
and error messages consistent.

I.e. for the above case, the error message should be
Request-based dm doesn't support multiple targets yet,
not Inconsistent table: different target types can't be mixed up.

---
Jun'ichi Nomura, NEC Corporation


diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index f221812..6e683c8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -860,14 +860,16 @@ EXPORT_SYMBOL(dm_consume_args);
 static int dm_table_set_type(struct dm_table *t)
 {
unsigned i;
-   unsigned bio_based = 0, request_based = 0;
+   unsigned bio_based = 0, request_based = 0, hybrid = 0;
struct dm_target *tgt;
struct dm_dev_internal *dd;
struct list_head *devices;
 
for (i = 0; i  t-num_targets; i++) {
tgt = t-targets + i;
-   if (dm_target_request_based(tgt))
+   if (dm_target_hybrid(tgt))
+   hybrid = 1;
+   else if (dm_target_request_based(tgt))
request_based = 1;
else
bio_based = 1;
@@ -879,6 +881,25 @@ static int dm_table_set_type(struct dm_table *t)
}
}
 
+   if (hybrid  !bio_based  !request_based) {
+   /*
+* The targets can work either way.
+* Determine the type from the live device.
+*/
+   unsigned live_md_type;
+   dm_lock_md_type(t-md);
+   live_md_type = dm_get_md_type(t-md);
+   dm_unlock_md_type(t-md);
+   switch (live_md_type) {
+   case DM_TYPE_REQUEST_BASED:
+   request_based = 1;
+   break;
+   default:
+   bio_based = 1;
+   break;
+   }
+   }
+
if (bio_based) {
/* We must use this table as bio-based */
t-type = DM_TYPE_BIO_BASED;
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 37ba5db..242e3ce 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -131,12 +131,19 @@ static int io_err_map(struct dm_target *tt, struct bio 
*bio)
return -EIO;
 }
 
+static int io_err_map_rq(struct dm_target *ti, struct request *clone,
+union map_info *map_context)
+{
+   return -EIO;
+}
+
 static struct target_type error_target = {
.name = error,
-   .version = {1, 1, 0},
+   .version = {1, 2, 0},
.ctr  = io_err_ctr,
.dtr  = io_err_dtr,
.map  = io_err_map,
+   .map_rq = io_err_map_rq,
 };
 
 int __init dm_target_init(void)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 45b97da..8b4c075 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -89,10 +89,21 @@ int dm_setup_md_queue(struct mapped_device *md);
 #define dm_target_is_valid(t) ((t)-table)
 
 /*
+ * To check whether the target type is bio-based or not (request-based).
+ */
+#define dm_target_bio_based(t) ((t)-type-map != NULL)
+
+/*
  * To check whether the target type is request-based or not (bio-based).
  */
 #define dm_target_request_based(t) ((t)-type-map_rq != NULL)
 
+/*
+ * To check whether the target type is a hybrid (capable of being
+ * either request-based or bio-based).
+ */
+#define dm_target_hybrid(t) (dm_target_bio_based(t)  
dm_target_request_based(t))
+
 /*-
  * A registry of target types.
  *---*/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start

2013-02-25 Thread Jun'ichi Nomura

Hello Jens,

please consider to pick up this patch.
Without this patch, the warning below splats every when a multipath
device is created.
I got acks from Vivek and Tejun when I posted this for v3.7
and this same patch is still applicable to v3.8.


Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (>type_lock){..}, at: [] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [] __schedule_bug+0x6a/0x78
   [] __schedule+0xb4/0x5e0
   [] schedule+0x64/0x66
   [] schedule_timeout+0x39/0xf8
   [] ? put_lock_stats+0xe/0x29
   [] ? lock_release_holdtime+0xb6/0xbb
   [] wait_for_common+0x9d/0xee
   [] ? try_to_wake_up+0x206/0x206
   [] ? kfree_call_rcu+0x1c/0x1c
   [] wait_for_completion+0x1d/0x1f
   [] wait_rcu_gp+0x5d/0x7a
   [] ? wait_rcu_gp+0x7a/0x7a
   [] ? complete+0x21/0x53
   [] synchronize_rcu+0x1e/0x20
   [] blk_queue_bypass_start+0x5d/0x62
   [] blkcg_activate_policy+0x73/0x270
   [] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [] cfq_init_queue+0x80/0x28e
   [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [] elevator_init+0xe1/0x115
   [] ? blk_queue_make_request+0x54/0x59
   [] blk_init_allocated_queue+0x8c/0x9e
   [] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [] table_load+0x1bd/0x2c8 [dm_mod]
   [] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [] ? table_clear+0xaa/0xaa [dm_mod]
   [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [] do_vfs_ioctl+0x3fb/0x441
   [] ? file_has_perm+0x8a/0x99
   [] sys_ioctl+0x5e/0x82
   [] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura 
Acked-by: Vivek Goyal 
Acked-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Alasdair G Kergon 
---
 block/blk-cgroup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..53628e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;
 
-   preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);
 
+   preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q->queue_lock);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH repost] blkcg: fix scheduling while atomic in blk_queue_bypass_start

2013-02-25 Thread Jun'ichi Nomura

Hello Jens,

please consider to pick up this patch.
Without this patch, the warning below splats every when a multipath
device is created.
I got acks from Vivek and Tejun when I posted this for v3.7
and this same patch is still applicable to v3.8.


Since 749fefe677 in v3.7 (block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (md-type_lock){..}, at: [a019fb05] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [810723ae] __schedule_bug+0x6a/0x78
   [81428ba2] __schedule+0xb4/0x5e0
   [814291e6] schedule+0x64/0x66
   [8142773a] schedule_timeout+0x39/0xf8
   [8108ad5f] ? put_lock_stats+0xe/0x29
   [8108ae30] ? lock_release_holdtime+0xb6/0xbb
   [814289e3] wait_for_common+0x9d/0xee
   [8107526c] ? try_to_wake_up+0x206/0x206
   [810c0eb8] ? kfree_call_rcu+0x1c/0x1c
   [81428aec] wait_for_completion+0x1d/0x1f
   [810611f9] wait_rcu_gp+0x5d/0x7a
   [81061216] ? wait_rcu_gp+0x7a/0x7a
   [8106fb18] ? complete+0x21/0x53
   [810c0556] synchronize_rcu+0x1e/0x20
   [811dd903] blk_queue_bypass_start+0x5d/0x62
   [811ee109] blkcg_activate_policy+0x73/0x270
   [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [811f04b3] cfq_init_queue+0x80/0x28e
   [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [811d8c41] elevator_init+0xe1/0x115
   [811e229f] ? blk_queue_make_request+0x54/0x59
   [811dd743] blk_init_allocated_queue+0x8c/0x9e
   [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod]
   [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod]
   [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [811479fc] do_vfs_ioctl+0x3fb/0x441
   [811b643c] ? file_has_perm+0x8a/0x99
   [81147aa0] sys_ioctl+0x5e/0x82
   [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [814310d9] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Acked-by: Vivek Goyal vgo...@redhat.com
Acked-by: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Alasdair G Kergon a...@redhat.com
---
 block/blk-cgroup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..53628e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;
 
-   preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);
 
+   preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q-queue_lock);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start

2013-01-07 Thread Jun'ichi Nomura

With 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (>type_lock){..}, at: [] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [] __schedule_bug+0x6a/0x78
   [] __schedule+0xb4/0x5e0
   [] schedule+0x64/0x66
   [] schedule_timeout+0x39/0xf8
   [] ? put_lock_stats+0xe/0x29
   [] ? lock_release_holdtime+0xb6/0xbb
   [] wait_for_common+0x9d/0xee
   [] ? try_to_wake_up+0x206/0x206
   [] ? kfree_call_rcu+0x1c/0x1c
   [] wait_for_completion+0x1d/0x1f
   [] wait_rcu_gp+0x5d/0x7a
   [] ? wait_rcu_gp+0x7a/0x7a
   [] ? complete+0x21/0x53
   [] synchronize_rcu+0x1e/0x20
   [] blk_queue_bypass_start+0x5d/0x62
   [] blkcg_activate_policy+0x73/0x270
   [] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [] cfq_init_queue+0x80/0x28e
   [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [] elevator_init+0xe1/0x115
   [] ? blk_queue_make_request+0x54/0x59
   [] blk_init_allocated_queue+0x8c/0x9e
   [] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [] table_load+0x1bd/0x2c8 [dm_mod]
   [] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [] ? table_clear+0xaa/0xaa [dm_mod]
   [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [] do_vfs_ioctl+0x3fb/0x441
   [] ? file_has_perm+0x8a/0x99
   [] sys_ioctl+0x5e/0x82
   [] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura 
Acked-by: Vivek Goyal 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: Alasdair G Kergon 
---
 block/blk-cgroup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..53628e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;
 
-   preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);
 
+   preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q->queue_lock);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH repost] blkcg: fix scheduling while atomic in blk_queue_bypass_start

2013-01-07 Thread Jun'ichi Nomura

With 749fefe677 in v3.7 (block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (md-type_lock){..}, at: [a019fb05] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [810723ae] __schedule_bug+0x6a/0x78
   [81428ba2] __schedule+0xb4/0x5e0
   [814291e6] schedule+0x64/0x66
   [8142773a] schedule_timeout+0x39/0xf8
   [8108ad5f] ? put_lock_stats+0xe/0x29
   [8108ae30] ? lock_release_holdtime+0xb6/0xbb
   [814289e3] wait_for_common+0x9d/0xee
   [8107526c] ? try_to_wake_up+0x206/0x206
   [810c0eb8] ? kfree_call_rcu+0x1c/0x1c
   [81428aec] wait_for_completion+0x1d/0x1f
   [810611f9] wait_rcu_gp+0x5d/0x7a
   [81061216] ? wait_rcu_gp+0x7a/0x7a
   [8106fb18] ? complete+0x21/0x53
   [810c0556] synchronize_rcu+0x1e/0x20
   [811dd903] blk_queue_bypass_start+0x5d/0x62
   [811ee109] blkcg_activate_policy+0x73/0x270
   [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [811f04b3] cfq_init_queue+0x80/0x28e
   [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [811d8c41] elevator_init+0xe1/0x115
   [811e229f] ? blk_queue_make_request+0x54/0x59
   [811dd743] blk_init_allocated_queue+0x8c/0x9e
   [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod]
   [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod]
   [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [811479fc] do_vfs_ioctl+0x3fb/0x441
   [811b643c] ? file_has_perm+0x8a/0x99
   [81147aa0] sys_ioctl+0x5e/0x82
   [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [814310d9] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Acked-by: Vivek Goyal vgo...@redhat.com
Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Alasdair G Kergon a...@redhat.com
---
 block/blk-cgroup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..53628e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;
 
-   preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);
 
+   preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q-queue_lock);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start

2012-10-29 Thread Jun'ichi Nomura

On 10/30/12 02:13, Vivek Goyal wrote:
> On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
>> int radix_tree_preload(gfp_t gfp_mask)
>> {
>> struct radix_tree_preload *rtp;
>> struct radix_tree_node *node;
>> int ret = -ENOMEM;
>> 
>> preempt_disable();
>>
>>
>> Seems obvious why it explodes..
> 
> Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
> radix_tree_preload() should fix it. Junichi, can you please give it
> a try.

Thank you! It just works.
This is a revised patch.

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with
CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (>type_lock){..}, at: [] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [] __schedule_bug+0x6a/0x78
   [] __schedule+0xb4/0x5e0
   [] schedule+0x64/0x66
   [] schedule_timeout+0x39/0xf8
   [] ? put_lock_stats+0xe/0x29
   [] ? lock_release_holdtime+0xb6/0xbb
   [] wait_for_common+0x9d/0xee
   [] ? try_to_wake_up+0x206/0x206
   [] ? kfree_call_rcu+0x1c/0x1c
   [] wait_for_completion+0x1d/0x1f
   [] wait_rcu_gp+0x5d/0x7a
   [] ? wait_rcu_gp+0x7a/0x7a
   [] ? complete+0x21/0x53
   [] synchronize_rcu+0x1e/0x20
   [] blk_queue_bypass_start+0x5d/0x62
   [] blkcg_activate_policy+0x73/0x270
   [] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [] cfq_init_queue+0x80/0x28e
   [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [] elevator_init+0xe1/0x115
   [] ? blk_queue_make_request+0x54/0x59
   [] blk_init_allocated_queue+0x8c/0x9e
   [] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [] table_load+0x1bd/0x2c8 [dm_mod]
   [] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [] ? table_clear+0xaa/0xaa [dm_mod]
   [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [] do_vfs_ioctl+0x3fb/0x441
   [] ? file_has_perm+0x8a/0x99
   [] sys_ioctl+0x5e/0x82
   [] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura 
Cc: Vivek Goyal 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: Alasdair G Kergon 
---
 block/blk-cgroup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d0b7703..954f4be 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;
 
-   preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);
 
+   preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q->queue_lock);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-29 Thread Jun'ichi Nomura

On 10/30/12 04:07, Andi Kleen wrote:
> Theodore Ts'o  writes:
>> Note that the problem that we're dealing with is buffered writes; so
>> it's quite possible that the process which wrote the file, thus
>> dirtying the page cache, has already exited; so there's no way we can
>> guarantee we can inform the process which wrote the file via a signal
>> or a error code return.
> 
> Is that any different from other IO errors? It doesn't need to 
> be better.

IMO, it is different in that next read from disk will likely succeed.
(and read corrupted data)
For IO errors come from disk failure, next read will likely fail
again so we don't have to remember it somewhere.

>> Also, if you're going to keep this state in memory, what happens if
>> the inode gets pushed out of memory? 
> 
> You lose the error, just like you do today with any other IO error.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

2012-10-29 Thread Jun'ichi Nomura

On 10/27/12 05:21, Vivek Goyal wrote:
> On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
>> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
>>
>> With 749fefe677 ("block: lift the initial queue bypass mode on
>> blk_register_queue() instead of blk_init_allocated_queue()"),
>> add_disk() eventually calls blk_queue_bypass_end().
>> This change invokes the following warning when multipath is used.
>>
>>   BUG: scheduling while atomic: multipath/2460/0x0002
>>   1 lock held by multipath/2460:
>>#0:  (>type_lock){..}, at: [] 
>> dm_lock_md_type+0x17/0x19 [dm_mod]
>>   Modules linked in: ...
>>   Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
>>   Call Trace:
>>[] __schedule_bug+0x6a/0x78
>>[] __schedule+0xb4/0x5e0
>>[] schedule+0x64/0x66
>>[] schedule_timeout+0x39/0xf8
>>[] ? put_lock_stats+0xe/0x29
>>[] ? lock_release_holdtime+0xb6/0xbb
>>[] wait_for_common+0x9d/0xee
>>[] ? try_to_wake_up+0x206/0x206
>>[] ? kfree_call_rcu+0x1c/0x1c
>>[] wait_for_completion+0x1d/0x1f
>>[] wait_rcu_gp+0x5d/0x7a
>>[] ? wait_rcu_gp+0x7a/0x7a
>>[] ? complete+0x21/0x53
>>[] synchronize_rcu+0x1e/0x20
>>[] blk_queue_bypass_start+0x5d/0x62
>>[] blkcg_activate_policy+0x73/0x270
>>[] ? kmem_cache_alloc_node_trace+0xc7/0x108
>>[] cfq_init_queue+0x80/0x28e
>>[] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>>[] elevator_init+0xe1/0x115
>>[] ? blk_queue_make_request+0x54/0x59
>>[] blk_init_allocated_queue+0x8c/0x9e
>>[] dm_setup_md_queue+0x36/0xaa [dm_mod]
>>[] table_load+0x1bd/0x2c8 [dm_mod]
>>[] ctl_ioctl+0x1d6/0x236 [dm_mod]
>>[] ? table_clear+0xaa/0xaa [dm_mod]
>>[] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>>[] do_vfs_ioctl+0x3fb/0x441
>>[] ? file_has_perm+0x8a/0x99
>>[] sys_ioctl+0x5e/0x82
>>[] ? trace_hardirqs_on_thunk+0x3a/0x3f
>>[] system_call_fastpath+0x16/0x1b
>>
>> The warning means during queue initialization blk_queue_bypass_start()
>> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> 
> md->type_lock is a mutex, isn't it? I thought we are allowed to block
> and schedule out under mutex?

Hm, you are right. It's a mutex.
The warning occurs only if I turned on CONFIG_PREEMPT=y.

> add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> So we already have code which can block/wait under md->type_lock. I am
> not sure why should we get this warning under a mutex.

add_disk() is called without md->type_lock.

Call flow is like this:

dm_create
  alloc_dev
blk_alloc_queue
alloc_disk
add_disk
  blk_queue_bypass_end [with 3.7-rc2]

table_load
  dm_lock_md_type [takes md->type_lock]
  dm_setup_md_queue
blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
  elevator_init
blkcg_activate_policy
  blk_queue_bypass_start <-- THIS triggers the warning
  blk_queue_bypass_end
  blk_queue_bypass_end [with 3.6]
  dm_unlock_md_type

blk_queue_bypass_start() in blkcg_activate_policy was nested call,
that did nothing, with 3.6.
With 3.7-rc2, it becomes the initial call and does
actual draining stuff.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-29 Thread Jun'ichi Nomura

On 10/29/12 19:37, Andi Kleen wrote:
> Theodore Ts'o  writes:
>> On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote:
>>> Except that there are filesystems that cannot implement such flags,
>>> or require on-disk format changes to add more of those flags. This
>>> is most definitely not a filesystem specific behaviour, so any sort
>>> of VFS level per-file state needs to be kept in xattrs, not special
>>> flags. Filesystems are welcome to optimise the storage of such
>>> special xattrs (e.g. down to a single boolean flag in an inode), but
>>> using a flag for something that dould, in fact, storage the exactly
>>> offset and length of the corruption is far better than just storing
>>> a "something is corrupted in this file" bit
>>
>> Agreed, if we're going to add an xattr, then we might as well store
> 
> I don't think an xattr makes sense for this. It's sufficient to keep
> this state in memory.
> 
> In general these error paths are hard to test and it's important
> to keep them as simple as possible. Doing IO and other complexities
> just doesn't make sense. Just have the simplest possible path
> that can do the job.

And since it's difficult to prove, I think it's nice to have an
option to panic if the memory error was on dirty page cache.

It's theoretically same as disk I/O error; dirty cache is marked invalid
and next read will go to disk.
Though in practice, the next read will likely to fail if disk was broken.
(Given that transient errors are usually recovered by retries and fail-overs
 in storage stack and not visible to applications which don't care.)
So it's "consistent" in some sense.
OTOH, the next read will likely succeed reading old data from disk
in case of the memory error.
I'm afraid the read-after-write inconsistency could cause silent data
corruption.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-29 Thread Jun'ichi Nomura

On 10/30/12 04:07, Andi Kleen wrote:
 Theodore Ts'o ty...@mit.edu writes:
 Note that the problem that we're dealing with is buffered writes; so
 it's quite possible that the process which wrote the file, thus
 dirtying the page cache, has already exited; so there's no way we can
 guarantee we can inform the process which wrote the file via a signal
 or a error code return.
 
 Is that any different from other IO errors? It doesn't need to 
 be better.

IMO, it is different in that next read from disk will likely succeed.
(and read corrupted data)
For IO errors come from disk failure, next read will likely fail
again so we don't have to remember it somewhere.

 Also, if you're going to keep this state in memory, what happens if
 the inode gets pushed out of memory? 
 
 You lose the error, just like you do today with any other IO error.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blkcg: fix scheduling while atomic in blk_queue_bypass_start

2012-10-29 Thread Jun'ichi Nomura

On 10/30/12 02:13, Vivek Goyal wrote:
 On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
 int radix_tree_preload(gfp_t gfp_mask)
 {
 struct radix_tree_preload *rtp;
 struct radix_tree_node *node;
 int ret = -ENOMEM;
 
 preempt_disable();


 Seems obvious why it explodes..
 
 Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
 radix_tree_preload() should fix it. Junichi, can you please give it
 a try.

Thank you! It just works.
This is a revised patch.

With 749fefe677 (block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()),
the following warning appears when multipath is used with
CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (md-type_lock){..}, at: [a019fb05] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [810723ae] __schedule_bug+0x6a/0x78
   [81428ba2] __schedule+0xb4/0x5e0
   [814291e6] schedule+0x64/0x66
   [8142773a] schedule_timeout+0x39/0xf8
   [8108ad5f] ? put_lock_stats+0xe/0x29
   [8108ae30] ? lock_release_holdtime+0xb6/0xbb
   [814289e3] wait_for_common+0x9d/0xee
   [8107526c] ? try_to_wake_up+0x206/0x206
   [810c0eb8] ? kfree_call_rcu+0x1c/0x1c
   [81428aec] wait_for_completion+0x1d/0x1f
   [810611f9] wait_rcu_gp+0x5d/0x7a
   [81061216] ? wait_rcu_gp+0x7a/0x7a
   [8106fb18] ? complete+0x21/0x53
   [810c0556] synchronize_rcu+0x1e/0x20
   [811dd903] blk_queue_bypass_start+0x5d/0x62
   [811ee109] blkcg_activate_policy+0x73/0x270
   [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [811f04b3] cfq_init_queue+0x80/0x28e
   [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [811d8c41] elevator_init+0xe1/0x115
   [811e229f] ? blk_queue_make_request+0x54/0x59
   [811dd743] blk_init_allocated_queue+0x8c/0x9e
   [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod]
   [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod]
   [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [811479fc] do_vfs_ioctl+0x3fb/0x441
   [811b643c] ? file_has_perm+0x8a/0x99
   [81147aa0] sys_ioctl+0x5e/0x82
   [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [814310d9] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Alasdair G Kergon a...@redhat.com
---
 block/blk-cgroup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d0b7703..954f4be 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;
 
-   preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);
 
+   preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q-queue_lock);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-29 Thread Jun'ichi Nomura

On 10/29/12 19:37, Andi Kleen wrote:
 Theodore Ts'o ty...@mit.edu writes:
 On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote:
 Except that there are filesystems that cannot implement such flags,
 or require on-disk format changes to add more of those flags. This
 is most definitely not a filesystem specific behaviour, so any sort
 of VFS level per-file state needs to be kept in xattrs, not special
 flags. Filesystems are welcome to optimise the storage of such
 special xattrs (e.g. down to a single boolean flag in an inode), but
 using a flag for something that dould, in fact, storage the exactly
 offset and length of the corruption is far better than just storing
 a something is corrupted in this file bit

 Agreed, if we're going to add an xattr, then we might as well store
 
 I don't think an xattr makes sense for this. It's sufficient to keep
 this state in memory.
 
 In general these error paths are hard to test and it's important
 to keep them as simple as possible. Doing IO and other complexities
 just doesn't make sense. Just have the simplest possible path
 that can do the job.

And since it's difficult to prove, I think it's nice to have an
option to panic if the memory error was on dirty page cache.

It's theoretically same as disk I/O error; dirty cache is marked invalid
and next read will go to disk.
Though in practice, the next read will likely to fail if disk was broken.
(Given that transient errors are usually recovered by retries and fail-overs
 in storage stack and not visible to applications which don't care.)
So it's consistent in some sense.
OTOH, the next read will likely succeed reading old data from disk
in case of the memory error.
I'm afraid the read-after-write inconsistency could cause silent data
corruption.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

2012-10-29 Thread Jun'ichi Nomura

On 10/27/12 05:21, Vivek Goyal wrote:
 On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
 [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

 With 749fefe677 (block: lift the initial queue bypass mode on
 blk_register_queue() instead of blk_init_allocated_queue()),
 add_disk() eventually calls blk_queue_bypass_end().
 This change invokes the following warning when multipath is used.

   BUG: scheduling while atomic: multipath/2460/0x0002
   1 lock held by multipath/2460:
#0:  (md-type_lock){..}, at: [a019fb05] 
 dm_lock_md_type+0x17/0x19 [dm_mod]
   Modules linked in: ...
   Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
   Call Trace:
[810723ae] __schedule_bug+0x6a/0x78
[81428ba2] __schedule+0xb4/0x5e0
[814291e6] schedule+0x64/0x66
[8142773a] schedule_timeout+0x39/0xf8
[8108ad5f] ? put_lock_stats+0xe/0x29
[8108ae30] ? lock_release_holdtime+0xb6/0xbb
[814289e3] wait_for_common+0x9d/0xee
[8107526c] ? try_to_wake_up+0x206/0x206
[810c0eb8] ? kfree_call_rcu+0x1c/0x1c
[81428aec] wait_for_completion+0x1d/0x1f
[810611f9] wait_rcu_gp+0x5d/0x7a
[81061216] ? wait_rcu_gp+0x7a/0x7a
[8106fb18] ? complete+0x21/0x53
[810c0556] synchronize_rcu+0x1e/0x20
[811dd903] blk_queue_bypass_start+0x5d/0x62
[811ee109] blkcg_activate_policy+0x73/0x270
[81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108
[811f04b3] cfq_init_queue+0x80/0x28e
[a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
[811d8c41] elevator_init+0xe1/0x115
[811e229f] ? blk_queue_make_request+0x54/0x59
[811dd743] blk_init_allocated_queue+0x8c/0x9e
[a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod]
[a01a60e6] table_load+0x1bd/0x2c8 [dm_mod]
[a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod]
[a01a5f29] ? table_clear+0xaa/0xaa [dm_mod]
[a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod]
[811479fc] do_vfs_ioctl+0x3fb/0x441
[811b643c] ? file_has_perm+0x8a/0x99
[81147aa0] sys_ioctl+0x5e/0x82
[812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f
[814310d9] system_call_fastpath+0x16/0x1b

 The warning means during queue initialization blk_queue_bypass_start()
 calls sleeping function (synchronize_rcu) while dm holds md-type_lock.
 
 md-type_lock is a mutex, isn't it? I thought we are allowed to block
 and schedule out under mutex?

Hm, you are right. It's a mutex.
The warning occurs only if I turned on CONFIG_PREEMPT=y.

 add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
 So we already have code which can block/wait under md-type_lock. I am
 not sure why should we get this warning under a mutex.

add_disk() is called without md-type_lock.

Call flow is like this:

dm_create
  alloc_dev
blk_alloc_queue
alloc_disk
add_disk
  blk_queue_bypass_end [with 3.7-rc2]

table_load
  dm_lock_md_type [takes md-type_lock]
  dm_setup_md_queue
blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
  elevator_init
blkcg_activate_policy
  blk_queue_bypass_start -- THIS triggers the warning
  blk_queue_bypass_end
  blk_queue_bypass_end [with 3.6]
  dm_unlock_md_type

blk_queue_bypass_start() in blkcg_activate_policy was nested call,
that did nothing, with 3.6.
With 3.7-rc2, it becomes the initial call and does
actual draining stuff.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

2012-10-25 Thread Jun'ichi Nomura

On 10/25/12 18:41, Jun'ichi Nomura wrote:
> With 749fefe677 ("block: lift the initial queue bypass mode on
> blk_register_queue() instead of blk_init_allocated_queue()"),
> add_disk() eventually calls blk_queue_bypass_end().
> This change invokes the following warning when multipath is used.
...
> The warning means during queue initialization blk_queue_bypass_start()
> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> 
> dm device initialization basically includes the following 3 steps:
>   1. create ioctl, allocates queue and call add_disk()
>   2. table load ioctl, determines device type and initialize queue
>  if request-based
>   3. resume ioctl, device becomes functional
> 
> So it is better to have dm's queue stay in bypass mode until
> the initialization completes in table load ioctl.
> 
> The effect of additional blk_queue_bypass_start():
> 
>   3.7-rc2 (plain)
> # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
> dmsetup remove a; done
> 
> real  0m15.434s
> user  0m0.423s
> sys   0m7.052s
> 
>   3.7-rc2 (with this patch)
> # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
> dmsetup remove a; done
> real  0m19.766s
> user  0m0.442s
> sys   0m6.861s
> 
> If this additional cost is not negligible, we need a variant of add_disk()
> that does not end bypassing.

Or call blk_queue_bypass_start() before add_disk():

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ad02761..d14639b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1868,9 +1868,9 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->queue = md->queue;
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
-   add_disk(md->disk);
/* Until md type is determined, put the queue in bypass mode */
blk_queue_bypass_start(md->queue);
+   add_disk(md->disk);
format_dev_t(md->name, MKDEV(_major, minor));
 
md->wq = alloc_workqueue("kdmflush",
---

If the patch is modified like above, we could fix the issue
without incurring additional cost on dm device creation.

# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done

real0m15.684s
user0m0.404s
sys 0m7.181s

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h

2012-10-25 Thread Jun'ichi Nomura

[PATCH] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h

dm wants to use those functions to control the bypass status of
half-initialized device.

This patch is a preparation for:
  [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

Signed-off-by: Jun'ichi Nomura 
Cc: Vivek Goyal 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: Alasdair G Kergon 

---
 block/blk.h|2 --
 include/linux/blkdev.h |2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index ca51543..48bac5b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -26,8 +26,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request 
*rq,
struct bio *bio);
 int blk_rq_append_bio(struct request_queue *q, struct request *rq,
  struct bio *bio);
-void blk_queue_bypass_start(struct request_queue *q);
-void blk_queue_bypass_end(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1756001..86ba153 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -949,6 +949,8 @@ bool __must_check blk_get_queue(struct request_queue *);
 struct request_queue *blk_alloc_queue(gfp_t);
 struct request_queue *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(struct request_queue *);
+void blk_queue_bypass_start(struct request_queue *q);
+void blk_queue_bypass_end(struct request_queue *q);
 
 /*
  * blk_plug permits building a queue of related requests by holding the I/O
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

2012-10-25 Thread Jun'ichi Nomura

[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
add_disk() eventually calls blk_queue_bypass_end().
This change invokes the following warning when multipath is used.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (>type_lock){..}, at: [] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [] __schedule_bug+0x6a/0x78
   [] __schedule+0xb4/0x5e0
   [] schedule+0x64/0x66
   [] schedule_timeout+0x39/0xf8
   [] ? put_lock_stats+0xe/0x29
   [] ? lock_release_holdtime+0xb6/0xbb
   [] wait_for_common+0x9d/0xee
   [] ? try_to_wake_up+0x206/0x206
   [] ? kfree_call_rcu+0x1c/0x1c
   [] wait_for_completion+0x1d/0x1f
   [] wait_rcu_gp+0x5d/0x7a
   [] ? wait_rcu_gp+0x7a/0x7a
   [] ? complete+0x21/0x53
   [] synchronize_rcu+0x1e/0x20
   [] blk_queue_bypass_start+0x5d/0x62
   [] blkcg_activate_policy+0x73/0x270
   [] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [] cfq_init_queue+0x80/0x28e
   [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [] elevator_init+0xe1/0x115
   [] ? blk_queue_make_request+0x54/0x59
   [] blk_init_allocated_queue+0x8c/0x9e
   [] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [] table_load+0x1bd/0x2c8 [dm_mod]
   [] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [] ? table_clear+0xaa/0xaa [dm_mod]
   [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [] do_vfs_ioctl+0x3fb/0x441
   [] ? file_has_perm+0x8a/0x99
   [] sys_ioctl+0x5e/0x82
   [] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [] system_call_fastpath+0x16/0x1b

The warning means during queue initialization blk_queue_bypass_start()
calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

dm device initialization basically includes the following 3 steps:
  1. create ioctl, allocates queue and call add_disk()
  2. table load ioctl, determines device type and initialize queue
 if request-based
  3. resume ioctl, device becomes functional

So it is better to have dm's queue stay in bypass mode until
the initialization completes in table load ioctl.

The effect of additional blk_queue_bypass_start():

  3.7-rc2 (plain)
# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done

real  0m15.434s
user  0m0.423s
sys   0m7.052s

  3.7-rc2 (with this patch)
# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done
real  0m19.766s
user  0m0.442s
sys   0m6.861s

If this additional cost is not negligible, we need a variant of add_disk()
that does not end bypassing.

Signed-off-by: Jun'ichi Nomura 
Cc: Vivek Goyal 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: Alasdair G Kergon 

---
 drivers/md/dm.c|4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 02db918..ad02761 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1869,6 +1869,8 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
add_disk(md->disk);
+   /* Until md type is determined, put the queue in bypass mode */
+   blk_queue_bypass_start(md->queue);
format_dev_t(md->name, MKDEV(_major, minor));
 
md->wq = alloc_workqueue("kdmflush",
@@ -2172,6 +2174,7 @@ static int dm_init_request_based_queue(struct 
mapped_device *md)
return 1;
 
/* Fully initialize the queue */
+   WARN_ON(!blk_queue_bypass(md->queue));
q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
if (!q)
return 0;
@@ -2198,6 +2201,7 @@ int dm_setup_md_queue(struct mapped_device *md)
return -EINVAL;
}
 
+   blk_queue_bypass_end(md->queue);
return 0;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

2012-10-25 Thread Jun'ichi Nomura

[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

With 749fefe677 (block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()),
add_disk() eventually calls blk_queue_bypass_end().
This change invokes the following warning when multipath is used.

  BUG: scheduling while atomic: multipath/2460/0x0002
  1 lock held by multipath/2460:
   #0:  (md-type_lock){..}, at: [a019fb05] 
dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: GW3.7.0-rc2 #1
  Call Trace:
   [810723ae] __schedule_bug+0x6a/0x78
   [81428ba2] __schedule+0xb4/0x5e0
   [814291e6] schedule+0x64/0x66
   [8142773a] schedule_timeout+0x39/0xf8
   [8108ad5f] ? put_lock_stats+0xe/0x29
   [8108ae30] ? lock_release_holdtime+0xb6/0xbb
   [814289e3] wait_for_common+0x9d/0xee
   [8107526c] ? try_to_wake_up+0x206/0x206
   [810c0eb8] ? kfree_call_rcu+0x1c/0x1c
   [81428aec] wait_for_completion+0x1d/0x1f
   [810611f9] wait_rcu_gp+0x5d/0x7a
   [81061216] ? wait_rcu_gp+0x7a/0x7a
   [8106fb18] ? complete+0x21/0x53
   [810c0556] synchronize_rcu+0x1e/0x20
   [811dd903] blk_queue_bypass_start+0x5d/0x62
   [811ee109] blkcg_activate_policy+0x73/0x270
   [81130521] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [811f04b3] cfq_init_queue+0x80/0x28e
   [a01a1600] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [811d8c41] elevator_init+0xe1/0x115
   [811e229f] ? blk_queue_make_request+0x54/0x59
   [811dd743] blk_init_allocated_queue+0x8c/0x9e
   [a019ffcd] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [a01a60e6] table_load+0x1bd/0x2c8 [dm_mod]
   [a01a7026] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [a01a5f29] ? table_clear+0xaa/0xaa [dm_mod]
   [a01a7099] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [811479fc] do_vfs_ioctl+0x3fb/0x441
   [811b643c] ? file_has_perm+0x8a/0x99
   [81147aa0] sys_ioctl+0x5e/0x82
   [812010be] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [814310d9] system_call_fastpath+0x16/0x1b

The warning means during queue initialization blk_queue_bypass_start()
calls sleeping function (synchronize_rcu) while dm holds md-type_lock.

dm device initialization basically includes the following 3 steps:
  1. create ioctl, allocates queue and call add_disk()
  2. table load ioctl, determines device type and initialize queue
 if request-based
  3. resume ioctl, device becomes functional

So it is better to have dm's queue stay in bypass mode until
the initialization completes in table load ioctl.

The effect of additional blk_queue_bypass_start():

  3.7-rc2 (plain)
# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done

real  0m15.434s
user  0m0.423s
sys   0m7.052s

  3.7-rc2 (with this patch)
# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done
real  0m19.766s
user  0m0.442s
sys   0m6.861s

If this additional cost is not negligible, we need a variant of add_disk()
that does not end bypassing.

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Alasdair G Kergon a...@redhat.com

---
 drivers/md/dm.c|4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 02db918..ad02761 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1869,6 +1869,8 @@ static struct mapped_device *alloc_dev(int minor)
md-disk-private_data = md;
sprintf(md-disk-disk_name, dm-%d, minor);
add_disk(md-disk);
+   /* Until md type is determined, put the queue in bypass mode */
+   blk_queue_bypass_start(md-queue);
format_dev_t(md-name, MKDEV(_major, minor));
 
md-wq = alloc_workqueue(kdmflush,
@@ -2172,6 +2174,7 @@ static int dm_init_request_based_queue(struct 
mapped_device *md)
return 1;
 
/* Fully initialize the queue */
+   WARN_ON(!blk_queue_bypass(md-queue));
q = blk_init_allocated_queue(md-queue, dm_request_fn, NULL);
if (!q)
return 0;
@@ -2198,6 +2201,7 @@ int dm_setup_md_queue(struct mapped_device *md)
return -EINVAL;
}
 
+   blk_queue_bypass_end(md-queue);
return 0;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h

2012-10-25 Thread Jun'ichi Nomura

[PATCH] block: move blk_queue_bypass_{start,end} to include/linux/blkdev.h

dm wants to use those functions to control the bypass status of
half-initialized device.

This patch is a preparation for:
  [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Alasdair G Kergon a...@redhat.com

---
 block/blk.h|2 --
 include/linux/blkdev.h |2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index ca51543..48bac5b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -26,8 +26,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request 
*rq,
struct bio *bio);
 int blk_rq_append_bio(struct request_queue *q, struct request *rq,
  struct bio *bio);
-void blk_queue_bypass_start(struct request_queue *q);
-void blk_queue_bypass_end(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1756001..86ba153 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -949,6 +949,8 @@ bool __must_check blk_get_queue(struct request_queue *);
 struct request_queue *blk_alloc_queue(gfp_t);
 struct request_queue *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(struct request_queue *);
+void blk_queue_bypass_start(struct request_queue *q);
+void blk_queue_bypass_end(struct request_queue *q);
 
 /*
  * blk_plug permits building a queue of related requests by holding the I/O
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

2012-10-25 Thread Jun'ichi Nomura

On 10/25/12 18:41, Jun'ichi Nomura wrote:
 With 749fefe677 (block: lift the initial queue bypass mode on
 blk_register_queue() instead of blk_init_allocated_queue()),
 add_disk() eventually calls blk_queue_bypass_end().
 This change invokes the following warning when multipath is used.
...
 The warning means during queue initialization blk_queue_bypass_start()
 calls sleeping function (synchronize_rcu) while dm holds md-type_lock.
 
 dm device initialization basically includes the following 3 steps:
   1. create ioctl, allocates queue and call add_disk()
   2. table load ioctl, determines device type and initialize queue
  if request-based
   3. resume ioctl, device becomes functional
 
 So it is better to have dm's queue stay in bypass mode until
 the initialization completes in table load ioctl.
 
 The effect of additional blk_queue_bypass_start():
 
   3.7-rc2 (plain)
 # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
 dmsetup remove a; done
 
 real  0m15.434s
 user  0m0.423s
 sys   0m7.052s
 
   3.7-rc2 (with this patch)
 # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
 dmsetup remove a; done
 real  0m19.766s
 user  0m0.442s
 sys   0m6.861s
 
 If this additional cost is not negligible, we need a variant of add_disk()
 that does not end bypassing.

Or call blk_queue_bypass_start() before add_disk():

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ad02761..d14639b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1868,9 +1868,9 @@ static struct mapped_device *alloc_dev(int minor)
md-disk-queue = md-queue;
md-disk-private_data = md;
sprintf(md-disk-disk_name, dm-%d, minor);
-   add_disk(md-disk);
/* Until md type is determined, put the queue in bypass mode */
blk_queue_bypass_start(md-queue);
+   add_disk(md-disk);
format_dev_t(md-name, MKDEV(_major, minor));
 
md-wq = alloc_workqueue(kdmflush,
---

If the patch is modified like above, we could fix the issue
without incurring additional cost on dm device creation.

# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done

real0m15.684s
user0m0.404s
sys 0m7.181s

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blkcg: stop iteration early if root_rl is the only request list

2012-10-21 Thread Jun'ichi Nomura

__blk_queue_next_rl() finds next request list based on blkg_list
while skipping root_blkg in the list.
OTOH, root_rl is special as it may exist even without root_blkg.

Though the later part of the function handles such a case correctly,
exiting early is good for readability of the code.

Signed-off-by: Jun'ichi Nomura 
Cc: Vivek Goyal 
Cc: Tejun Heo 
Cc: Jens Axboe 
---
 block/blk-cgroup.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 54f35d1..a31e678 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -333,6 +333,9 @@ struct request_list *__blk_queue_next_rl(struct 
request_list *rl,
 */
if (rl == >root_rl) {
ent = >blkg_list;
+   /* There are no more block groups, hence no request lists */
+   if (list_empty(ent))
+   return NULL;
} else {
blkg = container_of(rl, struct blkcg_gq, rl);
ent = >q_node;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg

2012-10-21 Thread Jun'ichi Nomura

On 10/19/12 23:53, Vivek Goyal wrote:
> On Thu, Oct 18, 2012 at 02:20:53PM -0700, Tejun Heo wrote:
>> Hey, Vivek.
>>
>> On Thu, Oct 18, 2012 at 09:31:49AM -0400, Vivek Goyal wrote:
>>> Tejun, for the sake of readability, are you fine with keeping the original
>>> check and original patch which I had acked.
>>
>> Can you please send another patch to change that?  It really isn't a
>> related change and I don't wanna mix the two.
> 
> Sure. Jun'ichi, would you like to send that cleanup line in a separate patch? 

OK. I will send that patch.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg

2012-10-21 Thread Jun'ichi Nomura

On 10/19/12 23:53, Vivek Goyal wrote:
 On Thu, Oct 18, 2012 at 02:20:53PM -0700, Tejun Heo wrote:
 Hey, Vivek.

 On Thu, Oct 18, 2012 at 09:31:49AM -0400, Vivek Goyal wrote:
 Tejun, for the sake of readability, are you fine with keeping the original
 check and original patch which I had acked.

 Can you please send another patch to change that?  It really isn't a
 related change and I don't wanna mix the two.
 
 Sure. Jun'ichi, would you like to send that cleanup line in a separate patch? 

OK. I will send that patch.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blkcg: stop iteration early if root_rl is the only request list

2012-10-21 Thread Jun'ichi Nomura

__blk_queue_next_rl() finds next request list based on blkg_list
while skipping root_blkg in the list.
OTOH, root_rl is special as it may exist even without root_blkg.

Though the later part of the function handles such a case correctly,
exiting early is good for readability of the code.

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
---
 block/blk-cgroup.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 54f35d1..a31e678 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -333,6 +333,9 @@ struct request_list *__blk_queue_next_rl(struct 
request_list *rl,
 */
if (rl == q-root_rl) {
ent = q-blkg_list;
+   /* There are no more block groups, hence no request lists */
+   if (list_empty(ent))
+   return NULL;
} else {
blkg = container_of(rl, struct blkcg_gq, rl);
ent = blkg-q_node;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg

2012-10-17 Thread Jun'ichi Nomura

On 10/17/12 22:47, Vivek Goyal wrote:
> On Wed, Oct 17, 2012 at 09:02:22AM +0900, Jun'ichi Nomura wrote:
>> On 10/17/12 08:20, Tejun Heo wrote:
>>>>>> -if (ent == >root_blkg->q_node)
>>>>>> +if (q->root_blkg && ent == >root_blkg->q_node)
>>>>>
>>>>> Can we fix it little differently. Little earlier in the code, we check for
>>>>> if q->blkg_list is empty, then all the groups are gone, and there are
>>>>> no more request lists hence and return NULL.
>>>>>
>>>>> Current code:
>>>>> if (rl == >root_rl) {
>>>>> ent = >blkg_list;
>>>>>
>>>>> Modified code:
>>>>> if (rl == >root_rl) {
>>>>> ent = >blkg_list;
>>>>>   /* There are no more block groups, hence no request lists */
>>>>>   if (list_empty(ent))
>>>>>   return NULL;
>>>>>   }
>>>
>>> Do we need this at all?  q->root_blkg being NULL is completely fine
>>> there and the comparison would work as expected, no?
>>
>> Hmm?
>>
>> If list_empty(ent) and q->root_blkg == NULL,
>>
>>> /* walk to the next list_head, skip root blkcg */
>>> ent = ent->next;
>>
>> ent is >blkg_list again.
>>
>>> if (ent == >root_blkg->q_node)
>>
>> So ent is not >root_blkg->q_node.
> 
> If q->root_blkg is NULL, will it not lead to NULL pointer dereference.
> (q->root_blkg->q_node).

It's not dereferenced.

>>> ent = ent->next;
>>> if (ent == >blkg_list)
>>> return NULL;
>>
>> And we return NULL here.
>>
>> Ah, yes. You are correct.
>> We can do without the above hunk.
> 
> I would rather prefer to check for this boundary condition early and
> return instead of letting it fall through all these conditions and
> then figure out yes we have no next rl. IMO, code becomes easier to
> understand if nothing else. Otherwise one needs a step by step 
> explanation as above to show that case of q->root_blkg is covered.

I have same opinion as yours that it's good for readability.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg

2012-10-17 Thread Jun'ichi Nomura

blk_put_rl() does not call blkg_put() for q->root_rl because we
don't take request list reference on q->root_blkg.
However, if root_blkg is once attached then detached (freed),
blk_put_rl() is confused by the bogus pointer in q->root_blkg.

For example, with !CONFIG_BLK_DEV_THROTTLING &&
CONFIG_CFQ_GROUP_IOSCHED,
switching IO scheduler from cfq to deadline will cause system stall
after the following warning with 3.6:

> WARNING: at /work/build/linux/block/blk-cgroup.h:250
> blk_put_rl+0x4d/0x95()
> Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf
> ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
> Call Trace:
>[] warn_slowpath_common+0x85/0x9d
>  [] warn_slowpath_null+0x1a/0x1c
>  [] blk_put_rl+0x4d/0x95
>  [] __blk_put_request+0xc3/0xcb
>  [] blk_finish_request+0x232/0x23f
>  [] ? blk_end_bidi_request+0x34/0x5d
>  [] blk_end_bidi_request+0x42/0x5d
>  [] blk_end_request+0x10/0x12
>  [] scsi_io_completion+0x207/0x4d5
>  [] scsi_finish_command+0xfa/0x103
>  [] scsi_softirq_done+0xff/0x108
>  [] blk_done_softirq+0x8d/0xa1
>  [] ?
>  generic_smp_call_function_single_interrupt+0x9f/0xd7
>  [] __do_softirq+0x102/0x213
>  [] ? lock_release_holdtime+0xb6/0xbb
>  [] ? raise_softirq_irqoff+0x9/0x3d
>  [] call_softirq+0x1c/0x30
>  [] do_softirq+0x4b/0xa3
>  [] irq_exit+0x53/0xd5
>  [] smp_call_function_single_interrupt+0x34/0x36
>  [] call_function_single_interrupt+0x6f/0x80
>[] ? mwait_idle+0x94/0xcd
>  [] ? mwait_idle+0x8b/0xcd
>  [] cpu_idle+0xbb/0x114
>  [] rest_init+0xc1/0xc8
>  [] ? csum_partial_copy_generic+0x16c/0x16c
>  [] start_kernel+0x3d4/0x3e1
>  [] ? kernel_init+0x1f7/0x1f7
>  [] x86_64_start_reservations+0xb8/0xbd
>  [] x86_64_start_kernel+0x101/0x110

This patch clears q->root_blkg and q->root_rl.blkg when root blkg
is destroyed.

Signed-off-by: Jun'ichi Nomura 
Acked-by: Vivek Goyal 
Cc: Tejun Heo 
Cc: Jens Axboe 
---
v3:
  Removed a hunk for NULL-check q->root_blkg in __blk_queue_next_rl().
  Current code can handle the case without the change.

v2:
  Added comments in code based on Vivek's suggestion.

 block/blk-cgroup.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f3b44a6..54f35d1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -285,6 +285,13 @@ static void blkg_destroy_all(struct request_queue *q)
blkg_destroy(blkg);
spin_unlock(>lock);
}
+
+   /*
+* root blkg is destroyed.  Just clear the pointer since
+* root_rl does not take reference on root blkg.
+*/
+   q->root_blkg = NULL;
+   q->root_rl.blkg = NULL;
 }
 
 static void blkg_rcu_free(struct rcu_head *rcu_head)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blkcg: Fix use-after-free of q-root_blkg and q-root_rl.blkg

2012-10-17 Thread Jun'ichi Nomura

blk_put_rl() does not call blkg_put() for q-root_rl because we
don't take request list reference on q-root_blkg.
However, if root_blkg is once attached then detached (freed),
blk_put_rl() is confused by the bogus pointer in q-root_blkg.

For example, with !CONFIG_BLK_DEV_THROTTLING 
CONFIG_CFQ_GROUP_IOSCHED,
switching IO scheduler from cfq to deadline will cause system stall
after the following warning with 3.6:

 WARNING: at /work/build/linux/block/blk-cgroup.h:250
 blk_put_rl+0x4d/0x95()
 Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf
 ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
 Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
 Call Trace:
  IRQ  [810453bd] warn_slowpath_common+0x85/0x9d
  [810453ef] warn_slowpath_null+0x1a/0x1c
  [811d5f8d] blk_put_rl+0x4d/0x95
  [811d614a] __blk_put_request+0xc3/0xcb
  [811d71a3] blk_finish_request+0x232/0x23f
  [811d76c3] ? blk_end_bidi_request+0x34/0x5d
  [811d76d1] blk_end_bidi_request+0x42/0x5d
  [811d7728] blk_end_request+0x10/0x12
  [812cdf16] scsi_io_completion+0x207/0x4d5
  [812c6fcf] scsi_finish_command+0xfa/0x103
  [812ce2f8] scsi_softirq_done+0xff/0x108
  [811dcea5] blk_done_softirq+0x8d/0xa1
  [810915d5] ?
  generic_smp_call_function_single_interrupt+0x9f/0xd7
  [8104cf5b] __do_softirq+0x102/0x213
  [8108a5ec] ? lock_release_holdtime+0xb6/0xbb
  [8104d2b4] ? raise_softirq_irqoff+0x9/0x3d
  [81424dfc] call_softirq+0x1c/0x30
  [81011beb] do_softirq+0x4b/0xa3
  [8104cdb0] irq_exit+0x53/0xd5
  [8102d865] smp_call_function_single_interrupt+0x34/0x36
  [8142486f] call_function_single_interrupt+0x6f/0x80
  EOI  [8101800b] ? mwait_idle+0x94/0xcd
  [81018002] ? mwait_idle+0x8b/0xcd
  [81017811] cpu_idle+0xbb/0x114
  [81401fbd] rest_init+0xc1/0xc8
  [81401efc] ? csum_partial_copy_generic+0x16c/0x16c
  [81cdbd3d] start_kernel+0x3d4/0x3e1
  [81cdb79e] ? kernel_init+0x1f7/0x1f7
  [81cdb2dd] x86_64_start_reservations+0xb8/0xbd
  [81cdb3e3] x86_64_start_kernel+0x101/0x110

This patch clears q-root_blkg and q-root_rl.blkg when root blkg
is destroyed.

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Acked-by: Vivek Goyal vgo...@redhat.com
Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
---
v3:
  Removed a hunk for NULL-check q-root_blkg in __blk_queue_next_rl().
  Current code can handle the case without the change.

v2:
  Added comments in code based on Vivek's suggestion.

 block/blk-cgroup.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f3b44a6..54f35d1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -285,6 +285,13 @@ static void blkg_destroy_all(struct request_queue *q)
blkg_destroy(blkg);
spin_unlock(blkcg-lock);
}
+
+   /*
+* root blkg is destroyed.  Just clear the pointer since
+* root_rl does not take reference on root blkg.
+*/
+   q-root_blkg = NULL;
+   q-root_rl.blkg = NULL;
 }
 
 static void blkg_rcu_free(struct rcu_head *rcu_head)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg

2012-10-17 Thread Jun'ichi Nomura

On 10/17/12 22:47, Vivek Goyal wrote:
 On Wed, Oct 17, 2012 at 09:02:22AM +0900, Jun'ichi Nomura wrote:
 On 10/17/12 08:20, Tejun Heo wrote:
 -if (ent == q-root_blkg-q_node)
 +if (q-root_blkg  ent == q-root_blkg-q_node)

 Can we fix it little differently. Little earlier in the code, we check for
 if q-blkg_list is empty, then all the groups are gone, and there are
 no more request lists hence and return NULL.

 Current code:
 if (rl == q-root_rl) {
 ent = q-blkg_list;

 Modified code:
 if (rl == q-root_rl) {
 ent = q-blkg_list;
   /* There are no more block groups, hence no request lists */
   if (list_empty(ent))
   return NULL;
   }

 Do we need this at all?  q-root_blkg being NULL is completely fine
 there and the comparison would work as expected, no?

 Hmm?

 If list_empty(ent) and q-root_blkg == NULL,

 /* walk to the next list_head, skip root blkcg */
 ent = ent-next;

 ent is q-blkg_list again.

 if (ent == q-root_blkg-q_node)

 So ent is not q-root_blkg-q_node.
 
 If q-root_blkg is NULL, will it not lead to NULL pointer dereference.
 (q-root_blkg-q_node).

It's not dereferenced.

 ent = ent-next;
 if (ent == q-blkg_list)
 return NULL;

 And we return NULL here.

 Ah, yes. You are correct.
 We can do without the above hunk.
 
 I would rather prefer to check for this boundary condition early and
 return instead of letting it fall through all these conditions and
 then figure out yes we have no next rl. IMO, code becomes easier to
 understand if nothing else. Otherwise one needs a step by step 
 explanation as above to show that case of q-root_blkg is covered.

I have same opinion as yours that it's good for readability.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg

2012-10-16 Thread Jun'ichi Nomura

On 10/17/12 08:20, Tejun Heo wrote:
>>>> -  if (ent == >root_blkg->q_node)
>>>> +  if (q->root_blkg && ent == >root_blkg->q_node)
>>>
>>> Can we fix it little differently. Little earlier in the code, we check for
>>> if q->blkg_list is empty, then all the groups are gone, and there are
>>> no more request lists hence and return NULL.
>>>
>>> Current code:
>>> if (rl == >root_rl) {
>>> ent = >blkg_list;
>>>
>>> Modified code:
>>> if (rl == >root_rl) {
>>> ent = >blkg_list;
>>> /* There are no more block groups, hence no request lists */
>>> if (list_empty(ent))
>>> return NULL;
>>> }
> 
> Do we need this at all?  q->root_blkg being NULL is completely fine
> there and the comparison would work as expected, no?

Hmm?

If list_empty(ent) and q->root_blkg == NULL,

> /* walk to the next list_head, skip root blkcg */
> ent = ent->next;

ent is >blkg_list again.

> if (ent == >root_blkg->q_node)

So ent is not >root_blkg->q_node.

> ent = ent->next;
> if (ent == >blkg_list)
> return NULL;

And we return NULL here.

Ah, yes. You are correct.
We can do without the above hunk.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg

2012-10-16 Thread Jun'ichi Nomura

On 10/17/12 08:20, Tejun Heo wrote:
 -  if (ent == q-root_blkg-q_node)
 +  if (q-root_blkg  ent == q-root_blkg-q_node)

 Can we fix it little differently. Little earlier in the code, we check for
 if q-blkg_list is empty, then all the groups are gone, and there are
 no more request lists hence and return NULL.

 Current code:
 if (rl == q-root_rl) {
 ent = q-blkg_list;

 Modified code:
 if (rl == q-root_rl) {
 ent = q-blkg_list;
 /* There are no more block groups, hence no request lists */
 if (list_empty(ent))
 return NULL;
 }
 
 Do we need this at all?  q-root_blkg being NULL is completely fine
 there and the comparison would work as expected, no?

Hmm?

If list_empty(ent) and q-root_blkg == NULL,

 /* walk to the next list_head, skip root blkcg */
 ent = ent-next;

ent is q-blkg_list again.

 if (ent == q-root_blkg-q_node)

So ent is not q-root_blkg-q_node.

 ent = ent-next;
 if (ent == q-blkg_list)
 return NULL;

And we return NULL here.

Ah, yes. You are correct.
We can do without the above hunk.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg

2012-10-10 Thread Jun'ichi Nomura

Hi Vivek, thank you for comments.

On 10/11/12 00:59, Vivek Goyal wrote:
> I think patch looks reasonable to me. Just that some more description
> would be nice. In fact, I will prefer some code comments too as I
> had to scratch my head for a while to figure out how did we reach here.
> 
> So looks like we deactivated cfq policy (most likely changed IO
> scheduler). That will destroy all the block groups (disconnect blkg
> from list and drop policy reference on group). If there are any pending
> IOs, then group will not be destroyed till IO is completed. (Because
> of cfqq reference on blkg and because of request list reference on
> blkg).
> 
> Now, all request list take a refenrece on associated blkg except
> q->root_rl. This means when last IO finished, it must have dropped
> the reference on cfqq which will drop reference on associated cfqg/blkg
> and immediately root blkg will be destroyed. And now we will call
> blk_put_rl() and that will try to access root_rl>blkg which has
> been just freed as last IO completed.

Yes, and for completion of any new IOs, blk_put_rl() is misled.

I'll try to extend the description according to your comments.

> 
> So problem here is that we don't take request list reference on
> root blkg and that creates all these corner cases.
> 
> So clearing q->root_blkg and q->root_rl.blkg during policy activation
> makes sense. That means that from queue and request list point of view
> root blkg is gone and you can't get to it. (It might still be around for
> some more time due to pending IOs though).
> 
> Some minor comments below.
> 
>>
>> Signed-off-by: Jun'ichi Nomura 
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index f3b44a6..5015764 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q)
>>  blkg_destroy(blkg);
>>  spin_unlock(>lock);
>>  }
>> +
>> +q->root_blkg = NULL;
>> +q->root_rl.blkg = NULL;
> 
> I think some of the above description about we not taking root_rl
> reference on root group can go here so that next time I don't have
> to scratch my head for a long time.

I put the following comment:
  /*
   * root blkg is destroyed.  Just clear the pointer since
   * root_rl does not take reference on root blkg.
   */

> 
>>  }
>>  
>>  static void blkg_rcu_free(struct rcu_head *rcu_head)
>> @@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct 
>> request_list *rl,
>>  
>>  /* walk to the next list_head, skip root blkcg */
>>  ent = ent->next;
>> -if (ent == >root_blkg->q_node)
>> +if (q->root_blkg && ent == >root_blkg->q_node)
> 
> Can we fix it little differently. Little earlier in the code, we check for
> if q->blkg_list is empty, then all the groups are gone, and there are
> no more request lists hence and return NULL.
> 
> Current code:
> if (rl == >root_rl) {
> ent = >blkg_list;
> 
> Modified code:
> if (rl == >root_rl) {
> ent = >blkg_list;
>   /* There are no more block groups, hence no request lists */
>   if (list_empty(ent))
>   return NULL;
>   }

OK. I changed that.

Below is the updated version of the patch.

==
blk_put_rl() does not call blkg_put() for q->root_rl because we
don't take request list reference on q->root_blkg.
However, if root_blkg is once attached then detached (freed),
blk_put_rl() is confused by the bogus pointer in q->root_blkg.

For example, with !CONFIG_BLK_DEV_THROTTLING && CONFIG_CFQ_GROUP_IOSCHED,
switching IO scheduler from cfq to deadline will cause system stall
after the following warning with 3.6:

> WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95()
> Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf 
> ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
> Call Trace:
>[] warn_slowpath_common+0x85/0x9d
>  [] warn_slowpath_null+0x1a/0x1c
>  [] blk_put_rl+0x4d/0x95
>  [] __blk_put_request+0xc3/0xcb
>  [] blk_finish_request+0x232/0x23f
>  [] ? blk_end_bidi_request+0x34/0x5d
>  [] blk_end_bidi_request+0x42/0x5d
>  [] blk_end_request+0x10/0x12
>  [] scsi_io_completion+0x207/0x4d5
>  [] scsi_finish_command+0xfa/0x103
>  [] scsi_softirq_done+0xff/0x108
>  [] blk_done_softirq+0x8d/0xa1
>  [] ? generic_smp_call_function_single_interrupt+0x9f/0xd7
>  [] __do_softirq+0x102/0x213
>

Re: [PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg

2012-10-10 Thread Jun'ichi Nomura

Hi Vivek, thank you for comments.

On 10/11/12 00:59, Vivek Goyal wrote:
 I think patch looks reasonable to me. Just that some more description
 would be nice. In fact, I will prefer some code comments too as I
 had to scratch my head for a while to figure out how did we reach here.
 
 So looks like we deactivated cfq policy (most likely changed IO
 scheduler). That will destroy all the block groups (disconnect blkg
 from list and drop policy reference on group). If there are any pending
 IOs, then group will not be destroyed till IO is completed. (Because
 of cfqq reference on blkg and because of request list reference on
 blkg).
 
 Now, all request list take a refenrece on associated blkg except
 q-root_rl. This means when last IO finished, it must have dropped
 the reference on cfqq which will drop reference on associated cfqg/blkg
 and immediately root blkg will be destroyed. And now we will call
 blk_put_rl() and that will try to access root_rlblkg which has
 been just freed as last IO completed.

Yes, and for completion of any new IOs, blk_put_rl() is misled.

I'll try to extend the description according to your comments.

 
 So problem here is that we don't take request list reference on
 root blkg and that creates all these corner cases.
 
 So clearing q-root_blkg and q-root_rl.blkg during policy activation
 makes sense. That means that from queue and request list point of view
 root blkg is gone and you can't get to it. (It might still be around for
 some more time due to pending IOs though).
 
 Some minor comments below.
 

 Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com

 diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
 index f3b44a6..5015764 100644
 --- a/block/blk-cgroup.c
 +++ b/block/blk-cgroup.c
 @@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q)
  blkg_destroy(blkg);
  spin_unlock(blkcg-lock);
  }
 +
 +q-root_blkg = NULL;
 +q-root_rl.blkg = NULL;
 
 I think some of the above description about we not taking root_rl
 reference on root group can go here so that next time I don't have
 to scratch my head for a long time.

I put the following comment:
  /*
   * root blkg is destroyed.  Just clear the pointer since
   * root_rl does not take reference on root blkg.
   */

 
  }
  
  static void blkg_rcu_free(struct rcu_head *rcu_head)
 @@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct 
 request_list *rl,
  
  /* walk to the next list_head, skip root blkcg */
  ent = ent-next;
 -if (ent == q-root_blkg-q_node)
 +if (q-root_blkg  ent == q-root_blkg-q_node)
 
 Can we fix it little differently. Little earlier in the code, we check for
 if q-blkg_list is empty, then all the groups are gone, and there are
 no more request lists hence and return NULL.
 
 Current code:
 if (rl == q-root_rl) {
 ent = q-blkg_list;
 
 Modified code:
 if (rl == q-root_rl) {
 ent = q-blkg_list;
   /* There are no more block groups, hence no request lists */
   if (list_empty(ent))
   return NULL;
   }

OK. I changed that.

Below is the updated version of the patch.

==
blk_put_rl() does not call blkg_put() for q-root_rl because we
don't take request list reference on q-root_blkg.
However, if root_blkg is once attached then detached (freed),
blk_put_rl() is confused by the bogus pointer in q-root_blkg.

For example, with !CONFIG_BLK_DEV_THROTTLING  CONFIG_CFQ_GROUP_IOSCHED,
switching IO scheduler from cfq to deadline will cause system stall
after the following warning with 3.6:

 WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95()
 Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf 
 ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
 Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
 Call Trace:
  IRQ  [810453bd] warn_slowpath_common+0x85/0x9d
  [810453ef] warn_slowpath_null+0x1a/0x1c
  [811d5f8d] blk_put_rl+0x4d/0x95
  [811d614a] __blk_put_request+0xc3/0xcb
  [811d71a3] blk_finish_request+0x232/0x23f
  [811d76c3] ? blk_end_bidi_request+0x34/0x5d
  [811d76d1] blk_end_bidi_request+0x42/0x5d
  [811d7728] blk_end_request+0x10/0x12
  [812cdf16] scsi_io_completion+0x207/0x4d5
  [812c6fcf] scsi_finish_command+0xfa/0x103
  [812ce2f8] scsi_softirq_done+0xff/0x108
  [811dcea5] blk_done_softirq+0x8d/0xa1
  [810915d5] ? generic_smp_call_function_single_interrupt+0x9f/0xd7
  [8104cf5b] __do_softirq+0x102/0x213
  [8108a5ec] ? lock_release_holdtime+0xb6/0xbb
  [8104d2b4] ? raise_softirq_irqoff+0x9/0x3d
  [81424dfc] call_softirq+0x1c/0x30
  [81011beb] do_softirq+0x4b/0xa3
  [8104cdb0] irq_exit+0x53/0xd5
  [8102d865] smp_call_function_single_interrupt+0x34/0x36

[PATCH] Fix use-after-free of q->root_blkg and q->root_rl.blkg

2012-10-09 Thread Jun'ichi Nomura

I got system stall after the following warning with 3.6:

> WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95()
> Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf 
> ipt_REJEC
> T nf_conntrack_ipv4 nf_defrag_ipv4
> Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
> Call Trace:
>[] warn_slowpath_common+0x85/0x9d
>  [] warn_slowpath_null+0x1a/0x1c
>  [] blk_put_rl+0x4d/0x95
>  [] __blk_put_request+0xc3/0xcb
>  [] blk_finish_request+0x232/0x23f
>  [] ? blk_end_bidi_request+0x34/0x5d
>  [] blk_end_bidi_request+0x42/0x5d
>  [] blk_end_request+0x10/0x12
>  [] scsi_io_completion+0x207/0x4d5
>  [] scsi_finish_command+0xfa/0x103
>  [] scsi_softirq_done+0xff/0x108
>  [] blk_done_softirq+0x8d/0xa1
>  [] ? generic_smp_call_function_single_interrupt+0x9f/0xd7
>  [] __do_softirq+0x102/0x213
>  [] ? lock_release_holdtime+0xb6/0xbb
>  [] ? raise_softirq_irqoff+0x9/0x3d
>  [] call_softirq+0x1c/0x30
>  [] do_softirq+0x4b/0xa3
>  [] irq_exit+0x53/0xd5
>  [] smp_call_function_single_interrupt+0x34/0x36
>  [] call_function_single_interrupt+0x6f/0x80
>[] ? mwait_idle+0x94/0xcd
>  [] ? mwait_idle+0x8b/0xcd
>  [] cpu_idle+0xbb/0x114
>  [] rest_init+0xc1/0xc8
>  [] ? csum_partial_copy_generic+0x16c/0x16c
>  [] start_kernel+0x3d4/0x3e1
>  [] ? kernel_init+0x1f7/0x1f7
>  [] x86_64_start_reservations+0xb8/0xbd
>  [] x86_64_start_kernel+0x101/0x110

blk_put_rl() does this:
 if (rl->blkg && rl->blkg->blkcg != _root)
 blkg_put(rl->blkg);
but if rl is q->root_rl, rl->blkg might be a bogus pointer
because blkcg_deactivate_policy() does not clear q->root_rl.blkg
after blkg_destroy_all().

Attached patch works for me.

Signed-off-by: Jun'ichi Nomura 

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f3b44a6..5015764 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q)
blkg_destroy(blkg);
spin_unlock(>lock);
}
+
+   q->root_blkg = NULL;
+   q->root_rl.blkg = NULL;
 }
 
 static void blkg_rcu_free(struct rcu_head *rcu_head)
@@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct 
request_list *rl,
 
/* walk to the next list_head, skip root blkcg */
ent = ent->next;
-   if (ent == >root_blkg->q_node)
+   if (q->root_blkg && ent == >root_blkg->q_node)
ent = ent->next;
if (ent == >blkg_list)
return NULL;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix use-after-free of q-root_blkg and q-root_rl.blkg

2012-10-09 Thread Jun'ichi Nomura

I got system stall after the following warning with 3.6:

 WARNING: at /work/build/linux/block/blk-cgroup.h:250 blk_put_rl+0x4d/0x95()
 Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf 
 ipt_REJEC
 T nf_conntrack_ipv4 nf_defrag_ipv4
 Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
 Call Trace:
  IRQ  [810453bd] warn_slowpath_common+0x85/0x9d
  [810453ef] warn_slowpath_null+0x1a/0x1c
  [811d5f8d] blk_put_rl+0x4d/0x95
  [811d614a] __blk_put_request+0xc3/0xcb
  [811d71a3] blk_finish_request+0x232/0x23f
  [811d76c3] ? blk_end_bidi_request+0x34/0x5d
  [811d76d1] blk_end_bidi_request+0x42/0x5d
  [811d7728] blk_end_request+0x10/0x12
  [812cdf16] scsi_io_completion+0x207/0x4d5
  [812c6fcf] scsi_finish_command+0xfa/0x103
  [812ce2f8] scsi_softirq_done+0xff/0x108
  [811dcea5] blk_done_softirq+0x8d/0xa1
  [810915d5] ? generic_smp_call_function_single_interrupt+0x9f/0xd7
  [8104cf5b] __do_softirq+0x102/0x213
  [8108a5ec] ? lock_release_holdtime+0xb6/0xbb
  [8104d2b4] ? raise_softirq_irqoff+0x9/0x3d
  [81424dfc] call_softirq+0x1c/0x30
  [81011beb] do_softirq+0x4b/0xa3
  [8104cdb0] irq_exit+0x53/0xd5
  [8102d865] smp_call_function_single_interrupt+0x34/0x36
  [8142486f] call_function_single_interrupt+0x6f/0x80
  EOI  [8101800b] ? mwait_idle+0x94/0xcd
  [81018002] ? mwait_idle+0x8b/0xcd
  [81017811] cpu_idle+0xbb/0x114
  [81401fbd] rest_init+0xc1/0xc8
  [81401efc] ? csum_partial_copy_generic+0x16c/0x16c
  [81cdbd3d] start_kernel+0x3d4/0x3e1
  [81cdb79e] ? kernel_init+0x1f7/0x1f7
  [81cdb2dd] x86_64_start_reservations+0xb8/0xbd
  [81cdb3e3] x86_64_start_kernel+0x101/0x110

blk_put_rl() does this:
 if (rl-blkg  rl-blkg-blkcg != blkcg_root)
 blkg_put(rl-blkg);
but if rl is q-root_rl, rl-blkg might be a bogus pointer
because blkcg_deactivate_policy() does not clear q-root_rl.blkg
after blkg_destroy_all().

Attached patch works for me.

Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f3b44a6..5015764 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -285,6 +285,9 @@ static void blkg_destroy_all(struct request_queue *q)
blkg_destroy(blkg);
spin_unlock(blkcg-lock);
}
+
+   q-root_blkg = NULL;
+   q-root_rl.blkg = NULL;
 }
 
 static void blkg_rcu_free(struct rcu_head *rcu_head)
@@ -333,7 +336,7 @@ struct request_list *__blk_queue_next_rl(struct 
request_list *rl,
 
/* walk to the next list_head, skip root blkcg */
ent = ent-next;
-   if (ent == q-root_blkg-q_node)
+   if (q-root_blkg  ent == q-root_blkg-q_node)
ent = ent-next;
if (ent == q-blkg_list)
return NULL;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH v8 3/8] dm: Use bioset's front_pad for dm_rq_clone_bio_info

2012-09-05 Thread Jun'ichi Nomura

On 09/06/12 05:27, Kent Overstreet wrote:
> @@ -2718,7 +2705,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned 
> type, unsigned integrity)
>   if (!pools->tio_pool)
>   goto free_io_pool_and_out;
>  
> - pools->bs = bioset_create(pool_size, 0);
> + pools->bs = bioset_create(pool_size,
> +   offsetof(struct dm_rq_clone_bio_info, clone));
>   if (!pools->bs)
>   goto free_tio_pool_and_out;

frontpad is not necessary if type is DM_TYPE_BIO_BASED.

Other pool creation in that function do something like:
pools->bs = (type == DM_TYPE_BIO_BASED) ?
bioset_create(pool_size, 0) :
bioset_create(pool_size, offsetof(struct 
dm_rq_clone_bio_info, clone));

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH v8 3/8] dm: Use bioset's front_pad for dm_rq_clone_bio_info

2012-09-05 Thread Jun'ichi Nomura

On 09/06/12 05:27, Kent Overstreet wrote:
 @@ -2718,7 +2705,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned 
 type, unsigned integrity)
   if (!pools-tio_pool)
   goto free_io_pool_and_out;
  
 - pools-bs = bioset_create(pool_size, 0);
 + pools-bs = bioset_create(pool_size,
 +   offsetof(struct dm_rq_clone_bio_info, clone));
   if (!pools-bs)
   goto free_tio_pool_and_out;

frontpad is not necessary if type is DM_TYPE_BIO_BASED.

Other pool creation in that function do something like:
pools-bs = (type == DM_TYPE_BIO_BASED) ?
bioset_create(pool_size, 0) :
bioset_create(pool_size, offsetof(struct 
dm_rq_clone_bio_info, clone));

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-29 Thread Jun'ichi Nomura

On 08/29/12 11:59, Dave Chinner wrote:
> On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
>> And yes, I understand it's ideal, but many applications choose not to
>> do that for performance reason.
>> So I think it's helpful if we can surely report to such applications.

I suspect "performance vs. integrity" is not a correct
description of the problem.

> If performance is chosen over data integrity, we are under no
> obligation to keep the error around indefinitely.  Fundamentally,
> ensuring a write completes successfully is the reponsibility of the
> application, not the kernel. There are so many different filesytem
> and storage errors that can be lost right now because data is not
> fsync()d, adding another one to them really doesn't change anything.
> IOWs, a memory error is no different to a disk failing or the system
> crashing when it comes to data integrity. If you care, you use
> fsync().

I agree that applications should fsync() or O_SYNC
when it wants to make sure the written data in on disk.

AFAIU, what Naoya is going to address is the case where
fsync() is not necessarily needed.

For example, if someone do:
  $ patch -p1 < ../a.patch
  $ tar cf . > ../a.tar

and disk failure occurred between "patch" and "tar",
"tar" will either see uptodate data or I/O error.
OTOH, if the failure was detected on dirty pagecache, the current memory
failure handler invalidates the dirty page and the "tar" command will
re-read old contents from disk without error.

(Well, the failures above are permanent failures. IOW, the current
 memory failure handler turns permanent failure into transient error,
 which is often more difficult to handle, I think.)

Naoya's patch will keep the failure information and allows the reader
to get I/O error when it reads from broken pagecache.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-29 Thread Jun'ichi Nomura

On 08/29/12 11:59, Dave Chinner wrote:
 On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
 And yes, I understand it's ideal, but many applications choose not to
 do that for performance reason.
 So I think it's helpful if we can surely report to such applications.

I suspect performance vs. integrity is not a correct
description of the problem.

 If performance is chosen over data integrity, we are under no
 obligation to keep the error around indefinitely.  Fundamentally,
 ensuring a write completes successfully is the reponsibility of the
 application, not the kernel. There are so many different filesytem
 and storage errors that can be lost right now because data is not
 fsync()d, adding another one to them really doesn't change anything.
 IOWs, a memory error is no different to a disk failing or the system
 crashing when it comes to data integrity. If you care, you use
 fsync().

I agree that applications should fsync() or O_SYNC
when it wants to make sure the written data in on disk.

AFAIU, what Naoya is going to address is the case where
fsync() is not necessarily needed.

For example, if someone do:
  $ patch -p1  ../a.patch
  $ tar cf .  ../a.tar

and disk failure occurred between patch and tar,
tar will either see uptodate data or I/O error.
OTOH, if the failure was detected on dirty pagecache, the current memory
failure handler invalidates the dirty page and the tar command will
re-read old contents from disk without error.

(Well, the failures above are permanent failures. IOW, the current
 memory failure handler turns permanent failure into transient error,
 which is often more difficult to handle, I think.)

Naoya's patch will keep the failure information and allows the reader
to get I/O error when it reads from broken pagecache.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 02/12] dm: Use bioset's front_pad for dm_rq_clone_bio_info

2012-08-13 Thread Jun'ichi Nomura

On 08/07/12 07:08, Kent Overstreet wrote:
>  struct dm_rq_clone_bio_info {
>   struct bio *orig;
>   struct dm_rq_target_io *tio;
> + struct bio clone;
>  };
...
> - pools->bs = bioset_create(pool_size, 0);
> + pools->bs = bioset_create(pool_size,
> +   offsetof(struct dm_rq_clone_bio_info, orig));
>   if (!pools->bs)
>   goto free_tio_pool_and_out;

Shouldn't this be offsetof(struct dm_rq_clone_bio_info, clone)?

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache

2012-08-13 Thread Jun'ichi Nomura

On 08/11/12 08:09, Andi Kleen wrote:
> Naoya Horiguchi  writes:
> 
>> Current memory error handling on dirty pagecache has a bug that user
>> processes who use corrupted pages via read() or write() can't be aware
>> of the memory error and result in discarding dirty data silently.
>>
>> The following patch is to improve handling/reporting memory errors on
>> this case, but as a short term solution I suggest that we should undo
>> the present error handling code and just leave errors for such cases
>> (which expect the 2nd MCE to panic the system) to ensure data consistency.
> 
> Not sure that's the right approach. It's not worse than any other IO 
> errors isn't it? 

IMO, it's worse in certain cases.  For example, producer-consumer type
program which uses file as a temporary storage.
Current memory-failure.c drops produced data from dirty pagecache
and allows reader to consume old or empty data from disk (silently!),
that's what I think HWPOISON should prevent.

Similar thing could happen theoretically with disk I/O errors,
though, practically those errors are often persistent and reader will
likely get errors again instead of bad data.
Also, ext3/ext4 has an option to panic when an error is detected,
for people who want to avoid corruption on intermittent errors.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache

2012-08-13 Thread Jun'ichi Nomura

On 08/11/12 08:09, Andi Kleen wrote:
 Naoya Horiguchi n-horigu...@ah.jp.nec.com writes:
 
 Current memory error handling on dirty pagecache has a bug that user
 processes who use corrupted pages via read() or write() can't be aware
 of the memory error and result in discarding dirty data silently.

 The following patch is to improve handling/reporting memory errors on
 this case, but as a short term solution I suggest that we should undo
 the present error handling code and just leave errors for such cases
 (which expect the 2nd MCE to panic the system) to ensure data consistency.
 
 Not sure that's the right approach. It's not worse than any other IO 
 errors isn't it? 

IMO, it's worse in certain cases.  For example, producer-consumer type
program which uses file as a temporary storage.
Current memory-failure.c drops produced data from dirty pagecache
and allows reader to consume old or empty data from disk (silently!),
that's what I think HWPOISON should prevent.

Similar thing could happen theoretically with disk I/O errors,
though, practically those errors are often persistent and reader will
likely get errors again instead of bad data.
Also, ext3/ext4 has an option to panic when an error is detected,
for people who want to avoid corruption on intermittent errors.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 02/12] dm: Use bioset's front_pad for dm_rq_clone_bio_info

2012-08-13 Thread Jun'ichi Nomura

On 08/07/12 07:08, Kent Overstreet wrote:
  struct dm_rq_clone_bio_info {
   struct bio *orig;
   struct dm_rq_target_io *tio;
 + struct bio clone;
  };
...
 - pools-bs = bioset_create(pool_size, 0);
 + pools-bs = bioset_create(pool_size,
 +   offsetof(struct dm_rq_clone_bio_info, orig));
   if (!pools-bs)
   goto free_tio_pool_and_out;

Shouldn't this be offsetof(struct dm_rq_clone_bio_info, clone)?

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] PCI: Remove default PCI expansion ROM memory allocation

2007-12-13 Thread Jun'ichi Nomura

Gary Hade wrote:
> Remove default PCI expansion ROM memory allocation

Thank you Gary for the new patch.

By not allocating resources for expansion ROM,
this patch will eliminate the problem that my patch masked, too:
http://lkml.org/lkml/2007/12/4/284

Regards,
-- 
Jun'ichi Nomura, NEC Corporation of America
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] PCI: Remove default PCI expansion ROM memory allocation

2007-12-13 Thread Jun'ichi Nomura

Gary Hade wrote:
 Remove default PCI expansion ROM memory allocation

Thank you Gary for the new patch.

By not allocating resources for expansion ROM,
this patch will eliminate the problem that my patch masked, too:
http://lkml.org/lkml/2007/12/4/284

Regards,
-- 
Jun'ichi Nomura, NEC Corporation of America
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pci: Omit error message for benign allocation failure

2007-12-05 Thread Jun'ichi Nomura

Gary Hade wrote:
> On Tue, Dec 04, 2007 at 06:23:32PM -0500, Jun'ichi Nomura wrote:
>> Kernel always tries to. But it's best effort basis, IMO.
>> (Maybe your patch is going to fix that?)
> 
> If you are seeing the allocation failures both with and
> without my original patch I doubt that an improved version
> would work.  In my case, the BIOS had allowed sufficient
> resource for the expansion ROMs but was expecting the kernel
> to get it from the non-prefetch instead of prefetch window.

OK. Then it's different from mine.
Thanks for the info.

Regards,
-- 
Jun'ichi Nomura, NEC Corporation of America
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pci: Omit error message for benign allocation failure

2007-12-05 Thread Jun'ichi Nomura

Gary Hade wrote:
 On Tue, Dec 04, 2007 at 06:23:32PM -0500, Jun'ichi Nomura wrote:
 Kernel always tries to. But it's best effort basis, IMO.
 (Maybe your patch is going to fix that?)
 
 If you are seeing the allocation failures both with and
 without my original patch I doubt that an improved version
 would work.  In my case, the BIOS had allowed sufficient
 resource for the expansion ROMs but was expecting the kernel
 to get it from the non-prefetch instead of prefetch window.

OK. Then it's different from mine.
Thanks for the info.

Regards,
-- 
Jun'ichi Nomura, NEC Corporation of America
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pci: Omit error message for benign allocation failure

2007-12-04 Thread Jun'ichi Nomura

Hi Gary,

Gary Hade wrote:
> On Tue, Dec 04, 2007 at 02:35:48PM -0500, Jun'ichi Nomura wrote:
>> On a system with PCI-to-PCI bridges, following errors are observed:
>>
>> PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0
>> PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0
>>
>> '#6' is for expansion ROM and '#8' for the bridge where the device
>> with the expansion ROM is connected.
> 
> I believe there is a good chance that may be another instance
> of the regression caused by my "Avoid creating P2P prefetch
> window for expansion ROMs" patch that was recently reported by
> Jan Beulich.
>   http://marc.info/?l=linux-kernel=11981103023=2

Sorry, I was not clear about that.
I've seen the above errors even with 2.6.22 and RHEL5, which is
based on 2.6.18.
So it's not a regression.

> I am working on a better fix for the problem that the patch
> was attempting to address but this is turning out to be much
> more difficult than I expected.  If I don't have a solution
> very soon I plan to publish a revert patch.
> 
>> But I think the failure is benign because the allocation is
>> not necessary for these resources.
> 
> This is an interesting idea.  Could you elaborate?  As far
> as I can tell, the kernel always tries to allocate memory
> for expansion ROMs which it also exports to user level.

Kernel always tries to. But it's best effort basis, IMO.
(Maybe your patch is going to fix that?)
In the 1st stage (pcibios_allocate_resources, etc.),
it allocates resources based on PCI configuration provided by BIOS,
except for expansion ROMs.
In the 2nd stage (pci_assign_unassigned_resources, etc.),
it allocates resources for unallocated ones including expansion ROMs.
The 2nd stage doesn't reprogram the bridge settings.
So if the expansion ROM is under the bridge where other resource
is already allocated, the allocation failure occurs.
There are comments in drivers/pci/setup-bus.c:
  /* Helper function for sizing routines: find first available
 bus resource of a given type. Note: we intentionally skip
 the bus resources which have already been assigned (that is,
 have non-NULL parent resource). */

Fixing the above might be a better solution.
But I don't know if there are any users who need it.

> I have assumed that some drivers or user level apps may
> need to access this space.  Is this not true?

I haven't heard of applications which depend on the kernel resource
allocation for expansion ROMs.
X is a big user of expansion ROMs but I heard it solves the resource
conflict itself.
If there is a counter example, I would like to know it.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] pci: Omit error message for benign allocation failure

2007-12-04 Thread Jun'ichi Nomura

Hi,

On a system with PCI-to-PCI bridges, following errors are observed:

PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0
PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0

'#6' is for expansion ROM and '#8' for the bridge where the device
with the expansion ROM is connected.
But I think the failure is benign because the allocation is
not necessary for these resources.

The allocation failure message is informative when the failure
is really a problem and needs a diagnosis.
However, if the resource is expansion ROMs, the message is just
confusing.

This patch omits the error message if the resource is an expansion
ROM or a bridge.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America


Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

--- linux-2.6.24-rc3/drivers/pci/setup-res.c.orig   2007-12-04 
00:24:11.0 -0500
+++ linux-2.6.24-rc3/drivers/pci/setup-res.c2007-12-04 00:42:14.0 
-0500
@@ -158,11 +158,12 @@ int pci_assign_resource(struct pci_dev *
}
 
if (ret) {
-   printk(KERN_ERR "PCI: Failed to allocate %s resource "
-   "#%d:[EMAIL PROTECTED] for %s\n",
-   res->flags & IORESOURCE_IO ? "I/O" : "mem",
-   resno, (unsigned long long)size,
-   (unsigned long long)res->start, pci_name(dev));
+   if (resno < PCI_ROM_RESOURCE)
+   printk(KERN_ERR "PCI: Failed to allocate %s resource "
+   "#%d:[EMAIL PROTECTED] for %s\n",
+   res->flags & IORESOURCE_IO ? "I/O" : "mem",
+   resno, (unsigned long long)size,
+   (unsigned long long)res->start, pci_name(dev));
} else if (resno < PCI_BRIDGE_RESOURCES) {
pci_update_resource(dev, res, resno);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] pci: Omit error message for benign allocation failure

2007-12-04 Thread Jun'ichi Nomura

Hi,

On a system with PCI-to-PCI bridges, following errors are observed:

PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0
PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0

'#6' is for expansion ROM and '#8' for the bridge where the device
with the expansion ROM is connected.
But I think the failure is benign because the allocation is
not necessary for these resources.

The allocation failure message is informative when the failure
is really a problem and needs a diagnosis.
However, if the resource is expansion ROMs, the message is just
confusing.

This patch omits the error message if the resource is an expansion
ROM or a bridge.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America


Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

--- linux-2.6.24-rc3/drivers/pci/setup-res.c.orig   2007-12-04 
00:24:11.0 -0500
+++ linux-2.6.24-rc3/drivers/pci/setup-res.c2007-12-04 00:42:14.0 
-0500
@@ -158,11 +158,12 @@ int pci_assign_resource(struct pci_dev *
}
 
if (ret) {
-   printk(KERN_ERR PCI: Failed to allocate %s resource 
-   #%d:[EMAIL PROTECTED] for %s\n,
-   res-flags  IORESOURCE_IO ? I/O : mem,
-   resno, (unsigned long long)size,
-   (unsigned long long)res-start, pci_name(dev));
+   if (resno  PCI_ROM_RESOURCE)
+   printk(KERN_ERR PCI: Failed to allocate %s resource 
+   #%d:[EMAIL PROTECTED] for %s\n,
+   res-flags  IORESOURCE_IO ? I/O : mem,
+   resno, (unsigned long long)size,
+   (unsigned long long)res-start, pci_name(dev));
} else if (resno  PCI_BRIDGE_RESOURCES) {
pci_update_resource(dev, res, resno);
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pci: Omit error message for benign allocation failure

2007-12-04 Thread Jun'ichi Nomura

Hi Gary,

Gary Hade wrote:
 On Tue, Dec 04, 2007 at 02:35:48PM -0500, Jun'ichi Nomura wrote:
 On a system with PCI-to-PCI bridges, following errors are observed:

 PCI: Failed to allocate mem resource #8:[EMAIL PROTECTED] for :02:00.0
 PCI: Failed to allocate mem resource #6:[EMAIL PROTECTED] for :03:01.0

 '#6' is for expansion ROM and '#8' for the bridge where the device
 with the expansion ROM is connected.
 
 I believe there is a good chance that may be another instance
 of the regression caused by my Avoid creating P2P prefetch
 window for expansion ROMs patch that was recently reported by
 Jan Beulich.
   http://marc.info/?l=linux-kernelm=11981103023w=2

Sorry, I was not clear about that.
I've seen the above errors even with 2.6.22 and RHEL5, which is
based on 2.6.18.
So it's not a regression.

 I am working on a better fix for the problem that the patch
 was attempting to address but this is turning out to be much
 more difficult than I expected.  If I don't have a solution
 very soon I plan to publish a revert patch.
 
 But I think the failure is benign because the allocation is
 not necessary for these resources.
 
 This is an interesting idea.  Could you elaborate?  As far
 as I can tell, the kernel always tries to allocate memory
 for expansion ROMs which it also exports to user level.

Kernel always tries to. But it's best effort basis, IMO.
(Maybe your patch is going to fix that?)
In the 1st stage (pcibios_allocate_resources, etc.),
it allocates resources based on PCI configuration provided by BIOS,
except for expansion ROMs.
In the 2nd stage (pci_assign_unassigned_resources, etc.),
it allocates resources for unallocated ones including expansion ROMs.
The 2nd stage doesn't reprogram the bridge settings.
So if the expansion ROM is under the bridge where other resource
is already allocated, the allocation failure occurs.
There are comments in drivers/pci/setup-bus.c:
  /* Helper function for sizing routines: find first available
 bus resource of a given type. Note: we intentionally skip
 the bus resources which have already been assigned (that is,
 have non-NULL parent resource). */

Fixing the above might be a better solution.
But I don't know if there are any users who need it.

 I have assumed that some drivers or user level apps may
 need to access this space.  Is this not true?

I haven't heard of applications which depend on the kernel resource
allocation for expansion ROMs.
X is a big user of expansion ROMs but I heard it solves the resource
conflict itself.
If there is a counter example, I would like to know it.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: Fix panic on shrinking device size

2007-10-30 Thread Jun'ichi Nomura

Hi,

This patch fixes a panic on shrinking DM device, due to reference
outside of the DM table.

The problem occurs only if there is outstanding I/O to the removed
part of the device.

The bug is in that __clone_and_map() assumes dm_table_find_target()
always returns a valid pointer.
It may fail if a BIO is sent from the block layer but its target
sector is already not in the DM table.

This patch tidies up dm_table_find_target() to return NULL
instead of bogus pointer for failure case.
Then __clone_and_map() can check the return value of it
and make the BIO return with -EIO.
target_message() in dm-ioctl.c, the only other user of
dm_table_find_target(), is modified to use the return value of it, too.

Sample test script to trigger oops:
--
#!/bin/bash

FILE=$(mktemp)
LODEV=$(losetup -f)
MAP=$(basename ${FILE})
SIZE=4M

dd if=/dev/zero of=${FILE} bs=${SIZE} count=1
losetup ${LODEV} ${FILE}

echo "0 $(blockdev --getsz ${LODEV}) linear ${LODEV} 0" |dmsetup create ${MAP}
dmsetup suspend ${MAP}
echo "0 1 linear ${LODEV} 0" |dmsetup load ${MAP}
dd if=/dev/zero of=/dev/mapper/${MAP} bs=${SIZE} count=1 &
echo "Wait til dd push some I/O"
sleep 5
dmsetup resume ${MAP}
------

Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

---
 dm-ioctl.c |   10 +++---
 dm-table.c |3 +++
 dm.c   |   24 ++--
 3 files changed, 24 insertions(+), 13 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -663,13 +663,19 @@ static struct bio *clone_bio(struct bio 
return clone;
 }
 
-static void __clone_and_map(struct clone_info *ci)
+static int __clone_and_map(struct clone_info *ci)
 {
struct bio *clone, *bio = ci->bio;
-   struct dm_target *ti = dm_table_find_target(ci->map, ci->sector);
-   sector_t len = 0, max = max_io_len(ci->md, ci->sector, ti);
+   struct dm_target *ti;
+   sector_t len = 0, max;
struct dm_target_io *tio;
 
+   ti = dm_table_find_target(ci->map, ci->sector);
+   if (!ti)
+   return -EIO;
+
+   max = max_io_len(ci->md, ci->sector, ti);
+
/*
 * Allocate a target io object.
 */
@@ -727,6 +733,9 @@ static void __clone_and_map(struct clone
do {
if (offset) {
ti = dm_table_find_target(ci->map, ci->sector);
+   if (!ti)
+   return -EIO;
+
max = max_io_len(ci->md, ci->sector, ti);
 
tio = alloc_tio(ci->md);
@@ -750,6 +759,8 @@ static void __clone_and_map(struct clone
 
ci->idx++;
}
+
+   return 0;
 }
 
 /*
@@ -758,6 +769,7 @@ static void __clone_and_map(struct clone
 static void __split_bio(struct mapped_device *md, struct bio *bio)
 {
struct clone_info ci;
+   int error = 0;
 
ci.map = dm_get_table(md);
if (!ci.map) {
@@ -777,11 +789,11 @@ static void __split_bio(struct mapped_de
ci.idx = bio->bi_idx;
 
start_io_acct(ci.io);
-   while (ci.sector_count)
-   __clone_and_map();
+   while (ci.sector_count && !error)
+   error = __clone_and_map();
 
/* drop the extra reference count */
-   dec_pending(ci.io, 0);
+   dec_pending(ci.io, error);
dm_table_put(ci.map);
 }
 /*-
Index: linux-2.6.23.work/drivers/md/dm-ioctl.c
===
--- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.23.work/drivers/md/dm-ioctl.c
@@ -1250,21 +1250,17 @@ static int target_message(struct dm_ioct
if (!table)
goto out_argv;
 
-   if (tmsg->sector >= dm_table_get_size(table)) {
+   ti = dm_table_find_target(table, tmsg->sector);
+   if (!ti) {
DMWARN("Target message sector outside device.");
r = -EINVAL;
-   goto out_table;
-   }
-
-   ti = dm_table_find_target(table, tmsg->sector);
-   if (ti->type->message)
+   } else if (ti->type->message)
r = ti->type->message(ti, argc, argv);
else {
DMWARN("Target type does not support messages");
r = -EINVAL;
}
 
- out_table:
dm_table_put(table);
  out_argv:
kfree(argv);
Index: linux-2.6.23.work/drivers/md/dm-table.c
===
--- linux-2.6.23.work.orig/drivers/md/dm-tab

[PATCH] dm: Fix panic on shrinking device size

2007-10-30 Thread Jun'ichi Nomura

Hi,

This patch fixes a panic on shrinking DM device, due to reference
outside of the DM table.

The problem occurs only if there is outstanding I/O to the removed
part of the device.

The bug is in that __clone_and_map() assumes dm_table_find_target()
always returns a valid pointer.
It may fail if a BIO is sent from the block layer but its target
sector is already not in the DM table.

This patch tidies up dm_table_find_target() to return NULL
instead of bogus pointer for failure case.
Then __clone_and_map() can check the return value of it
and make the BIO return with -EIO.
target_message() in dm-ioctl.c, the only other user of
dm_table_find_target(), is modified to use the return value of it, too.

Sample test script to trigger oops:
--
#!/bin/bash

FILE=$(mktemp)
LODEV=$(losetup -f)
MAP=$(basename ${FILE})
SIZE=4M

dd if=/dev/zero of=${FILE} bs=${SIZE} count=1
losetup ${LODEV} ${FILE}

echo 0 $(blockdev --getsz ${LODEV}) linear ${LODEV} 0 |dmsetup create ${MAP}
dmsetup suspend ${MAP}
echo 0 1 linear ${LODEV} 0 |dmsetup load ${MAP}
dd if=/dev/zero of=/dev/mapper/${MAP} bs=${SIZE} count=1 
echo Wait til dd push some I/O
sleep 5
dmsetup resume ${MAP}
--

Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

---
 dm-ioctl.c |   10 +++---
 dm-table.c |3 +++
 dm.c   |   24 ++--
 3 files changed, 24 insertions(+), 13 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -663,13 +663,19 @@ static struct bio *clone_bio(struct bio 
return clone;
 }
 
-static void __clone_and_map(struct clone_info *ci)
+static int __clone_and_map(struct clone_info *ci)
 {
struct bio *clone, *bio = ci-bio;
-   struct dm_target *ti = dm_table_find_target(ci-map, ci-sector);
-   sector_t len = 0, max = max_io_len(ci-md, ci-sector, ti);
+   struct dm_target *ti;
+   sector_t len = 0, max;
struct dm_target_io *tio;
 
+   ti = dm_table_find_target(ci-map, ci-sector);
+   if (!ti)
+   return -EIO;
+
+   max = max_io_len(ci-md, ci-sector, ti);
+
/*
 * Allocate a target io object.
 */
@@ -727,6 +733,9 @@ static void __clone_and_map(struct clone
do {
if (offset) {
ti = dm_table_find_target(ci-map, ci-sector);
+   if (!ti)
+   return -EIO;
+
max = max_io_len(ci-md, ci-sector, ti);
 
tio = alloc_tio(ci-md);
@@ -750,6 +759,8 @@ static void __clone_and_map(struct clone
 
ci-idx++;
}
+
+   return 0;
 }
 
 /*
@@ -758,6 +769,7 @@ static void __clone_and_map(struct clone
 static void __split_bio(struct mapped_device *md, struct bio *bio)
 {
struct clone_info ci;
+   int error = 0;
 
ci.map = dm_get_table(md);
if (!ci.map) {
@@ -777,11 +789,11 @@ static void __split_bio(struct mapped_de
ci.idx = bio-bi_idx;
 
start_io_acct(ci.io);
-   while (ci.sector_count)
-   __clone_and_map(ci);
+   while (ci.sector_count  !error)
+   error = __clone_and_map(ci);
 
/* drop the extra reference count */
-   dec_pending(ci.io, 0);
+   dec_pending(ci.io, error);
dm_table_put(ci.map);
 }
 /*-
Index: linux-2.6.23.work/drivers/md/dm-ioctl.c
===
--- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.23.work/drivers/md/dm-ioctl.c
@@ -1250,21 +1250,17 @@ static int target_message(struct dm_ioct
if (!table)
goto out_argv;
 
-   if (tmsg-sector = dm_table_get_size(table)) {
+   ti = dm_table_find_target(table, tmsg-sector);
+   if (!ti) {
DMWARN(Target message sector outside device.);
r = -EINVAL;
-   goto out_table;
-   }
-
-   ti = dm_table_find_target(table, tmsg-sector);
-   if (ti-type-message)
+   } else if (ti-type-message)
r = ti-type-message(ti, argc, argv);
else {
DMWARN(Target type does not support messages);
r = -EINVAL;
}
 
- out_table:
dm_table_put(table);
  out_argv:
kfree(argv);
Index: linux-2.6.23.work/drivers/md/dm-table.c
===
--- linux-2.6.23.work.orig/drivers/md/dm-table.c
+++ linux-2.6.23.work/drivers/md/dm-table.c
@@ -868,6 +868,9 @@ struct dm_target *dm_table_find_target(s
unsigned int l, n = 0, k = 0;
sector_t *node;
 
+   if (sector

Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)

2007-10-25 Thread Jun'ichi Nomura

Alasdair,

Jun'ichi Nomura wrote:
> So as far as I understand, the point is:
>   1. it's preferable to resize inode after the resume, if possible

Attached is a modified version of the (2/3) patch.
  - The logic is basically the same as before.
  - Moved the set-size function outside of the table swapping.
  - Moved the uevent to the end of resize from the end of resume.
  - Lightly tested on LVM2 mirror resizing.
  - (1/3) and (3/3) are unchanged.

What do you think about this?

---
 drivers/md/dm-ioctl.c |4 ++
 drivers/md/dm.c   |   61 ++
 include/linux/device-mapper.h |5 +++
 3 files changed, 46 insertions(+), 24 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1099,31 +1099,11 @@ static void event_callback(void *context
wake_up(>eventq);
 }
 
-static void __set_size(struct mapped_device *md, sector_t size)
-{
-   set_capacity(md->disk, size);
-
-   mutex_lock(>suspended_bdev->bd_inode->i_mutex);
-   i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << 
SECTOR_SHIFT);
-   mutex_unlock(>suspended_bdev->bd_inode->i_mutex);
-}
-
 static int __bind(struct mapped_device *md, struct dm_table *t)
 {
struct request_queue *q = md->queue;
-   sector_t size;
-
-   size = dm_table_get_size(t);
-
-   /*
-* Wipe any geometry if the size of the table changed.
-*/
-   if (size != get_capacity(md->disk))
-   memset(>geometry, 0, sizeof(md->geometry));
 
-   if (md->suspended_bdev)
-   __set_size(md, size);
-   if (size == 0)
+   if (!dm_table_get_size(t))
return 0;
 
dm_table_get(t);
@@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md)
 
dm_table_unplug_all(map);
 
-   kobject_uevent(>disk->kobj, KOBJ_CHANGE);
-
r = 0;
 
 out:
@@ -1508,6 +1486,43 @@ out:
return r;
 }
 
+void dm_set_size(struct mapped_device *md)
+{
+   struct dm_table *map;
+   struct block_device *bdev;
+   sector_t size;
+
+   down(>suspend_lock);
+
+   map = dm_get_table(md);
+   if (!map)
+   goto out;
+
+   size = dm_table_get_size(map);
+
+   /*
+* Wipe any geometry if the size of the table changed.
+*/
+   if (size != get_capacity(md->disk))
+   memset(>geometry, 0, sizeof(md->geometry));
+
+   set_capacity(md->disk, size);
+
+   bdev = bdlookup_disk(md->disk, 0);
+   if (bdev) {
+   mutex_lock(>bd_mutex);
+   i_size_write(bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
+   mutex_unlock(>bd_mutex);
+   bdput(bdev);
+   }
+
+   kobject_uevent(>disk->kobj, KOBJ_CHANGE);
+
+out:
+   dm_table_put(map);
+   up(>suspend_lock);
+}
+
 /*-
  * Event notification.
  *---*/
Index: linux-2.6.23.work/drivers/md/dm-ioctl.c
===
--- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.23.work/drivers/md/dm-ioctl.c
@@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa
if (dm_suspended(md))
r = dm_resume(md);
 
-   if (!r)
+   if (!r) {
+   dm_set_size(md);
r = __dev_status(md, param);
+   }
 
dm_put(md);
return r;
Index: linux-2.6.23.work/include/linux/device-mapper.h
===
--- linux-2.6.23.work.orig/include/linux/device-mapper.h
+++ linux-2.6.23.work/include/linux/device-mapper.h
@@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ
 int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo);
 int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo);
 
+/*
+ * Update device size
+ */
+void dm_set_size(struct mapped_device *md);
+
 
 /*-
  * Functions for manipulating device-mapper tables.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)

2007-10-25 Thread Jun'ichi Nomura

Hi Alasdair,

Thanks for the immediate reply.

So as far as I understand, the point is:
  1. it's preferable to resize inode after the resume, if possible
  2. it's necessary to have nowait version of bdget() to avoid stall

For the 1st item, I guess the reason is that we want to make the
table swapping code deterministic. Is it correct?
Even if we don't have to wait for bdget(), you like to have the resizing
code after the resume?

For the 2nd item, the patch included bdlookup() for that purpose.
What do you think about it?

I added below some additional comments and questions.

Alasdair G Kergon wrote:
> On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote:
>> There is no guarantee that the I/O flowing through the device again.
>> The table might need be replaced again, but to do that, the resume
>> should have been completed to let the userspace know it.
> 
> Then the first attempt to set the size could be made to fail
> (because it could not get the lock immediately) and the
> size could be set after the second resume instead.
> 
> - Setting the size would lag behind the actual size the dm table was
> supporting, but (given the usage cases discussed) this would not matter.
> 
>> bdget() in noflush suspend has a possibility of stall.
> 
> So we cannot avoid fixing that: we require immediate return
> with failure instead of waiting.

bdlookup() is the almost nowait version of bdget()
("almost" means it may wait for I_NEW turned off in case there is
  bdget() doing its latter half. But it never stalls.)

Do you think we need something else?

>> OTOH, calling bdget() and i_size_write() outside of the lock
>> can cause race with other table swapping and may result in setting
>> wrong device size.
> 
> If the size setting is removed from the lock, then it becomes
> "set the inode size to match the current size of the table" and
> races would not matter - each "set size" attempt would set it
> to the instantaneous live table size, not a cached value that
> could be out-of-date.

Even though, I think we still want set_capacity() during the table
swapping. So we need to call dm_table_get_size() twice. Is it ok?

Also, do you want to move the resizing code for normal suspend, too?
Otherwise, the code will be duplicated and I don't think you like it.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)

2007-10-25 Thread Jun'ichi Nomura

Hi Alasdair,

Alasdair G Kergon wrote:
> Before reviewing the details of the proposed workaround, I'd like to see
> a deeper analysis of the problem to see that there isn't a cleaner way
> to resolve this.

OK. Let me try.

> For example:
> 
> Question) What are the realistic situations we must support that lead to
> a resize during table reload with I/O outstanding?

'noflush' is currently the only option for userspace to suspend without
risking otherwise-avoidable I/O lost, because failure of underlying
device might occur *after* the userspace started normal suspend
but *before* the flushing completes. (normal suspend = flushing suspend)

E.g. if userspace wants to resize dm-mpath device with queue_if_no_path,
it has to use 'noflush' suspend on table swapping.
Otherwise, path failures during the suspend will cause I/O error,
though queue_if_no_path is specified.

Other possible solution would be allowing the suspend to fail if it
can't flush all I/O and letting userspace to retry after fixing
the device failure.

> - The resize is the purpose of the reload; noflush is only set to avoid losing
> I/O if a path should fail.  So any outstanding I/O may be expected to be
> consistent with both the old and new sizes of the device. E.g. If it's
> beyond the end of a shrinking device and userspace cared about not
> losing that I/O, it would have waited for that I/O to be flushed
> *before* issuing the resize.  If the I/O is beyond the end of the
> existing device but within the new size, userspace would have waited for
> the resize operation to complete before allowing the new I/O to be
> issued.

If userspace cares about not losing I/O, it should wait for the I/O
before trying to shrink the device size.
After the shrinking started, any I/O beyond the new end of the device
would have a possibility of lost.

Reducing the size of the device while actively running I/O on it
would anyway have a possibility of losing some of the I/O.

> If the I/O is beyond the end of the
> existing device but within the new size, userspace would have waited for
> the resize operation to complete before allowing the new I/O to be
> issued.

Issuing the I/O beyond the end of the existing device would get error.
If the issuer knows the device will be extended, it should wait for
the completion of the extention.
If it doesn't know, such I/O won't be issued.

> => Is it OK for device-mapper to handle the device size check
> internally, rejecting any I/O that falls beyond the end of the table (it

I think this check is needed in current dm, regardless of noflush.

> already must do this lookup anyway), and to update the size recorded in
> the inode later, after I/O is flowing through the device again, but (of
> course) before reporting that the resize operation is complete?
> I.e. does it eliminate deadlocks if the bdget() and i_size_write()
> happen after the 'resume'?

There is no guarantee that the I/O flowing through the device again.
The table might need be replaced again, but to do that, the resume
should have been completed to let the userspace know it.

bdget() in noflush suspend has a possibility of stall.
Once it stalls, the remedy is userspace to replace the table with
working one. However, since bdget() occurs inside of suspend_lock,
it's not possible to run other instance of suspend/resume.
OTOH, calling bdget() and i_size_write() outside of the lock
can cause race with other table swapping and may result in setting
wrong device size.

Also, bdget() allocates new bdev inode if there isn't.
But dm wants to access bdev inode only when it exists in memory.
So something like bdlookup() will fit for the purpose, IMO.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)

2007-10-25 Thread Jun'ichi Nomura

Hi Alasdair,

Alasdair G Kergon wrote:
 Before reviewing the details of the proposed workaround, I'd like to see
 a deeper analysis of the problem to see that there isn't a cleaner way
 to resolve this.

OK. Let me try.

 For example:
 
 Question) What are the realistic situations we must support that lead to
 a resize during table reload with I/O outstanding?

'noflush' is currently the only option for userspace to suspend without
risking otherwise-avoidable I/O lost, because failure of underlying
device might occur *after* the userspace started normal suspend
but *before* the flushing completes. (normal suspend = flushing suspend)

E.g. if userspace wants to resize dm-mpath device with queue_if_no_path,
it has to use 'noflush' suspend on table swapping.
Otherwise, path failures during the suspend will cause I/O error,
though queue_if_no_path is specified.

Other possible solution would be allowing the suspend to fail if it
can't flush all I/O and letting userspace to retry after fixing
the device failure.

 - The resize is the purpose of the reload; noflush is only set to avoid losing
 I/O if a path should fail.  So any outstanding I/O may be expected to be
 consistent with both the old and new sizes of the device. E.g. If it's
 beyond the end of a shrinking device and userspace cared about not
 losing that I/O, it would have waited for that I/O to be flushed
 *before* issuing the resize.  If the I/O is beyond the end of the
 existing device but within the new size, userspace would have waited for
 the resize operation to complete before allowing the new I/O to be
 issued.

If userspace cares about not losing I/O, it should wait for the I/O
before trying to shrink the device size.
After the shrinking started, any I/O beyond the new end of the device
would have a possibility of lost.

Reducing the size of the device while actively running I/O on it
would anyway have a possibility of losing some of the I/O.

 If the I/O is beyond the end of the
 existing device but within the new size, userspace would have waited for
 the resize operation to complete before allowing the new I/O to be
 issued.

Issuing the I/O beyond the end of the existing device would get error.
If the issuer knows the device will be extended, it should wait for
the completion of the extention.
If it doesn't know, such I/O won't be issued.


 = Is it OK for device-mapper to handle the device size check
 internally, rejecting any I/O that falls beyond the end of the table (it

I think this check is needed in current dm, regardless of noflush.

 already must do this lookup anyway), and to update the size recorded in
 the inode later, after I/O is flowing through the device again, but (of
 course) before reporting that the resize operation is complete?
 I.e. does it eliminate deadlocks if the bdget() and i_size_write()
 happen after the 'resume'?

There is no guarantee that the I/O flowing through the device again.
The table might need be replaced again, but to do that, the resume
should have been completed to let the userspace know it.

bdget() in noflush suspend has a possibility of stall.
Once it stalls, the remedy is userspace to replace the table with
working one. However, since bdget() occurs inside of suspend_lock,
it's not possible to run other instance of suspend/resume.
OTOH, calling bdget() and i_size_write() outside of the lock
can cause race with other table swapping and may result in setting
wrong device size.

Also, bdget() allocates new bdev inode if there isn't.
But dm wants to access bdev inode only when it exists in memory.
So something like bdlookup() will fit for the purpose, IMO.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)

2007-10-25 Thread Jun'ichi Nomura

Hi Alasdair,

Thanks for the immediate reply.

So as far as I understand, the point is:
  1. it's preferable to resize inode after the resume, if possible
  2. it's necessary to have nowait version of bdget() to avoid stall

For the 1st item, I guess the reason is that we want to make the
table swapping code deterministic. Is it correct?
Even if we don't have to wait for bdget(), you like to have the resizing
code after the resume?

For the 2nd item, the patch included bdlookup() for that purpose.
What do you think about it?

I added below some additional comments and questions.

Alasdair G Kergon wrote:
 On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote:
 There is no guarantee that the I/O flowing through the device again.
 The table might need be replaced again, but to do that, the resume
 should have been completed to let the userspace know it.
 
 Then the first attempt to set the size could be made to fail
 (because it could not get the lock immediately) and the
 size could be set after the second resume instead.
 
 - Setting the size would lag behind the actual size the dm table was
 supporting, but (given the usage cases discussed) this would not matter.
 
 bdget() in noflush suspend has a possibility of stall.
 
 So we cannot avoid fixing that: we require immediate return
 with failure instead of waiting.

bdlookup() is the almost nowait version of bdget()
(almost means it may wait for I_NEW turned off in case there is
  bdget() doing its latter half. But it never stalls.)

Do you think we need something else?

 OTOH, calling bdget() and i_size_write() outside of the lock
 can cause race with other table swapping and may result in setting
 wrong device size.
 
 If the size setting is removed from the lock, then it becomes
 set the inode size to match the current size of the table and
 races would not matter - each set size attempt would set it
 to the instantaneous live table size, not a cached value that
 could be out-of-date.

Even though, I think we still want set_capacity() during the table
swapping. So we need to call dm_table_get_size() twice. Is it ok?

Also, do you want to move the resizing code for normal suspend, too?
Otherwise, the code will be duplicated and I don't think you like it.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)

2007-10-25 Thread Jun'ichi Nomura

Alasdair,

Jun'ichi Nomura wrote:
 So as far as I understand, the point is:
   1. it's preferable to resize inode after the resume, if possible

Attached is a modified version of the (2/3) patch.
  - The logic is basically the same as before.
  - Moved the set-size function outside of the table swapping.
  - Moved the uevent to the end of resize from the end of resume.
  - Lightly tested on LVM2 mirror resizing.
  - (1/3) and (3/3) are unchanged.

What do you think about this?

---
 drivers/md/dm-ioctl.c |4 ++
 drivers/md/dm.c   |   61 ++
 include/linux/device-mapper.h |5 +++
 3 files changed, 46 insertions(+), 24 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1099,31 +1099,11 @@ static void event_callback(void *context
wake_up(md-eventq);
 }
 
-static void __set_size(struct mapped_device *md, sector_t size)
-{
-   set_capacity(md-disk, size);
-
-   mutex_lock(md-suspended_bdev-bd_inode-i_mutex);
-   i_size_write(md-suspended_bdev-bd_inode, (loff_t)size  
SECTOR_SHIFT);
-   mutex_unlock(md-suspended_bdev-bd_inode-i_mutex);
-}
-
 static int __bind(struct mapped_device *md, struct dm_table *t)
 {
struct request_queue *q = md-queue;
-   sector_t size;
-
-   size = dm_table_get_size(t);
-
-   /*
-* Wipe any geometry if the size of the table changed.
-*/
-   if (size != get_capacity(md-disk))
-   memset(md-geometry, 0, sizeof(md-geometry));
 
-   if (md-suspended_bdev)
-   __set_size(md, size);
-   if (size == 0)
+   if (!dm_table_get_size(t))
return 0;
 
dm_table_get(t);
@@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md)
 
dm_table_unplug_all(map);
 
-   kobject_uevent(md-disk-kobj, KOBJ_CHANGE);
-
r = 0;
 
 out:
@@ -1508,6 +1486,43 @@ out:
return r;
 }
 
+void dm_set_size(struct mapped_device *md)
+{
+   struct dm_table *map;
+   struct block_device *bdev;
+   sector_t size;
+
+   down(md-suspend_lock);
+
+   map = dm_get_table(md);
+   if (!map)
+   goto out;
+
+   size = dm_table_get_size(map);
+
+   /*
+* Wipe any geometry if the size of the table changed.
+*/
+   if (size != get_capacity(md-disk))
+   memset(md-geometry, 0, sizeof(md-geometry));
+
+   set_capacity(md-disk, size);
+
+   bdev = bdlookup_disk(md-disk, 0);
+   if (bdev) {
+   mutex_lock(bdev-bd_mutex);
+   i_size_write(bdev-bd_inode, (loff_t)size  SECTOR_SHIFT);
+   mutex_unlock(bdev-bd_mutex);
+   bdput(bdev);
+   }
+
+   kobject_uevent(md-disk-kobj, KOBJ_CHANGE);
+
+out:
+   dm_table_put(map);
+   up(md-suspend_lock);
+}
+
 /*-
  * Event notification.
  *---*/
Index: linux-2.6.23.work/drivers/md/dm-ioctl.c
===
--- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.23.work/drivers/md/dm-ioctl.c
@@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa
if (dm_suspended(md))
r = dm_resume(md);
 
-   if (!r)
+   if (!r) {
+   dm_set_size(md);
r = __dev_status(md, param);
+   }
 
dm_put(md);
return r;
Index: linux-2.6.23.work/include/linux/device-mapper.h
===
--- linux-2.6.23.work.orig/include/linux/device-mapper.h
+++ linux-2.6.23.work/include/linux/device-mapper.h
@@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ
 int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo);
 int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo);
 
+/*
+ * Update device size
+ */
+void dm_set_size(struct mapped_device *md);
+
 
 /*-
  * Functions for manipulating device-mapper tables.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (3/3) Allow table swapping with different size

2007-10-24 Thread Jun'ichi Nomura

This patch removes the check for whether the loaded table
is going to change the device size and allows resizing of
the dm device suspended with 'noflush'.


Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

---
 dm.c |5 -
 1 file changed, 5 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1288,11 +1288,6 @@ int dm_swap_table(struct mapped_device *
if (!dm_suspended(md))
goto out;
 
-   /* without bdev, the device size cannot be changed */
-   if (!md->suspended_bdev)
-   if (get_capacity(md->disk) != dm_table_get_size(table))
-   goto out;
-
__unbind(md);
r = __bind(md, table);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (1/3) Add bdlookup()

2007-10-24 Thread Jun'ichi Nomura

This patch adds bdlookup() and bdlookup_disk().
They are lookup-only variant of bdget()/bdget_disk().

The function can be used when
  - the caller wants to get bdev only if somebody already got it
(i.e. there is already bd_inode allocated in memory)
  - the race between bdlookup() and bdget() is acceptable

The patch is a preparation for the 2nd part of this patchset.

Background:
  inode is initialized with I_LOCK and I_NEW turned on.
  After initialization, both flags are turned off.

  bdget() uses iget5_locked(), which waits until I_LOCK is removed.
  It could take so long or forever.


Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

---
 fs/block_dev.c|   27 +++
 include/linux/fs.h|4 +++-
 include/linux/genhd.h |6 ++
 3 files changed, 36 insertions(+), 1 deletion(-)

Index: linux-2.6.23.work/fs/block_dev.c
===
--- linux-2.6.23.work.orig/fs/block_dev.c
+++ linux-2.6.23.work/fs/block_dev.c
@@ -586,6 +586,33 @@ struct block_device *bdget(dev_t dev)
 
 EXPORT_SYMBOL(bdget);
 
+/*
+ * Variant of bdget(), which returns bdev only when it is already gotten.
+ *
+ * The function can be used when:
+ *   - the caller wants to get bdev only if somebody already got it
+ * (i.e. there is already bd_inode allocated in memory)
+ * and
+ *   - the race between bdlookup() and bdget() is acceptable
+ */
+struct block_device *bdlookup(dev_t dev)
+{
+   struct inode *inode;
+
+   might_sleep();
+
+   /* Use _nowait because we don't want to wait for I_LOCK */
+   inode = ilookup5_nowait(bd_mnt->mnt_sb, hash(dev), bdev_test, );
+
+   if (!inode)
+   return NULL;
+
+   wait_on_bit(>i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE);
+
+   return _I(inode)->bdev;
+}
+EXPORT_SYMBOL(bdlookup);
+
 long nr_blockdev_pages(void)
 {
struct block_device *bdev;
Index: linux-2.6.23.work/include/linux/fs.h
===
--- linux-2.6.23.work.orig/include/linux/fs.h
+++ linux-2.6.23.work/include/linux/fs.h
@@ -1211,10 +1211,11 @@ struct super_operations {
 #define I_DIRTY_DATASYNC   2 /* Data-related inode changes pending */
 #define I_DIRTY_PAGES  4 /* Data-related inode changes pending */
 #define __I_LOCK   3
+#define __I_NEW6
 #define I_LOCK (1 << __I_LOCK)
 #define I_FREEING  16
 #define I_CLEAR32
-#define I_NEW  64
+#define I_NEW  (1 << __I_NEW)
 #define I_WILL_FREE128
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
@@ -1431,6 +1432,7 @@ extern void putname(const char *name);
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
 extern struct block_device *bdget(dev_t);
+extern struct block_device *bdlookup(dev_t);
 extern void bd_set_size(struct block_device *, loff_t size);
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
Index: linux-2.6.23.work/include/linux/genhd.h
===
--- linux-2.6.23.work.orig/include/linux/genhd.h
+++ linux-2.6.23.work/include/linux/genhd.h
@@ -435,6 +435,12 @@ static inline struct block_device *bdget
return bdget(MKDEV(disk->major, disk->first_minor) + index);
 }
 
+static inline struct block_device *bdlookup_disk(struct gendisk *disk,
+int index)
+{
+   return bdlookup(MKDEV(disk->major, disk->first_minor) + index);
+}
+
 #endif
 
 #else /* CONFIG_BLOCK */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (2/3) Use bdlookup() on noflush suspend

2007-10-24 Thread Jun'ichi Nomura

This patch makes device-mapper to use bdlookup to allow resizing
of noflush-suspended device.

With this patch, by using bdlookup() instead of bdget(),
resizing will no longer wait for I_LOCK.

There is a race between bdlookup() and bdget().
However, it's benign because the only use of bdev obtained by
bdlookup() is to modify i_size and bdget() will anyway see the
updated disk size.
(See below)

Case 1:
CPU 0 CPU 1
--
set_capacity()
  disk size is updated
bdlookup() -> NULL
   bdget()
 bdget reads updated disk size
 and set it to bdev inode's i_size

Case 2:
CPU 0 CPU 1
--
set_capacity()
  disk size is updated
   bdget()
 bdget reads updated disk size
 and set it to bdev inode's i_size
bdlookup()
  wait for the bdget completion (I_NEW)
  and get the bdev inode with updated disk size

Case 3:
CPU 0 CPU 1
--
   bdget()
 bdget reads old disk size
set_capacity()
  disk size is updated
bdlookup()
  get the existing bdev inode
  update the i_size

To change i_size of bdev inode, the new code uses bd_mutex
instead of inode->i_mutex. According to a comment in
include/linux/fs.h, mutex is necessary here only to serialize
i_size_write().
bd_inode->i_size is set via bd_set_size() and bd_set_size()
is protected by bd_mutex.
So I think bd_mutex is the lock we should use.


Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

---
 dm.c |   27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1101,11 +1101,29 @@ static void event_callback(void *context
 
 static void __set_size(struct mapped_device *md, sector_t size)
 {
+   struct block_device *bdev;
+
set_capacity(md->disk, size);
 
-   mutex_lock(>suspended_bdev->bd_inode->i_mutex);
-   i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << 
SECTOR_SHIFT);
-   mutex_unlock(>suspended_bdev->bd_inode->i_mutex);
+   /*
+* Updated disk size should be visible to bdget() possibly
+* called in parallel to bdlookup()
+*/
+   smp_mb();
+
+   if (md->suspended_bdev)
+   bdev = md->suspended_bdev;
+   else
+   bdev = bdlookup_disk(md->disk, 0);
+
+   if (bdev) {
+   mutex_lock(>bd_mutex);
+   i_size_write(bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
+   mutex_unlock(>bd_mutex);
+   }
+
+   if (!md->suspended_bdev)
+   bdput(bdev);
 }
 
 static int __bind(struct mapped_device *md, struct dm_table *t)
@@ -1121,8 +1139,7 @@ static int __bind(struct mapped_device *
if (size != get_capacity(md->disk))
memset(>geometry, 0, sizeof(md->geometry));
 
-   if (md->suspended_bdev)
-   __set_size(md, size);
+   __set_size(md, size);
if (size == 0)
return 0;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (0/3)

2007-10-24 Thread Jun'ichi Nomura

Hi,

Currently, there is a restriction on resizing device-mapper device:
it cannot be resized if 'noflush' option is given for suspend.

Since both multipath-tools (dm-multipath) and LVM2 mirror use
'noflush' suspend, the restriction limits the capability of
those tools.

The following patches remove the restriction.
  [PATCH] (1/3) Add bdlookup()
  [PATCH] (2/3) Use bdlookup() on noflush suspend
  [PATCH] (3/3) Allow resizing table load on noflush suspend

The patches were tested on i686 and ia64 with LVM2,
for both 'noflush' case and normal case.

Background:

  - For some device-mapper targets (multipath and mirror),
the mapping table sometimes has to be replaced to cope with device
failure.
OTOH, device-mapper flushes all pending I/Os upon table replacement
and may result in I/O errors, if there are device failures.
'noflush' suspend is used to let dm queue the pending I/Os
instead of flushing them.
Since it's not possible for user space program to tell whether
the suspend could cause I/O error, they always use
'noflush' to suspend mirror/multipath targets.

  - Currently resizing is disabled for 'noflush' suspend.
Resizing occurs in the course of table replacement.
To resize the device under use, device-mapper needs to get its
bdev inode. However, using bdget() in this case could cause deadlock
by waiting for I_LOCK where an I/O process holding I_LOCK is
waiting for completion of table replacement.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (0/3)

2007-10-24 Thread Jun'ichi Nomura

Hi,

Currently, there is a restriction on resizing device-mapper device:
it cannot be resized if 'noflush' option is given for suspend.

Since both multipath-tools (dm-multipath) and LVM2 mirror use
'noflush' suspend, the restriction limits the capability of
those tools.

The following patches remove the restriction.
  [PATCH] (1/3) Add bdlookup()
  [PATCH] (2/3) Use bdlookup() on noflush suspend
  [PATCH] (3/3) Allow resizing table load on noflush suspend

The patches were tested on i686 and ia64 with LVM2,
for both 'noflush' case and normal case.

Background:

  - For some device-mapper targets (multipath and mirror),
the mapping table sometimes has to be replaced to cope with device
failure.
OTOH, device-mapper flushes all pending I/Os upon table replacement
and may result in I/O errors, if there are device failures.
'noflush' suspend is used to let dm queue the pending I/Os
instead of flushing them.
Since it's not possible for user space program to tell whether
the suspend could cause I/O error, they always use
'noflush' to suspend mirror/multipath targets.

  - Currently resizing is disabled for 'noflush' suspend.
Resizing occurs in the course of table replacement.
To resize the device under use, device-mapper needs to get its
bdev inode. However, using bdget() in this case could cause deadlock
by waiting for I_LOCK where an I/O process holding I_LOCK is
waiting for completion of table replacement.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (1/3) Add bdlookup()

2007-10-24 Thread Jun'ichi Nomura

This patch adds bdlookup() and bdlookup_disk().
They are lookup-only variant of bdget()/bdget_disk().

The function can be used when
  - the caller wants to get bdev only if somebody already got it
(i.e. there is already bd_inode allocated in memory)
  - the race between bdlookup() and bdget() is acceptable

The patch is a preparation for the 2nd part of this patchset.

Background:
  inode is initialized with I_LOCK and I_NEW turned on.
  After initialization, both flags are turned off.

  bdget() uses iget5_locked(), which waits until I_LOCK is removed.
  It could take so long or forever.


Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

---
 fs/block_dev.c|   27 +++
 include/linux/fs.h|4 +++-
 include/linux/genhd.h |6 ++
 3 files changed, 36 insertions(+), 1 deletion(-)

Index: linux-2.6.23.work/fs/block_dev.c
===
--- linux-2.6.23.work.orig/fs/block_dev.c
+++ linux-2.6.23.work/fs/block_dev.c
@@ -586,6 +586,33 @@ struct block_device *bdget(dev_t dev)
 
 EXPORT_SYMBOL(bdget);
 
+/*
+ * Variant of bdget(), which returns bdev only when it is already gotten.
+ *
+ * The function can be used when:
+ *   - the caller wants to get bdev only if somebody already got it
+ * (i.e. there is already bd_inode allocated in memory)
+ * and
+ *   - the race between bdlookup() and bdget() is acceptable
+ */
+struct block_device *bdlookup(dev_t dev)
+{
+   struct inode *inode;
+
+   might_sleep();
+
+   /* Use _nowait because we don't want to wait for I_LOCK */
+   inode = ilookup5_nowait(bd_mnt-mnt_sb, hash(dev), bdev_test, dev);
+
+   if (!inode)
+   return NULL;
+
+   wait_on_bit(inode-i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE);
+
+   return BDEV_I(inode)-bdev;
+}
+EXPORT_SYMBOL(bdlookup);
+
 long nr_blockdev_pages(void)
 {
struct block_device *bdev;
Index: linux-2.6.23.work/include/linux/fs.h
===
--- linux-2.6.23.work.orig/include/linux/fs.h
+++ linux-2.6.23.work/include/linux/fs.h
@@ -1211,10 +1211,11 @@ struct super_operations {
 #define I_DIRTY_DATASYNC   2 /* Data-related inode changes pending */
 #define I_DIRTY_PAGES  4 /* Data-related inode changes pending */
 #define __I_LOCK   3
+#define __I_NEW6
 #define I_LOCK (1  __I_LOCK)
 #define I_FREEING  16
 #define I_CLEAR32
-#define I_NEW  64
+#define I_NEW  (1  __I_NEW)
 #define I_WILL_FREE128
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
@@ -1431,6 +1432,7 @@ extern void putname(const char *name);
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
 extern struct block_device *bdget(dev_t);
+extern struct block_device *bdlookup(dev_t);
 extern void bd_set_size(struct block_device *, loff_t size);
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
Index: linux-2.6.23.work/include/linux/genhd.h
===
--- linux-2.6.23.work.orig/include/linux/genhd.h
+++ linux-2.6.23.work/include/linux/genhd.h
@@ -435,6 +435,12 @@ static inline struct block_device *bdget
return bdget(MKDEV(disk-major, disk-first_minor) + index);
 }
 
+static inline struct block_device *bdlookup_disk(struct gendisk *disk,
+int index)
+{
+   return bdlookup(MKDEV(disk-major, disk-first_minor) + index);
+}
+
 #endif
 
 #else /* CONFIG_BLOCK */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (2/3) Use bdlookup() on noflush suspend

2007-10-24 Thread Jun'ichi Nomura

This patch makes device-mapper to use bdlookup to allow resizing
of noflush-suspended device.

With this patch, by using bdlookup() instead of bdget(),
resizing will no longer wait for I_LOCK.

There is a race between bdlookup() and bdget().
However, it's benign because the only use of bdev obtained by
bdlookup() is to modify i_size and bdget() will anyway see the
updated disk size.
(See below)

Case 1:
CPU 0 CPU 1
--
set_capacity()
  disk size is updated
bdlookup() - NULL
   bdget()
 bdget reads updated disk size
 and set it to bdev inode's i_size

Case 2:
CPU 0 CPU 1
--
set_capacity()
  disk size is updated
   bdget()
 bdget reads updated disk size
 and set it to bdev inode's i_size
bdlookup()
  wait for the bdget completion (I_NEW)
  and get the bdev inode with updated disk size

Case 3:
CPU 0 CPU 1
--
   bdget()
 bdget reads old disk size
set_capacity()
  disk size is updated
bdlookup()
  get the existing bdev inode
  update the i_size

To change i_size of bdev inode, the new code uses bd_mutex
instead of inode-i_mutex. According to a comment in
include/linux/fs.h, mutex is necessary here only to serialize
i_size_write().
bd_inode-i_size is set via bd_set_size() and bd_set_size()
is protected by bd_mutex.
So I think bd_mutex is the lock we should use.


Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

---
 dm.c |   27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1101,11 +1101,29 @@ static void event_callback(void *context
 
 static void __set_size(struct mapped_device *md, sector_t size)
 {
+   struct block_device *bdev;
+
set_capacity(md-disk, size);
 
-   mutex_lock(md-suspended_bdev-bd_inode-i_mutex);
-   i_size_write(md-suspended_bdev-bd_inode, (loff_t)size  
SECTOR_SHIFT);
-   mutex_unlock(md-suspended_bdev-bd_inode-i_mutex);
+   /*
+* Updated disk size should be visible to bdget() possibly
+* called in parallel to bdlookup()
+*/
+   smp_mb();
+
+   if (md-suspended_bdev)
+   bdev = md-suspended_bdev;
+   else
+   bdev = bdlookup_disk(md-disk, 0);
+
+   if (bdev) {
+   mutex_lock(bdev-bd_mutex);
+   i_size_write(bdev-bd_inode, (loff_t)size  SECTOR_SHIFT);
+   mutex_unlock(bdev-bd_mutex);
+   }
+
+   if (!md-suspended_bdev)
+   bdput(bdev);
 }
 
 static int __bind(struct mapped_device *md, struct dm_table *t)
@@ -1121,8 +1139,7 @@ static int __bind(struct mapped_device *
if (size != get_capacity(md-disk))
memset(md-geometry, 0, sizeof(md-geometry));
 
-   if (md-suspended_bdev)
-   __set_size(md, size);
+   __set_size(md, size);
if (size == 0)
return 0;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm: noflush resizing (3/3) Allow table swapping with different size

2007-10-24 Thread Jun'ichi Nomura

This patch removes the check for whether the loaded table
is going to change the device size and allows resizing of
the dm device suspended with 'noflush'.


Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

---
 dm.c |5 -
 1 file changed, 5 deletions(-)

Index: linux-2.6.23.work/drivers/md/dm.c
===
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1288,11 +1288,6 @@ int dm_swap_table(struct mapped_device *
if (!dm_suspended(md))
goto out;
 
-   /* without bdev, the device size cannot be changed */
-   if (!md-suspended_bdev)
-   if (get_capacity(md-disk) != dm_table_get_size(table))
-   goto out;
-
__unbind(md);
r = __bind(md, table);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] device-mapper: fix bd_mount_sem corruption

2007-10-08 Thread Jun'ichi Nomura

Hi,

This patch fixes a bd_mount_sem counter corruption bug in device-mapper.

thaw_bdev() should be called only when freeze_bdev() was called for the
device.
Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter.
struct block_device with the corrupted semaphore may remain in slab cache
and be reused later.

Attached patch will fix it by calling unlock_fs() instead.
unlock_fs() will determine whether it should call thaw_bdev()
by checking the device is frozen or not.

Easy reproducer is:
  #!/bin/sh
  while [ 1 ]; do
 dmsetup --notable create a
 dmsetup --nolockfs suspend a
 dmsetup remove a
  done

It's not easy to see the effect of corrupted semaphore.
So I have tested with putting printk below in bdev_alloc_inode():
if (atomic_read(>bdev.bd_mount_sem.count) != 1)
printk(KERN_DEBUG "Incorrect semaphore count = %d (%p)\n",
atomic_read(>bdev.bd_mount_sem.count),
>bdev);

Without the patch, I saw something like:
 Incorrect semaphore count = 17 (f2ab91c0)

With the patch, the message didn't appear.


Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2120155..998d450 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1064,12 +1064,14 @@ static struct mapped_device *alloc_dev(int minor)
return NULL;
 }
 
+static void unlock_fs(struct mapped_device *md);
+
 static void free_dev(struct mapped_device *md)
 {
int minor = md->disk->first_minor;
 
if (md->suspended_bdev) {
-   thaw_bdev(md->suspended_bdev, NULL);
+   unlock_fs(md);
bdput(md->suspended_bdev);
}
mempool_destroy(md->tio_pool);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] device-mapper: fix bd_mount_sem corruption

2007-10-08 Thread Jun'ichi Nomura

Hi,

This patch fixes a bd_mount_sem counter corruption bug in device-mapper.

thaw_bdev() should be called only when freeze_bdev() was called for the
device.
Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter.
struct block_device with the corrupted semaphore may remain in slab cache
and be reused later.

Attached patch will fix it by calling unlock_fs() instead.
unlock_fs() will determine whether it should call thaw_bdev()
by checking the device is frozen or not.

Easy reproducer is:
  #!/bin/sh
  while [ 1 ]; do
 dmsetup --notable create a
 dmsetup --nolockfs suspend a
 dmsetup remove a
  done

It's not easy to see the effect of corrupted semaphore.
So I have tested with putting printk below in bdev_alloc_inode():
if (atomic_read(ei-bdev.bd_mount_sem.count) != 1)
printk(KERN_DEBUG Incorrect semaphore count = %d (%p)\n,
atomic_read(ei-bdev.bd_mount_sem.count),
ei-bdev);

Without the patch, I saw something like:
 Incorrect semaphore count = 17 (f2ab91c0)

With the patch, the message didn't appear.


Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2120155..998d450 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1064,12 +1064,14 @@ static struct mapped_device *alloc_dev(int minor)
return NULL;
 }
 
+static void unlock_fs(struct mapped_device *md);
+
 static void free_dev(struct mapped_device *md)
 {
int minor = md-disk-first_minor;
 
if (md-suspended_bdev) {
-   thaw_bdev(md-suspended_bdev, NULL);
+   unlock_fs(md);
bdput(md-suspended_bdev);
}
mempool_destroy(md-tio_pool);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request

2007-07-17 Thread Jun'ichi Nomura

Patrick McHardy wrote:
> Jun'ichi Nomura wrote:
>> Are you using other dm modules such as dm-multipath, dm-mirror
>> or dm-snapshot?
>> If so, can you take the output of 'dmsetup table' and 'dmsetup ls'?
> 
> No other modules.
> 
>> Do you have a reliable way to reproduce the oops which I can try?
> 
> "/etc/init.d/cryptdisk start" (debian) on a luks partition triggered
> it for me.

With today's git HEAD (commit 49c13b51a15f1ba9f6d47e26e4a3886c4f3931e2),
I tried the following but could not reproduce the oops here.
  # cryptsetup luksFormat /dev/sdb1
  # cryptsetup luksOpen /dev/sdb1 c
  # mkfs.ext3 /dev/mapper/c
  

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request

2007-07-17 Thread Jun'ichi Nomura

Hi,

Patrick McHardy wrote:
> Alasdair G Kergon wrote:
>> From: "Jun'ichi Nomura" <[EMAIL PROTECTED]>
>>
>> bio_alloc_bioset() will return NULL if 'num_vecs' is too large.
>> Use bio_get_nr_vecs() to get estimation of maximum number.
>>
>> Signed-off-by: "Jun'ichi Nomura" <[EMAIL PROTECTED]>
>> Signed-off-by: Alasdair G Kergon <[EMAIL PROTECTED]>
>>
>> ---
>>  drivers/md/dm-io.c |5 -
>>  1 files changed, 4 insertions(+), 1 deletion(-)
> 
> 
> This patch reproducibly oopses my box:

Thanks for the report.
But I'm not sure how the patch is related to the oops.

The stack trace shows the oops occurred in dm-crypt,
which doesn't use the part of the code modified by the patch
(dm-io).

Are you using other dm modules such as dm-multipath, dm-mirror
or dm-snapshot?
If so, can you take the output of 'dmsetup table' and 'dmsetup ls'?

Do you have a reliable way to reproduce the oops which I can try?

> 
> [  126.754204] BUG: unable to handle kernel NULL pointer dereference at
> virtual address 
> [  126.754326]  printing eip:
> [  126.754369] c0141a67
> [  126.754420] *pde = 
> [  126.754465] Oops:  [#1]
> [  126.754507] PREEMPT
> [  126.754585] Modules linked in: [...]
> 
> 
> [  126.758372] CPU:0
> [  126.758373] EIP:0060:[]Not tainted VLI
> [  126.758374] EFLAGS: 00010282   (2.6.22 #1)
> [  126.758511] EIP is at mempool_free+0xe/0xc0
> [  126.758558] eax: d39e65d0   ebx:    ecx: df2b9898   edx: 
> [  126.758605] esi:    edi: d39e65d0   ebp: d487d6d0   esp: df79fec0
> [  126.758652] ds: 007b   es: 007b   fs:   gs:   ss: 0068
> [  126.758699] Process kcryptd/0 (pid: 3218, ti=df79f000 task=df2b9640
> task.ti=df79f000)
> [  126.758747] Stack:   d3835f80  e08b0923
> e08a5f69 0200 e0ad1080
> [  126.759093]dfb5ab40 d3835f80 e08b08c0  e08a5fb7
> c01804d8  0200
> [  126.759439]c520bc00 0c00 d0b77438 d5754b00 df79ff5c
> e08a515e d0b77444 d5754b00
> [  126.759858] Call Trace:
> [  126.759965]  [] clone_endio+0x63/0xc0 [dm_mod]
> [  126.760066]  [] crypt_convert+0x131/0x17f [dm_crypt]
> [  126.760168]  [] clone_endio+0x0/0xc0 [dm_mod]
> [  126.760264]  [] kcryptd_do_work+0x0/0x30f [dm_crypt]
> [  126.760349]  [] bio_endio+0x33/0x5d
> [  126.760462]  [] dec_pending+0x28/0x39 [dm_crypt]
> [  126.760558]  [] kcryptd_do_work+0x22f/0x30f [dm_crypt]
> [  126.760669]  [] update_stats_wait_end+0x7f/0xb2
> [  126.760801]  [] kcryptd_do_work+0x0/0x30f [dm_crypt]
> [  126.760888]  [] run_workqueue+0x84/0x179
> [  126.760990]  [] worker_thread+0x0/0xf0
> [  126.761074]  [] worker_thread+0x9d/0xf0
> [  126.761160]  [] autoremove_wake_function+0x0/0x37
> [  126.761256]  [] worker_thread+0x0/0xf0
> [  126.761334]  [] kthread+0x52/0x58
> [  126.761411]  [] kthread+0x0/0x58
> [  126.761496]  [] kernel_thread_helper+0x7/0x14
> [  126.761598]  ===
> [  126.761717] Code: 1c 00 89 f6 eb a9 b8 88 13 00 00 e8 b4 56 1c 00 8d
> 74 26 00 eb d5 31 db e9 11 ff ff ff 57 56 53 83 ec 04 89 c7 89 d6 85 c0
> 74 55 <8b> 02 39 42 04 7d 46 9c 58 90 8d b4 26 00 00 00 00 89 c3 fa 90
> [  126.763964] EIP: [] mempool_free+0xe/0xc0 SS:ESP 0068:df79fec0
> 

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request

2007-07-17 Thread Jun'ichi Nomura

Hi,

Patrick McHardy wrote:
 Alasdair G Kergon wrote:
 From: Jun'ichi Nomura [EMAIL PROTECTED]

 bio_alloc_bioset() will return NULL if 'num_vecs' is too large.
 Use bio_get_nr_vecs() to get estimation of maximum number.

 Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]
 Signed-off-by: Alasdair G Kergon [EMAIL PROTECTED]

 ---
  drivers/md/dm-io.c |5 -
  1 files changed, 4 insertions(+), 1 deletion(-)
 
 
 This patch reproducibly oopses my box:

Thanks for the report.
But I'm not sure how the patch is related to the oops.

The stack trace shows the oops occurred in dm-crypt,
which doesn't use the part of the code modified by the patch
(dm-io).

Are you using other dm modules such as dm-multipath, dm-mirror
or dm-snapshot?
If so, can you take the output of 'dmsetup table' and 'dmsetup ls'?

Do you have a reliable way to reproduce the oops which I can try?

 
 [  126.754204] BUG: unable to handle kernel NULL pointer dereference at
 virtual address 
 [  126.754326]  printing eip:
 [  126.754369] c0141a67
 [  126.754420] *pde = 
 [  126.754465] Oops:  [#1]
 [  126.754507] PREEMPT
 [  126.754585] Modules linked in: [...]
 
 
 [  126.758372] CPU:0
 [  126.758373] EIP:0060:[c0141a67]Not tainted VLI
 [  126.758374] EFLAGS: 00010282   (2.6.22 #1)
 [  126.758511] EIP is at mempool_free+0xe/0xc0
 [  126.758558] eax: d39e65d0   ebx:    ecx: df2b9898   edx: 
 [  126.758605] esi:    edi: d39e65d0   ebp: d487d6d0   esp: df79fec0
 [  126.758652] ds: 007b   es: 007b   fs:   gs:   ss: 0068
 [  126.758699] Process kcryptd/0 (pid: 3218, ti=df79f000 task=df2b9640
 task.ti=df79f000)
 [  126.758747] Stack:   d3835f80  e08b0923
 e08a5f69 0200 e0ad1080
 [  126.759093]dfb5ab40 d3835f80 e08b08c0  e08a5fb7
 c01804d8  0200
 [  126.759439]c520bc00 0c00 d0b77438 d5754b00 df79ff5c
 e08a515e d0b77444 d5754b00
 [  126.759858] Call Trace:
 [  126.759965]  [e08b0923] clone_endio+0x63/0xc0 [dm_mod]
 [  126.760066]  [e08a5f69] crypt_convert+0x131/0x17f [dm_crypt]
 [  126.760168]  [e08b08c0] clone_endio+0x0/0xc0 [dm_mod]
 [  126.760264]  [e08a5fb7] kcryptd_do_work+0x0/0x30f [dm_crypt]
 [  126.760349]  [c01804d8] bio_endio+0x33/0x5d
 [  126.760462]  [e08a515e] dec_pending+0x28/0x39 [dm_crypt]
 [  126.760558]  [e08a61e6] kcryptd_do_work+0x22f/0x30f [dm_crypt]
 [  126.760669]  [c0112182] update_stats_wait_end+0x7f/0xb2
 [  126.760801]  [e08a5fb7] kcryptd_do_work+0x0/0x30f [dm_crypt]
 [  126.760888]  [c012700e] run_workqueue+0x84/0x179
 [  126.760990]  [c0127292] worker_thread+0x0/0xf0
 [  126.761074]  [c012732f] worker_thread+0x9d/0xf0
 [  126.761160]  [c012a360] autoremove_wake_function+0x0/0x37
 [  126.761256]  [c0127292] worker_thread+0x0/0xf0
 [  126.761334]  [c012a15c] kthread+0x52/0x58
 [  126.761411]  [c012a10a] kthread+0x0/0x58
 [  126.761496]  [c0104983] kernel_thread_helper+0x7/0x14
 [  126.761598]  ===
 [  126.761717] Code: 1c 00 89 f6 eb a9 b8 88 13 00 00 e8 b4 56 1c 00 8d
 74 26 00 eb d5 31 db e9 11 ff ff ff 57 56 53 83 ec 04 89 c7 89 d6 85 c0
 74 55 8b 02 39 42 04 7d 46 9c 58 90 8d b4 26 00 00 00 00 89 c3 fa 90
 [  126.763964] EIP: [c0141a67] mempool_free+0xe/0xc0 SS:ESP 0068:df79fec0
 

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23 PATCH 07/18] dm io: fix panic on large request

2007-07-17 Thread Jun'ichi Nomura

Patrick McHardy wrote:
 Jun'ichi Nomura wrote:
 Are you using other dm modules such as dm-multipath, dm-mirror
 or dm-snapshot?
 If so, can you take the output of 'dmsetup table' and 'dmsetup ls'?
 
 No other modules.
 
 Do you have a reliable way to reproduce the oops which I can try?
 
 /etc/init.d/cryptdisk start (debian) on a luks partition triggered
 it for me.

With today's git HEAD (commit 49c13b51a15f1ba9f6d47e26e4a3886c4f3931e2),
I tried the following but could not reproduce the oops here.
  # cryptsetup luksFormat /dev/sdb1
  # cryptsetup luksOpen /dev/sdb1 c
  # mkfs.ext3 /dev/mapper/c
  mount it and do some I/O

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation of America

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.20-rc5] dm-multipath: fix stall on noflush suspend/resume

2007-01-19 Thread Jun'ichi Nomura

Allow noflush suspend/resume of device-mapper device only for
the case where the device size is unchanged.

Otherwise, dm-multipath devices can stall when resumed if noflush
was used when suspending them, all paths have failed and
queue_if_no_path is set.

Explanation:
 1. Something is doing fsync() on the block dev,
holding inode->i_sem
 2. The fsync write is blocked by all-paths-down and queue_if_no_path
 3. Someone requests to suspend the dm device with noflush.
Pending writes are left in queue.
 4. In the middle of dm_resume(), __bind() tries to get
inode->i_sem to do __set_size() and waits forever.

Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>

---
'noflush suspend' is a new device-mapper feature introduced in
early 2.6.20. So I hope the fix being included before 2.6.20 is
released.

Example of reproducer:
 1. Create a multipath device by dmsetup
 2. Fail all paths during mkfs
 3. Do dmsetup suspend --noflush and load new map with healthy paths
 4. Do dmsetup resume


 drivers/md/dm.c |   27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

Index: linux-2.6.19/drivers/md/dm.c
===
--- linux-2.6.19.orig/drivers/md/dm.c	2006-12-12 22:02:29.0 +
+++ linux-2.6.19/drivers/md/dm.c	2007-01-05 15:15:53.0 +
@@ -1116,7 +1116,8 @@ static int __bind(struct mapped_device *
 	if (size != get_capacity(md->disk))
 		memset(>geometry, 0, sizeof(md->geometry));
 
-	__set_size(md, size);
+	if (md->suspended_bdev)
+		__set_size(md, size);
 	if (size == 0)
 		return 0;
 
@@ -1265,6 +1266,11 @@ int dm_swap_table(struct mapped_device *
 	if (!dm_suspended(md))
 		goto out;
 
+	/* without bdev, the device size cannot be changed */
+	if (!md->suspended_bdev)
+		if (get_capacity(md->disk) != dm_table_get_size(table))
+			goto out;
+
 	__unbind(md);
 	r = __bind(md, table);
 
@@ -1342,11 +1348,14 @@ int dm_suspend(struct mapped_device *md,
 	/* This does not get reverted if there's an error later. */
 	dm_table_presuspend_targets(map);
 
-	md->suspended_bdev = bdget_disk(md->disk, 0);
-	if (!md->suspended_bdev) {
-		DMWARN("bdget failed in dm_suspend");
-		r = -ENOMEM;
-		goto flush_and_out;
+	/* bdget() can stall if the pending I/Os are not flushed */
+	if (!noflush) {
+		md->suspended_bdev = bdget_disk(md->disk, 0);
+		if (!md->suspended_bdev) {
+			DMWARN("bdget failed in dm_suspend");
+			r = -ENOMEM;
+			goto flush_and_out;
+		}
 	}
 
 	/*
@@ -1474,8 +1483,10 @@ int dm_resume(struct mapped_device *md)
 
 	unlock_fs(md);
 
-	bdput(md->suspended_bdev);
-	md->suspended_bdev = NULL;
+	if (md->suspended_bdev) {
+		bdput(md->suspended_bdev);
+		md->suspended_bdev = NULL;
+	}
 
 	clear_bit(DMF_SUSPENDED, >flags);

[PATCH 2.6.20-rc5] dm-multipath: fix stall on noflush suspend/resume

2007-01-19 Thread Jun'ichi Nomura

Allow noflush suspend/resume of device-mapper device only for
the case where the device size is unchanged.

Otherwise, dm-multipath devices can stall when resumed if noflush
was used when suspending them, all paths have failed and
queue_if_no_path is set.

Explanation:
 1. Something is doing fsync() on the block dev,
holding inode-i_sem
 2. The fsync write is blocked by all-paths-down and queue_if_no_path
 3. Someone requests to suspend the dm device with noflush.
Pending writes are left in queue.
 4. In the middle of dm_resume(), __bind() tries to get
inode-i_sem to do __set_size() and waits forever.

Signed-off-by: Jun'ichi Nomura [EMAIL PROTECTED]

---
'noflush suspend' is a new device-mapper feature introduced in
early 2.6.20. So I hope the fix being included before 2.6.20 is
released.

Example of reproducer:
 1. Create a multipath device by dmsetup
 2. Fail all paths during mkfs
 3. Do dmsetup suspend --noflush and load new map with healthy paths
 4. Do dmsetup resume


 drivers/md/dm.c |   27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

Index: linux-2.6.19/drivers/md/dm.c
===
--- linux-2.6.19.orig/drivers/md/dm.c	2006-12-12 22:02:29.0 +
+++ linux-2.6.19/drivers/md/dm.c	2007-01-05 15:15:53.0 +
@@ -1116,7 +1116,8 @@ static int __bind(struct mapped_device *
 	if (size != get_capacity(md-disk))
 		memset(md-geometry, 0, sizeof(md-geometry));
 
-	__set_size(md, size);
+	if (md-suspended_bdev)
+		__set_size(md, size);
 	if (size == 0)
 		return 0;
 
@@ -1265,6 +1266,11 @@ int dm_swap_table(struct mapped_device *
 	if (!dm_suspended(md))
 		goto out;
 
+	/* without bdev, the device size cannot be changed */
+	if (!md-suspended_bdev)
+		if (get_capacity(md-disk) != dm_table_get_size(table))
+			goto out;
+
 	__unbind(md);
 	r = __bind(md, table);
 
@@ -1342,11 +1348,14 @@ int dm_suspend(struct mapped_device *md,
 	/* This does not get reverted if there's an error later. */
 	dm_table_presuspend_targets(map);
 
-	md-suspended_bdev = bdget_disk(md-disk, 0);
-	if (!md-suspended_bdev) {
-		DMWARN(bdget failed in dm_suspend);
-		r = -ENOMEM;
-		goto flush_and_out;
+	/* bdget() can stall if the pending I/Os are not flushed */
+	if (!noflush) {
+		md-suspended_bdev = bdget_disk(md-disk, 0);
+		if (!md-suspended_bdev) {
+			DMWARN(bdget failed in dm_suspend);
+			r = -ENOMEM;
+			goto flush_and_out;
+		}
 	}
 
 	/*
@@ -1474,8 +1483,10 @@ int dm_resume(struct mapped_device *md)
 
 	unlock_fs(md);
 
-	bdput(md-suspended_bdev);
-	md-suspended_bdev = NULL;
+	if (md-suspended_bdev) {
+		bdput(md-suspended_bdev);
+		md-suspended_bdev = NULL;
+	}
 
 	clear_bit(DMF_SUSPENDED, md-flags);

76 matches

Mail list logo