Re: [PATCH] lightnvm: fix memory leak in pblk_luns_init

2018-02-22 Thread Matias Bjørling

On 02/23/2018 12:03 AM, Huaicheng Li wrote:

Please ignore my previous email as I found the memory is free'ed at
pblk_init()'s error handling logic.
Sorry for the interruption.

On Thu, Feb 22, 2018 at 3:01 PM, Huaicheng Li 
wrote:


Signed-off-by: Huaicheng Li 
---
drivers/lightnvm/pblk-init.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 93d671ca518e..330665d91d8d 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -510,6 +510,7 @@ static int pblk_luns_init(struct pblk *pblk, struct
ppa_addr *luns)
 if (ret) {
 while (--i >= 0)
 kfree(pblk->luns[i].bb_list);
+   kfree(pblk->luns);
 return ret;
 }
 }
--
2.13.6






No worries. The initialization part is pretty complex due to all the 
data structures being set up. I would love to have a clean 
initialization function, which cleans up after it self. But being able 
to share the "fail" bringup and tear-down code is very helpful and makes 
other parts easier.


Re: [PATCH] blk-throttle: avoid multiple counting for same bio

2018-02-22 Thread Chengguang Xu
Hi Joseph,

Thanks for quick reply, I didn’t notice that patch before.


Thanks,
Chengguang.


> 在 2018年2月23日,上午11:55,Joseph Qi  写道:
> 
> 
> 
> On 18/2/23 11:33, Chengguang Xu wrote:
>> Hi Tejun,
>> 
>> Sorry for delayed reply, I was on vacation last week.
>> 
>> The problem still exists in current code of 4.16.0-rc2, 
>> detail test information is below, if further info is needed please let me 
>> know.
>> 
>> Thanks.
>> 
> That's true, the issue Shaohua has fixed is double charge, but double
> stat issue still exists.
> 
> Jiufei has posted a fix, which has already been tested by Bo Liu:
> [PATCH RESEND] blk-throttle: avoid double counted
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg18516.html
> 
> Thanks,
> Joseph
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] blk-throttle: avoid multiple counting for same bio

2018-02-22 Thread Joseph Qi


On 18/2/23 11:33, Chengguang Xu wrote:
> Hi Tejun,
> 
> Sorry for delayed reply, I was on vacation last week.
> 
> The problem still exists in current code of 4.16.0-rc2, 
> detail test information is below, if further info is needed please let me 
> know.
> 
> Thanks.
> 
That's true, the issue Shaohua has fixed is double charge, but double
stat issue still exists.

Jiufei has posted a fix, which has already been tested by Bo Liu:
[PATCH RESEND] blk-throttle: avoid double counted
https://www.mail-archive.com/linux-block@vger.kernel.org/msg18516.html

Thanks,
Joseph


Re: [PATCH] blk-throttle: avoid multiple counting for same bio

2018-02-22 Thread Chengguang Xu
Hi Tejun,

Sorry for delayed reply, I was on vacation last week.

The problem still exists in current code of 4.16.0-rc2, 
detail test information is below, if further info is needed please let me know.

Thanks.

———
Both read/write bps are limited to 10MB/s in blkio cgroup v1 & v2

$ uname -r
4.16.0-rc2+


[Without this patch]

CGROUP V1 (direct write):

$ dd if=/dev/zero of=/mnt/sdb1/20/test bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.402 s, 10.5 MB/s

8:16 Read 16384
8:16 Write 2684354560
8:16 Sync 2684370944
8:16 Async 0
8:16 Total 2684370944

CGROUP V1 (read):

$ dd if=/mnt/sdb1/20/test of=/dev/zero bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.412 s, 10.5 MB/s

8:16 Read 4831838208
8:16 Write 0
8:16 Sync 4831838208
8:16 Async 0
8:16 Total 4831838208


CGROUP V2 (direct write):

$ cat io.max
8:16 rbps=max wbps=10485760 riops=max wiops=max

$ dd if=/dev/zero of=/mnt/sdb1/20/test bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.408 s, 10.5 MB/s

8:16 rbytes=24576 wbytes=2684354560 rios=5 wios=4096


CGROUP V2 (buffered write):

$ dd if=/dev/zero of=/mnt/sdb1/20/test bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.637822 s, 1.7 GB/s

8:16 rbytes=0 wbytes=4831838208 rios=0 wios=4096

CGROUP V2 (read):

$ cat io.max
8:16 rbps=10485760 wbps=max riops=max wiops=max

$ dd if=/mnt/sdb1/20/test of=/dev/zero bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.409 s, 10.5 MB/s

8:16 rbytes=4831846400 wbytes=0 rios=4097 wios=0

[With this patch]

CGROUP V1 (direct write):

$ dd if=/dev/zero of=/mnt/sdb1/20/test bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.402 s, 10.5 MB/s

8:16 Read 24576
8:16 Write 1073741824
8:16 Sync 1073766400
8:16 Async 0
8:16 Total 1073766400

CGROUP V1 (read):

$ dd if=/mnt/sdb1/20/test of=/dev/zero bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.406 s, 10.5 MB/s

8:16 Read 1073741824
8:16 Write 0
8:16 Sync 1073741824
8:16 Async 0
8:16 Total 1073741824

CGROUP V2 (direct write):

$ dd if=/dev/zero of=/mnt/sdb1/20/test bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.407 s, 10.5 MB/s

8:16 rbytes=16384 wbytes=1073741824 rios=4 wios=1024


CGROUP V2 (buffered write):

$ dd if=/dev/zero of=/mnt/sdb1/20/test bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.650783 s, 1.6 GB/s

8:16 rbytes=0 wbytes=1073741824 rios=0 wios=512

CGROUP V2 (read):

$ dd if=/mnt/sdb1/20/test of=/dev/zero bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 102.411 s, 10.5 MB/s

8:16 rbytes=1077314048 wbytes=0 rios=572 wios=0


———


> 在 2018年2月13日,下午10:43,Tejun Heo  写道:
> 
> On Tue, Feb 13, 2018 at 02:45:50PM +0800, Chengguang Xu wrote:
>> In current throttling/upper limit policy of blkio cgroup
>> blkio.throttle.io_service_bytes does not exactly represent
>> the number of bytes issued to the disk by the group, sometimes
>> this number could be counted multiple times of real bytes.
>> This fix introduces BIO_COUNTED flag to avoid multiple counting
>> for same bio.
>> 
>> Signed-off-by: Chengguang Xu 
> 
> We had a series of fixes / changes for this problem during the last
> cycle.  Can you please see whether the current linus master has the
> same problem.
> 
> Thanks.
> 
> -- 
> tejun



Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-22 Thread xuejiufei
Hi Tejun,

On 2018/2/22 下午11:18, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote:
>> I still don't get how css_tryget can work here.
>>
>> The race happens when:
>> 1) writeback kworker has found the blkg with rcu;
>> 2) blkcg is during offlining and blkg_destroy() has already been called.
>> Then, writeback kworker will take queue lock and access the blkg with
>> refcount 0.
> 
> Yeah, then tryget would fail and it should go through the root.
> 
In this race, the refcount of blkg becomes zero and is destroyed.
However css may still have refcount, and css_tryget can return success
before other callers put the refcount.
So I don't get how css_tryget can fix this race? Or I wonder if we can
add another function blkg_tryget?

Thanks,
Jiufei

> Thanks.
> 




[PATCH v4 1/6] block/loop: Delete gendisk before cleaning up the request queue

2018-02-22 Thread Bart Van Assche
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.

Signed-off-by: Bart Van Assche 
Cc: Josef Bacik 
Cc: Shaohua Li 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Ming Lei 
---
 drivers/block/loop.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d5fe720cf149..8b693fdd08dc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1864,8 +1864,8 @@ static int loop_add(struct loop_device **l, int i)
 
 static void loop_remove(struct loop_device *lo)
 {
-   blk_cleanup_queue(lo->lo_queue);
del_gendisk(lo->lo_disk);
+   blk_cleanup_queue(lo->lo_queue);
blk_mq_free_tag_set(>tag_set);
put_disk(lo->lo_disk);
kfree(lo);
-- 
2.16.2



[PATCH v4 2/6] md: Delete gendisk before cleaning up the request queue

2018-02-22 Thread Bart Van Assche
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.

Signed-off-by: Bart Van Assche 
Cc: Shaohua Li 
---
 drivers/md/md.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index bc67ab6844f0..eba7fa2f0abb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5203,12 +5203,12 @@ static void md_free(struct kobject *ko)
if (mddev->sysfs_state)
sysfs_put(mddev->sysfs_state);
 
+   if (mddev->gendisk)
+   del_gendisk(mddev->gendisk);
if (mddev->queue)
blk_cleanup_queue(mddev->queue);
-   if (mddev->gendisk) {
-   del_gendisk(mddev->gendisk);
+   if (mddev->gendisk)
put_disk(mddev->gendisk);
-   }
percpu_ref_exit(>writes_pending);
 
kfree(mddev);
-- 
2.16.2



[PATCH v4 0/6] Fix races between blkcg code and request queue initialization and cleanup

2018-02-22 Thread Bart Van Assche
Hello Jens,

Recently Joseph Qi identified races between the block cgroup code and request
queue initialization and cleanup. This patch series address these races. Please
consider these patches for kernel v4.17.

Thanks,

Bart.

Changes between v3 and v4:
- Added three patches that fix the calling order of del_gendisk() and
  blk_cleanup_queue() in multiple block drivers.

Changes between v2 and v3:
- Added a third patch that fixes a race between the blkcg code and queue
  cleanup.

Changes between v1 and v2:
- Split a single patch into two patches.
- Dropped blk_alloc_queue_node2() and modified all block drivers that call
  blk_alloc_queue_node().

Bart Van Assche (6):
  block/loop: Delete gendisk before cleaning up the request queue
  md: Delete gendisk before cleaning up the request queue
  zram: Delete gendisk before cleaning up the request queue
  block: Add a third argument to blk_alloc_queue_node()
  block: Fix a race between the cgroup code and request queue
initialization
  block: Fix a race between request queue removal and the block cgroup
controller

 block/blk-core.c   | 60 +++---
 block/blk-mq.c |  2 +-
 block/blk-sysfs.c  |  7 -
 drivers/block/drbd/drbd_main.c |  3 +--
 drivers/block/loop.c   |  2 +-
 drivers/block/null_blk.c   |  3 ++-
 drivers/block/umem.c   |  7 +++--
 drivers/block/zram/zram_drv.c  |  2 +-
 drivers/ide/ide-probe.c|  2 +-
 drivers/lightnvm/core.c|  2 +-
 drivers/md/dm.c|  2 +-
 drivers/md/md.c|  6 ++---
 drivers/nvdimm/pmem.c  |  2 +-
 drivers/nvme/host/multipath.c  |  2 +-
 drivers/scsi/scsi_lib.c|  2 +-
 include/linux/blkdev.h |  3 ++-
 16 files changed, 70 insertions(+), 37 deletions(-)

-- 
2.16.2



[PATCH v4 6/6] block: Fix a race between request queue removal and the block cgroup controller

2018-02-22 Thread Bart Van Assche
Avoid that the following race can occur:

blk_cleanup_queue()   blkcg_print_blkgs()
  spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
q->queue_lock = >__queue_lock (3)
  spin_unlock_irq(lock) (4)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) take driver lock;
(2) busy loop for driver lock;
(3) override driver lock with internal lock;
(4) unlock driver lock;
(5) can take driver lock now;
(6) but unlock internal lock.

This change is safe because only the SCSI core and the NVME core keep
a reference on a request queue after having called blk_cleanup_queue().
Neither driver accesses any of the removed data structures between its
blk_cleanup_queue() and blk_put_queue() calls.

Reported-by: Joseph Qi 
Signed-off-by: Bart Van Assche 
Cc: Jan Kara 
---
 block/blk-core.c  | 31 +++
 block/blk-sysfs.c |  7 ---
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 41c74b37be85..6febc69a58aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,6 +719,37 @@ void blk_cleanup_queue(struct request_queue *q)
del_timer_sync(>backing_dev_info->laptop_mode_wb_timer);
blk_sync_queue(q);
 
+   /*
+* I/O scheduler exit is only safe after the sysfs scheduler attribute
+* has been removed.
+*/
+   WARN_ON_ONCE(q->kobj.state_in_sysfs);
+
+   /*
+* Since the I/O scheduler exit code may access cgroup information,
+* perform I/O scheduler exit before disassociating from the block
+* cgroup controller.
+*/
+   if (q->elevator) {
+   ioc_clear_queue(q);
+   elevator_exit(q, q->elevator);
+   q->elevator = NULL;
+   }
+
+   /*
+* Remove all references to @q from the block cgroup controller before
+* restoring @q->queue_lock to avoid that restoring this pointer causes
+* e.g. blkcg_print_blkgs() to crash.
+*/
+   blkcg_exit_queue(q);
+
+   /*
+* Since the cgroup code may dereference the @q->backing_dev_info
+* pointer, only decrease its reference count after having removed the
+* association with the block cgroup controller.
+*/
+   bdi_put(q->backing_dev_info);
+
if (q->mq_ops)
blk_mq_free_queue(q);
percpu_ref_exit(>q_usage_counter);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index cbea895a5547..fd71a00c9462 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -798,13 +798,6 @@ static void __blk_release_queue(struct work_struct *work)
if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
-   bdi_put(q->backing_dev_info);
-   blkcg_exit_queue(q);
-
-   if (q->elevator) {
-   ioc_clear_queue(q);
-   elevator_exit(q, q->elevator);
-   }
 
blk_free_queue_stats(q->stats);
 
-- 
2.16.2



[PATCH v4 5/6] block: Fix a race between the cgroup code and request queue initialization

2018-02-22 Thread Bart Van Assche
Initialize the request queue lock earlier such that the following
race can no longer occur:

blk_init_queue_node() blkcg_print_blkgs()
  blk_alloc_queue_node (1)
q->queue_lock = >__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
  q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.

The changes in this patch are as follows:
- Move the .queue_lock initialization from blk_init_queue_node() into
  blk_alloc_queue_node().
- Only override the .queue_lock pointer for legacy queues because it
  is not useful for blk-mq queues to override this pointer.
- For all all block drivers that initialize .queue_lock explicitly,
  change the blk_alloc_queue() call in the driver into a
  blk_alloc_queue_node() call and remove the explicit .queue_lock
  initialization. Additionally, initialize the spin lock that will
  be used as queue lock earlier if necessary.

Reported-by: Joseph Qi 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
---
 block/blk-core.c   | 24 
 drivers/block/drbd/drbd_main.c |  3 +--
 drivers/block/umem.c   |  7 +++
 3 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e873a24bf82d..41c74b37be85 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -888,6 +888,19 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
+/**
+ * blk_alloc_queue_node - allocate a request queue
+ * @gfp_mask: memory allocation flags
+ * @node_id: NUMA node to allocate memory from
+ * @lock: For legacy queues, pointer to a spinlock that will be used to e.g.
+ *serialize calls to the legacy .request_fn() callback. Ignored for
+ *   blk-mq request queues.
+ *
+ * Note: pass the queue lock as the third argument to this function instead of
+ * setting the queue lock pointer explicitly to avoid triggering a sporadic
+ * crash in the blkcg code. This function namely calls blkcg_init_queue() and
+ * the queue lock pointer must be set before blkcg_init_queue() is called.
+ */
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
   spinlock_t *lock)
 {
@@ -940,11 +953,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id,
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
 
-   /*
-* By default initialize queue_lock to internal lock and driver can
-* override it later if need be.
-*/
-   q->queue_lock = >__queue_lock;
+   if (!q->mq_ops)
+   q->queue_lock = lock ? : >__queue_lock;
 
/*
 * A queue starts its life with bypass turned on to avoid
@@ -1031,13 +1041,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL);
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);
if (!q)
return NULL;
 
q->request_fn = rfn;
-   if (lock)
-   q->queue_lock = lock;
if (blk_init_allocated_queue(q) < 0) {
blk_cleanup_queue(q);
return NULL;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 0a0394aa1b9c..185f1ef00a7c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2816,7 +2816,7 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
 
drbd_init_set_defaults(device);
 
-   q = blk_alloc_queue(GFP_KERNEL);
+   q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, >req_lock);
if (!q)
goto out_no_q;
device->rq_queue = q;
@@ -2848,7 +2848,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
/* Setting the max_hw_sectors to an odd value of 8kibyte here
   This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
-   q->queue_lock = >req_lock;
 
device->md_io.page = 

[PATCH v4 4/6] block: Add a third argument to blk_alloc_queue_node()

2018-02-22 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
---
 block/blk-core.c  | 7 ---
 block/blk-mq.c| 2 +-
 drivers/block/null_blk.c  | 3 ++-
 drivers/ide/ide-probe.c   | 2 +-
 drivers/lightnvm/core.c   | 2 +-
 drivers/md/dm.c   | 2 +-
 drivers/nvdimm/pmem.c | 2 +-
 drivers/nvme/host/multipath.c | 2 +-
 drivers/scsi/scsi_lib.c   | 2 +-
 include/linux/blkdev.h| 3 ++-
 10 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..e873a24bf82d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -810,7 +810,7 @@ void blk_exit_rl(struct request_queue *q, struct 
request_list *rl)
 
 struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 {
-   return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE);
+   return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE, NULL);
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
@@ -888,7 +888,8 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
-struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
+struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
+  spinlock_t *lock)
 {
struct request_queue *q;
 
@@ -1030,7 +1031,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL);
if (!q)
return NULL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102e2149..ae4e5096f425 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2554,7 +2554,7 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
 {
struct request_queue *uninit_q, *q;
 
-   uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node);
+   uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);
if (!uninit_q)
return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 7f5fc9f730b8..c5ac1b334787 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1715,7 +1715,8 @@ static int null_add_dev(struct nullb_device *dev)
}
null_init_queues(nullb);
} else if (dev->queue_mode == NULL_Q_BIO) {
-   nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node);
+   nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node,
+   NULL);
if (!nullb->q) {
rv = -ENOMEM;
goto out_cleanup_queues;
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 17fd55af4d92..2e80a866073c 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -766,7 +766,7 @@ static int ide_init_queue(ide_drive_t *drive)
 *  limits and LBA48 we could raise it but as yet
 *  do not.
 */
-   q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif));
+   q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif), NULL);
if (!q)
return 1;
 
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..5f1988df1593 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -384,7 +384,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
goto err_dev;
}
 
-   tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
+   tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node, NULL);
if (!tqueue) {
ret = -ENOMEM;
goto err_disk;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d6de00f367ef..3c55564f6367 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1840,7 +1840,7 @@ static struct mapped_device *alloc_dev(int minor)
INIT_LIST_HEAD(>table_devices);
spin_lock_init(>uevent_lock);
 
-   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
+   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id, NULL);
if (!md->queue)
goto bad;
md->queue->queuedata = md;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 10041ac4032c..cfb15ac50925 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -344,7 +344,7 @@ static int pmem_attach_disk(struct device *dev,
return -EBUSY;
}
 
-   q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+   q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev), NULL);
if (!q)
   

[PATCH v4 3/6] zram: Delete gendisk before cleaning up the request queue

2018-02-22 Thread Bart Van Assche
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.

Signed-off-by: Bart Van Assche 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
---
 drivers/block/zram/zram_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0afa6c8c3857..85110e7931e5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1620,8 +1620,8 @@ static int zram_remove(struct zram *zram)
 
pr_info("Removed device: %s\n", zram->disk->disk_name);
 
-   blk_cleanup_queue(zram->disk->queue);
del_gendisk(zram->disk);
+   blk_cleanup_queue(zram->disk->queue);
put_disk(zram->disk);
kfree(zram);
return 0;
-- 
2.16.2



Re: v4.16-rc2: I/O hang with dm-rq + Kyber

2018-02-22 Thread Bart Van Assche
On Thu, 2018-02-22 at 14:42 -0800, Omar Sandoval wrote:
> On Thu, Feb 22, 2018 at 09:10:23PM +, Bart Van Assche wrote:
> > I/O hangs if I run the following command on top of kernel v4.16-rc2 + the
> > ib_srpt patch that adds RDMA/CM support:
> > 
> > srp-test/run_tests -c -d -r 10 -t 02-mq -e kyber
> > 
> > This does not happen with the deadline scheduler nor without a scheduler.
> > This test passed a few months ago. I have attached the output of the 
> > following
> > command to this e-mail:
> > 
> > (cd /sys/kernel/debug/block && grep -r .) | grep -v /poll_stat
> > 
> > Can you have a look?
> 
> Hey, Bart, thanks for the report. Can you clarify what the device
> topology is w.r.t. the dm devices?

Hello Omar,

The topology was nothing fancy: dm-rq in blk-mq mode was configured to use
one scsi-mq SRP-over-RoCE path. The multipath output is as follows:

# multipath -ll
mpathb (3600140572616d6469736b310) dm-0 LIO-ORG,IBLOCK
size=32M features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 4:0:0:0 sdc 8:32 active ready running

# for d in /sys/block/{dm-0,sdc}; do echo $(basename $d): 
$(<$d/queue/scheduler); done
dm-0: bfq [kyber] mq-deadline none
sdc: bfq [kyber] mq-deadline none

Bart.

Re: v4.16-rc2: I/O hang with dm-rq + Kyber

2018-02-22 Thread Omar Sandoval
On Thu, Feb 22, 2018 at 09:10:23PM +, Bart Van Assche wrote:
> Hello Omar,
> 
> I/O hangs if I run the following command on top of kernel v4.16-rc2 + the
> ib_srpt patch that adds RDMA/CM support:
> 
> srp-test/run_tests -c -d -r 10 -t 02-mq -e kyber
> 
> This does not happen with the deadline scheduler nor without a scheduler.
> This test passed a few months ago. I have attached the output of the following
> command to this e-mail:
> 
> (cd /sys/kernel/debug/block && grep -r .) | grep -v /poll_stat
> 
> Can you have a look?
> 
> Thanks,
> 
> Bart.

Hey, Bart, thanks for the report. Can you clarify what the device
topology is w.r.t. the dm devices?


v4.16-rc2: I/O hang with dm-rq + Kyber

2018-02-22 Thread Bart Van Assche
Hello Omar,

I/O hangs if I run the following command on top of kernel v4.16-rc2 + the
ib_srpt patch that adds RDMA/CM support:

srp-test/run_tests -c -d -r 10 -t 02-mq -e kyber

This does not happen with the deadline scheduler nor without a scheduler.
This test passed a few months ago. I have attached the output of the following
command to this e-mail:

(cd /sys/kernel/debug/block && grep -r .) | grep -v /poll_stat

Can you have a look?

Thanks,

Bart.dm-1/sched/async_depth:48
dm-1/sched/other_tokens:depth=64
dm-1/sched/other_tokens:busy=0
dm-1/sched/other_tokens:bits_per_word=64
dm-1/sched/other_tokens:map_nr=1
dm-1/sched/other_tokens:alloc_hint={33, 1415, 1816, 307}
dm-1/sched/other_tokens:wake_batch=8
dm-1/sched/other_tokens:wake_index=0
dm-1/sched/other_tokens:ws={
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:{.wait_cnt=8, .wait=inactive},
dm-1/sched/other_tokens:}
dm-1/sched/other_tokens:round_robin=0
dm-1/sched/sync_write_tokens:depth=128
dm-1/sched/sync_write_tokens:busy=128
dm-1/sched/sync_write_tokens:bits_per_word=64
dm-1/sched/sync_write_tokens:map_nr=2
dm-1/sched/sync_write_tokens:alloc_hint={85, 0, 0, 90}
dm-1/sched/sync_write_tokens:wake_batch=8
dm-1/sched/sync_write_tokens:wake_index=2
dm-1/sched/sync_write_tokens:ws={
dm-1/sched/sync_write_tokens:   {.wait_cnt=1, .wait=inactive},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=inactive},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=active},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=inactive},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=inactive},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=inactive},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=inactive},
dm-1/sched/sync_write_tokens:   {.wait_cnt=8, .wait=inactive},
dm-1/sched/sync_write_tokens:}
dm-1/sched/sync_write_tokens:round_robin=0
dm-1/sched/read_tokens:depth=256
dm-1/sched/read_tokens:busy=256
dm-1/sched/read_tokens:bits_per_word=64
dm-1/sched/read_tokens:map_nr=4
dm-1/sched/read_tokens:alloc_hint={89, 0, 90, 64}
dm-1/sched/read_tokens:wake_batch=8
dm-1/sched/read_tokens:wake_index=0
dm-1/sched/read_tokens:ws={
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=active},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens: {.wait_cnt=8, .wait=inactive},
dm-1/sched/read_tokens:}
dm-1/sched/read_tokens:round_robin=0
dm-1/hctx0/sched/batching:0
dm-1/hctx0/sched/cur_domain:READ
dm-1/hctx0/sched/other_waiting:0
dm-1/hctx0/sched/sync_write_waiting:1
dm-1/hctx0/sched/sync_write_rqs:8b7da875 {.op=WRITE, 
.cmd_flags=SYNC|META|PRIO, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x260/0x0, .tag=-1, .internal_tag=48}
dm-1/hctx0/sched/sync_write_rqs:4b242673 {.op=WRITE, 
.cmd_flags=SYNC|META|PRIO, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x1b4/0x0, .tag=-1, .internal_tag=46}
dm-1/hctx0/sched/sync_write_rqs:83108835 {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x248/0x0, .tag=-1, .internal_tag=183}
dm-1/hctx0/sched/sync_write_rqs:b01459f7 {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x2a4/0x0, .tag=-1, .internal_tag=54}
dm-1/hctx0/sched/sync_write_rqs:f56c232f {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x23c/0x0, .tag=-1, .internal_tag=62}
dm-1/hctx0/sched/sync_write_rqs:8ccce380 {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x290/0x0, .tag=-1, .internal_tag=28}
dm-1/hctx0/sched/sync_write_rqs:89435afc {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x15c/0x0, .tag=-1, .internal_tag=122}
dm-1/hctx0/sched/sync_write_rqs:e683a4e2 {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x268/0x0, .tag=-1, .internal_tag=127}
dm-1/hctx0/sched/sync_write_rqs:c4ada57b {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x2e4/0x0, .tag=-1, .internal_tag=64}
dm-1/hctx0/sched/sync_write_rqs:ab365972 {.op=WRITE, 
.cmd_flags=SYNC|IDLE, .rq_flags=ELVPRIV|IO_STAT, .state=idle, 
.gstate=0x230/0x0, .tag=-1, .internal_tag=101}
dm-1/hctx0/sched/sync_write_rqs:c9e14c7b {.op=WRITE, 
.cmd_flags=SYNC|IDLE, 

[PATCH] lightnvm: fix memory leak in pblk_luns_init

2018-02-22 Thread Huaicheng Li

Signed-off-by: Huaicheng Li 
---
drivers/lightnvm/pblk-init.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 93d671ca518e..330665d91d8d 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -510,6 +510,7 @@ static int pblk_luns_init(struct pblk *pblk, struct 
ppa_addr *luns)
if (ret) {
while (--i >= 0)
kfree(pblk->luns[i].bb_list);
+   kfree(pblk->luns);
return ret;
}
}
--
2.13.6



Re: [PATCH V15 06/22] mmc: block: Add blk-mq support

2018-02-22 Thread Dmitry Osipenko
On 22.02.2018 10:42, Adrian Hunter wrote:
> On 21/02/18 22:50, Dmitry Osipenko wrote:
>> On 29.11.2017 16:41, Adrian Hunter wrote:
>>> Define and use a blk-mq queue. Discards and flushes are processed
>>> synchronously, but reads and writes asynchronously. In order to support
>>> slow DMA unmapping, DMA unmapping is not done until after the next request
>>> is started. That means the request is not completed until then. If there is
>>> no next request then the completion is done by queued work.
>>
>> Hello,
>>
>> I'm using (running linux-next and doing some upstream development for) some 
>> old
>> NVIDIA Tegra tablet that has built-in (internal) and external MMC's and with 
>> the
>> blk-mq being enabled I'm observing a soft lockup. The lockup seems is
>> reproducible quite reliably by running fsck on any MMC partition, sometimes
>> kernels lockups on boot during probing partitions table (weirdly only when 
>> both
>> SDHCI's are present, i.e. internal storage enabled in DT and external SD is
>> inserted/enabled) and it also lockups pretty quickly in a case of just a 
>> general
>> use. Reverting mmc/ commits up to 1bec43a3b18 ("Remove option not to use
>> blk-mq") and disabling CONFIG_MMC_MQ_DEFAULT makes everything working fine
>> again. There is also a third SDHCI populated with built-in WiFi/Bluetooth 
>> SDIO
>> and I'm observing odd MMC timeouts with the blk-mq enabled, disabling
>> CONFIG_MMC_MQ_DEFAULT fixes these timeouts as well.
>>
>> Any thoughts?
> 
> SDIO (unless it is a combo card) should be unaffected by changes to the
> block driver.
> 
> I don't have any ideas.  Adding more NVIDIA people.
> 
>>
>> WiFi issue
>> 
>>
>> [   38.247006] mmc2: Timeout waiting for hardware interrupt.
>> [   38.247027] brcmfmac: brcmf_escan_timeout: timer expired
>> [   38.247036] mmc2: sdhci:  SDHCI REGISTER DUMP ===
>> [   38.247047] mmc2: sdhci: Sys addr:  0x | Version:  0x0001
>> [   38.247055] mmc2: sdhci: Blk size:  0x7008 | Blk cnt:  0x
>> [   38.247062] mmc2: sdhci: Argument:  0x2108 | Trn mode: 0x0013
>> [   38.247070] mmc2: sdhci: Present:   0x01d7 | Host ctl: 0x0013
>> [   38.247077] mmc2: sdhci: Power: 0x0001 | Blk gap:  0x
>> [   38.247084] mmc2: sdhci: Wake-up:   0x | Clock:0x0007
>> [   38.247091] mmc2: sdhci: Timeout:   0x000e | Int stat: 0x
>> [   38.247098] mmc2: sdhci: Int enab:  0x02ff000b | Sig enab: 0x02fc000b
>> [   38.247105] mmc2: sdhci: AC12 err:  0x | Slot int: 0x
>> [   38.247112] mmc2: sdhci: Caps:  0x61ff30b0 | Caps_1:   0x
>> [   38.247119] mmc2: sdhci: Cmd:   0x353a | Max curr: 0x0001
>> [   38.247126] mmc2: sdhci: Resp[0]:   0x1800 | Resp[1]:  0x08002db5
>> [   38.247133] mmc2: sdhci: Resp[2]:   0x16da8000 | Resp[3]:  0x0400
>> [   38.247139] mmc2: sdhci: Host ctl2: 0x
>> [   38.247146] mmc2: sdhci: ADMA Err:  0x | ADMA Ptr: 0x17c47200
>> [   38.247152] mmc2: sdhci: 
>> [   38.247250] brcmfmac: brcmf_sdio_readframes: read 520 bytes from channel 1
>> failed: -84
>> [   38.247274] brcmfmac: brcmf_sdio_rxfail: abort command, terminate frame, 
>> send NAK
>> [   40.807019] brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout
>> [   40.807042] brcmfmac: brcmf_notify_escan_complete: Scan abort failed
>> [   48.487007] mmc2: Timeout waiting for hardware interrupt.
>> [   48.487057] mmc2: sdhci:  SDHCI REGISTER DUMP ===
>> [   48.487096] mmc2: sdhci: Sys addr:  0x | Version:  0x0001
>> [   48.487128] mmc2: sdhci: Blk size:  0x7040 | Blk cnt:  0x0001
>> [   48.487160] mmc2: sdhci: Argument:  0x2140 | Trn mode: 0x0013
>> [   48.487191] mmc2: sdhci: Present:   0x01d7 | Host ctl: 0x0013
>> [   48.487221] mmc2: sdhci: Power: 0x0001 | Blk gap:  0x
>> [   48.487251] mmc2: sdhci: Wake-up:   0x | Clock:0x0007
>> [   48.487281] mmc2: sdhci: Timeout:   0x000e | Int stat: 0x
>> [   48.487313] mmc2: sdhci: Int enab:  0x02ff000b | Sig enab: 0x02fc000b
>> [   48.487343] mmc2: sdhci: AC12 err:  0x | Slot int: 0x
>> [   48.487374] mmc2: sdhci: Caps:  0x61ff30b0 | Caps_1:   0x
>> [   48.487404] mmc2: sdhci: Cmd:   0x353a | Max curr: 0x0001
>> [   48.487435] mmc2: sdhci: Resp[0]:   0x1000 | Resp[1]:  0x08002db5
>> [   48.487466] mmc2: sdhci: Resp[2]:   0x16da8000 | Resp[3]:  0x0400
>> [   48.487493] mmc2: sdhci: Host ctl2: 0x
>> [   48.487525] mmc2: sdhci: ADMA Err:  0x | ADMA Ptr: 0x17c47200
>> [   48.487552] mmc2: sdhci: 
>> [   48.487749] brcmfmac: brcmf_sdio_readframes: read 480 bytes from channel 1
>> failed: -84
>> [   48.487822] brcmfmac: brcmf_sdio_rxfail: abort command, terminate frame, 
>> send NAK
>>
>>
>> Soft lockup issue
>> 
>>
>> # 

Re: [PATCH v3 3/3] block: Fix a race between request queue removal and the block cgroup controller

2018-02-22 Thread Bart Van Assche
On Thu, 2018-02-22 at 10:25 +0800, Joseph Qi wrote:
> I notice that several devices such as loop and zram will call
> blk_cleanup_queue before del_gendisk, so it will hit this warning. Is
> this normal?

Hello Joseph,

Since the disk object has a reference to the queue I agree with Ming that it's
wrong to call blk_cleanup_queue() before del_gendisk(). Anyway, I will verify
whether there are any block drivers in which these calls have to be swapped.

Bart.

Re: disk-io lockup in 4.14.13 kernel

2018-02-22 Thread Bart Van Assche

On 02/22/18 02:58, Jaco Kroon wrote:

We've been seeing sporadic IO lockups on recent kernels.


Are you using the legacy I/O stack or blk-mq? If you are not yet using 
blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is 
reproducible with blk-mq, can you share the output of the following command:


(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)

Thanks,

Bart.


Re: [PATCH 01/20] lightnvm: simplify geometry structure.

2018-02-22 Thread Javier Gonzalez
> On 22 Feb 2018, at 13.22, Matias Bjørling  wrote:
> 
> On 02/22/2018 08:44 AM, Javier Gonzalez wrote:
>>> On 22 Feb 2018, at 08.25, Matias Bjørling  wrote:
>>> 
>>> On 02/21/2018 10:26 AM, Javier González wrote:
 Currently, the device geometry is stored redundantly in the nvm_id and
 nvm_geo structures at a device level. Moreover, when instantiating
 targets on a specific number of LUNs, these structures are replicated
 and manually modified to fit the instance channel and LUN partitioning.
 Instead, create a generic geometry around two base structures:
 nvm_dev_geo, which describes the geometry of the whole device and
 nvm_geo, which describes the geometry of the instance. Since these share
 a big part of the geometry, create a nvm_common_geo structure that keeps
 the static geoometry values that are shared across instances.
 As we introduce support for 2.0, these structures allow to abstract
 spec. specific values and present a common geometry to targets.
 Signed-off-by: Javier González 
 ---
  drivers/lightnvm/core.c  | 137 +++-
  drivers/lightnvm/pblk-core.c |  16 +-
  drivers/lightnvm/pblk-gc.c   |   2 +-
  drivers/lightnvm/pblk-init.c | 123 +++---
  drivers/lightnvm/pblk-read.c |   2 +-
  drivers/lightnvm/pblk-recovery.c |  14 +-
  drivers/lightnvm/pblk-rl.c   |   2 +-
  drivers/lightnvm/pblk-sysfs.c|  39 +++--
  drivers/lightnvm/pblk-write.c|   2 +-
  drivers/lightnvm/pblk.h  |  93 +--
  drivers/nvme/host/lightnvm.c | 339 
 +++
  include/linux/lightnvm.h | 204 ---
  12 files changed, 514 insertions(+), 459 deletions(-)
 diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
 index 689c97b97775..42596afdf64c 100644
 --- a/drivers/lightnvm/core.c
 +++ b/drivers/lightnvm/core.c
 @@ -111,6 +111,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, 
 int lun_begin,
  static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
  {
struct nvm_dev *dev = tgt_dev->parent;
 +  struct nvm_dev_geo *dev_geo = >dev_geo;
struct nvm_dev_map *dev_map = tgt_dev->map;
int i, j;
  @@ -122,7 +123,7 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev 
 *tgt_dev, int clear)
if (clear) {
for (j = 0; j < ch_map->nr_luns; j++) {
int lun = j + lun_offs[j];
 -  int lunid = (ch * dev->geo.nr_luns) + lun;
 +  int lunid = (ch * dev_geo->nr_luns) + lun;
WARN_ON(!test_and_clear_bit(lunid,
dev->lun_map));
 @@ -143,19 +144,20 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
 nvm_dev *dev,
  u16 lun_begin, u16 lun_end,
  u16 op)
  {
 +  struct nvm_dev_geo *dev_geo = >dev_geo;
struct nvm_tgt_dev *tgt_dev = NULL;
struct nvm_dev_map *dev_rmap = dev->rmap;
struct nvm_dev_map *dev_map;
struct ppa_addr *luns;
int nr_luns = lun_end - lun_begin + 1;
int luns_left = nr_luns;
 -  int nr_chnls = nr_luns / dev->geo.nr_luns;
 -  int nr_chnls_mod = nr_luns % dev->geo.nr_luns;
 -  int bch = lun_begin / dev->geo.nr_luns;
 -  int blun = lun_begin % dev->geo.nr_luns;
 +  int nr_chnls = nr_luns / dev_geo->nr_luns;
 +  int nr_chnls_mod = nr_luns % dev_geo->nr_luns;
 +  int bch = lun_begin / dev_geo->nr_luns;
 +  int blun = lun_begin % dev_geo->nr_luns;
int lunid = 0;
int lun_balanced = 1;
 -  int prev_nr_luns;
 +  int sec_per_lun, prev_nr_luns;
int i, j;
nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
 @@ -173,15 +175,15 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
 nvm_dev *dev,
if (!luns)
goto err_luns;
  - prev_nr_luns = (luns_left > dev->geo.nr_luns) ?
 -  dev->geo.nr_luns : luns_left;
 +  prev_nr_luns = (luns_left > dev_geo->nr_luns) ?
 +  dev_geo->nr_luns : luns_left;
for (i = 0; i < nr_chnls; i++) {
struct nvm_ch_map *ch_rmap = _rmap->chnls[i + bch];
int *lun_roffs = ch_rmap->lun_offs;
struct nvm_ch_map *ch_map = _map->chnls[i];
int *lun_offs;
 -  int luns_in_chnl = (luns_left > dev->geo.nr_luns) ?
 -  dev->geo.nr_luns : luns_left;
 +  int luns_in_chnl = (luns_left > dev_geo->nr_luns) ?
 +  

Re: how can one drain MQ request queue ?

2018-02-22 Thread Ming Lei
On Thu, Feb 22, 2018 at 09:10:26PM +0800, Ming Lei wrote:
> On Thu, Feb 22, 2018 at 12:56:05PM +0200, Max Gurtovoy wrote:
> > 
> > 
> > On 2/22/2018 4:59 AM, Ming Lei wrote:
> > > Hi Max,
> > 
> > Hi Ming,
> > 
> > > 
> > > On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
> > > > hi all,
> > > > is there a way to drain a blk-mq based request queue (similar to
> > > > blk_drain_queue for non MQ) ?
> > > 
> > > Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
> > > based request queue, but it may not work well when the hardware is broken.
> > 
> > I tried that, but the path failover takes ~cmd_timeout seconds and this is
> > not good enough...
> 
> Yeah, I agree it isn't good for handling timeout.
> 
> > 
> > > 
> > > > 
> > > > I try to fix the following situation:
> > > > Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
> > > > ports during traffic using fio and making sure the traffic never fails.
> > > > 
> > > > when the switch port goes down the initiator driver start an error 
> > > > recovery
> > > 
> > > What is the code you are referring to?
> > 
> > from nvme_rdma driver:
> > 
> > static void nvme_rdma_error_recovery_work(struct work_struct *work)
> > {
> > struct nvme_rdma_ctrl *ctrl = container_of(work,
> > struct nvme_rdma_ctrl, err_work);
> > 
> > nvme_stop_keep_alive(>ctrl);
> > 
> > if (ctrl->ctrl.queue_count > 1) {
> > nvme_stop_queues(>ctrl);
> > blk_mq_tagset_busy_iter(>tag_set,
> > nvme_cancel_request, >ctrl);
> > nvme_rdma_destroy_io_queues(ctrl, false);
> > }
> > 
> > blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> > blk_mq_tagset_busy_iter(>admin_tag_set,
> > nvme_cancel_request, >ctrl);
> > nvme_rdma_destroy_admin_queue(ctrl, false);
> 
> I am not sure if it is good to destroy admin queue here since
> nvme_rdma_configure_admin_queue() need to use admin queue, and I saw
> there is report of 'nvme nvme0: Identify namespace failed' in Red Hat
> BZ.
> 
> > 
> > /*
> >  * queues are not a live anymore, so restart the queues to fail fast
> >  * new IO
> >  */
> > blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> > nvme_start_queues(>ctrl);
> > 
> > if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_CONNECTING)) {
> > /* state change failure should never happen */
> > WARN_ON_ONCE(1);
> > return;
> > }
> > 
> > nvme_rdma_reconnect_or_remove(ctrl);
> > }
> > 
> > 
> > > 
> > > > process
> > > > - blk_mq_quiesce_queue for each namespace request queue
> > > 
> > > blk_mq_quiesce_queue() only guarantees that no requests can be dispatched 
> > > to
> > > low level driver, and new requests still can be allocated, but can't be
> > > dispatched until the queue becomes unquiesced.
> > > 
> > > > - cancel all requests of the tagset using blk_mq_tagset_busy_iter
> > > 
> > > Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
> > > requests, and it depends on implementation of the busy_tag_iter_fn, and
> > > timed-out request can't be covered by blk_mq_tagset_busy_iter().
> > 
> > How can we deal with timed-out commands ?
> 
> For PCI NVMe, they are handled by requeuing, just like all canceled
> in-flight commands, and all these commands will be dispatched to driver
> again after reset is done successfully.
> 
> > 
> > 
> > > 
> > > So blk_mq_tagset_busy_iter() is often used in error recovery path, such
> > > as nvme_dev_disable(), which is usually used in resetting PCIe NVMe 
> > > controller.
> > > 
> > > > - destroy the QPs/RDMA connections and MR pools
> > > > - blk_mq_unquiesce_queue for each namespace request queue
> > > > - reconnect to the target (after creating RDMA resources again)
> > > > 
> > > > During the QP destruction, I see a warning that not all the memory 
> > > > regions
> > > > were back to the mr_pool. For every request we get from the block layer
> > > > (well, almost every request) we get a MR from the MR pool.
> > > > So what I see is that, depends on the timing, some requests are
> > > > dispatched/completed after we blk_mq_unquiesce_queue and after we 
> > > > destroy
> > > > the QP and the MR pool. Probably these request were inserted during
> > > > quiescing,
> > > 
> > > Yes.
> > 
> > maybe we need to update the nvmf_check_init_req to check that the ctrl is in
> > NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think about it
> > and test it.
> > 
> > > 
> > > > and I want to flush/drain them before I destroy the QP.
> > > 
> > > As mentioned above, you can't do that by blk_mq_quiesce_queue() &
> > > blk_mq_tagset_busy_iter().
> > > 
> > > The PCIe NVMe driver takes two steps for the error recovery: 
> > > nvme_dev_disable() &
> > > nvme_reset_work(), and you may consider the 

Re: how can one drain MQ request queue ?

2018-02-22 Thread Ming Lei
On Thu, Feb 22, 2018 at 12:56:05PM +0200, Max Gurtovoy wrote:
> 
> 
> On 2/22/2018 4:59 AM, Ming Lei wrote:
> > Hi Max,
> 
> Hi Ming,
> 
> > 
> > On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
> > > hi all,
> > > is there a way to drain a blk-mq based request queue (similar to
> > > blk_drain_queue for non MQ) ?
> > 
> > Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
> > based request queue, but it may not work well when the hardware is broken.
> 
> I tried that, but the path failover takes ~cmd_timeout seconds and this is
> not good enough...

Yeah, I agree it isn't good for handling timeout.

> 
> > 
> > > 
> > > I try to fix the following situation:
> > > Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
> > > ports during traffic using fio and making sure the traffic never fails.
> > > 
> > > when the switch port goes down the initiator driver start an error 
> > > recovery
> > 
> > What is the code you are referring to?
> 
> from nvme_rdma driver:
> 
> static void nvme_rdma_error_recovery_work(struct work_struct *work)
> {
> struct nvme_rdma_ctrl *ctrl = container_of(work,
> struct nvme_rdma_ctrl, err_work);
> 
> nvme_stop_keep_alive(>ctrl);
> 
> if (ctrl->ctrl.queue_count > 1) {
> nvme_stop_queues(>ctrl);
> blk_mq_tagset_busy_iter(>tag_set,
> nvme_cancel_request, >ctrl);
> nvme_rdma_destroy_io_queues(ctrl, false);
> }
> 
> blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> blk_mq_tagset_busy_iter(>admin_tag_set,
> nvme_cancel_request, >ctrl);
> nvme_rdma_destroy_admin_queue(ctrl, false);

I am not sure if it is good to destroy admin queue here since
nvme_rdma_configure_admin_queue() need to use admin queue, and I saw
there is report of 'nvme nvme0: Identify namespace failed' in Red Hat
BZ.

> 
> /*
>  * queues are not a live anymore, so restart the queues to fail fast
>  * new IO
>  */
> blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> nvme_start_queues(>ctrl);
> 
> if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_CONNECTING)) {
> /* state change failure should never happen */
> WARN_ON_ONCE(1);
> return;
> }
> 
> nvme_rdma_reconnect_or_remove(ctrl);
> }
> 
> 
> > 
> > > process
> > > - blk_mq_quiesce_queue for each namespace request queue
> > 
> > blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
> > low level driver, and new requests still can be allocated, but can't be
> > dispatched until the queue becomes unquiesced.
> > 
> > > - cancel all requests of the tagset using blk_mq_tagset_busy_iter
> > 
> > Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
> > requests, and it depends on implementation of the busy_tag_iter_fn, and
> > timed-out request can't be covered by blk_mq_tagset_busy_iter().
> 
> How can we deal with timed-out commands ?

For PCI NVMe, they are handled by requeuing, just like all canceled
in-flight commands, and all these commands will be dispatched to driver
again after reset is done successfully.

> 
> 
> > 
> > So blk_mq_tagset_busy_iter() is often used in error recovery path, such
> > as nvme_dev_disable(), which is usually used in resetting PCIe NVMe 
> > controller.
> > 
> > > - destroy the QPs/RDMA connections and MR pools
> > > - blk_mq_unquiesce_queue for each namespace request queue
> > > - reconnect to the target (after creating RDMA resources again)
> > > 
> > > During the QP destruction, I see a warning that not all the memory regions
> > > were back to the mr_pool. For every request we get from the block layer
> > > (well, almost every request) we get a MR from the MR pool.
> > > So what I see is that, depends on the timing, some requests are
> > > dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
> > > the QP and the MR pool. Probably these request were inserted during
> > > quiescing,
> > 
> > Yes.
> 
> maybe we need to update the nvmf_check_init_req to check that the ctrl is in
> NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think about it
> and test it.
> 
> > 
> > > and I want to flush/drain them before I destroy the QP.
> > 
> > As mentioned above, you can't do that by blk_mq_quiesce_queue() &
> > blk_mq_tagset_busy_iter().
> > 
> > The PCIe NVMe driver takes two steps for the error recovery: 
> > nvme_dev_disable() &
> > nvme_reset_work(), and you may consider the similar approach, but the 
> > in-flight
> > requests won't be drained in this case because they can be requeued.
> > 
> > Could you explain a bit what your exact problem is?
> 
> The problem is that I assign an MR from QP mr_pool for each call to
> nvme_rdma_queue_rq. During the error recovery I destroy the QP and the
> mr_pool *but* some MR's are 

Re: [PATCH 01/20] lightnvm: simplify geometry structure.

2018-02-22 Thread Matias Bjørling

On 02/22/2018 08:44 AM, Javier Gonzalez wrote:



On 22 Feb 2018, at 08.25, Matias Bjørling  wrote:

On 02/21/2018 10:26 AM, Javier González wrote:

Currently, the device geometry is stored redundantly in the nvm_id and
nvm_geo structures at a device level. Moreover, when instantiating
targets on a specific number of LUNs, these structures are replicated
and manually modified to fit the instance channel and LUN partitioning.
Instead, create a generic geometry around two base structures:
nvm_dev_geo, which describes the geometry of the whole device and
nvm_geo, which describes the geometry of the instance. Since these share
a big part of the geometry, create a nvm_common_geo structure that keeps
the static geoometry values that are shared across instances.
As we introduce support for 2.0, these structures allow to abstract
spec. specific values and present a common geometry to targets.
Signed-off-by: Javier González 
---
  drivers/lightnvm/core.c  | 137 +++-
  drivers/lightnvm/pblk-core.c |  16 +-
  drivers/lightnvm/pblk-gc.c   |   2 +-
  drivers/lightnvm/pblk-init.c | 123 +++---
  drivers/lightnvm/pblk-read.c |   2 +-
  drivers/lightnvm/pblk-recovery.c |  14 +-
  drivers/lightnvm/pblk-rl.c   |   2 +-
  drivers/lightnvm/pblk-sysfs.c|  39 +++--
  drivers/lightnvm/pblk-write.c|   2 +-
  drivers/lightnvm/pblk.h  |  93 +--
  drivers/nvme/host/lightnvm.c | 339 +++
  include/linux/lightnvm.h | 204 ---
  12 files changed, 514 insertions(+), 459 deletions(-)
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 689c97b97775..42596afdf64c 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -111,6 +111,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int 
lun_begin,
  static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
  {
struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_dev_geo *dev_geo = >dev_geo;
struct nvm_dev_map *dev_map = tgt_dev->map;
int i, j;
  @@ -122,7 +123,7 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev 
*tgt_dev, int clear)
if (clear) {
for (j = 0; j < ch_map->nr_luns; j++) {
int lun = j + lun_offs[j];
-   int lunid = (ch * dev->geo.nr_luns) + lun;
+   int lunid = (ch * dev_geo->nr_luns) + lun;
WARN_ON(!test_and_clear_bit(lunid,
dev->lun_map));
@@ -143,19 +144,20 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
  u16 lun_begin, u16 lun_end,
  u16 op)
  {
+   struct nvm_dev_geo *dev_geo = >dev_geo;
struct nvm_tgt_dev *tgt_dev = NULL;
struct nvm_dev_map *dev_rmap = dev->rmap;
struct nvm_dev_map *dev_map;
struct ppa_addr *luns;
int nr_luns = lun_end - lun_begin + 1;
int luns_left = nr_luns;
-   int nr_chnls = nr_luns / dev->geo.nr_luns;
-   int nr_chnls_mod = nr_luns % dev->geo.nr_luns;
-   int bch = lun_begin / dev->geo.nr_luns;
-   int blun = lun_begin % dev->geo.nr_luns;
+   int nr_chnls = nr_luns / dev_geo->nr_luns;
+   int nr_chnls_mod = nr_luns % dev_geo->nr_luns;
+   int bch = lun_begin / dev_geo->nr_luns;
+   int blun = lun_begin % dev_geo->nr_luns;
int lunid = 0;
int lun_balanced = 1;
-   int prev_nr_luns;
+   int sec_per_lun, prev_nr_luns;
int i, j;
nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
@@ -173,15 +175,15 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!luns)
goto err_luns;
  - prev_nr_luns = (luns_left > dev->geo.nr_luns) ?
-   dev->geo.nr_luns : luns_left;
+   prev_nr_luns = (luns_left > dev_geo->nr_luns) ?
+   dev_geo->nr_luns : luns_left;
for (i = 0; i < nr_chnls; i++) {
struct nvm_ch_map *ch_rmap = _rmap->chnls[i + bch];
int *lun_roffs = ch_rmap->lun_offs;
struct nvm_ch_map *ch_map = _map->chnls[i];
int *lun_offs;
-   int luns_in_chnl = (luns_left > dev->geo.nr_luns) ?
-   dev->geo.nr_luns : luns_left;
+   int luns_in_chnl = (luns_left > dev_geo->nr_luns) ?
+   dev_geo->nr_luns : luns_left;
if (lun_balanced && prev_nr_luns != luns_in_chnl)
lun_balanced = 0;
@@ -215,18 +217,22 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!tgt_dev)
goto err_ch;
  - memcpy(_dev->geo, 

Re: [PATCH 03/20] lightnvm: fix capabilities for 2.0 sysfs

2018-02-22 Thread Matias Bjørling

On 02/22/2018 11:25 AM, Javier Gonzalez wrote:




On 22 Feb 2018, at 10.39, Matias Bjørling  wrote:

On 02/22/2018 08:47 AM, Javier Gonzalez wrote:

On 22 Feb 2018, at 08.28, Matias Bjørling  wrote:


On 02/21/2018 10:26 AM, Javier González wrote:
Both 1.2 and 2.0 specs define a field for media and controller
capabilities. Also, 1.2 defines a separate field dedicated to device
capabilities.
In 2.0 sysfs, this values have been mixed. Revert them to the right
value.
Signed-off-by: Javier González 
---
  drivers/nvme/host/lightnvm.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 969bb874850c..598abba66f52 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -914,8 +914,8 @@ static ssize_t nvm_dev_attr_show(struct device *dev,
if (strcmp(attr->name, "version") == 0) {
  return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->ver_id);
-} else if (strcmp(attr->name, "capabilities") == 0) {
-return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.cap);
+} else if (strcmp(attr->name, "media_capabilities") == 0) {
+return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mccap);
  } else if (strcmp(attr->name, "read_typ") == 0) {
  return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.trdt);
  } else if (strcmp(attr->name, "read_max") == 0) {
@@ -993,8 +993,8 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev,
  return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbem);
  } else if (strcmp(attr->name, "multiplane_modes") == 0) {
  return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.mpos);
-} else if (strcmp(attr->name, "media_capabilities") == 0) {
-return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.mccap);
+} else if (strcmp(attr->name, "capabilities") == 0) {
+return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.cap);
  } else if (strcmp(attr->name, "max_phys_secs") == 0) {
  return scnprintf(page, PAGE_SIZE, "%u\n", NVM_MAX_VLBA);
  } else {
@@ -1055,7 +1055,7 @@ static ssize_t nvm_dev_attr_show_20(struct device *dev,
/* general attributes */
  static NVM_DEV_ATTR_RO(version);
-static NVM_DEV_ATTR_RO(capabilities);
+static NVM_DEV_ATTR_RO(media_capabilities);
static NVM_DEV_ATTR_RO(read_typ);
  static NVM_DEV_ATTR_RO(read_max);
@@ -1080,12 +1080,12 @@ static NVM_DEV_ATTR_12_RO(prog_max);
  static NVM_DEV_ATTR_12_RO(erase_typ);
  static NVM_DEV_ATTR_12_RO(erase_max);
  static NVM_DEV_ATTR_12_RO(multiplane_modes);
-static NVM_DEV_ATTR_12_RO(media_capabilities);
+static NVM_DEV_ATTR_12_RO(capabilities);
  static NVM_DEV_ATTR_12_RO(max_phys_secs);
static struct attribute *nvm_dev_attrs_12[] = {
  _attr_version.attr,
-_attr_capabilities.attr,
+_attr_media_capabilities.attr,
_attr_vendor_opcode.attr,
  _attr_device_mode.attr,
@@ -1108,7 +1108,7 @@ static struct attribute *nvm_dev_attrs_12[] = {
  _attr_erase_typ.attr,
  _attr_erase_max.attr,
  _attr_multiplane_modes.attr,
-_attr_media_capabilities.attr,
+_attr_capabilities.attr,
  _attr_max_phys_secs.attr,
NULL,
@@ -1134,7 +1134,7 @@ static NVM_DEV_ATTR_20_RO(reset_max);
static struct attribute *nvm_dev_attrs_20[] = {
  _attr_version.attr,
-_attr_capabilities.attr,
+_attr_media_capabilities.attr,
_attr_groups.attr,
  _attr_punits.attr,


With the mccap changes, it should make sense to keep the capabilities
as is.

The change adds mccap, but sysfs points to cap, which is wrong. This
patch is needed. Otherwise, we change the name of mccap to cap, which
is _very_ confusing to people familiar to both specs. We can change
the name of mccap to cap in a future spec revision.
Javier


Think of the sysfs capabilities as an abstract value that defines generic 
capabilities. It is not directly tied to either 1.2 or 2.0.


I’m thinking about the user looking at sysfs and at the spec at the same time - 
I myself get confused when names don’t match.

Anyway, I’ll keep it the way it was and add a comment for clarification. Would 
that work for you?



That works. One may also wait until it is going to be used, e.g., when 
vector copy is implemented. Then it's natural to revisit.


Re: how can one drain MQ request queue ?

2018-02-22 Thread Max Gurtovoy



On 2/22/2018 4:59 AM, Ming Lei wrote:

Hi Max,


Hi Ming,



On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:

hi all,
is there a way to drain a blk-mq based request queue (similar to
blk_drain_queue for non MQ) ?


Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
based request queue, but it may not work well when the hardware is broken.


I tried that, but the path failover takes ~cmd_timeout seconds and this 
is not good enough...






I try to fix the following situation:
Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
ports during traffic using fio and making sure the traffic never fails.

when the switch port goes down the initiator driver start an error recovery


What is the code you are referring to?


from nvme_rdma driver:

static void nvme_rdma_error_recovery_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);

nvme_stop_keep_alive(>ctrl);

if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(>ctrl);
blk_mq_tagset_busy_iter(>tag_set,
nvme_cancel_request, >ctrl);
nvme_rdma_destroy_io_queues(ctrl, false);
}

blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
blk_mq_tagset_busy_iter(>admin_tag_set,
nvme_cancel_request, >ctrl);
nvme_rdma_destroy_admin_queue(ctrl, false);

/*
 * queues are not a live anymore, so restart the queues to fail 
fast

 * new IO
 */
blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
nvme_start_queues(>ctrl);

if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_CONNECTING)) {
/* state change failure should never happen */
WARN_ON_ONCE(1);
return;
}

nvme_rdma_reconnect_or_remove(ctrl);
}





process
- blk_mq_quiesce_queue for each namespace request queue


blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
low level driver, and new requests still can be allocated, but can't be
dispatched until the queue becomes unquiesced.


- cancel all requests of the tagset using blk_mq_tagset_busy_iter


Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
requests, and it depends on implementation of the busy_tag_iter_fn, and
timed-out request can't be covered by blk_mq_tagset_busy_iter().


How can we deal with timed-out commands ?




So blk_mq_tagset_busy_iter() is often used in error recovery path, such
as nvme_dev_disable(), which is usually used in resetting PCIe NVMe controller.


- destroy the QPs/RDMA connections and MR pools
- blk_mq_unquiesce_queue for each namespace request queue
- reconnect to the target (after creating RDMA resources again)

During the QP destruction, I see a warning that not all the memory regions
were back to the mr_pool. For every request we get from the block layer
(well, almost every request) we get a MR from the MR pool.
So what I see is that, depends on the timing, some requests are
dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
the QP and the MR pool. Probably these request were inserted during
quiescing,


Yes.


maybe we need to update the nvmf_check_init_req to check that the ctrl 
is in NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think 
about it and test it.





and I want to flush/drain them before I destroy the QP.


As mentioned above, you can't do that by blk_mq_quiesce_queue() &
blk_mq_tagset_busy_iter().

The PCIe NVMe driver takes two steps for the error recovery: nvme_dev_disable() 
&
nvme_reset_work(), and you may consider the similar approach, but the in-flight
requests won't be drained in this case because they can be requeued.

Could you explain a bit what your exact problem is?


The problem is that I assign an MR from QP mr_pool for each call to 
nvme_rdma_queue_rq. During the error recovery I destroy the QP and the 
mr_pool *but* some MR's are missing and not returned to the pool.





Thanks,
Ming



Thanks,
Max.



Re: [PATCH 03/20] lightnvm: fix capabilities for 2.0 sysfs

2018-02-22 Thread Javier Gonzalez


> On 22 Feb 2018, at 10.39, Matias Bjørling  wrote:
> 
> On 02/22/2018 08:47 AM, Javier Gonzalez wrote:
>>> On 22 Feb 2018, at 08.28, Matias Bjørling  wrote:
>>> 
 On 02/21/2018 10:26 AM, Javier González wrote:
 Both 1.2 and 2.0 specs define a field for media and controller
 capabilities. Also, 1.2 defines a separate field dedicated to device
 capabilities.
 In 2.0 sysfs, this values have been mixed. Revert them to the right
 value.
 Signed-off-by: Javier González 
 ---
  drivers/nvme/host/lightnvm.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)
 diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
 index 969bb874850c..598abba66f52 100644
 --- a/drivers/nvme/host/lightnvm.c
 +++ b/drivers/nvme/host/lightnvm.c
 @@ -914,8 +914,8 @@ static ssize_t nvm_dev_attr_show(struct device *dev,
if (strcmp(attr->name, "version") == 0) {
  return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->ver_id);
 -} else if (strcmp(attr->name, "capabilities") == 0) {
 -return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.cap);
 +} else if (strcmp(attr->name, "media_capabilities") == 0) {
 +return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mccap);
  } else if (strcmp(attr->name, "read_typ") == 0) {
  return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.trdt);
  } else if (strcmp(attr->name, "read_max") == 0) {
 @@ -993,8 +993,8 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev,
  return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbem);
  } else if (strcmp(attr->name, "multiplane_modes") == 0) {
  return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.mpos);
 -} else if (strcmp(attr->name, "media_capabilities") == 0) {
 -return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.mccap);
 +} else if (strcmp(attr->name, "capabilities") == 0) {
 +return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.cap);
  } else if (strcmp(attr->name, "max_phys_secs") == 0) {
  return scnprintf(page, PAGE_SIZE, "%u\n", NVM_MAX_VLBA);
  } else {
 @@ -1055,7 +1055,7 @@ static ssize_t nvm_dev_attr_show_20(struct device 
 *dev,
/* general attributes */
  static NVM_DEV_ATTR_RO(version);
 -static NVM_DEV_ATTR_RO(capabilities);
 +static NVM_DEV_ATTR_RO(media_capabilities);
static NVM_DEV_ATTR_RO(read_typ);
  static NVM_DEV_ATTR_RO(read_max);
 @@ -1080,12 +1080,12 @@ static NVM_DEV_ATTR_12_RO(prog_max);
  static NVM_DEV_ATTR_12_RO(erase_typ);
  static NVM_DEV_ATTR_12_RO(erase_max);
  static NVM_DEV_ATTR_12_RO(multiplane_modes);
 -static NVM_DEV_ATTR_12_RO(media_capabilities);
 +static NVM_DEV_ATTR_12_RO(capabilities);
  static NVM_DEV_ATTR_12_RO(max_phys_secs);
static struct attribute *nvm_dev_attrs_12[] = {
  _attr_version.attr,
 -_attr_capabilities.attr,
 +_attr_media_capabilities.attr,
_attr_vendor_opcode.attr,
  _attr_device_mode.attr,
 @@ -1108,7 +1108,7 @@ static struct attribute *nvm_dev_attrs_12[] = {
  _attr_erase_typ.attr,
  _attr_erase_max.attr,
  _attr_multiplane_modes.attr,
 -_attr_media_capabilities.attr,
 +_attr_capabilities.attr,
  _attr_max_phys_secs.attr,
NULL,
 @@ -1134,7 +1134,7 @@ static NVM_DEV_ATTR_20_RO(reset_max);
static struct attribute *nvm_dev_attrs_20[] = {
  _attr_version.attr,
 -_attr_capabilities.attr,
 +_attr_media_capabilities.attr,
_attr_groups.attr,
  _attr_punits.attr,
>>> 
>>> With the mccap changes, it should make sense to keep the capabilities
>>> as is.
>> The change adds mccap, but sysfs points to cap, which is wrong. This
>> patch is needed. Otherwise, we change the name of mccap to cap, which
>> is _very_ confusing to people familiar to both specs. We can change
>> the name of mccap to cap in a future spec revision.
>> Javier
> 
> Think of the sysfs capabilities as an abstract value that defines generic 
> capabilities. It is not directly tied to either 1.2 or 2.0.

I’m thinking about the user looking at sysfs and at the spec at the same time - 
I myself get confused when names don’t match. 

Anyway, I’ll keep it the way it was and add a comment for clarification. Would 
that work for you?

Javier 

Re: [PATCH 08/20] lightnvm: complete geo structure with maxoc*

2018-02-22 Thread Javier Gonzalez

Javier

> On 22 Feb 2018, at 11.00, Matias Bjørling  wrote:
> 
> On 02/22/2018 10:52 AM, Javier Gonzalez wrote:
>>> On 22 Feb 2018, at 10.45, Matias Bjørling  wrote:
>>> 
>>> On 02/22/2018 08:55 AM, Javier Gonzalez wrote:
> On 22 Feb 2018, at 08.45, Matias Bjørling  wrote:
> 
> On 02/21/2018 10:26 AM, Javier González wrote:
>> Complete the generic geometry structure with the maxoc and maxocpu
>> felds, present in the 2.0 spec.
>> Signed-off-by: Javier González 
>> ---
>>  drivers/nvme/host/lightnvm.c | 4 
>>  include/linux/lightnvm.h | 2 ++
>>  2 files changed, 6 insertions(+)
>> diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
>> index cca32da05316..9c1f8225c4e1 100644
>> --- a/drivers/nvme/host/lightnvm.c
>> +++ b/drivers/nvme/host/lightnvm.c
>> @@ -318,6 +318,8 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 
>> *id,
>>  dev_geo->c.ws_min = sec_per_pg;
>>  dev_geo->c.ws_opt = sec_per_pg;
>>  dev_geo->c.mw_cunits = 8;   /* default to MLC safe 
>> values */
>> +dev_geo->c.maxoc = dev_geo->all_luns;   /* default to 1 chunk 
>> per LUN */
>> +dev_geo->c.maxocpu = 1; /* default to 1 chunk 
>> per LUN */
> 
> One can't assume that it is 1 open chunk per lun. If you need this for 
> specific hardware, make a quirk for it.
 Which default you want for 1.2 if not specified then? I use 1 because it
 has been the implicit default until now.
>>> 
>>> INT_MAX, since it then allows the maximum of open chunks. It cannot be 
>>> assumed that other 1.2 devices is limited to a single open chunk.
>> So you want the default to be that all blocks on the device can be
>> opened at the same time. Interesting... I guess that such a SSD will
>> have a AA battery attached to it, but fine by me if that's how you want
>> it.
> 
> I feel you're a bit sarcastic here. One may think of SLC and other memories 
> that does one-shot programming. In that case no caching is needed, and 
> therefore power-caps can be limited on the hardware.

Sure. Hope people move to 2.0 then for all >SLC memories out there,
otherwise we'll see a lot of quirks coming in. Thought we wanted to be
generic.

> 
>> Assuming this, can we instead set it to the reported number of chunks,
>> since this is the hard limit anyway.
> 
> Works for me.

Cool. I'll add this to the next version.

Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 08/20] lightnvm: complete geo structure with maxoc*

2018-02-22 Thread Matias Bjørling

On 02/22/2018 10:52 AM, Javier Gonzalez wrote:

On 22 Feb 2018, at 10.45, Matias Bjørling  wrote:

On 02/22/2018 08:55 AM, Javier Gonzalez wrote:

On 22 Feb 2018, at 08.45, Matias Bjørling  wrote:

On 02/21/2018 10:26 AM, Javier González wrote:

Complete the generic geometry structure with the maxoc and maxocpu
felds, present in the 2.0 spec.
Signed-off-by: Javier González 
---
  drivers/nvme/host/lightnvm.c | 4 
  include/linux/lightnvm.h | 2 ++
  2 files changed, 6 insertions(+)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index cca32da05316..9c1f8225c4e1 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -318,6 +318,8 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
dev_geo->c.ws_min = sec_per_pg;
dev_geo->c.ws_opt = sec_per_pg;
dev_geo->c.mw_cunits = 8;/* default to MLC safe values */
+   dev_geo->c.maxoc = dev_geo->all_luns; /* default to 1 chunk per LUN 
*/
+   dev_geo->c.maxocpu = 1;  /* default to 1 chunk per 
LUN */


One can't assume that it is 1 open chunk per lun. If you need this for specific 
hardware, make a quirk for it.

Which default you want for 1.2 if not specified then? I use 1 because it
has been the implicit default until now.


INT_MAX, since it then allows the maximum of open chunks. It cannot be assumed 
that other 1.2 devices is limited to a single open chunk.


So you want the default to be that all blocks on the device can be
opened at the same time. Interesting... I guess that such a SSD will
have a AA battery attached to it, but fine by me if that's how you want
it.


I feel you're a bit sarcastic here. One may think of SLC and other 
memories that does one-shot programming. In that case no caching is 
needed, and therefore power-caps can be limited on the hardware.




Assuming this, can we instead set it to the reported number of chunks,
since this is the hard limit anyway.



Works for me.


Re: [PATCH 08/20] lightnvm: complete geo structure with maxoc*

2018-02-22 Thread Javier Gonzalez
> On 22 Feb 2018, at 10.45, Matias Bjørling  wrote:
> 
> On 02/22/2018 08:55 AM, Javier Gonzalez wrote:
>>> On 22 Feb 2018, at 08.45, Matias Bjørling  wrote:
>>> 
>>> On 02/21/2018 10:26 AM, Javier González wrote:
 Complete the generic geometry structure with the maxoc and maxocpu
 felds, present in the 2.0 spec.
 Signed-off-by: Javier González 
 ---
  drivers/nvme/host/lightnvm.c | 4 
  include/linux/lightnvm.h | 2 ++
  2 files changed, 6 insertions(+)
 diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
 index cca32da05316..9c1f8225c4e1 100644
 --- a/drivers/nvme/host/lightnvm.c
 +++ b/drivers/nvme/host/lightnvm.c
 @@ -318,6 +318,8 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
dev_geo->c.ws_min = sec_per_pg;
dev_geo->c.ws_opt = sec_per_pg;
dev_geo->c.mw_cunits = 8;   /* default to MLC safe values */
 +  dev_geo->c.maxoc = dev_geo->all_luns;   /* default to 1 chunk per LUN */
 +  dev_geo->c.maxocpu = 1; /* default to 1 chunk per LUN */
>>> 
>>> One can't assume that it is 1 open chunk per lun. If you need this for 
>>> specific hardware, make a quirk for it.
>> Which default you want for 1.2 if not specified then? I use 1 because it
>> has been the implicit default until now.
> 
> INT_MAX, since it then allows the maximum of open chunks. It cannot be 
> assumed that other 1.2 devices is limited to a single open chunk.

So you want the default to be that all blocks on the device can be
opened at the same time. Interesting... I guess that such a SSD will
have a AA battery attached to it, but fine by me if that's how you want
it.

Assuming this, can we instead set it to the reported number of chunks,
since this is the hard limit anyway.

Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 08/20] lightnvm: complete geo structure with maxoc*

2018-02-22 Thread Matias Bjørling

On 02/22/2018 08:55 AM, Javier Gonzalez wrote:

On 22 Feb 2018, at 08.45, Matias Bjørling  wrote:

On 02/21/2018 10:26 AM, Javier González wrote:

Complete the generic geometry structure with the maxoc and maxocpu
felds, present in the 2.0 spec.
Signed-off-by: Javier González 
---
  drivers/nvme/host/lightnvm.c | 4 
  include/linux/lightnvm.h | 2 ++
  2 files changed, 6 insertions(+)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index cca32da05316..9c1f8225c4e1 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -318,6 +318,8 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
dev_geo->c.ws_min = sec_per_pg;
dev_geo->c.ws_opt = sec_per_pg;
dev_geo->c.mw_cunits = 8;/* default to MLC safe values */
+   dev_geo->c.maxoc = dev_geo->all_luns; /* default to 1 chunk per LUN 
*/
+   dev_geo->c.maxocpu = 1;  /* default to 1 chunk per 
LUN */


One can't assume that it is 1 open chunk per lun. If you need this for specific 
hardware, make a quirk for it.



Which default you want for 1.2 if not specified then? I use 1 because it
has been the implicit default until now.


INT_MAX, since it then allows the maximum of open chunks. It cannot be 
assumed that other 1.2 devices is limited to a single open chunk.




Javier





Re: [PATCH 09/20] lightnvm: use generic identify structure

2018-02-22 Thread Matias Bjørling

On 02/22/2018 08:49 AM, Javier González wrote:

On 22 Feb 2018, at 08.47, Matias Bjørling  wrote:

On 02/21/2018 10:26 AM, Javier González wrote:

Create a generic identify structure to collect the identify information
before knowing the spec. version. This forces different version paths to
cast the structure to their spec structure, thus making the code less
error prone and more maintainable.
Signed-off-by: Javier González 
---
  drivers/nvme/host/lightnvm.c | 32 
  1 file changed, 20 insertions(+), 12 deletions(-)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 9c1f8225c4e1..70dc4740f0d3 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -170,6 +170,12 @@ struct nvme_nvm_id12 {
__u8resv2[2880];
  } __packed;
  +/* Generic identification structure */
+struct nvme_nvm_id {
+   __u8ver_id;
+   __u8resv[4095];
+} __packed;
+
  struct nvme_nvm_bb_tbl {
__u8tblid[4];
__le16  verid;
@@ -279,9 +285,10 @@ static void nvme_nvm_set_addr_12(struct nvm_addr_format_12 
*dst,
dst->sec_mask = ((1ULL << dst->sec_len) - 1) << dst->sec_offset;
  }
  -static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
+static int nvme_nvm_setup_12(struct nvme_nvm_id *gen_id,
 struct nvm_dev_geo *dev_geo)
  {
+   struct nvme_nvm_id12 *id = (struct nvme_nvm_id12 *)gen_id;
struct nvme_nvm_id12_grp *src;
int sec_per_pg, sec_per_pl, pg_per_blk;
  @@ -380,9 +387,11 @@ static void nvme_nvm_set_addr_20(struct nvm_addr_format 
*dst,
dst->sec_mask = ((1ULL << dst->sec_len) - 1) << dst->sec_offset;
  }
  -static int nvme_nvm_setup_20(struct nvme_nvm_id20 *id,
+static int nvme_nvm_setup_20(struct nvme_nvm_id *gen_id,
 struct nvm_dev_geo *dev_geo)
  {
+   struct nvme_nvm_id20 *id = (struct nvme_nvm_id20 *)gen_id;
+
dev_geo->major_ver_id = id->mjr;
dev_geo->minor_ver_id = id->mnr;
  @@ -427,19 +436,19 @@ static int nvme_nvm_setup_20(struct nvme_nvm_id20 *id,
  static int nvme_nvm_identity(struct nvm_dev *nvmdev)
  {
struct nvme_ns *ns = nvmdev->q->queuedata;
-   struct nvme_nvm_id12 *id;
+   struct nvme_nvm_id *nvme_nvm_id;
struct nvme_nvm_command c = {};
int ret;
c.identity.opcode = nvme_nvm_admin_identity;
c.identity.nsid = cpu_to_le32(ns->head->ns_id);
  - id = kmalloc(sizeof(struct nvme_nvm_id12), GFP_KERNEL);
-   if (!id)
+   nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
+   if (!nvme_nvm_id)
return -ENOMEM;
ret = nvme_submit_sync_cmd(ns->ctrl->admin_q, (struct nvme_command *),
-   id, sizeof(struct nvme_nvm_id12));
+   nvme_nvm_id, sizeof(struct nvme_nvm_id));
if (ret) {
ret = -EIO;
goto out;
@@ -449,22 +458,21 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev)
 * The 1.2 and 2.0 specifications share the first byte in their geometry
 * command to make it possible to know what version a device implements.
 */
-   switch (id->ver_id) {
+   switch (nvme_nvm_id->ver_id) {
case 1:
-   ret = nvme_nvm_setup_12(id, >dev_geo);
+   ret = nvme_nvm_setup_12(nvme_nvm_id, >dev_geo);
break;
case 2:
-   ret = nvme_nvm_setup_20((struct nvme_nvm_id20 *)id,
-   >dev_geo);
+   ret = nvme_nvm_setup_20(nvme_nvm_id, >dev_geo);
break;
default:
dev_err(ns->ctrl->device, "OCSSD revision not supported (%d)\n",
-   id->ver_id);
+   nvme_nvm_id->ver_id);
ret = -EINVAL;
}
out:
-   kfree(id);
+   kfree(nvme_nvm_id);
return ret;
  }



Thanks for another way to represent it. I want to keep the original
path. If we are going that down road, then one should maybe look into
unifying the "three" data structures, and have the version as the base
property and the others in each their sub data structure.


That's what this structure aims to do, since the version is shared. But
you call if you want to cast unrelated structures back an forth. I'll
revert it...

Javier



In that case, the patch could look like

struct nvm_id {
__u8 ver_id;

union {
struct v1 {};
struct v2 {};
};
};

I'm good with just having the single cast instead of putting each under 
a v1/v2 variable or using nvm_id first, and then goto either v1 or v2.


Re: [PATCH 03/20] lightnvm: fix capabilities for 2.0 sysfs

2018-02-22 Thread Matias Bjørling

On 02/22/2018 08:47 AM, Javier Gonzalez wrote:

On 22 Feb 2018, at 08.28, Matias Bjørling  wrote:

On 02/21/2018 10:26 AM, Javier González wrote:

Both 1.2 and 2.0 specs define a field for media and controller
capabilities. Also, 1.2 defines a separate field dedicated to device
capabilities.
In 2.0 sysfs, this values have been mixed. Revert them to the right
value.
Signed-off-by: Javier González 
---
  drivers/nvme/host/lightnvm.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 969bb874850c..598abba66f52 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -914,8 +914,8 @@ static ssize_t nvm_dev_attr_show(struct device *dev,
if (strcmp(attr->name, "version") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->ver_id);
-   } else if (strcmp(attr->name, "capabilities") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.cap);
+   } else if (strcmp(attr->name, "media_capabilities") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mccap);
} else if (strcmp(attr->name, "read_typ") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.trdt);
} else if (strcmp(attr->name, "read_max") == 0) {
@@ -993,8 +993,8 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev,
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbem);
} else if (strcmp(attr->name, "multiplane_modes") == 0) {
return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.mpos);
-   } else if (strcmp(attr->name, "media_capabilities") == 0) {
-   return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.mccap);
+   } else if (strcmp(attr->name, "capabilities") == 0) {
+   return scnprintf(page, PAGE_SIZE, "0x%08x\n", dev_geo->c.cap);
} else if (strcmp(attr->name, "max_phys_secs") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", NVM_MAX_VLBA);
} else {
@@ -1055,7 +1055,7 @@ static ssize_t nvm_dev_attr_show_20(struct device *dev,
/* general attributes */
  static NVM_DEV_ATTR_RO(version);
-static NVM_DEV_ATTR_RO(capabilities);
+static NVM_DEV_ATTR_RO(media_capabilities);
static NVM_DEV_ATTR_RO(read_typ);
  static NVM_DEV_ATTR_RO(read_max);
@@ -1080,12 +1080,12 @@ static NVM_DEV_ATTR_12_RO(prog_max);
  static NVM_DEV_ATTR_12_RO(erase_typ);
  static NVM_DEV_ATTR_12_RO(erase_max);
  static NVM_DEV_ATTR_12_RO(multiplane_modes);
-static NVM_DEV_ATTR_12_RO(media_capabilities);
+static NVM_DEV_ATTR_12_RO(capabilities);
  static NVM_DEV_ATTR_12_RO(max_phys_secs);
static struct attribute *nvm_dev_attrs_12[] = {
_attr_version.attr,
-   _attr_capabilities.attr,
+   _attr_media_capabilities.attr,
_attr_vendor_opcode.attr,
_attr_device_mode.attr,
@@ -1108,7 +1108,7 @@ static struct attribute *nvm_dev_attrs_12[] = {
_attr_erase_typ.attr,
_attr_erase_max.attr,
_attr_multiplane_modes.attr,
-   _attr_media_capabilities.attr,
+   _attr_capabilities.attr,
_attr_max_phys_secs.attr,
NULL,
@@ -1134,7 +1134,7 @@ static NVM_DEV_ATTR_20_RO(reset_max);
static struct attribute *nvm_dev_attrs_20[] = {
_attr_version.attr,
-   _attr_capabilities.attr,
+   _attr_media_capabilities.attr,
_attr_groups.attr,
_attr_punits.attr,


With the mccap changes, it should make sense to keep the capabilities
as is.


The change adds mccap, but sysfs points to cap, which is wrong. This
patch is needed. Otherwise, we change the name of mccap to cap, which
is _very_ confusing to people familiar to both specs. We can change
the name of mccap to cap in a future spec revision.

Javier



Think of the sysfs capabilities as an abstract value that defines 
generic capabilities. It is not directly tied to either 1.2 or 2.0.


Re: [PATCH 13/20] lightnvm: add support for 2.0 address format

2018-02-22 Thread Matias Bjørling

On 02/21/2018 10:26 AM, Javier González wrote:

Add support for 2.0 address format. Also, align address bits for 1.2 and
2.0 to be able to operate on channel and luns without requiring a format
conversion. Use a generic address format for this purpose.

Signed-off-by: Javier González 
---
  drivers/lightnvm/core.c  |  20 -
  include/linux/lightnvm.h | 105 ++-
  2 files changed, 86 insertions(+), 39 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index f70a907223e2..baec963b2c96 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -196,8 +196,8 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
  
  		for (j = 0; j < luns_in_chnl; j++) {

luns[lunid].ppa = 0;
-   luns[lunid].g.ch = i;
-   luns[lunid++].g.lun = j;
+   luns[lunid].a.ch = i;
+   luns[lunid++].a.lun = j;
  
  			lun_offs[j] = blun;

lun_roffs[j + blun] = blun;
@@ -563,22 +563,22 @@ static void nvm_unregister_map(struct nvm_dev *dev)
  static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
  {
struct nvm_dev_map *dev_map = tgt_dev->map;
-   struct nvm_ch_map *ch_map = _map->chnls[p->g.ch];
-   int lun_off = ch_map->lun_offs[p->g.lun];
+   struct nvm_ch_map *ch_map = _map->chnls[p->a.ch];
+   int lun_off = ch_map->lun_offs[p->a.lun];
  
-	p->g.ch += ch_map->ch_off;

-   p->g.lun += lun_off;
+   p->a.ch += ch_map->ch_off;
+   p->a.lun += lun_off;
  }
  
  static void nvm_map_to_tgt(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)

  {
struct nvm_dev *dev = tgt_dev->parent;
struct nvm_dev_map *dev_rmap = dev->rmap;
-   struct nvm_ch_map *ch_rmap = _rmap->chnls[p->g.ch];
-   int lun_roff = ch_rmap->lun_offs[p->g.lun];
+   struct nvm_ch_map *ch_rmap = _rmap->chnls[p->a.ch];
+   int lun_roff = ch_rmap->lun_offs[p->a.lun];
  
-	p->g.ch -= ch_rmap->ch_off;

-   p->g.lun -= lun_roff;
+   p->a.ch -= ch_rmap->ch_off;
+   p->a.lun -= lun_roff;
  }
  
  static void nvm_ppa_tgt_to_dev(struct nvm_tgt_dev *tgt_dev,

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index e1c4292ea33d..cd310bf06051 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -16,12 +16,22 @@ enum {
NVM_IOTYPE_GC = 1,
  };
  
-#define NVM_BLK_BITS (16)

-#define NVM_PG_BITS  (16)
-#define NVM_SEC_BITS (8)
-#define NVM_PL_BITS  (8)
-#define NVM_LUN_BITS (8)
-#define NVM_CH_BITS  (7)
+/* common format */
+#define NVM_GEN_CH_BITS  (8)
+#define NVM_GEN_LUN_BITS (8)
+#define NVM_GEN_RESERVED (48)
+
+/* 1.2 format */
+#define NVM_12_BLK_BITS (16)
+#define NVM_12_PG_BITS  (16)
+#define NVM_12_PL_BITS  (4)
+#define NVM_12_SEC_BITS (4)
+#define NVM_12_RESERVED (8)
+
+/* 2.0 format */
+#define NVM_20_CHK_BITS (16)
+#define NVM_20_SEC_BITS (24)
+#define NVM_20_RESERVED (8)
  
  enum {

NVM_OCSSD_SPEC_12 = 12,
@@ -31,16 +41,33 @@ enum {
  struct ppa_addr {
/* Generic structure for all addresses */
union {
+   /* generic device format */
struct {
-   u64 blk : NVM_BLK_BITS;
-   u64 pg  : NVM_PG_BITS;
-   u64 sec : NVM_SEC_BITS;
-   u64 pl  : NVM_PL_BITS;
-   u64 lun : NVM_LUN_BITS;
-   u64 ch  : NVM_CH_BITS;
-   u64 reserved: 1;
+   u64 ch  : NVM_GEN_CH_BITS;
+   u64 lun : NVM_GEN_LUN_BITS;
+   u64 reserved: NVM_GEN_RESERVED;
+   } a;
+
+   /* 1.2 device format */
+   struct {
+   u64 ch  : NVM_GEN_CH_BITS;
+   u64 lun : NVM_GEN_LUN_BITS;
+   u64 blk : NVM_12_BLK_BITS;
+   u64 pg  : NVM_12_PG_BITS;
+   u64 pl  : NVM_12_PL_BITS;
+   u64 sec : NVM_12_SEC_BITS;
+   u64 reserved: NVM_12_RESERVED;
} g;
  
+		/* 2.0 device format */

+   struct {
+   u64 ch  : NVM_GEN_CH_BITS;
+   u64 lun : NVM_GEN_LUN_BITS;
+   u64 chk : NVM_20_CHK_BITS;
+   u64 sec : NVM_20_SEC_BITS;
+   u64 reserved: NVM_20_RESERVED;
+   } m;
+
struct {
u64 line: 63;
u64 is_cached   : 1;
@@ -392,16 +419,26 @@ static inline struct ppa_addr generic_to_dev_addr(struct 
nvm_tgt_dev *tgt_dev,
  struct 

Re: [PATCH v2] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-22 Thread Johannes Thumshirn
Thanks Bart,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH] lightnvm: fix bad block initialization

2018-02-22 Thread Heiner Litz
fix reading bad block device information to correctly setup the per line 
blk_bitmap during lightnvm initialization 

Signed-off-by: Heiner Litz 
---
 drivers/lightnvm/pblk-init.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 86a94a7faa96..e0eba150ac7e 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -461,10 +461,11 @@ static int pblk_bb_line(struct pblk *pblk, struct 
pblk_line *line,
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
int i, bb_cnt = 0;
+   int blk_per_lun = geo->nr_chks * geo->plane_mode;
 
for (i = 0; i < blk_per_line; i++) {
struct pblk_lun *rlun = >luns[i];
-   u8 *lun_bb_log = bb_log + i * blk_per_line;
+   u8 *lun_bb_log = bb_log + i * blk_per_lun;
 
if (lun_bb_log[line->id] == NVM_BLK_T_FREE)
continue;
-- 
2.14.1



Re: [PATCH 08/20] lightnvm: complete geo structure with maxoc*

2018-02-22 Thread Javier Gonzalez
> On 22 Feb 2018, at 08.45, Matias Bjørling  wrote:
> 
> On 02/21/2018 10:26 AM, Javier González wrote:
>> Complete the generic geometry structure with the maxoc and maxocpu
>> felds, present in the 2.0 spec.
>> Signed-off-by: Javier González 
>> ---
>>  drivers/nvme/host/lightnvm.c | 4 
>>  include/linux/lightnvm.h | 2 ++
>>  2 files changed, 6 insertions(+)
>> diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
>> index cca32da05316..9c1f8225c4e1 100644
>> --- a/drivers/nvme/host/lightnvm.c
>> +++ b/drivers/nvme/host/lightnvm.c
>> @@ -318,6 +318,8 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
>>  dev_geo->c.ws_min = sec_per_pg;
>>  dev_geo->c.ws_opt = sec_per_pg;
>>  dev_geo->c.mw_cunits = 8;   /* default to MLC safe values */
>> +dev_geo->c.maxoc = dev_geo->all_luns;   /* default to 1 chunk per LUN */
>> +dev_geo->c.maxocpu = 1; /* default to 1 chunk per LUN */
> 
> One can't assume that it is 1 open chunk per lun. If you need this for 
> specific hardware, make a quirk for it.
> 

Which default you want for 1.2 if not specified then? I use 1 because it
has been the implicit default until now.

Javier


signature.asc
Description: Message signed with OpenPGP