Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-09 Thread Ming Lei
Hi Kashyap,

On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Friday, February 9, 2018 11:01 AM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > -Original Message-
> > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > To: Hannes Reinecke
> > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun
> > > > Easi; Omar
> > > Sandoval;
> > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > Peter
> > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > introduce force_blk_mq
> > > >
> > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > >> -Original Message-
> > > > > >> From: Ming Lei [mailto:ming@redhat.com]
> > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > >> To: Hannes Reinecke
> > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > > >> Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org;
> > > > > >> Arun Easi; Omar
> > > > > > Sandoval;
> > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > >> Brace;
> > > > > > Peter
> > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > >> introduce force_blk_mq
> > > > > >>
> > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
> wrote:
> > > > > >>> Hi all,
> > > > > >>>
> > > > > >>> [ .. ]
> > > > > >
> > > > > > Could you share us your patch for enabling global_tags/MQ on
> > > > >  megaraid_sas
> > > > > > so that I can reproduce your test?
> > > > > >
> > > > > >> See below perf top data. "bt_iter" is consuming 4 times
> > > > > >> more
> > > CPU.
> > > > > >
> > > > > > Could you share us what the IOPS/CPU utilization effect is
> > > > > > after
> > > > >  applying the
> > > > > > patch V2? And your test script?
> > > > >  Regarding CPU utilization, I need to test one more time.
> > > > >  Currently system is in used.
> > > > > 
> > > > >  I run below fio test on total 24 SSDs expander attached.
> > > > > 
> > > > >  numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > >  --ioengine=libaio --rw=randread
> > > > > 
> > > > >  Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > 
> > > > > >>> This is basically what we've seen with earlier iterations.
> > > > > >>
> > > > > >> Hi Hannes,
> > > > > >>
> > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big
> > > > > >> issue,
> > > > > > which
> > > > > >> causes only reply queue 0 used.
> > > > > >>
> > > > > >> [1] https://marc.info/?l=linux-scsi=151793204014631=2
> > > > > >>
> > > > > >> So could you guys run your performance test again after fixing
> > > > > >> the
> > > > > > patch?
> > > > > >
> > > > > > Ming -
> > > > > >
> > > > > > I tried after change you requested.  Performance drop is still
> > > unresolved.
> > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > >
> > > > > > See below data. All 24 reply queue is in used correctly.
> > > > > >
> > > > > > IRQs / 1 second(s)
> > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > >  360  16422  0   16422  IR-PCI-MSI 70254653-edge megasas
> > > > > >  364  15980  0   15980  IR-PCI-MSI 70254657-edge megasas
> > > > > >  362  15979  0   15979  IR-PCI-MSI 70254655-edge megasas
> > > > > >  345  15696  0   15696  IR-PCI-MSI 70254638-edge megasas
> > > > > >  341  15659  0   15659  IR-PCI-MSI 70254634-edge megasas
> > > > > >  369  15656  0   15656  IR-PCI-MSI 70254662-edge megasas
> > > > > >  359  15650  0   15650  IR-PCI-MSI 70254652-edge megasas
> > > > > >  358  15596  0   15596  IR-PCI-MSI 70254651-edge megasas
> > > > > >  350  15574  0   15574  IR-PCI-MSI 70254643-edge megasas
> > > > > >  342  15532  0   15532  IR-PCI-MSI 70254635-edge megasas
> > > > > >  344  15527  0   15527  IR-PCI-MSI 70254637-edge megasas
> > > > > >  346  15485  0   15485  IR-PCI-MSI 70254639-edge megasas
> > > > > >  361  15482  0   15482  IR-PCI-MSI 70254654-edge megasas
> > > > > >  348  15467  0   15467  IR-PCI-MSI 70254641-edge megasas
> > > > > >  368  15463  0 

[PATCH V2] block: pass inclusive 'lend' parameter to truncate_inode_pages_range

2018-02-09 Thread Ming Lei
The 'lend' parameter of truncate_inode_pages_range is required to be
inclusive, so follow the rule.

This patch fixes one memory corruption triggered by discard.

Cc: 
Cc: Dmitry Monakhov 
Fixes: 351499a172c0 ("block: Invalidate cache on discard v2")
Signed-off-by: Ming Lei 
---
V2:
- Cc stable list and Dmitry as suggested by Bart

 block/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 1668506d8ed8..3884d810efd2 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -225,7 +225,7 @@ static int blk_ioctl_discard(struct block_device *bdev, 
fmode_t mode,
 
if (start + len > i_size_read(bdev->bd_inode))
return -EINVAL;
-   truncate_inode_pages_range(mapping, start, start + len);
+   truncate_inode_pages_range(mapping, start, start + len - 1);
return blkdev_issue_discard(bdev, start >> 9, len >> 9,
GFP_KERNEL, flags);
 }
-- 
2.9.5



Re: [LSF/MM ATTEND] State of blk-mq I/O scheduling and next steps

2018-02-09 Thread Ming Lei
On Fri, Feb 9, 2018 at 5:47 AM, Omar Sandoval  wrote:
> Hi,
>
> I'd like to attend LSF/MM to talk about the state of I/O scheduling in
> blk-mq. I can present some results about mq-deadline and Kyber on our
> production workloads. I'd also like to talk about what's next, in
> particular, improvements and features for Kyber.
Hi Omar,

I am very interested in these topics, especially looking forward to
your update on what's next. I am also thinking about improving
SSD/NVMe via kyber, and interested in any topics about SSD/NVMe
performance, how to deal with SSD/NVMe friendly,...


Thanks,
Ming Lei


Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Jens Axboe
On 2/9/18 12:14 PM, Bart Van Assche wrote:
> On 02/09/18 10:58, Jens Axboe wrote:
>> On 2/9/18 11:54 AM, Bart Van Assche wrote:
>>> Hello Paolo,
>>>
>>> If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
>>> appears (see also below). This happens systematically with Linus' tree from
>>> this morning (commit 54ce685cae30) merged with Jens' for-linus branch 
>>> (commit
>>> a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
>>> (commit 88455ad7f928). Is this a known issue?
>>
>> Does it happen on Linus -git as well, or just with my for-linus merged in?
>> What I'm getting at is if a78773906147 caused this or not.
> 
> Hello Jens,
> 
> Thanks for chiming in. After having reverted commit a78773906147, after 
> having rebuilt the BFQ scheduler, after having rebooted and after having 
> repeated the test I see the same kernel oops being reported. I think 
> that means that this regression is not caused by commit a78773906147. In 
> case it would be useful, here is how gdb translates the crash address:
> 
> $ gdb block/bfq*ko
> (gdb) list *(bfq_remove_request+0x8d)
> 0x280d is in bfq_remove_request (block/bfq-iosched.c:1760).
> 1755list_del_init(>queuelist);
> 1756bfqq->queued[sync]--;
> 1757bfqd->queued--;
> 1758elv_rb_del(>sort_list, rq);
> 1759
> 1760elv_rqhash_del(q, rq);
> 1761if (q->last_merge == rq)
> 1762q->last_merge = NULL;
> 1763
> 1764if (RB_EMPTY_ROOT(>sort_list)) {

Looks very odd. So clearly RQF_HASHED is set, but we're blowing up on
the hash list pointers. I'll let Paolo take a look at this one. Thanks
for testing without that commit, I want to push out my pending fixes
today and this would have thrown a wrench in the works.

-- 
Jens Axboe



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Bart Van Assche

On 02/09/18 10:58, Jens Axboe wrote:

On 2/9/18 11:54 AM, Bart Van Assche wrote:

Hello Paolo,

If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
appears (see also below). This happens systematically with Linus' tree from
this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit
a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
(commit 88455ad7f928). Is this a known issue?


Does it happen on Linus -git as well, or just with my for-linus merged in?
What I'm getting at is if a78773906147 caused this or not.


Hello Jens,

Thanks for chiming in. After having reverted commit a78773906147, after 
having rebuilt the BFQ scheduler, after having rebooted and after having 
repeated the test I see the same kernel oops being reported. I think 
that means that this regression is not caused by commit a78773906147. In 
case it would be useful, here is how gdb translates the crash address:


$ gdb block/bfq*ko
(gdb) list *(bfq_remove_request+0x8d)
0x280d is in bfq_remove_request (block/bfq-iosched.c:1760).
1755list_del_init(>queuelist);
1756bfqq->queued[sync]--;
1757bfqd->queued--;
1758elv_rb_del(>sort_list, rq);
1759
1760elv_rqhash_del(q, rq);
1761if (q->last_merge == rq)
1762q->last_merge = NULL;
1763
1764if (RB_EMPTY_ROOT(>sort_list)) {

Bart.


Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Jens Axboe
On 2/9/18 11:54 AM, Bart Van Assche wrote:
> Hello Paolo,
> 
> If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
> appears (see also below). This happens systematically with Linus' tree from
> this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit
> a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
> (commit 88455ad7f928). Is this a known issue?

Does it happen on Linus -git as well, or just with my for-linus merged in?
What I'm getting at is if a78773906147 caused this or not.



-- 
Jens Axboe



v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Bart Van Assche
Hello Paolo,

If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
appears (see also below). This happens systematically with Linus' tree from
this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit
a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
(commit 88455ad7f928). Is this a known issue?

Thanks,

Bart.

BUG: unable to handle kernel NULL pointer dereference at 0200
IP: rb_erase+0x284/0x380
PGD 0 P4D 0 
Oops: 0002 [#1] PREEMPT SMP
CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW4.15.0-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:rb_erase+0x284/0x380
Call Trace:
 
 elv_rb_del+0x20/0x30
 bfq_remove_request+0x8d/0x2e0 [bfq]
 bfq_finish_requeue_request+0x2bb/0x390 [bfq]
 blk_mq_free_request+0x51/0x170
 dm_softirq_done+0xd5/0x240 [dm_mod]
 flush_smp_call_function_queue+0x92/0x140
 smp_call_function_single_interrupt+0x47/0x2b0
 call_function_single_interrupt+0xa2/0xb0
 

[PATCH v3 2/3] block: Fix a race between the cgroup code and request queue initialization

2018-02-09 Thread Bart Van Assche
Initialize the request queue lock earlier such that the following
race can no longer occur:

blk_init_queue_node() blkcg_print_blkgs()
  blk_alloc_queue_node (1)
q->queue_lock = >__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
  q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.

The changes in this patch are as follows:
- Move the .queue_lock initialization from blk_init_queue_node() into
  blk_alloc_queue_node().
- Only override the .queue_lock pointer for legacy queues because it
  is not useful for blk-mq queues to override this pointer.
- For all all block drivers that initialize .queue_lock explicitly,
  change the blk_alloc_queue() call in the driver into a
  blk_alloc_queue_node() call and remove the explicit .queue_lock
  initialization. Additionally, initialize the spin lock that will
  be used as queue lock earlier if necessary.

Reported-by: Joseph Qi 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
---
 block/blk-core.c   | 24 
 drivers/block/drbd/drbd_main.c |  3 +--
 drivers/block/umem.c   |  7 +++
 3 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e873a24bf82d..41c74b37be85 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -888,6 +888,19 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
+/**
+ * blk_alloc_queue_node - allocate a request queue
+ * @gfp_mask: memory allocation flags
+ * @node_id: NUMA node to allocate memory from
+ * @lock: For legacy queues, pointer to a spinlock that will be used to e.g.
+ *serialize calls to the legacy .request_fn() callback. Ignored for
+ *   blk-mq request queues.
+ *
+ * Note: pass the queue lock as the third argument to this function instead of
+ * setting the queue lock pointer explicitly to avoid triggering a sporadic
+ * crash in the blkcg code. This function namely calls blkcg_init_queue() and
+ * the queue lock pointer must be set before blkcg_init_queue() is called.
+ */
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
   spinlock_t *lock)
 {
@@ -940,11 +953,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id,
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
 
-   /*
-* By default initialize queue_lock to internal lock and driver can
-* override it later if need be.
-*/
-   q->queue_lock = >__queue_lock;
+   if (!q->mq_ops)
+   q->queue_lock = lock ? : >__queue_lock;
 
/*
 * A queue starts its life with bypass turned on to avoid
@@ -1031,13 +1041,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL);
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);
if (!q)
return NULL;
 
q->request_fn = rfn;
-   if (lock)
-   q->queue_lock = lock;
if (blk_init_allocated_queue(q) < 0) {
blk_cleanup_queue(q);
return NULL;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 0a0394aa1b9c..185f1ef00a7c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2816,7 +2816,7 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
 
drbd_init_set_defaults(device);
 
-   q = blk_alloc_queue(GFP_KERNEL);
+   q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, >req_lock);
if (!q)
goto out_no_q;
device->rq_queue = q;
@@ -2848,7 +2848,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
/* Setting the max_hw_sectors to an odd value of 8kibyte here
   This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
-   q->queue_lock = >req_lock;
 
device->md_io.page = 

[PATCH v3 1/3] block: Add a third argument to blk_alloc_queue_node()

2018-02-09 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
---
 block/blk-core.c  | 7 ---
 block/blk-mq.c| 2 +-
 drivers/block/null_blk.c  | 3 ++-
 drivers/ide/ide-probe.c   | 2 +-
 drivers/lightnvm/core.c   | 2 +-
 drivers/md/dm.c   | 2 +-
 drivers/nvdimm/pmem.c | 2 +-
 drivers/nvme/host/multipath.c | 2 +-
 drivers/scsi/scsi_lib.c   | 2 +-
 include/linux/blkdev.h| 3 ++-
 10 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..e873a24bf82d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -810,7 +810,7 @@ void blk_exit_rl(struct request_queue *q, struct 
request_list *rl)
 
 struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 {
-   return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE);
+   return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE, NULL);
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
@@ -888,7 +888,8 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
-struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
+struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
+  spinlock_t *lock)
 {
struct request_queue *q;
 
@@ -1030,7 +1031,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL);
if (!q)
return NULL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102e2149..ae4e5096f425 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2554,7 +2554,7 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
 {
struct request_queue *uninit_q, *q;
 
-   uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node);
+   uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);
if (!uninit_q)
return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 287a09611c0f..f3d17ea88965 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1717,7 +1717,8 @@ static int null_add_dev(struct nullb_device *dev)
}
null_init_queues(nullb);
} else if (dev->queue_mode == NULL_Q_BIO) {
-   nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node);
+   nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node,
+   NULL);
if (!nullb->q) {
rv = -ENOMEM;
goto out_cleanup_queues;
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 17fd55af4d92..2e80a866073c 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -766,7 +766,7 @@ static int ide_init_queue(ide_drive_t *drive)
 *  limits and LBA48 we could raise it but as yet
 *  do not.
 */
-   q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif));
+   q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif), NULL);
if (!q)
return 1;
 
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..5f1988df1593 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -384,7 +384,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
goto err_dev;
}
 
-   tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
+   tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node, NULL);
if (!tqueue) {
ret = -ENOMEM;
goto err_disk;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d6de00f367ef..3c55564f6367 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1840,7 +1840,7 @@ static struct mapped_device *alloc_dev(int minor)
INIT_LIST_HEAD(>table_devices);
spin_lock_init(>uevent_lock);
 
-   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
+   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id, NULL);
if (!md->queue)
goto bad;
md->queue->queuedata = md;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 10041ac4032c..cfb15ac50925 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -344,7 +344,7 @@ static int pmem_attach_disk(struct device *dev,
return -EBUSY;
}
 
-   q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+   q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev), NULL);
if (!q)
   

[PATCH v3 0/3] Fix races between blkcg code and request queue initialization and cleanup

2018-02-09 Thread Bart Van Assche
Hello Jens,

Recently Joseph Qi identified races between the block cgroup code and request
queue initialization and cleanup. This patch series address these races. Please
consider these patches for kernel v4.17.

Thanks,

Bart.

Changes between v2 and v3:
- Added a third patch that fixes a race between the blkcg code and queue
  cleanup.

Changes between v1 and v2:
- Split a single patch into two patches.
- Dropped blk_alloc_queue_node2() and modified all block drivers that call
  blk_alloc_queue_node().

Bart Van Assche (3):
  block: Add a third argument to blk_alloc_queue_node()
  block: Fix a race between the cgroup code and request queue
initialization
  block: Fix a race between request queue removal and the block cgroup
controller

 block/blk-core.c   | 60 +++---
 block/blk-mq.c |  2 +-
 block/blk-sysfs.c  |  7 -
 drivers/block/drbd/drbd_main.c |  3 +--
 drivers/block/null_blk.c   |  3 ++-
 drivers/block/umem.c   |  7 +++--
 drivers/ide/ide-probe.c|  2 +-
 drivers/lightnvm/core.c|  2 +-
 drivers/md/dm.c|  2 +-
 drivers/nvdimm/pmem.c  |  2 +-
 drivers/nvme/host/multipath.c  |  2 +-
 drivers/scsi/scsi_lib.c|  2 +-
 include/linux/blkdev.h |  3 ++-
 13 files changed, 65 insertions(+), 32 deletions(-)

-- 
2.16.1



[PATCH v3 3/3] block: Fix a race between request queue removal and the block cgroup controller

2018-02-09 Thread Bart Van Assche
Avoid that the following race can occur:

blk_cleanup_queue()   blkcg_print_blkgs()
  spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
q->queue_lock = >__queue_lock (3)
  spin_unlock_irq(lock) (4)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) take driver lock;
(2) busy loop for driver lock;
(3) override driver lock with internal lock;
(4) unlock driver lock;
(5) can take driver lock now;
(6) but unlock internal lock.

This change is safe because only the SCSI core and the NVME core keep
a reference on a request queue after having called blk_cleanup_queue().
Neither driver accesses any of the removed data structures between its
blk_cleanup_queue() and blk_put_queue() calls.

Reported-by: Joseph Qi 
Signed-off-by: Bart Van Assche 
Cc: Jan Kara 
---
 block/blk-core.c  | 31 +++
 block/blk-sysfs.c |  7 ---
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 41c74b37be85..6febc69a58aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,6 +719,37 @@ void blk_cleanup_queue(struct request_queue *q)
del_timer_sync(>backing_dev_info->laptop_mode_wb_timer);
blk_sync_queue(q);
 
+   /*
+* I/O scheduler exit is only safe after the sysfs scheduler attribute
+* has been removed.
+*/
+   WARN_ON_ONCE(q->kobj.state_in_sysfs);
+
+   /*
+* Since the I/O scheduler exit code may access cgroup information,
+* perform I/O scheduler exit before disassociating from the block
+* cgroup controller.
+*/
+   if (q->elevator) {
+   ioc_clear_queue(q);
+   elevator_exit(q, q->elevator);
+   q->elevator = NULL;
+   }
+
+   /*
+* Remove all references to @q from the block cgroup controller before
+* restoring @q->queue_lock to avoid that restoring this pointer causes
+* e.g. blkcg_print_blkgs() to crash.
+*/
+   blkcg_exit_queue(q);
+
+   /*
+* Since the cgroup code may dereference the @q->backing_dev_info
+* pointer, only decrease its reference count after having removed the
+* association with the block cgroup controller.
+*/
+   bdi_put(q->backing_dev_info);
+
if (q->mq_ops)
blk_mq_free_queue(q);
percpu_ref_exit(>q_usage_counter);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index cbea895a5547..fd71a00c9462 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -798,13 +798,6 @@ static void __blk_release_queue(struct work_struct *work)
if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
-   bdi_put(q->backing_dev_info);
-   blkcg_exit_queue(q);
-
-   if (q->elevator) {
-   ioc_clear_queue(q);
-   elevator_exit(q, q->elevator);
-   }
 
blk_free_queue_stats(q->stats);
 
-- 
2.16.1



Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-09 Thread Mike Galbraith
On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote:
> 
> In addition to this I think it should be worth considering CC'ing Greg 
> to pull this fix into 4.15 stable tree.

This isn't one he can cherry-pick, some munging required, in which case
he usually wants a properly tested backport.

-Mike


Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-09 Thread Jens Axboe
On 2/9/18 6:21 AM, Oleksandr Natalenko wrote:
> Hi.
> 
> 08.02.2018 08:16, Paolo Valente wrote:
>>> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe  ha 
>>> scritto:
>>>
>>> On 2/7/18 2:19 PM, Paolo Valente wrote:
 Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq 
 via
 RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a 
 device
 be re-inserted into the active I/O scheduler for that device. As a
 consequence, I/O schedulers may get the same request inserted again,
 even several times, without a finish_request invoked on that request
 before each re-insertion.

 This fact is the cause of the failure reported in [1]. For an I/O
 scheduler, every re-insertion of the same re-prepared request is
 equivalent to the insertion of a new request. For schedulers like
 mq-deadline or kyber, this fact causes no harm. In contrast, it
 confuses a stateful scheduler like BFQ, which keeps state for an I/O
 request, until the finish_request hook is invoked on the request. In
 particular, BFQ may get stuck, waiting forever for the number of
 request dispatches, of the same request, to be balanced by an equal
 number of request completions (while there will be one completion for
 that request). In this state, BFQ may refuse to serve I/O requests
 from other bfq_queues. The hang reported in [1] then follows.

 However, the above re-prepared requests undergo a requeue, thus the
 requeue_request hook of the active elevator is invoked for these
 requests, if set. This commit then addresses the above issue by
 properly implementing the hook requeue_request in BFQ.
>>>
>>> Thanks, applied.
>>>
>>
>> I Jens,
>> I forgot to add
>> Tested-by: Oleksandr Natalenko 
>> in the patch.
>>
>> Is it still possible to add it?
>>
> 
> In addition to this I think it should be worth considering CC'ing Greg 
> to pull this fix into 4.15 stable tree.

I can't add the tested-by anymore, but it's easy enough to target for
stable after-the-fact.


-- 
Jens Axboe



Re: [PATCH] block: pass inclusive 'lend' parameter to truncate_inode_pages_range

2018-02-09 Thread Bart Van Assche
On Fri, 2018-02-09 at 22:15 +0800, Ming Lei wrote:
> The 'lend' parameter of truncate_inode_pages_range is required to be
> inclusive, so follow the rule.
> 
> This patch fixes one memory corruption triggered by discard.
> 
> Fixes: 351499a172c0 ("block: Invalidate cache on discard v2")

Since this bug got introduced in kernel v4.15 please add a "Cc: stable" tag.
Please also Cc: the author of the patch.

Thanks,

Bart.



[PATCH] block: pass inclusive 'lend' parameter to truncate_inode_pages_range

2018-02-09 Thread Ming Lei
The 'lend' parameter of truncate_inode_pages_range is required to be
inclusive, so follow the rule.

This patch fixes one memory corruption triggered by discard.

Fixes: 351499a172c0 ("block: Invalidate cache on discard v2")
Signed-off-by: Ming Lei 
---
 block/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 1668506d8ed8..3884d810efd2 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -225,7 +225,7 @@ static int blk_ioctl_discard(struct block_device *bdev, 
fmode_t mode,
 
if (start + len > i_size_read(bdev->bd_inode))
return -EINVAL;
-   truncate_inode_pages_range(mapping, start, start + len);
+   truncate_inode_pages_range(mapping, start, start + len - 1);
return blkdev_issue_discard(bdev, start >> 9, len >> 9,
GFP_KERNEL, flags);
 }
-- 
2.9.5



Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-09 Thread Oleksandr Natalenko

Hi.

08.02.2018 08:16, Paolo Valente wrote:
Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe  ha 
scritto:


On 2/7/18 2:19 PM, Paolo Valente wrote:
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq 
via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a 
device

be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.


Thanks, applied.



I Jens,
I forgot to add
Tested-by: Oleksandr Natalenko 
in the patch.

Is it still possible to add it?



In addition to this I think it should be worth considering CC'ing Greg 
to pull this fix into 4.15 stable tree.


Oleksandr


[PATCH v2] blk: optimization for classic polling

2018-02-09 Thread Nitesh Shetty
From: Nitesh Shetty 

This removes the dependency on interrupts to wake up task. Set task
state as TASK_RUNNING, if need_resched() returns true,
while polling for IO completion.
Earlier, polling task used to sleep, relying on interrupt to wake it up.
This made some IO take very long when interrupt-coalescing is enabled in
NVMe.

Changes v1->v2:
-setting task state once in blk_poll, instead of multiple callers.

Signed-off-by: Nitesh Shetty 
---
 block/blk-mq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..40285fe 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, 
struct request *rq)
cpu_relax();
}
 
+   set_current_state(TASK_RUNNING);
return false;
 }
 
-- 
2.7.4



[PATCH V2 1/4] lightnvm: make 1.2 data structures explicit

2018-02-09 Thread Matias Bjørling
Make the 1.2 data structures explicit, so it will be easy to identify
the 2.0 data structures. Also fix the order of which the nvme_nvm_*
are declared, such that they follow the nvme_nvm_command order.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 82 ++--
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index dc0b1335c7c6..60db3f1b59da 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -51,6 +51,21 @@ struct nvme_nvm_ph_rw {
__le64  resv;
 };
 
+struct nvme_nvm_erase_blk {
+   __u8opcode;
+   __u8flags;
+   __u16   command_id;
+   __le32  nsid;
+   __u64   rsvd[2];
+   __le64  prp1;
+   __le64  prp2;
+   __le64  spba;
+   __le16  length;
+   __le16  control;
+   __le32  dsmgmt;
+   __le64  resv;
+};
+
 struct nvme_nvm_identity {
__u8opcode;
__u8flags;
@@ -89,33 +104,18 @@ struct nvme_nvm_setbbtbl {
__u32   rsvd4[3];
 };
 
-struct nvme_nvm_erase_blk {
-   __u8opcode;
-   __u8flags;
-   __u16   command_id;
-   __le32  nsid;
-   __u64   rsvd[2];
-   __le64  prp1;
-   __le64  prp2;
-   __le64  spba;
-   __le16  length;
-   __le16  control;
-   __le32  dsmgmt;
-   __le64  resv;
-};
-
 struct nvme_nvm_command {
union {
struct nvme_common_command common;
-   struct nvme_nvm_identity identity;
struct nvme_nvm_ph_rw ph_rw;
+   struct nvme_nvm_erase_blk erase;
+   struct nvme_nvm_identity identity;
struct nvme_nvm_getbbtbl get_bb;
struct nvme_nvm_setbbtbl set_bb;
-   struct nvme_nvm_erase_blk erase;
};
 };
 
-struct nvme_nvm_id_group {
+struct nvme_nvm_id12_grp {
__u8mtype;
__u8fmtype;
__le16  res16;
@@ -141,7 +141,7 @@ struct nvme_nvm_id_group {
__u8reserved[906];
 } __packed;
 
-struct nvme_nvm_addr_format {
+struct nvme_nvm_id12_addrf {
__u8ch_offset;
__u8ch_len;
__u8lun_offset;
@@ -157,16 +157,16 @@ struct nvme_nvm_addr_format {
__u8res[4];
 } __packed;
 
-struct nvme_nvm_id {
+struct nvme_nvm_id12 {
__u8ver_id;
__u8vmnt;
__u8cgrps;
__u8res;
__le32  cap;
__le32  dom;
-   struct nvme_nvm_addr_format ppaf;
+   struct nvme_nvm_id12_addrf ppaf;
__u8resv[228];
-   struct nvme_nvm_id_group group;
+   struct nvme_nvm_id12_grp grp;
__u8resv2[2880];
 } __packed;
 
@@ -191,25 +191,25 @@ static inline void _nvme_nvm_check_size(void)
 {
BUILD_BUG_ON(sizeof(struct nvme_nvm_identity) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_ph_rw) != 64);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_getbbtbl) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_setbbtbl) != 64);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_id_group) != 960);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 16);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_id) != NVME_IDENTIFY_DATA_SIZE);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_grp) != 960);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_addrf) != 16);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id12) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
 }
 
-static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id)
+static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12)
 {
-   struct nvme_nvm_id_group *src;
+   struct nvme_nvm_id12_grp *src;
struct nvm_id_group *grp;
int sec_per_pg, sec_per_pl, pg_per_blk;
 
-   if (nvme_nvm_id->cgrps != 1)
+   if (id12->cgrps != 1)
return -EINVAL;
 
-   src = _nvm_id->group;
+   src = >grp;
grp = _id->grp;
 
grp->mtype = src->mtype;
@@ -261,34 +261,34 @@ static int init_grps(struct nvm_id *nvm_id, struct 
nvme_nvm_id *nvme_nvm_id)
 

[PATCH V2 2/4] lightnvm: flatten nvm_id_group into nvm_id

2018-02-09 Thread Matias Bjørling
There are no groups in the 2.0 specification, make sure that the
nvm_id structure is flattened before 2.0 data structures are added.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  |  25 ++-
 drivers/nvme/host/lightnvm.c | 100 +--
 include/linux/lightnvm.h |  53 +++
 3 files changed, 86 insertions(+), 92 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..c72863b36439 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -851,33 +851,32 @@ EXPORT_SYMBOL(nvm_get_tgt_bb_tbl);
 static int nvm_core_init(struct nvm_dev *dev)
 {
struct nvm_id *id = >identity;
-   struct nvm_id_group *grp = >grp;
struct nvm_geo *geo = >geo;
int ret;
 
memcpy(>ppaf, >ppaf, sizeof(struct nvm_addr_format));
 
-   if (grp->mtype != 0) {
+   if (id->mtype != 0) {
pr_err("nvm: memory type not supported\n");
return -EINVAL;
}
 
/* Whole device values */
-   geo->nr_chnls = grp->num_ch;
-   geo->nr_luns = grp->num_lun;
+   geo->nr_chnls = id->num_ch;
+   geo->nr_luns = id->num_lun;
 
/* Generic device geometry values */
-   geo->ws_min = grp->ws_min;
-   geo->ws_opt = grp->ws_opt;
-   geo->ws_seq = grp->ws_seq;
-   geo->ws_per_chk = grp->ws_per_chk;
-   geo->nr_chks = grp->num_chk;
-   geo->sec_size = grp->csecs;
-   geo->oob_size = grp->sos;
-   geo->mccap = grp->mccap;
+   geo->ws_min = id->ws_min;
+   geo->ws_opt = id->ws_opt;
+   geo->ws_seq = id->ws_seq;
+   geo->ws_per_chk = id->ws_per_chk;
+   geo->nr_chks = id->num_chk;
+   geo->sec_size = id->csecs;
+   geo->oob_size = id->sos;
+   geo->mccap = id->mccap;
geo->max_rq_size = dev->ops->max_phys_sect * geo->sec_size;
 
-   geo->sec_per_chk = grp->clba;
+   geo->sec_per_chk = id->clba;
geo->sec_per_lun = geo->sec_per_chk * geo->nr_chks;
geo->all_luns = geo->nr_luns * geo->nr_chnls;
 
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 60db3f1b59da..6412551ecc65 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -203,57 +203,55 @@ static inline void _nvme_nvm_check_size(void)
 static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12)
 {
struct nvme_nvm_id12_grp *src;
-   struct nvm_id_group *grp;
int sec_per_pg, sec_per_pl, pg_per_blk;
 
if (id12->cgrps != 1)
return -EINVAL;
 
src = >grp;
-   grp = _id->grp;
 
-   grp->mtype = src->mtype;
-   grp->fmtype = src->fmtype;
+   nvm_id->mtype = src->mtype;
+   nvm_id->fmtype = src->fmtype;
 
-   grp->num_ch = src->num_ch;
-   grp->num_lun = src->num_lun;
+   nvm_id->num_ch = src->num_ch;
+   nvm_id->num_lun = src->num_lun;
 
-   grp->num_chk = le16_to_cpu(src->num_chk);
-   grp->csecs = le16_to_cpu(src->csecs);
-   grp->sos = le16_to_cpu(src->sos);
+   nvm_id->num_chk = le16_to_cpu(src->num_chk);
+   nvm_id->csecs = le16_to_cpu(src->csecs);
+   nvm_id->sos = le16_to_cpu(src->sos);
 
pg_per_blk = le16_to_cpu(src->num_pg);
-   sec_per_pg = le16_to_cpu(src->fpg_sz) / grp->csecs;
+   sec_per_pg = le16_to_cpu(src->fpg_sz) / nvm_id->csecs;
sec_per_pl = sec_per_pg * src->num_pln;
-   grp->clba = sec_per_pl * pg_per_blk;
-   grp->ws_per_chk = pg_per_blk;
+   nvm_id->clba = sec_per_pl * pg_per_blk;
+   nvm_id->ws_per_chk = pg_per_blk;
 
-   grp->mpos = le32_to_cpu(src->mpos);
-   grp->cpar = le16_to_cpu(src->cpar);
-   grp->mccap = le32_to_cpu(src->mccap);
+   nvm_id->mpos = le32_to_cpu(src->mpos);
+   nvm_id->cpar = le16_to_cpu(src->cpar);
+   nvm_id->mccap = le32_to_cpu(src->mccap);
 
-   grp->ws_opt = grp->ws_min = sec_per_pg;
-   grp->ws_seq = NVM_IO_SNGL_ACCESS;
+   nvm_id->ws_opt = nvm_id->ws_min = sec_per_pg;
+   nvm_id->ws_seq = NVM_IO_SNGL_ACCESS;
 
-   if (grp->mpos & 0x020202) {
-   grp->ws_seq = NVM_IO_DUAL_ACCESS;
-   grp->ws_opt <<= 1;
-   } else if (grp->mpos & 0x040404) {
-   grp->ws_seq = NVM_IO_QUAD_ACCESS;
-   grp->ws_opt <<= 2;
+   if (nvm_id->mpos & 0x020202) {
+   nvm_id->ws_seq = NVM_IO_DUAL_ACCESS;
+   nvm_id->ws_opt <<= 1;
+   } else if (nvm_id->mpos & 0x040404) {
+   nvm_id->ws_seq = NVM_IO_QUAD_ACCESS;
+   nvm_id->ws_opt <<= 2;
}
 
-   grp->trdt = le32_to_cpu(src->trdt);
-   grp->trdm = le32_to_cpu(src->trdm);
-   grp->tprt = le32_to_cpu(src->tprt);
-   grp->tprm = le32_to_cpu(src->tprm);
-   grp->tbet = le32_to_cpu(src->tbet);
-   grp->tbem = le32_to_cpu(src->tbem);
+   nvm_id->trdt = le32_to_cpu(src->trdt);
+   nvm_id->trdm = 

[PATCH V2 0/4] lightnvm: base 2.0 implementation

2018-02-09 Thread Matias Bjørling
A couple of patches for 2.0 support for the lightnvm subsystem. They
form the basis for integrating 2.0 support.

For the rest of the support, Javier has code that implements report
chunk and sets up the LBA format data structure. He also has a bunch
of patches that brings pblk up to speed.

The first two patches is preparation for the 2.0 work. The third patch
implements the 2.0 data structures, the geometry command, and exposes
the sysfs attributes that comes with the 2.0 specification. Note that
the attributes between 1.2 and 2.0 are different, and it is expected
that user-space shall use the version sysfs attribute to know which
attributes will be available.

The last patch implements support for using the nvme namespace logical
block and metadata fields and sync it with the internal lightnvm
identify structures.

Changes since v1:

 - pr_err fix from Randy.
 - Address type fix from Javier.
 - Also CC the nvme mailing list.

Matias Bjørling (4):
  lightnvm: make 1.2 data structures explicit
  lightnvm: flatten nvm_id_group into nvm_id
  lightnvm: add 2.0 geometry identification
  nvme: lightnvm: add late setup of block size and metadata

 drivers/lightnvm/core.c  |  33 ++-
 drivers/nvme/host/core.c |   2 +
 drivers/nvme/host/lightnvm.c | 508 ---
 drivers/nvme/host/nvme.h |   2 +
 include/linux/lightnvm.h |  64 +++---
 5 files changed, 428 insertions(+), 181 deletions(-)

-- 
2.11.0



[PATCH V2 4/4] nvme: lightnvm: add late setup of block size and metadata

2018-02-09 Thread Matias Bjørling
The nvme driver sets up the size of the nvme namespace in two steps.
First it initializes the device with standard logical block and
metadata sizes, and then sets the correct logical block and metadata
size. Due to the OCSSD 2.0 specification relies on the namespace to
expose these sizes for correct initialization, let it be updated
appropriately on the LightNVM side as well.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/core.c | 2 ++
 drivers/nvme/host/lightnvm.c | 8 
 drivers/nvme/host/nvme.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..740ceb28067c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1379,6 +1379,8 @@ static void __nvme_revalidate_disk(struct gendisk *disk, 
struct nvme_id_ns *id)
if (ns->noiob)
nvme_set_chunk_size(ns);
nvme_update_disk_info(disk, ns, id);
+   if (ns->ndev)
+   nvme_nvm_update_nvm_info(ns);
 #ifdef CONFIG_NVME_MULTIPATH
if (ns->head->disk)
nvme_update_disk_info(ns->head->disk, ns, id);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 8b243af8a949..a19e85f0cbae 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -814,6 +814,14 @@ int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, 
unsigned long arg)
}
 }
 
+void nvme_nvm_update_nvm_info(struct nvme_ns *ns)
+{
+   struct nvm_dev *ndev = ns->ndev;
+
+   ndev->identity.csecs = ndev->geo.sec_size = 1 << ns->lba_shift;
+   ndev->identity.sos = ndev->geo.oob_size = ns->ms;
+}
+
 int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node)
 {
struct request_queue *q = ns->queue;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..1ca08f4993ba 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -451,12 +451,14 @@ static inline void nvme_mpath_clear_current_path(struct 
nvme_ns *ns)
 #endif /* CONFIG_NVME_MULTIPATH */
 
 #ifdef CONFIG_NVM
+void nvme_nvm_update_nvm_info(struct nvme_ns *ns);
 int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node);
 void nvme_nvm_unregister(struct nvme_ns *ns);
 int nvme_nvm_register_sysfs(struct nvme_ns *ns);
 void nvme_nvm_unregister_sysfs(struct nvme_ns *ns);
 int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg);
 #else
+static inline void nvme_nvm_update_nvm_info(struct nvme_ns *ns) {};
 static inline int nvme_nvm_register(struct nvme_ns *ns, char *disk_name,
int node)
 {
-- 
2.11.0



[PATCH V2 3/4] lightnvm: add 2.0 geometry identification

2018-02-09 Thread Matias Bjørling
Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  |   8 +-
 drivers/nvme/host/lightnvm.c | 334 +--
 include/linux/lightnvm.h |  11 +-
 3 files changed, 297 insertions(+), 56 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index c72863b36439..9b1255b3e05e 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -931,11 +931,9 @@ static int nvm_init(struct nvm_dev *dev)
goto err;
}
 
-   pr_debug("nvm: ver:%x nvm_vendor:%x\n",
-   dev->identity.ver_id, dev->identity.vmnt);
-
-   if (dev->identity.ver_id != 1) {
-   pr_err("nvm: device not supported by kernel.");
+   if (dev->identity.ver_id != 1 && dev->identity.ver_id != 2) {
+   pr_err("nvm: device ver_id %d not supported by kernel.\n",
+   dev->identity.ver_id);
goto err;
}
 
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 6412551ecc65..8b243af8a949 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -184,6 +184,58 @@ struct nvme_nvm_bb_tbl {
__u8blk[0];
 };
 
+struct nvme_nvm_id20_addrf {
+   __u8grp_len;
+   __u8pu_len;
+   __u8chk_len;
+   __u8lba_len;
+   __u8resv[4];
+};
+
+struct nvme_nvm_id20 {
+   __u8mjr;
+   __u8mnr;
+   __u8resv[6];
+
+   struct nvme_nvm_id20_addrf lbaf;
+
+   __le32  mccap;
+   __u8resv2[12];
+
+   __u8wit;
+   __u8resv3[31];
+
+   /* Geometry */
+   __le16  num_grp;
+   __le16  num_pu;
+   __le32  num_chk;
+   __le32  clba;
+   __u8resv4[52];
+
+   /* Write data requirements */
+   __le32  ws_min;
+   __le32  ws_opt;
+   __le32  mw_cunits;
+   __le32  maxoc;
+   __le32  maxocpu;
+   __u8resv5[44];
+
+   /* Performance related metrics */
+   __le32  trdt;
+   __le32  trdm;
+   __le32  twrt;
+   __le32  twrm;
+   __le32  tcrst;
+   __le32  tcrsm;
+   __u8resv6[40];
+
+   /* Reserved area */
+   __u8resv7[2816];
+
+   /* Vendor specific */
+   __u8vs[1024];
+};
+
 /*
  * Check we didn't inadvertently grow the command struct
  */
@@ -198,6 +250,8 @@ static inline void _nvme_nvm_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_addrf) != 16);
BUILD_BUG_ON(sizeof(struct nvme_nvm_id12) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id20_addrf) != 8);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id20) != NVME_IDENTIFY_DATA_SIZE);
 }
 
 static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12)
@@ -256,6 +310,49 @@ static int init_grp(struct nvm_id *nvm_id, struct 
nvme_nvm_id12 *id12)
return 0;
 }
 
+static int nvme_nvm_setup_12(struct nvm_dev *nvmdev, struct nvm_id *nvm_id,
+   struct nvme_nvm_id12 *id)
+{
+   nvm_id->ver_id = id->ver_id;
+   nvm_id->vmnt = id->vmnt;
+   nvm_id->cap = le32_to_cpu(id->cap);
+   nvm_id->dom = le32_to_cpu(id->dom);
+   memcpy(_id->ppaf, >ppaf,
+   sizeof(struct nvm_addr_format));
+
+   return init_grp(nvm_id, id);
+}
+
+static int nvme_nvm_setup_20(struct nvm_dev *nvmdev, struct nvm_id *nvm_id,
+   struct nvme_nvm_id20 *id)
+{
+   nvm_id->ver_id = id->mjr;
+
+   nvm_id->num_ch = le16_to_cpu(id->num_grp);
+   nvm_id->num_lun = le16_to_cpu(id->num_pu);
+   nvm_id->num_chk = le32_to_cpu(id->num_chk);
+   nvm_id->clba = le32_to_cpu(id->clba);
+
+   nvm_id->ws_min = le32_to_cpu(id->ws_min);
+   nvm_id->ws_opt = le32_to_cpu(id->ws_opt);
+   nvm_id->mw_cunits = le32_to_cpu(id->mw_cunits);
+
+   nvm_id->trdt = le32_to_cpu(id->trdt);
+   nvm_id->trdm = le32_to_cpu(id->trdm);
+   nvm_id->tprt = le32_to_cpu(id->twrt);
+   nvm_id->tprm = le32_to_cpu(id->twrm);
+   nvm_id->tbet = le32_to_cpu(id->tcrst);
+   nvm_id->tbem = le32_to_cpu(id->tcrsm);
+
+   /* calculated values */
+   nvm_id->ws_per_chk = nvm_id->clba / nvm_id->ws_min;
+
+   /* 1.2 compatibility */
+