Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
Hi Kashyap, On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Friday, February 9, 2018 11:01 AM > > To: Kashyap Desai > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > > Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar > Sandoval; > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter > > Rivera; Paolo Bonzini; Laurence Oberman > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > force_blk_mq > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > -Original Message- > > > > From: Ming Lei [mailto:ming@redhat.com] > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > To: Hannes Reinecke > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun > > > > Easi; Omar > > > Sandoval; > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > > Peter > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > introduce force_blk_mq > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > >> -Original Message- > > > > > >> From: Ming Lei [mailto:ming@redhat.com] > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > >> To: Hannes Reinecke > > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > > >> Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; > > > > > >> Arun Easi; Omar > > > > > > Sandoval; > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > >> Brace; > > > > > > Peter > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > >> introduce force_blk_mq > > > > > >> > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke > wrote: > > > > > >>> Hi all, > > > > > >>> > > > > > >>> [ .. ] > > > > > > > > > > > > Could you share us your patch for enabling global_tags/MQ on > > > > > megaraid_sas > > > > > > so that I can reproduce your test? > > > > > > > > > > > >> See below perf top data. "bt_iter" is consuming 4 times > > > > > >> more > > > CPU. > > > > > > > > > > > > Could you share us what the IOPS/CPU utilization effect is > > > > > > after > > > > > applying the > > > > > > patch V2? And your test script? > > > > > Regarding CPU utilization, I need to test one more time. > > > > > Currently system is in used. > > > > > > > > > > I run below fio test on total 24 SSDs expander attached. > > > > > > > > > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > > > --ioengine=libaio --rw=randread > > > > > > > > > > Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > > > > > > >>> This is basically what we've seen with earlier iterations. > > > > > >> > > > > > >> Hi Hannes, > > > > > >> > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big > > > > > >> issue, > > > > > > which > > > > > >> causes only reply queue 0 used. > > > > > >> > > > > > >> [1] https://marc.info/?l=linux-scsi=151793204014631=2 > > > > > >> > > > > > >> So could you guys run your performance test again after fixing > > > > > >> the > > > > > > patch? > > > > > > > > > > > > Ming - > > > > > > > > > > > > I tried after change you requested. Performance drop is still > > > unresolved. > > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > > > IRQs / 1 second(s) > > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > > > > 368 15463 0
[PATCH V2] block: pass inclusive 'lend' parameter to truncate_inode_pages_range
The 'lend' parameter of truncate_inode_pages_range is required to be inclusive, so follow the rule. This patch fixes one memory corruption triggered by discard. Cc:Cc: Dmitry Monakhov Fixes: 351499a172c0 ("block: Invalidate cache on discard v2") Signed-off-by: Ming Lei --- V2: - Cc stable list and Dmitry as suggested by Bart block/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/ioctl.c b/block/ioctl.c index 1668506d8ed8..3884d810efd2 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -225,7 +225,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode, if (start + len > i_size_read(bdev->bd_inode)) return -EINVAL; - truncate_inode_pages_range(mapping, start, start + len); + truncate_inode_pages_range(mapping, start, start + len - 1); return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL, flags); } -- 2.9.5
Re: [LSF/MM ATTEND] State of blk-mq I/O scheduling and next steps
On Fri, Feb 9, 2018 at 5:47 AM, Omar Sandovalwrote: > Hi, > > I'd like to attend LSF/MM to talk about the state of I/O scheduling in > blk-mq. I can present some results about mq-deadline and Kyber on our > production workloads. I'd also like to talk about what's next, in > particular, improvements and features for Kyber. Hi Omar, I am very interested in these topics, especially looking forward to your update on what's next. I am also thinking about improving SSD/NVMe via kyber, and interested in any topics about SSD/NVMe performance, how to deal with SSD/NVMe friendly,... Thanks, Ming Lei
Re: v4.16-rc1 + dm-mpath + BFQ
On 2/9/18 12:14 PM, Bart Van Assche wrote: > On 02/09/18 10:58, Jens Axboe wrote: >> On 2/9/18 11:54 AM, Bart Van Assche wrote: >>> Hello Paolo, >>> >>> If I enable the BFQ scheduler for a dm-mpath device then a kernel oops >>> appears (see also below). This happens systematically with Linus' tree from >>> this morning (commit 54ce685cae30) merged with Jens' for-linus branch >>> (commit >>> a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch >>> (commit 88455ad7f928). Is this a known issue? >> >> Does it happen on Linus -git as well, or just with my for-linus merged in? >> What I'm getting at is if a78773906147 caused this or not. > > Hello Jens, > > Thanks for chiming in. After having reverted commit a78773906147, after > having rebuilt the BFQ scheduler, after having rebooted and after having > repeated the test I see the same kernel oops being reported. I think > that means that this regression is not caused by commit a78773906147. In > case it would be useful, here is how gdb translates the crash address: > > $ gdb block/bfq*ko > (gdb) list *(bfq_remove_request+0x8d) > 0x280d is in bfq_remove_request (block/bfq-iosched.c:1760). > 1755list_del_init(>queuelist); > 1756bfqq->queued[sync]--; > 1757bfqd->queued--; > 1758elv_rb_del(>sort_list, rq); > 1759 > 1760elv_rqhash_del(q, rq); > 1761if (q->last_merge == rq) > 1762q->last_merge = NULL; > 1763 > 1764if (RB_EMPTY_ROOT(>sort_list)) { Looks very odd. So clearly RQF_HASHED is set, but we're blowing up on the hash list pointers. I'll let Paolo take a look at this one. Thanks for testing without that commit, I want to push out my pending fixes today and this would have thrown a wrench in the works. -- Jens Axboe
Re: v4.16-rc1 + dm-mpath + BFQ
On 02/09/18 10:58, Jens Axboe wrote: On 2/9/18 11:54 AM, Bart Van Assche wrote: Hello Paolo, If I enable the BFQ scheduler for a dm-mpath device then a kernel oops appears (see also below). This happens systematically with Linus' tree from this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch (commit 88455ad7f928). Is this a known issue? Does it happen on Linus -git as well, or just with my for-linus merged in? What I'm getting at is if a78773906147 caused this or not. Hello Jens, Thanks for chiming in. After having reverted commit a78773906147, after having rebuilt the BFQ scheduler, after having rebooted and after having repeated the test I see the same kernel oops being reported. I think that means that this regression is not caused by commit a78773906147. In case it would be useful, here is how gdb translates the crash address: $ gdb block/bfq*ko (gdb) list *(bfq_remove_request+0x8d) 0x280d is in bfq_remove_request (block/bfq-iosched.c:1760). 1755list_del_init(>queuelist); 1756bfqq->queued[sync]--; 1757bfqd->queued--; 1758elv_rb_del(>sort_list, rq); 1759 1760elv_rqhash_del(q, rq); 1761if (q->last_merge == rq) 1762q->last_merge = NULL; 1763 1764if (RB_EMPTY_ROOT(>sort_list)) { Bart.
Re: v4.16-rc1 + dm-mpath + BFQ
On 2/9/18 11:54 AM, Bart Van Assche wrote: > Hello Paolo, > > If I enable the BFQ scheduler for a dm-mpath device then a kernel oops > appears (see also below). This happens systematically with Linus' tree from > this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit > a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch > (commit 88455ad7f928). Is this a known issue? Does it happen on Linus -git as well, or just with my for-linus merged in? What I'm getting at is if a78773906147 caused this or not. -- Jens Axboe
v4.16-rc1 + dm-mpath + BFQ
Hello Paolo, If I enable the BFQ scheduler for a dm-mpath device then a kernel oops appears (see also below). This happens systematically with Linus' tree from this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch (commit 88455ad7f928). Is this a known issue? Thanks, Bart. BUG: unable to handle kernel NULL pointer dereference at 0200 IP: rb_erase+0x284/0x380 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW4.15.0-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:rb_erase+0x284/0x380 Call Trace: elv_rb_del+0x20/0x30 bfq_remove_request+0x8d/0x2e0 [bfq] bfq_finish_requeue_request+0x2bb/0x390 [bfq] blk_mq_free_request+0x51/0x170 dm_softirq_done+0xd5/0x240 [dm_mod] flush_smp_call_function_queue+0x92/0x140 smp_call_function_single_interrupt+0x47/0x2b0 call_function_single_interrupt+0xa2/0xb0
[PATCH v3 2/3] block: Fix a race between the cgroup code and request queue initialization
Initialize the request queue lock earlier such that the following race can no longer occur: blk_init_queue_node() blkcg_print_blkgs() blk_alloc_queue_node (1) q->queue_lock = >__queue_lock (2) blkcg_init_queue(q) (3) spin_lock_irq(blkg->q->queue_lock) (4) q->queue_lock = lock (5) spin_unlock_irq(blkg->q->queue_lock) (6) (1) allocate an uninitialized queue; (2) initialize queue_lock to its default internal lock; (3) initialize blkcg part of request queue, which will create blkg and then insert it to blkg_list; (4) traverse blkg_list and find the created blkg, and then take its queue lock, here it is the default *internal lock*; (5) *race window*, now queue_lock is overridden with *driver specified lock*; (6) now unlock *driver specified lock*, not the locked *internal lock*, unlock balance breaks. The changes in this patch are as follows: - Move the .queue_lock initialization from blk_init_queue_node() into blk_alloc_queue_node(). - Only override the .queue_lock pointer for legacy queues because it is not useful for blk-mq queues to override this pointer. - For all all block drivers that initialize .queue_lock explicitly, change the blk_alloc_queue() call in the driver into a blk_alloc_queue_node() call and remove the explicit .queue_lock initialization. Additionally, initialize the spin lock that will be used as queue lock earlier if necessary. Reported-by: Joseph QiSigned-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Joseph Qi Cc: Philipp Reisner Cc: Ulf Hansson Cc: Kees Cook --- block/blk-core.c | 24 drivers/block/drbd/drbd_main.c | 3 +-- drivers/block/umem.c | 7 +++ 3 files changed, 20 insertions(+), 14 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index e873a24bf82d..41c74b37be85 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -888,6 +888,19 @@ static void blk_rq_timed_out_timer(struct timer_list *t) kblockd_schedule_work(>timeout_work); } +/** + * blk_alloc_queue_node - allocate a request queue + * @gfp_mask: memory allocation flags + * @node_id: NUMA node to allocate memory from + * @lock: For legacy queues, pointer to a spinlock that will be used to e.g. + *serialize calls to the legacy .request_fn() callback. Ignored for + * blk-mq request queues. + * + * Note: pass the queue lock as the third argument to this function instead of + * setting the queue lock pointer explicitly to avoid triggering a sporadic + * crash in the blkcg code. This function namely calls blkcg_init_queue() and + * the queue lock pointer must be set before blkcg_init_queue() is called. + */ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id, spinlock_t *lock) { @@ -940,11 +953,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id, mutex_init(>sysfs_lock); spin_lock_init(>__queue_lock); - /* -* By default initialize queue_lock to internal lock and driver can -* override it later if need be. -*/ - q->queue_lock = >__queue_lock; + if (!q->mq_ops) + q->queue_lock = lock ? : >__queue_lock; /* * A queue starts its life with bypass turned on to avoid @@ -1031,13 +1041,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id) { struct request_queue *q; - q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL); + q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock); if (!q) return NULL; q->request_fn = rfn; - if (lock) - q->queue_lock = lock; if (blk_init_allocated_queue(q) < 0) { blk_cleanup_queue(q); return NULL; diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 0a0394aa1b9c..185f1ef00a7c 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2816,7 +2816,7 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig drbd_init_set_defaults(device); - q = blk_alloc_queue(GFP_KERNEL); + q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, >req_lock); if (!q) goto out_no_q; device->rq_queue = q; @@ -2848,7 +2848,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig /* Setting the max_hw_sectors to an odd value of 8kibyte here This triggers a max_bio_size message upon first attach or connect */ blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8); - q->queue_lock = >req_lock; device->md_io.page =
[PATCH v3 1/3] block: Add a third argument to blk_alloc_queue_node()
This patch does not change any functionality. Signed-off-by: Bart Van AsscheCc: Christoph Hellwig Cc: Joseph Qi Cc: Philipp Reisner Cc: Ulf Hansson Cc: Kees Cook --- block/blk-core.c | 7 --- block/blk-mq.c| 2 +- drivers/block/null_blk.c | 3 ++- drivers/ide/ide-probe.c | 2 +- drivers/lightnvm/core.c | 2 +- drivers/md/dm.c | 2 +- drivers/nvdimm/pmem.c | 2 +- drivers/nvme/host/multipath.c | 2 +- drivers/scsi/scsi_lib.c | 2 +- include/linux/blkdev.h| 3 ++- 10 files changed, 15 insertions(+), 12 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2d1a7bbe0634..e873a24bf82d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -810,7 +810,7 @@ void blk_exit_rl(struct request_queue *q, struct request_list *rl) struct request_queue *blk_alloc_queue(gfp_t gfp_mask) { - return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE); + return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE, NULL); } EXPORT_SYMBOL(blk_alloc_queue); @@ -888,7 +888,8 @@ static void blk_rq_timed_out_timer(struct timer_list *t) kblockd_schedule_work(>timeout_work); } -struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) +struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id, + spinlock_t *lock) { struct request_queue *q; @@ -1030,7 +1031,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id) { struct request_queue *q; - q = blk_alloc_queue_node(GFP_KERNEL, node_id); + q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL); if (!q) return NULL; diff --git a/block/blk-mq.c b/block/blk-mq.c index df93102e2149..ae4e5096f425 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2554,7 +2554,7 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set) { struct request_queue *uninit_q, *q; - uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node); + uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL); if (!uninit_q) return ERR_PTR(-ENOMEM); diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 287a09611c0f..f3d17ea88965 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1717,7 +1717,8 @@ static int null_add_dev(struct nullb_device *dev) } null_init_queues(nullb); } else if (dev->queue_mode == NULL_Q_BIO) { - nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node); + nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node, + NULL); if (!nullb->q) { rv = -ENOMEM; goto out_cleanup_queues; diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c index 17fd55af4d92..2e80a866073c 100644 --- a/drivers/ide/ide-probe.c +++ b/drivers/ide/ide-probe.c @@ -766,7 +766,7 @@ static int ide_init_queue(ide_drive_t *drive) * limits and LBA48 we could raise it but as yet * do not. */ - q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif)); + q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif), NULL); if (!q) return 1; diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index dcc9e621e651..5f1988df1593 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -384,7 +384,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) goto err_dev; } - tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node); + tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node, NULL); if (!tqueue) { ret = -ENOMEM; goto err_disk; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index d6de00f367ef..3c55564f6367 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1840,7 +1840,7 @@ static struct mapped_device *alloc_dev(int minor) INIT_LIST_HEAD(>table_devices); spin_lock_init(>uevent_lock); - md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id); + md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id, NULL); if (!md->queue) goto bad; md->queue->queuedata = md; diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 10041ac4032c..cfb15ac50925 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -344,7 +344,7 @@ static int pmem_attach_disk(struct device *dev, return -EBUSY; } - q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev)); + q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev), NULL); if (!q)
[PATCH v3 0/3] Fix races between blkcg code and request queue initialization and cleanup
Hello Jens, Recently Joseph Qi identified races between the block cgroup code and request queue initialization and cleanup. This patch series address these races. Please consider these patches for kernel v4.17. Thanks, Bart. Changes between v2 and v3: - Added a third patch that fixes a race between the blkcg code and queue cleanup. Changes between v1 and v2: - Split a single patch into two patches. - Dropped blk_alloc_queue_node2() and modified all block drivers that call blk_alloc_queue_node(). Bart Van Assche (3): block: Add a third argument to blk_alloc_queue_node() block: Fix a race between the cgroup code and request queue initialization block: Fix a race between request queue removal and the block cgroup controller block/blk-core.c | 60 +++--- block/blk-mq.c | 2 +- block/blk-sysfs.c | 7 - drivers/block/drbd/drbd_main.c | 3 +-- drivers/block/null_blk.c | 3 ++- drivers/block/umem.c | 7 +++-- drivers/ide/ide-probe.c| 2 +- drivers/lightnvm/core.c| 2 +- drivers/md/dm.c| 2 +- drivers/nvdimm/pmem.c | 2 +- drivers/nvme/host/multipath.c | 2 +- drivers/scsi/scsi_lib.c| 2 +- include/linux/blkdev.h | 3 ++- 13 files changed, 65 insertions(+), 32 deletions(-) -- 2.16.1
[PATCH v3 3/3] block: Fix a race between request queue removal and the block cgroup controller
Avoid that the following race can occur: blk_cleanup_queue() blkcg_print_blkgs() spin_lock_irq(lock) (1) spin_lock_irq(blkg->q->queue_lock) (2,5) q->queue_lock = >__queue_lock (3) spin_unlock_irq(lock) (4) spin_unlock_irq(blkg->q->queue_lock) (6) (1) take driver lock; (2) busy loop for driver lock; (3) override driver lock with internal lock; (4) unlock driver lock; (5) can take driver lock now; (6) but unlock internal lock. This change is safe because only the SCSI core and the NVME core keep a reference on a request queue after having called blk_cleanup_queue(). Neither driver accesses any of the removed data structures between its blk_cleanup_queue() and blk_put_queue() calls. Reported-by: Joseph QiSigned-off-by: Bart Van Assche Cc: Jan Kara --- block/blk-core.c | 31 +++ block/blk-sysfs.c | 7 --- 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 41c74b37be85..6febc69a58aa 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -719,6 +719,37 @@ void blk_cleanup_queue(struct request_queue *q) del_timer_sync(>backing_dev_info->laptop_mode_wb_timer); blk_sync_queue(q); + /* +* I/O scheduler exit is only safe after the sysfs scheduler attribute +* has been removed. +*/ + WARN_ON_ONCE(q->kobj.state_in_sysfs); + + /* +* Since the I/O scheduler exit code may access cgroup information, +* perform I/O scheduler exit before disassociating from the block +* cgroup controller. +*/ + if (q->elevator) { + ioc_clear_queue(q); + elevator_exit(q, q->elevator); + q->elevator = NULL; + } + + /* +* Remove all references to @q from the block cgroup controller before +* restoring @q->queue_lock to avoid that restoring this pointer causes +* e.g. blkcg_print_blkgs() to crash. +*/ + blkcg_exit_queue(q); + + /* +* Since the cgroup code may dereference the @q->backing_dev_info +* pointer, only decrease its reference count after having removed the +* association with the block cgroup controller. +*/ + bdi_put(q->backing_dev_info); + if (q->mq_ops) blk_mq_free_queue(q); percpu_ref_exit(>q_usage_counter); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index cbea895a5547..fd71a00c9462 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -798,13 +798,6 @@ static void __blk_release_queue(struct work_struct *work) if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) blk_stat_remove_callback(q, q->poll_cb); blk_stat_free_callback(q->poll_cb); - bdi_put(q->backing_dev_info); - blkcg_exit_queue(q); - - if (q->elevator) { - ioc_clear_queue(q); - elevator_exit(q, q->elevator); - } blk_free_queue_stats(q->stats); -- 2.16.1
Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook
On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote: > > In addition to this I think it should be worth considering CC'ing Greg > to pull this fix into 4.15 stable tree. This isn't one he can cherry-pick, some munging required, in which case he usually wants a properly tested backport. -Mike
Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook
On 2/9/18 6:21 AM, Oleksandr Natalenko wrote: > Hi. > > 08.02.2018 08:16, Paolo Valente wrote: >>> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboeha >>> scritto: >>> >>> On 2/7/18 2:19 PM, Paolo Valente wrote: Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. >>> >>> Thanks, applied. >>> >> >> I Jens, >> I forgot to add >> Tested-by: Oleksandr Natalenko >> in the patch. >> >> Is it still possible to add it? >> > > In addition to this I think it should be worth considering CC'ing Greg > to pull this fix into 4.15 stable tree. I can't add the tested-by anymore, but it's easy enough to target for stable after-the-fact. -- Jens Axboe
Re: [PATCH] block: pass inclusive 'lend' parameter to truncate_inode_pages_range
On Fri, 2018-02-09 at 22:15 +0800, Ming Lei wrote: > The 'lend' parameter of truncate_inode_pages_range is required to be > inclusive, so follow the rule. > > This patch fixes one memory corruption triggered by discard. > > Fixes: 351499a172c0 ("block: Invalidate cache on discard v2") Since this bug got introduced in kernel v4.15 please add a "Cc: stable" tag. Please also Cc: the author of the patch. Thanks, Bart.
[PATCH] block: pass inclusive 'lend' parameter to truncate_inode_pages_range
The 'lend' parameter of truncate_inode_pages_range is required to be inclusive, so follow the rule. This patch fixes one memory corruption triggered by discard. Fixes: 351499a172c0 ("block: Invalidate cache on discard v2") Signed-off-by: Ming Lei--- block/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/ioctl.c b/block/ioctl.c index 1668506d8ed8..3884d810efd2 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -225,7 +225,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode, if (start + len > i_size_read(bdev->bd_inode)) return -EINVAL; - truncate_inode_pages_range(mapping, start, start + len); + truncate_inode_pages_range(mapping, start, start + len - 1); return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL, flags); } -- 2.9.5
Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook
Hi. 08.02.2018 08:16, Paolo Valente wrote: Il giorno 07 feb 2018, alle ore 23:18, Jens Axboeha scritto: On 2/7/18 2:19 PM, Paolo Valente wrote: Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. Thanks, applied. I Jens, I forgot to add Tested-by: Oleksandr Natalenko in the patch. Is it still possible to add it? In addition to this I think it should be worth considering CC'ing Greg to pull this fix into 4.15 stable tree. Oleksandr
[PATCH v2] blk: optimization for classic polling
From: Nitesh ShettyThis removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe. Changes v1->v2: -setting task state once in blk_poll, instead of multiple callers. Signed-off-by: Nitesh Shetty --- block/blk-mq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index df93102..40285fe 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) cpu_relax(); } + set_current_state(TASK_RUNNING); return false; } -- 2.7.4
[PATCH V2 1/4] lightnvm: make 1.2 data structures explicit
Make the 1.2 data structures explicit, so it will be easy to identify the 2.0 data structures. Also fix the order of which the nvme_nvm_* are declared, such that they follow the nvme_nvm_command order. Signed-off-by: Matias Bjørling--- drivers/nvme/host/lightnvm.c | 82 ++-- 1 file changed, 41 insertions(+), 41 deletions(-) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index dc0b1335c7c6..60db3f1b59da 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -51,6 +51,21 @@ struct nvme_nvm_ph_rw { __le64 resv; }; +struct nvme_nvm_erase_blk { + __u8opcode; + __u8flags; + __u16 command_id; + __le32 nsid; + __u64 rsvd[2]; + __le64 prp1; + __le64 prp2; + __le64 spba; + __le16 length; + __le16 control; + __le32 dsmgmt; + __le64 resv; +}; + struct nvme_nvm_identity { __u8opcode; __u8flags; @@ -89,33 +104,18 @@ struct nvme_nvm_setbbtbl { __u32 rsvd4[3]; }; -struct nvme_nvm_erase_blk { - __u8opcode; - __u8flags; - __u16 command_id; - __le32 nsid; - __u64 rsvd[2]; - __le64 prp1; - __le64 prp2; - __le64 spba; - __le16 length; - __le16 control; - __le32 dsmgmt; - __le64 resv; -}; - struct nvme_nvm_command { union { struct nvme_common_command common; - struct nvme_nvm_identity identity; struct nvme_nvm_ph_rw ph_rw; + struct nvme_nvm_erase_blk erase; + struct nvme_nvm_identity identity; struct nvme_nvm_getbbtbl get_bb; struct nvme_nvm_setbbtbl set_bb; - struct nvme_nvm_erase_blk erase; }; }; -struct nvme_nvm_id_group { +struct nvme_nvm_id12_grp { __u8mtype; __u8fmtype; __le16 res16; @@ -141,7 +141,7 @@ struct nvme_nvm_id_group { __u8reserved[906]; } __packed; -struct nvme_nvm_addr_format { +struct nvme_nvm_id12_addrf { __u8ch_offset; __u8ch_len; __u8lun_offset; @@ -157,16 +157,16 @@ struct nvme_nvm_addr_format { __u8res[4]; } __packed; -struct nvme_nvm_id { +struct nvme_nvm_id12 { __u8ver_id; __u8vmnt; __u8cgrps; __u8res; __le32 cap; __le32 dom; - struct nvme_nvm_addr_format ppaf; + struct nvme_nvm_id12_addrf ppaf; __u8resv[228]; - struct nvme_nvm_id_group group; + struct nvme_nvm_id12_grp grp; __u8resv2[2880]; } __packed; @@ -191,25 +191,25 @@ static inline void _nvme_nvm_check_size(void) { BUILD_BUG_ON(sizeof(struct nvme_nvm_identity) != 64); BUILD_BUG_ON(sizeof(struct nvme_nvm_ph_rw) != 64); + BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64); BUILD_BUG_ON(sizeof(struct nvme_nvm_getbbtbl) != 64); BUILD_BUG_ON(sizeof(struct nvme_nvm_setbbtbl) != 64); - BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64); - BUILD_BUG_ON(sizeof(struct nvme_nvm_id_group) != 960); - BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 16); - BUILD_BUG_ON(sizeof(struct nvme_nvm_id) != NVME_IDENTIFY_DATA_SIZE); + BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_grp) != 960); + BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_addrf) != 16); + BUILD_BUG_ON(sizeof(struct nvme_nvm_id12) != NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64); } -static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id) +static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12) { - struct nvme_nvm_id_group *src; + struct nvme_nvm_id12_grp *src; struct nvm_id_group *grp; int sec_per_pg, sec_per_pl, pg_per_blk; - if (nvme_nvm_id->cgrps != 1) + if (id12->cgrps != 1) return -EINVAL; - src = _nvm_id->group; + src = >grp; grp = _id->grp; grp->mtype = src->mtype; @@ -261,34 +261,34 @@ static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id)
[PATCH V2 2/4] lightnvm: flatten nvm_id_group into nvm_id
There are no groups in the 2.0 specification, make sure that the nvm_id structure is flattened before 2.0 data structures are added. Signed-off-by: Matias Bjørling--- drivers/lightnvm/core.c | 25 ++- drivers/nvme/host/lightnvm.c | 100 +-- include/linux/lightnvm.h | 53 +++ 3 files changed, 86 insertions(+), 92 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index dcc9e621e651..c72863b36439 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -851,33 +851,32 @@ EXPORT_SYMBOL(nvm_get_tgt_bb_tbl); static int nvm_core_init(struct nvm_dev *dev) { struct nvm_id *id = >identity; - struct nvm_id_group *grp = >grp; struct nvm_geo *geo = >geo; int ret; memcpy(>ppaf, >ppaf, sizeof(struct nvm_addr_format)); - if (grp->mtype != 0) { + if (id->mtype != 0) { pr_err("nvm: memory type not supported\n"); return -EINVAL; } /* Whole device values */ - geo->nr_chnls = grp->num_ch; - geo->nr_luns = grp->num_lun; + geo->nr_chnls = id->num_ch; + geo->nr_luns = id->num_lun; /* Generic device geometry values */ - geo->ws_min = grp->ws_min; - geo->ws_opt = grp->ws_opt; - geo->ws_seq = grp->ws_seq; - geo->ws_per_chk = grp->ws_per_chk; - geo->nr_chks = grp->num_chk; - geo->sec_size = grp->csecs; - geo->oob_size = grp->sos; - geo->mccap = grp->mccap; + geo->ws_min = id->ws_min; + geo->ws_opt = id->ws_opt; + geo->ws_seq = id->ws_seq; + geo->ws_per_chk = id->ws_per_chk; + geo->nr_chks = id->num_chk; + geo->sec_size = id->csecs; + geo->oob_size = id->sos; + geo->mccap = id->mccap; geo->max_rq_size = dev->ops->max_phys_sect * geo->sec_size; - geo->sec_per_chk = grp->clba; + geo->sec_per_chk = id->clba; geo->sec_per_lun = geo->sec_per_chk * geo->nr_chks; geo->all_luns = geo->nr_luns * geo->nr_chnls; diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 60db3f1b59da..6412551ecc65 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -203,57 +203,55 @@ static inline void _nvme_nvm_check_size(void) static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12) { struct nvme_nvm_id12_grp *src; - struct nvm_id_group *grp; int sec_per_pg, sec_per_pl, pg_per_blk; if (id12->cgrps != 1) return -EINVAL; src = >grp; - grp = _id->grp; - grp->mtype = src->mtype; - grp->fmtype = src->fmtype; + nvm_id->mtype = src->mtype; + nvm_id->fmtype = src->fmtype; - grp->num_ch = src->num_ch; - grp->num_lun = src->num_lun; + nvm_id->num_ch = src->num_ch; + nvm_id->num_lun = src->num_lun; - grp->num_chk = le16_to_cpu(src->num_chk); - grp->csecs = le16_to_cpu(src->csecs); - grp->sos = le16_to_cpu(src->sos); + nvm_id->num_chk = le16_to_cpu(src->num_chk); + nvm_id->csecs = le16_to_cpu(src->csecs); + nvm_id->sos = le16_to_cpu(src->sos); pg_per_blk = le16_to_cpu(src->num_pg); - sec_per_pg = le16_to_cpu(src->fpg_sz) / grp->csecs; + sec_per_pg = le16_to_cpu(src->fpg_sz) / nvm_id->csecs; sec_per_pl = sec_per_pg * src->num_pln; - grp->clba = sec_per_pl * pg_per_blk; - grp->ws_per_chk = pg_per_blk; + nvm_id->clba = sec_per_pl * pg_per_blk; + nvm_id->ws_per_chk = pg_per_blk; - grp->mpos = le32_to_cpu(src->mpos); - grp->cpar = le16_to_cpu(src->cpar); - grp->mccap = le32_to_cpu(src->mccap); + nvm_id->mpos = le32_to_cpu(src->mpos); + nvm_id->cpar = le16_to_cpu(src->cpar); + nvm_id->mccap = le32_to_cpu(src->mccap); - grp->ws_opt = grp->ws_min = sec_per_pg; - grp->ws_seq = NVM_IO_SNGL_ACCESS; + nvm_id->ws_opt = nvm_id->ws_min = sec_per_pg; + nvm_id->ws_seq = NVM_IO_SNGL_ACCESS; - if (grp->mpos & 0x020202) { - grp->ws_seq = NVM_IO_DUAL_ACCESS; - grp->ws_opt <<= 1; - } else if (grp->mpos & 0x040404) { - grp->ws_seq = NVM_IO_QUAD_ACCESS; - grp->ws_opt <<= 2; + if (nvm_id->mpos & 0x020202) { + nvm_id->ws_seq = NVM_IO_DUAL_ACCESS; + nvm_id->ws_opt <<= 1; + } else if (nvm_id->mpos & 0x040404) { + nvm_id->ws_seq = NVM_IO_QUAD_ACCESS; + nvm_id->ws_opt <<= 2; } - grp->trdt = le32_to_cpu(src->trdt); - grp->trdm = le32_to_cpu(src->trdm); - grp->tprt = le32_to_cpu(src->tprt); - grp->tprm = le32_to_cpu(src->tprm); - grp->tbet = le32_to_cpu(src->tbet); - grp->tbem = le32_to_cpu(src->tbem); + nvm_id->trdt = le32_to_cpu(src->trdt); + nvm_id->trdm =
[PATCH V2 0/4] lightnvm: base 2.0 implementation
A couple of patches for 2.0 support for the lightnvm subsystem. They form the basis for integrating 2.0 support. For the rest of the support, Javier has code that implements report chunk and sets up the LBA format data structure. He also has a bunch of patches that brings pblk up to speed. The first two patches is preparation for the 2.0 work. The third patch implements the 2.0 data structures, the geometry command, and exposes the sysfs attributes that comes with the 2.0 specification. Note that the attributes between 1.2 and 2.0 are different, and it is expected that user-space shall use the version sysfs attribute to know which attributes will be available. The last patch implements support for using the nvme namespace logical block and metadata fields and sync it with the internal lightnvm identify structures. Changes since v1: - pr_err fix from Randy. - Address type fix from Javier. - Also CC the nvme mailing list. Matias Bjørling (4): lightnvm: make 1.2 data structures explicit lightnvm: flatten nvm_id_group into nvm_id lightnvm: add 2.0 geometry identification nvme: lightnvm: add late setup of block size and metadata drivers/lightnvm/core.c | 33 ++- drivers/nvme/host/core.c | 2 + drivers/nvme/host/lightnvm.c | 508 --- drivers/nvme/host/nvme.h | 2 + include/linux/lightnvm.h | 64 +++--- 5 files changed, 428 insertions(+), 181 deletions(-) -- 2.11.0
[PATCH V2 4/4] nvme: lightnvm: add late setup of block size and metadata
The nvme driver sets up the size of the nvme namespace in two steps. First it initializes the device with standard logical block and metadata sizes, and then sets the correct logical block and metadata size. Due to the OCSSD 2.0 specification relies on the namespace to expose these sizes for correct initialization, let it be updated appropriately on the LightNVM side as well. Signed-off-by: Matias Bjørling--- drivers/nvme/host/core.c | 2 ++ drivers/nvme/host/lightnvm.c | 8 drivers/nvme/host/nvme.h | 2 ++ 3 files changed, 12 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f837d666cbd4..740ceb28067c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1379,6 +1379,8 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) if (ns->noiob) nvme_set_chunk_size(ns); nvme_update_disk_info(disk, ns, id); + if (ns->ndev) + nvme_nvm_update_nvm_info(ns); #ifdef CONFIG_NVME_MULTIPATH if (ns->head->disk) nvme_update_disk_info(ns->head->disk, ns, id); diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 8b243af8a949..a19e85f0cbae 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -814,6 +814,14 @@ int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg) } } +void nvme_nvm_update_nvm_info(struct nvme_ns *ns) +{ + struct nvm_dev *ndev = ns->ndev; + + ndev->identity.csecs = ndev->geo.sec_size = 1 << ns->lba_shift; + ndev->identity.sos = ndev->geo.oob_size = ns->ms; +} + int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node) { struct request_queue *q = ns->queue; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ea1aa5283e8e..1ca08f4993ba 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -451,12 +451,14 @@ static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns) #endif /* CONFIG_NVME_MULTIPATH */ #ifdef CONFIG_NVM +void nvme_nvm_update_nvm_info(struct nvme_ns *ns); int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node); void nvme_nvm_unregister(struct nvme_ns *ns); int nvme_nvm_register_sysfs(struct nvme_ns *ns); void nvme_nvm_unregister_sysfs(struct nvme_ns *ns); int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg); #else +static inline void nvme_nvm_update_nvm_info(struct nvme_ns *ns) {}; static inline int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node) { -- 2.11.0
[PATCH V2 3/4] lightnvm: add 2.0 geometry identification
Implement the geometry data structures for 2.0 and enable a drive to be identified as one, including exposing the appropriate 2.0 sysfs entries. Signed-off-by: Matias Bjørling--- drivers/lightnvm/core.c | 8 +- drivers/nvme/host/lightnvm.c | 334 +-- include/linux/lightnvm.h | 11 +- 3 files changed, 297 insertions(+), 56 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index c72863b36439..9b1255b3e05e 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -931,11 +931,9 @@ static int nvm_init(struct nvm_dev *dev) goto err; } - pr_debug("nvm: ver:%x nvm_vendor:%x\n", - dev->identity.ver_id, dev->identity.vmnt); - - if (dev->identity.ver_id != 1) { - pr_err("nvm: device not supported by kernel."); + if (dev->identity.ver_id != 1 && dev->identity.ver_id != 2) { + pr_err("nvm: device ver_id %d not supported by kernel.\n", + dev->identity.ver_id); goto err; } diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 6412551ecc65..8b243af8a949 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -184,6 +184,58 @@ struct nvme_nvm_bb_tbl { __u8blk[0]; }; +struct nvme_nvm_id20_addrf { + __u8grp_len; + __u8pu_len; + __u8chk_len; + __u8lba_len; + __u8resv[4]; +}; + +struct nvme_nvm_id20 { + __u8mjr; + __u8mnr; + __u8resv[6]; + + struct nvme_nvm_id20_addrf lbaf; + + __le32 mccap; + __u8resv2[12]; + + __u8wit; + __u8resv3[31]; + + /* Geometry */ + __le16 num_grp; + __le16 num_pu; + __le32 num_chk; + __le32 clba; + __u8resv4[52]; + + /* Write data requirements */ + __le32 ws_min; + __le32 ws_opt; + __le32 mw_cunits; + __le32 maxoc; + __le32 maxocpu; + __u8resv5[44]; + + /* Performance related metrics */ + __le32 trdt; + __le32 trdm; + __le32 twrt; + __le32 twrm; + __le32 tcrst; + __le32 tcrsm; + __u8resv6[40]; + + /* Reserved area */ + __u8resv7[2816]; + + /* Vendor specific */ + __u8vs[1024]; +}; + /* * Check we didn't inadvertently grow the command struct */ @@ -198,6 +250,8 @@ static inline void _nvme_nvm_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_addrf) != 16); BUILD_BUG_ON(sizeof(struct nvme_nvm_id12) != NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64); + BUILD_BUG_ON(sizeof(struct nvme_nvm_id20_addrf) != 8); + BUILD_BUG_ON(sizeof(struct nvme_nvm_id20) != NVME_IDENTIFY_DATA_SIZE); } static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12) @@ -256,6 +310,49 @@ static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12) return 0; } +static int nvme_nvm_setup_12(struct nvm_dev *nvmdev, struct nvm_id *nvm_id, + struct nvme_nvm_id12 *id) +{ + nvm_id->ver_id = id->ver_id; + nvm_id->vmnt = id->vmnt; + nvm_id->cap = le32_to_cpu(id->cap); + nvm_id->dom = le32_to_cpu(id->dom); + memcpy(_id->ppaf, >ppaf, + sizeof(struct nvm_addr_format)); + + return init_grp(nvm_id, id); +} + +static int nvme_nvm_setup_20(struct nvm_dev *nvmdev, struct nvm_id *nvm_id, + struct nvme_nvm_id20 *id) +{ + nvm_id->ver_id = id->mjr; + + nvm_id->num_ch = le16_to_cpu(id->num_grp); + nvm_id->num_lun = le16_to_cpu(id->num_pu); + nvm_id->num_chk = le32_to_cpu(id->num_chk); + nvm_id->clba = le32_to_cpu(id->clba); + + nvm_id->ws_min = le32_to_cpu(id->ws_min); + nvm_id->ws_opt = le32_to_cpu(id->ws_opt); + nvm_id->mw_cunits = le32_to_cpu(id->mw_cunits); + + nvm_id->trdt = le32_to_cpu(id->trdt); + nvm_id->trdm = le32_to_cpu(id->trdm); + nvm_id->tprt = le32_to_cpu(id->twrt); + nvm_id->tprm = le32_to_cpu(id->twrm); + nvm_id->tbet = le32_to_cpu(id->tcrst); + nvm_id->tbem = le32_to_cpu(id->tcrsm); + + /* calculated values */ + nvm_id->ws_per_chk = nvm_id->clba / nvm_id->ws_min; + + /* 1.2 compatibility */ +