Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
Reproducer (needs SCSI disk): #include #include #include #include #include #include #include #include #include #define NR_IOS 1 #define NR_IOVECS 8 #define SG_IO 0x2285 int main(int argc, char *argv[]) { int fd, i, j; unsigned char *buf, *ptr, cdb[10]; sg_io_hdr_t io_hdr; sg_iovec_t iovec[NR_IOVECS]; if (argc < 2) { printf("Run: %s \n", argv[0]); exit(1); } buf = ptr = memalign(4096, NR_IOS * NR_IOVECS * 512); if (!buf) { printf("can't alloc memory\n"); exit(1); } fd = open(argv[1], 0); if (fd < 0) { printf("open %s failed: %d (%s)\n", argv[1], errno, strerror(errno)); exit(1); } io_hdr.interface_id = 'S'; io_hdr.cmd_len = sizeof(cdb); io_hdr.cmdp = cdb; io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.dxfer_len = 512 * NR_IOVECS; io_hdr.dxferp = iovec; io_hdr.iovec_count = NR_IOVECS; cdb[0] = 0x28; // READ10 cdb[8] = NR_IOVECS; // sectors for (j = 0; j < NR_IOS; j++, ptr += 512) { for (i = 0; i < NR_IOVECS; i++) { iovec[i].iov_base = ptr; iovec[i].iov_len = 512; } if (ioctl(fd, SG_IO, _hdr)) { printf("IOCTL failed: %d (%s)\n", errno, strerror(errno)); exit(1); } } free(buf); close(fd); return 0; } # free -m totalusedfree shared buff/cache available Mem: 3827 463601 0 1783568 Swap: 0 0 0 # ./sgio-leak /dev/sdd # free -m totalusedfree shared buff/cache available Mem: 3827 853562 0 1783529 Swap: 0 0 0 [root@node-A ~]# free -m totalusedfree shared buff/cache available Mem: 3827 853628 0 1133561 Swap: 0 0 0 # ./sgio-leak /dev/sdd # free -m totalusedfree shared buff/cache available Mem: 3827 1243589 0 1133523 Swap: 0 0 0 -- wbr, Vitaly
[PATCH] fix unbalanced page refcounting in bio_map_user_iov
bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if IO vector has small consecutive buffers belonging to the same page. bio_add_pc_page merges them into one, but the page reference is never dropped. Signed-off-by: Vitaly Mayatskikhdiff --git a/block/bio.c b/block/bio.c index b38e962fa83e..10cd3b6bed27 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q, offset = offset_in_page(uaddr); for (j = cur_page; j < page_limit; j++) { unsigned int bytes = PAGE_SIZE - offset; + unsigned short prev_bi_vcnt = bio->bi_vcnt; if (len <= 0) break; @@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q, bytes) break; + /* +* check if vector was merged with previous +* drop page reference if needed +*/ + if (bio->bi_vcnt == prev_bi_vcnt) + put_page(pages[j]); + len -= bytes; offset = 0; } -- wbr, Vitaly
Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
On Wed, Sep 20, 2017 at 08:20:58PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 02:37:50PM -0600, Jens Axboe wrote: > > On 09/02/2017 09:17 AM, Ming Lei wrote: > > > @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct > > > blk_mq_hw_ctx *hctx) > > > if (!list_empty(_list)) { > > > blk_mq_sched_mark_restart_hctx(hctx); > > > do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list); > > > - } else if (!has_sched_dispatch) { > > > + } else if (!has_sched_dispatch && !q->queue_depth) { > > > + /* > > > + * If there is no per-request_queue depth, we > > > + * flush all requests in this hw queue, otherwise > > > + * pick up request one by one from sw queue for > > > + * avoiding to mess up I/O merge when dispatch > > > + * is busy, which can be triggered easily by > > > + * per-request_queue queue depth > > > + */ > > > blk_mq_flush_busy_ctxs(hctx, _list); > > > blk_mq_dispatch_rq_list(q, _list); > > > } > > > > I don't like this part at all. It's a heuristic, and on top of that, > > it's a heuristic based on a weak assumption that if q->queue_depth == 0, > > then we never run into BUSY constraints. I think that would be stronger > > if it was based on "is this using shared tags", but even that is not > > great at all. It's perfectly possible to have a shared tags setup and > > never run into resource constraints. The opposite condition is also true > > - you can run without shared tags, and yet hit resource constraints > > constantly. Hence this is NOT a good test for deciding whether to flush > > everything at once or not. In short, I think a much better test case > > would be "has this device ever returned BLK_STS_RESOURCE. As it so > > happens, that's easy enough to keep track of and base this test on. > > Hi Jens, > > The attached patch follows your suggestion, and uses EWMA to > compute the average length of hctx->dispatch, then only flush > all requests from ctxs if the average length is zero, what do > you think about this approach? Or other suggestions? Hi Jens, I am going to prepare for V5, as suggested by Omar. Could you suggest which way you prefer to? Keeping to check q->queue_depth, or the approach of using average length of hctx->dispath, or others? -- Ming
[PATCH] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
blk-mq will rerun queue via RESTART after one request is completed, so not necessary to wait random time for requeuing, we should trust blk-mq to do it. More importantly, we need return BLK_STS_RESOURCE to blk-mq so that dequeue from I/O scheduler can be stopped, then I/O merge gets improved. Signed-off-by: Ming Lei--- drivers/md/dm-mpath.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 11f273d2f018..0902f7762306 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -504,8 +504,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, if (queue_dying) { atomic_inc(>pg_init_in_progress); activate_or_offline_path(pgpath); + return DM_MAPIO_DELAY_REQUEUE; } - return DM_MAPIO_DELAY_REQUEUE; + + /* +* blk-mq's SCHED_RESTART can cover this requeue, so +* we needn't to deal with it by DELAY_REQUEUE. More +* importantly, we have to return DM_MAPIO_REQUEUE +* so that blk-mq can get the queue busy feedback, +* otherwise I/O merge can be hurt. +*/ + if (q->mq_ops) + return DM_MAPIO_REQUEUE; + else + return DM_MAPIO_DELAY_REQUEUE; } clone->bio = clone->biotail = NULL; clone->rq_disk = bdev->bd_disk; -- 2.9.5
Re: nvme multipath support V2
Excuse me, ask an junior question, how can you complete the nvme mpath package clone, I use git clone, after completion, found no nvme directory. [root@scst1 soft]# git clone git://git.infradead.org/users/hch/block.git nvme-mpath Cloning into 'nvme-mpath'... remote: Counting objects: 5623436, done. remote: Compressing objects: 100% (922468/922468), done. remote: Total 5623436 (delta 4757577), reused 5524606 (delta 4658819) Receiving objects: 100% (5623436/5623436), 1.12 GiB | 3.17 MiB/s, done. Resolving deltas: 100% (4757577/4757577), done. Checking out files: 100% (50795/50795), done. [root@scst1 drivers]# pwd /u01/soft/nvme-mpath/drivers [root@scst1 drivers]# ls nvme ls: cannot access nvme: No such file or directory Thanks 2017-09-21 22:50 GMT+08:00 Christoph Hellwig: > On Thu, Sep 21, 2017 at 07:23:45AM +0200, Johannes Thumshirn wrote: >> Ah OK, we maybe should update nvme-cli to recognize it as well then. I just >> looked at the output of nvme list and obviously didn't find it. > > Overloading the new per-subsystem nodes into nvme list would be > very confusing I think. We could add a new command to list subsystems > instead of controllers, if just doing a ls in /dev or sysfs is too hard. > > ___ > Linux-nvme mailing list > linux-n...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme
Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency
On Thu, Sep 21, 2017 at 11:56:33PM +, Bart Van Assche wrote: > On Fri, 2017-09-22 at 07:53 +0800, Ming Lei wrote: > > Then that is simply not enough since the issue is that request pool can > > be used up easily when SCSI device is quiesced, then no request is left > > for RQF_PREEMPT(PM), and I/O hang is triggered. > > > > It has been discussed for long time, I am sorry you still don't > > understand it. > > I'm really sorry that you don't understand the intention of this patch series. > Anyway, let's close the discussion here and continue once the next version of > this patch series has been posted. So you mean you don't want to fix Cathy's issue? which was reported from one production system of our customer. And definitely belongs to same kind of SCSI quiesce issue. -- Ming
Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency
On Fri, 2017-09-22 at 07:53 +0800, Ming Lei wrote: > Then that is simply not enough since the issue is that request pool can > be used up easily when SCSI device is quiesced, then no request is left > for RQF_PREEMPT(PM), and I/O hang is triggered. > > It has been discussed for long time, I am sorry you still don't > understand it. I'm really sorry that you don't understand the intention of this patch series. Anyway, let's close the discussion here and continue once the next version of this patch series has been posted. Bart.
Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency
On Thu, Sep 21, 2017 at 11:32:33PM +, Bart Van Assche wrote: > On Fri, 2017-09-22 at 07:25 +0800, Ming Lei wrote: > > On Thu, Sep 21, 2017 at 10:43:26PM +, Bart Van Assche wrote: > > > On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote: > > > > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote: > > > > > + } else { > > > > > scsi_run_queue(q); > > > > > + while (atomic_read(>device_busy)) { > > > > > + msleep_interruptible(200); > > > > > + scsi_run_queue(q); > > > > > + } > > > > > > > > Are you sure only blk-mq need to drain queue? We need > > > > to do that for block legacy too. > > > > > > The code above your comment drains the queue for the legacy block layer. > > > > That is just draining the requests dispatched to SCSI layer, and there > > might be lots of requests in block I/O scheduler queue or requeue or > > whatever. > > There is no requeue list for the legacy block layer. There is only a requeue > list for blk-mq. > > Waiting for I/O requests that are in scheduler queues is not the purpose of > scsi_quiesce_device(). The purpose of that function is to wait for requests > that have already been started. The sdev->device_busy counter represents the > number of started requests so waiting until that counter has reached zero is > sufficient. Then that is simply not enough since the issue is that request pool can be used up easily when SCSI device is quiesced, then no request is left for RQF_PREEMPT(PM), and I/O hang is triggered. It has been discussed for long time, I am sorry you still don't understand it. -- Ming
Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency
On Fri, 2017-09-22 at 07:25 +0800, Ming Lei wrote: > On Thu, Sep 21, 2017 at 10:43:26PM +, Bart Van Assche wrote: > > On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote: > > > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote: > > > > + } else { > > > > scsi_run_queue(q); > > > > + while (atomic_read(>device_busy)) { > > > > + msleep_interruptible(200); > > > > + scsi_run_queue(q); > > > > + } > > > > > > Are you sure only blk-mq need to drain queue? We need > > > to do that for block legacy too. > > > > The code above your comment drains the queue for the legacy block layer. > > That is just draining the requests dispatched to SCSI layer, and there > might be lots of requests in block I/O scheduler queue or requeue or > whatever. There is no requeue list for the legacy block layer. There is only a requeue list for blk-mq. Waiting for I/O requests that are in scheduler queues is not the purpose of scsi_quiesce_device(). The purpose of that function is to wait for requests that have already been started. The sdev->device_busy counter represents the number of started requests so waiting until that counter has reached zero is sufficient. Bart.
Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency
On Thu, Sep 21, 2017 at 10:43:26PM +, Bart Van Assche wrote: > On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote: > > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote: > > > + } else { > > > scsi_run_queue(q); > > > + while (atomic_read(>device_busy)) { > > > + msleep_interruptible(200); > > > + scsi_run_queue(q); > > > + } > > > > Are you sure only blk-mq need to drain queue? We need > > to do that for block legacy too. > > The code above your comment drains the queue for the legacy block layer. That is just draining the requests dispatched to SCSI layer, and there might be lots of requests in block I/O scheduler queue or requeue or whatever. -- Ming
Re: [PATCH 6/9] nvme: track subsystems
On Mon, Sep 18, 2017 at 04:14:50PM -0700, Christoph Hellwig wrote: > +static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl > *id) > +{ > + struct nvme_subsystem *subsys, *found; > + > + subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); > + if (!subsys) > + return -ENOMEM; > + INIT_LIST_HEAD(>ctrls); > + kref_init(>ref); > + nvme_init_subnqn(subsys, ctrl, id); > + mutex_init(>lock); > + > + mutex_lock(_subsystems_lock); > + found = __nvme_find_get_subsystem(subsys->subnqn); > + if (found) { > + /* > + * Verify that the subsystem actually supports multiple > + * controllers, else bail out. > + */ > + kfree(subsys); > + if (!(id->cmic & (1 << 1))) { > + dev_err(ctrl->device, > + "ignoring ctrl due to duplicate subnqn (%s).\n", > + found->subnqn); > + mutex_unlock(_subsystems_lock); > + return -EINVAL; > + } > + > + subsys = found; > + } else { > + list_add_tail(>entry, _subsystems); > + } > + > + ctrl->subsys = subsys; > + mutex_unlock(_subsystems_lock); > + > + mutex_lock(>lock); > + list_add_tail(>subsys_entry, >ctrls); > + mutex_unlock(>lock); This function is called every time nvme_init_identify is called, which happens on every controller reset. The controller reset does not remove itself from the subsystem list of controllers, so its entry is getting doubly added after a controller reset.
Re: [PATCH v2 3/4] block, scsi: Make SCSI device suspend and resume work reliably
On Fri, 2017-09-22 at 06:04 +0800, Ming Lei wrote: > On Thu, Sep 21, 2017 at 02:22:54PM -0700, Bart Van Assche wrote: > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -1429,11 +1429,18 @@ struct request *blk_get_request(struct > > request_queue *q, unsigned int op, > > gfp_t gfp_mask) > > { > > struct request *req; > > + const bool may_sleep = gfp_mask & __GFP_DIRECT_RECLAIM; > > + > > + if (unlikely(blk_queue_preempt_only(q) && !(op & REQ_PREEMPT))) { > > The flag is set with queue_lock, but checked without any lock, do you > think it is safe in this way? > > Also this flag isn't checked in normal I/O path, but you unfreeze > queue during scsi_device_quiesce(), then any normal I/O can come > from that time. I will address both comments in the next version of this patch series. Bart.
Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency
On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote: > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote: > > + } else { > > scsi_run_queue(q); > > + while (atomic_read(>device_busy)) { > > + msleep_interruptible(200); > > + scsi_run_queue(q); > > + } > > Are you sure only blk-mq need to drain queue? We need > to do that for block legacy too. The code above your comment drains the queue for the legacy block layer. Bart.
[PATCH v2 0/4] Make SCSI device suspend and resume work reliably
Hello Jens, It is known that during the resume following a hibernate sometimes the system hangs instead of coming up properly. This patch series fixes this problem. This patch series is an alternative for Ming Lei's "[PATCH V5 0/10] block/scsi: safe SCSI quiescing" patch series. The advantages of this patch series are: - No new freeze states and hence no new freeze state variables. - Easier to review because no new race conditions are introduced between queue freezing and blk_cleanup_queue(). As the discussion that followed Ming's patch series shows the correctness of the new code is hard to verify. These patches have been tested on top of a merge of the block layer for-next branch and Linus' master tree. Linus' master tree includes patch "KVM: x86: Fix the NULL pointer parameter in check_cr_write()" but the block layer for-next branch not yet. Please consider these changes for kernel v4.15. Thanks, Bart. Changes compared to v1 of this patch series: - Changed the approach and rewrote the patch series. Bart Van Assche (4): block: Convert RQF_PREEMPT into REQ_PREEMPT block: Add the QUEUE_PREEMPT_ONLY request queue flag block, scsi: Make SCSI device suspend and resume work reliably scsi-mq: Reduce suspend latency block/blk-core.c | 26 +++--- block/blk-mq-debugfs.c| 2 +- drivers/ide/ide-atapi.c | 3 +-- drivers/ide/ide-io.c | 2 +- drivers/ide/ide-pm.c | 4 ++-- drivers/scsi/scsi_lib.c | 31 +-- include/linux/blk_types.h | 6 ++ include/linux/blkdev.h| 8 +--- 8 files changed, 60 insertions(+), 22 deletions(-) -- 2.14.1
[PATCH v2 1/4] block: Convert RQF_PREEMPT into REQ_PREEMPT
This patch does not change any functionality but makes the REQ_PREEMPT flag available to blk_get_request(). A later patch will add code to blk_get_request() that checks the REQ_PREEMPT flag. Note: the IDE sense_rq request is allocated statically so there is no blk_get_request() call that corresponds to this request. Signed-off-by: Bart Van AsscheCc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-mq-debugfs.c| 2 +- drivers/ide/ide-atapi.c | 3 +-- drivers/ide/ide-io.c | 2 +- drivers/ide/ide-pm.c | 4 ++-- drivers/scsi/scsi_lib.c | 11 ++- include/linux/blk_types.h | 6 ++ include/linux/blkdev.h| 3 --- 7 files changed, 17 insertions(+), 14 deletions(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 980e73095643..62ac248a4984 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -266,6 +266,7 @@ static const char *const cmd_flag_name[] = { CMD_FLAG_NAME(BACKGROUND), CMD_FLAG_NAME(NOUNMAP), CMD_FLAG_NAME(NOWAIT), + CMD_FLAG_NAME(PREEMPT), }; #undef CMD_FLAG_NAME @@ -279,7 +280,6 @@ static const char *const rqf_name[] = { RQF_NAME(MIXED_MERGE), RQF_NAME(MQ_INFLIGHT), RQF_NAME(DONTPREP), - RQF_NAME(PREEMPT), RQF_NAME(COPY_USER), RQF_NAME(FAILED), RQF_NAME(QUIET), diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c index 14d1e7d9a1d6..1258739d5fa1 100644 --- a/drivers/ide/ide-atapi.c +++ b/drivers/ide/ide-atapi.c @@ -211,9 +211,8 @@ void ide_prep_sense(ide_drive_t *drive, struct request *rq) } sense_rq->rq_disk = rq->rq_disk; - sense_rq->cmd_flags = REQ_OP_DRV_IN; + sense_rq->cmd_flags = REQ_OP_DRV_IN | REQ_PREEMPT; ide_req(sense_rq)->type = ATA_PRIV_SENSE; - sense_rq->rq_flags |= RQF_PREEMPT; req->cmd[0] = GPCMD_REQUEST_SENSE; req->cmd[4] = cmd_len; diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c index 3a234701d92c..06ffccd0fb10 100644 --- a/drivers/ide/ide-io.c +++ b/drivers/ide/ide-io.c @@ -539,7 +539,7 @@ void do_ide_request(struct request_queue *q) */ if ((drive->dev_flags & IDE_DFLAG_BLOCKED) && ata_pm_request(rq) == 0 && - (rq->rq_flags & RQF_PREEMPT) == 0) { + (rq->cmd_flags & REQ_PREEMPT) == 0) { /* there should be no pending command at this point */ ide_unlock_port(hwif); goto plug_device; diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c index 544f02d673ca..f8d2709dcd56 100644 --- a/drivers/ide/ide-pm.c +++ b/drivers/ide/ide-pm.c @@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev) } memset(, 0, sizeof(rqpm)); - rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM); + rq = blk_get_request(drive->queue, REQ_OP_DRV_IN | REQ_PREEMPT, +__GFP_RECLAIM); ide_req(rq)->type = ATA_PRIV_PM_RESUME; - rq->rq_flags |= RQF_PREEMPT; rq->special = rqpm.pm_step = IDE_PM_START_RESUME; rqpm.pm_state = PM_EVENT_ON; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6db8247577a0 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -245,8 +245,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, int ret = DRIVER_ERROR << 24; req = blk_get_request(sdev->request_queue, - data_direction == DMA_TO_DEVICE ? - REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM); + (data_direction == DMA_TO_DEVICE ? + REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN) | REQ_PREEMPT, + __GFP_RECLAIM); if (IS_ERR(req)) return ret; rq = scsi_req(req); @@ -260,7 +261,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, rq->retries = retries; req->timeout = timeout; req->cmd_flags |= flags; - req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT; + req->rq_flags |= rq_flags | RQF_QUIET; /* * head injection *required* here otherwise quiesce won't work @@ -1271,7 +1272,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req) /* * If the devices is blocked we defer normal commands. */ - if (!(req->rq_flags & RQF_PREEMPT)) + if (!(req->cmd_flags & REQ_PREEMPT)) ret = BLKPREP_DEFER; break; default: @@ -1280,7 +1281,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
[PATCH v2 2/4] block: Add the QUEUE_PREEMPT_ONLY request queue flag
This flag will be used in the next patch to let the block layer core know whether or not a SCSI request queue has been quiesced. A quiesced SCSI queue namely only processes RQF_PREEMPT requests. Signed-off-by: Bart Van AsscheCc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c | 13 + include/linux/blkdev.h | 5 + 2 files changed, 18 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index d709c0e3a2ac..1ac337712bbd 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -346,6 +346,19 @@ void blk_sync_queue(struct request_queue *q) } EXPORT_SYMBOL(blk_sync_queue); +void blk_set_preempt_only(struct request_queue *q, bool preempt_only) +{ + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + if (preempt_only) + queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q); + else + queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + spin_unlock_irqrestore(q->queue_lock, flags); +} +EXPORT_SYMBOL(blk_set_preempt_only); + /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped * @q: The queue to run diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index cd26901a6e25..5bd87599eed0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -627,6 +627,7 @@ struct request_queue { #define QUEUE_FLAG_REGISTERED 26 /* queue has been registered to a disk */ #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED28 /* queue has been quiesced */ +#define QUEUE_FLAG_PREEMPT_ONLY29 /* only process REQ_PREEMPT requests */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\ (1 << QUEUE_FLAG_STACKABLE)| \ @@ -731,6 +732,10 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ REQ_FAILFAST_DRIVER)) #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) +#define blk_queue_preempt_only(q) \ + test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags) + +extern void blk_set_preempt_only(struct request_queue *q, bool preempt_only); static inline bool blk_account_rq(struct request *rq) { -- 2.14.1
[PATCH v2 4/4] scsi-mq: Reduce suspend latency
Avoid that it can take 200 ms too long to wait for ongoing requests to finish. Note: blk_mq_freeze_queue() uses a wait queue to wait for ongoing requests to finish. Signed-off-by: Bart Van AsscheCc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/scsi/scsi_lib.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index e76fd6e89a81..34e5f0f95d01 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2901,10 +2901,15 @@ scsi_device_quiesce(struct scsi_device *sdev) if (err) return err; - scsi_run_queue(q); - while (atomic_read(>device_busy)) { - msleep_interruptible(200); + if (q->mq_ops) { + blk_mq_freeze_queue(q); + blk_mq_unfreeze_queue(q); + } else { scsi_run_queue(q); + while (atomic_read(>device_busy)) { + msleep_interruptible(200); + scsi_run_queue(q); + } } return 0; } -- 2.14.1
[PATCH v2 3/4] block, scsi: Make SCSI device suspend and resume work reliably
It is essential during suspend and resume that neither the filesystem state nor the filesystem metadata in RAM changes. This is why while the hibernation image is being written or restored that SCSI devices are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() and scsi_device_resume(). In the SDEV_QUIESCE state execution of non-preempt requests is deferred. This is realized by returning BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI devices. Avoid that a full queue prevents power management requests to be submitted by slowing down allocation of non-preempt requests for devices in the quiesced state. This patch has been tested by running the following commands and by verifying that after resume the fio job is still running: for d in /sys/class/block/sd*[a-z]; do hcil=$(readlink "$d/device") hcil=${hcil#../../../} echo 4 > "$d" echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" done bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c) bdev=${bdev#../../} hcil=$(readlink "/sys/block/$bdev/device") hcil=${hcil#../../../} fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \ --ioengine=psync --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \ --loops=$((2**31)) & pid=$! sleep 1 systemctl hibernate sleep 10 kill $pid Reported-by: Oleksandr NatalenkoReferences: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block=150340235201348). Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c| 13 ++--- drivers/scsi/scsi_lib.c | 11 --- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 1ac337712bbd..6a190dd998aa 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1429,11 +1429,18 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op, gfp_t gfp_mask) { struct request *req; + const bool may_sleep = gfp_mask & __GFP_DIRECT_RECLAIM; + + if (unlikely(blk_queue_preempt_only(q) && !(op & REQ_PREEMPT))) { + if (may_sleep) + msleep(100); + else + return ERR_PTR(-EBUSY); + } if (q->mq_ops) { - req = blk_mq_alloc_request(q, op, - (gfp_mask & __GFP_DIRECT_RECLAIM) ? - 0 : BLK_MQ_REQ_NOWAIT); + req = blk_mq_alloc_request(q, op, may_sleep ? + 0 : BLK_MQ_REQ_NOWAIT); if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn) q->mq_ops->initialize_rq_fn(req); } else { diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6db8247577a0..e76fd6e89a81 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2889,19 +2889,22 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev) int scsi_device_quiesce(struct scsi_device *sdev) { + struct request_queue *q = sdev->request_queue; int err; mutex_lock(>state_mutex); err = scsi_device_set_state(sdev, SDEV_QUIESCE); + if (err == 0) + blk_set_preempt_only(q, true); mutex_unlock(>state_mutex); if (err) return err; - scsi_run_queue(sdev->request_queue); + scsi_run_queue(q); while (atomic_read(>device_busy)) { msleep_interruptible(200); - scsi_run_queue(sdev->request_queue); + scsi_run_queue(q); } return 0; } @@ -2924,8 +2927,10 @@ void scsi_device_resume(struct scsi_device *sdev) */ mutex_lock(>state_mutex); if (sdev->sdev_state == SDEV_QUIESCE && - scsi_device_set_state(sdev, SDEV_RUNNING) == 0) + scsi_device_set_state(sdev, SDEV_RUNNING) == 0) { + blk_set_preempt_only(sdev->request_queue, false); scsi_run_queue(sdev->request_queue); + } mutex_unlock(>state_mutex); } EXPORT_SYMBOL(scsi_device_resume); -- 2.14.1
Re: [PATCH 9/9] nvme: implement multipath access to nvme subsystems
On Thu, Sep 21, 2017 at 01:52:50AM +0200, Christoph Hellwig wrote: > I noticed the odd renaming in sysfs and though about gettind rid > of the /dev/nvme/ directory. I just need to come up with a good > name for the device nodes - the name can't contain /dev/nvme* as > nvme-cli would break if it sees files that start with nvme in /dev/. Ah, didn't know that. I'll fix nvme-cli to handle this. We can still use a different naming convention for the multipath handles. > /dev/nvmN ? Looks safe. BTW, considered persistent nameing rules to symlink these from /dev/disk/by-id/? May need to add an attribute to the multipath object to assist that.
[PATCH] bcache: fix a comments typo in bch_alloc_sectors()
Code comments in alloc.c:bch_alloc_sectors() mentions a function name find_data_bucket(), the correct function name should be pick_data_bucket() indeed. bch_alloc_sectors() is a quite important function in bcache allocation code, fixing the typo may help other people to have less confusion. Signed-off-by: Coly Li--- drivers/md/bcache/alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index cacbe2dbd5c3..071ff28be912 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -600,7 +600,7 @@ bool bch_alloc_sectors(struct cache_set *c, struct bkey *k, unsigned sectors, /* * If we had to allocate, we might race and not need to allocate the -* second time we call find_data_bucket(). If we allocated a bucket but +* second time we call pick_data_bucket(). If we allocated a bucket but * didn't use it, drop the refcount bch_bucket_alloc_set() took: */ if (KEY_PTRS()) -- 2.13.5
[PATCHv3] bcache: only permit to recovery read error when cache device is clean
When bcache does read I/Os, for example in writeback or writethrough mode, if a read request on cache device is failed, bcache will try to recovery the request by reading from cached device. If the data on cached device is not synced with cache device, then requester will get a stale data. For critical storage system like database, providing stale data from recovery may result an application level data corruption, which is unacceptible. With this patch, for a failed read request in writeback or writethrough mode, recovery a recoverable read request only happens when cache device is clean. That is to say, all data on cached device is up to update. For other cache modes in bcache, read request will never hit cached_dev_read_error(), they don't need this patch. Please note, because cache mode can be switched arbitrarily in run time, a writethrough mode might be switched from a writeback mode. Therefore checking dc->has_data in writethrough mode still makes sense. Changelog: v3: By response from Kent Oversteet, he thinks recovering stale data is a bug to fix, and option to permit it is unneccessary. So this version the sysfs file is removed. v2: rename sysfs entry from allow_stale_data_on_failure to allow_stale_data_on_failure, and fix the confusing commit log. v1: initial patch posted. Signed-off-by: Coly LiReported-by: Arne Wolf Cc: Kent Overstreet Cc: Michael Lyle Cc: Nix Cc: Kai Krakow Cc: Eric Wheeler Cc: Junhui Tang Cc: sta...@vger.kernel.org --- drivers/md/bcache/request.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 681b4f12b05a..e7f769ff7234 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -697,8 +697,16 @@ static void cached_dev_read_error(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); struct bio *bio = >bio.bio; + struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); - if (s->recoverable) { + /* +* If cache device is dirty (dc->has_dirty is non-zero), then +* recovery a failed read request from cached device may get a +* stale data back. So read failure recovery is only permitted +* when cache device is clean. +*/ + if (s->recoverable && + (dc && !atomic_read(>has_dirty)) { /* Retry from the backing device: */ trace_bcache_read_retry(s->orig_bio); -- 2.13.5
Re: [PATCH] block: fix a crash caused by wrong API
On 09/21/2017 01:17 PM, Shaohua Li wrote: > part_stat_show takes a part device not a disk, so we should use > part_to_disk. Oops, thanks Shaohua! Applied. -- Jens Axboe
[PATCH] block: fix a crash caused by wrong API
part_stat_show takes a part device not a disk, so we should use part_to_disk. Fix: d62e26b3ffd2(block: pass in queue to inflight accounting) Cc: Bart Van AsscheCc: Omar Sandoval Signed-off-by: Shaohua Li --- block/partition-generic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/partition-generic.c b/block/partition-generic.c index 86e8fe1..88c555d 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -112,7 +112,7 @@ ssize_t part_stat_show(struct device *dev, struct device_attribute *attr, char *buf) { struct hd_struct *p = dev_to_part(dev); - struct request_queue *q = dev_to_disk(dev)->queue; + struct request_queue *q = part_to_disk(p)->queue; unsigned int inflight[2]; int cpu; -- 2.9.5
Re: Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?
On Tue, Sep 19, 2017 at 02:16:26PM -0400, Douglas Gilbert wrote: > On 2017-09-19 10:56 AM, Benjamin Block wrote: > > Hello linux-block, > > > > I wrote some tests recently to test patches against bsg.c and bsg-lib.c, > > and while writing those I noticed something strange: > > > > When you use the write() and read() call on multiple file-descriptors > > for a single bsg-device (FC or SCSI), it is possible that you get > > cross-talk between those different file-descriptors. > > > > E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0 > > and you send commands via write() in both processes, when they both do > > read() later on - to read the response for the commands they send before > > -, it is possible that process A gets the response (Sense-Data, > > FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis. > > > > I noticed this because in my tests I 'tag' each command I send with > > write() via a value I write into the usr_ptr field of sg_io_v4. When I > > later user read() to receive responses I check for this tag in a > > hash-table and so can look up the original command. When I used this and > > spawned multiple processes with the same target bsg-device, I got > > responses for commands I don't have tags for in my hash-table, so they > > were written by an other process. This never happend when I only spawn > > one test-process. > > > > This seems awfully contra-productive.. so much that I don't see any > > point in allowing users to open more than one FD per bsg-device, because > > that would inevitably lead to very hard to debug 'bugs'. And I wonder if > > this is really by design or some regression that happend over time. > > > > I looked at the code in bsg.c and yes, it seems this is what is coded > > right now. We have a hash-table which manages all 'struct bsg_device's > > who are indexed by device-minors, so one hash-table entry per > > device-node. > > > > So eh, long talk short question, is that intended? > > Hi, > About once a year I point out that major misfeature in the bsg driver > and no-one seems to care. Most recently I pointed it out in a > discussion about SCSI SG CVE-2017-0794 6 days ago: > " ... Last time I looked at the bsg driver, a SCSI command > could be issued on one file descriptor and its data-in block > and SCSI status delivered to another file descriptor to the > same block device (potentially in another process). IOW chaos" > > It is hard to imagine any sane application relying on this bizarre > behaviour so fixing it should not break any applications. My guess > is that it would require a non-trivial patch set to fix. Would you > like to volunteer? > Interesting. So this is not a regression then. Well, personally I am intrigued to try to fix this, but professionally it is not really up to me to choose to work on this. Especially because as you say, this looks not trivial - at least if you want to let multiple applications open a FD for a single BSG device node. Lets see. But thank you for elaborating on this! Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
[PATCH 1/1] bsg-lib: fix use-after-free under memory-pressure
When under memory-pressure it is possible that the mempool which backs the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count emergency buffers - in case it can't get a regular allocation. These buffers are preallocated and once they are also used, they are re-supplied with old finished requests from the same request_queue (see mempool_free()). The bug is, when re-supplying the emergency pool, the old requests are not again ran through the callback mempool_t->alloc(), and thus also not through the callback bsg_init_rq(). Thus we skip initialization, and while the sense-buffer still should be good, scsi_request->cmd might have become to be an invalid pointer in the meantime. When the request is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB, bsg will replace it with a custom allocated buffer, which is freed when the user's command is finished, thus it dangles afterwards. When next a command is sent by the user that has a smaller/similar CDB as BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by scsi_request->__cmd, will not make a custom allocation, and write into undefined memory. Fix this by splitting bsg_init_rq() into two functions: - bsg_init_job() directly replace bsg_init_rq() and only does the allocation of the sense-buffer, which is used to back the bsg job's reply buffer. This pointer should never change during the lifetime of a scsi_request, so it doesn't need re-initialization. - bsg_init_rq() is a new function that make use of 'struct request_queue's initialize_rq_fn callback (which was introduced in v4.12). This is always called before the request is given out via blk_get_request(). This function does the remaining initialization that was previously done in bsg_init_rq(), and will also do it when the request is taken from the emergency-pool of the backing mempool. Also rename bsg_exit_rq() into bsg_exit_job(), to make it fit the name-scheme. Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer") Cc:# 4.11+ Signed-off-by: Benjamin Block --- Notes: I did test this on zFCP with FC CT commands send via the ioctl() and write() system-call. That did work fine. But I would very much appreciate if anyone could run this against an other HBA or even an other implementer of bsg-lib, such as now SAS, because I have no access to such hardware here. This should make no difference to the normal cases - where each request is allocated via slab - with- or without this patch; if I didn't miss anything. Only the order is a bit mixed up - the memset is done after the sense-allocation, so I have to buffer the sense-pointer for that. But otherwise there is no difference I am aware of, so it should behave the same (does for me). I could not reproduce the memory-pressure case here in the lab.. I don't see any reason why it should work now, but I am open to suggestions :) Beste Grüße / Best regards, - Benjamin Block block/bsg-lib.c | 41 - 1 file changed, 28 insertions(+), 13 deletions(-) diff --git a/block/bsg-lib.c b/block/bsg-lib.c index c82408c7cc3c..634d1557da38 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -203,28 +203,42 @@ static void bsg_request_fn(struct request_queue *q) spin_lock_irq(q->queue_lock); } -static int bsg_init_rq(struct request_queue *q, struct request *req, gfp_t gfp) +static int bsg_init_job(struct request_queue *q, struct request *req, gfp_t gfp) { struct bsg_job *job = blk_mq_rq_to_pdu(req); struct scsi_request *sreq = >sreq; - memset(job, 0, sizeof(*job)); + /* called right after the request is allocated for the request_queue */ - scsi_req_init(sreq); - sreq->sense_len = SCSI_SENSE_BUFFERSIZE; - sreq->sense = kzalloc(sreq->sense_len, gfp); + sreq->sense = kzalloc(SCSI_SENSE_BUFFERSIZE, gfp); if (!sreq->sense) return -ENOMEM; - job->req = req; - job->reply = sreq->sense; - job->reply_len = sreq->sense_len; - job->dd_data = job + 1; - return 0; } -static void bsg_exit_rq(struct request_queue *q, struct request *req) +static void bsg_init_rq(struct request *req) +{ + struct bsg_job *job = blk_mq_rq_to_pdu(req); + struct scsi_request *sreq = >sreq; + void *sense = sreq->sense; + + /* called right before the request is given to the request_queue user */ + + memset(job, 0, sizeof(*job)); + + scsi_req_init(sreq); + + sreq->sense = sense; + sreq->sense_len = SCSI_SENSE_BUFFERSIZE; + + job->req = req; + job->reply = sense; + job->reply_len = sreq->sense_len; + job->dd_data = job + 1; +} + +static void
Re: [PATCH v2 1/2] blk-mq: fix nr_requests wrong value when modify it from sysfs
On Thu, Sep 21, 2017 at 10:37:48AM -0600, Jens Axboe wrote: > On 09/21/2017 09:17 AM, weiping zhang wrote: > > if blk-mq use "none" io scheduler, nr_request get a wrong value when > > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get > > the smaller one min(nr, set->queue_depth), and then q->nr_request get a > > wrong value. > > > > Reproduce: > > > > echo none > /sys/block/nvme0n1/queue/ioscheduler > > echo 100 > /sys/block/nvme0n1/queue/nr_requests > > cat /sys/block/nvme0n1/queue/nr_requests > > 100 > > > > Signed-off-by: weiping zhang> > --- > > block/blk-mq.c | 8 ++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 98a1860..479c35a 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -2642,8 +2642,12 @@ int blk_mq_update_nr_requests(struct request_queue > > *q, unsigned int nr) > > * queue depth. This is similar to what the old code would do. > > */ > > if (!hctx->sched_tags) { > > - ret = blk_mq_tag_update_depth(hctx, >tags, > > - min(nr, > > set->queue_depth), > > + if (nr > set->queue_depth) { > > + ret = -EINVAL; > > + break; > > + } > > + > > + ret = blk_mq_tag_update_depth(hctx, >tags, nr, > > false); > > What am I missing here? blk_mq_tag_update_depth() should already return > -EINVAL for the case where we can't grow the tags. Looks like this patch > should simply remove the min(nr, set->queue_depth) and just pass in 'nr'. > Should not need the duplicated check for depth. > Ya, you are right, I will send V3 later. Thanks
[PATCH v3] blk-mq: fix nr_requests wrong value when modify it from sysfs
if blk-mq use "none" io scheduler, nr_request get a wrong value when input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get the smaller one min(nr, set->queue_depth), and then q->nr_request get a wrong value. Reproduce: echo none > /sys/block/nvme0n1/queue/ioscheduler echo 100 > /sys/block/nvme0n1/queue/nr_requests cat /sys/block/nvme0n1/queue/nr_requests 100 Signed-off-by: weiping zhang--- block/blk-mq.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 98a1860..491e336 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2642,8 +2642,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) * queue depth. This is similar to what the old code would do. */ if (!hctx->sched_tags) { - ret = blk_mq_tag_update_depth(hctx, >tags, - min(nr, set->queue_depth), + ret = blk_mq_tag_update_depth(hctx, >tags, nr, false); } else { ret = blk_mq_tag_update_depth(hctx, >sched_tags, -- 2.9.4
[PATCH 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Christoph HellwigCc: "Martin K. Petersen" Cc: Hannes Reinecke Signed-off-by: Ilya Dryomov --- block/blk-lib.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 6b97feb71065..1cb402beb983 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -316,12 +316,6 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev, * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * - * Note that this function may fail with -EOPNOTSUPP if the driver signals - * zeroing offload support, but the device fails to process the command (for - * some devices there is no non-destructive way to verify whether this - * operation is actually supported). In this case the caller should call - * retry the call to blkdev_issue_zeroout() and the fallback path will be used. - * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -374,6 +368,27 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, , flags); if (ret == 0 && bio) { ret = submit_bio_wait(bio); + /* +* Fall back to a manual zeroout on any error, if allowed. +* +* Particularly, WRITE ZEROES may fail with -EREMOTEIO if the +* driver signals zeroing offload support, but the device +* fails to process the command (for some devices there is no +* non-destructive way to verify whether this operation is +* actually supported). +*/ + if (ret && bio_op(bio) == REQ_OP_WRITE_ZEROES) { + if (flags & BLKDEV_ZERO_NOFALLBACK) { + if (!bdev_write_zeroes_sectors(bdev)) + ret = -EOPNOTSUPP; + } else { + bio_put(bio); + bio = NULL; + __blkdev_issue_zero_pages(bdev, sector, + nr_sects, gfp_mask, ); + ret = submit_bio_wait(bio); + } + } bio_put(bio); } blk_finish_plug(); -- 2.4.3
[PATCH 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
Hi Christoph, Martin, blkdev_issue_zeroout() now checks for any error. This required a minor refactor, so I dropped the stable tag, Jens can add it back if needed. Previous patch and discussion at https://marc.info/?l=linux-block=150471953327942=2 Thanks, Ilya Ilya Dryomov (2): block: factor out __blkdev_issue_zero_pages() block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() block/blk-lib.c | 85 +++-- 1 file changed, 53 insertions(+), 32 deletions(-) -- 2.4.3
[PATCH 1/2] block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Signed-off-by: Ilya Dryomov--- block/blk-lib.c | 58 +++-- 1 file changed, 32 insertions(+), 26 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 62240f8832ca..6b97feb71065 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -274,6 +274,35 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects) return min(pages, (sector_t)BIO_MAX_PAGES); } +static void __blkdev_issue_zero_pages(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask, + struct bio **biop) +{ + struct bio *bio = *biop; + int bi_size = 0; + unsigned int sz; + + while (nr_sects != 0) { + bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), + gfp_mask); + bio->bi_iter.bi_sector = sector; + bio_set_dev(bio, bdev); + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + + while (nr_sects != 0) { + sz = min((sector_t) PAGE_SIZE, nr_sects << 9); + bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); + nr_sects -= bi_size >> 9; + sector += bi_size >> 9; + if (bi_size < sz) + break; + } + cond_resched(); + } + + *biop = bio; +} + /** * __blkdev_issue_zeroout - generate number of zero filed write bios * @bdev: blockdev to issue @@ -304,9 +333,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, unsigned flags) { int ret; - int bi_size = 0; - struct bio *bio = *biop; - unsigned int sz; sector_t bs_mask; bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; @@ -316,30 +342,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) - goto out; - - ret = 0; - while (nr_sects != 0) { - bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), - gfp_mask); - bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - - while (nr_sects != 0) { - sz = min((sector_t) PAGE_SIZE, nr_sects << 9); - bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); - nr_sects -= bi_size >> 9; - sector += bi_size >> 9; - if (bi_size < sz) - break; - } - cond_resched(); - } + return ret; - *biop = bio; -out: - return ret; + __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, biop); + return 0; } EXPORT_SYMBOL(__blkdev_issue_zeroout); -- 2.4.3
Re: [PATCH v2 2/2] blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
On Thu, Sep 21, 2017 at 10:38:22AM -0600, Jens Axboe wrote: > On 09/21/2017 09:17 AM, weiping zhang wrote: > > Signed-off-by: weiping zhang> > --- > > block/blk-sysfs.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index b8362c0..03a6e19 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -75,7 +75,7 @@ queue_requests_store(struct request_queue *q, const char > > *page, size_t count) > > return ret; > > > > if (nr < BLKDEV_MIN_RQ) > > - nr = BLKDEV_MIN_RQ; > > + return -EINVAL; > > This is potentially breaking existing scripts. > I just want this keep same behavior with the too larger case, If this may make other side effect, I drop this patch. Thanks weiping
Re: [PATCH v2 2/2] blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
On 09/21/2017 09:17 AM, weiping zhang wrote: > Signed-off-by: weiping zhang> --- > block/blk-sysfs.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index b8362c0..03a6e19 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -75,7 +75,7 @@ queue_requests_store(struct request_queue *q, const char > *page, size_t count) > return ret; > > if (nr < BLKDEV_MIN_RQ) > - nr = BLKDEV_MIN_RQ; > + return -EINVAL; This is potentially breaking existing scripts. -- Jens Axboe
Re: [PATCH v2 1/2] blk-mq: fix nr_requests wrong value when modify it from sysfs
On 09/21/2017 09:17 AM, weiping zhang wrote: > if blk-mq use "none" io scheduler, nr_request get a wrong value when > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get > the smaller one min(nr, set->queue_depth), and then q->nr_request get a > wrong value. > > Reproduce: > > echo none > /sys/block/nvme0n1/queue/ioscheduler > echo 100 > /sys/block/nvme0n1/queue/nr_requests > cat /sys/block/nvme0n1/queue/nr_requests > 100 > > Signed-off-by: weiping zhang> --- > block/blk-mq.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 98a1860..479c35a 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2642,8 +2642,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, > unsigned int nr) >* queue depth. This is similar to what the old code would do. >*/ > if (!hctx->sched_tags) { > - ret = blk_mq_tag_update_depth(hctx, >tags, > - min(nr, > set->queue_depth), > + if (nr > set->queue_depth) { > + ret = -EINVAL; > + break; > + } > + > + ret = blk_mq_tag_update_depth(hctx, >tags, nr, > false); What am I missing here? blk_mq_tag_update_depth() should already return -EINVAL for the case where we can't grow the tags. Looks like this patch should simply remove the min(nr, set->queue_depth) and just pass in 'nr'. Should not need the duplicated check for depth. -- Jens Axboe
[PATCH v2 2/2] blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
Signed-off-by: weiping zhang--- block/blk-sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index b8362c0..03a6e19 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -75,7 +75,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count) return ret; if (nr < BLKDEV_MIN_RQ) - nr = BLKDEV_MIN_RQ; + return -EINVAL; if (q->request_fn) err = blk_update_nr_requests(q, nr); -- 2.9.4
[PATCH v2 1/2] blk-mq: fix nr_requests wrong value when modify it from sysfs
if blk-mq use "none" io scheduler, nr_request get a wrong value when input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get the smaller one min(nr, set->queue_depth), and then q->nr_request get a wrong value. Reproduce: echo none > /sys/block/nvme0n1/queue/ioscheduler echo 100 > /sys/block/nvme0n1/queue/nr_requests cat /sys/block/nvme0n1/queue/nr_requests 100 Signed-off-by: weiping zhang--- block/blk-mq.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 98a1860..479c35a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2642,8 +2642,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) * queue depth. This is similar to what the old code would do. */ if (!hctx->sched_tags) { - ret = blk_mq_tag_update_depth(hctx, >tags, - min(nr, set->queue_depth), + if (nr > set->queue_depth) { + ret = -EINVAL; + break; + } + + ret = blk_mq_tag_update_depth(hctx, >tags, nr, false); } else { ret = blk_mq_tag_update_depth(hctx, >sched_tags, -- 2.9.4
[PATCH v2 0/2] fix wrong value when user modify nr_request by sysfs
Hi Jens, This is v2 of fixing nr_request. v1 -> v2: blk-mq: fix nr_requests wrong value when modify it from sysfs this patch return -EINVAL when user write a value that's too large. blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ In order to keep same behavior with former patch, also return EINVAL when user write a value less than BLKDEV_MIN_RQ weiping zhang (2): blk-mq: fix nr_requests wrong value when modify it from sysfs blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ block/blk-mq.c| 8 ++-- block/blk-sysfs.c | 2 +- 2 files changed, 7 insertions(+), 3 deletions(-) -- 2.9.4
Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs
On Thu, Sep 21, 2017 at 08:09:47AM -0600, Jens Axboe wrote: > On 09/21/2017 07:03 AM, weiping zhang wrote: > > On Tue, Sep 12, 2017 at 09:57:32PM +0800, weiping zhang wrote: > >> On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote: > >>> On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote: > On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote: > > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote: > >> if blk-mq use "none" io scheduler, nr_request get a wrong value when > >> input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get > >> the smaller one min(nr, set->queue_depth), and then q->nr_request get a > >> wrong value. > >> > >> Reproduce: > >> > >> echo none > /sys/block/nvme0n1/queue/ioscheduler > >> echo 100 > /sys/block/nvme0n1/queue/nr_requests > >> cat /sys/block/nvme0n1/queue/nr_requests > >> 100 > >> > >> Signed-off-by: weiping zhang> >> --- > >> block/blk-mq.c | 7 +-- > >> 1 file changed, 5 insertions(+), 2 deletions(-) > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index f84d145..8303e5e 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct > >> request_queue *q, unsigned int nr) > >> * queue depth. This is similar to what the old code > >> would do. > >> */ > >>if (!hctx->sched_tags) { > >> - ret = blk_mq_tag_update_depth(hctx, >tags, > >> - min(nr, > >> set->queue_depth), > >> + if (nr > set->queue_depth) { > >> + nr = set->queue_depth; > >> + pr_warn("reduce nr_request to %u\n", > >> nr); > >> + } > >> + ret = blk_mq_tag_update_depth(hctx, > >> >tags, nr, > >>false); > >>} else { > >>ret = blk_mq_tag_update_depth(hctx, > >> >sched_tags, > > > > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? > > That will help to > > keep user space code simple that updates the queue depth. > > Hi Bart, > > The reason why not return -EINVAL is keeping alin with minimum checking > in queue_requests_store, > if you insist return -EINVAL/-ERANGE, minimum checking should also keep > same behavior. Both return error meesage and quietly changing are okey > for me. Which way do you prefer ? > > static ssize_t > queue_requests_store(struct request_queue *q, const char *page, size_t > count) > { > unsigned long nr; > int ret, err; > > if (!q->request_fn && !q->mq_ops) > return -EINVAL; > > ret = queue_var_store(, page, count); > if (ret < 0) > return ret; > > if (nr < BLKDEV_MIN_RQ) > nr = BLKDEV_MIN_RQ; > >>> > >>> Hello Jens, > >>> > >>> Do you perhaps have a preference for one of the approaches that have been > >>> discussed > >>> in this e-mail thread? > >>> > >>> Thanks, > >>> > >>> Bart. > >> > > Hello Jens, > > > > Would you please give some comments about this patch, > > If someone writes a value that's too large, return -EINVAL and > don't set it. Don't add weird debug printks. > > OK, I send patch V2 soon.
Re: [PATCH 8/9] nvme: track shared namespaces
On Thu, Sep 21, 2017 at 04:37:48PM +0200, Christoph Hellwig wrote: > On Thu, Sep 21, 2017 at 07:22:17AM +0200, Johannes Thumshirn wrote: > > > But head also has connotations in the SAN world. Maybe nvme_ns_chain? > > > > I know that's why I didn't really like it all too much in the first place as > > well. For nvme_ns_chain, it's not a chain really (the list itself is a > > chain, > > the structure really is the list head...), but I suck at naming things so. > > Well, it _is_ the structure for the namespace, and that's the fundamental > problem here given that we use that name for something else at the > moment. > > We could hav nvme_namespace and nvme_ns, but I'm not sure that this > helps clarity.. If there weren't resistence to renaming structs, it would be more aligned to how the specification calls these if we rename nvme_ns to nvme_ns_path, and what you're calling nvme_ns_head is should just be the nvme_ns.
Re: [PATCH] block: drop "sending ioctl to a partition" message
On 21/09/2017 16:53, Christoph Hellwig wrote: > This looks ok to me, but do we even need to keep the special > cases above? Is there anything relying on the safe but not very > useful ioctls? No idea, I stuck to the usual "don't break userspace" rule. Honestly I doubt anything is using most of those ioctls _in general_, not just on a partition. Paolo > Condensing the thing down to: > > int scsi_verify_blk_ioctl(struct block_device *bd, unsigned int cmd) > { > if (bd && bd == bd->bd_contains) > return 0; > if (capable(CAP_SYS_RAWIO)) > return 0; > return -ENOIOCTLCMD; > } > > would certainly be nice.
Re: [PATCH] block: drop "sending ioctl to a partition" message
This looks ok to me, but do we even need to keep the special cases above? Is there anything relying on the safe but not very useful ioctls? Condensing the thing down to: int scsi_verify_blk_ioctl(struct block_device *bd, unsigned int cmd) { if (bd && bd == bd->bd_contains) return 0; if (capable(CAP_SYS_RAWIO)) return 0; return -ENOIOCTLCMD; } would certainly be nice.
Re: nvme multipath support V2
On Thu, Sep 21, 2017 at 07:23:45AM +0200, Johannes Thumshirn wrote: > Ah OK, we maybe should update nvme-cli to recognize it as well then. I just > looked at the output of nvme list and obviously didn't find it. Overloading the new per-subsystem nodes into nvme list would be very confusing I think. We could add a new command to list subsystems instead of controllers, if just doing a ls in /dev or sysfs is too hard.
[PATCH] block: drop "sending ioctl to a partition" message
After the first few months, the message has not led to many bug reports. It's been almost five years now, and in practice the main source of it seems to be MTIOCGET that someone is using to detect tape devices. While we could whitelist it just like CDROM_GET_CAPABILITY, this patch just removes the message altogether. Signed-off-by: Paolo Bonzini--- block/scsi_ioctl.c | 18 ++ 1 file changed, 2 insertions(+), 16 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 7440de44dd85..eafcd67e2480 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -707,24 +707,10 @@ int scsi_verify_blk_ioctl(struct block_device *bd, unsigned int cmd) case SG_SET_RESERVED_SIZE: case SG_EMULATED_HOST: return 0; - case CDROM_GET_CAPABILITY: - /* Keep this until we remove the printk below. udev sends it -* and we do not want to spam dmesg about it. CD-ROMs do -* not have partitions, so we get here only for disks. -*/ - return -ENOIOCTLCMD; default: - break; + /* In particular, rule out all resets and host-specific ioctls. */ + return capable(CAP_SYS_RAWIO) ? 0 : -ENOIOCTLCMD; } - - if (capable(CAP_SYS_RAWIO)) - return 0; - - /* In particular, rule out all resets and host-specific ioctls. */ - printk_ratelimited(KERN_WARNING - "%s: sending ioctl %x to a partition!\n", current->comm, cmd); - - return -ENOIOCTLCMD; } EXPORT_SYMBOL(scsi_verify_blk_ioctl); -- 1.8.3.1
Re: [PATCH 8/9] nvme: track shared namespaces
On Thu, Sep 21, 2017 at 07:22:17AM +0200, Johannes Thumshirn wrote: > > But head also has connotations in the SAN world. Maybe nvme_ns_chain? > > I know that's why I didn't really like it all too much in the first place as > well. For nvme_ns_chain, it's not a chain really (the list itself is a chain, > the structure really is the list head...), but I suck at naming things so. Well, it _is_ the structure for the namespace, and that's the fundamental problem here given that we use that name for something else at the moment. We could hav nvme_namespace and nvme_ns, but I'm not sure that this helps clarity..
Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs
On 09/21/2017 07:03 AM, weiping zhang wrote: > On Tue, Sep 12, 2017 at 09:57:32PM +0800, weiping zhang wrote: >> On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote: >>> On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote: On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote: > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote: >> if blk-mq use "none" io scheduler, nr_request get a wrong value when >> input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get >> the smaller one min(nr, set->queue_depth), and then q->nr_request get a >> wrong value. >> >> Reproduce: >> >> echo none > /sys/block/nvme0n1/queue/ioscheduler >> echo 100 > /sys/block/nvme0n1/queue/nr_requests >> cat /sys/block/nvme0n1/queue/nr_requests >> 100 >> >> Signed-off-by: weiping zhang>> --- >> block/blk-mq.c | 7 +-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index f84d145..8303e5e 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct >> request_queue *q, unsigned int nr) >> * queue depth. This is similar to what the old code >> would do. >> */ >> if (!hctx->sched_tags) { >> -ret = blk_mq_tag_update_depth(hctx, >tags, >> -min(nr, >> set->queue_depth), >> +if (nr > set->queue_depth) { >> +nr = set->queue_depth; >> +pr_warn("reduce nr_request to %u\n", >> nr); >> +} >> +ret = blk_mq_tag_update_depth(hctx, >> >tags, nr, >> false); >> } else { >> ret = blk_mq_tag_update_depth(hctx, >> >sched_tags, > > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? That > will help to > keep user space code simple that updates the queue depth. Hi Bart, The reason why not return -EINVAL is keeping alin with minimum checking in queue_requests_store, if you insist return -EINVAL/-ERANGE, minimum checking should also keep same behavior. Both return error meesage and quietly changing are okey for me. Which way do you prefer ? static ssize_t queue_requests_store(struct request_queue *q, const char *page, size_t count) { unsigned long nr; int ret, err; if (!q->request_fn && !q->mq_ops) return -EINVAL; ret = queue_var_store(, page, count); if (ret < 0) return ret; if (nr < BLKDEV_MIN_RQ) nr = BLKDEV_MIN_RQ; >>> >>> Hello Jens, >>> >>> Do you perhaps have a preference for one of the approaches that have been >>> discussed >>> in this e-mail thread? >>> >>> Thanks, >>> >>> Bart. >> > Hello Jens, > > Would you please give some comments about this patch, If someone writes a value that's too large, return -EINVAL and don't set it. Don't add weird debug printks. -- Jens Axboe
Re: [PATCH V8 00/14] mmc: Add Command Queue support
On Wed, Sep 13, 2017 at 02:40:00PM +0300, Adrian Hunter wrote: > Non-CQE blk-mq showed a 3% decrease in sequential read performance. This > seemed to be coming from the inferior latency of running work items compared > with a dedicated thread. Hacking blk-mq workqueue to be unbound reduced the > performance degradation from 3% to 1%. Can you start a discussion on just this on the linux-block lists?
Re: [PATCH V2] lightnvm: include NVM Express driver if OCSSD is selected for build
On 09/21/2017 03:41 PM, Rakesh Pandit wrote: Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM in config file before building kernel. Also append PCI to depends as select doesn't automatically add dependencies. Signed-off-by: Rakesh Pandit--- V2: appends 'depends' with PCI drivers/lightnvm/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig index ead61a9..2a953ef 100644 --- a/drivers/lightnvm/Kconfig +++ b/drivers/lightnvm/Kconfig @@ -4,7 +4,8 @@ menuconfig NVM bool "Open-Channel SSD target support" - depends on BLOCK && HAS_DMA + depends on BLOCK && HAS_DMA && PCI + select BLK_DEV_NVME help Say Y here to get to enable Open-channel SSDs. Thanks Rakesh.
[PATCH V2] lightnvm: include NVM Express driver if OCSSD is selected for build
Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM in config file before building kernel. Also append PCI to depends as select doesn't automatically add dependencies. Signed-off-by: Rakesh Pandit--- V2: appends 'depends' with PCI drivers/lightnvm/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig index ead61a9..2a953ef 100644 --- a/drivers/lightnvm/Kconfig +++ b/drivers/lightnvm/Kconfig @@ -4,7 +4,8 @@ menuconfig NVM bool "Open-Channel SSD target support" - depends on BLOCK && HAS_DMA + depends on BLOCK && HAS_DMA && PCI + select BLK_DEV_NVME help Say Y here to get to enable Open-channel SSDs. -- 2.5.0
Re: [PATCH 6/6] lightnvm: include NVM Express driver if OCSSD is selected for build
On Thu, Sep 21, 2017 at 01:32:40PM +0200, Matias Bjørling wrote: > On 09/21/2017 01:28 PM, Rakesh Pandit wrote: > > Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM > > in config file before building kernel. > > > > Signed-off-by: Rakesh Pandit> > --- > > drivers/lightnvm/Kconfig | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig > > index ead61a9..b3c00cb 100644 > > --- a/drivers/lightnvm/Kconfig > > +++ b/drivers/lightnvm/Kconfig > > @@ -5,6 +5,7 @@ > > menuconfig NVM > > bool "Open-Channel SSD target support" > > depends on BLOCK && HAS_DMA > > + select BLK_DEV_NVME > > help > > Say Y here to get to enable Open-channel SSDs. > > > > Thanks Rakesh. I've picked it up for 4.15. As discussed (IRC) I wasn't very careful first time and missed appending PCI to depends as select doesn't visit dependencies automatically. Will post a version 2 soon. Will only post version 2 of this very patch. Doesn't make sense to push anything else from set. Thank you in advance,
Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs
On Tue, Sep 12, 2017 at 09:57:32PM +0800, weiping zhang wrote: > On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote: > > On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote: > > > On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote: > > > > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote: > > > > > if blk-mq use "none" io scheduler, nr_request get a wrong value when > > > > > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will > > > > > get > > > > > the smaller one min(nr, set->queue_depth), and then q->nr_request get > > > > > a > > > > > wrong value. > > > > > > > > > > Reproduce: > > > > > > > > > > echo none > /sys/block/nvme0n1/queue/ioscheduler > > > > > echo 100 > /sys/block/nvme0n1/queue/nr_requests > > > > > cat /sys/block/nvme0n1/queue/nr_requests > > > > > 100 > > > > > > > > > > Signed-off-by: weiping zhang> > > > > --- > > > > > block/blk-mq.c | 7 +-- > > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > > index f84d145..8303e5e 100644 > > > > > --- a/block/blk-mq.c > > > > > +++ b/block/blk-mq.c > > > > > @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct > > > > > request_queue *q, unsigned int nr) > > > > >* queue depth. This is similar to what the old code > > > > > would do. > > > > >*/ > > > > > if (!hctx->sched_tags) { > > > > > - ret = blk_mq_tag_update_depth(hctx, >tags, > > > > > - min(nr, > > > > > set->queue_depth), > > > > > + if (nr > set->queue_depth) { > > > > > + nr = set->queue_depth; > > > > > + pr_warn("reduce nr_request to %u\n", > > > > > nr); > > > > > + } > > > > > + ret = blk_mq_tag_update_depth(hctx, > > > > > >tags, nr, > > > > > false); > > > > > } else { > > > > > ret = blk_mq_tag_update_depth(hctx, > > > > > >sched_tags, > > > > > > > > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? > > > > That will help to > > > > keep user space code simple that updates the queue depth. > > > > > > Hi Bart, > > > > > > The reason why not return -EINVAL is keeping alin with minimum checking > > > in queue_requests_store, > > > if you insist return -EINVAL/-ERANGE, minimum checking should also keep > > > same behavior. Both return error meesage and quietly changing are okey > > > for me. Which way do you prefer ? > > > > > > static ssize_t > > > queue_requests_store(struct request_queue *q, const char *page, size_t > > > count) > > > { > > > unsigned long nr; > > > int ret, err; > > > > > > if (!q->request_fn && !q->mq_ops) > > > return -EINVAL; > > > > > > ret = queue_var_store(, page, count); > > > if (ret < 0) > > > return ret; > > > > > > if (nr < BLKDEV_MIN_RQ) > > > nr = BLKDEV_MIN_RQ; > > > > Hello Jens, > > > > Do you perhaps have a preference for one of the approaches that have been > > discussed > > in this e-mail thread? > > > > Thanks, > > > > Bart. > Hello Jens, Would you please give some comments about this patch, Thanks Weiping.
Re: [PATCH 6/6] lightnvm: include NVM Express driver if OCSSD is selected for build
On 09/21/2017 01:28 PM, Rakesh Pandit wrote: Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM in config file before building kernel. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig index ead61a9..b3c00cb 100644 --- a/drivers/lightnvm/Kconfig +++ b/drivers/lightnvm/Kconfig @@ -5,6 +5,7 @@ menuconfig NVM bool "Open-Channel SSD target support" depends on BLOCK && HAS_DMA + select BLK_DEV_NVME help Say Y here to get to enable Open-channel SSDs. Thanks Rakesh. I've picked it up for 4.15.
[PATCH 5/6] lightnvm: pblk: print incompatible line version correctly
Correct it by coverting little endian to cpu endian and also define a macro for line version so that maintenance is easy. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/pblk-core.c | 2 +- drivers/lightnvm/pblk-recovery.c | 4 ++-- drivers/lightnvm/pblk.h | 1 + 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 74ddb30..57583a1 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -991,7 +991,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, struct pblk_line *line, memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16); smeta_buf->header.id = cpu_to_le32(line->id); smeta_buf->header.type = cpu_to_le16(line->type); - smeta_buf->header.version = cpu_to_le16(1); + smeta_buf->header.version = SMETA_VERSION; /* Start metadata */ smeta_buf->seq_nr = cpu_to_le64(line->seq_nr); diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 1869eef..686bc17 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -876,9 +876,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) if (le32_to_cpu(smeta_buf->header.identifier) != PBLK_MAGIC) continue; - if (le16_to_cpu(smeta_buf->header.version) != 1) { + if (smeta_buf->header.version != SMETA_VERSION) { pr_err("pblk: found incompatible line version %u\n", - smeta_buf->header.version); + le16_to_cpu(smeta_buf->header.version)); return ERR_PTR(-EINVAL); } diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index eaf5397..87b1d7f 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -318,6 +318,7 @@ enum { }; #define PBLK_MAGIC 0x70626c6b /*pblk*/ +#define SMETA_VERSION cpu_to_le16(1) struct line_header { __le32 crc; -- 2.5.0
[PATCH 6/6] lightnvm: include NVM Express driver if OCSSD is selected for build
Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM in config file before building kernel. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig index ead61a9..b3c00cb 100644 --- a/drivers/lightnvm/Kconfig +++ b/drivers/lightnvm/Kconfig @@ -5,6 +5,7 @@ menuconfig NVM bool "Open-Channel SSD target support" depends on BLOCK && HAS_DMA + select BLK_DEV_NVME help Say Y here to get to enable Open-channel SSDs. -- 2.5.0
[PATCH 4/6] lightnvm: pblk: improve error message if down_timeout fails
The two pr_err messages are useless as they don't even differentiate error code. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/pblk-core.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index b92eabc..74ddb30 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1734,16 +1734,8 @@ static void __pblk_down_page(struct pblk *pblk, struct ppa_addr *ppa_list, #endif ret = down_timeout(>wr_sem, msecs_to_jiffies(3)); - if (ret) { - switch (ret) { - case -ETIME: - pr_err("pblk: lun semaphore timed out\n"); - break; - case -EINTR: - pr_err("pblk: lun semaphore timed out\n"); - break; - } - } + if (ret == -ETIME || ret == -EINTR) + pr_err("pblk: taking lun semaphore timed out: err %d\n", -ret); } void pblk_down_page(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas) -- 2.5.0
[PATCH 3/6] lightnvm: pblk: fix message if L2P MAP is in device
This usually happens if we are developing with qemu and ll2pmode has default value. Even in that case message seems wrong. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/pblk-init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 470ef04..c5c1337 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -913,7 +913,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, int ret; if (dev->identity.dom & NVM_RSP_L2P) { - pr_err("pblk: device-side L2P table not supported. (%x)\n", + pr_err("pblk: device-side L2P table supported. (%x)\n", dev->identity.dom); return ERR_PTR(-EINVAL); } -- 2.5.0
[PATCH 2/6] lightnvm: pblk: protect line bitmap while submitting meta io
It seems pblk_dealloc_page would race against pblk_alloc_pages for line bitmap for sector allocation. The chances are very low but might as well protect the bitmap properly. It's not even in fast path. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/pblk-core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index a230125..b92eabc 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -502,12 +502,14 @@ void pblk_dealloc_page(struct pblk *pblk, struct pblk_line *line, int nr_secs) u64 addr; int i; + spin_lock(>lock); addr = find_next_zero_bit(line->map_bitmap, pblk->lm.sec_per_line, line->cur_sec); line->cur_sec = addr - nr_secs; for (i = 0; i < nr_secs; i++, line->cur_sec--) WARN_ON(!test_and_clear_bit(line->cur_sec, line->map_bitmap)); + spin_lock(>lock); } u64 __pblk_alloc_page(struct pblk *pblk, struct pblk_line *line, int nr_secs) -- 2.5.0
[PATCH 0/6] misc patches mostly for pblk
These are trivial changes up for review. Most of them I made while skimming through the code base. They are mostly cleanups and they are at random places. Rakesh Pandit (6): lightnvm: pblk: reuse pblk_gc_should_kick lightnvm: pblk: protect line bitmap while submitting meta io lightnvm: pblk: fix message if L2P MAP is in device lightnvm: pblk: improve error message if down_timeout fails lightnvm: pblk: print incompatible line version correctly lightnvm: include NVM Express driver if OCSSD is selected for build drivers/lightnvm/Kconfig | 1 + drivers/lightnvm/pblk-core.c | 17 ++--- drivers/lightnvm/pblk-init.c | 2 +- drivers/lightnvm/pblk-recovery.c | 4 ++-- drivers/lightnvm/pblk-rl.c | 9 - drivers/lightnvm/pblk.h | 1 + 6 files changed, 11 insertions(+), 23 deletions(-) -- 2.5.0
[PATCH 1/6] lightnvm: pblk: reuse pblk_gc_should_kick
This is a trivial change which reuses pblk_gc_should_kick instead of repeating it again in pblk_rl_free_lines_inc. Signed-off-by: Rakesh Pandit--- drivers/lightnvm/pblk-core.c | 1 + drivers/lightnvm/pblk-rl.c | 9 - 2 files changed, 1 insertion(+), 9 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 64a6a25..a230125 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1478,6 +1478,7 @@ static void __pblk_line_put(struct pblk *pblk, struct pblk_line *line) spin_unlock(_mg->free_lock); pblk_rl_free_lines_inc(>rl, line); + pblk_gc_should_kick(pblk); } static void pblk_line_put_ws(struct work_struct *work) diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index 596bdec..e7c162a 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -129,18 +129,9 @@ static int pblk_rl_update_rates(struct pblk_rl *rl, unsigned long max) void pblk_rl_free_lines_inc(struct pblk_rl *rl, struct pblk_line *line) { - struct pblk *pblk = container_of(rl, struct pblk, rl); int blk_in_line = atomic_read(>blk_in_line); - int ret; atomic_add(blk_in_line, >free_blocks); - /* Rates will not change that often - no need to lock update */ - ret = pblk_rl_update_rates(rl, rl->rb_budget); - - if (ret == (PBLK_RL_MID | PBLK_RL_LOW)) - pblk_gc_should_start(pblk); - else - pblk_gc_should_stop(pblk); } void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line) -- 2.5.0
Re: [PATCH V8 12/14] mmc: block: Add CQE and blk-mq support
On 21/09/17 12:59, Ulf Hansson wrote: > On 13 September 2017 at 13:40, Adrian Hunterwrote: >> Add CQE support to the block driver, including: >> - optionally using DCMD for flush requests >> - "manually" issuing discard requests >> - issuing read / write requests to the CQE >> - supporting block-layer timeouts >> - handling recovery >> - supporting re-tuning >> >> CQE offers 25% - 50% better random multi-threaded I/O. There is a slight >> (e.g. 2%) drop in sequential read speed but no observable change to >> sequential >> write. >> >> CQE automatically sends the commands to complete requests. However it only >> supports reads / writes and so-called "direct commands" (DCMD). Furthermore >> DCMD is limited to one command at a time, but discards require 3 commands. >> That makes issuing discards through CQE very awkward, but some CQE's don't >> support DCMD anyway. So for discards, the existing non-CQE approach is >> taken, where the mmc core code issues the 3 commands one at a time i.e. >> mmc_erase(). Where DCMD is used, is for issuing flushes. >> >> For host controllers without CQE support, blk-mq support is extended to >> synchronous reads/writes or, if the host supports CAP_WAIT_WHILE_BUSY, >> asynchonous reads/writes. The advantage of asynchronous reads/writes is >> that it allows the preparation of the next request while the current >> request is in progress. >> >> Signed-off-by: Adrian Hunter >> --- >> drivers/mmc/core/block.c | 734 >> ++- >> drivers/mmc/core/block.h | 8 + >> drivers/mmc/core/queue.c | 428 +-- >> drivers/mmc/core/queue.h | 54 +++- >> 4 files changed, 1192 insertions(+), 32 deletions(-) > > Adrian, this is just too hard for me to review. Please, can you split this up? > > At least enabling mq support should be done in a separate and first > step, then you could add the CQE bits on top. > > I think that is important, because we want to make sure the mq > deployment is done correctly first. Otherwise it becomes impossible to > narrow down problems, because those could then be either CQE related > or mq related. I looked at splitting it up, but it makes more sense together. Things are the way they are to support all the different issue types. It reads much better together. The CQE functions are named appropriately and practically all the code is new, so it is not as though there are any intermediate steps.
Re: [PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata
On 09/21/2017 12:15 PM, Rakesh Pandit wrote: On Thu, Sep 21, 2017 at 11:56:46AM +0200, Javier González wrote: On 20 Sep 2017, at 21.50, Rakesh Panditwrote: Use appropriate memory free calls based on allocation type used and also fix number of times free is called if kmalloc fails. Signed-off-by: Rakesh Pandit --- drivers/lightnvm/pblk-init.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 7cf4b53..470ef04 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk *pblk) fail_free_emeta: while (--i >= 0) { - vfree(l_mg->eline_meta[i]->buf); + if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META) + vfree(l_mg->eline_meta[i]->buf); + else + kfree(l_mg->eline_meta[i]->buf); kfree(l_mg->eline_meta[i]); } + i = PBLK_DATA_LINES; fail_free_smeta: - for (i = 0; i < PBLK_DATA_LINES; i++) + while (--i >= 0) kfree(l_mg->sline_meta[i]); It is safe to use kfree on NULL pointers. No need to do this. You can either send a new patch, or we can change it when picking it up. Yes, that would be great if this is adjusted while picking up. return -ENOMEM; -- 2.5.0 Rest looks good. Reviewed-by: Javier González Thanks, Thanks Rakesh. I queued it up.
Re: [PATCH] lightnvm: remove already calculated nr_chnls
On 09/18/2017 12:56 PM, Matias Bjørling wrote: Den 18. sep. 2017 09.56 skrev "Javier González">: > On 17 Sep 2017, at 23.04, Rakesh Pandit > wrote: > > Remove repeated calculation for number of channels while creating a > target device. > > Signed-off-by: Rakesh Pandit > > --- > > This is also a trivial change I found while investigating/working on > other issue. > > drivers/lightnvm/core.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c > index 1b8338d..01536b8 100644 > --- a/drivers/lightnvm/core.c > +++ b/drivers/lightnvm/core.c > @@ -139,7 +139,6 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev, > int prev_nr_luns; > int i, j; > > - nr_chnls = nr_luns / dev->geo.luns_per_chnl; > nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1; > > dev_map = kmalloc(sizeof(struct nvm_dev_map), GFP_KERNEL); > -- > 2.7.4 We wanted to make sure that nr_chnls was really, really set :) Reviewed-by: Javier González > What the hell... must have been a patch or merge that went wrong. Thanks Rakesh. I pulled it in for 4.15.
Re: [PATCH V2] lightnvm: protect target type list with correct locks
On 09/18/2017 09:53 AM, Javier González wrote: On 16 Sep 2017, at 20.39, Rakesh Panditwrote: nvm_tgt_types list was protected by wrong lock for NVM_INFO ioctl call and can race with addition or removal of target types. Also unregistering target type was not protected correctly. Fixes: 5cd907853 ("lightnvm: remove nested lock conflict with mm") Signed-off-by: Rakesh Pandit --- V2: also add correct lock while unregistering and fix "Fixes" tag at end. Note I found these while investigating another issue and skimming the core code but worth fixing. drivers/lightnvm/core.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 9f9a137..1b8338d 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -589,9 +589,9 @@ void nvm_unregister_tgt_type(struct nvm_tgt_type *tt) if (!tt) return; - down_write(_lock); + down_write(_tgtt_lock); list_del(>list); - up_write(_lock); + up_write(_tgtt_lock); } EXPORT_SYMBOL(nvm_unregister_tgt_type); @@ -1190,7 +1190,7 @@ static long nvm_ioctl_info(struct file *file, void __user *arg) info->version[1] = NVM_VERSION_MINOR; info->version[2] = NVM_VERSION_PATCH; - down_write(_lock); + down_write(_tgtt_lock); list_for_each_entry(tt, _tgt_types, list) { struct nvm_ioctl_info_tgt *tgt = >tgts[tgt_iter]; @@ -1203,7 +1203,7 @@ static long nvm_ioctl_info(struct file *file, void __user *arg) } info->tgtsize = tgt_iter; - up_write(_lock); + up_write(_tgtt_lock); if (copy_to_user(arg, info, sizeof(struct nvm_ioctl_info))) { kfree(info); -- 2.7.4 LGTM. Reviewed-by: Javier González Thanks Rakesh.
Re: [PATCH V2] lightnvm: prevent bd removal if busy
On 09/12/2017 03:22 PM, Javier González wrote: On 10 Sep 2017, at 21.07, Rakesh Panditwrote: When a virtual block device is formatted and mounted after creating with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove" would result in this: 446416.309757] bdi-block not registered [446416.309773] [ cut here ] [446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159 __mark_inode_dirty+0x268/0x340 Ideally removal should return -EBUSY as block device is mounted after formatting. This patch tries to address this checking if whole device or any partition of it already mounted or not before removal. Whole device is checked using "bd_super" member of block device. This member is always set once block device has been mounted using a filesystem. Another member "bd_part_count" takes care of checking any if any partitions are under use. "bd_part_count" is only updated under locks when partitions are opened or closed (first open and last release). This at least does take care sending -EBUSY if removal is being attempted while whole block device or any partition is mounted. Signed-off-by: Rakesh Pandit --- V2: Take a different approach. Instead of checking bd_openers use bd_super and bd_part_count. This should address the removal of bdevs which are mounted from removal. drivers/lightnvm/core.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index c39f87d..9f9a137 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -373,6 +373,7 @@ static void __nvm_remove_target(struct nvm_target *t) static int nvm_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove) { struct nvm_target *t; + struct block_device *bdev; mutex_lock(>mlock); t = nvm_find_target(dev, remove->tgtname); @@ -380,6 +381,19 @@ static int nvm_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove) mutex_unlock(>mlock); return 1; } + bdev = bdget_disk(t->disk, 0); + if (!bdev) { + pr_err("nvm: removal failed, allocating bd failed\n"); + mutex_unlock(>mlock); + return -ENOMEM; + } + if (bdev->bd_super || bdev->bd_part_count) { + pr_err("nvm: removal failed, block device busy\n"); + bdput(bdev); + mutex_unlock(>mlock); + return -EBUSY; + } + bdput(bdev); __nvm_remove_target(t); mutex_unlock(>mlock); -- 2.7.4 Looks good. Reviewed-by: Javier González Thanks Rakesh. I pulled it in for 4.15.
Re: [PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata
On Thu, Sep 21, 2017 at 11:56:46AM +0200, Javier González wrote: > > On 20 Sep 2017, at 21.50, Rakesh Panditwrote: > > > > Use appropriate memory free calls based on allocation type used and > > also fix number of times free is called if kmalloc fails. > > > > Signed-off-by: Rakesh Pandit > > --- > > drivers/lightnvm/pblk-init.c | 8 ++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c > > index 7cf4b53..470ef04 100644 > > --- a/drivers/lightnvm/pblk-init.c > > +++ b/drivers/lightnvm/pblk-init.c > > @@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk > > *pblk) > > > > fail_free_emeta: > > while (--i >= 0) { > > - vfree(l_mg->eline_meta[i]->buf); > > + if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META) > > + vfree(l_mg->eline_meta[i]->buf); > > + else > > + kfree(l_mg->eline_meta[i]->buf); > > kfree(l_mg->eline_meta[i]); > > } > > > > + i = PBLK_DATA_LINES; > > fail_free_smeta: > > - for (i = 0; i < PBLK_DATA_LINES; i++) > > + while (--i >= 0) > > kfree(l_mg->sline_meta[i]); > > It is safe to use kfree on NULL pointers. No need to do this. You can > either send a new patch, or we can change it when picking it up. Yes, that would be great if this is adjusted while picking up. > > > > > return -ENOMEM; > > -- > > 2.5.0 > > Rest looks good. > > > Reviewed-by: Javier González > Thanks,
Re: [PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata
> On 20 Sep 2017, at 21.50, Rakesh Panditwrote: > > Use appropriate memory free calls based on allocation type used and > also fix number of times free is called if kmalloc fails. > > Signed-off-by: Rakesh Pandit > --- > drivers/lightnvm/pblk-init.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c > index 7cf4b53..470ef04 100644 > --- a/drivers/lightnvm/pblk-init.c > +++ b/drivers/lightnvm/pblk-init.c > @@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk *pblk) > > fail_free_emeta: > while (--i >= 0) { > - vfree(l_mg->eline_meta[i]->buf); > + if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META) > + vfree(l_mg->eline_meta[i]->buf); > + else > + kfree(l_mg->eline_meta[i]->buf); > kfree(l_mg->eline_meta[i]); > } > > + i = PBLK_DATA_LINES; > fail_free_smeta: > - for (i = 0; i < PBLK_DATA_LINES; i++) > + while (--i >= 0) > kfree(l_mg->sline_meta[i]); It is safe to use kfree on NULL pointers. No need to do this. You can either send a new patch, or we can change it when picking it up. > > return -ENOMEM; > -- > 2.5.0 Rest looks good. Reviewed-by: Javier González signature.asc Description: Message signed with OpenPGP
Re: [PATCH V8 00/14] mmc: Add Command Queue support
On 21/09/17 12:01, Ulf Hansson wrote: > On 13 September 2017 at 13:40, Adrian Hunterwrote: >> Hi >> >> Here is V8 of the hardware command queue patches without the software >> command queue patches, now using blk-mq and now with blk-mq support for >> non-CQE I/O. >> >> After the unacceptable debacle of the last release cycle, I expect an >> immediate response to these patches. >> >> HW CMDQ offers 25% - 50% better random multi-threaded I/O. I see a slight >> 2% drop in sequential read speed but no change to sequential write. >> >> Non-CQE blk-mq showed a 3% decrease in sequential read performance. This >> seemed to be coming from the inferior latency of running work items compared >> with a dedicated thread. Hacking blk-mq workqueue to be unbound reduced the >> performance degradation from 3% to 1%. >> >> While we should look at changing blk-mq to give better workqueue performance, >> a bigger gain is likely to be made by adding a new host API to enable the >> next already-prepared request to be issued directly from within ->done() >> callback of the current request. > > Adrian, I am reviewing this series, however let me comment on each > change individually. > > I have also run some test on my ux500 board and enabling the blkmq > path via the new MMC Kconfig option. My idea was to run some iozone > comparisons between the legacy path and the new blkmq path, but I just > couldn't get to that point because of the following errors. > > I am using a Kingston 4GB SDHC card, which is detected and mounted > nicely. However, when I decide to do some writes to the card I get the > following errors. > > root@ME:/mnt/sdcard dd if=/dev/zero of=testfile bs=8192 count=5000 conv=fsync > [ 463.714294] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 464.722656] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 466.081481] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 467.111236] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 468.669647] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 469.685699] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 471.043334] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 472.052337] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 473.342651] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 474.323760] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 475.544769] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 476.539031] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 477.748474] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > [ 478.724182] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! > > I haven't yet got the point of investigating this any further, and > unfortunate I have a busy schedule with traveling next week. I will do > my best to look into this as soon as I can. > > Perhaps you have some ideas? The behaviour depends on whether you have MMC_CAP_WAIT_WHILE_BUSY. Try changing that and see if it makes a difference.
Re: [PATCH V8 08/14] mmc: core: Add parameter use_blk_mq
On 13 September 2017 at 13:40, Adrian Hunterwrote: > Until mmc has blk-mq support fully implemented and tested, add a > parameter use_blk_mq, default to false unless config option MMC_MQ_DEFAULT > is selected. > > Signed-off-by: Adrian Hunter > --- > drivers/mmc/Kconfig | 11 +++ > drivers/mmc/core/core.c | 7 +++ > drivers/mmc/core/core.h | 2 ++ > drivers/mmc/core/host.c | 2 ++ > drivers/mmc/core/host.h | 4 > include/linux/mmc/host.h | 1 + > 6 files changed, 27 insertions(+) > > diff --git a/drivers/mmc/Kconfig b/drivers/mmc/Kconfig > index ec21388311db..98202934bd29 100644 > --- a/drivers/mmc/Kconfig > +++ b/drivers/mmc/Kconfig > @@ -12,6 +12,17 @@ menuconfig MMC > If you want MMC/SD/SDIO support, you should say Y here and > also to your specific host controller driver. > > +config MMC_MQ_DEFAULT > + bool "MMC: use blk-mq I/O path by default" > + depends on MMC && BLOCK > + ---help--- > + This option enables the new blk-mq based I/O path for MMC block > + devices by default. With the option the mmc_core.use_blk_mq > + module/boot option defaults to Y, without it to N, but it can > + still be overridden either way. > + > + If unsure say N. > + > if MMC I asume the goal of adding this option is to enable us to move slowly forward. In general that might be a good idea, however for this particular case I am not sure. The main reason is simply that I find it unlikely that people and distributions will actually go in and change the default value, so in the end we will just be adding new code, which isn't really going to be much tested. That's what happened in scsi case. As I also stated earlier, I do worry about the maintenance of the mmc block device code, and this approach make it worse, at least short term. To me, the scsi case is also different, because the mq support was added long time ago and at that point one could worry a bit of maturity of the blkmq in general, that I assume have been sorted out by know. > > source "drivers/mmc/core/Kconfig" > diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c > index ef2d8aa1e7d2..3638ed4f0f9e 100644 > --- a/drivers/mmc/core/core.c > +++ b/drivers/mmc/core/core.c > @@ -66,6 +66,13 @@ > bool use_spi_crc = 1; > module_param(use_spi_crc, bool, 0); > > +#ifdef CONFIG_MMC_MQ_DEFAULT > +bool mmc_use_blk_mq = true; > +#else > +bool mmc_use_blk_mq = false; > +#endif > +module_param_named(use_blk_mq, mmc_use_blk_mq, bool, S_IWUSR | S_IRUGO); > + > static int mmc_schedule_delayed_work(struct delayed_work *work, > unsigned long delay) > { > diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h > index e941342ed450..535539a9e7eb 100644 > --- a/drivers/mmc/core/core.h > +++ b/drivers/mmc/core/core.h > @@ -35,6 +35,8 @@ struct mmc_bus_ops { > int (*reset)(struct mmc_host *); > }; > > +extern bool mmc_use_blk_mq; > + > void mmc_attach_bus(struct mmc_host *host, const struct mmc_bus_ops *ops); > void mmc_detach_bus(struct mmc_host *host); > > diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c > index ad88deb2e8f3..b624dbb6cd15 100644 > --- a/drivers/mmc/core/host.c > +++ b/drivers/mmc/core/host.c > @@ -398,6 +398,8 @@ struct mmc_host *mmc_alloc_host(int extra, struct device > *dev) > host->max_blk_size = 512; > host->max_blk_count = PAGE_SIZE / 512; > > + host->use_blk_mq = mmc_use_blk_mq; > + > return host; > } > > diff --git a/drivers/mmc/core/host.h b/drivers/mmc/core/host.h > index 77d6f60d1bf9..170fe5947087 100644 > --- a/drivers/mmc/core/host.h > +++ b/drivers/mmc/core/host.h > @@ -69,6 +69,10 @@ static inline bool mmc_card_hs400es(struct mmc_card *card) > return card->host->ios.enhanced_strobe; > } > > +static inline bool mmc_host_use_blk_mq(struct mmc_host *host) > +{ > + return host->use_blk_mq; > +} > > #endif > > diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h > index 54b0463443bd..5d1bd10991f7 100644 > --- a/include/linux/mmc/host.h > +++ b/include/linux/mmc/host.h > @@ -378,6 +378,7 @@ struct mmc_host { > unsigned intdoing_retune:1; /* re-tuning in progress */ > unsigned intretune_now:1; /* do re-tuning at next req */ > unsigned intretune_paused:1; /* re-tuning is temporarily > disabled */ > + unsigned intuse_blk_mq:1; /* use blk-mq */ > > int rescan_disable; /* disable card detection */ > int rescan_entered; /* used with nonremovable > devices */ > -- > 1.9.1 > Kind regards Uffe
[PATCH BUGFIX/IMPROVEMENT 4/4] block, bfq: decrease burst size when queues in burst exit
If many queues belonging to the same group happen to be created shortly after each other, then the concurrent processes associated with these queues have typically a common goal, and they get it done as soon as possible if not hampered by device idling. Examples are processes spawned by git grep, or by systemd during boot. As for device idling, this mechanism is currently necessary for weight raising to succeed in its goal: privileging I/O. In view of these facts, BFQ does not provide the above queues with either weight raising or device idling. On the other hand, a burst of queue creations may be caused also by the start-up of a complex application. In this case, these queues need usually to be served one after the other, and as quickly as possible, to maximise responsiveness. Therefore, in this case the best strategy is to weight-raise all the queues created during the burst, i.e., the exact opposite of the strategy for the above case. To distinguish between the two cases, BFQ uses an empirical burst-size threshold, found through extensive tests and monitoring of daily usage. Only large bursts, i.e., burst with a size above this threshold, are considered as generated by a high number of parallel processes. In this respect, upstart-based boot proved to be rather hard to detect as generating a large burst of queue creations, because with upstart most of the queues created in a burst exit *before* the next queues in the same burst are created. To address this issue, I changed the burst-detection mechanism so as to not decrease the size of the current burst even if one of the queues in the burst is eliminated. Unfortunately, this missing decrease causes false positives on very fast systems: on the start-up of a complex application, such as libreoffice writer, so many queues are created, served and exited shortly after each other, that a large burst of queue creations is wrongly detected as occurring. These false positives just disappear if the size of a burst is decreased when one of the queues in the burst exits. This commit restores the missing burst-size decrease, relying of the fact that upstart is apparently unlikely to be used on systems running this and future versions of the kernel. Signed-off-by: Paolo ValenteSigned-off-by: Mauro Andreolini Signed-off-by: Angelo Ruocco Tested-by: Mirko Montanari Tested-by: Oleksandr Natalenko --- block/bfq-iosched.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 115747f..70f9177 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3725,16 +3725,10 @@ void bfq_put_queue(struct bfq_queue *bfqq) if (bfqq->ref) return; - if (bfq_bfqq_sync(bfqq)) - /* -* The fact that this queue is being destroyed does not -* invalidate the fact that this queue may have been -* activated during the current burst. As a consequence, -* although the queue does not exist anymore, and hence -* needs to be removed from the burst list if there, -* the burst size has not to be decremented. -*/ + if (bfq_bfqq_sync(bfqq) && !hlist_unhashed(>burst_list_node)) { hlist_del_init(>burst_list_node); + bfqq->bfqd->burst_size--; + } kmem_cache_free(bfq_pool, bfqq); #ifdef CONFIG_BFQ_GROUP_IOSCHED -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 3/4] block, bfq: let early-merged queues be weight-raised on split too
A just-created bfq_queue, say Q, may happen to be merged with another bfq_queue on the very first invocation of the function __bfq_insert_request. In such a case, even if Q would clearly deserve interactive weight raising (as it has just been created), the function bfq_add_request does not make it to be invoked for Q, and thus to activate weight raising for Q. As a consequence, when the state of Q is saved for a possible future restore, after a split of Q from the other bfq_queue(s), such a state happens to be (unjustly) non-weight-raised. Then the bfq_queue will not enjoy any weight raising on the split, even if should still be in an interactive weight-raising period when the split occurs. This commit solves this problem as follows, for a just-created bfq_queue that is being early-merged: it stores directly, in the saved state of the bfq_queue, the weight-raising state that would have been assigned to the bfq_queue if not early-merged. Signed-off-by: Paolo ValenteTested-by: Angelo Ruocco Tested-by: Mirko Montanari Tested-by: Oleksandr Natalenko --- block/bfq-iosched.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 33b63bc..115747f 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2061,10 +2061,27 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq) bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq); bic->was_in_burst_list = !hlist_unhashed(>burst_list_node); - bic->saved_wr_coeff = bfqq->wr_coeff; - bic->saved_wr_start_at_switch_to_srt = bfqq->wr_start_at_switch_to_srt; - bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish; - bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time; + if (unlikely(bfq_bfqq_just_created(bfqq) && +!bfq_bfqq_in_large_burst(bfqq))) { + /* +* bfqq being merged right after being created: bfqq +* would have deserved interactive weight raising, but +* did not make it to be set in a weight-raised state, +* because of this early merge. Store directly the +* weight-raising state that would have been assigned +* to bfqq, so that to avoid that bfqq unjustly fails +* to enjoy weight raising if split soon. +*/ + bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff; + bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd); + bic->saved_last_wr_start_finish = jiffies; + } else { + bic->saved_wr_coeff = bfqq->wr_coeff; + bic->saved_wr_start_at_switch_to_srt = + bfqq->wr_start_at_switch_to_srt; + bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish; + bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time; + } } static void @@ -4150,7 +4167,6 @@ static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq) new_bfqq->allocated++; bfqq->allocated--; new_bfqq->ref++; - bfq_clear_bfqq_just_created(bfqq); /* * If the bic associated with the process * issuing this request still points to bfqq @@ -4162,6 +4178,8 @@ static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq) if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq) bfq_merge_bfqqs(bfqd, RQ_BIC(rq), bfqq, new_bfqq); + + bfq_clear_bfqq_just_created(bfqq); /* * rq is about to be enqueued into new_bfqq, * release rq reference on bfqq -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 1/4] block, bfq: fix wrong init of saved start time for weight raising
This commit fixes a bug that causes bfq to fail to guarantee a high responsiveness on some drives, if there is heavy random read+write I/O in the background. More precisely, such a failure allowed this bug to be found [1], but the bug may well cause other yet unreported anomalies. BFQ raises the weight of the bfq_queues associated with soft real-time applications, to privilege the I/O, and thus reduce latency, for these applications. This mechanism is named soft-real-time weight raising in BFQ. A soft real-time period may happen to be nested into an interactive weight raising period, i.e., it may happen that, when a bfq_queue switches to a soft real-time weight-raised state, the bfq_queue is already being weight-raised because deemed interactive too. In this case, BFQ saves in a special variable wr_start_at_switch_to_srt, the time instant when the interactive weight-raising period started for the bfq_queue, i.e., the time instant when BFQ started to deem the bfq_queue interactive. This value is then used to check whether the interactive weight-raising period would still be in progress when the soft real-time weight-raising period ends. If so, interactive weight raising is restored for the bfq_queue. This restore is useful, in particular, because it prevents bfq_queues from losing their interactive weight raising prematurely, as a consequence of spurious, short-lived soft real-time weight-raising periods caused by wrong detections as soft real-time. If, instead, a bfq_queue switches to soft-real-time weight raising while it *is not* already in an interactive weight-raising period, then the variable wr_start_at_switch_to_srt has no meaning during the following soft real-time weight-raising period. Unfortunately the handling of this case is wrong in BFQ: not only the variable is not flagged somehow as meaningless, but it is also set to the time when the switch to soft real-time weight-raising occurs. This may cause an interactive weight-raising period to be considered mistakenly as still in progress, and thus a spurious interactive weight-raising period to start for the bfq_queue, at the end of the soft-real-time weight-raising period. In particular the spurious interactive weight-raising period will be considered as still in progress, if the soft-real-time weight-raising period does not last very long. The bfq_queue will then be wrongly privileged and, if I/O bound, will unjustly steal bandwidth to truly interactive or soft real-time bfq_queues, harming responsiveness and low latency. This commit fixes this issue by just setting wr_start_at_switch_to_srt to minus infinity (farthest past time instant according to jiffies macros): when the soft-real-time weight-raising period ends, certainly no interactive weight-raising period will be considered as still in progress. [1] Background I/O Type: Random - Background I/O mix: Reads and writes - Application to start: LibreOffice Writer in http://www.phoronix.com/scan.php?page=news_item=Linux-4.13-IO-Laptop Signed-off-by: Paolo ValenteSigned-off-by: Angelo Ruocco Tested-by: Oleksandr Natalenko Tested-by: Lee Tibbert Tested-by: Mirko Montanari --- block/bfq-iosched.c | 50 +++--- 1 file changed, 31 insertions(+), 19 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index a4783da..c25955c 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1202,6 +1202,24 @@ static unsigned int bfq_wr_duration(struct bfq_data *bfqd) return dur; } +/* + * Return the farthest future time instant according to jiffies + * macros. + */ +static unsigned long bfq_greatest_from_now(void) +{ + return jiffies + MAX_JIFFY_OFFSET; +} + +/* + * Return the farthest past time instant according to jiffies + * macros. + */ +static unsigned long bfq_smallest_from_now(void) +{ + return jiffies - MAX_JIFFY_OFFSET; +} + static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd, struct bfq_queue *bfqq, unsigned int old_wr_coeff, @@ -1216,7 +1234,19 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd, bfqq->wr_coeff = bfqd->bfq_wr_coeff; bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); } else { - bfqq->wr_start_at_switch_to_srt = jiffies; + /* +* No interactive weight raising in progress +* here: assign minus infinity to +* wr_start_at_switch_to_srt, to make sure +* that, at the end of the soft-real-time +* weight raising periods that is starting +* now, no interactive weight-raising period +
[PATCH BUGFIX/IMPROVEMENT 2/4] block, bfq: check and switch back to interactive wr also on queue split
As already explained in the message of commit "block, bfq: fix wrong init of saved start time for weight raising", if a soft real-time weight-raising period happens to be nested in a larger interactive weight-raising period, then BFQ restores the interactive weight raising at the end of the soft real-time weight raising. In particular, BFQ checks whether the latter has ended only on request dispatches. Unfortunately, the above scheme fails to restore interactive weight raising in the following corner case: if a bfq_queue, say Q, 1) Is merged with another bfq_queue while it is in a nested soft real-time weight-raising period. The weight-raising state of Q is then saved, and not considered any longer until a split occurs. 2) Is split from the other bfq_queue(s) at a time instant when its soft real-time weight raising is already finished. On the split, while resuming the previous, soft real-time weight-raised state of the bfq_queue Q, BFQ checks whether the current soft real-time weight-raising period is actually over. If so, BFQ switches weight raising off for Q, *without* checking whether the soft real-time period was actually nested in a non-yet-finished interactive weight-raising period. This commit addresses this issue by adding the above missing check in bfq_queue splits, and restoring interactive weight raising if needed. Signed-off-by: Paolo ValenteTested-by: Angelo Ruocco Tested-by: Mirko Montanari Tested-by: Oleksandr Natalenko --- block/bfq-iosched.c | 87 ++--- 1 file changed, 49 insertions(+), 38 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index c25955c..33b63bc 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -724,6 +724,44 @@ static void bfq_updated_next_req(struct bfq_data *bfqd, } } +static unsigned int bfq_wr_duration(struct bfq_data *bfqd) +{ + u64 dur; + + if (bfqd->bfq_wr_max_time > 0) + return bfqd->bfq_wr_max_time; + + dur = bfqd->RT_prod; + do_div(dur, bfqd->peak_rate); + + /* +* Limit duration between 3 and 13 seconds. Tests show that +* higher values than 13 seconds often yield the opposite of +* the desired result, i.e., worsen responsiveness by letting +* non-interactive and non-soft-real-time applications +* preserve weight raising for a too long time interval. +* +* On the other end, lower values than 3 seconds make it +* difficult for most interactive tasks to complete their jobs +* before weight-raising finishes. +*/ + if (dur > msecs_to_jiffies(13000)) + dur = msecs_to_jiffies(13000); + else if (dur < msecs_to_jiffies(3000)) + dur = msecs_to_jiffies(3000); + + return dur; +} + +/* switch back from soft real-time to interactive weight raising */ +static void switch_back_to_interactive_wr(struct bfq_queue *bfqq, + struct bfq_data *bfqd) +{ + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt; +} + static void bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd, struct bfq_io_cq *bic, bool bfq_already_existing) @@ -750,10 +788,16 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd, if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) || time_is_before_jiffies(bfqq->last_wr_start_finish + bfqq->wr_cur_max_time))) { - bfq_log_bfqq(bfqq->bfqd, bfqq, - "resume state: switching off wr"); - - bfqq->wr_coeff = 1; + if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time && + !bfq_bfqq_in_large_burst(bfqq) && + time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt + +bfq_wr_duration(bfqd))) { + switch_back_to_interactive_wr(bfqq, bfqd); + } else { + bfqq->wr_coeff = 1; + bfq_log_bfqq(bfqq->bfqd, bfqq, +"resume state: switching off wr"); + } } /* make sure weight will be updated, however we got here */ @@ -1173,35 +1217,6 @@ static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd, return wr_or_deserves_wr; } -static unsigned int bfq_wr_duration(struct bfq_data *bfqd) -{ - u64 dur; - - if (bfqd->bfq_wr_max_time > 0) - return bfqd->bfq_wr_max_time; - - dur = bfqd->RT_prod; - do_div(dur, bfqd->peak_rate); - - /* -* Limit duration between 3 and 13 seconds. Tests show that -*
[PATCH BUGFIX/IMPROVEMENT 0/4] block, bfq: series of fixes of bugs affecting service guarantees
Hi, the first patch in this series fixes a bug that causes bfq to fail to guarantee a high responsiveness on some drives, if there is heavy random read+write I/O in the background. More precisely, such a failure allowed this bug to be found [1], but the bug may well cause other yet unreported anomalies. This fix uncovered other bugs that were concealed by the fixed bug, for rather subtle reasons. These further bugs caused similar responsiveness failures, but with sequential reaad+write workloads in the background. The remaining three patches fix these further bugs. The sum of these fixes makes responsiveness much stabler with BFQ. In the presence of write hogs, it is however still impossible for an I/O scheduler to guarantee perfect responsiveness in any circustance, because of throttling issues in the virtual-memory management subsystem, and in other higher-level components. Thanks, Paolo [1] Background I/O Type: Random - Background I/O mix: Reads and writes - Application to start: LibreOffice Writer in http://www.phoronix.com/scan.php?page=news_item=Linux-4.13-IO-Laptop Paolo Valente (4): block, bfq: fix wrong init of saved start time for weight raising block, bfq: check and switch back to interactive wr also on queue split block, bfq: let early-merged queues be weight-raised on split too block, bfq: decrease burst size when queues in burst exit block/bfq-iosched.c | 169 +++- 1 file changed, 102 insertions(+), 67 deletions(-) -- 2.10.0
Re: [PATCH V8 00/14] mmc: Add Command Queue support
On 13 September 2017 at 13:40, Adrian Hunterwrote: > Hi > > Here is V8 of the hardware command queue patches without the software > command queue patches, now using blk-mq and now with blk-mq support for > non-CQE I/O. > > After the unacceptable debacle of the last release cycle, I expect an > immediate response to these patches. > > HW CMDQ offers 25% - 50% better random multi-threaded I/O. I see a slight > 2% drop in sequential read speed but no change to sequential write. > > Non-CQE blk-mq showed a 3% decrease in sequential read performance. This > seemed to be coming from the inferior latency of running work items compared > with a dedicated thread. Hacking blk-mq workqueue to be unbound reduced the > performance degradation from 3% to 1%. > > While we should look at changing blk-mq to give better workqueue performance, > a bigger gain is likely to be made by adding a new host API to enable the > next already-prepared request to be issued directly from within ->done() > callback of the current request. Adrian, I am reviewing this series, however let me comment on each change individually. I have also run some test on my ux500 board and enabling the blkmq path via the new MMC Kconfig option. My idea was to run some iozone comparisons between the legacy path and the new blkmq path, but I just couldn't get to that point because of the following errors. I am using a Kingston 4GB SDHC card, which is detected and mounted nicely. However, when I decide to do some writes to the card I get the following errors. root@ME:/mnt/sdcard dd if=/dev/zero of=testfile bs=8192 count=5000 conv=fsync [ 463.714294] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 464.722656] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 466.081481] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 467.111236] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 468.669647] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 469.685699] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 471.043334] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 472.052337] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 473.342651] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 474.323760] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 475.544769] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 476.539031] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 477.748474] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! [ 478.724182] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer! I haven't yet got the point of investigating this any further, and unfortunate I have a busy schedule with traveling next week. I will do my best to look into this as soon as I can. Perhaps you have some ideas? Kind regards Uffe > > > Changes since V7: > Re-based > mmc: core: Introduce host claiming by context > Slightly simplified > mmc: core: Add parameter use_blk_mq > New patch. > mmc: core: Remove unnecessary host claim > New patch. > mmc: core: Export mmc_start_bkops() > New patch. > mmc: core: Export mmc_start_request() > New patch. > mmc: block: Add CQE and blk-mq support > Add blk-mq support for non_CQE requests > > Changes since V6: > mmc: core: Introduce host claiming by context > New patch. > mmc: core: Move mmc_start_areq() declaration > Dropped because it has been applied > mmc: block: Fix block status codes > Dropped because it has been applied > mmc: host: Add CQE interface > Dropped because it has been applied > mmc: core: Turn off CQE before sending commands > Dropped because it has been applied > mmc: block: Factor out mmc_setup_queue() > New patch. > mmc: block: Add CQE support > Drop legacy support and add blk-mq support > > Changes since V5: > Re-based > mmc: core: Add mmc_retune_hold_now() > Dropped because it has been applied > mmc: core: Add members to mmc_request and mmc_data for CQE's > Dropped because it has been applied > mmc: core: Move mmc_start_areq() declaration > New patch at Ulf's request > mmc: block: Fix block status codes > Another un-related patch > mmc: host: Add CQE interface > Move recovery_notifier() callback to struct mmc_request > mmc: core: Add support for handling CQE requests > Roll __mmc_cqe_request_done() into mmc_cqe_request_done() > Move function declarations requested by Ulf > mmc: core: Remove unused MMC_CAP2_PACKED_CMD > Dropped because it has been applied > mmc: block: Add CQE support > Add explanation to commit message > Adjustment for changed recovery_notifier() callback > mmc: cqhci: support for command
Re: I/O hangs after resuming from suspend-to-ram
Martin Steigerwald - 21.09.17, 09:30: > Ming Lei - 21.09.17, 06:20: > > On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote: > > > Ming Lei - 28.08.17, 20:58: > > > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote: > > > > > Hi. > > > > > > > > > > Here is disk setup for QEMU VM: > > > > > > > > > > === > > > > > [root@archmq ~]# smartctl -i /dev/sda > > > > > … > > > > > Device Model: QEMU HARDDISK > > > > > Serial Number:QM1 > > > > > Firmware Version: 2.5+ > > > > > User Capacity:4,294,967,296 bytes [4.29 GB] > > > > > Sector Size: 512 bytes logical/physical > > > > > Device is:Not in smartctl database [for details use: -P > > > > > showall] > > > > > ATA Version is: ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS > > > > > 340-2000 > > > > > Local Time is:Sun Aug 27 09:31:54 2017 CEST > > > > > SMART support is: Available - device has SMART capability. > > > > > SMART support is: Enabled > > > > > > > > > > [root@archmq ~]# lsblk > > > > > NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > > > > sda 8:004G 0 disk > > > > > `-sda18:104G 0 part > > > > > > > > > > `-md0 9:004G 0 raid10 > > > > > > > > > > `-system253:004G 0 crypt > > > > > > > > > > |-system-boot 253:10 512M 0 lvm/boot > > > > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > > > > > > > `-system-root 253:303G 0 lvm/ > > > > > > > > > > sdb 8:16 04G 0 disk > > > > > `-sdb18:17 04G 0 part > > > > > > > > > > `-md0 9:004G 0 raid10 > > > > > > > > > > `-system253:004G 0 crypt > > > > > > > > > > |-system-boot 253:10 512M 0 lvm/boot > > > > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > > > > > > > `-system-root 253:303G 0 lvm/ > > > > > > > > > > sr0 11:01 1024M 0 rom > > > > > > > > > > [root@archmq ~]# mdadm --misc --detail /dev/md0 > > > > > > > > > > /dev/md0: > > > > > Version : 1.2 > > > > > > > > > > Creation Time : Sat Jul 29 16:37:05 2017 > > > > > > > > > > Raid Level : raid10 > > > > > Array Size : 4191232 (4.00 GiB 4.29 GB) > > > > > > > > > > Used Dev Size : 4191232 (4.00 GiB 4.29 GB) > > > > > > > > > >Raid Devices : 2 > > > > > > > > > > Total Devices : 2 > > > > > > > > > > Persistence : Superblock is persistent > > > > > > > > > > Update Time : Sun Aug 27 09:30:33 2017 > > > > > > > > > > State : clean > > > > > > > > > > Active Devices : 2 > > > > > > > > > > Working Devices : 2 > > > > > > > > > > Failed Devices : 0 > > > > > > > > > > Spare Devices : 0 > > > > > > > > > > Layout : far=2 > > > > > > > > > > Chunk Size : 512K > > > > > > > > > >Name : archiso:0 > > > > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e > > > > > > > > > > Events : 485 > > > > > > > > > > Number Major Minor RaidDevice State > > > > > > > > > >0 810 active sync /dev/sda1 > > > > >1 8 171 active sync /dev/sdb1 > > > > > > > > > > === > > > > > > > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on > > > > > it, > > > > > then > > > > > LVM, then ext4 for boot, swap and btrfs for /. > > > > > > > > > > I couldn't reproduce the issue with single disk without RAID. > > > > > > > > Could you verify if the following patch fixes your issue? > > > > Yes, the patch should address this kind of issue, not related > > with RAID specially, and the latest version can be found in the > > > > following link: > > https://marc.info/?l=linux-block=150579298505484=2 > > Thank you. > > So if I understand already I can just add > > https://github.com/ming1/linux/tree/my_v4.13-safe-scsi-quiesce_V5_for_test > > as an remote and go from there. https://github.com/ming1/linux/tree/my_v4.13-safe-scsi-quiesce_V5_for_test and checkout my_v4.13-safe-scsi-quiesce_V5_for_test branch of course. -- Martin
Re: I/O hangs after resuming from suspend-to-ram
Ming Lei - 21.09.17, 06:20: > On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote: > > Ming Lei - 28.08.17, 20:58: > > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote: > > > > Hi. > > > > > > > > Here is disk setup for QEMU VM: > > > > > > > > === > > > > [root@archmq ~]# smartctl -i /dev/sda > > > > … > > > > Device Model: QEMU HARDDISK > > > > Serial Number:QM1 > > > > Firmware Version: 2.5+ > > > > User Capacity:4,294,967,296 bytes [4.29 GB] > > > > Sector Size: 512 bytes logical/physical > > > > Device is:Not in smartctl database [for details use: -P > > > > showall] > > > > ATA Version is: ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS > > > > 340-2000 > > > > Local Time is:Sun Aug 27 09:31:54 2017 CEST > > > > SMART support is: Available - device has SMART capability. > > > > SMART support is: Enabled > > > > > > > > [root@archmq ~]# lsblk > > > > NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > > > sda 8:004G 0 disk > > > > `-sda18:104G 0 part > > > > > > > > `-md0 9:004G 0 raid10 > > > > > > > > `-system253:004G 0 crypt > > > > > > > > |-system-boot 253:10 512M 0 lvm/boot > > > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > > > > > `-system-root 253:303G 0 lvm/ > > > > > > > > sdb 8:16 04G 0 disk > > > > `-sdb18:17 04G 0 part > > > > > > > > `-md0 9:004G 0 raid10 > > > > > > > > `-system253:004G 0 crypt > > > > > > > > |-system-boot 253:10 512M 0 lvm/boot > > > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > > > > > `-system-root 253:303G 0 lvm/ > > > > > > > > sr0 11:01 1024M 0 rom > > > > > > > > [root@archmq ~]# mdadm --misc --detail /dev/md0 > > > > > > > > /dev/md0: > > > > Version : 1.2 > > > > > > > > Creation Time : Sat Jul 29 16:37:05 2017 > > > > > > > > Raid Level : raid10 > > > > Array Size : 4191232 (4.00 GiB 4.29 GB) > > > > > > > > Used Dev Size : 4191232 (4.00 GiB 4.29 GB) > > > > > > > >Raid Devices : 2 > > > > > > > > Total Devices : 2 > > > > > > > > Persistence : Superblock is persistent > > > > > > > > Update Time : Sun Aug 27 09:30:33 2017 > > > > > > > > State : clean > > > > > > > > Active Devices : 2 > > > > > > > > Working Devices : 2 > > > > > > > > Failed Devices : 0 > > > > > > > > Spare Devices : 0 > > > > > > > > Layout : far=2 > > > > > > > > Chunk Size : 512K > > > > > > > >Name : archiso:0 > > > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e > > > > > > > > Events : 485 > > > > > > > > Number Major Minor RaidDevice State > > > > > > > >0 810 active sync /dev/sda1 > > > >1 8 171 active sync /dev/sdb1 > > > > > > > > === > > > > > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it, > > > > then > > > > LVM, then ext4 for boot, swap and btrfs for /. > > > > > > > > I couldn't reproduce the issue with single disk without RAID. > > > > > > Could you verify if the following patch fixes your issue? > > Yes, the patch should address this kind of issue, not related > with RAID specially, and the latest version can be found in the > following link: > > https://marc.info/?l=linux-block=150579298505484=2 Thank you. So if I understand already I can just add https://github.com/ming1/linux/tree/my_v4.13-safe-scsi-quiesce_V5_for_test as an remote and go from there. Thanks, -- Martin