Re: [PATCH 0/2] block/SCSI MQ: two RESTART related patches
On Tue, Oct 17, 2017 at 01:04:16PM +0800, Ming Lei wrote: > Hi Jens, > > The 1st patch runs idle hctx after dealy in scsi_mq_get_budget(), > so that we can keep same behaviour with before, and it can be > thought as a fix. > > The 2nd patch cleans up RESTART, and removes handling for TAG_SHARED > from current blk-mq's RESTART mechanism because SCSI_MQ can covers its > restart by itself, so that no need to handle TAG_SHARED in blk-mq > RESTART. And >20% IOPS boost is observed in my rand read test over > scsi_debug. > > John, please test this two patches and see if it may improve your SAS > IO performance, and you can find the two patches in the following branch: > > https://github.com/ming1/linux/commits/blk_mq_improve_restart_V1 Forget to mention, you need to either pull the above branch directly or apply the two patches against for-next branch of Jens' block tree: git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git #for-next -- Ming
[PATCH 2/2] blk-mq: don't handle TAG_SHARED in restart
Now restart is used in the following cases, and TAG_SHARED is for SCSI only. 1) .get_budget() returns BLK_STS_RESOURCE - if resource in target/host level isn't satistifed, this SCSI device will be added in shost->starved_list, and the whole queue will be rerun (via SCSI's built-in RESTART) in scsi_end_request() after any request initiated from this host/targe is completed. Forget to mention, host level resource is always respected by blk-mq before running .queue_rq(). - the same is true if resource in the queue level isn't satisfied. - if there isn't outstanding request on this queue, then SCSI's RESTART can't work(blk-mq's can't work too), and the queue will be run after SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's RESTART when this request is finished 2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE - if there isn't onprogressing request on this queue, the queue will be run after SCSI_QUEUE_DELAY - otherwise, SCSI's RESTART covers the rerun. 3) blk_mq_get_driver_tag() failed - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver allocation. In one word, SCSI's built-in RESTART is enough to cover itself. So we don't need to pay special attention to TAG_SHARED wrt. restart. Signed-off-by: Ming Lei--- block/blk-mq-sched.c | 78 +++- 1 file changed, 4 insertions(+), 74 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index df8581bb0a37..daab27feb653 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -68,25 +68,17 @@ static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) set_bit(BLK_MQ_S_SCHED_RESTART, >state); } -static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) +void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) { if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state)) - return false; - - if (hctx->flags & BLK_MQ_F_TAG_SHARED) { - struct request_queue *q = hctx->queue; + return; - if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state)) - atomic_dec(>shared_hctx_restart); - } else - clear_bit(BLK_MQ_S_SCHED_RESTART, >state); + clear_bit(BLK_MQ_S_SCHED_RESTART, >state); if (blk_mq_hctx_has_pending(hctx)) { blk_mq_run_hw_queue(hctx, true); - return true; + return; } - - return false; } /* return true if hctx need to run again */ @@ -385,68 +377,6 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, return true; } -/** - * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list - * @pos:loop cursor. - * @skip: the list element that will not be examined. Iteration starts at - * @skip->next. - * @head: head of the list to examine. This list must have at least one - * element, namely @skip. - * @member: name of the list_head structure within typeof(*pos). - */ -#define list_for_each_entry_rcu_rr(pos, skip, head, member)\ - for ((pos) = (skip);\ -(pos = (pos)->member.next != (head) ? list_entry_rcu( \ - (pos)->member.next, typeof(*pos), member) : \ - list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \ -(pos) != (skip); ) - -/* - * Called after a driver tag has been freed to check whether a hctx needs to - * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware - * queues in a round-robin fashion if the tag set of @hctx is shared with other - * hardware queues. - */ -void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) -{ - struct blk_mq_tags *const tags = hctx->tags; - struct blk_mq_tag_set *const set = hctx->queue->tag_set; - struct request_queue *const queue = hctx->queue, *q; - struct blk_mq_hw_ctx *hctx2; - unsigned int i, j; - - if (set->flags & BLK_MQ_F_TAG_SHARED) { - /* -* If this is 0, then we know that no hardware queues -* have RESTART marked. We're done. -*/ - if (!atomic_read(>shared_hctx_restart)) - return; - - rcu_read_lock(); - list_for_each_entry_rcu_rr(q, queue, >tag_list, - tag_set_list) { - queue_for_each_hw_ctx(q, hctx2, i) - if (hctx2->tags == tags && - blk_mq_sched_restart_hctx(hctx2)) - goto done; - } - j = hctx->queue_num + 1; - for (i = 0; i < queue->nr_hw_queues; i++, j++) { - if (j == queue->nr_hw_queues) - j = 0; - hctx2 =
[PATCH 1/2] SCSI: run idle hctx after delay in scsi_mq_get_budget()
If there isn't any outstanding request in this queue, both blk-mq's RESTART and SCSI's builtin RESTART can't work, so we have to deal with this case by running this queue after delay. Fixes: d04b6d97d0a1(scsi: implement .get_budget and .put_budget for blk-mq) Signed-off-by: Ming Lei--- drivers/scsi/scsi_lib.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6f10afaca25b..08c495364b28 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1959,6 +1959,14 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) put_device(>sdev_gendev); } +static void scsi_mq_run_idle_hctx(struct scsi_device *sdev, + struct blk_mq_hw_ctx *hctx) +{ + if (atomic_read(>device_busy) == 0 && + !scsi_device_blocked(sdev)) + blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); +} + static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; @@ -1989,6 +1997,7 @@ static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) out_put_device: put_device(>sdev_gendev); out: + scsi_mq_run_idle_hctx(sdev, hctx); return BLK_STS_RESOURCE; } @@ -2039,9 +2048,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(>device_busy) == 0 && - !scsi_device_blocked(sdev)) - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); + scsi_mq_run_idle_hctx(sdev, hctx); break; default: /* -- 2.9.5
[PATCH 0/2] block/SCSI MQ: two RESTART related patches
Hi Jens, The 1st patch runs idle hctx after dealy in scsi_mq_get_budget(), so that we can keep same behaviour with before, and it can be thought as a fix. The 2nd patch cleans up RESTART, and removes handling for TAG_SHARED from current blk-mq's RESTART mechanism because SCSI_MQ can covers its restart by itself, so that no need to handle TAG_SHARED in blk-mq RESTART. And >20% IOPS boost is observed in my rand read test over scsi_debug. John, please test this two patches and see if it may improve your SAS IO performance, and you can find the two patches in the following branch: https://github.com/ming1/linux/commits/blk_mq_improve_restart_V1 Ming Lei (2): SCSI: run idle hctx after delay in scsi_mq_get_budget() blk-mq: don't handle TAG_SHARED in restart block/blk-mq-sched.c| 78 +++-- drivers/scsi/scsi_lib.c | 13 +++-- 2 files changed, 14 insertions(+), 77 deletions(-) -- 2.9.5
Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
Bart, > The contexts from which a SCSI device can be quiesced or resumed are: > * Writing into /sys/class/scsi_device/*/device/state. > * SCSI parallel (SPI) domain validation. > * The SCSI device power management methods. See also scsi_bus_pm_ops. The SCSI bits look fine to me. Acked-by: Martin K. Petersen-- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v9 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests
Bart, > Convert blk_get_request(q, op, __GFP_RECLAIM) into > blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not > change any functionality. Acked-by: Martin K. Petersen-- Martin K. Petersen Oracle Linux Engineering
Re: [RFC] bsg-lib interface cleanup
Christoph, > this series cleans up various abuses of the bsg interfaces, and then > splits bsg for SCSI passthrough from bsg for arbitrary transport > passthrough. This removes the scsi_request abuse in bsg-lib that is > very confusing, and also makes sure we can sanity check the requests > we get. The current code will happily execute scsi commands against > bsg-lib queues, and transport pass through against scsi nodes, without > any indication to the driver that we are doing the wrong thing. I applied patches 2-5 to 4.15/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
[bug report] A race between blk_cleanup_queue and blk_timeout_work
Hi Jens, Christoph, There is a scenario: unplug this disks when running IO in disk, we will find IO is blocked all the times as follows: .. Jobs: 3 (f=3): [M_MM__] [89.7% done] [0K/0K /s] [0 /0 iops] [eta 00m:36s] .. I find there is a race between blk_cleanup_queue and blk_timeout_work (kernel is 4.14.0-rc1): (1)Remove disks process When unplug disk, it will call scsi_remove_target to delete disk: scsi_remove_target--> __scsi_remove_target> scsi_remove_device---> __scsi_remove_device---> blk_cleanup_queue blk_freeze_queue . __blk_drain_queue scsi_remove_target will call blk_cleanup_queue, and blk_cleanup_queue will call blk_freeze_queue and __blk_drain_queue. In blk_freeze_queue, for !blk_mq (our driver satifies this) it will kill q->q_usage_counter. In __blk_drain_queue, it is a loop with condition=true, only when drain=0 can this function will be existed.If all the IOs are ended, it will be existed, or it will wait and query no-finished IOs every 10ms. (2) Timeout process For every IO from block layer,if timeout, it will call blk_timeout_work. In blk_timeout_work, it checks blk_queue_enter first. In blk_queue_enter, it trys to get q->q_usage_counter, so if failed, it will return directly and will not enter timeout process. So when unplug disk, removing disk process will kill q->q_usage_counter in blk_cleanup_queue, if there are IOs which are not finished, they will wait for timeout, when timeout, they will try to get q->q_usage_counter in blk_timeout_work, as q->q_usage_counter is killed in blk_freeze_queue already at that time, so it failed, it will not enter timeout process and this IO will be not processed. But in __blk_drain_queue it will loop forever as there are IOs which are still not ended. I add printk in function blk_timeout_work as follows, . when this issue occurs, i can see this printk happens: void blk_timeout_work(struct work_struct *work) { struct request_queue *q = container_of(work, struct request_queue, timeout_work); unsigned long flags, next = 0; struct request *rq, *tmp; int next_set = 0; if (blk_queue_enter(q, true)) { pr_err("%s %d\n", __func__, __LINE__);-> i add printk here return; } spin_lock_irqsave(q->queue_lock, flags); list_for_each_entry_safe(rq, tmp, >timeout_list, timeout_list) blk_rq_check_expired(rq, , _set); if (next_set) mod_timer(>timeout, round_jiffies_up(next)); spin_unlock_irqrestore(q->queue_lock, flags); blk_queue_exit(q); } regards, shawn
Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote: > On 10/13/2017 07:29 PM, Ming Lei wrote: > > On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote: > >> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote: > >>> On Fri, Oct 13, 2017 at 04:31:04PM +, Bart Van Assche wrote: > On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote: > > Actually it is in hot path, for example, lpfc and qla2xx's queue depth > > is 3, > > Sorry but I doubt whether that is correct. More in general, I don't know > any modern > storage HBA for which the default queue depth is so low. > >>> > >>> You can grep: > >>> > >>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E > >>> "qla2xxx|lpfc" > >> > >> Such a low queue depth will result in suboptimal performance for adapters > >> that communicate over a storage network. I think that's a bug and that both > >> adapters support much higher cmd_per_lun values. > >> > >> (+James Smart) > >> > >> James, can you explain us why commit 445cf4f4d2aa decreased > >> LPFC_CMD_PER_LUN > >> from 30 to 3? Was that perhaps a workaround for a bug in a specific target > >> implementation? > >> > >> (+Himanshu Madhani) > >> > >> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun > >> for > >> the qla2xxx initiator driver to the scsi_host->can_queue value? > > > > ->can_queue is size of the whole tag space shared by all LUNs, looks it > > isn't > > reasonable to increase cmd_per_lun to .can_queue. > > > '3' is just a starting point; later on it'll be adjusted via > scsi_change_depth(). > Looks like it's not working correctly with blk-mq, though. At default, in scsi_alloc_sdev(), q->queue_depth is set as host->cmd_per_lun. You are right, q->queue_depth can be adjusted later too. q->queue_depth is respected in scsi_dev_queue_ready(). .cmd_per_lun defines the max outstanding cmds for each lun, I guess it is respected by some hardware inside. For example, I remembered that on lpfc q->queue_depth is 30 because the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3. Per my observation, this .cmd_per_lun limit is still workable. -- Ming
Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
On Mon, Oct 16, 2017 at 04:29:04PM -0700, Bart Van Assche wrote: > The contexts from which a SCSI device can be quiesced or resumed are: > * Writing into /sys/class/scsi_device/*/device/state. > * SCSI parallel (SPI) domain validation. > * The SCSI device power management methods. See also scsi_bus_pm_ops. > > It is essential during suspend and resume that neither the filesystem > state nor the filesystem metadata in RAM changes. This is why while > the hibernation image is being written or restored that SCSI devices > are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() > and scsi_device_resume(). In the SDEV_QUIESCE state execution of > non-preempt requests is deferred. This is realized by returning > BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI > devices. Avoid that a full queue prevents power management requests > to be submitted by deferring allocation of non-preempt requests for > devices in the quiesced state. This patch has been tested by running > the following commands and by verifying that after resume the fio job > is still running: > > for d in /sys/class/block/sd*[a-z]; do > hcil=$(readlink "$d/device") > hcil=${hcil#../../../} > echo 4 > "$d/queue/nr_requests" > echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" > done > bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c) > bdev=${bdev#../../} > hcil=$(readlink "/sys/block/$bdev/device") > hcil=${hcil#../../../} > fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 > --rw=randread \ > --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \ > --loops=$((2**31)) & > pid=$! > sleep 1 > systemctl hibernate > sleep 10 > kill $pid > > Reported-by: Oleksandr Natalenko> References: "I/O hangs after resuming from suspend-to-ram" > (https://marc.info/?l=linux-block=150340235201348). > Signed-off-by: Bart Van Assche > Tested-by: Martin Steigerwald > Cc: Martin K. Petersen > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > block/blk-core.c | 42 +++--- > block/blk-mq.c | 4 ++-- > block/blk-timeout.c| 2 +- > drivers/scsi/scsi_lib.c| 40 +--- > fs/block_dev.c | 4 ++-- > include/linux/blkdev.h | 2 +- > include/scsi/scsi_device.h | 1 + > 7 files changed, 71 insertions(+), 24 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index b73203286bf5..abb5095bcef6 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q) > > spin_lock_irqsave(q->queue_lock, flags); > queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); > + wake_up_all(>mq_freeze_wq); > spin_unlock_irqrestore(q->queue_lock, flags); > } > EXPORT_SYMBOL_GPL(blk_clear_preempt_only); > @@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) > } > EXPORT_SYMBOL(blk_alloc_queue); > > -int blk_queue_enter(struct request_queue *q, bool nowait) > +/** > + * blk_queue_enter() - try to increase q->q_usage_counter > + * @q: request queue pointer > + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT > + */ > +int blk_queue_enter(struct request_queue *q, unsigned int flags) > { > + const bool preempt = flags & BLK_MQ_REQ_PREEMPT; > + > while (true) { > + bool success = false; > int ret; > > - if (percpu_ref_tryget_live(>q_usage_counter)) > + rcu_read_lock_sched(); > + if (percpu_ref_tryget_live(>q_usage_counter)) { > + /* > + * The code that sets the PREEMPT_ONLY flag is > + * responsible for ensuring that that flag is globally > + * visible before the queue is unfrozen. > + */ > + if (preempt || !blk_queue_preempt_only(q)) { > + success = true; > + } else { > + percpu_ref_put(>q_usage_counter); > + WARN_ONCE("%s: Attempt to allocate non-preempt > request in preempt-only mode.\n", > + kobject_name(q->kobj.parent)); > + } > + } > + rcu_read_unlock_sched(); > + > + if (success) > return 0; > > - if (nowait) > + if (flags & BLK_MQ_REQ_NOWAIT) > return -EBUSY; > > /* > @@ -814,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait) > smp_rmb(); > > ret = wait_event_interruptible(q->mq_freeze_wq, >
Re: [PATCH v2] blk-mq: Make blk_mq_get_request() error path less confusing
On Mon, Oct 16, 2017 at 04:32:26PM -0700, Bart Van Assche wrote: > blk_mq_get_tag() can modify data->ctx. This means that in the > error path of blk_mq_get_request() data->ctx should be passed to > blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx() > ignores its argument, this patch does not change any functionality. > > References: commit 1ad43c0078b7 ("blk-mq: don't leak preempt > counter/q_usage_counter when allocating rq failed") > Signed-off-by: Bart Van Assche> Cc: Ming Lei > --- > Changes compared to v1: > - Removed "Cc: " > - Removed "bug fix" from the description. > > block/blk-mq.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 097ca3ece716..d9b3b5dc66ab 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -336,12 +336,14 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > struct elevator_queue *e = q->elevator; > struct request *rq; > unsigned int tag; > - struct blk_mq_ctx *local_ctx = NULL; > + bool put_ctx_on_error = false; > > blk_queue_enter_live(q); > data->q = q; > - if (likely(!data->ctx)) > - data->ctx = local_ctx = blk_mq_get_ctx(q); > + if (likely(!data->ctx)) { > + data->ctx = blk_mq_get_ctx(q); > + put_ctx_on_error = true; > + } > if (likely(!data->hctx)) > data->hctx = blk_mq_map_queue(q, data->ctx->cpu); > if (op & REQ_NOWAIT) > @@ -360,8 +362,8 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > > tag = blk_mq_get_tag(data); > if (tag == BLK_MQ_TAG_FAIL) { > - if (local_ctx) { > - blk_mq_put_ctx(local_ctx); > + if (put_ctx_on_error) { > + blk_mq_put_ctx(data->ctx); > data->ctx = NULL; > } > blk_queue_exit(q); > -- > 2.14.2 > Reviewed-by: Ming Lei -- Ming
Re: [PATCH v2] block/aoe: Convert timers to use timer_setup()
On Fri, Oct 6, 2017 at 7:19 AM, Jens Axboewrote: > On 10/05/2017 05:13 PM, Kees Cook wrote: >> In preparation for unconditionally passing the struct timer_list pointer to >> all timer callbacks, switch to using the new timer_setup() and from_timer() >> to pass the timer pointer explicitly. > > Applied to for-4.15/timer Hi, I just wanted to check what your timer plans were for merging this into -next (I'm doing rebasing to find out which maintainers I need to resend patches to, and I noticed block hasn't appeared in -next, but I know you've pulled patches...) Thanks! -Kees -- Kees Cook Pixel Security
[PATCH v2] blk-mq: Make blk_mq_get_request() error path less confusing
blk_mq_get_tag() can modify data->ctx. This means that in the error path of blk_mq_get_request() data->ctx should be passed to blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx() ignores its argument, this patch does not change any functionality. References: commit 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed") Signed-off-by: Bart Van AsscheCc: Ming Lei --- Changes compared to v1: - Removed "Cc: " - Removed "bug fix" from the description. block/blk-mq.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 097ca3ece716..d9b3b5dc66ab 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -336,12 +336,14 @@ static struct request *blk_mq_get_request(struct request_queue *q, struct elevator_queue *e = q->elevator; struct request *rq; unsigned int tag; - struct blk_mq_ctx *local_ctx = NULL; + bool put_ctx_on_error = false; blk_queue_enter_live(q); data->q = q; - if (likely(!data->ctx)) - data->ctx = local_ctx = blk_mq_get_ctx(q); + if (likely(!data->ctx)) { + data->ctx = blk_mq_get_ctx(q); + put_ctx_on_error = true; + } if (likely(!data->hctx)) data->hctx = blk_mq_map_queue(q, data->ctx->cpu); if (op & REQ_NOWAIT) @@ -360,8 +362,8 @@ static struct request *blk_mq_get_request(struct request_queue *q, tag = blk_mq_get_tag(data); if (tag == BLK_MQ_TAG_FAIL) { - if (local_ctx) { - blk_mq_put_ctx(local_ctx); + if (put_ctx_on_error) { + blk_mq_put_ctx(data->ctx); data->ctx = NULL; } blk_queue_exit(q); -- 2.14.2
[PATCH v9 01/10] md: Rename md_notifier into md_reboot_notifier
This avoids confusion with the pm notifier that will be added through a later patch. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Reviewed-by: Shaohua Li Tested-by: Martin Steigerwald Cc: linux-r...@vger.kernel.org Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke --- drivers/md/md.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 5d61049e7417..8933cafc212d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8966,7 +8966,7 @@ static int md_notify_reboot(struct notifier_block *this, return NOTIFY_DONE; } -static struct notifier_block md_notifier = { +static struct notifier_block md_reboot_notifier = { .notifier_call = md_notify_reboot, .next = NULL, .priority = INT_MAX, /* before any real devices */ @@ -9003,7 +9003,7 @@ static int __init md_init(void) blk_register_region(MKDEV(mdp_major, 0), 1UL<
[PATCH v9 08/10] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
This flag will be used in the next patch to let the block layer core know whether or not a SCSI request queue has been quiesced. A quiesced SCSI queue namely only processes RQF_PREEMPT requests. Signed-off-by: Bart Van AsscheTested-by: Martin Steigerwald Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c | 30 ++ block/blk-mq-debugfs.c | 1 + include/linux/blkdev.h | 6 ++ 3 files changed, 37 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index ab835d3c2f39..b73203286bf5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -346,6 +346,36 @@ void blk_sync_queue(struct request_queue *q) } EXPORT_SYMBOL(blk_sync_queue); +/** + * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * @q: request queue pointer + * + * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not + * set and 1 if the flag was already set. + */ +int blk_set_preempt_only(struct request_queue *q) +{ + unsigned long flags; + int res; + + spin_lock_irqsave(q->queue_lock, flags); + res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return res; +} +EXPORT_SYMBOL_GPL(blk_set_preempt_only); + +void blk_clear_preempt_only(struct request_queue *q) +{ + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + spin_unlock_irqrestore(q->queue_lock, flags); +} +EXPORT_SYMBOL_GPL(blk_clear_preempt_only); + /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped * @q: The queue to run diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 7f4a1ba532af..75f31535f280 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -74,6 +74,7 @@ static const char *const blk_queue_flag_name[] = { QUEUE_FLAG_NAME(REGISTERED), QUEUE_FLAG_NAME(SCSI_PASSTHROUGH), QUEUE_FLAG_NAME(QUIESCED), + QUEUE_FLAG_NAME(PREEMPT_ONLY), }; #undef QUEUE_FLAG_NAME diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 05203175eb9c..864ad2e4a58c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -630,6 +630,7 @@ struct request_queue { #define QUEUE_FLAG_REGISTERED 26 /* queue has been registered to a disk */ #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED28 /* queue has been quiesced */ +#define QUEUE_FLAG_PREEMPT_ONLY29 /* only process REQ_PREEMPT requests */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\ (1 << QUEUE_FLAG_SAME_COMP)| \ @@ -730,6 +731,11 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ REQ_FAILFAST_DRIVER)) #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) +#define blk_queue_preempt_only(q) \ + test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags) + +extern int blk_set_preempt_only(struct request_queue *q); +extern void blk_clear_preempt_only(struct request_queue *q); static inline bool blk_account_rq(struct request *rq) { -- 2.14.2
[PATCH v9 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests
Convert blk_get_request(q, op, __GFP_RECLAIM) into blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not change any functionality. Signed-off-by: Bart Van AsscheTested-by: Martin Steigerwald Acked-by: David S. Miller [ for IDE ] Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/ide/ide-pm.c| 4 ++-- drivers/scsi/scsi_lib.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c index 544f02d673ca..f56d742908df 100644 --- a/drivers/ide/ide-pm.c +++ b/drivers/ide/ide-pm.c @@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev) } memset(, 0, sizeof(rqpm)); - rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM); + rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN, + BLK_MQ_REQ_PREEMPT); ide_req(rq)->type = ATA_PRIV_PM_RESUME; - rq->rq_flags |= RQF_PREEMPT; rq->special = rqpm.pm_step = IDE_PM_START_RESUME; rqpm.pm_state = PM_EVENT_ON; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6f10afaca25b..7c119696402c 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, struct scsi_request *rq; int ret = DRIVER_ERROR << 24; - req = blk_get_request(sdev->request_queue, + req = blk_get_request_flags(sdev->request_queue, data_direction == DMA_TO_DEVICE ? - REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM); + REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT); if (IS_ERR(req)) return ret; rq = scsi_req(req); @@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, rq->retries = retries; req->timeout = timeout; req->cmd_flags |= flags; - req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT; + req->rq_flags |= rq_flags | RQF_QUIET; /* * head injection *required* here otherwise quiesce won't work -- 2.14.2
[PATCH v9 06/10] block: Introduce BLK_MQ_REQ_PREEMPT
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to blk_get_request_flags(). Signed-off-by: Bart Van AsscheTested-by: Martin Steigerwald Cc: Christoph Hellwig Cc: Ming Lei Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c | 4 +++- block/blk-mq.c | 2 ++ include/linux/blk-mq.h | 1 + 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 559818be8e02..ab835d3c2f39 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1260,6 +1260,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, blk_rq_set_rl(rq, rl); rq->cmd_flags = op; rq->rq_flags = rq_flags; + if (flags & BLK_MQ_REQ_PREEMPT) + rq->rq_flags |= RQF_PREEMPT; /* init elvpriv */ if (rq_flags & RQF_ELVPRIV) { @@ -1441,7 +1443,7 @@ struct request *blk_get_request_flags(struct request_queue *q, unsigned int op, struct request *req; WARN_ON_ONCE(op & REQ_NOWAIT); - WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT); + WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT)); if (q->mq_ops) { req = blk_mq_alloc_request(q, op, flags); diff --git a/block/blk-mq.c b/block/blk-mq.c index 59b7de6b616b..6a025b17caac 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -290,6 +290,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->q = data->q; rq->mq_ctx = data->ctx; rq->cmd_flags = op; + if (data->flags & BLK_MQ_REQ_PREEMPT) + rq->rq_flags |= RQF_PREEMPT; if (blk_queue_io_stat(data->q)) rq->rq_flags |= RQF_IO_STAT; /* do not touch atomic flags, it needs atomic ops against the timer */ diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index e5e6becd57d3..22c7f36745fc 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -213,6 +213,7 @@ enum { BLK_MQ_REQ_NOWAIT = (1 << 0), /* return when out of requests */ BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */ BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */ + BLK_MQ_REQ_PREEMPT = (1 << 3), /* set RQF_PREEMPT */ }; struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, -- 2.14.2
[PATCH v9 02/10] md: Introduce md_stop_all_writes()
Introduce md_stop_all_writes() because the next patch will add a second caller for this function. This patch does not change any functionality. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Reviewed-by: Shaohua Li Tested-by: Martin Steigerwald Cc: linux-r...@vger.kernel.org Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke --- drivers/md/md.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 8933cafc212d..b99584e5d6b1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8937,8 +8937,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, } EXPORT_SYMBOL_GPL(rdev_clear_badblocks); -static int md_notify_reboot(struct notifier_block *this, - unsigned long code, void *x) +static void md_stop_all_writes(void) { struct list_head *tmp; struct mddev *mddev; @@ -8962,6 +8961,12 @@ static int md_notify_reboot(struct notifier_block *this, */ if (need_delay) mdelay(1000*1); +} + +static int md_notify_reboot(struct notifier_block *this, + unsigned long code, void *x) +{ + md_stop_all_writes(); return NOTIFY_DONE; } -- 2.14.2
[PATCH v9 05/10] block: Introduce blk_get_request_flags()
A side effect of this patch is that the GFP mask that is passed to several allocation functions in the legacy block layer is changed from GFP_KERNEL into __GFP_DIRECT_RECLAIM. Signed-off-by: Bart Van AsscheTested-by: Martin Steigerwald Cc: Christoph Hellwig Cc: Ming Lei Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c | 50 +++--- include/linux/blkdev.h | 3 +++ 2 files changed, 38 insertions(+), 15 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 23a0110fc62b..559818be8e02 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1157,7 +1157,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * @rl: request list to allocate from * @op: operation and flags * @bio: bio to allocate request for (can be %NULL) - * @gfp_mask: allocation mask + * @flags: BLQ_MQ_REQ_* flags * * Get a free request from @q. This function may fail under memory * pressure or if @q is dead. @@ -1167,7 +1167,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *__get_request(struct request_list *rl, unsigned int op, - struct bio *bio, gfp_t gfp_mask) +struct bio *bio, unsigned int flags) { struct request_queue *q = rl->q; struct request *rq; @@ -1176,6 +1176,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, struct io_cq *icq = NULL; const bool is_sync = op_is_sync(op); int may_queue; + gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : +__GFP_DIRECT_RECLAIM; req_flags_t rq_flags = RQF_ALLOCED; lockdep_assert_held(q->queue_lock); @@ -1336,7 +1338,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * @q: request_queue to allocate request from * @op: operation and flags * @bio: bio to allocate request for (can be %NULL) - * @gfp_mask: allocation mask + * @flags: BLK_MQ_REQ_* flags. * * Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask, * this function keeps retrying under memory pressure and fails iff @q is dead. @@ -1346,7 +1348,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *get_request(struct request_queue *q, unsigned int op, - struct bio *bio, gfp_t gfp_mask) + struct bio *bio, unsigned int flags) { const bool is_sync = op_is_sync(op); DEFINE_WAIT(wait); @@ -1358,7 +1360,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, rl = blk_get_rl(q, bio);/* transferred to @rq on success */ retry: - rq = __get_request(rl, op, bio, gfp_mask); + rq = __get_request(rl, op, bio, flags); if (!IS_ERR(rq)) return rq; @@ -1367,7 +1369,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, return ERR_PTR(-EAGAIN); } - if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { + if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; } @@ -1394,10 +1396,13 @@ static struct request *get_request(struct request_queue *q, unsigned int op, goto retry; } +/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */ static struct request *blk_old_get_request(struct request_queue *q, - unsigned int op, gfp_t gfp_mask) + unsigned int op, unsigned int flags) { struct request *rq; + gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : +__GFP_DIRECT_RECLAIM; int ret = 0; WARN_ON_ONCE(q->mq_ops); @@ -1410,7 +1415,7 @@ static struct request *blk_old_get_request(struct request_queue *q, if (ret) return ERR_PTR(ret); spin_lock_irq(q->queue_lock); - rq = get_request(q, op, NULL, gfp_mask); + rq = get_request(q, op, NULL, flags); if (IS_ERR(rq)) { spin_unlock_irq(q->queue_lock); blk_queue_exit(q); @@ -1424,25 +1429,40 @@ static struct request *blk_old_get_request(struct request_queue *q, return rq; } -struct request *blk_get_request(struct request_queue *q, unsigned int op, - gfp_t gfp_mask) +/** + * blk_get_request_flags - allocate a request + * @q: request queue to allocate a request for + * @op: operation (REQ_OP_*) and REQ_* flags, e.g.
[PATCH v9 04/10] block: Make q_usage_counter also track legacy requests
From: Ming LeiThis patch makes it possible to pause request allocation for the legacy block layer by calling blk_mq_freeze_queue() and blk_mq_unfreeze_queue(). Signed-off-by: Ming Lei [ bvanassche: Combined two patches into one, edited a comment and made sure REQ_NOWAIT is handled properly in blk_old_get_request() ] Signed-off-by: Bart Van Assche Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Tested-by: Martin Steigerwald Cc: Ming Lei Cc: Hannes Reinecke --- block/blk-core.c | 12 block/blk-mq.c | 10 ++ 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index e8e149ccc86b..23a0110fc62b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -610,6 +610,9 @@ void blk_set_queue_dying(struct request_queue *q) } spin_unlock_irq(q->queue_lock); } + + /* Make blk_queue_enter() reexamine the DYING flag. */ + wake_up_all(>mq_freeze_wq); } EXPORT_SYMBOL_GPL(blk_set_queue_dying); @@ -1395,16 +1398,22 @@ static struct request *blk_old_get_request(struct request_queue *q, unsigned int op, gfp_t gfp_mask) { struct request *rq; + int ret = 0; WARN_ON_ONCE(q->mq_ops); /* create ioc upfront */ create_io_context(gfp_mask, q->node); + ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) || + (op & REQ_NOWAIT)); + if (ret) + return ERR_PTR(ret); spin_lock_irq(q->queue_lock); rq = get_request(q, op, NULL, gfp_mask); if (IS_ERR(rq)) { spin_unlock_irq(q->queue_lock); + blk_queue_exit(q); return rq; } @@ -1576,6 +1585,7 @@ void __blk_put_request(struct request_queue *q, struct request *req) blk_free_request(rl, req); freed_request(rl, sync, rq_flags); blk_put_rl(rl); + blk_queue_exit(q); } } EXPORT_SYMBOL_GPL(__blk_put_request); @@ -1857,8 +1867,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) * Grab a free request. This is might sleep but can not fail. * Returns with the queue unlocked. */ + blk_queue_enter_live(q); req = get_request(q, bio->bi_opf, bio, GFP_NOIO); if (IS_ERR(req)) { + blk_queue_exit(q); __wbt_done(q->rq_wb, wb_acct); if (PTR_ERR(req) == -ENOMEM) bio->bi_status = BLK_STS_RESOURCE; diff --git a/block/blk-mq.c b/block/blk-mq.c index 097ca3ece716..59b7de6b616b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q) freeze_depth = atomic_inc_return(>mq_freeze_depth); if (freeze_depth == 1) { percpu_ref_kill(>q_usage_counter); - blk_mq_run_hw_queues(q, false); + if (q->mq_ops) + blk_mq_run_hw_queues(q, false); } } EXPORT_SYMBOL_GPL(blk_freeze_queue_start); @@ -255,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q) queue_for_each_hw_ctx(q, hctx, i) if (blk_mq_hw_queue_mapped(hctx)) blk_mq_tag_wakeup_all(hctx->tags, true); - - /* -* If we are called because the queue has now been marked as -* dying, we need to ensure that processes currently waiting on -* the queue are notified as well. -*/ - wake_up_all(>mq_freeze_wq); } bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx) -- 2.14.2
[PATCH v9 10/10] block, nvme: Introduce blk_mq_req_flags_t
Several block layer and NVMe core functions accept a combination of BLK_MQ_REQ_* flags through the 'flags' argument but there is no verification at compile time whether the right type of block layer flags is passed. Make it possible for sparse to verify this. This patch does not change any functionality. Signed-off-by: Bart Van AsscheCc: linux-n...@lists.infradead.org Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Ming Lei --- block/blk-core.c | 12 ++-- block/blk-mq.c| 4 ++-- block/blk-mq.h| 2 +- drivers/nvme/host/core.c | 5 +++-- drivers/nvme/host/nvme.h | 5 +++-- include/linux/blk-mq.h| 17 +++-- include/linux/blk_types.h | 2 ++ include/linux/blkdev.h| 4 ++-- 8 files changed, 30 insertions(+), 21 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index abb5095bcef6..d1f1694f2e1f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -799,7 +799,7 @@ EXPORT_SYMBOL(blk_alloc_queue); * @q: request queue pointer * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT */ -int blk_queue_enter(struct request_queue *q, unsigned int flags) +int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { const bool preempt = flags & BLK_MQ_REQ_PREEMPT; @@ -1224,7 +1224,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *__get_request(struct request_list *rl, unsigned int op, -struct bio *bio, unsigned int flags) +struct bio *bio, blk_mq_req_flags_t flags) { struct request_queue *q = rl->q; struct request *rq; @@ -1407,7 +1407,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *get_request(struct request_queue *q, unsigned int op, - struct bio *bio, unsigned int flags) + struct bio *bio, blk_mq_req_flags_t flags) { const bool is_sync = op_is_sync(op); DEFINE_WAIT(wait); @@ -1457,7 +1457,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, /* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */ static struct request *blk_old_get_request(struct request_queue *q, - unsigned int op, unsigned int flags) + unsigned int op, blk_mq_req_flags_t flags) { struct request *rq; gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : @@ -1494,7 +1494,7 @@ static struct request *blk_old_get_request(struct request_queue *q, * @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT. */ struct request *blk_get_request_flags(struct request_queue *q, unsigned int op, - unsigned int flags) + blk_mq_req_flags_t flags) { struct request *req; @@ -2290,7 +2290,7 @@ blk_qc_t generic_make_request(struct bio *bio) current->bio_list = bio_list_on_stack; do { struct request_queue *q = bio->bi_disk->queue; - unsigned int flags = bio->bi_opf & REQ_NOWAIT ? + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? BLK_MQ_REQ_NOWAIT : 0; if (likely(blk_queue_enter(q, flags) == 0)) { diff --git a/block/blk-mq.c b/block/blk-mq.c index c6bff60e6b8b..c037b1ad64a7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -380,7 +380,7 @@ static struct request *blk_mq_get_request(struct request_queue *q, } struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, - unsigned int flags) + blk_mq_req_flags_t flags) { struct blk_mq_alloc_data alloc_data = { .flags = flags }; struct request *rq; @@ -406,7 +406,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, EXPORT_SYMBOL(blk_mq_alloc_request); struct request *blk_mq_alloc_request_hctx(struct request_queue *q, - unsigned int op, unsigned int flags, unsigned int hctx_idx) + unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx) { struct blk_mq_alloc_data alloc_data = { .flags = flags }; struct request *rq; diff --git a/block/blk-mq.h b/block/blk-mq.h index 522b420dedc0..5dcfe4fa5e0d 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -110,7 +110,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx) struct blk_mq_alloc_data { /* input parameter */ struct request_queue *q; - unsigned int flags; + blk_mq_req_flags_t flags; unsigned int shallow_depth;
[PATCH v9 03/10] md: Neither resync nor reshape while the system is frozen
Some people use the md driver on laptops and use the suspend and resume functionality. Since it is essential that submitting of new I/O requests stops before a hibernation image is created, interrupt the md resync and reshape actions if the system is being frozen. Note: the resync and reshape will restart after the system is resumed and a message similar to the following will appear in the system log: md: md0: data-check interrupted. Signed-off-by: Bart Van AsscheReviewed-by: Shaohua Li Tested-by: Martin Steigerwald Cc: linux-r...@vger.kernel.org Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/md/md.c | 50 +- drivers/md/md.h | 8 2 files changed, 57 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index b99584e5d6b1..8b2fc750f97f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -66,6 +66,8 @@ #include #include #include +#include +#include #include #include "md.h" @@ -7439,6 +7441,7 @@ static int md_thread(void *arg) */ allow_signal(SIGKILL); + set_freezable(); while (!kthread_should_stop()) { /* We need to wait INTERRUPTIBLE so that @@ -7449,7 +7452,7 @@ static int md_thread(void *arg) if (signal_pending(current)) flush_signals(current); - wait_event_interruptible_timeout + wait_event_freezable_timeout (thread->wqueue, test_bit(THREAD_WAKEUP, >flags) || kthread_should_stop() || kthread_should_park(), @@ -8944,6 +8947,8 @@ static void md_stop_all_writes(void) int need_delay = 0; for_each_mddev(mddev, tmp) { + mddev->frozen_before_suspend = + test_bit(MD_RECOVERY_FROZEN, >recovery); if (mddev_trylock(mddev)) { if (mddev->pers) __md_stop_writes(mddev); @@ -8963,6 +8968,47 @@ static void md_stop_all_writes(void) mdelay(1000*1); } +static void md_restore_frozen_flag(void) +{ + struct list_head *tmp; + struct mddev *mddev; + + for_each_mddev(mddev, tmp) { + if (mddev->frozen_before_suspend) + set_bit(MD_RECOVERY_FROZEN, >recovery); + else + clear_bit(MD_RECOVERY_FROZEN, >recovery); + } +} + +/* + * Ensure that neither resyncing nor reshaping occurs while the system is + * frozen. + */ +static int md_notify_pm(struct notifier_block *bl, unsigned long state, + void *unused) +{ + pr_debug("%s: state = %ld\n", __func__, state); + + switch (state) { + case PM_HIBERNATION_PREPARE: + case PM_SUSPEND_PREPARE: + case PM_RESTORE_PREPARE: + md_stop_all_writes(); + break; + case PM_POST_HIBERNATION: + case PM_POST_SUSPEND: + case PM_POST_RESTORE: + md_restore_frozen_flag(); + break; + } + return NOTIFY_DONE; +} + +static struct notifier_block md_pm_notifier = { + .notifier_call = md_notify_pm, +}; + static int md_notify_reboot(struct notifier_block *this, unsigned long code, void *x) { @@ -9009,6 +9055,7 @@ static int __init md_init(void) md_probe, NULL, NULL); register_reboot_notifier(_reboot_notifier); + register_pm_notifier(_pm_notifier); raid_table_header = register_sysctl_table(raid_root_table); md_geninit(); @@ -9248,6 +9295,7 @@ static __exit void md_exit(void) unregister_blkdev(MD_MAJOR,"md"); unregister_blkdev(mdp_major, "mdp"); + unregister_pm_notifier(_pm_notifier); unregister_reboot_notifier(_reboot_notifier); unregister_sysctl_table(raid_table_header); diff --git a/drivers/md/md.h b/drivers/md/md.h index d8287d3cd1bf..727b3aef4d26 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -356,6 +356,14 @@ struct mddev { */ int recovery_disabled; + /* Writes are stopped before hibernation, suspend and restore by +* calling md_stop_all_writes(). That function sets the +* MD_RECOVERY_FROZEN flag. The value of that flag is saved before +* writes are stopped such that it can be restored after hibernation, +* suspend or restore have finished. +*/ + boolfrozen_before_suspend; + int in_sync;/* know to not need resync */ /* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so * that we are never stopping an array while it is open. -- 2.14.2
[PATCH v9 00/10] block, scsi, md: Improve suspend and resume
Hello Jens, It is known that during the resume following a hibernate, especially when using an md RAID1 array created on top of SCSI devices, sometimes the system hangs instead of coming up properly. This patch series fixes that problem. These patches have been tested on top of the block layer for-next branch. Please consider these changes for kernel v4.15. Thanks, Bart. Changes between v8 and v9: - Modified the third patch such that the MD_RECOVERY_FROZEN flag is restored properly after a resume. - Modified the ninth patch such that a kernel warning is reported if it is attempted to call scsi_device_quiesce() from multiple contexts concurrently. - Added Reviewed-by / Tested-by tags as appropriate. Changes between v7 and v8: - Fixed a (theoretical?) race identified by Ming Lei. - Added a tenth patch that checks whether the proper type of flags has been passed to a range of block layer functions. Changes between v6 and v7: - Added support for the PREEMPT_ONLY flag in blk-mq-debugfs.c. - Fixed kerneldoc header of blk_queue_enter(). - Added a rcu_read_lock_sched() / rcu_read_unlock_sched() pair in blk_set_preempt_only(). - Removed a synchronize_rcu() call from scsi_device_quiesce(). - Modified the description of patch 9/9 in this series. - Removed scsi_run_queue() call from scsi_device_resume(). Changes between v5 and v6: - Split an md patch into two patches to make it easier to review the changes. - For the md patch that suspends I/O while the system is frozen, switched back to the freezer mechanism because this makes the code shorter and easier to review. - Changed blk_set/clear_preempt_only() from EXPORT_SYMBOL() into EXPORT_SYMBOL_GPL(). - Made blk_set_preempt_only() behave as a test-and-set operation. - Introduced blk_get_request_flags() and BLK_MQ_REQ_PREEMPT as requested by Christoph and reduced the number of arguments of blk_queue_enter() back from three to two. - In scsi_device_quiesce(), moved the blk_mq_freeze_queue() call out of a critical section. Made the explanation of why the synchronize_rcu() call is necessary more detailed. Changes between v4 and v5: - Split blk_set_preempt_only() into two functions as requested by Christoph. - Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate a request while the system is frozen. This is a big help to identify drivers that submit I/O while the system is frozen. - Since kernel thread freezing is on its way out, reworked the approach for avoiding that the md driver submits I/O while the system is frozen such that the approach no longer depends on the kernel thread freeze mechanism. - Fixed the (theoretical) deadlock in scsi_device_quiesce() that was identified by Ming. Changes between v3 and v4: - Made sure that this patch series not only works for scsi-mq but also for the legacy SCSI stack. - Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment that explains why that is safe. - Reordered the patches in this patch series. Changes between v2 and v3: - Made md kernel threads freezable. - Changed the approach for quiescing SCSI devices again. - Addressed Ming's review comments. Changes compared to v1 of this patch series: - Changed the approach and rewrote the patch series. Bart Van Assche (9): md: Rename md_notifier into md_reboot_notifier md: Introduce md_stop_all_writes() md: Neither resync nor reshape while the system is frozen block: Introduce blk_get_request_flags() block: Introduce BLK_MQ_REQ_PREEMPT ide, scsi: Tell the block layer at request allocation time about preempt requests block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag block, scsi: Make SCSI quiesce and resume work reliably block, nvme: Introduce blk_mq_req_flags_t Ming Lei (1): block: Make q_usage_counter also track legacy requests block/blk-core.c | 132 ++--- block/blk-mq-debugfs.c | 1 + block/blk-mq.c | 20 +++ block/blk-mq.h | 2 +- block/blk-timeout.c| 2 +- drivers/ide/ide-pm.c | 4 +- drivers/md/md.c| 65 +++--- drivers/md/md.h| 8 +++ drivers/nvme/host/core.c | 5 +- drivers/nvme/host/nvme.h | 5 +- drivers/scsi/scsi_lib.c| 46 +++- fs/block_dev.c | 4 +- include/linux/blk-mq.h | 16 -- include/linux/blk_types.h | 2 + include/linux/blkdev.h | 11 +++- include/scsi/scsi_device.h | 1 + 16 files changed, 256 insertions(+), 68 deletions(-) -- 2.14.2
[PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
The contexts from which a SCSI device can be quiesced or resumed are: * Writing into /sys/class/scsi_device/*/device/state. * SCSI parallel (SPI) domain validation. * The SCSI device power management methods. See also scsi_bus_pm_ops. It is essential during suspend and resume that neither the filesystem state nor the filesystem metadata in RAM changes. This is why while the hibernation image is being written or restored that SCSI devices are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() and scsi_device_resume(). In the SDEV_QUIESCE state execution of non-preempt requests is deferred. This is realized by returning BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI devices. Avoid that a full queue prevents power management requests to be submitted by deferring allocation of non-preempt requests for devices in the quiesced state. This patch has been tested by running the following commands and by verifying that after resume the fio job is still running: for d in /sys/class/block/sd*[a-z]; do hcil=$(readlink "$d/device") hcil=${hcil#../../../} echo 4 > "$d/queue/nr_requests" echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" done bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c) bdev=${bdev#../../} hcil=$(readlink "/sys/block/$bdev/device") hcil=${hcil#../../../} fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \ --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \ --loops=$((2**31)) & pid=$! sleep 1 systemctl hibernate sleep 10 kill $pid Reported-by: Oleksandr NatalenkoReferences: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block=150340235201348). Signed-off-by: Bart Van Assche Tested-by: Martin Steigerwald Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c | 42 +++--- block/blk-mq.c | 4 ++-- block/blk-timeout.c| 2 +- drivers/scsi/scsi_lib.c| 40 +--- fs/block_dev.c | 4 ++-- include/linux/blkdev.h | 2 +- include/scsi/scsi_device.h | 1 + 7 files changed, 71 insertions(+), 24 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index b73203286bf5..abb5095bcef6 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q) spin_lock_irqsave(q->queue_lock, flags); queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + wake_up_all(>mq_freeze_wq); spin_unlock_irqrestore(q->queue_lock, flags); } EXPORT_SYMBOL_GPL(blk_clear_preempt_only); @@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); -int blk_queue_enter(struct request_queue *q, bool nowait) +/** + * blk_queue_enter() - try to increase q->q_usage_counter + * @q: request queue pointer + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + */ +int blk_queue_enter(struct request_queue *q, unsigned int flags) { + const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + while (true) { + bool success = false; int ret; - if (percpu_ref_tryget_live(>q_usage_counter)) + rcu_read_lock_sched(); + if (percpu_ref_tryget_live(>q_usage_counter)) { + /* +* The code that sets the PREEMPT_ONLY flag is +* responsible for ensuring that that flag is globally +* visible before the queue is unfrozen. +*/ + if (preempt || !blk_queue_preempt_only(q)) { + success = true; + } else { + percpu_ref_put(>q_usage_counter); + WARN_ONCE("%s: Attempt to allocate non-preempt request in preempt-only mode.\n", + kobject_name(q->kobj.parent)); + } + } + rcu_read_unlock_sched(); + + if (success) return 0; - if (nowait) + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; /* @@ -814,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait) smp_rmb(); ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(>mq_freeze_depth) || + (atomic_read(>mq_freeze_depth) == 0 && +(preempt || !blk_queue_preempt_only(q))) ||
[PATCH v2 3/8] nvmet/fc: Use sgl_alloc() and sgl_free()
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Cc: Keith Busch Cc: Christoph Hellwig Cc: James Smart Cc: Sagi Grimberg --- drivers/nvme/target/Kconfig | 1 + drivers/nvme/target/fc.c| 36 ++-- 2 files changed, 3 insertions(+), 34 deletions(-) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 03e4ab65fe77..4d9715630e21 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -39,6 +39,7 @@ config NVME_TARGET_FC tristate "NVMe over Fabrics FC target driver" depends on NVME_TARGET depends on HAS_DMA + select SGL_ALLOC help This enables the NVMe FC target support, which allows exporting NVMe devices over FC. diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 58e010bdda3e..20f96038c272 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -1683,31 +1683,12 @@ static int nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod) { struct scatterlist *sg; - struct page *page; unsigned int nent; - u32 page_len, length; - int i = 0; - length = fod->total_length; - nent = DIV_ROUND_UP(length, PAGE_SIZE); - sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL); + sg = sgl_alloc(fod->total_length, GFP_KERNEL, ); if (!sg) goto out; - sg_init_table(sg, nent); - - while (length) { - page_len = min_t(u32, length, PAGE_SIZE); - - page = alloc_page(GFP_KERNEL); - if (!page) - goto out_free_pages; - - sg_set_page([i], page, page_len, 0); - length -= page_len; - i++; - } - fod->data_sg = sg; fod->data_sg_cnt = nent; fod->data_sg_cnt = fc_dma_map_sg(fod->tgtport->dev, sg, nent, @@ -1717,14 +1698,6 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod) return 0; -out_free_pages: - while (i > 0) { - i--; - __free_page(sg_page([i])); - } - kfree(sg); - fod->data_sg = NULL; - fod->data_sg_cnt = 0; out: return NVME_SC_INTERNAL; } @@ -1732,18 +1705,13 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod) static void nvmet_fc_free_tgt_pgs(struct nvmet_fc_fcp_iod *fod) { - struct scatterlist *sg; - int count; - if (!fod->data_sg || !fod->data_sg_cnt) return; fc_dma_unmap_sg(fod->tgtport->dev, fod->data_sg, fod->data_sg_cnt, ((fod->io_dir == NVMET_FCP_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE)); - for_each_sg(fod->data_sg, sg, fod->data_sg_cnt, count) - __free_page(sg_page(sg)); - kfree(fod->data_sg); + sgl_free(fod->data_sg); fod->data_sg = NULL; fod->data_sg_cnt = 0; } -- 2.14.2
[PATCH v2 7/8] scsi/pmcraid: Remove an unused structure member
Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Cc: linux-s...@vger.kernel.org Cc: Martin K. Petersen Cc: Anil Ravindranath --- drivers/scsi/pmcraid.h | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h index 8bfac72a242b..44da91712115 100644 --- a/drivers/scsi/pmcraid.h +++ b/drivers/scsi/pmcraid.h @@ -542,7 +542,6 @@ struct pmcraid_sglist { u32 order; u32 num_sg; u32 num_dma_sg; - u32 buffer_len; struct scatterlist scatterlist[1]; }; -- 2.14.2
[PATCH v2 1/8] lib/scatterlist: Introduce sgl_alloc() and sgl_free()
Many kernel drivers contain code that allocates and frees both a scatterlist and the pages that populate that scatterlist. Introduce functions in lib/scatterlist.c that perform these tasks instead of duplicating this functionality in multiple drivers. Only include these functions in the build if CONFIG_SGL_ALLOC=y to avoid that the kernel size increases if this functionality is not used. Signed-off-by: Bart Van Assche--- include/linux/scatterlist.h | 10 + lib/Kconfig | 4 ++ lib/scatterlist.c | 105 3 files changed, 119 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 4b3286ac60c8..3642511918d5 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -266,6 +266,16 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, unsigned long offset, unsigned long size, gfp_t gfp_mask); +#ifdef CONFIG_SGL_ALLOC +struct scatterlist *sgl_alloc_order(unsigned long long length, + unsigned int order, bool chainable, + gfp_t gfp, unsigned int *nent_p); +struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, + unsigned int *nent_p); +void sgl_free_order(struct scatterlist *sgl, int order); +void sgl_free(struct scatterlist *sgl); +#endif /* CONFIG_SGL_ALLOC */ + size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen, off_t skip, bool to_buffer); diff --git a/lib/Kconfig b/lib/Kconfig index b1445b22a6de..8396c4cfa1ab 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -413,6 +413,10 @@ config HAS_DMA depends on !NO_DMA default y +config SGL_ALLOC + bool + default n + config DMA_NOOP_OPS bool depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index be7b4dd6b68d..e2b5a78ceb44 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -433,6 +433,111 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, } EXPORT_SYMBOL(sg_alloc_table_from_pages); +#ifdef CONFIG_SGL_ALLOC + +/** + * sgl_alloc_order - allocate a scatterlist and its pages + * @length: Length in bytes of the scatterlist. Must be at least one + * @order: Second argument for alloc_pages() + * @chainable: Whether or not to allocate an extra element in the scatterlist + * for scatterlist chaining purposes + * @gfp: Memory allocation flags + * @nent_p: [out] Number of entries in the scatterlist that have pages + * + * Returns: A pointer to an initialized scatterlist or %NULL upon failure. + */ +struct scatterlist *sgl_alloc_order(unsigned long long length, + unsigned int order, bool chainable, + gfp_t gfp, unsigned int *nent_p) +{ + struct scatterlist *sgl, *sg; + struct page *page; + unsigned int nent, nalloc; + u32 elem_len; + + nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); + /* Check for integer overflow */ + if (length > (nent << (PAGE_SHIFT + order))) + return NULL; + nalloc = nent; + if (chainable) { + /* Check for integer overflow */ + if (nalloc + 1 < nalloc) + return NULL; + nalloc++; + } + sgl = kmalloc_array(nalloc, sizeof(struct scatterlist), + (gfp & ~GFP_DMA) | __GFP_ZERO); + if (!sgl) + return NULL; + + sg_init_table(sgl, nent); + sg = sgl; + while (length) { + elem_len = min_t(u64, length, PAGE_SIZE << order); + page = alloc_pages(gfp, order); + if (!page) { + sgl_free(sgl); + return NULL; + } + + sg_set_page(sg, page, elem_len, 0); + length -= elem_len; + sg = sg_next(sg); + } + WARN_ON_ONCE(sg); + if (nent_p) + *nent_p = nent; + return sgl; +} +EXPORT_SYMBOL(sgl_alloc_order); + +/** + * sgl_alloc - allocate a scatterlist and its pages + * @length: Length in bytes of the scatterlist + * @gfp: Memory allocation flags + * @nent_p: [out] Number of entries in the scatterlist + * + * Returns: A pointer to an initialized scatterlist or %NULL upon failure. + */ +struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, + unsigned int *nent_p) +{ + return sgl_alloc_order(length, 0, false, gfp, nent_p); +} +EXPORT_SYMBOL(sgl_alloc); + +/** + * sgl_free_order - free a scatterlist and its pages + * @sgl: Scatterlist with one or more elements + * @order: Second argument for __free_pages() + */ +void sgl_free_order(struct scatterlist *sgl, int order) +{ + struct scatterlist *sg; +
[PATCH v2 8/8] scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order()
Use the sgl_alloc_order() and sgl_free_order() functions instead of open coding these functions. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Cc: linux-s...@vger.kernel.org Cc: Martin K. Petersen Cc: Anil Ravindranath --- drivers/scsi/Kconfig | 1 + drivers/scsi/pmcraid.c | 43 --- drivers/scsi/pmcraid.h | 2 +- 3 files changed, 6 insertions(+), 40 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index d11e75e76195..632200ec36a6 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -1576,6 +1576,7 @@ config ZFCP config SCSI_PMCRAID tristate "PMC SIERRA Linux MaxRAID adapter support" depends on PCI && SCSI && NET + select SGL_ALLOC ---help--- This driver supports the PMC SIERRA MaxRAID adapters. diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c index b4d6cd8cd1ad..95fc0352f9bb 100644 --- a/drivers/scsi/pmcraid.c +++ b/drivers/scsi/pmcraid.c @@ -3232,12 +3232,7 @@ static int pmcraid_build_ioadl( */ static void pmcraid_free_sglist(struct pmcraid_sglist *sglist) { - int i; - - for (i = 0; i < sglist->num_sg; i++) - __free_pages(sg_page(&(sglist->scatterlist[i])), -sglist->order); - + sgl_free_order(sglist->scatterlist, sglist->order); kfree(sglist); } @@ -3254,50 +3249,20 @@ static void pmcraid_free_sglist(struct pmcraid_sglist *sglist) static struct pmcraid_sglist *pmcraid_alloc_sglist(int buflen) { struct pmcraid_sglist *sglist; - struct scatterlist *scatterlist; - struct page *page; - int num_elem, i, j; int sg_size; int order; - int bsize_elem; sg_size = buflen / (PMCRAID_MAX_IOADLS - 1); order = (sg_size > 0) ? get_order(sg_size) : 0; - bsize_elem = PAGE_SIZE * (1 << order); - - /* Determine the actual number of sg entries needed */ - if (buflen % bsize_elem) - num_elem = (buflen / bsize_elem) + 1; - else - num_elem = buflen / bsize_elem; /* Allocate a scatter/gather list for the DMA */ - sglist = kzalloc(sizeof(struct pmcraid_sglist) + -(sizeof(struct scatterlist) * (num_elem - 1)), -GFP_KERNEL); - + sglist = kzalloc(sizeof(struct pmcraid_sglist), GFP_KERNEL); if (sglist == NULL) return NULL; - scatterlist = sglist->scatterlist; - sg_init_table(scatterlist, num_elem); sglist->order = order; - sglist->num_sg = num_elem; - sg_size = buflen; - - for (i = 0; i < num_elem; i++) { - page = alloc_pages(GFP_KERNEL|GFP_DMA|__GFP_ZERO, order); - if (!page) { - for (j = i - 1; j >= 0; j--) - __free_pages(sg_page([j]), order); - kfree(sglist); - return NULL; - } - - sg_set_page([i], page, - sg_size < bsize_elem ? sg_size : bsize_elem, 0); - sg_size -= bsize_elem; - } + sgl_alloc_order(buflen, order, false, + GFP_KERNEL | GFP_DMA | __GFP_ZERO, >num_sg); return sglist; } diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h index 44da91712115..754ef30927e2 100644 --- a/drivers/scsi/pmcraid.h +++ b/drivers/scsi/pmcraid.h @@ -542,7 +542,7 @@ struct pmcraid_sglist { u32 order; u32 num_sg; u32 num_dma_sg; - struct scatterlist scatterlist[1]; + struct scatterlist *scatterlist; }; /* page D0 inquiry data of focal point resource */ -- 2.14.2
[PATCH v2 2/8] crypto: scompress - use sgl_alloc() and sgl_free()
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van AsscheCc: Ard Biesheuvel Cc: Herbert Xu --- crypto/Kconfig | 1 + crypto/scompress.c | 51 ++- 2 files changed, 3 insertions(+), 49 deletions(-) diff --git a/crypto/Kconfig b/crypto/Kconfig index 0a121f9ddf8e..a0667dd284ff 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -105,6 +105,7 @@ config CRYPTO_KPP config CRYPTO_ACOMP2 tristate select CRYPTO_ALGAPI2 + select SGL_ALLOC config CRYPTO_ACOMP tristate diff --git a/crypto/scompress.c b/crypto/scompress.c index 2075e2c4e7df..968bbcf65c94 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -140,53 +140,6 @@ static int crypto_scomp_init_tfm(struct crypto_tfm *tfm) return ret; } -static void crypto_scomp_sg_free(struct scatterlist *sgl) -{ - int i, n; - struct page *page; - - if (!sgl) - return; - - n = sg_nents(sgl); - for_each_sg(sgl, sgl, n, i) { - page = sg_page(sgl); - if (page) - __free_page(page); - } - - kfree(sgl); -} - -static struct scatterlist *crypto_scomp_sg_alloc(size_t size, gfp_t gfp) -{ - struct scatterlist *sgl; - struct page *page; - int i, n; - - n = ((size - 1) >> PAGE_SHIFT) + 1; - - sgl = kmalloc_array(n, sizeof(struct scatterlist), gfp); - if (!sgl) - return NULL; - - sg_init_table(sgl, n); - - for (i = 0; i < n; i++) { - page = alloc_page(gfp); - if (!page) - goto err; - sg_set_page(sgl + i, page, PAGE_SIZE, 0); - } - - return sgl; - -err: - sg_mark_end(sgl + i); - crypto_scomp_sg_free(sgl); - return NULL; -} - static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) { struct crypto_acomp *tfm = crypto_acomp_reqtfm(req); @@ -220,7 +173,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) scratch_dst, >dlen, *ctx); if (!ret) { if (!req->dst) { - req->dst = crypto_scomp_sg_alloc(req->dlen, GFP_ATOMIC); + req->dst = sgl_alloc(req->dlen, GFP_ATOMIC, NULL); if (!req->dst) goto out; } @@ -274,7 +227,7 @@ int crypto_init_scomp_ops_async(struct crypto_tfm *tfm) crt->compress = scomp_acomp_compress; crt->decompress = scomp_acomp_decompress; - crt->dst_free = crypto_scomp_sg_free; + crt->dst_free = sgl_free; crt->reqsize = sizeof(void *); return 0; -- 2.14.2
[PATCH v2 4/8] nvmet/rdma: Use sgl_alloc() and sgl_free()
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Cc: Keith Busch Cc: Christoph Hellwig Cc: Sagi Grimberg --- drivers/nvme/target/Kconfig | 1 + drivers/nvme/target/rdma.c | 63 +++-- 2 files changed, 5 insertions(+), 59 deletions(-) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 4d9715630e21..5f4f8b16685f 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -29,6 +29,7 @@ config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" depends on INFINIBAND depends on NVME_TARGET + select SGL_ALLOC help This enables the NVMe RDMA target support, which allows exporting NVMe devices over RDMA. diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 76d2bb793afe..363d44c10b68 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -185,59 +185,6 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp) spin_unlock_irqrestore(>queue->rsps_lock, flags); } -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents) -{ - struct scatterlist *sg; - int count; - - if (!sgl || !nents) - return; - - for_each_sg(sgl, sg, nents, count) - __free_page(sg_page(sg)); - kfree(sgl); -} - -static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, - u32 length) -{ - struct scatterlist *sg; - struct page *page; - unsigned int nent; - int i = 0; - - nent = DIV_ROUND_UP(length, PAGE_SIZE); - sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL); - if (!sg) - goto out; - - sg_init_table(sg, nent); - - while (length) { - u32 page_len = min_t(u32, length, PAGE_SIZE); - - page = alloc_page(GFP_KERNEL); - if (!page) - goto out_free_pages; - - sg_set_page([i], page, page_len, 0); - length -= page_len; - i++; - } - *sgl = sg; - *nents = nent; - return 0; - -out_free_pages: - while (i > 0) { - i--; - __free_page(sg_page([i])); - } - kfree(sg); -out: - return NVME_SC_INTERNAL; -} - static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev, struct nvmet_rdma_cmd *c, bool admin) { @@ -484,7 +431,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp) } if (rsp->req.sg != >cmd->inline_sg) - nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt); + sgl_free(rsp->req.sg); if (unlikely(!list_empty_careful(>rsp_wr_wait_list))) nvmet_rdma_process_wr_wait_list(queue); @@ -620,16 +567,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp, u32 len = get_unaligned_le24(sgl->length); u32 key = get_unaligned_le32(sgl->key); int ret; - u16 status; /* no data command? */ if (!len) return 0; - status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt, - len); - if (status) - return status; + rsp->req.sg = sgl_alloc(len, GFP_KERNEL, >req.sg_cnt); + if (!rsp->req.sg) + return NVME_SC_INTERNAL; ret = rdma_rw_ctx_init(>rw, cm_id->qp, cm_id->port_num, rsp->req.sg, rsp->req.sg_cnt, 0, addr, key, -- 2.14.2
[PATCH v2 6/8] scsi/ipr: Use sgl_alloc_order() and sgl_free_order()
Use the sgl_alloc_order() and sgl_free_order() functions instead of open coding these functions. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Cc: linux-s...@vger.kernel.org Cc: Martin K. Petersen Cc: Brian King --- drivers/scsi/Kconfig | 1 + drivers/scsi/ipr.c | 49 - drivers/scsi/ipr.h | 2 +- 3 files changed, 10 insertions(+), 42 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index 41366339b950..d11e75e76195 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -1058,6 +1058,7 @@ config SCSI_IPR depends on PCI && SCSI && ATA select FW_LOADER select IRQ_POLL + select SGL_ALLOC ---help--- This driver supports the IBM Power Linux family RAID adapters. This includes IBM pSeries 5712, 5703, 5709, and 570A, as well diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c index f838bd73befa..6fed5db6255e 100644 --- a/drivers/scsi/ipr.c +++ b/drivers/scsi/ipr.c @@ -3815,10 +3815,8 @@ static struct device_attribute ipr_iopoll_weight_attr = { **/ static struct ipr_sglist *ipr_alloc_ucode_buffer(int buf_len) { - int sg_size, order, bsize_elem, num_elem, i, j; + int sg_size, order; struct ipr_sglist *sglist; - struct scatterlist *scatterlist; - struct page *page; /* Get the minimum size per scatter/gather element */ sg_size = buf_len / (IPR_MAX_SGLIST - 1); @@ -3826,45 +3824,18 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int buf_len) /* Get the actual size per element */ order = get_order(sg_size); - /* Determine the actual number of bytes per element */ - bsize_elem = PAGE_SIZE * (1 << order); - - /* Determine the actual number of sg entries needed */ - if (buf_len % bsize_elem) - num_elem = (buf_len / bsize_elem) + 1; - else - num_elem = buf_len / bsize_elem; - /* Allocate a scatter/gather list for the DMA */ - sglist = kzalloc(sizeof(struct ipr_sglist) + -(sizeof(struct scatterlist) * (num_elem - 1)), -GFP_KERNEL); - + sglist = kzalloc(sizeof(struct ipr_sglist), GFP_KERNEL); if (sglist == NULL) { ipr_trace; return NULL; } - - scatterlist = sglist->scatterlist; - sg_init_table(scatterlist, num_elem); - sglist->order = order; - sglist->num_sg = num_elem; - - /* Allocate a bunch of sg elements */ - for (i = 0; i < num_elem; i++) { - page = alloc_pages(GFP_KERNEL, order); - if (!page) { - ipr_trace; - - /* Free up what we already allocated */ - for (j = i - 1; j >= 0; j--) - __free_pages(sg_page([j]), order); - kfree(sglist); - return NULL; - } - - sg_set_page([i], page, 0, 0); + sglist->scatterlist = sgl_alloc_order(buf_len, order, false, GFP_KERNEL, + >num_sg); + if (!sglist->scatterlist) { + kfree(sglist); + return NULL; } return sglist; @@ -3882,11 +3853,7 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int buf_len) **/ static void ipr_free_ucode_buffer(struct ipr_sglist *sglist) { - int i; - - for (i = 0; i < sglist->num_sg; i++) - __free_pages(sg_page(>scatterlist[i]), sglist->order); - + sgl_free_order(sglist->scatterlist, sglist->order); kfree(sglist); } diff --git a/drivers/scsi/ipr.h b/drivers/scsi/ipr.h index c7f0e9e3cd7d..93570734cbfb 100644 --- a/drivers/scsi/ipr.h +++ b/drivers/scsi/ipr.h @@ -1454,7 +1454,7 @@ struct ipr_sglist { u32 num_sg; u32 num_dma_sg; u32 buffer_len; - struct scatterlist scatterlist[1]; + struct scatterlist *scatterlist; }; enum ipr_sdt_state { -- 2.14.2
[PATCH v2 0/8] Introduce sgl_alloc() and sgl_free()
Hello Jens, As you know there are multiple drivers that both allocate a scatter/gather list and populate that list with pages. This patch series moves the code for allocating and freeing such scatterlists from these drivers into lib/scatterlist.c. Please consider this patch series for kernel v4.15. Notes: - Only the ib_srpt driver changes have been tested but none of the other drivers have been retested. - The next step is to introduce a caching mechanism for these scatterlists and make the nvmet/rdma and SCSI target drivers use that caching mechanism since for these drivers sgl allocation occurs in the hot path. Changes compared to v1: - Moved the sgl_alloc*() and sgl_free() functions from a new source file into lib/scatterlist.c. - Changed the argument order for the sgl_alloc*() functions such that the (pointer to) the output argument comes last. Thanks, Bart. Bart Van Assche (8): lib/scatterlist: Introduce sgl_alloc() and sgl_free() crypto: scompress - use sgl_alloc() and sgl_free() nvmet/fc: Use sgl_alloc() and sgl_free() nvmet/rdma: Use sgl_alloc() and sgl_free() target: Use sgl_alloc_order() and sgl_free() scsi/ipr: Use sgl_alloc_order() and sgl_free_order() scsi/pmcraid: Remove an unused structure member scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order() crypto/Kconfig | 1 + crypto/scompress.c | 51 +--- drivers/nvme/target/Kconfig| 2 + drivers/nvme/target/fc.c | 36 +-- drivers/nvme/target/rdma.c | 63 ++-- drivers/scsi/Kconfig | 2 + drivers/scsi/ipr.c | 49 +++ drivers/scsi/ipr.h | 2 +- drivers/scsi/pmcraid.c | 43 ++ drivers/scsi/pmcraid.h | 3 +- drivers/target/Kconfig | 1 + drivers/target/target_core_transport.c | 46 ++- include/linux/scatterlist.h| 10 lib/Kconfig| 4 ++ lib/scatterlist.c | 105 + 15 files changed, 151 insertions(+), 267 deletions(-) -- 2.14.2
[PATCH v2 5/8] target: Use sgl_alloc_order() and sgl_free()
Use the sgl_alloc_order() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van AsscheCc: Nicholas A. Bellinger Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Sagi Grimberg --- drivers/target/Kconfig | 1 + drivers/target/target_core_transport.c | 46 +++--- 2 files changed, 5 insertions(+), 42 deletions(-) diff --git a/drivers/target/Kconfig b/drivers/target/Kconfig index e2bc99980f75..4c44d7bed01a 100644 --- a/drivers/target/Kconfig +++ b/drivers/target/Kconfig @@ -5,6 +5,7 @@ menuconfig TARGET_CORE select CONFIGFS_FS select CRC_T10DIF select BLK_SCSI_REQUEST # only for scsi_command_size_tbl.. + select SGL_ALLOC default n help Say Y or M here to enable the TCM Storage Engine and ConfigFS enabled diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 836d552b0385..9bbd08be9d60 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2293,13 +2293,7 @@ static void target_complete_ok_work(struct work_struct *work) void target_free_sgl(struct scatterlist *sgl, int nents) { - struct scatterlist *sg; - int count; - - for_each_sg(sgl, sg, nents, count) - __free_page(sg_page(sg)); - - kfree(sgl); + sgl_free(sgl); } EXPORT_SYMBOL(target_free_sgl); @@ -2423,42 +2417,10 @@ int target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length, bool zero_page, bool chainable) { - struct scatterlist *sg; - struct page *page; - gfp_t zero_flag = (zero_page) ? __GFP_ZERO : 0; - unsigned int nalloc, nent; - int i = 0; - - nalloc = nent = DIV_ROUND_UP(length, PAGE_SIZE); - if (chainable) - nalloc++; - sg = kmalloc_array(nalloc, sizeof(struct scatterlist), GFP_KERNEL); - if (!sg) - return -ENOMEM; + gfp_t gfp = GFP_KERNEL | (zero_page ? __GFP_ZERO : 0); - sg_init_table(sg, nalloc); - - while (length) { - u32 page_len = min_t(u32, length, PAGE_SIZE); - page = alloc_page(GFP_KERNEL | zero_flag); - if (!page) - goto out; - - sg_set_page([i], page, page_len, 0); - length -= page_len; - i++; - } - *sgl = sg; - *nents = nent; - return 0; - -out: - while (i > 0) { - i--; - __free_page(sg_page([i])); - } - kfree(sg); - return -ENOMEM; + *sgl = sgl_alloc_order(length, 0, chainable, gfp, nents); + return *sgl ? 0 : -ENOMEM; } EXPORT_SYMBOL(target_alloc_sgl); -- 2.14.2
Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe
On 10/16/2017 01:07 PM, Michael Lyle wrote: > Jens-- > > Thanks for your patience. > > On Mon, Oct 16, 2017 at 12:01 PM, Jens Axboewrote: >> On 10/16/2017 11:34 AM, Michael Lyle wrote: >>> Sorry this got through to linux-block, was detected by the kbuilds test >>> robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a >>> signed long constant. >> >> Applied, but you should remember to add Fixes lines when a patch >> explicitly fixes a problem introduced by a previous patch. Ala: >> >> Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp") >> >> so that backports have an easier time finding dependent fixes. > > Sorry about that too. I considered the Fixes: tag but didn't know if > the hashes would "hold true" into mainline. Now I see Linus merges > your tree so the hash should be durable. They are durable once applied, so yeah. -- Jens Axboe
Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe
Jens-- Thanks for your patience. On Mon, Oct 16, 2017 at 12:01 PM, Jens Axboewrote: > On 10/16/2017 11:34 AM, Michael Lyle wrote: >> Sorry this got through to linux-block, was detected by the kbuilds test >> robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a >> signed long constant. > > Applied, but you should remember to add Fixes lines when a patch > explicitly fixes a problem introduced by a previous patch. Ala: > > Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp") > > so that backports have an easier time finding dependent fixes. Sorry about that too. I considered the Fixes: tag but didn't know if the hashes would "hold true" into mainline. Now I see Linus merges your tree so the hash should be durable. > > -- > Jens Axboe > Mike
Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe
On 10/16/2017 11:34 AM, Michael Lyle wrote: > Sorry this got through to linux-block, was detected by the kbuilds test > robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a > signed long constant. Applied, but you should remember to add Fixes lines when a patch explicitly fixes a problem introduced by a previous patch. Ala: Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp") so that backports have an easier time finding dependent fixes. -- Jens Axboe
Re: [PATCH] block: fix Sphinx kernel-doc warning
On 10/16/2017 12:01 PM, Randy Dunlap wrote: > From: Randy Dunlap> > Sphinx treats symbols that end with '_' as a kind of special > documentation indicator, so fix that by adding an ending '*' > to it. Added for 4.15, thanks Randy. -- Jens Axboe
Re: [PATCHv3] bcache: only permit to recovery read error when cache device is clean
On 2017/10/17 上午1:39, Michael Lyle wrote: > Hey Coly-- > > On Thu, Sep 21, 2017 at 12:54 PM, Coly Liwrote: >> When bcache does read I/Os, for example in writeback or writethrough mode, >> if a read request on cache device is failed, bcache will try to recovery >> the request by reading from cached device. If the data on cached device is >> not synced with cache device, then requester will get a stale data. > [...] >> + if (s->recoverable && >> + (dc && !atomic_read(>has_dirty)) { > > Looks like this is missing a parens. Hi Mike, Oops, I am blind ... Thanks for figure out this. V4 patch is out, and I think you may change Acked-by to Reviewed-by, because you reviewed the code and pointed out problem :-) Thanks. -- Coly Li
[PATCHv4] bcache: only permit to recovery read error when cache device is clean
When bcache does read I/Os, for example in writeback or writethrough mode, if a read request on cache device is failed, bcache will try to recovery the request by reading from cached device. If the data on cached device is not synced with cache device, then requester will get a stale data. For critical storage system like database, providing stale data from recovery may result an application level data corruption, which is unacceptible. With this patch, for a failed read request in writeback or writethrough mode, recovery a recoverable read request only happens when cache device is clean. That is to say, all data on cached device is up to update. For other cache modes in bcache, read request will never hit cached_dev_read_error(), they don't need this patch. Please note, because cache mode can be switched arbitrarily in run time, a writethrough mode might be switched from a writeback mode. Therefore checking dc->has_data in writethrough mode still makes sense. Changelog: V4: Fix parens error pointed by Michael Lyle. v3: By response from Kent Oversteet, he thinks recovering stale data is a bug to fix, and option to permit it is unneccessary. So this version the sysfs file is removed. v2: rename sysfs entry from allow_stale_data_on_failure to allow_stale_data_on_failure, and fix the confusing commit log. v1: initial patch posted. Signed-off-by: Coly LiReported-by: Arne Wolf Acked-by: Michael Lyle Cc: Kent Overstreet Cc: Nix Cc: Kai Krakow Cc: Eric Wheeler Cc: Junhui Tang Cc: sta...@vger.kernel.org --- drivers/md/bcache/request.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 681b4f12b05a..e7f769ff7234 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -697,8 +697,16 @@ static void cached_dev_read_error(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); struct bio *bio = >bio.bio; + struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); - if (s->recoverable) { + /* +* If cache device is dirty (dc->has_dirty is non-zero), then +* recovery a failed read request from cached device may get a +* stale data back. So read failure recovery is only permitted +* when cache device is clean. +*/ + if (s->recoverable && + (dc && !atomic_read(>has_dirty))) { /* Retry from the backing device: */ trace_bcache_read_retry(s->orig_bio); -- 2.13.6
[PATCH] block: fix Sphinx kernel-doc warning
From: Randy DunlapSphinx treats symbols that end with '_' as a kind of special documentation indicator, so fix that by adding an ending '*' to it. ../block/bio.c:404: ERROR: Unknown target name: "gfp". Signed-off-by: Randy Dunlap --- block/bio.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- lnx-414-rc5.orig/block/bio.c +++ lnx-414-rc5/block/bio.c @@ -400,7 +400,7 @@ static void punt_bios_to_rescuer(struct /** * bio_alloc_bioset - allocate a bio for I/O - * @gfp_mask: the GFP_ mask given to the slab allocator + * @gfp_mask: the GFP_* mask given to the slab allocator * @nr_iovecs: number of iovecs to pre-allocate * @bs:the bio_set to allocate from. *
Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe
On 2017/10/17 上午1:34, Michael Lyle wrote: > Sorry this got through to linux-block, was detected by the kbuilds test > robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a > signed long constant. > > Signed-off-by: Michael LyleI forgot i386 config... sorry too... Reviewed-by: Coly Li Thanks to Michael for the fix. Coly Li > --- > drivers/md/bcache/util.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c > index 4dbe37e82877..e548b8b51322 100644 > --- a/drivers/md/bcache/util.c > +++ b/drivers/md/bcache/util.c > @@ -238,8 +238,8 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t > done) >* don't let us sleep more than 2.5 seconds (so we can notice/respond >* if the control system tells us to speed up!). >*/ > - if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next)) > - d->next = now + NSEC_PER_SEC * 5 / 2; > + if (time_before64(now + NSEC_PER_SEC * 5LLU / 2LLU, d->next)) > + d->next = now + NSEC_PER_SEC * 5LLU / 2LLU; > > if (time_after64(now - NSEC_PER_SEC * 2, d->next)) > d->next = now - NSEC_PER_SEC * 2;
Re: [PATCHv3] bcache: only permit to recovery read error when cache device is clean
Hey Coly-- On Thu, Sep 21, 2017 at 12:54 PM, Coly Liwrote: > When bcache does read I/Os, for example in writeback or writethrough mode, > if a read request on cache device is failed, bcache will try to recovery > the request by reading from cached device. If the data on cached device is > not synced with cache device, then requester will get a stale data. [...] > + if (s->recoverable && > + (dc && !atomic_read(>has_dirty)) { Looks like this is missing a parens. Thx, Mike
[PATCH] bcache: writeback rate clamping: make 32 bit safe
Sorry this got through to linux-block, was detected by the kbuilds test robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a signed long constant. Signed-off-by: Michael Lyle--- drivers/md/bcache/util.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index 4dbe37e82877..e548b8b51322 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -238,8 +238,8 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) * don't let us sleep more than 2.5 seconds (so we can notice/respond * if the control system tells us to speed up!). */ - if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next)) - d->next = now + NSEC_PER_SEC * 5 / 2; + if (time_before64(now + NSEC_PER_SEC * 5LLU / 2LLU, d->next)) + d->next = now + NSEC_PER_SEC * 5LLU / 2LLU; if (time_after64(now - NSEC_PER_SEC * 2, d->next)) d->next = now - NSEC_PER_SEC * 2; -- 2.11.0
Re: [PATCH 8/9] block: pass full fmode_t to blk_verify_command
On Tue, Oct 03, 2017 at 12:48:44PM +0200, Christoph Hellwig wrote: > Use the obvious calling convention. > > Signed-off-by: Christoph Hellwig> --- > block/bsg.c| 18 -- > block/scsi_ioctl.c | 8 > drivers/scsi/sg.c | 2 +- > include/linux/blkdev.h | 2 +- > 4 files changed, 14 insertions(+), 16 deletions(-) > > diff --git a/block/bsg.c b/block/bsg.c > index ee1335c68de7..452f94f1c5d4 100644 > --- a/block/bsg.c > +++ b/block/bsg.c > @@ -137,7 +137,7 @@ static inline struct hlist_head *bsg_dev_idx_hash(int > index) > > static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, > struct sg_io_v4 *hdr, struct bsg_device *bd, > - fmode_t has_write_perm) > + fmode_t mode) > { > struct scsi_request *req = scsi_req(rq); > > @@ -152,7 +152,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, > struct request *rq, > return -EFAULT; > > if (hdr->subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) { > - if (blk_verify_command(req->cmd, has_write_perm)) > + if (blk_verify_command(req->cmd, mode)) > return -EPERM; > } else if (!capable(CAP_SYS_RAWIO)) > return -EPERM; > @@ -206,7 +206,7 @@ bsg_validate_sgv4_hdr(struct sg_io_v4 *hdr, int *op) > * map sg_io_v4 to a request. > */ > static struct request * > -bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t > has_write_perm) > +bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t mode) > { > struct request_queue *q = bd->queue; > struct request *rq, *next_rq = NULL; > @@ -237,7 +237,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, > fmode_t has_write_perm) > if (IS_ERR(rq)) > return rq; > > - ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, has_write_perm); > + ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, mode); > if (ret) > goto out; > > @@ -587,8 +587,7 @@ bsg_read(struct file *file, char __user *buf, size_t > count, loff_t *ppos) > } > > static int __bsg_write(struct bsg_device *bd, const char __user *buf, > -size_t count, ssize_t *bytes_written, > -fmode_t has_write_perm) > +size_t count, ssize_t *bytes_written, fmode_t mode) > { > struct bsg_command *bc; > struct request *rq; > @@ -619,7 +618,7 @@ static int __bsg_write(struct bsg_device *bd, const char > __user *buf, > /* >* get a request, fill in the blanks, and add to request queue >*/ > - rq = bsg_map_hdr(bd, >hdr, has_write_perm); > + rq = bsg_map_hdr(bd, >hdr, mode); > if (IS_ERR(rq)) { > ret = PTR_ERR(rq); > rq = NULL; > @@ -655,8 +654,7 @@ bsg_write(struct file *file, const char __user *buf, > size_t count, loff_t *ppos) > bsg_set_block(bd, file); > > bytes_written = 0; > - ret = __bsg_write(bd, buf, count, _written, > - file->f_mode & FMODE_WRITE); > + ret = __bsg_write(bd, buf, count, _written, file->f_mode); > > *ppos = bytes_written; > > @@ -915,7 +913,7 @@ static long bsg_ioctl(struct file *file, unsigned int > cmd, unsigned long arg) > if (copy_from_user(, uarg, sizeof(hdr))) > return -EFAULT; > > - rq = bsg_map_hdr(bd, , file->f_mode & FMODE_WRITE); > + rq = bsg_map_hdr(bd, , file->f_mode); > if (IS_ERR(rq)) > return PTR_ERR(rq); > > diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c > index 7440de44dd85..edcfff974527 100644 > --- a/block/scsi_ioctl.c > +++ b/block/scsi_ioctl.c > @@ -207,7 +207,7 @@ static void blk_set_cmd_filter_defaults(struct > blk_cmd_filter *filter) > __set_bit(GPCMD_SET_READ_AHEAD, filter->write_ok); > } > > -int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm) > +int blk_verify_command(unsigned char *cmd, fmode_t mode) > { > struct blk_cmd_filter *filter = _default_cmd_filter; > > @@ -220,7 +220,7 @@ int blk_verify_command(unsigned char *cmd, fmode_t > has_write_perm) > return 0; > > /* Write-safe commands require a writable open */ > - if (test_bit(cmd[0], filter->write_ok) && has_write_perm) > + if (test_bit(cmd[0], filter->write_ok) && (mode & FMODE_WRITE)) > return 0; > > return -EPERM; > @@ -234,7 +234,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, > struct request *rq, > > if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len)) > return -EFAULT; > - if (blk_verify_command(req->cmd, mode & FMODE_WRITE)) > + if (blk_verify_command(req->cmd, mode)) > return -EPERM; > > /* > @@ -469,7 +469,7 @@ int sg_scsi_ioctl(struct
Re: [PATCH 6/9] bsg-lib: introduce a timeout field in struct bsg_job
On Tue, Oct 03, 2017 at 12:48:42PM +0200, Christoph Hellwig wrote: > The zfcp driver wants to know the timeout for a bsg job, so add a field > to struct bsg_job for it in preparation of not exposing the request > to the bsg-lib users. > > Signed-off-by: Christoph Hellwig> --- > block/bsg-lib.c | 1 + > drivers/s390/scsi/zfcp_fc.c | 4 ++-- > include/linux/bsg-lib.h | 2 ++ > 3 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/block/bsg-lib.c b/block/bsg-lib.c > index 15d25ccd51a5..0d5bbf6a2ddd 100644 > --- a/block/bsg-lib.c > +++ b/block/bsg-lib.c > @@ -132,6 +132,7 @@ static int bsg_prepare_job(struct device *dev, struct > request *req) > struct bsg_job *job = blk_mq_rq_to_pdu(req); > int ret; > > + job->timeout = req->timeout; > job->request = rq->cmd; > job->request_len = rq->cmd_len; > > diff --git a/drivers/s390/scsi/zfcp_fc.c b/drivers/s390/scsi/zfcp_fc.c > index 8210645c2111..9d6f69b92c81 100644 > --- a/drivers/s390/scsi/zfcp_fc.c > +++ b/drivers/s390/scsi/zfcp_fc.c > @@ -960,7 +960,7 @@ static int zfcp_fc_exec_els_job(struct bsg_job *job, > d_id = ntoh24(bsg_request->rqst_data.h_els.port_id); > > els->handler = zfcp_fc_ct_els_job_handler; > - return zfcp_fsf_send_els(adapter, d_id, els, job->req->timeout / HZ); > + return zfcp_fsf_send_els(adapter, d_id, els, job->timeout / HZ); > } > > static int zfcp_fc_exec_ct_job(struct bsg_job *job, > @@ -979,7 +979,7 @@ static int zfcp_fc_exec_ct_job(struct bsg_job *job, > return ret; > > ct->handler = zfcp_fc_ct_job_handler; > - ret = zfcp_fsf_send_ct(wka_port, ct, NULL, job->req->timeout / HZ); > + ret = zfcp_fsf_send_ct(wka_port, ct, NULL, job->timeout / HZ); > if (ret) > zfcp_fc_wka_port_put(wka_port); > > diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h > index b1be0233ce35..402223c95ce1 100644 > --- a/include/linux/bsg-lib.h > +++ b/include/linux/bsg-lib.h > @@ -44,6 +44,8 @@ struct bsg_job { > > struct kref kref; > > + unsigned int timeout; > + > /* Transport/driver specific request/reply structs */ > void *request; > void *reply; > -- > 2.14.1 > Reviewed-by: Benjamin Block Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
On Mon, 2017-10-16 at 13:30 +0200, Hannes Reinecke wrote: > On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote: > > (+Himanshu Madhani) > > > > Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for > > the qla2xxx initiator driver to the scsi_host->can_queue value? > > '3' is just a starting point; later on it'll be adjusted via > scsi_change_depth(). > Looks like it's not working correctly with blk-mq, though. Hello Hannes, Isn't the purpose of scsi_change_queue_depth() to change the sdev queue_depth only? As far as I know there is only one assignment of shost->cmd_per_lun in the SCSI core and that's the following assignment in scsi_host_alloc(): shost->cmd_per_lun = sht->cmd_per_lun; Bart.
Re: [GIT PULL 02/58] lightnvm: prevent bd removal if busy
On Fri, Oct 13, 2017 at 5:35 PM, Rakesh Panditwrote: > On Fri, Oct 13, 2017 at 07:58:09AM -0700, Christoph Hellwig wrote: >> On Fri, Oct 13, 2017 at 02:45:51PM +0200, Matias Bjørling wrote: >> > From: Rakesh Pandit >> > >> > When a virtual block device is formatted and mounted after creating >> > with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove" >> > would result in this: >> > >> > 446416.309757] bdi-block not registered >> > [446416.309773] [ cut here ] >> > [446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159 >> > __mark_inode_dirty+0x268/0x340 >> > >> > Ideally removal should return -EBUSY as block device is mounted after >> > formatting. This patch tries to address this checking if whole device >> > or any partition of it already mounted or not before removal. >> >> How is this different from any other block device that can be >> removed even if a file system is mounted? > > One can create many virtual block devices on top of physical using: > nvme lnvm create ... -t pblk > > And remove them using: > nvme lnvm remove > > Because the block devices are virtual in nature created by a program I was > expecting removal to tell me they are busy instead of bdi-block not registered > following by a WARNING (above). My use case was writing automatic test case > but I assumed this is useful in general. > >> >> > >> > Whole device is checked using "bd_super" member of block device. This >> > member is always set once block device has been mounted using a >> > filesystem. Another member "bd_part_count" takes care of checking any >> > if any partitions are under use. "bd_part_count" is only updated >> > under locks when partitions are opened or closed (first open and last >> > release). This at least does take care sending -EBUSY if removal is >> > being attempted while whole block device or any partition is mounted. >> > >> >> That's a massive layering violation, and a driver has no business >> looking at these fields. > > Okay, I didn't consider this earlier. I would suggest a revert for this. I see you've already done it. Thanks Jens.
Re: [PATCH 00/15] bcache: series of patches for 4.15
On Fri, Oct 13 2017, Michael Lyle wrote: > Jens, and everyone: > > Here is the current series of work for inclusion in next and for 4.15's > merge window. We may get some additional fixes, but this is most of the > work we expect. All have been previously sent to these lists except the > very last one. > > There's a lot of small cleanup patches and fixes for race conditions. > > Also included is my new rate controller which we've been measuring for > some time, and other related changes to writeback rate management. Applied for 4.15, thanks. -- Jens Axboe
Re: [PATCH v2 0/3] Move tagset reinit to its only current consumer, nvme-core
On 10/16/2017 06:48 AM, Christoph Hellwig wrote: > Jens, are you fine with picking this up throught the nvme tree? > I think Sagi's later changes rely on it, so that would make life > a littler easier. Yeah that's fine, you can add my reviewed-by to the patches as well. -- Jens Axboe
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Mon, Oct 16, 2017 at 1:44 PM, Christoph Hellwigwrote: > On Fri, Oct 06, 2017 at 02:31:20PM +0200, Ilya Dryomov wrote: >> This would unconditionally overwrite any WRITE ZEROS error. If we get >> e.g. -EIO, and manual zeroing is not allowed, I don't think we want to >> return -EOPNOTSUPP? >> >> Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't >> make sense to me, because manual zeroing is always supported, just not >> always allowed. > > Then thow the throw bdev_write_zeroes_sectors check back in: > > if (ret && try_write_zeroes) { > if (!(flags & BLKDEV_ZERO_NOFALLBACK)) > try_write_zeroes = false; > goto retry; > } > if (!bdev_write_zeroes_sectors(bdev)) > ret = -EOPNOTSUPP; > } > > The important bit is that the structure in the current patch where the > bdev_write_zeroes_sectors check is on the same level as the method > selection is extremely confusing. I see. An updated version should be in your inbox. Thanks, Ilya
[PATCH v3 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Christoph HellwigCc: "Martin K. Petersen" Cc: Hannes Reinecke Signed-off-by: Ilya Dryomov --- block/blk-lib.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 9d2ab8bba52a..23411ec7f3cc 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -321,12 +321,6 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * - * Note that this function may fail with -EOPNOTSUPP if the driver signals - * zeroing offload support, but the device fails to process the command (for - * some devices there is no non-destructive way to verify whether this - * operation is actually supported). In this case the caller should call - * retry the call to blkdev_issue_zeroout() and the fallback path will be used. - * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -370,18 +364,49 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { - int ret; - struct bio *bio = NULL; + int ret = 0; + sector_t bs_mask; + struct bio *bio; struct blk_plug plug; + bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev); + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; + if ((sector | nr_sects) & bs_mask) + return -EINVAL; + +retry: + bio = NULL; blk_start_plug(); - ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - , flags); + if (try_write_zeroes) { + ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, + gfp_mask, , flags); + } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { + ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects, + gfp_mask, ); + } else { + /* No zeroing offload support */ + ret = -EOPNOTSUPP; + } if (ret == 0 && bio) { ret = submit_bio_wait(bio); bio_put(bio); } blk_finish_plug(); + if (ret && try_write_zeroes) { + if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { + try_write_zeroes = false; + goto retry; + } + if (!bdev_write_zeroes_sectors(bdev)) { + /* +* Zeroing offload support was indicated, but the +* device reported ILLEGAL REQUEST (for some devices +* there is no non-destructive way to verify whether +* WRITE ZEROES is actually supported). +*/ + ret = -EOPNOTSUPP; + } + } return ret; } -- 2.4.3
[PATCH v3 1/2] block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Signed-off-by: Ilya Dryomov--- block/blk-lib.c | 63 + 1 file changed, 37 insertions(+), 26 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 62240f8832ca..9d2ab8bba52a 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -274,6 +274,40 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects) return min(pages, (sector_t)BIO_MAX_PAGES); } +static int __blkdev_issue_zero_pages(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask, + struct bio **biop) +{ + struct request_queue *q = bdev_get_queue(bdev); + struct bio *bio = *biop; + int bi_size = 0; + unsigned int sz; + + if (!q) + return -ENXIO; + + while (nr_sects != 0) { + bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), + gfp_mask); + bio->bi_iter.bi_sector = sector; + bio_set_dev(bio, bdev); + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + + while (nr_sects != 0) { + sz = min((sector_t) PAGE_SIZE, nr_sects << 9); + bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); + nr_sects -= bi_size >> 9; + sector += bi_size >> 9; + if (bi_size < sz) + break; + } + cond_resched(); + } + + *biop = bio; + return 0; +} + /** * __blkdev_issue_zeroout - generate number of zero filed write bios * @bdev: blockdev to issue @@ -304,9 +338,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, unsigned flags) { int ret; - int bi_size = 0; - struct bio *bio = *biop; - unsigned int sz; sector_t bs_mask; bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; @@ -316,30 +347,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) - goto out; - - ret = 0; - while (nr_sects != 0) { - bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), - gfp_mask); - bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - - while (nr_sects != 0) { - sz = min((sector_t) PAGE_SIZE, nr_sects << 9); - bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); - nr_sects -= bi_size >> 9; - sector += bi_size >> 9; - if (bi_size < sz) - break; - } - cond_resched(); - } + return ret; - *biop = bio; -out: - return ret; + return __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, +biop); } EXPORT_SYMBOL(__blkdev_issue_zeroout); -- 2.4.3
[PATCH v3 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
Hi Christoph, Martin, blkdev_issue_zeroout() now checks for any error. This required a minor refactor, so I dropped the stable tag, Jens can add it back if needed. v2 -> v3: - another code flow change in blkdev_issue_zeroout() suggested by Christoph -- no functional changes v1 -> v2: - changed code flow in blkdev_issue_zeroout() according to Christoph's suggestion - this required adding additional checks to blkdev_issue_zeroout() and __blkdev_issue_zero_pages(), the latter is no longer void Previous version at https://marc.info/?l=linux-block=150712940922089=2 Thanks, Ilya Ilya Dryomov (2): block: factor out __blkdev_issue_zero_pages() block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() block/blk-lib.c | 108 +--- 1 file changed, 72 insertions(+), 36 deletions(-) -- 2.4.3
Re: [PATCH v2 0/3] Move tagset reinit to its only current consumer, nvme-core
Jens, are you fine with picking this up throught the nvme tree? I think Sagi's later changes rely on it, so that would make life a littler easier.
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Fri, Oct 06, 2017 at 02:31:20PM +0200, Ilya Dryomov wrote: > This would unconditionally overwrite any WRITE ZEROS error. If we get > e.g. -EIO, and manual zeroing is not allowed, I don't think we want to > return -EOPNOTSUPP? > > Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't > make sense to me, because manual zeroing is always supported, just not > always allowed. Then thow the throw bdev_write_zeroes_sectors check back in: if (ret && try_write_zeroes) { if (!(flags & BLKDEV_ZERO_NOFALLBACK)) try_write_zeroes = false; goto retry; } if (!bdev_write_zeroes_sectors(bdev)) ret = -EOPNOTSUPP; } The important bit is that the structure in the current patch where the bdev_write_zeroes_sectors check is on the same level as the method selection is extremely confusing.
Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
On 10/13/2017 07:29 PM, Ming Lei wrote: > On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote: >> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote: >>> On Fri, Oct 13, 2017 at 04:31:04PM +, Bart Van Assche wrote: On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote: > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is > 3, Sorry but I doubt whether that is correct. More in general, I don't know any modern storage HBA for which the default queue depth is so low. >>> >>> You can grep: >>> >>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E >>> "qla2xxx|lpfc" >> >> Such a low queue depth will result in suboptimal performance for adapters >> that communicate over a storage network. I think that's a bug and that both >> adapters support much higher cmd_per_lun values. >> >> (+James Smart) >> >> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN >> from 30 to 3? Was that perhaps a workaround for a bug in a specific target >> implementation? >> >> (+Himanshu Madhani) >> >> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for >> the qla2xxx initiator driver to the scsi_host->can_queue value? > > ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't > reasonable to increase cmd_per_lun to .can_queue. > '3' is just a starting point; later on it'll be adjusted via scsi_change_depth(). Looks like it's not working correctly with blk-mq, though. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Fri, Oct 6, 2017 at 2:31 PM, Ilya Dryomovwrote: > On Fri, Oct 6, 2017 at 2:05 PM, Christoph Hellwig wrote: >> On Thu, Oct 05, 2017 at 09:32:33PM +0200, Ilya Dryomov wrote: >>> This is to avoid returning -EREMOTEIO in the following case: device >>> doesn't support WRITE SAME but scsi_disk::max_ws_blocks != 0, zeroout >>> is called with BLKDEV_ZERO_NOFALLBACK. Enter blkdev_issue_zeroout(), >>> bdev_write_zeroes_sectors() != 0, so we issue WRITE ZEROES. The >>> request fails with ILLEGAL REQUEST, sd_done() sets ->no_write_same and >>> updates queue_limits, ILLEGAL REQUEST is translated into -EREMOTEIO, >>> which is returned from submit_bio_wait(). Manual zeroing is not >>> allowed, so we must return an error, but it shouldn't be -EREMOTEIO if >>> queue_limits just got updated because of ILLEGAL REQUEST. Without this >>> conditional, we'd get >> >> Hmm. I think we'd better off to just do the before the retry loop: >> >> if (ret && try_write_zeroes) { >> if (!(flags & BLKDEV_ZERO_NOFALLBACK)) >> try_write_zeroes = false; >> goto retry; >> } >> ret = -EOPNOTSUPP; >> } > > This would unconditionally overwrite any WRITE ZEROS error. If we get > e.g. -EIO, and manual zeroing is not allowed, I don't think we want to > return -EOPNOTSUPP? > > Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't > make sense to me, because manual zeroing is always supported, just not > always allowed. Ping... Thanks, Ilya