date:20171016

Re: [PATCH 0/2] block/SCSI MQ: two RESTART related patches

2017-10-16 Thread Ming Lei

On Tue, Oct 17, 2017 at 01:04:16PM +0800, Ming Lei wrote:
> Hi Jens,
> 
> The 1st patch runs idle hctx after dealy in scsi_mq_get_budget(),
> so that we can keep same behaviour with before, and it can be
> thought as a fix.
> 
> The 2nd patch cleans up RESTART, and removes handling for TAG_SHARED
> from current blk-mq's RESTART mechanism because SCSI_MQ can covers its
> restart by itself, so that no need to handle TAG_SHARED in blk-mq
> RESTART. And >20% IOPS boost is observed in my rand read test over
> scsi_debug.
> 
> John, please test this two patches and see if it may improve your SAS
> IO performance, and you can find the two patches in the following branch:
> 
>   https://github.com/ming1/linux/commits/blk_mq_improve_restart_V1

Forget to mention, you need to either pull the above branch directly
or apply the two patches against for-next branch of Jens' block
tree:

git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
#for-next

-- 
Ming

[PATCH 2/2] blk-mq: don't handle TAG_SHARED in restart

2017-10-16 Thread Ming Lei

Now restart is used in the following cases, and TAG_SHARED is for
SCSI only.

1) .get_budget() returns BLK_STS_RESOURCE
- if resource in target/host level isn't satistifed, this SCSI device
will be added in shost->starved_list, and the whole queue will be rerun
(via SCSI's built-in RESTART) in scsi_end_request() after any request
initiated from this host/targe is completed. Forget to mention, host level
resource is always respected by blk-mq before running .queue_rq().

- the same is true if resource in the queue level isn't satisfied.

- if there isn't outstanding request on this queue, then SCSI's RESTART
can't work(blk-mq's can't work too), and the queue will be run after
SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
RESTART when this request is finished

2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
- if there isn't onprogressing request on this queue, the queue
will be run after SCSI_QUEUE_DELAY

- otherwise, SCSI's RESTART covers the rerun.

3) blk_mq_get_driver_tag() failed
- BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
allocation.

In one word, SCSI's built-in RESTART is enough to cover itself.
So we don't need to pay special attention to TAG_SHARED wrt. restart.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 78 +++-
 1 file changed, 4 insertions(+), 74 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index df8581bb0a37..daab27feb653 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -68,25 +68,17 @@ static void blk_mq_sched_mark_restart_hctx(struct 
blk_mq_hw_ctx *hctx)
set_bit(BLK_MQ_S_SCHED_RESTART, >state);
 }
 
-static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
-   return false;
-
-   if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
-   struct request_queue *q = hctx->queue;
+   return;
 
-   if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state))
-   atomic_dec(>shared_hctx_restart);
-   } else
-   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
+   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
 
if (blk_mq_hctx_has_pending(hctx)) {
blk_mq_run_hw_queue(hctx, true);
-   return true;
+   return;
}
-
-   return false;
 }
 
 /* return true if hctx need to run again */
@@ -385,68 +377,6 @@ static bool blk_mq_sched_bypass_insert(struct 
blk_mq_hw_ctx *hctx,
return true;
 }
 
-/**
- * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list
- * @pos:loop cursor.
- * @skip:   the list element that will not be examined. Iteration starts at
- *  @skip->next.
- * @head:   head of the list to examine. This list must have at least one
- *  element, namely @skip.
- * @member: name of the list_head structure within typeof(*pos).
- */
-#define list_for_each_entry_rcu_rr(pos, skip, head, member)\
-   for ((pos) = (skip);\
-(pos = (pos)->member.next != (head) ? list_entry_rcu(  \
-   (pos)->member.next, typeof(*pos), member) : \
- list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \
-(pos) != (skip); )
-
-/*
- * Called after a driver tag has been freed to check whether a hctx needs to
- * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware
- * queues in a round-robin fashion if the tag set of @hctx is shared with other
- * hardware queues.
- */
-void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
-{
-   struct blk_mq_tags *const tags = hctx->tags;
-   struct blk_mq_tag_set *const set = hctx->queue->tag_set;
-   struct request_queue *const queue = hctx->queue, *q;
-   struct blk_mq_hw_ctx *hctx2;
-   unsigned int i, j;
-
-   if (set->flags & BLK_MQ_F_TAG_SHARED) {
-   /*
-* If this is 0, then we know that no hardware queues
-* have RESTART marked. We're done.
-*/
-   if (!atomic_read(>shared_hctx_restart))
-   return;
-
-   rcu_read_lock();
-   list_for_each_entry_rcu_rr(q, queue, >tag_list,
-  tag_set_list) {
-   queue_for_each_hw_ctx(q, hctx2, i)
-   if (hctx2->tags == tags &&
-   blk_mq_sched_restart_hctx(hctx2))
-   goto done;
-   }
-   j = hctx->queue_num + 1;
-   for (i = 0; i < queue->nr_hw_queues; i++, j++) {
-   if (j == queue->nr_hw_queues)
-   j = 0;
-   hctx2 =

[PATCH 1/2] SCSI: run idle hctx after delay in scsi_mq_get_budget()

2017-10-16 Thread Ming Lei

If there isn't any outstanding request in this queue, both
blk-mq's RESTART and SCSI's builtin RESTART can't work,
so we have to deal with this case by running this queue
after delay.

Fixes: d04b6d97d0a1(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6f10afaca25b..08c495364b28 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1959,6 +1959,14 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx 
*hctx)
put_device(>sdev_gendev);
 }
 
+static void scsi_mq_run_idle_hctx(struct scsi_device *sdev,
+   struct blk_mq_hw_ctx *hctx)
+{
+   if (atomic_read(>device_busy) == 0 &&
+   !scsi_device_blocked(sdev))
+   blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+}
+
 static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
@@ -1989,6 +1997,7 @@ static blk_status_t scsi_mq_get_budget(struct 
blk_mq_hw_ctx *hctx)
 out_put_device:
put_device(>sdev_gendev);
 out:
+   scsi_mq_run_idle_hctx(sdev, hctx);
return BLK_STS_RESOURCE;
 }
 
@@ -2039,9 +2048,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
case BLK_STS_OK:
break;
case BLK_STS_RESOURCE:
-   if (atomic_read(>device_busy) == 0 &&
-   !scsi_device_blocked(sdev))
-   blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+   scsi_mq_run_idle_hctx(sdev, hctx);
break;
default:
/*
-- 
2.9.5

[PATCH 0/2] block/SCSI MQ: two RESTART related patches

2017-10-16 Thread Ming Lei

Hi Jens,

The 1st patch runs idle hctx after dealy in scsi_mq_get_budget(),
so that we can keep same behaviour with before, and it can be
thought as a fix.

The 2nd patch cleans up RESTART, and removes handling for TAG_SHARED
from current blk-mq's RESTART mechanism because SCSI_MQ can covers its
restart by itself, so that no need to handle TAG_SHARED in blk-mq
RESTART. And >20% IOPS boost is observed in my rand read test over
scsi_debug.

John, please test this two patches and see if it may improve your SAS
IO performance, and you can find the two patches in the following branch:

https://github.com/ming1/linux/commits/blk_mq_improve_restart_V1


Ming Lei (2):
  SCSI: run idle hctx after delay in scsi_mq_get_budget()
  blk-mq: don't handle TAG_SHARED in restart

 block/blk-mq-sched.c| 78 +++--
 drivers/scsi/scsi_lib.c | 13 +++--
 2 files changed, 14 insertions(+), 77 deletions(-)

-- 
2.9.5

Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-16 Thread Martin K. Petersen


Bart,

> The contexts from which a SCSI device can be quiesced or resumed are:
> * Writing into /sys/class/scsi_device/*/device/state.
> * SCSI parallel (SPI) domain validation.
> * The SCSI device power management methods. See also scsi_bus_pm_ops.

The SCSI bits look fine to me.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v9 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests

2017-10-16 Thread Martin K. Petersen


Bart,

> Convert blk_get_request(q, op, __GFP_RECLAIM) into
> blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not
> change any functionality.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [RFC] bsg-lib interface cleanup

2017-10-16 Thread Martin K. Petersen


Christoph,

> this series cleans up various abuses of the bsg interfaces, and then
> splits bsg for SCSI passthrough from bsg for arbitrary transport
> passthrough.  This removes the scsi_request abuse in bsg-lib that is
> very confusing, and also makes sure we can sanity check the requests
> we get.  The current code will happily execute scsi commands against
> bsg-lib queues, and transport pass through against scsi nodes, without
> any indication to the driver that we are doing the wrong thing.

I applied patches 2-5 to 4.15/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

[bug report] A race between blk_cleanup_queue and blk_timeout_work

2017-10-16 Thread chenxiang (M)


Hi Jens, Christoph,

There is a scenario: unplug this disks when running IO in disk, we will 
find IO is blocked all the times as follows:

..
Jobs: 3 (f=3): [M_MM__] [89.7% done] [0K/0K /s] [0 /0  iops] [eta 00m:36s]
..
I find there is a race between blk_cleanup_queue and blk_timeout_work 
(kernel is 4.14.0-rc1):

(1)Remove disks process
When unplug disk, it will call scsi_remove_target to delete disk:
scsi_remove_target-->
 __scsi_remove_target>
scsi_remove_device--->
__scsi_remove_device--->
blk_cleanup_queue
blk_freeze_queue
.
__blk_drain_queue
scsi_remove_target will call blk_cleanup_queue, and blk_cleanup_queue 
will call blk_freeze_queue and __blk_drain_queue.
In blk_freeze_queue, for !blk_mq (our driver satifies this) it will kill 
q->q_usage_counter.
In __blk_drain_queue, it is a loop with condition=true, only when 
drain=0 can this function will be existed.If all the IOs
are ended, it will be existed, or it will wait and query no-finished IOs 
every 10ms.

(2) Timeout process
For every IO from block layer,if timeout, it will call blk_timeout_work. 
In blk_timeout_work, it checks blk_queue_enter first.
In blk_queue_enter, it trys to get q->q_usage_counter, so if failed, it 
will return directly and will not enter timeout process.


So when unplug disk, removing disk process will kill q->q_usage_counter 
in blk_cleanup_queue, if there are IOs which are not finished,
they will wait for timeout, when timeout, they will try to get 
q->q_usage_counter in blk_timeout_work, as q->q_usage_counter is killed
in blk_freeze_queue already at that time, so it failed, it will not 
enter timeout process and this IO will be not processed.
But in __blk_drain_queue it will loop forever as there are IOs which are 
still not ended.


I add printk in function blk_timeout_work as follows, . when this issue 
occurs, i can see this printk happens:


void blk_timeout_work(struct work_struct *work)
{
struct request_queue *q =
container_of(work, struct request_queue, timeout_work);
unsigned long flags, next = 0;
struct request *rq, *tmp;
int next_set = 0;

if (blk_queue_enter(q, true)) {
pr_err("%s %d\n", __func__, 
__LINE__);-> i add printk here

return;
}

spin_lock_irqsave(q->queue_lock, flags);


list_for_each_entry_safe(rq, tmp, >timeout_list, timeout_list)
blk_rq_check_expired(rq, , _set);

if (next_set)
mod_timer(>timeout, round_jiffies_up(next));

spin_unlock_irqrestore(q->queue_lock, flags);

blk_queue_exit(q);
}


regards,
shawn

Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

2017-10-16 Thread Ming Lei

On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
> On 10/13/2017 07:29 PM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote:
> >> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> >>> On Fri, Oct 13, 2017 at 04:31:04PM +, Bart Van Assche wrote:
>  On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> > Actually it is in hot path, for example, lpfc and qla2xx's queue depth 
> > is 3,
> 
>  Sorry but I doubt whether that is correct. More in general, I don't know 
>  any modern
>  storage HBA for which the default queue depth is so low.
> >>>
> >>> You can grep:
> >>>
> >>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E 
> >>> "qla2xxx|lpfc"
> >>
> >> Such a low queue depth will result in suboptimal performance for adapters
> >> that communicate over a storage network. I think that's a bug and that both
> >> adapters support much higher cmd_per_lun values.
> >>
> >> (+James Smart)
> >>
> >> James, can you explain us why commit 445cf4f4d2aa decreased 
> >> LPFC_CMD_PER_LUN
> >> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> >> implementation?
> >>
> >> (+Himanshu Madhani)
> >>
> >> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun 
> >> for
> >> the qla2xxx initiator driver to the scsi_host->can_queue value?
> > 
> > ->can_queue is size of the whole tag space shared by all LUNs, looks it 
> > isn't
> > reasonable to increase cmd_per_lun to .can_queue.
> > 
> '3' is just a starting point; later on it'll be adjusted via
> scsi_change_depth().
> Looks like it's not working correctly with blk-mq, though.

At default, in scsi_alloc_sdev(), q->queue_depth is set as
host->cmd_per_lun. You are right, q->queue_depth can be adjusted
later too.

q->queue_depth is respected in scsi_dev_queue_ready().
.cmd_per_lun defines the max outstanding cmds for each lun, I
guess it is respected by some hardware inside.

For example, I remembered that on lpfc q->queue_depth is 30 because
the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3.
Per my observation, this .cmd_per_lun limit is still workable.

-- 
Ming

Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-16 Thread Ming Lei

On Mon, Oct 16, 2017 at 04:29:04PM -0700, Bart Van Assche wrote:
> The contexts from which a SCSI device can be quiesced or resumed are:
> * Writing into /sys/class/scsi_device/*/device/state.
> * SCSI parallel (SPI) domain validation.
> * The SCSI device power management methods. See also scsi_bus_pm_ops.
> 
> It is essential during suspend and resume that neither the filesystem
> state nor the filesystem metadata in RAM changes. This is why while
> the hibernation image is being written or restored that SCSI devices
> are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
> and scsi_device_resume(). In the SDEV_QUIESCE state execution of
> non-preempt requests is deferred. This is realized by returning
> BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
> devices. Avoid that a full queue prevents power management requests
> to be submitted by deferring allocation of non-preempt requests for
> devices in the quiesced state. This patch has been tested by running
> the following commands and by verifying that after resume the fio job
> is still running:
> 
> for d in /sys/class/block/sd*[a-z]; do
>   hcil=$(readlink "$d/device")
>   hcil=${hcil#../../../}
>   echo 4 > "$d/queue/nr_requests"
>   echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
> done
> bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
> bdev=${bdev#../../}
> hcil=$(readlink "/sys/block/$bdev/device")
> hcil=${hcil#../../../}
> fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 
> --rw=randread \
>   --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
>   --loops=$((2**31)) &
> pid=$!
> sleep 1
> systemctl hibernate
> sleep 10
> kill $pid
> 
> Reported-by: Oleksandr Natalenko 
> References: "I/O hangs after resuming from suspend-to-ram" 
> (https://marc.info/?l=linux-block=150340235201348).
> Signed-off-by: Bart Van Assche 
> Tested-by: Martin Steigerwald 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-core.c   | 42 +++---
>  block/blk-mq.c |  4 ++--
>  block/blk-timeout.c|  2 +-
>  drivers/scsi/scsi_lib.c| 40 +---
>  fs/block_dev.c |  4 ++--
>  include/linux/blkdev.h |  2 +-
>  include/scsi/scsi_device.h |  1 +
>  7 files changed, 71 insertions(+), 24 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b73203286bf5..abb5095bcef6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q)
>  
>   spin_lock_irqsave(q->queue_lock, flags);
>   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> + wake_up_all(>mq_freeze_wq);
>   spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
> @@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
>  
> -int blk_queue_enter(struct request_queue *q, bool nowait)
> +/**
> + * blk_queue_enter() - try to increase q->q_usage_counter
> + * @q: request queue pointer
> + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + */
> +int blk_queue_enter(struct request_queue *q, unsigned int flags)
>  {
> + const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
> +
>   while (true) {
> + bool success = false;
>   int ret;
>  
> - if (percpu_ref_tryget_live(>q_usage_counter))
> + rcu_read_lock_sched();
> + if (percpu_ref_tryget_live(>q_usage_counter)) {
> + /*
> +  * The code that sets the PREEMPT_ONLY flag is
> +  * responsible for ensuring that that flag is globally
> +  * visible before the queue is unfrozen.
> +  */
> + if (preempt || !blk_queue_preempt_only(q)) {
> + success = true;
> + } else {
> + percpu_ref_put(>q_usage_counter);
> + WARN_ONCE("%s: Attempt to allocate non-preempt 
> request in preempt-only mode.\n",
> +   kobject_name(q->kobj.parent));
> + }
> + }
> + rcu_read_unlock_sched();
> +
> + if (success)
>   return 0;
>  
> - if (nowait)
> + if (flags & BLK_MQ_REQ_NOWAIT)
>   return -EBUSY;
>  
>   /*
> @@ -814,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
>   smp_rmb();
>  
>   ret = wait_event_interruptible(q->mq_freeze_wq,
>

Re: [PATCH v2] blk-mq: Make blk_mq_get_request() error path less confusing

2017-10-16 Thread Ming Lei

On Mon, Oct 16, 2017 at 04:32:26PM -0700, Bart Van Assche wrote:
> blk_mq_get_tag() can modify data->ctx. This means that in the
> error path of blk_mq_get_request() data->ctx should be passed to
> blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
> ignores its argument, this patch does not change any functionality.
> 
> References: commit 1ad43c0078b7 ("blk-mq: don't leak preempt 
> counter/q_usage_counter when allocating rq failed")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> ---
> Changes compared to v1:
> - Removed "Cc: "
> - Removed "bug fix" from the description.
> 
>  block/blk-mq.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 097ca3ece716..d9b3b5dc66ab 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -336,12 +336,14 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>   struct elevator_queue *e = q->elevator;
>   struct request *rq;
>   unsigned int tag;
> - struct blk_mq_ctx *local_ctx = NULL;
> + bool put_ctx_on_error = false;
>  
>   blk_queue_enter_live(q);
>   data->q = q;
> - if (likely(!data->ctx))
> - data->ctx = local_ctx = blk_mq_get_ctx(q);
> + if (likely(!data->ctx)) {
> + data->ctx = blk_mq_get_ctx(q);
> + put_ctx_on_error = true;
> + }
>   if (likely(!data->hctx))
>   data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>   if (op & REQ_NOWAIT)
> @@ -360,8 +362,8 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>  
>   tag = blk_mq_get_tag(data);
>   if (tag == BLK_MQ_TAG_FAIL) {
> - if (local_ctx) {
> - blk_mq_put_ctx(local_ctx);
> + if (put_ctx_on_error) {
> + blk_mq_put_ctx(data->ctx);
>   data->ctx = NULL;
>   }
>   blk_queue_exit(q);
> -- 
> 2.14.2
> 

Reviewed-by: Ming Lei 


-- 
Ming

Re: [PATCH v2] block/aoe: Convert timers to use timer_setup()

2017-10-16 Thread Kees Cook

On Fri, Oct 6, 2017 at 7:19 AM, Jens Axboe  wrote:
> On 10/05/2017 05:13 PM, Kees Cook wrote:
>> In preparation for unconditionally passing the struct timer_list pointer to
>> all timer callbacks, switch to using the new timer_setup() and from_timer()
>> to pass the timer pointer explicitly.
>
> Applied to for-4.15/timer

Hi,

I just wanted to check what your timer plans were for merging this
into -next (I'm doing rebasing to find out which maintainers I need to
resend patches to, and I noticed block hasn't appeared in -next, but I
know you've pulled patches...)

Thanks!

-Kees

-- 
Kees Cook
Pixel Security

[PATCH v2] blk-mq: Make blk_mq_get_request() error path less confusing

2017-10-16 Thread Bart Van Assche

blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c0078b7 ("blk-mq: don't leak preempt 
counter/q_usage_counter when allocating rq failed")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
---
Changes compared to v1:
- Removed "Cc: "
- Removed "bug fix" from the description.

 block/blk-mq.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 097ca3ece716..d9b3b5dc66ab 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -336,12 +336,14 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
struct elevator_queue *e = q->elevator;
struct request *rq;
unsigned int tag;
-   struct blk_mq_ctx *local_ctx = NULL;
+   bool put_ctx_on_error = false;
 
blk_queue_enter_live(q);
data->q = q;
-   if (likely(!data->ctx))
-   data->ctx = local_ctx = blk_mq_get_ctx(q);
+   if (likely(!data->ctx)) {
+   data->ctx = blk_mq_get_ctx(q);
+   put_ctx_on_error = true;
+   }
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
if (op & REQ_NOWAIT)
@@ -360,8 +362,8 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
 
tag = blk_mq_get_tag(data);
if (tag == BLK_MQ_TAG_FAIL) {
-   if (local_ctx) {
-   blk_mq_put_ctx(local_ctx);
+   if (put_ctx_on_error) {
+   blk_mq_put_ctx(data->ctx);
data->ctx = NULL;
}
blk_queue_exit(q);
-- 
2.14.2

[PATCH v9 01/10] md: Rename md_notifier into md_reboot_notifier

2017-10-16 Thread Bart Van Assche

This avoids confusion with the pm notifier that will be added
through a later patch.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Shaohua Li 
Tested-by: Martin Steigerwald 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
---
 drivers/md/md.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5d61049e7417..8933cafc212d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8966,7 +8966,7 @@ static int md_notify_reboot(struct notifier_block *this,
return NOTIFY_DONE;
 }
 
-static struct notifier_block md_notifier = {
+static struct notifier_block md_reboot_notifier = {
.notifier_call  = md_notify_reboot,
.next   = NULL,
.priority   = INT_MAX, /* before any real devices */
@@ -9003,7 +9003,7 @@ static int __init md_init(void)
blk_register_region(MKDEV(mdp_major, 0), 1UL<

[PATCH v9 08/10] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag

2017-10-16 Thread Bart Van Assche

This flag will be used in the next patch to let the block layer
core know whether or not a SCSI request queue has been quiesced.
A quiesced SCSI queue namely only processes RQF_PREEMPT requests.

Signed-off-by: Bart Van Assche 
Tested-by: Martin Steigerwald 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 30 ++
 block/blk-mq-debugfs.c |  1 +
 include/linux/blkdev.h |  6 ++
 3 files changed, 37 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ab835d3c2f39..b73203286bf5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,36 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+/**
+ * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * @q: request queue pointer
+ *
+ * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
+ * set and 1 if the flag was already set.
+ */
+int blk_set_preempt_only(struct request_queue *q)
+{
+   unsigned long flags;
+   int res;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   return res;
+}
+EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+
+void blk_clear_preempt_only(struct request_queue *q)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7f4a1ba532af..75f31535f280 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -74,6 +74,7 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
QUEUE_FLAG_NAME(QUIESCED),
+   QUEUE_FLAG_NAME(PREEMPT_ONLY),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 05203175eb9c..864ad2e4a58c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -630,6 +630,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
@@ -730,6 +731,11 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern int blk_set_preempt_only(struct request_queue *q);
+extern void blk_clear_preempt_only(struct request_queue *q);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.14.2

[PATCH v9 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests

2017-10-16 Thread Bart Van Assche

Convert blk_get_request(q, op, __GFP_RECLAIM) into
blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not
change any functionality.

Signed-off-by: Bart Van Assche 
Tested-by: Martin Steigerwald 
Acked-by: David S. Miller  [ for IDE ]
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/ide/ide-pm.c| 4 ++--
 drivers/scsi/scsi_lib.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 544f02d673ca..f56d742908df 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev)
}
 
memset(, 0, sizeof(rqpm));
-   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
+   rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
+  BLK_MQ_REQ_PREEMPT);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
-   rq->rq_flags |= RQF_PREEMPT;
rq->special = 
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6f10afaca25b..7c119696402c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = blk_get_request_flags(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
-   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+   req->rq_flags |= rq_flags | RQF_QUIET;
 
/*
 * head injection *required* here otherwise quiesce won't work
-- 
2.14.2

[PATCH v9 06/10] block: Introduce BLK_MQ_REQ_PREEMPT

2017-10-16 Thread Bart Van Assche

Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to
blk_get_request_flags().

Signed-off-by: Bart Van Assche 
Tested-by: Martin Steigerwald 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 4 +++-
 block/blk-mq.c | 2 ++
 include/linux/blk-mq.h | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 559818be8e02..ab835d3c2f39 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1260,6 +1260,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
blk_rq_set_rl(rq, rl);
rq->cmd_flags = op;
rq->rq_flags = rq_flags;
+   if (flags & BLK_MQ_REQ_PREEMPT)
+   rq->rq_flags |= RQF_PREEMPT;
 
/* init elvpriv */
if (rq_flags & RQF_ELVPRIV) {
@@ -1441,7 +1443,7 @@ struct request *blk_get_request_flags(struct 
request_queue *q, unsigned int op,
struct request *req;
 
WARN_ON_ONCE(op & REQ_NOWAIT);
-   WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT);
+   WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
 
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 59b7de6b616b..6a025b17caac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -290,6 +290,8 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->q = data->q;
rq->mq_ctx = data->ctx;
rq->cmd_flags = op;
+   if (data->flags & BLK_MQ_REQ_PREEMPT)
+   rq->rq_flags |= RQF_PREEMPT;
if (blk_queue_io_stat(data->q))
rq->rq_flags |= RQF_IO_STAT;
/* do not touch atomic flags, it needs atomic ops against the timer */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e5e6becd57d3..22c7f36745fc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -213,6 +213,7 @@ enum {
BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+   BLK_MQ_REQ_PREEMPT  = (1 << 3), /* set RQF_PREEMPT */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.14.2

[PATCH v9 02/10] md: Introduce md_stop_all_writes()

2017-10-16 Thread Bart Van Assche

Introduce md_stop_all_writes() because the next patch will add
a second caller for this function. This patch does not change
any functionality.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Shaohua Li 
Tested-by: Martin Steigerwald 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
---
 drivers/md/md.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8933cafc212d..b99584e5d6b1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8937,8 +8937,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t 
s, int sectors,
 }
 EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 
-static int md_notify_reboot(struct notifier_block *this,
-   unsigned long code, void *x)
+static void md_stop_all_writes(void)
 {
struct list_head *tmp;
struct mddev *mddev;
@@ -8962,6 +8961,12 @@ static int md_notify_reboot(struct notifier_block *this,
 */
if (need_delay)
mdelay(1000*1);
+}
+
+static int md_notify_reboot(struct notifier_block *this,
+   unsigned long code, void *x)
+{
+   md_stop_all_writes();
 
return NOTIFY_DONE;
 }
-- 
2.14.2

[PATCH v9 05/10] block: Introduce blk_get_request_flags()

2017-10-16 Thread Bart Van Assche

A side effect of this patch is that the GFP mask that is passed to
several allocation functions in the legacy block layer is changed
from GFP_KERNEL into __GFP_DIRECT_RECLAIM.

Signed-off-by: Bart Van Assche 
Tested-by: Martin Steigerwald 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 50 +++---
 include/linux/blkdev.h |  3 +++
 2 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 23a0110fc62b..559818be8e02 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1157,7 +1157,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * @rl: request list to allocate from
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLQ_MQ_REQ_* flags
  *
  * Get a free request from @q.  This function may fail under memory
  * pressure or if @q is dead.
@@ -1167,7 +1167,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *__get_request(struct request_list *rl, unsigned int op,
-   struct bio *bio, gfp_t gfp_mask)
+struct bio *bio, unsigned int flags)
 {
struct request_queue *q = rl->q;
struct request *rq;
@@ -1176,6 +1176,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
struct io_cq *icq = NULL;
const bool is_sync = op_is_sync(op);
int may_queue;
+   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+__GFP_DIRECT_RECLAIM;
req_flags_t rq_flags = RQF_ALLOCED;
 
lockdep_assert_held(q->queue_lock);
@@ -1336,7 +1338,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * @q: request_queue to allocate request from
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLK_MQ_REQ_* flags.
  *
  * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
  * this function keeps retrying under memory pressure and fails iff @q is dead.
@@ -1346,7 +1348,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, unsigned int op,
-   struct bio *bio, gfp_t gfp_mask)
+  struct bio *bio, unsigned int flags)
 {
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1358,7 +1360,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
rl = blk_get_rl(q, bio);/* transferred to @rq on success */
 retry:
-   rq = __get_request(rl, op, bio, gfp_mask);
+   rq = __get_request(rl, op, bio, flags);
if (!IS_ERR(rq))
return rq;
 
@@ -1367,7 +1369,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
return ERR_PTR(-EAGAIN);
}
 
-   if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
+   if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) {
blk_put_rl(rl);
return rq;
}
@@ -1394,10 +1396,13 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
goto retry;
 }
 
+/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, unsigned int flags)
 {
struct request *rq;
+   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+__GFP_DIRECT_RECLAIM;
int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
@@ -1410,7 +1415,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
-   rq = get_request(q, op, NULL, gfp_mask);
+   rq = get_request(q, op, NULL, flags);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
blk_queue_exit(q);
@@ -1424,25 +1429,40 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+/**
+ * blk_get_request_flags - allocate a request
+ * @q: request queue to allocate a request for
+ * @op: operation (REQ_OP_*) and REQ_* flags, e.g.

[PATCH v9 04/10] block: Make q_usage_counter also track legacy requests

2017-10-16 Thread Bart Van Assche

From: Ming Lei 

This patch makes it possible to pause request allocation for
the legacy block layer by calling blk_mq_freeze_queue() and
blk_mq_unfreeze_queue().

Signed-off-by: Ming Lei 
[ bvanassche: Combined two patches into one, edited a comment and made sure
  REQ_NOWAIT is handled properly in blk_old_get_request() ]
Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
Tested-by: Martin Steigerwald 
Cc: Ming Lei 
Cc: Hannes Reinecke 
---
 block/blk-core.c | 12 
 block/blk-mq.c   | 10 ++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e8e149ccc86b..23a0110fc62b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,9 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /* Make blk_queue_enter() reexamine the DYING flag. */
+   wake_up_all(>mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1395,16 +1398,22 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
+ (op & REQ_NOWAIT));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1576,6 +1585,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1857,8 +1867,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 097ca3ece716..59b7de6b616b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
@@ -255,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(>mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.14.2

[PATCH v9 10/10] block, nvme: Introduce blk_mq_req_flags_t

2017-10-16 Thread Bart Van Assche

Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: linux-n...@lists.infradead.org
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Ming Lei 
---
 block/blk-core.c  | 12 ++--
 block/blk-mq.c|  4 ++--
 block/blk-mq.h|  2 +-
 drivers/nvme/host/core.c  |  5 +++--
 drivers/nvme/host/nvme.h  |  5 +++--
 include/linux/blk-mq.h| 17 +++--
 include/linux/blk_types.h |  2 ++
 include/linux/blkdev.h|  4 ++--
 8 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index abb5095bcef6..d1f1694f2e1f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -799,7 +799,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  * @q: request queue pointer
  * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
  */
-int blk_queue_enter(struct request_queue *q, unsigned int flags)
+int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
 
@@ -1224,7 +1224,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *__get_request(struct request_list *rl, unsigned int op,
-struct bio *bio, unsigned int flags)
+struct bio *bio, blk_mq_req_flags_t flags)
 {
struct request_queue *q = rl->q;
struct request *rq;
@@ -1407,7 +1407,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, unsigned int op,
-  struct bio *bio, unsigned int flags)
+  struct bio *bio, blk_mq_req_flags_t flags)
 {
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1457,7 +1457,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
 /* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, unsigned int flags)
+   unsigned int op, blk_mq_req_flags_t flags)
 {
struct request *rq;
gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
@@ -1494,7 +1494,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
  * @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT.
  */
 struct request *blk_get_request_flags(struct request_queue *q, unsigned int op,
- unsigned int flags)
+ blk_mq_req_flags_t flags)
 {
struct request *req;
 
@@ -2290,7 +2290,7 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = bio_list_on_stack;
do {
struct request_queue *q = bio->bi_disk->queue;
-   unsigned int flags = bio->bi_opf & REQ_NOWAIT ?
+   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
BLK_MQ_REQ_NOWAIT : 0;
 
if (likely(blk_queue_enter(q, flags) == 0)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c6bff60e6b8b..c037b1ad64a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -380,7 +380,7 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-   unsigned int flags)
+   blk_mq_req_flags_t flags)
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
@@ -406,7 +406,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
-   unsigned int op, unsigned int flags, unsigned int hctx_idx)
+   unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 522b420dedc0..5dcfe4fa5e0d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -110,7 +110,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
 struct blk_mq_alloc_data {
/* input parameter */
struct request_queue *q;
-   unsigned int flags;
+   blk_mq_req_flags_t flags;
unsigned int shallow_depth;

[PATCH v9 03/10] md: Neither resync nor reshape while the system is frozen

2017-10-16 Thread Bart Van Assche

Some people use the md driver on laptops and use the suspend and
resume functionality. Since it is essential that submitting of
new I/O requests stops before a hibernation image is created,
interrupt the md resync and reshape actions if the system is
being frozen. Note: the resync and reshape will restart after
the system is resumed and a message similar to the following
will appear in the system log:

md: md0: data-check interrupted.

Signed-off-by: Bart Van Assche 
Reviewed-by: Shaohua Li 
Tested-by: Martin Steigerwald 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/md/md.c | 50 +-
 drivers/md/md.h |  8 
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b99584e5d6b1..8b2fc750f97f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include "md.h"
@@ -7439,6 +7441,7 @@ static int md_thread(void *arg)
 */
 
allow_signal(SIGKILL);
+   set_freezable();
while (!kthread_should_stop()) {
 
/* We need to wait INTERRUPTIBLE so that
@@ -7449,7 +7452,7 @@ static int md_thread(void *arg)
if (signal_pending(current))
flush_signals(current);
 
-   wait_event_interruptible_timeout
+   wait_event_freezable_timeout
(thread->wqueue,
 test_bit(THREAD_WAKEUP, >flags)
 || kthread_should_stop() || kthread_should_park(),
@@ -8944,6 +8947,8 @@ static void md_stop_all_writes(void)
int need_delay = 0;
 
for_each_mddev(mddev, tmp) {
+   mddev->frozen_before_suspend =
+   test_bit(MD_RECOVERY_FROZEN, >recovery);
if (mddev_trylock(mddev)) {
if (mddev->pers)
__md_stop_writes(mddev);
@@ -8963,6 +8968,47 @@ static void md_stop_all_writes(void)
mdelay(1000*1);
 }
 
+static void md_restore_frozen_flag(void)
+{
+   struct list_head *tmp;
+   struct mddev *mddev;
+
+   for_each_mddev(mddev, tmp) {
+   if (mddev->frozen_before_suspend)
+   set_bit(MD_RECOVERY_FROZEN, >recovery);
+   else
+   clear_bit(MD_RECOVERY_FROZEN, >recovery);
+   }
+}
+
+/*
+ * Ensure that neither resyncing nor reshaping occurs while the system is
+ * frozen.
+ */
+static int md_notify_pm(struct notifier_block *bl, unsigned long state,
+   void *unused)
+{
+   pr_debug("%s: state = %ld\n", __func__, state);
+
+   switch (state) {
+   case PM_HIBERNATION_PREPARE:
+   case PM_SUSPEND_PREPARE:
+   case PM_RESTORE_PREPARE:
+   md_stop_all_writes();
+   break;
+   case PM_POST_HIBERNATION:
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   md_restore_frozen_flag();
+   break;
+   }
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block md_pm_notifier = {
+   .notifier_call  = md_notify_pm,
+};
+
 static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
 {
@@ -9009,6 +9055,7 @@ static int __init md_init(void)
md_probe, NULL, NULL);
 
register_reboot_notifier(_reboot_notifier);
+   register_pm_notifier(_pm_notifier);
raid_table_header = register_sysctl_table(raid_root_table);
 
md_geninit();
@@ -9248,6 +9295,7 @@ static __exit void md_exit(void)
 
unregister_blkdev(MD_MAJOR,"md");
unregister_blkdev(mdp_major, "mdp");
+   unregister_pm_notifier(_pm_notifier);
unregister_reboot_notifier(_reboot_notifier);
unregister_sysctl_table(raid_table_header);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d8287d3cd1bf..727b3aef4d26 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -356,6 +356,14 @@ struct mddev {
 */
int recovery_disabled;
 
+   /* Writes are stopped before hibernation, suspend and restore by
+* calling md_stop_all_writes(). That function sets the
+* MD_RECOVERY_FROZEN flag. The value of that flag is saved before
+* writes are stopped such that it can be restored after hibernation,
+* suspend or restore have finished.
+*/
+   boolfrozen_before_suspend;
+
int in_sync;/* know to not need 
resync */
/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
 * that we are never stopping an array while it is open.
-- 
2.14.2

[PATCH v9 00/10] block, scsi, md: Improve suspend and resume

2017-10-16 Thread Bart Van Assche

Hello Jens,

It is known that during the resume following a hibernate, especially when
using an md RAID1 array created on top of SCSI devices, sometimes the system
hangs instead of coming up properly. This patch series fixes that
problem. These patches have been tested on top of the block layer for-next
branch. Please consider these changes for kernel v4.15.

Thanks,

Bart.

Changes between v8 and v9:
- Modified the third patch such that the MD_RECOVERY_FROZEN flag is restored
  properly after a resume.
- Modified the ninth patch such that a kernel warning is reported if it is
  attempted to call scsi_device_quiesce() from multiple contexts concurrently.
- Added Reviewed-by / Tested-by tags as appropriate.
  
Changes between v7 and v8:
- Fixed a (theoretical?) race identified by Ming Lei.
- Added a tenth patch that checks whether the proper type of flags has been
  passed to a range of block layer functions.

Changes between v6 and v7:
- Added support for the PREEMPT_ONLY flag in blk-mq-debugfs.c.
- Fixed kerneldoc header of blk_queue_enter().
- Added a rcu_read_lock_sched() / rcu_read_unlock_sched() pair in
  blk_set_preempt_only().
- Removed a synchronize_rcu() call from scsi_device_quiesce().
- Modified the description of patch 9/9 in this series.
- Removed scsi_run_queue() call from scsi_device_resume().

Changes between v5 and v6:
- Split an md patch into two patches to make it easier to review the changes.
- For the md patch that suspends I/O while the system is frozen, switched back
  to the freezer mechanism because this makes the code shorter and easier to
  review.
- Changed blk_set/clear_preempt_only() from EXPORT_SYMBOL() into
  EXPORT_SYMBOL_GPL().
- Made blk_set_preempt_only() behave as a test-and-set operation.
- Introduced blk_get_request_flags() and BLK_MQ_REQ_PREEMPT as requested by
  Christoph and reduced the number of arguments of blk_queue_enter() back from
  three to two.
- In scsi_device_quiesce(), moved the blk_mq_freeze_queue() call out of a
  critical section. Made the explanation of why the synchronize_rcu() call
  is necessary more detailed.

Changes between v4 and v5:
- Split blk_set_preempt_only() into two functions as requested by Christoph.
- Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate
  a request while the system is frozen. This is a big help to identify drivers
  that submit I/O while the system is frozen.
- Since kernel thread freezing is on its way out, reworked the approach for
  avoiding that the md driver submits I/O while the system is frozen such that
  the approach no longer depends on the kernel thread freeze mechanism.
- Fixed the (theoretical) deadlock in scsi_device_quiesce() that was identified
  by Ming.

Changes between v3 and v4:
- Made sure that this patch series not only works for scsi-mq but also for
  the legacy SCSI stack.
- Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment
  that explains why that is safe.
- Reordered the patches in this patch series.

Changes between v2 and v3:
- Made md kernel threads freezable.
- Changed the approach for quiescing SCSI devices again.
- Addressed Ming's review comments.

Changes compared to v1 of this patch series:
- Changed the approach and rewrote the patch series.

Bart Van Assche (9):
  md: Rename md_notifier into md_reboot_notifier
  md: Introduce md_stop_all_writes()
  md: Neither resync nor reshape while the system is frozen
  block: Introduce blk_get_request_flags()
  block: Introduce BLK_MQ_REQ_PREEMPT
  ide, scsi: Tell the block layer at request allocation time about
preempt requests
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  block, scsi: Make SCSI quiesce and resume work reliably
  block, nvme: Introduce blk_mq_req_flags_t

Ming Lei (1):
  block: Make q_usage_counter also track legacy requests

 block/blk-core.c   | 132 ++---
 block/blk-mq-debugfs.c |   1 +
 block/blk-mq.c |  20 +++
 block/blk-mq.h |   2 +-
 block/blk-timeout.c|   2 +-
 drivers/ide/ide-pm.c   |   4 +-
 drivers/md/md.c|  65 +++---
 drivers/md/md.h|   8 +++
 drivers/nvme/host/core.c   |   5 +-
 drivers/nvme/host/nvme.h   |   5 +-
 drivers/scsi/scsi_lib.c|  46 +++-
 fs/block_dev.c |   4 +-
 include/linux/blk-mq.h |  16 --
 include/linux/blk_types.h  |   2 +
 include/linux/blkdev.h |  11 +++-
 include/scsi/scsi_device.h |   1 +
 16 files changed, 256 insertions(+), 68 deletions(-)

-- 
2.14.2

[PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-16 Thread Bart Van Assche

The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after resume the fio job
is still running:

for d in /sys/class/block/sd*[a-z]; do
  hcil=$(readlink "$d/device")
  hcil=${hcil#../../../}
  echo 4 > "$d/queue/nr_requests"
  echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
done
bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
bdev=${bdev#../../}
hcil=$(readlink "/sys/block/$bdev/device")
hcil=${hcil#../../../}
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
  --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
  --loops=$((2**31)) &
pid=$!
sleep 1
systemctl hibernate
sleep 10
kill $pid

Reported-by: Oleksandr Natalenko 
References: "I/O hangs after resuming from suspend-to-ram" 
(https://marc.info/?l=linux-block=150340235201348).
Signed-off-by: Bart Van Assche 
Tested-by: Martin Steigerwald 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 42 +++---
 block/blk-mq.c |  4 ++--
 block/blk-timeout.c|  2 +-
 drivers/scsi/scsi_lib.c| 40 +---
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  2 +-
 include/scsi/scsi_device.h |  1 +
 7 files changed, 71 insertions(+), 24 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b73203286bf5..abb5095bcef6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q)
 
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   wake_up_all(>mq_freeze_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+/**
+ * blk_queue_enter() - try to increase q->q_usage_counter
+ * @q: request queue pointer
+ * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ */
+int blk_queue_enter(struct request_queue *q, unsigned int flags)
 {
+   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+
while (true) {
+   bool success = false;
int ret;
 
-   if (percpu_ref_tryget_live(>q_usage_counter))
+   rcu_read_lock_sched();
+   if (percpu_ref_tryget_live(>q_usage_counter)) {
+   /*
+* The code that sets the PREEMPT_ONLY flag is
+* responsible for ensuring that that flag is globally
+* visible before the queue is unfrozen.
+*/
+   if (preempt || !blk_queue_preempt_only(q)) {
+   success = true;
+   } else {
+   percpu_ref_put(>q_usage_counter);
+   WARN_ONCE("%s: Attempt to allocate non-preempt 
request in preempt-only mode.\n",
+ kobject_name(q->kobj.parent));
+   }
+   }
+   rcu_read_unlock_sched();
+
+   if (success)
return 0;
 
-   if (nowait)
+   if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -814,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   (atomic_read(>mq_freeze_depth) == 0 &&
+(preempt || !blk_queue_preempt_only(q))) ||

[PATCH v2 3/8] nvmet/fc: Use sgl_alloc() and sgl_free()

2017-10-16 Thread Bart Van Assche

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: James Smart 
Cc: Sagi Grimberg 
---
 drivers/nvme/target/Kconfig |  1 +
 drivers/nvme/target/fc.c| 36 ++--
 2 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 03e4ab65fe77..4d9715630e21 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -39,6 +39,7 @@ config NVME_TARGET_FC
tristate "NVMe over Fabrics FC target driver"
depends on NVME_TARGET
depends on HAS_DMA
+   select SGL_ALLOC
help
  This enables the NVMe FC target support, which allows exporting NVMe
  devices over FC.
diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 58e010bdda3e..20f96038c272 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -1683,31 +1683,12 @@ static int
 nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 {
struct scatterlist *sg;
-   struct page *page;
unsigned int nent;
-   u32 page_len, length;
-   int i = 0;
 
-   length = fod->total_length;
-   nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL);
+   sg = sgl_alloc(fod->total_length, GFP_KERNEL, );
if (!sg)
goto out;
 
-   sg_init_table(sg, nent);
-
-   while (length) {
-   page_len = min_t(u32, length, PAGE_SIZE);
-
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   goto out_free_pages;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-
fod->data_sg = sg;
fod->data_sg_cnt = nent;
fod->data_sg_cnt = fc_dma_map_sg(fod->tgtport->dev, sg, nent,
@@ -1717,14 +1698,6 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 
return 0;
 
-out_free_pages:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-   fod->data_sg = NULL;
-   fod->data_sg_cnt = 0;
 out:
return NVME_SC_INTERNAL;
 }
@@ -1732,18 +1705,13 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 static void
 nvmet_fc_free_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 {
-   struct scatterlist *sg;
-   int count;
-
if (!fod->data_sg || !fod->data_sg_cnt)
return;
 
fc_dma_unmap_sg(fod->tgtport->dev, fod->data_sg, fod->data_sg_cnt,
((fod->io_dir == NVMET_FCP_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE));
-   for_each_sg(fod->data_sg, sg, fod->data_sg_cnt, count)
-   __free_page(sg_page(sg));
-   kfree(fod->data_sg);
+   sgl_free(fod->data_sg);
fod->data_sg = NULL;
fod->data_sg_cnt = 0;
 }
-- 
2.14.2

[PATCH v2 7/8] scsi/pmcraid: Remove an unused structure member

2017-10-16 Thread Bart Van Assche

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: linux-s...@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Anil Ravindranath 
---
 drivers/scsi/pmcraid.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h
index 8bfac72a242b..44da91712115 100644
--- a/drivers/scsi/pmcraid.h
+++ b/drivers/scsi/pmcraid.h
@@ -542,7 +542,6 @@ struct pmcraid_sglist {
u32 order;
u32 num_sg;
u32 num_dma_sg;
-   u32 buffer_len;
struct scatterlist scatterlist[1];
 };
 
-- 
2.14.2

[PATCH v2 1/8] lib/scatterlist: Introduce sgl_alloc() and sgl_free()

2017-10-16 Thread Bart Van Assche

Many kernel drivers contain code that allocates and frees both a
scatterlist and the pages that populate that scatterlist.
Introduce functions in lib/scatterlist.c that perform these tasks
instead of duplicating this functionality in multiple drivers.
Only include these functions in the build if CONFIG_SGL_ALLOC=y
to avoid that the kernel size increases if this functionality is
not used.

Signed-off-by: Bart Van Assche 
---
 include/linux/scatterlist.h |  10 +
 lib/Kconfig |   4 ++
 lib/scatterlist.c   | 105 
 3 files changed, 119 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 4b3286ac60c8..3642511918d5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -266,6 +266,16 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
unsigned long offset, unsigned long size,
gfp_t gfp_mask);
 
+#ifdef CONFIG_SGL_ALLOC
+struct scatterlist *sgl_alloc_order(unsigned long long length,
+   unsigned int order, bool chainable,
+   gfp_t gfp, unsigned int *nent_p);
+struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+ unsigned int *nent_p);
+void sgl_free_order(struct scatterlist *sgl, int order);
+void sgl_free(struct scatterlist *sgl);
+#endif /* CONFIG_SGL_ALLOC */
+
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
  size_t buflen, off_t skip, bool to_buffer);
 
diff --git a/lib/Kconfig b/lib/Kconfig
index b1445b22a6de..8396c4cfa1ab 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -413,6 +413,10 @@ config HAS_DMA
depends on !NO_DMA
default y
 
+config SGL_ALLOC
+   bool
+   default n
+
 config DMA_NOOP_OPS
bool
depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index be7b4dd6b68d..e2b5a78ceb44 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -433,6 +433,111 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
 }
 EXPORT_SYMBOL(sg_alloc_table_from_pages);
 
+#ifdef CONFIG_SGL_ALLOC
+
+/**
+ * sgl_alloc_order - allocate a scatterlist and its pages
+ * @length: Length in bytes of the scatterlist. Must be at least one
+ * @order: Second argument for alloc_pages()
+ * @chainable: Whether or not to allocate an extra element in the scatterlist
+ * for scatterlist chaining purposes
+ * @gfp: Memory allocation flags
+ * @nent_p: [out] Number of entries in the scatterlist that have pages
+ *
+ * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
+ */
+struct scatterlist *sgl_alloc_order(unsigned long long length,
+   unsigned int order, bool chainable,
+   gfp_t gfp, unsigned int *nent_p)
+{
+   struct scatterlist *sgl, *sg;
+   struct page *page;
+   unsigned int nent, nalloc;
+   u32 elem_len;
+
+   nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
+   /* Check for integer overflow */
+   if (length > (nent << (PAGE_SHIFT + order)))
+   return NULL;
+   nalloc = nent;
+   if (chainable) {
+   /* Check for integer overflow */
+   if (nalloc + 1 < nalloc)
+   return NULL;
+   nalloc++;
+   }
+   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist),
+   (gfp & ~GFP_DMA) | __GFP_ZERO);
+   if (!sgl)
+   return NULL;
+
+   sg_init_table(sgl, nent);
+   sg = sgl;
+   while (length) {
+   elem_len = min_t(u64, length, PAGE_SIZE << order);
+   page = alloc_pages(gfp, order);
+   if (!page) {
+   sgl_free(sgl);
+   return NULL;
+   }
+
+   sg_set_page(sg, page, elem_len, 0);
+   length -= elem_len;
+   sg = sg_next(sg);
+   }
+   WARN_ON_ONCE(sg);
+   if (nent_p)
+   *nent_p = nent;
+   return sgl;
+}
+EXPORT_SYMBOL(sgl_alloc_order);
+
+/**
+ * sgl_alloc - allocate a scatterlist and its pages
+ * @length: Length in bytes of the scatterlist
+ * @gfp: Memory allocation flags
+ * @nent_p: [out] Number of entries in the scatterlist
+ *
+ * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
+ */
+struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+ unsigned int *nent_p)
+{
+   return sgl_alloc_order(length, 0, false, gfp, nent_p);
+}
+EXPORT_SYMBOL(sgl_alloc);
+
+/**
+ * sgl_free_order - free a scatterlist and its pages
+ * @sgl: Scatterlist with one or more elements
+ * @order: Second argument for __free_pages()
+ */
+void sgl_free_order(struct scatterlist *sgl, int order)
+{
+   struct scatterlist *sg;
+

[PATCH v2 8/8] scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order()

2017-10-16 Thread Bart Van Assche

Use the sgl_alloc_order() and sgl_free_order() functions instead
of open coding these functions.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: linux-s...@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Anil Ravindranath 
---
 drivers/scsi/Kconfig   |  1 +
 drivers/scsi/pmcraid.c | 43 ---
 drivers/scsi/pmcraid.h |  2 +-
 3 files changed, 6 insertions(+), 40 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index d11e75e76195..632200ec36a6 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -1576,6 +1576,7 @@ config ZFCP
 config SCSI_PMCRAID
tristate "PMC SIERRA Linux MaxRAID adapter support"
depends on PCI && SCSI && NET
+   select SGL_ALLOC
---help---
  This driver supports the PMC SIERRA MaxRAID adapters.
 
diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c
index b4d6cd8cd1ad..95fc0352f9bb 100644
--- a/drivers/scsi/pmcraid.c
+++ b/drivers/scsi/pmcraid.c
@@ -3232,12 +3232,7 @@ static int pmcraid_build_ioadl(
  */
 static void pmcraid_free_sglist(struct pmcraid_sglist *sglist)
 {
-   int i;
-
-   for (i = 0; i < sglist->num_sg; i++)
-   __free_pages(sg_page(&(sglist->scatterlist[i])),
-sglist->order);
-
+   sgl_free_order(sglist->scatterlist, sglist->order);
kfree(sglist);
 }
 
@@ -3254,50 +3249,20 @@ static void pmcraid_free_sglist(struct pmcraid_sglist 
*sglist)
 static struct pmcraid_sglist *pmcraid_alloc_sglist(int buflen)
 {
struct pmcraid_sglist *sglist;
-   struct scatterlist *scatterlist;
-   struct page *page;
-   int num_elem, i, j;
int sg_size;
int order;
-   int bsize_elem;
 
sg_size = buflen / (PMCRAID_MAX_IOADLS - 1);
order = (sg_size > 0) ? get_order(sg_size) : 0;
-   bsize_elem = PAGE_SIZE * (1 << order);
-
-   /* Determine the actual number of sg entries needed */
-   if (buflen % bsize_elem)
-   num_elem = (buflen / bsize_elem) + 1;
-   else
-   num_elem = buflen / bsize_elem;
 
/* Allocate a scatter/gather list for the DMA */
-   sglist = kzalloc(sizeof(struct pmcraid_sglist) +
-(sizeof(struct scatterlist) * (num_elem - 1)),
-GFP_KERNEL);
-
+   sglist = kzalloc(sizeof(struct pmcraid_sglist), GFP_KERNEL);
if (sglist == NULL)
return NULL;
 
-   scatterlist = sglist->scatterlist;
-   sg_init_table(scatterlist, num_elem);
sglist->order = order;
-   sglist->num_sg = num_elem;
-   sg_size = buflen;
-
-   for (i = 0; i < num_elem; i++) {
-   page = alloc_pages(GFP_KERNEL|GFP_DMA|__GFP_ZERO, order);
-   if (!page) {
-   for (j = i - 1; j >= 0; j--)
-   __free_pages(sg_page([j]), order);
-   kfree(sglist);
-   return NULL;
-   }
-
-   sg_set_page([i], page,
-   sg_size < bsize_elem ? sg_size : bsize_elem, 0);
-   sg_size -= bsize_elem;
-   }
+   sgl_alloc_order(buflen, order, false,
+   GFP_KERNEL | GFP_DMA | __GFP_ZERO, >num_sg);
 
return sglist;
 }
diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h
index 44da91712115..754ef30927e2 100644
--- a/drivers/scsi/pmcraid.h
+++ b/drivers/scsi/pmcraid.h
@@ -542,7 +542,7 @@ struct pmcraid_sglist {
u32 order;
u32 num_sg;
u32 num_dma_sg;
-   struct scatterlist scatterlist[1];
+   struct scatterlist *scatterlist;
 };
 
 /* page D0 inquiry data of focal point resource */
-- 
2.14.2

[PATCH v2 2/8] crypto: scompress - use sgl_alloc() and sgl_free()

2017-10-16 Thread Bart Van Assche

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche 
Cc: Ard Biesheuvel 
Cc: Herbert Xu 
---
 crypto/Kconfig |  1 +
 crypto/scompress.c | 51 ++-
 2 files changed, 3 insertions(+), 49 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 0a121f9ddf8e..a0667dd284ff 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -105,6 +105,7 @@ config CRYPTO_KPP
 config CRYPTO_ACOMP2
tristate
select CRYPTO_ALGAPI2
+   select SGL_ALLOC
 
 config CRYPTO_ACOMP
tristate
diff --git a/crypto/scompress.c b/crypto/scompress.c
index 2075e2c4e7df..968bbcf65c94 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -140,53 +140,6 @@ static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
return ret;
 }
 
-static void crypto_scomp_sg_free(struct scatterlist *sgl)
-{
-   int i, n;
-   struct page *page;
-
-   if (!sgl)
-   return;
-
-   n = sg_nents(sgl);
-   for_each_sg(sgl, sgl, n, i) {
-   page = sg_page(sgl);
-   if (page)
-   __free_page(page);
-   }
-
-   kfree(sgl);
-}
-
-static struct scatterlist *crypto_scomp_sg_alloc(size_t size, gfp_t gfp)
-{
-   struct scatterlist *sgl;
-   struct page *page;
-   int i, n;
-
-   n = ((size - 1) >> PAGE_SHIFT) + 1;
-
-   sgl = kmalloc_array(n, sizeof(struct scatterlist), gfp);
-   if (!sgl)
-   return NULL;
-
-   sg_init_table(sgl, n);
-
-   for (i = 0; i < n; i++) {
-   page = alloc_page(gfp);
-   if (!page)
-   goto err;
-   sg_set_page(sgl + i, page, PAGE_SIZE, 0);
-   }
-
-   return sgl;
-
-err:
-   sg_mark_end(sgl + i);
-   crypto_scomp_sg_free(sgl);
-   return NULL;
-}
-
 static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 {
struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
@@ -220,7 +173,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, 
int dir)
  scratch_dst, >dlen, *ctx);
if (!ret) {
if (!req->dst) {
-   req->dst = crypto_scomp_sg_alloc(req->dlen, GFP_ATOMIC);
+   req->dst = sgl_alloc(req->dlen, GFP_ATOMIC, NULL);
if (!req->dst)
goto out;
}
@@ -274,7 +227,7 @@ int crypto_init_scomp_ops_async(struct crypto_tfm *tfm)
 
crt->compress = scomp_acomp_compress;
crt->decompress = scomp_acomp_decompress;
-   crt->dst_free = crypto_scomp_sg_free;
+   crt->dst_free = sgl_free;
crt->reqsize = sizeof(void *);
 
return 0;
-- 
2.14.2

[PATCH v2 4/8] nvmet/rdma: Use sgl_alloc() and sgl_free()

2017-10-16 Thread Bart Van Assche

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
---
 drivers/nvme/target/Kconfig |  1 +
 drivers/nvme/target/rdma.c  | 63 +++--
 2 files changed, 5 insertions(+), 59 deletions(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 4d9715630e21..5f4f8b16685f 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -29,6 +29,7 @@ config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
depends on INFINIBAND
depends on NVME_TARGET
+   select SGL_ALLOC
help
  This enables the NVMe RDMA target support, which allows exporting NVMe
  devices over RDMA.
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 76d2bb793afe..363d44c10b68 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -185,59 +185,6 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
spin_unlock_irqrestore(>queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
-{
-   struct scatterlist *sg;
-   int count;
-
-   if (!sgl || !nents)
-   return;
-
-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
-   kfree(sgl);
-}
-
-static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-   u32 length)
-{
-   struct scatterlist *sg;
-   struct page *page;
-   unsigned int nent;
-   int i = 0;
-
-   nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL);
-   if (!sg)
-   goto out;
-
-   sg_init_table(sg, nent);
-
-   while (length) {
-   u32 page_len = min_t(u32, length, PAGE_SIZE);
-
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   goto out_free_pages;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-   *sgl = sg;
-   *nents = nent;
-   return 0;
-
-out_free_pages:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-out:
-   return NVME_SC_INTERNAL;
-}
-
 static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
struct nvmet_rdma_cmd *c, bool admin)
 {
@@ -484,7 +431,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp 
*rsp)
}
 
if (rsp->req.sg != >cmd->inline_sg)
-   nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+   sgl_free(rsp->req.sg);
 
if (unlikely(!list_empty_careful(>rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -620,16 +567,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
int ret;
-   u16 status;
 
/* no data command? */
if (!len)
return 0;
 
-   status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt,
-   len);
-   if (status)
-   return status;
+   rsp->req.sg = sgl_alloc(len, GFP_KERNEL, >req.sg_cnt);
+   if (!rsp->req.sg)
+   return NVME_SC_INTERNAL;
 
ret = rdma_rw_ctx_init(>rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-- 
2.14.2

[PATCH v2 6/8] scsi/ipr: Use sgl_alloc_order() and sgl_free_order()

2017-10-16 Thread Bart Van Assche

Use the sgl_alloc_order() and sgl_free_order() functions instead
of open coding these functions.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: linux-s...@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Brian King 
---
 drivers/scsi/Kconfig |  1 +
 drivers/scsi/ipr.c   | 49 -
 drivers/scsi/ipr.h   |  2 +-
 3 files changed, 10 insertions(+), 42 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 41366339b950..d11e75e76195 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -1058,6 +1058,7 @@ config SCSI_IPR
depends on PCI && SCSI && ATA
select FW_LOADER
select IRQ_POLL
+   select SGL_ALLOC
---help---
  This driver supports the IBM Power Linux family RAID adapters.
  This includes IBM pSeries 5712, 5703, 5709, and 570A, as well
diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index f838bd73befa..6fed5db6255e 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -3815,10 +3815,8 @@ static struct device_attribute ipr_iopoll_weight_attr = {
  **/
 static struct ipr_sglist *ipr_alloc_ucode_buffer(int buf_len)
 {
-   int sg_size, order, bsize_elem, num_elem, i, j;
+   int sg_size, order;
struct ipr_sglist *sglist;
-   struct scatterlist *scatterlist;
-   struct page *page;
 
/* Get the minimum size per scatter/gather element */
sg_size = buf_len / (IPR_MAX_SGLIST - 1);
@@ -3826,45 +3824,18 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int 
buf_len)
/* Get the actual size per element */
order = get_order(sg_size);
 
-   /* Determine the actual number of bytes per element */
-   bsize_elem = PAGE_SIZE * (1 << order);
-
-   /* Determine the actual number of sg entries needed */
-   if (buf_len % bsize_elem)
-   num_elem = (buf_len / bsize_elem) + 1;
-   else
-   num_elem = buf_len / bsize_elem;
-
/* Allocate a scatter/gather list for the DMA */
-   sglist = kzalloc(sizeof(struct ipr_sglist) +
-(sizeof(struct scatterlist) * (num_elem - 1)),
-GFP_KERNEL);
-
+   sglist = kzalloc(sizeof(struct ipr_sglist), GFP_KERNEL);
if (sglist == NULL) {
ipr_trace;
return NULL;
}
-
-   scatterlist = sglist->scatterlist;
-   sg_init_table(scatterlist, num_elem);
-
sglist->order = order;
-   sglist->num_sg = num_elem;
-
-   /* Allocate a bunch of sg elements */
-   for (i = 0; i < num_elem; i++) {
-   page = alloc_pages(GFP_KERNEL, order);
-   if (!page) {
-   ipr_trace;
-
-   /* Free up what we already allocated */
-   for (j = i - 1; j >= 0; j--)
-   __free_pages(sg_page([j]), order);
-   kfree(sglist);
-   return NULL;
-   }
-
-   sg_set_page([i], page, 0, 0);
+   sglist->scatterlist = sgl_alloc_order(buf_len, order, false, GFP_KERNEL,
+ >num_sg);
+   if (!sglist->scatterlist) {
+   kfree(sglist);
+   return NULL;
}
 
return sglist;
@@ -3882,11 +3853,7 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int 
buf_len)
  **/
 static void ipr_free_ucode_buffer(struct ipr_sglist *sglist)
 {
-   int i;
-
-   for (i = 0; i < sglist->num_sg; i++)
-   __free_pages(sg_page(>scatterlist[i]), sglist->order);
-
+   sgl_free_order(sglist->scatterlist, sglist->order);
kfree(sglist);
 }
 
diff --git a/drivers/scsi/ipr.h b/drivers/scsi/ipr.h
index c7f0e9e3cd7d..93570734cbfb 100644
--- a/drivers/scsi/ipr.h
+++ b/drivers/scsi/ipr.h
@@ -1454,7 +1454,7 @@ struct ipr_sglist {
u32 num_sg;
u32 num_dma_sg;
u32 buffer_len;
-   struct scatterlist scatterlist[1];
+   struct scatterlist *scatterlist;
 };
 
 enum ipr_sdt_state {
-- 
2.14.2

[PATCH v2 0/8] Introduce sgl_alloc() and sgl_free()

2017-10-16 Thread Bart Van Assche

Hello Jens,

As you know there are multiple drivers that both allocate a scatter/gather
list and populate that list with pages. This patch series moves the code for
allocating and freeing such scatterlists from these drivers into
lib/scatterlist.c. Please consider this patch series for kernel v4.15.

Notes:
- Only the ib_srpt driver changes have been tested but none of the other
  drivers have been retested.
- The next step is to introduce a caching mechanism for these scatterlists
  and make the nvmet/rdma and SCSI target drivers use that caching mechanism
  since for these drivers sgl allocation occurs in the hot path.

Changes compared to v1:
- Moved the sgl_alloc*() and sgl_free() functions from a new source file into
  lib/scatterlist.c.
- Changed the argument order for the sgl_alloc*() functions such that the
  (pointer to) the output argument comes last.

Thanks,

Bart.

Bart Van Assche (8):
  lib/scatterlist: Introduce sgl_alloc() and sgl_free()
  crypto: scompress - use sgl_alloc() and sgl_free()
  nvmet/fc: Use sgl_alloc() and sgl_free()
  nvmet/rdma: Use sgl_alloc() and sgl_free()
  target: Use sgl_alloc_order() and sgl_free()
  scsi/ipr: Use sgl_alloc_order() and sgl_free_order()
  scsi/pmcraid: Remove an unused structure member
  scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order()

 crypto/Kconfig |   1 +
 crypto/scompress.c |  51 +---
 drivers/nvme/target/Kconfig|   2 +
 drivers/nvme/target/fc.c   |  36 +--
 drivers/nvme/target/rdma.c |  63 ++--
 drivers/scsi/Kconfig   |   2 +
 drivers/scsi/ipr.c |  49 +++
 drivers/scsi/ipr.h |   2 +-
 drivers/scsi/pmcraid.c |  43 ++
 drivers/scsi/pmcraid.h |   3 +-
 drivers/target/Kconfig |   1 +
 drivers/target/target_core_transport.c |  46 ++-
 include/linux/scatterlist.h|  10 
 lib/Kconfig|   4 ++
 lib/scatterlist.c  | 105 +
 15 files changed, 151 insertions(+), 267 deletions(-)

-- 
2.14.2

[PATCH v2 5/8] target: Use sgl_alloc_order() and sgl_free()

2017-10-16 Thread Bart Van Assche

Use the sgl_alloc_order() and sgl_free() functions instead of open
coding these functions.

Signed-off-by: Bart Van Assche 
Cc: Nicholas A. Bellinger 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
---
 drivers/target/Kconfig |  1 +
 drivers/target/target_core_transport.c | 46 +++---
 2 files changed, 5 insertions(+), 42 deletions(-)

diff --git a/drivers/target/Kconfig b/drivers/target/Kconfig
index e2bc99980f75..4c44d7bed01a 100644
--- a/drivers/target/Kconfig
+++ b/drivers/target/Kconfig
@@ -5,6 +5,7 @@ menuconfig TARGET_CORE
select CONFIGFS_FS
select CRC_T10DIF
select BLK_SCSI_REQUEST # only for scsi_command_size_tbl..
+   select SGL_ALLOC
default n
help
Say Y or M here to enable the TCM Storage Engine and ConfigFS enabled
diff --git a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
index 836d552b0385..9bbd08be9d60 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2293,13 +2293,7 @@ static void target_complete_ok_work(struct work_struct 
*work)
 
 void target_free_sgl(struct scatterlist *sgl, int nents)
 {
-   struct scatterlist *sg;
-   int count;
-
-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
-
-   kfree(sgl);
+   sgl_free(sgl);
 }
 EXPORT_SYMBOL(target_free_sgl);
 
@@ -2423,42 +2417,10 @@ int
 target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length,
 bool zero_page, bool chainable)
 {
-   struct scatterlist *sg;
-   struct page *page;
-   gfp_t zero_flag = (zero_page) ? __GFP_ZERO : 0;
-   unsigned int nalloc, nent;
-   int i = 0;
-
-   nalloc = nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   if (chainable)
-   nalloc++;
-   sg = kmalloc_array(nalloc, sizeof(struct scatterlist), GFP_KERNEL);
-   if (!sg)
-   return -ENOMEM;
+   gfp_t gfp = GFP_KERNEL | (zero_page ? __GFP_ZERO : 0);
 
-   sg_init_table(sg, nalloc);
-
-   while (length) {
-   u32 page_len = min_t(u32, length, PAGE_SIZE);
-   page = alloc_page(GFP_KERNEL | zero_flag);
-   if (!page)
-   goto out;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-   *sgl = sg;
-   *nents = nent;
-   return 0;
-
-out:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-   return -ENOMEM;
+   *sgl = sgl_alloc_order(length, 0, chainable, gfp, nents);
+   return *sgl ? 0 : -ENOMEM;
 }
 EXPORT_SYMBOL(target_alloc_sgl);
 
-- 
2.14.2

Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe

2017-10-16 Thread Jens Axboe

On 10/16/2017 01:07 PM, Michael Lyle wrote:
> Jens--
> 
> Thanks for your patience.
> 
> On Mon, Oct 16, 2017 at 12:01 PM, Jens Axboe  wrote:
>> On 10/16/2017 11:34 AM, Michael Lyle wrote:
>>> Sorry this got through to linux-block, was detected by the kbuilds test
>>> robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
>>> signed long constant.
>>
>> Applied, but you should remember to add Fixes lines when a patch
>> explicitly fixes a problem introduced by a previous patch. Ala:
>>
>> Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp")
>>
>> so that backports have an easier time finding dependent fixes.
> 
> Sorry about that too. I considered the Fixes: tag but didn't know if
> the hashes would "hold true" into mainline.  Now I see Linus merges
> your tree so the hash should be durable.

They are durable once applied, so yeah.

-- 
Jens Axboe

Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe

2017-10-16 Thread Michael Lyle

Jens--

Thanks for your patience.

On Mon, Oct 16, 2017 at 12:01 PM, Jens Axboe  wrote:
> On 10/16/2017 11:34 AM, Michael Lyle wrote:
>> Sorry this got through to linux-block, was detected by the kbuilds test
>> robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
>> signed long constant.
>
> Applied, but you should remember to add Fixes lines when a patch
> explicitly fixes a problem introduced by a previous patch. Ala:
>
> Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp")
>
> so that backports have an easier time finding dependent fixes.

Sorry about that too. I considered the Fixes: tag but didn't know if
the hashes would "hold true" into mainline.  Now I see Linus merges
your tree so the hash should be durable.

>
> --
> Jens Axboe
>

Mike

Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe

2017-10-16 Thread Jens Axboe

On 10/16/2017 11:34 AM, Michael Lyle wrote:
> Sorry this got through to linux-block, was detected by the kbuilds test
> robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
> signed long constant.

Applied, but you should remember to add Fixes lines when a patch
explicitly fixes a problem introduced by a previous patch. Ala:

Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp")

so that backports have an easier time finding dependent fixes.

-- 
Jens Axboe

Re: [PATCH] block: fix Sphinx kernel-doc warning

2017-10-16 Thread Jens Axboe

On 10/16/2017 12:01 PM, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Sphinx treats symbols that end with '_' as a kind of special
> documentation indicator, so fix that by adding an ending '*'
> to it.

Added for 4.15, thanks Randy.

-- 
Jens Axboe

Re: [PATCHv3] bcache: only permit to recovery read error when cache device is clean

2017-10-16 Thread Coly Li

On 2017/10/17 上午1:39, Michael Lyle wrote:
> Hey Coly--
> 
> On Thu, Sep 21, 2017 at 12:54 PM, Coly Li  wrote:
>> When bcache does read I/Os, for example in writeback or writethrough mode,
>> if a read request on cache device is failed, bcache will try to recovery
>> the request by reading from cached device. If the data on cached device is
>> not synced with cache device, then requester will get a stale data.
> [...]
>> +   if (s->recoverable &&
>> +   (dc && !atomic_read(>has_dirty)) {
> 
> Looks like this is missing a parens.

Hi Mike,

Oops, I am blind ... Thanks for figure out this.

V4 patch is out, and I think you may change Acked-by to Reviewed-by,
because you reviewed the code and pointed out problem :-)

Thanks.

-- 
Coly Li

[PATCHv4] bcache: only permit to recovery read error when cache device is clean

2017-10-16 Thread Coly Li

When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.

For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.

With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.

For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.

Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.

Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
bug to fix, and option to permit it is unneccessary. So this version
the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure  to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.

Signed-off-by: Coly Li 
Reported-by: Arne Wolf 
Acked-by: Michael Lyle 
Cc: Kent Overstreet 
Cc: Nix 
Cc: Kai Krakow 
Cc: Eric Wheeler 
Cc: Junhui Tang 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/request.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 681b4f12b05a..e7f769ff7234 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -697,8 +697,16 @@ static void cached_dev_read_error(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = >bio.bio;
+   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 
-   if (s->recoverable) {
+   /*
+* If cache device is dirty (dc->has_dirty is non-zero), then
+* recovery a failed read request from cached device may get a
+* stale data back. So read failure recovery is only permitted
+* when cache device is clean.
+*/
+   if (s->recoverable &&
+   (dc && !atomic_read(>has_dirty))) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);
 
-- 
2.13.6

[PATCH] block: fix Sphinx kernel-doc warning

2017-10-16 Thread Randy Dunlap

From: Randy Dunlap 

Sphinx treats symbols that end with '_' as a kind of special
documentation indicator, so fix that by adding an ending '*'
to it.

../block/bio.c:404: ERROR: Unknown target name: "gfp".

Signed-off-by: Randy Dunlap 
---
 block/bio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- lnx-414-rc5.orig/block/bio.c
+++ lnx-414-rc5/block/bio.c
@@ -400,7 +400,7 @@ static void punt_bios_to_rescuer(struct
 
 /**
  * bio_alloc_bioset - allocate a bio for I/O
- * @gfp_mask:   the GFP_ mask given to the slab allocator
+ * @gfp_mask:   the GFP_* mask given to the slab allocator
  * @nr_iovecs: number of iovecs to pre-allocate
  * @bs:the bio_set to allocate from.
  *

Re: [PATCH] bcache: writeback rate clamping: make 32 bit safe

2017-10-16 Thread Coly Li

On 2017/10/17 上午1:34, Michael Lyle wrote:
> Sorry this got through to linux-block, was detected by the kbuilds test
> robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
> signed long constant.
> 
> Signed-off-by: Michael Lyle 

I forgot i386 config... sorry too...

Reviewed-by: Coly Li 

Thanks to Michael for the fix.

Coly Li

> ---
>  drivers/md/bcache/util.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index 4dbe37e82877..e548b8b51322 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -238,8 +238,8 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t 
> done)
>* don't let us sleep more than 2.5 seconds (so we can notice/respond
>* if the control system tells us to speed up!).
>*/
> - if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next))
> - d->next = now + NSEC_PER_SEC * 5 / 2;
> + if (time_before64(now + NSEC_PER_SEC * 5LLU / 2LLU, d->next))
> + d->next = now + NSEC_PER_SEC * 5LLU / 2LLU;
>  
>   if (time_after64(now - NSEC_PER_SEC * 2, d->next))
>   d->next = now - NSEC_PER_SEC * 2;

Re: [PATCHv3] bcache: only permit to recovery read error when cache device is clean

2017-10-16 Thread Michael Lyle

Hey Coly--

On Thu, Sep 21, 2017 at 12:54 PM, Coly Li  wrote:
> When bcache does read I/Os, for example in writeback or writethrough mode,
> if a read request on cache device is failed, bcache will try to recovery
> the request by reading from cached device. If the data on cached device is
> not synced with cache device, then requester will get a stale data.
[...]
> +   if (s->recoverable &&
> +   (dc && !atomic_read(>has_dirty)) {

Looks like this is missing a parens.

Thx,

Mike

[PATCH] bcache: writeback rate clamping: make 32 bit safe

2017-10-16 Thread Michael Lyle

Sorry this got through to linux-block, was detected by the kbuilds test
robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
signed long constant.

Signed-off-by: Michael Lyle 
---
 drivers/md/bcache/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 4dbe37e82877..e548b8b51322 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -238,8 +238,8 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t 
done)
 * don't let us sleep more than 2.5 seconds (so we can notice/respond
 * if the control system tells us to speed up!).
 */
-   if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next))
-   d->next = now + NSEC_PER_SEC * 5 / 2;
+   if (time_before64(now + NSEC_PER_SEC * 5LLU / 2LLU, d->next))
+   d->next = now + NSEC_PER_SEC * 5LLU / 2LLU;
 
if (time_after64(now - NSEC_PER_SEC * 2, d->next))
d->next = now - NSEC_PER_SEC * 2;
-- 
2.11.0

Re: [PATCH 8/9] block: pass full fmode_t to blk_verify_command

2017-10-16 Thread Benjamin Block

On Tue, Oct 03, 2017 at 12:48:44PM +0200, Christoph Hellwig wrote:
> Use the obvious calling convention.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bsg.c| 18 --
>  block/scsi_ioctl.c |  8 
>  drivers/scsi/sg.c  |  2 +-
>  include/linux/blkdev.h |  2 +-
>  4 files changed, 14 insertions(+), 16 deletions(-)
> 
> diff --git a/block/bsg.c b/block/bsg.c
> index ee1335c68de7..452f94f1c5d4 100644
> --- a/block/bsg.c
> +++ b/block/bsg.c
> @@ -137,7 +137,7 @@ static inline struct hlist_head *bsg_dev_idx_hash(int 
> index)
> 
>  static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq,
>   struct sg_io_v4 *hdr, struct bsg_device *bd,
> - fmode_t has_write_perm)
> + fmode_t mode)
>  {
>   struct scsi_request *req = scsi_req(rq);
> 
> @@ -152,7 +152,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, 
> struct request *rq,
>   return -EFAULT;
> 
>   if (hdr->subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) {
> - if (blk_verify_command(req->cmd, has_write_perm))
> + if (blk_verify_command(req->cmd, mode))
>   return -EPERM;
>   } else if (!capable(CAP_SYS_RAWIO))
>   return -EPERM;
> @@ -206,7 +206,7 @@ bsg_validate_sgv4_hdr(struct sg_io_v4 *hdr, int *op)
>   * map sg_io_v4 to a request.
>   */
>  static struct request *
> -bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t 
> has_write_perm)
> +bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t mode)
>  {
>   struct request_queue *q = bd->queue;
>   struct request *rq, *next_rq = NULL;
> @@ -237,7 +237,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, 
> fmode_t has_write_perm)
>   if (IS_ERR(rq))
>   return rq;
> 
> - ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, has_write_perm);
> + ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, mode);
>   if (ret)
>   goto out;
> 
> @@ -587,8 +587,7 @@ bsg_read(struct file *file, char __user *buf, size_t 
> count, loff_t *ppos)
>  }
> 
>  static int __bsg_write(struct bsg_device *bd, const char __user *buf,
> -size_t count, ssize_t *bytes_written,
> -fmode_t has_write_perm)
> +size_t count, ssize_t *bytes_written, fmode_t mode)
>  {
>   struct bsg_command *bc;
>   struct request *rq;
> @@ -619,7 +618,7 @@ static int __bsg_write(struct bsg_device *bd, const char 
> __user *buf,
>   /*
>* get a request, fill in the blanks, and add to request queue
>*/
> - rq = bsg_map_hdr(bd, >hdr, has_write_perm);
> + rq = bsg_map_hdr(bd, >hdr, mode);
>   if (IS_ERR(rq)) {
>   ret = PTR_ERR(rq);
>   rq = NULL;
> @@ -655,8 +654,7 @@ bsg_write(struct file *file, const char __user *buf, 
> size_t count, loff_t *ppos)
>   bsg_set_block(bd, file);
> 
>   bytes_written = 0;
> - ret = __bsg_write(bd, buf, count, _written,
> -   file->f_mode & FMODE_WRITE);
> + ret = __bsg_write(bd, buf, count, _written, file->f_mode);
> 
>   *ppos = bytes_written;
> 
> @@ -915,7 +913,7 @@ static long bsg_ioctl(struct file *file, unsigned int 
> cmd, unsigned long arg)
>   if (copy_from_user(, uarg, sizeof(hdr)))
>   return -EFAULT;
> 
> - rq = bsg_map_hdr(bd, , file->f_mode & FMODE_WRITE);
> + rq = bsg_map_hdr(bd, , file->f_mode);
>   if (IS_ERR(rq))
>   return PTR_ERR(rq);
> 
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 7440de44dd85..edcfff974527 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -207,7 +207,7 @@ static void blk_set_cmd_filter_defaults(struct 
> blk_cmd_filter *filter)
>   __set_bit(GPCMD_SET_READ_AHEAD, filter->write_ok);
>  }
> 
> -int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm)
> +int blk_verify_command(unsigned char *cmd, fmode_t mode)
>  {
>   struct blk_cmd_filter *filter = _default_cmd_filter;
> 
> @@ -220,7 +220,7 @@ int blk_verify_command(unsigned char *cmd, fmode_t 
> has_write_perm)
>   return 0;
> 
>   /* Write-safe commands require a writable open */
> - if (test_bit(cmd[0], filter->write_ok) && has_write_perm)
> + if (test_bit(cmd[0], filter->write_ok) && (mode & FMODE_WRITE))
>   return 0;
> 
>   return -EPERM;
> @@ -234,7 +234,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, 
> struct request *rq,
> 
>   if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len))
>   return -EFAULT;
> - if (blk_verify_command(req->cmd, mode & FMODE_WRITE))
> + if (blk_verify_command(req->cmd, mode))
>   return -EPERM;
> 
>   /*
> @@ -469,7 +469,7 @@ int sg_scsi_ioctl(struct

Re: [PATCH 6/9] bsg-lib: introduce a timeout field in struct bsg_job

2017-10-16 Thread Benjamin Block

On Tue, Oct 03, 2017 at 12:48:42PM +0200, Christoph Hellwig wrote:
> The zfcp driver wants to know the timeout for a bsg job, so add a field
> to struct bsg_job for it in preparation of not exposing the request
> to the bsg-lib users.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bsg-lib.c | 1 +
>  drivers/s390/scsi/zfcp_fc.c | 4 ++--
>  include/linux/bsg-lib.h | 2 ++
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bsg-lib.c b/block/bsg-lib.c
> index 15d25ccd51a5..0d5bbf6a2ddd 100644
> --- a/block/bsg-lib.c
> +++ b/block/bsg-lib.c
> @@ -132,6 +132,7 @@ static int bsg_prepare_job(struct device *dev, struct 
> request *req)
>   struct bsg_job *job = blk_mq_rq_to_pdu(req);
>   int ret;
> 
> + job->timeout = req->timeout;
>   job->request = rq->cmd;
>   job->request_len = rq->cmd_len;
> 
> diff --git a/drivers/s390/scsi/zfcp_fc.c b/drivers/s390/scsi/zfcp_fc.c
> index 8210645c2111..9d6f69b92c81 100644
> --- a/drivers/s390/scsi/zfcp_fc.c
> +++ b/drivers/s390/scsi/zfcp_fc.c
> @@ -960,7 +960,7 @@ static int zfcp_fc_exec_els_job(struct bsg_job *job,
>   d_id = ntoh24(bsg_request->rqst_data.h_els.port_id);
> 
>   els->handler = zfcp_fc_ct_els_job_handler;
> - return zfcp_fsf_send_els(adapter, d_id, els, job->req->timeout / HZ);
> + return zfcp_fsf_send_els(adapter, d_id, els, job->timeout / HZ);
>  }
> 
>  static int zfcp_fc_exec_ct_job(struct bsg_job *job,
> @@ -979,7 +979,7 @@ static int zfcp_fc_exec_ct_job(struct bsg_job *job,
>   return ret;
> 
>   ct->handler = zfcp_fc_ct_job_handler;
> - ret = zfcp_fsf_send_ct(wka_port, ct, NULL, job->req->timeout / HZ);
> + ret = zfcp_fsf_send_ct(wka_port, ct, NULL, job->timeout / HZ);
>   if (ret)
>   zfcp_fc_wka_port_put(wka_port);
> 
> diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h
> index b1be0233ce35..402223c95ce1 100644
> --- a/include/linux/bsg-lib.h
> +++ b/include/linux/bsg-lib.h
> @@ -44,6 +44,8 @@ struct bsg_job {
> 
>   struct kref kref;
> 
> + unsigned int timeout;
> +
>   /* Transport/driver specific request/reply structs */
>   void *request;
>   void *reply;
> -- 
> 2.14.1
> 

Reviewed-by: Benjamin Block 


Beste Grüße / Best regards,
  - Benjamin Block
-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

2017-10-16 Thread Bart Van Assche

On Mon, 2017-10-16 at 13:30 +0200, Hannes Reinecke wrote:
> On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote:
> > (+Himanshu Madhani)
> > 
> > Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> > the qla2xxx initiator driver to the scsi_host->can_queue value?
> 
> '3' is just a starting point; later on it'll be adjusted via 
> scsi_change_depth().
> Looks like it's not working correctly with blk-mq, though.

Hello Hannes,

Isn't the purpose of scsi_change_queue_depth() to change the sdev queue_depth
only? As far as I know there is only one assignment of shost->cmd_per_lun in
the SCSI core and that's the following assignment in scsi_host_alloc():

shost->cmd_per_lun = sht->cmd_per_lun;

Bart.

Re: [GIT PULL 02/58] lightnvm: prevent bd removal if busy

2017-10-16 Thread Matias Bjørling

On Fri, Oct 13, 2017 at 5:35 PM, Rakesh Pandit  wrote:
> On Fri, Oct 13, 2017 at 07:58:09AM -0700, Christoph Hellwig wrote:
>> On Fri, Oct 13, 2017 at 02:45:51PM +0200, Matias Bjørling wrote:
>> > From: Rakesh Pandit 
>> >
>> > When a virtual block device is formatted and mounted after creating
>> > with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove"
>> > would result in this:
>> >
>> > 446416.309757] bdi-block not registered
>> > [446416.309773] [ cut here ]
>> > [446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159
>> >   __mark_inode_dirty+0x268/0x340
>> >
>> > Ideally removal should return -EBUSY as block device is mounted after
>> > formatting.  This patch tries to address this checking if whole device
>> > or any partition of it already mounted or not before removal.
>>
>> How is this different from any other block device that can be
>> removed even if a file system is mounted?
>
> One can create many virtual block devices on top of physical using:
> nvme lnvm create ... -t pblk
>
> And remove them using:
> nvme lnvm remove
>
> Because the block devices are virtual in nature created by a program I was
> expecting removal to tell me they are busy instead of bdi-block not registered
> following by a WARNING (above).  My use case was writing automatic test case
> but I assumed this is useful in general.
>
>>
>> >
>> > Whole device is checked using "bd_super" member of block device.  This
>> > member is always set once block device has been mounted using a
>> > filesystem.  Another member "bd_part_count" takes care of checking any
>> > if any partitions are under use.  "bd_part_count" is only updated
>> > under locks when partitions are opened or closed (first open and last
>> > release).  This at least does take care sending -EBUSY if removal is
>> > being attempted while whole block device or any partition is mounted.
>> >
>>
>> That's a massive layering violation, and a driver has no business
>> looking at these fields.
>
> Okay, I didn't consider this earlier.  I would suggest a revert for this.

I see you've already done it. Thanks Jens.

Re: [PATCH 00/15] bcache: series of patches for 4.15

2017-10-16 Thread Jens Axboe

On Fri, Oct 13 2017, Michael Lyle wrote:
> Jens, and everyone:
> 
> Here is the current series of work for inclusion in next and for 4.15's
> merge window.  We may get some additional fixes, but this is most of the
> work we expect.  All have been previously sent to these lists except the
> very last one.
> 
> There's a lot of small cleanup patches and fixes for race conditions.
> 
> Also included is my new rate controller which we've been measuring for
> some time, and other related changes to writeback rate management.

Applied for 4.15, thanks.

-- 
Jens Axboe

Re: [PATCH v2 0/3] Move tagset reinit to its only current consumer, nvme-core

2017-10-16 Thread Jens Axboe

On 10/16/2017 06:48 AM, Christoph Hellwig wrote:
> Jens, are you fine with picking this up throught the nvme tree?
> I think Sagi's later changes rely on it, so that would make life
> a littler easier.

Yeah that's fine, you can add my reviewed-by to the patches
as well.

-- 
Jens Axboe

Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-10-16 Thread Ilya Dryomov

On Mon, Oct 16, 2017 at 1:44 PM, Christoph Hellwig  wrote:
> On Fri, Oct 06, 2017 at 02:31:20PM +0200, Ilya Dryomov wrote:
>> This would unconditionally overwrite any WRITE ZEROS error.  If we get
>> e.g. -EIO, and manual zeroing is not allowed, I don't think we want to
>> return -EOPNOTSUPP?
>>
>> Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't
>> make sense to me, because manual zeroing is always supported, just not
>> always allowed.
>
> Then thow the throw bdev_write_zeroes_sectors check back in:
>
> if (ret && try_write_zeroes) {
> if (!(flags & BLKDEV_ZERO_NOFALLBACK))
> try_write_zeroes = false;
> goto retry;
> }
> if (!bdev_write_zeroes_sectors(bdev))
> ret = -EOPNOTSUPP;
> }
>
> The important bit is that the structure in the current patch where the
> bdev_write_zeroes_sectors check is on the same level as the method
> selection is extremely confusing.

I see.  An updated version should be in your inbox.

Thanks,

Ilya

[PATCH v3 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-10-16 Thread Ilya Dryomov

sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set.  Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

  $ fallocate -zn -l 1k /dev/sdg
  fallocate: fallocate failed: Remote I/O error
  $ fallocate -zn -l 1k /dev/sdg  # OK
  $ fallocate -zn -l 1k /dev/sdg  # OK

The following calls succeed because sd_done() sets ->no_write_same in
response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
__blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
specified.  For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
sd_done() has just set ->no_write_same thus indicating lack of offload
support.

Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Christoph Hellwig 
Cc: "Martin K. Petersen" 
Cc: Hannes Reinecke 
Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9d2ab8bba52a..23411ec7f3cc 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -321,12 +321,6 @@ static int __blkdev_issue_zero_pages(struct block_device 
*bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
- *  Note that this function may fail with -EOPNOTSUPP if the driver signals
- *  zeroing offload support, but the device fails to process the command (for
- *  some devices there is no non-destructive way to verify whether this
- *  operation is actually supported).  In this case the caller should call
- *  retry the call to blkdev_issue_zeroout() and the fallback path will be 
used.
- *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -370,18 +364,49 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
-   int ret;
-   struct bio *bio = NULL;
+   int ret = 0;
+   sector_t bs_mask;
+   struct bio *bio;
struct blk_plug plug;
+   bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev);
 
+   bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+   if ((sector | nr_sects) & bs_mask)
+   return -EINVAL;
+
+retry:
+   bio = NULL;
blk_start_plug();
-   ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-   , flags);
+   if (try_write_zeroes) {
+   ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects,
+ gfp_mask, , flags);
+   } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
+   ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects,
+   gfp_mask, );
+   } else {
+   /* No zeroing offload support */
+   ret = -EOPNOTSUPP;
+   }
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
}
blk_finish_plug();
+   if (ret && try_write_zeroes) {
+   if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
+   try_write_zeroes = false;
+   goto retry;
+   }
+   if (!bdev_write_zeroes_sectors(bdev)) {
+   /*
+* Zeroing offload support was indicated, but the
+* device reported ILLEGAL REQUEST (for some devices
+* there is no non-destructive way to verify whether
+* WRITE ZEROES is actually supported).
+*/
+   ret = -EOPNOTSUPP;
+   }
+   }
 
return ret;
 }
-- 
2.4.3

[PATCH v3 1/2] block: factor out __blkdev_issue_zero_pages()

2017-10-16 Thread Ilya Dryomov

blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 63 +
 1 file changed, 37 insertions(+), 26 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 62240f8832ca..9d2ab8bba52a 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -274,6 +274,40 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t 
nr_sects)
return min(pages, (sector_t)BIO_MAX_PAGES);
 }
 
+static int __blkdev_issue_zero_pages(struct block_device *bdev,
+   sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+   struct bio **biop)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct bio *bio = *biop;
+   int bi_size = 0;
+   unsigned int sz;
+
+   if (!q)
+   return -ENXIO;
+
+   while (nr_sects != 0) {
+   bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
+  gfp_mask);
+   bio->bi_iter.bi_sector = sector;
+   bio_set_dev(bio, bdev);
+   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+
+   while (nr_sects != 0) {
+   sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
+   bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
+   nr_sects -= bi_size >> 9;
+   sector += bi_size >> 9;
+   if (bi_size < sz)
+   break;
+   }
+   cond_resched();
+   }
+
+   *biop = bio;
+   return 0;
+}
+
 /**
  * __blkdev_issue_zeroout - generate number of zero filed write bios
  * @bdev:  blockdev to issue
@@ -304,9 +338,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
unsigned flags)
 {
int ret;
-   int bi_size = 0;
-   struct bio *bio = *biop;
-   unsigned int sz;
sector_t bs_mask;
 
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
@@ -316,30 +347,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
-   goto out;
-
-   ret = 0;
-   while (nr_sects != 0) {
-   bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
-  gfp_mask);
-   bio->bi_iter.bi_sector = sector;
-   bio_set_dev(bio, bdev);
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
-
-   while (nr_sects != 0) {
-   sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
-   bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
-   nr_sects -= bi_size >> 9;
-   sector += bi_size >> 9;
-   if (bi_size < sz)
-   break;
-   }
-   cond_resched();
-   }
+   return ret;
 
-   *biop = bio;
-out:
-   return ret;
+   return __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask,
+biop);
 }
 EXPORT_SYMBOL(__blkdev_issue_zeroout);
 
-- 
2.4.3

[PATCH v3 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-10-16 Thread Ilya Dryomov

Hi Christoph, Martin,

blkdev_issue_zeroout() now checks for any error.  This required a minor
refactor, so I dropped the stable tag, Jens can add it back if needed.

v2 -> v3:
- another code flow change in blkdev_issue_zeroout() suggested by
  Christoph -- no functional changes

v1 -> v2:
- changed code flow in blkdev_issue_zeroout() according to Christoph's
  suggestion
- this required adding additional checks to blkdev_issue_zeroout() and
  __blkdev_issue_zero_pages(), the latter is no longer void

Previous version at

  https://marc.info/?l=linux-block=150712940922089=2

Thanks,

Ilya


Ilya Dryomov (2):
  block: factor out __blkdev_issue_zero_pages()
  block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

 block/blk-lib.c | 108 +---
 1 file changed, 72 insertions(+), 36 deletions(-)

-- 
2.4.3

Re: [PATCH v2 0/3] Move tagset reinit to its only current consumer, nvme-core

2017-10-16 Thread Christoph Hellwig

Jens, are you fine with picking this up throught the nvme tree?
I think Sagi's later changes rely on it, so that would make life
a littler easier.

Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-10-16 Thread Christoph Hellwig

On Fri, Oct 06, 2017 at 02:31:20PM +0200, Ilya Dryomov wrote:
> This would unconditionally overwrite any WRITE ZEROS error.  If we get
> e.g. -EIO, and manual zeroing is not allowed, I don't think we want to
> return -EOPNOTSUPP?
> 
> Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't
> make sense to me, because manual zeroing is always supported, just not
> always allowed.

Then thow the throw bdev_write_zeroes_sectors check back in:

if (ret && try_write_zeroes) {
if (!(flags & BLKDEV_ZERO_NOFALLBACK))
try_write_zeroes = false;
goto retry;
}
if (!bdev_write_zeroes_sectors(bdev))
ret = -EOPNOTSUPP;
}

The important bit is that the structure in the current patch where the
bdev_write_zeroes_sectors check is on the same level as the method
selection is extremely confusing.

Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

2017-10-16 Thread Hannes Reinecke

On 10/13/2017 07:29 PM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote:
>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 04:31:04PM +, Bart Van Assche wrote:
 On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 
> 3,

 Sorry but I doubt whether that is correct. More in general, I don't know 
 any modern
 storage HBA for which the default queue depth is so low.
>>>
>>> You can grep:
>>>
>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E 
>>> "qla2xxx|lpfc"
>>
>> Such a low queue depth will result in suboptimal performance for adapters
>> that communicate over a storage network. I think that's a bug and that both
>> adapters support much higher cmd_per_lun values.
>>
>> (+James Smart)
>>
>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
>> implementation?
>>
>> (+Himanshu Madhani)
>>
>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
>> the qla2xxx initiator driver to the scsi_host->can_queue value?
> 
> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> reasonable to increase cmd_per_lun to .can_queue.
> 
'3' is just a starting point; later on it'll be adjusted via
scsi_change_depth().
Looks like it's not working correctly with blk-mq, though.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-10-16 Thread Ilya Dryomov

On Fri, Oct 6, 2017 at 2:31 PM, Ilya Dryomov  wrote:
> On Fri, Oct 6, 2017 at 2:05 PM, Christoph Hellwig  wrote:
>> On Thu, Oct 05, 2017 at 09:32:33PM +0200, Ilya Dryomov wrote:
>>> This is to avoid returning -EREMOTEIO in the following case: device
>>> doesn't support WRITE SAME but scsi_disk::max_ws_blocks != 0, zeroout
>>> is called with BLKDEV_ZERO_NOFALLBACK.  Enter blkdev_issue_zeroout(),
>>> bdev_write_zeroes_sectors() != 0, so we issue WRITE ZEROES.  The
>>> request fails with ILLEGAL REQUEST, sd_done() sets ->no_write_same and
>>> updates queue_limits, ILLEGAL REQUEST is translated into -EREMOTEIO,
>>> which is returned from submit_bio_wait().  Manual zeroing is not
>>> allowed, so we must return an error, but it shouldn't be -EREMOTEIO if
>>> queue_limits just got updated because of ILLEGAL REQUEST.  Without this
>>> conditional, we'd get
>>
>> Hmm.  I think we'd better off to just do the before the retry loop:
>>
>> if (ret && try_write_zeroes) {
>> if (!(flags & BLKDEV_ZERO_NOFALLBACK))
>> try_write_zeroes = false;
>> goto retry;
>> }
>> ret = -EOPNOTSUPP;
>> }
>
> This would unconditionally overwrite any WRITE ZEROS error.  If we get
> e.g. -EIO, and manual zeroing is not allowed, I don't think we want to
> return -EOPNOTSUPP?
>
> Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't
> make sense to me, because manual zeroing is always supported, just not
> always allowed.

Ping...

Thanks,

Ilya

57 matches

Mail list logo