Re: [PATCH 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctxs()

2017-08-01 Thread Ming Lei
On Mon, Jul 31, 2017 at 11:09:38PM +, Bart Van Assche wrote:
> On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> > @@ -810,7 +810,11 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> >  
> >  struct ctx_iter_data {
> > struct blk_mq_hw_ctx *hctx;
> > -   struct list_head *list;
> > +
> > +   union {
> > +   struct list_head *list;
> > +   struct request *rq;
> > +   };
> >  };
> 
> Hello Ming,
> 
> Please introduce a new data structure for dispatch_rq_from_ctx() /
> blk_mq_dispatch_rq_from_ctxs() instead of introducing a union in struct
> ctx_iter_data. That will avoid that .list can be used in a context where
> a struct request * pointer has been stored in the structure and vice versa.

Looks there isn't such usage now, or we can just both 'list' and 'rq' in
this data structure if there is.

>  
> >  static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void 
> > *data)
> > @@ -826,6 +830,26 @@ static bool flush_busy_ctx(struct sbitmap *sb, 
> > unsigned int bitnr, void *data)
> > return true;
> >  }
> >  
> > +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, 
> > void *data)
> > +{
> > +   struct ctx_iter_data *dispatch_data = data;
> > +   struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
> > +   struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
> > +   bool empty = true;
> > +
> > +   spin_lock(>lock);
> > +   if (unlikely(!list_empty(>rq_list))) {
> > +   dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
> > +   list_del_init(_data->rq->queuelist);
> > +   empty = list_empty(>rq_list);
> > +   }
> > +   spin_unlock(>lock);
> > +   if (empty)
> > +   sbitmap_clear_bit(sb, bitnr);
> 
> This sbitmap_clear_bit() occurs without holding blk_mq_ctx.lock. Sorry but
> I don't think this is safe. Please either remove this sbitmap_clear_bit() call
> or make sure that it happens with blk_mq_ctx.lock held.

Good catch, sbitmap_clear_bit() should have been done with holding
ctx->lock, otherwise a new pending bit may be cleared.

-- 
Ming


[PATCH 1/2] block: Zoned block device single-threaded submission

2017-08-01 Thread Damien Le Moal
From: Hannes Reinecke 

The scsi_request_fn() dispatch function internally unlocks the request
queue before submitting a request to the underlying LLD. This can
potentially lead to write request reordering if the context executing
scsi_request_fn() is preempted before the request is submitted to the
LLD and another context start the same function execution.

This is not a problem for regular disks but leads to write I/O errors
on host managed zoned block devices and reduce the effectivness of
sequential write optimizations for host aware disks.
(Note: the zone write lock in place in the scsi command init code will
prevent multiple writes from being issued simultaneously to the same
zone to avoid HBA level reordering issues, but this locking mechanism
is ineffective to prevent reordering at this level)

Prevent this from happening by limiting the number of context that can
simultaneously execute the queue request_fn() function to a single
thread.

A similar patch was originally proposed by Hannes Reinecke in a first
set of patches implementing ZBC support but ultimately not included in
the final support implementation. See commit 92f5e2a295
"block: add flag for single-threaded submission" in the tree
https://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git/log/?h=zac.v3

Authorship thus goes to Hannes.

Signed-off-by: Hannes Reinecke 
Signed-off-by: Damien Le Moal 
---
 block/blk-core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index dbecbf4a64e0..cf590cbddcfd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -371,7 +371,14 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
 * running such a request function concurrently. Keep track of the
 * number of active request_fn invocations such that blk_drain_queue()
 * can wait until all these request_fn calls have finished.
+*
+* For zoned block devices, do not allow multiple threads to
+* dequeue requests as this can lead to write request reordering
+* during the time the queue is unlocked.
 */
+   if (blk_queue_is_zoned(q) && q->request_fn_active)
+   return;
+
q->request_fn_active++;
q->request_fn(q);
q->request_fn_active--;
-- 
2.13.3



[PATCH 2/2] sd_zbc: Disable zone locking with scsi-mq enabled

2017-08-01 Thread Damien Le Moal
Unlike the legacy blk-sq infrastructure, blk-mq cannot provide any sort
of guarantees to well behaved users regarding write command ordering for
zoned block devices. This is due to the lockless manipulation of the dispatch
queue as well as the use of different work queues for requeuing requests and
running queues. Request in flight may exist and a combination of context
preemption and request requeue event can easily reorder a sequential
write sequence in the dispatch queue.

This problem is not solved by the zone write lock mechanism in place in
sd_zbc.c (called from sd_init_cmnd() and sd_done()). On the contrary this
locking makes the problem even worse due to the high number of requeue events
it causes under heavy write workloads. Furthermore, this zone locking can
generate dispatch queue deadlocks if a write command that acquired a
zone lock is requeued and ends up behind a write command which has not been
yet prepared but is directed to the same locked zone. In such case, trying to
dispatch this write command will systematically return BLKPREP_DEFER and force
a requeue at the head of the dispatch queue. Subsequent dispatch execution will
fail again to issue any request in the same manner, stalling request dispatch
to the device endlessly.

This patch does not solve the blk-mq reordering problem, but prevent this
deadlock problem from hapenning by disabling zone locking when scsi-mq is
enabled.

Take away from this: ZBC/ZAC host managed drive cannot be used safely
with scsi-mq enabled for now, unless the drive users limit the number of
writes per sequential zone to 1 at all times. Both f2fs and dm-zoned do not
do this and so cannot be used safely with scsi-mq enabled.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 96855df9f49d..78fb51a37e86 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -528,15 +528,20 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 
 static int sd_zbc_setup(struct scsi_disk *sdkp)
 {
+   struct request_queue *q = sdkp->disk->queue;
 
/* chunk_sectors indicates the zone size */
-   blk_queue_chunk_sectors(sdkp->disk->queue,
+   blk_queue_chunk_sectors(q,
logical_to_sectors(sdkp->device, sdkp->zone_blocks));
sdkp->zone_shift = ilog2(sdkp->zone_blocks);
sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift;
if (sdkp->capacity & (sdkp->zone_blocks - 1))
sdkp->nr_zones++;
 
+   /* Do not use zone locking in mq case */
+   if (q->mq_ops)
+   return 0;
+
if (!sdkp->zones_wlock) {
sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones),
sizeof(unsigned long),
-- 
2.13.3



[PATCH 0/2] Zoned block device support fixes

2017-08-01 Thread Damien Le Moal
This small series addresses a couple of problems detected with 4.13-rc.

The first patch ensures that a well behaved host managed zoned block device
user (an application doing direct disk accesses, f2fs or dm-zoned) will not
see unaligned write errors due to reordering of write commands at dispatch
time.

The second patch addresses a request dispatch deadlock that can very easily
trigger with f2fs or dm-zoned when scsi-mq is enabled. The root cause of this
problem is the high probability of unintended reordering of sequential writes
in the dispatch queue due to concurrent requeue and insert events. This
patch only fixes the deadlock problem and is not a fix for the root cause
itself, which will require most likely a different approach as no easy fix
seem to be possible at the scsi-mq level.

This means that host managed zoned block devices cannot be reliably used under
a regular asynchronous (queued) write BIO issuing pattern with scsi-mq enabled
for now.

Damien Le Moal (1):
  sd_zbc: Disable zone locking with scsi-mq enabled

Hannes Reinecke (1):
  block: Zoned block device single-threaded submission

 block/blk-core.c  | 7 +++
 drivers/scsi/sd_zbc.c | 7 ++-
 2 files changed, 13 insertions(+), 1 deletion(-)

-- 
2.13.3



Re: [PATCH] tcmu: Oops in unmap_thread_fn()

2017-08-01 Thread Xiubo Li


On 2017年08月02日 04:09, Dan Carpenter wrote:

Calling list_del() on the iterator pointer in list_for_each_entry() will
cause an oops.  We need to user the _safe() version for that.

Fixes: c73d02f63c16 ("tcmu: Add fifo type waiter list support to avoid 
starvation")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index 9258b7dd2c30..fd9fcea68d23 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1985,7 +1985,7 @@ static struct target_backend_ops tcmu_ops = {
  
  static int unmap_thread_fn(void *data)

  {
-   struct tcmu_dev *udev;
+   struct tcmu_dev *udev, *tmp;
loff_t off;
uint32_t start, end, block;
static uint32_t free_blocks;
@@ -2056,7 +2056,7 @@ static int unmap_thread_fn(void *data)
 * for the global data pool blocks.
 */
mutex_lock(_udev_waiter_mutex);
-   list_for_each_entry(udev, _udev_waiter, waiter) {
+   list_for_each_entry_safe(udev, tmp, _udev_waiter, waiter) {
mutex_lock(>cmdr_lock);
if (udev->waiting_blocks < free_blocks) {
mutex_unlock(>cmdr_lock);

Good catch and it looks nice for me.

Thanks,

BRs
Xiubo




[Bug 196543] Adaptec 6405H can not understand Queue full message from Disks.

2017-08-01 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=196543

--- Comment #2 from Kazufumi Watanabe (kiyomi.kakitsub...@gmail.com) ---
Dear James

Thank you for your message.
I want give you result of sas bus trace,
but company do not give me grant permission of take data outside.

I think sorry to you not give data of bus trace.

Well, I can tell you what is happen.

Environment:
SAS HBA: 
Microsemi Adaptec 6405H/7085H
(Other SAS HBA not happend.ex Broadcom 9300-8E Adaptec 1000-8E)
OS:
Fedora26/CentOS7.3
openSUSE Leap 42.3

At install OS,when format partitions,
HBA send some commands to a SSD device(I don't know what command is send to a
SSD device).

When device's queue is full,it return "queue full" message to HBA,
but it was sending other command  to device.
I think a situation remaining the same,so OS can not install to SSD device.

This problem is SAS HBA continue sending command when receive "queue full"
message,
and kernel can not  detect it.

Cheers
Kazufumi Watanabe

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: [PATCH 05/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-01 Thread Ming Lei
On Tue, Aug 01, 2017 at 04:14:07PM +, Bart Van Assche wrote:
> On Tue, 2017-08-01 at 18:44 +0800, Ming Lei wrote:
> > On Mon, Jul 31, 2017 at 11:42:21PM +, Bart Van Assche wrote:
> > > Since setting, clearing and testing of BLK_MQ_S_BUSY can happen 
> > > concurrently
> > > and since clearing and testing happens without any locks held I'm afraid 
> > > this
> > > patch introduces the following race conditions:
> > > [ ... ]
> > > * Checking BLK_MQ_S_BUSY after requests have been removed from the 
> > > dispatch list
> > >   but before that bit is cleared, resulting in test_bit(BLK_MQ_S_BUSY, 
> > > >state)
> > >   reporting that the BLK_MQ_S_BUSY
> > > has been set although there are no requests
> > >   on the dispatch list.
> > 
> > That won't be a problem, because dispatch will be started in the
> > context in which dispatch list is flushed, since the BUSY bit
> > is cleared after blk_mq_dispatch_rq_list() returns. So no I/O
> > hang.
> 
> Hello Ming,
> 
> Please consider changing the name of the BLK_MQ_S_BUSY constant. That bit
> is used to serialize dispatching requests from the hctx dispatch list but
> that's not clear from the name of that constant.

Actually what we want to do is to stop taking request from sw/scheduler
queue when ->dispatch aren't flushed completely, I think BUSY isn't
a bad name for this case, or how about DISPATCH_BUSY? or
FLUSHING_DISPATCH?

After thinking about the handling further, we can set the
BUSY bit just when adding requests to ->dispatch, and clear the
bit after returning from blk_mq_dispatch_rq_list() when the
current local list(from ->dispatch) is flushed completely and
->dispatch is empty. This way can minimize the race window, and
still safe, because we always move on to dispatch either new
request is added to ->dispatch or ->dispatch is flushed completely.

But anyway comment should be added for clarifying the fact.

-- 
Ming


Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-08-01 Thread Ming Lei
On Tue, Aug 01, 2017 at 03:11:42PM +, Bart Van Assche wrote:
> On Tue, 2017-08-01 at 18:50 +0800, Ming Lei wrote:
> > On Tue, Aug 01, 2017 at 06:17:18PM +0800, Ming Lei wrote:
> > > How can we get the accurate 'number of requests in progress' efficiently?
> 
> Hello Ming,
> 
> How about counting the number of bits that have been set in the tag set?
> I am aware that these bits can be set and/or cleared concurrently with the
> dispatch code but that count is probably a good starting point.

It has to be atomic_t, which is too too heavy for us, please see the report:

http://marc.info/?t=14986844843=1=2

Both Jens and I want to kill hd_struct.in_flight, but looks still no
good way. 

> 
> > > From my test data of mq-deadline on lpfc, the performance is good,
> > > please see it in cover letter.
> > 
> > Forget to mention, ctx->list is one per-cpu list and the lock is percpu
> > lock, so changing to this way shouldn't be a performance issue.
> 
> Sorry but I don't consider this reply as sufficient. The latency of IB HCA's
> is significantly lower than that of any FC hardware I ran performance
> measurements on myself. It's not because this patch series improves 
> performance
> for lpfc that that guarantees that there won't be a performance regression for
> ib_srp, ib_iser or any other low-latency initiator driver for which q->depth
> != 0.

If IB HCA has lower latency than other FC, there should be less chances
to trigger BUSY. Otherwise IB should have the same sequential I/O
performance issue, I guess.

ctx list is per-cpu list, in theory there shouldn't be issues with this
change, and the only change for IB is the following:

V4.13-rc3:
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);

v4.13-rc3 patched:
do {
struct request *rq;

/* pick up one request from one ctx list */
rq = blk_mq_dispatch_rq_from_ctxs(hctx); 
if (!rq)
break;
list_add(>queuelist, _list);
} while (blk_mq_dispatch_rq_list(q, _list));

I doubt that the change can be observable in actual test.

Never mind, we will provide test data on IB HCA with V2, and Laurence
will help me run the test on IB.

> 
> Additionally, patch 03/14 most likely introduces a fairness problem. Shouldn't
> blk_mq_dispatch_rq_from_ctxs() dequeue requests from the per-CPU queues in a
> round-robin fashion instead of always starting at the first per-CPU queue in
> hctx->ctx_map?

That is a good question, looks round-robin should be more fair, will
change to it in V2 ant the implementation should be simple.

-- 
Ming


[no subject]

2017-08-01 Thread системы администратор
внимания;

Ваши сообщения превысил лимит памяти, который составляет 5 Гб, определенных 
администратором, который в настоящее время работает на 10.9GB, Вы не сможете 
отправить или получить новую почту, пока вы повторно не проверить ваш почтовый 
ящик почты. Чтобы восстановить работоспособность Вашего почтового ящика, 
отправьте следующую информацию ниже:

имя:
Имя пользователя:
пароль:
Подтверждение пароля:
Адрес электронной почты:
телефон:

Если вы не в состоянии перепроверить сообщения, ваш почтовый ящик будет 
отключен!

Приносим извинения за неудобства.
Проверочный код: EN: Ru...776774990..2017
Почты технической поддержки ©2017

спасибо
системы администратор


[PATCH] tcmu: Oops in unmap_thread_fn()

2017-08-01 Thread Dan Carpenter
Calling list_del() on the iterator pointer in list_for_each_entry() will
cause an oops.  We need to user the _safe() version for that.

Fixes: c73d02f63c16 ("tcmu: Add fifo type waiter list support to avoid 
starvation")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index 9258b7dd2c30..fd9fcea68d23 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1985,7 +1985,7 @@ static struct target_backend_ops tcmu_ops = {
 
 static int unmap_thread_fn(void *data)
 {
-   struct tcmu_dev *udev;
+   struct tcmu_dev *udev, *tmp;
loff_t off;
uint32_t start, end, block;
static uint32_t free_blocks;
@@ -2056,7 +2056,7 @@ static int unmap_thread_fn(void *data)
 * for the global data pool blocks.
 */
mutex_lock(_udev_waiter_mutex);
-   list_for_each_entry(udev, _udev_waiter, waiter) {
+   list_for_each_entry_safe(udev, tmp, _udev_waiter, waiter) {
mutex_lock(>cmdr_lock);
if (udev->waiting_blocks < free_blocks) {
mutex_unlock(>cmdr_lock);


Re: [PATCH] st: fix blk_get_queue usage

2017-08-01 Thread Kai Mäkisara (Kolumbus)

> On 1 Aug 2017, at 15.42, Hannes Reinecke  wrote:
> 
> From: Bodo Stroesser 
> 
> If blk_queue_get() in st_probe fails, disk->queue must not
> be set to SDp->request_queue, as that would result in
> put_disk() dropping a not taken reference.
> 
> Thus, disk->queue should be set only after a successful
> blk_queue_get().
> 
> Signed-off-by: Bodo Stroesser 
> Acked-by: Shirish Pargaonkar 
> Signed-off-by: Hannes Reinecke 
Acked-by: Kai Mäkisara 

Thanks,
Kai



[PATCH 0/2] nvmet_fc: work around overalloc of queue elements

2017-08-01 Thread James Smart
At queue creation, the transport allocates a local job struct
(struct nvmet_fc_fcp_iod) for each possible element of the queue.
When a new CMD is received from the wire, a jobs struct is allocated
from the queue and then used for the duration of the command.
The job struct contains buffer space for the wire command iu. Thus,
upon allocation of the job struct, the cmd iu buffer is copied to
the job struct and the LLDD may immediately free/reuse the CMD IU
buffer passed in the call.

However, in some circumstances, due to the packetized nature of FC
and the api of the FC LLDD which may issue a hw command to send the
wire response, but the LLDD may not get the hw completion for the
command and upcall the nvmet_fc layer before a new command may be
asynchronously received on the wire. In other words, its possible
for the initiator to get the response from the wire, thus believe a
command slot free, and send a new command iu. The new command iu
may be received by the LLDD and passed to the transport before the
LLDD had serviced the hw completion and made the teardown calls for
the original job struct. As such, there is no available job struct
available for the new io. E.g. it appears like the host sent more
queue elements than the queue size. It didn't based on it's
understanding.

Rather than treat this temporarily overflow of the queue resources
as a hard connection failure, change the api slightly to temporarily
queue the new command until the job resource is freed up. The impact
to the LLDD is in the cases where release can't be immediate, the
LLDD must wait until a new callback is called notifying the release
of the cmd iu buffer. Support for this api change is indicated by a
non-null callback function, and the new buffer-held semantic is
indicated by an -EOVERFLOW status code return from
nvmet_fc_rcv_fcp_req().


James Smart (2):
  nvmet_fc: add defer_req callback for deferment of cmd buffer return
  lpfc: support nvmet_fc defer_rcv callback

 drivers/nvme/target/fc.c | 212 +--
 drivers/scsi/lpfc/lpfc_attr.c|   4 +-
 drivers/scsi/lpfc/lpfc_debugfs.c |   5 +-
 drivers/scsi/lpfc/lpfc_nvmet.c   |  30 ++
 drivers/scsi/lpfc/lpfc_nvmet.h   |   1 +
 include/linux/nvme-fc-driver.h   |   7 ++
 6 files changed, 229 insertions(+), 30 deletions(-)

-- 
2.13.1



[PATCH 1/2] nvmet_fc: add defer_req callback for deferment of cmd buffer return

2017-08-01 Thread James Smart
At queue creation, the transport allocates a local job struct
(struct nvmet_fc_fcp_iod) for each possible element of the queue.
When a new CMD is received from the wire, a jobs struct is allocated
from the queue and then used for the duration of the command.
The job struct contains buffer space for the wire command iu. Thus,
upon allocation of the job struct, the cmd iu buffer is copied to
the job struct and the LLDD may immediately free/reuse the CMD IU
buffer passed in the call.

However, in some circumstances, due to the packetized nature of FC
and the api of the FC LLDD which may issue a hw command to send the
wire response, but the LLDD may not get the hw completion for the
command and upcall the nvmet_fc layer before a new command may be
asynchronously received on the wire. In other words, its possible
for the initiator to get the response from the wire, thus believe a
command slot free, and send a new command iu. The new command iu
may be received by the LLDD and passed to the transport before the
LLDD had serviced the hw completion and made the teardown calls for
the original job struct. As such, there is no available job struct
available for the new io. E.g. it appears like the host sent more
queue elements than the queue size. It didn't based on it's
understanding.

Rather than treat this as a hard connection failure queue the new
request until the job struct does free up. As the buffer isn't
copied as there's no job struct, a special return value must be
returned to the LLDD to signify to hold off on recycling the cmd
iu buffer.  And later, when a job struct is allocated and the
buffer copied, a new LLDD callback is introduced to notify the
LLDD and allow it to recycle it's command iu buffer.

Signed-off-by: James Smart 
---
 drivers/nvme/target/fc.c   | 212 +++--
 include/linux/nvme-fc-driver.h |   7 ++
 2 files changed, 191 insertions(+), 28 deletions(-)

diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 49948a412dcc..3d16870367d0 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -114,6 +114,11 @@ struct nvmet_fc_tgtport {
u32 max_sg_cnt;
 };
 
+struct nvmet_fc_defer_fcp_req {
+   struct list_headreq_list;
+   struct nvmefc_tgt_fcp_req   *fcp_req;
+};
+
 struct nvmet_fc_tgt_queue {
boolninetypercent;
u16 qid;
@@ -132,6 +137,8 @@ struct nvmet_fc_tgt_queue {
struct nvmet_fc_tgt_assoc   *assoc;
struct nvmet_fc_fcp_iod *fod;   /* array of fcp_iods */
struct list_headfod_list;
+   struct list_headpending_cmd_list;
+   struct list_headavail_defer_list;
struct workqueue_struct *work_q;
struct kref ref;
 } __aligned(sizeof(unsigned long long));
@@ -223,6 +230,8 @@ static void nvmet_fc_tgt_q_put(struct nvmet_fc_tgt_queue 
*queue);
 static int nvmet_fc_tgt_q_get(struct nvmet_fc_tgt_queue *queue);
 static void nvmet_fc_tgtport_put(struct nvmet_fc_tgtport *tgtport);
 static int nvmet_fc_tgtport_get(struct nvmet_fc_tgtport *tgtport);
+static void nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
+   struct nvmet_fc_fcp_iod *fod);
 
 
 /* *** FC-NVME DMA Handling  */
@@ -463,9 +472,9 @@ static struct nvmet_fc_fcp_iod *
 nvmet_fc_alloc_fcp_iod(struct nvmet_fc_tgt_queue *queue)
 {
static struct nvmet_fc_fcp_iod *fod;
-   unsigned long flags;
 
-   spin_lock_irqsave(>qlock, flags);
+   /* Caller must hold queue->qlock */
+
fod = list_first_entry_or_null(>fod_list,
struct nvmet_fc_fcp_iod, fcp_list);
if (fod) {
@@ -477,17 +486,37 @@ nvmet_fc_alloc_fcp_iod(struct nvmet_fc_tgt_queue *queue)
 * will "inherit" that reference.
 */
}
-   spin_unlock_irqrestore(>qlock, flags);
return fod;
 }
 
 
 static void
+nvmet_fc_queue_fcp_req(struct nvmet_fc_tgtport *tgtport,
+  struct nvmet_fc_tgt_queue *queue,
+  struct nvmefc_tgt_fcp_req *fcpreq)
+{
+   struct nvmet_fc_fcp_iod *fod = fcpreq->nvmet_fc_private;
+
+   /*
+* put all admin cmds on hw queue id 0. All io commands go to
+* the respective hw queue based on a modulo basis
+*/
+   fcpreq->hwqid = queue->qid ?
+   ((queue->qid - 1) % tgtport->ops->max_hw_queues) : 0;
+
+   if (tgtport->ops->target_features & NVMET_FCTGTFEAT_CMD_IN_ISR)
+   queue_work_on(queue->cpu, queue->work_q, >work);
+   else
+   nvmet_fc_handle_fcp_rqst(tgtport, fod);
+}
+
+static void
 nvmet_fc_free_fcp_iod(struct nvmet_fc_tgt_queue *queue,
struct 

[PATCH 2/2] lpfc: support nvmet_fc defer_rcv callback

2017-08-01 Thread James Smart
Currently, calls to nvmet_fc_rcv_fcp_req() always copied the
FC-NVME cmd iu to a temporary buffer before returning, allowing
the driver to immediately repost the buffer to the hardware.

To address timing conditions on queue element structures vs async
command reception, the nvmet_fc transport occasionally may need to
hold on to the command iu buffer for a short period. In these cases,
the nvmet_fc_rcv_fcp_req() will return a special return code
(-EOVERFLOW). In these cases, the LLDD must delay until the new
defer_rcv lldd callback is called before recycling the buffer back
to the hw.

This patch adds support for the new nvmet_fc transport defer_rcv
callback and recognition of the new error code when passing commands
to the transport.

This patch is intended to enter the kernel through the nvme block
tree which pulls in the nvmet_fc api change at the same time. It is
not to be merged via the scsi trees without the latest nvme support
in it.

Signed-off-by: Dick Kennedy 
Signed-off-by: James Smart 
---
 drivers/scsi/lpfc/lpfc_attr.c|  4 +++-
 drivers/scsi/lpfc/lpfc_debugfs.c |  5 -
 drivers/scsi/lpfc/lpfc_nvmet.c   | 30 ++
 drivers/scsi/lpfc/lpfc_nvmet.h   |  1 +
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
index 4ed48ed38e79..7ee1a94c0b33 100644
--- a/drivers/scsi/lpfc/lpfc_attr.c
+++ b/drivers/scsi/lpfc/lpfc_attr.c
@@ -205,8 +205,10 @@ lpfc_nvme_info_show(struct device *dev, struct 
device_attribute *attr,
atomic_read(>xmt_ls_rsp_error));
 
len += snprintf(buf+len, PAGE_SIZE-len,
-   "FCP: Rcv %08x Release %08x Drop %08x\n",
+   "FCP: Rcv %08x Defer %08x Release %08x "
+   "Drop %08x\n",
atomic_read(>rcv_fcp_cmd_in),
+   atomic_read(>rcv_fcp_cmd_defer),
atomic_read(>xmt_fcp_release),
atomic_read(>rcv_fcp_cmd_drop));
 
diff --git a/drivers/scsi/lpfc/lpfc_debugfs.c b/drivers/scsi/lpfc/lpfc_debugfs.c
index 5cc8b0f7d885..744f3f395b64 100644
--- a/drivers/scsi/lpfc/lpfc_debugfs.c
+++ b/drivers/scsi/lpfc/lpfc_debugfs.c
@@ -782,8 +782,11 @@ lpfc_debugfs_nvmestat_data(struct lpfc_vport *vport, char 
*buf, int size)
atomic_read(>xmt_ls_rsp_error));
 
len += snprintf(buf + len, size - len,
-   "FCP: Rcv %08x Drop %08x\n",
+   "FCP: Rcv %08x Defer %08x Release %08x "
+   "Drop %08x\n",
atomic_read(>rcv_fcp_cmd_in),
+   atomic_read(>rcv_fcp_cmd_defer),
+   atomic_read(>xmt_fcp_release),
atomic_read(>rcv_fcp_cmd_drop));
 
if (atomic_read(>rcv_fcp_cmd_in) !=
diff --git a/drivers/scsi/lpfc/lpfc_nvmet.c b/drivers/scsi/lpfc/lpfc_nvmet.c
index fbeec344c6cc..a5dc1b83064a 100644
--- a/drivers/scsi/lpfc/lpfc_nvmet.c
+++ b/drivers/scsi/lpfc/lpfc_nvmet.c
@@ -841,12 +841,31 @@ lpfc_nvmet_xmt_fcp_release(struct nvmet_fc_target_port 
*tgtport,
lpfc_nvmet_ctxbuf_post(phba, ctxp->ctxbuf);
 }
 
+static void
+lpfc_nvmet_defer_rcv(struct nvmet_fc_target_port *tgtport,
+struct nvmefc_tgt_fcp_req *rsp)
+{
+   struct lpfc_nvmet_tgtport *tgtp;
+   struct lpfc_nvmet_rcv_ctx *ctxp =
+   container_of(rsp, struct lpfc_nvmet_rcv_ctx, ctx.fcp_req);
+   struct rqb_dmabuf *nvmebuf = ctxp->rqb_buffer;
+   struct lpfc_hba *phba = ctxp->phba;
+
+   lpfc_nvmeio_data(phba, "NVMET DEFERRCV: xri x%x sz %d CPU %02x\n",
+ctxp->oxid, ctxp->size, smp_processor_id());
+
+   tgtp = (struct lpfc_nvmet_tgtport *)phba->targetport->private;
+   atomic_inc(>rcv_fcp_cmd_defer);
+   lpfc_rq_buf_free(phba, >hbuf); /* repost */
+}
+
 static struct nvmet_fc_target_template lpfc_tgttemplate = {
.targetport_delete = lpfc_nvmet_targetport_delete,
.xmt_ls_rsp = lpfc_nvmet_xmt_ls_rsp,
.fcp_op = lpfc_nvmet_xmt_fcp_op,
.fcp_abort  = lpfc_nvmet_xmt_fcp_abort,
.fcp_req_release = lpfc_nvmet_xmt_fcp_release,
+   .defer_rcv  = lpfc_nvmet_defer_rcv,
 
.max_hw_queues  = 1,
.max_sgl_segments = LPFC_NVMET_DEFAULT_SEGS,
@@ -1504,6 +1523,17 @@ lpfc_nvmet_unsol_fcp_buffer(struct lpfc_hba *phba,
return;
}
 
+   /* Processing of FCP command is deferred */
+   if (rc == -EOVERFLOW) {
+   lpfc_nvmeio_data(phba,
+"NVMET RCV BUSY: xri x%x sz %d from %06x\n",
+oxid, size, sid);
+   /* defer reposting rcv buffer till 

Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-08-01 Thread Ming Lei
On Mon, Jul 31, 2017 at 11:34:35PM +, Bart Van Assche wrote:
> On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> > SCSI devices use host-wide tagset, and the shared
> > driver tag space is often quite big. Meantime
> > there is also queue depth for each lun(.cmd_per_lun),
> > which is often small.
> > 
> > So lots of requests may stay in sw queue, and we
> > always flush all belonging to same hw queue and
> > dispatch them all to driver, unfortunately it is
> > easy to cause queue busy becasue of the small
> > per-lun queue depth. Once these requests are flushed
> > out, they have to stay in hctx->dispatch, and no bio
> > merge can participate into these requests, and
> > sequential IO performance is hurted.
> > 
> > This patch improves dispatching from sw queue when
> > there is per-request-queue queue depth by taking
> > request one by one from sw queue, just like the way
> > of IO scheduler.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq-sched.c | 25 +++--
> >  1 file changed, 15 insertions(+), 10 deletions(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 47a25333a136..3510c01cb17b 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -96,6 +96,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> > *hctx)
> > const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > bool can_go = true;
> > LIST_HEAD(rq_list);
> > +   struct request *(*dispatch_fn)(struct blk_mq_hw_ctx *) =
> > +   has_sched_dispatch ? e->type->ops.mq.dispatch_request :
> > +   blk_mq_dispatch_rq_from_ctxs;
> >  
> > /* RCU or SRCU read lock is needed before checking quiesced flag */
> > if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
> > @@ -126,26 +129,28 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> > if (!list_empty(_list)) {
> > blk_mq_sched_mark_restart_hctx(hctx);
> > can_go = blk_mq_dispatch_rq_list(q, _list);
> > -   } else if (!has_sched_dispatch) {
> > +   } else if (!has_sched_dispatch && !q->queue_depth) {
> > blk_mq_flush_busy_ctxs(hctx, _list);
> > blk_mq_dispatch_rq_list(q, _list);
> > +   can_go = false;
> > }
> >  
> > +   if (!can_go)
> > +   return;
> > +
> > /*
> >  * We want to dispatch from the scheduler if we had no work left
> >  * on the dispatch list, OR if we did have work but weren't able
> >  * to make progress.
> >  */
> > -   if (can_go && has_sched_dispatch) {
> > -   do {
> > -   struct request *rq;
> > +   do {
> > +   struct request *rq;
> >  
> > -   rq = e->type->ops.mq.dispatch_request(hctx);
> > -   if (!rq)
> > -   break;
> > -   list_add(>queuelist, _list);
> > -   } while (blk_mq_dispatch_rq_list(q, _list));
> > -   }
> > +   rq = dispatch_fn(hctx);
> > +   if (!rq)
> > +   break;
> > +   list_add(>queuelist, _list);
> > +   } while (blk_mq_dispatch_rq_list(q, _list));
> >  }
> 
> Hello Ming,
> 
> Although I like the idea behind this patch, I'm afraid that this patch will
> cause a performance regression for high-performance SCSI LLD drivers, e.g.
> ib_srp. Have you considered to rework this patch as follows:
> * Remove the code under "else if (!has_sched_dispatch && !q->queue_depth) {".

This will affect devices such as NVMe in which busy isn't triggered
basically, so better to not do this.

> * Modify all blk_mq_dispatch_rq_list() functions such that these dispatch up
>   to cmd_per_lun - (number of requests in progress) at once.

How can we get the accurate 'number of requests in progress' efficiently?

And we have done it in this way for blk-mq scheduler already, so it
shouldn't be a problem.

>From my test data of mq-deadline on lpfc, the performance is good,
please see it in cover letter.


Thanks,
Ming


Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-08-01 Thread Bart Van Assche
On Tue, 2017-08-01 at 18:50 +0800, Ming Lei wrote:
> On Tue, Aug 01, 2017 at 06:17:18PM +0800, Ming Lei wrote:
> > How can we get the accurate 'number of requests in progress' efficiently?

Hello Ming,

How about counting the number of bits that have been set in the tag set?
I am aware that these bits can be set and/or cleared concurrently with the
dispatch code but that count is probably a good starting point.

> > From my test data of mq-deadline on lpfc, the performance is good,
> > please see it in cover letter.
> 
> Forget to mention, ctx->list is one per-cpu list and the lock is percpu
> lock, so changing to this way shouldn't be a performance issue.

Sorry but I don't consider this reply as sufficient. The latency of IB HCA's
is significantly lower than that of any FC hardware I ran performance
measurements on myself. It's not because this patch series improves performance
for lpfc that that guarantees that there won't be a performance regression for
ib_srp, ib_iser or any other low-latency initiator driver for which q->depth
!= 0.

Additionally, patch 03/14 most likely introduces a fairness problem. Shouldn't
blk_mq_dispatch_rq_from_ctxs() dequeue requests from the per-CPU queues in a
round-robin fashion instead of always starting at the first per-CPU queue in
hctx->ctx_map?

Thanks,

Bart.


[PATCH] ipr: Fix scsi-mq lockdep issue

2017-08-01 Thread Brian King
Fixes the following lockdep warning that can occur when scsi-mq is enabled
with ipr due to ipr calling scsi_unblock_requests from irq context. The fix
is to move the call to scsi_unblock_requests to ipr's existing workqueue.

stack backtrace:
CPU: 28 PID: 0 Comm: swapper/28 Not tainted 4.13.0-rc2-gcc6x-gf74c89b #1
Call Trace:
[c01fffe97550] [c0b50818] dump_stack+0xe8/0x160 (unreliable)
[c01fffe97590] [c01586d0] print_usage_bug+0x2d0/0x390
[c01fffe97640] [c0158f34] mark_lock+0x7a4/0x8e0
[c01fffe976f0] [c015a000] __lock_acquire+0x6a0/0x1a70
[c01fffe97860] [c015befc] lock_acquire+0xec/0x2e0
[c01fffe97930] [c0b71514] _raw_spin_lock+0x44/0x70
[c01fffe97960] [c05b60f4] blk_mq_sched_dispatch_requests+0xa4/0x2a0
[c01fffe979c0] [c05acac0] __blk_mq_run_hw_queue+0x100/0x2c0
[c01fffe97a00] [c05ad478] __blk_mq_delay_run_hw_queue+0x118/0x130
[c01fffe97a40] [c05ad61c] blk_mq_start_hw_queues+0x6c/0xa0
[c01fffe97a80] [c0797aac] scsi_kick_queue+0x2c/0x60
[c01fffe97aa0] [c0797cf0] scsi_run_queue+0x210/0x360
[c01fffe97b10] [c079b888] scsi_run_host_queues+0x48/0x80
[c01fffe97b40] [c07b6090] ipr_ioa_bringdown_done+0x70/0x1e0
[c01fffe97bc0] [c07bc860] ipr_reset_ioa_job+0x80/0xf0
[c01fffe97bf0] [c07b4d50] ipr_reset_timer_done+0xd0/0x100
[c01fffe97c30] [c01937bc] call_timer_fn+0xdc/0x4b0
[c01fffe97cf0] [c0193d08] expire_timers+0x178/0x330
[c01fffe97d60] [c01940c8] run_timer_softirq+0xb8/0x120
[c01fffe97de0] [c0b726a8] __do_softirq+0x168/0x6d8
[c01fffe97ef0] [c00df2c8] irq_exit+0x108/0x150
[c01fffe97f10] [c0017bf4] __do_irq+0x2a4/0x4a0
[c01fffe97f90] [c002da50] call_do_irq+0x14/0x24
[c007fad93aa0] [c0017e8c] do_IRQ+0x9c/0x140
[c007fad93af0] [c0008b98] hardware_interrupt_common+0x138/0x140

Reported-by: Michael Ellerman 
Signed-off-by: Brian King 
---

Index: linux-2.6.git/drivers/scsi/ipr.c
===
--- linux-2.6.git.orig/drivers/scsi/ipr.c
+++ linux-2.6.git/drivers/scsi/ipr.c
@@ -3351,6 +3351,16 @@ static void ipr_worker_thread(struct wor
return;
}
 
+   if (ioa_cfg->scsi_unblock) {
+   ioa_cfg->scsi_unblock = 0;
+   ioa_cfg->scsi_blocked = 0;
+   spin_unlock_irqrestore(ioa_cfg->host->host_lock, lock_flags);
+   scsi_unblock_requests(ioa_cfg->host);
+   spin_lock_irqsave(ioa_cfg->host->host_lock, lock_flags);
+   if (ioa_cfg->scsi_blocked)
+   scsi_block_requests(ioa_cfg->host);
+   }
+
if (!ioa_cfg->scan_enabled) {
spin_unlock_irqrestore(ioa_cfg->host->host_lock, lock_flags);
return;
@@ -7211,9 +7221,8 @@ static int ipr_ioa_bringdown_done(struct
ENTER;
if (!ioa_cfg->hrrq[IPR_INIT_HRRQ].removing_ioa) {
ipr_trace;
-   spin_unlock_irq(ioa_cfg->host->host_lock);
-   scsi_unblock_requests(ioa_cfg->host);
-   spin_lock_irq(ioa_cfg->host->host_lock);
+   ioa_cfg->scsi_unblock = 1;
+   schedule_work(_cfg->work_q);
}
 
ioa_cfg->in_reset_reload = 0;
@@ -7287,13 +7296,7 @@ static int ipr_ioa_reset_done(struct ipr
list_add_tail(_cmd->queue, _cmd->hrrq->hrrq_free_q);
wake_up_all(_cfg->reset_wait_q);
 
-   spin_unlock(ioa_cfg->host->host_lock);
-   scsi_unblock_requests(ioa_cfg->host);
-   spin_lock(ioa_cfg->host->host_lock);
-
-   if (!ioa_cfg->hrrq[IPR_INIT_HRRQ].allow_cmds)
-   scsi_block_requests(ioa_cfg->host);
-
+   ioa_cfg->scsi_unblock = 1;
schedule_work(_cfg->work_q);
LEAVE;
return IPR_RC_JOB_RETURN;
@@ -9249,8 +9252,11 @@ static void _ipr_initiate_ioa_reset(stru
spin_unlock(_cfg->hrrq[i]._lock);
}
wmb();
-   if (!ioa_cfg->hrrq[IPR_INIT_HRRQ].removing_ioa)
+   if (!ioa_cfg->hrrq[IPR_INIT_HRRQ].removing_ioa) {
+   ioa_cfg->scsi_unblock = 0;
+   ioa_cfg->scsi_blocked = 1;
scsi_block_requests(ioa_cfg->host);
+   }
 
ipr_cmd = ipr_get_free_ipr_cmnd(ioa_cfg);
ioa_cfg->reset_cmd = ipr_cmd;
@@ -9306,9 +9312,8 @@ static void ipr_initiate_ioa_reset(struc
wake_up_all(_cfg->reset_wait_q);
 
if (!ioa_cfg->hrrq[IPR_INIT_HRRQ].removing_ioa) {
-   spin_unlock_irq(ioa_cfg->host->host_lock);
-   scsi_unblock_requests(ioa_cfg->host);
-   spin_lock_irq(ioa_cfg->host->host_lock);
+   ioa_cfg->scsi_unblock = 1;
+   

RE: [PATCH v2 11/15] megaraid_sas: Set device queue_depth same as HBA can_queue value in scsi-mq mode

2017-08-01 Thread Shivasharan Srikanteshwara
> -Original Message-
> From: Shivasharan Srikanteshwara
> [mailto:shivasharan.srikanteshw...@broadcom.com]
> Sent: Monday, July 24, 2017 4:59 PM
> To: 'Christoph Hellwig'
> Cc: 'linux-scsi@vger.kernel.org'; 'martin.peter...@oracle.com';
> 'the...@redhat.com'; 'j...@linux.vnet.ibm.com'; Sumit Saxena;
> 'h...@suse.com'; Kashyap Desai
> Subject: RE: [PATCH v2 11/15] megaraid_sas: Set device queue_depth same
as
> HBA can_queue value in scsi-mq mode
>
> > -Original Message-
> > From: Christoph Hellwig [mailto:h...@lst.de]
> > Sent: Thursday, July 20, 2017 1:18 PM
> > To: Shivasharan Srikanteshwara
> > Cc: Christoph Hellwig; linux-scsi@vger.kernel.org;
> > martin.peter...@oracle.com; the...@redhat.com;
> > j...@linux.vnet.ibm.com; Sumit Saxena; h...@suse.com; Kashyap Desai
> > Subject: Re: [PATCH v2 11/15] megaraid_sas: Set device queue_depth
> > same as HBA can_queue value in scsi-mq mode
> >
> > I still don't understand why you don't want to do the same for the
non-mq
> path.
>
> Hi Christoph,
>
> Sorry for delay in response.
>
> MQ case -
> If there is any block layer requeue happens, we see performance drop. So
we
> avoid re-queue increasing Device QD = HBA QD. Performance drop due to
block
> layer re-queue is more in case of HDD (sequential IO converted into
random IO).
>
> Non-MQ case.
> If we increase Device QD = HBA QD for no-mq case, we see performance
drop
> for certain profiles.
> For example SATA SSD, previous driver in non-mq set Device QD=32. In
this case,
> if we have more outstanding IO per device (more than 32), block layer
attempts
> soft merge and eventually end user experience higher performance due to
block
> layer attempting soft merge.  Same is not correct in MQ case, as IO
scheduler in
> MQ adds overhead if at all there is any throttling or staging due to
device QD.
>
> Below is example of single SATA SSD, Sequential Read, BS=4K, IO depth =
256
>
> MQ enable, Device QD = 32 achieves 137K IOPS MQ enable, Device QD = 916
> (HBA QD) achieves 145K IOPS
>
> MQ disable, Device QD = 32 achieves 237K IOPS MQ disable, Device QD =
916
> (HBA QD) achieves 145K IOPS
>
> Ideally we want to keep same QD settings in non-MQ mode, but trying to
avoid
> as we may face some regression from end user as explained.
>
> Thanks,
> Shivasharan

Hi Christoph,
Can you please let us know your thoughts on this?
We understand that the settings should ideally be uniform across non-mq
and mq cases.
But based on the test results above, for non-mq case we are seeing some
difference in performance for certain IO profiles when compared to earlier
releases after increasing queue depth. Same is not the case when mq is
enabled.

Based on these results, we would like to keep this patch as is for this
phase.
We will run the further tests on this and will update for the next phase.

Thanks,
Shivasharan


[PATCH] arcmsr: add const to bin_attribute structures

2017-08-01 Thread Bhumika Goyal
Add const to bin_attribute structures as they are only passed to the
functions system_{remove/create}_bin_file. The arguments passed are of
type const, so declare the structures to be const.

Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 drivers/scsi/arcmsr/arcmsr_attr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/arcmsr/arcmsr_attr.c 
b/drivers/scsi/arcmsr/arcmsr_attr.c
index 9c86481..259d9c2 100644
--- a/drivers/scsi/arcmsr/arcmsr_attr.c
+++ b/drivers/scsi/arcmsr/arcmsr_attr.c
@@ -190,7 +190,7 @@ static ssize_t arcmsr_sysfs_iop_message_clear(struct file 
*filp,
return 1;
 }
 
-static struct bin_attribute arcmsr_sysfs_message_read_attr = {
+static const struct bin_attribute arcmsr_sysfs_message_read_attr = {
.attr = {
.name = "mu_read",
.mode = S_IRUSR ,
@@ -199,7 +199,7 @@ static ssize_t arcmsr_sysfs_iop_message_clear(struct file 
*filp,
.read = arcmsr_sysfs_iop_message_read,
 };
 
-static struct bin_attribute arcmsr_sysfs_message_write_attr = {
+static const struct bin_attribute arcmsr_sysfs_message_write_attr = {
.attr = {
.name = "mu_write",
.mode = S_IWUSR,
@@ -208,7 +208,7 @@ static ssize_t arcmsr_sysfs_iop_message_clear(struct file 
*filp,
.write = arcmsr_sysfs_iop_message_write,
 };
 
-static struct bin_attribute arcmsr_sysfs_message_clear_attr = {
+static const struct bin_attribute arcmsr_sysfs_message_clear_attr = {
.attr = {
.name = "mu_clear",
.mode = S_IWUSR,
-- 
1.9.1



Re: [PATCH 2/2] sd_zbc: Disable zone locking with scsi-mq enabled

2017-08-01 Thread Bart Van Assche
On Tue, 2017-08-01 at 18:39 +0900, Damien Le Moal wrote:
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index 96855df9f49d..78fb51a37e86 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -528,15 +528,20 @@ static int sd_zbc_check_zone_size(struct scsi_disk 
> *sdkp)
>  
>  static int sd_zbc_setup(struct scsi_disk *sdkp)
>  {
> + struct request_queue *q = sdkp->disk->queue;
>  
>   /* chunk_sectors indicates the zone size */
> - blk_queue_chunk_sectors(sdkp->disk->queue,
> + blk_queue_chunk_sectors(q,
>   logical_to_sectors(sdkp->device, sdkp->zone_blocks));
>   sdkp->zone_shift = ilog2(sdkp->zone_blocks);
>   sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift;
>   if (sdkp->capacity & (sdkp->zone_blocks - 1))
>   sdkp->nr_zones++;
>  
> + /* Do not use zone locking in mq case */
> + if (q->mq_ops)
> + return 0;
> +
>   if (!sdkp->zones_wlock) {
>   sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones),
>   sizeof(unsigned long),

Hello Damien,

Are you aware that the blk-sq / scsi-sq code will be removed once all block
drivers have been converted? Are you aware that we don't want any differences
in behavior like the above between the single queue and multiqueue paths? Can
you check whether the patch series Ming Lei posted earlier this week solves
the frequent requeueing? See also "[PATCH 00/14] blk-mq-sched: fix SCSI-MQ
performance regression" (http://marc.info/?l=linux-block=150151989915776).

Thanks,

Bart.

Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-08-01 Thread Ming Lei
On Tue, Aug 01, 2017 at 06:17:18PM +0800, Ming Lei wrote:
> On Mon, Jul 31, 2017 at 11:34:35PM +, Bart Van Assche wrote:
> > On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> > > SCSI devices use host-wide tagset, and the shared
> > > driver tag space is often quite big. Meantime
> > > there is also queue depth for each lun(.cmd_per_lun),
> > > which is often small.
> > > 
> > > So lots of requests may stay in sw queue, and we
> > > always flush all belonging to same hw queue and
> > > dispatch them all to driver, unfortunately it is
> > > easy to cause queue busy becasue of the small
> > > per-lun queue depth. Once these requests are flushed
> > > out, they have to stay in hctx->dispatch, and no bio
> > > merge can participate into these requests, and
> > > sequential IO performance is hurted.
> > > 
> > > This patch improves dispatching from sw queue when
> > > there is per-request-queue queue depth by taking
> > > request one by one from sw queue, just like the way
> > > of IO scheduler.
> > > 
> > > Signed-off-by: Ming Lei 
> > > ---
> > >  block/blk-mq-sched.c | 25 +++--
> > >  1 file changed, 15 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 47a25333a136..3510c01cb17b 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -96,6 +96,9 @@ void blk_mq_sched_dispatch_requests(struct 
> > > blk_mq_hw_ctx *hctx)
> > >   const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > >   bool can_go = true;
> > >   LIST_HEAD(rq_list);
> > > + struct request *(*dispatch_fn)(struct blk_mq_hw_ctx *) =
> > > + has_sched_dispatch ? e->type->ops.mq.dispatch_request :
> > > + blk_mq_dispatch_rq_from_ctxs;
> > >  
> > >   /* RCU or SRCU read lock is needed before checking quiesced flag */
> > >   if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
> > > @@ -126,26 +129,28 @@ void blk_mq_sched_dispatch_requests(struct 
> > > blk_mq_hw_ctx *hctx)
> > >   if (!list_empty(_list)) {
> > >   blk_mq_sched_mark_restart_hctx(hctx);
> > >   can_go = blk_mq_dispatch_rq_list(q, _list);
> > > - } else if (!has_sched_dispatch) {
> > > + } else if (!has_sched_dispatch && !q->queue_depth) {
> > >   blk_mq_flush_busy_ctxs(hctx, _list);
> > >   blk_mq_dispatch_rq_list(q, _list);
> > > + can_go = false;
> > >   }
> > >  
> > > + if (!can_go)
> > > + return;
> > > +
> > >   /*
> > >* We want to dispatch from the scheduler if we had no work left
> > >* on the dispatch list, OR if we did have work but weren't able
> > >* to make progress.
> > >*/
> > > - if (can_go && has_sched_dispatch) {
> > > - do {
> > > - struct request *rq;
> > > + do {
> > > + struct request *rq;
> > >  
> > > - rq = e->type->ops.mq.dispatch_request(hctx);
> > > - if (!rq)
> > > - break;
> > > - list_add(>queuelist, _list);
> > > - } while (blk_mq_dispatch_rq_list(q, _list));
> > > - }
> > > + rq = dispatch_fn(hctx);
> > > + if (!rq)
> > > + break;
> > > + list_add(>queuelist, _list);
> > > + } while (blk_mq_dispatch_rq_list(q, _list));
> > >  }
> > 
> > Hello Ming,
> > 
> > Although I like the idea behind this patch, I'm afraid that this patch will
> > cause a performance regression for high-performance SCSI LLD drivers, e.g.
> > ib_srp. Have you considered to rework this patch as follows:
> > * Remove the code under "else if (!has_sched_dispatch && !q->queue_depth) 
> > {".
> 
> This will affect devices such as NVMe in which busy isn't triggered
> basically, so better to not do this.
> 
> > * Modify all blk_mq_dispatch_rq_list() functions such that these dispatch up
> >   to cmd_per_lun - (number of requests in progress) at once.
> 
> How can we get the accurate 'number of requests in progress' efficiently?
> 
> And we have done it in this way for blk-mq scheduler already, so it
> shouldn't be a problem.
> 
> From my test data of mq-deadline on lpfc, the performance is good,
> please see it in cover letter.

Forget to mention, ctx->list is one per-cpu list and the lock is percpu
lock, so changing to this way shouldn't be a performance issue.

-- 
Ming


Re: [PATCH 05/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-01 Thread Ming Lei
On Mon, Jul 31, 2017 at 11:42:21PM +, Bart Van Assche wrote:
> On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> > During dispatch, we moved all requests from hctx->dispatch to
> > one temporary list, then dispatch them one by one from this list.
> > Unfortunately duirng this period, run queue from other contexts
> > may think the queue is idle and start to dequeue from sw/scheduler
> > queue and try to dispatch because ->dispatch is empty.
> > 
> > This way will hurt sequential I/O performance because requests are
> > dequeued when queue is busy.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq-sched.c   | 24 ++--
> >  include/linux/blk-mq.h |  1 +
> >  2 files changed, 19 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 3510c01cb17b..eb638063673f 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -112,8 +112,15 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> >  */
> > if (!list_empty_careful(>dispatch)) {
> > spin_lock(>lock);
> > -   if (!list_empty(>dispatch))
> > +   if (!list_empty(>dispatch)) {
> > list_splice_init(>dispatch, _list);
> > +
> > +   /*
> > +* BUSY won't be cleared until all requests
> > +* in hctx->dispatch are dispatched successfully
> > +*/
> > +   set_bit(BLK_MQ_S_BUSY, >state);
> > +   }
> > spin_unlock(>lock);
> > }
> >  
> > @@ -129,15 +136,20 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> > if (!list_empty(_list)) {
> > blk_mq_sched_mark_restart_hctx(hctx);
> > can_go = blk_mq_dispatch_rq_list(q, _list);
> > -   } else if (!has_sched_dispatch && !q->queue_depth) {
> > -   blk_mq_flush_busy_ctxs(hctx, _list);
> > -   blk_mq_dispatch_rq_list(q, _list);
> > -   can_go = false;
> > +   if (can_go)
> > +   clear_bit(BLK_MQ_S_BUSY, >state);
> > }
> >  
> > -   if (!can_go)
> > +   /* can't go until ->dispatch is flushed */
> > +   if (!can_go || test_bit(BLK_MQ_S_BUSY, >state))
> > return;
> >  
> > +   if (!has_sched_dispatch && !q->queue_depth) {
> > +   blk_mq_flush_busy_ctxs(hctx, _list);
> > +   blk_mq_dispatch_rq_list(q, _list);
> > +   return;
> > +   }
> 
> Hello Ming,
> 
> Since setting, clearing and testing of BLK_MQ_S_BUSY can happen concurrently
> and since clearing and testing happens without any locks held I'm afraid this

Yes, I really want to avoid lock.

> patch introduces the following race conditions:
> * Clearing of BLK_MQ_S_BUSY immediately after this bit has been set, resulting
>   in this bit not being set although there are requests on the dispatch list.

The window is small enough.

And in the context of setting the BUSY bit, dispatch still can't move on
because 'can_go' will stop that.

Even it happens, no big deal, it just means only one request is dequeued
a bit early. What we really need to avoid is I/O hang.


> * Checking BLK_MQ_S_BUSY after requests have been added to the dispatch list
>   but before that bit is set, resulting in test_bit(BLK_MQ_S_BUSY, 
> >state)
>   reporting that the BLK_MQ_S_BUSY has not been set although there are 
> requests
>   on the dispatch list.

Same as above, no big deal, we can survive that.


> * Checking BLK_MQ_S_BUSY after requests have been removed from the dispatch 
> list
>   but before that bit is cleared, resulting in test_bit(BLK_MQ_S_BUSY, 
> >state)
>   reporting that the BLK_MQ_S_BUSY
> has been set although there are no requests
>   on the dispatch list.

That won't be a problem, because dispatch will be started in the
context in which dispatch list is flushed, since the BUSY bit
is cleared after blk_mq_dispatch_rq_list() returns. So no I/O
hang.


-- 
Ming


[PATCH] st: fix blk_get_queue usage

2017-08-01 Thread Hannes Reinecke
From: Bodo Stroesser 

If blk_queue_get() in st_probe fails, disk->queue must not
be set to SDp->request_queue, as that would result in
put_disk() dropping a not taken reference.

Thus, disk->queue should be set only after a successful
blk_queue_get().

Signed-off-by: Bodo Stroesser 
Acked-by: Shirish Pargaonkar 
Signed-off-by: Hannes Reinecke 
---
 drivers/scsi/st.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index 1ea34d6..f1bcaf6 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4300,11 +4300,11 @@ static int st_probe(struct device *dev)
kref_init(>kref);
tpnt->disk = disk;
disk->private_data = >driver;
-   disk->queue = SDp->request_queue;
/* SCSI tape doesn't register this gendisk via add_disk().  Manually
 * take queue reference that release_disk() expects. */
-   if (!blk_get_queue(disk->queue))
+   if (!blk_get_queue(SDp->request_queue))
goto out_put_disk;
+   disk->queue = SDp->request_queue;
tpnt->driver = _template;
 
tpnt->device = SDp;
-- 
1.8.5.6



Re: [PATCH] st: fix blk_get_queue usage

2017-08-01 Thread Ewan D. Milne
On Tue, 2017-08-01 at 14:42 +0200, Hannes Reinecke wrote:
> From: Bodo Stroesser 
> 
> If blk_queue_get() in st_probe fails, disk->queue must not
> be set to SDp->request_queue, as that would result in
> put_disk() dropping a not taken reference.
> 
> Thus, disk->queue should be set only after a successful
> blk_queue_get().
> 
> Signed-off-by: Bodo Stroesser 
> Acked-by: Shirish Pargaonkar 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/scsi/st.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index 1ea34d6..f1bcaf6 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -4300,11 +4300,11 @@ static int st_probe(struct device *dev)
>   kref_init(>kref);
>   tpnt->disk = disk;
>   disk->private_data = >driver;
> - disk->queue = SDp->request_queue;
>   /* SCSI tape doesn't register this gendisk via add_disk().  Manually
>* take queue reference that release_disk() expects. */
> - if (!blk_get_queue(disk->queue))
> + if (!blk_get_queue(SDp->request_queue))
>   goto out_put_disk;
> + disk->queue = SDp->request_queue;
>   tpnt->driver = _template;
>  
>   tpnt->device = SDp;

Fixes: 2b5bebccd282 ("st: Take additional queue ref in st_probe")
Reviewed-by: Ewan D. Milne 




Re: [RFC 2/9] zfcp: decouple TMF response handler from scsi_cmnd

2017-08-01 Thread Steffen Maier

Just for the records, in case anyone wants to resurrect this later on:
This patch is buggy.

On 07/25/2017 04:14 PM, Steffen Maier wrote:

Do not get scsi_device via req->data any more, but pass an optional(!)
scsi_device to zfcp_fsf_fcp_handler_common(). The latter must now guard
any access to scsi_device as it can be NULL.

Since we always have at least a zfcp port as scope, pass this as mandatory
argument to zfcp_fsf_fcp_handler_common() because we cannot get it through
scsi_device => zfcp_scsi_dev => port any more.

Hence, the callers of zfcp_fsf_fcp_handler_common() must resolve req->data.

TMF handling now has different context data in fsf_req->data
depending on the TMF scope in fcp_cmnd->fc_tm_flags:
* scsi_device if FCP_TMF_LUN_RESET,
* zfcp_port if FCP_TMF_TGT_RESET.

Signed-off-by: Steffen Maier 
---
  drivers/s390/scsi/zfcp_fsf.c | 72 
  1 file changed, 46 insertions(+), 26 deletions(-)

diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c
index 69d1dc3ec79d..8b2b2ea552d6 100644
--- a/drivers/s390/scsi/zfcp_fsf.c
+++ b/drivers/s390/scsi/zfcp_fsf.c



@@ -2296,10 +2304,21 @@ int zfcp_fsf_fcp_cmnd(struct scsi_cmnd *scsi_cmnd)

  static void zfcp_fsf_fcp_task_mgmt_handler(struct zfcp_fsf_req *req)
  {
+   struct fcp_cmnd *fcp_cmnd;
+   struct zfcp_port *port;
+   struct scsi_device *sdev;
struct fcp_resp_with_ext *fcp_rsp;
struct fcp_resp_rsp_info *rsp_info;

-   zfcp_fsf_fcp_handler_common(req);
+   fcp_cmnd = >qtcb->bottom.io.fcp_cmnd.iu;
+   if (fcp_cmnd->fc_tm_flags & FCP_TMF_LUN_RESET) {
+   sdev = req->data;
+   port = sdev_to_zfcp(sdev)->port;


Below described bug causes, in case of a LUN reset, a wrong type 
interpretation because we interpret req->data as scsi_device but the 
request function had assigned a zfcp_port to it. Dereferencing the port 
field leads to a kernel page fault in (Soft)IRQ context ending up in a 
panic.



+   } else { > + sdev = NULL;
+   port = req->data;
+   }
+   zfcp_fsf_fcp_handler_common(req, port, sdev);

fcp_rsp = >qtcb->bottom.io.fcp_rsp.iu;
rsp_info = (struct fcp_resp_rsp_info *) _rsp[1];
@@ -2340,7 +2359,9 @@ struct zfcp_fsf_req *zfcp_fsf_fcp_task_mgmt(struct 
scsi_cmnd *scmnd,
goto out;
}

-   req->data = scmnd;
+   fcp_cmnd = >qtcb->bottom.io.fcp_cmnd.iu;


While I moved the pointer assignment here,
the memory it points to is only filled in below with:
zfcp_fc_scsi_to_fcp(fcp_cmnd, scmnd, tm_flags).
The still freshly allocated QTCB is pre-initialized with zero.
Hence, the subsequent boolean expression always evaluates to false
since no flag is set yet.
Thus, a LUN reset erroneously has:
req->data = (void *)sdev_to_zfcp(scmnd->device)->port.

A fix would be to base the boolean expression on function argument 
tm_flags rather than the QTCB content:

(tm_flags & FCP_TMF_LUN_RESET).
To not confuse people, I would also undo the move of the fcp_cmnd 
pointer assignment.


I won't send a new version with this fix,
because it turned out the FCP channel always requires a valid LUN handle 
(even for a target reset), so I'm settling on scsi_device as common 
context for any TMF, similarly like Hannes did.

Once I've successfully completed function test of v2 of my patch set,
I'm going to re-submit the full refactored set.


+   req->data = (fcp_cmnd->fc_tm_flags & FCP_TMF_LUN_RESET) ?
+   scmnd->device : (void *)sdev_to_zfcp(scmnd->device)->port;
req->handler = zfcp_fsf_fcp_task_mgmt_handler;
req->qtcb->header.lun_handle = zfcp_sdev->lun_handle;
req->qtcb->header.port_handle = zfcp_sdev->port->handle;
@@ -2350,7 +2371,6 @@ struct zfcp_fsf_req *zfcp_fsf_fcp_task_mgmt(struct 
scsi_cmnd *scmnd,

zfcp_qdio_set_sbale_last(qdio, >qdio_req);

-   fcp_cmnd = >qtcb->bottom.io.fcp_cmnd.iu;
zfcp_fc_scsi_to_fcp(fcp_cmnd, scmnd, tm_flags);

zfcp_fsf_start_timer(req, ZFCP_SCSI_ER_TIMEOUT);



--
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



Re: [PATCH 05/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-01 Thread Bart Van Assche
On Tue, 2017-08-01 at 18:44 +0800, Ming Lei wrote:
> On Mon, Jul 31, 2017 at 11:42:21PM +, Bart Van Assche wrote:
> > Since setting, clearing and testing of BLK_MQ_S_BUSY can happen concurrently
> > and since clearing and testing happens without any locks held I'm afraid 
> > this
> > patch introduces the following race conditions:
> > [ ... ]
> > * Checking BLK_MQ_S_BUSY after requests have been removed from the dispatch 
> > list
> >   but before that bit is cleared, resulting in test_bit(BLK_MQ_S_BUSY, 
> > >state)
> >   reporting that the BLK_MQ_S_BUSY
> > has been set although there are no requests
> >   on the dispatch list.
> 
> That won't be a problem, because dispatch will be started in the
> context in which dispatch list is flushed, since the BUSY bit
> is cleared after blk_mq_dispatch_rq_list() returns. So no I/O
> hang.

Hello Ming,

Please consider changing the name of the BLK_MQ_S_BUSY constant. That bit
is used to serialize dispatching requests from the hctx dispatch list but
that's not clear from the name of that constant.

Thanks,

Bart.

[PATCH] ses: Fix wrong page error

2017-08-01 Thread Brian King
If a SES device returns an error on a requested diagnostic page,
we are currently printing an error indicating the wrong page
was received. Fix this up to simply return a failure and only
check the returned page when the diagnostic page buffer was
populated by the device.

Signed-off-by: Brian King 
---
 drivers/scsi/ses.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index f1cdf32..8927f9f5 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -99,7 +99,7 @@ static int ses_recv_diag(struct scsi_device *sdev, int 
page_code,
 
ret =  scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen,
NULL, SES_TIMEOUT, SES_RETRIES, NULL);
-   if (unlikely(!ret))
+   if (unlikely(ret))
return ret;
 
recv_page_code = ((unsigned char *)buf)[0];
-- 
1.8.3.1