[PATCH] aacraid: fix potential double-fetch issue

2017-09-19 Thread Meng Xu
While examining the kernel source code, I found a dangerous operation
that could turn into a double-fetch situation (a race condition bug)
where the same userspace memory region are fetched twice into kernel
with sanity checks after the first fetch while missing checks after the
second fetch.

1. The first userspace fetch happens in
copy_from_user(, _srb->count,sizeof(u32))

2. Subsequently fibsize undergoes a few sanity checks and is then
used to allocate user_srbcmd = kmalloc(fibsize, GFP_KERNEL).

3. The second userspace fetch happens in
copy_from_user(user_srbcmd, user_srb,fibsize)

4. Given that the memory region pointed by user_srb can be fully
controlled in userspace, an attacker can race to override the
user_srb->count to arbitrary value (say 0X) after the first
fetch but before the second. The changed value will be copied to
user_srbcmd.

The patch explicitly overrides user_srbcmd->count after the second
userspace fetch with the value fibsize from the first userspace fetch.
In this way, it is assured that the relation, user_srbcmd->count stores
the size of the user_srbcmd buffer, still holds after the second fetch.

Signed-off-by: Meng Xu 
---
 drivers/scsi/aacraid/commctrl.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/aacraid/commctrl.c b/drivers/scsi/aacraid/commctrl.c
index 9ab0fa9..734bf11 100644
--- a/drivers/scsi/aacraid/commctrl.c
+++ b/drivers/scsi/aacraid/commctrl.c
@@ -540,6 +540,12 @@ static int aac_send_raw_srb(struct aac_dev* dev, void 
__user * arg)
goto cleanup;
}
 
+   /* 
+* re-establish the relation that user_srbcmd->count holds the 
+* size of user_srbcmd 
+*/
+   user_srbcmd->count = fibsize;
+
flags = user_srbcmd->flags; /* from user in cpu order */
switch (flags & (SRB_DataIn | SRB_DataOut)) {
case SRB_DataOut:
-- 
2.7.4



[PATCH] mpt3sas: downgrade full copy_from_user to access_ok check

2017-09-19 Thread Meng Xu
Since right after the user copy, we are going to
memset(, 0, sizeof(karg)), I guess an access_ok check is enough?

Signed-off-by: Meng Xu 
---
 drivers/scsi/mpt3sas/mpt3sas_ctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_ctl.c 
b/drivers/scsi/mpt3sas/mpt3sas_ctl.c
index bdffb69..b363d2d 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_ctl.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_ctl.c
@@ -1065,7 +1065,7 @@ _ctl_getiocinfo(struct MPT3SAS_ADAPTER *ioc, void __user 
*arg)
 {
struct mpt3_ioctl_iocinfo karg;
 
-   if (copy_from_user(, arg, sizeof(karg))) {
+   if (!access_ok(VERIFY_READ, arg, sizeof(karg))) {
pr_err("failure at %s:%d/%s()!\n",
__FILE__, __LINE__, __func__);
return -EFAULT;
-- 
2.7.4



Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Ming Lei
On Tue, Sep 19, 2017 at 04:49:15PM +, Bart Van Assche wrote:
> On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote:
> > Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART
> > to do that already.
> 
> Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to
> reexamine the software queues and the hctx dispatch list but not the requeue
> list. If a block driver returns BLK_STS_RESOURCE then requests end up on the
> requeue list. Hence the following code in scsi_end_request():
> 
>   if (scsi_target(sdev)->single_lun || 
> !list_empty(>host->starved_list))
>   kblockd_schedule_work(>requeue_work);
>   else
>   blk_mq_run_hw_queues(q, true);

Let's focus on dm-rq.

I have explained before, dm-rq is different with SCSI's, we don't need
to requeue request any more in dm-rq if blk_get_request() returns NULL
in multipath_clone_and_map(), since SCHED_RESTART can cover that
definitely.

Not mention dm_mq_delay_requeue_request() will run the queue explicitly
if it has to be done. That needn't SCHED_RESTART to cover, right?

-- 
Ming


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Ming Lei
Hi Mike,

On Tue, Sep 19, 2017 at 07:50:06PM -0400, Mike Snitzer wrote:
> On Tue, Sep 19 2017 at  7:25pm -0400,
> Bart Van Assche  wrote:
> 
> > On Wed, 2017-09-20 at 06:44 +0800, Ming Lei wrote:
> > > For this issue, it isn't same between SCSI and dm-rq.
> > > 
> > > We don't need to run queue in .end_io of dm, and the theory is
> > > simple, otherwise it isn't performance issue, and should be I/O hang.
> > > 
> > > 1) every dm-rq's request is 1:1 mapped to SCSI's request
> > > 
> > > 2) if there is any mapped SCSI request not finished, either
> > > in-flight or in requeue list or whatever, there will be one
> > > corresponding dm-rq's request in-flight
> > > 
> > > 3) once the mapped SCSI request is completed, dm-rq's completion
> > > path will be triggered and dm-rq's queue will be rerun because of
> > > SCHED_RESTART in dm-rq
> > > 
> > > So the hw queue of dm-rq has been run in dm-rq's completion path
> > > already, right? Why do we need to do it again in the hot path?
> > 
> > The measurement data in the description of patch 5/5 shows a significant
> > performance regression for an important workload, namely random I/O.
> > Additionally, the performance improvement for sequential I/O was achieved
> > for an unrealistically low queue depth.
> 
> So you've ignored Ming's question entirely and instead decided to focus
> on previous questions you raised to Ming that he ignored.  This is
> getting tedious.

Sorry for not making it clear, I mentioned I will post a new version
to address the random I/O regression.

> 
> Especially given that Ming said the first patch that all this fighting
> has been over isn't even required to attain the improvements.
> 
> Ming, please retest both your baseline and patched setup with a
> queue_depth of >= 32.  Also, please do 3 - 5 runs to get a avg and std
> dev across the runs.

Taking a bigger queue_depth won't be helpful on this issue,
and it can make the situation worse, because .cmd_per_lun won't
be changed, and queue often becomes busy when number of in-flight
requests is bigger than .cmd_per_lun.

I will post one new version, which will use another simple way to
figure out if underlying queue is busy, so that random I/O perf
won't be affected, but this new version need to depend on the
following patchset:

https://marc.info/?t=15043655572=1=2

So it may take a while since that patchset is still under review.

I will post them all together in 'blk-mq-sched: improve SCSI-MQ 
performance(V5)'.

The approach taken in patch 5 depends on q->queue_depth, but some SCSI
host's .cmd_per_lun is different with q->queue_depth, so causes
the random I/O regression.

-- 
Ming


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Mike Snitzer
On Tue, Sep 19 2017 at  7:25pm -0400,
Bart Van Assche  wrote:

> On Wed, 2017-09-20 at 06:44 +0800, Ming Lei wrote:
> > For this issue, it isn't same between SCSI and dm-rq.
> > 
> > We don't need to run queue in .end_io of dm, and the theory is
> > simple, otherwise it isn't performance issue, and should be I/O hang.
> > 
> > 1) every dm-rq's request is 1:1 mapped to SCSI's request
> > 
> > 2) if there is any mapped SCSI request not finished, either
> > in-flight or in requeue list or whatever, there will be one
> > corresponding dm-rq's request in-flight
> > 
> > 3) once the mapped SCSI request is completed, dm-rq's completion
> > path will be triggered and dm-rq's queue will be rerun because of
> > SCHED_RESTART in dm-rq
> > 
> > So the hw queue of dm-rq has been run in dm-rq's completion path
> > already, right? Why do we need to do it again in the hot path?
> 
> The measurement data in the description of patch 5/5 shows a significant
> performance regression for an important workload, namely random I/O.
> Additionally, the performance improvement for sequential I/O was achieved
> for an unrealistically low queue depth.

So you've ignored Ming's question entirely and instead decided to focus
on previous questions you raised to Ming that he ignored.  This is
getting tedious.

Especially given that Ming said the first patch that all this fighting
has been over isn't even required to attain the improvements.

Ming, please retest both your baseline and patched setup with a
queue_depth of >= 32.  Also, please do 3 - 5 runs to get a avg and std
dev across the runs.

> Sorry but given these measurement results I don't see why I should
> spend more time on this patch series. 

Bart, I've historically genuinely always appreciated your review and
insight.  But if your future "review" on this patchset would take the
form shown in this thread then please don't spend more time on it.

Mike


[GIT PULL] SCSI fixes for 4.14-rc1

2017-09-19 Thread James Bottomley
This is a set of five small fixes: one is a null deref fix which is
pretty critical for the fc transport class and one fixes a potential
security issue of sg leaking kernel information.

The patch is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes

The short changelog is:

Arnd Bergmann (1):
  scsi: acornscsi: fix build error

Christoph Hellwig (1):
  scsi: scsi_transport_fc: fix NULL pointer dereference in 
fc_bsg_job_timeout

Hannes Reinecke (2):
  scsi: sg: fixup infoleak when using SG_GET_REQUEST_TABLE
  scsi: sg: factor out sg_fill_request_table()

Lukas Czerner (1):
  scsi: sd: Remove unnecessary condition in sd_read_block_limits()

and the diffstat:

 drivers/scsi/arm/acornscsi.c |  6 ++--
 drivers/scsi/scsi_transport_fc.c |  2 +-
 drivers/scsi/sd.c|  2 --
 drivers/scsi/sg.c| 64 ++--
 4 files changed, 40 insertions(+), 34 deletions(-)

With full diff below

James

---

diff --git a/drivers/scsi/arm/acornscsi.c b/drivers/scsi/arm/acornscsi.c
index 690816f3c6af..421fe869a11e 100644
--- a/drivers/scsi/arm/acornscsi.c
+++ b/drivers/scsi/arm/acornscsi.c
@@ -2725,9 +2725,9 @@ int acornscsi_abort(struct scsi_cmnd *SCpnt)
  * Params   : SCpnt  - command causing reset
  * Returns  : one of SCSI_RESET_ macros
  */
-int acornscsi_host_reset(struct Scsi_Host *shpnt)
+int acornscsi_host_reset(struct scsi_cmnd *SCpnt)
 {
-   AS_Host *host = (AS_Host *)shpnt->hostdata;
+   AS_Host *host = (AS_Host *)SCpnt->device->host->hostdata;
struct scsi_cmnd *SCptr;
 
 host->stats.resets += 1;
@@ -2741,7 +2741,7 @@ int acornscsi_host_reset(struct Scsi_Host *shpnt)
 
printk(KERN_WARNING "acornscsi_reset: ");
print_sbic_status(asr, ssr, host->scsi.phase);
-   for (devidx = 0; devidx < 9; devidx ++) {
+   for (devidx = 0; devidx < 9; devidx++)
acornscsi_dumplog(host, devidx);
 }
 #endif
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 3c6bc0081fcb..ba9d70f8a6a1 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3571,7 +3571,7 @@ fc_vport_sched_delete(struct work_struct *work)
 static enum blk_eh_timer_return
 fc_bsg_job_timeout(struct request *req)
 {
-   struct bsg_job *job = (void *) req->special;
+   struct bsg_job *job = blk_mq_rq_to_pdu(req);
struct Scsi_Host *shost = fc_bsg_to_shost(job);
struct fc_rport *rport = fc_bsg_to_rport(job);
struct fc_internal *i = to_fc_internal(shost->transportt);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 11c1738c2100..fb9f8b5f4673 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2915,8 +2915,6 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
sd_config_discard(sdkp, SD_LBP_WS16);
else if (sdkp->lbpws10)
sd_config_discard(sdkp, SD_LBP_WS10);
-   else if (sdkp->lbpu && sdkp->max_unmap_blocks)
-   sd_config_discard(sdkp, SD_LBP_UNMAP);
else
sd_config_discard(sdkp, SD_LBP_DISABLE);
}
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index cf0e71db9e51..0419c2298eab 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -828,6 +828,39 @@ static int max_sectors_bytes(struct request_queue *q)
return max_sectors << 9;
 }
 
+static void
+sg_fill_request_table(Sg_fd *sfp, sg_req_info_t *rinfo)
+{
+   Sg_request *srp;
+   int val;
+   unsigned int ms;
+
+   val = 0;
+   list_for_each_entry(srp, >rq_list, entry) {
+   if (val > SG_MAX_QUEUE)
+   break;
+   rinfo[val].req_state = srp->done + 1;
+   rinfo[val].problem =
+   srp->header.masked_status &
+   srp->header.host_status &
+   srp->header.driver_status;
+   if (srp->done)
+   rinfo[val].duration =
+   srp->header.duration;
+   else {
+   ms = jiffies_to_msecs(jiffies);
+   rinfo[val].duration =
+   (ms > srp->header.duration) ?
+   (ms - srp->header.duration) : 0;
+   }
+   rinfo[val].orphan = srp->orphan;
+   rinfo[val].sg_io_owned = srp->sg_io_owned;
+   rinfo[val].pack_id = srp->header.pack_id;
+   rinfo[val].usr_ptr = srp->header.usr_ptr;
+   val++;
+   }
+}
+
 static long
 sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg)
 {
@@ -1012,38 +1045,13 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, 
unsigned long arg)
return -EFAULT;
else {

Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Bart Van Assche
On Wed, 2017-09-20 at 06:44 +0800, Ming Lei wrote:
> For this issue, it isn't same between SCSI and dm-rq.
> 
> We don't need to run queue in .end_io of dm, and the theory is
> simple, otherwise it isn't performance issue, and should be I/O hang.
> 
> 1) every dm-rq's request is 1:1 mapped to SCSI's request
> 
> 2) if there is any mapped SCSI request not finished, either
> in-flight or in requeue list or whatever, there will be one
> corresponding dm-rq's request in-flight
> 
> 3) once the mapped SCSI request is completed, dm-rq's completion
> path will be triggered and dm-rq's queue will be rerun because of
> SCHED_RESTART in dm-rq
> 
> So the hw queue of dm-rq has been run in dm-rq's completion path
> already, right? Why do we need to do it again in the hot path?

The measurement data in the description of patch 5/5 shows a significant
performance regression for an important workload, namely random I/O.
Additionally, the performance improvement for sequential I/O was achieved
for an unrealistically low queue depth. Sorry but given these measurement
results I don't see why I should spend more time on this patch series.

Bart.

Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Ming Lei
On Tue, Sep 19, 2017 at 06:42:30PM +, Bart Van Assche wrote:
> On Wed, 2017-09-20 at 00:55 +0800, Ming Lei wrote:
> > On Wed, Sep 20, 2017 at 12:49 AM, Bart Van Assche
> >  wrote:
> > > On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote:
> > > > Run queue at end_io is definitely wrong, because blk-mq has 
> > > > SCHED_RESTART
> > > > to do that already.
> > > 
> > > Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core 
> > > to
> > > reexamine the software queues and the hctx dispatch list but not the 
> > > requeue
> > > list. If a block driver returns BLK_STS_RESOURCE then requests end up on 
> > > the
> > > requeue list. Hence the following code in scsi_end_request():
> > 
> > That doesn't need SCHED_RESTART, because it is requeue's
> > responsibility to do that,
> > see blk_mq_requeue_work(), which will run hw queue at the end of this func.
> 
> That's not what I was trying to explain. What I was trying to explain is that
> every block driver that can cause a request to end up on the requeue list is
> responsible for kicking the requeue list at a later time. Hence the
> kblockd_schedule_work(>requeue_work) call in the SCSI core and the
> blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() calls in the
> dm code. What I would like to see is measurement results for dm-mpath without
> this patch series and a call to kick the requeue list added to the dm-mpath
> end_io code.

For this issue, it isn't same between SCSI and dm-rq.

We don't need to run queue in .end_io of dm, and the theory is
simple, otherwise it isn't performance issue, and should be I/O hang.

1) every dm-rq's request is 1:1 mapped to SCSI's request

2) if there is any mapped SCSI request not finished, either
in-flight or in requeue list or whatever, there will be one
corresponding dm-rq's request in-flight

3) once the mapped SCSI request is completed, dm-rq's completion
path will be triggered and dm-rq's queue will be rerun because of
SCHED_RESTART in dm-rq

So the hw queue of dm-rq has been run in dm-rq's completion path
already, right? Why do we need to do it again in the hot path?

-- 
Ming


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
On 09/19/2017 02:05 PM, James Bottomley wrote:
> Actually, the whole problem sounds like a posted write.  Likely the
> write that causes the reset doesn't get flushed until the read checking
> if the reset has succeeded, which might explain the 100% initial
> failure.  Why not throw away that first value if it's a failure and
> then do your polled wait and timeout on the reset success.  We should
> anyway be waiting some time for a reset to be issued, so even on non-
> posted write systems we could see this problem intermittently.
> 
> James
> 

Thanks for this suggestion James.

I tried to remove the sleep and did a dummy read to register using
readl() - issue reproduced.

I did expect that, since in aac_is_ctrl_up_and_running() we indeed read
a register and if it shows us reset is not complete, we wait and read it
again. So, we can think in this 1st read as a dummy one heheh

My theory here is that we're observing a failure similar to one we
already did in some specific NVMe adapters - the readl before some delay
(in nvme it was 2s) corrupts the adapter FW procedure. It's as if the
adapter doesn't like to deal with this read while the reset procedure is
ongoing. So, we wait a bit to issue a readl and everything goes fine.

Cheers,


Guilherme



Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Bart Van Assche
On Wed, 2017-09-20 at 00:55 +0800, Ming Lei wrote:
> On Wed, Sep 20, 2017 at 12:49 AM, Bart Van Assche
>  wrote:
> > On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote:
> > > Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART
> > > to do that already.
> > 
> > Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to
> > reexamine the software queues and the hctx dispatch list but not the requeue
> > list. If a block driver returns BLK_STS_RESOURCE then requests end up on the
> > requeue list. Hence the following code in scsi_end_request():
> 
> That doesn't need SCHED_RESTART, because it is requeue's
> responsibility to do that,
> see blk_mq_requeue_work(), which will run hw queue at the end of this func.

That's not what I was trying to explain. What I was trying to explain is that
every block driver that can cause a request to end up on the requeue list is
responsible for kicking the requeue list at a later time. Hence the
kblockd_schedule_work(>requeue_work) call in the SCSI core and the
blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() calls in the
dm code. What I would like to see is measurement results for dm-mpath without
this patch series and a call to kick the requeue list added to the dm-mpath
end_io code.

Bart.

Re: Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?

2017-09-19 Thread Douglas Gilbert

On 2017-09-19 10:56 AM, Benjamin Block wrote:

Hello linux-block,

I wrote some tests recently to test patches against bsg.c and bsg-lib.c,
and while writing those I noticed something strange:

When you use the write() and read() call on multiple file-descriptors
for a single bsg-device (FC or SCSI), it is possible that you get
cross-talk between those different file-descriptors.

E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0
and you send commands via write() in both processes, when they both do
read() later on - to read the response for the commands they send before
-, it is possible that process A gets the response (Sense-Data,
FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis.

I noticed this because in my tests I 'tag' each command I send with
write() via a value I write into the usr_ptr field of sg_io_v4. When I
later user read() to receive responses I check for this tag in a
hash-table and so can look up the original command. When I used this and
spawned multiple processes with the same target bsg-device, I got
responses for commands I don't have tags for in my hash-table, so they
were written by an other process. This never happend when I only spawn
one test-process.

This seems awfully contra-productive.. so much that I don't see any
point in allowing users to open more than one FD per bsg-device, because
that would inevitably lead to very hard to debug 'bugs'. And I wonder if
this is really by design or some regression that happend over time.

I looked at the code in bsg.c and yes, it seems this is what is coded
right now. We have a hash-table which manages all 'struct bsg_device's
who are indexed by device-minors, so one hash-table entry per
device-node.

So eh, long talk short question, is that intended?


Hi,
About once a year I point out that major misfeature in the bsg driver
and no-one seems to care. Most recently I pointed it out in a
discussion about SCSI SG CVE-2017-0794 6 days ago:
  " ... Last time I looked at the bsg driver, a SCSI command
could be issued on one file descriptor and its data-in block
and SCSI status delivered to another file descriptor to the
same block device (potentially in another process). IOW chaos"

It is hard to imagine any sane application relying on this bizarre
behaviour so fixing it should not break any applications. My guess
is that it would require a non-trivial patch set to fix. Would you
like to volunteer?


In the same post I asked (thinking it was cc-ed to this list):
"BTW Is there a NVMe pass-through?"

And the answer is: yes there is in include/uapi/linux/nvme_ioctl.h
which has struct nvme_user_io and struct nvme_passthru_cmd plus
several ioctl constants. See the smartmontool repository for
examples of how to use them. I also discovered there is a SCSI
to NVMe translation layer (SNTL) hiding in some kernels. Hence
commands like this:
   sg_inq /dev/nvme0## that's a 'char' device, and ...
   sg_vpd /dev/nvme0n1  ## that's a block device "namespace 1"

work as expected. Ah, I spoke too soon ... in another post I have
been informed that the SNTL has now been removed from the kernel.
Just goes to show the human stupidity is still infinite :-)

Doug Gilbert





RE: [PATCH] sd: Limit WRITE SAME / WRITE SAME(16) w/UNMAP length for certain devices

2017-09-19 Thread Kuzeja, William
Ewan,

I like it, more generic than my patch. I never saw the other cases, so I 
limited my 
patch to WS16.

Acked-by: Bill Kuzeja 

On Tue-09-19 at 12:14 Ewan D. Milne wrote: 
> Some devices do not support a WRITE SAME / WRITE SAME(16) with the
> UNMAP
> bit set up to the length specified in the MAXIMUM WRITE SAME LENGTH
> field
> in the block limits VPD page (or, the field is zero, indicating there is
> no limit).  Limit the length by the MAXIMUM UNMAP LBA COUNT value.
> Otherwise the command might be rejected.
> 
> Signed-off-by: Ewan D. Milne 
> ---
>  drivers/scsi/sd.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 6549e5c..fa07ac2 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -693,6 +693,7 @@ static void sd_config_discard(struct scsi_disk *sdkp,
> unsigned int mode)
>   struct request_queue *q = sdkp->disk->queue;
>   unsigned int logical_block_size = sdkp->device->sector_size;
>   unsigned int max_blocks = 0;
> + u32 max_ws_blocks = sdkp->max_ws_blocks;
> 
>   q->limits.discard_alignment =
>   sdkp->unmap_alignment * logical_block_size;
> @@ -701,6 +702,13 @@ static void sd_config_discard(struct scsi_disk *sdkp,
> unsigned int mode)
>   sdkp->unmap_granularity * logical_block_size);
>   sdkp->provisioning_mode = mode;
> 
> + /*
> +  * Some devices limit WRITE SAME / WRITE SAME(16) w/UNMAP
> +  * by MAXIMUM UNMAP LBA COUNT instead of MAXIMUM WRITE
> SAME LENGTH.
> +  * Use the smaller of the two values to avoid ILLEGAL REQUEST
> errors.
> +  */
> + max_ws_blocks = min_not_zero(max_ws_blocks, sdkp-
> >max_unmap_blocks);
> +
>   switch (mode) {
> 
>   case SD_LBP_FULL:
> @@ -715,17 +723,17 @@ static void sd_config_discard(struct scsi_disk *sdkp,
> unsigned int mode)
>   break;
> 
>   case SD_LBP_WS16:
> - max_blocks = min_not_zero(sdkp->max_ws_blocks,
> + max_blocks = min_not_zero(max_ws_blocks,
> (u32)SD_MAX_WS16_BLOCKS);
>   break;
> 
>   case SD_LBP_WS10:
> - max_blocks = min_not_zero(sdkp->max_ws_blocks,
> + max_blocks = min_not_zero(max_ws_blocks,
> (u32)SD_MAX_WS10_BLOCKS);
>   break;
> 
>   case SD_LBP_ZERO:
> - max_blocks = min_not_zero(sdkp->max_ws_blocks,
> + max_blocks = min_not_zero(max_ws_blocks,
> (u32)SD_MAX_WS10_BLOCKS);
>   break;
>   }
> --
> 2.1.0



[PATCH V3 9/9] pm80xx : corrected linkrate value.

2017-09-19 Thread Viswas G
Corrected the value defined for LINKRATE_60 (6 Gig).

Signed-off-by: Raj Dinesh 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm80xx_hwi.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/pm8001/pm80xx_hwi.h b/drivers/scsi/pm8001/pm80xx_hwi.h
index e36c5176f9a9..889e69ce3689 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.h
+++ b/drivers/scsi/pm8001/pm80xx_hwi.h
@@ -167,7 +167,7 @@
 #define LINKMODE_AUTO  (0x03 << 12)
 #define LINKRATE_15(0x01 << 8)
 #define LINKRATE_30(0x02 << 8)
-#define LINKRATE_60(0x06 << 8)
+#define LINKRATE_60(0x04 << 8)
 #define LINKRATE_120   (0x08 << 8)
 
 /* phy_profile */
-- 
2.12.3



[PATCH V3 8/9] pm80xx : panic on ncq error cleaning up the read log.

2017-09-19 Thread Viswas G
when there's an error in 'ncq mode' the host has to read the ncq
error log (10h) to clear the error state. however, the ccb that
is setup for doing this doesn't setup the ccb so that the
previous state is cleared. if the ccb was previously used for an IO
n_elems is set and pm8001_ccb_task_free() treats this as the signal
to go free a scatter-gather list (that's already been free-ed).

Signed-off-by: Deepak Ukey 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm80xx_hwi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index 92d2045dea68..f2c0839afbe3 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -1489,6 +1489,7 @@ static void pm80xx_send_read_log(struct pm8001_hba_info 
*pm8001_ha,
ccb->device = pm8001_ha_dev;
ccb->ccb_tag = ccb_tag;
ccb->task = task;
+   ccb->n_elem = 0;
pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG;
pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG;
 
-- 
2.12.3



[PATCH V3 6/9] pm80xx : modified port reset timer value for PM8006 card

2017-09-19 Thread Viswas G
Added port reset timer value as 2000ms for PM8006 sata controller.

Signed-off-by: Deepak Ukey 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm80xx_hwi.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index baab8a19c78e..8f1f5dc77d71 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -597,6 +597,12 @@ static void update_main_config_table(struct 
pm8001_hba_info *pm8001_ha)
pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer &= 0x;
pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer |=
PORT_RECOVERY_TIMEOUT;
+   if (pm8001_ha->chip_id == chip_8006) {
+   pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer &=
+   0x;
+   pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer |=
+   0x14;
+   }
pm8001_mw32(address, MAIN_PORT_RECOVERY_TIMER,
pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer);
 }
-- 
2.12.3



[PATCH V3 7/9] pm80xx : corrected SATA abort handling sequence.

2017-09-19 Thread Viswas G
Modified SATA abort handling with following steps:
1) Set device state as recovery.
2) Send phy reset.
3) Wait for reset completion.
4) After successful reset, abort all IO's to the device.
5) After aborting all IO's to device, set device state as operational.

Signed-off-by: Deepak Ukey 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm8001_hwi.c |  8 -
 drivers/scsi/pm8001/pm8001_sas.c | 71 ++--
 drivers/scsi/pm8001/pm8001_sas.h |  8 +
 drivers/scsi/pm8001/pm80xx_hwi.c | 36 
 4 files changed, 113 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index bc4a6f649ec9..db88a8e7ee0e 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -3209,10 +3209,16 @@ int pm8001_mpi_local_phy_ctl(struct pm8001_hba_info 
*pm8001_ha, void *piomb)
PM8001_MSG_DBG(pm8001_ha,
pm8001_printk("%x phy execute %x phy op failed!\n",
phy_id, phy_op));
-   } else
+   } else {
PM8001_MSG_DBG(pm8001_ha,
pm8001_printk("%x phy execute %x phy op success!\n",
phy_id, phy_op));
+   pm8001_ha->phy[phy_id].reset_success = true;
+   }
+   if (pm8001_ha->phy[phy_id].enable_completion) {
+   complete(pm8001_ha->phy[phy_id].enable_completion);
+   pm8001_ha->phy[phy_id].enable_completion = NULL;
+   }
pm8001_tag_free(pm8001_ha, tag);
return 0;
 }
diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index e80b0542a67f..60d5bec5c45e 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -1167,13 +1167,16 @@ int pm8001_abort_task(struct sas_task *task)
struct scsi_lun lun;
struct pm8001_device *pm8001_dev;
struct pm8001_tmf_task tmf_task;
-   int rc = TMF_RESP_FUNC_FAILED;
+   int rc = TMF_RESP_FUNC_FAILED, ret;
+   u32 phy_id;
+   struct sas_task_slow slow_task;
if (unlikely(!task || !task->lldd_task || !task->dev))
return TMF_RESP_FUNC_FAILED;
dev = task->dev;
pm8001_dev = dev->lldd_dev;
pm8001_ha = pm8001_find_ha_by_dev(dev);
device_id = pm8001_dev->device_id;
+   phy_id = pm8001_dev->attached_phy;
rc = pm8001_find_tag(task, );
if (rc == 0) {
pm8001_printk("no tag for task:%p\n", task);
@@ -1184,6 +1187,11 @@ int pm8001_abort_task(struct sas_task *task)
spin_unlock_irqrestore(>task_state_lock, flags);
return TMF_RESP_FUNC_COMPLETE;
}
+   task->task_state_flags |= SAS_TASK_STATE_ABORTED;
+   if (task->slow_task == NULL) {
+   init_completion(_task.completion);
+   task->slow_task = _task;
+   }
spin_unlock_irqrestore(>task_state_lock, flags);
if (task->task_proto & SAS_PROTOCOL_SSP) {
struct scsi_cmnd *cmnd = task->uldd_task;
@@ -1195,8 +1203,61 @@ int pm8001_abort_task(struct sas_task *task)
pm8001_dev->sas_device, 0, tag);
} else if (task->task_proto & SAS_PROTOCOL_SATA ||
task->task_proto & SAS_PROTOCOL_STP) {
-   rc = pm8001_exec_internal_task_abort(pm8001_ha, pm8001_dev,
-   pm8001_dev->sas_device, 0, tag);
+   if (pm8001_ha->chip_id == chip_8006) {
+   DECLARE_COMPLETION_ONSTACK(completion_reset);
+   DECLARE_COMPLETION_ONSTACK(completion);
+   struct pm8001_phy *phy = pm8001_ha->phy + phy_id;
+   /* 1. Set Device state as Recovery*/
+   pm8001_dev->setds_completion = 
+   PM8001_CHIP_DISP->set_dev_state_req(pm8001_ha,
+   pm8001_dev, 0x03);
+   wait_for_completion();
+   /* 2. Send Phy Control Hard Reset */
+   reinit_completion();
+   phy->reset_success = false;
+   phy->enable_completion = 
+   phy->reset_completion = _reset;
+   ret = PM8001_CHIP_DISP->phy_ctl_req(pm8001_ha, phy_id,
+   PHY_HARD_RESET);
+   if (ret)
+   goto out;
+   PM8001_MSG_DBG(pm8001_ha,
+   pm8001_printk("Waiting for local phy ctl\n"));
+   wait_for_completion();
+   if (!phy->reset_success)
+   goto out;
+   /* 3. Wait for Port Reset complete / Port reset TMO*/
+   PM8001_MSG_DBG(pm8001_ha,
+ 

[PATCH V3 3/9] pm80xx : Different SAS addresses for phys.

2017-09-19 Thread Viswas G
Different SAS addresses are assigned for each set of phys.

Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm8001_init.c | 13 +
 drivers/scsi/pm8001/pm80xx_hwi.c  |  3 +--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_init.c 
b/drivers/scsi/pm8001/pm8001_init.c
index 0e013f76b582..7a697ca68501 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -132,7 +132,7 @@ static void pm8001_phy_init(struct pm8001_hba_info 
*pm8001_ha, int phy_id)
sas_phy->oob_mode = OOB_NOT_CONNECTED;
sas_phy->linkrate = SAS_LINK_RATE_UNKNOWN;
sas_phy->id = phy_id;
-   sas_phy->sas_addr = _ha->sas_addr[0];
+   sas_phy->sas_addr = (u8 *)>dev_sas_addr;
sas_phy->frame_rcvd = >frame_rcvd[0];
sas_phy->ha = (struct sas_ha_struct *)pm8001_ha->shost->hostdata;
sas_phy->lldd_phy = phy;
@@ -591,10 +591,12 @@ static void  pm8001_post_sas_ha_init(struct Scsi_Host 
*shost,
for (i = 0; i < chip_info->n_phy; i++) {
sha->sas_phy[i] = _ha->phy[i].sas_phy;
sha->sas_port[i] = _ha->port[i].sas_port;
+   sha->sas_phy[i]->sas_addr =
+   (u8 *)_ha->phy[i].dev_sas_addr;
}
sha->sas_ha_name = DRV_NAME;
sha->dev = pm8001_ha->dev;
-
+   sha->strict_wide_ports = 1;
sha->lldd_module = THIS_MODULE;
sha->sas_addr = _ha->sas_addr[0];
sha->num_phys = chip_info->n_phy;
@@ -611,6 +613,7 @@ static void  pm8001_post_sas_ha_init(struct Scsi_Host 
*shost,
 static void pm8001_init_sas_add(struct pm8001_hba_info *pm8001_ha)
 {
u8 i, j;
+   u8 sas_add[8];
 #ifdef PM8001_READ_VPD
/* For new SPC controllers WWN is stored in flash vpd
*  For SPC/SPCve controllers WWN is stored in EEPROM
@@ -672,10 +675,12 @@ static void pm8001_init_sas_add(struct pm8001_hba_info 
*pm8001_ha)
pm8001_ha->sas_addr[j] =
payload.func_specific[0x804 + i];
}
-
+   memcpy(sas_add, pm8001_ha->sas_addr, SAS_ADDR_SIZE);
for (i = 0; i < pm8001_ha->chip->n_phy; i++) {
+   if (i && ((i % 4) == 0))
+   sas_add[7] = sas_add[7] + 4;
memcpy(_ha->phy[i].dev_sas_addr,
-   pm8001_ha->sas_addr, SAS_ADDR_SIZE);
+   sas_add, SAS_ADDR_SIZE);
PM8001_INIT_DBG(pm8001_ha,
pm8001_printk("phy %d sas_addr = %016llx\n", i,
pm8001_ha->phy[i].dev_sas_addr));
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index 8fb5ddf08cc4..2b26445d1b97 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -3041,7 +3041,6 @@ hw_event_phy_down(struct pm8001_hba_info *pm8001_ha, void 
*piomb)
port->port_state = portstate;
phy->identify.device_type = 0;
phy->phy_attached = 0;
-   memset(>dev_sas_addr, 0, SAS_ADDR_SIZE);
switch (portstate) {
case PORT_VALID:
break;
@@ -4394,7 +4393,7 @@ pm80xx_chip_phy_start_req(struct pm8001_hba_info 
*pm8001_ha, u8 phy_id)
payload.sas_identify.dev_type = SAS_END_DEVICE;
payload.sas_identify.initiator_bits = SAS_PROTOCOL_ALL;
memcpy(payload.sas_identify.sas_addr,
-   pm8001_ha->sas_addr, SAS_ADDR_SIZE);
+ _ha->phy[phy_id].dev_sas_addr, SAS_ADDR_SIZE);
payload.sas_identify.phy_id = phy_id;
ret = pm8001_mpi_build_cmd(pm8001_ha, circularQ, opcode, , 0);
return ret;
-- 
2.12.3



[PATCH V3 0/9] pm80xx updates

2017-09-19 Thread Viswas G
This patch set include some bug fixes and enhancement for pm80xx driver.

Changes from V2:
- Corrected date.

Changes from V1:
- sas_identify_frame_local structure moved to pm80xx_hwi.h
- sata abort handling patch split to four patches.
- tag allocation for phy control request.
- cleanup in pm8001_abort_task function.
- modified port reset timer value for PM8006 card
- corrected SATA abort handling sequence.


Viswas G (9):
  pm80xx : redefine sas_identify_frame structure
  pm80xx : ILA and inactive firmware version through sysfs
  pm80xx : Different SAS addresses for phys.
  pm80xx : tag allocation for phy control request.
  pm80xx : cleanup in pm8001_abort_task function.
  pm80xx : modified port reset timer value for PM8006 card
  pm80xx : corrected SATA abort handling sequence.
  pm80xx : panic on ncq error cleaning up the read log.
  pm80xx : corrected linkrate value.

 drivers/scsi/pm8001/pm8001_ctl.c  |  54 +
 drivers/scsi/pm8001/pm8001_hwi.c  |  11 +++-
 drivers/scsi/pm8001/pm8001_init.c |  13 +++--
 drivers/scsi/pm8001/pm8001_sas.c  | 118 ++
 drivers/scsi/pm8001/pm8001_sas.h  |  10 
 drivers/scsi/pm8001/pm80xx_hwi.c  |  61 
 drivers/scsi/pm8001/pm80xx_hwi.h  | 102 +++-
 7 files changed, 313 insertions(+), 56 deletions(-)

-- 
2.12.3



[PATCH V3 4/9] pm80xx : tag allocation for phy control request.

2017-09-19 Thread Viswas G
tag is taken from the tag pool instead of using the hardcoded
tag value(1).

Signed-off-by: Deepak Ukey 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm8001_hwi.c |  3 +++
 drivers/scsi/pm8001/pm80xx_hwi.c | 10 +++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index 10546faac58c..bc4a6f649ec9 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -3198,11 +3198,13 @@ pm8001_mpi_get_nvmd_resp(struct pm8001_hba_info 
*pm8001_ha, void *piomb)
 
 int pm8001_mpi_local_phy_ctl(struct pm8001_hba_info *pm8001_ha, void *piomb)
 {
+   u32 tag;
struct local_phy_ctl_resp *pPayload =
(struct local_phy_ctl_resp *)(piomb + 4);
u32 status = le32_to_cpu(pPayload->status);
u32 phy_id = le32_to_cpu(pPayload->phyop_phyid) & ID_BITS;
u32 phy_op = le32_to_cpu(pPayload->phyop_phyid) & OP_BITS;
+   tag = le32_to_cpu(pPayload->tag);
if (status != 0) {
PM8001_MSG_DBG(pm8001_ha,
pm8001_printk("%x phy execute %x phy op failed!\n",
@@ -3211,6 +3213,7 @@ int pm8001_mpi_local_phy_ctl(struct pm8001_hba_info 
*pm8001_ha, void *piomb)
PM8001_MSG_DBG(pm8001_ha,
pm8001_printk("%x phy execute %x phy op success!\n",
phy_id, phy_op));
+   pm8001_tag_free(pm8001_ha, tag);
return 0;
 }
 
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index 2b26445d1b97..baab8a19c78e 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -4500,17 +4500,21 @@ static int pm80xx_chip_reg_dev_req(struct 
pm8001_hba_info *pm8001_ha,
 static int pm80xx_chip_phy_ctl_req(struct pm8001_hba_info *pm8001_ha,
u32 phyId, u32 phy_op)
 {
+   u32 tag;
+   int rc;
struct local_phy_ctl_req payload;
struct inbound_queue_table *circularQ;
int ret;
u32 opc = OPC_INB_LOCAL_PHY_CONTROL;
memset(, 0, sizeof(payload));
+   rc = pm8001_tag_alloc(pm8001_ha, );
+   if (rc)
+   return rc;
circularQ = _ha->inbnd_q_tbl[0];
-   payload.tag = cpu_to_le32(1);
+   payload.tag = cpu_to_le32(tag);
payload.phyop_phyid =
cpu_to_le32(((phy_op & 0xFF) << 8) | (phyId & 0xFF));
-   ret = pm8001_mpi_build_cmd(pm8001_ha, circularQ, opc, , 0);
-   return ret;
+   return pm8001_mpi_build_cmd(pm8001_ha, circularQ, opc, , 0);
 }
 
 static u32 pm80xx_chip_is_our_interupt(struct pm8001_hba_info *pm8001_ha)
-- 
2.12.3



[PATCH V3 1/9] pm80xx : redefine sas_identify_frame structure

2017-09-19 Thread Viswas G
sas_identify structure defined by pm80xx doesn't have CRC field.
So added a new sas_identify structure without CRC.

v2:
- Since the structure changes is applicable for only pm80xx,
  sas_identify_frame_local structure moved to pm80xx_hwi.h.

Signed-off-by: Raj Dinesh 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm80xx_hwi.h | 98 +++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/pm8001/pm80xx_hwi.h b/drivers/scsi/pm8001/pm80xx_hwi.h
index 7a443bad6163..82b8cf581da9 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.h
+++ b/drivers/scsi/pm8001/pm80xx_hwi.h
@@ -229,6 +229,102 @@
 #define IT_NEXUS_TIMEOUT   0x7D0
 #define PORT_RECOVERY_TIMEOUT  ((IT_NEXUS_TIMEOUT/100) + 30)
 
+#ifdef __LITTLE_ENDIAN_BITFIELD
+struct sas_identify_frame_local {
+   /* Byte 0 */
+   u8  frame_type:4;
+   u8  dev_type:3;
+   u8  _un0:1;
+
+   /* Byte 1 */
+   u8  _un1;
+
+   /* Byte 2 */
+   union {
+   struct {
+   u8  _un20:1;
+   u8  smp_iport:1;
+   u8  stp_iport:1;
+   u8  ssp_iport:1;
+   u8  _un247:4;
+   };
+   u8 initiator_bits;
+   };
+
+   /* Byte 3 */
+   union {
+   struct {
+   u8  _un30:1;
+   u8 smp_tport:1;
+   u8 stp_tport:1;
+   u8 ssp_tport:1;
+   u8 _un347:4;
+   };
+   u8 target_bits;
+   };
+
+   /* Byte 4 - 11 */
+   u8 _un4_11[8];
+
+   /* Byte 12 - 19 */
+   u8 sas_addr[SAS_ADDR_SIZE];
+
+   /* Byte 20 */
+   u8 phy_id;
+
+   u8 _un21_27[7];
+
+} __packed;
+
+#elif defined(__BIG_ENDIAN_BITFIELD)
+struct sas_identify_frame_local {
+   /* Byte 0 */
+   u8  _un0:1;
+   u8  dev_type:3;
+   u8  frame_type:4;
+
+   /* Byte 1 */
+   u8  _un1;
+
+   /* Byte 2 */
+   union {
+   struct {
+   u8  _un247:4;
+   u8  ssp_iport:1;
+   u8  stp_iport:1;
+   u8  smp_iport:1;
+   u8  _un20:1;
+   };
+   u8 initiator_bits;
+   };
+
+   /* Byte 3 */
+   union {
+   struct {
+   u8 _un347:4;
+   u8 ssp_tport:1;
+   u8 stp_tport:1;
+   u8 smp_tport:1;
+   u8 _un30:1;
+   };
+   u8 target_bits;
+   };
+
+   /* Byte 4 - 11 */
+   u8 _un4_11[8];
+
+   /* Byte 12 - 19 */
+   u8 sas_addr[SAS_ADDR_SIZE];
+
+   /* Byte 20 */
+   u8 phy_id;
+
+   u8 _un21_27[7];
+} __packed;
+#else
+#error "Bitfield order not defined!"
+#endif
+
 struct mpi_msg_hdr {
__le32  header; /* Bits [11:0] - Message operation code */
/* Bits [15:12] - Message Category */
@@ -248,7 +344,7 @@ struct mpi_msg_hdr {
 struct phy_start_req {
__le32  tag;
__le32  ase_sh_lm_slr_phyid;
-   struct sas_identify_frame sas_identify; /* 28 Bytes */
+   struct sas_identify_frame_local sas_identify; /* 28 Bytes */
__le32 spasti;
u32 reserved[21];
 } __attribute__((packed, aligned(4)));
-- 
2.12.3



[PATCH V3 5/9] pm80xx : cleanup in pm8001_abort_task function.

2017-09-19 Thread Viswas G
Signed-off-by: Deepak Ukey 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm8001_sas.c | 49 +++-
 1 file changed, 13 insertions(+), 36 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index ce584c31d36e..e80b0542a67f 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -1159,40 +1159,35 @@ int pm8001_query_task(struct sas_task *task)
 int pm8001_abort_task(struct sas_task *task)
 {
unsigned long flags;
-   u32 tag = 0xdeadbeef;
+   u32 tag;
u32 device_id;
struct domain_device *dev ;
-   struct pm8001_hba_info *pm8001_ha = NULL;
+   struct pm8001_hba_info *pm8001_ha;
struct pm8001_ccb_info *ccb;
struct scsi_lun lun;
struct pm8001_device *pm8001_dev;
struct pm8001_tmf_task tmf_task;
int rc = TMF_RESP_FUNC_FAILED;
if (unlikely(!task || !task->lldd_task || !task->dev))
-   return rc;
+   return TMF_RESP_FUNC_FAILED;
+   dev = task->dev;
+   pm8001_dev = dev->lldd_dev;
+   pm8001_ha = pm8001_find_ha_by_dev(dev);
+   device_id = pm8001_dev->device_id;
+   rc = pm8001_find_tag(task, );
+   if (rc == 0) {
+   pm8001_printk("no tag for task:%p\n", task);
+   return TMF_RESP_FUNC_FAILED;
+   }
spin_lock_irqsave(>task_state_lock, flags);
if (task->task_state_flags & SAS_TASK_STATE_DONE) {
spin_unlock_irqrestore(>task_state_lock, flags);
-   rc = TMF_RESP_FUNC_COMPLETE;
-   goto out;
+   return TMF_RESP_FUNC_COMPLETE;
}
spin_unlock_irqrestore(>task_state_lock, flags);
if (task->task_proto & SAS_PROTOCOL_SSP) {
struct scsi_cmnd *cmnd = task->uldd_task;
-   dev = task->dev;
-   ccb = task->lldd_task;
-   pm8001_dev = dev->lldd_dev;
-   pm8001_ha = pm8001_find_ha_by_dev(dev);
int_to_scsilun(cmnd->device->lun, );
-   rc = pm8001_find_tag(task, );
-   if (rc == 0) {
-   printk(KERN_INFO "No such tag in %s\n", __func__);
-   rc = TMF_RESP_FUNC_FAILED;
-   return rc;
-   }
-   device_id = pm8001_dev->device_id;
-   PM8001_EH_DBG(pm8001_ha,
-   pm8001_printk("abort io to deviceid= %d\n", device_id));
tmf_task.tmf = TMF_ABORT_TASK;
tmf_task.tag_of_task_to_be_managed = tag;
rc = pm8001_issue_ssp_tmf(dev, lun.scsi_lun, _task);
@@ -1200,28 +1195,10 @@ int pm8001_abort_task(struct sas_task *task)
pm8001_dev->sas_device, 0, tag);
} else if (task->task_proto & SAS_PROTOCOL_SATA ||
task->task_proto & SAS_PROTOCOL_STP) {
-   dev = task->dev;
-   pm8001_dev = dev->lldd_dev;
-   pm8001_ha = pm8001_find_ha_by_dev(dev);
-   rc = pm8001_find_tag(task, );
-   if (rc == 0) {
-   printk(KERN_INFO "No such tag in %s\n", __func__);
-   rc = TMF_RESP_FUNC_FAILED;
-   return rc;
-   }
rc = pm8001_exec_internal_task_abort(pm8001_ha, pm8001_dev,
pm8001_dev->sas_device, 0, tag);
} else if (task->task_proto & SAS_PROTOCOL_SMP) {
/* SMP */
-   dev = task->dev;
-   pm8001_dev = dev->lldd_dev;
-   pm8001_ha = pm8001_find_ha_by_dev(dev);
-   rc = pm8001_find_tag(task, );
-   if (rc == 0) {
-   printk(KERN_INFO "No such tag in %s\n", __func__);
-   rc = TMF_RESP_FUNC_FAILED;
-   return rc;
-   }
rc = pm8001_exec_internal_task_abort(pm8001_ha, pm8001_dev,
pm8001_dev->sas_device, 0, tag);
 
-- 
2.12.3



[PATCH V3 2/9] pm80xx : ILA and inactive firmware version through sysfs

2017-09-19 Thread Viswas G
Added support to read ILA version and inactive firmware version
from MPI configuration table and export through sysfs.

Signed-off-by: Deepak Ukey 
Signed-off-by: Viswas G 

Acked-by: Jack Wang 
---
 drivers/scsi/pm8001/pm8001_ctl.c | 54 
 drivers/scsi/pm8001/pm8001_sas.h |  2 ++
 drivers/scsi/pm8001/pm80xx_hwi.c |  5 
 drivers/scsi/pm8001/pm80xx_hwi.h |  2 ++
 4 files changed, 63 insertions(+)

diff --git a/drivers/scsi/pm8001/pm8001_ctl.c b/drivers/scsi/pm8001/pm8001_ctl.c
index be8269c8d127..596f3ff965f5 100644
--- a/drivers/scsi/pm8001/pm8001_ctl.c
+++ b/drivers/scsi/pm8001/pm8001_ctl.c
@@ -98,6 +98,58 @@ static ssize_t pm8001_ctl_fw_version_show(struct device 
*cdev,
}
 }
 static DEVICE_ATTR(fw_version, S_IRUGO, pm8001_ctl_fw_version_show, NULL);
+
+/**
+ * pm8001_ctl_ila_version_show - ila version
+ * @cdev: pointer to embedded class device
+ * @buf: the buffer returned
+ *
+ * A sysfs 'read-only' shost attribute.
+ */
+static ssize_t pm8001_ctl_ila_version_show(struct device *cdev,
+   struct device_attribute *attr, char *buf)
+{
+   struct Scsi_Host *shost = class_to_shost(cdev);
+   struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost);
+   struct pm8001_hba_info *pm8001_ha = sha->lldd_ha;
+
+   if (pm8001_ha->chip_id != chip_8001) {
+   return snprintf(buf, PAGE_SIZE, "%02x.%02x.%02x.%02x\n",
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version >> 24),
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version >> 16),
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version >> 8),
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version));
+   }
+   return 0;
+}
+static DEVICE_ATTR(ila_version, 0444, pm8001_ctl_ila_version_show, NULL);
+
+/**
+ * pm8001_ctl_inactive_fw_version_show - Inacative firmware version number
+ * @cdev: pointer to embedded class device
+ * @buf: the buffer returned
+ *
+ * A sysfs 'read-only' shost attribute.
+ */
+static ssize_t pm8001_ctl_inactive_fw_version_show(struct device *cdev,
+   struct device_attribute *attr, char *buf)
+{
+   struct Scsi_Host *shost = class_to_shost(cdev);
+   struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost);
+   struct pm8001_hba_info *pm8001_ha = sha->lldd_ha;
+
+   if (pm8001_ha->chip_id != chip_8001) {
+   return snprintf(buf, PAGE_SIZE, "%02x.%02x.%02x.%02x\n",
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version >> 24),
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version >> 16),
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version >> 8),
+   (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version));
+   }
+   return 0;
+}
+static
+DEVICE_ATTR(inc_fw_ver, 0444, pm8001_ctl_inactive_fw_version_show, NULL);
+
 /**
  * pm8001_ctl_max_out_io_show - max outstanding io supported
  * @cdev: pointer to embedded class device
@@ -748,6 +800,8 @@ struct device_attribute *pm8001_host_attrs[] = {
_attr_bios_version,
_attr_ib_log,
_attr_ob_log,
+   _attr_ila_version,
+   _attr_inc_fw_ver,
NULL,
 };
 
diff --git a/drivers/scsi/pm8001/pm8001_sas.h b/drivers/scsi/pm8001/pm8001_sas.h
index e81a8fa7ef1a..c75de413e062 100644
--- a/drivers/scsi/pm8001/pm8001_sas.h
+++ b/drivers/scsi/pm8001/pm8001_sas.h
@@ -404,6 +404,8 @@ union main_cfg_table {
u32 port_recovery_timer;
u32 interrupt_reassertion_delay;
u32 fatal_n_non_fatal_dump; /* 0x28 */
+   u32 ila_version;
+   u32 inc_fw_version;
} pm80xx_tbl;
 };
 
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index eb4fee61df72..8fb5ddf08cc4 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -312,6 +312,11 @@ static void read_main_config_table(struct pm8001_hba_info 
*pm8001_ha)
/* read port recover and reset timeout */
pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer =
pm8001_mr32(address, MAIN_PORT_RECOVERY_TIMER);
+   /* read ILA and inactive firmware version */
+   pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version =
+   pm8001_mr32(address, MAIN_MPI_ILA_RELEASE_TYPE);
+   pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version =
+   pm8001_mr32(address, MAIN_MPI_INACTIVE_FW_VERSION);
 }
 
 /**
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.h b/drivers/scsi/pm8001/pm80xx_hwi.h
index 82b8cf581da9..e36c5176f9a9 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.h
+++ b/drivers/scsi/pm8001/pm80xx_hwi.h
@@ -1445,6 +1445,8 @@ typedef struct SASProtocolTimerConfig 
SASProtocolTimerConfig_t;
 #define MAIN_SAS_PHY_ATTR_TABLE_OFFSET 0x90 /* DWORD 0x24 */
 #define MAIN_PORT_RECOVERY_TIMER   0x94 

Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread James Bottomley
On Tue, 2017-09-19 at 08:52 -0700, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
> > 
> > On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> > > 
> > > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli
> > > wrote:
> > > > 
> > > >     src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
> > > > +
> > > > +   msleep(5000);
> > > 
> > > src_writel is a writel, and thus a posted MMIO write.  You'll
> > > need
> > > to have to a read first to make it a reliable timing base.
> > > 
> > 
> > Just for my full understanding - you're saying a readl BEFORE
> > src_writel() or AFTER src_writel() ?
> 
> AFTER.

Actually, the whole problem sounds like a posted write.  Likely the
write that causes the reset doesn't get flushed until the read checking
if the reset has succeeded, which might explain the 100% initial
failure.  Why not throw away that first value if it's a failure and
then do your polled wait and timeout on the reset success.  We should
anyway be waiting some time for a reset to be issued, so even on non-
posted write systems we could see this problem intermittently.

James



Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Ming Lei
On Wed, Sep 20, 2017 at 12:49 AM, Bart Van Assche
 wrote:
> On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote:
>> Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART
>> to do that already.
>
> Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to
> reexamine the software queues and the hctx dispatch list but not the requeue
> list. If a block driver returns BLK_STS_RESOURCE then requests end up on the
> requeue list. Hence the following code in scsi_end_request():

That doesn't need SCHED_RESTART, because it is requeue's
responsibility to do that,
see blk_mq_requeue_work(), which will run hw queue at the end of this func.


-- 
Ming Lei


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Bart Van Assche
On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote:
> Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART
> to do that already.

Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to
reexamine the software queues and the hctx dispatch list but not the requeue
list. If a block driver returns BLK_STS_RESOURCE then requests end up on the
requeue list. Hence the following code in scsi_end_request():

if (scsi_target(sdev)->single_lun || 
!list_empty(>host->starved_list))
kblockd_schedule_work(>requeue_work);
else
blk_mq_run_hw_queues(q, true);

Bart.

[PATCH] sd: Limit WRITE SAME / WRITE SAME(16) w/UNMAP length for certain devices

2017-09-19 Thread Ewan D. Milne
From: "Ewan D. Milne" 

Some devices do not support a WRITE SAME / WRITE SAME(16) with the UNMAP
bit set up to the length specified in the MAXIMUM WRITE SAME LENGTH field
in the block limits VPD page (or, the field is zero, indicating there is
no limit).  Limit the length by the MAXIMUM UNMAP LBA COUNT value.
Otherwise the command might be rejected.

Signed-off-by: Ewan D. Milne 
---
 drivers/scsi/sd.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 6549e5c..fa07ac2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -693,6 +693,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
struct request_queue *q = sdkp->disk->queue;
unsigned int logical_block_size = sdkp->device->sector_size;
unsigned int max_blocks = 0;
+   u32 max_ws_blocks = sdkp->max_ws_blocks;
 
q->limits.discard_alignment =
sdkp->unmap_alignment * logical_block_size;
@@ -701,6 +702,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
sdkp->unmap_granularity * logical_block_size);
sdkp->provisioning_mode = mode;
 
+   /*
+* Some devices limit WRITE SAME / WRITE SAME(16) w/UNMAP
+* by MAXIMUM UNMAP LBA COUNT instead of MAXIMUM WRITE SAME LENGTH.
+* Use the smaller of the two values to avoid ILLEGAL REQUEST errors.
+*/
+   max_ws_blocks = min_not_zero(max_ws_blocks, sdkp->max_unmap_blocks);
+
switch (mode) {
 
case SD_LBP_FULL:
@@ -715,17 +723,17 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
break;
 
case SD_LBP_WS16:
-   max_blocks = min_not_zero(sdkp->max_ws_blocks,
+   max_blocks = min_not_zero(max_ws_blocks,
  (u32)SD_MAX_WS16_BLOCKS);
break;
 
case SD_LBP_WS10:
-   max_blocks = min_not_zero(sdkp->max_ws_blocks,
+   max_blocks = min_not_zero(max_ws_blocks,
  (u32)SD_MAX_WS10_BLOCKS);
break;
 
case SD_LBP_ZERO:
-   max_blocks = min_not_zero(sdkp->max_ws_blocks,
+   max_blocks = min_not_zero(max_ws_blocks,
  (u32)SD_MAX_WS10_BLOCKS);
break;
}
-- 
2.1.0



Re: [PATCH] scsi: SSDs can timeout during FS init because of too many unmaps

2017-09-19 Thread Ewan D. Milne
On Tue, 2017-09-19 at 09:02 -0400, Bill Kuzeja wrote:
> I encountered this issue putting XFS on several brands of SSDs on my
> system. During initialization, I would see a bunch of timeouts on
> WRITE_SAME_16 commands, which would get aborted, reissued, and complete.
> The logs look like this:
> 
> kernel: sd 2:0:1:0: attempting task abort! scmd(88086ca0c8c0)
> kernel: sd 1:0:1:0: attempting task abort! scmd(88086c7f2bc0)
> kernel: sd 1:0:1:0: [sds] CDB: Write same(16) 93 08 00 00 00 00 24 04
> 07 b8 00 7f ff ff 00 00
> kernel: sd 2:0:1:0: [sdt] CDB: Write same(16) 93 08 00 00 00 00 24 04
> 07 b8 00 7f ff ff 00 00
> 
> And so on (there are many, many more of these)...
> 
> The interesting thing to note as that these are WS16 commands with the
> unmap bit set (this drive's version of UNMAP) with length 0x7f.
> This is over 8.3 million blocks to be unmapped at once. Since there are
> many concurrent "unmaps", the drive can get overwhelmed and time out.

There is another problem as well.  There are some enterprise storage
arrays that are rejecting large WRITE SAME(16) commands w/UNMAP set
with ILLEGAL REQUEST / INVALID FIELD IN CDB.  As far as I can tell,
T10 SBC says that the MAXIMUM WRITE SAME LENGTH field in the block
limits VPD page should describe the limit for these commands, but
the arrays appear to reject anything large than MAXIMUM UNMAP LBA COUNT.
i.e. they are treating WRITE SAME(16) w/UNMAP the same as an UNMAP CDB.

I had come up with something similar, see my comment on your patch below.

> 
> Why does this happen? Initializing the device with a filesystem (in my
> experience XFS) creates one huge discard for the SSD. This gets
> broken apart into smaller unmap seqments, which get sent down to the
> drive. For the SSDs that I've been working with (lbpws is always set), 
> UNMAPs always gettranslated to WS16 with the unmap bit set.
> 
> The size of these segments is determined when the drive is set up
> initially. Early on, in the routine sd_read_block_limits, we read the
> Block Limits VPD page (page 0xb0). Among other things, this page gives
> us the drive's MAX UNMAP LBA count as well as the MAX WRITE SAME LENGTH.
> In my experience, every SSD returns zero for MAX WRITE SAME length but
> does have a real value for MAX_UNMAP LBA count.
> 
> The way the code is structured, because we read in zero for
> MAX WRITE SAME, we assume there is no limit for write same commands.
> This *may* be true, but unmap/discard commands translate into write
> same commands with the unmap bit set. Technically, this makes them
> no longer write same commands.
> 
> Currently, the max discard size is actually based off of max_ws_blocks.
> When configuring max discard size later, we default to
> SD_MAX_WS16_BLOCKS (0x7f) because max_ws_blocks is currently
> always zero:
> 
>  max_blocks = min_not_zero(sdkp->max_ws_blocks,
>   (u32)SD_MAX_WS16_BLOCKS);
> 
> A reasonable fix for this would be to use the MAX UNMAP LBA count
> (stored as max_unmap_blocks) instead of max_ws_blocks in the case where
> we're defaulting to WS16 for unmaps.
> 
> After discussing this issue with an SSD vendor's firmware team, they
> confirmed that this was a good way to proceed. That is, it made sense to
> use the max_unmap_blocks count instead of the default SD_MAX_WS16_BLOCKS
> value because 1) max_ws_blocks is always zero, 2) these are really
> unmap commands we're issuing, and 3) the SSD can handle a few unmaps
> the size of SD_MAX_WS16_BLOCKS but not necessarily a barrage of them.
> 
> The largest max unmap size I've seen from returned from a drive (from
> the Block Limits VPD page) is 0x27 or about 30% of SD_MAX_WS16_BLOCKS. 
> Other sizes are much smaller, typically 0x8 or about 6% of the 
> previous max value.
> 
> I've also done performance testing for this change. The only impact I've
> seen on SSDs is during filesystem initialization time, which would be 
> expected since that's most likely the only time we'd be doing really large 
> unmaps. Even so, the impact on FS init is fairly minimal, 10% for some 
> models of SSDs, others no noticeable difference at all.
> 
> 
> 
> Signed-off-by: Bill Kuzeja 
> ---
>  drivers/scsi/sd.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 11c1738..f24c4f2 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -715,8 +715,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
> unsigned int mode)
>   break;
>  
>   case SD_LBP_WS16:
> - max_blocks = min_not_zero(sdkp->max_ws_blocks,
> -   (u32)SD_MAX_WS16_BLOCKS);
> + /* If no value given, use unmap limit - WS16 default too large 
> */
> + if (!sdkp->max_ws_blocks)
> + max_blocks = min_not_zero(sdkp->max_unmap_blocks,
> +

Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Ming Lei
On Tue, Sep 19, 2017 at 11:48:23AM -0400, Mike Snitzer wrote:
> On Tue, Sep 19 2017 at  1:43am -0400,
> Ming Lei  wrote:
> 
> > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote:
> > > On Sun, 2017-09-17 at 20:40 +0800, Ming Lei wrote:
> > > > "if no request has completed before the delay has expired" can't be a
> > > > reason to rerun the queue, because the queue can still be busy.
> > > 
> > > That statement of you shows that there are important aspects of the SCSI
> > > core and dm-mpath driver that you don't understand.
> > 
> > Then can you tell me why blk-mq's SCHED_RESTART can't cover
> > the rerun when there are in-flight requests? What is the case
> > in which dm-rq can return BUSY and there aren't any in-flight
> > requests meantime?
> > 
> > Also you are the author of adding 'blk_mq_delay_run_hw_queue(
> > hctx, 100/*ms*/)' in dm-rq, you never explain in commit
> > 6077c2d706097c0(dm rq: Avoid that request processing stalls
> > sporadically) what the root cause is for your request stall
> > and why this patch fixes your issue. Even you don't explain
> > why is the delay 100ms?
> > 
> > So it is a workaound, isn't it?
> > 
> > My concern is that it isn't good to add blk_mq_delay_run_hw_queue(hctx, 
> > 100/*ms*/)
> > in the hot path since it should been covered by SCHED_RESTART
> > if there are in-flight requests.
> 
> This thread proves that it is definitely brittle to be relying on fixed
> delays like this:
> https://patchwork.kernel.org/patch/9703249/

I can't agree more, because no one mentioned the root cause, maybe the
request stall has been fixed recently. Keeping the workaound in hotpath
is a bit annoying.

-- 
Ming


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Ming Lei
On Tue, Sep 19, 2017 at 11:56:03AM -0400, Mike Snitzer wrote:
> On Tue, Sep 19 2017 at 11:36am -0400,
> Bart Van Assche  wrote:
> 
> > On Tue, 2017-09-19 at 13:43 +0800, Ming Lei wrote:
> > > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote:
> > > > If you are still looking at removing the blk_mq_delay_run_hw_queue() 
> > > > calls
> > > > then I think you are looking in the wrong direction. What kind of 
> > > > problem
> > > > are you trying to solve? Is it perhaps that there can be a delay between
> > > 
> > > Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't
> > > need this patch.
> > 
> > The approach of this patch series looks wrong to me and patch 5/5 is an ugly
> > hack in my opinion. That's why I asked you to drop the entire patch series 
> > and
> > to test whether inserting a queue run call into the dm-mpath end_io callback
> > yields a similar performance improvement to the patches you posted. Please 
> > do
> > not expect me to spend more time on this patch series if you do not come up
> > with measurement results for the proposed alternative.
> 
> Bart, asserting that Ming's work is a hack doesn't help your apparent
> goal of deligitimizing Ming's work.
> 
> Nor does it take away from the fact that your indecision on appropriate
> timeouts (let alone ability to defend and/or explain them) validates
> Ming calling them into question (which you are now dodging regularly).
> 
> But please don't take this feedback and shut-down.  Instead please work
> together more constructively.  This doesn't need to be adversarial!  I
> am at a loss for why there is such animosity from your end Bart.
> 
> Please dial it back.  It is just a distraction that fosters needless
> in-fighting.
> 
> Believe it or not: Ming is just trying to improve the code because he
> has a testbed that is showing fairly abysmal performance with dm-mq
> multipath (on lpfc with scsi-mq).
> 
> Ming, if you can: please see if what Bart has proposed (instead: run
> queue at end_io) helps.  Though I don't yet see why that should be
> needed.

Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART
to do that already.

The dm-mpath performance issue is nothing to do with that, actually
the issue is very similar with my improvement on SCSI-MQ, because
now dm_dispatch_clone_request() doesn't know if the underlying
queue is busy or not, and dm-rq/mpath is just trying to dispatch
more request to underlying queue even though it is busy, then IO
merge can't be done easily on dm-rq/mpath.

I am thinking another way to improve this issue, since some
SCSI device's queue_depth is different with .cmd_per_lun,
will post patchset soon.


-- 
Ming


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Mike Snitzer
On Tue, Sep 19 2017 at 11:52am -0400,
Bart Van Assche  wrote:

> On Tue, 2017-09-19 at 11:48 -0400, Mike Snitzer wrote:
> > This thread proves that it is definitely brittle to be relying on fixed
> > delays like this:
> > https://patchwork.kernel.org/patch/9703249/
> 
> Hello Mike,
> 
> Sorry but I think that's a misinterpretation of my patch. I came up with that
> patch before I had found and fixed the root cause of the high system load.
> These fixes are upstream (kernel v4.13) which means that that patch is now
> obsolete.

OK, fair point.

Though fixed magic values for delay aren't going to stand the test of
time.


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
On 09/19/2017 12:52 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
>> On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
>>> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
 +
 +  msleep(5000);
>>>
>>> src_writel is a writel, and thus a posted MMIO write.  You'll need
>>> to have to a read first to make it a reliable timing base.
>>>
>>
>> Just for my full understanding - you're saying a readl BEFORE
>> src_writel() or AFTER src_writel() ?
> 
> AFTER.

Thanks!

> 
>> I could add a read to some dummy register, but notice it was a sequence
>> of readl's on aac_is_ctrl_up_and_running() that caused the failure of
>> reset...
> 
> Oh, ouch.  I guess in that case we'll need to do the writel and pray,
> but that would need a big comment explaining what's going on there.
> 

Heheh you're right!

But do you remember that quirk added on nvme? Basically, it was a
similar scenario in which some adapters weren't happy in poll a register
in nvme_wait_ready() right after we wrote in the Controller Config
register when disabling a controller.

Seems the same thing here, the controller is not being able to handle a
read right after some internal procedure (reset) were initiated.

If you have suggestion to improve the commit msg, let me know :)
Thanks!



Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Mike Snitzer
On Tue, Sep 19 2017 at 11:36am -0400,
Bart Van Assche  wrote:

> On Tue, 2017-09-19 at 13:43 +0800, Ming Lei wrote:
> > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote:
> > > If you are still looking at removing the blk_mq_delay_run_hw_queue() calls
> > > then I think you are looking in the wrong direction. What kind of problem
> > > are you trying to solve? Is it perhaps that there can be a delay between
> > 
> > Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't
> > need this patch.
> 
> The approach of this patch series looks wrong to me and patch 5/5 is an ugly
> hack in my opinion. That's why I asked you to drop the entire patch series and
> to test whether inserting a queue run call into the dm-mpath end_io callback
> yields a similar performance improvement to the patches you posted. Please do
> not expect me to spend more time on this patch series if you do not come up
> with measurement results for the proposed alternative.

Bart, asserting that Ming's work is a hack doesn't help your apparent
goal of deligitimizing Ming's work.

Nor does it take away from the fact that your indecision on appropriate
timeouts (let alone ability to defend and/or explain them) validates
Ming calling them into question (which you are now dodging regularly).

But please don't take this feedback and shut-down.  Instead please work
together more constructively.  This doesn't need to be adversarial!  I
am at a loss for why there is such animosity from your end Bart.

Please dial it back.  It is just a distraction that fosters needless
in-fighting.

Believe it or not: Ming is just trying to improve the code because he
has a testbed that is showing fairly abysmal performance with dm-mq
multipath (on lpfc with scsi-mq).

Ming, if you can: please see if what Bart has proposed (instead: run
queue at end_io) helps.  Though I don't yet see why that should be
needed.

Mike


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Bart Van Assche
On Tue, 2017-09-19 at 11:48 -0400, Mike Snitzer wrote:
> This thread proves that it is definitely brittle to be relying on fixed
> delays like this:
> https://patchwork.kernel.org/patch/9703249/

Hello Mike,

Sorry but I think that's a misinterpretation of my patch. I came up with that
patch before I had found and fixed the root cause of the high system load.
These fixes are upstream (kernel v4.13) which means that that patch is now
obsolete.

Bart.

Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Christoph Hellwig
On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
> On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
> >>src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
> >> +
> >> +  msleep(5000);
> > 
> > src_writel is a writel, and thus a posted MMIO write.  You'll need
> > to have to a read first to make it a reliable timing base.
> > 
> 
> Just for my full understanding - you're saying a readl BEFORE
> src_writel() or AFTER src_writel() ?

AFTER.

> I could add a read to some dummy register, but notice it was a sequence
> of readl's on aac_is_ctrl_up_and_running() that caused the failure of
> reset...

Oh, ouch.  I guess in that case we'll need to do the writel and pray,
but that would need a big comment explaining what's going on there.


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
>>  src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
>> +
>> +msleep(5000);
> 
> src_writel is a writel, and thus a posted MMIO write.  You'll need
> to have to a read first to make it a reliable timing base.
> 

Just for my full understanding - you're saying a readl BEFORE
src_writel() or AFTER src_writel() ?

I could add a read to some dummy register, but notice it was a sequence
of readl's on aac_is_ctrl_up_and_running() that caused the failure of
reset...

Thanks



Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Mike Snitzer
On Tue, Sep 19 2017 at  1:43am -0400,
Ming Lei  wrote:

> On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote:
> > On Sun, 2017-09-17 at 20:40 +0800, Ming Lei wrote:
> > > "if no request has completed before the delay has expired" can't be a
> > > reason to rerun the queue, because the queue can still be busy.
> > 
> > That statement of you shows that there are important aspects of the SCSI
> > core and dm-mpath driver that you don't understand.
> 
> Then can you tell me why blk-mq's SCHED_RESTART can't cover
> the rerun when there are in-flight requests? What is the case
> in which dm-rq can return BUSY and there aren't any in-flight
> requests meantime?
> 
> Also you are the author of adding 'blk_mq_delay_run_hw_queue(
> hctx, 100/*ms*/)' in dm-rq, you never explain in commit
> 6077c2d706097c0(dm rq: Avoid that request processing stalls
> sporadically) what the root cause is for your request stall
> and why this patch fixes your issue. Even you don't explain
> why is the delay 100ms?
> 
> So it is a workaound, isn't it?
> 
> My concern is that it isn't good to add blk_mq_delay_run_hw_queue(hctx, 
> 100/*ms*/)
> in the hot path since it should been covered by SCHED_RESTART
> if there are in-flight requests.

This thread proves that it is definitely brittle to be relying on fixed
delays like this:
https://patchwork.kernel.org/patch/9703249/

Mike


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Christoph Hellwig
On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
>   src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
> +
> + msleep(5000);

src_writel is a writel, and thus a posted MMIO write.  You'll need
to have to a read first to make it a reliable timing base.


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-19 Thread Bart Van Assche
On Tue, 2017-09-19 at 13:43 +0800, Ming Lei wrote:
> On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote:
> > If you are still looking at removing the blk_mq_delay_run_hw_queue() calls
> > then I think you are looking in the wrong direction. What kind of problem
> > are you trying to solve? Is it perhaps that there can be a delay between
> 
> Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't
> need this patch.

The approach of this patch series looks wrong to me and patch 5/5 is an ugly
hack in my opinion. That's why I asked you to drop the entire patch series and
to test whether inserting a queue run call into the dm-mpath end_io callback
yields a similar performance improvement to the patches you posted. Please do
not expect me to spend more time on this patch series if you do not come up
with measurement results for the proposed alternative.

Bart.

[PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
Commit 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset
status") changed the way driver checks if a reset succeeded. Now, after an
IOP reset, aacraid immediately start polling a register to verify the reset
is complete.

This behavior cause regressions on the reset path in PowerPC (at least).
Since the delay after the IOP reset was removed by the aforementioned patch,
the fact driver just starts to read a register instantly after the reset
was issued (by writing in another register) "corrupts" the reset procedure,
which ends up failing all the time.

The issue highly impacted kdump on PowerPC, since on kdump path we
proactively issue a reset in adapter (through the reset_devices kernel
parameter).

This patch (re-)adds a delay right after IOP reset is issued. Empirically
we measured that 3 seconds is enough, but for safety reasons we delay
for 5s (and since it was 30s before, 5s is still a small amount).

For reference, without this patch we observe the following messages
on kdump kernel boot process:

  [ 76.294] aacraid 0003:01:00.0: IOP reset failed
  [ 76.294] aacraid 0003:01:00.0: ARC Reset attempt failed
  [ 86.524] aacraid 0003:01:00.0: adapter kernel panic'd ff.
  [ 86.524] aacraid 0003:01:00.0: Controller reset type is 3
  [ 86.524] aacraid 0003:01:00.0: Issuing IOP reset
  [146.534] aacraid 0003:01:00.0: IOP reset failed
  [146.534] aacraid 0003:01:00.0: ARC Reset attempt failed

Fixes: 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset 
status")
Cc: sta...@vger.kernel.org # v4.13+
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/scsi/aacraid/src.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/aacraid/src.c b/drivers/scsi/aacraid/src.c
index 48c2b2b34b72..0c9361c87ec8 100644
--- a/drivers/scsi/aacraid/src.c
+++ b/drivers/scsi/aacraid/src.c
@@ -740,6 +740,8 @@ static void aac_send_iop_reset(struct aac_dev *dev)
aac_set_intx_mode(dev);
 
src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
+
+   msleep(5000);
 }
 
 static void aac_send_hardware_soft_reset(struct aac_dev *dev)
-- 
2.14.1



Re: [PATCH V6 2/3] dma-mapping: Rework dma_get_cache_alignment() function

2017-09-19 Thread Christoph Hellwig
>   mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * 
> mdev->limits.mtt_seg_size,
> -dma_get_cache_alignment()) / 
> mdev->limits.mtt_seg_size;
> +dma_get_cache_alignment(NULL)) / 
> mdev->limits.mtt_seg_size;
>  
>   mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, 
> init_hca->mtt_base,

Please pass the actually relevant struct device for each call.


Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?

2017-09-19 Thread Benjamin Block
Hello linux-block,

I wrote some tests recently to test patches against bsg.c and bsg-lib.c,
and while writing those I noticed something strange:

When you use the write() and read() call on multiple file-descriptors
for a single bsg-device (FC or SCSI), it is possible that you get
cross-talk between those different file-descriptors.

E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0
and you send commands via write() in both processes, when they both do
read() later on - to read the response for the commands they send before
-, it is possible that process A gets the response (Sense-Data,
FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis.

I noticed this because in my tests I 'tag' each command I send with
write() via a value I write into the usr_ptr field of sg_io_v4. When I
later user read() to receive responses I check for this tag in a
hash-table and so can look up the original command. When I used this and
spawned multiple processes with the same target bsg-device, I got
responses for commands I don't have tags for in my hash-table, so they
were written by an other process. This never happend when I only spawn
one test-process.

This seems awfully contra-productive.. so much that I don't see any
point in allowing users to open more than one FD per bsg-device, because
that would inevitably lead to very hard to debug 'bugs'. And I wonder if
this is really by design or some regression that happend over time.

I looked at the code in bsg.c and yes, it seems this is what is coded
right now. We have a hash-table which manages all 'struct bsg_device's
who are indexed by device-minors, so one hash-table entry per
device-node.

So eh, long talk short question, is that intended?



Beste Grüße / Best regards,
  - Benjamin Block
-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294



Re: qla2xxx: MSI-X: Unsupported ISP 2432 SSVID/SSDID (0x103C,0x7041)

2017-09-19 Thread Meelis Roos
> On 08/19/2017 10:41 PM, Meelis Roos wrote:
> > Hello, I just tried Linux with the latest kernel (4.13-rc5+git) on a HP 
> > DL360 G6 with HP branded ISP2432 HBA. The driver mentions unsupported 
> > model of the card:
> > 
> > [3.868589] scsi host1: qla2xxx
> > [3.871696] qla2xxx [:07:00.0]-00fb:1: QLogic HPAE312A - PCI-Express 
> > Dual Port 4Gb Fibre Channel HBA.
> > [3.871782] qla2xxx [:07:00.0]-00fc:1: ISP2432: PCIe (2.5GT/s x4) @ 
> > :07:00.0 hdma+ host#=1 fw=8.03.00 (9496).
> > [3.871973] qla2xxx [:07:00.1]-001d: : Found an ISP2432 irq 49 
> > iobase 0xc900062c5000.
> > [3.872568] qla2xxx [:07:00.1]-0034:4: MSI-X: Unsupported ISP 2432 
> > SSVID/SSDID (0x103C,0x7041).
> > 
> > Is there some information I can provide to include this card in fully 
> > supported list?
> > 
> Weelll ... that statement indicates that MSI-X is unsupported, not the
> card itself.

Yes, something seems to be bad about MSI-X. Other hardware is using MSI, 
QLA is using IOAPIC.

cut -c 1-10,185- < /proc/interrupts

   0: IO-APIC2-edge  timer
   1: IO-APIC1-edge  i8042
   8: IO-APIC8-edge  rtc0
   9: IO-APIC9-fasteoi   acpi
  12: IO-APIC   12-edge  i8042
  17: IO-APIC   17-fasteoi   ata_piix
  20: IO-APIC   20-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
  21: IO-APIC   21-fasteoi   ipmi_si
  22: IO-APIC   22-fasteoi   hpilo, uhci_hcd:usb4, uhci_hcd:usb6
  23: IO-APIC   23-fasteoi   uhci_hcd:usb3, uhci_hcd:usb5, radeon
  24: DMAR-MSI0-edge  dmar0
  26: PCI-MSI 1572864-edge  hpsa0-msix0
  27: PCI-MSI 1572865-edge  hpsa0-msix1
  28: PCI-MSI 1572866-edge  hpsa0-msix2
  29: PCI-MSI 1572867-edge  hpsa0-msix3
  30: PCI-MSI 1572868-edge  hpsa0-msix4
  31: PCI-MSI 1572869-edge  hpsa0-msix5
  32: PCI-MSI 1572870-edge  hpsa0-msix6
  33: PCI-MSI 1572871-edge  hpsa0-msix7
  34: PCI-MSI 1572872-edge  hpsa0-msix8
  35: PCI-MSI 1572873-edge  hpsa0-msix9
  36: PCI-MSI 1572874-edge  hpsa0-msix10
  37: PCI-MSI 1572875-edge  hpsa0-msix11
  38: PCI-MSI 1572876-edge  hpsa0-msix12
  39: PCI-MSI 1572877-edge  hpsa0-msix13
  40: PCI-MSI 1572878-edge  hpsa0-msix14
  41: PCI-MSI 1572879-edge  hpsa0-msix15
  45: PCI-MSI 2097152-edge  ens1f0
  46: IO-APIC0-fasteoi   qla2xxx
  48: PCI-MSI 2099200-edge  ens1f1
  49: IO-APIC   10-fasteoi   qla2xxx
  50: PCI-MSI 1048576-edge  enp2s0f0-0
  51: PCI-MSI 1048577-edge  enp2s0f0-1
  52: PCI-MSI 1048578-edge  enp2s0f0-2
  53: PCI-MSI 1048579-edge  enp2s0f0-3
  54: PCI-MSI 1048580-edge  enp2s0f0-4
  55: PCI-MSI 1048581-edge  enp2s0f0-5
  56: PCI-MSI 1048582-edge  enp2s0f0-6
  57: PCI-MSI 1048583-edge  enp2s0f0-7
  59: PCI-MSI 1050624-edge  enp2s0f1-0
  60: PCI-MSI 1050625-edge  enp2s0f1-1
  61: PCI-MSI 1050626-edge  enp2s0f1-2
  62: PCI-MSI 1050627-edge  enp2s0f1-3
  63: PCI-MSI 1050628-edge  enp2s0f1-4
  64: PCI-MSI 1050629-edge  enp2s0f1-5
  65: PCI-MSI 1050630-edge  enp2s0f1-6
  66: PCI-MSI 1050631-edge  enp2s0f1-7
 NMI: Non-maskable interrupts
 LOC: Local timer interrupts
 SPU: Spurious interrupts
 PMI: Performance monitoring interrupts
 IWI: IRQ work interrupts
 RTR: APIC ICR read retries
 RES: Rescheduling interrupts
 CAL: Function call interrupts
 TLB: TLB shootdowns
 TRM: Thermal event interrupts
 THR: Threshold APIC interrupts
 MCE: Machine check exceptions
 MCP: Machine check polls
 ERR: 
 MIS: 
 PIN: Posted-interrupt notification event
 NPI: Nested posted-interrupt event
 PIW: Posted-interrupt wakeup event

Full dmesg:

[0.00] Linux version 4.14.0-rc1-00013-gebb2c2437d80 (mroos@dl360g6) 
(gcc version 7.2.0 (Debian 7.2.0-1)) #17 SMP Tue Sep 19 15:01:51 EEST 2017
[0.00] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc1-00013-gebb2c2437d80 root=/dev/sda1 ro
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009f3ff] usable
[0.00] BIOS-e820: [mem 0x0009f400-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd761efff] usable
[0.00] BIOS-e820: [mem 0xd761f000-0xd762bfff] ACPI data
[0.00] BIOS-e820: [mem 0xd762c000-0xd762cfff] usable
[0.00] BIOS-e820: [mem 0xd762d000-0xd7ff] reserved
[0.00] BIOS-e820: [mem 0xe000-0xe3ff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfee0] reserved
[0.00] BIOS-e820: [mem 0xff80-0x] 

[PATCH] scsi: SSDs can timeout during FS init because of too many unmaps

2017-09-19 Thread Bill Kuzeja
I encountered this issue putting XFS on several brands of SSDs on my
system. During initialization, I would see a bunch of timeouts on
WRITE_SAME_16 commands, which would get aborted, reissued, and complete.
The logs look like this:

kernel: sd 2:0:1:0: attempting task abort! scmd(88086ca0c8c0)
kernel: sd 1:0:1:0: attempting task abort! scmd(88086c7f2bc0)
kernel: sd 1:0:1:0: [sds] CDB: Write same(16) 93 08 00 00 00 00 24 04
07 b8 00 7f ff ff 00 00
kernel: sd 2:0:1:0: [sdt] CDB: Write same(16) 93 08 00 00 00 00 24 04
07 b8 00 7f ff ff 00 00

And so on (there are many, many more of these)...

The interesting thing to note as that these are WS16 commands with the
unmap bit set (this drive's version of UNMAP) with length 0x7f.
This is over 8.3 million blocks to be unmapped at once. Since there are
many concurrent "unmaps", the drive can get overwhelmed and time out.



Why does this happen? Initializing the device with a filesystem (in my
experience XFS) creates one huge discard for the SSD. This gets
broken apart into smaller unmap seqments, which get sent down to the
drive. For the SSDs that I've been working with (lbpws is always set), 
UNMAPs always gettranslated to WS16 with the unmap bit set.

The size of these segments is determined when the drive is set up
initially. Early on, in the routine sd_read_block_limits, we read the
Block Limits VPD page (page 0xb0). Among other things, this page gives
us the drive's MAX UNMAP LBA count as well as the MAX WRITE SAME LENGTH.
In my experience, every SSD returns zero for MAX WRITE SAME length but
does have a real value for MAX_UNMAP LBA count.

The way the code is structured, because we read in zero for
MAX WRITE SAME, we assume there is no limit for write same commands.
This *may* be true, but unmap/discard commands translate into write
same commands with the unmap bit set. Technically, this makes them
no longer write same commands.

Currently, the max discard size is actually based off of max_ws_blocks.
When configuring max discard size later, we default to
SD_MAX_WS16_BLOCKS (0x7f) because max_ws_blocks is currently
always zero:

 max_blocks = min_not_zero(sdkp->max_ws_blocks,
  (u32)SD_MAX_WS16_BLOCKS);

A reasonable fix for this would be to use the MAX UNMAP LBA count
(stored as max_unmap_blocks) instead of max_ws_blocks in the case where
we're defaulting to WS16 for unmaps.

After discussing this issue with an SSD vendor's firmware team, they
confirmed that this was a good way to proceed. That is, it made sense to
use the max_unmap_blocks count instead of the default SD_MAX_WS16_BLOCKS
value because 1) max_ws_blocks is always zero, 2) these are really
unmap commands we're issuing, and 3) the SSD can handle a few unmaps
the size of SD_MAX_WS16_BLOCKS but not necessarily a barrage of them.

The largest max unmap size I've seen from returned from a drive (from
the Block Limits VPD page) is 0x27 or about 30% of SD_MAX_WS16_BLOCKS. 
Other sizes are much smaller, typically 0x8 or about 6% of the 
previous max value.

I've also done performance testing for this change. The only impact I've
seen on SSDs is during filesystem initialization time, which would be 
expected since that's most likely the only time we'd be doing really large 
unmaps. Even so, the impact on FS init is fairly minimal, 10% for some 
models of SSDs, others no noticeable difference at all.



Signed-off-by: Bill Kuzeja 
---
 drivers/scsi/sd.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 11c1738..f24c4f2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -715,8 +715,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
break;
 
case SD_LBP_WS16:
-   max_blocks = min_not_zero(sdkp->max_ws_blocks,
- (u32)SD_MAX_WS16_BLOCKS);
+   /* If no value given, use unmap limit - WS16 default too large 
*/
+   if (!sdkp->max_ws_blocks)
+   max_blocks = min_not_zero(sdkp->max_unmap_blocks,
+ (u32)SD_MAX_WS16_BLOCKS);
+   else
+   max_blocks = min_not_zero(sdkp->max_ws_blocks,
+ (u32)SD_MAX_WS16_BLOCKS);
break;
 
case SD_LBP_WS10:
-- 
1.8.3.1



Re: [PATCH] fcoe-utils: Fix get_ctlr_num() for large ctlr_* indices

2017-09-19 Thread Hannes Reinecke
On 09/18/2017 04:35 PM, Andrey Grafin wrote:
> Each creation of a FCoE device increases counter which is used as a suffix
> in a FCoE device name in sysfs (i.e. /sys/bus/fcoe/devices/ctlr_1).
> Once this counter reaches the value of two digits (10 and larger),
> get_ctlr_num() stopped working properly and a command like `fcoeadm -i`
> which depends on get_ctlr_num() call doesn't show anything.
> This patch solves this problem.
> 
> Signed-off-by: Andrey Grafin 
> ---
>  lib/sysfs_hba.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/sysfs_hba.c b/lib/sysfs_hba.c
> index 5cb7fd3fa5d..786215440ba 100644
> --- a/lib/sysfs_hba.c
> +++ b/lib/sysfs_hba.c
> @@ -606,7 +606,7 @@ static int get_ctlr_num(const char *netdev)
>   if (!ctlr)
>   continue;
>  
> - ctlr_num = atoi([strlen(ctlr) - 1]);
> + ctlr_num = atoi([sizeof("ctlr_") - 1]);
>   break;
>   }
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH V6 3/3] scsi: Align block queue to dma_get_cache_alignment()

2017-09-19 Thread Huacai Chen
In non-coherent DMA mode, kernel uses cache flushing operations to
maintain I/O coherency, so scsi's block queue should be aligned to
ARCH_DMA_MINALIGN. Otherwise, it will cause data corruption, at least
on MIPS:

Step 1, dma_map_single
Step 2, cache_invalidate (no writeback)
Step 3, dma_from_device
Step 4, dma_unmap_single

If a DMA buffer and a kernel structure share a same cache line, and if
the kernel structure has dirty data, cache_invalidate (no writeback)
will cause data lost.

Cc: sta...@vger.kernel.org
Signed-off-by: Huacai Chen 
---
 drivers/scsi/scsi_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80..19abc2e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2132,11 +2132,11 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
q->limits.cluster = 0;
 
/*
-* set a reasonable default alignment on word boundaries: the
-* host and device may alter it using
+* set a reasonable default alignment on word/cacheline boundaries:
+* the host and device may alter it using
 * blk_queue_update_dma_alignment() later.
 */
-   blk_queue_dma_alignment(q, 0x03);
+   blk_queue_dma_alignment(q, max(4, dma_get_cache_alignment(dev)) - 1);
 }
 EXPORT_SYMBOL_GPL(__scsi_init_queue);
 
-- 
2.7.0





[PATCH V6 2/3] dma-mapping: Rework dma_get_cache_alignment() function

2017-09-19 Thread Huacai Chen
Make dma_get_cache_alignment() to accept a 'dev' argument. As a result,
it can return different alignments due to different devices' I/O cache
coherency. For compatibility, make all existing callers pass a NULL dev
argument.

Cc: sta...@vger.kernel.org
Signed-off-by: Huacai Chen 
---
 drivers/infiniband/hw/mthca/mthca_main.c   |  2 +-
 drivers/media/v4l2-core/videobuf2-dma-contig.c |  2 +-
 drivers/net/ethernet/broadcom/b44.c|  2 +-
 drivers/net/ethernet/ibm/emac/core.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  |  2 +-
 drivers/spi/spi-qup.c  |  4 ++--
 drivers/tty/serial/mpsc.c  | 16 
 drivers/tty/serial/samsung.c   | 14 +++---
 include/linux/dma-mapping.h| 14 +-
 9 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c 
b/drivers/infiniband/hw/mthca/mthca_main.c
index e36a9bc..cac5fac 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -416,7 +416,7 @@ static int mthca_init_icm(struct mthca_dev *mdev,
 
/* CPU writes to non-reserved MTTs, while HCA might DMA to reserved 
mtts */
mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * 
mdev->limits.mtt_seg_size,
-  dma_get_cache_alignment()) / 
mdev->limits.mtt_seg_size;
+  dma_get_cache_alignment(NULL)) / 
mdev->limits.mtt_seg_size;
 
mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, 
init_hca->mtt_base,
 
mdev->limits.mtt_seg_size,
diff --git a/drivers/media/v4l2-core/videobuf2-dma-contig.c 
b/drivers/media/v4l2-core/videobuf2-dma-contig.c
index 9f389f3..7f54739 100644
--- a/drivers/media/v4l2-core/videobuf2-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf2-dma-contig.c
@@ -484,7 +484,7 @@ static void *vb2_dc_get_userptr(struct device *dev, 
unsigned long vaddr,
int ret = 0;
struct sg_table *sgt;
unsigned long contig_size;
-   unsigned long dma_align = dma_get_cache_alignment();
+   unsigned long dma_align = dma_get_cache_alignment(NULL);
 
/* Only cache aligned DMA transfers are reliable */
if (!IS_ALIGNED(vaddr | size, dma_align)) {
diff --git a/drivers/net/ethernet/broadcom/b44.c 
b/drivers/net/ethernet/broadcom/b44.c
index a1125d1..291d6af 100644
--- a/drivers/net/ethernet/broadcom/b44.c
+++ b/drivers/net/ethernet/broadcom/b44.c
@@ -2587,7 +2587,7 @@ static inline void b44_pci_exit(void)
 
 static int __init b44_init(void)
 {
-   unsigned int dma_desc_align_size = dma_get_cache_alignment();
+   unsigned int dma_desc_align_size = dma_get_cache_alignment(NULL);
int err;
 
/* Setup paramaters for syncing RX/TX DMA descriptors */
diff --git a/drivers/net/ethernet/ibm/emac/core.h 
b/drivers/net/ethernet/ibm/emac/core.h
index 369de2c..236bf37 100644
--- a/drivers/net/ethernet/ibm/emac/core.h
+++ b/drivers/net/ethernet/ibm/emac/core.h
@@ -68,7 +68,7 @@ static inline int emac_rx_size(int mtu)
return mal_rx_size(ETH_DATA_LEN + EMAC_MTU_OVERHEAD);
 }
 
-#define EMAC_DMA_ALIGN(x)  ALIGN((x), dma_get_cache_alignment())
+#define EMAC_DMA_ALIGN(x)  ALIGN((x), 
dma_get_cache_alignment(NULL))
 
 #define EMAC_RX_SKB_HEADROOM   \
EMAC_DMA_ALIGN(CONFIG_IBM_EMAC_RX_SKB_HEADROOM)
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index e61c99e..56b1449 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -1660,7 +1660,7 @@ static int mlx4_init_icm(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap,
 */
dev->caps.reserved_mtts =
ALIGN(dev->caps.reserved_mtts * dev->caps.mtt_entry_sz,
- dma_get_cache_alignment()) / dev->caps.mtt_entry_sz;
+ dma_get_cache_alignment(NULL)) / dev->caps.mtt_entry_sz;
 
err = mlx4_init_icm_table(dev, >mr_table.mtt_table,
  init_hca->mtt_base,
diff --git a/drivers/spi/spi-qup.c b/drivers/spi/spi-qup.c
index 974a8ce..0c698c3 100644
--- a/drivers/spi/spi-qup.c
+++ b/drivers/spi/spi-qup.c
@@ -862,7 +862,7 @@ static bool spi_qup_can_dma(struct spi_master *master, 
struct spi_device *spi,
struct spi_transfer *xfer)
 {
struct spi_qup *qup = spi_master_get_devdata(master);
-   size_t dma_align = dma_get_cache_alignment();
+   size_t dma_align = dma_get_cache_alignment(NULL);
int n_words;
 
if (xfer->rx_buf) {
@@ -1038,7 +1038,7 @@ static int spi_qup_probe(struct platform_device *pdev)
master->transfer_one = spi_qup_transfer_one;
master->dev.of_node = pdev->dev.of_node;

[PATCH V6 1/3] dma-mapping: Introduce device_is_coherent() as a helper

2017-09-19 Thread Huacai Chen
We will use device_is_coherent() as a helper function, which will be
used in the next patch.

There is a MIPS-specific plat_device_is_coherent(), but we need a more
generic solution, so add and use a new function pointer in dma_map_ops.

Cc: sta...@vger.kernel.org
Signed-off-by: Huacai Chen 
---
 arch/mips/cavium-octeon/dma-octeon.c   |  3 ++-
 arch/mips/include/asm/mach-generic/dma-coherence.h |  6 +++---
 arch/mips/loongson64/common/dma-swiotlb.c  |  1 +
 arch/mips/mm/dma-default.c |  3 ++-
 arch/mips/netlogic/common/nlm-dma.c|  3 ++-
 include/linux/dma-mapping.h| 10 ++
 6 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
b/arch/mips/cavium-octeon/dma-octeon.c
index c64bd87..cd1a133 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -324,7 +324,8 @@ static struct octeon_dma_map_ops _octeon_pci_dma_map_ops = {
.sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
.sync_sg_for_device = octeon_dma_sync_sg_for_device,
.mapping_error = swiotlb_dma_mapping_error,
-   .dma_supported = swiotlb_dma_supported
+   .dma_supported = swiotlb_dma_supported,
+   .device_is_coherent = plat_device_is_coherent
},
 };
 
diff --git a/arch/mips/loongson64/common/dma-swiotlb.c 
b/arch/mips/loongson64/common/dma-swiotlb.c
index 34486c1..c758d9b 100644
--- a/arch/mips/loongson64/common/dma-swiotlb.c
+++ b/arch/mips/loongson64/common/dma-swiotlb.c
@@ -119,6 +119,7 @@ static const struct dma_map_ops loongson_dma_map_ops = {
.sync_sg_for_device = loongson_dma_sync_sg_for_device,
.mapping_error = swiotlb_dma_mapping_error,
.dma_supported = loongson_dma_supported,
+   .device_is_coherent = plat_device_is_coherent
 };
 
 void __init plat_swiotlb_setup(void)
diff --git a/arch/mips/mm/dma-default.c b/arch/mips/mm/dma-default.c
index c01bd20..6e18301 100644
--- a/arch/mips/mm/dma-default.c
+++ b/arch/mips/mm/dma-default.c
@@ -407,7 +407,8 @@ static const struct dma_map_ops mips_default_dma_map_ops = {
.sync_sg_for_cpu = mips_dma_sync_sg_for_cpu,
.sync_sg_for_device = mips_dma_sync_sg_for_device,
.mapping_error = mips_dma_mapping_error,
-   .dma_supported = mips_dma_supported
+   .dma_supported = mips_dma_supported,
+   .device_is_coherent = plat_device_is_coherent
 };
 
 const struct dma_map_ops *mips_dma_map_ops = _default_dma_map_ops;
diff --git a/arch/mips/netlogic/common/nlm-dma.c 
b/arch/mips/netlogic/common/nlm-dma.c
index 0ec9d9d..aa11b27 100644
--- a/arch/mips/netlogic/common/nlm-dma.c
+++ b/arch/mips/netlogic/common/nlm-dma.c
@@ -79,7 +79,8 @@ const struct dma_map_ops nlm_swiotlb_dma_ops = {
.sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
.sync_sg_for_device = swiotlb_sync_sg_for_device,
.mapping_error = swiotlb_dma_mapping_error,
-   .dma_supported = swiotlb_dma_supported
+   .dma_supported = swiotlb_dma_supported,
+   .device_is_coherent = plat_device_is_coherent
 };
 
 void __init plat_swiotlb_setup(void)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce981..08da227 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -131,6 +131,7 @@ struct dma_map_ops {
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
u64 (*get_required_mask)(struct device *dev);
 #endif
+   int (*device_is_coherent)(struct device *dev);
int is_phys;
 };
 
@@ -697,6 +698,15 @@ static inline void *dma_zalloc_coherent(struct device 
*dev, size_t size,
 }
 
 #ifdef CONFIG_HAS_DMA
+static inline int device_is_coherent(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   if (ops && ops->device_is_coherent)
+   return ops->device_is_coherent(dev);
+   else
+   return 1;/* compatible behavior */
+}
+
 static inline int dma_get_cache_alignment(void)
 {
 #ifdef ARCH_DMA_MINALIGN
-- 
2.7.0





[PATCH] scsi: ufs: fix a pclint warning

2017-09-19 Thread Zang Leigang
Signed-off-by: Zang Leigang 

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 794a4600e952..deb77535b8c9 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -3586,7 +3586,7 @@ static int ufshcd_uic_pwr_ctrl(struct ufs_hba *hba, 
struct uic_command *cmd)
status = ufshcd_get_upmcrs(hba);
if (status != PWR_LOCAL) {
dev_err(hba->dev,
-   "pwr ctrl cmd 0x%0x failed, host upmcrs:0x%x\n",
+   "pwr ctrl cmd 0x%x failed, host upmcrs:0x%x\n",
cmd->command, status);
ret = (status != PWR_OK) ? status : -1;
}
-- 
2.14.1



[RESEND PATCH v4 4/6] libsas: Use new workqueue to run sas event and disco event

2017-09-19 Thread Jason Yan
Now all libsas works are queued to scsi host workqueue,
include sas event work post by LLDD and sas discovery
work, and a sas hotplug flow may be divided into several
works, e.g libsas receive a PORTE_BYTES_DMAED event,
currently we process it as following steps:
sas_form_port  --- run in work in shost workq
sas_discover_domain  --- run in another work in shost workq
...
sas_probe_devices  --- run in new work in shost workq
We found during hot-add a device, libsas may need run several
works in same workqueue to add device in system, the process is
not atomic, it may interrupt by other sas event works, like
PHYE_LOSS_OF_SIGNAL.

This patch is preparation of execute libsas sas event in sync. We need
to use different workqueue to run sas event and disco event. Otherwise
the work will be blocked for waiting another chained work in the same
workqueue.

Signed-off-by: Yijing Wang 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
Signed-off-by: Jason Yan 
---
 drivers/scsi/libsas/sas_discover.c |  2 +-
 drivers/scsi/libsas/sas_event.c|  4 ++--
 drivers/scsi/libsas/sas_init.c | 18 ++
 include/scsi/libsas.h  |  3 +++
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 60de662..14f714d 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -534,7 +534,7 @@ static void sas_chain_work(struct sas_ha_struct *ha, struct 
sas_work *sw)
 * workqueue, or known to be submitted from a context that is
 * not racing against draining
 */
-   scsi_queue_work(ha->core.shost, >work);
+   queue_work(ha->disco_q, >work);
 }
 
 static void sas_chain_event(int event, unsigned long *pending,
diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c
index 6db941d..b124198 100644
--- a/drivers/scsi/libsas/sas_event.c
+++ b/drivers/scsi/libsas/sas_event.c
@@ -40,7 +40,7 @@ int sas_queue_work(struct sas_ha_struct *ha, struct sas_work 
*sw)
if (list_empty(>drain_node))
list_add_tail(>drain_node, >defer_q);
} else
-   rc = scsi_queue_work(ha->core.shost, >work);
+   rc = queue_work(ha->event_q, >work);
 
return rc;
 }
@@ -61,7 +61,7 @@ static int sas_queue_event(int event, struct sas_work *work,
 
 void __sas_drain_work(struct sas_ha_struct *ha)
 {
-   struct workqueue_struct *wq = ha->core.shost->work_q;
+   struct workqueue_struct *wq = ha->event_q;
struct sas_work *sw, *_sw;
int ret;
 
diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index e2d98a8..b49c46f 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -109,6 +109,7 @@ void sas_hash_addr(u8 *hashed, const u8 *sas_addr)
 
 int sas_register_ha(struct sas_ha_struct *sas_ha)
 {
+   char name[64];
int error = 0;
 
mutex_init(_ha->disco_mutex);
@@ -142,10 +143,24 @@ int sas_register_ha(struct sas_ha_struct *sas_ha)
goto Undo_ports;
}
 
+   error = -ENOMEM;
+   snprintf(name, sizeof(name), "%s_event_q", dev_name(sas_ha->dev));
+   sas_ha->event_q = create_singlethread_workqueue(name);
+   if (!sas_ha->event_q)
+   goto Undo_ports;
+
+   snprintf(name, sizeof(name), "%s_disco_q", dev_name(sas_ha->dev));
+   sas_ha->disco_q = create_singlethread_workqueue(name);
+   if (!sas_ha->disco_q)
+   goto Undo_event_q;
+
INIT_LIST_HEAD(_ha->eh_done_q);
INIT_LIST_HEAD(_ha->eh_ata_q);
 
return 0;
+
+Undo_event_q:
+   destroy_workqueue(sas_ha->event_q);
 Undo_ports:
sas_unregister_ports(sas_ha);
 Undo_phys:
@@ -176,6 +191,9 @@ int sas_unregister_ha(struct sas_ha_struct *sas_ha)
__sas_drain_work(sas_ha);
mutex_unlock(_ha->drain_mutex);
 
+   destroy_workqueue(sas_ha->disco_q);
+   destroy_workqueue(sas_ha->event_q);
+
return 0;
 }
 
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 08c1c32..d1ab157 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -388,6 +388,9 @@ struct sas_ha_struct {
struct device *dev;   /* should be set */
struct module *lldd_module; /* should be set */
 
+   struct workqueue_struct *event_q;
+   struct workqueue_struct *disco_q;
+
u8 *sas_addr; /* must be set */
u8 hashed_sas_addr[HASHED_SAS_ADDR_SIZE];
 
-- 
2.5.0



[RESEND PATCH v4 0/6] Enhance libsas hotplug feature

2017-09-19 Thread Jason Yan
Thanks Martin K. Petersen for applied some of the tidy-up patches. So I do not
have to maintain these patches out of the tree. I will only send the reset
of them in the next days if needed.

Now the libsas hotplug has some issues, Dan Williams report
a similar bug here before
https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg39187.html

The issues we have found
1. if LLDD burst reports lots of phy-up/phy-down sas events, some events
   may lost because a same sas events is pending now, finally libsas topo
   may different the hardware.
2. receive a phy down sas event, libsas call sas_deform_port to remove
   devices, it would first delete the sas port, then put a destruction
   discovery event in a new work, and queue it at the tail of workqueue,
   once the sas port be deleted, its children device will be deleted too,
   when the destruction work start, it will found the target device has
   been removed, and report a sysfs warnning.
3. since a hotplug process will be devided into several works, if a phy up
   sas event insert into phydown works, like
   destruction work  ---> PORTE_BYTES_DMAED (sas_form_port) 
>PHYE_LOSS_OF_SIGNAL
   the hot remove flow would broken by PORTE_BYTES_DMAED event, it's not
   we expected, and issues would occur.

v3->v4: -use dynamic alloced work and support shutting down the phy if active 
event reached the threshold
-use flush_workqueue instead of wait-completion to process discover 
events synchronously
-direct call probe and destruct function
v2->v3: some code improvements suggested by Johannes and John,
split v2 patch 2 into several small pathes.
v1->v2: some code improvements suggested by John Garry

Jason Yan (6):
  libsas: Use dynamic alloced work to avoid sas event lost
  libsas: shut down the PHY if events reached the threshold
  libsas: make the event threshold configurable
  libsas: Use new workqueue to run sas event and disco event
  libsas: libsas: use flush_workqueue to process disco events
synchronously
  libsas: direct call probe and destruct

 drivers/scsi/hisi_sas/hisi_sas_main.c |  6 +++
 drivers/scsi/libsas/sas_ata.c |  1 -
 drivers/scsi/libsas/sas_discover.c| 36 --
 drivers/scsi/libsas/sas_event.c   | 79 ++
 drivers/scsi/libsas/sas_expander.c|  2 +-
 drivers/scsi/libsas/sas_init.c| 91 +--
 drivers/scsi/libsas/sas_internal.h|  7 +++
 drivers/scsi/libsas/sas_phy.c | 73 ++--
 drivers/scsi/libsas/sas_port.c| 25 ++
 include/scsi/libsas.h | 29 ---
 include/scsi/scsi_transport_sas.h |  1 +
 11 files changed, 258 insertions(+), 92 deletions(-)

-- 
2.5.0



[RESEND PATCH v4 3/6] libsas: make the event threshold configurable

2017-09-19 Thread Jason Yan
Add a sysfs attr that LLDD can configure it for every host. We made
a example in hisi_sas. Other LLDDs using libsas can implement it if
they want.

Suggested-by: Hannes Reinecke 
Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 drivers/scsi/hisi_sas/hisi_sas_main.c |  6 ++
 drivers/scsi/libsas/sas_init.c| 27 +++
 include/scsi/libsas.h |  1 +
 3 files changed, 34 insertions(+)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c 
b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 5e47abb..9eceed1 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -1470,6 +1470,11 @@ EXPORT_SYMBOL_GPL(hisi_sas_rescan_topology);
 struct scsi_transport_template *hisi_sas_stt;
 EXPORT_SYMBOL_GPL(hisi_sas_stt);
 
+struct device_attribute *host_attrs[] = {
+_attr_phy_event_threshold,
+NULL,
+};
+
 static struct scsi_host_template _hisi_sas_sht = {
.module = THIS_MODULE,
.name   = DRV_NAME,
@@ -1489,6 +1494,7 @@ static struct scsi_host_template _hisi_sas_sht = {
.eh_bus_reset_handler   = sas_eh_bus_reset_handler,
.target_destroy = sas_target_destroy,
.ioctl  = sas_ioctl,
+   .shost_attrs= host_attrs,
 };
 struct scsi_host_template *hisi_sas_sht = &_hisi_sas_sht;
 EXPORT_SYMBOL_GPL(hisi_sas_sht);
diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index b1e03d5..e2d98a8 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -537,6 +537,33 @@ static struct sas_function_template sft = {
.smp_handler = sas_smp_handler,
 };
 
+static inline ssize_t phy_event_threshold_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct Scsi_Host *shost = class_to_shost(dev);
+   struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost);
+
+   return scnprintf(buf, PAGE_SIZE, "%u\n", sha->event_thres);
+}
+
+static inline ssize_t phy_event_threshold_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+   struct Scsi_Host *shost = class_to_shost(dev);
+   struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost);
+
+   sha->event_thres = simple_strtol(buf, NULL, 10);
+
+   return count;
+}
+
+DEVICE_ATTR(phy_event_threshold,
+ S_IRUGO|S_IWUSR,
+ phy_event_threshold_show,
+ phy_event_threshold_store);
+EXPORT_SYMBOL_GPL(dev_attr_phy_event_threshold);
+
 struct scsi_transport_template *
 sas_domain_attach_transport(struct sas_domain_function_template *dft)
 {
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 2fa0b13..08c1c32 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -679,6 +679,7 @@ extern int sas_bios_param(struct scsi_device *,
  sector_t capacity, int *hsc);
 extern struct scsi_transport_template *
 sas_domain_attach_transport(struct sas_domain_function_template *);
+extern struct device_attribute dev_attr_phy_event_threshold;
 
 int  sas_discover_root_expander(struct domain_device *);
 
-- 
2.5.0



[RESEND PATCH v4 6/6] libsas: direct call probe and destruct

2017-09-19 Thread Jason Yan
In commit 87c8331f ([SCSI] libsas: prevent domain rediscovery competing
with ata error handling) introduced disco mutex to prevent rediscovery
competing with ata error handling and put the whole revalidation in the
mutex. But the rphy add/remove needs to wait for the error handling
which also grabs the disco mutex. This may leads to dead lock.So the
probe and destruct event were introduce to do the rphy add/remove
asynchronously and out of the lock.

The asynchronously processed workers makes the whole discovery process
not atomic, the other events may interrupt the process. For example,
if a loss of signal event inserted before the probe event, the
sas_deform_port() is called and the port will be deleted.

And sas_port_delete() may run before the destruct event, but the
port-x:x is the top parent of end device or expander. This leads to
a kernel WARNING such as:

[   82.042979] sysfs group 'power' not found for kobject 'phy-1:0:22'
[   82.042983] [ cut here ]
[   82.042986] WARNING: CPU: 54 PID: 1714 at fs/sysfs/group.c:237
sysfs_remove_group+0x94/0xa0
[   82.043059] Call trace:
[   82.043082] [] sysfs_remove_group+0x94/0xa0
[   82.043085] [] dpm_sysfs_remove+0x60/0x70
[   82.043086] [] device_del+0x138/0x308
[   82.043089] [] sas_phy_delete+0x38/0x60
[   82.043091] [] do_sas_phy_delete+0x6c/0x80
[   82.043093] [] device_for_each_child+0x58/0xa0
[   82.043095] [] sas_remove_children+0x40/0x50
[   82.043100] [] sas_destruct_devices+0x64/0xa0
[   82.043102] [] process_one_work+0x1fc/0x4b0
[   82.043104] [] worker_thread+0x50/0x490
[   82.043105] [] kthread+0xfc/0x128
[   82.043107] [] ret_from_fork+0x10/0x50

Make probe and destruct a direct call in the disco and revalidate function,
but put them outside the lock. The whole discovery or revalidate won't
be interrupted by other events. And the DISCE_PROBE and DISCE_DESTRUCT
event are deleted as a result of the direct call.

Introduce a new list to destruct the sas_port and put the port delete after
the destruct. This makes sure the right order of destroying the sysfs
kobject and fix the warning above.

Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 drivers/scsi/libsas/sas_ata.c  |  1 -
 drivers/scsi/libsas/sas_discover.c | 34 --
 drivers/scsi/libsas/sas_expander.c |  2 +-
 drivers/scsi/libsas/sas_internal.h |  1 +
 drivers/scsi/libsas/sas_port.c |  3 +++
 include/scsi/libsas.h  |  3 +--
 include/scsi/scsi_transport_sas.h  |  1 +
 7 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 87f5e694..dbe8c5e 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -729,7 +729,6 @@ int sas_discover_sata(struct domain_device *dev)
if (res)
return res;
 
-   sas_discover_event(dev->port, DISCE_PROBE);
return 0;
 }
 
diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 14f714d..d5f5b58 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -212,13 +212,9 @@ void sas_notify_lldd_dev_gone(struct domain_device *dev)
}
 }
 
-static void sas_probe_devices(struct work_struct *work)
+static void sas_probe_devices(struct asd_sas_port *port)
 {
struct domain_device *dev, *n;
-   struct sas_discovery_event *ev = to_sas_discovery_event(work);
-   struct asd_sas_port *port = ev->port;
-
-   clear_bit(DISCE_PROBE, >disc.pending);
 
/* devices must be domain members before link recovery and probe */
list_for_each_entry(dev, >disco_list, disco_list_node) {
@@ -294,7 +290,6 @@ int sas_discover_end_dev(struct domain_device *dev)
res = sas_notify_lldd_dev_found(dev);
if (res)
return res;
-   sas_discover_event(dev->port, DISCE_PROBE);
 
return 0;
 }
@@ -353,13 +348,9 @@ static void sas_unregister_common_dev(struct asd_sas_port 
*port, struct domain_d
sas_put_device(dev);
 }
 
-static void sas_destruct_devices(struct work_struct *work)
+void sas_destruct_devices(struct asd_sas_port *port)
 {
struct domain_device *dev, *n;
-   struct sas_discovery_event *ev = to_sas_discovery_event(work);
-   struct asd_sas_port *port = ev->port;
-
-   clear_bit(DISCE_DESTRUCT, >disc.pending);
 
list_for_each_entry_safe(dev, n, >destroy_list, disco_list_node) {
list_del_init(>disco_list_node);
@@ -370,6 +361,16 @@ static void sas_destruct_devices(struct work_struct *work)
}
 }
 
+void sas_destruct_ports(struct asd_sas_port *port)
+{
+   struct sas_port *sas_port, *p;
+
+   list_for_each_entry_safe(sas_port, p, >sas_port_del_list, 

[RESEND PATCH v4 1/6] libsas: Use dynamic alloced work to avoid sas event lost

2017-09-19 Thread Jason Yan
Now libsas hotplug work is static, every sas event type has its own
static work, LLDD driver queues the hotplug work into shost->work_q.
If LLDD driver burst posts lots hotplug events to libsas, the hotplug
events may pending in the workqueue like

shost->work_q
new work[PORTE_BYTES_DMAED] --> |[PHYE_LOSS_OF_SIGNAL][PORTE_BYTES_DMAED] -> 
processing
|<---wait worker to process>|

In this case, a new PORTE_BYTES_DMAED event coming, libsas try to queue
it to shost->work_q, but this work is already pending, so it would be
lost. Finally, libsas delete the related sas port and sas devices, but
LLDD driver expect libsas add the sas port and devices(last sas event).

This patch use dynamic allocated work to avoid this issue.

Signed-off-by: Yijing Wang 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
Signed-off-by: Jason Yan 
---
 drivers/scsi/libsas/sas_event.c| 75 +-
 drivers/scsi/libsas/sas_init.c | 27 --
 drivers/scsi/libsas/sas_internal.h |  6 +++
 drivers/scsi/libsas/sas_phy.c  | 44 +-
 drivers/scsi/libsas/sas_port.c | 18 -
 include/scsi/libsas.h  | 16 +---
 6 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c
index 0bb9eef..6db941d 100644
--- a/drivers/scsi/libsas/sas_event.c
+++ b/drivers/scsi/libsas/sas_event.c
@@ -29,7 +29,8 @@
 
 int sas_queue_work(struct sas_ha_struct *ha, struct sas_work *sw)
 {
-   int rc = 0;
+   /* it's added to the defer_q when draining so return succeed */
+   int rc = 1;
 
if (!test_bit(SAS_HA_REGISTERED, >state))
return 0;
@@ -44,19 +45,15 @@ int sas_queue_work(struct sas_ha_struct *ha, struct 
sas_work *sw)
return rc;
 }
 
-static int sas_queue_event(int event, unsigned long *pending,
-   struct sas_work *work,
+static int sas_queue_event(int event, struct sas_work *work,
struct sas_ha_struct *ha)
 {
-   int rc = 0;
+   unsigned long flags;
+   int rc;
 
-   if (!test_and_set_bit(event, pending)) {
-   unsigned long flags;
-
-   spin_lock_irqsave(>lock, flags);
-   rc = sas_queue_work(ha, work);
-   spin_unlock_irqrestore(>lock, flags);
-   }
+   spin_lock_irqsave(>lock, flags);
+   rc = sas_queue_work(ha, work);
+   spin_unlock_irqrestore(>lock, flags);
 
return rc;
 }
@@ -66,6 +63,7 @@ void __sas_drain_work(struct sas_ha_struct *ha)
 {
struct workqueue_struct *wq = ha->core.shost->work_q;
struct sas_work *sw, *_sw;
+   int ret;
 
set_bit(SAS_HA_DRAINING, >state);
/* flush submitters */
@@ -78,7 +76,11 @@ void __sas_drain_work(struct sas_ha_struct *ha)
clear_bit(SAS_HA_DRAINING, >state);
list_for_each_entry_safe(sw, _sw, >defer_q, drain_node) {
list_del_init(>drain_node);
-   sas_queue_work(ha, sw);
+   ret = sas_queue_work(ha, sw);
+   if (ret != 1) {
+   struct asd_sas_event *ev = to_asd_sas_event(>work);
+   sas_free_event(ev);
+   }
}
spin_unlock_irq(>lock);
 }
@@ -119,29 +121,68 @@ void sas_enable_revalidation(struct sas_ha_struct *ha)
if (!test_and_clear_bit(ev, >pending))
continue;
 
-   sas_queue_event(ev, >pending, >disc_work[ev].work, ha);
+   sas_queue_event(ev, >disc_work[ev].work, ha);
}
mutex_unlock(>disco_mutex);
 }
 
+
+static void sas_port_event_worker(struct work_struct *work)
+{
+   struct asd_sas_event *ev = to_asd_sas_event(work);
+
+   sas_port_event_fns[ev->event](work);
+   sas_free_event(ev);
+}
+
+static void sas_phy_event_worker(struct work_struct *work)
+{
+   struct asd_sas_event *ev = to_asd_sas_event(work);
+
+   sas_phy_event_fns[ev->event](work);
+   sas_free_event(ev);
+}
+
 static int sas_notify_port_event(struct asd_sas_phy *phy, enum port_event 
event)
 {
+   struct asd_sas_event *ev;
struct sas_ha_struct *ha = phy->ha;
+   int ret;
 
BUG_ON(event >= PORT_NUM_EVENTS);
 
-   return sas_queue_event(event, >port_events_pending,
-  >port_events[event].work, ha);
+   ev = sas_alloc_event(phy);
+   if (!ev)
+   return -ENOMEM;
+
+   INIT_SAS_EVENT(ev, sas_port_event_worker, phy, event);
+
+   ret = sas_queue_event(event, >work, ha);
+   if (ret != 1)
+   sas_free_event(ev);
+
+   return ret;
 }
 
 int sas_notify_phy_event(struct 

[RESEND PATCH v4 5/6] libsas: libsas: use flush_workqueue to process disco events synchronously

2017-09-19 Thread Jason Yan
Use flush_workqueue to insure the disco and revalidate events processed 
synchronously.

Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 drivers/scsi/libsas/sas_port.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c
index 9326628..64722f4 100644
--- a/drivers/scsi/libsas/sas_port.c
+++ b/drivers/scsi/libsas/sas_port.c
@@ -192,6 +192,7 @@ static void sas_form_port(struct asd_sas_phy *phy)
si->dft->lldd_port_formed(phy);
 
sas_discover_event(phy->port, DISCE_DISCOVER_DOMAIN);
+   flush_workqueue(sas_ha->disco_q);
 }
 
 /**
@@ -277,6 +278,9 @@ void sas_porte_broadcast_rcvd(struct work_struct *work)
 
SAS_DPRINTK("broadcast received: %d\n", prim);
sas_discover_event(phy->port, DISCE_REVALIDATE_DOMAIN);
+
+   if (phy->port)
+   flush_workqueue(phy->port->ha->disco_q);
 }
 
 void sas_porte_link_reset_err(struct work_struct *work)
-- 
2.5.0



[RESEND PATCH v4 2/6] libsas: shut down the PHY if events reached the threshold

2017-09-19 Thread Jason Yan
If the PHY burst too many events, we will alloc a lot of events for the
worker. This may leads to memory exhaustion.

Dan Williams suggested to shut down the PHY if the events reached the
threshold, because in this case the PHY may have gone into some
erroneous state. Users can re-enable the PHY by sysfs if they want.

We cannot use the fixed memory pool because if we run out of events, the
shut down event and loss of signal event will lost too. The events still
need to be allocated and processed in this case.

Suggested-by: Dan Williams 
Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
---
 drivers/scsi/libsas/sas_init.c | 21 -
 drivers/scsi/libsas/sas_phy.c  | 31 ++-
 include/scsi/libsas.h  |  6 ++
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index 85c278a..b1e03d5 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -122,6 +122,8 @@ int sas_register_ha(struct sas_ha_struct *sas_ha)
INIT_LIST_HEAD(_ha->defer_q);
INIT_LIST_HEAD(_ha->eh_dev_q);
 
+   sas_ha->event_thres = SAS_PHY_SHUTDOWN_THRES;
+
error = sas_register_phys(sas_ha);
if (error) {
printk(KERN_NOTICE "couldn't register sas phys:%d\n", error);
@@ -556,14 +558,31 @@ EXPORT_SYMBOL_GPL(sas_domain_attach_transport);
 
 struct asd_sas_event *sas_alloc_event(struct asd_sas_phy *phy)
 {
+   struct asd_sas_event *event;
gfp_t flags = in_interrupt() ? GFP_ATOMIC : GFP_KERNEL;
 
-   return kmem_cache_zalloc(sas_event_cache, flags);
+   event = kmem_cache_zalloc(sas_event_cache, flags);
+   if (!event)
+   return NULL;
+
+   atomic_inc(>event_nr);
+   if (atomic_read(>event_nr) > phy->ha->event_thres &&
+   !phy->in_shutdown) {
+   phy->in_shutdown = 1;
+   sas_printk("The phy%02d bursting events, shut it down.\n",
+  phy->id);
+   sas_notify_phy_event(phy, PHYE_SHUTDOWN);
+   }
+
+   return event;
 }
 
 void sas_free_event(struct asd_sas_event *event)
 {
+   struct asd_sas_phy *phy = event->phy;
+
kmem_cache_free(sas_event_cache, event);
+   atomic_dec(>event_nr);
 }
 
 /* -- SAS Class register/unregister -- */
diff --git a/drivers/scsi/libsas/sas_phy.c b/drivers/scsi/libsas/sas_phy.c
index 59f8292..3df1eec 100644
--- a/drivers/scsi/libsas/sas_phy.c
+++ b/drivers/scsi/libsas/sas_phy.c
@@ -35,6 +35,7 @@ static void sas_phye_loss_of_signal(struct work_struct *work)
struct asd_sas_event *ev = to_asd_sas_event(work);
struct asd_sas_phy *phy = ev->phy;
 
+   phy->in_shutdown = 0;
phy->error = 0;
sas_deform_port(phy, 1);
 }
@@ -44,6 +45,7 @@ static void sas_phye_oob_done(struct work_struct *work)
struct asd_sas_event *ev = to_asd_sas_event(work);
struct asd_sas_phy *phy = ev->phy;
 
+   phy->in_shutdown = 0;
phy->error = 0;
 }
 
@@ -105,6 +107,32 @@ static void sas_phye_resume_timeout(struct work_struct 
*work)
 }
 
 
+static void sas_phye_shutdown(struct work_struct *work)
+{
+   struct asd_sas_event *ev = to_asd_sas_event(work);
+   struct asd_sas_phy *phy = ev->phy;
+   struct sas_ha_struct *sas_ha = phy->ha;
+   struct sas_internal *i =
+   to_sas_internal(sas_ha->core.shost->transportt);
+
+   if (phy->enabled && i->dft->lldd_control_phy) {
+   int ret;
+
+   phy->error = 0;
+   phy->enabled = 0;
+   ret = i->dft->lldd_control_phy(phy, PHY_FUNC_DISABLE, NULL);
+   if (ret)
+   sas_printk("lldd disable phy%02d returned %d\n",
+   phy->id, ret);
+
+   } else if (!i->dft->lldd_control_phy)
+   sas_printk("lldd does not support phy%02d control\n", phy->id);
+   else
+   sas_printk("phy%02d is not enabled, cannot shutdown\n",
+   phy->id);
+
+}
+
 /* -- Phy class registration -- */
 
 int sas_register_phys(struct sas_ha_struct *sas_ha)
@@ -116,6 +144,7 @@ int sas_register_phys(struct sas_ha_struct *sas_ha)
struct asd_sas_phy *phy = sas_ha->sas_phy[i];
 
phy->error = 0;
+   atomic_set(>event_nr, 0);
INIT_LIST_HEAD(>port_phy_el);
 
phy->port = NULL;
@@ -151,5 +180,5 @@ const work_func_t sas_phy_event_fns[PHY_NUM_EVENTS] = {
[PHYE_OOB_ERROR] = sas_phye_oob_error,
[PHYE_SPINUP_HOLD] = sas_phye_spinup_hold,
[PHYE_RESUME_TIMEOUT] = sas_phye_resume_timeout,
-
+   [PHYE_SHUTDOWN] = sas_phye_shutdown,
 };
diff --git