date:20170406

RE: [PATCH] scsi: aacraid: fix PCI error recovery path

2017-04-06 Thread Dave Carroll

> From: Guilherme G. Piccoli [mailto:gpicc...@linux.vnet.ibm.com]
> Sent: Thursday, April 06, 2017 3:12 PM
> To: dl-esc-Aacraid Linux Driver 
> Cc: gpicc...@linux.vnet.ibm.com; linux-scsi@vger.kernel.org; Raghava Aditya
> Renukunta 
> Subject: [PATCH] scsi: aacraid: fix PCI error recovery path
> 
> 
> During a PCI error recovery, if aac_check_health() is not aware that a PCI 
> error
> happened and we have an offline PCI channel, it might trigger some errors 
> (like
> NULL pointer dereference) and inhibit the error recovery process to complete.
> 
> This patch makes the health check procedure aware of PCI channel issues, and 
> in
> case of error recovery process, the function
> aac_adapter_check_health() returns -1 and let the recovery process to complete
> successfully. This patch was tested on upstream kernel
> v4.11-rc5 in PowerPC ppc64le architecture with adapter 9005:028d
> (VID:DID) - the error recovery procedure was able to recover fine.
> 
> Fixes: 5c63f7f710bd ("aacraid: Added EEH support")
> Cc: sta...@vger.kernel.org # v4.6+
> Signed-off-by: Guilherme G. Piccoli 
> ---
Reviewed-by: Dave Carroll

Re: [PATCH] scsi: qla2xxx: remove some redundant pointer assignments

2017-04-06 Thread Himanshu Madhani



On Thu, 6 Apr 2017, 4:19am, Colin King wrote:

> From: Colin Ian King 
> 
> There are several local or function parameter pointers that are
> being assigned NULL after a kfree where and these have no effect
> and hence can be removed.
> 
> Fixes various cppcheck warnings:
> 
> "Assignment of function parameter has no effect outside the
> function. Did you forget dereferencing it"
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/scsi/qla2xxx/qla_os.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 83d61d2142e9..1c7957903283 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -423,7 +423,6 @@ static void qla2x00_free_req_que(struct qla_hw_data *ha, 
> struct req_que *req)
>   kfree(req->outstanding_cmds);
>  
>   kfree(req);
> - req = NULL;
>  }
>  
>  static void qla2x00_free_rsp_que(struct qla_hw_data *ha, struct rsp_que *rsp)
> @@ -439,7 +438,6 @@ static void qla2x00_free_rsp_que(struct qla_hw_data *ha, 
> struct rsp_que *rsp)
>   rsp->ring, rsp->dma);
>   }
>   kfree(rsp);
> - rsp = NULL;
>  }
>  
>  static void qla2x00_free_queues(struct qla_hw_data *ha)
> @@ -653,7 +651,6 @@ qla2x00_sp_free_dma(void *ptr)
>   ha->gbl_dsd_inuse -= ctx1->dsd_use_cnt;
>   ha->gbl_dsd_avail += ctx1->dsd_use_cnt;
>   mempool_free(ctx1, ha->ctx_mempool);
> - ctx1 = NULL;
>   }
>  
>   CMD_SP(cmd) = NULL;
> @@ -3256,7 +3253,6 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct 
> pci_device_id *id)
>   }
>   pci_release_selected_regions(ha->pdev, ha->bars);
>   kfree(ha);
> - ha = NULL;
>  
>  probe_out:
>   pci_disable_device(pdev);
> @@ -3504,7 +3500,6 @@ qla2x00_remove_one(struct pci_dev *pdev)
>  
>   pci_release_selected_regions(ha->pdev, ha->bars);
>   kfree(ha);
> - ha = NULL;
>  
>   pci_disable_pcie_error_reporting(pdev);
>  
> @@ -3568,7 +3563,6 @@ void qla2x00_free_fcports(struct scsi_qla_host *vha)
>   list_del(>list);
>   qla2x00_clear_loop_id(fcport);
>   kfree(fcport);
> - fcport = NULL;
>   }
>  }
>  
> 

Looks Good. 

Acked-By: Himanshu Madhani

[PATCH] scsi: aacraid: fix PCI error recovery path

2017-04-06 Thread Guilherme G. Piccoli

During a PCI error recovery, if aac_check_health() is not aware that
a PCI error happened and we have an offline PCI channel, it might
trigger some errors (like NULL pointer dereference) and inhibit the
error recovery process to complete.

This patch makes the health check procedure aware of PCI channel
issues, and in case of error recovery process, the function
aac_adapter_check_health() returns -1 and let the recovery process
to complete successfully. This patch was tested on upstream kernel
v4.11-rc5 in PowerPC ppc64le architecture with adapter 9005:028d
(VID:DID) - the error recovery procedure was able to recover fine.

Fixes: 5c63f7f710bd ("aacraid: Added EEH support")
Cc: sta...@vger.kernel.org # v4.6+
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/scsi/aacraid/aacraid.h | 11 ---
 drivers/scsi/aacraid/commsup.c |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h
index d036a806f31c..d281492009fb 100644
--- a/drivers/scsi/aacraid/aacraid.h
+++ b/drivers/scsi/aacraid/aacraid.h
@@ -1690,9 +1690,6 @@ struct aac_dev
 #define aac_adapter_sync_cmd(dev, command, p1, p2, p3, p4, p5, p6, status, r1, 
r2, r3, r4) \
(dev)->a_ops.adapter_sync_cmd(dev, command, p1, p2, p3, p4, p5, p6, 
status, r1, r2, r3, r4)
 
-#define aac_adapter_check_health(dev) \
-   (dev)->a_ops.adapter_check_health(dev)
-
 #define aac_adapter_restart(dev, bled, reset_type) \
((dev)->a_ops.adapter_restart(dev, bled, reset_type))
 
@@ -2615,6 +2612,14 @@ static inline unsigned int cap_to_cyls(sector_t 
capacity, unsigned divisor)
return capacity;
 }
 
+static inline int aac_adapter_check_health(struct aac_dev *dev)
+{
+   if (unlikely(pci_channel_offline(dev->pdev)))
+   return -1;
+
+   return (dev)->a_ops.adapter_check_health(dev);
+}
+
 /* SCp.phase values */
 #define AAC_OWNER_MIDLEVEL 0x101
 #define AAC_OWNER_LOWLEVEL 0x102
diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
index c8172f16cf33..1f4918355fdb 100644
--- a/drivers/scsi/aacraid/commsup.c
+++ b/drivers/scsi/aacraid/commsup.c
@@ -1873,7 +1873,8 @@ int aac_check_health(struct aac_dev * aac)
spin_unlock_irqrestore(>fib_lock, flagv);
 
if (BlinkLED < 0) {
-   printk(KERN_ERR "%s: Host adapter dead %d\n", aac->name, 
BlinkLED);
+   printk(KERN_ERR "%s: Host adapter is dead (or got a PCI error) 
%d\n",
+   aac->name, BlinkLED);
goto out;
}
 
-- 
2.11.0

Re: [PATCH 14/25] nbd: don't use req->errors

2017-04-06 Thread Josef Bacik


> On Apr 6, 2017, at 11:39 AM, Christoph Hellwig  wrote:
> 
> Add a nbd-specific field instead.
> 
> Signed-off-by: Christoph Hellwig 

This is fine with me, you can add,

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: kill req->errors

2017-04-06 Thread Konrad Rzeszutek Wilk

On Thu, Apr 06, 2017 at 05:39:19PM +0200, Christoph Hellwig wrote:
> Currently the request structure has an errors field that is used in
> various different ways.  The oldest drivers use it as an error count,
> blk-mq and the generic timeout code assume that it holds a Linux
> errno for block completions, and various drivers use it for internal
> status values, often overwriting them with Linux errnos later,
> that is unless they are processing passthrough requests in which
> case they'll leave their errors in it.
> 
> This series kills the ->errors field and replaced it with new fields
> in the drivers (or an odd hack in a union in struct request for
> two bitrotting old drivers).
> 
> Note that this series expects that the patch to remove the mg_disk
> driver has been applied already on top of the block for-next tree.

You wouldn't have a git tree to easily test it? Thanks.

[no subject]

2017-04-06 Thread David Buckley

Martin,

I'm rather surprised nobody else has previously reported this as well,
especially as NetApp hadn't received any reports.  The only probably
explanation I could think of is that EL 7 is still based on a 3.10
kernel so is too old to be affected, and that is likely to be what
most NetApp customers are running.

I've been trying to test some additional kernels to try and narrow the
versions affected, and the change in discard_granularity does not
correspond with discard failing (as you suggested).

With kernel 3.13,  discard works as expected with
discard_granularity=4096 and discard_max_bytes=4194304.

With kernel 3.19, discard stops working completely, with
discard_granularity=4096 and discard_max_bytes=8388608.

Do you think the change in discard_max_bytes could be relevant?  If
that comes from the VPD data it seems odd it would change.

I am wondering if part of the issue is that in my use case, UNMAP and
WRITE SAME zeros result in very different results.  With thin
provisioned LUNs, UNMAP requests result in the blocks being freed and
thus shrinks the real size of the LUN allocation.  If WRITE SAME
requests are used to zero the blocks, they remain allocated as part of
the LUN and thus the real size of the LUN grows to match the allocated
size (effectively thick-provisioning the LUN).  This matches what I am
seeing on kernels with discard not working; running 'fstrim
/lun/mount' results in the LUN growing to its max size.

Thank you again for your help with this!

-David

On Thu, Apr 6, 2017 at 10:34 AM, Martin K. Petersen
 wrote:
> David Buckley  writes:
>
> David,
>
>> As I mentioned previously, I'm fairly certain that the issue I'm
>> seeing is due to the fact that while NetApp LUNs are presented as 512B
>> logical/4K physical disks for compatibility, they actually don't
>> support requests smaller than 4K (which makes sense as NetApp LUNs are
>> actually just files allocated on the 4K-block WAFL filesystem).
>
> That may be. But they should still deallocate all the whole 4K blocks
> described by an UNMAP request. Even if head and tail are not aligned.
>
>> Let me know if there's any additional information I can provide. This
>> has resulted in a 2-3x increase in raw disk requirements for some
>> workloads (unfortunately on SSD too), and I'd love to find a solution
>> that doesn't require rolling back to a 3.10 kernel.
>
> I just posted some patches yesterday that will address this (using WRITE
> SAME w/ UNMAP for block zeroing and allowing discards to be sent using
> the UNMAP command, honoring the granularity and alignment suggested by
> the device). That's 4.13 material, though.
>
> The enterprise distros have many customers using NetApp filers. I'm a
> bit puzzled why this is the first we hear of this...
>
> --
> Martin K. Petersen  Oracle Linux Engineering

[PATCH] scsi: sd: fix warning on min_not_zero use

2017-04-06 Thread Jérémy Lefaure

When calling min_not_zero, both arguments should have the same type.
Otherwise the compiler will raise a warning:

  CC  drivers/scsi/sd.o
In file included from ./include/linux/list.h:8:0,
 from ./include/linux/module.h:9,
 from drivers/scsi/sd.c:35:
drivers/scsi/sd.c: In function ‘sd_revalidate_disk’:
./include/linux/kernel.h:755:16: warning: comparison of distinct pointer
types lacks a cast
  (void) ( == );   \
^
./include/linux/kernel.h:758:2: note: in expansion of macro ‘__min’
  __min(typeof(x), typeof(y),   \
  ^
./include/linux/kernel.h:783:39: note: in expansion of macro ‘min’
  __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
   ^~~
drivers/scsi/sd.c:2959:12: note: in expansion of macro ‘min_not_zero’
   rw_max = min_not_zero(logical_to_sectors(sdp, dev_max),
^~~~

Casting the BLK_DEF_MAX_SECTORS constant fixes this issue.

Cc:  # v4.4+
Fixes: c3e62673ee20 ("scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is 
unusable")
Signed-off-by: Jérémy Lefaure 
---
 drivers/scsi/sd.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ab9011a6257d..7dce3592033d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2955,9 +2955,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
logical_to_bytes(sdp, sdkp->opt_xfer_blocks) >= PAGE_SIZE) {
q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks);
rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);
-   } else
+   } else {
rw_max = min_not_zero(logical_to_sectors(sdp, dev_max),
- BLK_DEF_MAX_SECTORS);
+ (sector_t)BLK_DEF_MAX_SECTORS);
+   }
 
/* Combine with controller limits */
q->limits.max_sectors = min(rw_max, queue_max_hw_sectors(q));
-- 
2.12.2

Re: [PATCH 07/25] virtio_blk: don't use req->errors

2017-04-06 Thread Johannes Thumshirn

On Thu, Apr 06, 2017 at 05:39:26PM +0200, Christoph Hellwig wrote:
> Remove passing req->errors (which at that point is always 0) to
> blk_mq_complete_requestq, and rely on the virtio status code for the
blk_mq_complete_request ^

> serial number passthrough request.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Otherwise looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 06/25] virtio: fix spelling of virtblk_scsi_request_done

2017-04-06 Thread Johannes Thumshirn

On Thu, Apr 06, 2017 at 05:39:25PM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
> ---

Fair enough,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 03/25] nvme-fc: fix status code handling in nvme_fc_fcpio_done

2017-04-06 Thread Johannes Thumshirn

On Thu, Apr 06, 2017 at 05:39:22PM +0200, Christoph Hellwig wrote:
> nvme_complete_async_event expects the little endian status code
> including the phase bit, and a new completion handler I plan to
> introduce will do so as well.
> 
> Change the status variable into the little endian format with the
> phase bit used in the NVMe CQE to fix / enable this.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 04/25] nvme: split nvme status from block req->errors

2017-04-06 Thread Johannes Thumshirn

On Thu, Apr 06, 2017 at 05:39:23PM +0200, Christoph Hellwig wrote:
> We want our own clearly defined error field for NVMe passthrough commands,
> and the request errors field is going away in its current form.
> 
> Just store the status and result field in the nvme_request field from
> hardirq completion context (using a new helper) and then generate a
> Linux errno for the block layer only when we actually need it.
> 
> Because we can't overload the status value with a negative error code
> for cancelled command we now have a flags filed in struct nvme_request
> that contains a bit for this condition.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 02/25] block: remove the blk_execute_rq return value

2017-04-06 Thread Johannes Thumshirn

On Thu, Apr 06, 2017 at 05:39:21PM +0200, Christoph Hellwig wrote:
> The function only returns -EIO if rq->errors is non-zero, which is not
> very useful and lets a large number of callers ignore the return value.
> 
> Just let the callers figure out their error themselves.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: problem with discard granularity in sd

2017-04-06 Thread Martin K. Petersen

David Buckley  writes:

David,

> As I mentioned previously, I'm fairly certain that the issue I'm
> seeing is due to the fact that while NetApp LUNs are presented as 512B
> logical/4K physical disks for compatibility, they actually don't
> support requests smaller than 4K (which makes sense as NetApp LUNs are
> actually just files allocated on the 4K-block WAFL filesystem).

That may be. But they should still deallocate all the whole 4K blocks
described by an UNMAP request. Even if head and tail are not aligned.

> Let me know if there's any additional information I can provide. This
> has resulted in a 2-3x increase in raw disk requirements for some
> workloads (unfortunately on SSD too), and I'd love to find a solution
> that doesn't require rolling back to a 3.10 kernel.

I just posted some patches yesterday that will address this (using WRITE
SAME w/ UNMAP for block zeroing and allowing discards to be sent using
the UNMAP command, honoring the granularity and alignment suggested by
the device). That's 4.13 material, though.

The enterprise distros have many customers using NetApp filers. I'm a
bit puzzled why this is the first we hear of this...

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] qla4xxx: drop redundant init_completion

2017-04-06 Thread Martin K. Petersen

Nicholas Mc Guire  writes:

> The redundant init_completion() here seems to be a cut error as
> struct scsi_qla_host only has 4 completion elements to initialize,
> thus the duplicate init_completion(disable_acb_comp) is simply
> removed.

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCHv6 0/8] SCSI EH cleanup

2017-04-06 Thread Martin K. Petersen

Hannes Reinecke  writes:

Hannes,

> this is a resend of a small patchset for cleaning up SCSI EH.  Primary
> goal is to make asynchronous aborts mandatory; there hasn't been a
> single report so far where asynchronous abort won't work, so the
> 'no_async_abort' flag has never been used and will be removed with
> this patchset.  Additionally there's a cleanup to detect retries of
> failed commands and to inline commands aborts.  I've also included the
> patch to correctly reset the media access timeout counter to avoid it
> being triggered inadvertedly by several commands timing out for the
> same device.

I would still have liked to see the medium access failures be tracked on
a per-command basis. But since this is my concoction, I guess I'll have
to fix that up after I get back from vacation.

Applied 1-7 to 4.12/scsi-queue. Patch 8 needs reviews.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: hpsa patches in mkp/4.12/scsi-queue and mkp/4.11/scsi-fixes

2017-04-06 Thread Martin K. Petersen

Martin Wilck  writes:

Martin,

> I noticed that the following commits
>
> eb94588dabec scsi: hpsa: fix volume offline state
> 2ef288498087 scsi: hpsa: do not timeout reset operations
> 87b9e6aa87d9 scsi: hpsa: limit outstanding rescans
> 85b29008d8af scsi: hpsa: update check for logical volume status
>
> are included in mkp/4.11/scsi-fixes but not in mkp/4.12/scsi-queue.
> Is there a specific reason for that?

scsi-fixes and scsi-queue are usually forked from linus/master at the
about same time after the merge window closes.

 - Bugfixes for the current release go into scsi-fixes.

 - New features for next release go into scsi-queue.

Consequently, it's perfectly normal for fixes to be more "recent" than
queue. I only rebase the queue if I absolutely have to (i.e. a new
feature depends on something that happened in Linus' tree after the
merge window closed).

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors

2017-04-06 Thread Martin K. Petersen

Christoph Hellwig  writes:

> Ok, the version below simplify skip the function split entirely:

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v2] scsi: sd: Fix capacity calculation with 32-bit sector_t

2017-04-06 Thread Martin K. Petersen

Bart Van Assche  writes:

>> We previously made sure that the reported disk capacity was less than
>> 0x blocks when the kernel was not compiled with large sector_t
>> support (CONFIG_LBDAF). However, this check assumed that the capacity
>> was reported in units of 512 bytes.
>> 
>> Add a sanity check function to ensure that we only enable disks if the
>> entire reported capacity can be expressed in terms of sector_t.
>
> Reviewed-by: Bart Van Assche 

Applied to 4.11/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-06 Thread Martin K. Petersen


Mauricio,

> The commit 08024885a2a3 ("ses: Add power_status to SES device slot")
> introduced the 'power_status' attribute to enclosure components and
> the associated callbacks.

Applied to 4.12/scsi-queue, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] Make checking the scsi_device_get() return value mandatory

2017-04-06 Thread Martin K. Petersen

Bart Van Assche  writes:

> Now that all scsi_device_get() callers check the return value of this
> function, make checking that return value mandatory.

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] osd_uld: Check scsi_device_get() return value

2017-04-06 Thread Martin K. Petersen

Bart Van Assche  writes:

Bart,

> scsi_device_get() can fail. Hence check its return value.

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-06 Thread Jason Gunthorpe

On Thu, Apr 06, 2017 at 08:33:38AM +0300, Sagi Grimberg wrote:
> 
> >>Note that the nvme completion queues are still on the host memory, so
> >>this means we have lost the ordering between data and completions as
> >>they go to different pcie targets.
> >
> >Hmm, in this simple up/down case with a switch, I think it might
> >actually be OK.
> >
> >Transactions might not complete at the NVMe device before the CPU
> >processes the RDMA completion, however due to the PCI-E ordering rules
> >new TLPs directed to the NVMe will complete after the RMDA TLPs and
> >thus observe the new data. (eg order preserving)
> >
> >It would be very hard to use P2P if fabric ordering is not preserved..
> 
> I think it still can race if the p2p device is connected with more than
> a single port to the switch.
> 
> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.

Right, this is why I qualified my statement with 'simple up/down case'

Make it any more complex and it clearly stops working sanely, but I
wouldn't worry about unusual PCI-E fabrics at this point..

> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

AFAIK PCI-E ordering is explicitly per link, so things that need order
must always traverse the same link.

Jason

Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-06 Thread Logan Gunthorpe



On 05/04/17 11:33 PM, Sagi Grimberg wrote:
> 
>>> Note that the nvme completion queues are still on the host memory, so
>>> this means we have lost the ordering between data and completions as
>>> they go to different pcie targets.
>>
>> Hmm, in this simple up/down case with a switch, I think it might
>> actually be OK.
>>
>> Transactions might not complete at the NVMe device before the CPU
>> processes the RDMA completion, however due to the PCI-E ordering rules
>> new TLPs directed to the NVMe will complete after the RMDA TLPs and
>> thus observe the new data. (eg order preserving)
>>
>> It would be very hard to use P2P if fabric ordering is not preserved..
> 
> I think it still can race if the p2p device is connected with more than
> a single port to the switch.
> 
> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.
> 
> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

I'll say I don't have a complete understanding of this myself. However,
my understanding is the completion coming from disk won't be sent toward
the RC until all the all the TLPs reached leg B. Then if the RC sends
TLPs to the p2p device via leg B they will be behind all the TLPs the
disk sent. Or something like that. Obviously this will only work with a
tree topology (which I believe is the only topology that makes sense for
PCI). If you had a mesh topology, then the data could route around
congestion and that would get around the ordering restrictions.

Logan

Re: [RFC 3/8] nvmet: Use p2pmem in nvme target

2017-04-06 Thread Logan Gunthorpe

Hey Sagi,

On 05/04/17 11:47 PM, Sagi Grimberg wrote:
> Because the user can get it wrong, and its our job to do what we can in
> order to prevent the user from screwing itself.

Well, "screwing" themselves seems a bit strong. It wouldn't be much
different from a lot of other tunables in the system. For example, it
would be similar to the user choosing the wrong io scheduler for their
disk or workload. If you change this setting without measuring
performance you probably don't care too much about the result anyway.

> I wasn't against it that much, I'm all for making things "just work"
> with minimal configuration steps, but I'm not sure we can get it
> right without it.

Ok, well in that case I may reconsider this in the next series.

 Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
 save an extra PCI transfer as the NVME card could just take the data
 out of it's own memory. However, at this time, cards with CMB buffers
 don't seem to be available.
>>>
>>> Even if it was available, it would be hard to make real use of this
>>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>>> data). But let's leave this out of the scope entirely...
>>
>> I don't understand what you're referring to. We'd simply use the CMB
>> buffer as a p2pmem device, why does that change anything?
> 
> I'm referring to the in-capsule data buffers pre-posts that we do.
> Because we prepare a buffer that would contain in-capsule data, we have
> no knowledge to which device the incoming I/O is directed to, which
> means we can (and will) have I/O where the data lies in CMB of device
> A but it's really targeted to device B - which sorta defeats the purpose
> of what we're trying to optimize here...

Well, the way I've had it is that each port gets one p2pmem device. So
you'd only want to put NVMe devices that will work with that p2pmem
device behind that port. Though, I can see that being a difficult
restriction seeing it probably means you'll need to have one port per
nvme device if you want to use the CMB buffer of each device. I'll have
to think about that some. Also, it's worth noting that we aren't even
optimizing in-capsule data at this time.

> Still the user can get it wrong. Not sure we can get a way without
> keeping track of this as new devices join the subsystem.

Yeah, I understand. I'll have to think some more about all of this. I'm
starting to see some ways to improve thing.s

Thanks,

Logan

[PATCH 25/25] block: remove the errors field from struct request

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 14 +-
 block/blk-exec.c |  3 +--
 block/blk-mq.c   | 10 +++---
 block/blk-timeout.c  |  1 -
 include/linux/blkdev.h   |  2 --
 include/trace/events/block.h | 17 +++--
 kernel/trace/blktrace.c  | 24 +++-
 7 files changed, 23 insertions(+), 48 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 316a5399fb15..4f95bdaa1e16 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1633,7 +1633,6 @@ void init_request_from_bio(struct request *req, struct 
bio *bio)
if (bio->bi_opf & REQ_RAHEAD)
req->cmd_flags |= REQ_FAILFAST_MASK;
 
-   req->errors = 0;
req->__sector = bio->bi_iter.bi_sector;
blk_rq_set_prio(req, rq_ioc(bio));
if (ioprio_valid(bio_prio(bio)))
@@ -2561,22 +2560,11 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
 {
int total_bytes;
 
-   trace_block_rq_complete(req->q, req, nr_bytes);
+   trace_block_rq_complete(req, error, nr_bytes);
 
if (!req->bio)
return false;
 
-   /*
-* For fs requests, rq is just carrier of independent bio's
-* and each partial completion should be handled separately.
-* Reset per-request error on each partial completion.
-*
-* TODO: tj: This is too subtle.  It would be better to let
-* low level drivers do what they see fit.
-*/
-   if (!blk_rq_is_passthrough(req))
-   req->errors = 0;
-
if (error && !blk_rq_is_passthrough(req) &&
!(req->rq_flags & RQF_QUIET)) {
char *error_type;
diff --git a/block/blk-exec.c b/block/blk-exec.c
index afa383248c7c..a9451e3b8587 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -69,8 +69,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 
if (unlikely(blk_queue_dying(q))) {
rq->rq_flags |= RQF_QUIET;
-   rq->errors = -ENXIO;
-   __blk_end_request_all(rq, rq->errors);
+   __blk_end_request_all(rq, -ENXIO);
spin_unlock_irq(q->queue_lock);
return;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a6e14a3c87ce..072f139704d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -213,7 +213,6 @@ void blk_mq_rq_ctx_init(struct request_queue *q, struct 
blk_mq_ctx *ctx,
 #endif
rq->special = NULL;
/* tag was already set */
-   rq->errors = 0;
rq->extra_len = 0;
 
INIT_LIST_HEAD(>timeout_list);
@@ -621,8 +620,7 @@ void blk_mq_abort_requeue_list(struct request_queue *q)
 
rq = list_first_entry(_list, struct request, queuelist);
list_del_init(>queuelist);
-   rq->errors = -EIO;
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, -EIO);
}
 }
 EXPORT_SYMBOL(blk_mq_abort_requeue_list);
@@ -1029,8 +1027,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
pr_err("blk-mq: bad return on queue: %d\n", ret);
case BLK_MQ_RQ_QUEUE_ERROR:
errors++;
-   rq->errors = -EIO;
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, -EIO);
break;
}
 
@@ -1445,8 +1442,7 @@ static void __blk_mq_try_issue_directly(struct request 
*rq, blk_qc_t *cookie,
 
if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
*cookie = BLK_QC_T_NONE;
-   rq->errors = -EIO;
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, -EIO);
return;
}
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a30441a200c0..cbff183f3d9f 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -89,7 +89,6 @@ static void blk_rq_timed_out(struct request *req)
ret = q->rq_timed_out_fn(req);
switch (ret) {
case BLK_EH_HANDLED:
-   /* Can we use req->errors here? */
__blk_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fa75401ea5cc..f912ddc39020 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -220,8 +220,6 @@ struct request {
 
void *special;  /* opaque pointer available for LLD use */
 
-   int errors;
-
unsigned int extra_len; /* length of alignment and padding */
 
unsigned long deadline;
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 99ed69fad041..d0dbe60d8a6d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -80,7 +80,6 @@ TRACE_EVENT(block_rq_requeue,

[PATCH 23/25] pd: remove bogus check for req->errors

2017-04-06 Thread Christoph Hellwig

The driver never sets req->errors
---
 drivers/block/paride/pd.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index 82c6d02193ae..3b0ab214fe74 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -739,7 +739,6 @@ static int pd_special_command(struct pd_unit *disk,
  enum action (*func)(struct pd_unit *disk))
 {
struct request *rq;
-   int err = 0;
 
rq = blk_get_request(disk->gd->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
if (IS_ERR(rq))
@@ -748,10 +747,9 @@ static int pd_special_command(struct pd_unit *disk,
rq->special = func;
 
blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
-   err = req->errors ? -EIO : 0;
 
blk_put_request(rq);
-   return err;
+   return 0;
 }
 
 /* kernel glue structures */
-- 
2.11.0

[PATCH 18/25] blk-mq: simplify __blk_mq_complete_request

2017-04-06 Thread Christoph Hellwig

Merge blk_mq_ipi_complete_request and blk_mq_stat_add into their only
caller.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 393f350ebb90..a6e14a3c87ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -405,12 +405,17 @@ static void __blk_mq_complete_request_remote(void *data)
rq->q->softirq_done_fn(rq);
 }
 
-static void blk_mq_ipi_complete_request(struct request *rq)
+static void __blk_mq_complete_request(struct request *rq)
 {
struct blk_mq_ctx *ctx = rq->mq_ctx;
bool shared = false;
int cpu;
 
+   if (rq->rq_flags & RQF_STATS) {
+   blk_mq_poll_stats_start(rq->q);
+   blk_stat_add(rq);
+   }
+
if (!test_bit(QUEUE_FLAG_SAME_COMP, >q->queue_flags)) {
rq->q->softirq_done_fn(rq);
return;
@@ -431,20 +436,6 @@ static void blk_mq_ipi_complete_request(struct request *rq)
put_cpu();
 }
 
-static void blk_mq_stat_add(struct request *rq)
-{
-   if (rq->rq_flags & RQF_STATS) {
-   blk_mq_poll_stats_start(rq->q);
-   blk_stat_add(rq);
-   }
-}
-
-static void __blk_mq_complete_request(struct request *rq)
-{
-   blk_mq_stat_add(rq);
-   blk_mq_ipi_complete_request(rq);
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
-- 
2.11.0

[PATCH 22/25] swim3: remove (commented out) printing of req->errors

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/block/swim3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 61b3ffa4f458..ba4809c9bdba 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -343,8 +343,8 @@ static void start_request(struct floppy_state *fs)
  req->rq_disk->disk_name, req->cmd,
  (long)blk_rq_pos(req), blk_rq_sectors(req),
  bio_data(req->bio));
-   swim3_dbg("   errors=%d current_nr_sectors=%u\n",
- req->errors, blk_rq_cur_sectors(req));
+   swim3_dbg("   current_nr_sectors=%u\n",
+ blk_rq_cur_sectors(req));
 #endif
 
if (blk_rq_pos(req) >= fs->total_secs) {
-- 
2.11.0

[PATCH 24/25] blktrace: remove the unused block_rq_abort tracepoint

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 include/trace/events/block.h | 44 ++--
 kernel/trace/blktrace.c  |  9 -
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index a88ed13446ff..99ed69fad041 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -61,7 +61,16 @@ DEFINE_EVENT(block_buffer, block_dirty_buffer,
TP_ARGS(bh)
 );
 
-DECLARE_EVENT_CLASS(block_rq_with_error,
+/**
+ * block_rq_requeue - place block IO request back on a queue
+ * @q: queue holding operation
+ * @rq: block IO operation request
+ *
+ * The block operation request @rq is being placed back into queue
+ * @q.  For some reason the request was not completed and needs to be
+ * put back in the queue.
+ */
+TRACE_EVENT(block_rq_requeue,
 
TP_PROTO(struct request_queue *q, struct request *rq),
 
@@ -94,39 +103,6 @@ DECLARE_EVENT_CLASS(block_rq_with_error,
 );
 
 /**
- * block_rq_abort - abort block operation request
- * @q: queue containing the block operation request
- * @rq: block IO operation request
- *
- * Called immediately after pending block IO operation request @rq in
- * queue @q is aborted. The fields in the operation request @rq
- * can be examined to determine which device and sectors the pending
- * operation would access.
- */
-DEFINE_EVENT(block_rq_with_error, block_rq_abort,
-
-   TP_PROTO(struct request_queue *q, struct request *rq),
-
-   TP_ARGS(q, rq)
-);
-
-/**
- * block_rq_requeue - place block IO request back on a queue
- * @q: queue holding operation
- * @rq: block IO operation request
- *
- * The block operation request @rq is being placed back into queue
- * @q.  For some reason the request was not completed and needs to be
- * put back in the queue.
- */
-DEFINE_EVENT(block_rq_with_error, block_rq_requeue,
-
-   TP_PROTO(struct request_queue *q, struct request *rq),
-
-   TP_ARGS(q, rq)
-);
-
-/**
  * block_rq_complete - block IO operation completed by device driver
  * @q: queue containing the block operation request
  * @rq: block operations request
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index b2058a7f94bd..9f3624dadb09 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -716,12 +716,6 @@ static void blk_add_trace_rq(struct request_queue *q, 
struct request *rq,
rq->cmd_flags, what, rq->errors, 0, NULL);
 }
 
-static void blk_add_trace_rq_abort(void *ignore,
-  struct request_queue *q, struct request *rq)
-{
-   blk_add_trace_rq(q, rq, blk_rq_bytes(rq), BLK_TA_ABORT);
-}
-
 static void blk_add_trace_rq_insert(void *ignore,
struct request_queue *q, struct request *rq)
 {
@@ -974,8 +968,6 @@ static void blk_register_tracepoints(void)
 {
int ret;
 
-   ret = register_trace_block_rq_abort(blk_add_trace_rq_abort, NULL);
-   WARN_ON(ret);
ret = register_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);
WARN_ON(ret);
ret = register_trace_block_rq_issue(blk_add_trace_rq_issue, NULL);
@@ -1028,7 +1020,6 @@ static void blk_unregister_tracepoints(void)
unregister_trace_block_rq_requeue(blk_add_trace_rq_requeue, NULL);
unregister_trace_block_rq_issue(blk_add_trace_rq_issue, NULL);
unregister_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);
-   unregister_trace_block_rq_abort(blk_add_trace_rq_abort, NULL);
 
tracepoint_synchronize_unregister();
 }
-- 
2.11.0

[PATCH 21/25] ataflop: switch from req->errors to req->error_count

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/block/ataflop.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 2104b1b4ccda..fa69ecd52cb5 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -617,12 +617,12 @@ static void fd_error( void )
if (!fd_request)
return;
 
-   fd_request->errors++;
-   if (fd_request->errors >= MAX_ERRORS) {
+   fd_request->error_count++;
+   if (fd_request->error_count >= MAX_ERRORS) {
printk(KERN_ERR "fd%d: too many errors.\n", SelectedDrive );
fd_end_request_cur(-EIO);
}
-   else if (fd_request->errors == RECALIBRATE_ERRORS) {
+   else if (fd_request->error_count == RECALIBRATE_ERRORS) {
printk(KERN_WARNING "fd%d: recalibrating\n", SelectedDrive );
if (SelectedDrive != -1)
SUD.track = -1;
@@ -1386,7 +1386,7 @@ static void setup_req_params( int drive )
ReqData = ReqBuffer + 512 * ReqCnt;
 
if (UseTrackbuffer)
-   read_track = (ReqCmd == READ && fd_request->errors == 0);
+   read_track = (ReqCmd == READ && fd_request->error_count == 0);
else
read_track = 0;
 
@@ -1409,8 +1409,10 @@ static struct request *set_next_request(void)
fdc_queue = 0;
if (q) {
rq = blk_fetch_request(q);
-   if (rq)
+   if (rq) {
+   rq->error_count = 0;
break;
+   }
}
} while (fdc_queue != old_pos);
 
-- 
2.11.0

[PATCH 17/25] blk-mq: remove the error argument to blk_mq_complete_request

2017-04-06 Thread Christoph Hellwig

Now that we always have a ->complete callback we can remove the direct
call to blk_mq_end_request, as well as the error argument to
blk_mq_complete_request.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c| 14 +++---
 drivers/block/loop.c  |  4 ++--
 drivers/block/mtip32xx/mtip32xx.c |  4 ++--
 drivers/block/nbd.c   |  4 ++--
 drivers/block/null_blk.c  |  2 +-
 drivers/block/virtio_blk.c|  2 +-
 drivers/block/xen-blkfront.c  |  2 +-
 drivers/md/dm-rq.c|  2 +-
 drivers/nvme/host/core.c  |  2 +-
 drivers/nvme/host/nvme.h  |  2 +-
 drivers/scsi/scsi_lib.c   |  2 +-
 include/linux/blk-mq.h|  2 +-
 12 files changed, 17 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..393f350ebb90 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -441,14 +441,8 @@ static void blk_mq_stat_add(struct request *rq)
 
 static void __blk_mq_complete_request(struct request *rq)
 {
-   struct request_queue *q = rq->q;
-
blk_mq_stat_add(rq);
-
-   if (!q->softirq_done_fn)
-   blk_mq_end_request(rq, rq->errors);
-   else
-   blk_mq_ipi_complete_request(rq);
+   blk_mq_ipi_complete_request(rq);
 }
 
 /**
@@ -459,16 +453,14 @@ static void __blk_mq_complete_request(struct request *rq)
  * Ends all I/O on a request. It does not handle partial completions.
  * The actual completion happens out-of-order, through a IPI handler.
  **/
-void blk_mq_complete_request(struct request *rq, int error)
+void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
-   if (!blk_mark_rq_complete(rq)) {
-   rq->errors = error;
+   if (!blk_mark_rq_complete(rq))
__blk_mq_complete_request(rq);
-   }
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6924ec611a49..7d49d919543a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -465,7 +465,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
cmd->ret = ret;
-   blk_mq_complete_request(cmd->rq, 0);
+   blk_mq_complete_request(cmd->rq);
 }
 
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
@@ -1683,7 +1683,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
/* complete non-aio request */
if (!cmd->use_aio || ret) {
cmd->ret = ret ? -EIO : 0;
-   blk_mq_complete_request(cmd->rq, 0);
+   blk_mq_complete_request(cmd->rq);
}
 }
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index ec998e80cfaf..be6bbe5ce613 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -242,7 +242,7 @@ static void mtip_async_complete(struct mtip_port *port,
rq = mtip_rq_from_tag(dd, tag);
 
cmd->status = status;
-   blk_mq_complete_request(rq, 0);
+   blk_mq_complete_request(rq);
 }
 
 /*
@@ -4110,7 +4110,7 @@ static void mtip_no_dev_cleanup(struct request *rq, void 
*data, bool reserv)
 
if (likely(!reserv)) {
cmd->status = -ENODEV;
-   blk_mq_complete_request(rq, 0);
+   blk_mq_complete_request(rq);
} else if (test_bit(MTIP_PF_IC_ACTIVE_BIT, >port->flags)) {
 
cmd = mtip_cmd_from_tag(dd, MTIP_TAG_INTERNAL);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4f045fab9659..4f78c5a01d11 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -498,7 +498,7 @@ static void recv_work(struct work_struct *work)
break;
}
 
-   blk_mq_complete_request(blk_mq_rq_from_pdu(cmd), 0);
+   blk_mq_complete_request(blk_mq_rq_from_pdu(cmd));
}
 
/*
@@ -519,7 +519,7 @@ static void nbd_clear_req(struct request *req, void *data, 
bool reserved)
return;
cmd = blk_mq_rq_to_pdu(req);
cmd->status = -EIO;
-   blk_mq_complete_request(req, 0);
+   blk_mq_complete_request(req);
 }
 
 static void nbd_clear_que(struct nbd_device *nbd)
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 24ca85a70fd8..c27cccec368b 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -281,7 +281,7 @@ static inline void null_handle_cmd(struct nullb_cmd *cmd)
case NULL_IRQ_SOFTIRQ:
switch (queue_mode)  {
case NULL_Q_MQ:
-   blk_mq_complete_request(cmd->rq, 0);
+   blk_mq_complete_request(cmd->rq);
break;
case NULL_Q_RQ:
blk_complete_request(cmd->rq);
diff

[PATCH 16/25] xen-blkfront: don't use req->errors

2017-04-06 Thread Christoph Hellwig

xen-blkfron is the last users using rq->errros for passing back error to
blk-mq, and I'd like to get rid of that.  In the longer run the driver
should be moving more of the completion processing into .complete, but
this is the minimal change to move forward for now.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/xen-blkfront.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d137ef8a72be..08172f679a64 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -115,6 +115,15 @@ struct split_bio {
atomic_t pending;
 };
 
+struct blkif_req {
+   int error;
+};
+
+static inline struct blkif_req *blkif_req(struct request *rq)
+{
+   return blk_mq_rq_to_pdu(rq);
+}
+
 static DEFINE_MUTEX(blkfront_mutex);
 static const struct block_device_operations xlvbd_block_fops;
 
@@ -907,8 +916,14 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
+static void blkif_complete_rq(struct request *rq)
+{
+   blk_mq_end_request(rq, blkif_req(rq)->error);
+}
+
 static const struct blk_mq_ops blkfront_mq_ops = {
.queue_rq = blkif_queue_rq,
+   .complete = blkif_complete_rq,
 };
 
 static void blkif_set_queue_limits(struct blkfront_info *info)
@@ -969,7 +984,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
-   info->tag_set.cmd_size = 0;
+   info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
if (blk_mq_alloc_tag_set(>tag_set))
@@ -1543,7 +1558,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
unsigned long flags;
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
-   int error;
 
if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return IRQ_HANDLED;
@@ -1587,37 +1601,36 @@ static irqreturn_t blkif_interrupt(int irq, void 
*dev_id)
continue;
}
 
-   error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
+   blkif_req(req)->error = (bret->status == BLKIF_RSP_OKAY) ? 0 : 
-EIO;
switch (bret->operation) {
case BLKIF_OP_DISCARD:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
struct request_queue *rq = info->rq;
printk(KERN_WARNING "blkfront: %s: %s op 
failed\n",
   info->gd->disk_name, 
op_name(bret->operation));
-   error = -EOPNOTSUPP;
+   blkif_req(req)->error = -EOPNOTSUPP;
info->feature_discard = 0;
info->feature_secdiscard = 0;
queue_flag_clear(QUEUE_FLAG_DISCARD, rq);
queue_flag_clear(QUEUE_FLAG_SECERASE, rq);
}
-   blk_mq_complete_request(req, error);
break;
case BLKIF_OP_FLUSH_DISKCACHE:
case BLKIF_OP_WRITE_BARRIER:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
printk(KERN_WARNING "blkfront: %s: %s op 
failed\n",
   info->gd->disk_name, 
op_name(bret->operation));
-   error = -EOPNOTSUPP;
+   blkif_req(req)->error = -EOPNOTSUPP;
}
if (unlikely(bret->status == BLKIF_RSP_ERROR &&
 rinfo->shadow[id].req.u.rw.nr_segments == 
0)) {
printk(KERN_WARNING "blkfront: %s: empty %s op 
failed\n",
   info->gd->disk_name, 
op_name(bret->operation));
-   error = -EOPNOTSUPP;
+   blkif_req(req)->error = -EOPNOTSUPP;
}
-   if (unlikely(error)) {
-   if (error == -EOPNOTSUPP)
-   error = 0;
+   if (unlikely(blkif_req(req)->error)) {
+   if (blkif_req(req)->error == -EOPNOTSUPP)
+   blkif_req(req)->error = 0;
info->feature_fua = 0;
info->feature_flush = 0;
xlvbd_flush(info);
@@ -1629,11 +1642,12 @@ static irqreturn_t blkif_interrupt(int irq, void 
*dev_id)

[PATCH 15/25] mtip32xx: add a status field to struct mtip_cmd

2017-04-06 Thread Christoph Hellwig

Instead of using req->errors, which will go away.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/mtip32xx/mtip32xx.c | 16 +---
 drivers/block/mtip32xx/mtip32xx.h |  1 +
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 30076e7753bc..ec998e80cfaf 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -241,7 +241,8 @@ static void mtip_async_complete(struct mtip_port *port,
 
rq = mtip_rq_from_tag(dd, tag);
 
-   blk_mq_complete_request(rq, status);
+   cmd->status = status;
+   blk_mq_complete_request(rq, 0);
 }
 
 /*
@@ -2910,18 +2911,19 @@ static void mtip_softirq_done_fn(struct request *rq)
if (unlikely(cmd->unaligned))
up(>port->cmd_slot_unal);
 
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, cmd->status);
 }
 
 static void mtip_abort_cmd(struct request *req, void *data,
bool reserved)
 {
+   struct mtip_cmd *cmd = blk_mq_rq_to_pdu(req);
struct driver_data *dd = data;
 
dbg_printk(MTIP_DRV_NAME " Aborting request, tag = %d\n", req->tag);
 
clear_bit(req->tag, dd->port->cmds_to_issue);
-   req->errors = -EIO;
+   cmd->status = -EIO;
mtip_softirq_done_fn(req);
 }
 
@@ -3816,7 +3818,6 @@ static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx,
if (likely(!ret))
return BLK_MQ_RQ_QUEUE_OK;
 
-   rq->errors = ret;
return BLK_MQ_RQ_QUEUE_ERROR;
 }
 
@@ -4107,9 +4108,10 @@ static void mtip_no_dev_cleanup(struct request *rq, void 
*data, bool reserv)
struct driver_data *dd = (struct driver_data *)data;
struct mtip_cmd *cmd;
 
-   if (likely(!reserv))
-   blk_mq_complete_request(rq, -ENODEV);
-   else if (test_bit(MTIP_PF_IC_ACTIVE_BIT, >port->flags)) {
+   if (likely(!reserv)) {
+   cmd->status = -ENODEV;
+   blk_mq_complete_request(rq, 0);
+   } else if (test_bit(MTIP_PF_IC_ACTIVE_BIT, >port->flags)) {
 
cmd = mtip_cmd_from_tag(dd, MTIP_TAG_INTERNAL);
if (cmd->comp_func)
diff --git a/drivers/block/mtip32xx/mtip32xx.h 
b/drivers/block/mtip32xx/mtip32xx.h
index 7617888f7944..57b41528a824 100644
--- a/drivers/block/mtip32xx/mtip32xx.h
+++ b/drivers/block/mtip32xx/mtip32xx.h
@@ -352,6 +352,7 @@ struct mtip_cmd {
int retries; /* The number of retries left for this command. */
 
int direction; /* Data transfer direction */
+   int status;
 };
 
 /* Structure used to describe a port. */
-- 
2.11.0

[PATCH 14/25] nbd: don't use req->errors

2017-04-06 Thread Christoph Hellwig

Add a nbd-specific field instead.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/nbd.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 03ae72985c79..4f045fab9659 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -83,6 +83,7 @@ struct nbd_device {
 struct nbd_cmd {
struct nbd_device *nbd;
struct completion send_complete;
+   int status;
 };
 
 #if IS_ENABLED(CONFIG_DEBUG_FS)
@@ -152,16 +153,14 @@ static void nbd_size_set(struct nbd_device *nbd, struct 
block_device *bdev,
nbd_size_update(nbd, bdev);
 }
 
-static void nbd_end_request(struct nbd_cmd *cmd)
+static void nbd_complete_rq(struct request *req)
 {
-   struct nbd_device *nbd = cmd->nbd;
-   struct request *req = blk_mq_rq_from_pdu(cmd);
-   int error = req->errors ? -EIO : 0;
+   struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 
-   dev_dbg(nbd_to_dev(nbd), "request %p: %s\n", cmd,
-   error ? "failed" : "done");
+   dev_dbg(nbd_to_dev(cmd->nbd), "request %p: %s\n", cmd,
+   cmd->status ? "failed" : "done");
 
-   blk_mq_complete_request(req, error);
+   blk_mq_end_request(req, cmd->status);
 }
 
 /*
@@ -193,7 +192,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct 
request *req,
 
dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down 
connection\n");
set_bit(NBD_TIMEDOUT, >runtime_flags);
-   req->errors = -EIO;
+   cmd->status = -EIO;
 
mutex_lock(>config_lock);
sock_shutdown(nbd);
@@ -433,7 +432,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device 
*nbd, int index)
if (ntohl(reply.error)) {
dev_err(disk_to_dev(nbd->disk), "Other side returned error 
(%d)\n",
ntohl(reply.error));
-   req->errors = -EIO;
+   cmd->status = -EIO;
return cmd;
}
 
@@ -449,7 +448,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device 
*nbd, int index)
if (result <= 0) {
dev_err(disk_to_dev(nbd->disk), "Receive data 
failed (result %d)\n",
result);
-   req->errors = -EIO;
+   cmd->status = -EIO;
return cmd;
}
dev_dbg(nbd_to_dev(nbd), "request %p: got %d bytes 
data\n",
@@ -499,7 +498,7 @@ static void recv_work(struct work_struct *work)
break;
}
 
-   nbd_end_request(cmd);
+   blk_mq_complete_request(blk_mq_rq_from_pdu(cmd), 0);
}
 
/*
@@ -519,8 +518,8 @@ static void nbd_clear_req(struct request *req, void *data, 
bool reserved)
if (!blk_mq_request_started(req))
return;
cmd = blk_mq_rq_to_pdu(req);
-   req->errors = -EIO;
-   nbd_end_request(cmd);
+   cmd->status = -EIO;
+   blk_mq_complete_request(req, 0);
 }
 
 static void nbd_clear_que(struct nbd_device *nbd)
@@ -551,7 +550,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
return -EINVAL;
}
 
-   req->errors = 0;
+   cmd->status = 0;
 
nsock = nbd->socks[index];
mutex_lock(>tx_lock);
@@ -1037,6 +1036,7 @@ static int nbd_init_request(void *data, struct request 
*rq,
 
 static const struct blk_mq_ops nbd_mq_ops = {
.queue_rq   = nbd_queue_rq,
+   .complete   = nbd_complete_rq,
.init_request   = nbd_init_request,
.timeout= nbd_xmit_timeout,
 };
-- 
2.11.0

[PATCH 20/25] floppy: switch from req->errors to req->error_count

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/block/floppy.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index ce102ec47ef2..60d4c7653178 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -2805,8 +2805,10 @@ static int set_next_request(void)
fdc_queue = 0;
if (q) {
current_req = blk_fetch_request(q);
-   if (current_req)
+   if (current_req) {
+   current_req->error_count = 0;
break;
+   }
}
} while (fdc_queue != old_pos);
 
@@ -2866,7 +2868,7 @@ static void redo_fd_request(void)
_floppy = floppy_type + DP->autodetect[DRS->probed_format];
} else
probing = 0;
-   errors = &(current_req->errors);
+   errors = &(current_req->error_count);
tmp = make_raw_rw_request();
if (tmp < 2) {
request_done(tmp);
-- 
2.11.0

[PATCH 19/25] block: add a error_count field to struct request

2017-04-06 Thread Christoph Hellwig

This is for the legacy floppy and ataflop drivers that currently abuse
->errors for this purpose.  It's stashed away in a union to not grow
the struct size, the other fields are either used by modern drivers
for different purposes or the I/O scheduler before queing the I/O
to drivers.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blkdev.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a03acd92ae74..fa75401ea5cc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -175,6 +175,7 @@ struct request {
struct rb_node rb_node; /* sort/lookup */
struct bio_vec special_vec;
void *completion_data;
+   int error_count; /* for legacy drivers, don't use */
};
 
/*
-- 
2.11.0

[PATCH 12/25] dm rq: don't pass irrelevant error code to blk_mq_complete_request

2017-04-06 Thread Christoph Hellwig

dm never uses rq->errors, so there is no need to pass an error argument
to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-rq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..e1cea42ec2a6 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -358,7 +358,7 @@ static void dm_complete_request(struct request *rq, int 
error)
if (!rq->q->mq_ops)
blk_complete_request(rq);
else
-   blk_mq_complete_request(rq, error);
+   blk_mq_complete_request(rq, 0);
 }
 
 /*
-- 
2.11.0

[PATCH 13/25] dm mpath: don't check for req->errors

2017-04-06 Thread Christoph Hellwig

We'll get all proper errors reported through ->end_io and ->errors will
go away soon.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-mpath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..4aa1b865e66e 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1491,7 +1491,7 @@ static int do_end_io(struct multipath *m, struct request 
*clone,
 */
int r = DM_ENDIO_REQUEUE;
 
-   if (!error && !clone->errors)
+   if (!error)
return 0;   /* I/O complete */
 
if (noretry_error(error))
-- 
2.11.0

[PATCH 09/25] scsi: introduce a new result field in struct scsi_request

2017-04-06 Thread Christoph Hellwig

This passes on the scsi_cmnd result field to users of passthrough
requests.  Currently we abuse req->errors for this purpose, but that
field will go away in its current form.

Note that the old IDE code abuses the errors field in very creative
ways and stores all kinds of different values in it.  I didn't dare
to touch this magic, so the abuses are brought forward 1:1.

Signed-off-by: Christoph Hellwig 
---
 block/bsg-lib.c|  8 
 block/bsg.c| 12 +--
 block/scsi_ioctl.c | 14 ++---
 drivers/block/cciss.c  | 42 +++---
 drivers/block/pktcdvd.c|  2 +-
 drivers/block/virtio_blk.c |  2 +-
 drivers/cdrom/cdrom.c  |  2 +-
 drivers/ide/ide-atapi.c| 10 -
 drivers/ide/ide-cd.c   | 20 +-
 drivers/ide/ide-cd_ioctl.c |  2 +-
 drivers/ide/ide-devsets.c  |  4 ++--
 drivers/ide/ide-dma.c  |  2 +-
 drivers/ide/ide-eh.c   | 36 
 drivers/ide/ide-floppy.c   | 10 -
 drivers/ide/ide-io.c   | 10 -
 drivers/ide/ide-ioctls.c   |  4 ++--
 drivers/ide/ide-park.c |  2 +-
 drivers/ide/ide-pm.c   |  8 
 drivers/ide/ide-tape.c |  4 ++--
 drivers/ide/ide-taskfile.c |  6 +++---
 drivers/scsi/osd/osd_initiator.c   |  2 +-
 drivers/scsi/osst.c|  2 +-
 drivers/scsi/qla2xxx/qla_bsg.c |  6 +++---
 drivers/scsi/scsi_lib.c| 18 +++-
 drivers/scsi/scsi_transport_sas.c  |  2 +-
 drivers/scsi/sg.c  |  2 +-
 drivers/scsi/st.c  |  6 +++---
 drivers/target/target_core_pscsi.c |  2 +-
 fs/nfsd/blocklayout.c  |  4 ++--
 include/linux/ide.h|  2 +-
 include/scsi/scsi_request.h|  1 +
 31 files changed, 122 insertions(+), 125 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index cd15f9dbb147..0a23dbba2d30 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -37,7 +37,7 @@ static void bsg_destroy_job(struct kref *kref)
struct bsg_job *job = container_of(kref, struct bsg_job, kref);
struct request *rq = job->req;
 
-   blk_end_request_all(rq, rq->errors);
+   blk_end_request_all(rq, scsi_req(rq)->result);
 
put_device(job->dev);   /* release reference for the request */
 
@@ -74,7 +74,7 @@ void bsg_job_done(struct bsg_job *job, int result,
struct scsi_request *rq = scsi_req(req);
int err;
 
-   err = job->req->errors = result;
+   err = scsi_req(job->req)->result = result;
if (err < 0)
/* we're only returning the result field in the reply */
rq->sense_len = sizeof(u32);
@@ -177,7 +177,7 @@ static int bsg_create_job(struct device *dev, struct 
request *req)
  * @q: request queue to manage
  *
  * On error the create_bsg_job function should return a -Exyz error value
- * that will be set to the req->errors.
+ * that will be set to ->result.
  *
  * Drivers/subsys should pass this to the queue init function.
  */
@@ -201,7 +201,7 @@ static void bsg_request_fn(struct request_queue *q)
 
ret = bsg_create_job(dev, req);
if (ret) {
-   req->errors = ret;
+   scsi_req(req)->result = ret;
blk_end_request_all(req, ret);
spin_lock_irq(q->queue_lock);
continue;
diff --git a/block/bsg.c b/block/bsg.c
index 74835dbf0c47..e54daf9f2ee0 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -391,13 +391,13 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, 
struct sg_io_v4 *hdr,
struct scsi_request *req = scsi_req(rq);
int ret = 0;
 
-   dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->errors);
+   dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->result);
/*
 * fill in all the output members
 */
-   hdr->device_status = rq->errors & 0xff;
-   hdr->transport_status = host_byte(rq->errors);
-   hdr->driver_status = driver_byte(rq->errors);
+   hdr->device_status = req->result & 0xff;
+   hdr->transport_status = host_byte(req->result);
+   hdr->driver_status = driver_byte(req->result);
hdr->info = 0;
if (hdr->device_status || hdr->transport_status || hdr->driver_status)
hdr->info |= SG_INFO_CHECK;
@@ -431,8 +431,8 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, 
struct sg_io_v4 *hdr,
 * just a protocol response (i.e. non negative), that gets
 * processed above.
 */
-   if (!ret && rq->errors < 0)
-   ret = rq->errors;
+   if (!ret && req->result < 0)
+   ret = req->result;
 
blk_rq_unmap_user(bio);
scsi_req_free_cmd(req);
diff --git a/block/scsi_ioctl.c

[PATCH 10/25] loop: zero-fill bio on the submitting cpu

2017-04-06 Thread Christoph Hellwig

In thruth I've just audited which blk-mq drivers don't currently have a
complete callback, but I think this change is at least borderline useful.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 30 ++
 drivers/block/loop.h |  1 +
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f34e017..6924ec611a49 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -445,32 +445,27 @@ static int lo_req_flush(struct loop_device *lo, struct 
request *rq)
return ret;
 }
 
-static inline void handle_partial_read(struct loop_cmd *cmd, long bytes)
+static void lo_complete_rq(struct request *rq)
 {
-   if (bytes < 0 || op_is_write(req_op(cmd->rq)))
-   return;
+   struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
-   if (unlikely(bytes < blk_rq_bytes(cmd->rq))) {
+   if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
+cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
struct bio *bio = cmd->rq->bio;
 
-   bio_advance(bio, bytes);
+   bio_advance(bio, cmd->ret);
zero_fill_bio(bio);
}
+
+   blk_mq_end_request(rq, cmd->ret < 0 ? -EIO : 0);
 }
 
 static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
-   struct request *rq = cmd->rq;
-
-   handle_partial_read(cmd, ret);
 
-   if (ret > 0)
-   ret = 0;
-   else if (ret < 0)
-   ret = -EIO;
-
-   blk_mq_complete_request(rq, ret);
+   cmd->ret = ret;
+   blk_mq_complete_request(cmd->rq, 0);
 }
 
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
@@ -1686,8 +1681,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
ret = do_req_filebacked(lo, cmd->rq);
  failed:
/* complete non-aio request */
-   if (!cmd->use_aio || ret)
-   blk_mq_complete_request(cmd->rq, ret ? -EIO : 0);
+   if (!cmd->use_aio || ret) {
+   cmd->ret = ret ? -EIO : 0;
+   blk_mq_complete_request(cmd->rq, 0);
+   }
 }
 
 static void loop_queue_work(struct kthread_work *work)
@@ -1713,6 +1710,7 @@ static int loop_init_request(void *data, struct request 
*rq,
 static const struct blk_mq_ops loop_mq_ops = {
.queue_rq   = loop_queue_rq,
.init_request   = loop_init_request,
+   .complete   = lo_complete_rq,
 };
 
 static int loop_add(struct loop_device **l, int i)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c73e61..fecd3f97ef8c 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -70,6 +70,7 @@ struct loop_cmd {
struct request *rq;
struct list_head list;
bool use_aio;   /* use AIO interface to handle I/O */
+   long ret;
struct kiocb iocb;
 };
 
-- 
2.11.0

[PATCH 08/25] scsi: fix fast-fail for non-passthrough requests

2017-04-06 Thread Christoph Hellwig

Currently error is always 0 for non-passthrough requests when reaching the
scsi_noretry_cmd check in scsi_io_completion, which effectively disables
all fastfail logic.  Fix this by having a single call to
__scsi_error_from_host_byte at the beginning of the function and always
having a valid error value.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/scsi_lib.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 11972d1075f1..89b4d9e69866 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -779,21 +779,17 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
sense_valid = scsi_command_normalize_sense(cmd, );
if (sense_valid)
sense_deferred = scsi_sense_is_deferred();
+
+   if (!sense_deferred)
+   error = __scsi_error_from_host_byte(cmd, result);
}
 
if (blk_rq_is_passthrough(req)) {
-   if (result) {
-   if (sense_valid) {
-   /*
-* SG_IO wants current and deferred errors
-*/
-   scsi_req(req)->sense_len =
-   min(8 + cmd->sense_buffer[7],
-   SCSI_SENSE_BUFFERSIZE);
-   }
-   if (!sense_deferred)
-   error = __scsi_error_from_host_byte(cmd, 
result);
+   if (result && sense_valid) {
+   scsi_req(req)->sense_len = min(8 + cmd->sense_buffer[7],
+  SCSI_SENSE_BUFFERSIZE);
}
+
/*
 * __scsi_error_from_host_byte may have reset the host_byte
 */
@@ -812,13 +808,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
BUG();
return;
}
-   } else if (blk_rq_bytes(req) == 0 && result && !sense_deferred) {
-   /*
-* Flush commands do not transfers any data, and thus cannot use
-* good_bytes != blk_rq_bytes(req) as the signal for an error.
-* This sets the error explicitly for the problem case.
-*/
-   error = __scsi_error_from_host_byte(cmd, result);
}
 
/* no bidi support for !blk_rq_is_passthrough yet */
@@ -848,7 +837,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
else if (!(req->rq_flags & RQF_QUIET))
scsi_print_sense(cmd);
result = 0;
-   /* for passthrough error may be set */
error = 0;
}
 
@@ -877,8 +865,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
if (result == 0)
goto requeue;
 
-   error = __scsi_error_from_host_byte(cmd, result);
-
if (host_byte(result) == DID_RESET) {
/* Third party bus reset or reset for error recovery
 * reasons.  Just retry the command and see what
-- 
2.11.0

[PATCH 11/25] null_blk: don't pass always-0 req->errors to blk_mq_complete_request

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/block/null_blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index f93906ff31e8..24ca85a70fd8 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -281,7 +281,7 @@ static inline void null_handle_cmd(struct nullb_cmd *cmd)
case NULL_IRQ_SOFTIRQ:
switch (queue_mode)  {
case NULL_Q_MQ:
-   blk_mq_complete_request(cmd->rq, cmd->rq->errors);
+   blk_mq_complete_request(cmd->rq, 0);
break;
case NULL_Q_RQ:
blk_complete_request(cmd->rq);
-- 
2.11.0

[PATCH 03/25] nvme-fc: fix status code handling in nvme_fc_fcpio_done

2017-04-06 Thread Christoph Hellwig

nvme_complete_async_event expects the little endian status code
including the phase bit, and a new completion handler I plan to
introduce will do so as well.

Change the status variable into the little endian format with the
phase bit used in the NVMe CQE to fix / enable this.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/fc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index fc42172c796a..aad7f9c0be32 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1147,7 +1147,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
struct nvme_fc_ctrl *ctrl = op->ctrl;
struct nvme_fc_queue *queue = op->queue;
struct nvme_completion *cqe = >rsp_iu.cqe;
-   u16 status = NVME_SC_SUCCESS;
+   __le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
 
/*
 * WARNING:
@@ -1182,9 +1182,9 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
sizeof(op->rsp_iu), DMA_FROM_DEVICE);
 
if (atomic_read(>state) == FCPOP_STATE_ABORTED)
-   status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
+   status = cpu_to_le16((NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1);
else if (freq->status)
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
 
/*
 * For the linux implementation, if we have an unsuccesful
@@ -1212,7 +1212,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 */
if (freq->transferred_length !=
be32_to_cpu(op->cmd_iu.data_len)) {
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
goto done;
}
op->nreq.result.u64 = 0;
@@ -1229,15 +1229,15 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
freq->transferred_length ||
 op->rsp_iu.status_code ||
 op->rqno != le16_to_cpu(cqe->command_id))) {
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
goto done;
}
op->nreq.result = cqe->result;
-   status = le16_to_cpu(cqe->status) >> 1;
+   status = cqe->status;
break;
 
default:
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
goto done;
}
 
@@ -1249,7 +1249,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
return;
}
 
-   blk_mq_complete_request(rq, status);
+   blk_mq_complete_request(rq, le16_to_cpu(status) >> 1);
 }
 
 static int
-- 
2.11.0

[PATCH 02/25] block: remove the blk_execute_rq return value

2017-04-06 Thread Christoph Hellwig

The function only returns -EIO if rq->errors is non-zero, which is not
very useful and lets a large number of callers ignore the return value.

Just let the callers figure out their error themselves.

Signed-off-by: Christoph Hellwig 
---
 block/blk-exec.c | 8 +---
 block/scsi_ioctl.c   | 3 ++-
 drivers/block/paride/pd.c| 3 ++-
 drivers/block/virtio_blk.c   | 3 ++-
 drivers/cdrom/cdrom.c| 3 ++-
 drivers/ide/ide-atapi.c  | 3 ++-
 drivers/ide/ide-cd.c | 3 ++-
 drivers/ide/ide-cd_ioctl.c   | 3 ++-
 drivers/ide/ide-devsets.c| 4 ++--
 drivers/ide/ide-disk.c   | 3 +--
 drivers/ide/ide-ioctls.c | 7 ---
 drivers/ide/ide-park.c   | 3 ++-
 drivers/ide/ide-pm.c | 3 ++-
 drivers/ide/ide-taskfile.c   | 4 ++--
 drivers/scsi/osd/osd_initiator.c | 5 -
 fs/nfsd/blocklayout.c| 4 ++--
 include/linux/blkdev.h   | 2 +-
 17 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 8cd0e9bc8dc8..afa383248c7c 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -92,11 +92,10 @@ EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
  *Insert a fully prepared request at the back of the I/O scheduler queue
  *for execution and wait for completion.
  */
-int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk,
+void blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk,
   struct request *rq, int at_head)
 {
DECLARE_COMPLETION_ONSTACK(wait);
-   int err = 0;
unsigned long hang_check;
 
rq->end_io_data = 
@@ -108,10 +107,5 @@ int blk_execute_rq(struct request_queue *q, struct gendisk 
*bd_disk,
while (!wait_for_completion_io_timeout(, hang_check * 
(HZ/2)));
else
wait_for_completion_io();
-
-   if (rq->errors)
-   err = -EIO;
-
-   return err;
 }
 EXPORT_SYMBOL(blk_execute_rq);
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 82a43bb19967..b1352143f12f 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -547,7 +547,8 @@ static int __blk_send_generic(struct request_queue *q, 
struct gendisk *bd_disk,
scsi_req(rq)->cmd[0] = cmd;
scsi_req(rq)->cmd[4] = data;
scsi_req(rq)->cmd_len = 6;
-   err = blk_execute_rq(q, bd_disk, rq, 0);
+   blk_execute_rq(q, bd_disk, rq, 0);
+   err = rq->errors ? -EIO : 0;
blk_put_request(rq);
 
return err;
diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index b05e151c9b38..82c6d02193ae 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -747,7 +747,8 @@ static int pd_special_command(struct pd_unit *disk,
 
rq->special = func;
 
-   err = blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
+   blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
+   err = req->errors ? -EIO : 0;
 
blk_put_request(rq);
return err;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2d8290169271..eaf99022bdc6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -310,7 +310,8 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
if (err)
goto out;
 
-   err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+   blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+   err = req->errors ? -EIO : 0;
 out:
blk_put_request(req);
return err;
diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 87739649eac2..308501730ab3 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -2218,7 +2218,8 @@ static int cdrom_read_cdda_bpc(struct cdrom_device_info 
*cdi, __u8 __user *ubuf,
rq->timeout = 60 * HZ;
bio = rq->bio;
 
-   if (blk_execute_rq(q, cdi->disk, rq, 0)) {
+   blk_execute_rq(q, cdi->disk, rq, 0);
+   if (rq->errors) {
struct request_sense *s = req->sense;
ret = -EIO;
cdi->last_sense = s->sense_key;
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index feb30061123b..1524797e1776 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -107,7 +107,8 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk 
*disk,
memcpy(scsi_req(rq)->cmd, pc->c, 12);
if (drive->media == ide_tape)
scsi_req(rq)->cmd[13] = REQ_IDETAPE_PC1;
-   error = blk_execute_rq(drive->queue, disk, rq, 0);
+   blk_execute_rq(drive->queue, disk, rq, 0);
+   error = rq->errors ? -EIO : 0;
 put_req:
blk_put_request(rq);
return error;
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 74f1b7dc03f7..95c40afa9120 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -452,7 +452,8 @@ int

[PATCH 05/25] nvme: make nvme_error_status private

2017-04-06 Thread Christoph Hellwig

Currently it's used by the lighnvm passthrough ioctl, but we'd like to make
it private in preparation of block layer specific error code.  Lighnvm already
returns the real NVMe status anyway, so I think we can just limit it to
returning -EIO for any status set.

This will need a careful audit from the lightnvm folks, though.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 3 +--
 drivers/nvme/host/lightnvm.c | 8 
 drivers/nvme/host/nvme.h | 1 -
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b73a86a78379..fbc7d64b6f10 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -66,7 +66,7 @@ static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
 
-int nvme_error_status(struct request *req)
+static int nvme_error_status(struct request *req)
 {
switch (nvme_req(req)->status & 0x7ff) {
case NVME_SC_SUCCESS:
@@ -77,7 +77,6 @@ int nvme_error_status(struct request *req)
return -EIO;
}
 }
-EXPORT_SYMBOL_GPL(nvme_error_status);
 
 static inline bool nvme_req_needs_retry(struct request *req)
 {
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 3d7b92c4c577..69067833b242 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -484,7 +484,7 @@ static void nvme_nvm_end_io(struct request *rq, int error)
struct nvm_rq *rqd = rq->end_io_data;
 
rqd->ppa_status = nvme_req(rq)->result.u64;
-   rqd->error = nvme_req(req)->status;
+   rqd->error = nvme_req(rq)->status;
nvm_end_io(rqd);
 
kfree(nvme_req(rq)->cmd);
@@ -611,7 +611,7 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
__le64 *metadata = NULL;
dma_addr_t metadata_dma;
DECLARE_COMPLETION_ONSTACK(wait);
-   int ret;
+   int ret = 0;
 
rq = nvme_alloc_request(q, (struct nvme_command *)vcmd, 0,
NVME_QID_ANY);
@@ -683,8 +683,8 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
 
if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
ret = -EINTR;
-   else
-   ret = nvme_error_status(rq);
+   else if (nvme_req(rq)->status & 0x7ff)
+   ret = -EIO;
if (result)
*result = nvme_req(rq)->status & 0x7ff;
if (status)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7a03a0897e0b..4dc930d3545d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -254,7 +254,6 @@ static inline void nvme_end_request(struct request *req, 
__le16 status,
blk_mq_complete_request(req, 0);
 }
 
-int nvme_error_status(struct request *req);
 void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
-- 
2.11.0

[PATCH 07/25] virtio_blk: don't use req->errors

2017-04-06 Thread Christoph Hellwig

Remove passing req->errors (which at that point is always 0) to
blk_mq_complete_requestq, and rely on the virtio status code for the
serial number passthrough request.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/virtio_blk.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index dbc4e80680b1..8378ad480f77 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -175,19 +175,15 @@ static int virtblk_add_req(struct virtqueue *vq, struct 
virtblk_req *vbr,
 static inline void virtblk_request_done(struct request *req)
 {
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
-   int error = virtblk_result(vbr);
 
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
virtblk_scsi_request_done(req);
break;
-   case REQ_OP_DRV_IN:
-   req->errors = (error != 0);
-   break;
}
 
-   blk_mq_end_request(req, error);
+   blk_mq_end_request(req, virtblk_result(vbr));
 }
 
 static void virtblk_done(struct virtqueue *vq)
@@ -205,7 +201,7 @@ static void virtblk_done(struct virtqueue *vq)
while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, )) != 
NULL) {
struct request *req = blk_mq_rq_from_pdu(vbr);
 
-   blk_mq_complete_request(req, req->errors);
+   blk_mq_complete_request(req, 0);
req_done = true;
}
if (unlikely(virtqueue_is_broken(vq)))
@@ -311,7 +307,7 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
goto out;
 
blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
-   err = req->errors ? -EIO : 0;
+   err = virtblk_result(blk_mq_rq_to_pdu(req));
 out:
blk_put_request(req);
return err;
-- 
2.11.0

[PATCH 06/25] virtio: fix spelling of virtblk_scsi_request_done

2017-04-06 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/block/virtio_blk.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index eaf99022bdc6..dbc4e80680b1 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -111,7 +111,7 @@ static int virtblk_add_req_scsi(struct virtqueue *vq, 
struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
 }
 
-static inline void virtblk_scsi_reques_done(struct request *req)
+static inline void virtblk_scsi_request_done(struct request *req)
 {
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
struct virtio_blk *vblk = req->q->queuedata;
@@ -144,7 +144,7 @@ static inline int virtblk_add_req_scsi(struct virtqueue *vq,
 {
return -EIO;
 }
-static inline void virtblk_scsi_reques_done(struct request *req)
+static inline void virtblk_scsi_request_done(struct request *req)
 {
 }
 #define virtblk_ioctl  NULL
@@ -180,7 +180,7 @@ static inline void virtblk_request_done(struct request *req)
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
-   virtblk_scsi_reques_done(req);
+   virtblk_scsi_request_done(req);
break;
case REQ_OP_DRV_IN:
req->errors = (error != 0);
-- 
2.11.0

[PATCH 04/25] nvme: split nvme status from block req->errors

2017-04-06 Thread Christoph Hellwig

We want our own clearly defined error field for NVMe passthrough commands,
and the request errors field is going away in its current form.

Just store the status and result field in the nvme_request field from
hardirq completion context (using a new helper) and then generate a
Linux errno for the block layer only when we actually need it.

Because we can't overload the status value with a negative error code
for cancelled command we now have a flags filed in struct nvme_request
that contains a bit for this condition.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 50 +++-
 drivers/nvme/host/fc.c   | 10 -
 drivers/nvme/host/lightnvm.c |  9 +---
 drivers/nvme/host/nvme.h | 33 +
 drivers/nvme/host/pci.c  | 11 +-
 drivers/nvme/host/rdma.c |  5 ++---
 drivers/nvme/target/loop.c   |  7 ++-
 7 files changed, 65 insertions(+), 60 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index dc05f41c3992..b73a86a78379 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -66,11 +66,24 @@ static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
 
+int nvme_error_status(struct request *req)
+{
+   switch (nvme_req(req)->status & 0x7ff) {
+   case NVME_SC_SUCCESS:
+   return 0;
+   case NVME_SC_CAP_EXCEEDED:
+   return -ENOSPC;
+   default:
+   return -EIO;
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_error_status);
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
if (blk_noretry_request(req))
return false;
-   if (req->errors & NVME_SC_DNR)
+   if (nvme_req(req)->status & NVME_SC_DNR)
return false;
if (jiffies - req->start_time >= req->timeout)
return false;
@@ -81,23 +94,13 @@ static inline bool nvme_req_needs_retry(struct request *req)
 
 void nvme_complete_rq(struct request *req)
 {
-   int error = 0;
-
-   if (unlikely(req->errors)) {
-   if (nvme_req_needs_retry(req)) {
-   nvme_req(req)->retries++;
-   blk_mq_requeue_request(req,
-   !blk_mq_queue_stopped(req->q));
-   return;
-   }
-
-   if (blk_rq_is_passthrough(req))
-   error = req->errors;
-   else
-   error = nvme_error_status(req->errors);
+   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+   nvme_req(req)->retries++;
+   blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q));
+   return;
}
 
-   blk_mq_end_request(req, error);
+   blk_mq_end_request(req, nvme_error_status(req));
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
@@ -114,7 +117,9 @@ void nvme_cancel_request(struct request *req, void *data, 
bool reserved)
status = NVME_SC_ABORT_REQ;
if (blk_queue_dying(req->q))
status |= NVME_SC_DNR;
-   blk_mq_complete_request(req, status);
+   nvme_req(req)->status = status;
+   blk_mq_complete_request(req, 0);
+
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
@@ -357,6 +362,7 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 
if (!(req->rq_flags & RQF_DONTPREP)) {
nvme_req(req)->retries = 0;
+   nvme_req(req)->flags = 0;
req->rq_flags |= RQF_DONTPREP;
}
 
@@ -411,7 +417,10 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct 
nvme_command *cmd,
blk_execute_rq(req->q, NULL, req, at_head);
if (result)
*result = nvme_req(req)->result;
-   ret = req->errors;
+   if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+   ret = -EINTR;
+   else
+   ret = nvme_req(req)->status;
  out:
blk_mq_free_request(req);
return ret;
@@ -496,7 +505,10 @@ int __nvme_submit_user_cmd(struct request_queue *q, struct 
nvme_command *cmd,
}
  submit:
blk_execute_rq(req->q, disk, req, 0);
-   ret = req->errors;
+   if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+   ret = -EINTR;
+   else
+   ret = nvme_req(req)->status;
if (result)
*result = le32_to_cpu(nvme_req(req)->result.u32);
if (meta && !ret && !write) {
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index aad7f9c0be32..450733c8cd24 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1148,6 +1148,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
struct nvme_fc_queue *queue = op->queue;
struct nvme_completion *cqe = >rsp_iu.cqe;
__le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
+   union nvme_result result;
 
/*
 * WARNING:
@@ -1215,7 +1216,7 @@ nvme_fc_fcpio_done(struct

[PATCH 01/25] remove the mg_disk driver

2017-04-06 Thread Christoph Hellwig

This drivers was added in 2008, but as far as a I can tell we never had a
single platform that actually registered resources for the platform driver.

It's also been unmaintained for a long time and apparently has a ATA mode
that can be driven using the IDE/libata subsystem.

Signed-off-by: Christoph Hellwig 
---
 Documentation/blockdev/mflash.txt |   84 ---
 drivers/block/Kconfig |   17 -
 drivers/block/Makefile|1 -
 drivers/block/mg_disk.c   | 1110 -
 include/linux/mg_disk.h   |   45 --
 5 files changed, 1257 deletions(-)
 delete mode 100644 Documentation/blockdev/mflash.txt
 delete mode 100644 drivers/block/mg_disk.c
 delete mode 100644 include/linux/mg_disk.h

diff --git a/Documentation/blockdev/mflash.txt 
b/Documentation/blockdev/mflash.txt
deleted file mode 100644
index f7e050551487..
--- a/Documentation/blockdev/mflash.txt
+++ /dev/null
@@ -1,84 +0,0 @@
-This document describes m[g]flash support in linux.
-
-Contents
-  1. Overview
-  2. Reserved area configuration
-  3. Example of mflash platform driver registration
-
-1. Overview
-
-Mflash and gflash are embedded flash drive. The only difference is mflash is
-MCP(Multi Chip Package) device. These two device operate exactly same way.
-So the rest mflash repersents mflash and gflash altogether.
-
-Internally, mflash has nand flash and other hardware logics and supports
-2 different operation (ATA, IO) modes. ATA mode doesn't need any new
-driver and currently works well under standard IDE subsystem. Actually it's
-one chip SSD. IO mode is ATA-like custom mode for the host that doesn't have
-IDE interface.
-
-Following are brief descriptions about IO mode.
-A. IO mode based on ATA protocol and uses some custom command. (read confirm,
-write confirm)
-B. IO mode uses SRAM bus interface.
-C. IO mode supports 4kB boot area, so host can boot from mflash.
-
-2. Reserved area configuration
-If host boot from mflash, usually needs raw area for boot loader image. All of
-the mflash's block device operation will be taken this value as start offset.
-Note that boot loader's size of reserved area and kernel configuration value
-must be same.
-
-3. Example of mflash platform driver registration
-Working mflash is very straight forward. Adding platform device stuff to board
-configuration file is all. Here is some pseudo example.
-
-static struct mg_drv_data mflash_drv_data = {
-   /* If you want to polling driver set to 1 */
-   .use_polling = 0,
-   /* device attribution */
-   .dev_attr = MG_BOOT_DEV
-};
-
-static struct resource mg_mflash_rsc[] = {
-   /* Base address of mflash */
-   [0] = {
-   .start = 0x0800,
-   .end = 0x0800 + SZ_64K - 1,
-   .flags = IORESOURCE_MEM
-   },
-   /* mflash interrupt pin */
-   [1] = {
-   .start = IRQ_GPIO(84),
-   .end = IRQ_GPIO(84),
-   .flags = IORESOURCE_IRQ
-   },
-   /* mflash reset pin */
-   [2] = {
-   .start = 43,
-   .end = 43,
-   .name = MG_RST_PIN,
-   .flags = IORESOURCE_IO
-   },
-   /* mflash reset-out pin
-* If you use mflash as storage device (i.e. other than MG_BOOT_DEV),
-* should assign this */
-   [3] = {
-   .start = 51,
-   .end = 51,
-   .name = MG_RSTOUT_PIN,
-   .flags = IORESOURCE_IO
-   }
-};
-
-static struct platform_device mflash_dev = {
-   .name = MG_DEV_NAME,
-   .id = -1,
-   .dev = {
-   .platform_data = _drv_data,
-   },
-   .num_resources = ARRAY_SIZE(mg_mflash_rsc),
-   .resource = mg_mflash_rsc
-};
-
-platform_device_register(_dev);
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a1c2e816128f..ebe8c1a6195e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -434,23 +434,6 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
-config MG_DISK
-   tristate "mGine mflash, gflash support"
-   depends on ARM && GPIOLIB
-   help
- mGine mFlash(gFlash) block device driver
-
-config MG_DISK_RES
-   int "Size of reserved area before MBR"
-   depends on MG_DISK
-   default 0
-   help
- Define size of reserved area that usually used for boot. Unit is KB.
- All of the block device operation will be taken this value as start
- offset
- Examples:
-   1024 => 1 MB
-
 config SUNVDC
tristate "Sun Virtual Disk Client support"
depends on SUN_LDOMS
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index b12c772bbeb3..5ceead8b52d7 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -19,7 +19,6 @@ obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o

kill req->errors

2017-04-06 Thread Christoph Hellwig

Currently the request structure has an errors field that is used in
various different ways.  The oldest drivers use it as an error count,
blk-mq and the generic timeout code assume that it holds a Linux
errno for block completions, and various drivers use it for internal
status values, often overwriting them with Linux errnos later,
that is unless they are processing passthrough requests in which
case they'll leave their errors in it.

This series kills the ->errors field and replaced it with new fields
in the drivers (or an odd hack in a union in struct request for
two bitrotting old drivers).

Note that this series expects that the patch to remove the mg_disk
driver has been applied already on top of the block for-next tree.

Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors

2017-04-06 Thread Varun Prakash

On Thu, Apr 06, 2017 at 09:58:58AM +0200, Christoph Hellwig wrote:
> Ok, the version below simplify skip the function split entirely:
> 
> ---
> From 7c9ca58f1d8cf53b42f14a51e02d0f3d0f12ab45 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Thu, 12 Jan 2017 11:17:29 +0100
> Subject: csiostor: switch to pci_alloc_irq_vectors
> 
> And get automatic MSI-X affinity for free.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/csiostor/csio_hw.h  |   1 -
>  drivers/scsi/csiostor/csio_isr.c | 128 
> ++-
>  2 files changed, 47 insertions(+), 82 deletions(-)
>

Looks good.

Acked-by: Varun Prakash

[PATCHv6 3/8] scsi: always send command aborts

2017-04-06 Thread Hannes Reinecke

When a command has timed out we always should be sending an
abort; with the previous code a failed abort might signal
SCSI EH to start, and all other timed out commands will
never be aborted, even though they might belong to a
different ITL nexus.

Cc: Benjamin Block 
Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
---
 drivers/scsi/scsi_error.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 370f6c0..cff7d9d 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -196,19 +196,7 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
return FAILED;
}
 
-   /*
-* Do not try a command abort if
-* SCSI EH has already started.
-*/
spin_lock_irqsave(shost->host_lock, flags);
-   if (scsi_host_in_recovery(shost)) {
-   spin_unlock_irqrestore(shost->host_lock, flags);
-   SCSI_LOG_ERROR_RECOVERY(3,
-   scmd_printk(KERN_INFO, scmd,
-   "not aborting, host in recovery\n"));
-   return FAILED;
-   }
-
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;
spin_unlock_irqrestore(shost->host_lock, flags);
-- 
1.8.5.6

[PATCHv6 7/8] scsi: make asynchronous aborts mandatory

2017-04-06 Thread Hannes Reinecke

There hasn't been any reports for HBAs where asynchronous abort
would not work, so we should make it mandatory and remove
the fallback.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
---
 Documentation/scsi/scsi_eh.txt | 18 --
 drivers/scsi/scsi_error.c  | 81 --
 drivers/scsi/scsi_lib.c|  2 +-
 drivers/scsi/scsi_priv.h   |  3 +-
 include/scsi/scsi_host.h   |  5 ---
 5 files changed, 15 insertions(+), 94 deletions(-)

diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt
index 4edb9c1c..11e447b 100644
--- a/Documentation/scsi/scsi_eh.txt
+++ b/Documentation/scsi/scsi_eh.txt
@@ -70,7 +70,7 @@ with the command.
scmd is requeued to blk queue.
 
  - otherwise
-   scsi_eh_scmd_add(scmd, 0) is invoked for the command.  See
+   scsi_eh_scmd_add(scmd) is invoked for the command.  See
[1-3] for details of this function.
 
 
@@ -103,9 +103,7 @@ function
 eh_timed_out() callback did not handle the command.
Step #2 is taken.
 
- 2. If the host supports asynchronous completion (as indicated by the
-no_async_abort setting in the host template) scsi_abort_command()
-is invoked to schedule an asynchrous abort.
+ 2. scsi_abort_command() is invoked to schedule an asynchrous abort.
 Asynchronous abort are not invoked for commands which the
 SCSI_EH_ABORT_SCHEDULED flag is set (this indicates that the command
 already had been aborted once, and this is a retry which failed),
@@ -127,16 +125,13 @@ function
 
  scmds enter EH via scsi_eh_scmd_add(), which does the following.
 
- 1. Turns on scmd->eh_eflags as requested.  It's 0 for error
-completions and SCSI_EH_CANCEL_CMD for timeouts.
+ 1. Links scmd->eh_entry to shost->eh_cmd_q
 
- 2. Links scmd->eh_entry to shost->eh_cmd_q
+ 2. Sets SHOST_RECOVERY bit in shost->shost_state
 
- 3. Sets SHOST_RECOVERY bit in shost->shost_state
+ 3. Increments shost->host_failed
 
- 4. Increments shost->host_failed
-
- 5. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
+ 4. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
 
  As can be seen above, once any scmd is added to shost->eh_cmd_q,
 SHOST_RECOVERY shost_state bit is turned on.  This prevents any new
@@ -252,7 +247,6 @@ scmd->allowed.
 
  1. Error completion / time out
 ACTION: scsi_eh_scmd_add() is invoked for scmd
-   - set scmd->eh_eflags
- add scmd to shost->eh_cmd_q
- set SHOST_RECOVERY
- shost->host_failed++
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index bbc4318..53e3343 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -162,7 +162,7 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
}
}
 
-   scsi_eh_scmd_add(scmd, 0);
+   scsi_eh_scmd_add(scmd);
 }
 
 /**
@@ -221,9 +221,8 @@ static void scsi_eh_reset(struct scsi_cmnd *scmd)
 /**
  * scsi_eh_scmd_add - add scsi cmd to error handling.
  * @scmd:  scmd to run eh on.
- * @eh_flag:   optional SCSI_EH flag.
  */
-void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
+void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 {
struct Scsi_Host *shost = scmd->device->host;
unsigned long flags;
@@ -239,9 +238,6 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;
 
-   if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
-   eh_flag &= ~SCSI_EH_CANCEL_CMD;
-   scmd->eh_eflags |= eh_flag;
scsi_eh_reset(scmd);
list_add_tail(>eh_entry, >eh_cmd_q);
shost->host_failed++;
@@ -275,10 +271,9 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
rtn = host->hostt->eh_timed_out(scmd);
 
if (rtn == BLK_EH_NOT_HANDLED) {
-   if (host->hostt->no_async_abort ||
-   scsi_abort_command(scmd) != SUCCESS) {
+   if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
-   scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD);
+   scsi_eh_scmd_add(scmd);
}
}
 
@@ -331,7 +326,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host 
*shost,
list_for_each_entry(scmd, work_q, eh_entry) {
if (scmd->device == sdev) {
++total_failures;
-   if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD)
+   if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
++cmd_cancel;
else
++cmd_failed;
@@

[PATCHv6 0/8] SCSI EH cleanup

2017-04-06 Thread Hannes Reinecke

Hi all,

this is a resend of a small patchset for cleaning up SCSI EH.
Primary goal is to make asynchronous aborts mandatory; there hasn't
been a single report so far where asynchronous abort won't work, so
the 'no_async_abort' flag has never been used and will be removed
with this patchset.
Additionally there's a cleanup to detect retries of failed commands and
to inline commands aborts.
I've also included the patch to correctly reset the media access
timeout counter to avoid it being triggered inadvertedly by several
commands timing out for the same device.

As usual, comments and reviews are welcome.

Changes to v1:
- Include reviews from Christoph

Changes to v2:
- Include reviews from Bart
- Add medium access timeout patch

Changes to v3:
- Add patch to inline command abort as per suggestion from Christoph
- Drop patch to not escalate failed EH commands

Changes to v4:
- Split off sdrv->eh_action into two functions

Changes to v5:
- Simplify 'scsi: inline command aborts'
- Add Reviewed-by tags

Christoph Hellwig (1):
  libsas: allow async aborts

Hannes Reinecke (7):
  scsi_error: count medium access timeout only once per EH run
  sd: Return SUCCESS in sd_eh_action() after device offline
  scsi: always send command aborts
  scsi: make eh_eflags persistent
  scsi: make scsi_eh_scmd_add() always succeed
  scsi: make asynchronous aborts mandatory
  scsi: inline command aborts

 Documentation/scsi/scsi_eh.txt  |  30 ++---
 drivers/scsi/libsas/sas_scsi_host.c |   5 -
 drivers/scsi/scsi.c |   2 -
 drivers/scsi/scsi_error.c   | 213 
 drivers/scsi/scsi_lib.c |   6 +-
 drivers/scsi/scsi_priv.h|   4 +-
 drivers/scsi/sd.c   |  29 -
 drivers/scsi/sd.h   |   1 +
 include/scsi/scsi_cmnd.h|   1 -
 include/scsi/scsi_driver.h  |   1 +
 include/scsi/scsi_eh.h  |   1 +
 include/scsi/scsi_host.h|   5 -
 12 files changed, 113 insertions(+), 185 deletions(-)

-- 
1.8.5.6

[PATCHv6 5/8] scsi: make eh_eflags persistent

2017-04-06 Thread Hannes Reinecke

If a failed command is retried and fails again we need
to enter SCSI EH, otherwise we will never be able to
recover the command.
To detect this situation we must not clear scmd->eh_eflags
when EH finishes but rather make it persistent throughout
the lifetime of the command.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Benjamin Block 
Reviewed-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Christoph Hellwig 
---
 Documentation/scsi/scsi_eh.txt  | 14 +++---
 drivers/scsi/libsas/sas_scsi_host.c |  2 --
 drivers/scsi/scsi_error.c   |  4 ++--
 include/scsi/scsi_eh.h  |  1 +
 4 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt
index 37eca00..4edb9c1c 100644
--- a/Documentation/scsi/scsi_eh.txt
+++ b/Documentation/scsi/scsi_eh.txt
@@ -105,11 +105,14 @@ function
 
  2. If the host supports asynchronous completion (as indicated by the
 no_async_abort setting in the host template) scsi_abort_command()
-is invoked to schedule an asynchrous abort. If that fails
-Step #3 is taken.
+is invoked to schedule an asynchrous abort.
+Asynchronous abort are not invoked for commands which the
+SCSI_EH_ABORT_SCHEDULED flag is set (this indicates that the command
+already had been aborted once, and this is a retry which failed),
+or when the EH deadline is expired. In these case Step #3 is taken.
 
- 2. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the
-command.  See [1-3] for more information.
+ 3. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the
+command.  See [1-4] for more information.
 
 [1-3] Asynchronous command aborts
 
@@ -263,7 +266,6 @@ scmd->allowed.
 
  3. scmd recovered
 ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd
-   - clear scmd->eh_eflags
- scsi_setup_cmd_retry()
- move from local eh_work_q to local eh_done_q
 LOCKING: none
@@ -456,8 +458,6 @@ except for #1 must be implemented by eh_strategy_handler().
 
  - shost->host_failed is zero.
 
- - Each scmd's eh_eflags field is cleared.
-
  - Each scmd is in such a state that scsi_setup_cmd_retry() on the
scmd doesn't make any difference.
 
diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index ee6b39a..87e5079 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -613,8 +613,6 @@ static void sas_eh_handle_sas_errors(struct Scsi_Host 
*shost, struct list_head *
SAS_DPRINTK("trying to find task 0x%p\n", task);
res = sas_scsi_find_task(task);
 
-   cmd->eh_eflags = 0;
-
switch (res) {
case TASK_IS_DONE:
SAS_DPRINTK("%s: task 0x%p is done\n", __func__,
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index cff7d9d..4d26ff2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -188,7 +188,6 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
/*
 * Retry after abort failed, escalate to next level.
 */
-   scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_INFO, scmd,
"previous abort failed\n"));
@@ -937,6 +936,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct 
scsi_eh_save *ses,
ses->result = scmd->result;
ses->underflow = scmd->underflow;
ses->prot_op = scmd->prot_op;
+   ses->eh_eflags = scmd->eh_eflags;
 
scmd->prot_op = SCSI_PROT_NORMAL;
scmd->eh_eflags = 0;
@@ -1000,6 +1000,7 @@ void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd, struct 
scsi_eh_save *ses)
scmd->result = ses->result;
scmd->underflow = ses->underflow;
scmd->prot_op = ses->prot_op;
+   scmd->eh_eflags = ses->eh_eflags;
 }
 EXPORT_SYMBOL(scsi_eh_restore_cmnd);
 
@@ -1132,7 +1133,6 @@ static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn)
  */
 void scsi_eh_finish_cmd(struct scsi_cmnd *scmd, struct list_head *done_q)
 {
-   scmd->eh_eflags = 0;
list_move_tail(>eh_entry, done_q);
 }
 EXPORT_SYMBOL(scsi_eh_finish_cmd);
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
index 98d366b..a25b328 100644
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -31,6 +31,7 @@ extern int scsi_get_sense_info_fld(const u8 * sense_buffer, 
int sb_len,
 struct scsi_eh_save {
/* saved state */
int result;
+   int eh_eflags;
enum dma_data_direction data_direction;
unsigned underflow;
unsigned char cmd_len;
-- 
1.8.5.6

[PATCHv6 1/8] scsi_error: count medium access timeout only once per EH run

2017-04-06 Thread Hannes Reinecke

The current medium access timeout counter will be increased for
each command, so if there are enough failed commands we'll hit
the medium access timeout for even a single device failure and
the following kernel message is displayed:

sd H:C:T:L: [sdXY] Medium access timeout failure. Offlining disk!

Fix this by making the timeout per EH run, ie the counter will
only be increased once per device and EH run.

Fixes: 18a4d0a ("[SCSI] Handle disk devices which can not process medium access 
commands")
Cc: Ewan Milne 
Cc: Lawrence Obermann 
Cc: Benjamin Block 
Cc: Steffen Maier 
Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
---
 drivers/scsi/scsi_error.c  | 18 ++
 drivers/scsi/sd.c  | 27 ++-
 drivers/scsi/sd.h  |  1 +
 include/scsi/scsi_driver.h |  1 +
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f2cafae..370f6c0 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -221,6 +221,23 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
 }
 
 /**
+ * scsi_eh_reset - call into ->eh_action to reset internal counters
+ * @scmd:  scmd to run eh on.
+ *
+ * The scsi driver might be carrying internal state about the
+ * devices, so we need to call into the driver to reset the
+ * internal state once the error handler is started.
+ */
+static void scsi_eh_reset(struct scsi_cmnd *scmd)
+{
+   if (!blk_rq_is_passthrough(scmd->request)) {
+   struct scsi_driver *sdrv = scsi_cmd_to_driver(scmd);
+   if (sdrv->eh_reset)
+   sdrv->eh_reset(scmd);
+   }
+}
+
+/**
  * scsi_eh_scmd_add - add scsi cmd to error handling.
  * @scmd:  scmd to run eh on.
  * @eh_flag:   optional SCSI_EH flag.
@@ -249,6 +266,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
eh_flag &= ~SCSI_EH_CANCEL_CMD;
scmd->eh_eflags |= eh_flag;
+   scsi_eh_reset(scmd);
list_add_tail(>eh_entry, >eh_cmd_q);
shost->host_failed++;
scsi_eh_wakeup(shost);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d277e86..bd2a38e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -115,6 +115,7 @@
 static int sd_init_command(struct scsi_cmnd *SCpnt);
 static void sd_uninit_command(struct scsi_cmnd *SCpnt);
 static int sd_done(struct scsi_cmnd *);
+static void sd_eh_reset(struct scsi_cmnd *);
 static int sd_eh_action(struct scsi_cmnd *, int);
 static void sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer);
 static void scsi_disk_release(struct device *cdev);
@@ -532,6 +533,7 @@ static void sd_set_flush_flag(struct scsi_disk *sdkp)
.uninit_command = sd_uninit_command,
.done   = sd_done,
.eh_action  = sd_eh_action,
+   .eh_reset   = sd_eh_reset,
 };
 
 /*
@@ -1686,6 +1688,26 @@ static int sd_pr_clear(struct block_device *bdev, u64 
key)
 };
 
 /**
+ * sd_eh_reset - reset error handling callback
+ * @scmd:  sd-issued command that has failed
+ *
+ * This function is called by the SCSI midlayer before starting
+ * SCSI EH. When counting medium access failures we have to be
+ * careful to register it only only once per device and SCSI EH run;
+ * there might be several timed out commands which will cause the
+ * 'max_medium_access_timeouts' counter to trigger after the first
+ * SCSI EH run already and set the device to offline.
+ * So this function resets the internal counter before starting SCSI EH.
+ **/
+static void sd_eh_reset(struct scsi_cmnd *scmd)
+{
+   struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
+
+   /* New SCSI EH run, reset gate variable */
+   sdkp->ignore_medium_access_errors = false;
+}
+
+/**
  * sd_eh_action - error handling callback
  * @scmd:  sd-issued command that has failed
  * @eh_disp:   The recovery disposition suggested by the midlayer
@@ -1714,7 +1736,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int 
eh_disp)
 * process of recovering or has it suffered an internal failure
 * that prevents access to the storage medium.
 */
-   sdkp->medium_access_timed_out++;
+   if (!sdkp->ignore_medium_access_errors) {
+   sdkp->medium_access_timed_out++;
+   sdkp->ignore_medium_access_errors = true;
+   }
 
/*
 * If the device keeps failing read/write commands but TEST UNIT
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 4dac35e..0cf9680 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -106,6 +106,7 @@ struct scsi_disk {
unsignedrc_basis: 2;
unsignedzoned: 2;

[PATCHv6 8/8] scsi: inline command aborts

2017-04-06 Thread Hannes Reinecke

The block layer always calls the timeout function from a workqueue
context, so there is no need to have yet another workqueue for
running command aborts.

Signed-off-by: Hannes Reinecke 
---
 drivers/scsi/scsi.c   |  2 --
 drivers/scsi/scsi_error.c | 83 +++
 drivers/scsi/scsi_lib.c   |  2 --
 drivers/scsi/scsi_priv.h  |  1 -
 include/scsi/scsi_cmnd.h  |  1 -
 5 files changed, 41 insertions(+), 48 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 7bfbcfa..fdec73e 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -115,8 +115,6 @@ void scsi_put_command(struct scsi_cmnd *cmd)
BUG_ON(list_empty(>list));
list_del_init(>list);
spin_unlock_irqrestore(>device->list_lock, flags);
-
-   BUG_ON(delayed_work_pending(>abort_work));
 }
 
 #ifdef CONFIG_SCSI_LOGGING
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 53e3343..7ce1268 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -115,11 +115,9 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
  * scmd_eh_abort_handler - Handle command aborts
  * @work:  command to be aborted.
  */
-void
-scmd_eh_abort_handler(struct work_struct *work)
+int
+scmd_eh_abort_handler(struct scsi_cmnd *scmd)
 {
-   struct scsi_cmnd *scmd =
-   container_of(work, struct scsi_cmnd, abort_work.work);
struct scsi_device *sdev = scmd->device;
int rtn;
 
@@ -127,42 +125,41 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_INFO, scmd,
"eh timeout, not aborting\n"));
-   } else {
+   return FAILED;
+   }
+   SCSI_LOG_ERROR_RECOVERY(3,
+   scmd_printk(KERN_INFO, scmd,
+   "aborting command\n"));
+   rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
+   if (rtn != SUCCESS) {
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_INFO, scmd,
-   "aborting command\n"));
-   rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
-   if (rtn == SUCCESS) {
-   set_host_byte(scmd, DID_TIME_OUT);
-   if (scsi_host_eh_past_deadline(sdev->host)) {
-   SCSI_LOG_ERROR_RECOVERY(3,
-   scmd_printk(KERN_INFO, scmd,
-   "eh timeout, not retrying "
-   "aborted command\n"));
-   } else if (!scsi_noretry_cmd(scmd) &&
-   (++scmd->retries <= scmd->allowed)) {
-   SCSI_LOG_ERROR_RECOVERY(3,
-   scmd_printk(KERN_WARNING, scmd,
-   "retry aborted command\n"));
-   scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
-   return;
-   } else {
-   SCSI_LOG_ERROR_RECOVERY(3,
-   scmd_printk(KERN_WARNING, scmd,
-   "finish aborted 
command\n"));
-   scsi_finish_command(scmd);
-   return;
-   }
-   } else {
-   SCSI_LOG_ERROR_RECOVERY(3,
-   scmd_printk(KERN_INFO, scmd,
-   "cmd abort %s\n",
-   (rtn == FAST_IO_FAIL) ?
-   "not send" : "failed"));
-   }
+   "cmd abort %s\n",
+   (rtn == FAST_IO_FAIL) ?
+   "not send" : "failed"));
+   return rtn;
}
-
-   scsi_eh_scmd_add(scmd);
+   set_host_byte(scmd, DID_TIME_OUT);
+   if (scsi_host_eh_past_deadline(sdev->host)) {
+   SCSI_LOG_ERROR_RECOVERY(3,
+   scmd_printk(KERN_INFO, scmd,
+   "eh timeout, not retrying "
+   "aborted command\n"));
+   return FAILED;
+   }
+   if (!scsi_noretry_cmd(scmd) &&
+  (++scmd->retries <= scmd->allowed)) {
+   SCSI_LOG_ERROR_RECOVERY(3,
+   scmd_printk(KERN_WARNING, scmd,
+   "retry aborted command\n"));
+   scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
+   } else {
+   SCSI_LOG_ERROR_RECOVERY(3,
+   scmd_printk(KERN_WARNING, scmd,
+   "finish

[PATCHv6 6/8] scsi: make scsi_eh_scmd_add() always succeed

2017-04-06 Thread Hannes Reinecke

scsi_eh_scmd_add() currently only will fail if no
error handler thread is started (which will never be the
case) or if the state machine encounters an illegal transition.

But if we're encountering an invalid state transition
chances is we cannot fixup things with the error handler.
So better add a WARN_ON for illegal host states and
make scsi_dh_scmd_add() a void function.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
---
 drivers/scsi/scsi_error.c | 41 +
 drivers/scsi/scsi_lib.c   |  4 ++--
 drivers/scsi/scsi_priv.h  |  2 +-
 3 files changed, 16 insertions(+), 31 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 4d26ff2..bbc4318 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -162,13 +162,7 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
*shost)
}
}
 
-   if (!scsi_eh_scmd_add(scmd, 0)) {
-   SCSI_LOG_ERROR_RECOVERY(3,
-   scmd_printk(KERN_WARNING, scmd,
-   "terminate aborted command\n"));
-   set_host_byte(scmd, DID_TIME_OUT);
-   scsi_finish_command(scmd);
-   }
+   scsi_eh_scmd_add(scmd, 0);
 }
 
 /**
@@ -228,28 +222,23 @@ static void scsi_eh_reset(struct scsi_cmnd *scmd)
  * scsi_eh_scmd_add - add scsi cmd to error handling.
  * @scmd:  scmd to run eh on.
  * @eh_flag:   optional SCSI_EH flag.
- *
- * Return value:
- * 0 on failure.
  */
-int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
+void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
 {
struct Scsi_Host *shost = scmd->device->host;
unsigned long flags;
-   int ret = 0;
+   int ret;
 
-   if (!shost->ehandler)
-   return 0;
+   WARN_ON_ONCE(!shost->ehandler);
 
spin_lock_irqsave(shost->host_lock, flags);
-   if (scsi_host_set_state(shost, SHOST_RECOVERY))
-   if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY))
-   goto out_unlock;
-
+   if (scsi_host_set_state(shost, SHOST_RECOVERY)) {
+   ret = scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY);
+   WARN_ON_ONCE(ret);
+   }
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;
 
-   ret = 1;
if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
eh_flag &= ~SCSI_EH_CANCEL_CMD;
scmd->eh_eflags |= eh_flag;
@@ -257,9 +246,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
list_add_tail(>eh_entry, >eh_cmd_q);
shost->host_failed++;
scsi_eh_wakeup(shost);
- out_unlock:
spin_unlock_irqrestore(shost->host_lock, flags);
-   return ret;
 }
 
 /**
@@ -288,13 +275,11 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
rtn = host->hostt->eh_timed_out(scmd);
 
if (rtn == BLK_EH_NOT_HANDLED) {
-   if (!host->hostt->no_async_abort &&
-   scsi_abort_command(scmd) == SUCCESS)
-   return BLK_EH_NOT_HANDLED;
-
-   set_host_byte(scmd, DID_TIME_OUT);
-   if (!scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD))
-   rtn = BLK_EH_HANDLED;
+   if (host->hostt->no_async_abort ||
+   scsi_abort_command(scmd) != SUCCESS) {
+   set_host_byte(scmd, DID_TIME_OUT);
+   scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD);
+   }
}
 
return rtn;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ba22866..ba8da90 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1593,8 +1593,8 @@ static void scsi_softirq_done(struct request *rq)
scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
break;
default:
-   if (!scsi_eh_scmd_add(cmd, 0))
-   scsi_finish_command(cmd);
+   scsi_eh_scmd_add(cmd, 0);
+   break;
}
 }
 
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 99bfc98..5be6cbf6 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -72,7 +72,7 @@ extern int scsi_dev_info_list_add_keyed(int compatible, char 
*vendor,
 extern int scsi_error_handler(void *host);
 extern int scsi_decide_disposition(struct scsi_cmnd *cmd);
 extern void scsi_eh_wakeup(struct Scsi_Host *shost);
-extern int scsi_eh_scmd_add(struct scsi_cmnd *, int);
+extern void scsi_eh_scmd_add(struct scsi_cmnd *, int);
 void scsi_eh_ready_devs(struct Scsi_Host *shost,
struct list_head *work_q,
struct list_head *done_q);
-- 
1.8.5.6

[PATCHv6 2/8] sd: Return SUCCESS in sd_eh_action() after device offline

2017-04-06 Thread Hannes Reinecke

If sd_eh_action() decides to take the device offline there is
no point in returning FAILED, as taking the device offline
is the ultimate step in SCSI EH anyway.
So further escalation via SCSI EH is not likely to make a
difference and we can as well return SUCCESS.

Cc: Benjamin Block 
Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bd2a38e..00b168b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1751,7 +1751,7 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int 
eh_disp)
"Medium access timeout failure. Offlining disk!\n");
scsi_device_set_state(scmd->device, SDEV_OFFLINE);
 
-   return FAILED;
+   return SUCCESS;
}
 
return eh_disp;
-- 
1.8.5.6

[PATCHv6 4/8] libsas: allow async aborts

2017-04-06 Thread Hannes Reinecke

From: Christoph Hellwig 

We now first try to call ->eh_abort_handler from a work queue, but libsas
was always failing that for no good reason.  Allow async aborts.

Reviewed-by: Johannes Thumshirn 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/libsas/sas_scsi_host.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 9bd55bc..ee6b39a 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -491,9 +491,6 @@ int sas_eh_abort_handler(struct scsi_cmnd *cmd)
struct Scsi_Host *host = cmd->device->host;
struct sas_internal *i = to_sas_internal(host->transportt);
 
-   if (current != host->ehandler)
-   return FAILED;
-
if (!i->dft->lldd_abort_task)
return FAILED;
 
-- 
1.8.5.6

hpsa patches in mkp/4.12/scsi-queue and mkp/4.11/scsi-fixes

2017-04-06 Thread Martin Wilck

Hi Martin,

I noticed that the following commits

eb94588dabec scsi: hpsa: fix volume offline state
2ef288498087 scsi: hpsa: do not timeout reset operations
87b9e6aa87d9 scsi: hpsa: limit outstanding rescans
85b29008d8af scsi: hpsa: update check for logical volume status

are included in mkp/4.11/scsi-fixes but not in mkp/4.12/scsi-queue.
Is there a specific reason for that?

Best regards
Martin W

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

Re: [Nbd] [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers

2017-04-06 Thread Mel Gorman

On Thu, Apr 06, 2017 at 08:38:10AM +0200, Wouter Verhelst wrote:
> On Wed, Apr 05, 2017 at 01:30:31PM +0200, Michal Hocko wrote:
> > On Wed 05-04-17 09:46:59, Vlastimil Babka wrote:
> > > We now have memalloc_noreclaim_{save,restore} helpers for robust setting 
> > > and
> > > clearing of PF_MEMALLOC. Let's convert the code which was using the 
> > > generic
> > > tsk_restore_flags(). No functional change.
> > 
> > It would be really great to revisit why those places outside of the mm
> > proper really need this flag. I know this is a painful exercise but I
> > wouldn't be surprised if there were abusers there.
> [...]
> > > ---
> > >  drivers/block/nbd.c  | 7 ---
> > >  drivers/scsi/iscsi_tcp.c | 7 ---
> > >  net/core/dev.c   | 7 ---
> > >  net/core/sock.c  | 7 ---
> > >  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> These were all done to make swapping over network safe. The idea is that
> if a socket has SOCK_MEMALLOC set, incoming packets for that socket can
> access PFMEMALLOC reserves (whereas other sockets cannot); this all in
> the hope that one packe destined to that socket will contain the TCP ACK
> that confirms the swapout was successful and we can now release RAM
> pages for other processes.
> 
> I don't know whether they need the PF_MEMALLOC flag specifically (not a
> kernel hacker), but they do need to interact with it at any rate.
> 

At the time it was required to get access to emergency reserves so swapping
can continue. The flip side is that the memory is then protected so pages
allocated from emergency reserves are not used for network traffic that
is not involved with swap. This means that under heavy swap load, it was
perfectly possible for unrelated traffic to get dropped for quite some
time.

-- 
Mel Gorman
SUSE Labs

[PATCH] scsi: qla2xxx: remove some redundant pointer assignments

2017-04-06 Thread Colin King

From: Colin Ian King 

There are several local or function parameter pointers that are
being assigned NULL after a kfree where and these have no effect
and hence can be removed.

Fixes various cppcheck warnings:

"Assignment of function parameter has no effect outside the
function. Did you forget dereferencing it"

Signed-off-by: Colin Ian King 
---
 drivers/scsi/qla2xxx/qla_os.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 83d61d2142e9..1c7957903283 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -423,7 +423,6 @@ static void qla2x00_free_req_que(struct qla_hw_data *ha, 
struct req_que *req)
kfree(req->outstanding_cmds);
 
kfree(req);
-   req = NULL;
 }
 
 static void qla2x00_free_rsp_que(struct qla_hw_data *ha, struct rsp_que *rsp)
@@ -439,7 +438,6 @@ static void qla2x00_free_rsp_que(struct qla_hw_data *ha, 
struct rsp_que *rsp)
rsp->ring, rsp->dma);
}
kfree(rsp);
-   rsp = NULL;
 }
 
 static void qla2x00_free_queues(struct qla_hw_data *ha)
@@ -653,7 +651,6 @@ qla2x00_sp_free_dma(void *ptr)
ha->gbl_dsd_inuse -= ctx1->dsd_use_cnt;
ha->gbl_dsd_avail += ctx1->dsd_use_cnt;
mempool_free(ctx1, ha->ctx_mempool);
-   ctx1 = NULL;
}
 
CMD_SP(cmd) = NULL;
@@ -3256,7 +3253,6 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct 
pci_device_id *id)
}
pci_release_selected_regions(ha->pdev, ha->bars);
kfree(ha);
-   ha = NULL;
 
 probe_out:
pci_disable_device(pdev);
@@ -3504,7 +3500,6 @@ qla2x00_remove_one(struct pci_dev *pdev)
 
pci_release_selected_regions(ha->pdev, ha->bars);
kfree(ha);
-   ha = NULL;
 
pci_disable_pcie_error_reporting(pdev);
 
@@ -3568,7 +3563,6 @@ void qla2x00_free_fcports(struct scsi_qla_host *vha)
list_del(>list);
qla2x00_clear_loop_id(fcport);
kfree(fcport);
-   fcport = NULL;
}
 }
 
-- 
2.11.0

Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd

2017-04-06 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 07:18:08PM +0200, Christoph Hellwig wrote:
> ->retries is counting the number of times a command is resubmitted, and
> be cleared on the first time we see the command.  We currently don't do
> that for non-PCIe command, which is easily fixed by moving the setup
> to common code.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Thanks a lot.

Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 1/2] block: Implement global tagset

2017-04-06 Thread Hannes Reinecke

On 04/06/2017 08:27 AM, Arun Easi wrote:
> Hi Hannes,
> 
> Thanks for taking a crack at the issue. My comments below..
> 
> On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote:
> 
>> Most legacy HBAs have a tagset per HBA, not per queue. To map
>> these devices onto block-mq this patch implements a new tagset
>> flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
>> to use just one tagset for all hardware queues.
>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  block/blk-mq-tag.c | 12 
>>  block/blk-mq.c | 10 --
>>  include/linux/blk-mq.h |  1 +
>>  3 files changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index e48bc2c..a14e76c 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
>> *tags,
>>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>  busy_tag_iter_fn *fn, void *priv)
>>  {
>> -int i;
>> +int i, lim = tagset->nr_hw_queues;
>>  
>> -for (i = 0; i < tagset->nr_hw_queues; i++) {
>> +if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
>> +lim = 1;
>> +for (i = 0; i < lim; i++) {
>>  if (tagset->tags && tagset->tags[i])
>>  blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
>>  }
>> @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
>> *tagset,
>>  
>>  int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
>>  {
>> -int i, j, ret = 0;
>> +int i, j, ret = 0, lim = set->nr_hw_queues;
>>  
>>  if (!set->ops->reinit_request)
>>  goto out;
>>  
>> -for (i = 0; i < set->nr_hw_queues; i++) {
>> +if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
>> +lim = 1;
>> +for (i = 0; i < lim; i++) {
>>  struct blk_mq_tags *tags = set->tags[i];
>>  
>>  for (j = 0; j < tags->nr_tags; j++) {
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 159187a..db96ed0 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct 
>> blk_mq_tag_set *set, int hctx_idx)
>>  {
>>  int ret = 0;
>>  
>> +if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
>> +set->tags[hctx_idx] = set->tags[0];
>> +return true;
>> +}
> 
> So, this effectively make all request allocations to the same NUMA node 
> locality of the hctx_idx 0, correct? Is the performance hit you were 
> talking about in the cover letter?
> 
Yes. It does make the request allocations local to NUMA node 0, but then
this will only affect LLDDs which are actually _using_ NUMA locality
when allocating the request nodes.
However, SCSI doesn't set a NUMA node locality to begin with, so this
doesn't affect us.

No, what I meant is this:
the 'sbitmap' allocator already splits up the bitmap into several words,
which then should provide a better NUMA locality per map.
When we're using a shared global map it's unclear whether the individual
words of the sbitmap can and will be moved to the various NUMA nodes, or
whether we suffer from non-locality.

My tests so far have been inconclusive; but then I'm not happy with the
testcase anyway (using null_blk I only get 250k/250k r/w IOPs, which I
found rather disappointing).

> Do you have any other alternatives in mind? Dynamic growing/shrinking 
> tags/request-pool in hctx with a fixed base as start?
> 
> One alternative that comes to my mind is to move the divvy up logic to 
> SCSI (instead of LLD doing it), IOW:
> 
> 1. Have SCSI set tag_set.queue_depth to can_queue/nr_hw_queues
> 2. Have blk_mq_unique_tag() (or a new i/f) returning "hwq * nr_hw_queue + 
>rq->tag"
> 
> That would make the tags linear in the can_queue space, but could result 
> in poor use of LLD resource if a given hctx has used up all it's tags.
> 
Exactly. This is the method I used for implementing mq support for lpfc
and mpt3sas; however the complaint there indeed was that we might be
running into a tag starvation scenario with a large number of LUNs and
single-threaded I/O submission.

> On a related note, would not the current use of can_queue in SCSI lead to 
> poor resource utilization in MQ cases? Like, block layer allocating 
> nr_hw_queues * tags+request+driver_data.etc * can_queue, but SCSI limiting 
> the number of requests to can_queue.
> 
Yes, indeed. That's another problem which we should be looking at.
However, it's only ever relevant if we indeed implement some divvying
logic; if we move to the shared tags approach it should work as designed.

> BTW, if you would like me to try out this patch on my setup, please let me 
> know.
> 
Oh, yes. Please do.

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.com  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D.

Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-06 Thread Christoph Hellwig

On Thu, Apr 06, 2017 at 04:35:56PM +0800, 廖亨权 wrote:
> Hi, Guys,
>   I want to ask if there is any plan to plant the NVMe driver to 
> Vxworks OS?Thank  you so much.---end quoted text---

The Linux NVMe team has no plans for a Vxworks NVMe driver at the moment.

Re: [PATCH] Make checking the scsi_device_get() return value mandatory

2017-04-06 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 09:52:50AM -0700, Bart Van Assche wrote:
> Now that all scsi_device_get() callers check the return value of this
> function, make checking that return value mandatory.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH] Make checking the scsi_device_get() return value mandatory

2017-04-06 Thread Johannes Thumshirn

On Thu, Apr 06, 2017 at 12:30:43AM +, Bart Van Assche wrote:
> On Thu, 2017-04-06 at 08:27 +0800, kbuild test robot wrote:
> > All warnings (new ones prefixed by >>):
> > 
> >drivers//scsi/osd/osd_uld.c: In function 'osd_probe':
> > > > drivers//scsi/osd/osd_uld.c:467:2: warning: ignoring return value of 
> > > > 'scsi_device_get', declared with attribute warn_unused_result 
> > > > [-Wunused-result]
> > 
> >  scsi_device_get(scsi_device);
> >  ^~~~
> 
> Please ignore this warning. It is triggered because patch "osd_uld: Check
> scsi_device_get() return value" is not yet upstream.

And verified this patch is valuable ;-)

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

[PATCH] sd: Explicit type cast to fix calculating rw_max

2017-04-06 Thread Fam Zheng

Some compilers don't like BLK_DEF_MAX_SECTORS being an enum (int) when
expanding min_not_zero. Cast it to sector_t so it matches the type of
the other operand, logical_to_sectors().

Signed-off-by: Fam Zheng 
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ab9011a..8d2315a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2957,7 +2957,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);
} else
rw_max = min_not_zero(logical_to_sectors(sdp, dev_max),
- BLK_DEF_MAX_SECTORS);
+ (sector_t)BLK_DEF_MAX_SECTORS);
 
/* Combine with controller limits */
q->limits.max_sectors = min(rw_max, queue_max_hw_sectors(q));
-- 
2.9.3

Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors

2017-04-06 Thread Christoph Hellwig

Ok, the version below simplify skip the function split entirely:

---
>From 7c9ca58f1d8cf53b42f14a51e02d0f3d0f12ab45 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Thu, 12 Jan 2017 11:17:29 +0100
Subject: csiostor: switch to pci_alloc_irq_vectors

And get automatic MSI-X affinity for free.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/csiostor/csio_hw.h  |   1 -
 drivers/scsi/csiostor/csio_isr.c | 128 ++-
 2 files changed, 47 insertions(+), 82 deletions(-)

diff --git a/drivers/scsi/csiostor/csio_hw.h b/drivers/scsi/csiostor/csio_hw.h
index 029bef82c057..62758e830d3b 100644
--- a/drivers/scsi/csiostor/csio_hw.h
+++ b/drivers/scsi/csiostor/csio_hw.h
@@ -95,7 +95,6 @@ enum {
 };
 
 struct csio_msix_entries {
-   unsigned short  vector; /* Assigned MSI-X vector */
void*dev_id;/* Priv object associated w/ this msix*/
chardesc[24];   /* Description of this vector */
 };
diff --git a/drivers/scsi/csiostor/csio_isr.c b/drivers/scsi/csiostor/csio_isr.c
index 2fb71c6c3b37..7c8814715711 100644
--- a/drivers/scsi/csiostor/csio_isr.c
+++ b/drivers/scsi/csiostor/csio_isr.c
@@ -383,17 +383,15 @@ csio_request_irqs(struct csio_hw *hw)
int rv, i, j, k = 0;
struct csio_msix_entries *entryp = >msix_entries[0];
struct csio_scsi_cpu_info *info;
+   struct pci_dev *pdev = hw->pdev;
 
if (hw->intr_mode != CSIO_IM_MSIX) {
-   rv = request_irq(hw->pdev->irq, csio_fcoe_isr,
-   (hw->intr_mode == CSIO_IM_MSI) ?
-   0 : IRQF_SHARED,
-   KBUILD_MODNAME, hw);
+   rv = request_irq(pci_irq_vector(pdev, 0), csio_fcoe_isr,
+   hw->intr_mode == CSIO_IM_MSI ? 0 : IRQF_SHARED,
+   KBUILD_MODNAME, hw);
if (rv) {
-   if (hw->intr_mode == CSIO_IM_MSI)
-   pci_disable_msi(hw->pdev);
csio_err(hw, "Failed to allocate interrupt line.\n");
-   return -EINVAL;
+   goto out_free_irqs;
}
 
goto out;
@@ -402,22 +400,22 @@ csio_request_irqs(struct csio_hw *hw)
/* Add the MSIX vector descriptions */
csio_add_msix_desc(hw);
 
-   rv = request_irq(entryp[k].vector, csio_nondata_isr, 0,
+   rv = request_irq(pci_irq_vector(pdev, k), csio_nondata_isr, 0,
 entryp[k].desc, hw);
if (rv) {
csio_err(hw, "IRQ request failed for vec %d err:%d\n",
-entryp[k].vector, rv);
-   goto err;
+pci_irq_vector(pdev, k), rv);
+   goto out_free_irqs;
}
 
-   entryp[k++].dev_id = (void *)hw;
+   entryp[k++].dev_id = hw;
 
-   rv = request_irq(entryp[k].vector, csio_fwevt_isr, 0,
+   rv = request_irq(pci_irq_vector(pdev, k), csio_fwevt_isr, 0,
 entryp[k].desc, hw);
if (rv) {
csio_err(hw, "IRQ request failed for vec %d err:%d\n",
-entryp[k].vector, rv);
-   goto err;
+pci_irq_vector(pdev, k), rv);
+   goto out_free_irqs;
}
 
entryp[k++].dev_id = (void *)hw;
@@ -429,51 +427,31 @@ csio_request_irqs(struct csio_hw *hw)
struct csio_scsi_qset *sqset = >sqset[i][j];
struct csio_q *q = hw->wrm.q_arr[sqset->iq_idx];
 
-   rv = request_irq(entryp[k].vector, csio_scsi_isr, 0,
+   rv = request_irq(pci_irq_vector(pdev, k), 
csio_scsi_isr, 0,
 entryp[k].desc, q);
if (rv) {
csio_err(hw,
   "IRQ request failed for vec %d err:%d\n",
-  entryp[k].vector, rv);
-   goto err;
+  pci_irq_vector(pdev, k), rv);
+   goto out_free_irqs;
}
 
-   entryp[k].dev_id = (void *)q;
+   entryp[k].dev_id = q;
 
} /* for all scsi cpus */
} /* for all ports */
 
 out:
hw->flags |= CSIO_HWF_HOST_INTR_ENABLED;
-
return 0;
 
-err:
-   for (i = 0; i < k; i++) {
-   entryp = >msix_entries[i];
-   free_irq(entryp->vector, entryp->dev_id);
-   }
-   pci_disable_msix(hw->pdev);
-
+out_free_irqs:
+   for (i = 0; i < k; i++)
+   free_irq(pci_irq_vector(pdev, i), hw->msix_entries[i].dev_id);
+   pci_free_irq_vectors(hw->pdev);
return -EINVAL;
 }
 
-static void
-csio_disable_msix(struct csio_hw

Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()

2017-04-06 Thread Michal Hocko

On Thu 06-04-17 09:33:44, Adrian Hunter wrote:
> On 05/04/17 14:39, Vlastimil Babka wrote:
> > On 04/05/2017 01:36 PM, Richard Weinberger wrote:
> >> Michal,
> >>
> >> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
> >>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
>  Nandsim has own functions set_memalloc() and clear_memalloc() for robust
>  setting and clearing of PF_MEMALLOC. Replace them by the new generic 
>  helpers.
>  No functional change.
> >>>
> >>> This one smells like an abuser. Why the hell should read/write path
> >>> touch memory reserves at all!
> >>
> >> Could be. Let's ask Adrian, AFAIK he wrote that code.
> >> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
> > 
> > I was thinking about it and concluded that since the simulator can be
> > used as a block device where reclaimed pages go to, writing the data out
> > is a memalloc operation. Then reading can be called as part of r-m-w
> > cycle, so reading as well.
> 
> IIRC it was to avoid getting stuck with nandsim waiting on memory reclaim
> and memory reclaim waiting on nandsim.

I've got lost in the indirection. Could you describe how would reclaim
get stuck waiting on these paths please?

-- 
Michal Hocko
SUSE Labs

Re: [Nbd] [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers

2017-04-06 Thread Wouter Verhelst

On Wed, Apr 05, 2017 at 01:30:31PM +0200, Michal Hocko wrote:
> On Wed 05-04-17 09:46:59, Vlastimil Babka wrote:
> > We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
> > clearing of PF_MEMALLOC. Let's convert the code which was using the generic
> > tsk_restore_flags(). No functional change.
> 
> It would be really great to revisit why those places outside of the mm
> proper really need this flag. I know this is a painful exercise but I
> wouldn't be surprised if there were abusers there.
[...]
> > ---
> >  drivers/block/nbd.c  | 7 ---
> >  drivers/scsi/iscsi_tcp.c | 7 ---
> >  net/core/dev.c   | 7 ---
> >  net/core/sock.c  | 7 ---
> >  4 files changed, 16 insertions(+), 12 deletions(-)

These were all done to make swapping over network safe. The idea is that
if a socket has SOCK_MEMALLOC set, incoming packets for that socket can
access PFMEMALLOC reserves (whereas other sockets cannot); this all in
the hope that one packe destined to that socket will contain the TCP ACK
that confirms the swapout was successful and we can now release RAM
pages for other processes.

I don't know whether they need the PF_MEMALLOC flag specifically (not a
kernel hacker), but they do need to interact with it at any rate.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()

2017-04-06 Thread Adrian Hunter

On 05/04/17 14:39, Vlastimil Babka wrote:
> On 04/05/2017 01:36 PM, Richard Weinberger wrote:
>> Michal,
>>
>> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
>>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
 Nandsim has own functions set_memalloc() and clear_memalloc() for robust
 setting and clearing of PF_MEMALLOC. Replace them by the new generic 
 helpers.
 No functional change.
>>>
>>> This one smells like an abuser. Why the hell should read/write path
>>> touch memory reserves at all!
>>
>> Could be. Let's ask Adrian, AFAIK he wrote that code.
>> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
> 
> I was thinking about it and concluded that since the simulator can be
> used as a block device where reclaimed pages go to, writing the data out
> is a memalloc operation. Then reading can be called as part of r-m-w
> cycle, so reading as well.

IIRC it was to avoid getting stuck with nandsim waiting on memory reclaim
and memory reclaim waiting on nandsim.

Re: [PATCH 1/2] block: Implement global tagset

2017-04-06 Thread Arun Easi

Hi Hannes,

Thanks for taking a crack at the issue. My comments below..

On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote:

> Most legacy HBAs have a tagset per HBA, not per queue. To map
> these devices onto block-mq this patch implements a new tagset
> flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
> to use just one tagset for all hardware queues.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  block/blk-mq-tag.c | 12 
>  block/blk-mq.c | 10 --
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index e48bc2c..a14e76c 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
> *tags,
>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   busy_tag_iter_fn *fn, void *priv)
>  {
> - int i;
> + int i, lim = tagset->nr_hw_queues;
>  
> - for (i = 0; i < tagset->nr_hw_queues; i++) {
> + if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
> + lim = 1;
> + for (i = 0; i < lim; i++) {
>   if (tagset->tags && tagset->tags[i])
>   blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
>   }
> @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
> *tagset,
>  
>  int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
>  {
> - int i, j, ret = 0;
> + int i, j, ret = 0, lim = set->nr_hw_queues;
>  
>   if (!set->ops->reinit_request)
>   goto out;
>  
> - for (i = 0; i < set->nr_hw_queues; i++) {
> + if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
> + lim = 1;
> + for (i = 0; i < lim; i++) {
>   struct blk_mq_tags *tags = set->tags[i];
>  
>   for (j = 0; j < tags->nr_tags; j++) {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a..db96ed0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct 
> blk_mq_tag_set *set, int hctx_idx)
>  {
>   int ret = 0;
>  
> + if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
> + set->tags[hctx_idx] = set->tags[0];
> + return true;
> + }

So, this effectively make all request allocations to the same NUMA node 
locality of the hctx_idx 0, correct? Is the performance hit you were 
talking about in the cover letter?

Do you have any other alternatives in mind? Dynamic growing/shrinking 
tags/request-pool in hctx with a fixed base as start?

One alternative that comes to my mind is to move the divvy up logic to 
SCSI (instead of LLD doing it), IOW:

1. Have SCSI set tag_set.queue_depth to can_queue/nr_hw_queues
2. Have blk_mq_unique_tag() (or a new i/f) returning "hwq * nr_hw_queue + 
   rq->tag"

That would make the tags linear in the can_queue space, but could result 
in poor use of LLD resource if a given hctx has used up all it's tags.

On a related note, would not the current use of can_queue in SCSI lead to 
poor resource utilization in MQ cases? Like, block layer allocating 
nr_hw_queues * tags+request+driver_data.etc * can_queue, but SCSI limiting 
the number of requests to can_queue.

BTW, if you would like me to try out this patch on my setup, please let me 
know.

Regards,
-Arun

>   set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
>   set->queue_depth, set->reserved_tags);
>   if (!set->tags[hctx_idx])
> @@ -2080,8 +2084,10 @@ static void blk_mq_free_map_and_requests(struct 
> blk_mq_tag_set *set,
>unsigned int hctx_idx)
>  {
>   if (set->tags[hctx_idx]) {
> - blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
> - blk_mq_free_rq_map(set->tags[hctx_idx]);
> + if (!(set->flags & BLK_MQ_F_GLOBAL_TAGS) || hctx_idx == 0) {
> + blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
> + blk_mq_free_rq_map(set->tags[hctx_idx]);
> + }
>   set->tags[hctx_idx] = NULL;
>   }
>  }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index b296a90..eee27b016 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -155,6 +155,7 @@ enum {
>   BLK_MQ_F_DEFER_ISSUE= 1 << 4,
>   BLK_MQ_F_BLOCKING   = 1 << 5,
>   BLK_MQ_F_NO_SCHED   = 1 << 6,
> + BLK_MQ_F_GLOBAL_TAGS= 1 << 7,
>   BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
>   BLK_MQ_F_ALLOC_POLICY_BITS = 1,
>  
>

Re: [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards

2017-04-06 Thread Hannes Reinecke

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" 
> 
> Separating discards and zeroout operations allows us to remove the LBPRZ
> block zeroing constraints from discards and honor the device preferences
> for UNMAP commands.
> 
> If supported by the device, we'll also choose UNMAP over one of the
> WRITE SAME variants for discards.
> 
> Signed-off-by: Martin K. Petersen 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/sd.c | 25 ++---
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices

2017-04-06 Thread Hannes Reinecke

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" 
> 
> Now that zeroout and discards are distinct operations we need to
> separate the policy of choosing the appropriate command. Create a
> zeroing_mode which can be one of:
> 
> write:Zeroout assist not present, use regular WRITE
> writesame:Allow WRITE SAME(10/16) with a zeroed payload
> writesame_16_unmap:   Allow WRITE SAME(16) with UNMAP
> writesame_10_unmap:   Allow WRITE SAME(10) with UNMAP
> 
> The last two are conditional on the device being thin provisioned with
> LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.
> 
> Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
> if none of the _unmap variants are supported, regular WRITE SAME will be
> used if the device supports it.
> 
> The zeroout_mode is exported in sysfs and the detected mode for a given
> device can be overridden using the string constants above.
> 
> With this change in place we can now issue WRITE SAME(16) with UNMAP set
> for block zeroing applications that require hard guarantees and
> logical_block_size granularity. And at the same time use the UNMAP
> command with the device's preferred granulary and alignment for discard
> operations.
> 
> Signed-off-by: Martin K. Petersen 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/sd.c | 56 
> ---
>  drivers/scsi/sd.h |  8 
>  2 files changed, 61 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

76 matches

Mail list logo