Re: [PATCH v3 1/2] scsi: scsi_transport_fc: Add dummy initiator role to rport

2017-04-18 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v3 2/2] scsi: storvsc: Add support for FC rport.

2017-04-18 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: block: add a error_count field to struct request

2017-04-18 Thread h...@lst.de
On Tue, Apr 18, 2017 at 10:57:11PM +, Bart Van Assche wrote:
> Both blk-mq and the traditional block layer support a .cmd_size field to
> make the block layer core allocate driver-specific data at the end of struct
> request. Could that mechanism have been used for the error_count field?

It could, and that's what I did for the modern drivers.  It would have
been a bit of a pain for these old floppy drivers, though.


Re: scsi: introduce a new result field in struct scsi_request

2017-04-18 Thread h...@lst.de
On Tue, Apr 18, 2017 at 10:34:20PM +, Bart Van Assche wrote:
> Did you perhaps intend "req->result" instead of "rq->result"?

Yes.

> Did you intend "war" or is that perhaps a typo?

I'll fix the comment.

> > trace_scsi_dispatch_cmd_done(cmd);
> > -   blk_mq_complete_request(cmd->request, cmd->request->errors);
> > +   blk_mq_complete_request(cmd->request, 0);
> >  }
> 
> Why has cmd->request->errors been changed into 0?

Because the argument is only used to set req->errors, which we won't
look at any more.


Re: block: remove the blk_execute_rq return value

2017-04-18 Thread h...@lst.de
On Tue, Apr 18, 2017 at 10:24:15PM +, Bart Van Assche wrote:
> On Tue, 2017-04-18 at 08:52 -0700, Christoph Hellwig wrote:
> > diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
> > index b05e151c9b38..82c6d02193ae 100644
> > --- a/drivers/block/paride/pd.c
> > +++ b/drivers/block/paride/pd.c
> > @@ -747,7 +747,8 @@ static int pd_special_command(struct pd_unit *disk,
> >  
> > rq->special = func;
> >  
> > -   err = blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
> > +   blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
> > +   err = req->errors ? -EIO : 0;
> >  
> > blk_put_request(rq);
> > return err;
> 
> Hello Christoph,
> 
> Sorry that I hadn't noticed this before but shouldn't "req" be changed into
> "rq"? Otherwise this patch looks fine to me. If this comment gets addressed
> you can add:

It should.  But it seems this no one caught this because the check gets
removed later in the series by "pd: remove bogus check for req->errors",
so I should just move that patch earlier in the series.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason


Re: [PATCH] scsi: lpfc: fix potential buffer overflow.

2017-04-18 Thread Martin K. Petersen
Maurizio Lombardi  writes:

> This patch fixes a potential buffer overflow in lpfc_nvme_info_show().

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] scsi:cxgb4i: update module description

2017-04-18 Thread Martin K. Petersen
Varun Prakash  writes:

Applied to 4.12/scsi-queue, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] scsi: fc: remove redundant check of an unsigned long being less than zero

2017-04-18 Thread Martin K. Petersen
Colin King  writes:

> The check for an unsigned long being less than zero is always false so
> it is a redundant check and can be removed.

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] scsi: ibmvfc: don't check for failure from mempool_alloc()

2017-04-18 Thread Martin K. Petersen
NeilBrown  writes:

> mempool_alloc() cannot fail when passed GFP_NOIO or any other gfp
> setting that is permitted to sleep.  So remove this pointless code.

Applied to 4.12/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote:
> > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?
> 
> The main conflict is in dma_map_sg which only does get_dma_ops once but
> the sg may contain memory of different types.

We can handle that in our "overriden" dma ops.

It's a bit tricky but it *could* break it down into segments and
forward portions back to the original dma ops.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote:
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)

We also need to be careful whether P2P is enabled in the switch
or not.

Cheers,
Ben.



Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread James Bottomley
On Wed, 2017-04-19 at 00:02 +, Bart Van Assche wrote:
> On Tue, 2017-04-18 at 16:56 -0700, James Bottomley wrote:
> > Actually, I think 3/3 is wrong.  You need to start the queue as 
> > soon as we see it's blocked and the device has gone into DEL.  This 
> > is safe because DEL ensures nothing ever gets from the generic 
> > request function to queuecommand, so the device driver never sees
> > anything.
> 
> Hello James,
> 
> I agree that restarting the queue earlier is safe but I do not agree 
> that patch 3/3 is wrong. Can you explain why you think that patch 3/3 
> is wrong?

The shutdown occurs in driver removal, which precedes the queue start
in 3/3.  To avoid hangs and timeouts, we need the queue to be started
before the driver shutdown tries to inject commands.  They're not
fatal, but they will likely show up as annoying messages (and time
delays) which we can avoid by starting the queue the moment we go into
DEL.

James




Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other". The "helper" is the
> dma api redirecting through a software-iommu that handles bus address
> translation differently than it would handle host memory dma mapping.

Do we even need tat function ? The dma_ops have a dma_supported()
call...

If we have those override ops built into the "dma_target" object,
then these things can make that decision knowing both the source
and target device.

Cheers,
Ben.



Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 16:56 -0700, James Bottomley wrote:
> Actually, I think 3/3 is wrong.  You need to start the queue as soon as
> we see it's blocked and the device has gone into DEL.  This is safe
> because DEL ensures nothing ever gets from the generic request function
> to queuecommand, so the device driver never sees anything.

Hello James,

I agree that restarting the queue earlier is safe but I do not agree that
patch 3/3 is wrong. Can you explain why you think that patch 3/3 is wrong?

Thanks,

Bart.

Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread James Bottomley
On Tue, 2017-04-18 at 23:47 +, Bart Van Assche wrote:
> On Tue, 2017-04-18 at 08:56 -0700, James Bottomley wrote:
> > How about this approach.  It goes straight to DEL if the device is
> > blocked (skipping CANCEL).  This means that all the commands issued
> > in 
> > ->shutdown will error in the mid-layer, thus making the removal
> > proceed
> > without being stopped.
> 
> Hello James,
> 
> The three attached patches pass my tests. Please let me know how you 
> would like to proceed with patch 1/3. Would you like to submit it 
> yourself or is it OK for you if I mention you as author and add your
> Signed-off-by?

Actually, I think 3/3 is wrong.  You need to start the queue as soon as
we see it's blocked and the device has gone into DEL.  This is safe
because DEL ensures nothing ever gets from the generic request function
to queuecommand, so the device driver never sees anything.

This means the combined 1/3 3/3 patch looks like this:

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e5a2d590a104..31171204cfd1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2611,7 +2611,6 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
case SDEV_QUIESCE:
case SDEV_OFFLINE:
case SDEV_TRANSPORT_OFFLINE:
-   case SDEV_BLOCK:
break;
default:
goto illegal;
@@ -2625,6 +2624,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
case SDEV_OFFLINE:
case SDEV_TRANSPORT_OFFLINE:
case SDEV_CANCEL:
+   case SDEV_BLOCK:
case SDEV_CREATED_BLOCK:
break;
default:
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 82dfe07b1d47..e477f95bf169 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1282,8 +1282,17 @@ void __scsi_remove_device(struct scsi_device *sdev)
return;
 
if (sdev->is_visible) {
-   if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
-   return;
+   /*
+* If blocked, we go straight to DEL so any commands
+* issued during the driver shutdown (like sync cache)
+* are errored
+*/
+   if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0) {
+   if (scsi_device_set_state(sdev, SDEV_DEL) != 0)
+   return;
+   else
+   scsi_start_queue(sdev);
+   }
 
bsg_unregister_queue(sdev->request_queue);
device_unregister(&sdev->sdev_dev);



Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:56 -0700, James Bottomley wrote:
> How about this approach.  It goes straight to DEL if the device is
> blocked (skipping CANCEL).  This means that all the commands issued in 
> ->shutdown will error in the mid-layer, thus making the removal proceed
> without being stopped.

Hello James,

The three attached patches pass my tests. Please let me know how you would
like to proceed with patch 1/3. Would you like to submit it yourself or is
it OK for you if I mention you as author and add your Signed-off-by?

Thanks,

Bart.From 9482fdc8b322f15ced6f64d57f45026367604a23 Mon Sep 17 00:00:00 2001
From: Bart Van Assche 
Date: Tue, 18 Apr 2017 10:11:02 -0700
Subject: [PATCH 1/3] Make __scsi_remove_device go straight from BLOCKED to DEL

If a device is blocked, make __scsi_remove_device() cause it to
transition to the DEL state. This means that all the commands
issued in .shutdown() will error in the mid-layer, thus making
the removal proceed without being stopped.

This patch is a slightly modified version of a patch from James
Bottomley.

Signed-off-by: Bart Van Assche 
Cc: James Bottomley 
Cc: Israel Rukshin 
Cc: Max Gurtovoy 
Cc: Hannes Reinecke 
Cc: Benjamin Block 
---
 drivers/scsi/scsi_lib.c   |  2 +-
 drivers/scsi/scsi_sysfs.c | 12 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index eecc005099b2..277c8b3ae7b0 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2611,7 +2611,6 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state)
 		case SDEV_QUIESCE:
 		case SDEV_OFFLINE:
 		case SDEV_TRANSPORT_OFFLINE:
-		case SDEV_BLOCK:
 			break;
 		default:
 			goto illegal;
@@ -2625,6 +2624,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state)
 		case SDEV_OFFLINE:
 		case SDEV_TRANSPORT_OFFLINE:
 		case SDEV_CANCEL:
+		case SDEV_BLOCK:
 		case SDEV_CREATED_BLOCK:
 			break;
 		default:
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 82dfe07b1d47..f95d191ec809 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1282,9 +1282,19 @@ void __scsi_remove_device(struct scsi_device *sdev)
 		return;
 
 	if (sdev->is_visible) {
-		if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
+		/*
+		 * If blocked, we go straight to DEL so any commands
+		 * issued during the driver shutdown (like sync cache)
+		 * are errored.
+		 */
+		if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0 &&
+		scsi_device_set_state(sdev, SDEV_DEL) != 0)
 			return;
 
+		if (sdev->sdev_state == SDEV_DEL)
+			sdev_printk(KERN_DEBUG, sdev,
+"Changed state from BLOCKED to DEL\n");
+
 		bsg_unregister_queue(sdev->request_queue);
 		device_unregister(&sdev->sdev_dev);
 		transport_remove_device(dev);
-- 
2.12.2

From c3f85b714fcfb12d43669b7f295a09f4718c2704 Mon Sep 17 00:00:00 2001
From: Bart Van Assche 
Date: Tue, 28 Mar 2017 14:00:17 -0700
Subject: [PATCH 2/3] Introduce scsi_start_queue()

This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Israel Rukshin 
Cc: Max Gurtovoy 
Cc: Hannes Reinecke 
Cc: Benjamin Block 
---
 drivers/scsi/scsi_lib.c  | 25 +++--
 drivers/scsi/scsi_priv.h |  1 +
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 277c8b3ae7b0..376cd1da102c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2987,6 +2987,20 @@ scsi_internal_device_block(struct scsi_device *sdev, bool wait)
 }
 EXPORT_SYMBOL_GPL(scsi_internal_device_block);
  
+void scsi_start_queue(struct scsi_device *sdev)
+{
+	struct request_queue *q = sdev->request_queue;
+	unsigned long flags;
+
+	if (q->mq_ops) {
+		blk_mq_start_stopped_hw_queues(q, false);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_start_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
+}
+
 /**
  * scsi_internal_device_unblock - resume a device after a block request
  * @sdev:	device to resume
@@ -3007,9 +3021,6 @@ int
 scsi_internal_device_unblock(struct scsi_device *sdev,
 			 enum scsi_device_state new_state)
 {
-	struct request_queue *q = sdev->request_queue; 
-	unsigned long flags;
-
 	/*
 	 * Try to transition the scsi device to SDEV_RUNNING or one of the
 	 * offlined states and goose the device queue if successful.
@@ -3027,13 +3038,7 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
 		 sdev->sdev_state != SDEV_OFFLINE)
 		return -EINVAL;
 
-	if (q->mq_ops) {
-		blk_mq_start_stopped_hw_queues(q, false);
-	} else {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_start_queue(q);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	}
+	scsi_start_queue(sdev);
 
 	return 0;
 }
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index f11bd102d6d5..c7629e31a75b 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -89,6 +89,7 @@ extern void scsi_run_host_queues(

[PATCH v3 8/8] scsi: Implement blk_mq_ops.show_rq()

2017-04-18 Thread Bart Van Assche
Show the SCSI CDB, .eh_eflags and .result for pending SCSI commands
in /sys/kernel/debug/block/*/mq/*/dispatch and */rq_list.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: James Bottomley 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: 
---
 drivers/scsi/scsi_lib.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bc4513bf4e4..52604573e4b6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2126,6 +2126,31 @@ static void scsi_exit_rq(struct request_queue *q, struct 
request *rq)
scsi_free_sense_buffer(shost, cmd->sense_buffer);
 }
 
+static const char *const ehflag_name[] = {
+   [ilog2(SCSI_EH_CANCEL_CMD)]  = "CANCEL_CMD",
+   [ilog2(SCSI_EH_ABORT_SCHEDULED)] = "ABORT_SCHEDULED",
+};
+
+static void scsi_show_rq(struct seq_file *m, struct request *rq)
+{
+   struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
+   unsigned int i;
+
+   seq_puts(m, ", .cmd =");
+   for (i = 0; i < cmd->cmd_len; i++)
+   seq_printf(m, " %02x", cmd->cmnd[i]);
+   seq_puts(m, ", .eh_eflags =");
+   for (i = 0; i < sizeof(cmd->eh_eflags) * BITS_PER_BYTE; i++) {
+   if (!(cmd->eh_eflags & BIT(i)))
+   continue;
+   if (i < ARRAY_SIZE(ehflag_name) && ehflag_name[i])
+   seq_printf(m, " %s", ehflag_name[i]);
+   else
+   seq_printf(m, " %d", i);
+   }
+   seq_printf(m, ", .result = %#06x", cmd->result);
+}
+
 struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 {
struct Scsi_Host *shost = sdev->host;
@@ -2158,6 +2183,7 @@ static const struct blk_mq_ops scsi_mq_ops = {
.queue_rq   = scsi_queue_rq,
.complete   = scsi_softirq_done,
.timeout= scsi_timeout,
+   .show_rq= scsi_show_rq,
.init_request   = scsi_init_request,
.exit_request   = scsi_exit_request,
.map_queues = scsi_map_queues,
-- 
2.12.2



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 03:51:27PM -0700, Dan Williams wrote:
> > This really seems like much less trouble than trying to wrapper all
> > the arch's dma ops, and doesn't have the wonky restrictions.
> 
> I don't think the root bus iommu drivers have any business knowing or
> caring about dma happening between devices lower in the hierarchy.

Maybe not, but performance requires some odd choices in this code.. :(

> > Setting up the iommu is fairly expensive, so getting rid of the
> > batching would kill performance..
> 
> When we're crossing device and host memory boundaries how much
> batching is possible? As far as I can see you'll always be splitting
> the sgl on these dma mapping boundaries.

Splitting the sgl is different from iommu batching.

As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
the middle.

The optimum behavior is to allocate a 1MB-4K iommu range and fill it
with the CPU memory. Then return a SGL with three entires, two
pointing into the range and one to the p2p.

It is creating each range which tends to be expensive, so creating two
ranges (or worse, if every SGL created a range it would be 255) is
very undesired.

Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 04:24 PM, Jason Gunthorpe wrote:
> Try and write a stacked map_sg function like you describe and you will
> see how horrible it quickly becomes.

Yes, unfortunately, I have to agree with this statement completely.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)

Yeah, I'm also personally thinking your proposal is the way to go as
well. Dan's injected ops suggestion is interesting but I can't see how
it solves the issue completely. Your proposal is the only one that seems
to be complete to me. It just has a few minor pain points which I've
already described but are likely manageable and less than the pain
stacked dma_ops creates.

Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 3:56 PM, Logan Gunthorpe  wrote:
>
>
> On 18/04/17 04:50 PM, Dan Williams wrote:
>> On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe  wrote:
>>>
>>>
>>> On 18/04/17 04:28 PM, Dan Williams wrote:
 Unlike the pci bus address offset case which I think is fundamental to
 support since shipping archs do this today, I think it is ok to say
 p2p is restricted to a single sgl that gets to talk to host memory or
 a single device. That said, what's wrong with a p2p aware map_sg
 implementation calling up to the host memory map_sg implementation on
 a per sgl basis?
>>>
>>> I think Ben said they need mixed sgls and that is where this gets messy.
>>> I think I'd prefer this too given trying to enforce all sgs in a list to
>>> be one type or another could be quite difficult given the state of the
>>> scatterlist code.
>>>
> Also, what happens if p2p pages end up getting passed to a device that
> doesn't have the injected dma_ops?

 This goes back to limiting p2p to a single pci host bridge. If the p2p
 capability is coordinated with the bridge rather than between the
 individual devices then we have a central point to catch this case.
>>>
>>> Not really relevant. If these pages get to userspace (as people seem
>>> keen on doing) or a less than careful kernel driver they could easily
>>> get into the dma_map calls of devices that aren't even pci related (via
>>> an O_DIRECT operation on an incorrect file or something). The common
>>> code must reject these and can't rely on an injected dma op.
>>
>> No, we can't do that at get_user_pages() time, it will always need to
>> be up to the device driver to fail dma that it can't perform.
>
> I'm not sure I follow -- are you agreeing with me? The dma_map_* needs
> to fail for any dma it cannot perform. Which means either all dma_ops
> providers need to be p2p aware or this logic has to be in dma_map_*
> itself. My point being: you can't rely on an injected dma_op for some
> devices to handle the fail case globally.

Ah, I see what you're saying now. Yes, we do need something that
guarantees any dma mapping implementation that gets a struct page that
it does now know how to translate properly fails the request.


Re: pd: remove bogus check for req->errors

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:52 -0700, Christoph Hellwig wrote:
> The driver never sets req->errors

Reviewed-by: Bart Van Assche 

Re: block: add a error_count field to struct request

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:52 -0700, Christoph Hellwig wrote:
> From: Christoph Hellwig 
> 
> This is for the legacy floppy and ataflop drivers that currently abuse
> ->errors for this purpose.  It's stashed away in a union to not grow
> the struct size, the other fields are either used by modern drivers
> for different purposes or the I/O scheduler before queing the I/O
> to drivers.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/blkdev.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5986e1250d7d..e618164462e8 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -175,6 +175,7 @@ struct request {
>   struct rb_node rb_node; /* sort/lookup */
>   struct bio_vec special_vec;
>   void *completion_data;
> + int error_count; /* for legacy drivers, don't use */
>   };
>  
>   /*

Hello Christoph,

Both blk-mq and the traditional block layer support a .cmd_size field to
make the block layer core allocate driver-specific data at the end of struct
request. Could that mechanism have been used for the error_count field?

Thanks,

Bart.

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 04:50 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe  wrote:
>>
>>
>> On 18/04/17 04:28 PM, Dan Williams wrote:
>>> Unlike the pci bus address offset case which I think is fundamental to
>>> support since shipping archs do this today, I think it is ok to say
>>> p2p is restricted to a single sgl that gets to talk to host memory or
>>> a single device. That said, what's wrong with a p2p aware map_sg
>>> implementation calling up to the host memory map_sg implementation on
>>> a per sgl basis?
>>
>> I think Ben said they need mixed sgls and that is where this gets messy.
>> I think I'd prefer this too given trying to enforce all sgs in a list to
>> be one type or another could be quite difficult given the state of the
>> scatterlist code.
>>
 Also, what happens if p2p pages end up getting passed to a device that
 doesn't have the injected dma_ops?
>>>
>>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>>> capability is coordinated with the bridge rather than between the
>>> individual devices then we have a central point to catch this case.
>>
>> Not really relevant. If these pages get to userspace (as people seem
>> keen on doing) or a less than careful kernel driver they could easily
>> get into the dma_map calls of devices that aren't even pci related (via
>> an O_DIRECT operation on an incorrect file or something). The common
>> code must reject these and can't rely on an injected dma op.
> 
> No, we can't do that at get_user_pages() time, it will always need to
> be up to the device driver to fail dma that it can't perform.

I'm not sure I follow -- are you agreeing with me? The dma_map_* needs
to fail for any dma it cannot perform. Which means either all dma_ops
providers need to be p2p aware or this logic has to be in dma_map_*
itself. My point being: you can't rely on an injected dma_op for some
devices to handle the fail case globally.




Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 3:46 PM, Benjamin Herrenschmidt
 wrote:
> On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
>> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
>> > already have APIs that map BAR memory to user space, and would like to
>> > keep using them. A 'enable P2P for bar' helper function sounds better
>> > to me.
>>
>> ...and I think it's not a helper function as much as asking the bus
>> provider "can these two device dma to each other". The "helper" is the
>> dma api redirecting through a software-iommu that handles bus address
>> translation differently than it would handle host memory dma mapping.
>
> Do we even need tat function ? The dma_ops have a dma_supported()
> call...
>
> If we have those override ops built into the "dma_target" object,
> then these things can make that decision knowing both the source
> and target device.
>

Yes.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 3:42 PM, Jason Gunthorpe
 wrote:
> On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:
>
>> Unlike the pci bus address offset case which I think is fundamental to
>> support since shipping archs do this toda
>
> But we can support this by modifying those arch's unique dma_ops
> directly.
>
> Eg as I explained, my p2p_same_segment_map_page() helper concept would
> do the offset adjustment for same-segement DMA.
>
> If PPC calls that in their IOMMU drivers then they will have proper
> support for this basic p2p, and the right framework to move on to more
> advanced cases of p2p.
>
> This really seems like much less trouble than trying to wrapper all
> the arch's dma ops, and doesn't have the wonky restrictions.

I don't think the root bus iommu drivers have any business knowing or
caring about dma happening between devices lower in the hierarchy.

>> I think it is ok to say p2p is restricted to a single sgl that gets
>> to talk to host memory or a single device.
>
> RDMA and GPU would be sad with this restriction...
>
>> That said, what's wrong with a p2p aware map_sg implementation
>> calling up to the host memory map_sg implementation on a per sgl
>> basis?
>
> Setting up the iommu is fairly expensive, so getting rid of the
> batching would kill performance..

When we're crossing device and host memory boundaries how much
batching is possible? As far as I can see you'll always be splitting
the sgl on these dma mapping boundaries.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe  wrote:
>
>
> On 18/04/17 04:28 PM, Dan Williams wrote:
>> Unlike the pci bus address offset case which I think is fundamental to
>> support since shipping archs do this today, I think it is ok to say
>> p2p is restricted to a single sgl that gets to talk to host memory or
>> a single device. That said, what's wrong with a p2p aware map_sg
>> implementation calling up to the host memory map_sg implementation on
>> a per sgl basis?
>
> I think Ben said they need mixed sgls and that is where this gets messy.
> I think I'd prefer this too given trying to enforce all sgs in a list to
> be one type or another could be quite difficult given the state of the
> scatterlist code.
>
>>> Also, what happens if p2p pages end up getting passed to a device that
>>> doesn't have the injected dma_ops?
>>
>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>> capability is coordinated with the bridge rather than between the
>> individual devices then we have a central point to catch this case.
>
> Not really relevant. If these pages get to userspace (as people seem
> keen on doing) or a less than careful kernel driver they could easily
> get into the dma_map calls of devices that aren't even pci related (via
> an O_DIRECT operation on an incorrect file or something). The common
> code must reject these and can't rely on an injected dma op.

No, we can't do that at get_user_pages() time, it will always need to
be up to the device driver to fail dma that it can't perform.


Re: blk-mq: remove the error argument to blk_mq_complete_request

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:52 -0700, Christoph Hellwig wrote:
> Now that we always have a ->complete callback we can remove the direct
> call to blk_mq_end_request, as well as the error argument to
> blk_mq_complete_request.

Hello Christoph,

Please add a runtime check that issues a warning early if a .complete
callback function is missing.

Are you sure that all blk-mq drivers have a .complete callback? In the
request-errors branch of your block git tree I found the following:

static const struct blk_mq_ops rbd_mq_ops = {
.queue_rq   = rbd_queue_rq,
.init_request   = rbd_init_request,
};

static const struct blk_mq_ops ubiblock_mq_ops = {
.queue_rq   = ubiblock_queue_rq,
.init_request   = ubiblock_init_request,
};

Thanks,

Bart.

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 04:28 PM, Dan Williams wrote:
> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this today, I think it is ok to say
> p2p is restricted to a single sgl that gets to talk to host memory or
> a single device. That said, what's wrong with a p2p aware map_sg
> implementation calling up to the host memory map_sg implementation on
> a per sgl basis?

I think Ben said they need mixed sgls and that is where this gets messy.
I think I'd prefer this too given trying to enforce all sgs in a list to
be one type or another could be quite difficult given the state of the
scatterlist code.

>> Also, what happens if p2p pages end up getting passed to a device that
>> doesn't have the injected dma_ops?
> 
> This goes back to limiting p2p to a single pci host bridge. If the p2p
> capability is coordinated with the bridge rather than between the
> individual devices then we have a central point to catch this case.

Not really relevant. If these pages get to userspace (as people seem
keen on doing) or a less than careful kernel driver they could easily
get into the dma_map calls of devices that aren't even pci related (via
an O_DIRECT operation on an incorrect file or something). The common
code must reject these and can't rely on an injected dma op.

Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:

> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this toda

But we can support this by modifying those arch's unique dma_ops
directly.

Eg as I explained, my p2p_same_segment_map_page() helper concept would
do the offset adjustment for same-segement DMA.

If PPC calls that in their IOMMU drivers then they will have proper
support for this basic p2p, and the right framework to move on to more
advanced cases of p2p.

This really seems like much less trouble than trying to wrapper all
the arch's dma ops, and doesn't have the wonky restrictions.

> I think it is ok to say p2p is restricted to a single sgl that gets
> to talk to host memory or a single device.

RDMA and GPU would be sad with this restriction...

> That said, what's wrong with a p2p aware map_sg implementation
> calling up to the host memory map_sg implementation on a per sgl
> basis?

Setting up the iommu is fairly expensive, so getting rid of the
batching would kill performance..

Jason


Re: scsi: introduce a new result field in struct scsi_request

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:52 -0700, Christoph Hellwig wrote:
> --- a/block/bsg.c
> +++ b/block/bsg.c
> @@ -391,13 +391,13 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, 
> struct sg_io_v4 *hdr,
>   struct scsi_request *req = scsi_req(rq);
>   int ret = 0;
>  
> - dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->errors);
> + dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->result);

Hello Christoph,

Did you perhaps intend "req->result" instead of "rq->result"?

> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -229,8 +229,8 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
>   * @rq_flags:flags for ->rq_flags
>   * @resid:   optional residual length
>   *
> - * returns the req->errors value which is the scsi_cmnd result
> - * field.
> + * Returns the scsi_cmnd result field if a command was executed, or a 
> negative
> + * Linux error code if we didn't get that war.

Did you intend "war" or is that perhaps a typo?

> @@ -1905,7 +1904,7 @@ static int scsi_mq_prep_fn(struct request *req)
>  static void scsi_mq_done(struct scsi_cmnd *cmd)
>  {
>   trace_scsi_dispatch_cmd_done(cmd);
> - blk_mq_complete_request(cmd->request, cmd->request->errors);
> + blk_mq_complete_request(cmd->request, 0);
>  }

Why has cmd->request->errors been changed into 0?

Bart.

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 3:15 PM, Logan Gunthorpe  wrote:
>
>
> On 18/04/17 03:36 PM, Dan Williams wrote:
>> On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
>>  wrote:
>>> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> I think this opens an even bigger can of worms..

 No, I don't think it does. You'd only shim when the target page is
 backed by a device, not host memory, and you can figure this out by a
 is_zone_device_page()-style lookup.
>>>
>>> The bigger can of worms is how do you meaningfully stack dma_ops.
>>
>> This goes back to my original comment to make this capability a
>> function of the pci bridge itself. The kernel has an implementation of
>> a dynamically created bridge device that injects its own dma_ops for
>> the devices behind the bridge. See vmd_setup_dma_ops() in
>> drivers/pci/host/vmd.c.
>
> Well the issue I think Jason is pointing out is that the ops don't
> stack. The map_* function in the injected dma_ops needs to be able to
> call the original map_* for any page that is not p2p memory. This is
> especially annoying in the map_sg function which may need to call a
> different op based on the contents of the sgl. (And please correct me if
> I'm not seeing how this can be done in the vmd example.)

Unlike the pci bus address offset case which I think is fundamental to
support since shipping archs do this today, I think it is ok to say
p2p is restricted to a single sgl that gets to talk to host memory or
a single device. That said, what's wrong with a p2p aware map_sg
implementation calling up to the host memory map_sg implementation on
a per sgl basis?

> Also, what happens if p2p pages end up getting passed to a device that
> doesn't have the injected dma_ops?

This goes back to limiting p2p to a single pci host bridge. If the p2p
capability is coordinated with the bridge rather than between the
individual devices then we have a central point to catch this case.

...of course this is all hand wavy until someone writes the code and
proves otherwise.

> However, the concept of replacing the dma_ops for all devices behind a
> supporting bridge is interesting and may be a good piece of the final
> solution.

It's at least a proof point for injecting special behavior for devices
behind a (virtual) pci bridge without needing to go touch a bunch of
drivers.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 03:31:58PM -0600, Logan Gunthorpe wrote:

> 1) It means that sg_has_p2p has to walk the entire sg and check every
> page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
> then do some operation per page. If anyone is concerned about the
> dma_map performance this could be an issue.

dma_map performance is a concern, this is why I suggest this as an
interm solution until all dma_ops are migrated. Ideally sg_has_p2p
would be a fast path that checked some kind of flags bit set during
sg_assign_page... 

This would probably all have to be protected with CONFIG_P2P until it
becomes performance neutral. People without an iommu are not going to
want to walk the sg list at all..

> 2) Without knowing exactly what the arch specific code may need to do
> it's hard to say that this is exactly the right approach. If every
> dma_ops provider has to do exactly this on every page it may lead to a
> lot of duplicate code:

I think someone would have to start to look at it to make a
determination..

I suspect the main server oriented iommu dma op will want to have
proper p2p support anyhow and will probably have their unique control
flow..

> The only thing I'm presently aware of is the segment check and applying
> the offset to the physical address

Well, I called the function p2p_same_segment_map_page() in my last
suggestion for a reason - that is all the helper does.

The intention would be for real iommu drivers to call that helper for
the one simple case and if it fails then use their own routines to
figure out if cross-segment P2P is possible and configure the iommu as
needed.

> bus specific and not arch specific which I think is what Dan may be
> getting at. So it may make sense to just have a pci_map_sg_p2p() which
> takes a dma_ops struct it would use for any page that isn't a p2p page.

Like I keep saying, dma_ops are not really designed to be stacked.

Try and write a stacked map_sg function like you describe and you will
see how horrible it quickly becomes.

Setting up an iommu is very expensive, so we need to batch it for the
entire sg list. Thus a trivial implementation to iterate over all sg
list entries is not desired.

So first a sg list without p2p memory would have to be created, pass
to the lower level ops, then brought back. Remember, the returned sg
list will have a different number of entries than the original. Now
another complex loop is needed to split/merge back in the p2p sg
elements to get a return result.

Finally, we have to undo all of this when doing unmap.

Basically, all this list processing is a huge overhead compared to
just putting a helper call in the existing sg iteration loop of the
actual op.  Particularly if the actual op is a no-op like no-mmu x86
would use.

Since dma mapping is a performance path we must be careful not to
create intrinsic inefficiencies with otherwise nice layering :)

Jason


Re: block: remove the blk_execute_rq return value

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:52 -0700, Christoph Hellwig wrote:
> diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
> index b05e151c9b38..82c6d02193ae 100644
> --- a/drivers/block/paride/pd.c
> +++ b/drivers/block/paride/pd.c
> @@ -747,7 +747,8 @@ static int pd_special_command(struct pd_unit *disk,
>  
>   rq->special = func;
>  
> - err = blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
> + blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
> + err = req->errors ? -EIO : 0;
>  
>   blk_put_request(rq);
>   return err;

Hello Christoph,

Sorry that I hadn't noticed this before but shouldn't "req" be changed into
"rq"? Otherwise this patch looks fine to me. If this comment gets addressed
you can add:

Reviewed-by: Bart Van Assche 



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 03:36 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
>  wrote:
>> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
 I think this opens an even bigger can of worms..
>>>
>>> No, I don't think it does. You'd only shim when the target page is
>>> backed by a device, not host memory, and you can figure this out by a
>>> is_zone_device_page()-style lookup.
>>
>> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> This goes back to my original comment to make this capability a
> function of the pci bridge itself. The kernel has an implementation of
> a dynamically created bridge device that injects its own dma_ops for
> the devices behind the bridge. See vmd_setup_dma_ops() in
> drivers/pci/host/vmd.c.

Well the issue I think Jason is pointing out is that the ops don't
stack. The map_* function in the injected dma_ops needs to be able to
call the original map_* for any page that is not p2p memory. This is
especially annoying in the map_sg function which may need to call a
different op based on the contents of the sgl. (And please correct me if
I'm not seeing how this can be done in the vmd example.)

Also, what happens if p2p pages end up getting passed to a device that
doesn't have the injected dma_ops?

However, the concept of replacing the dma_ops for all devices behind a
supporting bridge is interesting and may be a good piece of the final
solution.

Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
 wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>> > I think this opens an even bigger can of worms..
>>
>> No, I don't think it does. You'd only shim when the target page is
>> backed by a device, not host memory, and you can figure this out by a
>> is_zone_device_page()-style lookup.
>
> The bigger can of worms is how do you meaningfully stack dma_ops.

This goes back to my original comment to make this capability a
function of the pci bridge itself. The kernel has an implementation of
a dynamically created bridge device that injects its own dma_ops for
the devices behind the bridge. See vmd_setup_dma_ops() in
drivers/pci/host/vmd.c.

> What does the p2p provider do when it detects a p2p page?

Check to see if the arch requires this offset translation that Ben
brought up and if not provide the physical address as the patches are
doing now.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 03:03 PM, Jason Gunthorpe wrote:
> What about something more incremental like this instead:
> - dma_ops will set map_sg_p2p == map_sg when they are updated to
>   support p2p, otherwise DMA on P2P pages will fail for those ops.
> - When all ops support p2p we remove the if and ops->map_sg then
>   just call map_sg_p2p
> - For now the scatterlist maintains a bit when pages are added indicating if
>   p2p memory might be present in the list.
> - Unmap for p2p and non-p2p is the same, the underlying ops driver has
>   to make it work.
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 0977317c6835c2..505ed7d502053d 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -103,6 +103,9 @@ struct dma_map_ops {
>   int (*map_sg)(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir,
> unsigned long attrs);
> + int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
> +   int nents, enum dma_data_direction dir,
> +   unsigned long attrs);
>   void (*unmap_sg)(struct device *dev,
>struct scatterlist *sg, int nents,
>enum dma_data_direction dir,
> @@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, 
> struct scatterlist *sg,
>   for_each_sg(sg, s, nents, i)
>   kmemcheck_mark_initialized(sg_virt(s), s->length);
>   BUG_ON(!valid_dma_direction(dir));
> - ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
> + if (sg_has_p2p(sg)) {
> + if (ops->map_sg_p2p)
> + ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
> + else
> + return 0;
> + } else
> + ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
>   BUG_ON(ents < 0);
>   debug_dma_map_sg(dev, sg, nents, ents, dir);

I could get behind this. Though a couple of points:

1) It means that sg_has_p2p has to walk the entire sg and check every
page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
then do some operation per page. If anyone is concerned about the
dma_map performance this could be an issue.

2) Without knowing exactly what the arch specific code may need to do
it's hard to say that this is exactly the right approach. If every
dma_ops provider has to do exactly this on every page it may lead to a
lot of duplicate code:

foreach_sg page:
   if (pci_page_is_p2p(page)) {
dma_addr = pci_p2p_map_page(page)
if (!dma_addr)
return 0;
continue
   }
   ...

The only thing I'm presently aware of is the segment check and applying
the offset to the physical address -- neither of which has much to do
with the specific dma_ops providers. It _may_ be that this needs to be
bus specific and not arch specific which I think is what Dan may be
getting at. So it may make sense to just have a pci_map_sg_p2p() which
takes a dma_ops struct it would use for any page that isn't a p2p page.

Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > I think this opens an even bigger can of worms..
> 
> No, I don't think it does. You'd only shim when the target page is
> backed by a device, not host memory, and you can figure this out by a
> is_zone_device_page()-style lookup.

The bigger can of worms is how do you meaningfully stack dma_ops.

What does the p2p provider do when it detects a p2p page?

Jason


Re: [PATCH] scsi: lpfc: fix potential buffer overflow.

2017-04-18 Thread James Smart



On 4/18/2017 2:55 AM, Maurizio Lombardi wrote:

This patch fixes a potential buffer overflow in lpfc_nvme_info_show().

Signed-off-by: Maurizio Lombardi 
---



looks fine

-- james

Signed-off-by: James Smart 



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 2:03 PM, Jason Gunthorpe
 wrote:
> On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:
>
>> > Yes, I noticed this problem too and that makes sense. It just means
>> > every dma_ops will probably need to be modified to either support p2p
>> > pages or fail on them. Though, the only real difficulty there is that it
>> > will be a lot of work.
>>
>> I don't think you need to go touch all dma_ops, I think you can just
>> arrange for devices that are going to do dma to get redirected to a
>> p2p aware provider of operations that overrides the system default
>> dma_ops. I.e. just touch get_dma_ops().
>
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
>
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
>
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?
>
> I think this opens an even bigger can of worms..

No, I don't think it does. You'd only shim when the target page is
backed by a device, not host memory, and you can figure this out by a
is_zone_device_page()-style lookup.


Re: nvme-fc: fix status code handling in nvme_fc_fcpio_done

2017-04-18 Thread James Smart


On 4/18/2017 8:52 AM, Christoph Hellwig wrote:

From: Christoph Hellwig 

nvme_complete_async_event expects the little endian status code
including the phase bit, and a new completion handler I plan to
introduce will do so as well.

Change the status variable into the little endian format with the
phase bit used in the NVMe CQE to fix / enable this.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---



look good

-- james

Signed-off-by: James Smart 



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:

> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

I don't follow, when does get_dma_ops() return a p2p aware provider?
It has no way to know if the DMA is going to involve p2p, get_dma_ops
is called with the device initiating the DMA.

So you'd always return the P2P shim on a system that has registered
P2P memory?

Even so, how does this shim work? dma_ops are not really intended to
be stacked. How would we make unmap work, for instance? What happens
when the underlying iommu dma ops actually natively understands p2p
and doesn't want the shim?

I think this opens an even bigger can of worms..

Lets find a strategy to safely push this into dma_ops.

What about something more incremental like this instead:
- dma_ops will set map_sg_p2p == map_sg when they are updated to
  support p2p, otherwise DMA on P2P pages will fail for those ops.
- When all ops support p2p we remove the if and ops->map_sg then
  just call map_sg_p2p
- For now the scatterlist maintains a bit when pages are added indicating if
  p2p memory might be present in the list.
- Unmap for p2p and non-p2p is the same, the underlying ops driver has
  to make it work.

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0977317c6835c2..505ed7d502053d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,6 +103,9 @@ struct dma_map_ops {
int (*map_sg)(struct device *dev, struct scatterlist *sg,
  int nents, enum dma_data_direction dir,
  unsigned long attrs);
+   int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction dir,
+ unsigned long attrs);
void (*unmap_sg)(struct device *dev,
 struct scatterlist *sg, int nents,
 enum dma_data_direction dir,
@@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
for_each_sg(sg, s, nents, i)
kmemcheck_mark_initialized(sg_virt(s), s->length);
BUG_ON(!valid_dma_direction(dir));
-   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
+   if (sg_has_p2p(sg)) {
+   if (ops->map_sg_p2p)
+   ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
+   else
+   return 0;
+   } else
+   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
BUG_ON(ents < 0);
debug_dma_map_sg(dev, sg, nents, ents, dir);
 



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 02:31 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 1:29 PM, Jerome Glisse  wrote:
>>> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe 
>>> wrote:


 On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> Ultimately every dma_ops will need special code to support P2P with
> the special hardware that ops is controlling, so it makes some sense
> to start by pushing the check down there in the first place. This
> advice is partially motivated by how dma_map_sg is just a small
> wrapper around the function pointer call...

 Yes, I noticed this problem too and that makes sense. It just means
 every dma_ops will probably need to be modified to either support p2p
 pages or fail on them. Though, the only real difficulty there is that it
 will be a lot of work.
>>>
>>> I don't think you need to go touch all dma_ops, I think you can just
>>> arrange for devices that are going to do dma to get redirected to a
>>> p2p aware provider of operations that overrides the system default
>>> dma_ops. I.e. just touch get_dma_ops().
>>
>> This would not work well for everyone, for instance on GPU we usualy
>> have buffer object with a mix of device memory and regular system
>> memory but call dma sg map once for the list.
>>
> 
> ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?

The main conflict is in dma_map_sg which only does get_dma_ops once but
the sg may contain memory of different types.

Logan



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 1:29 PM, Jerome Glisse  wrote:
>> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe 
>> wrote:
>> >
>> >
>> > On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
>> >> Ultimately every dma_ops will need special code to support P2P with
>> >> the special hardware that ops is controlling, so it makes some sense
>> >> to start by pushing the check down there in the first place. This
>> >> advice is partially motivated by how dma_map_sg is just a small
>> >> wrapper around the function pointer call...
>> >
>> > Yes, I noticed this problem too and that makes sense. It just means
>> > every dma_ops will probably need to be modified to either support p2p
>> > pages or fail on them. Though, the only real difficulty there is that it
>> > will be a lot of work.
>>
>> I don't think you need to go touch all dma_ops, I think you can just
>> arrange for devices that are going to do dma to get redirected to a
>> p2p aware provider of operations that overrides the system default
>> dma_ops. I.e. just touch get_dma_ops().
>
> This would not work well for everyone, for instance on GPU we usualy
> have buffer object with a mix of device memory and regular system
> memory but call dma sg map once for the list.
>

...and that dma_map goes through get_dma_ops(), so I don't see the conflict?


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jerome Glisse
> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe 
> wrote:
> >
> >
> > On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> >> Ultimately every dma_ops will need special code to support P2P with
> >> the special hardware that ops is controlling, so it makes some sense
> >> to start by pushing the check down there in the first place. This
> >> advice is partially motivated by how dma_map_sg is just a small
> >> wrapper around the function pointer call...
> >
> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

This would not work well for everyone, for instance on GPU we usualy
have buffer object with a mix of device memory and regular system
memory but call dma sg map once for the list.

Cheers,
Jérôme


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 01:48 PM, Jason Gunthorpe wrote:
> I think this is why progress on this keeps getting stuck - every
> solution is a lot of work.

Yup! There's also a ton of work just to get the iomem safety issues
addressed. Let alone the dma mapping issues.

> You could try to do a dummy mapping / create a MR early on to detect
> this.

Ok, that could be a workable solution.

> FWIW, I wonder if from a RDMA perspective we have another
> problem.. Should we allow P2P memory to be used with the local DMA
> lkey? There are potential designs around virtualization that would not
> allow that. Should we mandate that P2P memory be in its own MR?

I can't say I understand these issues...

Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 01:35:32PM -0600, Logan Gunthorpe wrote:

> > Ultimately every dma_ops will need special code to support P2P with
> > the special hardware that ops is controlling, so it makes some sense
> > to start by pushing the check down there in the first place. This
> > advice is partially motivated by how dma_map_sg is just a small
> > wrapper around the function pointer call...
> 
> Yes, I noticed this problem too and that makes sense. It just means
> every dma_ops will probably need to be modified to either support p2p
> pages or fail on them. Though, the only real difficulty there is that it
> will be a lot of work.

I think this is why progress on this keeps getting stuck - every
solution is a lot of work.

> > Where p2p_same_segment_map_page checks if the two devices are on the
> > 'same switch' and if so returns the address translated to match the
> > bus address programmed into the BAR or fails. We knows this case is
> > required to work by the PCI spec, so it makes sense to use it as the
> > first canned helper.
> 
> I've also suggested that this check should probably be done (or perhaps
> duplicated) before we even get to the map stage.

Since the mechanics of the check is essentially unique to every
dma-ops I would not hoist it out of the map function without a really
good reason.

> In the case of nvme-fabrics we'd probably want to let the user know
> when they try to configure it or at least fall back to allocating
> regular memory instead.

You could try to do a dummy mapping / create a MR early on to detect
this.

FWIW, I wonder if from a RDMA perspective we have another
problem.. Should we allow P2P memory to be used with the local DMA
lkey? There are potential designs around virtualization that would not
allow that. Should we mandate that P2P memory be in its own MR?

Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe  wrote:
>
>
> On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
>> Ultimately every dma_ops will need special code to support P2P with
>> the special hardware that ops is controlling, so it makes some sense
>> to start by pushing the check down there in the first place. This
>> advice is partially motivated by how dma_map_sg is just a small
>> wrapper around the function pointer call...
>
> Yes, I noticed this problem too and that makes sense. It just means
> every dma_ops will probably need to be modified to either support p2p
> pages or fail on them. Though, the only real difficulty there is that it
> will be a lot of work.

I don't think you need to go touch all dma_ops, I think you can just
arrange for devices that are going to do dma to get redirected to a
p2p aware provider of operations that overrides the system default
dma_ops. I.e. just touch get_dma_ops().


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> Ultimately every dma_ops will need special code to support P2P with
> the special hardware that ops is controlling, so it makes some sense
> to start by pushing the check down there in the first place. This
> advice is partially motivated by how dma_map_sg is just a small
> wrapper around the function pointer call...

Yes, I noticed this problem too and that makes sense. It just means
every dma_ops will probably need to be modified to either support p2p
pages or fail on them. Though, the only real difficulty there is that it
will be a lot of work.

> Where p2p_same_segment_map_page checks if the two devices are on the
> 'same switch' and if so returns the address translated to match the
> bus address programmed into the BAR or fails. We knows this case is
> required to work by the PCI spec, so it makes sense to use it as the
> first canned helper.

I've also suggested that this check should probably be done (or perhaps
duplicated) before we even get to the map stage. In the case of
nvme-fabrics we'd probably want to let the user know when they try to
configure it or at least fall back to allocating regular memory instead.
It would be a difficult situation to have already copied a block of data
from a NIC to p2p memory only to have it be deemed unmappable on the
NVMe device it's destined for. (Or vice-versa.) This was another issue
p2pmem was attempting to solve.

Logan


[PATCH] target: Add WRITE_VERIFY_16

2017-04-18 Thread Bryant G. Ly
This patch addresses clients who needs write_verify_16 for
large volume groups such as AIX.

Signed-off-by: Bryant G. Ly 
---
 drivers/target/target_core_sbc.c | 2 ++
 include/scsi/scsi_proto.h| 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index ee35c90..2060f5a 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -850,6 +850,7 @@ static sense_reason_t sbc_parse_verify(struct se_cmd *cmd, 
int *sectors,
cmd->t_task_lba = transport_lba_32(cdb);
break;
case VERIFY_16:
+   case WRITE_VERIFY_16:
*sectors = transport_get_sectors_16(cdb);
cmd->t_task_lba = transport_lba_64(cdb);
break;
@@ -962,6 +963,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops)
cmd->execute_cmd = sbc_execute_rw;
break;
case WRITE_VERIFY:
+   case WRITE_VERIFY_16:
ret = sbc_parse_verify(cmd, §ors, &size);
if (ret)
return ret;
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 6ba66e0..ce78ec8 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -112,6 +112,7 @@
 #define WRITE_16  0x8a
 #define READ_ATTRIBUTE0x8c
 #define WRITE_ATTRIBUTE  0x8d
+#define WRITE_VERIFY_16  0x8e
 #define VERIFY_160x8f
 #define SYNCHRONIZE_CACHE_16  0x91
 #define WRITE_SAME_160x93
-- 
2.5.4 (Apple Git-61)



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 12:30:59PM -0600, Logan Gunthorpe wrote:

> >  - The dma ops provider must be able to tell if source memory is bar
> >mapped and recover the pci device backing the mapping.
> 
> Do you mean to say that every dma-ops provider needs to be taught about
> p2p backed pages? I was hoping we could have dma_map_* just use special
> p2p dma-ops if it was passed p2p pages (though there are some
> complications to this too).

I think that is how it will end up working out if this is the path..

Ultimately every dma_ops will need special code to support P2P with
the special hardware that ops is controlling, so it makes some sense
to start by pushing the check down there in the first place. This
advice is partially motivated by how dma_map_sg is just a small
wrapper around the function pointer call...

Something like:

foo_dma_map_sg(...)
{
  for (every page in sg)
  if (page is p2p)
   dma_addr[I] = p2p_same_segment_map_page(...);
}

Where p2p_same_segment_map_page checks if the two devices are on the
'same switch' and if so returns the address translated to match the
bus address programmed into the BAR or fails. We knows this case is
required to work by the PCI spec, so it makes sense to use it as the
first canned helper.

This also proves out the basic idea that the dma ops can recover the
pci device and perform an inspection of the traversed fabric path.

>From there every arch would have to expand the implementation to
support a wider range of things. Eg x86 with no iommu and no offset
could allow every address to be used based on a host bridge white
list.

Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 11:00 AM, Jason Gunthorpe
 wrote:
> On Tue, Apr 18, 2017 at 10:27:47AM -0700, Dan Williams wrote:
>> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
>> > already have APIs that map BAR memory to user space, and would like to
>> > keep using them. A 'enable P2P for bar' helper function sounds better
>> > to me.
>>
>> ...and I think it's not a helper function as much as asking the bus
>> provider "can these two device dma to each other".
>
> What I mean I could write in a RDMA driver:
>
>   /* Allow the memory in BAR 1 to be the target of P2P transactions */
>   pci_enable_p2p_bar(dev, 1);
>
> And not require anything else..
>
>> The "helper" is the dma api redirecting through a software-iommu
>> that handles bus address translation differently than it would
>> handle host memory dma mapping.
>
> Not sure, until we see what arches actually need to do here it is hard
> to design common helpers.
>
> Here are a few obvious things that arches will need to implement to
> support this broadly:
>
> - Virtualization might need to do a hypervisor call to get the right
>   translation, or consult some hypervisor specific description table.
>
> - Anything using IOMMUs for virtualization will need to setup IOMMU
>   permissions to allow the P2P flow, this might require translation to
>   an address cookie.
>
> - Fail if the PCI devices are in different domains, or setup hardware to
>   do completion bus/device/function translation.
>
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)
>
> So, we can eventually design helpers for various common scenarios, but
> until we see what arch code actually needs to do it seems
> premature. Much of this seems to involve interaction with some kind of
> hardware, or consulation of some kind of currently platform specific
> data, so I'm not sure what a software-iommu would be doing??
>
> The main thing to agree on is that this code belongs under dma ops and
> that arches have to support struct page mapped BAR addresses in their
> dma ops inputs. Is that resonable?

I think we're saying the same thing by "software-iommu" and "custom
dma_ops", so yes.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 10:45 AM, Jason Gunthorpe wrote:
> From Ben's comments, I would think that the 'first class' support that
> is needed here is simply a function to return the 'struct device'
> backing a CPU address range.

Yes, and Dan's get_dev_pagemap suggestion gets us 90% of the way there.
It's just a disagreement as to what struct device is inside the pagemap.
Care needs to be taken to ensure that struct device doesn't conflict
with hmm and doesn't limit other potential future users of ZONE_DEVICE.

> If there is going to be more core support for this stuff I think it
> will be under the topic of more robustly describing the fabric to the
> core and core helpers to extract data from the description: eg compute
> the path, check if the path crosses translation, etc

Agreed, those helpers would be useful to everyone.

> I think the key agreement to get out of Logan's series is that P2P DMA
> means:
>  - The BAR will be backed by struct pages
>  - Passing the CPU __iomem address of the BAR to the DMA API is
>valid and, long term, dma ops providers are expected to fail
>or return the right DMA address

Well, yes but we have a _lot_ of work to do to make it safe to pass
around struct pages backed with __iomem. That's where our next focus
will be. I've already taken very initial steps toward this with my
scatterlist map patchset.

>  - Mapping BAR memory into userspace and back to the kernel via
>get_user_pages works transparently, and with the DMA API above

Again, we've had a lot of push back for the memory to go to userspace at
all. It does work, but people expect userspace to screw it up in a lot
of ways. Among the people pushing back on that: Christoph Hellwig has
specifically said he wants to see this stay with in-kernel users only
until the apis can be worked out. This is one of the reasons we decided
to go with enabling nvme-fabrics as everything remains in the kernel.
And with that decision we needed a common in-kernel allocation
infrastructure: this is what p2pmem really is at this point.

>  - The dma ops provider must be able to tell if source memory is bar
>mapped and recover the pci device backing the mapping.

Do you mean to say that every dma-ops provider needs to be taught about
p2p backed pages? I was hoping we could have dma_map_* just use special
p2p dma-ops if it was passed p2p pages (though there are some
complications to this too).

> At least this is what we'd like in RDMA :)
> 
> FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> already have APIs that map BAR memory to user space, and would like to
> keep using them. A 'enable P2P for bar' helper function sounds better
> to me.

Well, in the end that will likely come down to just devm_memremap_pages
with some (presently undecided) struct device that can be used to get
special p2p dma-ops for the bus.

Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 10:27:47AM -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other".

What I mean I could write in a RDMA driver:

  /* Allow the memory in BAR 1 to be the target of P2P transactions */
  pci_enable_p2p_bar(dev, 1);

And not require anything else..

> The "helper" is the dma api redirecting through a software-iommu
> that handles bus address translation differently than it would
> handle host memory dma mapping.

Not sure, until we see what arches actually need to do here it is hard
to design common helpers.

Here are a few obvious things that arches will need to implement to
support this broadly:

- Virtualization might need to do a hypervisor call to get the right
  translation, or consult some hypervisor specific description table.

- Anything using IOMMUs for virtualization will need to setup IOMMU
  permissions to allow the P2P flow, this might require translation to
  an address cookie.

- Fail if the PCI devices are in different domains, or setup hardware to
  do completion bus/device/function translation.

- All platforms can succeed if the PCI devices are under the same
  'segment', but where segments begin is somewhat platform specific
  knowledge. (this is 'same switch' idea Logan has talked about)

So, we can eventually design helpers for various common scenarios, but
until we see what arch code actually needs to do it seems
premature. Much of this seems to involve interaction with some kind of
hardware, or consulation of some kind of currently platform specific
data, so I'm not sure what a software-iommu would be doing??

The main thing to agree on is that this code belongs under dma ops and
that arches have to support struct page mapped BAR addresses in their
dma ops inputs. Is that resonable?

Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Dan Williams
On Tue, Apr 18, 2017 at 9:45 AM, Jason Gunthorpe
 wrote:
> On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote:
>
>> Thanks :-) There's a reason why I'm insisting on this. We have constant
>> requests for this today. We have hacks in the GPU drivers to do it for
>> GPUs behind a switch, but those are just that, ad-hoc hacks in the
>> drivers. We have similar grossness around the corner with some CAPI
>> NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
>> to whack nVME devices.
>
> A lot of people feel this way in the RDMA community too. We have had
> vendors shipping out of tree code to enable P2P for RDMA with GPU
> years and years now. :(
>
> Attempts to get things in mainline have always run into the same sort
> of road blocks you've identified in this thread..
>
> FWIW, I read this discussion and it sounds closer to an agreement than
> I've ever seen in the past.
>
> From Ben's comments, I would think that the 'first class' support that
> is needed here is simply a function to return the 'struct device'
> backing a CPU address range.
>
> This is the minimal required information for the arch or IOMMU code
> under the dma ops to figure out the fabric source/dest, compute the
> traffic path, determine if P2P is even possible, what translation
> hardware is crossed, and what DMA address should be used.
>
> If there is going to be more core support for this stuff I think it
> will be under the topic of more robustly describing the fabric to the
> core and core helpers to extract data from the description: eg compute
> the path, check if the path crosses translation, etc
>
> But that isn't really related to P2P, and is probably better left to
> the arch authors to figure out where they need to enhance the existing
> topology data..
>
> I think the key agreement to get out of Logan's series is that P2P DMA
> means:
>  - The BAR will be backed by struct pages
>  - Passing the CPU __iomem address of the BAR to the DMA API is
>valid and, long term, dma ops providers are expected to fail
>or return the right DMA address
>  - Mapping BAR memory into userspace and back to the kernel via
>get_user_pages works transparently, and with the DMA API above
>  - The dma ops provider must be able to tell if source memory is bar
>mapped and recover the pci device backing the mapping.
>
> At least this is what we'd like in RDMA :)
>
> FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> already have APIs that map BAR memory to user space, and would like to
> keep using them. A 'enable P2P for bar' helper function sounds better
> to me.

...and I think it's not a helper function as much as asking the bus
provider "can these two device dma to each other". The "helper" is the
dma api redirecting through a software-iommu that handles bus address
translation differently than it would handle host memory dma mapping.


Re: RFC: drop the T10 OSD code and its users

2017-04-18 Thread Chandy, John
As one of those academics that Boaz was talking about, we do use the OSD driver 
for various research projects including support for OSD in parallel file 
systems like OrangeFS and Lustre.

John.

> On Apr 18, 2017, at 12:24 PM, Boaz Harrosh  wrote:
> 
> On 04/12/2017 07:01 PM, Christoph Hellwig wrote:
> 
> Hi Sir Christoph
> 
>> The only real user of the T10 OSD protocol, the pNFS object layout
>> driver never went to the point of having shipping products, and the
>> other two users (osdblk and exofs) were simple example of it's usage.
>> 
> 
> I understand why osdblk might be a pain, and was broken from day one, and
> should by all means go away ASAP.
> 
> But exofs should not be bothering anyone, and as far as I know does
> not use any special API's except the osd_uld code of course.
> 
>> The code has been mostly unmaintained for years and is getting in the
>> way of block / SCSI changes, so I think it's finally time to drop it.
>> 
> 
> Please tell me what are those changes you are talking about? I might be
> able to help in conversion. I guess you mean osd_uld and the Upper SCSI API.
> Just point me at a tree where osd_uld is broken, and perhaps with a little
> guidance from you I can do a satisfactory conversion.
> 
> Is true that no new code went in for a long while, but I still from time
> to time run a setup and test that the all stack, like iscsi-bidi and so on 
> still
> works.
> 
> That said, yes only a stand alone exofs was tested for a long time, a full
> pnfs setup is missing any supporting server. So Yes I admit that pnfs-obj is
> getting very rotten. And is most probably broken, on the pnfs side of things.
> [Which I admit makes my little plea kind of mute ;-) ]
> 
> Every once in a while I get emails from Students basing all kind of 
> interesting
> experiments on top of the exofs and object base storage. So for the sake of 
> academics
> and for the sake of a true bidi-stack testing, might we want to evaluate what 
> is the
> up coming cost, and what is a minimum set we are willing to keep?
> 
> Please advise?
> 
> thanks
> Boaz
> 
>> These patches are against Jens' block for-next tree as that already
>> has various modifications of the SCSI code.
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote:
 
> Thanks :-) There's a reason why I'm insisting on this. We have constant
> requests for this today. We have hacks in the GPU drivers to do it for
> GPUs behind a switch, but those are just that, ad-hoc hacks in the
> drivers. We have similar grossness around the corner with some CAPI
> NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
> to whack nVME devices.

A lot of people feel this way in the RDMA community too. We have had
vendors shipping out of tree code to enable P2P for RDMA with GPU
years and years now. :(

Attempts to get things in mainline have always run into the same sort
of road blocks you've identified in this thread..

FWIW, I read this discussion and it sounds closer to an agreement than
I've ever seen in the past.

>From Ben's comments, I would think that the 'first class' support that
is needed here is simply a function to return the 'struct device'
backing a CPU address range.

This is the minimal required information for the arch or IOMMU code
under the dma ops to figure out the fabric source/dest, compute the
traffic path, determine if P2P is even possible, what translation
hardware is crossed, and what DMA address should be used.

If there is going to be more core support for this stuff I think it
will be under the topic of more robustly describing the fabric to the
core and core helpers to extract data from the description: eg compute
the path, check if the path crosses translation, etc

But that isn't really related to P2P, and is probably better left to
the arch authors to figure out where they need to enhance the existing
topology data..

I think the key agreement to get out of Logan's series is that P2P DMA
means:
 - The BAR will be backed by struct pages
 - Passing the CPU __iomem address of the BAR to the DMA API is
   valid and, long term, dma ops providers are expected to fail
   or return the right DMA address
 - Mapping BAR memory into userspace and back to the kernel via
   get_user_pages works transparently, and with the DMA API above
 - The dma ops provider must be able to tell if source memory is bar
   mapped and recover the pci device backing the mapping.

At least this is what we'd like in RDMA :)

FWIW, RDMA probably wouldn't want to use a p2mem device either, we
already have APIs that map BAR memory to user space, and would like to
keep using them. A 'enable P2P for bar' helper function sounds better
to me.

Jason


Re: RFC: drop the T10 OSD code and its users

2017-04-18 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:

Hi Sir Christoph

> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and the
> other two users (osdblk and exofs) were simple example of it's usage.
> 

I understand why osdblk might be a pain, and was broken from day one, and
should by all means go away ASAP.

But exofs should not be bothering anyone, and as far as I know does
not use any special API's except the osd_uld code of course.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 

Please tell me what are those changes you are talking about? I might be
able to help in conversion. I guess you mean osd_uld and the Upper SCSI API.
Just point me at a tree where osd_uld is broken, and perhaps with a little
guidance from you I can do a satisfactory conversion.

Is true that no new code went in for a long while, but I still from time
to time run a setup and test that the all stack, like iscsi-bidi and so on still
works.

That said, yes only a stand alone exofs was tested for a long time, a full
pnfs setup is missing any supporting server. So Yes I admit that pnfs-obj is
getting very rotten. And is most probably broken, on the pnfs side of things.
[Which I admit makes my little plea kind of mute ;-) ]

Every once in a while I get emails from Students basing all kind of interesting
experiments on top of the exofs and object base storage. So for the sake of 
academics
and for the sake of a true bidi-stack testing, might we want to evaluate what 
is the
up coming cost, and what is a minimum set we are willing to keep?

Please advise?

thanks
Boaz

> These patches are against Jens' block for-next tree as that already
> has various modifications of the SCSI code.
> 



Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 08:56 -0700, James Bottomley wrote:
> How about this approach.  It goes straight to DEL if the device is
> blocked (skipping CANCEL).  This means that all the commands issued in 
> ->shutdown will error in the mid-layer, thus making the removal proceed
> without being stopped.

Hello James,

I think that's a good start but not a full solution. The block layer queue
still needs to be restarted to make sure that any requests that got queued
after the transition to the SDEV_BLOCK state and before the transition to
SDEV_DEL get processed. Anyway, I will see whether I can come up with a
patch based on this approach.

Bart.


Re: [PATCH 16/22] xen-blkfront: Make use of the new sg_map helper function

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 09:50 AM, Konrad Rzeszutek Wilk wrote:
> I am not sure if you know, but you can add on each patch the respective
> maintainer via 'CC'. That way you can have certain maintainers CCed only
> on the subsystems they cover. You put it after (or before) your SoB and
> git send-email happilly picks it up.

Yes, but I've seen some maintainers complain when they receive a patch
with no context (ie. cover letter and first patch). So I chose to do it
this way. I expect in this situation, no matter what you do, someone is
going to complain about the approach chosen.

Thanks anyway for the tip.

Logan


Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread James Bottomley
On Tue, 2017-04-18 at 16:44 +0200, Benjamin Block wrote:
> Hej Bart,
> 
> sry for the late'ish reply, had a long weekend.
> 
> On Thu, Apr 13, 2017 at 12:28:54AM +, Bart Van Assche wrote:
> > On Wed, 2017-04-12 at 16:41 +0200, Benjamin Block wrote:
> > > On Mon, Apr 10, 2017 at 10:54:01AM -0700, Bart Van Assche wrote:
> > > > [ ... ]
> > > OK, so I take it the problem is when the queue is stopped, then 
> > > the completion in blk_execute_rq() will never be triggered and 
> > > then we wait for a timeout there, or potentially forever?
> > 
> > Hello Benjamin,
> > 
> > Thanks for the review.
> > 
> > If a request is queued after a queue has been stopped then that 
> > request will never be started and hence even the timeout timer 
> > won't be started. blk_cleanup_queue() hangs if invoked for a 
> > stopped queue and one or more requests have not yet been started.
> > 
> > > But then what is the point in trying to do it async here anyway? 
> > > Won't that just be doomed in the same way, just that we don't see 
> > > the effect?
> > 
> > Have you noticed that patch 4/4 in this series restarts the queue 
> > just before calling blk_cleanup_queue()?
> > 
> > Anyway, can you have a look at the patch below and see whether this 
> > new version addresses all the concerns you had reported in your
> > previous e-mail?
> > 
> 
> Yes, the code- and comment-changes in sd_shutdown() are good. 
> Apparently there is something new with the done-function now, but you 
> got that from Israel.
> 
> I still wonder why we try 'so hard' scheduling a command for a dead
> device, but as that seems to be the status quo, and only lacks in the
> case where the LLD is already half-way gone, its ok for me too. I 
> mean, the order is a bit screwed.. we apparently first remove the 
> driver and post-factum try to drain the queue.. that is strange.

Yes, I've said all along that adding asynchronicity is only going to
add more problems.

The priority has got to be the removal we've been ordered to do rather
than waiting around to see if the transport comes back so we can send a
final flush.

How about this approach.  It goes straight to DEL if the device is
blocked (skipping CANCEL).  This means that all the commands issued in 
->shutdown will error in the mid-layer, thus making the removal proceed
without being stopped.

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e5a2d590a104..31171204cfd1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2611,7 +2611,6 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
case SDEV_QUIESCE:
case SDEV_OFFLINE:
case SDEV_TRANSPORT_OFFLINE:
-   case SDEV_BLOCK:
break;
default:
goto illegal;
@@ -2625,6 +2624,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
case SDEV_OFFLINE:
case SDEV_TRANSPORT_OFFLINE:
case SDEV_CANCEL:
+   case SDEV_BLOCK:
case SDEV_CREATED_BLOCK:
break;
default:
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 82dfe07b1d47..788309e307e9 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1282,8 +1282,14 @@ void __scsi_remove_device(struct scsi_device *sdev)
return;
 
if (sdev->is_visible) {
+   /*
+* If blocked, we go straight to DEL so any commands
+* issued during the driver shutdown (like sync cache)
+* are errored
+*/
if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
-   return;
+   if (scsi_device_set_state(sdev, SDEV_DEL) != 0)
+   return;
 
bsg_unregister_queue(sdev->request_queue);
device_unregister(&sdev->sdev_dev);




block: add a error_count field to struct request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

This is for the legacy floppy and ataflop drivers that currently abuse
->errors for this purpose.  It's stashed away in a union to not grow
the struct size, the other fields are either used by modern drivers
for different purposes or the I/O scheduler before queing the I/O
to drivers.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blkdev.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5986e1250d7d..e618164462e8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -175,6 +175,7 @@ struct request {
struct rb_node rb_node; /* sort/lookup */
struct bio_vec special_vec;
void *completion_data;
+   int error_count; /* for legacy drivers, don't use */
};
 
/*
-- 
2.11.0



ataflop: switch from req->errors to req->error_count

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
---
 drivers/block/ataflop.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 2104b1b4ccda..fa69ecd52cb5 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -617,12 +617,12 @@ static void fd_error( void )
if (!fd_request)
return;
 
-   fd_request->errors++;
-   if (fd_request->errors >= MAX_ERRORS) {
+   fd_request->error_count++;
+   if (fd_request->error_count >= MAX_ERRORS) {
printk(KERN_ERR "fd%d: too many errors.\n", SelectedDrive );
fd_end_request_cur(-EIO);
}
-   else if (fd_request->errors == RECALIBRATE_ERRORS) {
+   else if (fd_request->error_count == RECALIBRATE_ERRORS) {
printk(KERN_WARNING "fd%d: recalibrating\n", SelectedDrive );
if (SelectedDrive != -1)
SUD.track = -1;
@@ -1386,7 +1386,7 @@ static void setup_req_params( int drive )
ReqData = ReqBuffer + 512 * ReqCnt;
 
if (UseTrackbuffer)
-   read_track = (ReqCmd == READ && fd_request->errors == 0);
+   read_track = (ReqCmd == READ && fd_request->error_count == 0);
else
read_track = 0;
 
@@ -1409,8 +1409,10 @@ static struct request *set_next_request(void)
fdc_queue = 0;
if (q) {
rq = blk_fetch_request(q);
-   if (rq)
+   if (rq) {
+   rq->error_count = 0;
break;
+   }
}
} while (fdc_queue != old_pos);
 
-- 
2.11.0



pd: remove bogus check for req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

The driver never sets req->errors
---
 drivers/block/paride/pd.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index 82c6d02193ae..3b0ab214fe74 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -739,7 +739,6 @@ static int pd_special_command(struct pd_unit *disk,
  enum action (*func)(struct pd_unit *disk))
 {
struct request *rq;
-   int err = 0;
 
rq = blk_get_request(disk->gd->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
if (IS_ERR(rq))
@@ -748,10 +747,9 @@ static int pd_special_command(struct pd_unit *disk,
rq->special = func;
 
blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
-   err = req->errors ? -EIO : 0;
 
blk_put_request(rq);
-   return err;
+   return 0;
 }
 
 /* kernel glue structures */
-- 
2.11.0



blk-mq: remove the error argument to blk_mq_complete_request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Now that we always have a ->complete callback we can remove the direct
call to blk_mq_end_request, as well as the error argument to
blk_mq_complete_request.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c| 15 +++
 drivers/block/loop.c  |  4 ++--
 drivers/block/mtip32xx/mtip32xx.c |  4 ++--
 drivers/block/nbd.c   |  4 ++--
 drivers/block/null_blk.c  |  2 +-
 drivers/block/virtio_blk.c|  2 +-
 drivers/block/xen-blkfront.c  |  2 +-
 drivers/md/dm-rq.c|  2 +-
 drivers/nvme/host/core.c  |  2 +-
 drivers/nvme/host/nvme.h  |  2 +-
 drivers/scsi/scsi_lib.c   |  2 +-
 include/linux/blk-mq.h|  2 +-
 12 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2ef7b460924..0c19df66f5a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -442,17 +442,10 @@ static void blk_mq_stat_add(struct request *rq)
 
 static void __blk_mq_complete_request(struct request *rq)
 {
-   struct request_queue *q = rq->q;
-
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
-
blk_mq_stat_add(rq);
-
-   if (!q->softirq_done_fn)
-   blk_mq_end_request(rq, rq->errors);
-   else
-   blk_mq_ipi_complete_request(rq);
+   blk_mq_ipi_complete_request(rq);
 }
 
 /**
@@ -463,16 +456,14 @@ static void __blk_mq_complete_request(struct request *rq)
  * Ends all I/O on a request. It does not handle partial completions.
  * The actual completion happens out-of-order, through a IPI handler.
  **/
-void blk_mq_complete_request(struct request *rq, int error)
+void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
-   if (!blk_mark_rq_complete(rq)) {
-   rq->errors = error;
+   if (!blk_mark_rq_complete(rq))
__blk_mq_complete_request(rq);
-   }
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 86351b3f7350..994403efee19 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -465,7 +465,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
cmd->ret = ret;
-   blk_mq_complete_request(cmd->rq, 0);
+   blk_mq_complete_request(cmd->rq);
 }
 
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
@@ -1685,7 +1685,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
/* complete non-aio request */
if (!cmd->use_aio || ret) {
cmd->ret = ret ? -EIO : 0;
-   blk_mq_complete_request(cmd->rq, 0);
+   blk_mq_complete_request(cmd->rq);
}
 }
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 7406de29db58..66a6bd83faae 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -242,7 +242,7 @@ static void mtip_async_complete(struct mtip_port *port,
rq = mtip_rq_from_tag(dd, tag);
 
cmd->status = status;
-   blk_mq_complete_request(rq, 0);
+   blk_mq_complete_request(rq);
 }
 
 /*
@@ -4109,7 +4109,7 @@ static void mtip_no_dev_cleanup(struct request *rq, void 
*data, bool reserv)
 
if (likely(!reserv)) {
cmd->status = -ENODEV;
-   blk_mq_complete_request(rq, 0);
+   blk_mq_complete_request(rq);
} else if (test_bit(MTIP_PF_IC_ACTIVE_BIT, &dd->port->flags)) {
 
cmd = mtip_cmd_from_tag(dd, MTIP_TAG_INTERNAL);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 12ec881fd699..65341ed3fa11 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -635,7 +635,7 @@ static void recv_work(struct work_struct *work)
break;
}
 
-   blk_mq_complete_request(blk_mq_rq_from_pdu(cmd), 0);
+   blk_mq_complete_request(blk_mq_rq_from_pdu(cmd));
}
atomic_dec(&config->recv_threads);
wake_up(&config->recv_wq);
@@ -651,7 +651,7 @@ static void nbd_clear_req(struct request *req, void *data, 
bool reserved)
return;
cmd = blk_mq_rq_to_pdu(req);
cmd->status = -EIO;
-   blk_mq_complete_request(req, 0);
+   blk_mq_complete_request(req);
 }
 
 static void nbd_clear_que(struct nbd_device *nbd)
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 24ca85a70fd8..c27cccec368b 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -281,7 +281,7 @@ static inline void null_handle_cmd(struct nullb_cmd *cmd)
case NULL_IRQ_SOFTIRQ:
switch (queue_mode)  {
case NULL_Q_MQ:
-   blk_mq_complete_request(cmd->rq, 0);
+ 

dm rq: don't pass irrelevant error code to blk_mq_complete_request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

dm never uses rq->errors, so there is no need to pass an error argument
to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-rq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index e60f1b6845be..1173be21f6f6 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -363,7 +363,7 @@ static void dm_complete_request(struct request *rq, int 
error)
if (!rq->q->mq_ops)
blk_complete_request(rq);
else
-   blk_mq_complete_request(rq, error);
+   blk_mq_complete_request(rq, 0);
 }
 
 /*
-- 
2.11.0



scsi: introduce a new result field in struct scsi_request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

This passes on the scsi_cmnd result field to users of passthrough
requests.  Currently we abuse req->errors for this purpose, but that
field will go away in its current form.

Note that the old IDE code abuses the errors field in very creative
ways and stores all kinds of different values in it.  I didn't dare
to touch this magic, so the abuses are brought forward 1:1.

Signed-off-by: Christoph Hellwig 
---
 block/bsg-lib.c|  8 
 block/bsg.c| 12 +--
 block/scsi_ioctl.c | 14 ++---
 drivers/block/cciss.c  | 42 +++---
 drivers/block/pktcdvd.c|  2 +-
 drivers/block/virtio_blk.c |  2 +-
 drivers/cdrom/cdrom.c  |  2 +-
 drivers/ide/ide-atapi.c| 10 -
 drivers/ide/ide-cd.c   | 20 +-
 drivers/ide/ide-cd_ioctl.c |  2 +-
 drivers/ide/ide-devsets.c  |  4 ++--
 drivers/ide/ide-dma.c  |  2 +-
 drivers/ide/ide-eh.c   | 36 
 drivers/ide/ide-floppy.c   | 10 -
 drivers/ide/ide-io.c   | 10 -
 drivers/ide/ide-ioctls.c   |  4 ++--
 drivers/ide/ide-park.c |  2 +-
 drivers/ide/ide-pm.c   |  8 
 drivers/ide/ide-tape.c |  4 ++--
 drivers/ide/ide-taskfile.c |  6 +++---
 drivers/scsi/osd/osd_initiator.c   |  2 +-
 drivers/scsi/osst.c|  2 +-
 drivers/scsi/qla2xxx/qla_bsg.c |  6 +++---
 drivers/scsi/scsi_lib.c| 15 +++---
 drivers/scsi/scsi_transport_sas.c  |  2 +-
 drivers/scsi/sg.c  |  2 +-
 drivers/scsi/st.c  |  6 +++---
 drivers/target/target_core_pscsi.c |  2 +-
 fs/nfsd/blocklayout.c  |  4 ++--
 include/linux/ide.h|  2 +-
 include/scsi/scsi_request.h|  1 +
 31 files changed, 122 insertions(+), 122 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index cd15f9dbb147..0a23dbba2d30 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -37,7 +37,7 @@ static void bsg_destroy_job(struct kref *kref)
struct bsg_job *job = container_of(kref, struct bsg_job, kref);
struct request *rq = job->req;
 
-   blk_end_request_all(rq, rq->errors);
+   blk_end_request_all(rq, scsi_req(rq)->result);
 
put_device(job->dev);   /* release reference for the request */
 
@@ -74,7 +74,7 @@ void bsg_job_done(struct bsg_job *job, int result,
struct scsi_request *rq = scsi_req(req);
int err;
 
-   err = job->req->errors = result;
+   err = scsi_req(job->req)->result = result;
if (err < 0)
/* we're only returning the result field in the reply */
rq->sense_len = sizeof(u32);
@@ -177,7 +177,7 @@ static int bsg_create_job(struct device *dev, struct 
request *req)
  * @q: request queue to manage
  *
  * On error the create_bsg_job function should return a -Exyz error value
- * that will be set to the req->errors.
+ * that will be set to ->result.
  *
  * Drivers/subsys should pass this to the queue init function.
  */
@@ -201,7 +201,7 @@ static void bsg_request_fn(struct request_queue *q)
 
ret = bsg_create_job(dev, req);
if (ret) {
-   req->errors = ret;
+   scsi_req(req)->result = ret;
blk_end_request_all(req, ret);
spin_lock_irq(q->queue_lock);
continue;
diff --git a/block/bsg.c b/block/bsg.c
index 74835dbf0c47..e54daf9f2ee0 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -391,13 +391,13 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, 
struct sg_io_v4 *hdr,
struct scsi_request *req = scsi_req(rq);
int ret = 0;
 
-   dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->errors);
+   dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->result);
/*
 * fill in all the output members
 */
-   hdr->device_status = rq->errors & 0xff;
-   hdr->transport_status = host_byte(rq->errors);
-   hdr->driver_status = driver_byte(rq->errors);
+   hdr->device_status = req->result & 0xff;
+   hdr->transport_status = host_byte(req->result);
+   hdr->driver_status = driver_byte(req->result);
hdr->info = 0;
if (hdr->device_status || hdr->transport_status || hdr->driver_status)
hdr->info |= SG_INFO_CHECK;
@@ -431,8 +431,8 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, 
struct sg_io_v4 *hdr,
 * just a protocol response (i.e. non negative), that gets
 * processed above.
 */
-   if (!ret && rq->errors < 0)
-   ret = rq->errors;
+   if (!ret && req->result < 0)
+   ret = req->result;
 
blk_rq_unmap_user(bio);
scsi_req_free_cmd(req);
diff --git a/block/scsi_ioctl.c b/bl

block: remove the errors field from struct request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
Acked-by: Roger Pau Monné 
Reviewed-by: Konrad Rzeszutek Wilk 
---
 block/blk-core.c | 14 +-
 block/blk-exec.c |  3 +--
 block/blk-mq.c   | 10 +++---
 block/blk-timeout.c  |  1 -
 include/linux/blkdev.h   |  2 --
 include/trace/events/block.h | 17 +++--
 kernel/trace/blktrace.c  | 26 --
 7 files changed, 24 insertions(+), 49 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8654aa0cef6d..48f522a02f54 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1633,7 +1633,6 @@ void init_request_from_bio(struct request *req, struct 
bio *bio)
if (bio->bi_opf & REQ_RAHEAD)
req->cmd_flags |= REQ_FAILFAST_MASK;
 
-   req->errors = 0;
req->__sector = bio->bi_iter.bi_sector;
blk_rq_set_prio(req, rq_ioc(bio));
if (ioprio_valid(bio_prio(bio)))
@@ -2567,22 +2566,11 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
 {
int total_bytes;
 
-   trace_block_rq_complete(req->q, req, nr_bytes);
+   trace_block_rq_complete(req, error, nr_bytes);
 
if (!req->bio)
return false;
 
-   /*
-* For fs requests, rq is just carrier of independent bio's
-* and each partial completion should be handled separately.
-* Reset per-request error on each partial completion.
-*
-* TODO: tj: This is too subtle.  It would be better to let
-* low level drivers do what they see fit.
-*/
-   if (!blk_rq_is_passthrough(req))
-   req->errors = 0;
-
if (error && !blk_rq_is_passthrough(req) &&
!(req->rq_flags & RQF_QUIET)) {
char *error_type;
diff --git a/block/blk-exec.c b/block/blk-exec.c
index afa383248c7c..a9451e3b8587 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -69,8 +69,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 
if (unlikely(blk_queue_dying(q))) {
rq->rq_flags |= RQF_QUIET;
-   rq->errors = -ENXIO;
-   __blk_end_request_all(rq, rq->errors);
+   __blk_end_request_all(rq, -ENXIO);
spin_unlock_irq(q->queue_lock);
return;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1f98db082da3..f5e3cf153ad7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -213,7 +213,6 @@ void blk_mq_rq_ctx_init(struct request_queue *q, struct 
blk_mq_ctx *ctx,
 #endif
rq->special = NULL;
/* tag was already set */
-   rq->errors = 0;
rq->extra_len = 0;
 
INIT_LIST_HEAD(&rq->timeout_list);
@@ -624,8 +623,7 @@ void blk_mq_abort_requeue_list(struct request_queue *q)
 
rq = list_first_entry(&rq_list, struct request, queuelist);
list_del_init(&rq->queuelist);
-   rq->errors = -EIO;
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, -EIO);
}
 }
 EXPORT_SYMBOL(blk_mq_abort_requeue_list);
@@ -1032,8 +1030,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
pr_err("blk-mq: bad return on queue: %d\n", ret);
case BLK_MQ_RQ_QUEUE_ERROR:
errors++;
-   rq->errors = -EIO;
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, -EIO);
break;
}
 
@@ -1484,8 +1481,7 @@ static void __blk_mq_try_issue_directly(struct request 
*rq, blk_qc_t *cookie,
 
if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
*cookie = BLK_QC_T_NONE;
-   rq->errors = -EIO;
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, -EIO);
return;
}
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a30441a200c0..cbff183f3d9f 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -89,7 +89,6 @@ static void blk_rq_timed_out(struct request *req)
ret = q->rq_timed_out_fn(req);
switch (ret) {
case BLK_EH_HANDLED:
-   /* Can we use req->errors here? */
__blk_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e618164462e8..d40ac947ae26 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -220,8 +220,6 @@ struct request {
 
void *special;  /* opaque pointer available for LLD use */
 
-   int errors;
-
unsigned int extra_len; /* length of alignment and padding */
 
unsigned long deadline;
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 99ed69fad041..d0dbe60d8a6d 100644
--- a/include/trace/events/block.h
+++ b

mtip32xx: add a status field to struct mtip_cmd

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Instead of using req->errors, which will go away.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/mtip32xx/mtip32xx.c | 16 +---
 drivers/block/mtip32xx/mtip32xx.h |  1 +
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 05e3e664ea1b..7406de29db58 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -241,7 +241,8 @@ static void mtip_async_complete(struct mtip_port *port,
 
rq = mtip_rq_from_tag(dd, tag);
 
-   blk_mq_complete_request(rq, status);
+   cmd->status = status;
+   blk_mq_complete_request(rq, 0);
 }
 
 /*
@@ -2910,18 +2911,19 @@ static void mtip_softirq_done_fn(struct request *rq)
if (unlikely(cmd->unaligned))
up(&dd->port->cmd_slot_unal);
 
-   blk_mq_end_request(rq, rq->errors);
+   blk_mq_end_request(rq, cmd->status);
 }
 
 static void mtip_abort_cmd(struct request *req, void *data,
bool reserved)
 {
+   struct mtip_cmd *cmd = blk_mq_rq_to_pdu(req);
struct driver_data *dd = data;
 
dbg_printk(MTIP_DRV_NAME " Aborting request, tag = %d\n", req->tag);
 
clear_bit(req->tag, dd->port->cmds_to_issue);
-   req->errors = -EIO;
+   cmd->status = -EIO;
mtip_softirq_done_fn(req);
 }
 
@@ -3816,7 +3818,6 @@ static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx,
if (likely(!ret))
return BLK_MQ_RQ_QUEUE_OK;
 
-   rq->errors = ret;
return BLK_MQ_RQ_QUEUE_ERROR;
 }
 
@@ -4106,9 +4107,10 @@ static void mtip_no_dev_cleanup(struct request *rq, void 
*data, bool reserv)
struct driver_data *dd = (struct driver_data *)data;
struct mtip_cmd *cmd;
 
-   if (likely(!reserv))
-   blk_mq_complete_request(rq, -ENODEV);
-   else if (test_bit(MTIP_PF_IC_ACTIVE_BIT, &dd->port->flags)) {
+   if (likely(!reserv)) {
+   cmd->status = -ENODEV;
+   blk_mq_complete_request(rq, 0);
+   } else if (test_bit(MTIP_PF_IC_ACTIVE_BIT, &dd->port->flags)) {
 
cmd = mtip_cmd_from_tag(dd, MTIP_TAG_INTERNAL);
if (cmd->comp_func)
diff --git a/drivers/block/mtip32xx/mtip32xx.h 
b/drivers/block/mtip32xx/mtip32xx.h
index 7617888f7944..57b41528a824 100644
--- a/drivers/block/mtip32xx/mtip32xx.h
+++ b/drivers/block/mtip32xx/mtip32xx.h
@@ -352,6 +352,7 @@ struct mtip_cmd {
int retries; /* The number of retries left for this command. */
 
int direction; /* Data transfer direction */
+   int status;
 };
 
 /* Structure used to describe a port. */
-- 
2.11.0



kill req->errors V2

2017-04-18 Thread Christoph Hellwig
Currently the request structure has an errors field that is used in
various different ways.  The oldest drivers use it as an error count,
blk-mq and the generic timeout code assume that it holds a Linux
errno for block completions, and various drivers use it for internal
status values, often overwriting them with Linux errnos later,
that is unless they are processing passthrough requests in which
case they'll leave their errors in it.

This series kills the ->errors field and replaced it with new fields
in the drivers (or an odd hack in a union in struct request for
two bitrotting old drivers).

Also available in the following git tree:

git://git.infradead.org/users/hch/block.git request-errors

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/request-errors


Changes since V1:
 - rebased on top the latest block for-next tree
 - fix error handling in nfsd blocklayout
 - dropped "scsi: fix fast-fail for non-passthrough requests"


nbd: don't use req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Add a nbd-specific field instead.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Josef Bacik 
---
 drivers/block/nbd.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b78f23ce2395..12ec881fd699 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -115,6 +115,7 @@ struct nbd_cmd {
int index;
int cookie;
struct completion send_complete;
+   int status;
 };
 
 #if IS_ENABLED(CONFIG_DEBUG_FS)
@@ -244,16 +245,14 @@ static void nbd_size_set(struct nbd_device *nbd, loff_t 
blocksize,
nbd_size_update(nbd);
 }
 
-static void nbd_end_request(struct nbd_cmd *cmd)
+static void nbd_complete_rq(struct request *req)
 {
-   struct nbd_device *nbd = cmd->nbd;
-   struct request *req = blk_mq_rq_from_pdu(cmd);
-   int error = req->errors ? -EIO : 0;
+   struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 
-   dev_dbg(nbd_to_dev(nbd), "request %p: %s\n", cmd,
-   error ? "failed" : "done");
+   dev_dbg(nbd_to_dev(cmd->nbd), "request %p: %s\n", cmd,
+   cmd->status ? "failed" : "done");
 
-   blk_mq_complete_request(req, error);
+   blk_mq_end_request(req, cmd->status);
 }
 
 /*
@@ -286,7 +285,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct 
request *req,
struct nbd_config *config;
 
if (!refcount_inc_not_zero(&nbd->config_refs)) {
-   req->errors = -EIO;
+   cmd->status = -EIO;
return BLK_EH_HANDLED;
}
 
@@ -331,7 +330,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct 
request *req,
"Connection timed out\n");
}
set_bit(NBD_TIMEDOUT, &config->runtime_flags);
-   req->errors = -EIO;
+   cmd->status = -EIO;
sock_shutdown(nbd);
nbd_config_put(nbd);
 
@@ -574,7 +573,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device 
*nbd, int index)
if (ntohl(reply.error)) {
dev_err(disk_to_dev(nbd->disk), "Other side returned error 
(%d)\n",
ntohl(reply.error));
-   req->errors = -EIO;
+   cmd->status = -EIO;
return cmd;
}
 
@@ -599,7 +598,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device 
*nbd, int index)
 */
if (nbd_disconnected(config) ||
config->num_connections <= 1) {
-   req->errors = -EIO;
+   cmd->status = -EIO;
return cmd;
}
return ERR_PTR(-EIO);
@@ -636,7 +635,7 @@ static void recv_work(struct work_struct *work)
break;
}
 
-   nbd_end_request(cmd);
+   blk_mq_complete_request(blk_mq_rq_from_pdu(cmd), 0);
}
atomic_dec(&config->recv_threads);
wake_up(&config->recv_wq);
@@ -651,8 +650,8 @@ static void nbd_clear_req(struct request *req, void *data, 
bool reserved)
if (!blk_mq_request_started(req))
return;
cmd = blk_mq_rq_to_pdu(req);
-   req->errors = -EIO;
-   nbd_end_request(cmd);
+   cmd->status = -EIO;
+   blk_mq_complete_request(req, 0);
 }
 
 static void nbd_clear_que(struct nbd_device *nbd)
@@ -740,7 +739,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
nbd_config_put(nbd);
return -EINVAL;
}
-   req->errors = 0;
+   cmd->status = 0;
 again:
nsock = config->socks[index];
mutex_lock(&nsock->tx_lock);
@@ -1408,6 +1407,7 @@ static int nbd_init_request(void *data, struct request 
*rq,
 
 static const struct blk_mq_ops nbd_mq_ops = {
.queue_rq   = nbd_queue_rq,
+   .complete   = nbd_complete_rq,
.init_request   = nbd_init_request,
.timeout= nbd_xmit_timeout,
 };
-- 
2.11.0



dm mpath: don't check for req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

We'll get all proper errors reported through ->end_io and ->errors will
go away soon.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-mpath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index ab55955ed704..2950b145443d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1492,7 +1492,7 @@ static int do_end_io(struct multipath *m, struct request 
*clone,
 */
int r = DM_ENDIO_REQUEUE;
 
-   if (!error && !clone->errors)
+   if (!error)
return 0;   /* I/O complete */
 
if (noretry_error(error))
-- 
2.11.0



virtio_blk: don't use req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Remove passing req->errors (which at that point is always 0) to
blk_mq_complete_request, and rely on the virtio status code for the
serial number passthrough request.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/block/virtio_blk.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index dbc4e80680b1..8378ad480f77 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -175,19 +175,15 @@ static int virtblk_add_req(struct virtqueue *vq, struct 
virtblk_req *vbr,
 static inline void virtblk_request_done(struct request *req)
 {
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
-   int error = virtblk_result(vbr);
 
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
virtblk_scsi_request_done(req);
break;
-   case REQ_OP_DRV_IN:
-   req->errors = (error != 0);
-   break;
}
 
-   blk_mq_end_request(req, error);
+   blk_mq_end_request(req, virtblk_result(vbr));
 }
 
 static void virtblk_done(struct virtqueue *vq)
@@ -205,7 +201,7 @@ static void virtblk_done(struct virtqueue *vq)
while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, &len)) != 
NULL) {
struct request *req = blk_mq_rq_from_pdu(vbr);
 
-   blk_mq_complete_request(req, req->errors);
+   blk_mq_complete_request(req, 0);
req_done = true;
}
if (unlikely(virtqueue_is_broken(vq)))
@@ -311,7 +307,7 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
goto out;
 
blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
-   err = req->errors ? -EIO : 0;
+   err = virtblk_result(blk_mq_rq_to_pdu(req));
 out:
blk_put_request(req);
return err;
-- 
2.11.0



null_blk: don't pass always-0 req->errors to blk_mq_complete_request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
---
 drivers/block/null_blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index f93906ff31e8..24ca85a70fd8 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -281,7 +281,7 @@ static inline void null_handle_cmd(struct nullb_cmd *cmd)
case NULL_IRQ_SOFTIRQ:
switch (queue_mode)  {
case NULL_Q_MQ:
-   blk_mq_complete_request(cmd->rq, cmd->rq->errors);
+   blk_mq_complete_request(cmd->rq, 0);
break;
case NULL_Q_RQ:
blk_complete_request(cmd->rq);
-- 
2.11.0



nvme: split nvme status from block req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

We want our own clearly defined error field for NVMe passthrough commands,
and the request errors field is going away in its current form.

Just store the status and result field in the nvme_request field from
hardirq completion context (using a new helper) and then generate a
Linux errno for the block layer only when we actually need it.

Because we can't overload the status value with a negative error code
for cancelled command we now have a flags filed in struct nvme_request
that contains a bit for this condition.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 50 +++-
 drivers/nvme/host/fc.c   | 10 -
 drivers/nvme/host/lightnvm.c |  9 +---
 drivers/nvme/host/nvme.h | 33 +
 drivers/nvme/host/pci.c  | 11 +-
 drivers/nvme/host/rdma.c |  5 ++---
 drivers/nvme/target/loop.c   |  7 ++-
 7 files changed, 65 insertions(+), 60 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d33f829c3ab7..c6f256d74b6b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -66,11 +66,24 @@ static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
 
+int nvme_error_status(struct request *req)
+{
+   switch (nvme_req(req)->status & 0x7ff) {
+   case NVME_SC_SUCCESS:
+   return 0;
+   case NVME_SC_CAP_EXCEEDED:
+   return -ENOSPC;
+   default:
+   return -EIO;
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_error_status);
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
if (blk_noretry_request(req))
return false;
-   if (req->errors & NVME_SC_DNR)
+   if (nvme_req(req)->status & NVME_SC_DNR)
return false;
if (jiffies - req->start_time >= req->timeout)
return false;
@@ -81,23 +94,13 @@ static inline bool nvme_req_needs_retry(struct request *req)
 
 void nvme_complete_rq(struct request *req)
 {
-   int error = 0;
-
-   if (unlikely(req->errors)) {
-   if (nvme_req_needs_retry(req)) {
-   nvme_req(req)->retries++;
-   blk_mq_requeue_request(req,
-   !blk_mq_queue_stopped(req->q));
-   return;
-   }
-
-   if (blk_rq_is_passthrough(req))
-   error = req->errors;
-   else
-   error = nvme_error_status(req->errors);
+   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+   nvme_req(req)->retries++;
+   blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q));
+   return;
}
 
-   blk_mq_end_request(req, error);
+   blk_mq_end_request(req, nvme_error_status(req));
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
@@ -114,7 +117,9 @@ void nvme_cancel_request(struct request *req, void *data, 
bool reserved)
status = NVME_SC_ABORT_REQ;
if (blk_queue_dying(req->q))
status |= NVME_SC_DNR;
-   blk_mq_complete_request(req, status);
+   nvme_req(req)->status = status;
+   blk_mq_complete_request(req, 0);
+
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
@@ -357,6 +362,7 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 
if (!(req->rq_flags & RQF_DONTPREP)) {
nvme_req(req)->retries = 0;
+   nvme_req(req)->flags = 0;
req->rq_flags |= RQF_DONTPREP;
}
 
@@ -413,7 +419,10 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct 
nvme_command *cmd,
blk_execute_rq(req->q, NULL, req, at_head);
if (result)
*result = nvme_req(req)->result;
-   ret = req->errors;
+   if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+   ret = -EINTR;
+   else
+   ret = nvme_req(req)->status;
  out:
blk_mq_free_request(req);
return ret;
@@ -498,7 +507,10 @@ int __nvme_submit_user_cmd(struct request_queue *q, struct 
nvme_command *cmd,
}
  submit:
blk_execute_rq(req->q, disk, req, 0);
-   ret = req->errors;
+   if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+   ret = -EINTR;
+   else
+   ret = nvme_req(req)->status;
if (result)
*result = le32_to_cpu(nvme_req(req)->result.u32);
if (meta && !ret && !write) {
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index aad7f9c0be32..450733c8cd24 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1148,6 +1148,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
struct nvme_fc_queue *queue = op->queue;
struct nvme_completion *cqe = &op->rsp_iu.cqe;
__le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
+   union nvme_result result;
 
/*
 * WARNING:
@@

virtio: fix spelling of virtblk_scsi_request_done

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Bart Van Assche 
---
 drivers/block/virtio_blk.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index eaf99022bdc6..dbc4e80680b1 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -111,7 +111,7 @@ static int virtblk_add_req_scsi(struct virtqueue *vq, 
struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
 }
 
-static inline void virtblk_scsi_reques_done(struct request *req)
+static inline void virtblk_scsi_request_done(struct request *req)
 {
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
struct virtio_blk *vblk = req->q->queuedata;
@@ -144,7 +144,7 @@ static inline int virtblk_add_req_scsi(struct virtqueue *vq,
 {
return -EIO;
 }
-static inline void virtblk_scsi_reques_done(struct request *req)
+static inline void virtblk_scsi_request_done(struct request *req)
 {
 }
 #define virtblk_ioctl  NULL
@@ -180,7 +180,7 @@ static inline void virtblk_request_done(struct request *req)
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
-   virtblk_scsi_reques_done(req);
+   virtblk_scsi_request_done(req);
break;
case REQ_OP_DRV_IN:
req->errors = (error != 0);
-- 
2.11.0



floppy: switch from req->errors to req->error_count

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
---
 drivers/block/floppy.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index ce102ec47ef2..60d4c7653178 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -2805,8 +2805,10 @@ static int set_next_request(void)
fdc_queue = 0;
if (q) {
current_req = blk_fetch_request(q);
-   if (current_req)
+   if (current_req) {
+   current_req->error_count = 0;
break;
+   }
}
} while (fdc_queue != old_pos);
 
@@ -2866,7 +2868,7 @@ static void redo_fd_request(void)
_floppy = floppy_type + DP->autodetect[DRS->probed_format];
} else
probing = 0;
-   errors = &(current_req->errors);
+   errors = &(current_req->error_count);
tmp = make_raw_rw_request();
if (tmp < 2) {
request_done(tmp);
-- 
2.11.0



nvme-fc: fix status code handling in nvme_fc_fcpio_done

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

nvme_complete_async_event expects the little endian status code
including the phase bit, and a new completion handler I plan to
introduce will do so as well.

Change the status variable into the little endian format with the
phase bit used in the NVMe CQE to fix / enable this.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/fc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index fc42172c796a..aad7f9c0be32 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1147,7 +1147,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
struct nvme_fc_ctrl *ctrl = op->ctrl;
struct nvme_fc_queue *queue = op->queue;
struct nvme_completion *cqe = &op->rsp_iu.cqe;
-   u16 status = NVME_SC_SUCCESS;
+   __le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
 
/*
 * WARNING:
@@ -1182,9 +1182,9 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
sizeof(op->rsp_iu), DMA_FROM_DEVICE);
 
if (atomic_read(&op->state) == FCPOP_STATE_ABORTED)
-   status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
+   status = cpu_to_le16((NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1);
else if (freq->status)
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
 
/*
 * For the linux implementation, if we have an unsuccesful
@@ -1212,7 +1212,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 */
if (freq->transferred_length !=
be32_to_cpu(op->cmd_iu.data_len)) {
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
goto done;
}
op->nreq.result.u64 = 0;
@@ -1229,15 +1229,15 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
freq->transferred_length ||
 op->rsp_iu.status_code ||
 op->rqno != le16_to_cpu(cqe->command_id))) {
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
goto done;
}
op->nreq.result = cqe->result;
-   status = le16_to_cpu(cqe->status) >> 1;
+   status = cqe->status;
break;
 
default:
-   status = NVME_SC_FC_TRANSPORT_ERROR;
+   status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
goto done;
}
 
@@ -1249,7 +1249,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
return;
}
 
-   blk_mq_complete_request(rq, status);
+   blk_mq_complete_request(rq, le16_to_cpu(status) >> 1);
 }
 
 static int
-- 
2.11.0



block: remove the blk_execute_rq return value

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

The function only returns -EIO if rq->errors is non-zero, which is not
very useful and lets a large number of callers ignore the return value.

Just let the callers figure out their error themselves.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 block/blk-exec.c | 8 +---
 block/scsi_ioctl.c   | 3 ++-
 drivers/block/paride/pd.c| 3 ++-
 drivers/block/virtio_blk.c   | 3 ++-
 drivers/cdrom/cdrom.c| 3 ++-
 drivers/ide/ide-atapi.c  | 3 ++-
 drivers/ide/ide-cd.c | 3 ++-
 drivers/ide/ide-cd_ioctl.c   | 3 ++-
 drivers/ide/ide-devsets.c| 4 ++--
 drivers/ide/ide-disk.c   | 3 +--
 drivers/ide/ide-ioctls.c | 7 ---
 drivers/ide/ide-park.c   | 3 ++-
 drivers/ide/ide-pm.c | 3 ++-
 drivers/ide/ide-taskfile.c   | 4 ++--
 drivers/scsi/osd/osd_initiator.c | 5 -
 fs/nfsd/blocklayout.c| 5 +++--
 include/linux/blkdev.h   | 2 +-
 17 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 8cd0e9bc8dc8..afa383248c7c 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -92,11 +92,10 @@ EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
  *Insert a fully prepared request at the back of the I/O scheduler queue
  *for execution and wait for completion.
  */
-int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk,
+void blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk,
   struct request *rq, int at_head)
 {
DECLARE_COMPLETION_ONSTACK(wait);
-   int err = 0;
unsigned long hang_check;
 
rq->end_io_data = &wait;
@@ -108,10 +107,5 @@ int blk_execute_rq(struct request_queue *q, struct gendisk 
*bd_disk,
while (!wait_for_completion_io_timeout(&wait, hang_check * 
(HZ/2)));
else
wait_for_completion_io(&wait);
-
-   if (rq->errors)
-   err = -EIO;
-
-   return err;
 }
 EXPORT_SYMBOL(blk_execute_rq);
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 82a43bb19967..b1352143f12f 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -547,7 +547,8 @@ static int __blk_send_generic(struct request_queue *q, 
struct gendisk *bd_disk,
scsi_req(rq)->cmd[0] = cmd;
scsi_req(rq)->cmd[4] = data;
scsi_req(rq)->cmd_len = 6;
-   err = blk_execute_rq(q, bd_disk, rq, 0);
+   blk_execute_rq(q, bd_disk, rq, 0);
+   err = rq->errors ? -EIO : 0;
blk_put_request(rq);
 
return err;
diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index b05e151c9b38..82c6d02193ae 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -747,7 +747,8 @@ static int pd_special_command(struct pd_unit *disk,
 
rq->special = func;
 
-   err = blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
+   blk_execute_rq(disk->gd->queue, disk->gd, rq, 0);
+   err = req->errors ? -EIO : 0;
 
blk_put_request(rq);
return err;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2d8290169271..eaf99022bdc6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -310,7 +310,8 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
if (err)
goto out;
 
-   err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+   blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+   err = req->errors ? -EIO : 0;
 out:
blk_put_request(req);
return err;
diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 87739649eac2..308501730ab3 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -2218,7 +2218,8 @@ static int cdrom_read_cdda_bpc(struct cdrom_device_info 
*cdi, __u8 __user *ubuf,
rq->timeout = 60 * HZ;
bio = rq->bio;
 
-   if (blk_execute_rq(q, cdi->disk, rq, 0)) {
+   blk_execute_rq(q, cdi->disk, rq, 0);
+   if (rq->errors) {
struct request_sense *s = req->sense;
ret = -EIO;
cdi->last_sense = s->sense_key;
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index feb30061123b..1524797e1776 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -107,7 +107,8 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk 
*disk,
memcpy(scsi_req(rq)->cmd, pc->c, 12);
if (drive->media == ide_tape)
scsi_req(rq)->cmd[13] = REQ_IDETAPE_PC1;
-   error = blk_execute_rq(drive->queue, disk, rq, 0);
+   blk_execute_rq(drive->queue, disk, rq, 0);
+   error = rq->errors ? -EIO : 0;
 put_req:
blk_put_request(rq);
return error;
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 74f1b7dc03f7..95c40afa9120 100644
--- a/drivers/ide/ide-cd.c
+++ b/dr

blk-mq: simplify __blk_mq_complete_request

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Merge blk_mq_ipi_complete_request and blk_mq_stat_add into their only
caller.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c | 25 -
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0c19df66f5a0..1f98db082da3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -406,12 +406,19 @@ static void __blk_mq_complete_request_remote(void *data)
rq->q->softirq_done_fn(rq);
 }
 
-static void blk_mq_ipi_complete_request(struct request *rq)
+static void __blk_mq_complete_request(struct request *rq)
 {
struct blk_mq_ctx *ctx = rq->mq_ctx;
bool shared = false;
int cpu;
 
+   if (rq->internal_tag != -1)
+   blk_mq_sched_completed_request(rq);
+   if (rq->rq_flags & RQF_STATS) {
+   blk_mq_poll_stats_start(rq->q);
+   blk_stat_add(rq);
+   }
+
if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
rq->q->softirq_done_fn(rq);
return;
@@ -432,22 +439,6 @@ static void blk_mq_ipi_complete_request(struct request *rq)
put_cpu();
 }
 
-static void blk_mq_stat_add(struct request *rq)
-{
-   if (rq->rq_flags & RQF_STATS) {
-   blk_mq_poll_stats_start(rq->q);
-   blk_stat_add(rq);
-   }
-}
-
-static void __blk_mq_complete_request(struct request *rq)
-{
-   if (rq->internal_tag != -1)
-   blk_mq_sched_completed_request(rq);
-   blk_mq_stat_add(rq);
-   blk_mq_ipi_complete_request(rq);
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
-- 
2.11.0



loop: zero-fill bio on the submitting cpu

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

In thruth I've just audited which blk-mq drivers don't currently have a
complete callback, but I think this change is at least borderline useful.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
---
 drivers/block/loop.c | 30 ++
 drivers/block/loop.h |  1 +
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3081d83d2ea3..86351b3f7350 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -445,32 +445,27 @@ static int lo_req_flush(struct loop_device *lo, struct 
request *rq)
return ret;
 }
 
-static inline void handle_partial_read(struct loop_cmd *cmd, long bytes)
+static void lo_complete_rq(struct request *rq)
 {
-   if (bytes < 0 || op_is_write(req_op(cmd->rq)))
-   return;
+   struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
-   if (unlikely(bytes < blk_rq_bytes(cmd->rq))) {
+   if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
+cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
struct bio *bio = cmd->rq->bio;
 
-   bio_advance(bio, bytes);
+   bio_advance(bio, cmd->ret);
zero_fill_bio(bio);
}
+
+   blk_mq_end_request(rq, cmd->ret < 0 ? -EIO : 0);
 }
 
 static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
-   struct request *rq = cmd->rq;
-
-   handle_partial_read(cmd, ret);
 
-   if (ret > 0)
-   ret = 0;
-   else if (ret < 0)
-   ret = -EIO;
-
-   blk_mq_complete_request(rq, ret);
+   cmd->ret = ret;
+   blk_mq_complete_request(cmd->rq, 0);
 }
 
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
@@ -1688,8 +1683,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
ret = do_req_filebacked(lo, cmd->rq);
  failed:
/* complete non-aio request */
-   if (!cmd->use_aio || ret)
-   blk_mq_complete_request(cmd->rq, ret ? -EIO : 0);
+   if (!cmd->use_aio || ret) {
+   cmd->ret = ret ? -EIO : 0;
+   blk_mq_complete_request(cmd->rq, 0);
+   }
 }
 
 static void loop_queue_work(struct kthread_work *work)
@@ -1715,6 +1712,7 @@ static int loop_init_request(void *data, struct request 
*rq,
 static const struct blk_mq_ops loop_mq_ops = {
.queue_rq   = loop_queue_rq,
.init_request   = loop_init_request,
+   .complete   = lo_complete_rq,
 };
 
 static int loop_add(struct loop_device **l, int i)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c73e61..fecd3f97ef8c 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -70,6 +70,7 @@ struct loop_cmd {
struct request *rq;
struct list_head list;
bool use_aio;   /* use AIO interface to handle I/O */
+   long ret;
struct kiocb iocb;
 };
 
-- 
2.11.0



swim3: remove (commented out) printing of req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
---
 drivers/block/swim3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 61b3ffa4f458..ba4809c9bdba 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -343,8 +343,8 @@ static void start_request(struct floppy_state *fs)
  req->rq_disk->disk_name, req->cmd,
  (long)blk_rq_pos(req), blk_rq_sectors(req),
  bio_data(req->bio));
-   swim3_dbg("   errors=%d current_nr_sectors=%u\n",
- req->errors, blk_rq_cur_sectors(req));
+   swim3_dbg("   current_nr_sectors=%u\n",
+ blk_rq_cur_sectors(req));
 #endif
 
if (blk_rq_pos(req) >= fs->total_secs) {
-- 
2.11.0



nvme: make nvme_error_status private

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Currently it's used by the lighnvm passthrough ioctl, but we'd like to make
it private in preparation of block layer specific error code.  Lighnvm already
returns the real NVMe status anyway, so I think we can just limit it to
returning -EIO for any status set.

This will need a careful audit from the lightnvm folks, though.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 3 +--
 drivers/nvme/host/lightnvm.c | 6 +++---
 drivers/nvme/host/nvme.h | 1 -
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c6f256d74b6b..805f250315ec 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -66,7 +66,7 @@ static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
 
-int nvme_error_status(struct request *req)
+static int nvme_error_status(struct request *req)
 {
switch (nvme_req(req)->status & 0x7ff) {
case NVME_SC_SUCCESS:
@@ -77,7 +77,6 @@ int nvme_error_status(struct request *req)
return -EIO;
}
 }
-EXPORT_SYMBOL_GPL(nvme_error_status);
 
 static inline bool nvme_req_needs_retry(struct request *req)
 {
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index f85e6e57d641..7f407d5f0095 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -599,7 +599,7 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
__le64 *metadata = NULL;
dma_addr_t metadata_dma;
DECLARE_COMPLETION_ONSTACK(wait);
-   int ret;
+   int ret = 0;
 
rq = nvme_alloc_request(q, (struct nvme_command *)vcmd, 0,
NVME_QID_ANY);
@@ -671,8 +671,8 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
 
if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
ret = -EINTR;
-   else
-   ret = nvme_error_status(rq);
+   else if (nvme_req(rq)->status & 0x7ff)
+   ret = -EIO;
if (result)
*result = nvme_req(rq)->status & 0x7ff;
if (status)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d7330f75632d..550037f5efea 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -254,7 +254,6 @@ static inline void nvme_end_request(struct request *req, 
__le16 status,
blk_mq_complete_request(req, 0);
 }
 
-int nvme_error_status(struct request *req);
 void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
-- 
2.11.0



blktrace: remove the unused block_rq_abort tracepoint

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
---
 include/trace/events/block.h | 44 ++--
 kernel/trace/blktrace.c  |  9 -
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index a88ed13446ff..99ed69fad041 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -61,7 +61,16 @@ DEFINE_EVENT(block_buffer, block_dirty_buffer,
TP_ARGS(bh)
 );
 
-DECLARE_EVENT_CLASS(block_rq_with_error,
+/**
+ * block_rq_requeue - place block IO request back on a queue
+ * @q: queue holding operation
+ * @rq: block IO operation request
+ *
+ * The block operation request @rq is being placed back into queue
+ * @q.  For some reason the request was not completed and needs to be
+ * put back in the queue.
+ */
+TRACE_EVENT(block_rq_requeue,
 
TP_PROTO(struct request_queue *q, struct request *rq),
 
@@ -94,39 +103,6 @@ DECLARE_EVENT_CLASS(block_rq_with_error,
 );
 
 /**
- * block_rq_abort - abort block operation request
- * @q: queue containing the block operation request
- * @rq: block IO operation request
- *
- * Called immediately after pending block IO operation request @rq in
- * queue @q is aborted. The fields in the operation request @rq
- * can be examined to determine which device and sectors the pending
- * operation would access.
- */
-DEFINE_EVENT(block_rq_with_error, block_rq_abort,
-
-   TP_PROTO(struct request_queue *q, struct request *rq),
-
-   TP_ARGS(q, rq)
-);
-
-/**
- * block_rq_requeue - place block IO request back on a queue
- * @q: queue holding operation
- * @rq: block IO operation request
- *
- * The block operation request @rq is being placed back into queue
- * @q.  For some reason the request was not completed and needs to be
- * put back in the queue.
- */
-DEFINE_EVENT(block_rq_with_error, block_rq_requeue,
-
-   TP_PROTO(struct request_queue *q, struct request *rq),
-
-   TP_ARGS(q, rq)
-);
-
-/**
  * block_rq_complete - block IO operation completed by device driver
  * @q: queue containing the block operation request
  * @rq: block operations request
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index b2058a7f94bd..9f3624dadb09 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -716,12 +716,6 @@ static void blk_add_trace_rq(struct request_queue *q, 
struct request *rq,
rq->cmd_flags, what, rq->errors, 0, NULL);
 }
 
-static void blk_add_trace_rq_abort(void *ignore,
-  struct request_queue *q, struct request *rq)
-{
-   blk_add_trace_rq(q, rq, blk_rq_bytes(rq), BLK_TA_ABORT);
-}
-
 static void blk_add_trace_rq_insert(void *ignore,
struct request_queue *q, struct request *rq)
 {
@@ -974,8 +968,6 @@ static void blk_register_tracepoints(void)
 {
int ret;
 
-   ret = register_trace_block_rq_abort(blk_add_trace_rq_abort, NULL);
-   WARN_ON(ret);
ret = register_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);
WARN_ON(ret);
ret = register_trace_block_rq_issue(blk_add_trace_rq_issue, NULL);
@@ -1028,7 +1020,6 @@ static void blk_unregister_tracepoints(void)
unregister_trace_block_rq_requeue(blk_add_trace_rq_requeue, NULL);
unregister_trace_block_rq_issue(blk_add_trace_rq_issue, NULL);
unregister_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);
-   unregister_trace_block_rq_abort(blk_add_trace_rq_abort, NULL);
 
tracepoint_synchronize_unregister();
 }
-- 
2.11.0



xen-blkfront: don't use req->errors

2017-04-18 Thread Christoph Hellwig
From: Christoph Hellwig 

xen-blkfron is the last users using rq->errros for passing back error to
blk-mq, and I'd like to get rid of that.  In the longer run the driver
should be moving more of the completion processing into .complete, but
this is the minimal change to move forward for now.

Signed-off-by: Christoph Hellwig 
Acked-by: Roger Pau Monné 
Reviewed-by: Konrad Rzeszutek Wilk 
---
 drivers/block/xen-blkfront.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d137ef8a72be..08172f679a64 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -115,6 +115,15 @@ struct split_bio {
atomic_t pending;
 };
 
+struct blkif_req {
+   int error;
+};
+
+static inline struct blkif_req *blkif_req(struct request *rq)
+{
+   return blk_mq_rq_to_pdu(rq);
+}
+
 static DEFINE_MUTEX(blkfront_mutex);
 static const struct block_device_operations xlvbd_block_fops;
 
@@ -907,8 +916,14 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
+static void blkif_complete_rq(struct request *rq)
+{
+   blk_mq_end_request(rq, blkif_req(rq)->error);
+}
+
 static const struct blk_mq_ops blkfront_mq_ops = {
.queue_rq = blkif_queue_rq,
+   .complete = blkif_complete_rq,
 };
 
 static void blkif_set_queue_limits(struct blkfront_info *info)
@@ -969,7 +984,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
-   info->tag_set.cmd_size = 0;
+   info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
if (blk_mq_alloc_tag_set(&info->tag_set))
@@ -1543,7 +1558,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
unsigned long flags;
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
-   int error;
 
if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return IRQ_HANDLED;
@@ -1587,37 +1601,36 @@ static irqreturn_t blkif_interrupt(int irq, void 
*dev_id)
continue;
}
 
-   error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
+   blkif_req(req)->error = (bret->status == BLKIF_RSP_OKAY) ? 0 : 
-EIO;
switch (bret->operation) {
case BLKIF_OP_DISCARD:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
struct request_queue *rq = info->rq;
printk(KERN_WARNING "blkfront: %s: %s op 
failed\n",
   info->gd->disk_name, 
op_name(bret->operation));
-   error = -EOPNOTSUPP;
+   blkif_req(req)->error = -EOPNOTSUPP;
info->feature_discard = 0;
info->feature_secdiscard = 0;
queue_flag_clear(QUEUE_FLAG_DISCARD, rq);
queue_flag_clear(QUEUE_FLAG_SECERASE, rq);
}
-   blk_mq_complete_request(req, error);
break;
case BLKIF_OP_FLUSH_DISKCACHE:
case BLKIF_OP_WRITE_BARRIER:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
printk(KERN_WARNING "blkfront: %s: %s op 
failed\n",
   info->gd->disk_name, 
op_name(bret->operation));
-   error = -EOPNOTSUPP;
+   blkif_req(req)->error = -EOPNOTSUPP;
}
if (unlikely(bret->status == BLKIF_RSP_ERROR &&
 rinfo->shadow[id].req.u.rw.nr_segments == 
0)) {
printk(KERN_WARNING "blkfront: %s: empty %s op 
failed\n",
   info->gd->disk_name, 
op_name(bret->operation));
-   error = -EOPNOTSUPP;
+   blkif_req(req)->error = -EOPNOTSUPP;
}
-   if (unlikely(error)) {
-   if (error == -EOPNOTSUPP)
-   error = 0;
+   if (unlikely(blkif_req(req)->error)) {
+   if (blkif_req(req)->error == -EOPNOTSUPP)
+   blkif_req(req)->error = 0;
info->feature_fua = 0;
info->feature_flush = 0;
xlvbd_flush(info);
@@

Re: [PATCH 16/22] xen-blkfront: Make use of the new sg_map helper function

2017-04-18 Thread Konrad Rzeszutek Wilk
On Tue, Apr 18, 2017 at 09:42:20AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 18/04/17 08:27 AM, Konrad Rzeszutek Wilk wrote:
> > Interesting that you didn't CC any of the maintainers. Could you 
> > do that in the future please?
> 
> Please read the cover letter. The distribution list for the patchset
> would have been way too large to cc every maintainer (even as limited as
> it was, I had mailing lists yelling at me). My plan was to get buy in

I am not sure if you know, but you can add on each patch the respective
maintainer via 'CC'. That way you can have certain maintainers CCed only
on the subsystems they cover. You put it after (or before) your SoB and
git send-email happilly picks it up.

It does mean that for every patch you have to run something like this:

$ more add_cc 
#!/bin/bash

git diff HEAD^.. > /tmp/a
echo "---"
scripts/get_maintainer.pl --no-l /tmp/a | while read file
do
echo "Cc: $file"
done

Or such.


> for the first patch, get it merged and resend the rest independently to
> their respective maintainers. Of course, though, I'd be open to other
> suggestions.
> 
> >>>
> >>> Signed-off-by: Logan Gunthorpe 
> >>> ---
> >>>  drivers/block/xen-blkfront.c | 33 +++--
> >>>  1 file changed, 27 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >>> index 5067a0a..7dcf41d 100644
> >>> --- a/drivers/block/xen-blkfront.c
> >>> +++ b/drivers/block/xen-blkfront.c
> >>> @@ -807,8 +807,19 @@ static int blkif_queue_rw_req(struct request *req, 
> >>> struct blkfront_ring_info *ri
> >>>   BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> >>>
> >>>   if (setup.need_copy) {
> >>> - setup.bvec_off = sg->offset;
> >>> - setup.bvec_data = kmap_atomic(sg_page(sg));
> >>> + setup.bvec_off = 0;
> >>> + setup.bvec_data = sg_map(sg, SG_KMAP_ATOMIC);
> >>> + if (IS_ERR(setup.bvec_data)) {
> >>> + /*
> >>> +  * This should really never happen unless
> >>> +  * the code is changed to use memory that is
> >>> +  * not mappable in the sg. Seeing there is a
> >>> +  * questionable error path out of here,
> >>> +  * we WARN.
> >>> +  */
> >>> + WARN(1, "Non-mappable memory used in sg!");
> >>> + return 1;
> >>> + }
> >> ...
> >>
> >> Perhaps add a flag to mark failure as 'unexpected' and trace (and panic?)
> >> inside sg_map().
> 
> Thanks, that's a good suggestion. I'll make the change for v2.
> 
> Logan


Re: [PATCH 05/22] drm/i915: Make use of the new sg_map helper function

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 12:44 AM, Daniel Vetter wrote:
> On Thu, Apr 13, 2017 at 04:05:18PM -0600, Logan Gunthorpe wrote:
>> This is a single straightforward conversion from kmap to sg_map.
>>
>> Signed-off-by: Logan Gunthorpe 
> 
> Acked-by: Daniel Vetter 
> 
> Probably makes sense to merge through some other tree, but please be aware
> of the considerable churn rate in i915 (i.e. make sure your tree is in
> linux-next before you send a pull request for this). Plane B would be to
> get the prep patch in first and then merge the i915 conversion one kernel
> release later.

Yes, as per what I said in my cover letter, I was leaning towards a
"Plan B" style approach.

Logan


Re: [PATCH 16/22] xen-blkfront: Make use of the new sg_map helper function

2017-04-18 Thread Logan Gunthorpe


On 18/04/17 08:27 AM, Konrad Rzeszutek Wilk wrote:
> Interesting that you didn't CC any of the maintainers. Could you 
> do that in the future please?

Please read the cover letter. The distribution list for the patchset
would have been way too large to cc every maintainer (even as limited as
it was, I had mailing lists yelling at me). My plan was to get buy in
for the first patch, get it merged and resend the rest independently to
their respective maintainers. Of course, though, I'd be open to other
suggestions.

>>>
>>> Signed-off-by: Logan Gunthorpe 
>>> ---
>>>  drivers/block/xen-blkfront.c | 33 +++--
>>>  1 file changed, 27 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>>> index 5067a0a..7dcf41d 100644
>>> --- a/drivers/block/xen-blkfront.c
>>> +++ b/drivers/block/xen-blkfront.c
>>> @@ -807,8 +807,19 @@ static int blkif_queue_rw_req(struct request *req, 
>>> struct blkfront_ring_info *ri
>>> BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>>>
>>> if (setup.need_copy) {
>>> -   setup.bvec_off = sg->offset;
>>> -   setup.bvec_data = kmap_atomic(sg_page(sg));
>>> +   setup.bvec_off = 0;
>>> +   setup.bvec_data = sg_map(sg, SG_KMAP_ATOMIC);
>>> +   if (IS_ERR(setup.bvec_data)) {
>>> +   /*
>>> +* This should really never happen unless
>>> +* the code is changed to use memory that is
>>> +* not mappable in the sg. Seeing there is a
>>> +* questionable error path out of here,
>>> +* we WARN.
>>> +*/
>>> +   WARN(1, "Non-mappable memory used in sg!");
>>> +   return 1;
>>> +   }
>> ...
>>
>> Perhaps add a flag to mark failure as 'unexpected' and trace (and panic?)
>> inside sg_map().

Thanks, that's a good suggestion. I'll make the change for v2.

Logan


Re: [PATCH v3 0/4] Avoid that __scsi_remove_device() hangs

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 14:58 +0300, Israel Rukshin wrote:
> I tested those patches and I got a NULL dereference at sd_sync_cache_done().
> The test is unloading ib_srp while one port is down.
> The previous version worked fine.
> 
> From the log:
> [  190.272412] BUG: unable to handle kernel NULL pointer dereference at 
> 02f0
> [  190.281102] IP: sd_sync_cache_done+0x1b/0x80 [sd_mod]

Hello Israel,

Thanks for testing. I assume that this data refers to the sd_printk() statement?
That statement was executed properly in my tests. Anyway, I will leave out that
statement, retest and repost this patch series. There is a del_gendisk() call in
sd just before the sd_shutdown() call so that means it's not safe to access the
disk pointer from anywhere in sd_shutdown().

Bart.

Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread Bart Van Assche
On Tue, 2017-04-18 at 16:44 +0200, Benjamin Block wrote:
> I still wonder why we try 'so hard' scheduling a command for a dead
> device, but as that seems to be the status quo, and only lacks in the
> case where the LLD is already half-way gone, its ok for me too. I mean,
> the order is a bit screwed.. we apparently first remove the driver and
> post-factum try to drain the queue.. that is strange.

Hello Benjamin,

That's indeed strange. But I'm not sure whether it is possible to address
this without changing all SCSI LLDs.

Bart.

Re: [PATCH] scsi: lpfc: fix potential buffer overflow.

2017-04-18 Thread Ewan D. Milne
On Tue, 2017-04-18 at 11:55 +0200, Maurizio Lombardi wrote:
> This patch fixes a potential buffer overflow in lpfc_nvme_info_show().
> 
> Signed-off-by: Maurizio Lombardi 
> ---
>  drivers/scsi/lpfc/lpfc_attr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
> index 22819af..1ce252f 100644
> --- a/drivers/scsi/lpfc/lpfc_attr.c
> +++ b/drivers/scsi/lpfc/lpfc_attr.c
> @@ -181,7 +181,7 @@
>   wwn_to_u64(vport->fc_nodename.u.wwn),
>   phba->targetport->port_id);
>  
> - len += snprintf(buf + len, PAGE_SIZE,
> + len += snprintf(buf + len, PAGE_SIZE - len,
>   "\nNVME Target: Statistics\n");
>   tgtp = (struct lpfc_nvmet_tgtport *)phba->targetport->private;
>   len += snprintf(buf+len, PAGE_SIZE-len,
> @@ -326,7 +326,7 @@
>   }
>   spin_unlock_irq(shost->host_lock);
>  
> - len += snprintf(buf + len, PAGE_SIZE, "\nNVME Statistics\n");
> + len += snprintf(buf + len, PAGE_SIZE - len, "\nNVME Statistics\n");
>   len += snprintf(buf+len, PAGE_SIZE-len,
>   "LS: Xmt %016llx Cmpl %016llx\n",
>   phba->fc4NvmeLsRequests,

Reviewed-by: Ewan D. Milne 




Re: [PATCH 17/25] blk-mq: remove the error argument to blk_mq_complete_request

2017-04-18 Thread Konrad Rzeszutek Wilk
On Tue, Apr 18, 2017 at 04:03:21PM +0100, Roger Pau Monné wrote:
> On Thu, Apr 06, 2017 at 05:39:36PM +0200, Christoph Hellwig wrote:
> > Now that we always have a ->complete callback we can remove the direct
> > call to blk_mq_end_request, as well as the error argument to
> > blk_mq_complete_request.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> For blkfront:
> 
> Acked-by: Roger Pau Monné 
> 

And also
Reviewed-by: Konrad Rzeszutek Wilk 


Re: [PATCH 16/25] xen-blkfront: don't use req->errors

2017-04-18 Thread Konrad Rzeszutek Wilk
On Tue, Apr 18, 2017 at 04:00:51PM +0100, Roger Pau Monné wrote:
> On Thu, Apr 06, 2017 at 05:39:35PM +0200, Christoph Hellwig wrote:
> > xen-blkfron is the last users using rq->errros for passing back error to
> > blk-mq, and I'd like to get rid of that.  In the longer run the driver
> > should be moving more of the completion processing into .complete, but
> > this is the minimal change to move forward for now.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Acked-by: Roger Pau Monné 
> 

Reviewed-by: Konrad Rzeszutek Wilk 

Thanks!


Re: [PATCH 17/25] blk-mq: remove the error argument to blk_mq_complete_request

2017-04-18 Thread Roger Pau Monné
On Thu, Apr 06, 2017 at 05:39:36PM +0200, Christoph Hellwig wrote:
> Now that we always have a ->complete callback we can remove the direct
> call to blk_mq_end_request, as well as the error argument to
> blk_mq_complete_request.
> 
> Signed-off-by: Christoph Hellwig 

For blkfront:

Acked-by: Roger Pau Monné 



Re: [PATCH 16/25] xen-blkfront: don't use req->errors

2017-04-18 Thread Roger Pau Monné
On Thu, Apr 06, 2017 at 05:39:35PM +0200, Christoph Hellwig wrote:
> xen-blkfron is the last users using rq->errros for passing back error to
> blk-mq, and I'd like to get rid of that.  In the longer run the driver
> should be moving more of the completion processing into .complete, but
> this is the minimal change to move forward for now.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Roger Pau Monné 



Re: [PATCH v3 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-18 Thread Benjamin Block
Hej Bart,

sry for the late'ish reply, had a long weekend.

On Thu, Apr 13, 2017 at 12:28:54AM +, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 16:41 +0200, Benjamin Block wrote:
> > On Mon, Apr 10, 2017 at 10:54:01AM -0700, Bart Van Assche wrote:
> > > [ ... ]
> > OK, so I take it the problem is when the queue is stopped, then the
> > completion in blk_execute_rq() will never be triggered and then we wait
> > for a timeout there, or potentially forever?
>
> Hello Benjamin,
>
> Thanks for the review.
>
> If a request is queued after a queue has been stopped then that request
> will never be started and hence even the timeout timer won't be started.
> blk_cleanup_queue() hangs if invoked for a stopped queue and one or more
> requests have not yet been started.
>
> > But then what is the point in trying to do it async here anyway? Won't
> > that just be doomed in the same way, just that we don't see the effect?
>
> Have you noticed that patch 4/4 in this series restarts the queue just
> before calling blk_cleanup_queue()?
>
> Anyway, can you have a look at the patch below and see whether this new
> version addresses all the concerns you had reported in your previous
> e-mail?
>

Yes, the code- and comment-changes in sd_shutdown() are good. Apparently
there is something new with the done-function now, but you got that from
Israel.

I still wonder why we try 'so hard' scheduling a command for a dead
device, but as that seems to be the status quo, and only lacks in the
case where the LLD is already half-way gone, its ok for me too. I mean,
the order is a bit screwed.. we apparently first remove the driver and
post-factum try to drain the queue.. that is strange.


- Benjamin

On Mon, Apr 17, 2017 at 10:34:35AM -0700, Bart Van Assche wrote:
> This patch avoids that sd_shutdown() hangs on the SYNCHRONIZE CACHE
> command if the block layer queue has been stopped by
> scsi_target_block().
> 
> Signed-off-by: Bart Van Assche 
> Cc: Israel Rukshin 
> Cc: Max Gurtovoy 
> Cc: Hannes Reinecke 
> Cc: Benjamin Block 
> ---
>  drivers/scsi/sd.c | 45 -
>  1 file changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index fe0f7997074e..deff564fe649 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1489,6 +1489,33 @@ static unsigned int sd_check_events(struct gendisk 
> *disk, unsigned int clearing)
>   return retval;
>  }
> 
> +static void sd_sync_cache_done(struct request *rq, int e)
> +{
> + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
> +
> + sd_printk(KERN_DEBUG, sdkp, "%s\n", __func__);
> +
> + blk_put_request(rq);
> +}
> +
> +/*
> + * Issue a SYNCHRONIZE CACHE command asynchronously. Since 
> blk_cleanup_queue()
> + * waits for all commands to finish, __scsi_remove_device() will wait for the
> + * SYNCHRONIZE CACHE command to finish.
> + */
> +static int sd_sync_cache_async(struct scsi_disk *sdkp)
> +{
> + const struct scsi_device *sdp = sdkp->device;
> + const int timeout = sdp->request_queue->rq_timeout *
> + SD_FLUSH_TIMEOUT_MULTIPLIER;
> + const unsigned char cmd[10] = { SYNCHRONIZE_CACHE };
> +
> + sd_printk(KERN_DEBUG, sdkp, "%s\n", __func__);
> + return scsi_execute_async(sdp, sdkp->disk, cmd, DMA_NONE, NULL, 0,
> +   timeout, SD_MAX_RETRIES, 0, 0,
> +   sd_sync_cache_done);
> +}
> +
>  static int sd_sync_cache(struct scsi_disk *sdkp)
>  {
>   int retries, res;
> @@ -3349,13 +3376,15 @@ static int sd_start_stop_device(struct scsi_disk 
> *sdkp, int start)
>  }
> 
>  /*
> - * Send a SYNCHRONIZE CACHE instruction down to the device through
> - * the normal SCSI command structure.  Wait for the command to
> - * complete.
> + * Send a SYNCHRONIZE CACHE instruction down to the device through the normal
> + * SCSI command structure. When stopping the disk, wait for the command to
> + * complete. When not stopping the disk, the blk_cleanup_queue() call in
> + * __scsi_remove_device() will wait for this command to complete.
>   */
>  static void sd_shutdown(struct device *dev)
>  {
>   struct scsi_disk *sdkp = dev_get_drvdata(dev);
> + bool stop_disk;
> 
>   if (!sdkp)
>   return; /* this can happen */
> @@ -3363,12 +3392,18 @@ static void sd_shutdown(struct device *dev)
>   if (pm_runtime_suspended(dev))
>   return;
> 
> + stop_disk = system_state != SYSTEM_RESTART &&
> + sdkp->device->manage_start_stop;
> +
>   if (sdkp->WCE && sdkp->media_present) {
>   sd_printk(KERN_NOTICE, sdkp, "Synchronizing SCSI cache\n");
> - sd_sync_cache(sdkp);
> + if (stop_disk)
> + sd_sync_cache(sdkp);
> + else
> + sd_sync_cache_async(sdkp);
>   }
> 
> - if (system_state != SYSTEM_RESTART && sdkp->device->manage_start_stop) {
> + if (stop_disk) 

Re: kill req->errors

2017-04-18 Thread Konrad Rzeszutek Wilk
On Fri, Apr 07, 2017 at 09:11:24AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 06, 2017 at 04:00:24PM -0400, Konrad Rzeszutek Wilk wrote:
> > You wouldn't have a git tree to easily test it? Thanks.
> 
> git://git.infradead.org/users/hch/block.git request-errors
> 
> Gitweb:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/request-errors

I tested it with Xen and it all worked out. Albeit in hindsight I didn't check
the ones that matter - that is the error paths (duh!).

So let me do that now.


  1   2   >