Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-12 Thread Sinan Kaya
On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)

It feels like code tried to be a generic p2pdma provider first. Then got
converted to PCI, yet all dev parameters are still struct device.

Maybe, dev parameter should also be struct pci_dev so that you can get rid of
all to_pci_dev() calls in this code including find_parent_pci_dev() function.

Regarding the switch business, It is amazing how much trouble you went into
limit this functionality into very specific hardware.

I thought that we reached to an agreement that code would not impose
any limits on what user wants.

What happened to all the emails we exchanged?

I understand you are coming from what you tested. Considering that you
are singing up for providing a generic PCI functionality into the kernel,
why don't you just blacklist the products that you had problems with and
yet still allow other architectures to use your code with their root ports?

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-12 Thread Dou Liyang

Hi Thomas,

At 03/09/2018 11:08 PM, Thomas Gleixner wrote:
[...]


I'm not sure if there is a clear indicator whether physcial hotplug is
supported or not, but the ACPI folks (x86) and architecture maintainers

+cc Rafael


should be able to answer that question. I have a machine which says:

smpboot: Allowing 128 CPUs, 96 hotplug CPUs

There is definitely no way to hotplug anything on that machine and sure the


AFAIK, in ACPI based dynamic reconfiguration, there is no clear
indicator. In theory, If the ACPI tables have the hotpluggable
CPU resources, the OS can support physical hotplug.

For your machine, Did your CPUs support multi-threading, but not enable
it?

And, sometimes we should not trust the number of possible CPUs. I also
met the situation that BIOS told to ACPI that it could support physical
CPUs hotplug, But actually, there was no hardware slots in the machine.
the ACPI tables like user inputs which should be validated when we use.


existing spread algorithm will waste vectors to no end.

Sure then there is virt, which can pretend to have a gazillion of possible
hotpluggable CPUs, but virt is an insanity on its own. Though someone might
come up with reasonable heuristics for that as well.

Thoughts?


Do we have to map the vectors to CPU statically? Can we map them when
we hotplug/enable the possible CPU?

Thanks,

dou




Re: [PATCH] device_handler: remove VLAs

2018-03-12 Thread Martin K. Petersen

Stephen,

> In preparation to enabling -Wvla, remove VLAs and replace them with
> fixed-length arrays instead.
>
> scsi_dh_{alua,emc,rdac} use variable-length array declarations to
> store command blocks, with the appropriate size as determined by
> COMMAND_SIZE. This patch replaces these with fixed-sized arrays using
> MAX_COMMAND_SIZE, so that the array size can be determined at compile
> time.

Applied to 4.17/scsi-queue. Thank you!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-12 Thread Sinan Kaya
On 3/12/2018 9:55 PM, Sinan Kaya wrote:
> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
>> -if (nvmeq->sq_cmds_io)
> 
> I think you should keep the code as it is for the case where
> (!nvmeq->sq_cmds_is_io && nvmeq->sq_cmds_io)

Never mind. I misunderstood the code.


> 
> You are changing the behavior for NVMe drives with CMB buffers.
> You can change the if statement here with the statement above.
> 
>> -memcpy_toio(>sq_cmds_io[tail], cmd, sizeof(*cmd));
>> -else
>> -memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
>> +memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
>>  
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-12 Thread Sinan Kaya
On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> - if (nvmeq->sq_cmds_io)

I think you should keep the code as it is for the case where
(!nvmeq->sq_cmds_is_io && nvmeq->sq_cmds_io)

You are changing the behavior for NVMe drives with CMB buffers.
You can change the if statement here with the statement above.

> - memcpy_toio(>sq_cmds_io[tail], cmd, sizeof(*cmd));
> - else
> - memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
> + memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
>  


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-03-12 Thread Logan Gunthorpe

On 3/12/2018 1:41 PM, Jonathan Corbet wrote:

This all seems good, but...could we consider moving this documentation to
driver-api/PCI as it's converted to RST?  That would keep it together with
similar materials and bring a bit more coherence to Documentation/ as a
whole.


Yup, I'll change this for the next revision.

Thanks,

Logan


Re: [PATCH] device_handler: remove VLAs

2018-03-12 Thread Stephen Kitt
On Mon, 12 Mar 2018 15:41:26 +, Bart Van Assche 
wrote:
> On Sat, 2018-03-10 at 14:14 +0100, Stephen Kitt wrote:
> > The two patches I sent were supposed to be alternative solutions; see
> > https://marc.info/?l=linux-scsi=152063671005295=2 for the
> > introduction (I seem to have messed up the headers, so the mails didn’t
> > end up threaded properly).  
> 
> The two patches arrived in my inbox several minutes before the cover
> letter. In the e-mail header of the cover letter I found the following:
> 
> X-Greylist: delayed 1810 seconds by postgrey-1.27 at vger.kernel.org; Fri,
> 09 Mar 2018 18:05:08 EST
> 
> Does this mean that the delay happened due to vger server's anti-spam
> algorithm?

That’s right, the greylisting part of its anti-spam measures.

Regards,

Stephen


pgpueRnJy_Szt.pgp
Description: OpenPGP digital signature


Re: [PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-03-12 Thread Jonathan Corbet
On Mon, 12 Mar 2018 13:35:19 -0600
Logan Gunthorpe  wrote:

> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has ben converted to restructured text
> at this time.

This all seems good, but...could we consider moving this documentation to
driver-api/PCI as it's converted to RST?  That would keep it together with
similar materials and bring a bit more coherence to Documentation/ as a
whole.

Thanks,

jon


[PATCH v3 04/11] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-12 Thread Logan Gunthorpe
For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch will be in the same IOMMU group.
Which implies that individual devices behind any switch will not be
able to be assigned to separate VMs because there is no isolation
between them. Additionally, any malicious PCIe devices will be able
to DMA to memory exposed by other EPs in the same domain as TLPs will
not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/Kconfig|  9 +
 drivers/pci/p2pdma.c   | 44 
 drivers/pci/pci.c  |  6 ++
 include/linux/pci-p2pdma.h |  5 +
 4 files changed, 64 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index d59f6f5ddfcd..c7a9d155baca 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,15 @@ config PCI_P2PDMA
  it's hard to tell which support it at all, so at this time you
  will need a PCIe switch.
 
+ Enabling this option will also disable ACS on all ports behind
+ any PCIe switch. This effectively puts all devices behind any
+ switch into the same IOMMU group. Which implies that individual
+ devices behind any switch will not be able to be assigned to
+ separate VMs because there is no isolation between them.
+ Additionally, any malicious PCIe devices will be able to DMA
+ to memory exposed by other EPs in the same domain as TLPs will
+ not be checked by the IOMMU.
+
  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ab810c3a93eb..3e70b0662def 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -264,6 +264,50 @@ static struct pci_dev *get_upstream_bridge_port(struct 
pci_dev *pdev)
 }
 
 /*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ * bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+   struct pci_dev *up;
+   int pos;
+   u16 ctrl;
+
+   up = get_upstream_bridge_port(pdev);
+   if (!up)
+   return 0;
+   pci_dev_put(up);
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+   if (!pos)
+   return 0;
+
+   pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+   pci_read_config_word(pdev, pos + PCI_ACS_CTRL, );
+
+   ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+   pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+   return 1;
+}
+
+/*
  * This function checks if two PCI devices are behind the same switch.
  * (ie. they share the same second upstream port as returned by
  *  get_upstream_bridge_port().)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..e5da8f482e94 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2826,6 +2827,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+#ifdef CONFIG_PCI_P2PDMA
+   if (pci_p2pdma_disable_acs(dev))
+   return;
+#endif
+
if (!pci_acs_enable)
return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 59eb218bdb25..2a2bf2ca018e 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int 

[PATCH v3 06/11] block: Introduce PCI P2P flags for request and request queue

2018-03-12 Thread Logan Gunthorpe
QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Sagi Grimberg 
---
 block/blk-core.c  |  3 +++
 include/linux/blk_types.h | 18 +-
 include/linux/blkdev.h|  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f7fadd..a2f113738b85 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
 
+   if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+   goto not_supported;
+
if (should_fail_bio(bio))
goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
__REQ_BACKGROUND,   /* background IO */
__REQ_NOWAIT,   /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+   __REQ_PCI_P2PDMA,   /* request is to/from P2P memory */
+#endif
+
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP,  /* do not free blocks when zeroing */
 
@@ -298,6 +302,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA (1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req) ((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA 0
+#define REQ_IS_PCI_P2PDMA(req) 0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV(1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-   (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+   (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30  /* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
 #define blk_queue_dax(q)   test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)  \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)\
+   test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0



[PATCH v3 07/11] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()

2018-03-12 Thread Logan Gunthorpe
In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe 
---
 drivers/infiniband/core/rw.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;
 
-   ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+   if (is_pci_p2pdma_page(sg_page(sg)))
+   ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+   else
+   ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp, u8 port_num,
break;
}
 
-   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   if (is_pci_p2pdma_page(sg_page(sg)))
+   pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+   sg_cnt, dir);
+   else
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.11.0



[PATCH v3 11/11] nvmet: Optionally use PCI P2P memory

2018-03-12 Thread Logan Gunthorpe
We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/target/configfs.c |  67 ++
 drivers/nvme/target/core.c | 106 -
 drivers/nvme/target/io-cmd.c   |   3 ++
 drivers/nvme/target/nvmet.h|  12 +
 drivers/nvme/target/rdma.c |  32 +++--
 5 files changed, 214 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..6ca8c712f0d3 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "nvmet.h"
 
@@ -867,12 +869,77 @@ static void nvmet_port_release(struct config_item *item)
kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+   struct nvmet_port *port = to_nvmet_port(item);
+
+   if (!port->use_p2pmem)
+   return sprintf(page, "none\n");
+
+   if (!port->p2p_dev)
+   return sprintf(page, "auto\n");
+
+   return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+ const char *page, size_t count)
+{
+   struct nvmet_port *port = to_nvmet_port(item);
+   struct device *dev;
+   struct pci_dev *p2p_dev = NULL;
+   bool use_p2pmem;
+
+   switch (page[0]) {
+   case 'y':
+   case 'Y':
+   case 'a':
+   case 'A':
+   use_p2pmem = true;
+   break;
+   case 'n':
+   case 'N':
+   use_p2pmem = false;
+   break;
+   default:
+   dev = bus_find_device_by_name(_bus_type, NULL, page);
+   if (!dev) {
+   pr_err("No such PCI device: %s\n", page);
+   return -ENODEV;
+   }
+
+   use_p2pmem = true;
+   p2p_dev = to_pci_dev(dev);
+
+   if (!pci_has_p2pmem(p2p_dev)) {
+   pr_err("PCI device has no peer-to-peer memory: %s\n",
+  page);
+   pci_dev_put(p2p_dev);
+   return -ENODEV;
+   }
+   }
+
+   down_write(_config_sem);
+   port->use_p2pmem = use_p2pmem;
+   pci_dev_put(port->p2p_dev);
+   port->p2p_dev = p2p_dev;
+   up_write(_config_sem);
+
+   return count;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
_attr_addr_adrfam,
_attr_addr_treq,
_attr_addr_traddr,
_attr_addr_trsvcid,
_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+   _attr_p2pmem,
+#endif
NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index a78029e4e5f4..ab3cc7135ae8 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
percpu_ref_put(>ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+  struct nvmet_ns *ns)
+{
+   int ret;
+
+   if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+   pr_err("peer-to-peer DMA is not supported by %s\n",
+  ns->device_path);
+   return -EINVAL;
+   }
+
+   ret = pci_p2pdma_add_client(>p2p_clients, nvmet_ns_dev(ns));
+   if (ret)
+   pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+  ns->device_path, ret);
+
+   return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;
 
+   list_for_each_entry(ctrl, 

[PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-12 Thread Logan Gunthorpe
Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/pci.c | 75 +++--
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b738f03..1fb57fa42dd0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nvme.h"
 
@@ -91,9 +92,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
-   void __iomem *cmb;
-   pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
+   bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t q_lock;
struct nvme_command *sq_cmds;
-   struct nvme_command __iomem *sq_cmds_io;
+   bool sq_cmds_is_io;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
u16 tail = nvmeq->sq_tail;
 
-   if (nvmeq->sq_cmds_io)
-   memcpy_toio(>sq_cmds_io[tail], cmd, sizeof(*cmd));
-   else
-   memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
+   memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
 
if (++tail == nvmeq->q_depth)
tail = 0;
@@ -1287,9 +1284,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-   if (nvmeq->sq_cmds)
-   dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-   nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+   if (nvmeq->sq_cmds) {
+   if (nvmeq->sq_cmds_is_io)
+   pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+   nvmeq->sq_cmds,
+   SQ_SIZE(nvmeq->q_depth));
+   else
+   dma_free_coherent(nvmeq->q_dmadev,
+ SQ_SIZE(nvmeq->q_depth),
+ nvmeq->sq_cmds,
+ nvmeq->sq_dma_addr);
+   }
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1369,12 +1375,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
 {
-   /* CMB SQEs will be mapped before creation */
-   if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-   return 0;
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+   if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+   nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+   nvmeq->sq_cmds);
+   nvmeq->sq_cmds_is_io = true;
+   }
+
+   if (!nvmeq->sq_cmds) {
+   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+   >sq_dma_addr, GFP_KERNEL);
+   nvmeq->sq_cmds_is_io = false;
+   }
 
-   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-   >sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
return 0;
@@ -1450,13 +1465,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
struct nvme_dev *dev = nvmeq->dev;
int result;
 
-   if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-   unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
- dev->ctrl.page_size);
-   nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-   nvmeq->sq_cmds_io = dev->cmb + offset;
-   }
-
nvmeq->cq_vector = qid - 1;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
@@ -1689,9 +1697,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-   if (!use_cmb_sqes)
-   return;
-
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1708,11 +1713,15 @@ static void nvme_map_cmb(struct nvme_dev 

[PATCH v3 03/11] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset

2018-03-12 Thread Logan Gunthorpe
The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 51 ++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index fd4789566a56..ab810c3a93eb 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, 
size_t size,
pgmap->res.flags = pci_resource_flags(pdev, bar);
pgmap->ref = >p2pdma->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+   pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+   pci_resource_start(pdev, bar);
 
addr = devm_memremap_pages(>dev, pgmap);
if (IS_ERR(addr)) {
@@ -731,3 +733,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
pdev->p2pdma->p2pmem_published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+   struct dev_pagemap *pgmap;
+   struct scatterlist *s;
+   phys_addr_t paddr;
+   int i;
+
+   /*
+* p2pdma mappings are not compatible with devices that use
+* dma_virt_ops.
+*/
+   if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == _virt_ops)
+   return 0;
+
+   for_each_sg(sg, s, nents, i) {
+   pgmap = sg_page(s)->pgmap;
+   paddr = sg_phys(s);
+
+   s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+   sg_dma_len(s) = s->length;
+   }
+
+   return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
+   u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 1f7856ff098b..59eb218bdb25 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct 
scatterlist **sgl,
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
unsigned int nents);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -97,5 +101,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+   struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+   return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+   struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0



[PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-03-12 Thread Logan Gunthorpe
Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has ben converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe 
Cc: Jonathan Corbet 
---
 Documentation/PCI/index.rst  |  14 
 Documentation/PCI/p2pdma.rst | 164 +++
 Documentation/index.rst  |   3 +-
 3 files changed, 180 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/PCI/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index ..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==
+Linux PCI Driver Developer's Guide
+==
+
+.. toctree::
+
+   p2pdma
+
+.. only::  subproject and html
+
+   Indices
+   ===
+
+   * :ref:`genindex`
diff --git a/Documentation/PCI/p2pdma.rst b/Documentation/PCI/p2pdma.rst
new file mode 100644
index ..d7edd48a3941
--- /dev/null
+++ b/Documentation/PCI/p2pdma.rst
@@ -0,0 +1,164 @@
+
+PCI Peer-to-Peer DMA Support
+
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind a PCIe Switch as this guarantees the packets will always be routable.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+  DMA transaction to it.
+* Orchestrators - A driver which orchestrates the flow of data between
+  clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use the CMB as
+  submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVME device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping 

[PATCH v3 00/11] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-12 Thread Logan Gunthorpe
Hi Everyone,

Here's v3 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc5.

Thanks,

Logan


Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
  may be explicitly specified, bypassing the logic in pci_p2pmem_find().
  (Note: it's still enforced that the provider must be behind the
   same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (11):
  PCI/P2PDMA: Support peer-to-peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add 

[PATCH v3 10/11] nvme-pci: Add a quirk for a pseudo CMB

2018-03-12 Thread Logan Gunthorpe
Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Sagi Grimberg 
---
 drivers/nvme/host/nvme.h |  7 +++
 drivers/nvme/host/pci.c  | 24 
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1fb2b6603d49..d1381bfc40f1 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
 * Supports the LighNVM command set if indicated in vs[1].
 */
NVME_QUIRK_LIGHTNVM = (1 << 6),
+
+   /*
+* Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+* NVRAM that have CMB-like memory on a BAR but does not set
+* CMBLOC or CMBSZ.
+*/
+   NVME_QUIRK_PSEUDO_CMB_BAR4  = (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0ebab7ab4d7e..a798e08a07bc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1683,6 +1683,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+   return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+   (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+   ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1702,10 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;
 
-   dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-   if (!dev->cmbsz)
-   return;
-   dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+   if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+   dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+   dev->cmbloc = 4;
+   } else {
+   dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+   if (!dev->cmbsz)
+   return;
+   dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+   }
 
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2719,6 +2731,10 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807),   /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+   { PCI_DEVICE(0x11f8, 0xf117),   /* Microsemi NVRAM adaptor */
+   .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+   { PCI_DEVICE(0x1db1, 0x0002),   /* Everspin nvNitro adaptor */
+   .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0



[PATCH v3 02/11] PCI/P2PDMA: Add sysfs group to display p2pmem stats

2018-03-12 Thread Logan Gunthorpe
Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe 
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++
 drivers/pci/p2pdma.c| 54 +
 2 files changed, 79 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What:  /sys/bus/pci/devices/.../p2pmem/available
+Date:  November 2017
+Contact:   Logan Gunthorpe 
+Description:
+   If the device has any Peer-to-Peer memory registered, this
+   file contains the amount of memory that has not been
+   allocated (in decimal).
+
+What:  /sys/bus/pci/devices/.../p2pmem/size
+Date:  November 2017
+Contact:   Logan Gunthorpe 
+Description:
+   If the device has any Peer-to-Peer memory registered, this
+   file contains the total amount of memory that the device
+   provides (in decimal).
+
+What:  /sys/bus/pci/devices/.../p2pmem/published
+Date:  November 2017
+Contact:   Logan Gunthorpe 
+Description:
+   If the device has any Peer-to-Peer memory registered, this
+   file contains a '1' if the memory has been published for
+   use inside the kernel or a '0' if it is only intended
+   for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0ee917381dce..fd4789566a56 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
bool p2pmem_published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   size_t size = 0;
+
+   if (pdev->p2pdma->pool)
+   size = gen_pool_size(pdev->p2pdma->pool);
+
+   return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute 
*attr,
+ char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   size_t avail = 0;
+
+   if (pdev->p2pdma->pool)
+   avail = gen_pool_avail(pdev->p2pdma->pool);
+
+   return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute 
*attr,
+ char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return snprintf(buf, PAGE_SIZE, "%d\n",
+   pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+   _attr_size.attr,
+   _attr_available.attr,
+   _attr_published.attr,
+   NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+   .attrs = p2pmem_attrs,
+   .name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
percpu_ref_exit(>p2pdma->devmap_ref);
 
gen_pool_destroy(pdev->p2pdma->pool);
+   sysfs_remove_group(>dev.kobj, _group);
pdev->p2pdma = NULL;
 }
 
@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
pdev->p2pdma = p2p;
 
+   error = sysfs_create_group(>dev.kobj, _group);
+   if (error)
+   goto out_pool_destroy;
+
return 0;
 
 out_pool_destroy:
+   pdev->p2pdma = NULL;
gen_pool_destroy(p2p->pool);
 out:
devm_kfree(>dev, p2p);
-- 
2.11.0



[PATCH v3 09/11] nvme-pci: Add support for P2P memory in requests

2018-03-12 Thread Logan Gunthorpe
For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Sagi Grimberg 
---
 drivers/nvme/host/core.c |  4 
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7aeca5db7916..c7c5de116720 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2949,7 +2949,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
+
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
ns->queue->queuedata = ns;
ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d733b14ede9d..1fb2b6603d49 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
+#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1fb57fa42dd0..0ebab7ab4d7e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
goto out;
 
ret = BLK_STS_RESOURCE;
-   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-   DMA_ATTR_NO_WARN);
+
+   if (REQ_IS_PCI_P2PDMA(req))
+   nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+   else
+   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+dma_dir,  DMA_ATTR_NO_WARN);
if (!nr_mapped)
goto out;
 
@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
if (iod->nents) {
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+   if (REQ_IS_PCI_P2PDMA(req))
+   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+   dma_dir);
+   else
+   dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
if (blk_integrity_rq(req)) {
if (req_op(req) == REQ_OP_READ)
nvme_dif_remap(req, nvme_dif_complete);
@@ -2426,7 +2436,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, 
u32 off, u64 *val)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED,
+   .flags  = NVME_F_METADATA_SUPPORTED |
+ NVME_F_PCI_P2PDMA,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
-- 
2.11.0



[PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-12 Thread Logan Gunthorpe
Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but would significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVME
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
we have no way of knowing whether peer-to-peer routing between PCIe
Root Ports is supported (PCIe r4.0, sec 1.3.1). Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage and, in some cases, coding convienence.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/Kconfig|  16 ++
 drivers/pci/Makefile   |   1 +
 drivers/pci/p2pdma.c   | 679 +
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h | 101 +++
 include/linux/pci.h|   4 +
 6 files changed, 819 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..d59f6f5ddfcd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@ config PCI_PASID
 
  If unsure, say N.
 
+config PCI_P2PDMA
+   bool "PCI peer-to-peer transfer support"
+   depends on ZONE_DEVICE
+   select GENERIC_ALLOCATOR
+   help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time you
+ will need a PCIe switch.
+
+ If unsure, say N.
+
 config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
 
 obj-$(CONFIG_PCI_ATS) += ats.o
 obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
 
 #
 # ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index ..0ee917381dce
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,679 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct pci_p2pdma {
+   struct percpu_ref devmap_ref;
+   struct completion devmap_ref_done;
+   struct gen_pool *pool;
+   bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+   struct pci_p2pdma *p2p =
+   container_of(ref, struct pci_p2pdma, devmap_ref);
+
+   complete_all(>devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+   struct percpu_ref *ref = data;
+
+   if (percpu_ref_is_dying(ref))
+   return;
+
+   percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+   struct pci_dev *pdev = data;
+
+   if (!pdev->p2pdma)
+   return;
+
+   wait_for_completion(>p2pdma->devmap_ref_done);
+   percpu_ref_exit(>p2pdma->devmap_ref);
+
+   gen_pool_destroy(pdev->p2pdma->pool);
+   

Re: [PATCH 1/2] direct-io: Remove unused DIO_ASYNC_EXTEND flag

2018-03-12 Thread Jens Axboe
On 3/12/18 2:54 AM, Nikolay Borisov wrote:
> 
> 
> On 23.02.2018 13:45, Nikolay Borisov wrote:
>> This flag was added by 6039257378e4 ("direct-io: add flag to allow aio
>> writes beyond i_size") to support XFS. However, with the rework of
>> XFS' DIO's path to use iomap in acdda3aae146 ("xfs: use iomap_dio_rw")
>> it became redundant. So let's remove it.
>>
>> Signed-off-by: Nikolay Borisov 
> 
> Jens,
> 
> On a second look I think you are the more appropriate person to take
> these patches. SO do you have any objections to merging those via the
> block tree. ( I did CC you but didn't cc linux-block).

Both look fine to me, I can add them for 4.17. Thanks.

-- 
Jens Axboe



[PATCH] nbd: update size when connected

2018-03-12 Thread Josef Bacik
From: Josef Bacik 

I messed up changing the size of an NBD device while it was connected by
not actually updating the device or doing the uevent.  Fix this by
updating everything if we're connected and we change the size.

cc: sta...@vger.kernel.org
Fixes: 639812a ("nbd: don't set the device size until we're connected")
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 86258b00a1d4..7106b98a35fb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -243,6 +243,8 @@ static void nbd_size_set(struct nbd_device *nbd, loff_t 
blocksize,
struct nbd_config *config = nbd->config;
config->blksize = blocksize;
config->bytesize = blocksize * nr_blocks;
+   if (nbd->task_recv != NULL)
+   nbd_size_update(nbd);
 }
 
 static void nbd_complete_rq(struct request *req)
-- 
2.14.3



Re: [PATCH] device_handler: remove VLAs

2018-03-12 Thread Bart Van Assche
On Sat, 2018-03-10 at 14:14 +0100, Stephen Kitt wrote:
> The two patches I sent were supposed to be alternative solutions; see
> https://marc.info/?l=linux-scsi=152063671005295=2 for the introduction (I
> seem to have messed up the headers, so the mails didn’t end up threaded
> properly).

The two patches arrived in my inbox several minutes before the cover letter. In
the e-mail header of the cover letter I found the following:

X-Greylist: delayed 1810 seconds by postgrey-1.27 at vger.kernel.org; Fri, 09 
Mar 2018 18:05:08 EST

Does this mean that the delay happened due to vger server's anti-spam algorithm?

Bart.





RE: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue

2018-03-12 Thread Don Brace
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Thursday, March 08, 2018 9:32 PM
> To: James Bottomley ; Jens Axboe
> ; Martin K . Petersen 
> Cc: Christoph Hellwig ; linux-s...@vger.kernel.org; linux-
> bl...@vger.kernel.org; Meelis Roos ; Don Brace
> ; Kashyap Desai
> ; Laurence Oberman
> ; Mike Snitzer ; Ming Lei
> ; Hannes Reinecke ; James Bottomley
> ; Artem Bityutskiy
> 
> Subject: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue
> 
> EXTERNAL EMAIL
> 
> 
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
> one msix vector can be created without any online CPU mapped, then one
> command's completion may not be notified.
> 
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this mapping
> table to choose reply queue for queuing one command.
> 
> Then the chosen reply queue has to be active, and fixes IO hang caused
> by using inactive reply queue which doesn't have any online CPU mapped.
> 
> Cc: Hannes Reinecke 
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Meelis Roos 
> Cc: Artem Bityutskiy 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Tested-by: Don Brace 
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> Signed-off-by: Ming Lei 
> ---

Acked-by: Don Brace 
Tested-by: Don Brace 
   * Rebuilt test rig: applied the following patches to Linus's tree 
4.16.0-rc4+:
 [PATCH V4 1_4] scsi: hpsa: fix selection of reply queue - Ming 
Lei  - 2018-03-08 2132.eml
 [PATCH V4 3_4] scsi: introduce force_blk_mq - Ming Lei 
 - 2018-03-08 2132.eml
* fio tests on 6 LVs on P441 controller (fw 6.59) 5 days.
* fio tests on 10 HBA disks on P431 (fw 4.54) controller. 3 days. ( 
concurrent with P441 tests)

>  drivers/scsi/hpsa.c | 73 
> +++--
>  drivers/scsi/hpsa.h |  1 +
>  2 files changed, 55 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 5293e6827ce5..3a9eca163db8 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -1045,11 +1045,7 @@ static void set_performant_mode(struct ctlr_info
> *h, struct CommandList *c,
> c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
> if (unlikely(!h->msix_vectors))
> return;
> -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -   c->Header.ReplyQueue =
> -   raw_smp_processor_id() % h->nreply_queues;
> -   else
> -   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
> +   c->Header.ReplyQueue = reply_queue;
> }
>  }
> 
> @@ -1063,10 +1059,7 @@ static void set_ioaccel1_performant_mode(struct
> ctlr_info *h,
>  * Tell the controller to post the reply to the queue for this
>  * processor.  This seems to give the best I/O throughput.
>  */
> -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
> -   else
> -   cp->ReplyQueue = reply_queue % h->nreply_queues;
> +   cp->ReplyQueue = reply_queue;
> /*
>  * Set the bits in the address sent down to include:
>  *  - performant mode bit (bit 0)
> @@ -1087,10 +1080,7 @@ static void
> set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,
> /* Tell the controller to post the reply to the queue for this
>  * processor.  This seems to give the best I/O throughput.
>  */
> -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -   cp->reply_queue = smp_processor_id() % h->nreply_queues;
> -   else
> -   cp->reply_queue = reply_queue % h->nreply_queues;
> +   cp->reply_queue = reply_queue;
> /* Set the bits in the address sent down to include:
>  *  - performant mode bit not used in ioaccel mode 2
>  *  - pull count (bits 0-3)
> @@ -1109,10 +1099,7 @@ static void set_ioaccel2_performant_mode(struct
> ctlr_info *h,
>

Re: [PATCH 1/2] direct-io: Remove unused DIO_ASYNC_EXTEND flag

2018-03-12 Thread Nikolay Borisov


On 23.02.2018 13:45, Nikolay Borisov wrote:
> This flag was added by 6039257378e4 ("direct-io: add flag to allow aio
> writes beyond i_size") to support XFS. However, with the rework of
> XFS' DIO's path to use iomap in acdda3aae146 ("xfs: use iomap_dio_rw")
> it became redundant. So let's remove it.
> 
> Signed-off-by: Nikolay Borisov 

Jens,

On a second look I think you are the more appropriate person to take
these patches. SO do you have any objections to merging those via the
block tree. ( I did CC you but didn't cc linux-block).

> ---
>  fs/direct-io.c | 3 +--
>  include/linux/fs.h | 3 ---
>  2 files changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index a0ca9e48e993..99a81c49bce9 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1252,8 +1252,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
> *inode,
>*/
>   if (is_sync_kiocb(iocb))
>   dio->is_async = false;
> - else if (!(dio->flags & DIO_ASYNC_EXTEND) &&
> -  iov_iter_rw(iter) == WRITE && end > i_size_read(inode))
> + else if (iov_iter_rw(iter) == WRITE && end > i_size_read(inode))
>   dio->is_async = false;
>   else
>   dio->is_async = true;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2a815560fda0..260c233e7375 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2977,9 +2977,6 @@ enum {
>   /* filesystem does not support filling holes */
>   DIO_SKIP_HOLES  = 0x02,
>  
> - /* filesystem can handle aio writes beyond i_size */
> - DIO_ASYNC_EXTEND = 0x04,
> -
>   /* inode/fs/bdev does not need truncate protection */
>   DIO_SKIP_DIO_COUNT = 0x08,
>  };
> 


Re: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue

2018-03-12 Thread Ming Lei
On Mon, Mar 12, 2018 at 08:52:02AM +0100, Christoph Hellwig wrote:
> On Sat, Mar 10, 2018 at 11:01:43PM +0800, Ming Lei wrote:
> > > I really dislike this being open coded in drivers.  It really should
> > > be helper chared with the blk-mq map building that drivers just use.
> > > 
> > > For now just have a low-level blk_pci_map_queues that
> > > blk_mq_pci_map_queues, hpsa and megaraid can share.  In the long run
> > > it might make sense to change the blk-mq callout to that low-level
> > > prototype as well.
> > 
> > The way for selecting reply queue is needed for non scsi_mq too.
> 
> Which still doesn't prevent you from using a common helper.

The only common code is the following part:

+   for (queue = 0; queue < instance->msix_vectors; queue++) {
+   mask = pci_irq_get_affinity(instance->pdev, queue);
+   if (!mask)
+   goto fallback;
+
+   for_each_cpu(cpu, mask)
+   instance->reply_map[cpu] = queue;
+   }

For megraraid_sas, the fallback code need to handle mapping in the
following way for legacy vectors:

   for_each_possible_cpu(cpu)
   instance->reply_map[cpu] = cpu % instance->msix_vectors;


So not sure if it is worth of a common helper, given there may not be
potential users of the helper.

Thanks,
Ming


Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-03-12 Thread Ming Lei
On Fri, Mar 09, 2018 at 10:24:45AM -0700, Keith Busch wrote:
> On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
> > 
> > So I suspect we'll need to go with a patch like this, just with a way
> > better changelog.
> 
> I have to agree this is required for that use case. I'll run some
> quick tests and propose an alternate changelog.
> 
> Longer term, the current way we're including offline present cpus either
> (a) has the driver allocate resources it can't use or (b) spreads the
> ones it can use thinner than they need to be. Why don't we rerun the
> irq spread under a hot cpu notifier for only online CPUs?

4b855ad371 ("blk-mq: Create hctx for each present CPU") removes handling
mapping change via hot cpu notifier. Not only code is cleaned up, but
also fixes very complicated queue dependency issue:

- loop/dm-rq queue depends on underlying queue
- for NVMe, IO queue depends on admin queue

If freezing queue can be avoided in CPU notifier, it should be fine to
do that, otherwise it need to be avoided.

Thanks,
Ming


Re: [PATCH V4 4/4] scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity

2018-03-12 Thread Ming Lei
On Sat, Mar 10, 2018 at 11:15:20AM +0100, Christoph Hellwig wrote:
> This looks generally fine to me:
> 
> Reviewed-by: Christoph Hellwig 
> 
> As a follow on we should probably kill virtscsi_queuecommand_single and
> thus virtscsi_host_template_single as well.
> > Given storage IO is always C/S model, there isn't such issue with 
> > SCSI_MQ(blk-mq),
> 
> What does C/S mean here?

Client–Server.

> 
> > @@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct 
> > Scsi_Host *sh,
> > struct scsi_cmnd *sc)
> >  {
> > struct virtio_scsi *vscsi = shost_priv(sh);
> > -   struct virtio_scsi_target_state *tgt =
> > -   scsi_target(sc->device)->hostdata;
> >  
> > -   atomic_inc(>reqs);
> > return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
> >  }
> 
> >  static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >struct scsi_cmnd *sc)
> >  {
> > struct virtio_scsi *vscsi = shost_priv(sh);
> > -   struct virtio_scsi_target_state *tgt =
> > -   scsi_target(sc->device)->hostdata;
> > -   struct virtio_scsi_vq *req_vq;
> > -
> > -   if (shost_use_blk_mq(sh))
> > -   req_vq = virtscsi_pick_vq_mq(vscsi, sc);
> > -   else
> > -   req_vq = virtscsi_pick_vq(vscsi, tgt);
> > +   struct virtio_scsi_vq *req_vq = virtscsi_pick_vq_mq(vscsi, sc);
> >  
> > return virtscsi_queuecommand(vscsi, req_vq, sc);
> 
> Given how virtscsi_pick_vq_mq works virtscsi_queuecommand_single and
> virtscsi_queuecommand_multi now have identical behavior.  That means
> virtscsi_queuecommand_single should be removed, and
> virtscsi_queuecommand_multi should be merged into virtscsi_queuecommand,

OK.

> 
> > @@ -823,6 +768,7 @@ static struct scsi_host_template 
> > virtscsi_host_template_single = {
> > .target_alloc = virtscsi_target_alloc,
> > .target_destroy = virtscsi_target_destroy,
> > .track_queue_depth = 1,
> > +   .force_blk_mq = 1,
> 
> This probably isn't strictly needed.  That being said with your
> change we could probably just drop virtscsi_host_template_single entirely.
> 

OK.

Thanks,
Ming


Re: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue

2018-03-12 Thread Christoph Hellwig
On Sat, Mar 10, 2018 at 11:01:43PM +0800, Ming Lei wrote:
> > I really dislike this being open coded in drivers.  It really should
> > be helper chared with the blk-mq map building that drivers just use.
> > 
> > For now just have a low-level blk_pci_map_queues that
> > blk_mq_pci_map_queues, hpsa and megaraid can share.  In the long run
> > it might make sense to change the blk-mq callout to that low-level
> > prototype as well.
> 
> The way for selecting reply queue is needed for non scsi_mq too.

Which still doesn't prevent you from using a common helper.


Re: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue

2018-03-12 Thread Bityutskiy, Artem
Linux-Regression-ID: lr#15a115

On Fri, 2018-03-09 at 11:32 +0800, Ming Lei wrote:
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
> one msix vector can be created without any online CPU mapped, then one
> command's completion may not be notified.
> 
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this mapping
> table to choose reply queue for queuing one command.
> 
> Then the chosen reply queue has to be active, and fixes IO hang caused
> by using inactive reply queue which doesn't have any online CPU mapped.
> 
> Cc: Hannes Reinecke 
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Meelis Roos 
> Cc: Artem Bityutskiy 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Tested-by: Don Brace 
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> Signed-off-by: Ming Lei 

Tested-by: Artem Bityutskiy 
Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com

These 2 patches make the Dell R640 regression that I reported go away.
Tested on top of v4.16-rc5, thanks!

-- 
Best Regards,
Artem Bityutskiy
-
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: [PATCH] device_handler: remove VLAs

2018-03-12 Thread Hannes Reinecke
On 03/09/2018 11:32 PM, Stephen Kitt wrote:
> In preparation to enabling -Wvla, remove VLAs and replace them with
> fixed-length arrays instead.
> 
> scsi_dh_{alua,emc,rdac} use variable-length array declarations to
> store command blocks, with the appropriate size as determined by
> COMMAND_SIZE. This patch replaces these with fixed-sized arrays using
> MAX_COMMAND_SIZE, so that the array size can be determined at compile
> time.
> 
> This was prompted by https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Stephen Kitt 
> ---
>  drivers/scsi/device_handler/scsi_dh_alua.c | 8 
>  drivers/scsi/device_handler/scsi_dh_emc.c  | 2 +-
>  drivers/scsi/device_handler/scsi_dh_rdac.c | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)