Re: block and nvme polling improvements

2018-11-26 Thread Max Gurtovoy



On 11/21/2018 6:23 PM, Christoph Hellwig wrote:

Hi all,

this series optimizes a few bits in the block layer and nvme code
related to polling.

It starts by moving the queue types recently introduce entirely into
the block layer instead of requiring an indirect call for them.

It then switches nvme and the block layer to only allow polling
with separate poll queues, which allows us to realize the following
benefits:

  - poll queues can safely avoid disabling irqs on any locks
(we already do that in NVMe, but it isn't 100% kosher as-is)
  - regular interrupt driven queues can drop the CQ lock entirely,
as we won't race for completing CQs

Then we drop the NVMe RDMA code, as it doesn't follow the new mode,
and remove the nvme multipath polling code including the block hooks
for it, which didn't make much sense to start with given that we
started bypassing the multipath code for single controller subsystems
early on.  Last but not least we enable polling in the block layer
by default if the underlying driver has poll queues, as that already
requires explicit user action.

Note that it would be really nice to have polling back for RDMA with
dedicated poll queues, but that might take a while.  Also based on
Jens' polling aio patches we could now implement a model in nvmet
where we have a thread polling both the backend nvme device and
the RDMA CQs, which might give us some pretty nice performace
(I know Sagi looked into something similar a while ago).


That's interesting. Can you share more on that ? what is the model 
(thread per core ?)


I'll take a deeper look into polling aio patches from Jens.




A git tree is also available at:

 git://git.infradead.org/users/hch/misc.git nvme-polling

Gitweb:

 
http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-polling


Re: [PATCH 1/6] nvme: don't disable local ints for polled queue

2018-11-14 Thread Max Gurtovoy



On 11/10/2018 5:13 PM, Jens Axboe wrote:

A polled queued doesn't trigger interrupts, so it's always safe
to grab the queue lock without disabling interrupts.



Jens,

can you share the added value in performance for this change ?




Cc: Keith Busch 
Cc: linux-n...@lists.infradead.org
Signed-off-by: Jens Axboe 
---
  drivers/nvme/host/pci.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6aa86dfcb32c..a6e3fbddfadf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
  
  static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)

  {
+   unsigned long flags = 0; /* gcc 7.x fail */
u16 start, end;
-   bool found;
+   bool found, has_irq;
  
  	if (!nvme_cqe_pending(nvmeq))

return 0;
  
-	spin_lock_irq(>cq_lock);

+   /*
+* Polled queue doesn't have an IRQ, no need to disable ints
+*/
+   has_irq = !nvmeq->polled;
+   if (has_irq)
+   local_irq_save(flags);
+
+   spin_lock(>cq_lock);
found = nvme_process_cq(nvmeq, , , tag);
-   spin_unlock_irq(>cq_lock);
+   spin_unlock(>cq_lock);
+
+   if (has_irq)
+   local_irq_restore(flags);
  
  	nvme_complete_cqes(nvmeq, start, end);

return found;


Re: [PATCH] block: fix rdma queue mapping

2018-08-19 Thread Max Gurtovoy

Hi Sagi,
did you have a chance to look on Israel's and mine fixes that we 
attached to the first thread ?


there are few issues with this approach. For example in case you don't 
have a "free" cpu in the mask for Qi and you take cpu from Qi+j mask.


Also in case we have a non-symetrical affinity for 2 queues e.g. and the 
system has 32 cpus, the mapping is 16 for each (and not according to the 
user affinity...).


-Max.


[PATCH v5 1/3] block: move ref_tag calculation func to the block layer

2018-07-29 Thread Max Gurtovoy
Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v4:
 - fix failures in case kernel configured with !CONFIG_BLK_DEV_INTEGRITY

---
 drivers/infiniband/ulp/iser/iser_memory.c |  2 +-
 drivers/nvme/host/core.c  |  3 +--
 drivers/scsi/mpt3sas/mpt3sas_scsih.c  |  2 +-
 drivers/scsi/sd_dif.c |  4 ++--
 include/linux/t10-pi.h| 10 ++
 include/scsi/scsi_cmnd.h  |  7 +--
 6 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index ca844a9..130bf16 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -311,7 +311,7 @@ void iser_unreg_mem_fastreg(struct iscsi_iser_task 
*iser_task,
 {
domain->sig_type = IB_SIG_TYPE_T10_DIF;
domain->sig.dif.pi_interval = scsi_prot_interval(sc);
-   domain->sig.dif.ref_tag = scsi_prot_ref_tag(sc);
+   domain->sig.dif.ref_tag = t10_pi_ref_tag(sc->request);
/*
 * At the moment we hard code those, but in the future
 * we will take them from sc.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf65501..191177b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -627,8 +627,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
case NVME_NS_DPS_PI_TYPE2:
control |= NVME_RW_PRINFO_PRCHK_GUARD |
NVME_RW_PRINFO_PRCHK_REF;
-   cmnd->rw.reftag = cpu_to_le32(
-   nvme_block_nr(ns, blk_rq_pos(req)));
+   cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
break;
}
}
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b8d131a..dd738ae 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4568,7 +4568,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd 
*scmd, bool pending)
MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG |
MPI2_SCSIIO_EEDPFLAGS_CHECK_GUARD;
mpi_request->CDB.EEDP32.PrimaryReferenceTag =
-   cpu_to_be32(scsi_prot_ref_tag(scmd));
+   cpu_to_be32(t10_pi_ref_tag(scmd->request));
break;
 
case SCSI_PROT_DIF_TYPE3:
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 9035380..d8de43d 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -124,7 +124,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
if (sdkp->protection_type == T10_PI_TYPE3_PROTECTION)
return;
 
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -176,7 +176,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int 
good_bytes)
return;
 
intervals = good_bytes / scsi_prot_interval(scmd);
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index c6aa8a3..c40511f 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -37,6 +37,16 @@ struct t10_pi_tuple {
 #define T10_PI_APP_ESCAPE cpu_to_be16(0x)
 #define T10_PI_REF_ESCAPE cpu_to_be32(0x)
 
+static inline u32 t10_pi_ref_tag(struct request *rq)
+{
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+   return blk_rq_pos(rq) >>
+   (rq->q->integrity.interval_exp - 9) & 0x;
+#else
+   return -1U;
+#endif
+}
+
 extern const struct blk_integrity_profile t10_pi_type1_crc;
 extern const struct blk_integrity_profile t10_pi_type1_ip;
 extern const struct blk_integrity_profile t10_pi_type3_crc;
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index aaf1e97..cae229b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -313,12 +314,6 @@ static inline unsigned int scsi_prot_interval(struct 
scsi_cmnd *scmd)
return scmd->device->sector_size;
 }
 
-static inline u32 scsi_prot_ref_tag(struct scsi_cmnd *scmd)
-{
-   return blk_rq_pos(scmd->request) >>
-   (ilog2(scsi_prot_interval(scmd)) - 9) & 0xff

[PATCH v5 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-29 Thread Max Gurtovoy
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v4:
 - added Martin's Reviewed-by.

changes from v3:
 - kmap_atomic/kunmap_atomic the same address
 - declare pi struct inside the inner for loop
 - check "intervals" inside the for loop condition

changes from v2:
 - convert comments to kerneldoc format
 - removed SCSI specific comment
 - fix kmap_atomic/kunmap_atomic addresses
 - fix iteration over t10_pi_tuple's

changes from v1 (Christoph, Martin and Keith comments):
 - moved the functions to t10-pi.c
 - updated tuple size
 - changed local variables scope
 - remove/add new lines

---

 block/t10-pi.c | 110 +++
 drivers/scsi/sd.c  |   8 ++--
 drivers/scsi/sd.h  |   9 
 drivers/scsi/sd_dif.c  | 113 -
 include/linux/t10-pi.h |   3 ++
 5 files changed, 118 insertions(+), 125 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..62aed77 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -184,3 +184,113 @@ static blk_status_t t10_pi_type3_verify_ip(struct 
blk_integrity_iter *iter)
.verify_fn  = t10_pi_type3_verify_ip,
 };
 EXPORT_SYMBOL(t10_pi_type3_ip);
+
+/**
+ * t10_pi_prepare - prepare PI prior submitting request to device
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Remap protection information to match the
+ * physical LBA.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_prepare(struct request *rq, u8 protection_type)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   bip_for_each_vec(iv, bip, iter) {
+   void *p, *pmap;
+   unsigned int j;
+
+   pmap = kmap_atomic(iv.bv_page);
+   p = pmap + iv.bv_offset;
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   struct t10_pi_tuple *pi = p;
+
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(ref_tag);
+   virt++;
+   ref_tag++;
+   p += tuple_sz;
+   }
+
+   kunmap_atomic(pmap);
+   }
+
+   bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+   }
+}
+EXPORT_SYMBOL(t10_pi_prepare);
+
+/**
+ * t10_pi_complete - prepare PI prior returning request to the block layer
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ * @intervals:   total elements to prepare
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Since the physical start sector was submitted
+ * to the device, we should remap it back to virtual values expected by the
+ * block layer.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_complete(struct request *rq, u8 protection_type,
+unsigned int intervals)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec

[PATCH v5 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-29 Thread Max Gurtovoy
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Reviewed-by: Martin K. Petersen 
Acked-by: Keith Busch 
Signed-off-by: Max Gurtovoy 
---

changes from v4:
 - Added Martin's and Keith's signatures

---
 drivers/nvme/host/core.c | 18 
 drivers/nvme/host/nvme.h |  9 +-
 drivers/nvme/host/pci.c  | 75 +---
 3 files changed, 20 insertions(+), 82 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 191177b..b57abe5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -617,6 +617,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+   } else if (req_op(req) == REQ_OP_WRITE) {
+   t10_pi_prepare(req, ns->pi_type);
}
 
switch (ns->pi_type) {
@@ -637,6 +639,22 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
*ns,
return 0;
 }
 
+void nvme_cleanup_cmd(struct request *req)
+{
+   if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
+   nvme_req(req)->status == 0) {
+   struct nvme_ns *ns = req->rq_disk->private_data;
+
+   t10_pi_complete(req, ns->pi_type,
+   blk_rq_bytes(req) >> ns->lba_shift);
+   }
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+   kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
+
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33d..dfc01ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,14 +356,6 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, 
sector_t sector)
return (sector >> (ns->lba_shift - 9));
 }
 
-static inline void nvme_cleanup_cmd(struct request *req)
-{
-   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-   kfree(page_address(req->special_vec.bv_page) +
- req->special_vec.bv_offset);
-   }
-}
-
 static inline void nvme_end_request(struct request *req, __le16 status,
union nvme_result result)
 {
@@ -420,6 +412,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
+void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddd441b..03beac5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,73 +535,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct 
request *req)
mempool_free(iod->sg, dev->iod_mempool);
 }
 
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == v)
-   pi->ref_tag = cpu_to_be32(p);
-}
-
-static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == p)
-   pi->ref_tag = cpu_to_be32(v);
-}
-
-/**
- * nvme_dif_remap - remaps ref tags to bip seed and physical lba
- *
- * The virtual start sector is the one that was originally submitted by the
- * block layer.Due to partitioning, MD/DM cloning, etc. the actual 
physical
- * start sector may be different. Remap protection information to match the
- * physical LBA on writes, and back to the original seed on reads.
- *
- * Type 0 and 3 do not have a ref tag, so no remapping required.
- */
-static void nvme_dif_remap(struct request *req,
-   void (*dif_swap)(u32 p, u32 v, struct t10_pi_tuple *pi))
-{
-   struct nvme_ns *ns = req->rq_disk->private_data;
-   struct bio_integrity_payload *bip;
-   struct t10_pi_tuple *pi;
-   void *p, *pmap;
-   u32 i, nlb, ts, phys, virt;
-
-   if (!ns->pi_type || ns->pi_type == NVME_NS_DPS_PI_TYPE3)
-   return;
-
-   bip = bio_integrity(req->bio);
-   if (!bip)
-   return;
-
-   pmap = kmap_atomic(bip->bip_vec->bv_p

Re: [PATCH v4 1/3] block: move ref_tag calculation func to the block layer

2018-07-28 Thread Max Gurtovoy




On 7/27/2018 6:38 PM, Jens Axboe wrote:

On 7/27/18 9:21 AM, Jens Axboe wrote:

On 7/25/18 9:46 AM, Max Gurtovoy wrote:

Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.


Applied 1-3 for 4.19.


This:

static inline u32 t10_pi_ref_tag(struct request *rq)
{
 return blk_rq_pos(rq) >>
 (rq->q->integrity.interval_exp - 9) & 0x;
}

won't work for !CONFIG_BLK_DEV_INTEGRITY. I didn't want to make the
change, but looks like it should either return -1U as we the value
should mean nothing if BLK_DEV_INTEGRITY isn't set, or just ignore the
shift and return blk_rq_pos(rq) & 0x.

Please fixup and resend the series.



will this be good enough:

diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 81ae4c4..5a427c2 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -39,8 +39,12 @@ struct t10_pi_tuple {

 static inline u32 t10_pi_ref_tag(struct request *rq)
 {
+#ifdef CONFIG_BLK_DEV_INTEGRITY
return blk_rq_pos(rq) >>
(rq->q->integrity.interval_exp - 9) & 0x;
+#else
+   return -1U;
+#endif
 }

 extern const struct blk_integrity_profile t10_pi_type1_crc;


[PATCH v4 1/3] block: move ref_tag calculation func to the block layer

2018-07-25 Thread Max Gurtovoy
Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---
 drivers/infiniband/ulp/iser/iser_memory.c | 2 +-
 drivers/nvme/host/core.c  | 3 +--
 drivers/scsi/mpt3sas/mpt3sas_scsih.c  | 2 +-
 drivers/scsi/sd_dif.c | 4 ++--
 include/linux/t10-pi.h| 6 ++
 include/scsi/scsi_cmnd.h  | 7 +--
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index ca844a9..130bf16 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -311,7 +311,7 @@ void iser_unreg_mem_fastreg(struct iscsi_iser_task 
*iser_task,
 {
domain->sig_type = IB_SIG_TYPE_T10_DIF;
domain->sig.dif.pi_interval = scsi_prot_interval(sc);
-   domain->sig.dif.ref_tag = scsi_prot_ref_tag(sc);
+   domain->sig.dif.ref_tag = t10_pi_ref_tag(sc->request);
/*
 * At the moment we hard code those, but in the future
 * we will take them from sc.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf65501..191177b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -627,8 +627,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
case NVME_NS_DPS_PI_TYPE2:
control |= NVME_RW_PRINFO_PRCHK_GUARD |
NVME_RW_PRINFO_PRCHK_REF;
-   cmnd->rw.reftag = cpu_to_le32(
-   nvme_block_nr(ns, blk_rq_pos(req)));
+   cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
break;
}
}
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b8d131a..dd738ae 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4568,7 +4568,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd 
*scmd, bool pending)
MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG |
MPI2_SCSIIO_EEDPFLAGS_CHECK_GUARD;
mpi_request->CDB.EEDP32.PrimaryReferenceTag =
-   cpu_to_be32(scsi_prot_ref_tag(scmd));
+   cpu_to_be32(t10_pi_ref_tag(scmd->request));
break;
 
case SCSI_PROT_DIF_TYPE3:
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 9035380..d8de43d 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -124,7 +124,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
if (sdkp->protection_type == T10_PI_TYPE3_PROTECTION)
return;
 
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -176,7 +176,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int 
good_bytes)
return;
 
intervals = good_bytes / scsi_prot_interval(scmd);
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index c6aa8a3..635855e 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -37,6 +37,12 @@ struct t10_pi_tuple {
 #define T10_PI_APP_ESCAPE cpu_to_be16(0x)
 #define T10_PI_REF_ESCAPE cpu_to_be32(0x)
 
+static inline u32 t10_pi_ref_tag(struct request *rq)
+{
+   return blk_rq_pos(rq) >>
+   (rq->q->integrity.interval_exp - 9) & 0x;
+}
+
 extern const struct blk_integrity_profile t10_pi_type1_crc;
 extern const struct blk_integrity_profile t10_pi_type1_ip;
 extern const struct blk_integrity_profile t10_pi_type3_crc;
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index aaf1e97..cae229b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -313,12 +314,6 @@ static inline unsigned int scsi_prot_interval(struct 
scsi_cmnd *scmd)
return scmd->device->sector_size;
 }
 
-static inline u32 scsi_prot_ref_tag(struct scsi_cmnd *scmd)
-{
-   return blk_rq_pos(scmd->request) >>
-   (ilog2(scsi_prot_interval(scmd)) - 9) & 0x;
-}
-
 static inline unsigned scsi_prot_sg_count(struct scsi_cmnd *cmd)
 {
return cmd->prot_sdb ? cmd->prot_sdb->table.nents : 0;
-- 
1.8.3.1



[PATCH v4 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-25 Thread Max Gurtovoy
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---
 drivers/nvme/host/core.c | 18 
 drivers/nvme/host/nvme.h |  9 +-
 drivers/nvme/host/pci.c  | 75 +---
 3 files changed, 20 insertions(+), 82 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 191177b..b57abe5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -617,6 +617,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+   } else if (req_op(req) == REQ_OP_WRITE) {
+   t10_pi_prepare(req, ns->pi_type);
}
 
switch (ns->pi_type) {
@@ -637,6 +639,22 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
*ns,
return 0;
 }
 
+void nvme_cleanup_cmd(struct request *req)
+{
+   if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
+   nvme_req(req)->status == 0) {
+   struct nvme_ns *ns = req->rq_disk->private_data;
+
+   t10_pi_complete(req, ns->pi_type,
+   blk_rq_bytes(req) >> ns->lba_shift);
+   }
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+   kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
+
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33d..dfc01ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,14 +356,6 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, 
sector_t sector)
return (sector >> (ns->lba_shift - 9));
 }
 
-static inline void nvme_cleanup_cmd(struct request *req)
-{
-   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-   kfree(page_address(req->special_vec.bv_page) +
- req->special_vec.bv_offset);
-   }
-}
-
 static inline void nvme_end_request(struct request *req, __le16 status,
union nvme_result result)
 {
@@ -420,6 +412,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
+void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddd441b..03beac5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,73 +535,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct 
request *req)
mempool_free(iod->sg, dev->iod_mempool);
 }
 
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == v)
-   pi->ref_tag = cpu_to_be32(p);
-}
-
-static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == p)
-   pi->ref_tag = cpu_to_be32(v);
-}
-
-/**
- * nvme_dif_remap - remaps ref tags to bip seed and physical lba
- *
- * The virtual start sector is the one that was originally submitted by the
- * block layer.Due to partitioning, MD/DM cloning, etc. the actual 
physical
- * start sector may be different. Remap protection information to match the
- * physical LBA on writes, and back to the original seed on reads.
- *
- * Type 0 and 3 do not have a ref tag, so no remapping required.
- */
-static void nvme_dif_remap(struct request *req,
-   void (*dif_swap)(u32 p, u32 v, struct t10_pi_tuple *pi))
-{
-   struct nvme_ns *ns = req->rq_disk->private_data;
-   struct bio_integrity_payload *bip;
-   struct t10_pi_tuple *pi;
-   void *p, *pmap;
-   u32 i, nlb, ts, phys, virt;
-
-   if (!ns->pi_type || ns->pi_type == NVME_NS_DPS_PI_TYPE3)
-   return;
-
-   bip = bio_integrity(req->bio);
-   if (!bip)
-   return;
-
-   pmap = kmap_atomic(bip->bip_vec->bv_page) + bip->bip_vec->bv_offset;
-
-   p = pmap;
-   virt = bip_get_seed(bip);
-   phys = nvme_block_nr(n

[PATCH v4 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-25 Thread Max Gurtovoy
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v3:
 - kmap_atomic/kunmap_atomic the same address
 - declare pi struct inside the inner for loop
 - check "intervals" inside the for loop condition

changes from v2:
 - convert comments to kerneldoc format
 - removed SCSI specific comment
 - fix kmap_atomic/kunmap_atomic addresses
 - fix iteration over t10_pi_tuple's

changes from v1 (Christoph, Martin and Keith comments):
 - moved the functions to t10-pi.c
 - updated tuple size
 - changed local variables scope
 - remove/add new lines

---
 block/t10-pi.c | 110 +++
 drivers/scsi/sd.c  |   8 ++--
 drivers/scsi/sd.h  |   9 
 drivers/scsi/sd_dif.c  | 113 -
 include/linux/t10-pi.h |   3 ++
 5 files changed, 118 insertions(+), 125 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..62aed77 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -184,3 +184,113 @@ static blk_status_t t10_pi_type3_verify_ip(struct 
blk_integrity_iter *iter)
.verify_fn  = t10_pi_type3_verify_ip,
 };
 EXPORT_SYMBOL(t10_pi_type3_ip);
+
+/**
+ * t10_pi_prepare - prepare PI prior submitting request to device
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Remap protection information to match the
+ * physical LBA.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_prepare(struct request *rq, u8 protection_type)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   bip_for_each_vec(iv, bip, iter) {
+   void *p, *pmap;
+   unsigned int j;
+
+   pmap = kmap_atomic(iv.bv_page);
+   p = pmap + iv.bv_offset;
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   struct t10_pi_tuple *pi = p;
+
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(ref_tag);
+   virt++;
+   ref_tag++;
+   p += tuple_sz;
+   }
+
+   kunmap_atomic(pmap);
+   }
+
+   bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+   }
+}
+EXPORT_SYMBOL(t10_pi_prepare);
+
+/**
+ * t10_pi_complete - prepare PI prior returning request to the block layer
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ * @intervals:   total elements to prepare
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Since the physical start sector was submitted
+ * to the device, we should remap it back to virtual values expected by the
+ * block layer.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_complete(struct request *rq, u8 protection_type,
+unsigned int intervals)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   bip_for_each_vec(iv, bi

Re: [PATCH v3 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-25 Thread Max Gurtovoy




On 7/25/2018 2:22 PM, Christoph Hellwig wrote:

+   pmap = kmap_atomic(iv.bv_page) + iv.bv_offset;
+   p = pmap;


Maybe:

pmap = p = kmap_atomic(iv.bv_page) + iv.bv_offset;


+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   pi = (struct t10_pi_tuple *)p;


No need for the cast, also the pi declaration can be moved into the
inner scope:

struct t10_pi_tuple *pi = p;


+   pmap = kmap_atomic(iv.bv_page) + iv.bv_offset;
+   p = pmap;
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   if (intervals == 0) {
+   kunmap_atomic(pmap);
+   return;
+   }
+   pi = (struct t10_pi_tuple *)p;


Same here.

Also the intervals check would make sense in the for loop I think, e.g.:

pmap = p = kmap_atomic(iv.bv_page) + iv.bv_offset;
for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
struct t10_pi_tuple *pi = p;




Yes it make sense, but we can do more iterations before we return but I 
guess it's rare and it's worth to have less lines of code.

I'll update it for the next version.


[PATCH v3 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-25 Thread Max Gurtovoy
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v2:
 - convert comments to kerneldoc format
 - removed SCSI specific comment
 - fix kmap_atomic/kunmap_atomic addresses
 - fix iteration over t10_pi_tuple's

changes from v1 (Christoph, Martin and Keith comments):
 - moved the functions to t10-pi.c
 - updated tuple size
 - changed local variables scope
 - remove/add new lines

---
 block/t10-pi.c | 114 +
 drivers/scsi/sd.c  |   8 ++--
 drivers/scsi/sd.h  |   9 
 drivers/scsi/sd_dif.c  | 113 
 include/linux/t10-pi.h |   3 ++
 5 files changed, 122 insertions(+), 125 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..6d2acfe 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -184,3 +184,117 @@ static blk_status_t t10_pi_type3_verify_ip(struct 
blk_integrity_iter *iter)
.verify_fn  = t10_pi_type3_verify_ip,
 };
 EXPORT_SYMBOL(t10_pi_type3_ip);
+
+/**
+ * t10_pi_prepare - prepare PI prior submitting request to device
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Remap protection information to match the
+ * physical LBA.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_prepare(struct request *rq, u8 protection_type)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   bip_for_each_vec(iv, bip, iter) {
+   struct t10_pi_tuple *pi;
+   void *p, *pmap;
+   unsigned int j;
+
+   pmap = kmap_atomic(iv.bv_page) + iv.bv_offset;
+   p = pmap;
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   pi = (struct t10_pi_tuple *)p;
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(ref_tag);
+   virt++;
+   ref_tag++;
+   p += tuple_sz;
+   }
+
+   kunmap_atomic(pmap);
+   }
+
+   bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+   }
+}
+EXPORT_SYMBOL(t10_pi_prepare);
+
+/**
+ * t10_pi_complete - prepare PI prior returning request to the block layer
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ * @intervals:   total elements to prepare
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Since the physical start sector was submitted
+ * to the device, we should remap it back to virtual values expected by the
+ * block layer.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_complete(struct request *rq, u8 protection_type,
+unsigned int intervals)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   bip_for_each_vec(iv, bip, iter) {
+   struct t10_pi_tuple *pi;
+   void *p, *pmap;
+   unsigned int j;
+
+  

[PATCH v3 1/3] block: move ref_tag calculation func to the block layer

2018-07-25 Thread Max Gurtovoy
Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---
 drivers/infiniband/ulp/iser/iser_memory.c | 2 +-
 drivers/nvme/host/core.c  | 3 +--
 drivers/scsi/mpt3sas/mpt3sas_scsih.c  | 2 +-
 drivers/scsi/sd_dif.c | 4 ++--
 include/linux/t10-pi.h| 6 ++
 include/scsi/scsi_cmnd.h  | 7 +--
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index ca844a9..130bf16 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -311,7 +311,7 @@ void iser_unreg_mem_fastreg(struct iscsi_iser_task 
*iser_task,
 {
domain->sig_type = IB_SIG_TYPE_T10_DIF;
domain->sig.dif.pi_interval = scsi_prot_interval(sc);
-   domain->sig.dif.ref_tag = scsi_prot_ref_tag(sc);
+   domain->sig.dif.ref_tag = t10_pi_ref_tag(sc->request);
/*
 * At the moment we hard code those, but in the future
 * we will take them from sc.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf65501..191177b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -627,8 +627,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
case NVME_NS_DPS_PI_TYPE2:
control |= NVME_RW_PRINFO_PRCHK_GUARD |
NVME_RW_PRINFO_PRCHK_REF;
-   cmnd->rw.reftag = cpu_to_le32(
-   nvme_block_nr(ns, blk_rq_pos(req)));
+   cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
break;
}
}
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b8d131a..dd738ae 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4568,7 +4568,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd 
*scmd, bool pending)
MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG |
MPI2_SCSIIO_EEDPFLAGS_CHECK_GUARD;
mpi_request->CDB.EEDP32.PrimaryReferenceTag =
-   cpu_to_be32(scsi_prot_ref_tag(scmd));
+   cpu_to_be32(t10_pi_ref_tag(scmd->request));
break;
 
case SCSI_PROT_DIF_TYPE3:
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 9035380..d8de43d 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -124,7 +124,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
if (sdkp->protection_type == T10_PI_TYPE3_PROTECTION)
return;
 
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -176,7 +176,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int 
good_bytes)
return;
 
intervals = good_bytes / scsi_prot_interval(scmd);
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index c6aa8a3..635855e 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -37,6 +37,12 @@ struct t10_pi_tuple {
 #define T10_PI_APP_ESCAPE cpu_to_be16(0x)
 #define T10_PI_REF_ESCAPE cpu_to_be32(0x)
 
+static inline u32 t10_pi_ref_tag(struct request *rq)
+{
+   return blk_rq_pos(rq) >>
+   (rq->q->integrity.interval_exp - 9) & 0x;
+}
+
 extern const struct blk_integrity_profile t10_pi_type1_crc;
 extern const struct blk_integrity_profile t10_pi_type1_ip;
 extern const struct blk_integrity_profile t10_pi_type3_crc;
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index aaf1e97..cae229b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -313,12 +314,6 @@ static inline unsigned int scsi_prot_interval(struct 
scsi_cmnd *scmd)
return scmd->device->sector_size;
 }
 
-static inline u32 scsi_prot_ref_tag(struct scsi_cmnd *scmd)
-{
-   return blk_rq_pos(scmd->request) >>
-   (ilog2(scsi_prot_interval(scmd)) - 9) & 0x;
-}
-
 static inline unsigned scsi_prot_sg_count(struct scsi_cmnd *cmd)
 {
return cmd->prot_sdb ? cmd->prot_sdb->table.nents : 0;
-- 
1.8.3.1



[PATCH v3 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-25 Thread Max Gurtovoy
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v2:
 - add Reviewed-by tag
changes from v1:
 - check status of nvme_req instead of converting to BLK_STS

---
 drivers/nvme/host/core.c | 18 
 drivers/nvme/host/nvme.h |  9 +-
 drivers/nvme/host/pci.c  | 75 +---
 3 files changed, 20 insertions(+), 82 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 191177b..b57abe5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -617,6 +617,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+   } else if (req_op(req) == REQ_OP_WRITE) {
+   t10_pi_prepare(req, ns->pi_type);
}
 
switch (ns->pi_type) {
@@ -637,6 +639,22 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
*ns,
return 0;
 }
 
+void nvme_cleanup_cmd(struct request *req)
+{
+   if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
+   nvme_req(req)->status == 0) {
+   struct nvme_ns *ns = req->rq_disk->private_data;
+
+   t10_pi_complete(req, ns->pi_type,
+   blk_rq_bytes(req) >> ns->lba_shift);
+   }
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+   kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
+
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33d..dfc01ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,14 +356,6 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, 
sector_t sector)
return (sector >> (ns->lba_shift - 9));
 }
 
-static inline void nvme_cleanup_cmd(struct request *req)
-{
-   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-   kfree(page_address(req->special_vec.bv_page) +
- req->special_vec.bv_offset);
-   }
-}
-
 static inline void nvme_end_request(struct request *req, __le16 status,
union nvme_result result)
 {
@@ -420,6 +412,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
+void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddd441b..03beac5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,73 +535,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct 
request *req)
mempool_free(iod->sg, dev->iod_mempool);
 }
 
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == v)
-   pi->ref_tag = cpu_to_be32(p);
-}
-
-static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == p)
-   pi->ref_tag = cpu_to_be32(v);
-}
-
-/**
- * nvme_dif_remap - remaps ref tags to bip seed and physical lba
- *
- * The virtual start sector is the one that was originally submitted by the
- * block layer.Due to partitioning, MD/DM cloning, etc. the actual 
physical
- * start sector may be different. Remap protection information to match the
- * physical LBA on writes, and back to the original seed on reads.
- *
- * Type 0 and 3 do not have a ref tag, so no remapping required.
- */
-static void nvme_dif_remap(struct request *req,
-   void (*dif_swap)(u32 p, u32 v, struct t10_pi_tuple *pi))
-{
-   struct nvme_ns *ns = req->rq_disk->private_data;
-   struct bio_integrity_payload *bip;
-   struct t10_pi_tuple *pi;
-   void *p, *pmap;
-   u32 i, nlb, ts, phys, virt;
-
-   if (!ns->pi_type || ns->pi_type == NVME_NS_DPS_PI_TYPE3)
-   return;
-
-   bip = bio_integrity(req->bio);
-   if (!bip)
-   return;
-
-   pmap = km

Re: [PATCH 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-24 Thread Max Gurtovoy




On 7/24/2018 11:33 PM, Keith Busch wrote:

On Tue, Jul 24, 2018 at 04:33:41PM +0300, Max Gurtovoy wrote:

+void t10_pi_prepare(struct request *rq, u8 protection_type)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   bip_for_each_vec(iv, bip, iter) {
+   struct t10_pi_tuple *pi = kmap_atomic(iv.bv_page) +
+   iv.bv_offset;
+   unsigned int j;
+
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(ref_tag);
+   virt++;
+   ref_tag++;
+   pi += tuple_sz;
+   }
+
+   kunmap_atomic(pi);
+   }


Since you're incrementing 'pi', you end up unmapping an address that
you didn't map. It does appears harmless in current kunmap_atomic()
implementation, though.


ohh, good catch. This code is taken from the scsi remap, but I guess 
it's time to fix some stuff while we doing this patchset.

I'll use another variable pi_map that will be unmapped.

in the nvme code:

pmap = kmap_atomic(bip->bip_vec->bv_page) + bip->bip_vec->bv_offset;

...
but

kunmap_atomic(pmap);

should we unmap the same bv_page we mapped (without the offset) ?



You are also incrementing 'pi' by too many bytes since it is of type
struct t10_pi_tuple. The nvme driver used void* to make the pointer
arithmentic easier.



I'' use void*.
Thanks



Re: [PATCH 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-24 Thread Max Gurtovoy




On 7/24/2018 4:55 PM, Christoph Hellwig wrote:

+/*
+ * The virtual start sector is the one that was originally submitted
+ * by the block layer. Due to partitioning, MD/DM cloning, etc. the
+ * actual physical start sector is likely to be different.  Remap
+ * protection information to match the physical LBA.
+ *
+ * From SCSI protocol perspective there's a slight difference between
+ * Type 1 and 2.  The latter uses command's 32-byte exclusively, and the
+ * reference tag is seeded in the command.  This gives us the potential to
+ * avoid virt->phys remapping during write.  However, at read time we
+ * don't know whether the virt sector is the same as when we wrote it
+ * (we could be reading from real disk as opposed to MD/DM device.  So
+ * we always remap Type 2 making it identical to Type 1.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */


Please convert the comments for  t10_pi_prepare/t10_pi_complete to
kerneldoc format, and remove the mention of 32-byte CDBs, which
are SCSI specific.



something like this:

/**
 * t10_pi_prepare() - prepate PI prior submitting request to device
 * @rq:   request with PI that should be prepared
 * @protection_type:  PI type (Type 1/Type 2/Type 3)
 *
 * Description:
 *For Type 1/Type 2, the virtual start sector is the one that was
 *originally submitted by the block layer for the ref_tag usage. Due to
 *partitioning, MD/DM cloning, etc. the actual physical start sector is
 *likely to be different. Remap protection information to match the
 *physical LBA.
 *
 *Type 3 does not have a reference tag so no remapping is required.
 */


Otherwise this looks good to me.

Note that after this patch only sd_dif_config_host is left in sd_dif.c,
at which point it might be worth merging into sd.c as a separate patch.



[PATCH 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-24 Thread Max Gurtovoy
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v1 (Christoph, Martin and Keith comments):
 - moved the functions to t10-pi.c
 - updated tuple size
 - changed local variables scope
 - remove/add new lines

---
 block/t10-pi.c | 100 +++
 drivers/scsi/sd.c  |   8 ++--
 drivers/scsi/sd.h  |   9 
 drivers/scsi/sd_dif.c  | 113 -
 include/linux/t10-pi.h |   3 ++
 5 files changed, 108 insertions(+), 125 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..5f2a177 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -184,3 +184,103 @@ static blk_status_t t10_pi_type3_verify_ip(struct 
blk_integrity_iter *iter)
.verify_fn  = t10_pi_type3_verify_ip,
 };
 EXPORT_SYMBOL(t10_pi_type3_ip);
+
+/*
+ * The virtual start sector is the one that was originally submitted
+ * by the block layer. Due to partitioning, MD/DM cloning, etc. the
+ * actual physical start sector is likely to be different.  Remap
+ * protection information to match the physical LBA.
+ *
+ * From SCSI protocol perspective there's a slight difference between
+ * Type 1 and 2.  The latter uses command's 32-byte exclusively, and the
+ * reference tag is seeded in the command.  This gives us the potential to
+ * avoid virt->phys remapping during write.  However, at read time we
+ * don't know whether the virt sector is the same as when we wrote it
+ * (we could be reading from real disk as opposed to MD/DM device.  So
+ * we always remap Type 2 making it identical to Type 1.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_prepare(struct request *rq, u8 protection_type)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   bip_for_each_vec(iv, bip, iter) {
+   struct t10_pi_tuple *pi = kmap_atomic(iv.bv_page) +
+   iv.bv_offset;
+   unsigned int j;
+
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(ref_tag);
+   virt++;
+   ref_tag++;
+   pi += tuple_sz;
+   }
+
+   kunmap_atomic(pi);
+   }
+
+   bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+   }
+}
+EXPORT_SYMBOL(t10_pi_prepare);
+
+/*
+ * Remap physical sector values in the reference tag to the virtual
+ * values expected by the block layer.
+ */
+void t10_pi_complete(struct request *rq, u8 protection_type,
+unsigned int intervals)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   bip_for_each_vec(iv, bip, iter) {
+   struct t10_pi_tuple *pi = kmap_atomic(iv.bv_page) +
+   iv.bv_offset;
+   unsigned int j;
+
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   if (intervals == 0) {
+   kunmap_atomic(pi);
+   return;
+   }
+   if (be32_to_cpu(pi->ref_tag) == ref_tag)
+   pi->ref_tag = cpu_to_be32(virt);
+   virt++;
+   ref_tag++;
+   interval

[PATCH 1/3] block: move ref_tag calculation func to the block layer

2018-07-24 Thread Max Gurtovoy
Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---
 drivers/infiniband/ulp/iser/iser_memory.c | 2 +-
 drivers/nvme/host/core.c  | 3 +--
 drivers/scsi/mpt3sas/mpt3sas_scsih.c  | 2 +-
 drivers/scsi/sd_dif.c | 4 ++--
 include/linux/t10-pi.h| 6 ++
 include/scsi/scsi_cmnd.h  | 7 +--
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index ca844a9..130bf16 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -311,7 +311,7 @@ void iser_unreg_mem_fastreg(struct iscsi_iser_task 
*iser_task,
 {
domain->sig_type = IB_SIG_TYPE_T10_DIF;
domain->sig.dif.pi_interval = scsi_prot_interval(sc);
-   domain->sig.dif.ref_tag = scsi_prot_ref_tag(sc);
+   domain->sig.dif.ref_tag = t10_pi_ref_tag(sc->request);
/*
 * At the moment we hard code those, but in the future
 * we will take them from sc.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf65501..191177b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -627,8 +627,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
case NVME_NS_DPS_PI_TYPE2:
control |= NVME_RW_PRINFO_PRCHK_GUARD |
NVME_RW_PRINFO_PRCHK_REF;
-   cmnd->rw.reftag = cpu_to_le32(
-   nvme_block_nr(ns, blk_rq_pos(req)));
+   cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
break;
}
}
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b8d131a..dd738ae 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4568,7 +4568,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd 
*scmd, bool pending)
MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG |
MPI2_SCSIIO_EEDPFLAGS_CHECK_GUARD;
mpi_request->CDB.EEDP32.PrimaryReferenceTag =
-   cpu_to_be32(scsi_prot_ref_tag(scmd));
+   cpu_to_be32(t10_pi_ref_tag(scmd->request));
break;
 
case SCSI_PROT_DIF_TYPE3:
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 9035380..d8de43d 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -124,7 +124,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
if (sdkp->protection_type == T10_PI_TYPE3_PROTECTION)
return;
 
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -176,7 +176,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int 
good_bytes)
return;
 
intervals = good_bytes / scsi_prot_interval(scmd);
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index c6aa8a3..635855e 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -37,6 +37,12 @@ struct t10_pi_tuple {
 #define T10_PI_APP_ESCAPE cpu_to_be16(0x)
 #define T10_PI_REF_ESCAPE cpu_to_be32(0x)
 
+static inline u32 t10_pi_ref_tag(struct request *rq)
+{
+   return blk_rq_pos(rq) >>
+   (rq->q->integrity.interval_exp - 9) & 0x;
+}
+
 extern const struct blk_integrity_profile t10_pi_type1_crc;
 extern const struct blk_integrity_profile t10_pi_type1_ip;
 extern const struct blk_integrity_profile t10_pi_type3_crc;
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index aaf1e97..cae229b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -313,12 +314,6 @@ static inline unsigned int scsi_prot_interval(struct 
scsi_cmnd *scmd)
return scmd->device->sector_size;
 }
 
-static inline u32 scsi_prot_ref_tag(struct scsi_cmnd *scmd)
-{
-   return blk_rq_pos(scmd->request) >>
-   (ilog2(scsi_prot_interval(scmd)) - 9) & 0x;
-}
-
 static inline unsigned scsi_prot_sg_count(struct scsi_cmnd *cmd)
 {
return cmd->prot_sdb ? cmd->prot_sdb->table.nents : 0;
-- 
1.8.3.1



[PATCH 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-24 Thread Max Gurtovoy
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v1:
 - check status of nvme_req instead of converting to BLK_STS

---
 drivers/nvme/host/core.c | 18 
 drivers/nvme/host/nvme.h |  9 +-
 drivers/nvme/host/pci.c  | 75 +---
 3 files changed, 20 insertions(+), 82 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 191177b..b57abe5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -617,6 +617,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+   } else if (req_op(req) == REQ_OP_WRITE) {
+   t10_pi_prepare(req, ns->pi_type);
}
 
switch (ns->pi_type) {
@@ -637,6 +639,22 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
*ns,
return 0;
 }
 
+void nvme_cleanup_cmd(struct request *req)
+{
+   if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
+   nvme_req(req)->status == 0) {
+   struct nvme_ns *ns = req->rq_disk->private_data;
+
+   t10_pi_complete(req, ns->pi_type,
+   blk_rq_bytes(req) >> ns->lba_shift);
+   }
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+   kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
+
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33d..dfc01ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,14 +356,6 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, 
sector_t sector)
return (sector >> (ns->lba_shift - 9));
 }
 
-static inline void nvme_cleanup_cmd(struct request *req)
-{
-   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-   kfree(page_address(req->special_vec.bv_page) +
- req->special_vec.bv_offset);
-   }
-}
-
 static inline void nvme_end_request(struct request *req, __le16 status,
union nvme_result result)
 {
@@ -420,6 +412,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
+void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddd441b..03beac5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,73 +535,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct 
request *req)
mempool_free(iod->sg, dev->iod_mempool);
 }
 
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == v)
-   pi->ref_tag = cpu_to_be32(p);
-}
-
-static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == p)
-   pi->ref_tag = cpu_to_be32(v);
-}
-
-/**
- * nvme_dif_remap - remaps ref tags to bip seed and physical lba
- *
- * The virtual start sector is the one that was originally submitted by the
- * block layer.Due to partitioning, MD/DM cloning, etc. the actual 
physical
- * start sector may be different. Remap protection information to match the
- * physical LBA on writes, and back to the original seed on reads.
- *
- * Type 0 and 3 do not have a ref tag, so no remapping required.
- */
-static void nvme_dif_remap(struct request *req,
-   void (*dif_swap)(u32 p, u32 v, struct t10_pi_tuple *pi))
-{
-   struct nvme_ns *ns = req->rq_disk->private_data;
-   struct bio_integrity_payload *bip;
-   struct t10_pi_tuple *pi;
-   void *p, *pmap;
-   u32 i, nlb, ts, phys, virt;
-
-   if (!ns->pi_type || ns->pi_type == NVME_NS_DPS_PI_TYPE3)
-   return;
-
-   bip = bio_integrity(req->bio);
-   if (!bip)
-   return;
-
-   pmap = kmap_atomic(bip->bip_vec->bv_page) + bip->bip_vec->bv_o

Re: [PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-24 Thread Max Gurtovoy




On 7/24/2018 4:54 AM, Martin K. Petersen wrote:


Christoph,


+void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
+  u32 ref_tag)
+{


Maybe call this blk_t10_pi_prepare?


The rest of these functions have a blk_integrity_ prefix. So either
stick with that or put the functions in t10-pi.c and use a t10_pi_
prefix.

I'm a bit torn on placement since the integrity metadata could contain
other stuff than T10 PI. But the remapping is very specific to T10 PI.


diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9421d9877730..4186bf027c59 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1119,7 +1119,9 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
*SCpnt)
SCpnt->cmnd[0] = WRITE_6;
  
  		if (blk_integrity_rq(rq))

-   sd_dif_prepare(SCpnt);
+   blk_integrity_dif_prepare(SCpnt->request,
+ sdkp->protection_type,
+ scsi_prot_ref_tag(SCpnt));


scsi_prot_ref_tag could be move to the block layer as it only uses
the sector in the eequest and the sector size, which we can get
from the gendisk as well.  We then don't need to pass it to the function.


For Type 2, the PI can be at intervals different from the logical block
size (although we don't support that yet). We should use the
blk_integrity profile interval instead of assuming sector size.

And wrt. Keith's comment: The tuple_size should be the one from the
integrity profile as well, not sizeof(struct t10_pi_tuple).



Ok so I'll use rq->q->integrity.tuple_size as the tuple_sz and increment 
j and pi according to it, right ?


[PATCH 2/2] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-22 Thread Max Gurtovoy
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---
 drivers/nvme/host/core.c | 23 +--
 drivers/nvme/host/nvme.h |  9 +-
 drivers/nvme/host/pci.c  | 75 +---
 3 files changed, 23 insertions(+), 84 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030b2c3f..0d94d3eb641c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -591,6 +591,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
nvme_assign_write_stream(ctrl, req, , );
 
if (ns->ms) {
+   u32 ref_tag = nvme_block_nr(ns, blk_rq_pos(req));
/*
 * If formated with metadata, the block layer always provides a
 * metadata buffer if CONFIG_BLK_DEV_INTEGRITY is enabled.  Else
@@ -601,6 +602,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+   } else if (req_op(req) == REQ_OP_WRITE) {
+   blk_integrity_dif_prepare(req, ns->pi_type, ref_tag);
}
 
switch (ns->pi_type) {
@@ -611,8 +614,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
case NVME_NS_DPS_PI_TYPE2:
control |= NVME_RW_PRINFO_PRCHK_GUARD |
NVME_RW_PRINFO_PRCHK_REF;
-   cmnd->rw.reftag = cpu_to_le32(
-   nvme_block_nr(ns, blk_rq_pos(req)));
+   cmnd->rw.reftag = cpu_to_le32(ref_tag);
break;
}
}
@@ -622,6 +624,23 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
*ns,
return 0;
 }
 
+void nvme_cleanup_cmd(struct request *req)
+{
+   if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
+   nvme_error_status(req) == BLK_STS_OK) {
+   struct nvme_ns *ns = req->rq_disk->private_data;
+
+   blk_integrity_dif_complete(req, ns->pi_type,
+  nvme_block_nr(ns, blk_rq_pos(req)),
+  blk_rq_bytes(req) >> ns->lba_shift);
+   }
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+   kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
+
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33df3b2f..dfc01ffb30df 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,14 +356,6 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, 
sector_t sector)
return (sector >> (ns->lba_shift - 9));
 }
 
-static inline void nvme_cleanup_cmd(struct request *req)
-{
-   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-   kfree(page_address(req->special_vec.bv_page) +
- req->special_vec.bv_offset);
-   }
-}
-
 static inline void nvme_end_request(struct request *req, __le16 status,
union nvme_result result)
 {
@@ -420,6 +412,7 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl);
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
+void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba943f211687..20851d17acf3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,73 +535,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct 
request *req)
mempool_free(iod->sg, dev->iod_mempool);
 }
 
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == v)
-   pi->ref_tag = cpu_to_be32(p);
-}
-
-static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == p)
-   pi->ref_tag = cpu_to_be32(v);
-}
-
-/**
- * nvme_dif_remap - remaps ref tags to bip seed and physical lba
- *
- 

[PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-22 Thread Max Gurtovoy
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---
 block/blk-integrity.c  | 111 
 drivers/scsi/sd.c  |  12 --
 drivers/scsi/sd.h  |   9 
 drivers/scsi/sd_dif.c  | 113 -
 include/linux/blkdev.h |  14 ++
 5 files changed, 134 insertions(+), 125 deletions(-)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 6121611e1316..66b095a866d3 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -21,6 +21,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -451,3 +452,113 @@ void blk_integrity_del(struct gendisk *disk)
kobject_del(>integrity_kobj);
kobject_put(>integrity_kobj);
 }
+
+/*
+ * The virtual start sector is the one that was originally submitted
+ * by the block layer. Due to partitioning, MD/DM cloning, etc. the
+ * actual physical start sector is likely to be different.  Remap
+ * protection information to match the physical LBA.
+ *
+ * From a protocol perspective there's a slight difference between
+ * Type 1 and 2.  The latter uses command's 32-byte exclusively, and the
+ * reference tag is seeded in the command.  This gives us the potential to
+ * avoid virt->phys remapping during write.  However, at read time we
+ * don't know whether the virt sector is the same as when we wrote it
+ * (we could be reading from real disk as opposed to MD/DM device.  So
+ * we always remap Type 2 making it identical to Type 1.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
+  u32 ref_tag)
+{
+   const int tuple_sz = sizeof(struct t10_pi_tuple);
+   struct bio *bio;
+   struct t10_pi_tuple *pi;
+   u32 phys, virt;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   phys = ref_tag;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   struct bio_vec iv;
+   struct bvec_iter iter;
+   unsigned int j;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   virt = bip_get_seed(bip) & 0x;
+
+   bip_for_each_vec(iv, bip, iter) {
+   pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
+
+   for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {
+
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(phys);
+
+   virt++;
+   phys++;
+   }
+
+   kunmap_atomic(pi);
+   }
+
+   bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+   }
+}
+EXPORT_SYMBOL(blk_integrity_dif_prepare);
+
+/*
+ * Remap physical sector values in the reference tag to the virtual
+ * values expected by the block layer.
+ */
+void blk_integrity_dif_complete(struct request *rq, u8 protection_type,
+   u32 ref_tag, unsigned int intervals)
+{
+   const int tuple_sz = sizeof(struct t10_pi_tuple);
+   struct bio *bio;
+   struct t10_pi_tuple *pi;
+   unsigned int j;
+   u32 phys, virt;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   phys = ref_tag;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   virt = bip_get_seed(bip) & 0x;
+
+   bip_for_each_vec(iv, bip, iter) {
+   pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
+
+   for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {
+
+   if (intervals == 0) {
+   kunmap_atomic(pi);
+   return;
+   }
+
+   if (be32_to_cpu(pi->ref_tag) == phys)
+   pi->ref_tag = cpu_to_be32(virt);
+
+   virt++;
+   phys++;
+   intervals--;
+   }
+
+   kunmap_atomic(pi);
+   }
+   }
+}
+EXPORT_SYMBOL(blk_integrity_dif_complete);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9421d9877730..4186bf027c59 100644
--- a/drivers

Re: how can one drain MQ request queue ?

2018-02-22 Thread Max Gurtovoy



On 2/22/2018 4:59 AM, Ming Lei wrote:

Hi Max,


Hi Ming,



On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:

hi all,
is there a way to drain a blk-mq based request queue (similar to
blk_drain_queue for non MQ) ?


Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
based request queue, but it may not work well when the hardware is broken.


I tried that, but the path failover takes ~cmd_timeout seconds and this 
is not good enough...






I try to fix the following situation:
Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
ports during traffic using fio and making sure the traffic never fails.

when the switch port goes down the initiator driver start an error recovery


What is the code you are referring to?


from nvme_rdma driver:

static void nvme_rdma_error_recovery_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);

nvme_stop_keep_alive(>ctrl);

if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(>ctrl);
blk_mq_tagset_busy_iter(>tag_set,
nvme_cancel_request, >ctrl);
nvme_rdma_destroy_io_queues(ctrl, false);
}

blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
blk_mq_tagset_busy_iter(>admin_tag_set,
nvme_cancel_request, >ctrl);
nvme_rdma_destroy_admin_queue(ctrl, false);

/*
 * queues are not a live anymore, so restart the queues to fail 
fast

 * new IO
 */
blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
nvme_start_queues(>ctrl);

if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_CONNECTING)) {
/* state change failure should never happen */
WARN_ON_ONCE(1);
return;
}

nvme_rdma_reconnect_or_remove(ctrl);
}





process
- blk_mq_quiesce_queue for each namespace request queue


blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
low level driver, and new requests still can be allocated, but can't be
dispatched until the queue becomes unquiesced.


- cancel all requests of the tagset using blk_mq_tagset_busy_iter


Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
requests, and it depends on implementation of the busy_tag_iter_fn, and
timed-out request can't be covered by blk_mq_tagset_busy_iter().


How can we deal with timed-out commands ?




So blk_mq_tagset_busy_iter() is often used in error recovery path, such
as nvme_dev_disable(), which is usually used in resetting PCIe NVMe controller.


- destroy the QPs/RDMA connections and MR pools
- blk_mq_unquiesce_queue for each namespace request queue
- reconnect to the target (after creating RDMA resources again)

During the QP destruction, I see a warning that not all the memory regions
were back to the mr_pool. For every request we get from the block layer
(well, almost every request) we get a MR from the MR pool.
So what I see is that, depends on the timing, some requests are
dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
the QP and the MR pool. Probably these request were inserted during
quiescing,


Yes.


maybe we need to update the nvmf_check_init_req to check that the ctrl 
is in NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think 
about it and test it.





and I want to flush/drain them before I destroy the QP.


As mentioned above, you can't do that by blk_mq_quiesce_queue() &
blk_mq_tagset_busy_iter().

The PCIe NVMe driver takes two steps for the error recovery: nvme_dev_disable() 
&
nvme_reset_work(), and you may consider the similar approach, but the in-flight
requests won't be drained in this case because they can be requeued.

Could you explain a bit what your exact problem is?


The problem is that I assign an MR from QP mr_pool for each call to 
nvme_rdma_queue_rq. During the error recovery I destroy the QP and the 
mr_pool *but* some MR's are missing and not returned to the pool.





Thanks,
Ming



Thanks,
Max.



how can one drain MQ request queue ?

2018-02-20 Thread Max Gurtovoy

hi all,
is there a way to drain a blk-mq based request queue (similar to 
blk_drain_queue for non MQ) ?


I try to fix the following situation:
Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch 
ports during traffic using fio and making sure the traffic never fails.


when the switch port goes down the initiator driver start an error 
recovery process

- blk_mq_quiesce_queue for each namespace request queue
- cancel all requests of the tagset using blk_mq_tagset_busy_iter
- destroy the QPs/RDMA connections and MR pools
- blk_mq_unquiesce_queue for each namespace request queue
- reconnect to the target (after creating RDMA resources again)

During the QP destruction, I see a warning that not all the memory 
regions were back to the mr_pool. For every request we get from the 
block layer (well, almost every request) we get a MR from the MR pool.
So what I see is that, depends on the timing, some requests are 
dispatched/completed after we blk_mq_unquiesce_queue and after we 
destroy the QP and the MR pool. Probably these request were inserted 
during quiescing, and I want to flush/drain them before I destroy the QP.


Is there a way in the block layer that I can do it (we don't want to 
destroy the tagset and the request_queue on each reconnection)?


I'm open for suggestion :)

Cheers,
Max.


Re: kernel 4.14 how to install Mellanox connectx-3 infiniband driver?

2017-11-09 Thread Max Gurtovoy



On 11/9/2017 5:20 PM, Tony Yang wrote:

Hi, All
I downloaded the nvme with multipath kernel, The kernel version is
4.14, I encountered a problem, I use Mellanox connectx-3 infiniband
driver. Because the 4.14 kernel version is too new to install
infiniband driver, does anyone encounter with me The same situation?
How to solve? Thank you for your help.



Are you talking about MLNX_OFED ? what do you mean infiniband driver ?


Re: [PATCH 04/17] block: add a blk_steal_bios helper

2017-10-24 Thread Max Gurtovoy



On 10/23/2017 5:51 PM, Christoph Hellwig wrote:

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
---
  block/blk-core.c   | 20 
  include/linux/blkdev.h |  2 ++
  2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8c80f39f5fe..e804529e65a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2767,6 +2767,26 @@ struct request *blk_fetch_request(struct request_queue 
*q)
  }
  EXPORT_SYMBOL(blk_fetch_request);
  
+/*

+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */


Maybe we can add to the comment that "list" is the destination for the 
stolen bio.



+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+   if (rq->bio) {
+   if (list->tail)
+   list->tail->bi_next = rq->bio;
+   else
+   list->head = rq->bio;
+   list->tail = rq->biotail;


if list->tail != NULL don't we lose the "list->tail->bi_next = rq->bio;" 
assignment after assigning "list->tail = rq->biotail;" ?



+   }
+
+   rq->bio = NULL;


we can add this NULL assignment inside the big "if", but I'm not sure 
regarding the next 2 assignments.

Anyway not a big deal.


+   rq->biotail = NULL;
+   rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+




Re: [PATCH] blk-mq: Iterate also over sched_tags requests at blk_mq_tagset_iter()

2017-10-23 Thread Max Gurtovoy



diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c81b40e..c290de0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set 
*set, void *data,

  }
  }
+    for (i = 0; i < set->nr_hw_queues; i++) {
+    struct blk_mq_tags *sched_tags = set->sched_tags[i];
+
+    if (!sched_tags)
+    continue;
+
+    for (j = 0; j < sched_tags->nr_tags; j++) {
+    if (!sched_tags->static_rqs[j])
+    continue;
+
+    ret = fn(data, sched_tags->static_rqs[j]);
+    if (ret)
+    goto out;
+    }
+    }


If both a scheduler tag and a driver tag have been assigned to a 
request, can this cause

blk_mq_tagset_iter() to call fn() twice for the same request?


It will, its not a big issue for reinit functionality, but it might if
used for something else. I don't think its a good idea to go down this
route.


Don't you think sched_tags should be part of the tagset (as driver tags) ?
if so, blk_mq_tagset_iter should call fn() on both scheduler and driver 
tags. Otherwise, let's call it blk_mq_tags_iter instead and pass struct 
blk_mq_tags *.


The private case of fn() == reinit() for NVMEoF RDMA can be solved by mr 
pool but maybe we should look on the general case too.
In either way, currently we can't use a scheduler for NVMEoF RDMA 
because of that issue.


Re: [PATCH 0/3] Move tagset reinit to its only current consumer, nvme-core

2017-09-07 Thread Max Gurtovoy



On 9/6/2017 6:30 PM, Sagi Grimberg wrote:

As suggested by Bart Van Assche, get rid of
blk_mq_reinit_tagset and move it to nvme-core (its
only current consumer).

Instead, introduce a more generic tagset iterator helper.

Sagi Grimberg (3):
   block: introduce blk_mq_tagset_iter
   nvme: introduce nvme_reinit_tagset
   block: remove blk_mq_reinit_tagset

  block/blk-mq-tag.c   | 11 +--
  drivers/nvme/host/core.c | 13 +
  drivers/nvme/host/fc.c   |  3 ++-
  drivers/nvme/host/nvme.h |  2 ++
  drivers/nvme/host/rdma.c |  7 +++
  include/linux/blk-mq.h   |  4 ++--
  6 files changed, 27 insertions(+), 13 deletions(-)



The series looks good to me,

Reviewed-by: Max Gurtovoy <m...@mellanox.com>

BTW, if we talking about the reinit_tagset, can you explain the 
motivation for dereg_mr and alloc new mr for the RDMA transport layer ?

we don't do it in iSER/SRP so I wonder what is special here ?
I got some local protection errors in MP tests and under high load FIO 
(that cause cmd TMO to expire and abort the cmds and then reconnect).
I wonder if there is a connection between the two. I'm also debugging a 
situation that we might have free/dereg/local_invalidate the MR before 
the QP moved to ERR state.


Re: [PATCH] blk-mq: Fix queue usage on failed request allocation

2017-08-15 Thread Max Gurtovoy



On 8/14/2017 11:40 PM, Keith Busch wrote:

blk_mq_get_request() does not release the callers queue usage counter
when allocation fails. The caller still needs to account for its own
queue usage when it is unable to allocate a request.

Fixes: 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when 
allocating rq failed")

Reported-by: Max Gurtovoy <m...@mellanox.com>
Signed-off-by: Keith Busch <keith.bu...@intel.com>
---


tested with 4.13-rc5+ using the following commands in a loop:

1. modprobe nvme
2. sleep 10
3. modprobe -r nvme

Looks good,

Tested-by: Max Gurtovoy <m...@mellanox.com>


Re: [BUG] nvme driver crash

2017-07-27 Thread Max Gurtovoy



On 7/26/2017 10:12 PM, Omar Sandoval wrote:

On Wed, Jul 26, 2017 at 10:34:43AM -0700, Christoph Hellwig wrote:

On Tue, Jul 25, 2017 at 03:24:08PM -0700, Shaohua Li wrote:

Disable CONIFG_SMP, kernel crashes at boot time, here is the log.


I can reproduce the issue.  Unfortunately the addresss in the bug
doesn't make any sense to me when resolving it using gdb, as t just
points to the line where blk_mq_init_queue calls
blk_alloc_queue_node.

Can you check if you get better results in your build?


It's crashing on that line because ctrl->tagset is NULL, which is
because...

irq_create_affinity_masks() returns NULL on !CONFIG_SMP
-> pci_irq_get_affinity() returns NULL
-> blk_mq_pci_map_queues() returns -EINVAL
-> blk_mq_alloc_tag_set() returns -EINVAL
-> nvme_dev_add() doesn't set ctrl->tagset

The two-fold fix would be to make the nvme driver handle
blk_mq_alloc_tag_set() failing and to fall back to a dumb mapping in
blk_mq_pci_map_queues(), but I don't know what the best way to do those
is.



Adding Sagi and Keith.

Christoph,
I've send some fix few months ago to that but haven't got a green light:


nvme: don't ignore tagset allocation failures

the nvme_dev_add() function silently ignores failures.
In case blk_mq_alloc_tag_set fails, we hit NULL deref while
calling blk_mq_init_queue during nvme_alloc_ns with tagset == NULL.
Instead, we'll not issue the scan_work in case tagset allocation
failed and leave the ctrl functional.

Signed-off-by: Max Gurtovoy <m...@mellanox.com>
Reviewed-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/nvme/host/core.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9b3b57f..493722a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2115,9 +2115,9 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
/*
 * Do not queue new scan work when a controller is reset during
-* removal.
+* removal or if the tagset doesn't exist.
 */
-   if (ctrl->state == NVME_CTRL_LIVE)
+   if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset)
schedule_work(>scan_work);
 }
 EXPORT_SYMBOL_GPL(nvme_queue_scan);


maybe we can rebase and consider it again ?



[PATCH v2 1/1] null_blk: fix error flow for shared tags during module_init

2017-07-06 Thread Max Gurtovoy
In case we use shared tags feature, blk_mq_alloc_tag_set might fail
during module initialization. In that case, fail the load with the
suitble error code. Also move the tagset initialization process after
defining the amount of submition queues.

Signed-off-by: Max Gurtovoy <m...@mellanox.com>
---

Changes from v1:
- fail module loading in case of tagset initialization failure
- call blk_mq_free_tag_set in error flow

---
 drivers/block/null_blk.c |   18 +-
 1 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 71f4422..85c24ca 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -844,9 +844,6 @@ static int __init null_init(void)
queue_mode = NULL_Q_MQ;
}
 
-   if (queue_mode == NULL_Q_MQ && shared_tags)
-   null_init_tag_set(_set);
-
if (queue_mode == NULL_Q_MQ && use_per_node_hctx) {
if (submit_queues < nr_online_nodes) {
pr_warn("null_blk: submit_queues param is set to %u.",
@@ -858,11 +855,19 @@ static int __init null_init(void)
else if (!submit_queues)
submit_queues = 1;
 
+   if (queue_mode == NULL_Q_MQ && shared_tags) {
+   ret = null_init_tag_set(_set);
+   if (ret)
+   return ret;
+   }
+
mutex_init();
 
null_major = register_blkdev(0, "nullb");
-   if (null_major < 0)
-   return null_major;
+   if (null_major < 0) {
+   ret = null_major;
+   goto err_tagset;
+   }
 
if (use_lightnvm) {
ppa_cache = kmem_cache_create("ppa_cache", 64 * sizeof(u64),
@@ -891,6 +896,9 @@ static int __init null_init(void)
kmem_cache_destroy(ppa_cache);
 err_ppa:
unregister_blkdev(null_major, "nullb");
+err_tagset:
+   if (queue_mode == NULL_Q_MQ && shared_tags)
+   blk_mq_free_tag_set(_set);
return ret;
 }
 
-- 
1.7.1



Re: [PATCH 1/1] null_blk: fix error flow for shared tags during module_init

2017-07-06 Thread Max Gurtovoy



On 7/6/2017 5:02 PM, Jens Axboe wrote:

On 07/06/2017 07:24 AM, Max Gurtovoy wrote:

In case we use shared tags feature, blk_mq_alloc_tag_set might fail
during module initialization. Check the return value and default to run
without shared tag set before continuing. Also move the tagset
initialization process after defining the amount of submition queues.


Fail the load instead. For two reasons:

1) We could be loading to test this explicitly. null_blk isn't a regular
   driver, it's not like we're bringing up the system on this thing.
   Hence I feel it's more important that it loads with what we asked for,
   or not load at all.

2) If initializing one tag set fails, the unshared case will surely
   fail too.



Ok. I'll send V2 version soon.


[PATCH 1/1] null_blk: fix error flow for shared tags during module_init

2017-07-06 Thread Max Gurtovoy
In case we use shared tags feature, blk_mq_alloc_tag_set might fail
during module initialization. Check the return value and default to run
without shared tag set before continuing. Also move the tagset
initialization process after defining the amount of submition queues.

Signed-off-by: Max Gurtovoy <m...@mellanox.com>
---
 drivers/block/null_blk.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 71f4422..0b7e7e1 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -844,9 +844,6 @@ static int __init null_init(void)
queue_mode = NULL_Q_MQ;
}
 
-   if (queue_mode == NULL_Q_MQ && shared_tags)
-   null_init_tag_set(_set);
-
if (queue_mode == NULL_Q_MQ && use_per_node_hctx) {
if (submit_queues < nr_online_nodes) {
pr_warn("null_blk: submit_queues param is set to %u.",
@@ -858,6 +855,15 @@ static int __init null_init(void)
else if (!submit_queues)
submit_queues = 1;
 
+   if (queue_mode == NULL_Q_MQ && shared_tags) {
+   ret = null_init_tag_set(_set);
+   if (ret) {
+   pr_warn("null_blk: Can't use shared tags %d\n", ret);
+   pr_warn("null_blk: defaults to non shared tags\n");
+   shared_tags = false;
+   }
+   }
+
mutex_init();
 
null_major = register_blkdev(0, "nullb");
-- 
1.7.1



Re: [PATCH 1/1] blk-mq: map all HWQ also in hyperthreaded system

2017-07-05 Thread Max Gurtovoy



On 7/5/2017 10:59 AM, Johannes Thumshirn wrote:

On Wed, Jun 28, 2017 at 03:44:40PM +0300, Max Gurtovoy wrote:

This patch performs sequential mapping between CPUs and queues.
In case the system has more CPUs than HWQs then there are still
CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
and their siblings to the same HWQ.
This actually fixes a bug that found unmapped HWQs in a system with
2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
running NVMEoF (opens upto maximum of 64 HWQs).


Christoph/Sagi/Keith,

any updates on this patch? Without it I' not able to run NVMf on a box with 44
Cores and 88 Threads w/o adding -i 44 to the nvme connect statement.

Thanks,
Johannes



Hi Johannes,
this was merge already to the main tree (Jens add it to his pull request)

Cheers,
Max.


Re: NVMe induced NULL deref in bt_iter()

2017-07-03 Thread Max Gurtovoy



On 7/3/2017 3:03 PM, Ming Lei wrote:

On Mon, Jul 03, 2017 at 01:07:44PM +0300, Sagi Grimberg wrote:

Hi Ming,


Yeah, the above change is correct, for any canceling requests in this
way we should use blk_mq_quiesce_queue().


I still don't understand why should blk_mq_flush_busy_ctxs hit a NULL
deref if we don't touch the tagset...


Looks no one mentioned the steps for reproduction, then it isn't easy
to understand the related use case, could anyone share the steps for
reproduction?


Hi Ming,
I create 500 ns per 1 subsystem (using with CX4 target and C-IB 
initiator but also saw it in CX5 vs. CX5 setup).
The null deref happens when I remove all configuration in the target (1 
port 1 subsystem and 500 namespaces and nvmet modules unload) during 
traffic to 1 nvme device/ns from the intiator.
I get Null deref in blk_mq_flush_busy_ctxs function that calls 
sbitmap_for_each_set in the initiator. seems like the "struct 
sbitmap_word *word = >map[i];" is null. It's actually might be not 
null in the beginning of the func and become null during running the 
while loop there.






Also, I'm wandering in what case we shouldn't use
blk_mq_quiesce_queue()? Maybe we should unexport blk_mq_stop_hw_queues()
and blk_mq_start_stopped_hw_queues() and use the quiesce/unquiesce
equivalent always?


There are at least one case in which we have to use stop queues:

- when QUEUE_BUSY(now it becomes BLK_STS_RESOURCE) happens, some drivers
need to stop queues for avoiding to hurt CPU, such as virtio-blk, ...



The only fishy usage is in nvme_fc_start_fcp_op() where if submission
failed the code stop the hw queues and delays it, but I think it should
be handled differently..


It looks like the old way of scsi-mq, but scsi has removed this way and
avoids to stop queue.


Thanks,
Ming



Re: NVMe induced NULL deref in bt_iter()

2017-07-02 Thread Max Gurtovoy



On 7/2/2017 2:56 PM, Sagi Grimberg wrote:



On 02/07/17 13:45, Max Gurtovoy wrote:



On 6/30/2017 8:26 PM, Jens Axboe wrote:

Hi Max,


Hi Jens,



I remembered you reporting this. I think this is a regression introduced
with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
is, but that's not indexable by the tag we find. So I think we need to
guard those with a NULL check. The actual requests themselves are
static, so we know the memory itself isn't going away. But if we race
with completion, we could find a NULL there, validly.

Since you could reproduce it, can you try the below?


I still can repro the null deref with this patch applied.



diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d0be72ccb091..b856b2827157 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap,
unsigned int bitnr, void *data)
 bitnr += tags->nr_reserved_tags;
 rq = tags->rqs[bitnr];

-if (rq->q == hctx->queue)
+if (rq && rq->q == hctx->queue)
 iter_data->fn(hctx, rq, iter_data->data, reserved);
 return true;
 }
@@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap,
unsigned int bitnr, void *data)
 if (!reserved)
 bitnr += tags->nr_reserved_tags;
 rq = tags->rqs[bitnr];
-
-iter_data->fn(rq, iter_data->data, reserved);
+if (rq)
+iter_data->fn(rq, iter_data->data, reserved);
 return true;
 }


see the attached file for dmesg output.

output of gdb:

(gdb) list *(blk_mq_flush_busy_ctxs+0x48)
0x8127b108 is in blk_mq_flush_busy_ctxs
(./include/linux/sbitmap.h:234).
229
230 for (i = 0; i < sb->map_nr; i++) {
231 struct sbitmap_word *word = >map[i];
232 unsigned int off, nr;
233
234 if (!word->word)
235 continue;
236
237 nr = 0;
238 off = i << sb->shift;


when I change the "if (!word->word)" to  "if (word && !word->word)"
I can get null deref at "nr = find_next_bit(>word, word->depth,
nr);". Seems like somehow word becomes NULL.

Adding the linux-nvme guys too.
Sagi has mentioned that this can be null only if we remove the tagset
while I/O is trying to get a tag and when killing the target we get into
error recovery and periodic reconnects, which does _NOT_ include freeing
the tagset, so this is probably the admin tagset.

Sagi,
you've mention a patch for centrelizing the treatment of the admin
tagset to the nvme core. I think I missed this patch, so can you
please send a pointer to it and I'll check if it helps ?


Hmm,

In the above flow we should not be freeing the tag_set, not on admin as
well. The target keep removing namespaces and finally removes the
subsystem which generates a error recovery flow. What we at least try
to do is:

1. mark rdma queues as not live
2. stop all the sw queues (admin and io)
3. fail inflight I/Os
4. restart all sw queues (to fast fail until we recover)

We shouldn't be freeing the tagsets (although we might update them
when we recover and cpu map changed - which I don't think is happening).

However, I do see a difference between bt_tags_for_each
and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).

Unrelated to this I think we should quiesce/unquiesce the admin_q
instead of stop/start because it respects the submission path rcu [1].

It might hide the issue, but given that we never free the tagset its
seems like it's not in nvme-rdma (max, can you see if this makes the
issue go away?)


Yes, this fixes the null deref issue.
I run some additional login/logout tests that passed too.
This fix is important also for stable kernel (with needed backports to 
blk_mq_quiesce_queue/blk_mq_unquiesce_queue functions).

You can add my:
Tested-by: Max Gurtovoy <m...@mellanox.com>
Reviewed-by: Max Gurtovoy <m...@mellanox.com>

Let me know if you want me to push this fix to the mailing list to save 
time (can we make it to 4.12 ?)




[1]:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index e3996db22738..094873a4ee38 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(struct
work_struct *work)

if (ctrl->ctrl.queue_count > 1)
nvme_stop_queues(>ctrl);
-   blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
+   blk_mq_quiesce_queue(ctrl->ctrl.admin_q);

/* We must take care of fastfail/requeue all our inflight
requests */
if (ctrl->ctrl.queue_count > 1)
@@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct
work_struct *work)
 * queues are not a live anymore, so restart the queues to fail
fast
 * new IO
 */
-   blk_mq_start_stopped_hw_queues(c

Re: [PATCH 1/1] blk-mq: map all HWQ also in hyperthreaded system

2017-06-28 Thread Max Gurtovoy



On 6/28/2017 6:01 PM, Sagi Grimberg wrote:



Can you please test with my patchset on converting nvme-rdma to
MSIX based mapping (I assume you are testing with mlx5 yes)?


Sure. does V6 is the last version of the patchset ?
I'll test it with ConnectX-5 adapter and send the results.


Yes.


I'd be very much interested to know if the original problem
exists with this applied.


it will exist in case set->nr_hw_queues > dev->num_comp_vectors.


We don't ask for more hw queues than num_comp_vectors.



I've tested Sagi's patches and they fix the connection establishment bug 
for NVMEoF.


here are the results:

fio 72 jobs, 128 iodepth.
NVMEoF register_always=N
1 Subsystem, 1 namespace

num_comp_vector is 60 and therefore num_queues is 60.
I run a comparison to my original patch with 60 queues and also for 64 
queues (possible in my patch because no limitation of num_comp_vectors)


bs  IOPS(read queues=60(Sagi)/60(Max)/64(Max))
-  
5123424.9K/3587.8K/3619.2K
1k 3421.8K/3613.5K/3630.6K
2k 3416.4K/3605.7K/3630.2K
4k 2361.6K/2399.9K/2404.1K
8k 1368.7K/1370.7K/1370.6K
16k692K/691K/692K
32k345K/348K/348K
64k175K/174K/174K
128k   88K/87K/87K

bs IOPS(write queues=60(Sagi)/60(Max)/64(Max))
- --
5123243.6K/3329.7K/3392.9K
1k 3249.7K/3341.2K/3379.2K
2k 3251.2K/3336.9K/3385.9K
4k 2685.8K/2683.9K/2683.3K
8k 1336.6K/1355.1K/1361.6K
16k690K/690K/691K
32k348K/348K/348K
64k174K/174K/174K
128k   87K/87K/87K

My conclusion is that Sagi's patch is correct (although we see little 
bit less performance: 100K-200K less for small block sizes) so you can add:


Tested-by: Max Gurtovoy <m...@mellanox.com>

Nevertheless, we should review and consider pushing my fixes to the 
block layer for other users of this mapping function.


Re: [PATCH 1/1] blk-mq: map all HWQ also in hyperthreaded system

2017-06-28 Thread Max Gurtovoy



On 6/28/2017 5:58 PM, Sagi Grimberg wrote:



+static int cpu_to_queue_index(unsigned int nr_queues, const int cpu,
+  const struct cpumask *online_mask)
  {
-return cpu * nr_queues / nr_cpus;
+/*
+ * Non online CPU will be mapped to queue index 0.
+ */
+if (!cpumask_test_cpu(cpu, online_mask))
+return 0;


Why not map offline cpus to what they would've map to if they were
online?


I didn't change logic for offline cpu's.
Should it be done in this patch ?




+if (cpu < nr_queues) {
+map[cpu] = cpu_to_queue_index(nr_queues, cpu, online_mask);
+} else {
+first_sibling = get_first_sibling(cpu);
+if (first_sibling == cpu)
+map[cpu] = cpu_to_queue_index(nr_queues, cpu,
online_mask);


This seems to be the only reference to online_mask left (on each side of
the if statement. I think you can just not pass it and use
cpu_online_mask in cpu_to_queue_index.


Another user of cpu_to_queue_index may want to send different mask.
Currently it's a static function so I guess we can do it.


Re: [PATCH 1/1] blk-mq: map all HWQ also in hyperthreaded system

2017-06-28 Thread Max Gurtovoy



On 6/28/2017 5:38 PM, Sagi Grimberg wrote:

Hi Max,


Hi Sagi,




This patch performs sequential mapping between CPUs and queues.
In case the system has more CPUs than HWQs then there are still
CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
and their siblings to the same HWQ.
This actually fixes a bug that found unmapped HWQs in a system with
2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
running NVMEoF (opens upto maximum of 64 HWQs).


The explanation can be a bit clearer...

I still need to take a look at the patch itself, but do note that
ideally we will never get to blk_mq_map_queues since we prefer
to map queues based on MSIX assignments. for nvme-rdma, this is
merely a fallback. And looking ahead, MSIX based mapping should
be the primary mapping logic.


we still have a fallback option in your series so we surly need some fix 
to the blk_mq_map_queues (also for stable kernel IMO. Jens/Christoph ?).




Can you please test with my patchset on converting nvme-rdma to
MSIX based mapping (I assume you are testing with mlx5 yes)?


Sure. does V6 is the last version of the patchset ?
I'll test it with ConnectX-5 adapter and send the results.


I'd be very much interested to know if the original problem
exists with this applied.


it will exist in case set->nr_hw_queues > dev->num_comp_vectors.



I'll take a closer look into the patch.


Thanks.




[PATCH 1/1] blk-mq: map all HWQ also in hyperthreaded system

2017-06-28 Thread Max Gurtovoy
This patch performs sequential mapping between CPUs and queues.
In case the system has more CPUs than HWQs then there are still
CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
and their siblings to the same HWQ.
This actually fixes a bug that found unmapped HWQs in a system with
2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
running NVMEoF (opens upto maximum of 64 HWQs).

Performance results running fio (72 jobs, 128 iodepth)
using null_blk (w/w.o patch):

bs  IOPS(read submit_queues=72)   IOPS(write submit_queues=72)   IOPS(read 
submit_queues=24)  IOPS(write submit_queues=24)
-    -- 
 -
5124890.4K/4723.5K 4524.7K/4324.2K   
4280.2K/4264.3K   3902.4K/3909.5K
1k 4910.1K/4715.2K 4535.8K/4309.6K   
4296.7K/4269.1K   3906.8K/3914.9K
2k 4906.3K/4739.7K 4526.7K/4330.6K   
4301.1K/4262.4K   3890.8K/3900.1K
4k 4918.6K/4730.7K 4556.1K/4343.6K   
4297.6K/4264.5K   3886.9K/3893.9K
8k 4906.4K/4748.9K 4550.9K/4346.7K   
4283.2K/4268.8K   3863.4K/3858.2K
16k4903.8K/4782.6K 4501.5K/4233.9K   
4292.3K/4282.3K   3773.1K/3773.5K
32k4885.8K/4782.4K 4365.9K/4184.2K   
4307.5K/4289.4K   3780.3K/3687.3K
64k4822.5K/4762.7K 2752.8K/2675.1K   
4308.8K/4312.3K   2651.5K/2655.7K
128k   2388.5K/2313.8K 1391.9K/1375.7K   
2142.8K/2152.2K   1395.5K/1374.2K

Signed-off-by: Max Gurtovoy <m...@mellanox.com>
---
 block/blk-mq-cpumap.c |   68 -
 1 files changed, 22 insertions(+), 46 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8e61e86..2cca4fc 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -14,10 +14,15 @@
 #include "blk.h"
 #include "blk-mq.h"
 
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
- const int cpu)
+static int cpu_to_queue_index(unsigned int nr_queues, const int cpu,
+ const struct cpumask *online_mask)
 {
-   return cpu * nr_queues / nr_cpus;
+   /*
+* Non online CPU will be mapped to queue index 0.
+*/
+   if (!cpumask_test_cpu(cpu, online_mask))
+   return 0;
+   return cpu % nr_queues;
 }
 
 static int get_first_sibling(unsigned int cpu)
@@ -36,55 +41,26 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
const struct cpumask *online_mask = cpu_online_mask;
-   unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-   cpumask_var_t cpus;
-
-   if (!alloc_cpumask_var(, GFP_ATOMIC))
-   return -ENOMEM;
-
-   cpumask_clear(cpus);
-   nr_cpus = nr_uniq_cpus = 0;
-   for_each_cpu(i, online_mask) {
-   nr_cpus++;
-   first_sibling = get_first_sibling(i);
-   if (!cpumask_test_cpu(first_sibling, cpus))
-   nr_uniq_cpus++;
-   cpumask_set_cpu(i, cpus);
-   }
-
-   queue = 0;
-   for_each_possible_cpu(i) {
-   if (!cpumask_test_cpu(i, online_mask)) {
-   map[i] = 0;
-   continue;
-   }
+   unsigned int cpu, first_sibling;
 
+   for_each_possible_cpu(cpu) {
/*
-* Easy case - we have equal or more hardware queues. Or
-* there are no thread siblings to take into account. Do
-* 1:1 if enough, or sequential mapping if less.
+* First do sequential mapping between CPUs and queues.
+* In case we still have CPUs to map, and we have some number of
+* threads per cores then map sibling threads to the same queue 
for
+* performace optimizations.
 */
-   if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-   map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-   queue++;
-   continue;
+   if (cpu < nr_queues) {
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu, 
online_mask);
+   } else {
+   first_sibling = get_first_sibling(cpu);
+   if (first_sibling == cpu)
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu, 
online_mask);
+   else
+ 

Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma

2017-04-04 Thread Max Gurtovoy


Any feedback is welcome.


Hi Sagi,

the patchset looks good and of course we can add support for more 
drivers in the future.

have you run some performance testing with the nvmf initiator ?




Sagi Grimberg (6):
  mlx5: convert to generic pci_alloc_irq_vectors
  mlx5: move affinity hints assignments to generic code
  RDMA/core: expose affinity mappings per completion vector
  mlx5: support ->get_vector_affinity
  block: Add rdma affinity based queue mapping helper
  nvme-rdma: use intelligent affinity based queue mappings

 block/Kconfig  |   5 +
 block/Makefile |   1 +
 block/blk-mq-rdma.c|  56 +++
 drivers/infiniband/hw/mlx5/main.c  |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 106 +++--
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|   1 -
 drivers/nvme/host/rdma.c   |  13 +++
 include/linux/blk-mq-rdma.h|  10 ++
 include/linux/mlx5/driver.h|   2 -
 include/rdma/ib_verbs.h|  24 +
 14 files changed, 138 insertions(+), 108 deletions(-)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h



Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper

2017-04-04 Thread Max Gurtovoy



diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
new file mode 100644
index ..d402f7c93528
--- /dev/null
+++ b/block/blk-mq-rdma.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (c) 2017 Sagi Grimberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */


shouldn't you include   and  like in 
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?



+#include 
+#include 
+#include 
+#include 
+#include "blk-mq.h"


Is this include needed ?



+
+/**
+ * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
+ * @set:   tagset to provide the mapping for
+ * @dev:   rdma device associated with @set.
+ * @first_vec: first interrupt vectors to use for queues (usually 0)
+ *
+ * This function assumes the rdma device @dev has at least as many available
+ * interrupt vetors as @set has queues.  It will then query it's affinity mask
+ * and built queue mapping that maps a queue to the CPUs that have irq affinity
+ * for the corresponding vector.
+ *
+ * In case either the driver passed a @dev with less vectors than
+ * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
+ * vector, we fallback to the naive mapping.
+ */
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+   struct ib_device *dev, int first_vec)
+{
+   const struct cpumask *mask;
+   unsigned int queue, cpu;
+
+   if (set->nr_hw_queues > dev->num_comp_vectors)
+   goto fallback;
+
+   for (queue = 0; queue < set->nr_hw_queues; queue++) {
+   mask = ib_get_vector_affinity(dev, first_vec + queue);
+   if (!mask)
+   goto fallback;


Christoph,
we can use fallback also in the blk-mq-pci.c in case 
pci_irq_get_affinity fails, right ?



+
+   for_each_cpu(cpu, mask)
+   set->mq_map[cpu] = queue;
+   }
+
+   return 0;
+fallback:
+   return blk_mq_map_queues(set);
+}
+EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);


Otherwise, Looks good.

Reviewed-by: Max Gurtovoy <m...@mellanox.com>