Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-29 Thread jianchao.wang



On 07/27/2018 11:29 PM, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 09:57 +0800, jianchao.wang wrote:
>> Something like:
>>
>>  percpu_ref_switch_to_atomic_sync(>q_usage_counter);
>>  if (!percpu_ref_is_zero(>q_usage_counter)) {
>>  ret = -EBUSY;
>>  pm_runtime_mark_last_busy(q->dev);
>>  } else {
>>  blk_set_preempt_only(q);
>>  if (!percpu_ref_is_zero(>q_usage_counter) {
>>  ret = -EBUSY;
>>  pm_runtime_mark_last_busy(q->dev);
>>  blk_clear_preempt_only(q);
>>  } else {
>>  q->rpm_status = RPM_SUSPENDIN;
>>  }
>> }
> 
> I think this code is racy. Because there is no memory barrier in
> blk_queue_enter() between the percpu_ref_tryget_live() and the
> blk_queue_preempt_only() calls, the context that sets the PREEMPT_ONLY flag
> has to use synchronize_rcu() or call_rcu() to ensure that blk_queue_enter()
> sees the PREEMPT_ONLY flag after it has called percpu_ref_tryget_live().
> See also 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIFgQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=mkAOXQtxuVTAej2tBjOvEed2h4TRvX3mE-EPeNEUD5E=xTgTB2x1JwvRUQVyOL8m3rhbbk8xhOkZMC_Io9bmpFc=.

Yes, a synchrorize_rcu is indeed needed here to ensure a q_usage_counter is
got successfully with increasing one,  or unsuccessfully without increasing one.

Thanks
Jianchao


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-29 Thread jianchao.wang


Hi Bart

On 07/27/2018 11:09 PM, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 09:57 +0800, jianchao.wang wrote:
>> If q_usage_counter is not zero here, we will leave the request_queue in 
>> preempt only mode.
> 
> That's on purpose. If q_usage_counter is not zero then 
> blk_pre_runtime_suspend()
> will return -EBUSY. That error code will be passed to 
> blk_post_runtime_suspend()
> and that will cause that function to clear QUEUE_FLAG_PREEMPT_ONLY.
> 

static int sdev_runtime_suspend(struct device *dev)
{
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
struct scsi_device *sdev = to_scsi_device(dev);
int err = 0;

err = blk_pre_runtime_suspend(sdev->request_queue);
if (err)
return err;
if (pm && pm->runtime_suspend)
err = pm->runtime_suspend(dev);
blk_post_runtime_suspend(sdev->request_queue, err);

return err;
}

If blk_pre_runtime_suspend returns -EBUSY, blk_post_runtime_suspend will not be 
invoked.

>> The request_queue should be set to preempt only mode only when we confirm we 
>> could set
>> rpm_status to RPM_SUSPENDING or RPM_RESUMING.
> 
> Why do you think this?

https://marc.info/?l=linux-scsi=133727953625963=2
"
If q->rpm_status is RPM_SUSPENDED, they shouldn't do anything -- act as though 
the queue is
empty.  If q->rpm_status is RPM_SUSPENDING or RPM_RESUMING, they should hand 
over the request
only if it has the REQ_PM flag set.
"
In additon, if we set the preempt only here unconditionally, the normal IO will 
be blocked
during the blk_pre_runtime_suspend. In your patch, q_usage_counter will be 
switched to atomic mode,
this could cost some time. Is it really OK ?

Thanks
Jianchao
> 
> Bart.
> 
> 


Re: [PATCH] block/023: performance test on queue creation & cleanup

2018-07-29 Thread Ming Lei
On Wed, Jul 25, 2018 at 02:33:14PM -0700, Omar Sandoval wrote:
> On Wed, Jul 04, 2018 at 01:29:56PM +0800, Ming Lei wrote:
> > SCSI may have lots of channels, targets or LUNs, so it may
> > take long time for creating and cleaning up queues.
> > 
> > So introduce block/023 and uses null_blk to run this test
> > on both blk-mq and legacy mode, then compare both and check
> > the difference.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  tests/block/023 | 101 
> > 
> >  tests/block/023.out |   2 ++
> >  2 files changed, 103 insertions(+)
> >  create mode 100755 tests/block/023
> >  create mode 100755 tests/block/023.out
> 
> Hi, Ming, is this a regression test for "blk-mq: remove
> synchronize_rcu() from blk_mq_del_queue_tag_set()" and "blk-mq: avoid to
> synchronize rcu inside blk_cleanup_queue()"?

Yeah, it is.

It also shows that destroying queue may take much more time than
creating queue, and seems no solution for this one now.

Thanks,
Ming


[PATCH v5 1/3] block: move ref_tag calculation func to the block layer

2018-07-29 Thread Max Gurtovoy
Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v4:
 - fix failures in case kernel configured with !CONFIG_BLK_DEV_INTEGRITY

---
 drivers/infiniband/ulp/iser/iser_memory.c |  2 +-
 drivers/nvme/host/core.c  |  3 +--
 drivers/scsi/mpt3sas/mpt3sas_scsih.c  |  2 +-
 drivers/scsi/sd_dif.c |  4 ++--
 include/linux/t10-pi.h| 10 ++
 include/scsi/scsi_cmnd.h  |  7 +--
 6 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index ca844a9..130bf16 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -311,7 +311,7 @@ void iser_unreg_mem_fastreg(struct iscsi_iser_task 
*iser_task,
 {
domain->sig_type = IB_SIG_TYPE_T10_DIF;
domain->sig.dif.pi_interval = scsi_prot_interval(sc);
-   domain->sig.dif.ref_tag = scsi_prot_ref_tag(sc);
+   domain->sig.dif.ref_tag = t10_pi_ref_tag(sc->request);
/*
 * At the moment we hard code those, but in the future
 * we will take them from sc.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf65501..191177b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -627,8 +627,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
case NVME_NS_DPS_PI_TYPE2:
control |= NVME_RW_PRINFO_PRCHK_GUARD |
NVME_RW_PRINFO_PRCHK_REF;
-   cmnd->rw.reftag = cpu_to_le32(
-   nvme_block_nr(ns, blk_rq_pos(req)));
+   cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
break;
}
}
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b8d131a..dd738ae 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4568,7 +4568,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd 
*scmd, bool pending)
MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG |
MPI2_SCSIIO_EEDPFLAGS_CHECK_GUARD;
mpi_request->CDB.EEDP32.PrimaryReferenceTag =
-   cpu_to_be32(scsi_prot_ref_tag(scmd));
+   cpu_to_be32(t10_pi_ref_tag(scmd->request));
break;
 
case SCSI_PROT_DIF_TYPE3:
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 9035380..d8de43d 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -124,7 +124,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
if (sdkp->protection_type == T10_PI_TYPE3_PROTECTION)
return;
 
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -176,7 +176,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int 
good_bytes)
return;
 
intervals = good_bytes / scsi_prot_interval(scmd);
-   phys = scsi_prot_ref_tag(scmd);
+   phys = t10_pi_ref_tag(scmd->request);
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_integrity_payload *bip = bio_integrity(bio);
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index c6aa8a3..c40511f 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -37,6 +37,16 @@ struct t10_pi_tuple {
 #define T10_PI_APP_ESCAPE cpu_to_be16(0x)
 #define T10_PI_REF_ESCAPE cpu_to_be32(0x)
 
+static inline u32 t10_pi_ref_tag(struct request *rq)
+{
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+   return blk_rq_pos(rq) >>
+   (rq->q->integrity.interval_exp - 9) & 0x;
+#else
+   return -1U;
+#endif
+}
+
 extern const struct blk_integrity_profile t10_pi_type1_crc;
 extern const struct blk_integrity_profile t10_pi_type1_ip;
 extern const struct blk_integrity_profile t10_pi_type3_crc;
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index aaf1e97..cae229b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -313,12 +314,6 @@ static inline unsigned int scsi_prot_interval(struct 
scsi_cmnd *scmd)
return scmd->device->sector_size;
 }
 
-static inline u32 scsi_prot_ref_tag(struct scsi_cmnd *scmd)
-{
-   return blk_rq_pos(scmd->request) >>
-   (ilog2(scsi_prot_interval(scmd)) - 9) & 0x;
-}
-
 static inline unsigned scsi_prot_sg_count(struct scsi_cmnd *cmd)
 {

[PATCH v5 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-29 Thread Max Gurtovoy
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Max Gurtovoy 
---

changes from v4:
 - added Martin's Reviewed-by.

changes from v3:
 - kmap_atomic/kunmap_atomic the same address
 - declare pi struct inside the inner for loop
 - check "intervals" inside the for loop condition

changes from v2:
 - convert comments to kerneldoc format
 - removed SCSI specific comment
 - fix kmap_atomic/kunmap_atomic addresses
 - fix iteration over t10_pi_tuple's

changes from v1 (Christoph, Martin and Keith comments):
 - moved the functions to t10-pi.c
 - updated tuple size
 - changed local variables scope
 - remove/add new lines

---

 block/t10-pi.c | 110 +++
 drivers/scsi/sd.c  |   8 ++--
 drivers/scsi/sd.h  |   9 
 drivers/scsi/sd_dif.c  | 113 -
 include/linux/t10-pi.h |   3 ++
 5 files changed, 118 insertions(+), 125 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..62aed77 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -184,3 +184,113 @@ static blk_status_t t10_pi_type3_verify_ip(struct 
blk_integrity_iter *iter)
.verify_fn  = t10_pi_type3_verify_ip,
 };
 EXPORT_SYMBOL(t10_pi_type3_ip);
+
+/**
+ * t10_pi_prepare - prepare PI prior submitting request to device
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Remap protection information to match the
+ * physical LBA.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_prepare(struct request *rq, u8 protection_type)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   /* Already remapped? */
+   if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
+   break;
+
+   bip_for_each_vec(iv, bip, iter) {
+   void *p, *pmap;
+   unsigned int j;
+
+   pmap = kmap_atomic(iv.bv_page);
+   p = pmap + iv.bv_offset;
+   for (j = 0; j < iv.bv_len; j += tuple_sz) {
+   struct t10_pi_tuple *pi = p;
+
+   if (be32_to_cpu(pi->ref_tag) == virt)
+   pi->ref_tag = cpu_to_be32(ref_tag);
+   virt++;
+   ref_tag++;
+   p += tuple_sz;
+   }
+
+   kunmap_atomic(pmap);
+   }
+
+   bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+   }
+}
+EXPORT_SYMBOL(t10_pi_prepare);
+
+/**
+ * t10_pi_complete - prepare PI prior returning request to the block layer
+ * @rq:  request with PI that should be prepared
+ * @protection_type: PI type (Type 1/Type 2/Type 3)
+ * @intervals:   total elements to prepare
+ *
+ * For Type 1/Type 2, the virtual start sector is the one that was
+ * originally submitted by the block layer for the ref_tag usage. Due to
+ * partitioning, MD/DM cloning, etc. the actual physical start sector is
+ * likely to be different. Since the physical start sector was submitted
+ * to the device, we should remap it back to virtual values expected by the
+ * block layer.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+void t10_pi_complete(struct request *rq, u8 protection_type,
+unsigned int intervals)
+{
+   const int tuple_sz = rq->q->integrity.tuple_size;
+   u32 ref_tag = t10_pi_ref_tag(rq);
+   struct bio *bio;
+
+   if (protection_type == T10_PI_TYPE3_PROTECTION)
+   return;
+
+   __rq_for_each_bio(bio, rq) {
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   u32 virt = bip_get_seed(bip) & 0x;
+   struct bio_vec iv;
+   struct bvec_iter iter;
+
+   bip_for_each_vec(iv, 

[PATCH v5 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-29 Thread Max Gurtovoy
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Martin K. Petersen 
Reviewed-by: Martin K. Petersen 
Acked-by: Keith Busch 
Signed-off-by: Max Gurtovoy 
---

changes from v4:
 - Added Martin's and Keith's signatures

---
 drivers/nvme/host/core.c | 18 
 drivers/nvme/host/nvme.h |  9 +-
 drivers/nvme/host/pci.c  | 75 +---
 3 files changed, 20 insertions(+), 82 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 191177b..b57abe5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -617,6 +617,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+   } else if (req_op(req) == REQ_OP_WRITE) {
+   t10_pi_prepare(req, ns->pi_type);
}
 
switch (ns->pi_type) {
@@ -637,6 +639,22 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
*ns,
return 0;
 }
 
+void nvme_cleanup_cmd(struct request *req)
+{
+   if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
+   nvme_req(req)->status == 0) {
+   struct nvme_ns *ns = req->rq_disk->private_data;
+
+   t10_pi_complete(req, ns->pi_type,
+   blk_rq_bytes(req) >> ns->lba_shift);
+   }
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+   kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+   }
+}
+EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
+
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33d..dfc01ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,14 +356,6 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, 
sector_t sector)
return (sector >> (ns->lba_shift - 9));
 }
 
-static inline void nvme_cleanup_cmd(struct request *req)
-{
-   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-   kfree(page_address(req->special_vec.bv_page) +
- req->special_vec.bv_offset);
-   }
-}
-
 static inline void nvme_end_request(struct request *req, __le16 status,
union nvme_result result)
 {
@@ -420,6 +412,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
+void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddd441b..03beac5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,73 +535,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct 
request *req)
mempool_free(iod->sg, dev->iod_mempool);
 }
 
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == v)
-   pi->ref_tag = cpu_to_be32(p);
-}
-
-static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
-{
-   if (be32_to_cpu(pi->ref_tag) == p)
-   pi->ref_tag = cpu_to_be32(v);
-}
-
-/**
- * nvme_dif_remap - remaps ref tags to bip seed and physical lba
- *
- * The virtual start sector is the one that was originally submitted by the
- * block layer.Due to partitioning, MD/DM cloning, etc. the actual 
physical
- * start sector may be different. Remap protection information to match the
- * physical LBA on writes, and back to the original seed on reads.
- *
- * Type 0 and 3 do not have a ref tag, so no remapping required.
- */
-static void nvme_dif_remap(struct request *req,
-   void (*dif_swap)(u32 p, u32 v, struct t10_pi_tuple *pi))
-{
-   struct nvme_ns *ns = req->rq_disk->private_data;
-   struct bio_integrity_payload *bip;
-   struct t10_pi_tuple *pi;
-   void *p, *pmap;
-   u32 i, nlb, ts, phys, virt;
-
-   if (!ns->pi_type || ns->pi_type == NVME_NS_DPS_PI_TYPE3)
-   return;
-
-   bip = bio_integrity(req->bio);
-   if (!bip)
-   return;
-
-   pmap = kmap_atomic(bip->bip_vec->bv_page) + bip->bip_vec->bv_offset;
-
-   p = pmap;
-   virt = bip_get_seed(bip);
-   phys = nvme_block_nr(ns,