Re: [PATCH 2/2] iw_cxgb4: add fast-path for small REG_MR operations

2016-09-18 Thread Leon Romanovsky
On Sun, Sep 18, 2016 at 07:40:29PM -0500, Steve Wise wrote:
> > On Fri, Sep 16, 2016 at 07:54:52AM -0700, Steve Wise wrote:
> > > When processing a REG_MR work request, if fw supports the
> > > FW_RI_NSMR_TPTE_WR work request, and if the page list for this
> > > registration is <= 2 pages, and the current state of the mr is INVALID,
> > > then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for
> > FW
> > > to write.  This avoids FW having to do an async read of the TPTE
> blocking
> > > the SQ until the read completes.
> > >
> > > To know if the current MR state is INVALID or not, iw_cxgb4 must track
> the
> > > state of each fastreg MR.  The c4iw_mr struct state is updated as REG_MR
> > > and LOCAL_INV WRs are posted and completed, when a reg_mr is
> > destroyed,
> > > and when RECV completions are processed that include a local
> invalidation.
> > >
> > > This optimization increases small IO IOPS for both iSER and NVMF.
> > >
> > > Signed-off-by: Steve Wise 
> > > ---
> >
> > <...>
> >
> > > +   struct ib_reg_wr *wr, struct c4iw_mr *mhp,
> > > +   u8 *len16)
> > > +{
> > > + __be64 *p = (__be64 *)fr->pbl;
> > > +
> > > + fr->r2 = cpu_to_be32(0);
> >
> > Is there any difference between the line above and "fr->r2 = 0"?
>
> It makes sparse happy, IIRC...

Strange, but ok :)

>
>
>


signature.asc
Description: PGP signature


RE: [PATCH 2/2] iw_cxgb4: add fast-path for small REG_MR operations

2016-09-18 Thread Steve Wise
> On Fri, Sep 16, 2016 at 07:54:52AM -0700, Steve Wise wrote:
> > When processing a REG_MR work request, if fw supports the
> > FW_RI_NSMR_TPTE_WR work request, and if the page list for this
> > registration is <= 2 pages, and the current state of the mr is INVALID,
> > then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for
> FW
> > to write.  This avoids FW having to do an async read of the TPTE
blocking
> > the SQ until the read completes.
> >
> > To know if the current MR state is INVALID or not, iw_cxgb4 must track
the
> > state of each fastreg MR.  The c4iw_mr struct state is updated as REG_MR
> > and LOCAL_INV WRs are posted and completed, when a reg_mr is
> destroyed,
> > and when RECV completions are processed that include a local
invalidation.
> >
> > This optimization increases small IO IOPS for both iSER and NVMF.
> >
> > Signed-off-by: Steve Wise 
> > ---
> 
> <...>
> 
> > + struct ib_reg_wr *wr, struct c4iw_mr *mhp,
> > + u8 *len16)
> > +{
> > +   __be64 *p = (__be64 *)fr->pbl;
> > +
> > +   fr->r2 = cpu_to_be32(0);
> 
> Is there any difference between the line above and "fr->r2 = 0"?

It makes sparse happy, IIRC...





Re: [PATCH 2/2] iw_cxgb4: add fast-path for small REG_MR operations

2016-09-18 Thread Leon Romanovsky
On Fri, Sep 16, 2016 at 07:54:52AM -0700, Steve Wise wrote:
> When processing a REG_MR work request, if fw supports the
> FW_RI_NSMR_TPTE_WR work request, and if the page list for this
> registration is <= 2 pages, and the current state of the mr is INVALID,
> then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for FW
> to write.  This avoids FW having to do an async read of the TPTE blocking
> the SQ until the read completes.
>
> To know if the current MR state is INVALID or not, iw_cxgb4 must track the
> state of each fastreg MR.  The c4iw_mr struct state is updated as REG_MR
> and LOCAL_INV WRs are posted and completed, when a reg_mr is destroyed,
> and when RECV completions are processed that include a local invalidation.
>
> This optimization increases small IO IOPS for both iSER and NVMF.
>
> Signed-off-by: Steve Wise 
> ---

<...>

> +   struct ib_reg_wr *wr, struct c4iw_mr *mhp,
> +   u8 *len16)
> +{
> + __be64 *p = (__be64 *)fr->pbl;
> +
> + fr->r2 = cpu_to_be32(0);

Is there any difference between the line above and "fr->r2 = 0"?


signature.asc
Description: PGP signature


[PATCH 2/2] iw_cxgb4: add fast-path for small REG_MR operations

2016-09-16 Thread Steve Wise
When processing a REG_MR work request, if fw supports the
FW_RI_NSMR_TPTE_WR work request, and if the page list for this
registration is <= 2 pages, and the current state of the mr is INVALID,
then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for FW
to write.  This avoids FW having to do an async read of the TPTE blocking
the SQ until the read completes.

To know if the current MR state is INVALID or not, iw_cxgb4 must track the
state of each fastreg MR.  The c4iw_mr struct state is updated as REG_MR
and LOCAL_INV WRs are posted and completed, when a reg_mr is destroyed,
and when RECV completions are processed that include a local invalidation.

This optimization increases small IO IOPS for both iSER and NVMF.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c  | 17 +++
 drivers/infiniband/hw/cxgb4/mem.c |  2 +-
 drivers/infiniband/hw/cxgb4/qp.c  | 67 +++
 drivers/infiniband/hw/cxgb4/t4.h  |  4 +-
 drivers/infiniband/hw/cxgb4/t4fw_ri_api.h | 12 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h |  1 +
 6 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index ac926c9..867b8cf 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -666,6 +666,18 @@ skip_cqe:
return ret;
 }
 
+static void invalidate_mr(struct c4iw_dev *rhp, u32 rkey)
+{
+   struct c4iw_mr *mhp;
+   unsigned long flags;
+
+   spin_lock_irqsave(&rhp->lock, flags);
+   mhp = get_mhp(rhp, rkey >> 8);
+   if (mhp)
+   mhp->attr.state = 0;
+   spin_unlock_irqrestore(&rhp->lock, flags);
+}
+
 /*
  * Get one cq entry from c4iw and map it to openib.
  *
@@ -721,6 +733,7 @@ static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct 
ib_wc *wc)
CQE_OPCODE(&cqe) == FW_RI_SEND_WITH_SE_INV) {
wc->ex.invalidate_rkey = CQE_WRID_STAG(&cqe);
wc->wc_flags |= IB_WC_WITH_INVALIDATE;
+   invalidate_mr(qhp->rhp, wc->ex.invalidate_rkey);
}
} else {
switch (CQE_OPCODE(&cqe)) {
@@ -746,6 +759,10 @@ static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct 
ib_wc *wc)
break;
case FW_RI_FAST_REGISTER:
wc->opcode = IB_WC_REG_MR;
+
+   /* Invalidate the MR if the fastreg failed */
+   if (CQE_STATUS(&cqe) != T4_ERR_SUCCESS)
+   invalidate_mr(qhp->rhp, CQE_WRID_FR_STAG(&cqe));
break;
default:
printk(KERN_ERR MOD "Unexpected opcode %d "
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 0b91b0f..80e2774 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -695,7 +695,7 @@ struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
mhp->attr.pdid = php->pdid;
mhp->attr.type = FW_RI_STAG_NSMR;
mhp->attr.stag = stag;
-   mhp->attr.state = 1;
+   mhp->attr.state = 0;
mmid = (stag) >> 8;
mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
if (insert_handle(rhp, &rhp->mmidr, mhp, mmid)) {
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index edb1172..3467b90 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -609,10 +609,42 @@ static int build_rdma_recv(struct c4iw_qp *qhp, union 
t4_recv_wr *wqe,
return 0;
 }
 
+static void build_tpte_memreg(struct fw_ri_fr_nsmr_tpte_wr *fr,
+ struct ib_reg_wr *wr, struct c4iw_mr *mhp,
+ u8 *len16)
+{
+   __be64 *p = (__be64 *)fr->pbl;
+
+   fr->r2 = cpu_to_be32(0);
+   fr->stag = cpu_to_be32(mhp->ibmr.rkey);
+
+   fr->tpte.valid_to_pdid = cpu_to_be32(FW_RI_TPTE_VALID_F |
+   FW_RI_TPTE_STAGKEY_V((mhp->ibmr.rkey & FW_RI_TPTE_STAGKEY_M)) |
+   FW_RI_TPTE_STAGSTATE_V(1) |
+   FW_RI_TPTE_STAGTYPE_V(FW_RI_STAG_NSMR) |
+   FW_RI_TPTE_PDID_V(mhp->attr.pdid));
+   fr->tpte.locread_to_qpid = cpu_to_be32(
+   FW_RI_TPTE_PERM_V(c4iw_ib_to_tpt_access(wr->access)) |
+   FW_RI_TPTE_ADDRTYPE_V(FW_RI_VA_BASED_TO) |
+   FW_RI_TPTE_PS_V(ilog2(wr->mr->page_size) - 12));
+   fr->tpte.nosnoop_pbladdr = cpu_to_be32(FW_RI_TPTE_PBLADDR_V(
+   PBL_OFF(&mhp->rhp->rdev, mhp->attr.pbl_addr)>>3));
+   fr->tpte.dca_mwbcnt_pstag = cpu_to_be32(0);
+   fr->tpte.len_hi = cpu_to_be32(0);
+   fr->tpte.len_lo = cpu_to_be32(mhp->ibmr.length);
+   fr->tpte.va_hi = cpu_to_be32(mhp->ibmr.iova >> 32);
+   fr->tpte.va_lo_fbo = cpu_to_be32(mhp->ibmr.iova & 0x);
+
+   p[0] = cpu_to_be64((u64)mhp->mpl[0]);
+   p[1] = cpu_to