Re: (subset) [PATCH vhost 0/7] vdpa/mlx5: Add support for resumable vqs

2023-12-04 Thread Leon Romanovsky


On Fri, 01 Dec 2023 12:48:50 +0200, Dragos Tatulea wrote:
> Add support for resumable vqs in the driver. This is a firmware feature
> that can be used for the following benefits:
> - Full device .suspend/.resume.
> - .set_map doesn't need to destroy and create new vqs anymore just to
>   update the map. When resumable vqs are supported it is enough to
>   suspend the vqs, set the new maps, and then resume the vqs.
> 
> [...]

Applied, thanks!

[1/7] vdpa/mlx5: Expose resumable vq capability
  https://git.kernel.org/rdma/rdma/c/b24910e1be0e76

Best regards,
-- 
Leon Romanovsky 



Re: [Patch v5 0/5] RDMA/mana_ib

2023-09-11 Thread Leon Romanovsky
On Thu, Sep 07, 2023 at 09:52:34AM -0700, sharmaa...@linuxonhyperv.com wrote:
> From: Ajay Sharma 
> 
> Change from v4:
> Send qp fatal error event to the context that
> created the qp. Add lookup table for qp.
> 
> Ajay Sharma (5):
>   RDMA/mana_ib : Rename all mana_ib_dev type variables to mib_dev
>   RDMA/mana_ib : Register Mana IB  device with Management SW
>   RDMA/mana_ib : Create adapter and Add error eq
>   RDMA/mana_ib : Query adapter capabilities
>   RDMA/mana_ib : Send event to qp

I didn't look very deep into the series and has three very initial comments.
1. Please do git log drivers/infiniband/hw/mana/ and use same format for
commit messages.
2. Don't invent your own index-to-qp query mechanism in last patch and use 
xarray.
3. Once you decided to export mana_gd_register_device, it hinted me that
it is time to move to auxbus infrastructure.

Thanks

> 
>  drivers/infiniband/hw/mana/cq.c   |  12 +-
>  drivers/infiniband/hw/mana/device.c   |  81 +++--
>  drivers/infiniband/hw/mana/main.c | 288 +-
>  drivers/infiniband/hw/mana/mana_ib.h  | 102 ++-
>  drivers/infiniband/hw/mana/mr.c   |  42 ++-
>  drivers/infiniband/hw/mana/qp.c   |  86 +++---
>  drivers/infiniband/hw/mana/wq.c   |  21 +-
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 152 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c |   3 +
>  include/net/mana/gdma.h   |  16 +-
>  10 files changed, 545 insertions(+), 258 deletions(-)
> 
> -- 
> 2.25.1
> 



Re: [PATCH for-next v3 1/7] RDMA/rxe: Convert triple tasklets to use workqueue

2022-12-28 Thread Leon Romanovsky
On Wed, Dec 28, 2022 at 10:56:11AM -0600, Bob Pearson wrote:
> On 12/23/22 00:51, Daisuke Matsuda wrote:
> > In order to implement On-Demand Paging on the rxe driver, triple tasklets
> > (requester, responder, and completer) must be allowed to sleep so that they
> > can trigger page fault when pages being accessed are not present.
> > 
> > This patch replaces the tasklets with a workqueue, but still allows direct-
> > call of works from softirq context if it is obvious that MRs are not going
> > to be accessed and there is no work being processed in the workqueue.
> 
> There are already at least two patch sets that do this waiting to get upstream

I don't see any unhandled RXE series, except of this one patch [1],
which is one out larger series.

[1] 
https://patchwork.kernel.org/project/linux-rdma/patch/20221029031331.64518-1-rpearson...@gmail.com/

> Bob



Re: [RFC PATCH v2 0/7] On-Demand Paging on SoftRoCE

2022-11-16 Thread Leon Romanovsky
On Fri, Nov 11, 2022 at 06:22:21PM +0900, Daisuke Matsuda wrote:
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.

<...>

> Daisuke Matsuda (7):
>   IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
>   RDMA/rxe: Convert the triple tasklets to workqueues
>   RDMA/rxe: Cleanup code for responder Atomic operations
>   RDMA/rxe: Add page invalidation support
>   RDMA/rxe: Allow registering MRs for On-Demand Paging
>   RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
>   RDMA/rxe: Add support for the traditional Atomic operations with ODP

It is a shame that such cool feature is not progressing.
RXE folks, can you please review it?

Thanks



Re: [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP

2022-09-08 Thread Leon Romanovsky
On Wed, Sep 07, 2022 at 11:43:04AM +0900, Daisuke Matsuda wrote:
> rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> it to load payloads of requesting packets; responder uses it to process
> Send, Write, and Read operaetions; completer uses it to copy data from
> response packets of Read and Atomic operations to a user MR.
> 
> Allow these operations to be used with ODP by adding a counterpart function
> rxe_odp_mr_copy(). It is comprised of the following steps:
>  1. Check the driver page table(umem_odp->dma_list) to see if pages being
> accessed are present with appropriate permission.
>  2. If necessary, trigger page fault to map the pages.
>  3. Convert their user space addresses to kernel logical addresses using
> PFNs in the driver page table(umem_odp->pfn_list).
>  4. Execute data copy fo/from the pages.
> 
> umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> is not changed while it is checked and that mapped pages are not
> invalidated before data copy completes.
> 
> Signed-off-by: Daisuke Matsuda 
> ---
>  drivers/infiniband/sw/rxe/rxe.c  |  10 ++
>  drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_odp.c  | 173 +++
>  drivers/infiniband/sw/rxe/rxe_resp.c |   6 +-
>  5 files changed, 190 insertions(+), 3 deletions(-)

<...>

> +/* umem mutex is always locked when returning from this function. */
> +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 
> flags)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + const int max_tries = 3;
> + int cnt = 0;
> +
> + int err;
> + u64 perm;
> + bool need_fault;
> +
> + if (unlikely(length < 1))
> + return -EINVAL;
> +
> + perm = ODP_READ_ALLOWED_BIT;
> + if (!(flags & RXE_PAGEFAULT_RDONLY))
> + perm |= ODP_WRITE_ALLOWED_BIT;
> +
> + mutex_lock(_odp->umem_mutex);
> +
> + /*
> +  * A successful return from rxe_odp_do_pagefault() does not guarantee
> +  * that all pages in the range became present. Recheck the DMA address
> +  * array, allowing max 3 tries for pagefault.
> +  */
> + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> + iova, length, perm))) {
> +
> + if (cnt >= max_tries)
> + break;
> +
> + mutex_unlock(_odp->umem_mutex);
> +
> + /* rxe_odp_do_pagefault() locks the umem mutex. */

Maybe it is correct and safe to release lock in the middle, but it is
not clear. The whole pattern of taking lock in one function and later
releasing it in another doesn't look right to me.

Thanks



Re: [PATCH] net/rds: Avoid potential use after free in rds_send_remove_from_sock

2021-04-20 Thread Leon Romanovsky
On Tue, Apr 20, 2021 at 12:09:06PM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 07:09:12PM -0500, Aditya Pakki wrote:
> > In case of rs failure in rds_send_remove_from_sock(), the 'rm' resource
> > is freed and later under spinlock, causing potential use-after-free.
> > Set the free pointer to NULL to avoid undefined behavior.
> > 
> > Signed-off-by: Aditya Pakki 
> > ---
> >  net/rds/message.c | 1 +
> >  net/rds/send.c| 2 +-
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> Dave, Jakub
> 
> Please revert this patch, given responses from Eric and Al together
> with this response from Greg here 
> https://lore.kernel.org/lkml/YH5/i7ovsjsmq...@kroah.com

https://lore.kernel.org/lkml/yh5%2fi7ovsjsmq...@kroah.com/

> 
> BTW, I looked on the rds code too and agree with Eric, this patch
> is a total garbage.
> 
> Thanks


Re: [PATCH] net/rds: Avoid potential use after free in rds_send_remove_from_sock

2021-04-20 Thread Leon Romanovsky
On Tue, Apr 06, 2021 at 07:09:12PM -0500, Aditya Pakki wrote:
> In case of rs failure in rds_send_remove_from_sock(), the 'rm' resource
> is freed and later under spinlock, causing potential use-after-free.
> Set the free pointer to NULL to avoid undefined behavior.
> 
> Signed-off-by: Aditya Pakki 
> ---
>  net/rds/message.c | 1 +
>  net/rds/send.c| 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)

Dave, Jakub

Please revert this patch, given responses from Eric and Al together
with this response from Greg here 
https://lore.kernel.org/lkml/YH5/i7ovsjsmq...@kroah.com

BTW, I looked on the rds code too and agree with Eric, this patch
is a total garbage.

Thanks


Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match

2021-04-18 Thread Leon Romanovsky
On Mon, Apr 19, 2021 at 12:27:22PM +0800, Alice Guo (OSS) wrote:
> From: Alice Guo 
> 
> Update all the code that use soc_device_match because add support for
> soc_device_match returning -EPROBE_DEFER.
> 
> Signed-off-by: Alice Guo 
> ---
>  drivers/bus/ti-sysc.c |  2 +-
>  drivers/clk/renesas/r8a7795-cpg-mssr.c|  4 +++-
>  drivers/clk/renesas/rcar-gen2-cpg.c   |  2 +-
>  drivers/clk/renesas/rcar-gen3-cpg.c   |  2 +-
>  drivers/dma/fsl-dpaa2-qdma/dpaa2-qdma.c   |  7 ++-
>  drivers/dma/ti/k3-psil.c  |  3 +++
>  drivers/dma/ti/k3-udma.c  |  2 +-
>  drivers/gpu/drm/bridge/nwl-dsi.c  |  2 +-
>  drivers/gpu/drm/meson/meson_drv.c |  4 +++-
>  drivers/gpu/drm/omapdrm/dss/dispc.c   |  2 +-
>  drivers/gpu/drm/omapdrm/dss/dpi.c |  4 +++-
>  drivers/gpu/drm/omapdrm/dss/dsi.c |  3 +++
>  drivers/gpu/drm/omapdrm/dss/dss.c |  3 +++
>  drivers/gpu/drm/omapdrm/dss/hdmi4_core.c  |  3 +++
>  drivers/gpu/drm/omapdrm/dss/venc.c|  4 +++-
>  drivers/gpu/drm/omapdrm/omap_drv.c|  3 +++
>  drivers/gpu/drm/rcar-du/rcar_du_crtc.c|  4 +++-
>  drivers/gpu/drm/rcar-du/rcar_lvds.c   |  2 +-
>  drivers/gpu/drm/tidss/tidss_dispc.c   |  4 +++-
>  drivers/iommu/ipmmu-vmsa.c|  7 +--
>  drivers/media/platform/rcar-vin/rcar-core.c   |  2 +-
>  drivers/media/platform/rcar-vin/rcar-csi2.c   |  2 +-
>  drivers/media/platform/vsp1/vsp1_uif.c|  4 +++-
>  drivers/mmc/host/renesas_sdhi_core.c  |  2 +-
>  drivers/mmc/host/renesas_sdhi_internal_dmac.c |  2 +-
>  drivers/mmc/host/sdhci-of-esdhc.c | 21 ++-
>  drivers/mmc/host/sdhci-omap.c |  2 +-
>  drivers/mmc/host/sdhci_am654.c|  2 +-
>  drivers/net/ethernet/renesas/ravb_main.c  |  4 +++-
>  drivers/net/ethernet/ti/am65-cpsw-nuss.c  |  2 +-
>  drivers/net/ethernet/ti/cpsw.c|  2 +-
>  drivers/net/ethernet/ti/cpsw_new.c|  2 +-
>  drivers/phy/ti/phy-omap-usb2.c|  4 +++-
>  drivers/pinctrl/renesas/core.c|  2 +-
>  drivers/pinctrl/renesas/pfc-r8a7790.c |  5 -
>  drivers/pinctrl/renesas/pfc-r8a7794.c |  5 -
>  drivers/soc/fsl/dpio/dpio-driver.c| 13 
>  drivers/soc/renesas/r8a774c0-sysc.c   |  5 -
>  drivers/soc/renesas/r8a7795-sysc.c|  2 +-
>  drivers/soc/renesas/r8a77990-sysc.c   |  5 -
>  drivers/soc/ti/k3-ringacc.c   |  2 +-
>  drivers/staging/mt7621-pci/pci-mt7621.c   |  2 +-
>  drivers/thermal/rcar_gen3_thermal.c   |  4 +++-
>  drivers/thermal/ti-soc-thermal/ti-bandgap.c   | 10 +++--
>  drivers/usb/gadget/udc/renesas_usb3.c |  2 +-
>  drivers/usb/host/ehci-platform.c  |  4 +++-
>  drivers/usb/host/xhci-rcar.c  |  2 +-
>  drivers/watchdog/renesas_wdt.c|  2 +-
>  48 files changed, 131 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/bus/ti-sysc.c b/drivers/bus/ti-sysc.c
> index 5fae60f8c135..00c59aa217c1 100644
> --- a/drivers/bus/ti-sysc.c
> +++ b/drivers/bus/ti-sysc.c
> @@ -2909,7 +2909,7 @@ static int sysc_init_soc(struct sysc *ddata)
>   }
>  
>   match = soc_device_match(sysc_soc_feat_match);
> - if (!match)
> + if (!match || IS_ERR(match))
>   return 0;
>  
>   if (match->data)
> diff --git a/drivers/clk/renesas/r8a7795-cpg-mssr.c 
> b/drivers/clk/renesas/r8a7795-cpg-mssr.c
> index c32d2c678046..90a18336a4c3 100644
> --- a/drivers/clk/renesas/r8a7795-cpg-mssr.c
> +++ b/drivers/clk/renesas/r8a7795-cpg-mssr.c
> @@ -439,6 +439,7 @@ static const unsigned int r8a7795es2_mod_nullify[] 
> __initconst = {
>  
>  static int __init r8a7795_cpg_mssr_init(struct device *dev)
>  {
> + const struct soc_device_attribute *match;
>   const struct rcar_gen3_cpg_pll_config *cpg_pll_config;
>   u32 cpg_mode;
>   int error;
> @@ -453,7 +454,8 @@ static int __init r8a7795_cpg_mssr_init(struct device 
> *dev)
>   return -EINVAL;
>   }
>  
> - if (soc_device_match(r8a7795es1)) {
> + match = soc_device_match(r8a7795es1);
> + if (!IS_ERR(match) && match) {

"if (!IS_ERR_OR_NULL(match))" in all places.

Thanks


[PATCH rdma-next 0/3] CMA fixes

2021-04-18 Thread Leon Romanovsky
From: Leon Romanovsky 

Another round of fixes to cma.c

Parav Pandit (1):
  RDMA/cma: Skip device which doesn't support CM

Shay Drory (2):
  RDMA/core: Fix check of device in rdma_listen()
  RDMA/core: Add CM to restrack after successful attachment to a device

 drivers/infiniband/core/cma.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

-- 
2.30.2



Re: [RFC V2 PATCH 11/12] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-04-18 Thread Leon Romanovsky
On Tue, Apr 13, 2021 at 11:22:16AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() still need to handle. Use DMA API to map/umap these
> memory during sending/receiving packet and Hyper-V DMA ops callback
> will use swiotlb fucntion to allocate bounce buffer and copy data
> from/to bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/net/hyperv/hyperv_net.h   |  11 +++
>  drivers/net/hyperv/netvsc.c   | 137 --
>  drivers/net/hyperv/rndis_filter.c |   3 +
>  3 files changed, 144 insertions(+), 7 deletions(-)

<...>

> + packet->dma_range = kzalloc(sizeof(struct dma_range) * page_count,
> +   GFP_KERNEL);
> + if (!packet->dma_range)
> + return -ENOMEM;
> +
> + for (i = 0; i < page_count; i++) {
> + char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
> +  + pb[i].offset);
> + u32 len = pb[i].len;
> +
> + dma = dma_map_single(_dev->device, src, len,
> +  DMA_TO_DEVICE);
> + if (dma_mapping_error(_dev->device, dma))
> + return -ENOMEM;

Don't you leak dma_range here?

BTW, It will be easier if you CC all on all patches, so we will be able
to get whole context.

Thanks


Re: [PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-04-13 Thread Leon Romanovsky
On Tue, Apr 13, 2021 at 11:13:38AM +, Haakon Bugge wrote:
> 
> 
> > On 13 Apr 2021, at 08:29, Leon Romanovsky  wrote:
> > 
> > On Mon, Apr 12, 2021 at 07:58:47PM -0300, Jason Gunthorpe wrote:
> >> On Wed, Mar 31, 2021 at 08:43:12PM +0200, Håkon Bugge wrote:
> >>> ib_modify_qp() is an expensive operation on some HCAs running
> >>> virtualized. This series removes two ib_modify_qp() calls from RDS.
> >>> 
> >>> I am sending this as a v3, even though it is the first sent to
> >>> net. This because the IB Core commit has reach v3.
> >>> 
> >>> Håkon Bugge (2):
> >>>  IB/cma: Introduce rdma_set_min_rnr_timer()
> >>>  rds: ib: Remove two ib_modify_qp() calls
> >> 
> >> Applied to rdma for-next, thanks
> > 
> > Jason,
> > 
> > It should be 
> > +   WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT);
> 
> With no return you will arm the setting of the timer and subsequently get an 
> error from the modify_qp later.

The addition of WARN_ON() means that this is programmer error to get
such input. Historically, in-kernel API doesn't need to have protection
from other kernel developers.

Thanks

> 
> 
> Håkon
> 
> > 
> > and not
> > +   if (WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT))
> > +   return -EINVAL;
> > 
> > Thanks
> > 
> >> 
> >> Jason
> 


Re: [PATCH 0/1] Use of /sys/bus/pci/devices/…/index for non-SMBIOS platforms

2021-04-13 Thread Leon Romanovsky
On Tue, Apr 13, 2021 at 08:57:19AM +0200, Niklas Schnelle wrote:
> On Tue, 2021-04-13 at 08:39 +0300, Leon Romanovsky wrote:
> > On Mon, Apr 12, 2021 at 03:59:04PM +0200, Niklas Schnelle wrote:
> > > Hi Narendra, Hi All,
> > > 
> > > According to Documentation/ABI/testing/sysfs-bus-pci you are responsible
> > > for the index device attribute that is used by systemd to create network
> > > interface names.
> > > 
> > > Now we would like to reuse this attribute for firmware provided PCI
> > > device index numbers on the s390 architecture which doesn't have
> > > SMBIOS/DMI nor ACPI. All code changes are within our architecture
> > > specific code but I'd like to get some Acks for this reuse. I've sent an
> > > RFC version of this patch on 15th of March with the subject:
> > > 
> > >s390/pci: expose a PCI device's UID as its index
> > > 
> > > but got no response. Would it be okay to re-use this attribute for
> > > essentially the same purpose but with index numbers provided by
> > > a different platform mechanism? I think this would be cleaner than
> > > further proliferation of /sys/bus/pci/devices//xyz_index
> > > attributes and allows re-use of the existing userspace infrastructure.
> > 
> > I'm missing an explanation that this change is safe for systemd and
> > they don't have some hard-coded assumption about the meaning of existing
> > index on s390.
> > 
> > Thanks
> 
> 
> Sure, good point. So first off yes this change does create new index
> based names also on existing systemd versions, this is known and
> intended and we'll certainly closely collaborate with any distributions
> wishing to backport this change.
> 
> As for being otherwise safe or having unintended consequences, Viktor
> (see R-b) and I recently got the following PR merged in that exact area
> of systemd to fix how hotplug slot derived interface names are
> generated:
> https://github.com/systemd/systemd/pull/19017
> In working on that we did also analyse the use of the index attribute
> for hidden assumptions and tested with this attribute added. Arguably,
> as the nature of that PR shows we haven't had a perfect track record of
> keeping this monitored but will in the future as PCI based NICs become
> increasingly important for our platform. We also have special NIC
> naming logic in the same area for our channel based platform specific
> NICs which was also contributed by Viktor.

Thanks, this PR is exciting to read, very warm words were said about
kernel developers :). Can you please summarize that will be the breakage
in old systemd if this index will be overloaded?

Thanks

> 
> Thanks,
> Niklas
> 


Re: [PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-04-13 Thread Leon Romanovsky
On Mon, Apr 12, 2021 at 07:58:47PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 08:43:12PM +0200, Håkon Bugge wrote:
> > ib_modify_qp() is an expensive operation on some HCAs running
> > virtualized. This series removes two ib_modify_qp() calls from RDS.
> > 
> > I am sending this as a v3, even though it is the first sent to
> > net. This because the IB Core commit has reach v3.
> > 
> > Håkon Bugge (2):
> >   IB/cma: Introduce rdma_set_min_rnr_timer()
> >   rds: ib: Remove two ib_modify_qp() calls
> 
> Applied to rdma for-next, thanks

Jason,

It should be 
+   WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT);

and not
+   if (WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT))
+   return -EINVAL;

Thanks

> 
> Jason


Re: [PATCH 0/1] Use of /sys/bus/pci/devices/…/index for non-SMBIOS platforms

2021-04-12 Thread Leon Romanovsky
On Mon, Apr 12, 2021 at 03:59:04PM +0200, Niklas Schnelle wrote:
> Hi Narendra, Hi All,
> 
> According to Documentation/ABI/testing/sysfs-bus-pci you are responsible
> for the index device attribute that is used by systemd to create network
> interface names.
> 
> Now we would like to reuse this attribute for firmware provided PCI
> device index numbers on the s390 architecture which doesn't have
> SMBIOS/DMI nor ACPI. All code changes are within our architecture
> specific code but I'd like to get some Acks for this reuse. I've sent an
> RFC version of this patch on 15th of March with the subject:
> 
>s390/pci: expose a PCI device's UID as its index
> 
> but got no response. Would it be okay to re-use this attribute for
> essentially the same purpose but with index numbers provided by
> a different platform mechanism? I think this would be cleaner than
> further proliferation of /sys/bus/pci/devices//xyz_index
> attributes and allows re-use of the existing userspace infrastructure.

I'm missing an explanation that this change is safe for systemd and
they don't have some hard-coded assumption about the meaning of existing
index on s390.

Thanks


Re: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-12 Thread Leon Romanovsky
On Mon, Apr 12, 2021 at 08:35:32AM +, Dexuan Cui wrote:
> > From: Leon Romanovsky 
> > Sent: Monday, April 12, 2021 12:46 AM
> > To: Dexuan Cui 
> > > ...
> > > +#define ANA_MAJOR_VERSION0
> > > +#define ANA_MINOR_VERSION1
> > > +#define ANA_MICRO_VERSION1
> > 
> > Please don't introduce drier versions.
> 
> This is not the usual "driver version", though it's called  "drv version" :-)
> As you can see, the driver does not use the macro MODULE_VERSION().
> 
> Here the "drv version" actually means the version of the VF-to-PF protocol,
> with which the Azure Network Adapter ethernet NIC driver (i.e. the VF driver)
> talks to the PF driver.  The protocol version determines the formats of the
> messages that are sent from the VF driver to the PF driver, e.g. query the
> MAC address, create Send/Receive queues, configure RSS, etc.
> 
> Currently the protocol versin is 0.1.1 You may ask why it's called
> "drv version" rather than "protocol version" -- it's because the PF driver
> calls it that way, so I think here the VF driver may as well use the same
> name. BTW, the "drv ver" info is passed to the PF driver in the below
> function:

Ohh, yes, the "driver version" is not the ideal name for that.

I already looked on it in previous patch, came to the conclusion about
the protocol and forgot :(.

> 
> static int mana_query_client_cfg(struct ana_context *ac, u32 drv_major_ver,
>  u32 drv_minor_ver, u32 drv_micro_ver,
>  u16 *max_num_vports)
> {
> struct gdma_context *gc = ac->gdma_dev->gdma_context;
> struct ana_query_client_cfg_resp resp = {};
> struct ana_query_client_cfg_req req = {};
> struct device *dev = gc->dev;
> int err = 0;
> 
> mana_gd_init_req_hdr(, ANA_QUERY_CLIENT_CONFIG,
>  sizeof(req), sizeof(resp));
> req.drv_major_ver = drv_major_ver;
> req.drv_minor_ver = drv_minor_ver;
> req.drv_micro_ver = drv_micro_ver;
> 
> err = mana_send_request(ac, , sizeof(req), , sizeof(resp));
> 
> Thanks,
> Dexuan
> 


Re: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-12 Thread Leon Romanovsky
On Sun, Apr 11, 2021 at 07:34:55PM -0700, Dexuan Cui wrote:
> Add a VF driver for Microsoft Azure Network Adapter (MANA) that will be
> available in the future.
> 
> Co-developed-by: Haiyang Zhang 
> Signed-off-by: Haiyang Zhang 
> Co-developed-by: Shachar Raindel 
> Signed-off-by: Shachar Raindel 
> Signed-off-by: Dexuan Cui 
> ---
>  MAINTAINERS   |4 +-
>  drivers/net/ethernet/Kconfig  |1 +
>  drivers/net/ethernet/Makefile |1 +
>  drivers/net/ethernet/microsoft/Kconfig|   29 +
>  drivers/net/ethernet/microsoft/Makefile   |5 +
>  drivers/net/ethernet/microsoft/mana/Makefile  |6 +
>  drivers/net/ethernet/microsoft/mana/gdma.h|  728 +++
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 1525 +
>  .../net/ethernet/microsoft/mana/hw_channel.c  |  854 
>  .../net/ethernet/microsoft/mana/hw_channel.h  |  190 ++
>  drivers/net/ethernet/microsoft/mana/mana.h|  549 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 1924 +
>  .../ethernet/microsoft/mana/mana_ethtool.c|  252 +++
>  .../net/ethernet/microsoft/mana/shm_channel.c |  298 +++
>  .../net/ethernet/microsoft/mana/shm_channel.h |   21 +
>  15 files changed, 6386 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/microsoft/Kconfig
>  create mode 100644 drivers/net/ethernet/microsoft/Makefile
>  create mode 100644 drivers/net/ethernet/microsoft/mana/Makefile
>  create mode 100644 drivers/net/ethernet/microsoft/mana/gdma.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/gdma_main.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana_en.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana_ethtool.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.h
> 

<...>

> +/* Microsoft Azure Network Adapter (MANA)'s definitions
> + *
> + * Structures labeled with "HW DATA" are exchanged with the hardware. All of
> + * them are naturally aligned and hence don't need __packed.
> + */
> +
> +#define ANA_MAJOR_VERSION0
> +#define ANA_MINOR_VERSION1
> +#define ANA_MICRO_VERSION1

Please don't introduce drier versions.

Thanks


Re: [PATCH net-next 4/7] net: ipa: ipa_stop() does not return an error

2021-04-12 Thread Leon Romanovsky
On Sun, Apr 11, 2021 at 08:42:15AM -0500, Alex Elder wrote:
> On 4/11/21 8:28 AM, Leon Romanovsky wrote:
> >> I think *not* checking an available return value is questionable
> >> practice.  I'd really rather have a build option for a
> >> "__need_not_check" tag and have "must_check" be the default.
> > __need_not_check == void ???
> 
> I'm not sure I understand your statement here, but...

We are talking about the same thing. My point was that __need_not_check
is actually void. The API author was supposed to declare that by
declaring that function doesn't return anything.

Thanks


Re: [PATCH net-next 4/7] net: ipa: ipa_stop() does not return an error

2021-04-11 Thread Leon Romanovsky
On Sun, Apr 11, 2021 at 08:09:55AM -0500, Alex Elder wrote:
> On 4/11/21 1:34 AM, Leon Romanovsky wrote:
> > On Fri, Apr 09, 2021 at 01:07:19PM -0500, Alex Elder wrote:
> >> In ipa_modem_stop(), if the modem netdev pointer is non-null we call
> >> ipa_stop().  We check for an error and if one is returned we handle
> >> it.  But ipa_stop() never returns an error, so this extra handling
> >> is unnecessary.  Simplify the code in ipa_modem_stop() based on the
> >> knowledge no error handling is needed at this spot.
> >>
> >> Signed-off-by: Alex Elder 
> >> ---
> >>  drivers/net/ipa/ipa_modem.c | 18 --
> >>  1 file changed, 4 insertions(+), 14 deletions(-)
> > 
> > <...>
> > 
> >> +  /* Stop the queue and disable the endpoints if it's open */
> >>if (netdev) {
> >> -  /* Stop the queue and disable the endpoints if it's open */
> >> -  ret = ipa_stop(netdev);
> >> -  if (ret)
> >> -  goto out_set_state;
> >> -
> >> +  (void)ipa_stop(netdev);
> > 
> > This void casting is not needed here and in more general case sometimes
> > even be seen as a mistake, for example if the returned attribute declared
> > as __must_check.
> 
> I accept your point but I feel like it's sort of a 50/50 thing.
> 
> I think *not* checking an available return value is questionable
> practice.  I'd really rather have a build option for a
> "__need_not_check" tag and have "must_check" be the default.

__need_not_check == void ???

> 
> The void cast here says "I know this returns a result, but I am
> intentionally not checking it."  If it had been __must_check I
> would certainly have checked it.  
> 
> That being said, I don't really care that much, so I'll plan
> to post version 2, which will drop this cast (I'll probably
> add a comment though).

Thanks

> 
> Thanks.
> 
>   -Alex
> 
> > 
> > Thanks
> > 
> 


Re: [PATCH v2] PCI: merge slot and bus reset implementations

2021-04-11 Thread Leon Romanovsky
On Thu, Apr 08, 2021 at 06:23:40PM +, Raphael Norwitz wrote:
> Slot resets are bus resets with additional logic to prevent a device
> from being removed during the reset. Currently slot and bus resets have
> separate implementations in pci.c, complicating higher level logic. As
> discussed on the mailing list, they should be combined into a generic
> function which performs an SBR. This change adds a function,
> pci_reset_bus_function(), which first attempts a slot reset and then
> attempts a bus reset if -ENOTTY is returned, such that there is now a
> single device agnostic function to perform an SBR.
> 
> This new function is also needed to add SBR reset quirks and therefore
> is exposed in pci.h.
> 
> Link: https://lkml.org/lkml/2021/3/23/911
> 
> Suggested-by: Alex Williamson 
> Signed-off-by: Amey Narkhede 
> Signed-off-by: Raphael Norwitz 
> ---
>  drivers/pci/pci.c   | 19 +++
>  include/linux/pci.h |  1 +
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 

Thanks,
Reviewed-by: Leon Romanovsky 


Re: [PATCH net-next 4/7] net: ipa: ipa_stop() does not return an error

2021-04-11 Thread Leon Romanovsky
On Fri, Apr 09, 2021 at 01:07:19PM -0500, Alex Elder wrote:
> In ipa_modem_stop(), if the modem netdev pointer is non-null we call
> ipa_stop().  We check for an error and if one is returned we handle
> it.  But ipa_stop() never returns an error, so this extra handling
> is unnecessary.  Simplify the code in ipa_modem_stop() based on the
> knowledge no error handling is needed at this spot.
> 
> Signed-off-by: Alex Elder 
> ---
>  drivers/net/ipa/ipa_modem.c | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)

<...>

> + /* Stop the queue and disable the endpoints if it's open */
>   if (netdev) {
> - /* Stop the queue and disable the endpoints if it's open */
> - ret = ipa_stop(netdev);
> - if (ret)
> - goto out_set_state;
> -
> + (void)ipa_stop(netdev);

This void casting is not needed here and in more general case sometimes
even be seen as a mistake, for example if the returned attribute declared
as __must_check.

Thanks


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 09:59:52PM +, Dexuan Cui wrote:
> > From: Leon Romanovsky 
> > Sent: Wednesday, April 7, 2021 5:45 AM
> > > >
> > > > BTW, you don't need to write { 0 }, the {} is enough.
> > >
> > > Thanks for the suggestion! I'll use {0} in v2.
> > 
> > You missed the point, "{ 0 }" change to be "{}" without 0.
> 
> Got it. Will make the suggested change.

The numbers are not important, if you are curious, read this thread, it
talks about {}, {0}, memset(0,..) and padding :)
https://lore.kernel.org/linux-rdma/20200730192026.110246-1-yepeilin...@gmail.com/

> 
> FWIW, {0} and { 0 } are still widely used, but it looks like
> {} is indeed more preferred:
> 
> $ grep "= {};" drivers/net/  -nr  | wc -l
> 829
> 
> $ grep "= {0};" drivers/net/  -nr  | wc -l
> 708
> 
> $ grep "= {};" kernel/  -nr  | wc -l
> 29
> 
> $ grep "= {0};" kernel/  -nr  | wc -l
> 4


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 03:05:26PM +, Haiyang Zhang wrote:
> 
> 
> > -Original Message-
> > From: Leon Romanovsky 
> > Sent: Wednesday, April 7, 2021 10:55 AM
> > To: Haiyang Zhang 
> > Cc: Dexuan Cui ; da...@davemloft.net;
> > k...@kernel.org; KY Srinivasan ; Stephen Hemminger
> > ; wei@kernel.org; Wei Liu
> > ; net...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; linux-hyp...@vger.kernel.org
> > Subject: Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure
> > Network Adapter (MANA)
> > 
> > On Wed, Apr 07, 2021 at 02:41:45PM +, Haiyang Zhang wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Leon Romanovsky 
> > > > Sent: Wednesday, April 7, 2021 8:51 AM
> > > > To: Dexuan Cui 
> > > > Cc: da...@davemloft.net; k...@kernel.org; KY Srinivasan
> > > > ; Haiyang Zhang ;
> > Stephen
> > > > Hemminger ; wei@kernel.org; Wei Liu
> > > > ; net...@vger.kernel.org; linux-
> > > > ker...@vger.kernel.org; linux-hyp...@vger.kernel.org
> > > > Subject: Re: [PATCH net-next] net: mana: Add a driver for Microsoft
> > > > Azure Network Adapter (MANA)
> > > >
> > > > On Wed, Apr 07, 2021 at 08:40:13AM +, Dexuan Cui wrote:
> > > > > > From: Leon Romanovsky 
> > > > > > Sent: Wednesday, April 7, 2021 1:10 AM
> > > > > >
> > > > > > <...>
> > > > > >
> > > > > > > +int gdma_verify_vf_version(struct pci_dev *pdev) {
> > > > > > > + struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > > > > + struct gdma_verify_ver_req req = { 0 };
> > > > > > > + struct gdma_verify_ver_resp resp = { 0 };
> > > > > > > + int err;
> > > > > > > +
> > > > > > > + gdma_init_req_hdr(,
> > GDMA_VERIFY_VF_DRIVER_VERSION,
> > > > > > > +   sizeof(req), sizeof(resp));
> > > > > > > +
> > > > > > > + req.protocol_ver_min = GDMA_PROTOCOL_FIRST;
> > > > > > > + req.protocol_ver_max = GDMA_PROTOCOL_LAST;
> > > > > > > +
> > > > > > > + err = gdma_send_request(gc, sizeof(req), , sizeof(resp),
> > );
> > > > > > > + if (err || resp.hdr.status) {
> > > > > > > + pr_err("VfVerifyVersionOutput: %d, status=0x%x\n",
> > err,
> > > > > > > +resp.hdr.status);
> > > > > > > + return -EPROTO;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return 0;
> > > > > > > +}
> > > > > >
> > > > > > <...>
> > > > > > > + err = gdma_verify_vf_version(pdev);
> > > > > > > + if (err)
> > > > > > > + goto remove_irq;
> > > > > >
> > > > > > Will this VF driver be used in the guest VM? What will prevent
> > > > > > from users
> > > > to
> > > > > > change it?
> > > > > > I think that such version negotiation scheme is not allowed.
> > > > >
> > > > > Yes, the VF driver is expected to run in a Linux VM that runs on 
> > > > > Azure.
> > > > >
> > > > > Currently gdma_verify_vf_version() just tells the PF driver that
> > > > > the VF
> > > > driver
> > > > > is only able to support GDMA_PROTOCOL_V1, and want to use
> > > > > GDMA_PROTOCOL_V1's message formats to talk to the PF driver later.
> > > > >
> > > > > enum {
> > > > > GDMA_PROTOCOL_UNDEFINED = 0,
> > > > > GDMA_PROTOCOL_V1 = 1,
> > > > > GDMA_PROTOCOL_FIRST = GDMA_PROTOCOL_V1,
> > > > > GDMA_PROTOCOL_LAST = GDMA_PROTOCOL_V1,
> > > > > GDMA_PROTOCOL_VALUE_MAX
> > > > > };
> > > > >
> > > > > The PF driver is supposed to always support GDMA_PROTOCOL_V1, so I
> > > > expect
> > > > > here gdma_verify_vf_version() should succeed. If a user changes
> > > > > the Linux
> > > > VF
> > > > > driver and try to use a protocol version not supported by the PF
> > > > > driver,
> > > > then
> > > > > gdma_verify_vf_version() will fail; later, if the VF driver tries
> > > > > to talk to the
> > > > PF
> > > > > driver using an unsupported message format, the PF driver will
> > > > > return a
> > > > failure.
> > > >
> > > > The worry is not for the current code, but for the future one when
> > > > you will support v2, v3 e.t.c. First, your code will look like a
> > > > spaghetti and second, users will try and mix vX with "unsupported"
> > > > commands just for the fun.
> > >
> > > In the future, if the protocol version updated on the host side,
> > > guests need to support different host versions because not all hosts
> > > are updated (simultaneously). So this negotiation is necessary to know
> > > the supported version, and decide the proper command version to use.
> > 
> > And how do other paravirtual drivers solve this negotiation scheme?
> 
> I saw some other drivers used version negotiation too, for example:

I see, thanks.

> 
> /**
>  *  ixgbevf_negotiate_api_version_vf - Negotiate supported API version
>  *  @hw: pointer to the HW structure
>  *  @api: integer containing requested API version
>  **/
> static int ixgbevf_negotiate_api_version_vf(struct ixgbe_hw *hw, int api)
> {
> 
> Thanks,
> - Haiyang


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 02:41:45PM +, Haiyang Zhang wrote:
> 
> 
> > -Original Message-
> > From: Leon Romanovsky 
> > Sent: Wednesday, April 7, 2021 8:51 AM
> > To: Dexuan Cui 
> > Cc: da...@davemloft.net; k...@kernel.org; KY Srinivasan
> > ; Haiyang Zhang ; Stephen
> > Hemminger ; wei@kernel.org; Wei Liu
> > ; net...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; linux-hyp...@vger.kernel.org
> > Subject: Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure
> > Network Adapter (MANA)
> > 
> > On Wed, Apr 07, 2021 at 08:40:13AM +, Dexuan Cui wrote:
> > > > From: Leon Romanovsky 
> > > > Sent: Wednesday, April 7, 2021 1:10 AM
> > > >
> > > > <...>
> > > >
> > > > > +int gdma_verify_vf_version(struct pci_dev *pdev)
> > > > > +{
> > > > > + struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > > + struct gdma_verify_ver_req req = { 0 };
> > > > > + struct gdma_verify_ver_resp resp = { 0 };
> > > > > + int err;
> > > > > +
> > > > > + gdma_init_req_hdr(, GDMA_VERIFY_VF_DRIVER_VERSION,
> > > > > +   sizeof(req), sizeof(resp));
> > > > > +
> > > > > + req.protocol_ver_min = GDMA_PROTOCOL_FIRST;
> > > > > + req.protocol_ver_max = GDMA_PROTOCOL_LAST;
> > > > > +
> > > > > + err = gdma_send_request(gc, sizeof(req), , sizeof(resp), 
> > > > > );
> > > > > + if (err || resp.hdr.status) {
> > > > > + pr_err("VfVerifyVersionOutput: %d, status=0x%x\n", err,
> > > > > +resp.hdr.status);
> > > > > + return -EPROTO;
> > > > > + }
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > >
> > > > <...>
> > > > > + err = gdma_verify_vf_version(pdev);
> > > > > + if (err)
> > > > > + goto remove_irq;
> > > >
> > > > Will this VF driver be used in the guest VM? What will prevent from 
> > > > users
> > to
> > > > change it?
> > > > I think that such version negotiation scheme is not allowed.
> > >
> > > Yes, the VF driver is expected to run in a Linux VM that runs on Azure.
> > >
> > > Currently gdma_verify_vf_version() just tells the PF driver that the VF
> > driver
> > > is only able to support GDMA_PROTOCOL_V1, and want to use
> > > GDMA_PROTOCOL_V1's message formats to talk to the PF driver later.
> > >
> > > enum {
> > > GDMA_PROTOCOL_UNDEFINED = 0,
> > > GDMA_PROTOCOL_V1 = 1,
> > > GDMA_PROTOCOL_FIRST = GDMA_PROTOCOL_V1,
> > > GDMA_PROTOCOL_LAST = GDMA_PROTOCOL_V1,
> > > GDMA_PROTOCOL_VALUE_MAX
> > > };
> > >
> > > The PF driver is supposed to always support GDMA_PROTOCOL_V1, so I
> > expect
> > > here gdma_verify_vf_version() should succeed. If a user changes the Linux
> > VF
> > > driver and try to use a protocol version not supported by the PF driver,
> > then
> > > gdma_verify_vf_version() will fail; later, if the VF driver tries to talk 
> > > to the
> > PF
> > > driver using an unsupported message format, the PF driver will return a
> > failure.
> > 
> > The worry is not for the current code, but for the future one when you will
> > support v2, v3 e.t.c. First, your code will look like a spaghetti and
> > second, users will try and mix vX with "unsupported" commands just for the
> > fun.
> 
> In the future, if the protocol version updated on the host side, guests need 
> to support different host versions because not all hosts are updated 
> (simultaneously). So this negotiation is necessary to know the supported 
> version, and decide the proper command version to use. 

And how do other paravirtual drivers solve this negotiation scheme?

> 
> If any user try "unsupported commands just for the fun", the host will deny 
> and return an error.
> 
> Thanks,
> - Haiyang


Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 06:36:01PM +0530, ameynarkhed...@gmail.com wrote:
> On 21/04/07 03:30PM, Leon Romanovsky wrote:
> > On Wed, Apr 07, 2021 at 01:53:56PM +0530, ameynarkhed...@gmail.com wrote:
> > > On 21/04/07 10:23AM, Leon Romanovsky wrote:
> > > > On Tue, Apr 06, 2021 at 08:16:26AM -0600, Alex Williamson wrote:
> > > > > On Sun, 4 Apr 2021 11:04:32 +0300
> > > > > Leon Romanovsky  wrote:
> > > > >
> > > > > > On Thu, Apr 01, 2021 at 10:56:16AM -0600, Alex Williamson wrote:
> > > > > > > On Thu, 1 Apr 2021 15:27:37 +0300
> > > > > > > Leon Romanovsky  wrote:
> > > > > > >
> > > > > > > > On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:
> > > > > > > > > Slot resets are bus resets with additional logic to prevent a 
> > > > > > > > > device
> > > > > > > > > from being removed during the reset. Currently slot and bus 
> > > > > > > > > resets have
> > > > > > > > > separate implementations in pci.c, complicating higher level 
> > > > > > > > > logic. As
> > > > > > > > > discussed on the mailing list, they should be combined into a 
> > > > > > > > > generic
> > > > > > > > > function which performs an SBR. This change adds a function,
> > > > > > > > > pci_reset_bus_function(), which first attempts a slot reset 
> > > > > > > > > and then
> > > > > > > > > attempts a bus reset if -ENOTTY is returned, such that there 
> > > > > > > > > is now a
> > > > > > > > > single device agnostic function to perform an SBR.
> > > > > > > > >
> > > > > > > > > This new function is also needed to add SBR reset quirks and 
> > > > > > > > > therefore
> > > > > > > > > is exposed in pci.h.
> > > > > > > > >
> > > > > > > > > Link: https://lkml.org/lkml/2021/3/23/911
> > > > > > > > >
> > > > > > > > > Suggested-by: Alex Williamson 
> > > > > > > > > Signed-off-by: Amey Narkhede 
> > > > > > > > > Signed-off-by: Raphael Norwitz 
> > > > > > > > > ---
> > > > > > > > >  drivers/pci/pci.c   | 17 +
> > > > > > > > >  include/linux/pci.h |  1 +
> > > > > > > > >  2 files changed, 10 insertions(+), 8 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > > > > > > index 16a17215f633..12a91af2ade4 100644
> > > > > > > > > --- a/drivers/pci/pci.c
> > > > > > > > > +++ b/drivers/pci/pci.c
> > > > > > > > > @@ -4982,6 +4982,13 @@ static int 
> > > > > > > > > pci_dev_reset_slot_function(struct pci_dev *dev, int probe)
> > > > > > > > >   return pci_reset_hotplug_slot(dev->slot->hotplug, 
> > > > > > > > > probe);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> > > > > > > > > +{
> > > > > > > > > + int rc = pci_dev_reset_slot_function(dev, probe);
> > > > > > > > > +
> > > > > > > > > + return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, 
> > > > > > > > > probe) : rc;
> > > > > > > >
> > > > > > > > The previous coding style is preferable one in the Linux kernel.
> > > > > > > > int rc = pci_dev_reset_slot_function(dev, probe);
> > > > > > > > if (rc != -ENOTTY)
> > > > > > > >   return rc;
> > > > > > > > return pci_parent_bus_reset(dev, probe);
> > > > > > >
> > > > > > >
> > > > > > > That'd be news to me, do you have a reference?  I've never seen
> > > > > > > complaints for ternaries previously.  Thanks,
> > > > > >
> > > > > > The complaint is not to ternaries, but to the function call as one 
> > > > > > of
> > > > > > the parameters, that makes it harder to read.
> > > > >
> > > > > Sorry, I don't find a function call as a parameter to a ternary to be
> > > > > extraordinary, nor do I find it to be a discouraged usage model within
> > > > > the kernel.  This seems like a pretty low bar for hard to read code.
> > > >
> > > > It is up to us where this bar is set.
> > > >
> > > > Thanks
> > > On the side note there are plenty of places where this pattern is used
> > > though
> > > for example -
> > > kernel/time/clockevents.c:328:
> > > return force ? clockevents_program_min_delta(dev) : -ETIME;
> > >
> > > kernel/trace/trace_kprobe.c:233:
> > > return tk ? within_error_injection_list(trace_kprobe_address(tk)) :
> > >false;
> > >
> > > kernel/signal.c:3104:
> > > return oset ? put_compat_sigset(oset, _set, sizeof(*oset)) : 0;
> > > etc
> >
> > Did you look when they were introduced?
> >
> > Thanks
> >
> that code trace_kprobe in 2 years old.
> If you want more recent example checkout
> drivers/pci/controller/pcie-brcmstb.c:1112,1117:
> return pcie->rescal ? brcm_phy_cntl(pcie, 1) : 0;
> which was introduced 7 months ago.
> There are lot of examples in pci.c also.

Yeah, I know, copy-paste is a powerful tool.

Can we please progress with this patch instead of doing
archaeological research?

Thanks

> 
> Thanks,
> Amey


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 08:40:13AM +, Dexuan Cui wrote:
> > From: Leon Romanovsky 
> > Sent: Wednesday, April 7, 2021 1:10 AM
> > 
> > <...>
> > 
> > > +int gdma_verify_vf_version(struct pci_dev *pdev)
> > > +{
> > > + struct gdma_context *gc = pci_get_drvdata(pdev);
> > > + struct gdma_verify_ver_req req = { 0 };
> > > + struct gdma_verify_ver_resp resp = { 0 };
> > > + int err;
> > > +
> > > + gdma_init_req_hdr(, GDMA_VERIFY_VF_DRIVER_VERSION,
> > > +   sizeof(req), sizeof(resp));
> > > +
> > > + req.protocol_ver_min = GDMA_PROTOCOL_FIRST;
> > > + req.protocol_ver_max = GDMA_PROTOCOL_LAST;
> > > +
> > > + err = gdma_send_request(gc, sizeof(req), , sizeof(resp), );
> > > + if (err || resp.hdr.status) {
> > > + pr_err("VfVerifyVersionOutput: %d, status=0x%x\n", err,
> > > +resp.hdr.status);
> > > + return -EPROTO;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > 
> > <...>
> > > + err = gdma_verify_vf_version(pdev);
> > > + if (err)
> > > + goto remove_irq;
> > 
> > Will this VF driver be used in the guest VM? What will prevent from users to
> > change it?
> > I think that such version negotiation scheme is not allowed.
> 
> Yes, the VF driver is expected to run in a Linux VM that runs on Azure.
> 
> Currently gdma_verify_vf_version() just tells the PF driver that the VF driver
> is only able to support GDMA_PROTOCOL_V1, and want to use
> GDMA_PROTOCOL_V1's message formats to talk to the PF driver later.
> 
> enum {
> GDMA_PROTOCOL_UNDEFINED = 0,
> GDMA_PROTOCOL_V1 = 1,
> GDMA_PROTOCOL_FIRST = GDMA_PROTOCOL_V1,
> GDMA_PROTOCOL_LAST = GDMA_PROTOCOL_V1,
> GDMA_PROTOCOL_VALUE_MAX
> };
> 
> The PF driver is supposed to always support GDMA_PROTOCOL_V1, so I expect
> here gdma_verify_vf_version() should succeed. If a user changes the Linux VF
> driver and try to use a protocol version not supported by the PF driver, then
> gdma_verify_vf_version() will fail; later, if the VF driver tries to talk to 
> the PF
> driver using an unsupported message format, the PF driver will return a 
> failure.

The worry is not for the current code, but for the future one when you will
support v2, v3 e.t.c. First, your code will look like a spaghetti and
second, users will try and mix vX with "unsupported" commands just for the
fun.

Thanks

> 
> Thanks,
> -- Dexuan


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 08:28:45AM +, Dexuan Cui wrote:
> > From: Leon Romanovsky 
> > Sent: Wednesday, April 7, 2021 1:15 AM
> > > ...
> > > int gdma_test_eq(struct gdma_context *gc, struct gdma_queue *eq)
> > > {
> > > struct gdma_generate_test_event_req req = { 0 };
> > > struct gdma_general_resp resp = { 0 };
> > 
> > BTW, you don't need to write { 0 }, the {} is enough.
>  
> Thanks for the suggestion! I'll use {0} in v2. 

You missed the point, "{ 0 }" change to be "{}" without 0.

Thanks


Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 01:53:56PM +0530, ameynarkhed...@gmail.com wrote:
> On 21/04/07 10:23AM, Leon Romanovsky wrote:
> > On Tue, Apr 06, 2021 at 08:16:26AM -0600, Alex Williamson wrote:
> > > On Sun, 4 Apr 2021 11:04:32 +0300
> > > Leon Romanovsky  wrote:
> > >
> > > > On Thu, Apr 01, 2021 at 10:56:16AM -0600, Alex Williamson wrote:
> > > > > On Thu, 1 Apr 2021 15:27:37 +0300
> > > > > Leon Romanovsky  wrote:
> > > > >
> > > > > > On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:
> > > > > > > Slot resets are bus resets with additional logic to prevent a 
> > > > > > > device
> > > > > > > from being removed during the reset. Currently slot and bus 
> > > > > > > resets have
> > > > > > > separate implementations in pci.c, complicating higher level 
> > > > > > > logic. As
> > > > > > > discussed on the mailing list, they should be combined into a 
> > > > > > > generic
> > > > > > > function which performs an SBR. This change adds a function,
> > > > > > > pci_reset_bus_function(), which first attempts a slot reset and 
> > > > > > > then
> > > > > > > attempts a bus reset if -ENOTTY is returned, such that there is 
> > > > > > > now a
> > > > > > > single device agnostic function to perform an SBR.
> > > > > > >
> > > > > > > This new function is also needed to add SBR reset quirks and 
> > > > > > > therefore
> > > > > > > is exposed in pci.h.
> > > > > > >
> > > > > > > Link: https://lkml.org/lkml/2021/3/23/911
> > > > > > >
> > > > > > > Suggested-by: Alex Williamson 
> > > > > > > Signed-off-by: Amey Narkhede 
> > > > > > > Signed-off-by: Raphael Norwitz 
> > > > > > > ---
> > > > > > >  drivers/pci/pci.c   | 17 +
> > > > > > >  include/linux/pci.h |  1 +
> > > > > > >  2 files changed, 10 insertions(+), 8 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > > > > index 16a17215f633..12a91af2ade4 100644
> > > > > > > --- a/drivers/pci/pci.c
> > > > > > > +++ b/drivers/pci/pci.c
> > > > > > > @@ -4982,6 +4982,13 @@ static int 
> > > > > > > pci_dev_reset_slot_function(struct pci_dev *dev, int probe)
> > > > > > >   return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
> > > > > > >  }
> > > > > > >
> > > > > > > +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> > > > > > > +{
> > > > > > > + int rc = pci_dev_reset_slot_function(dev, probe);
> > > > > > > +
> > > > > > > + return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, probe) : rc;
> > > > > >
> > > > > > The previous coding style is preferable one in the Linux kernel.
> > > > > > int rc = pci_dev_reset_slot_function(dev, probe);
> > > > > > if (rc != -ENOTTY)
> > > > > >   return rc;
> > > > > > return pci_parent_bus_reset(dev, probe);
> > > > >
> > > > >
> > > > > That'd be news to me, do you have a reference?  I've never seen
> > > > > complaints for ternaries previously.  Thanks,
> > > >
> > > > The complaint is not to ternaries, but to the function call as one of
> > > > the parameters, that makes it harder to read.
> > >
> > > Sorry, I don't find a function call as a parameter to a ternary to be
> > > extraordinary, nor do I find it to be a discouraged usage model within
> > > the kernel.  This seems like a pretty low bar for hard to read code.
> >
> > It is up to us where this bar is set.
> >
> > Thanks
> On the side note there are plenty of places where this pattern is used
> though
> for example -
> kernel/time/clockevents.c:328:
> return force ? clockevents_program_min_delta(dev) : -ETIME;
> 
> kernel/trace/trace_kprobe.c:233:
> return tk ? within_error_injection_list(trace_kprobe_address(tk)) :
>false;
> 
> kernel/signal.c:3104:
> return oset ? put_compat_sigset(oset, _set, sizeof(*oset)) : 0;
> etc

Did you look when they were introduced?

Thanks

> 
> Thanks,
> Amey


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Leon Romanovsky
On Wed, Apr 07, 2021 at 08:02:17AM +, Dexuan Cui wrote:
> > From: Andrew Lunn 
> > Sent: Tuesday, April 6, 2021 6:08 PM
> > To: Dexuan Cui 
> > 
> > > +static int gdma_query_max_resources(struct pci_dev *pdev)
> > > +{
> > > + struct gdma_context *gc = pci_get_drvdata(pdev);
> > > + struct gdma_general_req req = { 0 };
> > > + struct gdma_query_max_resources_resp resp = { 0 };
> > > + int err;
> > 
> > Network drivers need to use reverse christmas tree. I spotted a number
> > of functions getting this wrong. Please review the whole driver.
> 
> Hi Andrew, thanks for the quick comments!
> 
> I think In general the patch follows the reverse christmas tree style.
> 
> For the function you pointed out, I didn't follow the reverse
> christmas tree style because I wanted to keep the variable defines
> more natural, i.e. first define 'req' and then 'resp'.
> 
> I can swap the 2 lines here, i.e. define 'resp' first, but this looks a little
> strange to me, because in some other functions the 'req' should be
> defined first, e.g. 
> 
> int gdma_test_eq(struct gdma_context *gc, struct gdma_queue *eq)
> {
> struct gdma_generate_test_event_req req = { 0 };
> struct gdma_general_resp resp = { 0 };

BTW, you don't need to write { 0 }, the {} is enough.

Thanks


Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Leon Romanovsky
On Tue, Apr 06, 2021 at 04:23:21PM -0700, Dexuan Cui wrote:
> Add a VF driver for Microsoft Azure Network Adapter (MANA) that will be
> available in the future.
> 
> Co-developed-by: Haiyang Zhang 
> Signed-off-by: Haiyang Zhang 
> Signed-off-by: Dexuan Cui 
> ---
>  MAINTAINERS   |4 +-
>  drivers/net/ethernet/Kconfig  |1 +
>  drivers/net/ethernet/Makefile |1 +
>  drivers/net/ethernet/microsoft/Kconfig|   29 +
>  drivers/net/ethernet/microsoft/Makefile   |5 +
>  drivers/net/ethernet/microsoft/mana/Makefile  |6 +
>  drivers/net/ethernet/microsoft/mana/gdma.h|  731 +++
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 1500 +
>  .../net/ethernet/microsoft/mana/hw_channel.c  |  851 
>  .../net/ethernet/microsoft/mana/hw_channel.h  |  181 ++
>  drivers/net/ethernet/microsoft/mana/mana.h|  529 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 1861 +
>  .../ethernet/microsoft/mana/mana_ethtool.c|  276 +++
>  .../net/ethernet/microsoft/mana/shm_channel.c |  290 +++
>  .../net/ethernet/microsoft/mana/shm_channel.h |   19 +
>  15 files changed, 6283 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/microsoft/Kconfig
>  create mode 100644 drivers/net/ethernet/microsoft/Makefile
>  create mode 100644 drivers/net/ethernet/microsoft/mana/Makefile
>  create mode 100644 drivers/net/ethernet/microsoft/mana/gdma.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/gdma_main.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana_en.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana_ethtool.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.h

<...>

> +int gdma_verify_vf_version(struct pci_dev *pdev)
> +{
> + struct gdma_context *gc = pci_get_drvdata(pdev);
> + struct gdma_verify_ver_req req = { 0 };
> + struct gdma_verify_ver_resp resp = { 0 };
> + int err;
> +
> + gdma_init_req_hdr(, GDMA_VERIFY_VF_DRIVER_VERSION,
> +   sizeof(req), sizeof(resp));
> +
> + req.protocol_ver_min = GDMA_PROTOCOL_FIRST;
> + req.protocol_ver_max = GDMA_PROTOCOL_LAST;
> +
> + err = gdma_send_request(gc, sizeof(req), , sizeof(resp), );
> + if (err || resp.hdr.status) {
> + pr_err("VfVerifyVersionOutput: %d, status=0x%x\n", err,
> +resp.hdr.status);
> + return -EPROTO;
> + }
> +
> + return 0;
> +}

<...>

> +static int gdma_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> +{
> + struct gdma_context *gc;
> + void __iomem *bar0_va;
> + int bar = 0;
> + int err;
> +
> + err = pci_enable_device(pdev);
> + if (err)
> + return -ENXIO;
> +
> + pci_set_master(pdev);
> +
> + err = pci_request_regions(pdev, "gdma");
> + if (err)
> + goto disable_dev;
> +
> + err = dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(64));
> + if (err)
> + goto release_region;
> +
> + err = -ENOMEM;
> + gc = vzalloc(sizeof(*gc));
> + if (!gc)
> + goto release_region;
> +
> + bar0_va = pci_iomap(pdev, bar, 0);
> + if (!bar0_va)
> + goto free_gc;
> +
> + gc->bar0_va = bar0_va;
> + gc->pci_dev = pdev;
> +
> + pci_set_drvdata(pdev, gc);
> +
> + gdma_init_registers(pdev);
> +
> + shm_channel_init(>shm_channel, gc->shm_base);
> +
> + err = gdma_setup_irqs(pdev);
> + if (err)
> + goto unmap_bar;
> +
> + mutex_init(>eq_test_event_mutex);
> +
> + err = hwc_create_channel(gc);
> + if (err)
> + goto remove_irq;
> +
> + err = gdma_verify_vf_version(pdev);
> + if (err)
> + goto remove_irq;

Will this VF driver be used in the guest VM? What will prevent from users to 
change it? 
I think that such version negotiation scheme is not allowed.

Thanks


Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-07 Thread Leon Romanovsky
On Tue, Apr 06, 2021 at 08:16:26AM -0600, Alex Williamson wrote:
> On Sun, 4 Apr 2021 11:04:32 +0300
> Leon Romanovsky  wrote:
> 
> > On Thu, Apr 01, 2021 at 10:56:16AM -0600, Alex Williamson wrote:
> > > On Thu, 1 Apr 2021 15:27:37 +0300
> > > Leon Romanovsky  wrote:
> > >   
> > > > On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:  
> > > > > Slot resets are bus resets with additional logic to prevent a device
> > > > > from being removed during the reset. Currently slot and bus resets 
> > > > > have
> > > > > separate implementations in pci.c, complicating higher level logic. As
> > > > > discussed on the mailing list, they should be combined into a generic
> > > > > function which performs an SBR. This change adds a function,
> > > > > pci_reset_bus_function(), which first attempts a slot reset and then
> > > > > attempts a bus reset if -ENOTTY is returned, such that there is now a
> > > > > single device agnostic function to perform an SBR.
> > > > > 
> > > > > This new function is also needed to add SBR reset quirks and therefore
> > > > > is exposed in pci.h.
> > > > > 
> > > > > Link: https://lkml.org/lkml/2021/3/23/911
> > > > > 
> > > > > Suggested-by: Alex Williamson 
> > > > > Signed-off-by: Amey Narkhede 
> > > > > Signed-off-by: Raphael Norwitz 
> > > > > ---
> > > > >  drivers/pci/pci.c   | 17 +
> > > > >  include/linux/pci.h |  1 +
> > > > >  2 files changed, 10 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > > index 16a17215f633..12a91af2ade4 100644
> > > > > --- a/drivers/pci/pci.c
> > > > > +++ b/drivers/pci/pci.c
> > > > > @@ -4982,6 +4982,13 @@ static int pci_dev_reset_slot_function(struct 
> > > > > pci_dev *dev, int probe)
> > > > >   return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
> > > > >  }
> > > > >  
> > > > > +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> > > > > +{
> > > > > + int rc = pci_dev_reset_slot_function(dev, probe);
> > > > > +
> > > > > + return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, probe) : rc; 
> > > > >
> > > > 
> > > > The previous coding style is preferable one in the Linux kernel.
> > > > int rc = pci_dev_reset_slot_function(dev, probe);
> > > > if (rc != -ENOTTY)
> > > >   return rc;
> > > > return pci_parent_bus_reset(dev, probe);  
> > > 
> > > 
> > > That'd be news to me, do you have a reference?  I've never seen
> > > complaints for ternaries previously.  Thanks,  
> > 
> > The complaint is not to ternaries, but to the function call as one of
> > the parameters, that makes it harder to read.
> 
> Sorry, I don't find a function call as a parameter to a ternary to be
> extraordinary, nor do I find it to be a discouraged usage model within
> the kernel.  This seems like a pretty low bar for hard to read code.

It is up to us where this bar is set.

Thanks


Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

2021-04-06 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 02:01:16PM -0400, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> > From: Avihai Horon 
> > 
> > Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> > local_dma_lkey.
> > 
> > This will take effect only for devices that don't pre-allocate the lkey
> > but allocate it per PD allocation.
> > 
> > Signed-off-by: Avihai Horon 
> > Reviewed-by: Michael Guralnik 
> > Signed-off-by: Leon Romanovsky 
> > ---
> >   drivers/infiniband/core/verbs.c  | 3 ++-
> >   drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
> >   2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/infiniband/core/verbs.c 
> > b/drivers/infiniband/core/verbs.c
> > index a1782f8a6ca0..9b719f7d6fd5 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, 
> > unsigned int flags,
> > if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> > pd->local_dma_lkey = device->local_dma_lkey;
> > else
> > -   mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +   mr_access_flags |=
> > +   IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
> 
> So, do local_dma_lkey's get relaxed ordering unconditionally?

Yes, in mlx5, this lkey is created on the fly.

Thanks


Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-05 Thread Leon Romanovsky
On Tue, Apr 06, 2021 at 07:27:17AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 08:23:28AM +0300, Leon Romanovsky wrote:
> > The same proposal (enable unconditionally) was raised during
> > submission preparations and we decided to follow same pattern
> > as other verbs objects which receive flag parameter.
> 
> A flags argument can be added when it actually is needed.  Using it
> to pass an argument enabled by all ULPs just gets us back to the bad
> old days of complete crap APIs someone drew up on a whiteboard.

Let's wait till Jason wakes up, before jumping to conclusions.
It was his request to update all ULPs.

> 
> I think we need to:
> 
>  a) document the semantics
>  b) sort out any technical concerns
>  c) just enable the damn thing

Sure

> 
> instead of requiring some form of cargo culting.


Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-05 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 03:46:18PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:55AM +0300, Leon Romanovsky wrote:
> > From: Avihai Horon 
> > 
> > Add access flags parameter to ib_alloc_mr() and to ib_mr_pool_init(),
> > and refactor relevant code. This parameter is used to pass MR access
> > flags during MR allocation.
> > 
> > In the following patches, the new access flags parameter will be used
> > to enable Relaxed Ordering for ib_alloc_mr() and ib_mr_pool_init() users.
> 
> So this weirds up a new RELAXED_ORDERING flag without ever mentioning
> that flag in the commit log, never mind what it actually does.

We will improve commit messages.

Thanks


Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-05 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 08:27:16AM -0700, Bart Van Assche wrote:
> On 4/4/21 10:23 PM, Leon Romanovsky wrote:
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > index bed4cfe50554..59138174affa 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -2444,10 +2444,10 @@ struct ib_device_ops {
> >struct ib_udata *udata);
> > int (*dereg_mr)(struct ib_mr *mr, struct ib_udata *udata);
> > struct ib_mr *(*alloc_mr)(struct ib_pd *pd, enum ib_mr_type mr_type,
> > - u32 max_num_sg);
> > + u32 max_num_sg, u32 access);
> > struct ib_mr *(*alloc_mr_integrity)(struct ib_pd *pd,
> > u32 max_num_data_sg,
> > -   u32 max_num_meta_sg);
> > +   u32 max_num_meta_sg, u32 access);
> > int (*advise_mr)(struct ib_pd *pd,
> >  enum ib_uverbs_advise_mr_advice advice, u32 flags,
> >  struct ib_sge *sg_list, u32 num_sge,
> > @@ -4142,11 +4142,10 @@ static inline int ib_dereg_mr(struct ib_mr *mr)
> >  }
> >  
> >  struct ib_mr *ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
> > - u32 max_num_sg);
> > + u32 max_num_sg, u32 access);
> >  
> > -struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
> > -   u32 max_num_data_sg,
> > -   u32 max_num_meta_sg);
> > +struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_data_sg,
> > +   u32 max_num_meta_sg, u32 access);
> >  
> >  /**
> >   * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
> > diff --git a/include/rdma/mr_pool.h b/include/rdma/mr_pool.h
> > index e77123bcb43b..2a0ee791037d 100644
> > --- a/include/rdma/mr_pool.h
> > +++ b/include/rdma/mr_pool.h
> > @@ -11,7 +11,8 @@ struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct 
> > list_head *list);
> >  void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr 
> > *mr);
> >  
> >  int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
> > -   enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg);
> > +   enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg,
> > +   u32 access);
> >  void ib_mr_pool_destroy(struct ib_qp *qp, struct list_head *list);
> >  
> >  #endif /* _RDMA_MR_POOL_H */
> 
> Does the new 'access' argument only control whether or not PCIe relaxed
> ordering is enabled? It seems wrong to me to make enabling of PCIe
> relaxed ordering configurable. I think this mechanism should be enabled
> unconditionally if the HCA supports it.

The same proposal (enable unconditionally) was raised during
submission preparations and we decided to follow same pattern
as other verbs objects which receive flag parameter.

Thanks

> 
> Thanks,
> 
> Bart.


Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:
> 
> 
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe  wrote:
> > 
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky 
> >>> 
> >>>> From Avihai,
> >>> 
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>> 
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>> 
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >> 
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> > 
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
> 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

I will ask to provide more details.

> 
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

We don't have such old systems too.

> 
> 
> > AMD dual socket systems are well known to benefit from relaxed
> > ordering, people have been doing this in userspace for a while now
> > with the opt in.
> 
> 
> --
> Chuck Lever
> 
> 
> 


Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Leon Romanovsky
On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> > 
> > From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> 
> Did you test this patchset with CPU does not support relaxed ordering?

I don't think so, the CPUs that don't support RO are Intel's 
fourth/fifth-generation
and they are not interesting from performance point of view.

> 
> We observed significantly performance degradation when run perftest with
> relaxed ordering enabled over old CPU.
> 
> https://github.com/linux-rdma/perftest/issues/116

The perftest is slightly different, but you pointed to the valid point.
We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
and arguably this was needed to be done in perftest too.

Thanks

> 
> thanks
> 


Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> > 
> > >From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
> 
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

This system is in use by our storage oriented customer who performed the
test. He runs drivers/infiniband/* stack from the upstream, simply backported
to specific kernel version.

The performance boost is seen in other systems too.

> 
> Also if you enable this for basically all kernel ULPs, why not have
> an opt-out into strict ordering for the cases that need it (if there are
> any).

The RO property is optional, it can only improve. In addition, all in-kernel 
ULPs
don't need strict ordering. I can be mistaken here and Jason will correct me, it
is because of two things: ULP doesn't touch data before CQE and DMA API 
prohibits it.

Thanks


Re: [PATCH net 1/2] net: hns3: Remove the left over redundant check & assignment

2021-04-05 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 12:26:37PM +, Salil Mehta wrote:
> Hi Leon,
> Thanks for the review.
> 
> > From: Leon Romanovsky [mailto:l...@kernel.org]
> > Sent: Sunday, April 4, 2021 7:26 AM
> > To: Salil Mehta 
> > Cc: da...@davemloft.net; k...@kernel.org; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; Linuxarm ;
> > linux...@openeuler.org
> > Subject: Re: [PATCH net 1/2] net: hns3: Remove the left over redundant check
> > & assignment
> > 
> > On Sat, Apr 03, 2021 at 02:35:19AM +0100, Salil Mehta wrote:
> > > This removes the left over check and assignment which is no longer used
> > > anywhere in the function and should have been removed as part of the
> > > below mentioned patch.
> > >
> > > Fixes: 012fcb52f67c ("net: hns3: activate reset timer when calling
> > reset_event")
> > > Signed-off-by: Salil Mehta 
> > > ---
> > >  drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> > b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> > > index e3f81c7e0ce7..7ad0722383f5 100644
> > > --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> > > +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> > > @@ -3976,8 +3976,6 @@ static void hclge_reset_event(struct pci_dev *pdev,
> > struct hnae3_handle *handle)
> > >* want to make sure we throttle the reset request. Therefore, we will
> > >* not allow it again before 3*HZ times.
> > >*/
> > > - if (!handle)
> > > - handle = >vport[0].nic;
> > 
> > The comment above should be updated too, and probably the signature of
> > hclge_reset_event() worth to be changed.
> 
> 
> Yes, true. Both the comment and the prototype will be updated in near future.
> I can assure you this did not go un-noticed during the change. There are
> some internal subtleties which I am trying to sort out. Those might come
> as part of different patch-set which deals with other related changes as well.

I can buy such explanation for the change in function signature, but have hard
time to believe that extra commit is needed to change comment above.

Thanks

> 
> The current change(and some other) will pave the way for necessary refactoring
> Of the code being done.
> 
> 
> > 
> > Thanks
> > 
> > >
> > >   if (time_before(jiffies, (hdev->last_reset_time +
> > > HCLGE_RESET_INTERVAL))) {
> > > --
> > > 2.17.1
> > >


Re: [PATCH rdma-next 1/8] RDMA/core: Check if client supports IB device or not

2021-04-05 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 09:20:32AM +0300, Gal Pressman wrote:
> On 05/04/2021 8:49, Leon Romanovsky wrote:
> > From: Parav Pandit 
> > 
> > RDMA devices are of different transport(iWarp, IB, RoCE) and have
> > different attributes.
> > Not all clients are interested in all type of devices.
> > 
> > Implement a generic callback that each IB client can implement to decide
> > if client add() or remove() should be done by the IB core or not for a
> > given IB device, client combination.
> > 
> > Signed-off-by: Parav Pandit 
> > Signed-off-by: Leon Romanovsky 
> > ---
> >  drivers/infiniband/core/device.c | 3 +++
> >  include/rdma/ib_verbs.h  | 9 +
> >  2 files changed, 12 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/device.c 
> > b/drivers/infiniband/core/device.c
> > index c660cef66ac6..c9af2deba8c1 100644
> > --- a/drivers/infiniband/core/device.c
> > +++ b/drivers/infiniband/core/device.c
> > @@ -691,6 +691,9 @@ static int add_client_context(struct ib_device *device,
> > if (!device->kverbs_provider && !client->no_kverbs_req)
> > return 0;
> >  
> > +   if (client->is_supported && !client->is_supported(device))
> > +   return 0;
> 
> Isn't it better to remove the kverbs_provider flag (from previous if 
> statement)
> and unify it with this generic support check?

I thought about it, but didn't find it worth. The kverbs_provider needs
to be provided by device and all ULPs except uverbs will have the same check.

Thanks


Re: [PATCH] RDMA/addr: potential uninitialized variable in ib_nl_process_good_ip_rsep()

2021-04-04 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 08:41:41AM +0300, Dan Carpenter wrote:
> Could you send that and give me a reported-by?  I'm going AFK for a
> week.

Sure, we will handle it.

Thanks for the report.

> 
> regards,
> dan carpenter
> 


[PATCH rdma-next 8/8] net/rds: Move to client_supported callback

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

Use newly introduced client_supported() callback to avoid client
additional if the RDMA device is not of IB type or if it doesn't
support device memory extensions.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 net/rds/ib.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 24c9a9005a6f..bd2ff7d5a718 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -125,18 +125,23 @@ void rds_ib_dev_put(struct rds_ib_device *rds_ibdev)
queue_work(rds_wq, _ibdev->free_work);
 }
 
-static int rds_ib_add_one(struct ib_device *device)
+static bool rds_client_supported(struct ib_device *device)
 {
-   struct rds_ib_device *rds_ibdev;
-   int ret;
-
/* Only handle IB (no iWARP) devices */
if (device->node_type != RDMA_NODE_IB_CA)
-   return -EOPNOTSUPP;
+   return false;
 
/* Device must support FRWR */
if (!(device->attrs.device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS))
-   return -EOPNOTSUPP;
+   return false;
+
+   return true;
+}
+
+static int rds_ib_add_one(struct ib_device *device)
+{
+   struct rds_ib_device *rds_ibdev;
+   int ret;
 
rds_ibdev = kzalloc_node(sizeof(struct rds_ib_device), GFP_KERNEL,
 ibdev_to_node(device));
@@ -288,7 +293,8 @@ static void rds_ib_remove_one(struct ib_device *device, 
void *client_data)
 struct ib_client rds_ib_client = {
.name   = "rds_ib",
.add= rds_ib_add_one,
-   .remove = rds_ib_remove_one
+   .remove = rds_ib_remove_one,
+   .is_supported = rds_client_supported,
 };
 
 static int rds_ib_conn_info_visitor(struct rds_connection *conn,
-- 
2.30.2



[PATCH rdma-next 4/8] IB/core: Skip device which doesn't have necessary capabilities

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

If device doesn't have multicast capability, avoid client registration
for it. This saves 16Kbytes of memory for a RDMA device consist of 128
ports.

If device doesn't support subnet administration, avoid client
registration for it. This saves 8Kbytes of memory for a RDMA device
consist of 128 ports.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/multicast.c | 15 ++-
 drivers/infiniband/core/sa_query.c  | 15 ++-
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/multicast.c 
b/drivers/infiniband/core/multicast.c
index a5dd4b7a74bc..8c81acc24e3e 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -44,11 +44,13 @@
 
 static int mcast_add_one(struct ib_device *device);
 static void mcast_remove_one(struct ib_device *device, void *client_data);
+static bool mcast_client_supported(struct ib_device *device);
 
 static struct ib_client mcast_client = {
.name   = "ib_multicast",
.add= mcast_add_one,
-   .remove = mcast_remove_one
+   .remove = mcast_remove_one,
+   .is_supported = mcast_client_supported,
 };
 
 static struct ib_sa_client sa_client;
@@ -816,6 +818,17 @@ static void mcast_event_handler(struct ib_event_handler 
*handler,
}
 }
 
+static bool mcast_client_supported(struct ib_device *device)
+{
+   u32 i;
+
+   rdma_for_each_port(device, i) {
+   if (rdma_cap_ib_mcast(device, i))
+   return true;
+   }
+   return false;
+}
+
 static int mcast_add_one(struct ib_device *device)
 {
struct mcast_device *dev;
diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index 9a4a49c37922..7e00e24d9423 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -176,11 +176,13 @@ static const struct nla_policy 
ib_nl_policy[LS_NLA_TYPE_MAX] = {
 
 static int ib_sa_add_one(struct ib_device *device);
 static void ib_sa_remove_one(struct ib_device *device, void *client_data);
+static bool ib_sa_client_supported(struct ib_device *device);
 
 static struct ib_client sa_client = {
.name   = "sa",
.add= ib_sa_add_one,
-   .remove = ib_sa_remove_one
+   .remove = ib_sa_remove_one,
+   .is_supported = ib_sa_client_supported,
 };
 
 static DEFINE_XARRAY_FLAGS(queries, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
@@ -2293,6 +2295,17 @@ static void ib_sa_event(struct ib_event_handler *handler,
}
 }
 
+static bool ib_sa_client_supported(struct ib_device *device)
+{
+   unsigned int i;
+
+   rdma_for_each_port(device, i) {
+   if (rdma_cap_ib_sa(device, i))
+   return true;
+   }
+   return false;
+}
+
 static int ib_sa_add_one(struct ib_device *device)
 {
struct ib_sa_device *sa_dev;
-- 
2.30.2



[PATCH rdma-next 5/8] IB/IPoIB: Skip device which doesn't have InfiniBand port

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

Skip RDMA device which doesn't have InfiniBand ports using newly
introduced client_supported() callback.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 8f769ebaacc6..b02c10dea242 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -93,6 +93,7 @@ static struct net_device *ipoib_get_net_dev_by_params(
struct ib_device *dev, u32 port, u16 pkey,
const union ib_gid *gid, const struct sockaddr *addr,
void *client_data);
+static bool ipoib_client_supported(struct ib_device *device);
 static int ipoib_set_mac(struct net_device *dev, void *addr);
 static int ipoib_ioctl(struct net_device *dev, struct ifreq *ifr,
   int cmd);
@@ -102,6 +103,7 @@ static struct ib_client ipoib_client = {
.add= ipoib_add_one,
.remove = ipoib_remove_one,
.get_net_dev_by_params = ipoib_get_net_dev_by_params,
+   .is_supported = ipoib_client_supported,
 };
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
@@ -2530,6 +2532,17 @@ static struct net_device *ipoib_add_port(const char 
*format,
return ERR_PTR(-ENOMEM);
 }
 
+static bool ipoib_client_supported(struct ib_device *device)
+{
+   u32 i;
+
+   rdma_for_each_port(device, i) {
+   if (rdma_protocol_ib(device, i))
+   return true;
+   }
+   return false;
+}
+
 static int ipoib_add_one(struct ib_device *device)
 {
struct list_head *dev_list;
-- 
2.30.2



[PATCH rdma-next 6/8] IB/opa_vnic: Move to client_supported callback

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

Move to newly introduced client_supported callback
Avoid client registration using newly introduced helper callback if the
IB device doesn't have OPA VNIC capability.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c
index cecf0f7cadf9..58658eba97dd 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c
@@ -121,6 +121,7 @@ static struct ib_client opa_vnic_client = {
.name   = opa_vnic_driver_name,
.add= opa_vnic_vema_add_one,
.remove = opa_vnic_vema_rem_one,
+   .is_supported = rdma_cap_opa_vnic,
 };
 
 /**
@@ -993,9 +994,6 @@ static int opa_vnic_vema_add_one(struct ib_device *device)
struct opa_vnic_ctrl_port *cport;
int rc, size = sizeof(*cport);
 
-   if (!rdma_cap_opa_vnic(device))
-   return -EOPNOTSUPP;
-
size += device->phys_port_cnt * sizeof(struct opa_vnic_vema_port);
cport = kzalloc(size, GFP_KERNEL);
if (!cport)
-- 
2.30.2



[PATCH rdma-next 7/8] net/smc: Move to client_supported callback

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

Use newly introduced client_supported() callback to avoid client
additional if the RDMA device is not of IB type.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 net/smc/smc_ib.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 6b65c5d1f957..f7186d9d1299 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -767,6 +767,11 @@ void smc_ib_ndev_change(struct net_device *ndev, unsigned 
long event)
mutex_unlock(_ib_devices.mutex);
 }
 
+static bool smc_client_supported(struct ib_device *ibdev)
+{
+   return ibdev->node_type == RDMA_NODE_IB_CA;
+}
+
 /* callback function for ib_register_client() */
 static int smc_ib_add_dev(struct ib_device *ibdev)
 {
@@ -774,9 +779,6 @@ static int smc_ib_add_dev(struct ib_device *ibdev)
u8 port_cnt;
int i;
 
-   if (ibdev->node_type != RDMA_NODE_IB_CA)
-   return -EOPNOTSUPP;
-
smcibdev = kzalloc(sizeof(*smcibdev), GFP_KERNEL);
if (!smcibdev)
return -ENOMEM;
@@ -840,6 +842,7 @@ static struct ib_client smc_ib_client = {
.name   = "smc_ib",
.add= smc_ib_add_dev,
.remove = smc_ib_remove_dev,
+   .is_supported = smc_client_supported,
 };
 
 int __init smc_ib_register_client(void)
-- 
2.30.2



[PATCH rdma-next 1/8] RDMA/core: Check if client supports IB device or not

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

RDMA devices are of different transport(iWarp, IB, RoCE) and have
different attributes.
Not all clients are interested in all type of devices.

Implement a generic callback that each IB client can implement to decide
if client add() or remove() should be done by the IB core or not for a
given IB device, client combination.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/device.c | 3 +++
 include/rdma/ib_verbs.h  | 9 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c660cef66ac6..c9af2deba8c1 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -691,6 +691,9 @@ static int add_client_context(struct ib_device *device,
if (!device->kverbs_provider && !client->no_kverbs_req)
return 0;
 
+   if (client->is_supported && !client->is_supported(device))
+   return 0;
+
down_write(>client_data_rwsem);
/*
 * So long as the client is registered hold both the client and device
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 59138174affa..777fbcbd4858 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2756,6 +2756,15 @@ struct ib_client {
const union ib_gid *gid,
const struct sockaddr *addr,
void *client_data);
+   /*
+* Returns if the client is supported for a given device or not.
+* @dev: An RDMA device to check if client can support this RDMA or not.
+*
+* A client that is interested in specific device attributes, should
+* implement it to check if client can be supported for this device or
+* not.
+*/
+   bool (*is_supported)(struct ib_device *dev);
 
refcount_t uses;
struct completion uses_zero;
-- 
2.30.2



[PATCH rdma-next 3/8] IB/cm: Skip device which doesn't support IB CM

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

There are at least 3 types of RDMA devices which do not support IB CM.
They are
(1) A (eswitch) switchdev RDMA device,
(2) iWARP device and
(3) RDMA device without a RoCE capability

Hence, avoid IB CM initialization for such devices.

This saves 8Kbytes of memory for eswitch device consist of 512 ports and
also avoids unnecessary initialization for all above 3 types of devices.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/cm.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 8a7791ebae69..5025f2c1347b 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -87,6 +87,7 @@ struct cm_id_private;
 struct cm_work;
 static int cm_add_one(struct ib_device *device);
 static void cm_remove_one(struct ib_device *device, void *client_data);
+static bool cm_supported(struct ib_device *device);
 static void cm_process_work(struct cm_id_private *cm_id_priv,
struct cm_work *work);
 static int cm_send_sidr_rep_locked(struct cm_id_private *cm_id_priv,
@@ -103,7 +104,8 @@ static int cm_send_rej_locked(struct cm_id_private 
*cm_id_priv,
 static struct ib_client cm_client = {
.name   = "cm",
.add= cm_add_one,
-   .remove = cm_remove_one
+   .remove = cm_remove_one,
+   .is_supported = cm_supported,
 };
 
 static struct ib_cm {
@@ -4371,6 +4373,17 @@ static void cm_remove_port_fs(struct cm_port *port)
 
 }
 
+static bool cm_supported(struct ib_device *device)
+{
+   u32 i;
+
+   rdma_for_each_port(device, i) {
+   if (rdma_cap_ib_cm(device, i))
+   return true;
+   }
+   return false;
+}
+
 static int cm_add_one(struct ib_device *ib_device)
 {
struct cm_device *cm_dev;
-- 
2.30.2



[PATCH rdma-next 2/8] RDMA/cma: Skip device which doesn't support CM

2021-04-04 Thread Leon Romanovsky
From: Parav Pandit 

A switchdev RDMA device do not support IB CM. When such device is added
to the RDMA CM's device list, when application invokes rdma_listen(),
cma attempts to listen to such device, however it has IB CM attribute
disabled.

Due to this, rdma_listen() call fails to listen for other non
switchdev devices as well.

A below error message can be seen.

infiniband mlx5_0: RDMA CMA: cma_listen_on_dev, error -38

A failing call flow is below.

rdma_listen()
  cma_listen_on_all()
cma_listen_on_dev()
  _cma_attach_to_dev()
  rdma_listen() <- fails on a specific switchdev device

Hence, when a IB device doesn't support IB CM or IW CM, avoid adding
such device to the cma list.

Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/cma.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 42a1c8955c50..80156faf90de 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -157,11 +157,13 @@ EXPORT_SYMBOL(rdma_res_to_id);
 
 static int cma_add_one(struct ib_device *device);
 static void cma_remove_one(struct ib_device *device, void *client_data);
+static bool cma_supported(struct ib_device *device);
 
 static struct ib_client cma_client = {
.name   = "cma",
.add= cma_add_one,
-   .remove = cma_remove_one
+   .remove = cma_remove_one,
+   .is_supported = cma_supported,
 };
 
 static struct ib_sa_client sa_client;
@@ -4870,6 +4872,17 @@ static void cma_process_remove(struct cma_device 
*cma_dev)
wait_for_completion(_dev->comp);
 }
 
+static bool cma_supported(struct ib_device *device)
+{
+   u32 i;
+
+   rdma_for_each_port(device, i) {
+   if (rdma_cap_ib_cm(device, i) || rdma_cap_iw_cm(device, i))
+   return true;
+   }
+   return false;
+}
+
 static int cma_add_one(struct ib_device *device)
 {
struct rdma_id_private *to_destroy;
-- 
2.30.2



[PATCH rdma-next 0/8] Generalize if ULP supported check

2021-04-04 Thread Leon Romanovsky
From: Leon Romanovsky 

Hi,

This series adds new callback to check if ib client is supported/not_supported.
Such general callback allows us to save memory footprint by not starting
on devices that not going to work on them anyway.

Thanks

Parav Pandit (8):
  RDMA/core: Check if client supports IB device or not
  RDMA/cma: Skip device which doesn't support CM
  IB/cm: Skip device which doesn't support IB CM
  IB/core: Skip device which doesn't have necessary capabilities
  IB/IPoIB: Skip device which doesn't have InfiniBand port
  IB/opa_vnic: Move to client_supported callback
  net/smc: Move to client_supported callback
  net/rds: Move to client_supported callback

 drivers/infiniband/core/cm.c  | 15 +-
 drivers/infiniband/core/cma.c | 15 +-
 drivers/infiniband/core/device.c  |  3 +++
 drivers/infiniband/core/multicast.c   | 15 +-
 drivers/infiniband/core/sa_query.c| 15 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 13 
 .../infiniband/ulp/opa_vnic/opa_vnic_vema.c   |  4 +---
 include/rdma/ib_verbs.h   |  9 +
 net/rds/ib.c  | 20 ---
 net/smc/smc_ib.c  |  9 ++---
 10 files changed, 101 insertions(+), 17 deletions(-)

-- 
2.30.2



[PATCH rdma-next 10/10] xprtrdma: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for xprtrdma.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 net/sunrpc/xprtrdma/frwr_ops.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index cfbdd197cdfe..f9334c0a1a13 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -135,7 +135,8 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr *mr)
struct ib_mr *frmr;
int rc;
 
-   frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth, 0);
+   frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth,
+  IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(frmr))
goto out_mr_err;
 
@@ -339,9 +340,10 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt 
*r_xprt,
reg_wr = >frwr.fr_regwr;
reg_wr->mr = ibmr;
reg_wr->key = ibmr->rkey;
-   reg_wr->access = writing ?
-IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-IB_ACCESS_REMOTE_READ;
+   reg_wr->access =
+   (writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+  IB_ACCESS_REMOTE_READ) |
+   IB_ACCESS_RELAXED_ORDERING;
 
mr->mr_handle = ibmr->rkey;
mr->mr_length = ibmr->length;
-- 
2.30.2



[PATCH rdma-next 09/10] net/smc: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for smc.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 net/smc/smc_ib.c | 3 ++-
 net/smc/smc_wr.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 4e91ed3dc265..6b65c5d1f957 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -579,7 +579,8 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int 
access_flags,
return 0; /* already done */
 
buf_slot->mr_rx[link_idx] =
-   ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order, 0);
+   ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order,
+   IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(buf_slot->mr_rx[link_idx])) {
int rc;
 
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index cbc73a7e4d59..78e9650621f1 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -555,7 +555,8 @@ static void smc_wr_init_sge(struct smc_link *lnk)
lnk->wr_reg.wr.num_sge = 0;
lnk->wr_reg.wr.send_flags = IB_SEND_SIGNALED;
lnk->wr_reg.wr.opcode = IB_WR_REG_MR;
-   lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
+   lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_RELAXED_ORDERING;
 }
 
 void smc_wr_free_link(struct smc_link *lnk)
-- 
2.30.2



[PATCH rdma-next 08/10] net/rds: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for rds.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 net/rds/ib_frmr.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 694eb916319e..1a60c2c90c78 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,7 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct 
rds_ib_device *rds_ibdev,
 
frmr = >u.frmr;
frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
-  pool->max_pages, 0);
+  pool->max_pages, IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(frmr->mr)) {
pr_warn("RDS/IB: %s failed to allocate MR", __func__);
err = PTR_ERR(frmr->mr);
@@ -156,9 +156,8 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
reg_wr.wr.num_sge = 0;
reg_wr.mr = frmr->mr;
reg_wr.key = frmr->mr->rkey;
-   reg_wr.access = IB_ACCESS_LOCAL_WRITE |
-   IB_ACCESS_REMOTE_READ |
-   IB_ACCESS_REMOTE_WRITE;
+   reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
reg_wr.wr.send_flags = IB_SEND_SIGNALED;
 
ret = ib_post_send(ibmr->ic->i_cm_id->qp, _wr.wr, NULL);
-- 
2.30.2



[PATCH rdma-next 06/10] nvme-rdma: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for nvme.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 drivers/nvme/host/rdma.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4dbc17311e0b..8f106b20b39c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -532,9 +532,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue 
*queue)
 */
pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
ret = ib_mr_pool_init(queue->qp, >qp->rdma_mrs,
- queue->queue_size,
- IB_MR_TYPE_MEM_REG,
- pages_per_mr, 0, 0);
+ queue->queue_size, IB_MR_TYPE_MEM_REG,
+ pages_per_mr, 0, IB_ACCESS_RELAXED_ORDERING);
if (ret) {
dev_err(queue->ctrl->ctrl.device,
"failed to initialize MR pool sized %d for QID %d\n",
@@ -545,7 +544,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue 
*queue)
if (queue->pi_support) {
ret = ib_mr_pool_init(queue->qp, >qp->sig_mrs,
  queue->queue_size, IB_MR_TYPE_INTEGRITY,
- pages_per_mr, pages_per_mr, 0);
+ pages_per_mr, pages_per_mr,
+ IB_ACCESS_RELAXED_ORDERING);
if (ret) {
dev_err(queue->ctrl->ctrl.device,
"failed to initialize PI MR pool sized %d for 
QID %d\n",
@@ -1382,9 +1382,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue 
*queue,
req->reg_wr.wr.num_sge = 0;
req->reg_wr.mr = req->mr;
req->reg_wr.key = req->mr->rkey;
-   req->reg_wr.access = IB_ACCESS_LOCAL_WRITE |
-IB_ACCESS_REMOTE_READ |
-IB_ACCESS_REMOTE_WRITE;
+   req->reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_RELAXED_ORDERING;
 
sg->addr = cpu_to_le64(req->mr->iova);
put_unaligned_le24(req->mr->length, sg->length);
@@ -1488,9 +1488,8 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue 
*queue,
wr->wr.send_flags = 0;
wr->mr = req->mr;
wr->key = req->mr->rkey;
-   wr->access = IB_ACCESS_LOCAL_WRITE |
-IB_ACCESS_REMOTE_READ |
-IB_ACCESS_REMOTE_WRITE;
+   wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
 
sg->addr = cpu_to_le64(req->mr->iova);
put_unaligned_le24(req->mr->length, sg->length);
-- 
2.30.2



[PATCH rdma-next 07/10] cifs: smbd: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for smbd.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 fs/cifs/smbdirect.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 647098a5cf3b..1e86dc8bbe85 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2178,8 +2178,10 @@ static void smbd_mr_recovery_work(struct work_struct 
*work)
continue;
}
 
-   smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
-  info->max_frmr_depth, 0);
+   smbdirect_mr->mr =
+   ib_alloc_mr(info->pd, info->mr_type,
+   info->max_frmr_depth,
+   IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(smbdirect_mr->mr)) {
log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x 
max_frmr_depth=%x\n",
info->mr_type,
@@ -2244,7 +2246,8 @@ static int allocate_mr_list(struct smbd_connection *info)
if (!smbdirect_mr)
goto out;
smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
-  info->max_frmr_depth, 0);
+  info->max_frmr_depth,
+  IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(smbdirect_mr->mr)) {
log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x 
max_frmr_depth=%x\n",
info->mr_type, info->max_frmr_depth);
@@ -2406,9 +2409,10 @@ struct smbd_mr *smbd_register_mr(
reg_wr->wr.send_flags = IB_SEND_SIGNALED;
reg_wr->mr = smbdirect_mr->mr;
reg_wr->key = smbdirect_mr->mr->rkey;
-   reg_wr->access = writing ?
-   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-   IB_ACCESS_REMOTE_READ;
+   reg_wr->access =
+   (writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+  IB_ACCESS_REMOTE_READ) |
+   IB_ACCESS_RELAXED_ORDERING;
 
/*
 * There is no need for waiting for complemtion on ib_post_send
-- 
2.30.2



[PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
local_dma_lkey.

This will take effect only for devices that don't pre-allocate the lkey
but allocate it per PD allocation.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/verbs.c  | 3 ++-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a1782f8a6ca0..9b719f7d6fd5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, 
unsigned int flags,
if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
pd->local_dma_lkey = device->local_dma_lkey;
else
-   mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
+   mr_access_flags |=
+   IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
 
if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
pr_warn("%s: enabling unsafe global rkey\n", caller);
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c 
b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index b3fa783698a0..d74827694f92 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
int ret;
 
/* Support only LOCAL_WRITE flag for DMA MRs */
+   acc &= ~IB_ACCESS_RELAXED_ORDERING;
if (acc & ~IB_ACCESS_LOCAL_WRITE) {
dev_warn(>pdev->dev,
 "unsupported dma mr access flags %#x\n", acc);
-- 
2.30.2



[PATCH rdma-next 05/10] RDMA/srp: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for srp.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 8481ad769ba4..0026660c3ead 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -436,7 +436,8 @@ static struct srp_fr_pool *srp_create_fr_pool(struct 
ib_device *device,
mr_type = IB_MR_TYPE_MEM_REG;
 
for (i = 0, d = >desc[0]; i < pool->size; i++, d++) {
-   mr = ib_alloc_mr(pd, mr_type, max_page_list_len, 0);
+   mr = ib_alloc_mr(pd, mr_type, max_page_list_len,
+IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(mr)) {
ret = PTR_ERR(mr);
if (ret == -ENOMEM)
@@ -1487,9 +1488,8 @@ static int srp_map_finish_fr(struct srp_map_state *state,
wr.wr.send_flags = 0;
wr.mr = desc->mr;
wr.key = desc->mr->rkey;
-   wr.access = (IB_ACCESS_LOCAL_WRITE |
-IB_ACCESS_REMOTE_READ |
-IB_ACCESS_REMOTE_WRITE);
+   wr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING);
 
*state->fr.next++ = desc;
state->nmdesc++;
-- 
2.30.2



[PATCH rdma-next 04/10] RDMA/rtrs: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering fro rtrs client and server.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/rtrs/rtrs-clt.c |  6 --
 drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c 
b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 0d3960ed5b2b..a3fbb47a3574 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1099,7 +1099,8 @@ static int rtrs_clt_read_req(struct rtrs_clt_io_req *req)
.mr = req->mr,
.key = req->mr->rkey,
.access = (IB_ACCESS_LOCAL_WRITE |
-  IB_ACCESS_REMOTE_WRITE),
+  IB_ACCESS_REMOTE_WRITE |
+  IB_ACCESS_RELAXED_ORDERING),
};
wr = 
 
@@ -1260,7 +1261,8 @@ static int alloc_sess_reqs(struct rtrs_clt_sess *sess)
goto out;
 
req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
- sess->max_pages_per_mr, 0);
+ sess->max_pages_per_mr,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(req->mr)) {
err = PTR_ERR(req->mr);
req->mr = NULL;
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c 
b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 575f31ff20fd..c28ed5e2245d 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -312,8 +312,8 @@ static int rdma_write_sg(struct rtrs_srv_op *id)
rwr.mr = srv_mr->mr;
rwr.wr.send_flags = 0;
rwr.key = srv_mr->mr->rkey;
-   rwr.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+   rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
msg = srv_mr->iu->buf;
msg->buf_id = cpu_to_le16(id->msg_id);
msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -432,8 +432,8 @@ static int send_io_resp_imm(struct rtrs_srv_con *con, 
struct rtrs_srv_op *id,
rwr.wr.send_flags = 0;
rwr.mr = srv_mr->mr;
rwr.key = srv_mr->mr->rkey;
-   rwr.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+   rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
msg = srv_mr->iu->buf;
msg->buf_id = cpu_to_le16(id->msg_id);
msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -638,7 +638,7 @@ static int map_cont_bufs(struct rtrs_srv_sess *sess)
goto free_sg;
}
mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
-sgt->nents, 0);
+sgt->nents, IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(mr)) {
err = PTR_ERR(mr);
goto unmap_sg;
@@ -823,8 +823,9 @@ static int process_info_req(struct rtrs_srv_con *con,
rwr[mri].wr.send_flags = 0;
rwr[mri].mr = mr;
rwr[mri].key = mr->rkey;
-   rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
-  IB_ACCESS_REMOTE_WRITE);
+   rwr[mri].access =
+   (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_RELAXED_ORDERING);
reg_wr = [mri].wr;
}
 
-- 
2.30.2



[PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Add access flags parameter to ib_alloc_mr() and to ib_mr_pool_init(),
and refactor relevant code. This parameter is used to pass MR access
flags during MR allocation.

In the following patches, the new access flags parameter will be used
to enable Relaxed Ordering for ib_alloc_mr() and ib_mr_pool_init() users.

Signed-off-by: Avihai Horon 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/mr_pool.c |  7 +-
 drivers/infiniband/core/rw.c  | 12 ++--
 drivers/infiniband/core/verbs.c   | 23 +--
 drivers/infiniband/hw/bnxt_re/ib_verbs.c  |  2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h  |  2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h|  2 +-
 drivers/infiniband/hw/cxgb4/mem.c |  2 +-
 drivers/infiniband/hw/hns/hns_roce_device.h   |  2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c   |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c |  3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |  2 +-
 drivers/infiniband/hw/mlx4/mr.c   |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c   | 61 
 drivers/infiniband/hw/mlx5/wr.c   | 69 ++-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h   |  2 +-
 drivers/infiniband/hw/qedr/verbs.c|  2 +-
 drivers/infiniband/hw/qedr/verbs.h|  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c  |  3 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c |  3 +-
 drivers/infiniband/sw/rdmavt/mr.h |  2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.c |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.h |  2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c  |  4 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c|  2 +-
 drivers/infiniband/ulp/rtrs/rtrs-srv.c|  2 +-
 drivers/infiniband/ulp/srp/ib_srp.c   |  2 +-
 drivers/nvme/host/rdma.c  |  4 +-
 fs/cifs/smbdirect.c   |  7 +-
 include/rdma/ib_verbs.h   | 11 ++-
 include/rdma/mr_pool.h|  3 +-
 net/rds/ib_frmr.c |  2 +-
 net/smc/smc_ib.c  |  2 +-
 net/sunrpc/xprtrdma/frwr_ops.c|  2 +-
 37 files changed, 163 insertions(+), 105 deletions(-)

diff --git a/drivers/infiniband/core/mr_pool.c 
b/drivers/infiniband/core/mr_pool.c
index c0e2df128b34..b869c3487475 100644
--- a/drivers/infiniband/core/mr_pool.c
+++ b/drivers/infiniband/core/mr_pool.c
@@ -34,7 +34,8 @@ void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, 
struct ib_mr *mr)
 EXPORT_SYMBOL(ib_mr_pool_put);
 
 int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
-   enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg)
+   enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg,
+   u32 access)
 {
struct ib_mr *mr;
unsigned long flags;
@@ -43,9 +44,9 @@ int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, 
int nr,
for (i = 0; i < nr; i++) {
if (type == IB_MR_TYPE_INTEGRITY)
mr = ib_alloc_mr_integrity(qp->pd, max_num_sg,
-  max_num_meta_sg);
+  max_num_meta_sg, access);
else
-   mr = ib_alloc_mr(qp->pd, type, max_num_sg);
+   mr = ib_alloc_mr(qp->pd, type, max_num_sg, access);
if (IS_ERR(mr)) {
ret = PTR_ERR(mr);
goto out;
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index a588c2038479..d5a0038e82a4 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -110,7 +110,7 @@ static int rdma_rw_init_one_mr(struct ib_qp *qp, u32 
port_num,
 
reg->reg_wr.wr.opcode = IB_WR_REG_MR;
reg->reg_wr.mr = reg->mr;
-   reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+   reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
if (rdma_protocol_iwarp(qp->device, port_num))
reg->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
count++;
@@ -437,7 +437,8 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
ctx->reg->reg_wr.wr.wr_cqe = NULL;
ctx->reg->reg_wr.wr.num_sge = 0;
ctx->reg->reg_wr.wr.send_flags = 0;
-   ctx->reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+   ctx->reg->reg_wr.access =
+   IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
if (rdma

[PATCH rdma-next 03/10] RDMA/iser: Enable Relaxed Ordering

2021-04-04 Thread Leon Romanovsky
From: Avihai Horon 

Enable Relaxed Ordering for iser.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/iser/iser_memory.c | 10 --
 drivers/infiniband/ulp/iser/iser_verbs.c  |  6 --
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index afec40da9b58..29849c9e82e8 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -271,9 +271,8 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
wr->wr.send_flags = 0;
wr->mr = mr;
wr->key = mr->rkey;
-   wr->access = IB_ACCESS_LOCAL_WRITE |
-IB_ACCESS_REMOTE_READ |
-IB_ACCESS_REMOTE_WRITE;
+   wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
rsc->mr_valid = 1;
 
sig_reg->sge.lkey = mr->lkey;
@@ -318,9 +317,8 @@ static int iser_fast_reg_mr(struct iscsi_iser_task 
*iser_task,
wr->wr.num_sge = 0;
wr->mr = mr;
wr->key = mr->rkey;
-   wr->access = IB_ACCESS_LOCAL_WRITE  |
-IB_ACCESS_REMOTE_WRITE |
-IB_ACCESS_REMOTE_READ;
+   wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_REMOTE_READ | IB_ACCESS_RELAXED_ORDERING;
 
rsc->mr_valid = 1;
 
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 3c370ee25f2f..1e236b1cf29b 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -121,7 +121,8 @@ iser_create_fastreg_desc(struct iser_device *device,
else
mr_type = IB_MR_TYPE_MEM_REG;
 
-   desc->rsc.mr = ib_alloc_mr(pd, mr_type, size, 0);
+   desc->rsc.mr = ib_alloc_mr(pd, mr_type, size,
+   IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(desc->rsc.mr)) {
ret = PTR_ERR(desc->rsc.mr);
iser_err("Failed to allocate ib_fast_reg_mr err=%d\n", ret);
@@ -129,7 +130,8 @@ iser_create_fastreg_desc(struct iser_device *device,
}
 
if (pi_enable) {
-   desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size, 0);
+   desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size,
+   IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(desc->rsc.sig_mr)) {
ret = PTR_ERR(desc->rsc.sig_mr);
iser_err("Failed to allocate sig_mr err=%d\n", ret);
-- 
2.30.2



[PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-04 Thread Leon Romanovsky
From: Leon Romanovsky 

>From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.

Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.

The following test results show the performance improvement achieved
with relaxed ordering. The test was performed on a NVIDIA A100 in order
to check performance of storage infrastructure over xprtrdma:

Without Relaxed Ordering:
READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
io=1987GiB (2133GB), run=120422-120422msec

With relaxed ordering:
READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
io=2367GiB (2542GB), run=32492-32492msec

Thanks

Avihai Horon (10):
  RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  RDMA/iser: Enable Relaxed Ordering
  RDMA/rtrs: Enable Relaxed Ordering
  RDMA/srp: Enable Relaxed Ordering
  nvme-rdma: Enable Relaxed Ordering
  cifs: smbd: Enable Relaxed Ordering
  net/rds: Enable Relaxed Ordering
  net/smc: Enable Relaxed Ordering
  xprtrdma: Enable Relaxed Ordering

 drivers/infiniband/core/mr_pool.c |  7 +-
 drivers/infiniband/core/rw.c  | 12 ++--
 drivers/infiniband/core/verbs.c   | 26 +--
 drivers/infiniband/hw/bnxt_re/ib_verbs.c  |  2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h  |  2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h|  2 +-
 drivers/infiniband/hw/cxgb4/mem.c |  2 +-
 drivers/infiniband/hw/hns/hns_roce_device.h   |  2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c   |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c |  3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |  2 +-
 drivers/infiniband/hw/mlx4/mr.c   |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c   | 61 
 drivers/infiniband/hw/mlx5/wr.c   | 69 ++-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h   |  2 +-
 drivers/infiniband/hw/qedr/verbs.c|  2 +-
 drivers/infiniband/hw/qedr/verbs.h|  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c  |  4 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c |  3 +-
 drivers/infiniband/sw/rdmavt/mr.h |  2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.c |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.h |  2 +-
 drivers/infiniband/ulp/iser/iser_memory.c | 10 ++-
 drivers/infiniband/ulp/iser/iser_verbs.c  |  6 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c|  6 +-
 drivers/infiniband/ulp/rtrs/rtrs-srv.c| 15 ++--
 drivers/infiniband/ulp/srp/ib_srp.c   |  8 +--
 drivers/nvme/host/rdma.c  | 19 +++--
 fs/cifs/smbdirect.c   | 17 +++--
 include/rdma/ib_verbs.h   | 11 ++-
 include/rdma/mr_pool.h|  3 +-
 net/rds/ib_frmr.c |  7 +-
 net/smc/smc_ib.c  |  3 +-
 net/smc/smc_wr.c  |  3 +-
 net/sunrpc/xprtrdma/frwr_ops.c| 10 +--
 39 files changed, 209 insertions(+), 140 deletions(-)

-- 
2.30.2



Re: [PATCH] net/mlx5: fix kfree mismatch in indir_table.c

2021-04-04 Thread Leon Romanovsky
On Mon, Apr 05, 2021 at 10:53:39AM +0800, Xiaoming Ni wrote:
> Memory allocated by kvzalloc() should be freed by kvfree().
> 
> Fixes: 34ca65352ddf2 ("net/mlx5: E-Switch, Indirect table infrastructur")
> Signed-off-by: Xiaoming Ni 
> ---
>  .../net/ethernet/mellanox/mlx5/core/esw/indir_table.c  | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 

Thanks,
Reviewed-by: Leon Romanovsky 


Re: [PATCH] RDMA/addr: potential uninitialized variable in ib_nl_process_good_ip_rsep()

2021-04-04 Thread Leon Romanovsky
On Sun, Apr 04, 2021 at 04:13:17PM +0300, Mark Bloch wrote:
> On 4/4/21 1:33 PM, Leon Romanovsky wrote:
> > On Fri, Apr 02, 2021 at 02:47:23PM +0300, Dan Carpenter wrote:
> >> The nla_len() is less than or equal to 16.  If it's less than 16 then
> >> end of the "gid" buffer is uninitialized.
> >>
> >> Fixes: ae43f8286730 ("IB/core: Add IP to GID netlink offload")
> >> Signed-off-by: Dan Carpenter 
> >> ---
> >> I just spotted this in review.  I think it's a bug but I'm not 100%.
> > 
> > I tend to agree with you, that it is a bug.
> > 
> > LS_NLA_TYPE_DGID is declared as NLA_BINARY which doesn't complain if
> > data is less than declared ".len". However, the fix needs to be in
> > ib_nl_is_good_ip_resp(), it shouldn't return "true" if length not equal
> > to 16.
> 
> What about just updating the policy? The bellow diff should work I believe.

I didn't know about ".validation_type", but yes this change will be enough.

> 
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index 0abce004a959..65e3e7df8a4b 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -76,7 +76,9 @@ static struct workqueue_struct *addr_wq;
>  
>  static const struct nla_policy ib_nl_addr_policy[LS_NLA_TYPE_MAX] = {
> [LS_NLA_TYPE_DGID] = {.type = NLA_BINARY,
> -   .len = sizeof(struct rdma_nla_ls_gid)},
> +   .len = sizeof(struct rdma_nla_ls_gid),
> +   .validation_type = NLA_VALIDATE_MIN,
> +   .min = sizeof(struct rdma_nla_ls_gid)},
>  };
>  
>  static inline bool ib_nl_is_good_ip_resp(const struct nlmsghdr *nlh)
> 
> > 
> > Thanks
> > 
> 
> Mark


Re: [PATCH v3] IB/mlx5: Reduce max order of memory allocated for xlt update

2021-04-04 Thread Leon Romanovsky
On Sat, Apr 03, 2021 at 04:53:55AM +, Praveen Kumar Kannoju wrote:
> To update xlt (during mlx5_ib_reg_user_mr()), the driver can request up to
> 1 MB (order-8) memory, depending on the size of the MR. This costly
> allocation can sometimes take very long to return (a few seconds). This
> causes the calling application to hang for a long time, especially when the
> system is fragmented.  To avoid these long latency spikes, the calls the
> higher order allocations need to fail faster in case they are not
> available. In order to acheive this we need __GFP_NORETRY flag in the
> gfp_mask before during fetching the free pages. This patch adds this flag
> to the mask.
> 
> Signed-off-by: Praveen Kumar Kannoju 
> ---
>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Thanks a lot,
Acked-by: Leon Romanovsky 


Re: [PATCH] RDMA/addr: potential uninitialized variable in ib_nl_process_good_ip_rsep()

2021-04-04 Thread Leon Romanovsky
On Fri, Apr 02, 2021 at 02:47:23PM +0300, Dan Carpenter wrote:
> The nla_len() is less than or equal to 16.  If it's less than 16 then
> end of the "gid" buffer is uninitialized.
> 
> Fixes: ae43f8286730 ("IB/core: Add IP to GID netlink offload")
> Signed-off-by: Dan Carpenter 
> ---
> I just spotted this in review.  I think it's a bug but I'm not 100%.

I tend to agree with you, that it is a bug.

LS_NLA_TYPE_DGID is declared as NLA_BINARY which doesn't complain if
data is less than declared ".len". However, the fix needs to be in
ib_nl_is_good_ip_resp(), it shouldn't return "true" if length not equal
to 16.

Thanks


Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-04 Thread Leon Romanovsky
On Thu, Apr 01, 2021 at 10:56:16AM -0600, Alex Williamson wrote:
> On Thu, 1 Apr 2021 15:27:37 +0300
> Leon Romanovsky  wrote:
> 
> > On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:
> > > Slot resets are bus resets with additional logic to prevent a device
> > > from being removed during the reset. Currently slot and bus resets have
> > > separate implementations in pci.c, complicating higher level logic. As
> > > discussed on the mailing list, they should be combined into a generic
> > > function which performs an SBR. This change adds a function,
> > > pci_reset_bus_function(), which first attempts a slot reset and then
> > > attempts a bus reset if -ENOTTY is returned, such that there is now a
> > > single device agnostic function to perform an SBR.
> > > 
> > > This new function is also needed to add SBR reset quirks and therefore
> > > is exposed in pci.h.
> > > 
> > > Link: https://lkml.org/lkml/2021/3/23/911
> > > 
> > > Suggested-by: Alex Williamson 
> > > Signed-off-by: Amey Narkhede 
> > > Signed-off-by: Raphael Norwitz 
> > > ---
> > >  drivers/pci/pci.c   | 17 +
> > >  include/linux/pci.h |  1 +
> > >  2 files changed, 10 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index 16a17215f633..12a91af2ade4 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -4982,6 +4982,13 @@ static int pci_dev_reset_slot_function(struct 
> > > pci_dev *dev, int probe)
> > >   return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
> > >  }
> > >  
> > > +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> > > +{
> > > + int rc = pci_dev_reset_slot_function(dev, probe);
> > > +
> > > + return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, probe) : rc;  
> > 
> > The previous coding style is preferable one in the Linux kernel.
> > int rc = pci_dev_reset_slot_function(dev, probe);
> > if (rc != -ENOTTY)
> >   return rc;
> > return pci_parent_bus_reset(dev, probe);
> 
> 
> That'd be news to me, do you have a reference?  I've never seen
> complaints for ternaries previously.  Thanks,

The complaint is not to ternaries, but to the function call as one of
the parameters, that makes it harder to read.

Thanks

> 
> Alex
> 


Re: [PATCH net 1/2] net: hns3: Remove the left over redundant check & assignment

2021-04-04 Thread Leon Romanovsky
On Sat, Apr 03, 2021 at 02:35:19AM +0100, Salil Mehta wrote:
> This removes the left over check and assignment which is no longer used
> anywhere in the function and should have been removed as part of the
> below mentioned patch.
> 
> Fixes: 012fcb52f67c ("net: hns3: activate reset timer when calling 
> reset_event")
> Signed-off-by: Salil Mehta 
> ---
>  drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
> b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> index e3f81c7e0ce7..7ad0722383f5 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -3976,8 +3976,6 @@ static void hclge_reset_event(struct pci_dev *pdev, 
> struct hnae3_handle *handle)
>* want to make sure we throttle the reset request. Therefore, we will
>* not allow it again before 3*HZ times.
>*/
> - if (!handle)
> - handle = >vport[0].nic;

The comment above should be updated too, and probably the signature of
hclge_reset_event() worth to be changed.

Thanks

>  
>   if (time_before(jiffies, (hdev->last_reset_time +
> HCLGE_RESET_INTERVAL))) {
> -- 
> 2.17.1
> 


Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-01 Thread Leon Romanovsky
On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:
> Slot resets are bus resets with additional logic to prevent a device
> from being removed during the reset. Currently slot and bus resets have
> separate implementations in pci.c, complicating higher level logic. As
> discussed on the mailing list, they should be combined into a generic
> function which performs an SBR. This change adds a function,
> pci_reset_bus_function(), which first attempts a slot reset and then
> attempts a bus reset if -ENOTTY is returned, such that there is now a
> single device agnostic function to perform an SBR.
> 
> This new function is also needed to add SBR reset quirks and therefore
> is exposed in pci.h.
> 
> Link: https://lkml.org/lkml/2021/3/23/911
> 
> Suggested-by: Alex Williamson 
> Signed-off-by: Amey Narkhede 
> Signed-off-by: Raphael Norwitz 
> ---
>  drivers/pci/pci.c   | 17 +
>  include/linux/pci.h |  1 +
>  2 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 16a17215f633..12a91af2ade4 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4982,6 +4982,13 @@ static int pci_dev_reset_slot_function(struct pci_dev 
> *dev, int probe)
>   return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
>  }
>  
> +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> +{
> + int rc = pci_dev_reset_slot_function(dev, probe);
> +
> + return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, probe) : rc;

The previous coding style is preferable one in the Linux kernel.
int rc = pci_dev_reset_slot_function(dev, probe);
if (rc != -ENOTTY)
  return rc;
return pci_parent_bus_reset(dev, probe);


> +}
> +
>  static void pci_dev_lock(struct pci_dev *dev)
>  {
>   pci_cfg_access_lock(dev);
> @@ -5102,10 +5109,7 @@ int __pci_reset_function_locked(struct pci_dev *dev)
>   rc = pci_pm_reset(dev, 0);
>   if (rc != -ENOTTY)
>   return rc;
> - rc = pci_dev_reset_slot_function(dev, 0);
> - if (rc != -ENOTTY)
> - return rc;
> - return pci_parent_bus_reset(dev, 0);
> + return pci_reset_bus_function(dev, 0);
>  }
>  EXPORT_SYMBOL_GPL(__pci_reset_function_locked);
>  
> @@ -5135,13 +5139,10 @@ int pci_probe_reset_function(struct pci_dev *dev)
>   if (rc != -ENOTTY)
>   return rc;
>   rc = pci_pm_reset(dev, 1);
> - if (rc != -ENOTTY)
> - return rc;
> - rc = pci_dev_reset_slot_function(dev, 1);
>   if (rc != -ENOTTY)
>   return rc;
>  
> - return pci_parent_bus_reset(dev, 1);
> + return pci_reset_bus_function(dev, 1);
>  }
>  
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 86c799c97b77..979d54335ac1 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1228,6 +1228,7 @@ int pci_probe_reset_bus(struct pci_bus *bus);
>  int pci_reset_bus(struct pci_dev *dev);
>  void pci_reset_secondary_bus(struct pci_dev *dev);
>  void pcibios_reset_secondary_bus(struct pci_dev *dev);
> +int pci_reset_bus_function(struct pci_dev *dev, int probe);
>  void pci_update_resource(struct pci_dev *dev, int resno);
>  int __must_check pci_assign_resource(struct pci_dev *dev, int i);
>  int __must_check pci_reassign_resource(struct pci_dev *dev, int i, 
> resource_size_t add_size, resource_size_t align);
> -- 
> 2.20.1


Re: [PATCH -next] RDMA/uverbs: Fix -Wunused-function warning

2021-04-01 Thread Leon Romanovsky
On Thu, Apr 01, 2021 at 10:10:28AM +0800, YueHaibing wrote:
> make W=1 warns this:
> 
> In file included from drivers/infiniband/sw/rdmavt/mmap.c:51:0:
> ./include/rdma/uverbs_ioctl.h:937:1:
>  warning: ‘_uverbs_get_const_unsigned’ defined but not used 
> [-Wunused-function]
>  _uverbs_get_const_unsigned(u64 *to,
>  ^~
> ./include/rdma/uverbs_ioctl.h:930:1:
>  warning: ‘_uverbs_get_const_signed’ defined but not used [-Wunused-function]
>  _uverbs_get_const_signed(s64 *to, const struct uverbs_attr_bundle 
> *attrs_bundle,
>  ^~~~
> 
> Make these functions inline to fix this warnings.
> 
> Signed-off-by: YueHaibing 
> ---
>  include/rdma/uverbs_ioctl.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 

Thanks,
Reviewed-by: Leon Romanovsky 


Re: [PATCH for-next v3 1/2] IB/cma: Introduce rdma_set_min_rnr_timer()

2021-04-01 Thread Leon Romanovsky
On Wed, Mar 31, 2021 at 08:43:13PM +0200, Håkon Bugge wrote:
> Introduce the ability for kernel ULPs to adjust the minimum RNR Retry
> timer. The INIT -> RTR transition executed by RDMA CM will be used for
> this adjustment. This avoids an additional ib_modify_qp() call.
> 
> rdma_set_min_rnr_timer() must be called before the call to
> rdma_connect() on the active side and before the call to rdma_accept()
> on the passive side.
> 
> The default value of RNR Retry timer is zero, which translates to 655
> ms. When the receiver is not ready to accept a send messages, it
> encodes the RNR Retry timer value in the NAK. The requestor will then
> wait at least the specified time value before retrying the send.
> 
> The 5-bit value to be supplied to the rdma_set_min_rnr_timer() is
> documented in IBTA Table 45: "Encoding for RNR NAK Timer Field".
> 
> Signed-off-by: Håkon Bugge 
> Acked-by: Jason Gunthorpe 
> ---
>  drivers/infiniband/core/cma.c  | 41 
> ++
>  drivers/infiniband/core/cma_priv.h |  2 ++
>  include/rdma/rdma_cm.h |  2 ++
>  3 files changed, 45 insertions(+)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 9409651..5ce097d 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -852,6 +852,7 @@ static void cma_id_put(struct rdma_id_private *id_priv)
>   id_priv->id.qp_type = qp_type;
>   id_priv->tos_set = false;
>   id_priv->timeout_set = false;
> + id_priv->min_rnr_timer_set = false;
>   id_priv->gid_type = IB_GID_TYPE_IB;
>   spin_lock_init(_priv->lock);
>   mutex_init(_priv->qp_mutex);
> @@ -1141,6 +1142,9 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct 
> ib_qp_attr *qp_attr,
>   if ((*qp_attr_mask & IB_QP_TIMEOUT) && id_priv->timeout_set)
>   qp_attr->timeout = id_priv->timeout;
>  
> + if ((*qp_attr_mask & IB_QP_MIN_RNR_TIMER) && id_priv->min_rnr_timer_set)
> + qp_attr->min_rnr_timer = id_priv->min_rnr_timer;
> +
>   return ret;
>  }
>  EXPORT_SYMBOL(rdma_init_qp_attr);
> @@ -2615,6 +2619,43 @@ int rdma_set_ack_timeout(struct rdma_cm_id *id, u8 
> timeout)
>  }
>  EXPORT_SYMBOL(rdma_set_ack_timeout);
>  
> +/**
> + * rdma_set_min_rnr_timer() - Set the minimum RNR Retry timer of the
> + * QP associated with a connection identifier.
> + * @id: Communication identifier to associated with service type.
> + * @min_rnr_timer: 5-bit value encoded as Table 45: "Encoding for RNR NAK
> + *  Timer Field" in the IBTA specification.
> + *
> + * This function should be called before rdma_connect() on active
> + * side, and on passive side before rdma_accept(). The timer value
> + * will be associated with the local QP. When it receives a send it is
> + * not read to handle, typically if the receive queue is empty, an RNR
> + * Retry NAK is returned to the requester with the min_rnr_timer
> + * encoded. The requester will then wait at least the time specified
> + * in the NAK before retrying. The default is zero, which translates
> + * to a minimum RNR Timer value of 655 ms.
> + *
> + * Return: 0 for success
> + */
> +int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer)
> +{
> + struct rdma_id_private *id_priv;
> +
> + /* It is a five-bit value */
> + if (min_rnr_timer & 0xe0)
> + return -EINVAL;
> +
> + if (id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT)
> + return -EINVAL;

This is in-kernel API and safe to use WARN_ON() instead of returning
error which RDS is not checking anyway.

Thanks


Re: [PATCH] mm: add ___GFP_NOINIT flag which disables zeroing on alloc

2021-03-30 Thread Leon Romanovsky
On Mon, Mar 29, 2021 at 07:12:55PM +0900, Hyunsoon Kim wrote:
> On Mon, Mar 29, 2021 at 09:34:31AM +0300, Leon Romanovsky wrote:
> > On Mon, Mar 29, 2021 at 02:29:10PM +0900, Hyunsoon Kim wrote:
> > > This patch allows programmer to avoid zero initialization on page
> > > allocation even when the kernel config "CONFIG_INIT_ON_ALLOC_DEFAULT"
> > > is enabled. The configuration is made to prevent uninitialized
> > > heap memory flaws, and Android has applied this for security and
> > > deterministic execution times. Please refer to below.
> > > 
> > > https://android-review.googlesource.com/c/kernel/common/+/1235132
> > > 
> > > However, there is a case that the zeroing page memory is unnecessary
> > > when the page is used on specific purpose and will be zeroed
> > > automatically by hardware that accesses the memory through DMA.
> > > For instance, page allocation used for IP packet reception from Exynos
> > > modem is solely used for packet reception. Although the page will be
> > > freed eventually and reused for some other purpose, initialization at
> > > that moment of reuse will be sufficient to avoid uninitialized heap
> > > memory flaws. To support this kind of control, this patch creates new
> > > gfp type called ___GFP_NOINIT, that allows no zeroing at the moment
> > > of page allocation, called by many related APIs such as page_frag_alloc,
> > > alloc_pages, etc.
> > > 
> > > Signed-off-by: Hyunsoon Kim 
> > > ---
> > >  include/linux/gfp.h | 2 ++
> > >  include/linux/mm.h  | 4 +++-
> > >  2 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > Let's assume that we will use this new flag, and users are smart enough
> > to figure when it needs to be used, what will be the performance gain?
> > 
> > Thanks
> 
> For instance, there are four memory access (either read or write) done
> by the system; memory write due to page allocation for reserving memory
> for modem hardware, memory write on the page by modem hardware,
> read and write incurred by copy_to_user operation by iperf reading
> the incoming network data. Theoretically, we can expect 1/4 of power
> saving on DRAM bandwidth. By performing simple iperf test with download
> UDP 800Mbps, we saw 5-6mA power gain by disabling
> CONFIG_INIT_ON_ALLOC_DEFAULT.

I'm more interested to see real results.

Thanks

> 
> Thanks




Re: [PATCH] mm: add ___GFP_NOINIT flag which disables zeroing on alloc

2021-03-29 Thread Leon Romanovsky
On Mon, Mar 29, 2021 at 02:29:10PM +0900, Hyunsoon Kim wrote:
> This patch allows programmer to avoid zero initialization on page
> allocation even when the kernel config "CONFIG_INIT_ON_ALLOC_DEFAULT"
> is enabled. The configuration is made to prevent uninitialized
> heap memory flaws, and Android has applied this for security and
> deterministic execution times. Please refer to below.
> 
> https://android-review.googlesource.com/c/kernel/common/+/1235132
> 
> However, there is a case that the zeroing page memory is unnecessary
> when the page is used on specific purpose and will be zeroed
> automatically by hardware that accesses the memory through DMA.
> For instance, page allocation used for IP packet reception from Exynos
> modem is solely used for packet reception. Although the page will be
> freed eventually and reused for some other purpose, initialization at
> that moment of reuse will be sufficient to avoid uninitialized heap
> memory flaws. To support this kind of control, this patch creates new
> gfp type called ___GFP_NOINIT, that allows no zeroing at the moment
> of page allocation, called by many related APIs such as page_frag_alloc,
> alloc_pages, etc.
> 
> Signed-off-by: Hyunsoon Kim 
> ---
>  include/linux/gfp.h | 2 ++
>  include/linux/mm.h  | 4 +++-
>  2 files changed, 5 insertions(+), 1 deletion(-)

Let's assume that we will use this new flag, and users are smart enough
to figure when it needs to be used, what will be the performance gain?

Thanks


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-27 Thread Leon Romanovsky
On Fri, Mar 26, 2021 at 08:20:07AM -0600, Alex Williamson wrote:
> On Fri, 26 Mar 2021 09:40:30 +0300
> Leon Romanovsky  wrote:

<...>

> > 
> > It supports by writing: echo "bus,pm" > reset_methods.
> > Regarding comma, IMHO it is easiest pattern for the parsing.
> > 
> > Anyway, The in-kernel implementation is not important to me.
> 
> Too bad, it should have been apparent from the sample code that it was
> using a comma separated list with re-ordering support.  Thanks,

Excellent, both of us think that "bus,pm" is the easiest way to
implement policy decision.

Thanks

> 
> Alex
> 


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-26 Thread Leon Romanovsky
On Fri, Mar 26, 2021 at 10:18:25AM +0100, Krzysztof Wilczyński wrote:
> Hello,
> 
> [...]
> 
> Aside of the sysfs interface, would this new functionality also require
> anything to be overridden at boot time via passing some command-line
> arguments?  Not sure how relevant such thing would be to device, but,
> whatnot reset, though.

This is per-device property and can't be universally correct like kernel
command-line arguments. I don't think that we need to add such functionality.

> 
> I am curious whether there would be a need for anything like that.

I prefer not.

> 
> Krzysztof


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-26 Thread Leon Romanovsky
On Thu, Mar 25, 2021 at 11:53:24AM -0600, Alex Williamson wrote:
> On Thu, 25 Mar 2021 18:09:58 +0200
> Leon Romanovsky  wrote:
> 
> > On Thu, Mar 25, 2021 at 08:55:04AM -0600, Alex Williamson wrote:
> > > On Thu, 25 Mar 2021 10:37:54 +0200
> > > Leon Romanovsky  wrote:
> > >   
> > > > On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:  
> > > > > On Wed, 24 Mar 2021 17:13:56 +0200
> > > > > Leon Romanovsky  wrote:
> > > > 
> > > > <...>
> > > >   
> > > > > > Yes, and real testing/debugging almost always requires kernel 
> > > > > > rebuild.
> > > > > > Everything else is waste of time.
> > > > > 
> > > > > Sorry, this is nonsense.  Allowing users to debug issues without a 
> > > > > full
> > > > > kernel rebuild is a good thing.
> > > > 
> > > > It is far from debug, this interface doesn't give you any answers why
> > > > the reset didn't work, it just helps you to find the one that works.
> > > > 
> > > > Unless you believe that this information will be enough to understand
> > > > the root cause, you will need to ask from the user to perform extra
> > > > tests, maybe try some quirk. All of that requires from the users to
> > > > rebuild their kernel.
> > > > 
> > > > So no, it is not debug.  
> > > 
> > > It allows a user to experiment to determine (a) my device doesn't work
> > > in a given scenario with the default configuration, but (b) if I change
> > > the reset to this other thing it does work.  That is a step in
> > > debugging.
> > > 
> > > It's absurd to think that a sysfs attribute could provide root cause,
> > > but it might be enough for someone to further help that user.  It would
> > > be a useful clue for a bug report.  Yes, reaching root cause might
> > > involve building a kernel, but that doesn't invalidate that having a
> > > step towards debugging in the base kernel might be a useful tool.  
> > 
> > Let's agree to do not agree.
> > 
> > >   
> > > > > > > > > For policy preference, I already described how I've 
> > > > > > > > > configured QEMU to
> > > > > > > > > prefer a bus reset rather than a PM reset due to lack of 
> > > > > > > > > specification
> > > > > > > > > regarding the scope of a PM "soft reset".  This interface 
> > > > > > > > > would allow a
> > > > > > > > > system policy to do that same thing.
> > > > > > > > > 
> > > > > > > > > I don't think anyone is suggesting this as a means to avoid 
> > > > > > > > > quirks that
> > > > > > > > > would resolve reset issues and create the best default 
> > > > > > > > > general behavior.
> > > > > > > > > This provides a mechanism to test various reset methods, and 
> > > > > > > > > thereby
> > > > > > > > > identify broken methods, and set a policy.  Sure, that policy 
> > > > > > > > > might be
> > > > > > > > > to avoid a broken reset in the interim before it gets quirked 
> > > > > > > > > and
> > > > > > > > > there's potential for abuse there, but I think the benefits 
> > > > > > > > > outweigh
> > > > > > > > > the risks.
> > > > > > > > 
> > > > > > > > This interface is proposed as first class citizen in the 
> > > > > > > > general sysfs
> > > > > > > > layout. Of course, it will be seen as a way to bypass the 
> > > > > > > > kernel.
> > > > > > > > 
> > > > > > > > At least, put it under CONFIG_EXPERT option, so no distro will 
> > > > > > > > enable it
> > > > > > > > by default.  
> > > > > > > 
> > > > > > > Of course we're proposing it to be accessible, it should also 
> > > > > > > require
> > > > > > > admin privileges to modify, sysfs has lots of such things.  If 
> > > > > > > it's
> > > &g

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-25 Thread Leon Romanovsky
On Thu, Mar 25, 2021 at 10:52:57PM +0530, Amey Narkhede wrote:
> On 21/03/25 06:09PM, Leon Romanovsky wrote:
> > On Thu, Mar 25, 2021 at 08:55:04AM -0600, Alex Williamson wrote:
> > > On Thu, 25 Mar 2021 10:37:54 +0200
> > > Leon Romanovsky  wrote:
> > >
> > > > On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:
> > > > > On Wed, 24 Mar 2021 17:13:56 +0200
> > > > > Leon Romanovsky  wrote:
> > > >
> > > > <...>
> > > >
> > > > > > Yes, and real testing/debugging almost always requires kernel 
> > > > > > rebuild.
> > > > > > Everything else is waste of time.
> > > > >
> > > > > Sorry, this is nonsense.  Allowing users to debug issues without a 
> > > > > full
> > > > > kernel rebuild is a good thing.
> > > >
> > > > It is far from debug, this interface doesn't give you any answers why
> > > > the reset didn't work, it just helps you to find the one that works.
> > > >
> > > > Unless you believe that this information will be enough to understand
> > > > the root cause, you will need to ask from the user to perform extra
> > > > tests, maybe try some quirk. All of that requires from the users to
> > > > rebuild their kernel.
> > > >
> > > > So no, it is not debug.
> > >
> > > It allows a user to experiment to determine (a) my device doesn't work
> > > in a given scenario with the default configuration, but (b) if I change
> > > the reset to this other thing it does work.  That is a step in
> > > debugging.
> > >
> > > It's absurd to think that a sysfs attribute could provide root cause,
> > > but it might be enough for someone to further help that user.  It would
> > > be a useful clue for a bug report.  Yes, reaching root cause might
> > > involve building a kernel, but that doesn't invalidate that having a
> > > step towards debugging in the base kernel might be a useful tool.
> >
> > Let's agree to do not agree.
> >
> > >
> > > > > > > > > For policy preference, I already described how I've 
> > > > > > > > > configured QEMU to
> > > > > > > > > prefer a bus reset rather than a PM reset due to lack of 
> > > > > > > > > specification
> > > > > > > > > regarding the scope of a PM "soft reset".  This interface 
> > > > > > > > > would allow a
> > > > > > > > > system policy to do that same thing.
> > > > > > > > >
> > > > > > > > > I don't think anyone is suggesting this as a means to avoid 
> > > > > > > > > quirks that
> > > > > > > > > would resolve reset issues and create the best default 
> > > > > > > > > general behavior.
> > > > > > > > > This provides a mechanism to test various reset methods, and 
> > > > > > > > > thereby
> > > > > > > > > identify broken methods, and set a policy.  Sure, that policy 
> > > > > > > > > might be
> > > > > > > > > to avoid a broken reset in the interim before it gets quirked 
> > > > > > > > > and
> > > > > > > > > there's potential for abuse there, but I think the benefits 
> > > > > > > > > outweigh
> > > > > > > > > the risks.
> > > > > > > >
> > > > > > > > This interface is proposed as first class citizen in the 
> > > > > > > > general sysfs
> > > > > > > > layout. Of course, it will be seen as a way to bypass the 
> > > > > > > > kernel.
> > > > > > > >
> > > > > > > > At least, put it under CONFIG_EXPERT option, so no distro will 
> > > > > > > > enable it
> > > > > > > > by default.
> > > > > > >
> > > > > > > Of course we're proposing it to be accessible, it should also 
> > > > > > > require
> > > > > > > admin privileges to modify, sysfs has lots of such things.  If 
> > > > > > > it's
> > > > > > > relegated to non-default accessibility, it won't be used for 
>

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-25 Thread Leon Romanovsky
On Thu, Mar 25, 2021 at 09:56:37PM +0530, Amey Narkhede wrote:
> On 21/03/25 10:37AM, Leon Romanovsky wrote:
> > On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:
> > > On Wed, 24 Mar 2021 17:13:56 +0200
> > > Leon Romanovsky  wrote:

<...>

> > I expect that QEMU sets same reset policy for all devices at the same
> > time instead of trying per-device to guess which one works.
> >
> The current reset attribute does the same thing internally you described
> at the end.
> > And yes, you will be able to bypass kernel, but at least this interface
> > will be broader than initial one that serves only SO and workarounds.
> >
> What does it mean by "bypassing" kernel?
> I don't see any problem with SO and workaround if that is the only
> way an user can use their device. Why are you expecting every vendor to
> develop quirk? Also I don't see any point of using linked list to
> unnecessarily complicate a simple thing.

Please reread our conversation with Alex, it has answers to both of your
questions.

Thanks

> 
> Thanks,
> Amey


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-25 Thread Leon Romanovsky
On Thu, Mar 25, 2021 at 08:55:04AM -0600, Alex Williamson wrote:
> On Thu, 25 Mar 2021 10:37:54 +0200
> Leon Romanovsky  wrote:
> 
> > On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:
> > > On Wed, 24 Mar 2021 17:13:56 +0200
> > > Leon Romanovsky  wrote:  
> > 
> > <...>
> > 
> > > > Yes, and real testing/debugging almost always requires kernel rebuild.
> > > > Everything else is waste of time.  
> > > 
> > > Sorry, this is nonsense.  Allowing users to debug issues without a full
> > > kernel rebuild is a good thing.  
> > 
> > It is far from debug, this interface doesn't give you any answers why
> > the reset didn't work, it just helps you to find the one that works.
> > 
> > Unless you believe that this information will be enough to understand
> > the root cause, you will need to ask from the user to perform extra
> > tests, maybe try some quirk. All of that requires from the users to
> > rebuild their kernel.
> > 
> > So no, it is not debug.
> 
> It allows a user to experiment to determine (a) my device doesn't work
> in a given scenario with the default configuration, but (b) if I change
> the reset to this other thing it does work.  That is a step in
> debugging.
> 
> It's absurd to think that a sysfs attribute could provide root cause,
> but it might be enough for someone to further help that user.  It would
> be a useful clue for a bug report.  Yes, reaching root cause might
> involve building a kernel, but that doesn't invalidate that having a
> step towards debugging in the base kernel might be a useful tool.

Let's agree to do not agree.

> 
> > > > > > > For policy preference, I already described how I've configured 
> > > > > > > QEMU to
> > > > > > > prefer a bus reset rather than a PM reset due to lack of 
> > > > > > > specification
> > > > > > > regarding the scope of a PM "soft reset".  This interface would 
> > > > > > > allow a
> > > > > > > system policy to do that same thing.
> > > > > > > 
> > > > > > > I don't think anyone is suggesting this as a means to avoid 
> > > > > > > quirks that
> > > > > > > would resolve reset issues and create the best default general 
> > > > > > > behavior.
> > > > > > > This provides a mechanism to test various reset methods, and 
> > > > > > > thereby
> > > > > > > identify broken methods, and set a policy.  Sure, that policy 
> > > > > > > might be
> > > > > > > to avoid a broken reset in the interim before it gets quirked and
> > > > > > > there's potential for abuse there, but I think the benefits 
> > > > > > > outweigh
> > > > > > > the risks.  
> > > > > > 
> > > > > > This interface is proposed as first class citizen in the general 
> > > > > > sysfs
> > > > > > layout. Of course, it will be seen as a way to bypass the kernel.
> > > > > > 
> > > > > > At least, put it under CONFIG_EXPERT option, so no distro will 
> > > > > > enable it
> > > > > > by default.
> > > > > 
> > > > > Of course we're proposing it to be accessible, it should also require
> > > > > admin privileges to modify, sysfs has lots of such things.  If it's
> > > > > relegated to non-default accessibility, it won't be used for testing
> > > > > and it won't be available for system policy and it's pointless.
> > > > 
> > > > We probably have difference in view of what testing is. I expect from
> > > > the users who experience issues with reset to do extra steps and one of
> > > > them is to require from them to compile their kernel.  
> > > 
> > > I would define the ability to generate a CI test that can pick a
> > > device, unbind it from its driver, and iterate reset methods as a
> > > worthwhile improvement in testing.  
> > 
> > Who is going to run this CI? At least all kernel CIs (external and
> > internal to HW vendors) that I'm familiar are building kernel themselves.
> > 
> > Distro kernel is too bloat to be really usable for CI.
> 
> At this point I'm suspicious you're trolling.  A distro kernel CI
> certainly uses the kernel they intend to ship and support in their
> environm

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-25 Thread Leon Romanovsky
On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:
> On Wed, 24 Mar 2021 17:13:56 +0200
> Leon Romanovsky  wrote:

<...>

> > Yes, and real testing/debugging almost always requires kernel rebuild.
> > Everything else is waste of time.
> 
> Sorry, this is nonsense.  Allowing users to debug issues without a full
> kernel rebuild is a good thing.

It is far from debug, this interface doesn't give you any answers why
the reset didn't work, it just helps you to find the one that works.

Unless you believe that this information will be enough to understand
the root cause, you will need to ask from the user to perform extra
tests, maybe try some quirk. All of that requires from the users to
rebuild their kernel.

So no, it is not debug.

> 
> > > > > For policy preference, I already described how I've configured QEMU to
> > > > > prefer a bus reset rather than a PM reset due to lack of specification
> > > > > regarding the scope of a PM "soft reset".  This interface would allow 
> > > > > a
> > > > > system policy to do that same thing.
> > > > > 
> > > > > I don't think anyone is suggesting this as a means to avoid quirks 
> > > > > that
> > > > > would resolve reset issues and create the best default general 
> > > > > behavior.
> > > > > This provides a mechanism to test various reset methods, and thereby
> > > > > identify broken methods, and set a policy.  Sure, that policy might be
> > > > > to avoid a broken reset in the interim before it gets quirked and
> > > > > there's potential for abuse there, but I think the benefits outweigh
> > > > > the risks.
> > > > 
> > > > This interface is proposed as first class citizen in the general sysfs
> > > > layout. Of course, it will be seen as a way to bypass the kernel.
> > > > 
> > > > At least, put it under CONFIG_EXPERT option, so no distro will enable it
> > > > by default.  
> > > 
> > > Of course we're proposing it to be accessible, it should also require
> > > admin privileges to modify, sysfs has lots of such things.  If it's
> > > relegated to non-default accessibility, it won't be used for testing
> > > and it won't be available for system policy and it's pointless.  
> > 
> > We probably have difference in view of what testing is. I expect from
> > the users who experience issues with reset to do extra steps and one of
> > them is to require from them to compile their kernel.
> 
> I would define the ability to generate a CI test that can pick a
> device, unbind it from its driver, and iterate reset methods as a
> worthwhile improvement in testing.

Who is going to run this CI? At least all kernel CIs (external and
internal to HW vendors) that I'm familiar are building kernel themselves.

Distro kernel is too bloat to be really usable for CI.

> 
> > The root permissions doesn't protect from anything, SO lovers will use
> > root without even thinking twice.
> 
> Yes, with great power comes great responsibility.  Many admins ignore
> this.  That's far beyond the scope of this series.

<...>

> > I'm trying to help you with your use case of providing reset policy
> > mechanism, which can be without CONFIG_EXPERT. However if you want
> > to continue path of having specific reset type only, please ensure
> > that this is not taken to the "bypass kernel" direction.
> 
> You've lost me, are you saying you'd be in favor of an interface that
> allows an admin to specify an arbitrary list of reset methods because
> that's somehow more in line with a policy choice than a userspace
> workaround?  This seems like unnecessary bloat because (a) it allows
> the same bypass mechanism, and (b) a given device is only going to use
> a single method anyway, so the functionality is unnecessary.  Please
> help me understand how this favors the policy use case.  Thanks,

The policy decision is global logic that is easier to grasp. At some
point of our discussion, you presented the case where PM reset is not
defined well and you prefer to do bus reset (something like that).

I expect that QEMU sets same reset policy for all devices at the same
time instead of trying per-device to guess which one works.

And yes, you will be able to bypass kernel, but at least this interface
will be broader than initial one that serves only SO and workarounds.

Thanks

> 
> Alex
> 


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-24 Thread Leon Romanovsky
On Wed, Mar 24, 2021 at 08:37:43AM -0600, Alex Williamson wrote:
> On Wed, 24 Mar 2021 12:03:00 +0200
> Leon Romanovsky  wrote:
> 
> > On Mon, Mar 22, 2021 at 11:10:03AM -0600, Alex Williamson wrote:
> > > On Sun, 21 Mar 2021 10:40:55 +0200
> > > Leon Romanovsky  wrote:
> > >   
> > > > On Sat, Mar 20, 2021 at 08:59:42AM -0600, Alex Williamson wrote:  
> > > > > On Sat, 20 Mar 2021 11:10:08 +0200
> > > > > Leon Romanovsky  wrote:
> > > > > > On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote:
> > > > > >  
> > > > > > > 
> > > > > > > What if we taint the kernel or pci_warn() for cases where either 
> > > > > > > all
> > > > > > > the reset methods are disabled, ie. 'echo none > reset_method', 
> > > > > > > or any
> > > > > > > time a device specific method is disabled?  
> > > > > > 
> > > > > > What does it mean "none"? Does it mean nothing supported? If yes, I 
> > > > > > think that
> > > > > > pci_warn() will be enough. At least for me, taint is usable during 
> > > > > > debug stages,
> > > > > > probably if device doesn't crash no one will look to see 
> > > > > > /proc/sys/kernel/tainted.
> > > > > 
> > > > > "none" as implemented in this patch, clearing the enabled function
> > > > > reset methods.
> > > > 
> > > > It is far from intuitive, the empty string will be easier to understand,
> > > > because "none" means no reset at all.  
> > > 
> > > "No reset at all" is what "none" achieves, the
> > > pci_dev.reset_methods_enabled bitmap is cleared.  We can use an empty
> > > string, but I think we want a way to clear all enabled resets and a way
> > > to return it to the default.  I could see arguments for an empty string
> > > serving either purpose, so this version proposed explicitly using
> > > "none" and "default", as included in the ABI update.  
> > 
> > I will stick with "default" only and leave "none" for something else.
> 
> Are you suggesting writing "default" restores the unmodified behavior
> and writing an empty string clears all enabled reset methods?
>  
> > > > > > > I'd almost go so far as to prevent disabling a device specific 
> > > > > > > reset
> > > > > > > altogether, but for example should a device specific reset that 
> > > > > > > fixes
> > > > > > > an aspect of FLR behavior prevent using a bus reset?  I'd prefer 
> > > > > > > in that
> > > > > > > case if direct FLR were disabled via a device flag introduced 
> > > > > > > with the
> > > > > > > quirk and the remaining resets can still be selected by 
> > > > > > > preference.  
> > > > > > 
> > > > > > I don't know enough to discuss the PCI details, but you raised good 
> > > > > > point.
> > > > > > This sysfs is user visible API that is presented as is from device 
> > > > > > point
> > > > > > of view. It can be easily run into problems if PCI/core doesn't 
> > > > > > work with
> > > > > > user's choice.
> > > > > > 
> > > > > > > 
> > > > > > > Theoretically all the other reset methods work and are available, 
> > > > > > > it's
> > > > > > > only a policy decision which to use, right?  
> > > > > > 
> > > > > > But this patch was presented as a way to overcome situations where
> > > > > > supported != working and user magically knows which reset type to 
> > > > > > set.
> > > > > 
> > > > > It's not magic, the new sysfs attributes expose which resets are
> > > > > enabled and the order that they're used, the user can simply select 
> > > > > the
> > > > > next one.  Being able to bypass a broken reset method is a helpful 
> > > > > side
> > > > > effect of getting to select a preferred reset method.
> > > > 
> > > > Magic in a sense that user has no idea what those resets mean, the
&

Re: [PATCH -next] IB/srpt: Fix passing zero to 'PTR_ERR'

2021-03-24 Thread Leon Romanovsky
On Wed, Mar 24, 2021 at 10:09:39PM +0800, YueHaibing wrote:
> Fix smatch warning:
> 
> drivers/infiniband/ulp/srpt/ib_srpt.c:2341 srpt_cm_req_recv() warn: passing 
> zero to 'PTR_ERR'
> 
> Use PTR_ERR_OR_ZERO instead of PTR_ERR
> 
> Fixes: 847462de3a0a ("IB/srpt: Fix srpt_cm_req_recv() error path (1/2)")
> Signed-off-by: YueHaibing 
> ---
>  drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c 
> b/drivers/infiniband/ulp/srpt/ib_srpt.c
> index 6be60aa5ffe2..3ff24b5048ac 100644
> --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -2338,7 +2338,7 @@ static int srpt_cm_req_recv(struct srpt_device *const 
> sdev,
>  
>   if (IS_ERR_OR_NULL(ch->sess)) {
>   WARN_ON_ONCE(ch->sess == NULL);
> - ret = PTR_ERR(ch->sess);
> + ret = PTR_ERR_OR_ZERO(ch->sess);

It is crazy, in first line, we checked ch->sess and allowed it to be NULL,
later caused to kernel panic and set ret to success.

>   ch->sess = NULL;
>   pr_info("Rejected login for initiator %s: ret = %d.\n",
>   ch->sess_name, ret);
> -- 
> 2.22.0
> 


Re: [PATCH 1/2] net: ipv4: route.c: add likely() statements

2021-03-24 Thread Leon Romanovsky
On Wed, Mar 24, 2021 at 07:01:19PM +0800, Yejune Deng wrote:
> My reasons are as following: ipv4_confirm_neigh() belongs to
> ipv4_dst_ops that family is AF_INET, and ipv4_neigh_lookup() is also
> added likely() when rt->rt_gw_family is equal to AF_INET.

It is part of that cargo cult. Please support your claim with
performance numbers when this likely/unlikely will give any difference.

Thanks

> 
> On Wed, Mar 24, 2021 at 6:34 PM Leon Romanovsky  wrote:
> >
> > On Wed, Mar 24, 2021 at 11:09:22AM +0800, Yejune Deng wrote:
> > > Add likely() statements in ipv4_confirm_neigh() for 'rt->rt_gw_family
> > > == AF_INET'.
> >
> > Why? Such macros are beneficial in only specific cases, most of the time,
> > likely/unlikely is cargo cult.
> >
> > >
> > > Signed-off-by: Yejune Deng 
> > > ---
> > >  net/ipv4/route.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index fa68c2612252..5762d9bc671c 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -440,7 +440,7 @@ static void ipv4_confirm_neigh(const struct dst_entry 
> > > *dst, const void *daddr)
> > >   struct net_device *dev = dst->dev;
> > >   const __be32 *pkey = daddr;
> > >
> > > - if (rt->rt_gw_family == AF_INET) {
> > > + if (likely(rt->rt_gw_family == AF_INET)) {
> > >   pkey = (const __be32 *)>rt_gw4;
> > >   } else if (rt->rt_gw_family == AF_INET6) {
> > >   return __ipv6_confirm_neigh_stub(dev, >rt_gw6);
> > > --
> > > 2.29.0
> > >


Re: [PATCH 1/2] net: ipv4: route.c: add likely() statements

2021-03-24 Thread Leon Romanovsky
On Wed, Mar 24, 2021 at 11:09:22AM +0800, Yejune Deng wrote:
> Add likely() statements in ipv4_confirm_neigh() for 'rt->rt_gw_family
> == AF_INET'.

Why? Such macros are beneficial in only specific cases, most of the time,
likely/unlikely is cargo cult.

> 
> Signed-off-by: Yejune Deng 
> ---
>  net/ipv4/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index fa68c2612252..5762d9bc671c 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -440,7 +440,7 @@ static void ipv4_confirm_neigh(const struct dst_entry 
> *dst, const void *daddr)
>   struct net_device *dev = dst->dev;
>   const __be32 *pkey = daddr;
>  
> - if (rt->rt_gw_family == AF_INET) {
> + if (likely(rt->rt_gw_family == AF_INET)) {
>   pkey = (const __be32 *)>rt_gw4;
>   } else if (rt->rt_gw_family == AF_INET6) {
>   return __ipv6_confirm_neigh_stub(dev, >rt_gw6);
> -- 
> 2.29.0
> 


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-24 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 11:10:03AM -0600, Alex Williamson wrote:
> On Sun, 21 Mar 2021 10:40:55 +0200
> Leon Romanovsky  wrote:
> 
> > On Sat, Mar 20, 2021 at 08:59:42AM -0600, Alex Williamson wrote:
> > > On Sat, 20 Mar 2021 11:10:08 +0200
> > > Leon Romanovsky  wrote:  
> > > > On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote:   
> > > > > 
> > > > > What if we taint the kernel or pci_warn() for cases where either all
> > > > > the reset methods are disabled, ie. 'echo none > reset_method', or any
> > > > > time a device specific method is disabled?
> > > > 
> > > > What does it mean "none"? Does it mean nothing supported? If yes, I 
> > > > think that
> > > > pci_warn() will be enough. At least for me, taint is usable during 
> > > > debug stages,
> > > > probably if device doesn't crash no one will look to see 
> > > > /proc/sys/kernel/tainted.  
> > > 
> > > "none" as implemented in this patch, clearing the enabled function
> > > reset methods.  
> > 
> > It is far from intuitive, the empty string will be easier to understand,
> > because "none" means no reset at all.
> 
> "No reset at all" is what "none" achieves, the
> pci_dev.reset_methods_enabled bitmap is cleared.  We can use an empty
> string, but I think we want a way to clear all enabled resets and a way
> to return it to the default.  I could see arguments for an empty string
> serving either purpose, so this version proposed explicitly using
> "none" and "default", as included in the ABI update.

I will stick with "default" only and leave "none" for something else.

> 
> > > > > I'd almost go so far as to prevent disabling a device specific reset
> > > > > altogether, but for example should a device specific reset that fixes
> > > > > an aspect of FLR behavior prevent using a bus reset?  I'd prefer in 
> > > > > that
> > > > > case if direct FLR were disabled via a device flag introduced with the
> > > > > quirk and the remaining resets can still be selected by preference.   
> > > > >  
> > > > 
> > > > I don't know enough to discuss the PCI details, but you raised good 
> > > > point.
> > > > This sysfs is user visible API that is presented as is from device point
> > > > of view. It can be easily run into problems if PCI/core doesn't work 
> > > > with
> > > > user's choice.
> > > >   
> > > > > 
> > > > > Theoretically all the other reset methods work and are available, it's
> > > > > only a policy decision which to use, right?
> > > > 
> > > > But this patch was presented as a way to overcome situations where
> > > > supported != working and user magically knows which reset type to set.  
> > > 
> > > It's not magic, the new sysfs attributes expose which resets are
> > > enabled and the order that they're used, the user can simply select the
> > > next one.  Being able to bypass a broken reset method is a helpful side
> > > effect of getting to select a preferred reset method.  
> > 
> > Magic in a sense that user has no idea what those resets mean, the
> > expectation is that he will blindly iterate till something works.
> 
> Which ought to actually be a safe thing to do.  We should have quirks to
> exclude resets that are known broken but still probe as present and I'd
> be perfectly fine if we issue a warning if the user disables all resets
> for a given device.
>  
> > > > If you want to take this patch to be policy decision tool,
> > > > it will need to accept "reset_type1,reset_type2,..." sort of input,
> > > > so fallback will work natively.  
> > > 
> > > I don't see that as a requirement.  We have fall-through support in the
> > > kernel, but for a given device we're really only ever going to make use
> > > of one of those methods.  If a user knows enough about a device to have
> > > a preference, I think it can be singular.  That also significantly
> > > simplifies the interface and supporting code.  Thanks,  
> > 
> > I'm struggling to get requirements from this thread. You talked about
> > policy decision to overtake fallback mechanism, Amey wanted to avoid
> > quirks.
> > 
> > Do you have an example of such devices or we are talking about
> > theoretical case?

Re: [PATCH v2] net: make unregister netdev warning timeout configurable

2021-03-24 Thread Leon Romanovsky
On Tue, Mar 23, 2021 at 07:49:23AM +0100, Dmitry Vyukov wrote:
> netdev_wait_allrefs() issues a warning if refcount does not drop to 0
> after 10 seconds. While 10 second wait generally should not happen
> under normal workload in normal environment, it seems to fire falsely
> very often during fuzzing and/or in qemu emulation (~10x slower).
> At least it's not possible to understand if it's really a false
> positive or not. Automated testing generally bumps all timeouts
> to very high values to avoid flake failures.
> Add net.core.netdev_unregister_timeout_secs sysctl to make
> the timeout configurable for automated testing systems.
> Lowering the timeout may also be useful for e.g. manual bisection.
> The default value matches the current behavior.
> 
> Signed-off-by: Dmitry Vyukov 
> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
> Changes since v1:
>  - use sysctl instead of a config
> ---
>  Documentation/admin-guide/sysctl/net.rst | 11 +++
>  include/linux/netdevice.h|  1 +
>  net/core/dev.c   |  6 +-
>  net/core/sysctl_net_core.c   | 10 ++
>  4 files changed, 27 insertions(+), 1 deletion(-)
> 

Thanks,
Reviewed-by: Leon Romanovsky 


Re: [PATCH v2] infiniband: Fix a use after free in isert_connect_request

2021-03-24 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 09:13:25AM -0700, Lv Yunlong wrote:
> The device is got by isert_device_get() with refcount is 1,
> and is assigned to isert_conn by isert_conn->device = device.
> When isert_create_qp() failed, device will be freed with
> isert_device_put().
> 
> Later, the device is used in isert_free_login_buf(isert_conn)
> by the isert_conn->device->ib_device statement. This patch
> free the device in the correct order.
> 
> Signed-off-by: Lv Yunlong 
> ---
>  drivers/infiniband/ulp/isert/ib_isert.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 

Thanks,
Reviewed-by: Leon Romanovsky 


Re: Re: [PATCH] infiniband: Fix a use after free in isert_connect_request

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 10:51:35PM +0800, lyl2...@mail.ustc.edu.cn wrote:
> 
> 
> 
> > -原始邮件-
> > 发件人: "Leon Romanovsky" 
> > 发送时间: 2021-03-22 22:27:17 (星期一)
> > 收件人: "Lv Yunlong" 
> > 抄送: s...@grimberg.me, dledf...@redhat.com, j...@ziepe.ca, 
> > linux-r...@vger.kernel.org, target-de...@vger.kernel.org, 
> > linux-kernel@vger.kernel.org
> > 主题: Re: [PATCH] infiniband: Fix a use after free in isert_connect_request
> > 
> > On Mon, Mar 22, 2021 at 06:53:55AM -0700, Lv Yunlong wrote:
> > > The device is got by isert_device_get() with refcount is 1,
> > > and is assigned to isert_conn by isert_conn->device = device.
> > > When isert_create_qp() failed, device will be freed with
> > > isert_device_put().
> > > 
> > > Later, the device is used in isert_free_login_buf(isert_conn)
> > > by the isert_conn->device->ib_device statement. My patch
> > > exchanges the callees order to free the device late.
> > > 
> > > Signed-off-by: Lv Yunlong 
> > > ---
> > >  drivers/infiniband/ulp/isert/ib_isert.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > The fix needs to be change of isert_free_login_buf() from
> > isert_free_login_buf(isert_conn) to be isert_free_login_buf(isert_conn, 
> > cma_id->device)
> > 
> > Thanks
> > 
> > > 
> > > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
> > > b/drivers/infiniband/ulp/isert/ib_isert.c
> > > index 7305ed8976c2..d8a533e346b0 100644
> > > --- a/drivers/infiniband/ulp/isert/ib_isert.c
> > > +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> > > @@ -473,10 +473,10 @@ isert_connect_request(struct rdma_cm_id *cma_id, 
> > > struct rdma_cm_event *event)
> > >  
> > >  out_destroy_qp:
> > >   isert_destroy_qp(isert_conn);
> > > -out_conn_dev:
> > > - isert_device_put(device);
> > >  out_rsp_dma_map:
> > >   isert_free_login_buf(isert_conn);
> > > +out_conn_dev:
> > > + isert_device_put(device);
> > >  out:
> > >   kfree(isert_conn);
> > >   rdma_reject(cma_id, NULL, 0, IB_CM_REJ_CONSUMER_DEFINED);
> > > -- 
> > > 2.25.1
> > > 
> > > 
> 
> I see that function isert_free_login_buf(struct isert_conn *isert_conn) has 
> only
> a parameter,  do you mean i need change the implementation of 
> isert_free_login_buf?
> 
> I'm sorry to say that i am unfamilar with this module and afraid of making 
> more mistakes,
> because this function is being called elsewhere as well.
> Could you help me to fix this issue? Or just fix it and tell me your commit 
> number?

After checking how isert_connect_release() is implemented, it looks like
this will fix:

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 7305ed8976c2..18266f07c58d 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -438,23 +438,23 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct 
rdma_cm_event *event)
isert_init_conn(isert_conn);
isert_conn->cm_id = cma_id;
 
-   ret = isert_alloc_login_buf(isert_conn, cma_id->device);
-   if (ret)
-   goto out;
-
device = isert_device_get(cma_id);
if (IS_ERR(device)) {
ret = PTR_ERR(device);
-   goto out_rsp_dma_map;
+   goto out;
}
isert_conn->device = device;
 
+   ret = isert_alloc_login_buf(isert_conn, cma_id->device);
+   if (ret)
+   goto out_conn_dev;
+
isert_set_nego_params(isert_conn, >param.conn);
 
isert_conn->qp = isert_create_qp(isert_conn, cma_id);
if (IS_ERR(isert_conn->qp)) {
ret = PTR_ERR(isert_conn->qp);
-   goto out_conn_dev;
+   goto out_rsp_dma_map;
}
 
ret = isert_login_post_recv(isert_conn);
@@ -473,10 +473,10 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct 
rdma_cm_event *event)
 
 out_destroy_qp:
isert_destroy_qp(isert_conn);
-out_conn_dev:
-   isert_device_put(device);
 out_rsp_dma_map:
isert_free_login_buf(isert_conn);
+out_conn_dev:
+   isert_device_put(device);
 out:
kfree(isert_conn);
rdma_reject(cma_id, NULL, 0, IB_CM_REJ_CONSUMER_DEFINED);


Re: [PATCH] infiniband: Fix a use after free in isert_connect_request

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 06:53:55AM -0700, Lv Yunlong wrote:
> The device is got by isert_device_get() with refcount is 1,
> and is assigned to isert_conn by isert_conn->device = device.
> When isert_create_qp() failed, device will be freed with
> isert_device_put().
> 
> Later, the device is used in isert_free_login_buf(isert_conn)
> by the isert_conn->device->ib_device statement. My patch
> exchanges the callees order to free the device late.
> 
> Signed-off-by: Lv Yunlong 
> ---
>  drivers/infiniband/ulp/isert/ib_isert.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

The fix needs to be change of isert_free_login_buf() from
isert_free_login_buf(isert_conn) to be isert_free_login_buf(isert_conn, 
cma_id->device)

Thanks

> 
> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
> b/drivers/infiniband/ulp/isert/ib_isert.c
> index 7305ed8976c2..d8a533e346b0 100644
> --- a/drivers/infiniband/ulp/isert/ib_isert.c
> +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> @@ -473,10 +473,10 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct 
> rdma_cm_event *event)
>  
>  out_destroy_qp:
>   isert_destroy_qp(isert_conn);
> -out_conn_dev:
> - isert_device_put(device);
>  out_rsp_dma_map:
>   isert_free_login_buf(isert_conn);
> +out_conn_dev:
> + isert_device_put(device);
>  out:
>   kfree(isert_conn);
>   rdma_reject(cma_id, NULL, 0, IB_CM_REJ_CONSUMER_DEFINED);
> -- 
> 2.25.1
> 
> 


Re: [PATCH net-next v2 2/2] net: ipa: fix IPA validation

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 08:17:59AM -0500, Alex Elder wrote:
> On 3/22/21 1:40 AM, Leon Romanovsky wrote:
> > > I'd like to suggest a plan so I can begin to make progress,
> > > but do so in a way you/others think is satisfactory.
> > > - I would first like to fix the existing bugs, namely that
> > >if IPA_VALIDATION is defined there are build errors, and
> > >that IPA_VALIDATION is not consistently used.  That is
> > >this 2-patch series.
> > The thing is that IPA_VALIDATION is not defined in the upstream kernel.
> > There is nothing to be fixed in netdev repository
> > 
> > > - I assure you that my goal is to simplify the code that
> > >does this sort of checking.  So here are some specific
> > >things I can implement in the coming weeks toward that:
> > >  - Anything that can be checked at build time, will
> > >be checked with BUILD_BUG_ON().
> > +1
> > 
> > >  - Anything checked with BUILD_BUG_ON() will*not*
> > >be conditional.  I.e. it won't be inside an
> > >#ifdef IPA_VALIDATION block.
> > >  - I will review all remaining VALIDATION code (which
> > >can't--or can't always--be checked at build time),
> > >If it looks prudent to make it*always*  be checked,
> > >I will make it always be checked (not conditional
> > >on IPA_VALIDATION).
> > +1
> > 
> > > The result should clearly separate checks that can be done
> > > at build time from those that can't.
> > > 
> > > And with what's left (especially on that third sub-bullet)
> > > I might have some better examples with which to argue
> > > for something different.  Or I might just concede that
> > > you were right all along.
> > I hope so.
> 
> I came up with a solution last night that I'm going to try
> to implement.  I will still do the things I mention above.
> 
> The solution is to create a user space tool inside the
> drivers/net/ipa directory that will link with the kernel
> source files and will perform all the basic one-time checks
> I want to make.

We are not talking about putting this tool in upstream repo, right?
If yes, please get buy-in from David/Jakub first.

Thanks


Re: [PATCH net-next v2 0/2] net: ipa: fix validation

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 08:22:20AM -0500, Alex Elder wrote:
> On 3/20/21 9:17 AM, Alex Elder wrote:
> > There is sanity checking code in the IPA driver that's meant to be
> > enabled only during development.  This allows the driver to make
> > certain assumptions, but not have to verify those assumptions are
> > true at (operational) runtime.  This code is built conditional on
> > IPA_VALIDATION, set (if desired) inside the IPA makefile.
> > 
> > Unfortunately, this validation code has some errors.  First, there
> > are some mismatched arguments supplied to some dev_err() calls in
> > ipa_cmd_table_valid() and ipa_cmd_header_valid(), and these are
> > exposed if validation is enabled.  Second, the tag that enables
> > this conditional code isn't used consistently (it's IPA_VALIDATE
> > in some spots and IPA_VALIDATION in others).
> > 
> > This series fixes those two problems with the conditional validation
> > code.
> 
> After much back-and-forth with Leon Romanovsky:
> 
>   --> I retract this series <--
> 
> I will include these patches in a future series that will
> do cleanup of this validation code more completely.

Thanks a lot.

> 
> Thanks.
> 
>   -Alex
> 
> > Version 2 removes the two patches that introduced ipa_assert().  It
> > also modifies the description in the first patch so that it mentions
> > the changes made to ipa_cmd_table_valid().
> > 
> > -Alex
> > 
> > Alex Elder (2):
> >net: ipa: fix init header command validation
> >net: ipa: fix IPA validation
> > 
> >   drivers/net/ipa/Makefile   |  2 +-
> >   drivers/net/ipa/gsi_trans.c|  8 ++---
> >   drivers/net/ipa/ipa_cmd.c  | 54 ++
> >   drivers/net/ipa/ipa_cmd.h  |  6 ++--
> >   drivers/net/ipa/ipa_endpoint.c |  6 ++--
> >   drivers/net/ipa/ipa_main.c |  6 ++--
> >   drivers/net/ipa/ipa_mem.c  |  6 ++--
> >   drivers/net/ipa/ipa_table.c|  6 ++--
> >   drivers/net/ipa/ipa_table.h|  6 ++--
> >   9 files changed, 58 insertions(+), 42 deletions(-)
> > 
> 


Re: [PATCH rdma-next 0/2] Spring cleanup

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 10:00:12AM -0300, Jason Gunthorpe wrote:
> On Sun, Mar 14, 2021 at 03:39:06PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> > 
> > Bunch of cleanup in RDMA subsystem.
> > 
> > Leon Romanovsky (2):
> >   RDMA: Fix kernel-doc compilation warnings
> >   RDMA: Delete not-used static inline functions
> 
> Applied to for-next
> 
> How did you find the unused static inline functions?

Accidentally spotted such in pvrdma and later wrote one liner to create
me a list of functions to delete.

Thanks

> 
> Thanks,
> Jason


Re: [PATCH net-next] [v2] misdn: avoid -Wempty-body warning

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 01:14:47PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> gcc warns about a pointless condition:
> 
> drivers/isdn/hardware/mISDN/hfcmulti.c: In function 'hfcmulti_interrupt':
> drivers/isdn/hardware/mISDN/hfcmulti.c:2752:17: error: suggest braces around 
> empty body in an 'if' statement [-Werror=empty-body]
>  2752 | ; /* external IRQ */
> 
> As the check has no effect, just remove it.
> 
> Suggested-by: Leon Romanovsky 
> Signed-off-by: Arnd Bergmann 
> ---
> v2: remove the line instead of adding {}
> ---
>  drivers/isdn/hardware/mISDN/hfcmulti.c | 2 --
>  1 file changed, 2 deletions(-)
> 

Thanks, interesting when we will delete whole drivers/isdn :)

Reviewed-by: Leon Romanovsky 


Re: [PATCH net-next 1/5] misdn: avoid -Wempty-body warning

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 12:24:20PM +0100, Arnd Bergmann wrote:
> On Mon, Mar 22, 2021 at 11:55 AM Leon Romanovsky  wrote:
> > On Mon, Mar 22, 2021 at 11:43:31AM +0100, Arnd Bergmann wrote:
> > > From: Arnd Bergmann 
> > >
> > > gcc warns about a pointless condition:
> > >
> > > drivers/isdn/hardware/mISDN/hfcmulti.c: In function 'hfcmulti_interrupt':
> > > drivers/isdn/hardware/mISDN/hfcmulti.c:2752:17: error: suggest braces 
> > > around empty body in an 'if' statement [-Werror=empty-body]
> > >  2752 | ; /* external IRQ */
> > >
> > > Change this as suggested by gcc, which also fits the style of the
> > > other conditions in this function.
> > >
> > > Signed-off-by: Arnd Bergmann 
> > > ---
> > >  drivers/isdn/hardware/mISDN/hfcmulti.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/isdn/hardware/mISDN/hfcmulti.c 
> > > b/drivers/isdn/hardware/mISDN/hfcmulti.c
> > > index 7013a3f08429..8ab0fde758d2 100644
> > > --- a/drivers/isdn/hardware/mISDN/hfcmulti.c
> > > +++ b/drivers/isdn/hardware/mISDN/hfcmulti.c
> > > @@ -2748,8 +2748,9 @@ hfcmulti_interrupt(int intno, void *dev_id)
> > >   if (hc->ctype != HFC_TYPE_E1)
> > >   ph_state_irq(hc, r_irq_statech);
> > >   }
> > > - if (status & V_EXT_IRQSTA)
> > > - ; /* external IRQ */
> > > + if (status & V_EXT_IRQSTA) {
> > > + /* external IRQ */
> > > + }
> >
> > Any reason do not delete this hunk?
> 
> I don't care either way, I only kept it because it was apparently left there
> on purpose by the original author, as seen by the comment.

I personally would delete it.

Thanks

> 
> Arnd


Re: [PATCH net-next 1/5] misdn: avoid -Wempty-body warning

2021-03-22 Thread Leon Romanovsky
On Mon, Mar 22, 2021 at 11:43:31AM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> gcc warns about a pointless condition:
> 
> drivers/isdn/hardware/mISDN/hfcmulti.c: In function 'hfcmulti_interrupt':
> drivers/isdn/hardware/mISDN/hfcmulti.c:2752:17: error: suggest braces around 
> empty body in an 'if' statement [-Werror=empty-body]
>  2752 | ; /* external IRQ */
> 
> Change this as suggested by gcc, which also fits the style of the
> other conditions in this function.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/isdn/hardware/mISDN/hfcmulti.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/isdn/hardware/mISDN/hfcmulti.c 
> b/drivers/isdn/hardware/mISDN/hfcmulti.c
> index 7013a3f08429..8ab0fde758d2 100644
> --- a/drivers/isdn/hardware/mISDN/hfcmulti.c
> +++ b/drivers/isdn/hardware/mISDN/hfcmulti.c
> @@ -2748,8 +2748,9 @@ hfcmulti_interrupt(int intno, void *dev_id)
>   if (hc->ctype != HFC_TYPE_E1)
>   ph_state_irq(hc, r_irq_statech);
>   }
> - if (status & V_EXT_IRQSTA)
> - ; /* external IRQ */
> + if (status & V_EXT_IRQSTA) {
> + /* external IRQ */
> + }

Any reason do not delete this hunk?

>   if (status & V_LOST_STA) {
>   /* LOST IRQ */
>   HFC_outb(hc, R_INC_RES_FIFO, V_RES_LOST); /* clear irq! */
> -- 
> 2.29.2
> 


Re: [PATCH net-next v2 2/2] net: ipa: fix IPA validation

2021-03-22 Thread Leon Romanovsky
On Sun, Mar 21, 2021 at 12:19:02PM -0500, Alex Elder wrote:
> On 3/21/21 8:49 AM, Leon Romanovsky wrote:
> > On Sun, Mar 21, 2021 at 08:21:24AM -0500, Alex Elder wrote:
> >> On 3/21/21 3:21 AM, Leon Romanovsky wrote:
> >>> On Sat, Mar 20, 2021 at 09:17:29AM -0500, Alex Elder wrote:
> >>>> There are blocks of IPA code that sanity-check various values, at
> >>>> compile time where possible.  Most of these checks can be done once
> >>>> during development but skipped for normal operation.  These checks
> >>>> permit the driver to make certain assumptions, thereby avoiding the
> >>>> need for runtime error checking.
> >>>>
> >>>> The checks are defined conditionally, but not consistently.  In
> >>>> some cases IPA_VALIDATION enables the optional checks, while in
> >>>> others IPA_VALIDATE is used.
> >>>>
> >>>> Fix this by using IPA_VALIDATION consistently.
> >>>>
> >>>> Signed-off-by: Alex Elder 
> >>>> ---
> >>>>   drivers/net/ipa/Makefile   | 2 +-
> >>>>   drivers/net/ipa/gsi_trans.c| 8 
> >>>>   drivers/net/ipa/ipa_cmd.c  | 4 ++--
> >>>>   drivers/net/ipa/ipa_cmd.h  | 6 +++---
> >>>>   drivers/net/ipa/ipa_endpoint.c | 6 +++---
> >>>>   drivers/net/ipa/ipa_main.c | 6 +++---
> >>>>   drivers/net/ipa/ipa_mem.c  | 6 +++---
> >>>>   drivers/net/ipa/ipa_table.c| 6 +++---
> >>>>   drivers/net/ipa/ipa_table.h| 6 +++---
> >>>>   9 files changed, 25 insertions(+), 25 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/ipa/Makefile b/drivers/net/ipa/Makefile
> >>>> index afe5df1e6..014ae36ac6004 100644
> >>>> --- a/drivers/net/ipa/Makefile
> >>>> +++ b/drivers/net/ipa/Makefile
> >>>> @@ -1,5 +1,5 @@
> >>>>   # Un-comment the next line if you want to validate configuration data
> >>>> -#ccflags-y  +=  -DIPA_VALIDATE
> >>>> +# ccflags-y +=  -DIPA_VALIDATION
> >>>
> >>> Maybe netdev folks think differently here, but general rule that dead
> >>> code and closed code is such, is not acceptable to in Linux kernel.
> >>>
> >>> <...>
> >>
> >> What is the purpose of CONFIG_KGDB?  Or CONFIG_DEBUG_KERNEL?
> >> Would you prefer I expose this through a kconfig option?  I
> >> intentionally did not do that, because I really intended it
> >> to be only for development, so defined it in the Makefile.
> >> But I have no objection to making it configurable that way.
> > 
> > I prefer you to follow netdev/linux kernel rules of development.
> > The upstream repository and drivers/net/* folder especially are not
> > the place to put code used for the development.
> 
> How do I add support for new versions of the hardware as
> it evolves?

Exactly like all other driver developers do. You are not different here.

1. Clean your driver to have stable base without dead/debug code.
2. Send patch series per-feature/hardware enablement on top of this base.

> 
> What I started supporting (v3.5.1) was in some respects
> relatively old.  Version 4.2 is newer, and the v4.5 and
> beyond are for products that are relatively new on the
> market.

I see that it was submitted in 2018, we have many large drivers
that by far older than IPA. For example, mlx5 supports more than 5
generations of hardware and was added in 2013.

> 
> Some updates to IPA (like 4.0+ after 3.5.1, or 4.5+
> after 4.2) include substantial updates to the way the
> hardware works.  The code can't support the new hardware
> without being adapted and generalized to support both
> old and new.

It is ok.

> 
> My goal is to get upstream support for IPA for all
> Qualcomm SoCs that have it.  But the hardware design
> is evolving; Qualcomm is actively developing their
> architecture so they can support new technologies
> (e.g. cellular 5G).  Development of the driver is
> simply *necessary*.

No argue here.

> 
> The assertions I proposed and checks like this are
> intended as an *aid* to the active development I
> have been doing.

They need to be local to your development environment.
It is perfectly fine if you keep extra debug patch internally.

> 
> They may look like hacky debugging--checking errors
> that can't happen.  They aren't that at all--they're
> intended to the compiler help me develop correct code,
> given I *know* it will 

Re: [PATCH net-next v2 2/2] net: ipa: fix IPA validation

2021-03-21 Thread Leon Romanovsky
On Sun, Mar 21, 2021 at 08:21:24AM -0500, Alex Elder wrote:
> On 3/21/21 3:21 AM, Leon Romanovsky wrote:
> > On Sat, Mar 20, 2021 at 09:17:29AM -0500, Alex Elder wrote:
> > > There are blocks of IPA code that sanity-check various values, at
> > > compile time where possible.  Most of these checks can be done once
> > > during development but skipped for normal operation.  These checks
> > > permit the driver to make certain assumptions, thereby avoiding the
> > > need for runtime error checking.
> > > 
> > > The checks are defined conditionally, but not consistently.  In
> > > some cases IPA_VALIDATION enables the optional checks, while in
> > > others IPA_VALIDATE is used.
> > > 
> > > Fix this by using IPA_VALIDATION consistently.
> > > 
> > > Signed-off-by: Alex Elder 
> > > ---
> > >   drivers/net/ipa/Makefile   | 2 +-
> > >   drivers/net/ipa/gsi_trans.c| 8 
> > >   drivers/net/ipa/ipa_cmd.c  | 4 ++--
> > >   drivers/net/ipa/ipa_cmd.h  | 6 +++---
> > >   drivers/net/ipa/ipa_endpoint.c | 6 +++---
> > >   drivers/net/ipa/ipa_main.c | 6 +++---
> > >   drivers/net/ipa/ipa_mem.c  | 6 +++---
> > >   drivers/net/ipa/ipa_table.c| 6 +++---
> > >   drivers/net/ipa/ipa_table.h| 6 +++---
> > >   9 files changed, 25 insertions(+), 25 deletions(-)
> > > 
> > > diff --git a/drivers/net/ipa/Makefile b/drivers/net/ipa/Makefile
> > > index afe5df1e6..014ae36ac6004 100644
> > > --- a/drivers/net/ipa/Makefile
> > > +++ b/drivers/net/ipa/Makefile
> > > @@ -1,5 +1,5 @@
> > >   # Un-comment the next line if you want to validate configuration data
> > > -#ccflags-y   +=  -DIPA_VALIDATE
> > > +# ccflags-y  +=  -DIPA_VALIDATION
> > 
> > Maybe netdev folks think differently here, but general rule that dead
> > code and closed code is such, is not acceptable to in Linux kernel.
> > 
> > <...>
> 
> What is the purpose of CONFIG_KGDB?  Or CONFIG_DEBUG_KERNEL?
> Would you prefer I expose this through a kconfig option?  I
> intentionally did not do that, because I really intended it
> to be only for development, so defined it in the Makefile.
> But I have no objection to making it configurable that way.

I prefer you to follow netdev/linux kernel rules of development.
The upstream repository and drivers/net/* folder especially are not
the place to put code used for the development.

> 
> > > -#ifdef IPA_VALIDATE
> > > +#ifdef IPA_VALIDATION
> > >   if (!size || size % 8)
> > >   return -EINVAL;
> > >   if (count < max_alloc)
> > >   return -EINVAL;
> > >   if (!max_alloc)
> > >   return -EINVAL;
> > > -#endif /* IPA_VALIDATE */
> > > +#endif /* IPA_VALIDATION */
> > 
> > If it is possible to supply those values, the check should be always and
> > not only under some closed config option.
> 
> These are assertions.
> 
> There is no need to test them for working code.  If
> I run the code successfully with these tests enabled
> exactly once, and they are satisfied, then every time
> the code is run thereafter they will pass.  So I want
> to check them when debugging/developing only.  That
> way there is a mistake, it gets caught, but otherwise
> there's no pointless argument checking done.
> 
> I'll explain the first check; the others have similar
> explanation.
> 
> In the current code, the passed size is sizeof(struct)
> for three separate structures.
>   - If the structure size changes, I want to be
> sure the constraint is still honored
>   - The code will break of someone happens
> to pass a size of 0.  I don't expect that to
> ever happen, but this states that requirement.
> 
> This is an optimization, basically, but one that
> allows the assumed conditions to be optionally
> verified.

Everything above as an outcome of attempting to mix constant vs. run-time
checks. If "size" is constant, the use of BUILD_BIG_ON() will help not only
you but other developers to catch the errors too. The assumption that you alone
are working on this code, can or can't be correct.

If "size" is not constant, you should check it always.

> 
> > >   /* By allocating a few extra entries in our pool (one less
> > >* than the maximum number that will be requested in a
> > > @@ -140,14 +140,14 @@ int gsi_trans_pool_init_dma(struct device *dev, 
> > > struct gsi_tra

Re: linux-next: manual merge of the net tree with Linus' tree

2021-03-21 Thread Leon Romanovsky
On Sat, Mar 20, 2021 at 12:42:06PM -0700, Linus Torvalds wrote:
> On Sat, Mar 20, 2021 at 12:28 PM Marc Kleine-Budde  
> wrote:
> >
> > Good idea. I'll send a pull request to David and Jakub.
> 
> I don't think the revert is necessary. The conflict is so trivial that
> it doesn't really matter.

<...>

> But something like this that just removes the
> MODULE_SUPPORTED_DEVICE() thing that basically never gets touched
> anyway, and we happened to be unlucky in *one* file? Not a worry at
> all.

For me this specific revert is a way to reduce the overhead from the maintainer
when they prepare PRs. At least for me, PRs are most time consuming tasks.

No patch - no conflict - less worries.

Thanks


Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-21 Thread Leon Romanovsky
On Sat, Mar 20, 2021 at 08:59:42AM -0600, Alex Williamson wrote:
> On Sat, 20 Mar 2021 11:10:08 +0200
> Leon Romanovsky  wrote:
> > On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote: 
> > > 
> > > What if we taint the kernel or pci_warn() for cases where either all
> > > the reset methods are disabled, ie. 'echo none > reset_method', or any
> > > time a device specific method is disabled?  
> > 
> > What does it mean "none"? Does it mean nothing supported? If yes, I think 
> > that
> > pci_warn() will be enough. At least for me, taint is usable during debug 
> > stages,
> > probably if device doesn't crash no one will look to see 
> > /proc/sys/kernel/tainted.
> 
> "none" as implemented in this patch, clearing the enabled function
> reset methods.

It is far from intuitive, the empty string will be easier to understand,
because "none" means no reset at all.

> 
> > > I'd almost go so far as to prevent disabling a device specific reset
> > > altogether, but for example should a device specific reset that fixes
> > > an aspect of FLR behavior prevent using a bus reset?  I'd prefer in that
> > > case if direct FLR were disabled via a device flag introduced with the
> > > quirk and the remaining resets can still be selected by preference.  
> > 
> > I don't know enough to discuss the PCI details, but you raised good point.
> > This sysfs is user visible API that is presented as is from device point
> > of view. It can be easily run into problems if PCI/core doesn't work with
> > user's choice.
> > 
> > > 
> > > Theoretically all the other reset methods work and are available, it's
> > > only a policy decision which to use, right?  
> > 
> > But this patch was presented as a way to overcome situations where
> > supported != working and user magically knows which reset type to set.
> 
> It's not magic, the new sysfs attributes expose which resets are
> enabled and the order that they're used, the user can simply select the
> next one.  Being able to bypass a broken reset method is a helpful side
> effect of getting to select a preferred reset method.

Magic in a sense that user has no idea what those resets mean, the
expectation is that he will blindly iterate till something works.

> 
> > If you want to take this patch to be policy decision tool,
> > it will need to accept "reset_type1,reset_type2,..." sort of input,
> > so fallback will work natively.
> 
> I don't see that as a requirement.  We have fall-through support in the
> kernel, but for a given device we're really only ever going to make use
> of one of those methods.  If a user knows enough about a device to have
> a preference, I think it can be singular.  That also significantly
> simplifies the interface and supporting code.  Thanks,

I'm struggling to get requirements from this thread. You talked about
policy decision to overtake fallback mechanism, Amey wanted to avoid
quirks.

Do you have an example of such devices or we are talking about
theoretical case?

And I don't see why simple line parser with loop iterator over strchr()
suddenly becomes complicated code.

Thanks

> 
> Alex
> 


Re: [PATCH] net: make unregister netdev warning timeout configurable

2021-03-21 Thread Leon Romanovsky
On Sat, Mar 20, 2021 at 03:28:51PM +0100, Dmitry Vyukov wrote:
> netdev_wait_allrefs() issues a warning if refcount does not drop to 0
> after 10 seconds. While 10 second wait generally should not happen
> under normal workload in normal environment, it seems to fire falsely
> very often during fuzzing and/or in qemu emulation (~10x slower).
> At least it's not possible to understand if it's really a false
> positive or not. Automated testing generally bumps all timeouts
> to very high values to avoid flake failures.
> Make the timeout configurable for automated testing systems.
> Lowering the timeout may also be useful for e.g. manual bisection.
> The default value matches the current behavior.
> 
> Signed-off-by: Dmitry Vyukov 
> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  net/Kconfig| 12 
>  net/core/dev.c |  4 +++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 

Our verification team would like to see this change too.

Thanks,
Reviewed-by: Leon Romanovsky 


Re: [PATCH net-next v2 2/2] net: ipa: fix IPA validation

2021-03-21 Thread Leon Romanovsky
On Sat, Mar 20, 2021 at 09:17:29AM -0500, Alex Elder wrote:
> There are blocks of IPA code that sanity-check various values, at
> compile time where possible.  Most of these checks can be done once
> during development but skipped for normal operation.  These checks
> permit the driver to make certain assumptions, thereby avoiding the
> need for runtime error checking.
> 
> The checks are defined conditionally, but not consistently.  In
> some cases IPA_VALIDATION enables the optional checks, while in
> others IPA_VALIDATE is used.
> 
> Fix this by using IPA_VALIDATION consistently.
> 
> Signed-off-by: Alex Elder 
> ---
>  drivers/net/ipa/Makefile   | 2 +-
>  drivers/net/ipa/gsi_trans.c| 8 
>  drivers/net/ipa/ipa_cmd.c  | 4 ++--
>  drivers/net/ipa/ipa_cmd.h  | 6 +++---
>  drivers/net/ipa/ipa_endpoint.c | 6 +++---
>  drivers/net/ipa/ipa_main.c | 6 +++---
>  drivers/net/ipa/ipa_mem.c  | 6 +++---
>  drivers/net/ipa/ipa_table.c| 6 +++---
>  drivers/net/ipa/ipa_table.h| 6 +++---
>  9 files changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/ipa/Makefile b/drivers/net/ipa/Makefile
> index afe5df1e6..014ae36ac6004 100644
> --- a/drivers/net/ipa/Makefile
> +++ b/drivers/net/ipa/Makefile
> @@ -1,5 +1,5 @@
>  # Un-comment the next line if you want to validate configuration data
> -#ccflags-y   +=  -DIPA_VALIDATE
> +# ccflags-y  +=  -DIPA_VALIDATION

Maybe netdev folks think differently here, but general rule that dead
code and closed code is such, is not acceptable to in Linux kernel.

<...>

>  
> -#ifdef IPA_VALIDATE
> +#ifdef IPA_VALIDATION
>   if (!size || size % 8)
>   return -EINVAL;
>   if (count < max_alloc)
>   return -EINVAL;
>   if (!max_alloc)
>   return -EINVAL;
> -#endif /* IPA_VALIDATE */
> +#endif /* IPA_VALIDATION */

If it is possible to supply those values, the check should be always and
not only under some closed config option.

>  
>   /* By allocating a few extra entries in our pool (one less
>* than the maximum number that will be requested in a
> @@ -140,14 +140,14 @@ int gsi_trans_pool_init_dma(struct device *dev, struct 
> gsi_trans_pool *pool,
>   dma_addr_t addr;
>   void *virt;
>  
> -#ifdef IPA_VALIDATE
> +#ifdef IPA_VALIDATION
>   if (!size || size % 8)
>   return -EINVAL;
>   if (count < max_alloc)
>   return -EINVAL;
>   if (!max_alloc)
>   return -EINVAL;
> -#endif /* IPA_VALIDATE */
> +#endif /* IPA_VALIDATION */

Same

<...>

>  {
> -#ifdef IPA_VALIDATE
> +#ifdef IPA_VALIDATION
>   /* At one time we assumed a 64-bit build, allowing some do_div()
>* calls to be replaced by simple division or modulo operations.
>* We currently only perform divide and modulo operations on u32,
> @@ -768,7 +768,7 @@ static void ipa_validate_build(void)
>   BUILD_BUG_ON(!ipa_aggr_granularity_val(IPA_AGGR_GRANULARITY));
>   BUILD_BUG_ON(ipa_aggr_granularity_val(IPA_AGGR_GRANULARITY) >
>   field_max(AGGR_GRANULARITY_FMASK));
> -#endif /* IPA_VALIDATE */
> +#endif /* IPA_VALIDATION */

BUILD_BUG_ON()s are checked during compilation and not during runtime
like IPA_VALIDATION promised.

IMHO, the issue here is that this IPA code isn't release quality but
some debug drop variant and it is far from expected from submitted code.

Thanks


Re: [PATCH master] module: remove never implemented MODULE_SUPPORTED_DEVICE

2021-03-20 Thread Leon Romanovsky
On Thu, Mar 18, 2021 at 10:55:51AM -0700, Linus Torvalds wrote:
> On Thu, Mar 18, 2021 at 10:49 AM Leon Romanovsky  wrote:
> >
> > No, I opened patch and added the note manually, so it is definitely my VIM.
> > Most likely this part of my .vimrc caused it.
> 
> Ok, that would do it.
> 
> Yeah, whitespace is easy to "fix" at patch application time, but it
> really is meaningful and you never should change whitespace for
> patches.
> 
> Maybe you can limit your rules to just particular file types
> (although, honestly, I think it's bad for headers and C files too when
> it then causes entirely irrelevant and independent changes - you only
> want your own _new_ edits to be whitespace-clean, not fix other random
> issues).
> 
> Better yet, maybe not "fix whitespace" at all, but have some code
> coloring logic that just points out bad whitespace? I use "git diff"
> myself, with colorization being default for tty operations:
> 
> [color]
> ui=auto
> 
> so that then "git diff" will show you your (new) evil whitespace
> errors when you review your changes, but won't complain about existing
> whitespace issues..

Yeah, my color.ui is "always", so I will simply remove the problematic
line from VIM and won't change whitespaces at all.

Thanks

> 
>   Linus


  1   2   3   4   5   6   7   8   9   10   >