from:"Jason Gunthorpe"

Re: [PATCH rdma-next v2 05/17] RDMA/counter: Add set/clear per-port auto mode support

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:41AM +0300, Leon Romanovsky wrote:
> From: Mark Zhang 
> 
> Add an API to support set/clear per-port auto mode.
> 
> Signed-off-by: Mark Zhang 
> Reviewed-by: Majd Dibbiny 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/core/Makefile   |  2 +-
>  drivers/infiniband/core/counters.c | 77 ++
>  drivers/infiniband/core/device.c   |  4 ++
>  include/rdma/ib_verbs.h|  2 +
>  include/rdma/rdma_counter.h| 24 ++
>  include/uapi/rdma/rdma_netlink.h   | 26 ++
>  6 files changed, 134 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/infiniband/core/counters.c
> 
> diff --git a/drivers/infiniband/core/Makefile 
> b/drivers/infiniband/core/Makefile
> index 313f2349b518..cddf748c15c9 100644
> +++ b/drivers/infiniband/core/Makefile
> @@ -12,7 +12,7 @@ ib_core-y :=packer.o ud_header.o 
> verbs.o cq.o rw.o sysfs.o \
>   device.o fmr_pool.o cache.o netlink.o \
>   roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
>   multicast.o mad.o smi.o agent.o mad_rmpp.o \
> - nldev.o restrack.o
> + nldev.o restrack.o counters.o
>  
>  ib_core-$(CONFIG_SECURITY_INFINIBAND) += security.o
>  ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
> diff --git a/drivers/infiniband/core/counters.c 
> b/drivers/infiniband/core/counters.c
> new file mode 100644
> index ..bda8d945a758
> +++ b/drivers/infiniband/core/counters.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2019 Mellanox Technologies. All rights reserved.
> + */
> +#include 
> +#include 
> +
> +#include "core_priv.h"
> +#include "restrack.h"
> +
> +#define ALL_AUTO_MODE_MASKS (RDMA_COUNTER_MASK_QP_TYPE)
> +
> +static int __counter_set_mode(struct rdma_counter_mode *curr,
> +   enum rdma_nl_counter_mode new_mode,
> +   enum rdma_nl_counter_mask new_mask)
> +{
> + if ((new_mode == RDMA_COUNTER_MODE_AUTO) &&
> + ((new_mask & (~ALL_AUTO_MODE_MASKS)) ||
> +  (curr->mode != RDMA_COUNTER_MODE_NONE)))
> + return -EINVAL;
> +
> + curr->mode = new_mode;
> + curr->mask = new_mask;
> + return 0;
> +}
> +
> +/**
> + * rdma_counter_set_auto_mode() - Turn on/off per-port auto mode
> + *
> + * When @on is true, the @mask must be set
> + */
> +int rdma_counter_set_auto_mode(struct ib_device *dev, u8 port,
> +bool on, enum rdma_nl_counter_mask mask)
> +{
> + struct rdma_port_counter *port_counter;
> + int ret;
> +
> + if (!rdma_is_port_valid(dev, port))
> + return -EINVAL;
> +
> + port_counter = &dev->port_data[port].port_counter;
> + mutex_lock(&port_counter->lock);
> + if (on) {
> + ret = __counter_set_mode(&port_counter->mode,
> +  RDMA_COUNTER_MODE_AUTO, mask);
> + } else {
> + if (port_counter->mode.mode != RDMA_COUNTER_MODE_AUTO) {
> + ret = -EINVAL;
> + goto out;
> + }
> + ret = __counter_set_mode(&port_counter->mode,
> +  RDMA_COUNTER_MODE_NONE, 0);
> + }
> +
> +out:
> + mutex_unlock(&port_counter->lock);
> + return ret;
> +}
> +
> +void rdma_counter_init(struct ib_device *dev)
> +{
> + struct rdma_port_counter *port_counter;
> + u32 port;
> +
> + if (!dev->ops.alloc_hw_stats)
> + return;
> +
> + rdma_for_each_port(dev, port) {
> + port_counter = &dev->port_data[port].port_counter;
> + port_counter->mode.mode = RDMA_COUNTER_MODE_NONE;
> + mutex_init(&port_counter->lock);
> + }
> +}
> +
> +void rdma_counter_cleanup(struct ib_device *dev)
> +{
> +}

Please don't add empty functions

> @@ -1304,6 +1307,7 @@ static void __ib_unregister_device(struct ib_device 
> *ib_dev)
>   goto out;
>  
>   disable_device(ib_dev);
> + rdma_counter_cleanup(ib_dev);

This is the wrong place to call this, the patch that actually adds a
body is just doing kfree's so it is properly called
'rdma_counter_release' and it belongs in ib_device_release()

And it shouldn't test hw_stats, and it shouldn't have a 'fail' stanza
for allocation either.

Jason

Re: [PATCH rdma-next v2 13/17] RDMA/core: Get sum value of all counters when perform a sysfs stat read

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:49AM +0300, Leon Romanovsky wrote:
> From: Mark Zhang 
> 
> Since a QP can only be bound to one counter, then if it is bound to a
> separate counter, for backward compatibility purpose, the statistic
> value must be:
> * stat of default counter
> + stat of all running allocated counters
> + stat of all deallocated counters (history stats)
> 
> Signed-off-by: Mark Zhang 
> Reviewed-by: Majd Dibbiny 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/core/counters.c | 99 +-
>  drivers/infiniband/core/device.c   |  8 ++-
>  drivers/infiniband/core/sysfs.c| 10 ++-
>  include/rdma/rdma_counter.h|  5 +-
>  4 files changed, 113 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/core/counters.c 
> b/drivers/infiniband/core/counters.c
> index 36cd9eca1e46..f598b1cdb241 100644
> +++ b/drivers/infiniband/core/counters.c
> @@ -146,6 +146,20 @@ static int __rdma_counter_bind_qp(struct rdma_counter 
> *counter,
>   return ret;
>  }
>  
> +static void counter_history_stat_update(const struct rdma_counter *counter)
> +{
> + struct ib_device *dev = counter->device;
> + struct rdma_port_counter *port_counter;
> + int i;
> +
> + port_counter = &dev->port_data[counter->port].port_counter;
> + if (!port_counter->hstats)
> + return;
> +
> + for (i = 0; i < counter->stats->num_counters; i++)
> + port_counter->hstats->value[i] += counter->stats->value[i];
> +}
> +
>  static int __rdma_counter_unbind_qp(struct ib_qp *qp, bool force)
>  {
>   struct rdma_counter *counter = qp->counter;
> @@ -285,8 +299,10 @@ int rdma_counter_unbind_qp(struct ib_qp *qp, bool force)
>   return ret;
>  
>   rdma_restrack_put(&counter->res);
> - if (atomic_dec_and_test(&counter->usecnt))
> + if (atomic_dec_and_test(&counter->usecnt)) {
> + counter_history_stat_update(counter);
>   rdma_counter_dealloc(counter);
> + }
>  
>   return 0;
>  }
> @@ -307,21 +323,98 @@ int rdma_counter_query_stats(struct rdma_counter 
> *counter)
>   return ret;
>  }
>  
> -void rdma_counter_init(struct ib_device *dev)
> +static u64 get_running_counters_hwstat_sum(struct ib_device *dev,
> +u8 port, u32 index)
> +{
> + struct rdma_restrack_entry *res;
> + struct rdma_restrack_root *rt;
> + struct rdma_counter *counter;
> + unsigned long id = 0;
> + u64 sum = 0;
> +
> + rt = &dev->res[RDMA_RESTRACK_COUNTER];
> + xa_lock(&rt->xa);
> + xa_for_each(&rt->xa, id, res) {
> + if (!rdma_restrack_get(res))
> + continue;

Why do we need to get refcounts if we are holding the xa_lock?

> +
> + counter = container_of(res, struct rdma_counter, res);
> + if ((counter->device != dev) || (counter->port != port))
> + goto next;
> +
> + if (rdma_counter_query_stats(counter))
> + goto next;

And rdma_counter_query_stats does

+   mutex_lock(&counter->lock);

So this was never tested as it will insta-crash with lockdep.

Presumably this is why it is using xa_for_each and restrack_get - but
it needs to drop the lock after successful get.

This sort of comment applies to nearly evey place in this series that
uses xa_for_each. 

This needs to be tested with lockdep.

Jason

Re: [PATCH rdma-next v2 06/17] RDMA/counter: Add "auto" configuration mode support

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:42AM +0300, Leon Romanovsky wrote:

>  void rdma_counter_init(struct ib_device *dev)
>  {
>   struct rdma_port_counter *port_counter;
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index 9204b4251fc8..dfaa57de871f 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -2349,6 +2349,8 @@ void ib_set_device_ops(struct ib_device *dev, const 
> struct ib_device_ops *ops)
>   SET_DEVICE_OP(dev_ops, set_vf_guid);
>   SET_DEVICE_OP(dev_ops, set_vf_link_state);
>   SET_DEVICE_OP(dev_ops, unmap_fmr);
> + SET_DEVICE_OP(dev_ops, counter_bind_qp);
> + SET_DEVICE_OP(dev_ops, counter_unbind_qp);

Keep sorted

Jason

Re: [PATCH rdma-next v2 06/17] RDMA/counter: Add "auto" configuration mode support

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:42AM +0300, Leon Romanovsky wrote:

> +/**
> + * rdma_counter_unbind_qp - Unbind a qp from a counter
> + * @force:
> + *   true - Decrease the counter ref-count anyway (e.g., qp destroy)
> + */
> +int rdma_counter_unbind_qp(struct ib_qp *qp, bool force)
> +{
> + struct rdma_counter *counter = qp->counter;
> + int ret;
> +
> + if (!counter)
> + return -EINVAL;
> +
> + ret = __rdma_counter_unbind_qp(qp, force);
> + if (ret && !force)
> + return ret;
> +
> + rdma_restrack_put(&counter->res);
> + if (atomic_dec_and_test(&counter->usecnt))
> + rdma_counter_dealloc(counter);

An atomic that does kfree when it reaches zero should be implemented
with a kref.

Jason

Re: [PATCH rdma-next v2 11/17] RDMA/netlink: Implement counter dumpit calback

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:47AM +0300, Leon Romanovsky wrote:
> From: Mark Zhang 
> 
> This patch adds the ability to return all available counters
> together with their properties and hwstats.
> 
> Signed-off-by: Mark Zhang 
> Reviewed-by: Majd Dibbiny 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/core/counters.c |  28 +
>  drivers/infiniband/core/device.c   |   2 +
>  drivers/infiniband/core/nldev.c| 173 +
>  include/rdma/ib_verbs.h|  10 ++
>  include/rdma/rdma_counter.h|   3 +
>  include/uapi/rdma/rdma_netlink.h   |  10 +-
>  6 files changed, 225 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/counters.c 
> b/drivers/infiniband/core/counters.c
> index 665e0d43c21b..36cd9eca1e46 100644
> +++ b/drivers/infiniband/core/counters.c
> @@ -62,6 +62,9 @@ static struct rdma_counter *rdma_counter_alloc(struct 
> ib_device *dev, u8 port,
>  {
>   struct rdma_counter *counter;
>  
> + if (!dev->ops.counter_alloc_stats)
> + return NULL;
> +

Seems weird to add this now, why was it Ok to have counters prior to
this patch?

Jason

Re: [PATCH rdma-next v2 11/17] RDMA/netlink: Implement counter dumpit calback

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:47AM +0300, Leon Romanovsky wrote:
> From: Mark Zhang 
> 
> This patch adds the ability to return all available counters
> together with their properties and hwstats.
> 
> Signed-off-by: Mark Zhang 
> Reviewed-by: Majd Dibbiny 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/core/counters.c |  28 +
>  drivers/infiniband/core/device.c   |   2 +
>  drivers/infiniband/core/nldev.c| 173 +
>  include/rdma/ib_verbs.h|  10 ++
>  include/rdma/rdma_counter.h|   3 +
>  include/uapi/rdma/rdma_netlink.h   |  10 +-
>  6 files changed, 225 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/counters.c 
> b/drivers/infiniband/core/counters.c
> index 665e0d43c21b..36cd9eca1e46 100644
> +++ b/drivers/infiniband/core/counters.c
> @@ -62,6 +62,9 @@ static struct rdma_counter *rdma_counter_alloc(struct 
> ib_device *dev, u8 port,
>  {
>   struct rdma_counter *counter;
>  
> + if (!dev->ops.counter_alloc_stats)
> + return NULL;
> +
>   counter = kzalloc(sizeof(*counter), GFP_KERNEL);
>   if (!counter)
>   return NULL;
> @@ -69,16 +72,25 @@ static struct rdma_counter *rdma_counter_alloc(struct 
> ib_device *dev, u8 port,
>   counter->device= dev;
>   counter->port  = port;
>   counter->res.type  = RDMA_RESTRACK_COUNTER;
> + counter->stats = dev->ops.counter_alloc_stats(counter);
> + if (!counter->stats)
> + goto err_stats;
> +
>   counter->mode.mode = mode;
>   atomic_set(&counter->usecnt, 0);
>   mutex_init(&counter->lock);
>  
>   return counter;
> +
> +err_stats:
> + kfree(counter);
> + return NULL;
>  }
>  
>  static void rdma_counter_dealloc(struct rdma_counter *counter)
>  {
>   rdma_restrack_del(&counter->res);
> + kfree(counter->stats);
>   kfree(counter);
>  }
>  
> @@ -279,6 +291,22 @@ int rdma_counter_unbind_qp(struct ib_qp *qp, bool force)
>   return 0;
>  }
>  
> +int rdma_counter_query_stats(struct rdma_counter *counter)
> +{
> + int ret;
> +
> + struct ib_device *dev = counter->device;
> +

Extra blank line
Something about festive trees

Jason

Re: [PATCH rdma-next v2 13/17] RDMA/core: Get sum value of all counters when perform a sysfs stat read

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:49AM +0300, Leon Romanovsky wrote:
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index c56ffc61ab1e..8ae4906a60e7 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -1255,7 +1255,11 @@ int ib_register_device(struct ib_device *device, const 
> char *name)
>   goto dev_cleanup;
>   }
>  
> - rdma_counter_init(device);
> + ret = rdma_counter_init(device);
> + if (ret) {
> + dev_warn(&device->dev, "Couldn't initialize counter\n");
> + goto sysfs_cleanup;
> + }

Don't put this things randomly, if there is some reason it should be
after sysfs it needs a comment, otherwise if it is just allocating
memory it belongs earlier, and the unwind should be done in release.

I also think it is very strange/wrong that both sysfs and counters are
allocating the same alloc_hw_stats object

Why can't they share?

Jason

Re: [PATCH rdma-next v2 17/17] RDMA/nldev: Allow get default counter statistics through RDMA netlink

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:53AM +0300, Leon Romanovsky wrote:
> From: Mark Zhang 
> 
> This patch adds the ability to return the hwstats of per-port default
> counters (which can also be queried through sysfs nodes).
> 
> Signed-off-by: Mark Zhang 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/core/nldev.c | 101 +++-
>  1 file changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
> index 53c1d2d82a06..cb2dd38f49f1 100644
> +++ b/drivers/infiniband/core/nldev.c
> @@ -1709,6 +1709,98 @@ static int nldev_stat_del_doit(struct sk_buff *skb, 
> struct nlmsghdr *nlh,
>   return ret;
>  }
>  
> +static int nldev_res_get_default_counter_doit(struct sk_buff *skb,
> +   struct nlmsghdr *nlh,
> +   struct netlink_ext_ack *extack,
> +   struct nlattr *tb[])
> +{
> + struct rdma_hw_stats *stats;
> + struct nlattr *table_attr;
> + struct ib_device *device;
> + int ret, num_cnts, i;
> + struct sk_buff *msg;
> + u32 index, port;
> + u64 v;
> +
> + if (!tb[RDMA_NLDEV_ATTR_DEV_INDEX] || !tb[RDMA_NLDEV_ATTR_PORT_INDEX])
> + return -EINVAL;
> +
> + index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
> + device = ib_device_get_by_index(sock_net(skb->sk), index);
> + if (!device)
> + return -EINVAL;
> +
> + if (!device->ops.alloc_hw_stats || !device->ops.get_hw_stats) {
> + ret = -EINVAL;
> + goto err;
> + }
> +
> + port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
> + if (!rdma_is_port_valid(device, port)) {
> + ret = -EINVAL;
> + goto err;
> + }
> +
> + msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> + if (!msg) {
> + ret = -ENOMEM;
> + goto err;
> + }
> +
> + nlh = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
> + RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
> +  RDMA_NLDEV_CMD_STAT_GET),
> + 0, 0);
> +
> + if (fill_nldev_handle(msg, device) ||
> + nla_put_u32(msg, RDMA_NLDEV_ATTR_PORT_INDEX, port)) {
> + ret = -EMSGSIZE;
> + goto err_msg;
> + }
> +
> + stats = device->ops.alloc_hw_stats(device, port);
> + if (!stats) {
> + ret = -ENOMEM;
> + goto err_msg;
> + }

Why do we need yet another one of these to be allocated?

> + num_cnts = device->ops.get_hw_stats(device, stats, port, 0);

Is '0' right here?

Jason

Re: [PATCH rdma-next v2 00/17] Statistics counter support

2019-05-22 Thread Jason Gunthorpe

On Mon, Apr 29, 2019 at 11:34:36AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Changelog:
>  v1 -> v2:
>  * Rebased to latest rdma-next
>  v0 -> v1:
>  * Changed wording of counter comment
>  * Removed unneeded assignments
>  * Added extra patch to present global counters
> 
>  * I didn't change QP type from int to be enum ib_qp_type,
>because it caused to cyclic dependency between ib_verbs.h and
>rdma_counter.h.
> 
> 
> Hi,
> 
> This series from Mark provides dynamic statistics infrastructure.
> He uses netlink interface to configure and retrieve those counters.
> 
> This infrastructure allows to users monitor various objects by binding
> to them counters. As the beginning, we used QP object as target for
> those counters, but future patches will include ODP MR information too.
> 
> Two binding modes are supported:
>  - Auto: This allows a user to build automatic set of objects to a counter
>according to common criteria. For example in a per-type scheme, where in
>one process all QPs with same QP type are bound automatically to a single
>counter.
>  - Manual: This allows a user to manually bind objects on a counter.
> 
> Those two modes are mutual-exclusive with separation between processes,
> objects created by different processes cannot be bound to a same counter.
> 
> For objects which don't support counter binding, we will return
> pre-allocated counters.
> 
> $ rdma statistic qp set link mlx5_2/1 auto type on
> $ rdma statistic qp set link mlx5_2/1 auto off
> $ rdma statistic qp bind link mlx5_2/1 lqpn 178
> $ rdma statistic qp unbind link mlx5_2/1 cntn 4 lqpn 178
> $ rdma statistic show
> $ rdma statistic qp mode

Can you please include the command outputs?

Jason

Re: [PATCH rdma-next v2 13/17] RDMA/core: Get sum value of all counters when perform a sysfs stat read

2019-05-29 Thread Jason Gunthorpe

On Wed, May 29, 2019 at 02:15:44PM +0300, Leon Romanovsky wrote:
> On Wed, May 22, 2019 at 02:10:42PM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 29, 2019 at 11:34:49AM +0300, Leon Romanovsky wrote:
> > > From: Mark Zhang 
> > >
> > > Since a QP can only be bound to one counter, then if it is bound to a
> > > separate counter, for backward compatibility purpose, the statistic
> > > value must be:
> > > * stat of default counter
> > > + stat of all running allocated counters
> > > + stat of all deallocated counters (history stats)
> > >
> > > Signed-off-by: Mark Zhang 
> > > Reviewed-by: Majd Dibbiny 
> > > Signed-off-by: Leon Romanovsky 
> > >  drivers/infiniband/core/counters.c | 99 +-
> > >  drivers/infiniband/core/device.c   |  8 ++-
> > >  drivers/infiniband/core/sysfs.c| 10 ++-
> > >  include/rdma/rdma_counter.h|  5 +-
> > >  4 files changed, 113 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/core/counters.c 
> > > b/drivers/infiniband/core/counters.c
> > > index 36cd9eca1e46..f598b1cdb241 100644
> > > +++ b/drivers/infiniband/core/counters.c
> > > @@ -146,6 +146,20 @@ static int __rdma_counter_bind_qp(struct 
> > > rdma_counter *counter,
> > >   return ret;
> > >  }
> > >
> > > +static void counter_history_stat_update(const struct rdma_counter 
> > > *counter)
> > > +{
> > > + struct ib_device *dev = counter->device;
> > > + struct rdma_port_counter *port_counter;
> > > + int i;
> > > +
> > > + port_counter = &dev->port_data[counter->port].port_counter;
> > > + if (!port_counter->hstats)
> > > + return;
> > > +
> > > + for (i = 0; i < counter->stats->num_counters; i++)
> > > + port_counter->hstats->value[i] += counter->stats->value[i];
> > > +}
> > > +
> > >  static int __rdma_counter_unbind_qp(struct ib_qp *qp, bool force)
> > >  {
> > >   struct rdma_counter *counter = qp->counter;
> > > @@ -285,8 +299,10 @@ int rdma_counter_unbind_qp(struct ib_qp *qp, bool 
> > > force)
> > >   return ret;
> > >
> > >   rdma_restrack_put(&counter->res);
> > > - if (atomic_dec_and_test(&counter->usecnt))
> > > + if (atomic_dec_and_test(&counter->usecnt)) {
> > > + counter_history_stat_update(counter);
> > >   rdma_counter_dealloc(counter);
> > > + }
> > >
> > >   return 0;
> > >  }
> > > @@ -307,21 +323,98 @@ int rdma_counter_query_stats(struct rdma_counter 
> > > *counter)
> > >   return ret;
> > >  }
> > >
> > > -void rdma_counter_init(struct ib_device *dev)
> > > +static u64 get_running_counters_hwstat_sum(struct ib_device *dev,
> > > +u8 port, u32 index)
> > > +{
> > > + struct rdma_restrack_entry *res;
> > > + struct rdma_restrack_root *rt;
> > > + struct rdma_counter *counter;
> > > + unsigned long id = 0;
> > > + u64 sum = 0;
> > > +
> > > + rt = &dev->res[RDMA_RESTRACK_COUNTER];
> > > + xa_lock(&rt->xa);
> > > + xa_for_each(&rt->xa, id, res) {
> > > + if (!rdma_restrack_get(res))
> > > + continue;
> >
> > Why do we need to get refcounts if we are holding the xa_lock?
> 
> Don't we need to protect an entry itself from disappearing?

xa_lock prevents xa_erase and xa_erase should be done before any
parallel kfree.

Jason

Re: [PATCH rdma-next v2 13/17] RDMA/core: Get sum value of all counters when perform a sysfs stat read

2019-05-29 Thread Jason Gunthorpe

On Wed, May 29, 2019 at 02:05:24PM +0300, Leon Romanovsky wrote:
> On Wed, May 22, 2019 at 02:26:36PM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 29, 2019 at 11:34:49AM +0300, Leon Romanovsky wrote:
> > > diff --git a/drivers/infiniband/core/device.c 
> > > b/drivers/infiniband/core/device.c
> > > index c56ffc61ab1e..8ae4906a60e7 100644
> > > +++ b/drivers/infiniband/core/device.c
> > > @@ -1255,7 +1255,11 @@ int ib_register_device(struct ib_device *device, 
> > > const char *name)
> > >   goto dev_cleanup;
> > >   }
> > >
> > > - rdma_counter_init(device);
> > > + ret = rdma_counter_init(device);
> > > + if (ret) {
> > > + dev_warn(&device->dev, "Couldn't initialize counter\n");
> > > + goto sysfs_cleanup;
> > > + }
> >
> > Don't put this things randomly, if there is some reason it should be
> > after sysfs it needs a comment, otherwise if it is just allocating
> > memory it belongs earlier, and the unwind should be done in release.
> >
> > I also think it is very strange/wrong that both sysfs and counters are
> > allocating the same alloc_hw_stats object
> >
> > Why can't they share?
> 
> They can, but we wanted to separate "legacy" counters which were exposed
> through sysfs and "new" counters which can be enabled/disable automatically.

Is there any cross contamination through the hw_stats? If no they
should just share.

Jason

Re: [pull request][for-next 0/9] Generic DIM lib for netdev and RDMA

2019-06-06 Thread Jason Gunthorpe

On Thu, Jun 06, 2019 at 10:19:41AM +0300, Max Gurtovoy wrote:
> > > Solution:
> > > - Common logic is declared in include/linux/dim.h and implemented in
> > >lib/dim/dim.c
> > > - Net DIM (existing) logic is declared in include/linux/net_dim.h and
> > >implemented in lib/dim/net_dim.c, which uses the common logic from 
> > > dim.h
> > > - Any new DIM logic will be declared in "/include/linux/new_dim.h" and
> > > implemented in "lib/dim/new_dim.c".
> > > - This new implementation will expose modified versions of profiles,
> > >dim_step() and dim_decision().
> > > 
> > > Pros for this solution are:
> > > - Zero impact on existing net_dim implementation and usage
> > > - Relatively more code reuse (compared to two separate solutions)
> > > - Increased extensibility
> > > 
> > > Tal Gilboa (6):
> > >linux/dim: Move logic to dim.h
> > >linux/dim: Remove "net" prefix from internal DIM members
> > >linux/dim: Rename externally exposed macros
> > >linux/dim: Rename net_dim_sample() to net_dim_update_sample()
> > >linux/dim: Rename externally used net_dim members
> > >linux/dim: Move implementation to .c files
> > > 
> > > Yamin Friedman (3):
> > >linux/dim: Add completions count to dim_sample
> > >linux/dim: Implement rdma_dim
> > >RDMA/core: Provide RDMA DIM support for ULPs
> > Saeed,
> > 
> > No, for the RDMA patches.
> > We need to see usage of those APIs before merging.
> 
> I've asked Yamin to prepare patches for NVMeoF initiator and target for
> review, so I guess he has it on his plate (this is how he tested it..).
> 
> It might cause conflict with NVMe/blk branch maintained by Sagi, Christoph
> and Jens.

It looks like nvme could pull this series + the RDMA patches into the
nvme tree via PR? I'm not familiar with how that tree works.

But we need to get the patches posted right away..

Jason

Re: [pull request][for-next 0/9] Generic DIM lib for netdev and RDMA

2019-06-07 Thread Jason Gunthorpe

On Fri, Jun 07, 2019 at 06:14:11PM +, Saeed Mahameed wrote:
> On Thu, 2019-06-06 at 13:07 +0000, Jason Gunthorpe wrote:
> > On Thu, Jun 06, 2019 at 10:19:41AM +0300, Max Gurtovoy wrote:
> > > > > Solution:
> > > > > - Common logic is declared in include/linux/dim.h and
> > > > > implemented in
> > > > >lib/dim/dim.c
> > > > > - Net DIM (existing) logic is declared in
> > > > > include/linux/net_dim.h and
> > > > >implemented in lib/dim/net_dim.c, which uses the common
> > > > > logic from dim.h
> > > > > - Any new DIM logic will be declared in
> > > > > "/include/linux/new_dim.h" and
> > > > > implemented in "lib/dim/new_dim.c".
> > > > > - This new implementation will expose modified versions of
> > > > > profiles,
> > > > >dim_step() and dim_decision().
> > > > > 
> > > > > Pros for this solution are:
> > > > > - Zero impact on existing net_dim implementation and usage
> > > > > - Relatively more code reuse (compared to two separate
> > > > > solutions)
> > > > > - Increased extensibility
> > > > > 
> > > > > Tal Gilboa (6):
> > > > >linux/dim: Move logic to dim.h
> > > > >linux/dim: Remove "net" prefix from internal DIM members
> > > > >linux/dim: Rename externally exposed macros
> > > > >linux/dim: Rename net_dim_sample() to
> > > > > net_dim_update_sample()
> > > > >linux/dim: Rename externally used net_dim members
> > > > >linux/dim: Move implementation to .c files
> > > > > 
> > > > > Yamin Friedman (3):
> > > > >linux/dim: Add completions count to dim_sample
> > > > >linux/dim: Implement rdma_dim
> > > > >RDMA/core: Provide RDMA DIM support for ULPs
> > > > Saeed,
> > > > 
> > > > No, for the RDMA patches.
> > > > We need to see usage of those APIs before merging.
> > > 
> > > I've asked Yamin to prepare patches for NVMeoF initiator and target
> > > for
> > > review, so I guess he has it on his plate (this is how he tested
> > > it..).
> > > 
> > > It might cause conflict with NVMe/blk branch maintained by Sagi,
> > > Christoph
> > > and Jens.
> > 
> > It looks like nvme could pull this series + the RDMA patches into the
> > nvme tree via PR? I'm not familiar with how that tree works.
> > 
> > But we need to get the patches posted right away..
> > 
> 
> What do you suggest here ?
> I think the netdev community also deserve to see the rdma patches, at
> least with an external link, I can drop the last patch (or two ) ? but
> i need an external rdma link for people who are going to review this
> series.

Yes, all the patches need to be posted. We should have a 'double
branch' where you send the linux/dim & net stuff to net and then we
add the RDMA stuff on top and send to nvme & rdma with the ULP
patches

Assuming nvme takes pull requests.

But the whole thing should be posted as a single series on the list to
get acks before the PRs are generated.

Similar to how we've run the mlx5 shared branch

Jason

Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions

2019-07-25 Thread Jason Gunthorpe

On Tue, Jul 09, 2019 at 05:17:30PM +0300, Michal Kalderon wrote:
> Create some common API's for adding entries to a xa_mmap.
> Searching for an entry and freeing one.
> 
> The code was copied from the efa driver almost as is, just renamed
> function to be generic and not efa specific.
> 
> Signed-off-by: Ariel Elior 
> Signed-off-by: Michal Kalderon 
>  drivers/infiniband/core/device.c  |   1 +
>  drivers/infiniband/core/rdma_core.c   |   1 +
>  drivers/infiniband/core/uverbs_cmd.c  |   1 +
>  drivers/infiniband/core/uverbs_main.c | 135 
> ++
>  include/rdma/ib_verbs.h   |  46 
>  5 files changed, 184 insertions(+)
> 
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index 8a6ccb936dfe..a830c2c5d691 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -2521,6 +2521,7 @@ void ib_set_device_ops(struct ib_device *dev, const 
> struct ib_device_ops *ops)
>   SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
>   SET_DEVICE_OP(dev_ops, map_phys_fmr);
>   SET_DEVICE_OP(dev_ops, mmap);
> + SET_DEVICE_OP(dev_ops, mmap_free);
>   SET_DEVICE_OP(dev_ops, modify_ah);
>   SET_DEVICE_OP(dev_ops, modify_cq);
>   SET_DEVICE_OP(dev_ops, modify_device);
> diff --git a/drivers/infiniband/core/rdma_core.c 
> b/drivers/infiniband/core/rdma_core.c
> index ccf4d069c25c..1ed01b02401f 100644
> +++ b/drivers/infiniband/core/rdma_core.c
> @@ -816,6 +816,7 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file 
> *ufile,
>  
>   rdma_restrack_del(&ucontext->res);
>  
> + rdma_user_mmap_entries_remove_free(ucontext);
>   ib_dev->ops.dealloc_ucontext(ucontext);
>   kfree(ucontext);
>  
> diff --git a/drivers/infiniband/core/uverbs_cmd.c 
> b/drivers/infiniband/core/uverbs_cmd.c
> index 7ddd0e5bc6b3..44c0600245e4 100644
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -254,6 +254,7 @@ static int ib_uverbs_get_context(struct 
> uverbs_attr_bundle *attrs)
>  
>   mutex_init(&ucontext->per_mm_list_lock);
>   INIT_LIST_HEAD(&ucontext->per_mm_list);
> + xa_init(&ucontext->mmap_xa);
>  
>   ret = get_unused_fd_flags(O_CLOEXEC);
>   if (ret < 0)
> diff --git a/drivers/infiniband/core/uverbs_main.c 
> b/drivers/infiniband/core/uverbs_main.c
> index 11c13c1381cf..4b909d7b97de 100644
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -965,6 +965,141 @@ int rdma_user_mmap_io(struct ib_ucontext *ucontext, 
> struct vm_area_struct *vma,
>  }
>  EXPORT_SYMBOL(rdma_user_mmap_io);
>  
> +static inline u64
> +rdma_user_mmap_get_key(const struct rdma_user_mmap_entry *entry)
> +{
> + return (u64)entry->mmap_page << PAGE_SHIFT;
> +}
> +
> +/**
> + * rdma_user_mmap_entry_get() - Get an entry from the mmap_xa.
> + *
> + * @ucontext: associated user context.
> + * @key: The key received from rdma_user_mmap_entry_insert which
> + * is provided by user as the address to map.
> + * @len: The length the user wants to map
> + *
> + * This function is called when a user tries to mmap a key it
> + * initially received from the driver. They key was created by
> + * the function rdma_user_mmap_entry_insert.
> + *
> + * Return an entry if exists or NULL if there is no match.
> + */
> +struct rdma_user_mmap_entry *
> +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64 len)
> +{
> + struct rdma_user_mmap_entry *entry;
> + u64 mmap_page;
> +
> + mmap_page = key >> PAGE_SHIFT;
> + if (mmap_page > U32_MAX)
> + return NULL;
> +
> + entry = xa_load(&ucontext->mmap_xa, mmap_page);
> + if (!entry || entry->length != len)
> + return NULL;
> +
> + ibdev_dbg(ucontext->device,
> +   "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx] removed\n",
> +   entry->obj, key, entry->address, entry->length);
> +
> + return entry;
> +}
> +EXPORT_SYMBOL(rdma_user_mmap_entry_get);

It is a mistake we keep making, and maybe the war is hopelessly lost
now, but functions called from a driver should not be part of the
ib_uverbs module - ideally uverbs is an optional module. They should
be in ib_core.

Maybe put this in ib_core_uverbs.c ?

Kamal, you've been tackling various cleanups, maybe making ib_uverbs
unloadable again is something you'd be keen on?

> +/**
> + * rdma_user_mmap_entry_insert() - Allocate and insert an entry to the 
> mmap_xa.
> + *
> + * @ucontext: associated user context.
> + * @obj: opaque driver object that will be stored in the entry.
> + * @address: The address that will be mmapped to the user
> + * @length: Length of the address that will be mmapped
> + * @mmap_flag: opaque driver flags related to the address (For
> + *   example could be used for cachability)
> + *
> + * This function should be called by drivers that use the rdma_user_mmap
> + * interface for handling user mmapped addresses. The database is handled in
> + * the core and helper functions are provided to insert entries into the
> + * database a

Re: [PATCH v6 rdma-next 5/6] RDMA/qedr: Add doorbell overflow recovery support

2019-07-25 Thread Jason Gunthorpe

On Tue, Jul 09, 2019 at 05:17:34PM +0300, Michal Kalderon wrote:

> +static int qedr_init_user_db_rec(struct ib_udata *udata,
> +  struct qedr_dev *dev, struct qedr_userq *q,
> +  bool requires_db_rec)
> +{
> + struct qedr_ucontext *uctx =
> + rdma_udata_to_drv_context(udata, struct qedr_ucontext,
> +   ibucontext);
> +
> + /* Aborting for non doorbell userqueue (SRQ) or non-supporting lib */
> + if (requires_db_rec == 0 || !uctx->db_rec)
> + return 0;
> +
> + /* Allocate a page for doorbell recovery, add to mmap ) */
> + q->db_rec_data = (void *)get_zeroed_page(GFP_KERNEL);

I now think this needs to be GFP_USER and our other drivers have a bug
here as well..

Jason

Re: [PATCH v6 rdma-next 0/6] RDMA/qedr: Use the doorbell overflow recovery mechanism for RDMA

2019-07-25 Thread Jason Gunthorpe

On Tue, Jul 09, 2019 at 05:17:29PM +0300, Michal Kalderon wrote:
> This patch series uses the doorbell overflow recovery mechanism
> introduced in
> commit 36907cd5cd72 ("qed: Add doorbell overflow recovery mechanism")
> for rdma ( RoCE and iWARP )
> 
> The first three patches modify the core code to contain helper
> functions for managing mmap_xa inserting, getting and freeing
> entries. The code was taken almost as is from the efa driver.
> There is still an open discussion on whether we should take
> this even further and make the entire mmap generic. Until a
> decision is made, I only created the database API and modified
> the efa and qedr driver to use it. The doorbell recovery code will be based
> on the common code.
> 
> Efa driver was compile tested only.
> 
> rdma-core pull request #493
> 
> Changes from V5:
> - Switch between driver dealloc_ucontext and mmap_entries_remove.
> - No need to verify the key after using the key to load an entry from
>   the mmap_xa.
> - Change mmap_free api to pass an 'entry' object.
> - Add documentation for mmap_free and for newly exported functions.
> - Fix some extra/missing line breaks.

Lets do SIW now as well, it has the same xa scheme copied from EFA

Thanks,
Jason

Re: [PATCH v6 rdma-next 4/6] qed*: Change dpi_addr to be denoted with __iomem

2019-07-25 Thread Jason Gunthorpe

On Tue, Jul 09, 2019 at 05:17:33PM +0300, Michal Kalderon wrote:
> Several casts were required around dpi_addr parameter in qed_rdma_if.h
> This is an address on the doorbell bar and should therefore be marked
> with __iomem.
> 
> Reported-by: Jason Gunthorpe 
> Signed-off-by: Ariel Elior 
> Signed-off-by: Michal Kalderon 
>  drivers/infiniband/hw/qedr/main.c  | 2 +-
>  drivers/infiniband/hw/qedr/qedr.h  | 2 +-
>  drivers/net/ethernet/qlogic/qed/qed_rdma.c | 5 ++---
>  include/linux/qed/qed_rdma_if.h| 2 +-
>  4 files changed, 5 insertions(+), 6 deletions(-)

More lines are RDMA than net, so this patch applied to for-next

Thanks,
Jason

Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions

2019-07-25 Thread Jason Gunthorpe

On Thu, Jul 25, 2019 at 07:34:15PM +, Michal Kalderon wrote:
> > > + ibdev_dbg(ucontext->device,
> > > +   "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx]
> > removed\n",
> > > +   entry->obj, key, entry->address, entry->length);
> > > +
> > > + return entry;
> > > +}
> > > +EXPORT_SYMBOL(rdma_user_mmap_entry_get);
> > 
> > It is a mistake we keep making, and maybe the war is hopelessly lost now,
> > but functions called from a driver should not be part of the ib_uverbs 
> > module
> > - ideally uverbs is an optional module. They should be in ib_core.
> > 
> > Maybe put this in ib_core_uverbs.c ?

> But if there isn't ib_uverbs user apps can't be run right ? and then
> these functions Won't get called anyway ?

Right, but, we don't want loading the driver to force creating
/dev/infiniband/uverbs - so the driver support component of uverbs
should live in ib_core, and the /dev/ component should be in ib_uverbs

> > > + xa_lock(&ucontext->mmap_xa);
> > > + if (check_add_overflow(ucontext->mmap_xa_page,
> > > +(u32)(length >> PAGE_SHIFT),
> > 
> > Should this be divide round up ?

> For cases that length is not rounded to PAGE_SHIFT? 

It should never happen, but yes
 
> > 
> > > +&next_mmap_page))
> > > + goto err_unlock;
> > 
> > I still don't like that this algorithm latches into a permanent failure 
> > when the
> > xa_page wraps.
> > 
> > It seems worth spending a bit more time here to tidy this.. Keep using the
> > mmap_xa_page scheme, but instead do something like
> > 
> > alloc_cyclic_range():
> > 
> > while () {
> >// Find first empty element in a cyclic way
> >xa_page_first = mmap_xa_page;
> >xa_find(xa, &xa_page_first, U32_MAX, XA_FREE_MARK)
> > 
> >// Is there a enough room to have the range?
> >if (check_add_overflow(xa_page_first, npages, &xa_page_end)) {
> >   mmap_xa_page = 0;
> >   continue;
> >}
> > 
> >// See if the element before intersects
> >elm = xa_find(xa, &zero, xa_page_end, 0);
> >if (elm && intersects(xa_page_first, xa_page_last, elm->first, 
> > elm->last)) {
> >   mmap_xa_page = elm->last + 1;
> >   continue
> >}
> > 
> >// xa_page_first -> xa_page_end should now be free
> >xa_insert(xa, xa_page_start, entry);
> >mmap_xa_page = xa_page_end + 1;
> >return xa_page_start;
> > }
> > 
> > Approximately, please check it.

> But we don't free entires from the xa_array ( only when ucontext is 
> destroyed) so how will 
> There be an empty element after we wrap ?  

Oh!

That should be fixed up too, in the general case if a user is
creating/destroying driver objects in loop we don't want memory usage
to be unbounded.

The rdma_user_mmap stuff has VMA ops that can refcount the xa entry
and now that this is core code it is easy enough to harmonize the two
things and track the xa side from the struct rdma_umap_priv

The question is, does EFA or qedr have a use model for this that
allows a userspace verb to create/destroy in a loop? ie do we need to
fix this right now?

Jason

Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions

2019-07-26 Thread Jason Gunthorpe

On Fri, Jul 26, 2019 at 08:42:07AM +, Michal Kalderon wrote:

> > > But we don't free entires from the xa_array ( only when ucontext is
> > > destroyed) so how will There be an empty element after we wrap ?
> > 
> > Oh!
> > 
> > That should be fixed up too, in the general case if a user is
> > creating/destroying driver objects in loop we don't want memory usage to
> > be unbounded.
> > 
> > The rdma_user_mmap stuff has VMA ops that can refcount the xa entry and
> > now that this is core code it is easy enough to harmonize the two things and
> > track the xa side from the struct rdma_umap_priv
> > 
> > The question is, does EFA or qedr have a use model for this that allows a
> > userspace verb to create/destroy in a loop? ie do we need to fix this right
> > now?

> The mapping occurs for every qp and cq creation. So yes.
>
> So do you mean add a ref-cnt to the xarray entry and from umap
> decrease the refcnt and free?

Yes, free the entry (release the HW resource) and release the xa_array
ID.

Then, may as well don't use cyclic allocation for the xa, just the
algorithm above would be OK.

The zap should also clear the refs, and then when the ucontext is
destroyed we can just WARN_ON the xarray is empty. Either all the vmas
were destroyed or all were zapped.

Jason

Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions

2019-07-29 Thread Jason Gunthorpe

On Mon, Jul 29, 2019 at 04:53:38PM +0300, Gal Pressman wrote:
> On 29/07/2019 15:58, Michal Kalderon wrote:
> >> From: linux-rdma-ow...@vger.kernel.org  >> ow...@vger.kernel.org> On Behalf Of Jason Gunthorpe
> >>
> >>> + xa_lock(&ucontext->mmap_xa);
> >>> + if (check_add_overflow(ucontext->mmap_xa_page,
> >>> +(u32)(length >> PAGE_SHIFT),
> >>> +&next_mmap_page))
> >>> + goto err_unlock;
> >>
> >> I still don't like that this algorithm latches into a permanent failure 
> >> when the
> >> xa_page wraps.
> >>
> >> It seems worth spending a bit more time here to tidy this.. Keep using the
> >> mmap_xa_page scheme, but instead do something like
> >>
> >> alloc_cyclic_range():
> >>
> >> while () {
> >>// Find first empty element in a cyclic way
> >>xa_page_first = mmap_xa_page;
> >>xa_find(xa, &xa_page_first, U32_MAX, XA_FREE_MARK)
> >>
> >>// Is there a enough room to have the range?
> >>if (check_add_overflow(xa_page_first, npages, &xa_page_end)) {
> >>   mmap_xa_page = 0;
> >>   continue;
> >>}
> >>
> >>// See if the element before intersects
> >>elm = xa_find(xa, &zero, xa_page_end, 0);
> >>if (elm && intersects(xa_page_first, xa_page_last, elm->first, 
> >> elm->last)) {
> >>   mmap_xa_page = elm->last + 1;
> >>   continue
> >>}
> >>
> >>// xa_page_first -> xa_page_end should now be free
> >>xa_insert(xa, xa_page_start, entry);
> >>mmap_xa_page = xa_page_end + 1;
> >>return xa_page_start;
> >> }
> >>
> >> Approximately, please check it.
> > Gal & Jason, 
> > 
> > Coming back to the mmap_xa_page algorithm. I couldn't find some background 
> > on this. 
> > Why do you need the length to be represented in the mmap_xa_page ?  
> > Why not simply use xa_alloc_cyclic ( like in siw )

I think siw is dealing with only PAGE_SIZE objects, efa had variable
sized ones.

> > This is simply a key to a mmap object... 
> 
> The intention was that the entry would "occupy" number of xarray elements
> according to its size (in pages). It wasn't initially like this, but IIRC this
> was preferred by Jason.

It is not so critical, maybe we could drop it if it is really
simplifiying. But it doesn't look so hard to make an xa algorithm that
will be OK.

The offset/length is shown in things like lsof and what not, and from
a debugging perspective it makes a lot more sense if the offset/length
are sensible, ie they should not overlap.

Jason

Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions

2019-07-29 Thread Jason Gunthorpe

On Sun, Jul 28, 2019 at 11:45:56AM +0300, Gal Pressman wrote:
> On 26/07/2019 16:23, Jason Gunthorpe wrote:
> > On Fri, Jul 26, 2019 at 08:42:07AM +, Michal Kalderon wrote:
> > 
> >>>> But we don't free entires from the xa_array ( only when ucontext is
> >>>> destroyed) so how will There be an empty element after we wrap ?
> >>>
> >>> Oh!
> >>>
> >>> That should be fixed up too, in the general case if a user is
> >>> creating/destroying driver objects in loop we don't want memory usage to
> >>> be unbounded.
> >>>
> >>> The rdma_user_mmap stuff has VMA ops that can refcount the xa entry and
> >>> now that this is core code it is easy enough to harmonize the two things 
> >>> and
> >>> track the xa side from the struct rdma_umap_priv
> >>>
> >>> The question is, does EFA or qedr have a use model for this that allows a
> >>> userspace verb to create/destroy in a loop? ie do we need to fix this 
> >>> right
> >>> now?
> > 
> >> The mapping occurs for every qp and cq creation. So yes.
> >>
> >> So do you mean add a ref-cnt to the xarray entry and from umap
> >> decrease the refcnt and free?
> > 
> > Yes, free the entry (release the HW resource) and release the xa_array
> > ID.
> 
> This is a bit tricky for EFA.
> The UAR BAR resources (LLQ for example) aren't cleaned up until the UAR is
> deallocated, so many of the entries won't really be freed when the refcount
> reaches zero (i.e the HW considers these entries as refcounted as long as the
> UAR exists). The best we can do is free the DMA buffers for appropriate 
> entries.

Drivers can still defer HW destruction until the ucontext destroys,
but this gives an option to move it sooner, which looks like the other
drivers do need as they can allocate these things in userspace loops.

Jason

Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions

2019-07-29 Thread Jason Gunthorpe

On Sun, Jul 28, 2019 at 12:30:51PM +0300, Kamal Heib wrote:
> > Maybe put this in ib_core_uverbs.c ?
> > 
> > Kamal, you've been tackling various cleanups, maybe making ib_uverbs
> > unloadable again is something you'd be keen on?
> >
> 
> Yes, Could you please give some background on that?

Most of it is my fault from being too careless, but the general notion
is that all of these

$ grep EXPORT_SYMBOL uverbs_main.c uverbs_cmd.c  uverbs_marshall.c  rdma_core.c 
uverbs_std_types*.c uverbs_uapi.c 
uverbs_main.c:EXPORT_SYMBOL(ib_uverbs_get_ucontext_file);
uverbs_main.c:EXPORT_SYMBOL(rdma_user_mmap_io);
uverbs_cmd.c:EXPORT_SYMBOL(flow_resources_alloc);
uverbs_cmd.c:EXPORT_SYMBOL(ib_uverbs_flow_resources_free);
uverbs_cmd.c:EXPORT_SYMBOL(flow_resources_add);
uverbs_marshall.c:EXPORT_SYMBOL(ib_copy_ah_attr_to_user);
uverbs_marshall.c:EXPORT_SYMBOL(ib_copy_qp_attr_to_user);
uverbs_marshall.c:EXPORT_SYMBOL(ib_copy_path_rec_to_user);
uverbs_marshall.c:EXPORT_SYMBOL(ib_copy_path_rec_from_user);
rdma_core.c:EXPORT_SYMBOL(uverbs_idr_class);
rdma_core.c:EXPORT_SYMBOL(uverbs_close_fd);
rdma_core.c:EXPORT_SYMBOL(uverbs_fd_class);
uverbs_std_types.c:EXPORT_SYMBOL(uverbs_destroy_def_handler);

Need to go into some 'ib_core uverbs support' .c file in the ib_core,
be moved to a header inline, or moved otherwise

Maybe it is now unrealistic that the uapi is so complicated, ie
uverbs_close_fd is just not easy to fixup..

Maybe the only ones that need fixing are ib_uverbs_get_ucontext_file
rdma_user_mmap_io as alot of drivers are entangled on those now.

The other stuff is much harder..

Jason

Re: [PATCH rdma-next 0/3] ODP support for mlx5 DC QPs

2019-08-01 Thread Jason Gunthorpe

On Thu, Aug 01, 2019 at 03:21:36PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> From Michael,
> 
> The series adds support for on-demand paging for DC transport.
> Adding handling of DC WQE parsing upon page faults and exposing
> capabilities.
> 
> As DC is mlx-only transport, the capabilities are exposed to the user
> using the direct-verbs mechanism. Namely through the mlx5dv_query_device.

The cover letter should like to the RDMA core PR that uses the new
API...

Jason

Re: [PATCH mlx5-next 1/3] IB/mlx5: Query ODP capabilities for DC

2019-08-01 Thread Jason Gunthorpe

On Thu, Aug 01, 2019 at 03:21:37PM +0300, Leon Romanovsky wrote:

> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index ec571fd7fcf8..5eae8d734435 100644
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -944,7 +944,9 @@ struct mlx5_ifc_odp_cap_bits {
>  
>   struct mlx5_ifc_odp_per_transport_service_cap_bits xrc_odp_caps;
>  
> - u8 reserved_at_100[0x700];
> + struct mlx5_ifc_odp_per_transport_service_cap_bits dc_odp_caps;
> +
> + u8 reserved_at_100[0x6E0];
>  };

Not splitting this to mlx5-next?

Jason

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-14 Thread Jason Gunthorpe

On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:

> Where I think you and I disagree is that I really think the MSI-X
> table size should be fixed at one value for the life of the VF.
> Instead of changing the table size it should be the number of vectors
> that are functional that should be the limit. Basically there should
> be only some that are functional and some that are essentially just
> dummy vectors that you can write values into that will never be used.

Ignoring the PCI config space to learn the # of available MSI-X
vectors is big break on the how the device's programming ABI works.

Or stated another way, that isn't compatible with any existing drivers
so it is basically not a useful approach as it can't be deployed.

I don't know why you think that is better.

Jason

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-14 Thread Jason Gunthorpe

On Thu, Jan 14, 2021 at 09:55:24AM -0800, Alexander Duyck wrote:
> On Thu, Jan 14, 2021 at 8:49 AM Jason Gunthorpe  wrote:
> >
> > On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
> >
> > > Where I think you and I disagree is that I really think the MSI-X
> > > table size should be fixed at one value for the life of the VF.
> > > Instead of changing the table size it should be the number of vectors
> > > that are functional that should be the limit. Basically there should
> > > be only some that are functional and some that are essentially just
> > > dummy vectors that you can write values into that will never be used.
> >
> > Ignoring the PCI config space to learn the # of available MSI-X
> > vectors is big break on the how the device's programming ABI works.
> >
> > Or stated another way, that isn't compatible with any existing drivers
> > so it is basically not a useful approach as it can't be deployed.
> >
> > I don't know why you think that is better.
> >
> > Jason
> 
> First off, this is technically violating the PCIe spec section 7.7.2.2
> because this is the device driver modifying the Message Control
> register for a device, even if it is the PF firmware modifying the VF.
> The table size is something that should be set and fixed at device
> creation and not changed.

The word "violating" is rather an over-reaction, at worst this is an
extension.

> The MSI-X table is essentially just an MMIO resource, and I believe it
> should not be resized, just as you wouldn't expect any MMIO BAR to be
> dynamically resized. 

Resizing the BAR is already defined see commit 276b738deb5b ("PCI:
Add resizable BAR infrastructure")

As you say BAR and MSI vector table are not so different, it seems
perfectly in line with current PCI sig thinking to allow resizing the
MSI as well

> Many drivers don't make use of the full MSI-X table nor do they
> bother reading the size. We just populate a subset of the table
> based on the number of interrupt causes we will need to associate to
> interrupt handlers. 

This isn't about "many drivers" this is about what mlx5 does in all
the various OS drivers it has, and mlx5 has a sophisticated use of
MSI-X.

> What I see this patch doing is trying to push driver PF policy onto
> the VF PCIe device configuration space dynamically.

Huh? This is using the PF to dynamically reconfigure a child VF beyond
what the PCI spec defined. This is done safely under Linux because no
driver is bound when it is reconfigured, and any stale config data is
flushed out of any OS caches.

This is also why there is not a strong desire to standardize an ECN at
PCI-sig, the rules for how resizing can work are complicated and OS
specific.

> Having some limited number of interrupt causes should really be what
> is limiting things here. 

MSI inherently requires dedicated on-die resources to implement, so
every device has a maximum # of MSI vectors it can currently
expose. This is some consequence of various PCI rules and applies to
all devices.

To make effective use of this limited pool requires a hard restriction
enforced by the secure domain (hypervisor and FW) onto every
user. Every driver attached to the function needs to be aware of the
hard enforced limit by the secure domain to operate properly. It has
nothing to do with "limited number of interrupt causes".

The standards based way to communicate that limit is the MSI table cap
size.

To complain that changing the MSI table cap size dynamically is
non-standard then offer up a completely non-standard way to operate
MSI instead seems to miss the entire point.

The important standard is to keep the PCI config space acting per-spec
so all the various consumers can work as-is. The extension is to only
modify the rare hypervisor to support a dynamic MSI resizing extension
for VFs.

As far as applicability, any device working at high scale with MSI and
VMs is going to need this. Dynamically assigning the limited MSI HW is
really required to support the universe of VM configurations people
want. eg generally I would expect a VM to receive the number of MSI
vectors equal to the number of CPUs the VM gets.

> I see that being mostly a thing between the firmware and the VF in
> terms of configuration and not something that necessarily has to be
> pushed down onto the PCIe configuration space itself.

If mlx5 drivers had been designed long ago to never use standard based
MSI and instead did something internal with FW you might have a point,
but they weren't. All the mlx5 drivers use standards based MSI and
expect the config space to be correct.

Jason

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-14 Thread Jason Gunthorpe

On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote:

> > As you say BAR and MSI vector table are not so different, it seems
> > perfectly in line with current PCI sig thinking to allow resizing the
> > MSI as well
> 
> The resizing of a BAR has an extended capability that goes with it and
> isn't something that the device can just do on a whim. This patch set
> is not based on implementing some ECN for resizable MSI-X tables. I
> view it as arbitrarily rewriting the table size for a device after it
> is created.

The only difference is resizing the BAR is backed by an ECN, and this
is an extension. The device does not "do it on a whim" the OS tells it
when to change the size, exactly like for BAR resizing.

> In addition Leon still hasn't answered my question on preventing the
> VF driver from altering entries beyond the ones listed in the table.

Of course this is blocked, the FW completely revokes the HW resource
backing the vectors.

> From what I can tell, the mlx5 Linux driver never reads the MSI-X
> flags register so it isn't reading the MSI-X size either.

I don't know why you say that. All Linux drivers call into something
like pci_alloc_irq_vectors() requesting a maximum # of vectors and
that call returns the actual allocated. Drivers can request more
vectors than the system provides, which is what mlx5 does.

Under the call chain of pci_alloc_irq_vectors() it calls
pci_msix_vec_count() which does

pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
return msix_table_size(control);

And eventually uses that to return the result to the driver.

So, yes, it reads the config space and ensures it doesn't allocate
more vectors than that.

Every driver using MSI does this in Linux.

Adjusting config space *directly* limits the number of vectors the
driver allocates.

You should be able to find the call chain in mlx5 based on the above
guidance.

> At a minimum I really think we need to go through and have a clear
> definition on when updating the MSI-X table size is okay and when it
> is not. I am not sure just saying to not update it when a driver isn't
> attached is enough to guarantee all that.

If you know of a real issue then please state it, other don't fear
monger "maybe" issues that don't exist.

> What we are talking about is the MSI-X table size. Not the number of
> MSI-X vectors being requested by the device driver. Those are normally
> two seperate things.

Yes, table size is what is critical. The number of entries in that BAR
memory is what needs to be controlled.

> > The standards based way to communicate that limit is the MSI table cap
> > size.
> 
> This only defines the maximum number of entries, not how many have to be used.

A driver can't use entries beyond the cap. We are not trying to
reclaim vectors that are available but not used by the OS.

> I'm not offering up a non-standard way to do this. Just think about
> it. If I have a device that defines an MSI-X table size of 2048 but
> makes use of only one vector how would that be any different than what
> I am suggesting where you size your VF to whatever the maximum is you
> need but only make use of some fixed number from the hardware.

That is completely different, in the hypervisor there is no idea how
many vectors a guest OS will create. The FW is told to only permit 1
vector. How is the guest to know this if we don't update the config
space *as the standard requires* ?

> I will repeat what I said before. Why can't this be handled via the
> vfio interface? 

1) The FW has to be told of the limit and everything has to be in sync
   If the FW is out of sync with the config space then everything
   breaks if the user makes even a small mistake - for instance
   forgetting to use the ioctl to override vfio. This is needlessly
   frail and complicated.

2) VFIO needs to know how to tell the FW the limit so it can override
   the config space with emulation. This is all device specific and I
   don't see that adding an extension to vfio is any better than
   adding one here.

3) VFIO doesn't cover any other driver that binds to the VF, so
   this "solution" leaves the normal mlx5_core functionally broken on
   VFs which is a major regression.

Overall the entire idea to have the config space not reflect the
actual current device configuration seems completely wrong to me - why
do this? For what gain? It breaks everything.

Jason

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-15 Thread Jason Gunthorpe

On Thu, Jan 14, 2021 at 01:43:57PM -0800, Alexander Duyck wrote:

> > > In addition Leon still hasn't answered my question on preventing the
> > > VF driver from altering entries beyond the ones listed in the table.
> >
> > Of course this is blocked, the FW completely revokes the HW resource
> > backing the vectors.
> 
> One of the troubles with this is that I am having to take your word
> for it.

This is a Linux patch review, not a security review of a HW
implementation. There are million ways to screw up a PCI device
implementation and in SRIOV the PCI device HW implementation forms
part of the trust base of the hypervisor.

If the HW API can be implemented securely and the Linux code is
appropriate is the only question here.

In this case mlx5 HW is implemented correctly and securely, if you
don't belive then you are free not to use it.

> What it defines is the aperture available in MMIO to define the
> possible addresses and values to be written to trigger the
> interrupts. The device itself plays a large role in defining the
> number of interrupts ultimately requested.

Again you are confused about what is going on here - this is about
reconfiguring the HW so that MSI vector entries exist or not - it has
absoultely nothing to do with the driver. We are not optimizing for
the case where the driver does not use MSI vectors the VF has
available.

> > > At a minimum I really think we need to go through and have a clear
> > > definition on when updating the MSI-X table size is okay and when it
> > > is not. I am not sure just saying to not update it when a driver isn't
> > > attached is enough to guarantee all that.
> >
> > If you know of a real issue then please state it, other don't fear
> > monger "maybe" issues that don't exist.
>
> Well I don't have visibility into your firmware so I am not sure what
> is going on in response to this command so forgive me when I do a bit
> of fear mongering when somebody tells me that all this patch set does
> is modify the VF configuration space.

You were not talking about the FW, "is okay and when it is not" is a
*Linux* question.

> > > What we are talking about is the MSI-X table size. Not the number of
> > > MSI-X vectors being requested by the device driver. Those are normally
> > > two seperate things.
> >
> > Yes, table size is what is critical. The number of entries in that BAR
> > memory is what needs to be controlled.
> 
> That is where we disagree. 

Huh? You are disagreeing this is how the mlx5 device works?

> Normally as a part of that the device itself will place some
> limit on how many causes and vectors you can associate before you even
> get to the MSI-X table.

For mlx5 this cause limit is huge. With IMS it can even be higher than
the 2K MSI-X limit. Remember on an x86 system you get 256 interrupt
vectors per CPU *and* per vCPU, so with interrupt remapping there can
be huge numbers of interrupts required.

Your "normally" is for simplistic fixed function HW devices not
intended for use at this scale.

> The MSI-X table size is usually a formality that defines the upper
> limit on the number of entries the device might request.

It is not a formality. PCI rules require *actual physical HW* to be
dedicated to the MSI vector entries.

Think of it like this - the device has a large global MSI-X table of
say 2K entires. This is the actual physical HW SRAM backing MSI
entires required by PCIe.

The HW will map the MSI-X table BAR space in every PF/VF to a slice of
that global table. If the PCI Cap says 8 entries then the MSI-X page has
only 8 entries, everything else is /dev/null.

Global MSI entries cannot be shared - the total of all PF/VFs cap
field must not be more than 2K.

One application requires 2K MSI-X on a single function because it uses
VDPA devices and VT-d interrupt remapping

Another application requires 16 MSI-X on 128 VFs because it is using
SRIOV with VMs having 16 vCPUs.

The HW is configured differently in both cases. It is not something
that can be faked with VFIO!

> > That is completely different, in the hypervisor there is no idea how
> > many vectors a guest OS will create. The FW is told to only permit 1
> > vector. How is the guest to know this if we don't update the config
> > space *as the standard requires* ?
> 
> Doesn't the guest driver talk to the firmware? Last I knew it had to
> request additional resources such as queues and those come from the
> firmware don't they?

That is not how things work. Because VFIO has to be involved in
setting up interrupt remapping through its MSI emulation we don't get
to use a dynamic FW only path as something like IMS might imagine.

That would be so much better, but lots of things are not ready for
that.

> > 1) The FW has to be told of the limit and everything has to be in sync
> >If the FW is out of sync with the config space then everything
> >breaks if the user makes even a small mistake - for instance
> >forgetting to use the ioctl to override vfio. This is needles

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-15 Thread Jason Gunthorpe

On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:

> That said, it only works at the driver level. So if the firmware is
> the one that is having to do this it also occured to me that if this
> update happened on FLR that would probably be preferred. 

FLR is not free, I'd prefer not to require it just for some
philosophical reason.

> Since the mlx5 already supports devlink I don't see any reason why the
> driver couldn't be extended to also support the devlink resource
> interface and apply it to interrupts.

So you are OK with the PF changing the VF as long as it is devlink not
sysfs? Seems rather arbitary?

Leon knows best, but if I recall devlink becomes wonky when the VF
driver doesn't provide a devlink instance. How does it do reload of a
VF then?

I think you end up with essentially the same logic as presented here
with sysfs.

> > It is possible for vfio to fake the MSI-X capability and limit what a
> > user can access, but I don't think that's what is being done here.
> 
> Yeah, I am assuming that is what is being done here. 

Just to be really clear, that assumption is wrong

Jason

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-18 Thread Jason Gunthorpe

On Fri, Jan 15, 2021 at 08:32:19PM -0800, Alexander Duyck wrote:
> On Fri, Jan 15, 2021 at 6:06 AM Jason Gunthorpe  wrote:
> >
> > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> >
> > > That said, it only works at the driver level. So if the firmware is
> > > the one that is having to do this it also occured to me that if this
> > > update happened on FLR that would probably be preferred.
> >
> > FLR is not free, I'd prefer not to require it just for some
> > philosophical reason.
> 
> It wasn't so much a philosophical thing as the fact that it can sort
> of take the place as a reload. 

Asserting no driver is present and doing some SW-only "FLR" is pretty
much the same thing.

We can't issue FLR unless no driver is present anyhow, so really all
this does is add a useless step. If some HW needs FLR then it can do
it in here, but I don't see a value to inject it when not needed. 

Yes, if we were PCI-SIG we'd probably insist that a FLR be done, but
we are not PCI-SIG, this is just Linux, and asserting there are no
users of the MSI is sufficient.

> However looking over the mlx5 code I don't see any handling of FLR
> in there so I am assuming that is handled by the firmware.

The device does the device side of the FLR, the mlx5 driver should
trigger FLR during error recovery flows.

> It is about the setup of things. The sysfs existing in the VF is kind
> of ugly since it is a child device calling up to the parent and
> telling it how it is supposed to be configured. 

Well, the logical place to put that sysfs file is under the VF,
otherwise it becomes ugly in a different way. I agree it would be
nicer if the file only existed when the right driver is loaded, and
there was a better way to get from the PF to VF.

> I'm sure in theory we could probably even have the VF request
> something like that itself through some sort of mailbox and cut out
> the middle-man but that would be even uglier.

No, not ever. The VF is in a security domain that can't make those
kinds of changes to itself.

> In my mind it was the PF driver providing a devlink instance for the
> VF if a driver isn't loaded.

I think hacking up devlink to provide dummy devlink objects for VFs
that otherwise wouldn't exist and then ensuring handover to/from real
drivers that might want those objects natively, just for the sake of
using devlink to instead of the existing PCI sysfs is major overkill.

If we are even thinking of moving PCI to devlink I'd want to see
devlink taken out of net and a whole devlink PCI subsystem
infrastructure created to manage all this sanely.

Hacking a subystem into devlink on the side with some small niche
feature is not the way to approach such fundamental things.

I also don't know if PCI will get much value from netlinkification, or
if devlink is even the right netlink representation for PCI in the
first place.

Jason

Re: [PATCH rdma-rc] Revert "RDMA/mlx5: Fix devlink deadlock on net namespace deletion"

2021-01-19 Thread Jason Gunthorpe

On Sun, Jan 17, 2021 at 11:26:33AM +0200, Leon Romanovsky wrote:
> From: Parav Pandit 
> 
> This reverts commit fbdd0049d98d44914fc57d4b91f867f4996c787b.
> 
> Due to commit in fixes tag, netdevice events were received only
> in one net namespace of mlx5_core_dev. Due to this when netdevice
> events arrive in net namespace other than net namespace of mlx5_core_dev,
> they are missed.
> 
> This results in empty GID table due to RDMA device being detached from
> its net device.
> 
> Hence, revert back to receive netdevice events in all net namespaces to
> restore back RDMA functionality in non init_net net namespace.
> 
> Fixes: fbdd0049d98d ("RDMA/mlx5: Fix devlink deadlock on net namespace 
> deletion")
> Signed-off-by: Parav Pandit 
> Signed-off-by: Leon Romanovsky 
> ---
>  drivers/infiniband/hw/mlx5/main.c  |  6 ++
>  .../net/ethernet/mellanox/mlx5/core/lib/mlx5.h |  5 +
>  include/linux/mlx5/driver.h| 18 --
>  3 files changed, 7 insertions(+), 22 deletions(-)

Applied to for-rc, thanks

Jason

Re: [net-next V9 14/14] net/mlx5: Add devlink subfunction port documentation

2021-01-21 Thread Jason Gunthorpe

On Thu, Jan 21, 2021 at 12:59:55PM -0800, Samudrala, Sridhar wrote:

> > + mlx5_core.sf.4
> > +  (subfunction auxiliary device)
> > +   /\
> > +  /  \
> > + /\
> > +/  \
> > +   /\
> > +  mlx5_core.eth.4 mlx5_core.rdma.4
> > + (sf eth aux dev) (sf rdma aux dev)
> > + |  |
> > + |  |
> > +  p0sf88  mlx5_0
> > + (sf netdev)  (sf rdma device)
> 
> This picture seems to indicate that when SF is activated, a sub
> function auxiliary device is created 

Yes

> and when a driver is bound to that sub function aux device and
> probed, 2 additional auxiliary devices are created.  

More than two, but yes

> Is this correct? Are all these auxiliary devices seen on the same
> aux bus?  

Yes

> Why do we need another sf eth aux device?

The first aux device represents the physical HW and mlx5_core binds to it,
the analog is like a pci_device.

The other aux devices represent the subsystem split of the mlx5 driver
- mlx5_core creates them and each subsystem in turn binds to the
mlx5_core driver. This already exists, and Intel will be doing this as
well whenever the RDMA driver is posted again..

Jason

Re: [PATCH rdma-next] RDMA/mlx4: remove bogus dev_base_lock usage

2020-12-10 Thread Jason Gunthorpe

On Tue, Dec 08, 2020 at 09:39:28PM +0200, Vladimir Oltean wrote:
> It is not clear what this lock protects. If the authors wanted to ensure
> that "dev" does not disappear, that is impossible, given the following
> code path:
> 
> mlx4_ib_netdev_event (under RTNL mutex)
> -> mlx4_ib_scan_netdevs
>-> mlx4_ib_update_qps
> 
> Also, the dev_base_lock does not protect dev->dev_addr either.
> 
> So it serves no purpose here. Remove it.
> 
> Reviewed-by: Leon Romanovsky 
> Signed-off-by: Vladimir Oltean 
> ---
>  drivers/infiniband/hw/mlx4/main.c | 3 ---
>  1 file changed, 3 deletions(-)

Applied to for-next, thanks

Jason

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-15 Thread Jason Gunthorpe

On Tue, Dec 15, 2020 at 10:47:36AM -0800, Alexander Duyck wrote:

> > Jason and Saeed explained this in great detail few weeks back in v0 version 
> > of the patchset at [1], [2] and [3].
> > I better not repeat all of it here again. Please go through it.
> > If you may want to read precursor to it, RFC from Jiri at [4] is also 
> > explains this in great detail.
> 
> I think I have a pretty good idea of how the feature works. My concern
> is more the use of marketing speak versus actual functionality. The
> way this is being setup it sounds like it is useful for virtualization
> and it is not, at least in its current state. It may be at some point
> in the future but I worry that it is really going to muddy the waters
> as we end up with yet another way to partition devices.

If we do a virtualization version then it will take a SF and instead
of loading a mlx5_core on the SF aux device, we will load some
vfio_mdev_mlx5 driver which will convert the SF aux device into a
/dev/vfio/*

This is essentially the same as how you'd take a PCI VF and replace
mlx5_core with vfio-pci to get /dev/vfio/*. It has to be a special
mdev driver because it sits on the SF aux device, not on the VF PCI
device.

The vfio_mdev_mlx5 driver will create what Intel calls an SIOV ADI
from the SF, in other words the SF is already a superset of what a
SIOV ADI should be.

This matches very nicely the driver model in Linux, and I don't think
it becomes more muddied as we go along. If anything it is becoming
more clear and sane as things progress.

> I agree with you on that. My thought was more the fact that the two
> can be easily confused. If we are going to do this we need to define
> that for networking devices perhaps that using the mdev interface
> would be deprecated and we would need to go through devlink. However
> before we do that we need to make sure we have this completely
> standardized.

mdev is for creating /dev/vfio/* interfaces in userspace. Using it for
anything else is a bad abuse of the driver model.

We had this debate endlessly already.

AFAIK, there is nothing to deprecate, there are no mdev_drivers in
drivers/net, and none should ever be added. The only mdev_driver that
should ever exists is in vfio_mdev.c

If someone is using a mdev_driver in drivers/net out of tree then they
will need to convert to an aux driver for in-tree.

> Yeah, I recall that. However I feel like it is being oversold. It
> isn't "SR-IOV done right" it seems more like "VMDq done better". The
> fact that interrupts are shared between the subfunctions is telling.

The interrupt sharing is a consequence of having an ADI-like model
without relying on IMS. When IMS works then shared interrupts won't be
very necessary. Otherwise there is no choice but to share the MSI
table of the function.

> That is exactly how things work for Intel parts when they do VMDq as
> well. The queues are split up into pools and a block of queues belongs
> to a specific queue. From what I can can tell the only difference is
> that there is isolation of the pool into specific pages in the BAR.
> Which is essentially a requirement for mediated devices so that they
> can be direct assigned.

No, I said this to Jakub, mlx5 SFs have very little to do with
queues. There is no some 'queue' HW element that needs partitioning.

The SF is a hardware security boundary that wraps every operation a
mlx5 device can do. This is why it is an ADI. It is not a crappy ADI
that relies on hypervisor emulation, it is the real thing, just like a
SRIOV VF. You stick it in the VM and the guest can directly talk to
the HW. The HW provides the security.

I can't put focus on this enough: A mlx5 SF can run a *full RDMA
stack*. This means the driver can create all the RDMA HW objects and
resources under the SF. This is *not* just steering some ethernet
traffic to a few different ethernet queues like VMDq is.

The Intel analog to a SF is a *full virtual function* on one of the
Intel iWarp capable NICs, not VMDq.

> Assuming at some point one of the flavours is a virtio-net style
> interface you could eventually get to the point of something similar
> to what seems to have been the goal of mdev which was meant to address
> these two points.

mlx5 already supports VDPA virtio-net on PF/VF and with this series SF
too.

ie you can take a SF, bind the vdpa_mlx5 driver, and get a fully HW
accelerated "ADI" that does virtio-net. This can be assigned to a
guest and shows up as a PCI virtio-net netdev. With VT-d guest packet
tx/rx on this netdev never uses the hypervisor CPU.

> The point is that we should probably define some sort of standard
> and/or expectations on what should happen when you spawn a new
> interface. Would it be acceptable for the PF and existing subfunctions
> to have to reset if you need to rebalance the IRQ distribution, or
> should they not be disrupted when you spawn a new interface?

It is best to think of the SF as an ADI, so if you change something in
the PF and that causes th

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-15 Thread Jason Gunthorpe

On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote:

> > not just devlink and switchdev, auxbus was also introduced to
> > standardize some of the interfaces.
> 
> The auxbus is just there to make up for the fact that there isn't
> another bus type for this though. I would imagine otherwise this would
> be on some sort of platform bus.

Please lets not start this again. This was gone over with Greg for
literally a year and a half and he explicitly NAK'd platform bus for
this purpose.

Aux bus exists to connect different kernel subsystems that touch the
same HW block together. Here we have the mlx5_core subsystem, vdpa,
rdma, and netdev all being linked together using auxbus.

It is kind of like what MFD does, but again, using MFD for this was
also NAK'd by Greg.

At the very worst we might sometime find out there is some common
stuff between ADIs that we might get an ADI bus, but I'm not
optimistic. So far it looks like there is no commonality.

Aux bus has at least 4 users already in various stages of submission,
and many other target areas that should be replaced by it.

> I would really like to see is a solid standardization of what this is.
> Otherwise the comparison is going to be made. Especially since a year
> ago Mellanox was pushing this as an mdev type interface. 

mdev was NAK'd too.

mdev is only for creating /dev/vfio/*.

> That is all well and good. However if we agree that SR-IOV wasn't done
> right saying that you are spinning up something that works just like
> SR-IOV isn't all that appealing, is it?

Fitting into some universal least-common-denominator was never a goal
for SR-IOV, so I wouldn't agree it was done wrong. 

> I am talking about my perspective. From what I have seen, one-off
> features that are only available from specific vendors are a pain to
> deal with and difficult to enable when you have to support multiple
> vendors within your ecosystem. What you end up going for is usually
> the lowest common denominator because you ideally want to be able to
> configure all your devices the same and have one recipe for setup.

So encourage other vendors to support the switchdev model for managing
VFs and ADIs!

> I'm not saying you cannot enable those features. However at the same
> time I am saying it would be nice to have a vendor neutral way of
> dealing with those if we are going to support SF, ideally with some
> sort of software fallback that may not perform as well but will at
> least get us the same functionality.

Is it really true there is no way to create a software device on a
switchdev today? I looked for a while and couldn't find
anything. openvswitch can do this, so it does seem like a gap, but
this has nothing to do with this series.

A software switchdev path should still end up with the representor and
user facing netdev, and the behavior of the two netdevs should be
identical to the VF switchdev flow we already have today.

SF doesn't change any of this, it just shines a light that, yes,
people actually have been using VFs with netdevs in containers and
switchdev, as part of their operations.

FWIW, I view this as a positive because it shows the switchdev model
is working very well and seeing adoption beyond the original idea of
controlling VMs with SRIOV.

> I'm trying to remember which netdev conference it was. I referred to
> this as a veth switchdev offload when something like this was first
> brought up. 

Sure, though I think the way you'd create such a thing might be
different. These APIs are really about creating an ADI that might be
assigned to a VM and never have a netdev.

It would be nonsense to create a veth-switchdev thing with out a
netdev, and there have been various past attempts already NAK'd to
transform a netdev into an ADI.

Anyhow, if such a thing exists someday it could make sense to
automatically substitute the HW version using a SF, if available.

> could address those needs would be a good way to go for this as it
> would force everyone to come together and define a standardized
> feature set that all of the vendors would want to expose.

I would say switchdev is already the standard feature set.

Jason

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-15 Thread Jason Gunthorpe

On Tue, Dec 15, 2020 at 05:12:33PM -0800, Edwin Peer wrote:

> 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
> addresses this limit for VFs? 

They can't, the Bus/Device/Function is limited by protocol and
changing that would upend the entire PCI world.

Instead PCI-SIG said PASID is the way forward.

> If that were the only problem with VFs, then fixing it once there
> would be cleaner. 

Maybe, but half the problem with VFs is how HW expensive they are. The
mlx5 SF version is not such a good example, but Intel has shown in
other recent patches, like for their idxd, that the HW side of an ADI
can be very simple and hypervisor emulation can build a simple HW
capability into a full ADI for assignment to a guest.

A lot of the trappings that PCI-SIG requires to be implemented in HW
for a VF, like PCI config space, MSI tables, BAR space, etc. is all
just dead weight when scaling up to 1000's of VFs.

The ADI scheme is not that bad, the very simplest HW is just a queue
that can have all DMA contained by a PASID and can trigger an
addr/data interrupt message. Much less HW costly than a SRIOV VF.

Regardless, Intel kicked this path off years ago when they published
their SIOV cookbook and everyone started integrating PASID support
into their IOMMUs and working on ADIs. The mlx5 SFs are kind of early
because the HW is flexible enough to avoid the parts of SIOV that are
not ready or widely deployed yet, like IMS and PASID.

> Like you, I would also prefer a more common infrastructure for
> exposing something based on VirtIO/VMDq as the container/VM facing
> netdevs. 

A major point is to get switchdev.

> I also don't see how this tackles container/VF portability,
> migration of workloads, kernel network stack bypass, or any of the
> other legacy limitations regarding SR-IOV VFs

It isn't ment too. SF/ADI are just a way to have more VF's than PCI SIG
can support..

Jason

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-15 Thread Jason Gunthorpe

On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote:

> > > I would really like to see is a solid standardization of what this is.
> > > Otherwise the comparison is going to be made. Especially since a year
> > > ago Mellanox was pushing this as an mdev type interface.
> >
> > mdev was NAK'd too.
> >
> > mdev is only for creating /dev/vfio/*.
> 
> Agreed. However my worry is that as we start looking to make this
> support virtualization it will still end up swinging more toward
> mdev.

Of course. mdev is also the only way to create a /dev/vfio/* :)

So all paths that want to use vfio must end up creating a mdev.

Here we would choose to create the mdev on top of the SF aux device.
There isn't really anything mlx5 specific about that decision. 

The SF models the vendor specific ADI in the driver model.

> It isn't so much about right or wrong but he use cases. My experience
> has been that SR-IOV ends up being used for very niche use cases where
> you are direct assigning it into either DPDK or some NFV VM and you
> are essentially building the application around the NIC. It is all
> well and good, but for general virtualization it never really caught
> on.

Sure

> > So encourage other vendors to support the switchdev model for managing
> > VFs and ADIs!
> 
> Ugh, don't get me started on switchdev. The biggest issue as I see it
> with switchev is that you have to have a true switch in order to
> really be able to use it. 

That cuts both ways, suggesting HW with a true switch model itself
with VMDq is equally problematic.

> As such dumbed down hardware like the ixgbe for instance cannot use
> it since it defaults to outputting anything that doesn't have an
> existing rule to the external port. If we could tweak the design to
> allow for more dumbed down hardware it would probably be much easier
> to get wider adoption.

I'd agree with this

> interface, but keep the SF interface simple. Then you can back it with
> whatever you want, but without having to have a vendor specific
> version of the interface being plugged into the guest or container.

The entire point *is* to create the vendor version because that serves
the niche cases where SRIOV assignment is already being used.

Having a general solution that can't do vendor SRIOV is useful for
other application, but doesn't eliminate the need for the SRIOV case.

> One of the reasons why virtio-net is being pushed as a common
> interface for vendors is for this reason. It is an interface that can
> be emulated by software or hardware and it allows the guest to run on
> any arbitrary hardware.

Yes, and there is mlx5_vdpa to support this usecase, and it binds to
the SF. Of course all of that is vendor specific too, the driver to
convert HW specifc register programming into a virio-net ADI has to
live *somewhere*

> It has plenty to do with this series. This topic has been under
> discussion since something like 2017 when Mellanox first brought it up
> at Netdev 2.1. At the time I told them they should implement this as a
> veth offload. 

veth doesn't give an ADI, it is useless for these niche cases.

veth offload might be interesting for some container case, but feels
like writing an enormous amount of code to accomplish nothing new...

> Then it becomes obvious what the fallback becomes as you can place
> packets into one end of a veth and it comes out the other, just like
> a switchdev representor and the SF in this case. It would make much
> more sense to do it this way rather than setting up yet another
> vendor proprietary interface pair.

I agree it makes sense to have an all SW veth-like option, but I
wouldn't try to make that as the entry point for all the HW
acceleration or to serve the niche SRIOV use cases, or to represent an
ADI.

It just can't do that and it would make a huge mess if you tried to
force it. Didn't Intel already try this once with trying to use the
macvlan netdev and its queue offload to build an ADI?

> > Anyhow, if such a thing exists someday it could make sense to
> > automatically substitute the HW version using a SF, if available.
> 
> The main problem as I see it is the fact that the SF interface is
> bound too tightly to the hardware. 

That is goal here. This is not about creating just a netdev, this is
about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.

The SF has to support all of that completely. Focusing only on the
one use case of netdevs in containers misses the bigger picture. 

Yes, lots of this stuff is niche, but niche stuff needs to be
supported too.

> Yes, it is a standard feature set for the control plane. However for
> the data-path it is somewhat limited as I feel it only describes what
> goes through the switch.

Sure, I think that is its main point.

> Not the interfaces that are exposed as the endpoints. 

It came from modeling physical HW so the endports are 'physical'
things like actual HW switch ports, or SRIOV VFs, ADI, etc.

> It is the problem of that last bit and ho

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-16 Thread Jason Gunthorpe

On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote:

> > > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > > with switchev is that you have to have a true switch in order to
> > > really be able to use it.
> >
> > That cuts both ways, suggesting HW with a true switch model itself
> > with VMDq is equally problematic.
> 
> Yes and no. For example the macvlan offload I had setup could be
> configured both ways and it made use of VMDq. I'm not necessarily
> arguing that we need to do VMDq here, however at the same time saying
> that this is only meant to replace SR-IOV becomes problematic since we
> already have SR-IOV so why replace it with something that has many of
> the same limitations?

Why? Because SR-IOV is the *only* option for many use cases. Still. I
said this already, something more generic does not magicaly eliminate
SR-IOV.

The SIOV ADI model is a small refinement to the existing VF scheme, it
is completely parallel to making more generic things.

It is not "repeating mistakes" it is accepting the limitations of
SR-IOV because benefits exist and applications need those benefits.

> That said I understand your argument, however I view the elimination
> of SR-IOV to be something we do after we get this interface right and
> can justify doing so. 

Elimination of SR-IOV isn't even a goal here!

> Also it might be useful to call out the flavours and planned flavours
> in the cover page. Admittedly the description is somewhat lacking in
> that regard.

This is more of a general switchdev remark though. In the swithdev
model you have a the switch and a switch port. Each port has a
swichdev representor on the switch side and a "user port" of some
kind.

It can be a physical thing:
 - SFP
 - QSFP
 - WiFi Antennae

It could be a semi-physical thing outside the view of the kernel:
 - SmartNIC VF/SF attached to another CPU

It can be a semi-physical thing in view of this kernel:
 - SRIOV VF (struct pci device)
 - SF (struct aux device)

It could be a SW construct in this kernel:
 - netdev (struct net device)

*all* of these different port types are needed. Probably more down the
road!

Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user
port" type here. Instead creating the user port pci or aux device
allows the user to use the Linux driver model to control what happens
to the pci/aux device next.

> I would argue that is one of the reasons why this keeps being
> compared to either VMDq or VMQ as it is something that SR-IOV has
> yet to fully replace and has many features that would be useful in
> an interface that is a subpartition of an existing interface.

In what sense does switchdev and a VF not fully replace macvlan VMDq?

> The Intel drivers still have the macvlan as the assignable ADI and
> make use of VMDq to enable it.

Is this in-tree or only in the proprietary driver? AFAIK there is no
in-tree way to extract the DMA queue from the macvlan netdev into
userspace..

Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW
one. It doesn't really make sense to compare a SW dataplane to a HW
one. HW dataplanes come with limitations and require special driver
code.

> The limitation as I see it is that the macvlan interface doesn't allow
> for much in the way of custom offloads and the Intel hardware doesn't
> support switchdev. As such it is good for a basic interface, but
> doesn't really do well in terms of supporting advanced vendor-specific
> features.

I don't know what it is that prevents Intel from modeling their
selector HW in switchdev, but I think it is on them to work with the
switchdev folks to figure something out.

I'm a bit surprised HW that can do macvlan can't be modeled with
switchdev? What is missing?

> > That is goal here. This is not about creating just a netdev, this is
> > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> 
> One issue is right now we are only seeing the rdma and netdev. It is
> kind of backwards as it is using the ADIs on the host when this was
> really meant to be used for things like mdev.

This is second 15 patch series on this path already. It is not
possible to pack every single thing into this series. This is the
micro step of introducing the SF idea and using SF==VF to show how the
driver stack works. The minimal changing to the existing drivers
implies this can support an ADI as well.

Further, this does already show an ADI! vdpa_mlx5 will run on the
VF/SF and eventually causes qemu to build a virtio-net ADI that
directly passes HW DMA rings into the guest.

Isn't this exactly the kind of generic SRIOV replacement option you
have been asking for? Doesn't this completely supersede stuff built on
macvlan?

> expected to work. The swtichdev API puts some restrictions in place
> but there still ends up being parts without any definition.

I'm curious what you see as needing definition here? 

The SRIOV model has the HW register programming API is device
specific.

The swit

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-16 Thread Jason Gunthorpe

On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote:

> You say this will scale better but I am not even sure about that. The
> fact is SR-IOV could scale to 256 VFs, but for networking I kind of
> doubt the limitation would have been the bus number and would more
> likely be issues with packet replication and PCIe throughput,
> especially when you start dealing with east-west traffic within the
> same system.

We have been seeing deployments already hitting the 256 limit. This is
not a "theoretical use" patch set. There are already VM and container
farms with SW networking that support much more than 256 VM/containers
per server.

The optimization here is to reduce the hypervisor workload and free up
CPU cycles for the VMs/containers to consume. This means less handling
of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.

Even the extra DMA on the NIC is not really a big deal. These are 400G
NICs with big fast PCI. If you top them out you are already doing an
aggregate of 400G of network traffic. That is a big number for a
single sever, it is OK.

Someone might feel differently if they did this on a 10/40G NIC, in
which case this is not the solution for their application.

> Sorry you used the word "replace", and my assumption here was that the
> goal is to get something in place that can take the place of SR-IOV so
> that you wouldn't be maintaining the two systems at the same time.
> That is my concern as I don't want us having SR-IOV, and then several
> flavors of SIOV. We need to decide on one thing that will be the way
> forward.

SRIOV has to continue until the PASID and IMS platform features are
widely available and mature. It will probably be 10 years before we
see most people able to use SIOV for everything they want.

I think we will see lots of SIOV varients, I know Intel is already
pushing SIOV parts outside netdev.

> I get that. That is why I said switchdev isn't a standard for the
> endpoint. One of the biggest issues with SR-IOV that I have seen is
> the fact that the last piece isn't really defined. We never did a good
> job of defining how the ADI should look to the guest and as a result
> it kind of stalled in adoption.

The ADI is supposed to present the HW programming API that is
desired. It is always up to the implementation.

SIOV was never a project to standardize HW programming models like
virtio-net, NVMe, etc.

> > I'm a bit surprised HW that can do macvlan can't be modeled with
> > switchdev? What is missing?
> 
> If I recall it was the fact that the hardware defaults to transmitting
> everything that doesn't match an existing rule to the external port
> unless it comes from the external port.

That seems small enough it should be resolvable, IMHO. eg some new
switch rule that matches that specific HW behavior?

> Something like the vdpa model is more like what I had in mind. Only
> vdpa only works for the userspace networking case.

That's because making a driver that converts the native HW to VDPA and
then running a generic netdev on the resulting virtio-net is a pretty
wild thing to do. I can't really think of an actual use case.

> Basically the idea is to have an assignable device interface that
> isn't directly tied to the hardware. 

The switchdev model is to create a switch port. As I explained in
Linux we see "pci device" and "aux device" as being some "user port"
options to access to that switch.

If you want a "generic device" that is fine, but what exactly is that
programming interface in Linux? Sketch out an API, where does the idea
go?  What does the driver that implement it look like? What consumes
it?

Should this be a great idea, then a mlx5 version of this will still be
to create an SF aux device, bind mlx5_core, then bind "generic device"
on top of that. This is simply a reflection of how the mlx5 HW/SW
layering works. Squashing all of this into a single layer is work with
no bad ROI.

> they are pushed into containers you don't have to rip them out if for
> some reason you need to change the network configuration. 

Why would you need to rip them out?

Jason

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-16 Thread Jason Gunthorpe

On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote:

> That has been the case for a long time. However it had been my
> experience that SR-IOV never scaled well to meet those needs and so it
> hadn't been used in such deployments.

Seems to be going quite well here, perhaps the applications are
different.

> > The optimization here is to reduce the hypervisor workload and free up
> > CPU cycles for the VMs/containers to consume. This means less handling
> > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
> >
> > Even the extra DMA on the NIC is not really a big deal. These are 400G
> > NICs with big fast PCI. If you top them out you are already doing an
> > aggregate of 400G of network traffic. That is a big number for a
> > single sever, it is OK.
> 
> Yes, but at a certain point you start bumping up against memory
> throughput limitations as well. Doubling up the memory footprint by
> having the device have to write to new pages instead of being able to
> do something like pinning and zero-copy would be expensive.

You can't zero-copy when using VMs.

And when using containers every skb still has to go through all the
switching and encapsulation logic, which is not free in SW.

At a certain point the gains of avoiding the DMA copy are lost by the
costs of all the extra CPU work. The factor being optimized here is
CPU capacity.

> > Someone might feel differently if they did this on a 10/40G NIC, in
> > which case this is not the solution for their application.
> 
> My past experience was with 10/40G NIC with tens of VFs. When we start
> talking about hundreds I would imagine the overhead becomes orders of
> magnitudes worse as the problem becomes more of an n^2 issue since you
> will have n times more systems sending to n times more systems

The traffic demand is application dependent. If an application has an
n^2 traffic pattern then it needs a network to sustain that cross
sectional bandwidth regardless of how the VMs are packed.

It just becomes a design factor of the network and now the network
includes that switching component on the PCIe NIC as part of the
capacity for cross sectional BW.

There is some balance where a VM can only generate so much traffic
based on the CPU it has available, and you can design the entire
infrastructure to balance the CPU with the NIC with the switches and
come to some packing factor of VMs. 

As CPU constrains VM performance, removing CPU overheads from the
system will improve packing density. A HW network data path in the VMs
is one such case that can turn to a net win if the CPU bottleneck is
bigger than the network bottleneck.

It is really over simplifiying to just say PCIe DMA copies are bad.

> receiving. As such things like broadcast traffic would end up
> consuming a fair bit of traffic.

I think you have a lot bigger network problems if your broadcast
traffic is so high that you start to worry about DMA copy performance
in a 400G NIC.

> The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
> tend to be PCIe specific specifications. What we are defining here is
> how the network interfaces presented by such devices will work.

I think we've achieved this..

> > That seems small enough it should be resolvable, IMHO. eg some new
> > switch rule that matches that specific HW behavior?
> 
> I would have to go digging to find the conversation. It was about 3 or
> 4 years ago. I seem to recall mentioning the idea of having some
> static rules but it was a no-go at the time. If we wanted to spin off
> this conversation and pull in some Intel folks I would be up for us
> revisiting it. However I'm not with Intel anymore so it would mostly
> be something I would be working on as a hobby project instead of
> anything serious.

Personally I welcome getting more drivers to implement the switchdev
model, I think it is only good for the netdev community as as a whole
to understand and standardize on this.

> > > Something like the vdpa model is more like what I had in mind. Only
> > > vdpa only works for the userspace networking case.
> >
> > That's because making a driver that converts the native HW to VDPA and
> > then running a generic netdev on the resulting virtio-net is a pretty
> > wild thing to do. I can't really think of an actual use case.
> 
> I'm not talking about us drastically changing existing models. I would
> still expect the mlx5 driver to be running on top of the aux device.
> However it may be that the aux device is associated with something
> like the switchdev port as a parent 

That is exactly how this works. The switchdev representor and the aux
device are paired and form the analog of the veth tunnel. IIRC this
relationship with the aux device is shown in the devlink output for
the switchdev ports.

I still can't understand what you think should be changed here.

We can't get rid of the aux device, it is integral to the software
layering and essential to support any

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-16 Thread Jason Gunthorpe

On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:

> It isn't about the association, it is about who is handling the
> traffic. Going back to the macvlan model what we did is we had a group
> of rings on the device that would automatically forward unicast
> packets to the macvlan interface and would be reserved for
> transmitting packets from the macvlan interface. We took care of
> multicast and broadcast replication in software.

Okay, maybe I'm starting to see where you are coming from.

First, I think some clarity here, as I see it the devlink
infrastructure is all about creating the auxdevice for a switchdev
port.

What goes into that auxdevice is *completely* up to the driver. mlx5
is doing a SF which == VF, but that is not a requirement of the design
at all.

If an Intel driver wants to put a queue block into the aux device and
that is != VF, it is just fine.

The Intel netdev that binds to the auxdevice can transform the queue
block and specific switchdev config into a netdev identical to
accelerated macvlan. Nothing about the breaks the switchdev model.

Essentially think of it as generalizing the acceleration plugin for a
netdev. Instead of making something specific to limited macvlan, the
driver gets to provide exactly the structure that matches its HW to
provide the netdev as the user side of the switchdev port. I see no
limitation here so long as the switchdev model for controlling traffic
is followed.

Let me segue into a short story from RDMA.. We've had a netdev called
IPoIB for a long time. It is actually kind of similar to this general
thing you are talking about, in that there is a programming layer
under the IPOIB netdev called RDMA verbs that generalizes the actual
HW. Over the years this became more complicated because every new
netdev offloaded needed mirroring into the RDMA verbs general
API. TSO, GSO, checksum offload, endlessly onwards. It became quite
dumb in the end. We gave up and said the HW driver should directly
implement netdev. Implementing a middle API layer makes zero sense
when netdev is already perfectly suited to implement ontop of
HW. Removing SW layers caused performance to go up something like
2x.

The hard earned lesson I take from that is don't put software layers
between a struct net_device and the actual HW. The closest coupling is
really the best thing. Provide libary code in the kernel to help
drivers implement common patterns when making their netdevs, do not
provide wrapper netdevs around drivers.

IMHO the approach of macvlan accleration made some sense in 2013, but
today I would say it is mashing unrelated layers together and
polluting what should be a pure SW implementation with HW hooks.

I see from the mailing list comments this was done because creating a
device specific netdev via 'ip link add' was rightly rejected. However
here we *can* create a device specific vmdq *auxdevice*.  This is OK
because the netdev is controlling and containing the aux device via
switchdev.

So, Intel can get the "VMDQ link type" that was originally desired more
or less directly, so long as the associated switchdev port controls
the MAC filter process, not "ip link add".

And if you want to make the vmdq auxdevice into an ADI by user DMA to
queues, then sure, that model is completely sane too (vs hacking up
macvlan to expose user queues) - so long as the kernel controls the
selection of traffic into those queues and follows the switchdev
model. I would recommend creating a simple RDMA raw ethernet queue
driver over the aux device for something like this :)

> That might be a bad example, I was thinking of the issues we have had
> with VFs and direct assignment to Qemu based guests in the past.

As described, this is solved by VDPA.

> Essentially what I am getting at is that the setup in the container
> should be vendor agnostic. The interface exposed shouldn't be specific
> to any one vendor. So if I want to fire up a container or Mellanox,
> Broadcom, or some other vendor it shouldn't matter or be visible to
> the user. They should just see a vendor agnostic subfunction
> netdevice.

Agree. The agnostic container user interface here is 'struct
net_device'.

> > I have the feeling this stuff you are asking for is already done..
> 
> The case you are describing has essentially solved it for Qemu
> virtualization and direct assignment. It still doesn't necessarily
> solve it for the container case though.

The container case doesn't need solving.

Any scheme I've heard for container live migration, like CRIU,
essentially hot plugs the entire kernel in/out of a user process. We
rely on the kernel providing low leakage of the implementation details
of the struct net_device as part of it's uAPI contract. When CRIU
swaps the kernel the new kernel can have any implementation of the
container netdev it wants.

I've never heard of a use case to hot swap the implemention *under* a
netdev from a container. macvlan can't do this today. If you have a
use case here, it

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-17 Thread Jason Gunthorpe

On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote:

> Just to clarify I am not with Intel, nor do I plan to work on any
> Intel drivers related to this.

Sure

> I disagree here. In my mind a design where two interfaces, which both
> exist in the kernel, have to go to hardware in order to communicate is
> very limiting. The main thing I am wanting to see is the option of
> being able to pass traffic directly between the switchdev and the SF
> without the need to touch the hardware.

I view the SW bypass path you are talking about similarly to
GSO/etc. It should be accessed by the HW driver as an optional service
provided by the core netdev, not implemented as some wrapper netdev
around a HW implementation.

If you feel strongly it is needed then there is nothing standing in
the way to implement it in the switchdev auxdevice model.

It is simple enough, the HW driver's tx path would somehow detect
east/west and queue it differently, and the rx path would somehow be
able to mux in skbs from a SW queue. Not seeing any blockers here.

> > model. I would recommend creating a simple RDMA raw ethernet queue
> > driver over the aux device for something like this :)
> 
> You lost me here, I'm not seeing how RDMA and macvlan are connected.

RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet
packets.

> > > Essentially what I am getting at is that the setup in the container
> > > should be vendor agnostic. The interface exposed shouldn't be specific
> > > to any one vendor. So if I want to fire up a container or Mellanox,
> > > Broadcom, or some other vendor it shouldn't matter or be visible to
> > > the user. They should just see a vendor agnostic subfunction
> > > netdevice.
> >
> > Agree. The agnostic container user interface here is 'struct
> > net_device'.
> 
> I disagree here. The fact is a mellanox netdev, versus a broadcom
> netdev, versus an intel netdev all have a very different look at feel
> as the netdev is essentially just the base device you are building
> around.

Then fix the lack of standardization of netdev implementations!

Adding more abstraction layers isn't going to fix that fundamental
problem.

Frankly it seems a bit absurd to complain that the very basic element
of the common kernel uAPI - struct net_device - is so horribly
fragmented and vendor polluted that we can't rely on it as a stable
interface for containers.

Even if that is true, I don't belive for a second that adding a
different HW abstraction layer is going to somehow undo the mistakes
of the last 20 years.

> Again, the hot-swap isn't necessarily what I am talking about. I am
> talking about setting up a config for a set of containers in a
> datacenter. What I don't want to do is have to have one set of configs
> for an mlx5 SF, another for a broadcom SF, and yet another set for any
> other vendors out there. I would much rather have all of that dealt
> with within the namespace that is handling the switchdev setup.

If there is real problems here then I very much encourage you to start
an effort to push all the vendors to implement a consistent user
experience for the HW netdevs.

I don't know what your issues are, but it sounds like it would be a
very interesting conference presentation.

But it has nothing to do with this series.

Jason

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-17 Thread Jason Gunthorpe

On Thu, Dec 17, 2020 at 01:05:03PM -0800, Alexander Duyck wrote:

> > I view the SW bypass path you are talking about similarly to
> > GSO/etc. It should be accessed by the HW driver as an optional service
> > provided by the core netdev, not implemented as some wrapper netdev
> > around a HW implementation.
> 
> I view it as being something that would be a part of the switchdev API
> itself. Basically the switchev and endpoint would need to be able to
> control something like this because if XDP were enabled on one end or
> the other you would need to be able to switch it off so that all of
> the packets followed the same flow and could be scanned by the XDP
> program.

To me that still all comes down to being something like an optional
offload that the HW driver can trigger if the conditions are met.

> > It is simple enough, the HW driver's tx path would somehow detect
> > east/west and queue it differently, and the rx path would somehow be
> > able to mux in skbs from a SW queue. Not seeing any blockers here.
> 
> In my mind the simple proof of concept for this would be to check for
> the multicast bit being set in the destination MAC address for packets
> coming from the subfunction. If it is then shunt to this bypass route,
> and if not then you transmit to the hardware queues. 

Sure, not sure multicast optimization like this isn't incredibly niche
too, but it would be an interesting path to explore.

But again, there is nothing fundamental about the model here that
precludes this optional optimization.

> > Even if that is true, I don't belive for a second that adding a
> > different HW abstraction layer is going to somehow undo the mistakes
> > of the last 20 years.
> 
> It depends on how it is done. The general idea is to address the
> biggest limitation that has occured, which is the fact that in many
> cases we don't have software offloads to take care of things when the
> hardware offloads provided by a certain piece of hardware are not
> present. 

This is really disappointing to hear. Admittedly I don't follow all
the twists and turns on the mailing list, but I thought having a SW
version of everything was one of the fundamental tenants of netdev
that truly distinguished it from something like RDMA.

> It would basically allow us to reset the feature set. If something
> cannot be offloaded in software in a reasonable way, it is not
> allowed to be present in the interface provided to a container.
> That way instead of having to do all the custom configuration in the
> container recipe it can be centralized to one container handling all
> of the switching and hardware configuration.

Well, you could start by blocking stuff without a SW fallback..

> There I disagree. Now I can agree that most of the series is about
> presenting the aux device and that part I am fine with. However when
> the aux device is a netdev and that netdev is being loaded into the
> same kernel as the switchdev port is where the red flags start flying,
> especially when we start talking about how it is the same as a VF.

Well, it happens for the same reason a VF can create a netdev,
stopping it would actually be more patches. As I said before, people
are already doing this model with VFs.

I can agree with some of our points, but this is not the series to
argue them. What you want is to start some new thread on optimizing
switchdev for the container user case.

> In my mind we are talking about how the switchdev will behave and it
> makes sense to see about defining if a east-west bypass makes sense
> and how it could be implemented, rather than saying we won't bother
> for now and potentially locking in the subfunction to virtual function
> equality.

At least for mlx5 SF == VF, that is a consequence of the HW. Any SW
bypass would need to be specially built in the mlx5 netdev running on
a VF/SF attached to a switchdev port.

I don't see anything about this part of the model that precludes ever
doing that, and I also don't see this optimization as being valuable
enough to block things "just to be sure"

> In my mind we need more than just the increased count to justify
> going to subfunctions, and I think being able to solve the east-west
> problem at least in terms of containers would be such a thing.

Increased count is pretty important for users with SRIOV.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 01:17:09PM +, Mark Brown wrote:

> As previously discussed this will need the auxilliary bus extending to
> support at least interrupts and possibly also general resources.

I thought the recent LWN article summed it up nicely, auxillary bus is
for gluing to subsystems together using a driver specific software API
to connect to the HW, MFD is for splitting a physical HW into disjoint
regions of HW.

Maybe there is some overlap, but if you want to add HW representations
to the general auxillary device then I think you are using it for the
wrong thing.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 03:52:04PM +, Mark Brown wrote:
> On Fri, Dec 18, 2020 at 10:08:54AM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 18, 2020 at 01:17:09PM +, Mark Brown wrote:
> 
> > > As previously discussed this will need the auxilliary bus extending to
> > > support at least interrupts and possibly also general resources.
> 
> > I thought the recent LWN article summed it up nicely, auxillary bus is
> > for gluing to subsystems together using a driver specific software API
> > to connect to the HW, MFD is for splitting a physical HW into disjoint
> > regions of HW.
> 
> This conflicts with the statements from Greg about not using the
> platform bus for things that aren't memory mapped or "direct firmware",
> a large proportion of MFD subfunctions are neither at least in so far as
> I can understand what direct firmware means.

I assume MFD will keep existing and it will somehow stop using
platform device for the children it builds.

That doesn't mean MFD must use aux device, so I don't see what you
mean by conflicts?

If someone has a PCI device and they want to split it up, they should
choose between aux device and MFD (assuming MFD gets fixed, as Greg
has basically blanket NAK'd adding more of them to MFD as is)

> To be honest I don't find the LWN article clarifies things particularly
> here, the rationale appears to involve some misconceptions about what
> MFDs look like.  It looks like it assumes that MFD functions have
> physically separate register sets for example which is not a reliable
> feature of MFDs, nor is the assumption that there's no shared
> functionality which appears to be there.  It also appears to assume that
> MFD subfunctions can clearly be described by ACPI (where it would be
> unidiomatic, we just don't see this happening for the MFDs that appear
> on ACPI systems and I'm not sure bindings exist within ACPI) or DT
> (where even where subfunctions are individually described it's rarely
> doing more than enumerating that things exist).

I think the MFD cell model is probably the deciding feature. If that
cell description scheme suites the device, and it is very HW focused,
then MFD is probably the answer.

The places I see aux device being used are a terrible fit for the cell
idea. If there are MFD drivers that are awkardly crammed into that
cell description then maybe they should be aux devices?

> > Maybe there is some overlap, but if you want to add HW representations
> > to the general auxillary device then I think you are using it for the
> > wrong thing.
> 
> Even for the narrowest use case for auxiliary devices that I can think
> of I think the assumption that nobody will ever design something which
> can wire an interrupt intended to be serviced by a subfunction is a bit
> optimistic.  

mlx5, for example, uses interrupts but an aux device is not assigned
an exclusive MSI interrupt list.

These devices have a very dynamic interrupt scheme, pre-partitioning
the MSI vector table is completely the wrong API.

The "interrupt" API is more like:

   mlx5_register_event_handler(hw_object, my_function);

Which would call my_function from some MSI interrupt vector when
hw_object has an event to report. There might be 1000's of dynamic
hw_objects in the system any moment.

As I said, I see aux device as being something that exposes a driver
specifc SW API, not a list of generic HW resources.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 06:03:10PM +, Mark Brown wrote:
> On Fri, Dec 18, 2020 at 12:28:17PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 18, 2020 at 03:52:04PM +, Mark Brown wrote:
> > > On Fri, Dec 18, 2020 at 10:08:54AM -0400, Jason Gunthorpe wrote:
> 
> > > > I thought the recent LWN article summed it up nicely, auxillary bus is
> > > > for gluing to subsystems together using a driver specific software API
> > > > to connect to the HW, MFD is for splitting a physical HW into disjoint
> > > > regions of HW.
> 
> > > This conflicts with the statements from Greg about not using the
> > > platform bus for things that aren't memory mapped or "direct firmware",
> > > a large proportion of MFD subfunctions are neither at least in so far as
> > > I can understand what direct firmware means.
> 
> > I assume MFD will keep existing and it will somehow stop using
> > platform device for the children it builds.
> 
> If it's not supposed to use platform devices so I'm assuming that the
> intention is that it should use aux devices, otherwise presumably it'd
> be making some new clone of the platform bus but I've not seen anyone
> suggesting this.

I wouldn't assume that, I certainly don't want to see all the HW
related items in platform_device cloned roughly into aux device.

I've understood the bus type should be basically related to the thing
that is creating the device. In a clean view platform code creates
platform devices. DT should create DT devices, ACPI creates ACPI
devices, PNP does pnp devices, etc

So, I strongly suspect, MFD should create mfd devices on a MFD bus
type.

Alexandre's point is completely valid, and I think is the main
challenge here, somehow avoiding duplication.

If we were to look at it with some OOP viewpoint I'd say the generic
HW resource related parts should be some shared superclass between
'struct device' and 'struct platform/pnp/pci/acpi/mfd/etc_device'.

> > > To be honest I don't find the LWN article clarifies things particularly
> > > here, the rationale appears to involve some misconceptions about what
> > > MFDs look like.  It looks like it assumes that MFD functions have
> > > physically separate register sets for example which is not a reliable
> > > feature of MFDs, nor is the assumption that there's no shared
> > > functionality which appears to be there.  It also appears to assume that
> 
> > I think the MFD cell model is probably the deciding feature. If that
> > cell description scheme suites the device, and it is very HW focused,
> > then MFD is probably the answer.
> 
> > The places I see aux device being used are a terrible fit for the cell
> > idea. If there are MFD drivers that are awkardly crammed into that
> > cell description then maybe they should be aux devices?
> 
> When you say the MFD cell model it's not clear what you mean - I *think*
> you're referring to the idea of the subdevices getting all the

I mean using static "struct mfd_cell" arrays to describe things.

> Look at something like wm8994 for example - the subdevices just know
> which addresses in the device I2C/SPI regmap to work with but some of
> them have interrupts passed through to them (and could potentially also
> have separate subdevices for clocks and pinctrl).  These subdevices are
> not memory mapped, not enumerated by firmware and the hardware has
> indistinct separation of functions in the register map compared to how
> Linux models the chips.

wm8994 seems to fit in the mfd_cell static arrays pretty well..

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 07:09:11PM +, Lee Jones wrote:

> ACPI, DT and MFD are not busses.  

And yet ACPI and PNP have a bus:
  extern struct bus_type acpi_bus_type;
  extern struct bus_type pnp_bus_type;

Why? Because in the driver core if you subclass struct device and want
to bind drivers, as both PNP and ACPI do, you must place those devices
on a bus with a bus_type matching the device type. Thus subclassing
the device means subclassing the bus as well.

The purpose of the bus_type is to match drivers to devices and provide
methods to the driver core. The bus_type also defines the unique name
space of the device names.

It is confusing because the word bus immediately makes people think of
physical objects like I2C, PCI, etc, but that is not what bus_type
does in the object model of the driver core, IMHO.

So, if you subclass struct device for MFD's usage, then you must also
create a bus_type to handle driver binding. The MFD bus_type. Just
like auxillary does.

Making a mfd subclass is the logical thing for a subsystem to do,
co-opting another subsystem's bus_type is just really weird/abusive.

auxillary bus shows how all these parts work, and it is simple enough
to see the pieces clearly.

Jason

Re: [net-next v4 00/15] Add mlx5 subfunction support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote:

> Also as far as the patch count complaints I have seen in a few threads
> I would be fine with splitting things up so that the devlink and aux
> device creation get handled in one set, and then we work out the
> details of mlx5 attaching to the devices and spawning of the SF
> netdevs in another since that seems to be where the debate is.

It doesn't work like that. The aux device creates a mlx5_core and
every mlx5_core can run mlx5_en.

This really isn't the series to raise this feature request. Adding an
optional short cut path to VF/SF is something that can be done later
if up to date benchmarks show it has value. There is no blocker in
this model to doing that.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 08:32:11PM +, Mark Brown wrote:

> > So, I strongly suspect, MFD should create mfd devices on a MFD bus
> > type.
> 
> Historically people did try to create custom bus types, as I have
> pointed out before there was then pushback that these were duplicating
> the platform bus so everything uses platform bus.

Yes, I vaugely remember..

I don't know what to say, it seems Greg doesn't share this view of
platform devices as a universal device.

Reading between the lines, I suppose things would have been happier
with some kind of inheritance scheme where platform device remained as
only instantiated directly in board files, while drivers could bind to
OF/DT/ACPI/FPGA/etc device instantiations with minimal duplication &
boilerplate.

And maybe that is exactly what we have today with platform devices,
though the name is now unfortunate.

> I can't tell the difference between what it's doing and what SOF is
> doing, the code I've seen is just looking at the system it's running
> on and registering a fixed set of client devices.  It looks slightly
> different because it's registering a device at a time with some wrapper
> functions involved but that's what the code actually does.

SOF's aux bus usage in general seems weird to me, but if you think
it fits the mfd scheme of primarily describing HW to partition vs
describing a SW API then maybe it should use mfd.

The only problem with mfd as far as SOF is concerned was Greg was not
happy when he saw PCI stuff in the MFD subsystem.

This whole thing started when Intel first proposed to directly create
platform_device's in their ethernet driver and Greg had a quite strong
NAK to that.

MFD still doesn't fit what mlx5 and others in the netdev area are
trying to do. Though it could have been soe-horned it would have been
really weird to create a platform device with an empty HW resource
list. At a certain point the bus type has to mean *something*!

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Jason Gunthorpe

On Fri, Dec 18, 2020 at 10:16:58PM +0100, Alexandre Belloni wrote:

> But then again, what about non-enumerable devices on the PCI device? I
> feel this would exactly fit MFD. This is a collection of IPs that exist
> as standalone but in this case are grouped in a single device.

So, if mfd had a mfd_device and a mfd bus_type then drivers would need
to have both a mfd_driver and a platform_driver to bind. Look at
something like drivers/char/tpm/tpm_tis.c to see how a multi-probe
driver is structured

See Mark's remarks about the old of_platform_device, to explain why we
don't have a 'dt_device' today

> Note that I then have another issue because the kernel doesn't support
> irq controllers on PCI and this is exactly what my SoC has. But for now,
> I can just duplicate the irqchip driver in the MFD driver.

I think Thomas fixed that recently on x86 at least.. 

Having to put dummy irq chip drivers in MFD anything sounds scary :|

> Let me point to drivers/net/ethernet/cadence/macb_pci.c which is a
> fairly recent example. It does exactly that and I'm not sure you could
> do it otherwise while still not having to duplicate most of macb_probe.

Creating a platform_device to avoid restructuring the driver's probe
and device logic to be generic is a *really* horrible reason to use a
platform device.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-03 Thread Jason Gunthorpe

On Thu, Dec 03, 2020 at 04:06:24PM +0100, Greg KH wrote:

> > ...for all the independent drivers to have a common commit baseline. It
> > is not there yet pending Greg's Ack.
> 
> I have been trying to carve out some time to review this.  At my initial
> glance, I still have objections, so please, give me a few more days to
> get this done...

There are still several more days till the merge window, but I am
going to ask Leon to get the mlx5 series, and this version of the
auxbus patch it depends on, into linux-next with the intention to
forward it to Linus if there are no substantive comments.

Regardless of fault or reason this whole 1.5 year odyssey seems to have
brought misery to everyone involved and it really is time to move on.

Leon and his team did a good deed 6 weeks ago to quickly turn around
and build another user example. For their efforts they have been
rewarded with major merge conflicts and alot of delayed work due to
the invasive nature of the mlx5 changes. To continue to push this out
is disrespectful to him and his team's efforts.

A major part of my time as RDMA maintainer has been to bring things
away from vendor trees and into a common opensource community.  Intel
shipping a large out of tree RDMA driver and abandoning their intree
driver is really harmful. This auxbus is a substantial blocker to them
normalizing their operations, thus I view it as important to
resolve. Even after this it is going to take a long time and alot of
effort to review their new RDMA driver.

Regards,
Jason

Re: [pull request][for-next] mlx5-next auxbus support

2020-12-04 Thread Jason Gunthorpe

On Fri, Dec 04, 2020 at 10:29:52AM -0800, Saeed Mahameed wrote:
> Hi Jakub, Jason
> 
> This pull request is targeting net-next and rdma-next branches.
> 
> This series provides mlx5 support for auxiliary bus devices.
> 
> It starts with a merge commit of tag 'auxbus-5.11-rc1' from
> gregkh/driver-core into mlx5-next, then the mlx5 patches that will convert
> mlx5 ulp devices (netdev, rdma, vdpa) to use the proper auxbus
> infrastructure instead of the internal mlx5 device and interface management
> implementation, which Leon is deleting at the end of this patchset.
> 
> Link: 
> https://lore.kernel.org/alsa-devel/20201026111849.1035786-1-l...@kernel.org/
> 
> Thanks to everyone for the joint effort !
> 
> Please pull and let me know if there's any problem.

This all looks good, thanks.

Jakub a few notes on shared branch process here.. 

In general Linus's advice has been to avoid unnecessary merges so
Saeed/Leon have tended to send PRs to one tree or the other based on
need and that PR might have a "catch up" from the other tree. I guess
this one is special because it makes lots of changes in both trees.

Whoever pulls first means the other cannot refuse the PR, so I usually
prefer to let netdev go first. I have more BW to manage trouble on the
RDMA side..

I saw your other request related to the CI failures due to the wrong
branch basis in the build bot. This means you will need to pull every
update to the mlx5 shared branch, even if it is not immediately
relevant to netdev, or have Saeed include the 'base commit' trailer
and teach the build bots to respect it..

Also, I arrange the RDMA merge window PR to be after netdev (usually
on Thursday) so that Linus sees minor RDMA stuff in the netdev
diffstat, and almost no netdev stuff in the RDMA PR.

Cheers,
Jason

Re: [PATCH -next] net/mlx5_core: remove unused including

2020-12-09 Thread Jason Gunthorpe

On Wed, Dec 09, 2020 at 09:04:46AM -0800, Jakub Kicinski wrote:
> On Wed, 9 Dec 2020 08:21:00 +0200 Leon Romanovsky wrote:
> > On Tue, Dec 08, 2020 at 11:22:26AM -0800, Jakub Kicinski wrote:
> > > On Mon, 7 Dec 2020 20:14:00 +0800 Zou Wei wrote:  
> > > > Remove including  that don't need it.
> > > >
> > > > Signed-off-by: Zou Wei 
> > > >  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 1 -
> > > >  1 file changed, 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
> > > > b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> > > > index 989c70c..82ecc161 100644
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> > > > @@ -30,7 +30,6 @@
> > > >   * SOFTWARE.
> > > >   */
> > > >
> > > > -#include 
> > > >  #include 
> > > >  #include 
> > > >  #include   
> > 
> > Jakub,
> > 
> > You probably doesn't have latest net-next.
> > 
> > In the commit 17a7612b99e6 ("net/mlx5_core: Clean driver version and
> > name"), I removed "strlcpy(drvinfo->version, UTS_RELEASE,
> > sizeof(drvinfo->version));" line.
> > 
> > The patch is ok, but should have Fixes line.
> > Fixes: 17a7612b99e6 ("net/mlx5_core: Clean driver version and name")
> 
> Hm. Pretty sure our build bot gets a fresh copy before testing. 
> Must had been some timing issue, perhaps? Looks like the commit
> came in with the auxbus merge.

mlx5-next is in linux-next independently so people will be sending
fixes against stuff in linux-next before it hits net-next.

Jason

Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port

2020-11-18 Thread Jason Gunthorpe

On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:

> With Connectx-4 Lx for example the netdev can have at most 63 queues

What netdev calls a queue is really a "can the device deliver
interrupts and packets to a given per-CPU queue" and covers a whole
spectrum of smaller limits like RSS scheme, # of available interrupts,
ability of the device to create queues, etc.

CX4Lx can create a huge number of queues, but hits one of these limits
that mean netdev's specific usage can't scale up. Other stuff like
RDMA doesn't have the same limits, and has tonnes of queues.

What seems to be needed is a resource controller concept like cgroup
has for processes. The system is really organized into a tree:

   physical device
  mlx5_core
/  |  \  \(aux bus)
 netdev   rdmavdpa   SF  etc
 |(aux bus)
   mlx5_core
  /  \(aux bus)
   netdev   vdpa

And it does make a lot of sense to start to talk about limits at each
tree level.

eg the top of the tree may have 128 physical interrupts. With 128 CPU
cores that isn't enough interrupts to support all of those things
concurrently.

So the user may want to configure:
 - The first level netdev only gets 64,
 - 3rd level mlx5_core gets 32 
 - Final level vdpa gets 8

Other stuff has to fight it out with the remaining shared interrupts.

In netdev land # of interrupts governs # of queues

For RDMA # of interrupts limits the CPU affinities for queues

VPDA limits the # of VMs that can use VT-d

The same story repeats for other less general resources, mlx5 also
has consumption of limited BAR space, and consumption of some limited
memory elements. These numbers are much bigger and may not need
explicit governing, but the general concept holds.

It would be very nice if the limit could be injected when the aux
device is created but before the driver is bound. I'm not sure how to
manage that though..

I assume other devices will be different, maybe some devices have a
limit on the number of total queues, or a limit on the number of
VDPA or RDMA devices.

Jason

Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port

2020-11-18 Thread Jason Gunthorpe

On Wed, Nov 18, 2020 at 12:36:26PM -0700, David Ahern wrote:
> On 11/18/20 11:38 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:
> > 
> >> With Connectx-4 Lx for example the netdev can have at most 63 queues
> > 
> > What netdev calls a queue is really a "can the device deliver
> > interrupts and packets to a given per-CPU queue" and covers a whole
> > spectrum of smaller limits like RSS scheme, # of available interrupts,
> > ability of the device to create queues, etc.
> > 
> > CX4Lx can create a huge number of queues, but hits one of these limits
> > that mean netdev's specific usage can't scale up. Other stuff like
> > RDMA doesn't have the same limits, and has tonnes of queues.
> > 
> > What seems to be needed is a resource controller concept like cgroup
> > has for processes. The system is really organized into a tree:
> > 
> >physical device
> >   mlx5_core
> > /  |  \  \(aux bus)
> >  netdev   rdmavdpa   SF  etc
> >  |(aux bus)
> >mlx5_core
> >   /  \(aux bus)
> >netdev   vdpa
> > 
> > And it does make a lot of sense to start to talk about limits at each
> > tree level.
> > 
> > eg the top of the tree may have 128 physical interrupts. With 128 CPU
> > cores that isn't enough interrupts to support all of those things
> > concurrently.
> > 
> > So the user may want to configure:
> >  - The first level netdev only gets 64,
> >  - 3rd level mlx5_core gets 32 
> >  - Final level vdpa gets 8
> > 
> > Other stuff has to fight it out with the remaining shared interrupts.
> > 
> > In netdev land # of interrupts governs # of queues
> > 
> > For RDMA # of interrupts limits the CPU affinities for queues
> > 
> > VPDA limits the # of VMs that can use VT-d
> > 
> > The same story repeats for other less general resources, mlx5 also
> > has consumption of limited BAR space, and consumption of some limited
> > memory elements. These numbers are much bigger and may not need
> > explicit governing, but the general concept holds.
> > 
> > It would be very nice if the limit could be injected when the aux
> > device is created but before the driver is bound. I'm not sure how to
> > manage that though..
> > 
> > I assume other devices will be different, maybe some devices have a
> > limit on the number of total queues, or a limit on the number of
> > VDPA or RDMA devices.
> 
> A lot of low level resource details that need to be summarized into a
> nicer user / config perspective to specify limits / allocations.

Well, now that we have the aux bus stuff there is a nice natural place
to put things..

The aux bus owner device (mlx5_core) could have a list of available
resources

Each aux bus device (netdev/rdma/vdpa) could have a list of consumed
resources

Some API to place a limit on the consumed resources at each aux bus
device.

The tricky bit is the auto-probing/configure. By the time the user has
a chance to apply a limit the drivers are already bound and have
already done their setup. So each subsystem has to support dynamically
imposing a limit..

And I simplified things a bit above too, we actually have two kinds of
interrupt demand: sharable and dedicated. The actual need is to carve
out a bunch of dedicated interrupts and only allow subsystems that are
doing VT-d guest interrupt assignment to consume them (eg VDPA)

Jason

Re: [PATCH net-next 00/13] Add mlx5 subfunction support

2020-11-19 Thread Jason Gunthorpe

On Wed, Nov 18, 2020 at 10:22:51PM -0800, Saeed Mahameed wrote:
> > I think the biggest missing piece in my understanding is what's the
> > technical difference between an SF and a VDPA device.
> 
> Same difference as between a VF and netdev.
> SF == VF, so a full HW function.
> VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the
> same functions as always been, nothing new about this.

All the implementation details are very different, but this white
paper from Intel goes into some detail the basic elements and rational
for the SF concept:

https://software.intel.com/content/dam/develop/public/us/en/documents/intel-scalable-io-virtualization-technical-specification.pdf

What we are calling a sub-function here is a close cousin to what
Intel calls an Assignable Device Interface. I expect to see other
drivers following this general pattern eventually.

A SF will eventually be assignable to a VM and the VM won't be able to
tell the difference between a VF or SF providing the assignable PCI
resources.

VDPA is also assignable to a guest, but the key difference between
mlx5's SF and VDPA is what guest driver binds to the virtual PCI
function. For a SF the guest will bind mlx5_core, for VDPA the guest
will bind virtio-net.

So, the driver stack for a VM using VDPA might be

 Physical device [pci] -> mlx5_core -> [aux] -> SF -> [aux] ->  mlx5_core -> 
[aux] -> mlx5_vdpa -> QEMU -> |VM| -> [pci] -> virtio_net

When Parav is talking about creating VDPA devices he means attaching
the VDPA accelerator subsystem to a mlx5_core, where ever that
mlx5_core might be attached to.

To your other remark:

> > What are you NAK'ing?
> Spawning multiple netdevs from one device by slicing up its queues.

This is a bit vauge. In SRIOV a device spawns multiple netdevs for a
physical port by "slicing up its physical queues" - where do you see
the cross over between VMDq (bad) and SRIOV (ok)?

I thought the issue with VMDq was more on the horrid management to
configure the traffic splitting, not the actual splitting itself?

In classic SRIOV the traffic is split by a simple non-configurable HW
switch based on MAC address of the VF.

mlx5 already has the extended version of that idea, we can run in
switchdev mode and use switchdev to configure the HW switch. Now
configurable switchdev rules split the traffic for VFs.

This SF step replaces the VF in the above, but everything else is the
same. The switchdev still splits the traffic, it still ends up in same
nested netdev queue structure & RSS a VF/PF would use, etc, etc. No
queues are "stolen" to create the nested netdev.

>From the driver perspective there is no significant difference between
sticking a netdev on a mlx5 VF or sticking a netdev on a mlx5 SF. A SF
netdev is not going in and doing deep surgery to the PF netdev to
steal queues or something.

Both VF and SF will be eventually assignable to guests, both can
support all the accelerator subsystems - VDPA, RDMA, etc. Both can
support netdev.

Compared to VMDq, I think it is really no comparison. SF/ADI is an
evolution of a SRIOV VF from something PCI-SGI controlled to something
device specific and lighter weight.

SF/ADI come with a architectural security boundary suitable for
assignment to an untrusted guest. It is not just a jumble of queues.

VMDq is .. not that.

Actually it has been one of the open debates in the virtualization
userspace world. The approach to use switchdev to control the traffic
splitting to VMs is elegant but many drivers are are not following
this design. :(

Finally, in the mlx5 model VDPA is just an "application". It asks the
device to create a 'RDMA' raw ethernet packet QP that is uses rings
formed in the virtio-net specification. We can create it in the kernel
using mlx5_vdpa, and we can create it in userspace through the RDMA
subsystem. Like any "RDMA" application it is contained by the security
boundary of the PF/VF/SF the mlx5_core is running on.

Jason

Re: [PATCH net-next 00/13] Add mlx5 subfunction support

2020-11-20 Thread Jason Gunthorpe

On Thu, Nov 19, 2020 at 07:35:26PM -0800, Jakub Kicinski wrote:
> On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> > Finally, in the mlx5 model VDPA is just an "application". It asks the
> > device to create a 'RDMA' raw ethernet packet QP that is uses rings
> > formed in the virtio-net specification. We can create it in the kernel
> > using mlx5_vdpa, and we can create it in userspace through the RDMA
> > subsystem. Like any "RDMA" application it is contained by the security
> > boundary of the PF/VF/SF the mlx5_core is running on.
> 
> Thanks for the write up!

No problem!

> The part that's blurry to me is VDPA.

Okay, I think I see where the gap is, I'm going to elaborate below so
we are clear.

> I was under the impression that for VDPA the device is supposed to
> support native virtio 2.0 (or whatever the "HW friendly" spec was).

I think VDPA covers a wide range of things.

The basic idea is starting with the all SW virtio-net implementation
we can move parts to HW. Each implementation will probably be a little
different here. The kernel vdpa subsystem is a toolbox to mix the
required emulation and HW capability to build a virtio-net PCI
interface.

The most key question to ask of any VDPA design is "what does the VDPA
FW do with the packet once the HW accelerator has parsed the
virtio-net descriptor?".

The VDPA world has refused to agree on this due to vendor squabbling,
but mlx5 has a clear answer:

 VDPA Tx generates an ethernet packet and sends it out the SF/VF port
 through a tunnel to the representor and then on to the switchdev.

Other VDPA designs have a different answer!!

This concept is so innate to how Mellanox views the world it is not
surprising me that the cover letters and patch descriptions don't
belabor this point much :)

I'm going to deep dive through this answer below. I think you'll see
this is the most sane and coherent architecture with the tools
available in netdev.. Mellanox thinks the VDPA world should
standardize on this design so we can have a standard control plane.

> You're saying it's a client application like any other - do I understand
> it right that the hypervisor driver will be translating descriptors
> between virtio and device-native then?

No, the hypervisor creates a QP and tells the HW that this QP's
descriptor format follows virtio-net. The QP processes those
descriptors in HW and generates ethernet packets.

A "client application like any other" means that the ethernet packets
VDPA forms are identical to the ones netdev or RDMA forms. They are
all delivered into the tunnel on the SF/VF to the representor and on
to the switch. See below

> The vdpa parent is in the hypervisor correct?
> 
> Can a VDPA device have multiple children of the same type?

I'm not sure parent/child are good words here.

The VDPA emulation runs in the hypervisor, and the virtio-net netdev
driver runs in the guest. The VDPA is attached to a switchdev port and
representor tunnel by virtue of its QPs being created under a SF/VF.

If we imagine a virtio-rdma, then you might have a SF/VF hosting both
VDPA and VDPA-RDMA which emulate two PCI devices assigned to a
VM. Both of these peer virtio's would generate ethernet packets for TX
on the SF/VF port into the tunnel through the represntor and to the
switch.

> Why do we have a representor for a SF, if the interface is actually VDPA?
> Block and net traffic can't reasonably be treated the same by the
> switch.

I think you are focusing on queues, the architecture at PF/SF/VF is
not queue based, it is packet based.

At the physical mlx5 the netdev has a switchdev. On that switch I can
create a *switch port*.

The switch port is composed of a representor and a SF/VF. They form a
tunnel for packets.

The representor is the hypervisor side of the tunnel and contains all
packets coming out of and into the SF/VF.

The SF/VF is the guest side of the tunnel and has a full NIC.

The SF/VF can be:
 - Used in the same OS as the switch
 - Assigned to a guest VM as a PCI device
 - Assigned to another processor in the SmartNIC case.

In all cases if I use a queue on a SF/VF to generate an ethernet
packet then that packet *always* goes into the tunnel to the
representor and goes into a switch. It is always contained by any
rules on the switch side. If the switch is set so the representor is
VLAN tagged then a queue on a SF/VF *cannot* escape the VLAN tag.

Similarly SF/VF cannot Rx any packets that are not sent into the
tunnel, meaning the switch controls what packets go into the
representor, through the tunnel and to the SF.

Yes, block and net traffic are all reduced to ethernet packets, sent
through the tunnel to the representor and treated by the switch. It is
no different than a physical switch. If there is to be some net/block
difference

Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread Jason Gunthorpe

On Fri, Nov 20, 2020 at 12:21:39PM -0600, Gustavo A. R. Silva wrote:

>   IB/hfi1: Fix fall-through warnings for Clang
>   IB/mlx4: Fix fall-through warnings for Clang
>   IB/qedr: Fix fall-through warnings for Clang
>   RDMA/mlx5: Fix fall-through warnings for Clang

I picked these four to the rdma tree, thanks

Jason

Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX namespace

2020-11-24 Thread Jason Gunthorpe

On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:
> > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:
> > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:  
> > > > From: Eli Cohen 
> > > > 
> > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > DPDK to have precedence in packet processing.  
> > > 
> > > How does DPDK and VDPA relate in this context?  
> > 
> > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > Up till now, the VDPA implementation would insert a rule into the
> > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > all the incoming traffic.
> > 
> > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > MLX5_FLOW_NAMESPACE_BYPASS.
> 
> Our policy was no DPDK driver bifurcation. There's no asterisk saying
> "unless you pretend you need flow filters for RDMA, get them upstream
> and then drop the act".

Huh?

mlx5 DPDK is an *RDMA* userspace application. It links to
libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
RDMA raw ethernet QPs. It has been like this for years, it is not some
"act".

It is long standing uABI that accelerators like RDMA/etc get to take
the traffic before netdev. This cannot be reverted. I don't really
understand what you are expecting here?

Jason

Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX namespace

2020-11-24 Thread Jason Gunthorpe

On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:  
> > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:
> > > > > > From: Eli Cohen 
> > > > > > 
> > > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > > inserting VDPA rules before regular NIC but after bypass, thus 
> > > > > > allowing
> > > > > > DPDK to have precedence in packet processing.
> > > > > 
> > > > > How does DPDK and VDPA relate in this context?
> > > > 
> > > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > > Up till now, the VDPA implementation would insert a rule into the
> > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > > all the incoming traffic.
> > > > 
> > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > > 
> > > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > > "unless you pretend you need flow filters for RDMA, get them upstream
> > > and then drop the act".  
> > 
> > Huh?
> > 
> > mlx5 DPDK is an *RDMA* userspace application. 
> 
> Forgive me for my naiveté. 
> 
> Here I thought the RDMA subsystem is for doing RDMA.

RDMA covers a wide range of accelerated networking these days.. Where
else are you going to put this stuff in the kernel?

> I'm sure if you start doing crypto over ibverbs crypto people will want
> to have a look.

Well, RDMA has crypto transforms for a few years now too. Why would
crypto subsystem people be involved? It isn't using or duplicating
their APIs.

> > libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
> > RDMA raw ethernet QPs. 
> 
> I'm not saying that's not the case. I'm saying I don't think this was
> something that netdev developers signed-off on.

Part of the point of the subsystem split was to end the fighting that
started all of it. It was very clear during the whole iWarp and TCP
Offload Engine buisness in the mid 2000's that netdev wanted nothing
to do with the accelerator world.

So why would netdev need sign off on any accelerator stuff?  Do you
want to start co-operating now? I'm willing to talk about how to do
that.

> And our policy on DPDK is pretty widely known.

I honestly have no idea on the netdev DPDK policy, I'm maintaining the
RDMA subsystem not DPDK :)

> Would you mind pointing us to the introduction of raw Ethernet QPs?
> 
> Is there any production use for that without DPDK?

Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original
IBA specification cira 2000. When RoCE was defined (around 2010) they
were naturally carried forward to Ethernet. The "flow steering"
concept to make raw ethernet QP useful was added to verbs around 2012
- 2013. It officially made it upstream in commit 436f2ad05a0b
("IB/core: Export ib_create/destroy_flow through uverbs")

If I recall properly the first real application was ultra low latency
ethernet processing for financial applications.

dpdk later adopted the first mlx4 PMD using this libibverbs API around
2015. Interestingly the mlx4 PMD was made through an open source
process with minimal involvment from Mellanox, based on the
pre-existing RDMA work.

Currently there are many projects, and many open source, built on top
of the RDMA raw ethernet QP and RDMA flow steering model. It is now
long established kernel ABI.

> > It has been like this for years, it is not some "act".
> > 
> > It is long standing uABI that accelerators like RDMA/etc get to take
> > the traffic before netdev. This cannot be reverted. I don't really
> > understand what you are expecting here?
> 
> Same. I don't really know what you expect me to do either. I don't
> think I can sign-off on kernel changes needed for DPDK.

This patch is fine tuning the shared logic that splits the traffic to
accelerator subsystems, I don't think netdev should have a veto
here. This needs to be consensus among the various communities and
subsystems that rely on this.

Eli did not explain this well in his commit message. When he said DPDK
he means RDMA which is the owner of the FLOW_NAMESPACE. Each
accelerator subsystem gets hooked into this, so here VPDA is getting
its own hook because re-using the the same hook between two kernel
subsystems is buggy.

Jason

Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX namespace

2020-11-25 Thread Jason Gunthorpe

On Wed, Nov 25, 2020 at 10:54:22AM -0800, Jakub Kicinski wrote:

> > RDMA covers a wide range of accelerated networking these days.. Where
> > else are you going to put this stuff in the kernel?
> 
> IDK what else you got in there :) It's probably a case by case answer.

Hmm, yes, it seems endless sometimes :(

> IMHO even using libibverbs is no strong reason for things to fall under
> RDMA exclusively. Client drivers of virtio don't get silently funneled
> through a separate tree just because they use a certain spec.

I'm not sure I understand this, libibverbs is the user library to
interface with the kernel RDMA subsystem. I don't care what apps
people build on top of it, it doesn't matter to me that netdev and
DPDK have some kind of feud.

> > > I'm sure if you start doing crypto over ibverbs crypto people will want
> > > to have a look.  
> > 
> > Well, RDMA has crypto transforms for a few years now too. 
> 
> Are you talking about RDMA traffic being encrypted? That's a different
> case.

That too, but in general, anything netdev can do can be done via RDMA
in userspace. So all the kTLS and IPSEC xfrm HW offloads mlx5 supports
are all available in userspace too.

> > Part of the point of the subsystem split was to end the fighting that
> > started all of it. It was very clear during the whole iWarp and TCP
> > Offload Engine buisness in the mid 2000's that netdev wanted nothing
> > to do with the accelerator world.
> 
> I was in middle school at the time, not sure what exactly went down :)

Ah, it was quite the thing. Microsoft and Co were heavilly pushing TOE
technology (Microsoft Chimney!) as the next most certain thing and I
recall DaveM&co was completely against it in Linux.

I will admit at the time I was doubtful, but in hindsight this was the
correct choice. netdev would not look like it does today if it had
been shackled by the HW implementations of the day. Instead all this
HW stuff ended up largely in RDMA and some in block with the iSCSI
mania of old. It is quite evident to me the mess being tied to HW has
caused to a SW ecosystem. DRM and RDMA both have a very similiar kind
of suffering due to this.

However - over the last 20 years it has been steadfast that there is
*always* a compelling reason for certain applications to use something
from the accelerator side. It is not for everyone, but the specialized
applications that need it, *really need it*.

For instance, it is the difference between being able to get a COVID
simulation result in a few week vs .. well.. never.

> But I'm going by common sense here. Perhaps there was an agreement
> I'm not aware of?

The resolution to the argument above was to split them in Linux.  Thus
what logically is networking was split up in the kernel between netdev
and the accelerator subsystems (iscsi, rdma, and so on).

The general notion is netdev doesn't have to accomodate anything an
accelerator does. If you choose to run them then you do not get to
complain that your ethtool counters are wrong, your routing tables
and tc don't work, firewalling doesn't work. Etc.

That is all broken by design.

In turn, the accelerators do their own thing, tap the traffic before
it hits netdev and so on. netdev does not care what goes on over there
and is not responsible.

I would say this is the basic unspoken agreement of the last 15 years.

Both have a right to exist in Linux. Both have a right to use the
physical ethernet port.

> > So why would netdev need sign off on any accelerator stuff?
> 
> I'm not sure why you keep saying accelerators!
> 
> What is accelerated in raw Ethernet frame access??

The nature of the traffic is not relavent.

It goes through RDMA, it is accelerator traffic (vs netdev traffic,
which goes to netdev). Even if you want to be pedantic, in the raw
ethernet area there is lots of HW special accelerated stuff going
on. Mellanox has some really neat hard realtime networking technology
that works on raw ethernet packets, for instance.

And of course raw ethernet is a fraction of what RDMA covers. iWarp
and RoCE are much more like you might imagine when you hear the word
accelerator.

> > Do you want to start co-operating now? I'm willing to talk about how
> > to do that.
> 
> IDK how that's even in question. I always try to bump all RDMA-looking
> stuff to linux-rdma when it's not CCed there. That's the bare minimum
> of cooperation I'd expect from anyone.

I mean co-operate in the sense of defining a scheme where the two
worlds are not completely seperated and isolated.

> > > And our policy on DPDK is pretty widely known.  
> > 
> > I honestly have no idea on the netdev DPDK policy,
> > 
> > I'm maintaining the RDMA subsystem not DPDK :)
> 
> That's what I thought, but turns out DPDK is your important user.

Nonsense.

I don't have stats but the majority of people I work with using RDMA
are not using DPDK. DPDK serves two somewhat niche markets of NFV and
certain hyperscalers - RDMA covers the entire scientific computing
community and a b

Re: [PATCH net-next 1/5] IB/hfi1: switch to core handling of rx/tx byte/packet counters

2020-11-12 Thread Jason Gunthorpe

On Wed, Nov 11, 2020 at 09:03:55AM -0800, Jakub Kicinski wrote:
> On Tue, 10 Nov 2020 20:47:34 +0100 Heiner Kallweit wrote:
> > Use netdev->tstats instead of a member of hfi1_ipoib_dev_priv for storing
> > a pointer to the per-cpu counters. This allows us to use core
> > functionality for statistics handling.
> > 
> > Signed-off-by: Heiner Kallweit 
> 
> RDMA folks, ack for merging via net-next?

Yes OK

Ack-by: Jason Gunthorpe 

Jason

Re: [PATCH net-next 00/13] Add mlx5 subfunction support

2020-11-17 Thread Jason Gunthorpe

On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:

> > Just to refresh all our memory, we discussed and settled on the flow
> > in [2]; RFC [1] followed this discussion.
> > 
> > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > spawned PF, VF, SF device.
> 
> Nack for the networking part of that. It'd basically be VMDq.

What are you NAK'ing? 

It is consistent with the multi-subsystem device sharing model we've
had for ages now.

The physical ethernet port is shared between multiple accelerator
subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
VDPA, etc.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Jason Gunthorpe

On Mon, Dec 21, 2020 at 06:51:40PM +, Mark Brown wrote:

> > with some kind of inheritance scheme where platform device remained as
> > only instantiated directly in board files, while drivers could bind to
> > OF/DT/ACPI/FPGA/etc device instantiations with minimal duplication &
> > boilerplate.
> 
> Like I said in my previous message that is essentially what we have now.
> It's not worded in quite that way but it's how all the non-enumerable
> buses work.  

I think it is about half way there. We jammed everything into platform
device and platform bus and then had a few api aspects to tell if
which of the subtypes it might be.

That functions sort of like an object model with inheritance, but a
single type and 'is it a XXX' queries is not quite the same thing.

> BTW I did have a bit of a scan through some of the ACPI devices and
> for a good proportion of them it seems fairly clear that they are
> not platform devices at all - they were mostly interacting with ACPI
> firmware functionality rather than hardware, something you can't
> really do with FDT at all.

Right, that is kind of the point. We also have cases where ACPI
devices are just an ioresource list and don't have any special
ACPIness. IIRC DT has a similar issue where there are DT drivers that
just don't work without the OF stuff. Why are they platform drivers?

IMHO the point of the bus type is to tell the driver what API set you
have. If you have a of_device then you have an OF node and can do all
the of operations. Same for PCI/ACPI/etc.

We fake this idea out by being able to convert platform to DT and OF,
but if platform is to be the universal device then why do we have PCI
device and not a 'platform to pci' operator instead? None of this is
consistent.

Regardless of the shortcut to make everything a struct
platform_device, I think it was a mistake to put OF devices on
platform_bus. Those should have remained on some of_bus even if they
are represented by struct platform_device and fiddling in the core
done to make that work OK.

It is much easier to identify what a bus_type is (the unique
collection of APIs) and thus when to create those.

If the bus_type should contain struct platform_device or a unqiue
struct then becomes a different question.

Yes that is very hacky, but it feels less hacky than the platform
bus/device is everything and can be used everwhere idea.

> > The only problem with mfd as far as SOF is concerned was Greg was not
> > happy when he saw PCI stuff in the MFD subsystem.
> 
> This is a huge part of the problem here - there's no clearly articulated
> logic, it's all coming back to these sorts of opinion statements about
> specific cases which aren't really something you can base anything
> on.

I agree with this, IMHO there is no really cohesive explanation for
when to create a bus vs use the "universal bus" (platform) that can
also explain the things platform is already doing.

This feels like a good conference topic someday..

> Personally I'm even struggling to identify a practical problem that
> we're trying to solve here.  Like Alexandre says what would an
> mfd_driver actually buy us?

Well, there is the minor issue of name collision eg
/sys/bus/XX/devices/* must list all devices in the system with no
collisions.

The owner of the bus is supposed to define the stable naming scheme
and all the devices are supposed to follow it. platform doesn't have
this:

$ ls /sys/bus/platform/devices/
 ACPI000C:00 dell-smbios.0  'Fixed MDIO bus.0'   INT33A1:00 
microcode PNP0C04:00   PNP0C0B:03   PNP0C14:00
 alarmtimer.0.auto   dell-smbios.1   GHES.0  intel_rapl_msr.0   
MSFT0101:00   PNP0C0B:00   PNP0C0B:04   PNP0C14:01
 coretemp.0  efi-framebuffer.0   GHES.1  iTCO_wdt   
pcspkrPNP0C0B:01   PNP0C0C:00   reg-dummy
 dcdbas  eisa.0  INT0800:00  kgdboc 
PNP0103:00PNP0C0B:02   PNP0C0E:00   serial8250

Why are ACPI names in here? It looks like "because platform drivers
were used to bind to ACPI devices" 

eg INT33A1 is pmc_core_driver so the device was moved from acpi_bus to
platform_bus? How does that make sense??

Why is pmc_core_driver a platform device instead of ACPI? Because some
platforms don't have ACPI and the board file properly creates a
platform device in C code.

> I have some bad news for you about the hardware description problem
> space.  Among other things we have a bunch of platform devices that
> don't have any resources exposed through the resource API but are still
> things like chips on a board, doing some combination of exposing
> resources for other devices (eg, a fixed voltage regulator) and
> consuming things like clocks or GPIOs that don't appear in the resource
> API.

So in these cases how do I use the generic platform bus API to find
the GPIOs, regulators, and so on to connect with?

If drivers take a platform device and immediately covert it to an OF
object and use OF APIs

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Jason Gunthorpe

On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:

> > Regardless of the shortcut to make everything a struct
> > platform_device, I think it was a mistake to put OF devices on
> > platform_bus. Those should have remained on some of_bus even if they
> 
> Like I keep saying the same thing applies to all non-enumerable buses -
> exactly the same considerations exist for all the other buses like I2C
> (including the ACPI naming issue you mention below), and for that matter
> with enumerable buses which can have firmware info.

And most busses do already have their own bus type. ACPI, I2C, PCI,
etc. It is just a few that have been squished into platform, notably
OF.

> > are represented by struct platform_device and fiddling in the core
> > done to make that work OK.
> 
> What exactly is the fiddling in the core here, I'm a bit unclear?

I'm not sure, but I bet there is a small fall out to making bus_type
not 1:1 with the struct device type.. Would have to attempt it to see

> > This feels like a good conference topic someday..
> 
> We should have this discussion *before* we get too far along with trying
> to implement things, we should at least have some idea where we want to
> head there.

Well, auxillary bus is clearly following the original bus model
intention with a dedicated bus type with a controlled naming
scheme. The debate here seems to be "what about platform bus" and
"what to do with mfd"?

> Those APIs all take a struct device for lookup so it's the same call for
> looking things up regardless of the bus the device is on or what
> firmware the system is using - where there are firmware specific lookup
> functions they're generally historical and shouldn't be used for new
> code.  It's generally something in the form
> 
>   api_type *api_get(struct device *dev, const char *name);

Well, that is a nice improvement since a few years back when I last
worked on this stuff.

But now it begs the question, why not push harder to make 'struct
device' the generic universal access point and add some resource_get()
API along these lines so even a platform_device * isn't needed?

Then the path seems much clearer, add a multi-bus-type device_driver
that has a probe(struct device *) and uses the 'universal api_get()'
style interface to find the generic 'resources'.

The actual bus types and bus structs can then be split properly
without the boilerplate that caused them all to be merged to platform,
even PCI could be substantially merged like this.

Bonus points to replace the open coded method disptach:

int gpiod_count(struct device *dev, const char *con_id)
{
int count = -ENOENT;

if (IS_ENABLED(CONFIG_OF) && dev && dev->of_node)
count = of_gpio_get_count(dev, con_id);
else if (IS_ENABLED(CONFIG_ACPI) && dev && ACPI_HANDLE(dev))
count = acpi_gpio_count(dev, con_id);

if (count < 0)
count = platform_gpio_count(dev, con_id);

With an actual bus specific virtual function:

return dev->bus->gpio_count(dev);

> ...and then do the same thing for every other bus with firmware
> bindings.  If it's about the firmware interfaces it really isn't a
> platform bus specific thing.  It's not clear to me if that's what it is
> though or if this is just some tangent.

It should be split up based on the unique naming scheme and any bus
specific API elements - like raw access to ACPI or OF data or what
have you for other FW bus types.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Jason Gunthorpe

On Mon, Jan 04, 2021 at 04:51:51PM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 4:14 PM Jason Gunthorpe  wrote:
> >
> > On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:
> >
> >
> > > > Regardless of the shortcut to make everything a struct
> > > > platform_device, I think it was a mistake to put OF devices on
> > > > platform_bus. Those should have remained on some of_bus even if they
> > >
> > > Like I keep saying the same thing applies to all non-enumerable buses -
> > > exactly the same considerations exist for all the other buses like I2C
> > > (including the ACPI naming issue you mention below), and for that matter
> > > with enumerable buses which can have firmware info.
> >
> > And most busses do already have their own bus type. ACPI, I2C, PCI,
> > etc. It is just a few that have been squished into platform, notably
> > OF.
> >
> 
> I'll note that ACPI is an outlier that places devices on 2 buses,
> where new acpi_driver instances are discouraged [1] in favor of
> platform_drivers. ACPI scan handlers are awkwardly integrated into the
> Linux device model.
> 
> So while I agree with sentiment that an "ACPI bus" should
> theoretically stand on its own there is legacy to unwind.
> 
> I only bring that up to keep the focus on how to organize drivers
> going forward, because trying to map some of these arguments backwards
> runs into difficulties.
> 
> [1]: 
> http://lore.kernel.org/r/cajz5v0j_rek3agddw7flvmw_7kneccg2u_hukgjzqelcy8s...@mail.gmail.com

Well, this is the exact kind of thing I think we are talking about
here..

> > It should be split up based on the unique naming scheme and any bus
> > specific API elements - like raw access to ACPI or OF data or what
> > have you for other FW bus types.
> 
> I agree that the pendulum may have swung too far towards "reuse
> existing bus_type", and auxiliary-bus unwinds some of that, but does
> the bus_type really want to be an indirection for driver apis outside
> of bus-specific operations?

If the bus is the "enumeration entity" and we define that things like
name, resources, gpio's, regulators, etc are a generic part of what is
enumerated, then it makes sense that the bus would have methods
to handle those things too.

In other words, the only way to learn what GPIO 'resource' is to ask
the enumeration mechnism that is providing the bus. If the enumeration
and bus are 1:1 then you can use a function pointer on the bus type
instead of open coding a dispatch based on an indirect indication.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-05 Thread Jason Gunthorpe

On Mon, Jan 04, 2021 at 07:12:47PM -0800, Dan Williams wrote:

> I get that, but I'm fearing a gigantic bus_ops structure that has
> narrow helpers like ->gpio_count() that mean nothing to the many other
> clients of the bus. Maybe I'm overestimating the pressure there will
> be to widen the ops structure at the bus level.

If we want a 'universal device' then that stuff must live
someplace.. Open coding the dispatch as is today is also not the end
of the world, just seeing that is just usually a sign something is not
ideal with the object model.

Jason

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-05 Thread Jason Gunthorpe

On Tue, Jan 05, 2021 at 01:42:56PM +, Mark Brown wrote:
> On Mon, Jan 04, 2021 at 08:13:41PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:
> 
> > > Like I keep saying the same thing applies to all non-enumerable buses -
> > > exactly the same considerations exist for all the other buses like I2C
> > > (including the ACPI naming issue you mention below), and for that matter
> > > with enumerable buses which can have firmware info.
> 
> > And most busses do already have their own bus type. ACPI, I2C, PCI,
> > etc. It is just a few that have been squished into platform, notably
> > OF.
> 
> You're missing the point there.  I2C is enumerated by firmware in
> exactly the same way as the platform bus is, it's not discoverable from
> the hardware (and similarly for a bunch of other buses).  If we were to
> say that we need separate device types for platform devices enumerated
> using firmware then by analogy we should do the same for devices on
> these other buses that happen to be enumerated by firmware.

No, I understand how I2C works and I think it is fine as is because
the enumeration outcome is all standard. You always end up with a
stable I2C device address (the name) and you always end up with the
I2C programming API. So it doesn't matter how I2C gets enumerated, it
is always an I2C device.

PCI does this too, pci_device gets crossed over to the DT data, but it
is still a pci_device.

I see a big difference between attaching FW data to an existing
subsystem's HW centric bus (and possibly guiding enumeration of a HW
bus from FW data) and directly creating struct devices based on FW
data unconnected to any existing subsystem.

The latter case is where the enumerating FW should stay on its own
bus_type because there is no standardized subsystem bus providing an
API or naming rules, so the FW type should provide those rules
instead.

> > With an actual bus specific virtual function:
> 
> > return dev->bus->gpio_count(dev);
> 
> That won't work, you might have a mix of enumeration types for a given
> bus type in a single system so you'd need to do this per device. 

I'm being very general here, probably what we want is a little more
formal 'fw_type' concept, so a device is on a bus and also has a FW
attachment which can provide this other data.

Jason

Re: [PATCH rdma-rc] RDMA/mlx5: Fix devlink deadlock on net namespace deletion

2020-10-19 Thread Jason Gunthorpe

On Mon, Oct 19, 2020 at 08:27:36AM +0300, Leon Romanovsky wrote:
> From: Parav Pandit 
> 
> When a mlx5 core devlink instance is reloaded in different net
> namespace, its associated IB device is deleted and recreated.
> 
> Example sequence is:
> $ ip netns add foo
> $ devlink dev reload pci/:00:08.0 netns foo
> $ ip netns del foo
> 
> mlx5 IB device needs to attach and detach the netdevice to it
> through the netdev notifier chain during load and unload sequence.
> A below call graph of the unload flow.
> 
> cleanup_net()
>down_read(&pernet_ops_rwsem); <- first sem acquired
>  ops_pre_exit_list()
>pre_exit()
>  devlink_pernet_pre_exit()
>devlink_reload()
>  mlx5_devlink_reload_down()
>mlx5_unload_one()
>[...]
>  mlx5_ib_remove()
>mlx5_ib_unbind_slave_port()
>  mlx5_remove_netdev_notifier()
>unregister_netdevice_notifier()
>  down_write(&pernet_ops_rwsem);<- recurrsive lock
> 
> Hence, when net namespace is deleted, mlx5 reload results in deadlock.
> 
> When deadlock occurs, devlink mutex is also held. This not only deadlocks
> the mlx5 device under reload, but all the processes which attempt to access
> unrelated devlink devices are deadlocked.
> 
> Hence, fix this by mlx5 ib driver to register for per net netdev
> notifier instead of global one, which operats on the net namespace
> without holding the pernet_ops_rwsem.
> 
> Fixes: 4383cfcc65e7 ("net/mlx5: Add devlink reload")
> Signed-off-by: Parav Pandit 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/hw/mlx5/main.c  | 6 --
>  drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h | 5 -
>  include/linux/mlx5/driver.h| 5 +
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/main.c 
> b/drivers/infiniband/hw/mlx5/main.c
> index 944bb7691913..b1b3e563c15e 100644
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -3323,7 +3323,8 @@ static int mlx5_add_netdev_notifier(struct mlx5_ib_dev 
> *dev, u8 port_num)
>   int err;
> 
>   dev->port[port_num].roce.nb.notifier_call = mlx5_netdev_event;
> - err = register_netdevice_notifier(&dev->port[port_num].roce.nb);
> + err = register_netdevice_notifier_net(mlx5_core_net(dev->mdev),
> +   &dev->port[port_num].roce.nb);

This looks racy, what lock needs to be held to keep *mlx5_core_net()
stable?

>   if (err) {
>   dev->port[port_num].roce.nb.notifier_call = NULL;
>   return err;
> @@ -3335,7 +3336,8 @@ static int mlx5_add_netdev_notifier(struct mlx5_ib_dev 
> *dev, u8 port_num)
>  static void mlx5_remove_netdev_notifier(struct mlx5_ib_dev *dev, u8 port_num)
>  {
>   if (dev->port[port_num].roce.nb.notifier_call) {
> - unregister_netdevice_notifier(&dev->port[port_num].roce.nb);
> + unregister_netdevice_notifier_net(mlx5_core_net(dev->mdev),
> +   &dev->port[port_num].roce.nb);

This seems dangerous too, what if the mlx5_core_net changed before we
get here?

What are the rules for when devlink_net() changes?

Jason

Re: [PATCH rdma-rc] RDMA/mlx5: Fix devlink deadlock on net namespace deletion

2020-10-19 Thread Jason Gunthorpe

On Mon, Oct 19, 2020 at 01:23:23PM +, Parav Pandit wrote:
> > > - err = register_netdevice_notifier(&dev->port[port_num].roce.nb);
> > > + err = register_netdevice_notifier_net(mlx5_core_net(dev->mdev),
> > > +   &dev->port[port_num].roce.nb);
> > 
> > This looks racy, what lock needs to be held to keep *mlx5_core_net() stable?
> 
> mlx5_core_net() cannot be accessed outside of mlx5 driver's load, unload, 
> reload path.
> 
> When this is getting executed, devlink cannot be executing reload.
> This is guarded by devlink_reload_enable/disable calls done by mlx5 core.

A comment that devlink_reload_enable/disable() must be held would be
helpful
 
> > 
> > >   if (err) {
> > >   dev->port[port_num].roce.nb.notifier_call = NULL;
> > >   return err;
> > > @@ -3335,7 +3336,8 @@ static int mlx5_add_netdev_notifier(struct
> > > mlx5_ib_dev *dev, u8 port_num)  static void
> > > mlx5_remove_netdev_notifier(struct mlx5_ib_dev *dev, u8 port_num)  {
> > >   if (dev->port[port_num].roce.nb.notifier_call) {
> > > - unregister_netdevice_notifier(&dev-
> > >port[port_num].roce.nb);
> > > + unregister_netdevice_notifier_net(mlx5_core_net(dev-
> > >mdev),
> > > +   &dev-
> > >port[port_num].roce.nb);
> > 
> > This seems dangerous too, what if the mlx5_core_net changed before we
> > get here?
>
> When I inspected driver, code, I am not aware of any code flow where
> this can change before reaching here, because registration and
> unregistration is done only in driver load, unload and reload path.
> Reload can happen only after devlink_reload_enable() is done.

But we enable reload right after init_one

> > What are the rules for when devlink_net() changes?
> > 
> devlink_net() changes only after unload() callback is completed in driver.

You mean mlx5_devlink_reload_down ?

That seems OK then

Jason

Re: [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Jason Gunthorpe

On Mon, Oct 19, 2020 at 12:42:15PM -0700, Nick Desaulniers wrote:
> On Sat, Oct 17, 2020 at 10:43 PM Greg KH  wrote:
> >
> > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > From: Tom Rix 
> > >
> > > This is a upcoming change to clean up a new warning treewide.
> > > I am wondering if the change could be one mega patch (see below) or
> > > normal patch per file about 100 patches or somewhere half way by 
> > > collecting
> > > early acks.
> >
> > Please break it up into one-patch-per-subsystem, like normal, and get it
> > merged that way.
> >
> > Sending us a patch, without even a diffstat to review, isn't going to
> > get you very far...
> 
> Tom,
> If you're able to automate this cleanup, I suggest checking in a
> script that can be run on a directory.  Then for each subsystem you
> can say in your commit "I ran scripts/fix_whatever.py on this subdir."
>  Then others can help you drive the tree wide cleanup.  Then we can
> enable -Wunreachable-code-break either by default, or W=2 right now
> might be a good idea.

I remember using clang-modernize in the past to fix issues very
similar to this, if clang machinery can generate the warning, can't
something like clang-tidy directly generate the patch?

You can send me a patch for drivers/infiniband/* as well

Thanks,
Jason

[PATCH] RDMA: Add rdma_connect_locked()

2020-10-26 Thread Jason Gunthorpe

There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().

In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.

Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.

Reported-by: Guoqing Jiang 
Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state"
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/cma.c| 39 +---
 drivers/infiniband/ulp/iser/iser_verbs.c |  2 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c   |  4 +--
 drivers/nvme/host/rdma.c | 10 +++---
 include/rdma/rdma_cm.h   | 13 +---
 net/rds/ib_cm.c  |  5 +--
 6 files changed, 47 insertions(+), 26 deletions(-)

Seems people are not testing these four ULPs against rdma-next.. Here is a
quick fix for the issue:

https://lore.kernel.org/r/3b1f7767-98e2-93e0-b718-16d1c5346...@cloud.ionos.com

Jason

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 7c2ab1f2fbea37..2eaaa1292fb847 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -405,10 +405,10 @@ static int cma_comp_exch(struct rdma_id_private *id_priv,
/*
 * The FSM uses a funny double locking where state is protected by both
 * the handler_mutex and the spinlock. State is not allowed to change
-* away from a handler_mutex protected value without also holding
+* to/from a handler_mutex protected value without also holding
 * handler_mutex.
 */
-   if (comp == RDMA_CM_CONNECT)
+   if (comp == RDMA_CM_CONNECT || exch == RDMA_CM_CONNECT)
lockdep_assert_held(&id_priv->handler_mutex);
 
spin_lock_irqsave(&id_priv->lock, flags);
@@ -4038,13 +4038,20 @@ static int cma_connect_iw(struct rdma_id_private 
*id_priv,
return ret;
 }
 
-int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
+/**
+ * rdma_connect_locked - Initiate an active connection request.
+ * @id: Connection identifier to connect.
+ * @conn_param: Connection information used for connected QPs.
+ *
+ * Same as rdma_connect() but can only be called from the
+ * RDMA_CM_EVENT_ROUTE_RESOLVED handler callback.
+ */
+int rdma_connect_locked(struct rdma_cm_id *id, struct rdma_conn_param 
*conn_param)
 {
struct rdma_id_private *id_priv =
container_of(id, struct rdma_id_private, id);
int ret;
 
-   mutex_lock(&id_priv->handler_mutex);
if (!cma_comp_exch(id_priv, RDMA_CM_ROUTE_RESOLVED, RDMA_CM_CONNECT)) {
ret = -EINVAL;
goto err_unlock;
@@ -4071,6 +4078,30 @@ int rdma_connect(struct rdma_cm_id *id, struct 
rdma_conn_param *conn_param)
 err_state:
cma_comp_exch(id_priv, RDMA_CM_CONNECT, RDMA_CM_ROUTE_RESOLVED);
 err_unlock:
+   return ret;
+}
+EXPORT_SYMBOL(rdma_connect_locked);
+
+/**
+ * rdma_connect - Initiate an active connection request.
+ * @id: Connection identifier to connect.
+ * @conn_param: Connection information used for connected QPs.
+ *
+ * Users must have resolved a route for the rdma_cm_id to connect with by 
having
+ * called rdma_resolve_route before calling this routine.
+ *
+ * This call will either connect to a remote QP or obtain remote QP information
+ * for unconnected rdma_cm_id's.  The actual operation is based on the
+ * rdma_cm_id's port space.
+ */
+int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
+{
+   struct rdma_id_private *id_priv =
+   container_of(id, struct rdma_id_private, id);
+   int ret;
+
+   mutex_lock(&id_priv->handler_mutex);
+   ret = rdma_connect_locked(id, conn_param);
mutex_unlock(&id_priv->handler_mutex);
return ret;
 }
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 2f3ebc0a75d924..2bd18b00689341 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -620,7 +620,7 @@ static void iser_route_handler(struct rdma_cm_id *cma_id)
conn_param.private_data = (void *)&req_hdr;
conn_param.private_data_len = sizeof(struct iser_cm_hdr);
 
-   ret = rdma_connect(cma_id, &conn_param);
+   ret = rdma_connect_locked(cma_id, &conn_param);
if (ret) {
iser_err("failure connecting: %d\n", ret);
goto failure;
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c 
b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 776e89231c52f7..f298adc02acba2 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@

Re: [PATCH rdma-rc RESEND v1] RDMA/mlx5: Fix devlink deadlock on net namespace deletion

2020-10-26 Thread Jason Gunthorpe

On Mon, Oct 26, 2020 at 03:43:59PM +0200, Parav Pandit wrote:
> When a mlx5 core devlink instance is reloaded in different net
> namespace, its associated IB device is deleted and recreated.
> 
> Example sequence is:
> $ ip netns add foo
> $ devlink dev reload pci/:00:08.0 netns foo
> $ ip netns del foo
> 
> mlx5 IB device needs to attach and detach the netdevice to it
> through the netdev notifier chain during load and unload sequence.
> A below call graph of the unload flow.
> 
> cleanup_net()
>down_read(&pernet_ops_rwsem); <- first sem acquired
>  ops_pre_exit_list()
>pre_exit()
>  devlink_pernet_pre_exit()
>devlink_reload()
>  mlx5_devlink_reload_down()
>mlx5_unload_one()
>[...]
>  mlx5_ib_remove()
>mlx5_ib_unbind_slave_port()
>  mlx5_remove_netdev_notifier()
>unregister_netdevice_notifier()
>  down_write(&pernet_ops_rwsem);<- recurrsive lock
> 
> Hence, when net namespace is deleted, mlx5 reload results in deadlock.
> 
> When deadlock occurs, devlink mutex is also held. This not only deadlocks
> the mlx5 device under reload, but all the processes which attempt to access
> unrelated devlink devices are deadlocked.
> 
> Hence, fix this by mlx5 ib driver to register for per net netdev
> notifier instead of global one, which operats on the net namespace
> without holding the pernet_ops_rwsem.
> 
> Fixes: 4383cfcc65e7 ("net/mlx5: Add devlink reload")
> Signed-off-by: Parav Pandit 
> Signed-off-by: Leon Romanovsky 
> ---
> Changelog:
> v0->v1:
>  - updated comment for mlx5_core_net API to be used by multiple mlx5
>drivers
> ---
>  drivers/infiniband/hw/mlx5/main.c  |  6 --
>  .../net/ethernet/mellanox/mlx5/core/lib/mlx5.h |  5 -
>  include/linux/mlx5/driver.h| 18 ++
>  3 files changed, 22 insertions(+), 7 deletions(-)

Applied to for-rc, thanks

Jason

Re: [PATCH] RDMA: Add rdma_connect_locked()

2020-10-27 Thread Jason Gunthorpe

On Tue, Oct 27, 2020 at 10:01:00AM +0800, Chao Leng wrote:
> > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> > index aad829a2b50d0f..f488dc5f4c2c61 100644
> > +++ b/drivers/nvme/host/rdma.c
> > @@ -1730,11 +1730,10 @@ static void nvme_rdma_process_nvme_rsp(struct 
> > nvme_rdma_queue *queue,
> > req->result = cqe->result;
> > if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
> > -   if (unlikely(!req->mr ||
> > -wc->ex.invalidate_rkey != req->mr->rkey)) {
> > +   if (unlikely(wc->ex.invalidate_rkey != req->mr->rkey)) {
> > dev_err(queue->ctrl->ctrl.device,
> > "Bogus remote invalidation for rkey %#x\n",
> > -   req->mr ? req->mr->rkey : 0);
> > +   req->mr->rkey);
> Maybe the code version is incorrect, cause falsely code rollback.

Oh wow, thanks for noticing that, I made a git fumble when doing this
:(

Jason

[PATCH rdma v2] RDMA: Add rdma_connect_locked()

2020-10-27 Thread Jason Gunthorpe

There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().

In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.

Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.

Link: 
https://lore.kernel.org/r/0-v1-75e124dbad74+b05-rdma_connect_locking_...@nvidia.com
Reported-and-tested-by: Guoqing Jiang 
Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: Santosh Shilimkar 
Acked-by: Jack Wang 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/cma.c| 40 +---
 drivers/infiniband/ulp/iser/iser_verbs.c |  2 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c   |  4 +--
 drivers/nvme/host/rdma.c |  4 +--
 include/rdma/rdma_cm.h   | 14 ++---
 net/rds/ib_cm.c  |  5 +--
 6 files changed, 46 insertions(+), 23 deletions(-)

v2:
 - Remove extra code from nvme (Chao)
 - Fix long lines (CH)

I've applied this version to rdma-rc - expecting to get these ULPs unbroken for 
rc2
release

Thanks,
Jason

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 7c2ab1f2fbea37..193c8902b9db26 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -405,10 +405,10 @@ static int cma_comp_exch(struct rdma_id_private *id_priv,
/*
 * The FSM uses a funny double locking where state is protected by both
 * the handler_mutex and the spinlock. State is not allowed to change
-* away from a handler_mutex protected value without also holding
+* to/from a handler_mutex protected value without also holding
 * handler_mutex.
 */
-   if (comp == RDMA_CM_CONNECT)
+   if (comp == RDMA_CM_CONNECT || exch == RDMA_CM_CONNECT)
lockdep_assert_held(&id_priv->handler_mutex);
 
spin_lock_irqsave(&id_priv->lock, flags);
@@ -4038,13 +4038,21 @@ static int cma_connect_iw(struct rdma_id_private 
*id_priv,
return ret;
 }
 
-int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
+/**
+ * rdma_connect_locked - Initiate an active connection request.
+ * @id: Connection identifier to connect.
+ * @conn_param: Connection information used for connected QPs.
+ *
+ * Same as rdma_connect() but can only be called from the
+ * RDMA_CM_EVENT_ROUTE_RESOLVED handler callback.
+ */
+int rdma_connect_locked(struct rdma_cm_id *id,
+   struct rdma_conn_param *conn_param)
 {
struct rdma_id_private *id_priv =
container_of(id, struct rdma_id_private, id);
int ret;
 
-   mutex_lock(&id_priv->handler_mutex);
if (!cma_comp_exch(id_priv, RDMA_CM_ROUTE_RESOLVED, RDMA_CM_CONNECT)) {
ret = -EINVAL;
goto err_unlock;
@@ -4071,6 +4079,30 @@ int rdma_connect(struct rdma_cm_id *id, struct 
rdma_conn_param *conn_param)
 err_state:
cma_comp_exch(id_priv, RDMA_CM_CONNECT, RDMA_CM_ROUTE_RESOLVED);
 err_unlock:
+   return ret;
+}
+EXPORT_SYMBOL(rdma_connect_locked);
+
+/**
+ * rdma_connect - Initiate an active connection request.
+ * @id: Connection identifier to connect.
+ * @conn_param: Connection information used for connected QPs.
+ *
+ * Users must have resolved a route for the rdma_cm_id to connect with by 
having
+ * called rdma_resolve_route before calling this routine.
+ *
+ * This call will either connect to a remote QP or obtain remote QP information
+ * for unconnected rdma_cm_id's.  The actual operation is based on the
+ * rdma_cm_id's port space.
+ */
+int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
+{
+   struct rdma_id_private *id_priv =
+   container_of(id, struct rdma_id_private, id);
+   int ret;
+
+   mutex_lock(&id_priv->handler_mutex);
+   ret = rdma_connect_locked(id, conn_param);
mutex_unlock(&id_priv->handler_mutex);
return ret;
 }
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 2f3ebc0a75d924..2bd18b00689341 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -620,7 +620,7 @@ static void iser_route_handler(struct rdma_cm_id *cma_id)
conn_param.private_data = (void *)&req_hdr;
conn_param.private_data_len = sizeof(struct iser_cm_hdr);
 
-   ret = rdma_connect(cma_id, &conn_param);
+   ret = rdma_connect_locked(cma_id, &conn_param);
if (ret) {
iser_err("failure connecting: %d\n", ret);
goto failure;
diff --git a/drivers/infiniband/ulp/rtrs/rt

Re: [PATCH rdma v2] RDMA: Add rdma_connect_locked()

2020-10-27 Thread Jason Gunthorpe

On Tue, Oct 27, 2020 at 03:19:36PM +0200, Leon Romanovsky wrote:

> > +int rdma_connect_locked(struct rdma_cm_id *id,
> > +   struct rdma_conn_param *conn_param)
> >  {
> > struct rdma_id_private *id_priv =
> > container_of(id, struct rdma_id_private, id);
> > int ret;
> >
> > -   mutex_lock(&id_priv->handler_mutex);
> > if (!cma_comp_exch(id_priv, RDMA_CM_ROUTE_RESOLVED, RDMA_CM_CONNECT)) {
> > ret = -EINVAL;
> > goto err_unlock;
> 
> Not a big deal, but his label is not correct anymore.

Oh, yep

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 193c8902b9db26..f58d19881524dc 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -4053,10 +4053,8 @@ int rdma_connect_locked(struct rdma_cm_id *id,
container_of(id, struct rdma_id_private, id);
int ret;
 
-   if (!cma_comp_exch(id_priv, RDMA_CM_ROUTE_RESOLVED, RDMA_CM_CONNECT)) {
-   ret = -EINVAL;
-   goto err_unlock;
-   }
+   if (!cma_comp_exch(id_priv, RDMA_CM_ROUTE_RESOLVED, RDMA_CM_CONNECT))
+   return -EINVAL;
 
if (!id->qp) {
id_priv->qp_num = conn_param->qp_num;
@@ -4078,7 +4076,6 @@ int rdma_connect_locked(struct rdma_cm_id *id,
return 0;
 err_state:
cma_comp_exch(id_priv, RDMA_CM_CONNECT, RDMA_CM_ROUTE_RESOLVED);
-err_unlock:
return ret;
 }
 EXPORT_SYMBOL(rdma_connect_locked);

Re: [PATCH rdma v2] RDMA: Add rdma_connect_locked()

2020-10-28 Thread Jason Gunthorpe

On Wed, Oct 28, 2020 at 11:19:14AM +0200, Maor Gottlieb wrote:
> > +   struct rdma_conn_param *conn_param)
> >   {
> > struct rdma_id_private *id_priv =
> > container_of(id, struct rdma_id_private, id);
> > int ret;
> > -   mutex_lock(&id_priv->handler_mutex);
> 
> You need to delete the mutex_unlock in success path too.

Gah. Just goes to prove I shouldn't write patches with a child on
my lap :\

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f58d19881524dc..a77750b8954db0 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -4072,7 +4072,6 @@ int rdma_connect_locked(struct rdma_cm_id *id,
ret = -ENOSYS;
if (ret)
goto err_state;
-   mutex_unlock(&id_priv->handler_mutex);
return 0;
 err_state:
cma_comp_exch(id_priv, RDMA_CM_CONNECT, RDMA_CM_ROUTE_RESOLVED);

Thanks,
Jason

Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-03 Thread Jason Gunthorpe

On Sun, Nov 01, 2020 at 10:15:37PM +0200, Leon Romanovsky wrote:
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 6c218b47b9f1..5316e51e72d4 100644
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1,18 +1,27 @@
>  // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>  /* Copyright (c) 2020 Mellanox Technologies Ltd. */
> 
> +#include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> -#include 
>  #include 
> -#include "mlx5_vnet.h"
>  #include "mlx5_vdpa.h"
> 
> +MODULE_AUTHOR("Eli Cohen ");
> +MODULE_DESCRIPTION("Mellanox VDPA driver");
> +MODULE_LICENSE("Dual BSD/GPL");
> +
> +#define to_mlx5_vdpa_ndev(__mvdev) container_of(__mvdev, struct 
> mlx5_vdpa_net, mvdev)
>  #define to_mvdev(__vdev) container_of((__vdev), struct mlx5_vdpa_dev, vdev)
> 
>  #define VALID_FEATURES_MASK  
>   \
> @@ -159,6 +168,11 @@ static bool mlx5_vdpa_debug;
>   mlx5_vdpa_info(mvdev, "%s\n", #_status);
>\
>   } while (0)
> 
> +static inline u32 mlx5_vdpa_max_qps(int max_vqs)
> +{
> + return max_vqs / 2;
> +}
> +
>  static void print_status(struct mlx5_vdpa_dev *mvdev, u8 status, bool set)
>  {
>   if (status & ~VALID_STATUS_MASK)
> @@ -1928,8 +1942,11 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev)
>   }
>  }
> 
> -void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
> +static int mlx5v_probe(struct auxiliary_device *adev,
> +const struct auxiliary_device_id *id)
>  {
> + struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
> + struct mlx5_core_dev *mdev = madev->mdev;
>   struct virtio_net_config *config;
>   struct mlx5_vdpa_dev *mvdev;
>   struct mlx5_vdpa_net *ndev;
> @@ -1943,7 +1960,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
>   ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, 
> mdev->device, &mlx5_vdpa_ops,
>2 * mlx5_vdpa_max_qps(max_vqs));
>   if (IS_ERR(ndev))
> - return ndev;
> + return PTR_ERR(ndev);
> 
>   ndev->mvdev.max_vqs = max_vqs;
>   mvdev = &ndev->mvdev;
> @@ -1972,7 +1989,8 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
>   if (err)
>   goto err_reg;
> 
> - return ndev;
> + dev_set_drvdata(&adev->dev, ndev);
> + return 0;
> 
>  err_reg:
>   free_resources(ndev);
> @@ -1981,10 +1999,29 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
>  err_mtu:
>   mutex_destroy(&ndev->reslock);
>   put_device(&mvdev->vdev.dev);
> - return ERR_PTR(err);
> + return err;
>  }
> 
> -void mlx5_vdpa_remove_dev(struct mlx5_vdpa_dev *mvdev)
> +static int mlx5v_remove(struct auxiliary_device *adev)
>  {
> + struct mlx5_vdpa_dev *mvdev = dev_get_drvdata(&adev->dev);
> +
>   vdpa_unregister_device(&mvdev->vdev);
> + return 0;
>  }
> +
> +static const struct auxiliary_device_id mlx5v_id_table[] = {
> + { .name = MLX5_ADEV_NAME ".vnet", },
> + {},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, mlx5v_id_table);
> +
> +static struct auxiliary_driver mlx5v_driver = {
> + .name = "vnet",
> + .probe = mlx5v_probe,
> + .remove = mlx5v_remove,
> + .id_table = mlx5v_id_table,
> +};

It is hard to see from the diff, but when this patch is applied the
vdpa module looks like I imagined things would look with the auxiliary
bus. It is very similar in structure to a PCI driver with the probe()
function cleanly registering with its subsystem. This is what I'd like
to see from the new Intel RDMA driver.

Greg, I think this patch is the best clean usage example.

I've looked over this series and it has the right idea and
parts. There is definitely more that can be done to improve mlx5 in
this area, but this series is well scoped and cleans a good part of
it.

Jason

Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-05 Thread Jason Gunthorpe

On Thu, Nov 05, 2020 at 08:33:02AM +0100, gregkh wrote:
> > Were there any additional changes you wanted to see happen? I'll go
> > give the final set another once over, but David has been diligently
> > fixing up all the declared major issues so I expect to find at most
> > minor incremental fixups.
> 
> This is in my to-review pile, along with a load of other stuff at the
> moment:
>   $ ~/bin/mdfrm -c ~/mail/todo/
>   1709 messages in /home/gregkh/mail/todo/
> 
> So give me a chance.  There is no rush on my side for this given the
> huge delays that have happened here on the authorship side many times in
> the past :)

On the other hand Leon and his team did invest alot of time and
effort, very quickly, to build and QA this large mlx5 series here to
give a better/second example as you requested only a few weeks ago.

> If you can review it, or anyone else, that is always most appreciated.

Dan, Leon, Myself and others have looked at the auxiliary bus patch a
more than a few times now. Leon in particular went over it very
carefully and a number of bugs were fixed while he developed this
series.

There seems to be nothing fundamentally wrong with it, so long as
people are fine with the colour of the shed...

Jason

Re: [PATCH mlx5-next v1 04/11] vdpa/mlx5: Make hardware definitions visible to all mlx5 devices

2020-11-05 Thread Jason Gunthorpe

On Thu, Nov 05, 2020 at 12:31:52PM -0800, Saeed Mahameed wrote:
> On Sun, 2020-11-01 at 22:15 +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> > 
> > Move mlx5_vdpa IFC header file to the general include folder, so
> > mlx5_core will be able to reuse it to check if VDPA is supported
> > prior to creating an auxiliary device.
> > 
> 
> I don't really like this, the whole idea of aux devices is that they
> get to do own logic and hide details, now we are exposing aux
> specific stuff to the bus ..  let's figure a way to avoid such
> exposure as we discussed yesterday.

Not quite, the idea is we get to have a cleaner split between the two
sides.

The device side is responsible for things centric to the device, like
"does this device actually exists" which is what is_supported is
doing.

The driver side holds the driver specific logic.

> is_supported check shouldn't belong to mlx5_core and each aux device
> (en/ib/vdpa) should implement own is_supported op and keep the details
> hidden in the aux driver like it was before this patch.

No, it really should be in the device side.

Part of the point here is to properly fix module loading. That means
the core driver must only create devices that can actually have a
driver bound to them because creating a device triggers module
loading.

For instance we do not want to auto load vdpa modules on every mlx5
system for no reason, that is not clean at all.

Jason

Re: [PATCH mlx5-next v1 05/11] net/mlx5: Register mlx5 devices to auxiliary virtual bus

2020-11-05 Thread Jason Gunthorpe

On Thu, Nov 05, 2020 at 12:59:20PM -0800, Saeed Mahameed wrote:

> 2. you can always load a driver without its underlying device existed.
> for example, you can load a pci device driver/module and it will load
> and wait for pci devices to pop up, the subsysetem infrastructure will
> match between drivers and devices and probe them.

Yes, this works fine with this design

> struct aux_driver mlx5_vpda_aux_driver {
> 
>   .name = "vdpa",
>/* match this driver with mlx5_core devices */
>   .id_table = {"mlx5_core"}, 
>   .ops {
> /* called before probe on actual aux mlx5_core device */
>.is_supported(struct aux_device); 

This means module auto loading is impossible, we can't tell to load
the module until we load the module to call the is_supported code ..

Jason

Re: [PATCH rdma-next v2 0/5] Get rid of custom made module dependency

2021-04-19 Thread Jason Gunthorpe

On Sun, Apr 18, 2021 at 12:09:16AM +0530, Devesh Sharma wrote:

> The host crash I indicated earlier is actually caused by patch 4 and
> not by patch 3 from this series. I spent time to root cause the

This makes a lot more sense.

The ulp_id stuff does need to go away as well though.

> problem and realized that patch-4 is touching quite many areas which
> would require much intrusive testing and validation.
> As I indicated earlier, we are implementing the PCI Aux driver
> interface at a faster pace.

Doing an aux driver doesn't mean you get to keep all these single
implementation function pointers - see the discussion around Intel's
patches.

> The problem of module referencing would be rectified with PCI aux
> change by inheritance.

The first three patches are clearly an improvement, and quite trivial,
so I'm going to take them.

Jason

Re: [PATCH rdma-next 0/2] Two fixes to -next

2021-04-20 Thread Jason Gunthorpe

On Sun, Apr 18, 2021 at 04:49:38PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> The two fixes are targeted to the -next. Maor's change fixes DM code
> that was accepted in this cycle and Parav's change doesn't qualify
> urgency of -rc8.
> 
> Thanks
> 
> Maor Gottlieb (1):
>   RDMA/mlx5: Fix type assignment for ICM DM
> 
> Parav Pandit (1):
>   IB/mlx5: Set right RoCE l3 type and roce version while deleting GID

Applied to for-next, thanks

Jason

Re: [PATCH mlx5-next] RDMA/mlx5: Allow CQ creation without attached EQs

2021-02-12 Thread Jason Gunthorpe

On Thu, Feb 11, 2021 at 10:55:49AM +0200, Leon Romanovsky wrote:
> From: Tal Gilboa 
> 
> The traditional DevX CQ creation flow goes through mlx5_core_create_cq()
> which checks that the given EQN corresponds to an existing EQ. For some
> mlx5 devices this behaviour is too strict, they expect EQN assignment
> during modify CQ stage.
> 
> Allow them to create CQ through general command interface.
> 
> Signed-off-by: Tal Gilboa 
> Signed-off-by: Leon Romanovsky 
> ---
>  drivers/infiniband/hw/mlx5/devx.c | 13 -
>  include/linux/mlx5/mlx5_ifc.h |  5 +++--
>  2 files changed, 15 insertions(+), 3 deletions(-)

Applied to for-next

Thanks,
Jason

Re: [PATCH rdma-next 0/2] Real time/free running timestamp support

2021-02-12 Thread Jason Gunthorpe

On Tue, Feb 09, 2021 at 03:11:05PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Add an extra timestamp format for mlx5_ib device.
> 
> Thanks
> 
> Aharon Landau (2):
>   net/mlx5: Add new timestamp mode bits
>   RDMA/mlx5: Fail QP creation if the device can not support the CQE TS
> 
>  drivers/infiniband/hw/mlx5/qp.c | 104 +---
>  include/linux/mlx5/mlx5_ifc.h   |  54 +++--
>  2 files changed, 145 insertions(+), 13 deletions(-)

Since this is a rdma series, and we are at the end of the cycle, I
took the IFC file directly to the rdma tree instead of through the
shared branch.

Applied to for-next, thanks

Jason

Re: [PATCH rdma-next 0/2] Real time/free running timestamp support

2021-02-12 Thread Jason Gunthorpe

On Fri, Feb 12, 2021 at 01:09:20PM -0800, Saeed Mahameed wrote:
> On Fri, 2021-02-12 at 14:10 -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 09, 2021 at 03:11:05PM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky 
> > > 
> > > Add an extra timestamp format for mlx5_ib device.
> > > 
> > > Thanks
> > > 
> > > Aharon Landau (2):
> > >   net/mlx5: Add new timestamp mode bits
> > >   RDMA/mlx5: Fail QP creation if the device can not support the CQE
> > > TS
> > > 
> > >  drivers/infiniband/hw/mlx5/qp.c | 104
> > > +---
> > >  include/linux/mlx5/mlx5_ifc.h   |  54 +++--
> > >  2 files changed, 145 insertions(+), 13 deletions(-)
> > 
> > Since this is a rdma series, and we are at the end of the cycle, I
> > took the IFC file directly to the rdma tree instead of through the
> > shared branch.
> > 
> > Applied to for-next, thanks
> > 
> 
> mmm, i was planing to resubmit this patch with the netdev real time
> support series, since the uplink representor is getting delayed, I
> thought I could submit the real time stuff today. can you wait on the
> ifc patch, i will re-send it today if you will, but it must go through
> the shared branch

Friday of rc7 is a bit late to be sending new patches for the first
time, isn't it??

But sure, if you update the shared branch right now I'll fix up rdma.git

Jason

Re: [PATCH rdma-next 0/2] Real time/free running timestamp support

2021-02-12 Thread Jason Gunthorpe

On Fri, Feb 12, 2021 at 01:19:09PM -0800, Saeed Mahameed wrote:
> On Fri, 2021-02-12 at 17:14 -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 12, 2021 at 01:09:20PM -0800, Saeed Mahameed wrote:
> > > On Fri, 2021-02-12 at 14:10 -0400, Jason Gunthorpe wrote:
> > > > On Tue, Feb 09, 2021 at 03:11:05PM +0200, Leon Romanovsky wrote:
> > > > > From: Leon Romanovsky 
> > > > > 
> > > > > Add an extra timestamp format for mlx5_ib device.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > Aharon Landau (2):
> > > > >   net/mlx5: Add new timestamp mode bits
> > > > >   RDMA/mlx5: Fail QP creation if the device can not support the
> > > > > CQE
> > > > > TS
> > > > > 
> > > > >  drivers/infiniband/hw/mlx5/qp.c | 104
> > > > > +---
> > > > >  include/linux/mlx5/mlx5_ifc.h   |  54 +++--
> > > > >  2 files changed, 145 insertions(+), 13 deletions(-)
> > > > 
> > > > Since this is a rdma series, and we are at the end of the cycle,
> > > > I
> > > > took the IFC file directly to the rdma tree instead of through
> > > > the
> > > > shared branch.
> > > > 
> > > > Applied to for-next, thanks
> > > > 
> > > 
> > > mmm, i was planing to resubmit this patch with the netdev real time
> > > support series, since the uplink representor is getting delayed, I
> > > thought I could submit the real time stuff today. can you wait on
> > > the
> > > ifc patch, i will re-send it today if you will, but it must go
> > > through
> > > the shared branch
> > 
> > Friday of rc7 is a bit late to be sending new patches for the first
> > time, isn't it??
> 
> I know, uplink representor last minute mess !
> 
> > 
> > But sure, if you update the shared branch right now I'll fix up
> > rdma.git
> > 
> 
> I can't put it in the shared brach without review, i will post it to
> the netdev/rdma lists for two days at least for review and feedback.

Well, I'm not going to take any different patches beyond right now
unless Linus does a rc8??

Just move this one IFC patch to the shared branch, it is obviously OK

Jason

Re: [PATCH mlx5-next 1/6] net/mlx5: Add new timestamp mode bits

2021-02-16 Thread Jason Gunthorpe

On Fri, Feb 12, 2021 at 02:30:37PM -0800, Saeed Mahameed wrote:
> From: Aharon Landau 
> 
> These fields declare which timestamp mode is supported by the device
> per RQ/SQ/QP.
> 
> In addition add the ts_format field to the select the mode for
> RQ/SQ/QP.
> 
> Signed-off-by: Aharon Landau 
> Signed-off-by: Saeed Mahameed 
> ---
>  include/linux/mlx5/mlx5_ifc.h | 54 +++
>  1 file changed, 49 insertions(+), 5 deletions(-)

This is a commit in the shared branch now, so this series will have to
go as a pull request if it wants to go before the next rc1

Jason

Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

2021-02-16 Thread Jason Gunthorpe

On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > >
> > > But I still don't like the fact that we're calling
> > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > complication and opportunities for errors.
> > 
> > It is not different from any other code that we have in the kernel.
> 
> It *is* different.  There is a general rule that drivers should not
> call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> still true that callers of sysfs_create_files() are very special, and
> I'd prefer not to add another one.

I think the point of [1] is people should be setting up their sysfs in
the struct device attribute groups/etc before doing device_add() and
allowing the driver core to handle everything. This can be done in
a lot of cases, eg we have examples of building a dynamic list of
attributes

In other cases, calling wrappers like device_create_file() introduces
a bit more type safety, so adding a device_create_files() would be
trivial enough

Other places in PCI are using syfs_create_group() (and there are over
400 calls to this function in all sorts of device drivers):

drivers/pci/msi.c:  ret = sysfs_create_groups(&pdev->dev.kobj, 
msi_irq_groups);
drivers/pci/p2pdma.c:   error = sysfs_create_group(&pdev->dev.kobj, 
&p2pmem_group);
drivers/pci/pci-label.c:return sysfs_create_group(&pdev->dev.kobj, 
&smbios_attr_group);
drivers/pci/pci-label.c:return sysfs_create_group(&pdev->dev.kobj, 
&acpi_attr_group);

For post-driver_add() stuff, maybe this should do the same, a
"srio_vf/" group?

And a minor cleanup would change these to use device_create_bin_file():

drivers/pci/pci-sysfs.c:retval = sysfs_create_bin_file(&pdev->dev.kobj, 
res_attr);
drivers/pci/pci-sysfs.c:retval = 
sysfs_create_bin_file(&pdev->dev.kobj, &pcie_config_attr);
drivers/pci/pci-sysfs.c:retval = 
sysfs_create_bin_file(&pdev->dev.kobj, &pci_config_attr);
drivers/pci/pci-sysfs.c:retval = 
sysfs_create_bin_file(&pdev->dev.kobj, attr);
drivers/pci/vpd.c:  retval = sysfs_create_bin_file(&dev->dev.kobj, attr);

I haven't worked out why pci_create_firmware_label_files() and all of
this needs to be after device_add() though.. Would be slick to put
that in the normal attribute list - we've got some examples of dynamic
pre-device_add() attribute lists in the tree (eg tpm, rdma) that work
nicely.

Jason

Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

2021-02-17 Thread Jason Gunthorpe

On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:

> > BTW, I asked more than once how these sysfs knobs should be handled
> > in the PCI/core.
> 
> Thanks for the pointers.  This is the first instance I can think of
> where we want to create PCI core sysfs files based on a driver
> binding, so there really isn't a precedent.

The MSI stuff does it today, doesn't it? eg:

virtblk_probe (this is a driver bind)
  init_vq
   virtio_find_vqs
vp_modern_find_vqs
 vp_find_vqs
  vp_find_vqs_msix
   vp_request_msix_vectors
pci_alloc_irq_vectors_affinity
 __pci_enable_msi_range
  msi_capability_init
   populate_msi_sysfs
ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);

And the sysfs is removed during pci_disable_msi(), also called by the
driver

Jason

Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

2021-02-17 Thread Jason Gunthorpe

On Wed, Feb 17, 2021 at 02:28:35PM -0600, Bjorn Helgaas wrote:
> On Wed, Feb 17, 2021 at 03:25:22PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> > 
> > > > BTW, I asked more than once how these sysfs knobs should be handled
> > > > in the PCI/core.
> > > 
> > > Thanks for the pointers.  This is the first instance I can think of
> > > where we want to create PCI core sysfs files based on a driver
> > > binding, so there really isn't a precedent.
> > 
> > The MSI stuff does it today, doesn't it? eg:
> > 
> > virtblk_probe (this is a driver bind)
> >   init_vq
> >virtio_find_vqs
> > vp_modern_find_vqs
> >  vp_find_vqs
> >   vp_find_vqs_msix
> >vp_request_msix_vectors
> > pci_alloc_irq_vectors_affinity
> >  __pci_enable_msi_range
> >   msi_capability_init
> >populate_msi_sysfs
> > ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);
> > 
> > And the sysfs is removed during pci_disable_msi(), also called by the
> > driver
> 
> Yes, you're right, I didn't notice that one.
> 
> I'm not quite convinced that we clean up correctly in all cases --
> pci_disable_msix(), pci_disable_msi(), pci_free_irq_vectors(),
> pcim_release(), etc are called by several drivers, but in my quick
> look I didn't see a guaranteed-to-be-called path to the cleanup during
> driver unbind.  I probably just missed it.
 
I think the contract is the driver has to pair the msi enable with the
msi disable on its own? It is very similar to what is happening here.

Probably there are bugs in drivers on error paths, but there are
always bugs in drivers on error paths..

Jason

Re: [PATCH rdma-next] RDMA: Support more than 255 rdma ports

2021-03-26 Thread Jason Gunthorpe

On Mon, Mar 01, 2021 at 09:04:20AM +0200, Leon Romanovsky wrote:
> From: Mark Bloch 
> 
> Current code uses many different types when dealing with a port of a
> RDMA device: u8, unsigned int and u32. Switch to u32 to clean up the
> logic.
> 
> This allows us to make (at least) the core view consistent and use the same
> type. Unfortunately not all places can be converted. Many uverbs functions
> expect port to be u8 so keep those places in order not to break UAPIs.
> HW/Spec defined values must also not be changed.
> 
> With the switch to u32 we now can support devices with more than 255
> ports. U32_MAX is reserved to make control logic a bit easier to deal
> with. As a device with U32_MAX ports probably isn't going to happen any
> time soon this seems like a non issue.
> 
> When a device with more than 255 ports is created uverbs will report
> the RDMA device as having 255 ports as this is the max currently supported.
> 
> The verbs interface is not changed yet because the IBTA spec limits the
> port size in too many places to be u8 and all applications that relies in
> verbs won't be able to cope with this change. At this stage, we are
> extending the interfaces that are using vendor channel solely
> 
> Once the limitation is lifted mlx5 in switchdev mode will be able to have
> thousands of SFs created by the device. As the only instance of an RDMA
> device that reports more than 255 ports will be a representor device
> and it exposes itself as a RAW Ethernet only device CM/MAD/IPoIB and other
> ULPs aren't effected by this change and their sysfs/interfaces that
> are exposes to userspace can remain unchanged.
> 
> While here cleanup some alignment issues and remove unneeded sanity
> checks (mainly in rdmavt),
> 
> Signed-off-by: Mark Bloch 
> Signed-off-by: Leon Romanovsky 
> ---

Applied to for-next, I suppose this means the irdma driver needs
re-spinning already.

Thanks,
Jason

Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count

2021-03-26 Thread Jason Gunthorpe

On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky  wrote:
> >
> > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > >
> > > > > > NVMe and mlx5 have basically identical functionality in this 
> > > > > > respect.
> > > > > > Other devices and vendors will likely implement similar 
> > > > > > functionality.
> > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > them all.
> > > > > >
> > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > >
> > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > divorced from the operating system.
> > > > >
> > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > never want an API to make a nvme device with a driver attached offline
> > > > > because it would break the driver.
> > > >
> > > > I think the sticky part is that Linux driver attach is not visible to
> > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > is only usable when it is online.
> > > >
> > > > For NVMe, software must ask the PF to make those online/offline
> > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > Online commands [1].  How would this be integrated into this sysfs
> > > > interface?
> > >
> > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > notifier and mirrors it to the offline state, or it simply
> > > offline/onlines as part of the sequence to program the MSI change.
> > >
> > > I don't see why we need any additional modeling of this behavior.
> > >
> > > What would be the point of onlining a device without a driver?
> >
> > Agree, we should remember that we are talking about Linux kernel model
> > and implementation, where _no_driver_ means _offline_.
> 
> The only means you have of guaranteeing the driver is "offline" is by
> holding on the device lock and checking it. So it is only really
> useful for one operation and then you have to release the lock. The
> idea behind having an "offline" state would be to allow you to
> aggregate multiple potential operations into a single change.

What we really want is a solution where the SRIOV device exist for the
HW but isn't registered yet as a pci_device. We have endless problems
with needing to configure SRIOV instances at the PF before they get
plugged into the kernel and the no driver autoprobe buisness is such a
hack.

But that is a huge problem and not this series.

Jason

Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count

2021-03-26 Thread Jason Gunthorpe

On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:

> Leon has implemented a ton of variations, but I don't think having all
> the files in the PF directory was one of them.

If you promise this is the last of this bike painting adventure then
let's do it.

Jason

Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count

2021-03-26 Thread Jason Gunthorpe

On Sat, Mar 27, 2021 at 02:29:00AM +0900, Keith Busch wrote:
> On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> > I also want to resurrect your idea of associating
> > "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> > that idea, and it better reflects the way both mlx5 and NVMe work.
> 
> That is a better match for nvme: we can assign resources to VFs with
> the PF's "VF Enable" set to '0', so configuring VFs without requiring
> them be enumerated in sysfs is a plus. 

If the VF is not in sysfs already in the normal place, why would it be
in the special configuration directory? Do you want the driver to
somehow provide the configuration directory content?

I'm confused what you mean

As I said to Alex configuring things before they even get plugged in
sounds like the right direction..

Jason

Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count

2021-03-26 Thread Jason Gunthorpe

On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:

> My concern would be that we are defining the user space interface.
> Once we have this working as a single operation I could see us having
> to support it that way going forward as somebody will script something
> not expecting an "offline" sysfs file, and the complaint would be that
> we are breaking userspace if we require the use of an "offline"
> file.

Well, we wouldn't do that. The semantic we define here is that the
msix_count interface 'auto-offlines' if that is what is required. If
we add some formal offline someday then 'auto-offline' would be a NOP
when the device is offline and do the same online/offline sequence as
today if it isn't.

> I almost wonder if it wouldn't make sense to just partition this up to
> handle flexible resources in the future. Maybe something like having
> the directory setup such that you have "sriov_resources/msix/" and

This is supposed to be about PCI properties, that is why we are doing
it in the PCI layer.

If you want to see something that handles non-PCI properties too then
Leon needs to make the whole thing general so the device driver can
give a list of properties it wants to configure and the core manages
the thing.

But at that point, why involve the PCI core in the first place? Just
put the complex configuration in the driver, use configfs or devlink
or nvmecli or whatever is appropriate.

And we are doing that too, there will also be pre-driver configuration
in devlink for *non PCI* properties. *shrug*

Jason

Re: [PATCH rdma-next v1 5/5] net/bnxt: Use direct API instead of useless indirection

2021-03-29 Thread Jason Gunthorpe

On Mon, Mar 29, 2021 at 07:01:44AM -0700, Michael Chan wrote:
> On Mon, Mar 29, 2021 at 1:52 AM Leon Romanovsky  wrote:
> >
> > From: Leon Romanovsky 
> >
> > There is no need in any indirection complexity for one ULP user,
> > remove all this complexity in favour of direct calls to the exported
> > symbols. This allows us to greatly simplify the code.
> 
> The goal is not to have a hard dependency between the RDMA driver and
> the ethernet driver.  One day, there may be a newer ethernet driver
> for newer devices.  The RDMA driver may be the same because it
> operates at a higher level.  The hard dependency will require the
> older ethernet driver to always be loaded even if it is not needed.

Then someday you will fix it. Today you do not have this, so it needs
to be deleted.

If you ever get to that point you will need to rework this driver to
use auxillary bus/etc, and it will look very different anyhow.

Jason

Re: [PATCH v2 05/23] ice: Add devlink params support

2021-03-29 Thread Jason Gunthorpe

On Thu, Mar 25, 2021 at 08:10:13PM +, Saleem, Shiraz wrote:

> Maybe I am missing something but I see that a devlink hot reload is
> required to enforce the update?  There isn't really a de-init
> required of PCI driver entities in this case for this rdma param.
> But only an unplug, plug of the auxdev with new value. Intuitively
> it feels more runtime-ish.
> 
> There is also a device powerof2 requirement on the maxqp which I
> don't see enforceable as it stands.
> 
> This is not super-critical for the initial submission but a nice to
> have. But I do want to brainstorm options..

devlink upai often seems to be an adventure, can you submit this
driver without devlink (or any other uapis) then debate how to add
them in as followup patches?

Jason

Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count

2021-03-30 Thread Jason Gunthorpe

On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:

> I think I misunderstood Greg's subdirectory comment.  We already have
> directories like this:

Yes, IIRC, Greg's remark applies if you have to start creating
directories with manual kobjects.

> and aspm_ctrl_attr_group (for "link") is nicely done with static
> attributes.  So I think we could do something like this:
> 
>   /sys/bus/pci/devices/:01:00.0/   # PF directory
> sriov/ # SR-IOV related stuff
>   vf_total_msix
>   vf_msix_count_BB:DD.F# includes bus/dev/fn of first VF
>   ...
>   vf_msix_count_BB:DD.F# includes bus/dev/fn of last VF

It looks a bit odd that it isn't a subdirectory, but this seems
reasonable.

> For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> before asking the PF to assign the vectors, as Jason suggests above.

It is also not awful if it returns EBUSY if the admin hasn't done
some device-specific offline sequence.

I'm just worried adding the idea of offline here is going to open a
huge can of worms in terms of defining what it means, and the very
next ask will be to start all VFs in offline mode. This would be some
weird overlap with the no-driver-autoprobing sysfs. We've been
thinking about this alot here and there are not easy answers.

mlx5 sort of has an offline concept too, but we have been modeling it
in devlink, which is kind of like nvme-cli for networking.

Jason

Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count

2021-03-30 Thread Jason Gunthorpe

On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > 
> > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > directories like this:
> > 
> > Yes, IIRC, Greg's remark applies if you have to start creating
> > directories with manual kobjects.
> > 
> > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > attributes.  So I think we could do something like this:
> > > 
> > >   /sys/bus/pci/devices/:01:00.0/   # PF directory
> > > sriov/ # SR-IOV related stuff
> > >   vf_total_msix
> > >   vf_msix_count_BB:DD.F# includes bus/dev/fn of first VF
> > >   ...
> > >   vf_msix_count_BB:DD.F# includes bus/dev/fn of last VF
> > 
> > It looks a bit odd that it isn't a subdirectory, but this seems
> > reasonable.
> 
> Sorry, I missed your point; you'll have to lay it out more explicitly.
> I did intend that "sriov" *is* a subdirectory of the :01:00.0
> directory.  The full paths would be:
>
>   /sys/bus/pci/devices/:01:00.0/sriov/vf_total_msix
>   /sys/bus/pci/devices/:01:00.0/sriov/vf_msix_count_BB:DD.F
>   ...

Sorry, I was meaning what you first proposed:

   /sys/bus/pci/devices/:01:00.0/sriov/BB:DD.F/vf_msix_count

Which has the extra sub directory to organize the child VFs.

Keep in mind there is going to be alot of VFs here, > 1k - so this
will be a huge directory.

Jason

1 2 3 4 5 6 7 8 >

1 - 100 of 754 matches

Mail list logo