Re: [PATCH v4 net-next 00/21] nvme-tcp receive offloads

2021-02-21 Thread Or Gerlitz
On Thu, Feb 11, 2021 at 11:15 PM Boris Pismenny  wrote:
> Changes since v3:
> =
> * Use DDP_TCP ifdefs in iov_iter and skb iterators to minimize impact
> when compiled out (Christoph)
> * Simplify netdev references and reduce the use of
> get_netdev_for_sock (Sagi)
> * Avoid "static" in it's own line, move it one line down (Christoph)
> * Pass (queue, skb, *offset) and retrieve the pdu_seq in
> nvme_tcp_resync_response (Sagi)
> * Add missing assignment of offloading_netdev to null in offload_limits
> error case (Sagi)
> * Set req->offloaded = false once -- the lifetime rules are:
> set to false on cmd_setup / set to true when ddp setup succeeds (Sagi)
> * Replace pr_info_ratelimited with dev_info_ratelimited (Sagi)
> * Add nvme_tcp_complete_request and invoke it from two similar call
> sites (Sagi)
> * Introduce nvme_tcp_req_map_sg earlier in the series (Sagi)
> * Add nvme_tcp_consume_skb and put into it a hunk from
> nvme_tcp_recv_data to handle copy with and without offload

Sagi, Christoph,

Any further comments?

Or.





> Changes since v2:
> =
> * Use skb->ddp_crc for copy offload to avoid skb_condense
> * Default mellanox driver support to no (experimental feature)
> * In iov_iter use non-ddp functions for kvec and iovec
> * Remove typecasting in nvme-tcp
>
> Changes since v1:
> =
> * Rework iov_iter copy skip if src==dst to be less intrusive (David Ahern)
> * Add tcp-ddp documentation (David Ahern)
> * Refactor mellanox driver patches into more patches (Saeed Mahameed)
> * Avoid pointer casting (David Ahern)
> * Rename nvme-tcp offload flags (Shai Malin)
> * Update cover-letter according to the above
>
> Changes since RFC v1:
> =
> * Split mlx5 driver patches to several commits
> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>   init/teardown to the start/stop functions.


Re: [PATCH v4 net-next 07/21] nvme-tcp: Add DDP data-path

2021-02-17 Thread Or Gerlitz
On Sun, Feb 14, 2021 at 8:30 PM David Ahern  wrote:
>
> On 2/11/21 2:10 PM, Boris Pismenny wrote:
> >
> > +static int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> > +  u16 command_id,
> > +  struct request *rq)
> > +{
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > + struct net_device *netdev = queue->ctrl->offloading_netdev;
> > + int ret;
> > +
> > + if (unlikely(!netdev)) {
> > + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not 
> > found\n");
>
> again, unnecessary. you only get here because the rquest is marked
> offloaded and that only happens if the netdev exists and supports DDP.
>
> > + return -EINVAL;
> > + }
> > +
> > + ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
> > + &req->ddp, rq);
> > + sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
> > + return ret;
> > +}
> > +
> > +static void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> > +{
> > + struct request *rq = ddp_ctx;
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > +
> > + if (!nvme_try_complete_req(rq, req->status, req->result))
> > + nvme_complete_rq(rq);
> > +}
> > +
> > +static int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> > +   u16 command_id,
> > +   struct request *rq)
> > +{
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > + struct net_device *netdev = queue->ctrl->offloading_netdev;
> > + int ret;
> > +
> > + if (unlikely(!netdev)) {
> > + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not 
> > found\n");
>
> similarly here. you can't get here if netdev is null.
>
> > + return -EINVAL;
> > + }
> > +
> > + req->ddp.command_id = command_id;
> > + ret = nvme_tcp_req_map_sg(req, rq);
> > + if (ret)
> > + return -ENOMEM;
> > +
> > + ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
> > +  queue->sock->sk,
> > +  &req->ddp);
> > + if (!ret)
> > + req->offloaded = true;
> > + return ret;
> > +}
> > +
> >  static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
> >  {
> >   struct net_device *netdev = queue->ctrl->offloading_netdev;
> > @@ -343,7 +417,7 @@ static void nvme_tcp_resync_response(struct 
> > nvme_tcp_queue *queue,
> >   return;
> >
> >   if (unlikely(!netdev)) {
> > - pr_info_ratelimited("%s: netdev not found\n", __func__);
> > + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not 
> > found\n");
>
> and per comment on the last patch, this is not needed.
>
> > @@ -849,10 +953,39 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue 
> > *queue, struct sk_buff *skb,
> >
> >  static inline void nvme_tcp_end_request(struct request *rq, u16 status)
> >  {
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > + struct nvme_tcp_queue *queue = req->queue;
> > + struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
> >   union nvme_result res = {};
> >
> > - if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> > - nvme_complete_rq(rq);
> > + nvme_tcp_complete_request(rq, cpu_to_le16(status << 1), res, 
> > pdu->command_id);
> > +}
> > +
> > +
> > +static int nvme_tcp_consume_skb(struct nvme_tcp_queue *queue, struct 
> > sk_buff *skb,
> > + unsigned int *offset, struct iov_iter *iter, 
> > int recv_len)
> > +{
> > + int ret;
> > +
> > +#ifdef CONFIG_TCP_DDP
> > + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) {
> > + if (queue->data_digest)
> > + ret = skb_ddp_copy_and_hash_datagram_iter(skb, 
> > *offset, iter, recv_len,
> > + queue->rcv_hash);
> > + else
> > + ret = skb_ddp_copy_datagram_iter(skb, *offset, iter, 
> > recv_len);
> > + } else {
> > +#endif
>
> why not make that a helper defined in the CONFIG_TCP_DDP section with an
> inline for the unset case. Keeps this code from being polluted with the
> ifdef checks.

will check that

>
> > + if (queue->data_digest)
> > + ret = skb_copy_and_hash_datagram_iter(skb, *offset, 
> > iter, recv_len,
> > + queue->rcv_hash);
> > + else
> > + ret = skb_copy_datagram_iter(skb, *offset, iter, 
> > recv_len);
> > +#ifdef CONFIG_TCP_DDP
> > + }
> > +#endif
> > +
> > + return ret;
> >  }
> >
> >  static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff 
> > *skb,
> > @@ -899,12 +1032,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
> > *queue, struct sk_buff *skb,
> >   recv_len = min_t(si

Re: [PATCH v4 net-next 06/21] nvme-tcp: Add DDP offload control path

2021-02-17 Thread Or Gerlitz
On Sun, Feb 14, 2021 at 8:20 PM David Ahern  wrote:
> On 2/11/21 2:10 PM, Boris Pismenny wrote:
> > @@ -223,6 +229,164 @@ static inline size_t nvme_tcp_pdu_last_send(struct 
> > nvme_tcp_request *req,
> >   return nvme_tcp_pdu_data_left(req) <= len;
> >  }
> >
> > +#ifdef CONFIG_TCP_DDP
> > +
> > +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> > +static const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = {
> > + .resync_request = nvme_tcp_resync_request,
> > +};
> > +
> > +static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
> > +{
> > + struct net_device *netdev = queue->ctrl->offloading_netdev;
> > + struct nvme_tcp_ddp_config config = {};
> > + int ret;
> > +
> > + if (!(netdev->features & NETIF_F_HW_TCP_DDP))
>
> If nvme_tcp_offload_limits does not find a dst_entry on the socket then
> offloading_netdev may not NULL at this point.

correct :( will look on that

>
> > + return -EOPNOTSUPP;
> > +
> > + config.cfg.type = TCP_DDP_NVME;
> > + config.pfv  = NVME_TCP_PFV_1_0;
> > + config.cpda = 0;
> > + config.dgst = queue->hdr_digest ?
> > + NVME_TCP_HDR_DIGEST_ENABLE : 0;
> > + config.dgst |= queue->data_digest ?
> > + NVME_TCP_DATA_DIGEST_ENABLE : 0;
> > + config.queue_size   = queue->queue_size;
> > + config.queue_id = nvme_tcp_queue_id(queue);
> > + config.io_cpu   = queue->io_cpu;
> > +
> > + dev_hold(netdev); /* put by unoffload_socket */
> > + ret = netdev->tcp_ddp_ops->tcp_ddp_sk_add(netdev,
> > +   queue->sock->sk,
> > +   &config.cfg);
> > + if (ret) {
> > + dev_put(netdev);
> > + return ret;
> > + }
> > +
> > + inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;
> > + if (netdev->features & NETIF_F_HW_TCP_DDP)
> > + set_bit(NVME_TCP_Q_OFF_DDP, &queue->flags);
> > +
> > + return ret;
> > +}
> > +
> > +static void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
> > +{
> > + struct net_device *netdev = queue->ctrl->offloading_netdev;
> > +
> > + if (!netdev) {
> > + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not 
> > found\n");
>
> you are already logged in nvme_tcp_offload_limits that
> get_netdev_for_sock returned NULL; no need to do it again.

yeah, re this one and the few other places where you commented
on the same or similar thing, I tend to agree we need to go on the
kernel trusted programming paradigm and avoid over checking, will
discuss that with the team.


> > + return;
> > + }
> > +
> > + netdev->tcp_ddp_ops->tcp_ddp_sk_del(netdev, queue->sock->sk);
> > +
> > + inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;
> > + dev_put(netdev); /* held by offload_socket */
> > +}
> > +
> > +static int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue)
> > +{
> > + struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, 
> > true);
> > + struct tcp_ddp_limits limits;
> > + int ret = 0;
> > +
> > + if (!netdev) {
> > + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not 
> > found\n");
>
> This should be more informative.

okk

> > + queue->ctrl->offloading_netdev = NULL;
> > + return -ENODEV;
> > + }
> > +
> > + if (netdev->features & NETIF_F_HW_TCP_DDP &&
> > + netdev->tcp_ddp_ops &&
> > + netdev->tcp_ddp_ops->tcp_ddp_limits)
> > + ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, &limits);
> > + else
> > + ret = -EOPNOTSUPP;
> > +
> > + if (!ret) {
> > + queue->ctrl->offloading_netdev = netdev;
>
> you save a reference to the netdev here, but then release the refcnt
> below. That device could be deleted between this point in time and the
> initialization of all queues.

> > + dev_dbg_ratelimited(queue->ctrl->ctrl.device,
> > + "netdev %s offload limits: 
> > max_ddp_sgl_len %d\n",
> > + netdev->name, limits.max_ddp_sgl_len);
> > + queue->ctrl->ctrl.max_segments = limits.max_ddp_sgl_len;
> > + queue->ctrl->ctrl.max_hw_sectors =
> > + limits.max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
> > + } else {
> > + queue->ctrl->offloading_netdev = NULL;
> > + }
> > +
> > + /* release the device as no offload context is established yet. */
> > + dev_put(netdev);
> > +
> > + return ret;
> > +}
> > +
> > +static void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
> > +  struct sk_buff *skb, unsigned int offset)
> > +{
> > + u64 pdu_seq = TCP_SKB_CB(skb)->seq + offset - queue->pdu_offset;
> > + struct net_device *netdev = queue->ctrl->off

Re: [PATCH v4 net-next 06/21] nvme-tcp: Add DDP offload control path

2021-02-14 Thread Or Gerlitz
On Sun, Feb 14, 2021 at 8:20 PM David Ahern  wrote:
> On 2/11/21 2:10 PM, Boris Pismenny wrote:
> > @@ -223,6 +229,164 @@ static inline size_t nvme_tcp_pdu_last_send(struct 
> > nvme_tcp_request *req,
> >   return nvme_tcp_pdu_data_left(req) <= len;
> >  }

Hi Dave,

Thanks for the continuous feedback. Folks are out this week and it seems
for that few comments we will need to discuss internally, but anyway I will
address at least  some of the comments later today/tomorrow.

Or.

Or.


Re: [PATCH v4 net-next 21/21] Documentation: add TCP DDP offload documentation

2021-02-13 Thread Or Gerlitz
On Fri, Feb 12, 2021 at 4:11 PM Nikolay Aleksandrov  wrote:

> I got interested and read through the doc, there are a few typos below.

thanks for spotting these, we will fix them


Re: [net-next V2 01/17] net/mlx5: E-Switch, Refactor setting source port

2021-02-09 Thread Or Gerlitz
On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov  wrote:
> On Mon 08 Feb 2021 at 22:22, Jakub Kicinski  wrote:
> > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote:

> >> > These operations imply that 7.7.7.5 is configured on some interface on
> >> > the host. Most likely the VF representor itself, as that aids with ARP
> >> > resolution. Is that so?

> >> The tunnel endpoint IP address is configured on VF that is represented
> >> by enp8s0f0_0 representor in example rules. The VF is on host.

> > This is very confusing, are you saying that the 7.7.7.5 is configured
> > both on VF and VFrep? Could you provide a full picture of the config
> > with IP addresses and routing?

> No, tunnel IP is configured on VF. That particular VF is in host [..]

What's the motivation for that? isn't that introducing 3x slow down?


Re: [PATCH net-next v4] net: psample: Introduce stubs to remove NIC driver dependency

2021-02-09 Thread Or Gerlitz
On Tue, Feb 9, 2021 at 12:01 PM Leon Romanovsky  wrote:
>
> On Tue, Feb 09, 2021 at 11:25:33AM +0200, Or Gerlitz wrote:
> > On Tue, Feb 9, 2021 at 8:49 AM Leon Romanovsky  wrote:
> >
> > [..]
> >
> > > This is another problem with mlx5 - complete madness with config options
> > > that are not possible to test.
> > > ➜  kernel git:(rdma-next) grep -h "config MLX" 
> > > drivers/net/ethernet/mellanox/mlx5/core/Kconfig | awk '{ print $2}' | 
> > > sort |uniq |wc -l
> > > 19
> >
> > wait, why do you call it madness? we were suggested by some users (do
> > git blame for the patches) to refine things with multiple configs and it 
> > seem
> > to work quite well  -- what you don't like? and what's wrong with simple 
> > grep..
>
> Yes, I aware of these users and what and why they asked it.
>
> They didn't ask us to have new config for every feature/file, but to have
> light ethernet device.
>
> Other users are distributions and they enable all options that supported in
> the specific kernel they picked, because they don't know in advance where 
> their
> distro will be used.
>
> You also don't have capacity to test various combinations, so you
> test only small subset of common ones that are pretty standard. This is why
> you have this feeling of "work quite well".

ok, point taken

> And I'm not talking about compilations but actual regression runs.

understood

> I suggest to reduce number of configs to small amount, something like 3-5 
> options:
>  * Basic ethernet driver
>  * + ETH
>  * + RDMA
>  * + VDPA
>  * 
>  * Full mlx5 driver

doesn't sound impossible

> And there is nothing wrong with simple grep [..]

no worries


Re: [PATCH net-next v4] net: psample: Introduce stubs to remove NIC driver dependency

2021-02-09 Thread Or Gerlitz
On Tue, Feb 9, 2021 at 8:49 AM Leon Romanovsky  wrote:

[..]

> This is another problem with mlx5 - complete madness with config options
> that are not possible to test.
> ➜  kernel git:(rdma-next) grep -h "config MLX" 
> drivers/net/ethernet/mellanox/mlx5/core/Kconfig | awk '{ print $2}' | sort 
> |uniq |wc -l
> 19

wait, why do you call it madness? we were suggested by some users (do
git blame for the patches) to refine things with multiple configs and it seem
to work quite well  -- what you don't like? and what's wrong with simple grep..

$ grep "config MLX5_" drivers/net/ethernet/mellanox/mlx5/core/Kconfig
config MLX5_CORE
config MLX5_ACCEL
config MLX5_FPGA
config MLX5_CORE_EN
config MLX5_EN_ARFS
config MLX5_EN_RXNFC
config MLX5_MPFS
config MLX5_ESWITCH
config MLX5_CLS_ACT
config MLX5_TC_CT
config MLX5_CORE_EN_DCB
config MLX5_CORE_IPOIB
config MLX5_FPGA_IPSEC
config MLX5_IPSEC
config MLX5_EN_IPSEC
config MLX5_FPGA_TLS
config MLX5_TLS
config MLX5_EN_TLS
config MLX5_SW_STEERING
config MLX5_SF
config MLX5_SF_MANAGER
config MLX5_EN_NVMEOTCP


Re: [pull request][net-next V2 00/17] mlx5 updates 2021-02-04

2021-02-09 Thread Or Gerlitz
On Tue, Feb 9, 2021 at 10:42 AM Or Gerlitz  wrote:
> On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed  wrote:
> > Vlad Buslov says:
>
> > Implement support for VF tunneling
>
> > Currently, mlx5 only supports configuration with tunnel endpoint IP address 
> > on
> > uplink representor. Remove implicit and explicit assumptions of tunnel 
> > always
> > being terminated on uplink and implement necessary infrastructure for
> > configuring tunnels on VF representors and updating rules on such tunnels
> > according to routing changes.
>
> > SW TC model
>
> maybe before SW TC model, you can explain the vswitch SW model (TC is
> a vehicle to implement the SW model).

I thought my earlier post missed the list, so I reposted, but realized
now it didn't,
feel free to address either of the posts


Re: [pull request][net-next V2 00/17] mlx5 updates 2021-02-04

2021-02-09 Thread Or Gerlitz
On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed  wrote:

> Vlad Buslov says:

> Implement support for VF tunneling

> Currently, mlx5 only supports configuration with tunnel endpoint IP address on
> uplink representor. Remove implicit and explicit assumptions of tunnel always
> being terminated on uplink and implement necessary infrastructure for
> configuring tunnels on VF representors and updating rules on such tunnels
> according to routing changes.

> SW TC model

maybe before SW TC model, you can explain the vswitch SW model (TC is
a vehicle to implement the SW model).

SW model for VST and "classic" v-switch tunnel setup:

For example, in VST model, each virtio/vf/sf vport has a vlan
such that the v-switch tags packets going out "south" of the
vport towards the uplink, untags packets going "north" from
the uplink, matches on the vport tag and forwards them to
the vport (and does nothing for east-west traffic).

In a similar manner, in "classic" v-switch tunnel setup, each
virtio/vf/sf vport is somehow associated with VNI/s marking the
tenant/s it belongs to. Same tenant east-west traffic on the
host doesn't go through any encap/decap. The v-switch adds the
relevant tunnel MD to packets/skbs sent "southward" by the end-point
and forwards it to the VTEP which applies encap based on the MD (LWT
scheme) and sends the packets to the wire. On RX, the VTEP decaps
the tunnel info from the packet, adds it as MD to the skb and
forwards the packet up into the stack where the vsw hooks it, matches
on the MD + inner tuple and then forwards it to the relevant endpoint.

HW offloads for VST and "classic" v-switch tunnel setup:

more or less straight forward based on the above

> From TC perspective VF tunnel configuration requires two rules in both
> directions:

> TX rules
> 1. Rule that redirects packets from UL to VF rep that has the tunnel
> endpoint IP address:
> 2. Rule that decapsulates the tunneled flow and redirects to destination VF
> representor:

> RX rules
> 1. Rule that encapsulates the tunneled flow and redirects packets from
> source VF rep to tunnel device:
> 2. Rule that redirects from tunnel device to UL rep:

mmm it's kinda hard managing to follow and catch up a SW model from TC rules..

I think we need these two to begin with (in whatever order that works
better for you)

[1] Motivation for enhanced v-switch tunnel setup:

[2] SW model for enhanced v-switch tunnel setup:

> HW offloads model

a clear SW model before HW offloads model..

>  25 files changed, 3812 insertions(+), 1057 deletions(-)

for adding almost 4K LOCs


Re: [pull request][net-next V2 00/17] mlx5 updates 2021-02-04

2021-02-08 Thread Or Gerlitz
On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed  wrote:
> From: Saeed Mahameed 

> This series adds the support for VF tunneling.

> Vlad Buslov says:
> =
> Implement support for VF tunneling


> Abstract
> Currently, mlx5 only supports configuration with tunnel endpoint IP address on
> uplink representor. Remove implicit and explicit assumptions of tunnel always
> being terminated on uplink and implement necessary infrastructure for
> configuring tunnels on VF representors and updating rules on such tunnels
> according to routing changes.
>
> SW TC model

maybe before SW TC model, you can explain the SW model (TC is a vehicle
to implement the SW model).


SW model for VST and "classic" v-switch tunnel setup:

For example, in VST model, each virtio/vf/sf vport has a vlan such that
the v-switch tags packets going out "south" of the vport towards the
uplink, untags
packets going "north" from the uplink into the vport (and does nothing
for east-west traffic).

In a similar manner, in "classic" v-switch tunnel setup, each
virtio/vf/sf vport is somehow
associated with VNI/s marking the tenant/s it belongs to. Same tenant
east-west traffic
on the host doesn't go through any encap/decap. The v-switch adds the
relevant tunnel
MD to packets/skbs sent "southward" by the end-point and forwards it
to the VTEP which applies
encap and sends the packets to the wire. On RX, the VTEP decaps the
tunnel info from the packet,
adds it as MD to the skb and forwards the packet up into the stack
where the vsw hooks it, matches
on the MD + inner tuple and then forwards it to the relevant endpoint.

HW offloads for VST and "classic" v-switch tunnel setup:

more or less straight forward based on the above

> From TC perspective VF tunnel configuration requires two rules in both
> directions:
>
> TX rules
>
> 1. Rule that redirects packets from UL to VF rep that has the tunnel
> endpoint IP address:

> 2. Rule that decapsulates the tunneled flow and redirects to destination VF
> representor:

> RX rules
>
> 1. Rule that encapsulates the tunneled flow and redirects packets from
> source VF rep to tunnel device:

> 2. Rule that redirects from tunnel device to UL rep:

Sorry, I am not managing to follow and catch up a SW model from TC rules..

I think we need these two to begin with:

[1] Motivation for enhanced v-switch tunnel setup:

[2] SW model for enhanced v-switch tunnel setup:

> HW offloads model

a clear SW model before HW offloads model..


Re: [PATCH v3 net-next 08/21] nvme-tcp : Recalculate crc in the end of the capsule

2021-02-07 Thread Or Gerlitz
On Wed, Feb 3, 2021 at 11:12 AM Sagi Grimberg  wrote:
> On 2/1/21 2:04 AM, Boris Pismenny wrote:

> > @@ -290,12 +341,9 @@ int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> >   }
> >
> >   req->ddp.command_id = command_id;
> > - req->ddp.sg_table.sgl = req->ddp.first_sgl;
> > - ret = sg_alloc_table_chained(&req->ddp.sg_table, 
> > blk_rq_nr_phys_segments(rq),
> > -  req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
> > + ret = nvme_tcp_req_map_sg(req, rq);
>
> Why didn't you introduce nvme_tcp_req_map_sg in the first place?

OK, will do

> >   if (ret)
> >   return -ENOMEM;
> > - req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
> >
> >   ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
> >queue->sock->sk,

> > @@ -1088,17 +1148,29 @@ static int nvme_tcp_recv_ddgst(struct 
> > nvme_tcp_queue *queue,
> >   if (queue->ddgst_remaining)
> >   return 0;
> >
> > - if (queue->recv_ddgst != queue->exp_ddgst) {
> > - dev_err(queue->ctrl->ctrl.device,
> > - "data digest error: recv %#x expected %#x\n",
> > - le32_to_cpu(queue->recv_ddgst),
> > - le32_to_cpu(queue->exp_ddgst));
> > - return -EIO;
> > + offload_fail = !nvme_tcp_ddp_ddgst_ok(queue);
> > + offload_en = test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags);
> > + if (!offload_en || offload_fail) {
> > + if (offload_en && offload_fail) { // software-fallback
> > + rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
> > +   pdu->command_id);
> > + nvme_tcp_ddp_ddgst_recalc(queue->rcv_hash, rq);
> > + }
> > +
> > + nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
> > + if (queue->recv_ddgst != queue->exp_ddgst) {
> > + dev_err(queue->ctrl->ctrl.device,
> > + "data digest error: recv %#x expected %#x\n",
> > + le32_to_cpu(queue->recv_ddgst),
> > + le32_to_cpu(queue->exp_ddgst));
> > + return -EIO;
> > + }
>
> I still dislike this hunk. Can you split the recalc login to its
> own ddp function at least? This is just a confusing piece of code.

mmm, we added two boolean predicates (offload_en and
offload_failed) plus a comment (software-fallback)
to clarify this piece... thought it can fly

> >   if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
> > - struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
> > - pdu->command_id);
> > + if (!rq)
> > + rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
> > +   pdu->command_id);
>
> Why is this change needed? Maybe just move this assignment up?

OK will move it up


Re: [PATCH v3 net-next 07/21] nvme-tcp: Add DDP data-path

2021-02-04 Thread Or Gerlitz
On Wed, Feb 3, 2021 at 10:54 AM Sagi Grimberg  wrote:

> > +static
> > +int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> > +   u16 command_id,
> > +   struct request *rq)
> > +{
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > + struct net_device *netdev = queue->ctrl->offloading_netdev;
> > + int ret;
> > +
> > + if (unlikely(!netdev)) {
> > + pr_info_ratelimited("%s: netdev not found\n", __func__);
> > + return -EINVAL;
> > + }
> > +
> > + ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
> > + &req->ddp, rq);
> > + sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
> > + req->offloaded = false;
>
> Why is the offloaded = false needed here? you also clear it when you setup.

yep, there two places where we needlessly falsified the offloaded flag
- will remove them

The lifetime rules are - set to false on cmd setup and set to true in
ddp setup if it succeeded

>
> > + return ret;
> > +}
> > +
> > +static void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> > +{
> > + struct request *rq = ddp_ctx;
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > +
> > + if (!nvme_try_complete_req(rq, cpu_to_le16(req->status << 1), 
> > req->result))
> > + nvme_complete_rq(rq);
>
> Why is the status shifted here? it was taken from the cqe as is..

there are two cases

1. the status is taken from the cqe as is
2. the status is taken from the req with shift left (the success bit
IO read flow)

for #2 we already do the shift left in nvme_tcp_end_request so no need to
repeat it here,  will fix

>
> > +}
> > +
> > +static
> > +int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> > +u16 command_id,
> > +struct request *rq)
> > +{
> > + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> > + struct net_device *netdev = queue->ctrl->offloading_netdev;
> > + int ret;
> > +
> > + req->offloaded = false;
> > +
> > + if (unlikely(!netdev)) {
> > + pr_info_ratelimited("%s: netdev not found\n", __func__);
>
> dev_info_ratelimited please.

ok

> > + return -EINVAL;
> > + }
> > +
> > + req->ddp.command_id = command_id;
> > + req->ddp.sg_table.sgl = req->ddp.first_sgl;
> > + ret = sg_alloc_table_chained(&req->ddp.sg_table, 
> > blk_rq_nr_phys_segments(rq),
> > +  req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
> > + if (ret)
> > + return -ENOMEM;
> > + req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
> > +
> > + ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
> > +  queue->sock->sk,
> > +  &req->ddp);
> > + if (!ret)
> > + req->offloaded = true;
> > + return ret;
> > +}
> > +
> >   static
> >   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
> >   {
> > @@ -377,6 +446,25 @@ bool nvme_tcp_resync_request(struct sock *sk, u32 seq, 
> > u32 flags)
> >
> >   #else
> >
> > +static
> > +int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> > +u16 command_id,
> > +struct request *rq)
> > +{
> > + return -EINVAL;
> > +}
> > +
> > +static
> > +int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> > +   u16 command_id,
> > +   struct request *rq)
> > +{
> > + return -EINVAL;
> > +}
> > +
> > +static void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> > +{}
> > +
> >   static
> >   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
> >   {
> > @@ -665,6 +753,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl 
> > *ctrl)
> >   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> >   struct nvme_completion *cqe)
> >   {
> > + struct nvme_tcp_request *req;
> >   struct request *rq;
> >
> >   rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
> > @@ -676,8 +765,15 @@ static int nvme_tcp_process_nvme_cqe(struct 
> > nvme_tcp_queue *queue,
> >   return -EINVAL;
> >   }
> >
> > - if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> > - nvme_complete_rq(rq);
> > + req = blk_mq_rq_to_pdu(rq);
> > + if (req->offloaded) {
> > + req->status = cqe->status;
> > + req->result = cqe->result;
> > + nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
> > + } else {
> > + if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> > + nvme_complete_rq(rq);
> > + }
>
> Maybe move this to nvme_tcp_complete_request as it is called from two
> code paths.

to make sure, add

void nvme_tcp_complete_request(struct request *rq, u16 status, union
nvme_result *res, u16 status)

and invoke it from t

Re: [PATCH v3 net-next 08/21] nvme-tcp : Recalculate crc in the end of the capsule

2021-02-04 Thread Or Gerlitz
On Wed, Feb 3, 2021 at 11:12 AM Sagi Grimberg  wrote:

> > @@ -1841,8 +1913,10 @@ static void __nvme_tcp_stop_queue(struct 
> > nvme_tcp_queue *queue)
> >   nvme_tcp_restore_sock_calls(queue);
> >   cancel_work_sync(&queue->io_work);
> >
> > - if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
> > + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags) ||
> > + test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags))
> >   nvme_tcp_unoffload_socket(queue);
> > +
>
> extra newline

will remove

> >   }
> >
> >   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> > @@ -1970,8 +2044,6 @@ static int nvme_tcp_alloc_admin_queue(struct 
> > nvme_ctrl *ctrl)
> >   {
> >   int ret;
> >
> > - to_tcp_ctrl(ctrl)->offloading_netdev = NULL;
> > -
>
> Unclear what is the intent here.

yep, unclear indeed.. will look and probably remove

as for your other comment on this patch, will get back to you later on

> >   ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
> >   if (ret)
> >   return ret;


Re: [PATCH v3 net-next 09/21] nvme-tcp: Deal with netdevice DOWN events

2021-02-04 Thread Or Gerlitz
On Wed, Feb 3, 2021 at 11:17 AM Sagi Grimberg  wrote:
> > @@ -2930,6 +2931,27 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct 
> > device *dev,
> >   return ERR_PTR(ret);
> >   }
> >
> > +static int nvme_tcp_netdev_event(struct notifier_block *this,
> > +  unsigned long event, void *ptr)
> > +{
> > + struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
> > + struct nvme_tcp_ctrl *ctrl;
> > +
> > + switch (event) {
> > + case NETDEV_GOING_DOWN:
> > + mutex_lock(&nvme_tcp_ctrl_mutex);
> > + list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
> > + if (ndev != ctrl->offloading_netdev)
> > + continue;
> > + nvme_tcp_error_recovery(&ctrl->ctrl);
> > + }
> > + mutex_unlock(&nvme_tcp_ctrl_mutex);
> > + flush_workqueue(nvme_reset_wq);
> > + /* we assume that the going down part of error recovery is 
> > over */
>
> Maybe phrase it as:
> /*
>   * The associated controllers teardown has completed, ddp contexts
>   * were also torn down so we should be safe to continue...
>   */

sure


Re: [PATCH net-next RESEND 2/2] net/mlx5: E-Switch, Implement devlink port function cmds to control roce

2021-02-02 Thread Or Gerlitz
On Tue, Feb 2, 2021 at 10:08 AM Yishai Hadas  wrote:
> Implement devlink port function commands to enable / disable roce.
> This is used to control the roce device capabilities.

[..]

> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -122,8 +122,9 @@ struct mlx5_vport_info {
> int link_state;
> u32 min_rate;
> u32 max_rate;
> -   boolspoofchk;
> -   booltrusted;
> +   u8  spoofchk: 1;
> +   u8  trusted: 1;
> +   u8  roce_enabled: 1;
>  };

This struct has attributes which have e-switch vport affiliation where
roce enable/disable
is a property of the vhca function-over-the-e-wire -- it's that we do
something specific
in the e-switching to enforce  - sounds like a problematic location to
land the bit..


Re: [PATCH v3 net-next 07/21] nvme-tcp: Add DDP data-path

2021-02-02 Thread Or Gerlitz
On Mon, Feb 1, 2021 at 7:40 PM Christoph Hellwig  wrote:
> Given how much ddp code there is can you split it into a separate file?

mmm, do we need to check the preferences or get to a consensus among
the maintainers for that one?


Re: [PATCH v3 net-next 06/21] nvme-tcp: Add DDP offload control path

2021-02-02 Thread Or Gerlitz
On Mon, Feb 1, 2021 at 7:39 PM Christoph Hellwig  wrote:
>> +static
>> +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
>
> Please use the same coding style as the rest of the file, and not some
> weirdo version.

ack


Re: [PATCH v3 net-next 01/21] iov_iter: Introduce new procedures for copy to iter/pages

2021-02-02 Thread Or Gerlitz
On Mon, Feb 1, 2021 at 7:38 PM Christoph Hellwig  wrote:
> On Mon, Feb 01, 2021 at 12:04:49PM +0200, Boris Pismenny wrote:
> > +static __always_inline __must_check
> > +size_t ddp_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > +{
> > + if (unlikely(!check_copy_size(addr, bytes, true)))
> > + return 0;
> > + else
> > + return _ddp_copy_to_iter(addr, bytes, i);
> > +}
>
> No need for the else after a return, and the normal kernel convention
> double underscores for magic internal functions.

ack for the no-else-after-a-return

Re the double underscoring, I was not sure, e.g the non-ddp counterpart
(_copy_to_iter) is single underscored

> But more importantly: does this belong into the generic header without
> and comments what the ddp means and when it should be used?

will look into this, any idea for a more suitable location?

> > +static void ddp_memcpy_to_page(struct page *page, size_t offset, const 
> > char *from, size_t len)
>
> Overly long line.  But we're also looking into generic helpers for
> this kind of things, not sure if they made it to linux-next in the
> meantime, but please check.

This is what I found in linux-next - note sure if you were referring to it

commit 11432a3cc061c39475295be533c3674c4f8a6d0b
Author: David Howells 

iov_iter: Add ITER_XARRAY

> > +size_t _ddp_copy_to_iter(const void *addr, size_t bytes, struct iov_iter 
> > *i)
> > +{
> > + const char *from = addr;
> > + if (unlikely(iov_iter_is_pipe(i)))
> > + return copy_pipe_to_iter(addr, bytes, i);
> > + if (iter_is_iovec(i))
> > + might_fault();
> > + iterate_and_advance(i, bytes, v,
> > + copyout(v.iov_base, (from += v.iov_len) - v.iov_len, 
> > v.iov_len),
> > + ddp_memcpy_to_page(v.bv_page, v.bv_offset,
> > +(from += v.bv_len) - v.bv_len, v.bv_len),
> > + memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
> > + )
> > +
> > + return bytes;
> > +}
>
> This bloats every kernel build, so please move it into a conditionally built 
> file.

ack

>  And please document the whole thing.

ok


Re: perf measure for stalled cycles per instruction on newer Intel processors

2020-10-18 Thread Or Gerlitz
On Thu, Oct 15, 2020 at 9:33 PM Andi Kleen  wrote:
> On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote:
> > Earlier Intel processors (e.g E5-2650) support the more of classical
> > two stall events (for backend and frontend [1]) and then perf shows
> > the nice measure of stalled cycles per instruction - e.g here where we
> > have IPC of 0.91 and CSPI (see [2]) of 0.68:
>
> Don't use it. It's misleading on a out-of-order CPU because you don't
> know if it's actually limiting anything.
>
> If you want useful bottleneck data use --topdown.

So running again, this time with the below params, I got this output
where all the right most column is colored red. I wonder what can be
said on the amount/ratio of stalls for this app - if you can maybe recommend
some posts of yours to better understand that, I saw some comment in the
perf-stat man page and some lwn article but wasn't really able to figure it out.

FWIW, the kernel is 5.5.7-100.fc30.x86_64 and the CPU E5-2650 0

$ perf stat  --topdown -a  taskset -c 0 $APP

[...]

 Performance counter stats for 'system wide':

retiring  bad speculation
 frontend boundbackend bound
S0-D0-C0   124.9% 1.1%
   16.1%57.9%
S0-D0-C1   116.3% 1.3%
   17.3%65.1%
S0-D0-C2   117.0% 1.2%
   15.3%66.5%
S0-D0-C3   118.3% 0.8%
8.2%72.8%
S0-D0-C4   118.1% 0.8%
8.5%72.6%
S0-D0-C5   117.6% 0.8%
   10.0%71.6%
S0-D0-C6   118.3% 0.7%
7.4%73.6%
S0-D0-C7   115.4% 1.4%
   22.1%61.2%
S1-D0-C0   115.9% 1.4%
   16.4%66.3%
S1-D0-C1   121.9% 2.6%
   16.9%58.5%
S1-D0-C2   120.8% 3.7%
   17.1%58.4%
S1-D0-C3   117.8% 1.0%
9.2%72.1%
S1-D0-C4   117.8% 1.0%
9.0%72.2%
S1-D0-C5   117.8% 1.0%
9.0%72.2%
S1-D0-C6   117.4% 1.4%
   12.8%68.4%
S1-D0-C7   123.6% 4.3%
   17.2%55.0%

  13.341823591 seconds time elapsed

while running with perf stat -d gives this:

$ perf stat   -d taskset -c 0 $APP

Performance counter stats for 'taskset -c 0 ./main.gcc9.3.1':

 15,075.30 msec task-clock#0.900 CPUs
utilized
   199  context-switches  #0.013 K/sec
 1  cpu-migrations#0.000 K/sec
   117,987  page-faults   #0.008 M/sec
40,907,365,540  cycles#2.714 GHz
26,431,604,986  stalled-cycles-frontend   #   64.61% frontend
cycles idle
21,734,615,045  stalled-cycles-backend#   53.13% backend
cycles idle
35,339,765,469  instructions  #0.86  insn per
cycle
  #0.75  stalled
cycles per insn


perf measure for stalled cycles per instruction on newer Intel processors

2020-10-15 Thread Or Gerlitz
Hi,

Earlier Intel processors (e.g E5-2650) support the more of classical
two stall events (for backend and frontend [1]) and then perf shows
the nice measure of stalled cycles per instruction - e.g here where we
have IPC of 0.91 and CSPI (see [2]) of 0.68:

 9,568,273,970  cycles#2.679 GHz
   (53.30%)
 5,979,155,843  stalled-cycles-frontend   #   62.49% frontend
cycles idle (53.31%)
 4,874,774,413  stalled-cycles-backend#   50.95% backend
cycles idle  (53.31%)
 8,732,767,750  instructions  #0.91  insn per
cycle
  #0.68  stalled
cycles per insn  (59.97%)

Running over a system with newer processor (6254) I noted that there
are sort of zillion (..) stall events [2] and perf -e $EVENT for them
does show thier count.

However perf stat doesn't show any more the "stalled cycles per insn"
computation.

Looking in the perf sources, it seems we do that only if the
backend/frontend events exist (perf_stat__print_shadow_stats function)
- am I correct in my reading of the code?

If it's the case, what's needed here to get this or similar measure back?

If it's not the case, if you can suggest how to get perf to emit this
quantity there.

Thanks,

Or.

[1] perf list | grep stalled-cycles

stalled-cycles-backend OR idle-cycles-backend  [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend[Hardware event]

[2] http://www.brendangregg.com/perf.html#CPUstatistics

[3] perf list | grep stall -A 1 (manipulated, there are more..)

cycle_activity.stalls_l3_miss
   [Execution stalls while L3 cache miss demand load is outstanding]
  cycle_activity.stalls_l1d_miss
   [Execution stalls while L1 cache miss demand load is outstanding]
  cycle_activity.stalls_l2_miss
   [Execution stalls while L2 cache miss demand load is outstanding]
  cycle_activity.stalls_mem_any
   [Execution stalls while memory subsystem has an outstanding load]
  cycle_activity.stalls_total
   [Total execution stalls]
  ild_stall.lcp
   [Core cycles the allocator was stalled due to recovery from earlier
  partial_rat_stalls.scoreboard
   [Cycles where the pipeline is stalled due to serializing operations]
  resource_stalls.any
   [Resource-related stall cycles]
  resource_stalls.sb
   [Cycles stalled due to no store buffers available. (not including
  partial_rat_stalls.scoreboard
   [Cycles where the pipeline is stalled due to serializing operations]
  resource_stalls.any
   [Resource-related stall cycles]
  resource_stalls.sb
   [Cycles stalled due to no store buffers available. (not including
draining form sync)]
  uops_executed.stall_cycles
   [Counts number of cycles no uops were dispatched to be executed on this
  uops_issued.stall_cycles
   [Cycles when Resource Allocation Table (RAT) does not issue Uops to
  uops_retired.stall_cycles
   [Cycles without actually retired uops]


Re: [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events

2020-10-10 Thread Or Gerlitz
On Fri, Oct 9, 2020 at 1:50 AM Sagi Grimberg  wrote:

> > @@ -412,8 +413,6 @@ int nvme_tcp_offload_limits(struct nvme_tcp_queue 
> > *queue,
> >   queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
> >   queue->ctrl->ctrl.max_hw_sectors =
> >   limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
> > - } else {
> > - queue->ctrl->offloading_netdev = NULL;
>
> Squash this change to the patch that introduced it.

OK, will look on that and I guess it should be fine to make this as
you suggested


> > + case NETDEV_GOING_DOWN:
> > + mutex_lock(&nvme_tcp_ctrl_mutex);
> > + list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
> > + if (ndev != ctrl->offloading_netdev)
> > + continue;
> > + nvme_tcp_error_recovery(&ctrl->ctrl);
> > + }
> > + mutex_unlock(&nvme_tcp_ctrl_mutex);
> > + flush_workqueue(nvme_reset_wq);
>
> Worth a small comment that this we want the err_work to complete
> here. So if someone changes workqueues he may see this.


ack


Re: iproute2 DDMMYY versioning - why?

2020-07-31 Thread Or Gerlitz
On Tue, Jul 28, 2020 at 10:51 PM Stephen Hemminger
 wrote:

> It is only an historical leftover, because 15 yrs ago that is how Alexy did it

So how about putting behind the burden created by this historical leftover
and moving to use the kernel releases as the emitted version?

Or.


Re: [net V2 10/11] net/mlx5e: Modify uplink state on interface up/down

2020-07-28 Thread Or Gerlitz
On Tue, Jul 28, 2020 at 11:07 PM Saeed Mahameed  wrote:
> When setting the PF interface up/down, notify the firmware to update
> uplink state via MODIFY_VPORT_STATE, when E-Switch is enabled.
> This behavior will prevent sending traffic out on uplink port when PF is
> down, such as sending traffic from a VF interface which is still up.
> Currently when calling mlx5e_open/close(), the driver only sends PAOS
> command to notify the firmware to set the physical port state to
> up/down, however, it is not sufficient. When VF is in "auto" state, it
> follows the uplink state, which was not updated on mlx5e_open/close()
> before this patch.

> When switchdev mode is enabled and uplink representor is first enabled,
> set the uplink port state value back to its FW default "AUTO".

So this is not the legacy mode "auto" vf vport state but rather something else?

If this is the case what is the semantics of the firmware "auto" state and
how it related to the switchdev vport representors architecture/behaviour?


> Fixes: 63bfd399de55 ("net/mlx5e: Send PAOS command on interface up/down")


iproute2 DDMMYY versioning - why?

2020-07-28 Thread Or Gerlitz
Stephen,

Taking into account that iproute releases are aligned with the kernel
ones -- is there any real reason for the confusing DDMMYY double
versioning?

I see that even the git tags go after the kernel releases..

Or.

# git remote -v
origin  git://git.kernel.org/pub/scm/network/iproute2/iproute2.git (fetch)
origin  git://git.kernel.org/pub/scm/network/iproute2/iproute2.git (push)

$ git log -p include/SNAPSHOT.h

commit 1bfa3b3f66ef48bdabe0eb2a7c14e69f481dfa25 (tag: v5.7.0)
Author: Stephen Hemminger 
Date:   Tue Jun 2 20:35:00 2020 -0700

v5.7.0

diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h
index 0d10a9c2..0d211784 100644
--- a/include/SNAPSHOT.h
+++ b/include/SNAPSHOT.h
@@ -1 +1 @@
-static const char SNAPSHOT[] = "200330";
+static const char SNAPSHOT[] = "200602";

commit 29981db0e051cd4c53920c89dddcf3d883343a0f (tag: v5.6.0)
Author: Stephen Hemminger 
Date:   Mon Mar 30 08:06:08 2020 -0700

v5.6.0

diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h
index c0fa1bb4..0d10a9c2 100644
--- a/include/SNAPSHOT.h
+++ b/include/SNAPSHOT.h
@@ -1 +1 @@
-static const char SNAPSHOT[] = "200127";
+static const char SNAPSHOT[] = "200330";

commit d4df55404aec707bd55c9264c666ddb1bb05d7f1 (tag: v5.5.0)
Author: Stephen Hemminger 
Date:   Mon Jan 27 05:53:09 2020 -0800

v5.5.0

diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h
index b98ad502..c0fa1bb4 100644
--- a/include/SNAPSHOT.h
+++ b/include/SNAPSHOT.h
@@ -1 +1 @@
-static const char SNAPSHOT[] = "191125";
+static const char SNAPSHOT[] = "200127";

etc


Re: [net 01/12] net/mlx5e: Hold reference on mirred devices while accessing them

2020-07-28 Thread Or Gerlitz
On Tue, Jul 28, 2020 at 12:13 PM Saeed Mahameed  wrote:
> From: Eli Cohen 
>
> Net devices might be removed. For example, a vxlan device could be
> deleted and its ifnidex would become invalid. Use dev_get_by_index()
> instead of __dev_get_by_index() to hold reference on the device while
> accessing it and release after done.

haven't this patch sent in the past in the same form and we were in the middle
of discussing how to properly address this? if something changed, what?


Re: [net 11/12] net/mlx5e: Modify uplink state on interface up/down

2020-07-28 Thread Or Gerlitz
On Tue, Jul 28, 2020 at 12:16 PM Saeed Mahameed  wrote:
> From: Ron Diskin 
>
> When setting the PF interface up/down, notify the firmware to update
> uplink state via MODIFY_VPORT_STATE

How this relates to e-switching? the patch touches the e-switch code
but I don't see mentioning of that in the change-log..

> This behavior will prevent sending traffic out on uplink port when PF is
> down, such as sending traffic from a VF interface which is still up.
> Currently when calling mlx5e_open/close(), the driver only sends PAOS
> command to notify the firmware to set the physical port state to
> up/down, however, it is not sufficient. When VF is in "auto" state, it

"auto" is nasty concept that applies only to legacy mode. However, the patch
touches the switchdev mode (representors) code, please explain...

> follows the uplink state, which was not updated on mlx5e_open/close()
> before this patch.
>
> Fixes: 63bfd399de55 ("net/mlx5e: Send PAOS command on interface up/down")
> Signed-off-by: Ron Diskin 
> Reviewed-by: Roi Dayan 
> Reviewed-by: Moshe Shemesh 


Re: [net 05/11] net/mlx5e: Hold reference on mirred devices while accessing them

2020-07-07 Thread Or Gerlitz
On Tue, Jul 7, 2020 at 5:57 PM Eli Cohen  wrote:
> On Mon, Jul 06, 2020 at 09:13:06AM +0300, Or Gerlitz wrote:
> > On Sun, Jul 5, 2020 at 10:19 AM Eli Cohen  wrote:
> >
> > so what are we protecting here against? someone removing the device
> > while the tc rule is being added?
> >
> Not necessairly. In case of ecmp, the rule may be copied to another
> eswitch. At this time, if the mirred device does not exsist anymore, we
> will crash.
>
> I saw this problem at a customer's machine and this solved the problem.
> It was an older kernel but I think we have the same issue here as well.
> All you have is the ifindex of the mirred device so what I did here is
> required.

Repeating myself, why do it in the driver and not higher in the tc stack?
if I got you correctly, the same problem can happen for sw only (skip-hw) rules


Re: [net 05/11] net/mlx5e: Hold reference on mirred devices while accessing them

2020-07-05 Thread Or Gerlitz
On Sun, Jul 5, 2020 at 10:19 AM Eli Cohen  wrote:
>
> On Fri, Jul 03, 2020 at 12:33:58PM +0300, Or Gerlitz wrote:
> > On Fri, Jul 3, 2020 at 1:24 AM Saeed Mahameed  wrote:
> > > From: Eli Cohen 
> > >
> > > Net devices might be removed. For example, a vxlan device could be
> > > deleted and its ifnidex would become invalid. Use dev_get_by_index()
> > > instead of __dev_get_by_index() to hold reference on the device while
> > > accessing it and release after done.
> >
> > So if user space app installed a tc rule and then crashed or just
> > exited without
> > uninstalling the rule, the mirred (vxlan, vf rep, etc) device could
> > never be removed?
>
> Why do you think so? I decrease ref count, unconditionally, right after
> returning from mlx5e_attach_encap().

so what are we protecting here against? someone removing the device
while the tc rule is being added?

why do it in the driver and not higher in the tc stack? if I got you
correctly, the same problem can
happen for sw only (skip-hw) rules


Re: [net 05/11] net/mlx5e: Hold reference on mirred devices while accessing them

2020-07-03 Thread Or Gerlitz
On Fri, Jul 3, 2020 at 1:24 AM Saeed Mahameed  wrote:
> From: Eli Cohen 
>
> Net devices might be removed. For example, a vxlan device could be
> deleted and its ifnidex would become invalid. Use dev_get_by_index()
> instead of __dev_get_by_index() to hold reference on the device while
> accessing it and release after done.

So if user space app installed a tc rule and then crashed or just
exited without
uninstalling the rule, the mirred (vxlan, vf rep, etc) device could
never be removed?

Why this is any better from the current situation?


Re: [PATCH net-next 0/9] devlink vdev

2019-10-23 Thread Or Gerlitz
On Tue, Oct 22, 2019 at 8:46 PM Yuval Avnery  wrote:
> This patchset introduces devlink vdev.

> Currently, legacy tools do not provide a comprehensive solution that can
> be used in both SmartNic and non-SmartNic mode.
> Vdev represents a device that exists on the ASIC but is not necessarily
> visible to the kernel.

I assume you refer to a function (e.g Eth VF, NVME) or mediated device (e.g the
sub-functions work by mellanox we presented in netdev 0x13) residing on the
host but not visible to the smart nic, smart... would be good to
clarify that here.

FWIW whatever SW you have on the smart NIC, it will still not be able
to actually provision / PT / map this device to a guest/container before/after
this patch set, and a host SW agent is still needed, agree?

worth explaining what delta/progress this series gets us in this context
(of inability to provision from the smartnic ).

Or.


Re: [PATCH net-next 0/4] page_pool: API for numa node change handling

2019-10-23 Thread Or Gerlitz
On Tue, Oct 22, 2019 at 8:04 AM Saeed Mahameed  wrote:

> CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
> NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)
>
> XDP Drop/TX single core:
> NUMA  | XDP  | Before| After
> ---
> Close | Drop | 11   Mpps | 10.9 Mpps
> Far   | Drop | 4.4  Mpps | 5.8  Mpps
>
> Close | TX   | 6.5 Mpps  | 6.5 Mpps
> Far   | TX   | 4   Mpps  | 3.5  Mpps
>
> Improvement is about 30% drop packet rate, 15% tx packet rate for numa far 
> test.

some typo here, the TX far test results become worse, would be good to
clarify/fix the cover letter


Re: [RFC 00/20] Intel RDMA/IDC Driver series

2019-09-29 Thread Or Gerlitz
On Thu, Sep 26, 2019 at 7:46 PM Jeff Kirsher
 wrote:
> This series is sent out as an RFC to verify that our implementation of
> the MFD subsystem is correct to facilitate inner driver communication
> (IDC) between the new "irdma" driver to support Intel's ice and i40e
> drivers.
>
> The changes contain the modified ice and i40e driver changes using the
> MFD subsystem.  It also contains the new irdma driver which is replacing
> the i40iw driver and supports both the i40e and ice drivers.

Hi Jeff,

Can this be fetched from somewhere? didn't see a related branch at
your trees with these bits..

Or.


Re: [PATCH net v3] net/sched: cls_api: Fix nooffloaddevcnt counter when indr block call success

2019-09-24 Thread Or Gerlitz
On Mon, Sep 23, 2019 at 5:21 PM wenxu  wrote:
> 在 2019/9/23 17:42, John Hurley 写道:
> > On Mon, Sep 23, 2019 at 5:20 AM wenxu  wrote:
> >> Hi John & Jakub
> >>
> >> There are some limitations for indirect tc callback work with  skip_sw ?
> >>
> > Hi Wenxu,
> > This is not really a limitation.
> > As Or points out, indirect block offload is not supposed to work with 
> > skip_sw.
> > Indirect offload allows us to hook onto existing kernel devices (for
> > TC events we may which to offload) that are out of the control of the
> > offload driver and, therefore, should always accept software path
> > rules.
> > For example, the vxlan driver does not implement a setup_tc ndo so it
> > does not expect to run rules in hw - it should always handle
> > associated rules in the software datapath as a minimum.
> > I think accepting skip_sw rules for devices with no in-built concept
> > of hardware offload would be wrong.
> > Do you have a use case that requires skip_sw rules for such devices?

> When we use ovs to control the tc offload. The ovs kernel already provide the 
> software
> path rules so maybe user don't want other soft path.

this (programming the rule to both tc and ovs kernel DPs) is a choice made by
the ovs user-space code and could change later. Actually, for complex use
case such as connection tracking, there might be cases when multiple tables
are used, when the 1st packet/s would jump before hw to sw TC tables, so
using skip_sw will likely not to work and we'll get bug reports. The production
TC configuration we use/recommend with for overlay networks e-switching
is "both" (== "none", i.e none of skip_sw or skip_hw).

> And with skip_sw it can be easily distinguish offloaded and non-offloaded 
> rules.

per tc rule, the kernel reports on offloaded vs not offloaded
packet/bytes, so the
info is there. In the system level, I don't think it's good that on
point A in time
we blocked skip_sw rules for tunnel devices and at point B we enabled it for no
real/good reason, I vote for blocking it again, since as Pieter
explained disallowing
this is not consistent with the kernel SW model on the receiving side.


ELOed stable kernels

2019-09-19 Thread Or Gerlitz
Hi Greg,

If this is RTFM could you please point me to the Emm

AFAIR if a stable kernel is not listed at kernel.org than it is EOL by now.

Is this correct?

thanks,

Or.


Re: [PATCH net-next] tcp: force a PSH flag on TSO packets

2019-09-19 Thread Or Gerlitz
On Thu, Sep 19, 2019 at 4:46 PM Eric Dumazet  wrote:
> On 9/19/19 5:17 AM, Or Gerlitz wrote:
> > On Wed, Sep 11, 2019 at 12:54 AM Eric Dumazet  wrote:
> >> When tcp sends a TSO packet, adding a PSH flag on it
> >> reduces the sojourn time of GRO packet in GRO receivers.
> >>
> >> This is particularly the case under pressure, since RX queues
> >> receive packets for many concurrent flows.
> >>
> >> A sender can give a hint to GRO engines when it is
> >> appropriate to flush a super-packet, especially when pacing

> > Is this correct that we add here the push flag for the tcp header template
> > from which all the tcp headers for SW GSO packets will be generated?
> > Wouldn't that cause a too early flush on GRO engines at the receiver side?

> If a TSO engine is buggy enough to add the PSH on all the segments, it needs
> to be fixed urgently :)

yeah, but I guess you were not able to test this over all the TSO HWs
out there..
so I guess if someone complains we will have to add a quirk to disable
that, lets see..


Re: [PATCH net v3] net/sched: cls_api: Fix nooffloaddevcnt counter when indr block call success

2019-09-19 Thread Or Gerlitz
On Thu, Sep 19, 2019 at 11:39 AM  wrote:
>
> From: wenxu 
>
> A vxlan or gretap device offload through indr block methord. If the device

nit: method --> method

> successfully bind with a real hw through indr block call, It also add
> nooffloadcnt counter. This counter will lead the rule add failed in
> fl_hw_replace_filter-->tc_setup_cb_call with skip_sw flags.

wait.. indirect tc callbacks are typically used to do hw offloading
for decap rules (tunnel key unset action) set on SW devices (gretap, vxlan).

However, AFAIK, it's been couple of years since the kernel doesn't support
skip_sw for such rules. Did we enable it again? when? I am somehow
far from the details, so copied some folks..

Or.


>
> In the tc_setup_cb_call will check the nooffloaddevcnt and skip_sw flags
> as following:
> if (block->nooffloaddevcnt && err_stop)
> return -EOPNOTSUPP;
>
> So with this patch, if the indr block call success, it will not modify
> the nooffloaddevcnt counter.
>
> Fixes: 7f76fa36754b ("net: sched: register callbacks for indirect tc block 
> binds")
> Signed-off-by: wenxu 
> ---
> v3: rebase to the net
>
>  net/sched/cls_api.c | 30 +-
>  1 file changed, 17 insertions(+), 13 deletions(-)
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 32577c2..c980127 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -607,11 +607,11 @@ static void tc_indr_block_get_and_ing_cmd(struct 
> net_device *dev,
> tc_indr_block_ing_cmd(dev, block, cb, cb_priv, command);
>  }
>
> -static void tc_indr_block_call(struct tcf_block *block,
> -  struct net_device *dev,
> -  struct tcf_block_ext_info *ei,
> -  enum flow_block_command command,
> -  struct netlink_ext_ack *extack)
> +static int tc_indr_block_call(struct tcf_block *block,
> + struct net_device *dev,
> + struct tcf_block_ext_info *ei,
> + enum flow_block_command command,
> + struct netlink_ext_ack *extack)
>  {
> struct flow_block_offload bo = {
> .command= command,
> @@ -621,10 +621,15 @@ static void tc_indr_block_call(struct tcf_block *block,
> .block_shared   = tcf_block_shared(block),
> .extack = extack,
> };
> +
> INIT_LIST_HEAD(&bo.cb_list);
>
> flow_indr_block_call(dev, &bo, command);
> -   tcf_block_setup(block, &bo);
> +
> +   if (list_empty(&bo.cb_list))
> +   return -EOPNOTSUPP;
> +
> +   return tcf_block_setup(block, &bo);
>  }
>
>  static bool tcf_block_offload_in_use(struct tcf_block *block)
> @@ -681,8 +686,6 @@ static int tcf_block_offload_bind(struct tcf_block 
> *block, struct Qdisc *q,
> goto no_offload_dev_inc;
> if (err)
> goto err_unlock;
> -
> -   tc_indr_block_call(block, dev, ei, FLOW_BLOCK_BIND, extack);
> up_write(&block->cb_lock);
> return 0;
>
> @@ -691,9 +694,10 @@ static int tcf_block_offload_bind(struct tcf_block 
> *block, struct Qdisc *q,
> err = -EOPNOTSUPP;
> goto err_unlock;
> }
> +   err = tc_indr_block_call(block, dev, ei, FLOW_BLOCK_BIND, extack);
> +   if (err)
> +   block->nooffloaddevcnt++;
> err = 0;
> -   block->nooffloaddevcnt++;
> -   tc_indr_block_call(block, dev, ei, FLOW_BLOCK_BIND, extack);
>  err_unlock:
> up_write(&block->cb_lock);
> return err;
> @@ -706,8 +710,6 @@ static void tcf_block_offload_unbind(struct tcf_block 
> *block, struct Qdisc *q,
> int err;
>
> down_write(&block->cb_lock);
> -   tc_indr_block_call(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
> -
> if (!dev->netdev_ops->ndo_setup_tc)
> goto no_offload_dev_dec;
> err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
> @@ -717,7 +719,9 @@ static void tcf_block_offload_unbind(struct tcf_block 
> *block, struct Qdisc *q,
> return;
>
>  no_offload_dev_dec:
> -   WARN_ON(block->nooffloaddevcnt-- == 0);
> +   err = tc_indr_block_call(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
> +   if (err)
> +   WARN_ON(block->nooffloaddevcnt-- == 0);
> up_write(&block->cb_lock);
>  }
>
> --
> 1.8.3.1
>


Re: [PATCH RFC v3 0/5] Support fraglist GRO/GSO

2019-09-19 Thread Or Gerlitz
On Wed, Sep 18, 2019 at 2:48 PM Steffen Klassert
 wrote:
> This patchset adds support to do GRO/GSO by chaining packets
> of the same flow at the SKB frag_list pointer. This avoids
> the overhead to merge payloads into one big packet, and
> on the other end, if GSO is needed it avoids the overhead
> of splitting the big packet back to the native form.
>
> Patch 1 Enables UDP GRO by default.
>
> Patch 2 adds a netdev feature flag to enable listifyed GRO,
> this implements one of the configuration options discussed
> at netconf 2019.
[..]

The slide say that linked packets travel together though the stack.

This sounds somehow similar to the approach suggested by Ed
for skb lists. I wonder what we can say on cases where each of the
approaches would function better.

Or.


Re: [PATCH net-next] tcp: force a PSH flag on TSO packets

2019-09-19 Thread Or Gerlitz
On Wed, Sep 11, 2019 at 12:54 AM Eric Dumazet  wrote:
> When tcp sends a TSO packet, adding a PSH flag on it
> reduces the sojourn time of GRO packet in GRO receivers.
>
> This is particularly the case under pressure, since RX queues
> receive packets for many concurrent flows.
>
> A sender can give a hint to GRO engines when it is
> appropriate to flush a super-packet, especially when pacing

Hi Eric,

Is this correct that we add here the push flag for the tcp header template
from which all the tcp headers for SW GSO packets will be generated?

Wouldn't that cause a too early flush on GRO engines at the receiver side?

Or.


Re: [PATCH] net/mlx5e: Allow removing representors netdev to other namespace

2019-08-04 Thread Or Gerlitz
On Thu, Aug 1, 2019 at 3:44 AM Tonghao Zhang  wrote:
> On Wed, May 22, 2019 at 12:49 PM Or Gerlitz  wrote:
> > On Wed, May 22, 2019 at 4:26 AM Tonghao Zhang  
> > wrote:

> > > I review the reps of netronome nfp codes,  nfp does't set the
> > > NETIF_F_NETNS_LOCAL to netdev->features.
> > > And I changed the OFED codes which used for our product environment,
> > > and then send this patch to upstream.

> > The real question here is if we can provide the required separation when
> > vport rep netdevs are put into different name-spaces -- this needs deeper
> > thinking. Technically you can do that with this one liner patch but we have
> > to see if/what assumptions could be broken as of that.

> Can we add a mode parm for allowing user to switch it off/on ?

The kernel model for namespace means a completely new copy of the
networking stack
with new routing tables, new neighbour tables. everything. It also
means netdevices in
different namespaces can't communicate with each other. I tend to
think that our FW/HW
model doesn't support that and hence we can't do proper offloading of
the SW model.

I suggest you approach the current maintainers (Roi and Saeed) to see
if they have different opinion.

Or.


Re: [PATCH] net/mlx5e: Fix zero table prio set by user.

2019-07-26 Thread Or Gerlitz
On Fri, Jul 26, 2019 at 12:24 AM Saeed Mahameed  wrote:
>
> On Thu, 2019-07-25 at 19:24 +0800, we...@ucloud.cn wrote:
> > From: wenxu 
> >
> > The flow_cls_common_offload prio is zero
> >
> > It leads the invalid table prio in hw.
> >
> > Error: Could not process rule: Invalid argument
> >
> > kernel log:
> > mlx5_core :81:00.0: E-Switch: Failed to create FDB Table err -22
> > (table prio: 65535, level: 0, size: 4194304)
> >
> > table_prio = (chain * FDB_MAX_PRIO) + prio - 1;
> > should check (chain * FDB_MAX_PRIO) + prio is not 0
> >
> > Signed-off-by: wenxu 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git
> > a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> > index 089ae4d..64ca90f 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> > @@ -970,7 +970,9 @@ static int esw_add_fdb_miss_rule(struct
>
> this piece of code isn't in this function, weird how it got to the
> diff, patch applies correctly though !
>
> > mlx5_eswitch *esw)
> >   flags |= (MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT |
> > MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
> >
> > - table_prio = (chain * FDB_MAX_PRIO) + prio - 1;
> > + table_prio = (chain * FDB_MAX_PRIO) + prio;
> > + if (table_prio)
> > + table_prio = table_prio - 1;
> >
>
> This is black magic, even before this fix.
> this -1 seems to be needed in order to call
> create_next_size_table(table_prio) with the previous "table prio" ?
> (table_prio - 1)  ?
>
> The whole thing looks wrong to me since when prio is 0 and chain is 0,
> there is not such thing table_prio - 1.
>
> mlnx eswitch guys in the cc, please advise.

basically, prio 0 is not something we ever get in the driver, since if
user space
specifies 0, the kernel generates some random non-zero prio, and we support
only prios 1-16 -- Wenxu -- what do you run to get this error?



>
> Thanks,
> Saeed.
>
> >   /* create earlier levels for correct fs_core lookup when
> >* connecting tables


Re: [for-next V2 08/10] linux/dim: Implement rdma_dim

2019-06-26 Thread Or Gerlitz
On Wed, Jun 26, 2019 at 1:03 AM Sagi Grimberg  wrote:
>
> > +void rdma_dim(struct dim *dim, u64 completions)
> > +{
> > + struct dim_sample *curr_sample = &dim->measuring_sample;
> > + struct dim_stats curr_stats;
> > + u32 nevents;
> > +
> > + dim_update_sample_with_comps(curr_sample->event_ctr + 1,
> > +  curr_sample->pkt_ctr,
> > +  curr_sample->byte_ctr,
> > +  curr_sample->comp_ctr + completions,
> > +  &dim->measuring_sample);
>
> If this is the only caller, why add pkt_ctr and byte_ctr at all?

Hi Sagi,

Thanks for the fast review and feedback, other than the default per
ib/rdma device setup for rdma
dim / adaptive-moderation for which Idan commented on (and lets
discuss it there please) seems
the rest of the comments are fine and Yamin will respond / address
them in the coming days.

Or.


Re: [PATCH] net/mlx5e: Allow removing representors netdev to other namespace

2019-05-21 Thread Or Gerlitz
On Wed, May 22, 2019 at 4:26 AM Tonghao Zhang  wrote:

> I review the reps of netronome nfp codes,  nfp does't set the
> NETIF_F_NETNS_LOCAL to netdev->features.
> And I changed the OFED codes which used for our product environment,
> and then send this patch to upstream.

The real question here is if we can provide the required separation when
vport rep netdevs are put into different name-spaces -- this needs deeper
thinking. Technically you can do that with this one liner patch but we have
to see if/what assumptions could be broken as of that.


Re: [PATCH] net/mlx5e: Allow removing representors netdev to other namespace

2019-05-21 Thread Or Gerlitz
On Tue, May 21, 2019 at 7:36 AM Tonghao Zhang  wrote:
> On Tue, May 21, 2019 at 4:24 AM Or Gerlitz  wrote:
> >
> > On Mon, May 20, 2019 at 3:19 PM  wrote:
> > >
> > > From: Tonghao Zhang 
> > >
> > > At most case, we use the ConnectX-5 NIC on compute node for VMs,
> > > but we will offload forwarding rules to NICs on gateway node.
> > > On the gateway node, we will install multiple NICs and set them to
> > > different dockers which contain different net namespace, different
> > > routing table. In this way, we can specify the agent process on one
> > > docker. More dockers mean more high throughput.
> >
> > The vport (uplink and VF) representor netdev stands for the e-switch
> > side of things. If you put different
> > vport devices to different namespaces, you will not be able to forward
> > between them. It's the NIC side of things
> > (VF netdevice) which can/should be put to namespaces.
> >
> > For example, with SW veth devices, suppose I we have two pairs
> > (v0,v1), (v2, v3) -- we create
> > a SW switch (linux bridge, ovs) with the uplink and v0/v2 as ports all
> > in a single name space
> > and we map v1 and v3 into application containers.
> >
> > I am missing how can you make any use with vport reps belonging to the
> > same HW e-switch
> > on different name-spaces, maybe send chart?
>+-+
>| |
>| |
>|   docker01 docker02 |
>| |
>| +-+  +--+   |
>| |NIC (rep/vf) |  |   NIC|   |
>| | |  |  |   host|
>| |   ++|  |   +-+|   |
>| +-+  +--+   |
>| ||   | ||
>+-+
>  ||   | |
>  || phy_port2   | phy_port3
>  ||   | |
>  ||   | |
> phy_port0||phy_port1  | |
>  ||   | |
>  v+   v +
>
> For example, there are two NIC(4 phy ports) on the host, we set the
> one NIC to docker01(all rep and vf of this nic are set to docker01).
> and other one NIC are set to docker02. The docker01/docker02 run our
> agent which use the tc command to offload the rule. The NIC of
> docker01 will receive packets from phy_port1
> and do the QoS , NAT(pedit action) and then forward them to phy_port0.
> The NIC of docker02 do this in the same way.

I see, so in the case you described about, you are going to move **all** the
representors of a certain e-switch into **one** name-space -- this is something
we don't have to block. However, I think we did wanted to disallow moving
sub-set of the port reps into a name-space. Should look into that.

Or.


Re: [PATCH] net/mlx5e: Allow removing representors netdev to other namespace

2019-05-20 Thread Or Gerlitz
On Mon, May 20, 2019 at 3:19 PM  wrote:
>
> From: Tonghao Zhang 
>
> At most case, we use the ConnectX-5 NIC on compute node for VMs,
> but we will offload forwarding rules to NICs on gateway node.
> On the gateway node, we will install multiple NICs and set them to
> different dockers which contain different net namespace, different
> routing table. In this way, we can specify the agent process on one
> docker. More dockers mean more high throughput.

The vport (uplink and VF) representor netdev stands for the e-switch
side of things. If you put different
vport devices to different namespaces, you will not be able to forward
between them. It's the NIC side of things
(VF netdevice) which can/should be put to namespaces.

For example, with SW veth devices, suppose I we have two pairs
(v0,v1), (v2, v3) -- we create
a SW switch (linux bridge, ovs) with the uplink and v0/v2 as ports all
in a single name space
and we map v1 and v3 into application containers.

I am missing how can you make any use with vport reps belonging to the
same HW e-switch
on different name-spaces, maybe send chart?


>
> The commit abd3277287c7 ("net/mlx5e: Disallow changing name-space for VF 
> representors")
> disallow it, but we can change it now for gateway use case.
>
> Signed-off-by: Tonghao Zhang 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 91e24f1..15e932f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -1409,7 +1409,7 @@ static void mlx5e_build_rep_netdev(struct net_device 
> *netdev)
> netdev->watchdog_timeo= 15 * HZ;
>
>
> -   netdev->features |= NETIF_F_HW_TC | NETIF_F_NETNS_LOCAL;
> +   netdev->features |= NETIF_F_HW_TC;
> netdev->hw_features  |= NETIF_F_HW_TC;
>
> netdev->hw_features|= NETIF_F_SG;
> --
> 1.8.3.1
>


Re: [PATCH v2] net/mlx5e: Add bonding device for indr block to offload the packet received from bonding device

2019-05-19 Thread Or Gerlitz
On Fri, May 17, 2019 at 12:45 PM  wrote:
> From: wenxu 
>
> The mlx5e support the lag mode. When add mlx_p0 and mlx_p1 to bond0.
> packet received from mlx_p0 or mlx_p1 and in the ingress tc flower
> forward to vf0. The tc rule can't be offloaded because there is
> no indr_register_block for the bonding device.

For the bonding case, the offloading strategy is tc block sharing,
namly have user-space
share the tc block of the upper device (bond) with the lower devices
(mlx5 p0 and p1).

This is implemented for example in ovs.

You can read on tc block sharing in the mlxsw driver wiki [1]

Or.

[1] https://github.com/Mellanox/mlxsw/wiki/ACLs#shared-blocks-support


Re: [PATCH mlx5-next 3/4] net/mlx5: Decrease default mr cache size

2019-03-27 Thread Or Gerlitz
On Wed, Mar 27, 2019 at 3:36 PM Leon Romanovsky  wrote:
> On Wed, Mar 27, 2019 at 01:58:17PM +0200, Or Gerlitz wrote:
> > On Wed, Mar 27, 2019 at 1:41 PM Leon Romanovsky  wrote:
> > > On Wed, Mar 27, 2019 at 12:07:54PM +0200, Or Gerlitz wrote:
> > > > On Tue, Mar 19, 2019 at 11:25 AM Leon Romanovsky  
> > > > wrote:
> > > > > From: Artemy Kovalyov 
> > > > >
> > > > > Delete initialization of high order entries in mr cache to decrease 
> > > > > initial
> > > > > memory footprint. When required, the administrator can populate the
> > > > > entries with memory keys via the /sys interface.
> > > >
> > > > Please add here:
> > > >
> > > > This approach is very helpful to reduce the per HW function memory
> > > > footprint in environments such as VMs. Before the patch we see
> > > > consumption of 0.9GB per function and after the patch about 0.1GB

here it needs to be "per physical function" and not "per function"

> > > > Lets push it into stable kernels, a Fixes tag here will cause that
> > > > to happen more easily, so please consider that.

>>> I'll add it at the "apply" stage.

>> Just to make sure, by "add it" you mean the signatures, the text and
>> the Fixes that?

> Yes, of course.

thanks

>> Also, is there any reason not to eliminate the mr cache pre-population
>> all together?

> AFAIK, pre-populated cache helps MPI application to start faster and
> with out-of-box experience. Nobody seems to care enough to challenge
> this internal assumption.

I see, we are cutting here ~0.8GB so lets just do it and wonder on the
remaining 0.1GB later..

It would be good to push it into 5.1-rc so the fallout into stables
will start right away and
not only by 5.2 time - can we do that?


Re: [PATCH mlx5-next 3/4] net/mlx5: Decrease default mr cache size

2019-03-27 Thread Or Gerlitz
On Wed, Mar 27, 2019 at 1:41 PM Leon Romanovsky  wrote:
> On Wed, Mar 27, 2019 at 12:07:54PM +0200, Or Gerlitz wrote:
> > On Tue, Mar 19, 2019 at 11:25 AM Leon Romanovsky  wrote:
> > > From: Artemy Kovalyov 
> > >
> > > Delete initialization of high order entries in mr cache to decrease 
> > > initial
> > > memory footprint. When required, the administrator can populate the
> > > entries with memory keys via the /sys interface.
> >
> > Please add here:
> >
> > This approach is very helpful to reduce the per HW function memory
> > footprint in environments such as VMs. Before the patch we see
> > consumption of 0.9GB per function and after the patch about 0.1GB
> >
> > > Signed-off-by: Artemy Kovalyov 
> > > Signed-off-by: Moni Shoua 
> > > Signed-off-by: Leon Romanovsky 
> >
> > Reported-by:  Shalom Toledo 
> > Acked-by: Or Gerlitz 
> >
> > Lets push it into stable kernels, a Fixes tag here will cause that
> > to happen more easily, so please consider that.

> I'll add it at the "apply" stage.

Just to make sure, by "add it" you mean the signatures, the text and
the Fixes that?

Also, is there any reason not to eliminate the mr cache pre-population
all together?

Currently we consume 100-200MB per function after the patch which is
also problematic for some
environments.


Re: [PATCH mlx5-next 3/4] net/mlx5: Decrease default mr cache size

2019-03-27 Thread Or Gerlitz
On Tue, Mar 19, 2019 at 11:25 AM Leon Romanovsky  wrote:
> From: Artemy Kovalyov 
>
> Delete initialization of high order entries in mr cache to decrease initial
> memory footprint. When required, the administrator can populate the
> entries with memory keys via the /sys interface.

Please add here:

This approach is very helpful to reduce the per HW function memory
footprint in environments such as VMs. Before the patch we see
consumption of 0.9GB per function and after the patch about 0.1GB

> Signed-off-by: Artemy Kovalyov 
> Signed-off-by: Moni Shoua 
> Signed-off-by: Leon Romanovsky 

Reported-by:  Shalom Toledo 
Acked-by: Or Gerlitz 

Lets push it into stable kernels, a Fixes tag here will cause that
to happen more easily, so please consider that.

> ---
>  .../net/ethernet/mellanox/mlx5/core/main.c| 20 ---
>  1 file changed, 20 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 70cc906a102b..76716419370d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -164,26 +164,6 @@ static struct mlx5_profile profile[] = {
> .size   = 8,
> .limit  = 4
> },
> -   .mr_cache[16]   = {
> -   .size   = 8,
> -   .limit  = 4
> -   },
> -   .mr_cache[17]   = {
> -   .size   = 8,
> -   .limit  = 4
> -   },
> -   .mr_cache[18]   = {
> -   .size   = 8,
> -   .limit  = 4
> -   },
> -   .mr_cache[19]   = {
> -   .size   = 4,
> -   .limit  = 2
> -   },
> -   .mr_cache[20]   = {
> -   .size   = 4,
> -   .limit  = 2
> -   },
> },
>  };
>
> --
> 2.20.1
>


Re: [PATCH net-next v2 5/5] net/mlx5e: Return -EOPNOTSUPP when attempting to offload an unsupported action

2019-02-26 Thread Or Gerlitz
On Mon, Feb 25, 2019 at 1:07 PM  wrote:

> The encapsulation is not supported for mlx5 VFs. When we try to
> offload that action, the -EINVAL is returned, but not -EOPNOTSUPP.
> This patch changes the returned value and ignore to confuse user.

FWIW, note that this changes the behavior towards user-space, I don't see
concrete harm done here but we should realize that

> For example: (p2p1_0 is VF net device)
> tc filter add dev p2p1_0 protocol ip  parent : prio 1 flower skip_sw \
> src_mac e4:11:22:33:44:01   \
> action tunnel_key set   \
> src_ip 1.1.1.100\
> dst_ip 1.1.1.200\
> dst_port 4789 id 100\
> action mirred egress redirect dev vxlan0
>
> "RTNETLINK answers: Invalid argument"


Re: [PATCH net-next v2 1/5] net/mlx5e: Return -EOPNOTSUPP when modify header action zero

2019-02-26 Thread Or Gerlitz
On Mon, Feb 25, 2019 at 1:06 PM  wrote:

> When max modify header action is zero, we return -EOPNOTSUPP
> directly. In this way, we can ignore wrong message info (e.g.
> "mlx5: parsed 0 pedit actions, can't do more").
>
> This happens when offloading pedit actions on mlx VFs.

this command should work, we support header re-write (pedit offload)
for tc NIC rules

Is this CX5 VF? if yes, something broke

> For example:
> $ tc filter add dev mlx5_vf parent : protocol ip prio 1 \
> flower skip_sw dst_mac 00:10:56:fb:64:e8 \
> dst_ip 1.1.1.100 src_ip 1.1.1.200 \
> action pedit ex munge eth src set 00:10:56:b4:5d:20


Re: [PATCH net-next 2/5] net/mlx5e: Make the log friendly when decapsulation offload not supported

2019-02-24 Thread Or Gerlitz
On Sat, Feb 23, 2019 at 9:58 AM Tonghao Zhang  wrote:
> On Fri, Feb 22, 2019 at 5:07 PM Or Gerlitz  wrote:
> > On Fri, Feb 22, 2019 at 9:49 AM Tonghao Zhang  
> > wrote:
> > > On Fri, Feb 22, 2019 at 12:32 AM Or Gerlitz  wrote:
> > > > On Thu, Feb 21, 2019 at 3:42 PM  wrote:
> > > > >
> > > > > From: Tonghao Zhang 
> > > > >
> > > > > If we try to offload decapsulation actions to VFs hw, we get the log 
> > > > > [1].
> > > >
> > > > but the switching was on the tunnel  type (if (tunnel_type == [...]) -
> > > Yes, but we try to offload tc flow to VF device. For example
> > > the  p2p1_0 is VF, but not rep


>> so this should go to the nic and not esw tc offload code path in en_tc.c

> nic and esw will call parse_cls_flower() to parse the flower flows,
> and more information is shown as below.

ok, got you. We err on the match parsing stage, this is relatively
new.. up to 5.0
we were erring only on the  action parsing stage. The patch seems fine, it would
be good to modify  mlx5e_netdev_kind() to return "unkown" and not the
empty string
("") when the netdev doesn't have a kind assigned (can do this in this patch).


Re: mlx5 stable backport help

2019-02-22 Thread Or Gerlitz
On Fri, Feb 22, 2019 at 12:35 AM Saeed Mahameed  wrote:
> On Thu, 2019-02-21 at 12:21 +0200, Or Gerlitz wrote:

> So we all agree that the offending patch is "net/mlx5e: Support tunnel
> encap over tagged Ethernet" even if the issue existed before,

as you said the issue existed before we added support for offloading
tunnels in the presence of vlan on the underlay, it's like this since 4.12
when we introduced support for neigh update in 232c001398ae "net/mlx5e:
Add support to neighbour update flow" which is basically the one to blame/fix

since some code was moved and some code was added (e->route_dev)
backporting the patch without pulling more patches can be done as I sketched
below.

Anyway, we can maybe let it go without backporting, production environments
are typically not changing their source mac in prime time. So this can be seen
more as future proofing.

> in order to fix the issue you will have to port not only this patch but
> the whole series which claimed to fix the issue, so the fixes tag was
> wrong.. this patch on its own is no good.

>> [1] using  e->out_dev instead of e->route_dev
>> [2] use the hunks from mlx5e_tc_tun_create_header_ipv4/6() in
>> mlx5e_create_encap_header_ipv4/6()


Re: [PATCH net-next 2/5] net/mlx5e: Make the log friendly when decapsulation offload not supported

2019-02-22 Thread Or Gerlitz
On Fri, Feb 22, 2019 at 9:49 AM Tonghao Zhang  wrote:
>
> On Fri, Feb 22, 2019 at 12:32 AM Or Gerlitz  wrote:
> >
> > On Thu, Feb 21, 2019 at 3:42 PM  wrote:
> > >
> > > From: Tonghao Zhang 
> > >
> > > If we try to offload decapsulation actions to VFs hw, we get the log [1].
> >
> > but the switching was on the tunnel  type (if (tunnel_type == [...]) -
> Yes, but we try to offload tc flow to VF device. For example
> the  p2p1_0 is VF, but not rep

so this should go to the nic and not esw tc offload code path in en_tc.c
and the nic path (look for parse_tc_nic_actions or a like) doesn't have
a case for the tunnel key action - you should not have got to this code
at all, please look deeper to realize what is going on there, maybe p2p1_0
is a rep? what does ip -d link show gives on it?


Re: [PATCH net-next 5/5] net/mlx5e: Support enable/disable VFs link state on switchdev mode

2019-02-21 Thread Or Gerlitz
On Thu, Feb 21, 2019 at 3:42 PM  wrote:
> From: Tonghao Zhang 
>
> This patch allow users to enable/disable VFs link state
> on switchdev mode.

NAK

We do it with the reps, if you change the administrative link state of a VF rep
it will effect the operational link state of the VF, see upstream commit

20a1ea674783 net/mlx5e: Support VF vport link state control for SRIOV
switchdev mode


Re: [PATCH net-next 2/5] net/mlx5e: Make the log friendly when decapsulation offload not supported

2019-02-21 Thread Or Gerlitz
On Thu, Feb 21, 2019 at 3:42 PM  wrote:
>
> From: Tonghao Zhang 
>
> If we try to offload decapsulation actions to VFs hw, we get the log [1].

but the switching was on the tunnel  type (if (tunnel_type == [...]) -
what rules caused you to get here?
what was the ingress device and what was the egress (mirred) device?


> It's not friendly, because the kind of net device is null, and we don't
> know what '0' means.
>
> [1] "mlx5_core :05:01.2 vf_0: decapsulation offload is not supported for  
> net device (0)"


Re: [PATCH net-next 1/5] net/mlx5e: Return -EOPNOTSUPP when modify header action zero

2019-02-21 Thread Or Gerlitz
On Thu, Feb 21, 2019 at 3:42 PM  wrote:
> else /* namespace is MLX5_FLOW_NAMESPACE_KERNEL - NIC offloading */
> max_actions = MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, 
> max_modify_header_actions);
>
> +   if (!max_actions) {
> +   NL_SET_ERR_MSG_MOD(extack,
> +  "don't support pedit actions, can't 
> offload");
> +   netdev_warn(priv->netdev, "don't support pedit actions, can't 
> offload\n");

it's not going to work if we keep filling the driver with duplicated
error messages, stick to extack only

Also, when you respin with comments provided on this submission,
please send also cover letter
which is  going to be "[PATCH net-next 00/05] ... "  use
--cover-letter for git format-patch and edit it after creation


Re: mlx5 stable backport help

2019-02-21 Thread Or Gerlitz
On Wed, Feb 20, 2019 at 10:03 PM Saeed Mahameed  wrote:
> On Tue, 2019-02-19 at 22:36 -0800, David Miller wrote:

>> I need help backporting two changes for -stable.  Namely:
> Hi Dave Thanks for trying,

here too

> > 6707f74be8621ae067d2cf1c4485900e2742c20f ("net/mlx5e: Update hw flows
> > when encap source mac changed")
>
> The fixes tag of the above commit is wrong and it should have been
>
> commit e32ee6c78efa6a32bff782bbe7a9970b018996ca
> Author: Eli Britstein 
> Date:   Mon Dec 3 17:09:54 2018 +0200
>
> net/mlx5e: Support tunnel encap over tagged Ethernet
>
> Before this patch we didn't have the support for "tunnel encap over
> tagged Ethernet" which introduced the whole route_dev logic to generate
> the encap headers, so the patch should only exist in -rc, let's just skip it.

Hi Saeed,

The issue exists prior to the commit you mentioned, we are doing this
code since day one of
tc tunnel offloads support which lacked the neigh update flow

linux-stable.git]# git grep -pe "eth->h_source"
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c=static void
gen_vxlan_header_ipv4(struct net_device *out_dev,
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:
ether_addr_copy(eth->h_source, out_dev->dev_addr);
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c=static void
gen_vxlan_header_ipv6(struct net_device *out_dev,
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:
ether_addr_copy(eth->h_source, out_dev->dev_addr);

later when 232c001398ae ("net/mlx5e: Add support to neighbour update
flow") was done, we went short
and did not address the case of source address change. Until the
commit you mentioned, we didn't have the
out_dev vs route_dev notion, but the problem exists and it's possible
to backport the patch by

[1] using  e->out_dev instead of e->route_dev
[2] use the hunks from mlx5e_tc_tun_create_header_ipv4/6() in
mlx5e_create_encap_header_ipv4/6()

Or.


Re: [PATCH net v5 1/2] net/mlx5e: Update hw flows when encap source mac changed

2019-02-05 Thread Or Gerlitz
On Wed, Jan 30, 2019 at 12:20 AM Saeed Mahameed  wrote:
>
> On Mon, 2019-01-28 at 15:28 -0800, xiangxia.m@gmail.com wrote:
> > From: Tonghao Zhang 
> >
> > When we offload tc filters to hardware, hardware flows can
> > be updated when mac of encap destination ip is changed.
> > But we ignore one case, that the mac of local encap ip can
> > be changed too, so we should also update them.
> >
> > To fix it, add route_dev in mlx5e_encap_entry struct to save
> > the local encap netdevice, and when mac changed, kernel will
> > flush all the neighbour on the netdevice and send
> > NETEVENT_NEIGH_UPDATE
> > event. The mlx5 driver will delete the flows and add them when
> > neighbour
> > available again.
> >
> > Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update
> > flow")
> > Cc: Hadar Hen Zion 
> > Signed-off-by: Tonghao Zhang 
> > Reviewed-by: Or Gerlitz 
> >
> >
>
> Thank you Tonghao and Or.
>
> Acked-by: Saeed Mahameed 

Dave, I see a copy of the patch on patchworks [1] with status being
not applicable, what is missing here? the patch should apply AFAIK.
There was also another patch reviewed here by me and Saeed [2], not
sure where it stands from your side.

[1] https://patchwork.ozlabs.org/patch/1023844/
[2] https://marc.info/?l=linux-netdev&m=154878514916414&w=2


Re: [PATCH 02/12 net-next,v7] net/mlx5e: support for two independent packet edit actions

2019-02-02 Thread Or Gerlitz
On Sat, Feb 2, 2019 at 5:19 PM Tonghao Zhang  wrote:

> The patch [1] is ready too. David may decide which one will be applied
> firstly. and other is rebased ?.

Your patch is for net, net-next is rebased over net

> [1] http://patchwork.ozlabs.org/patch/1032952/


Re: [PATCH 02/12 net-next,v7] net/mlx5e: support for two independent packet edit actions

2019-02-02 Thread Or Gerlitz
On Sat, Feb 2, 2019 at 5:19 PM Tonghao Zhang  wrote:


> The patch [1] is ready too. David may decide which one will be applied
> firstly. and other is rebased ?.

Your patch is for net, net-next is rebased over net

> [1] http://patchwork.ozlabs.org/patch/1032952/


Re: [PATCH net-next] net/mlx5e: Fix code style issue in mlx driver

2019-01-31 Thread Or Gerlitz
On Thu, Jan 31, 2019 at 10:49 PM Or Gerlitz  wrote:
>
> On Thu, Jan 31, 2019 at 4:11 PM  wrote:
> > From: Tonghao Zhang 
> >
> > Add the tab before '}' and keep the code style consistent.
> >
> > Signed-off-by: Tonghao Zhang 
>
> LGTM

> Reviewed-by: Or Gerlitz 

oops, for files starting with en_ prefix we use net/mlx5e: prefix for the patch
title (ethernet) and for the others net/mlx5: (core) -- please fix and
re-send, add my R.B
and also make sure to copy Dave Miller


Re: [PATCH net-next] net/mlx5e: Fix code style issue in mlx driver

2019-01-31 Thread Or Gerlitz
On Thu, Jan 31, 2019 at 4:11 PM  wrote:
> From: Tonghao Zhang 
>
> Add the tab before '}' and keep the code style consistent.
>
> Signed-off-by: Tonghao Zhang 

LGTM

Reviewed-by: Or Gerlitz 


Re: [PATCH net v5 2/2] net/mlx5e: Don't overwrite pedit action when multiple pedit used

2019-01-29 Thread Or Gerlitz
On Tue, Jan 29, 2019 at 8:05 PM  wrote:
> From: Tonghao Zhang 
>
> In some case, we may use multiple pedit actions to modify packets.
> The command shown as below: the last pedit action is effective.
>
> $ tc filter add dev netdev_rep parent : protocol ip prio 1\
> flower skip_sw ip_proto icmp dst_ip 3.3.3.3\
> action pedit ex munge ip dst set 192.168.1.100 pipe\
> action pedit ex munge eth src set 00:00:00:00:00:01 pipe\
> action pedit ex munge eth dst set 00:00:00:00:00:02 pipe\
> action csum ip pipe\
> action tunnel_key set src_ip 1.1.1.100 dst_ip 1.1.1.200 dst_port 4789 
> id 100 \
> action mirred egress redirect dev vxlan0
>
> To fix it, we add max_mod_hdr_actions to mlx5e_tc_flow_parse_attr struction,
> max_mod_hdr_actions will store the max pedit action number we support and
> num_mod_hdr_actions indicates how many pedit action we used, and store all
> pedit action to mod_hdr_actions.
>
> Fixes: d79b6df6b10a ("net/mlx5e: Add parsing of TC pedit actions to HW 
> format")
> Cc: Or Gerlitz 
> Signed-off-by: Tonghao Zhang 
> Reviewed-by: Or Gerlitz 
> ---
> v3: Remove the unnecessary init.
> v2: Fix comment message and change tag from net-next to net.

Same here, this is good to go upstream, well done Tonghao!


Re: [PATCH net v5 1/2] net/mlx5e: Update hw flows when encap source mac changed

2019-01-29 Thread Or Gerlitz
On Tue, Jan 29, 2019 at 8:05 PM  wrote:
>
> From: Tonghao Zhang 
>
> When we offload tc filters to hardware, hardware flows can
> be updated when mac of encap destination ip is changed.
> But we ignore one case, that the mac of local encap ip can
> be changed too, so we should also update them.
>
> To fix it, add route_dev in mlx5e_encap_entry struct to save
> the local encap netdevice, and when mac changed, kernel will
> flush all the neighbour on the netdevice and send NETEVENT_NEIGH_UPDATE
> event. The mlx5 driver will delete the flows and add them when neighbour
> available again.
>
> Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow")
> Cc: Hadar Hen Zion 
> Signed-off-by: Tonghao Zhang 
> Reviewed-by: Or Gerlitz 

This is good to go upstream, well done Tonghao!


Re: [PATCH net v4 2/2] net/mlx5e: Don't overwrite pedit action when multiple pedit used

2019-01-29 Thread Or Gerlitz
On Mon, Jan 28, 2019 at 6:18 PM Tonghao Zhang  wrote:
> On Mon, Jan 28, 2019 at 11:34 PM Or Gerlitz  wrote:
> > On Mon, Jan 28, 2019 at 2:10 PM Tonghao Zhang  
> > wrote:
> > > On Mon, Jan 28, 2019 at 12:40 AM Or Gerlitz  wrote:
> > > >
> > > > On Sun, Jan 27, 2019 at 1:06 PM  wrote:
> > > > > From: Tonghao Zhang 
> > > > >
> > > > > In some case, we may use multiple pedit actions to modify packets.
> > > > > The command shown as below: the last pedit action is effective.
> > > >
> > > > > @@ -2073,7 +2076,8 @@ static int alloc_mod_hdr_actions(struct 
> > > > > mlx5e_priv *priv,
> > > > > if (!parse_attr->mod_hdr_actions)
> > > > > return -ENOMEM;
> > > > >
> > > > > -   parse_attr->num_mod_hdr_actions = max_actions;
> > > > > +   parse_attr->max_mod_hdr_actions = max_actions;
> > > > > +   parse_attr->num_mod_hdr_actions = 0;
> > > >
> > > > why would we want to do this zeroing? what purpose does it serve?
> > > Because we use the num_mod_hdr_actions to store the number of actions
> > > we have parsed,
> > > and when we alloc it, we init it 0 as default.
> > >
> > > > On a probably related note, I suspect that the patch broke the caching
> > > > we do for modify header contexts, see mlx5e_attach_mod_hdr where we
> > > > look if a given set of modify header operations already has hw modify 
> > > > header
> > > > context and we use it.
> > > >
> > > > To test that, put two tc rules with different matching but same set of
> > > > modify header
> > > > (pedit) actions and see that only one modify header context is used.
> >
> > > The patch does't break the cache, I think that different matching may
> > > share the same set of pedit actions.
> >
> > I suspect it does break it.. at [1]  we have this code for the cache lookup:
> >
> > num_actions  = parse_attr->num_mod_hdr_actions;
> > [..]
> > key.actions = parse_attr->mod_hdr_actions;
> > key.num_actions = num_actions;
> >
> > hash_key = hash_mod_hdr_info(&key);
> >
> > so we are doing the cached insertion and lookup with
> > parse_attr->num_mod_hdr_actions
> > which was zeroed along the way and not accounting for the full set of
> > pedit actions, agree?

> not really, before calling the  mlx5e_attach_mod_hdr,  we have call
> the offload_pedit_fields that will
> update the attr->num_mod_hdr_actions that indicate  how many pedit
> action we parsed.

ok, got you, so why do we need this line

 parse_attr->num_mod_hdr_actions = 0;

in alloc_mod_hdr_actions()? this should be zero
by kzalloc somewhere, it got to confuse me..

I suggest to remove this zeroing, otherwise the patch LGTM, once you fix it

Reviewed-by: Or Gerlitz 


Re: [PATCH net v4 2/2] net/mlx5e: Don't overwrite pedit action when multiple pedit used

2019-01-28 Thread Or Gerlitz
On Mon, Jan 28, 2019 at 2:10 PM Tonghao Zhang  wrote:
> On Mon, Jan 28, 2019 at 12:40 AM Or Gerlitz  wrote:
> >
> > On Sun, Jan 27, 2019 at 1:06 PM  wrote:
> > > From: Tonghao Zhang 
> > >
> > > In some case, we may use multiple pedit actions to modify packets.
> > > The command shown as below: the last pedit action is effective.
> >
> > > @@ -2073,7 +2076,8 @@ static int alloc_mod_hdr_actions(struct mlx5e_priv 
> > > *priv,
> > > if (!parse_attr->mod_hdr_actions)
> > > return -ENOMEM;
> > >
> > > -   parse_attr->num_mod_hdr_actions = max_actions;
> > > +   parse_attr->max_mod_hdr_actions = max_actions;
> > > +   parse_attr->num_mod_hdr_actions = 0;
> >
> > why would we want to do this zeroing? what purpose does it serve?
> Because we use the num_mod_hdr_actions to store the number of actions
> we have parsed,
> and when we alloc it, we init it 0 as default.
>
> > On a probably related note, I suspect that the patch broke the caching
> > we do for modify header contexts, see mlx5e_attach_mod_hdr where we
> > look if a given set of modify header operations already has hw modify header
> > context and we use it.
> >
> > To test that, put two tc rules with different matching but same set of
> > modify header
> > (pedit) actions and see that only one modify header context is used.

> The patch does't break the cache, I think that different matching may
> share the same set of pedit actions.

I suspect it does break it.. at [1]  we have this code for the cache lookup:

num_actions  = parse_attr->num_mod_hdr_actions;
[..]
key.actions = parse_attr->mod_hdr_actions;
key.num_actions = num_actions;

hash_key = hash_mod_hdr_info(&key);

so we are doing the cached insertion and lookup with
parse_attr->num_mod_hdr_actions
which was zeroed along the way and not accounting for the full set of
pedit actions, agree?

[1] 
https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c#L179


Re: [PATCH net v4 2/2] net/mlx5e: Don't overwrite pedit action when multiple pedit used

2019-01-27 Thread Or Gerlitz
On Sun, Jan 27, 2019 at 1:06 PM  wrote:
> From: Tonghao Zhang 
>
> In some case, we may use multiple pedit actions to modify packets.
> The command shown as below: the last pedit action is effective.

> @@ -2073,7 +2076,8 @@ static int alloc_mod_hdr_actions(struct mlx5e_priv 
> *priv,
> if (!parse_attr->mod_hdr_actions)
> return -ENOMEM;
>
> -   parse_attr->num_mod_hdr_actions = max_actions;
> +   parse_attr->max_mod_hdr_actions = max_actions;
> +   parse_attr->num_mod_hdr_actions = 0;

why would we want to do this zeroing? what purpose does it serve?

On a probably related note, I suspect that the patch broke the caching
we do for modify header contexts, see mlx5e_attach_mod_hdr where we
look if a given set of modify header operations already has hw modify header
context and we use it.

To test that, put two tc rules with different matching but same set of
modify header
(pedit) actions and see that only one modify header context is used.


Re: [PATCH net v4 1/2] net/mlx5e: Update hw flows when encap source mac changed

2019-01-27 Thread Or Gerlitz
On Sun, Jan 27, 2019 at 1:06 PM  wrote:
> From: Tonghao Zhang 
>
> When we offload tc filters to hardware, hardware flows can
> be updated when mac of encap destination ip is changed.
> But we ignore one case, that the mac of local encap ip can
> be changed too, so we should also update them.
>
> To fix it, add route_dev in mlx5e_encap_entry struct to save
> the local encap netdevice, and when mac changed, kernel will
> flush all the neighbour on the netdevice and send NETEVENT_NEIGH_UPDATE
> event. The mlx5 driver will delete the flows and add them when neighbour
> available again.
>
> Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow")
> Cc: Hadar Hen Zion 
> Signed-off-by: Tonghao Zhang 
> Reviewed-by: Or Gerlitz 

+1 again

this one is good to go upstream


Re: [PATCH net-next v3 2/2] net/mlx5e: Don't overwrite pedit action when multiple pedit used

2019-01-25 Thread Or Gerlitz
On Tue, Jan 22, 2019 at 4:38 PM  wrote:
> From: Tonghao Zhang 
>
> In some case, we may use multiple pedit actions to modify packets.

just to clarify, this is a matter of choice, right? in other words, it
doesn't buy you extra functionality

> The command shown as below: the last pedit action is effective.

and we are leaking some memory / HW (modify header instance) objects
in the driver, as of that, right? this makes your patch a fix which
should be labeled
by "net" and not "net-next", same goes to the neigh update fix, would be good
if you repost them with net labeling.


Re: [PATCH net-next v3 2/2] net/mlx5e: Don't overwrite pedit action when multiple pedit used

2019-01-24 Thread Or Gerlitz
On Tue, Jan 22, 2019 at 4:38 PM  wrote:
>
> From: Tonghao Zhang 
>
> In some case, we may use multiple pedit actions to modify packets.
> The command shown as below: the last pedit action is effective.
>
> $ tc filter add dev netdev_rep parent : protocol ip prio 1  \
> flower skip_sw ip_proto icmp dst_ip 3.3.3.3 \
> action pedit ex munge ip dst set 192.168.1.100 pipe \
> action pedit ex munge eth src set 00:00:00:00:00:01 pipe\
> action pedit ex munge eth dst set 00:00:00:00:00:02 pipe\
> action csum ip pipe \
> action tunnel_key set src_ip 1.1.1.100 dst_ip 1.1.1.200 dst_port 4789 
> id 100 \
> action mirred egress redirect dev vxlan0
>
> To fix it, we add max_mod_hdr_actions to mlx5e_tc_flow_parse_attr struction,
> max_mod_hdr_actions will store the max pedit action number we support and
> num_mod_hdr_actions indicates how many pedit action we used, and store all
> pedit action to mod_hdr_actions.

There's a comment just above offload_pedit_fields saying:

/* On input attr->num_mod_hdr_actions tells how many HW actions can be parsed at
 * max from the SW pedit action. On success, it says how many HW actions were
 * actually parsed.
 */

I guess it will have to change now, right?


Re: [PATCH] net: mlx5: allow default ip_proto to offload

2019-01-24 Thread Or Gerlitz
On Sat, Jan 19, 2019 at 10:50 AM Tonghao Zhang  wrote:
> On Fri, Jan 18, 2019 at 10:19 PM Or Gerlitz  wrote:
> > On Thu, Jan 17, 2019 at 3:34 PM Tonghao Zhang  
> > wrote:
> > > On Thu, Jan 17, 2019 at 8:58 PM Or Gerlitz  wrote:
> > > > On Thu, Jan 17, 2019 at 11:28 AM  wrote:
> > > > > From: Tonghao Zhang 
> >
> > > with this patch, run the command [2], we will not get err log,
> > > and the filter work in hw.
> >
> > This whole thing is done for a reason which is the inability of the current 
> > HW
> > to adjust checksum/crc for few L3 protocols. Such adjustment is needed if
> > you modify some fields of L3 headers, e.g re-write src/dst IP address.
> I got it, thanks
> > > We should consider ip_proto == 0, in some case, we only
> > > modify dest ip or src ip.
> >
> > we can't let it go without clear matching on the ip protocol, as I explained
> > above. With my proposed patch you will be able to NAT much more protocols
> > (all of them expect for three, and we're working to reduce that), but
> > you still need
> > a tc rule per ip proto
>
>
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > index 608025ca5c04..affb523e0e35 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > @@ -2167,11 +2167,11 @@ static bool
> > modify_header_match_supported(struct mlx5_flow_spec *spec,
> > }
> >
> > ip_proto = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ip_protocol);
> > -   if (modify_ip_header && ip_proto != IPPROTO_TCP &&
> > -   ip_proto != IPPROTO_UDP && ip_proto != IPPROTO_ICMP) {
> > +   if (modify_ip_header && (ip_proto == IPPROTO_ICMPV6 ||
> > +   ip_proto == IPPROTO_SCTP || ip_proto == IPPROTO_UDPLITE)) {
> > NL_SET_ERR_MSG_MOD(extack,
> > -  "can't offload re-write of non TCP/UDP");
> > -   pr_info("can't offload re-write of ip proto %d\n", 
> > ip_proto);
> > +  "can't offload this re-write of IP
> > addresses");
> > +   pr_info("can't offload re-write of IP addrs for ip
> > proto %d\n", ip_proto);
> > return false;
> > }

> This patch work for me too, because ip_proto == 0 will not return err(
> and my patch allow ip_proto == 0 and not return err) and will you send
> it to net-next ? because i can't find it in net-next.

basically, I was planning to upstream it on this cycle, but your
comment below is
something I need to look at

> and one question, In your patch, should we check ip_proto is valid ?
> for example, ip_proto == 18,  is not valid protocol.

yeah, this becomes a bit ugly, I need to see how to address that

> flower ip_proto 18


Re: [PATCH net-next v3 1/2] net/mlx5e: Update hw flows when encap source mac changed

2019-01-24 Thread Or Gerlitz
On Tue, Jan 22, 2019 at 4:38 PM  wrote:
> From: Tonghao Zhang 
>
> When we offload tc filters to hardware, hardware flows can
> be updated when mac of encap destination ip is changed.
> But we ignore one case, that the mac of local encap ip can
> be changed too, so we should also update them.
>
> To fix it, add route_dev in mlx5e_encap_entry struct to save
> the local encap netdevice, and when mac changed, kernel will
> flush all the neighbour on the netdevice and send NETEVENT_NEIGH_UPDATE
> event. The mlx5 driver will delete the flows and add them when neighbour
> available again.
>
> Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow")
> Cc: Hadar Hen Zion 
> Signed-off-by: Tonghao Zhang 

LGTM

Reviewed-by: Or Gerlitz 


Re: [PATCH net-next v2 1/2] net/mlx5e: Update hw flows when encap source mac changed

2019-01-21 Thread Or Gerlitz
On Mon, Jan 21, 2019 at 1:20 PM  wrote:
>
> From: Tonghao Zhang 
>
> When we offload tc filters to hardware, hardware flows can
> be updated when mac of encap destination ip is changed.
> But we ignore one case, that the mac of local encap ip can
> be changed too, so we should also update them.
>
> To fix it, add route_dev in mlx5e_encap_entry struct to save
> the local encap netdevice, and when mac changed, kernel will
> flush all the neighbour on the netdevice and send NETEVENT_NEIGH_UPDATE
> event. The mlx5 driver will delete the flows and add them when neighbour
> available again.
>
> Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow")
> Cc: Hadar Hen Zion 
> Signed-off-by: Tonghao Zhang 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c | 1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c| 4 
>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.h| 1 +
>  3 files changed, 6 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
> index 046948e..156b2b3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
> @@ -256,6 +256,7 @@ int mlx5e_tc_tun_create_header_ipv4(struct mlx5e_priv 
> *priv,

please address the ipv6 tunnel path as well

> e->m_neigh.family = n->ops->family;
> memcpy(&e->m_neigh.dst_ip, n->primary_key, n->tbl->key_len);
> e->out_dev = out_dev;
> +   e->route_dev = route_dev;



>
> /* It's important to add the neigh to the hash table before checking
>  * the neigh validity state. So if we'll get a notification, in case 
> the
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 96cc0c6..c4b9073 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -594,6 +594,10 @@ static void mlx5e_rep_update_flows(struct mlx5e_priv 
> *priv,
> if (neigh_connected && !(e->flags & MLX5_ENCAP_ENTRY_VALID)) {
> ether_addr_copy(e->h_dest, ha);
> ether_addr_copy(eth->h_dest, ha);
> +   /* Update the encap source mac, in case that we delete
> +* the flows when encap source mac changed.
> +*/
> +   ether_addr_copy(eth->h_source, e->route_dev->dev_addr);

there's must be some other location in the code which currently sets
the source mac. I need to look deeper but I suspect it has to be removed
and only the one you added should stay


Re: [PATCH] net: mlx5: allow default ip_proto to offload

2019-01-18 Thread Or Gerlitz
On Thu, Jan 17, 2019 at 3:34 PM Tonghao Zhang  wrote:
> On Thu, Jan 17, 2019 at 8:58 PM Or Gerlitz  wrote:
> > On Thu, Jan 17, 2019 at 11:28 AM  wrote:
> > > From: Tonghao Zhang 

> with this patch, run the command [2], we will not get err log,
> and the filter work in hw.

This whole thing is done for a reason which is the inability of the current HW
to adjust checksum/crc for few L3 protocols. Such adjustment is needed if
you modify some fields of L3 headers, e.g re-write src/dst IP address.

> We should consider ip_proto == 0, in some case, we only
> modify dest ip or src ip.

we can't let it go without clear matching on the ip protocol, as I explained
above. With my proposed patch you will be able to NAT much more protocols
(all of them expect for three, and we're working to reduce that), but
you still need
a tc rule per ip proto


Re: [PATCH] net: mlx5: update hw flows when encap source mac changed

2019-01-17 Thread Or Gerlitz
On Fri, Jan 11, 2019 at 4:19 PM  wrote:
>
> From: Tonghao Zhang 

We have common prefix, please use it:

net/mlx5e: Update offloaded flows when encap source mac changed

> When we offload tc filters to hardware, hardware flows can
> be updated when mac of encap destination ip is changed.
> But we ignore one case, that the mac of local encap ip can
> be changed too, so we should also update them.
>
> To fix it, add route_dev in mlx5e_encap_entry struct to save
> the local encap netdevice, and when mac changed, check it and
> update flows.

The problem seems to exist and the solution seems to be proper

I will look into it later, please resend with [PATCH net-next]  and
few nit fixes

> Fixs: 232c00139 ("net/mlx5e: Add support to neighbour update flow")

should be Fixes: and get 12 hex digits (git log --oneline --abbrev=12)


Re: [PATCH] net: mlx5: allow default ip_proto to offload

2019-01-17 Thread Or Gerlitz
On Thu, Jan 17, 2019 at 11:28 AM  wrote:
> From: Tonghao Zhang 

with the current code, if a modification of the ip header is required
and the ip protocol is not one of tcp, udp or icmp - we err

This is done in purpose, and we don't want to allow offloading
this header re-write for unknown ip protocol

> Allow default ip_proto to offload, so icmp, tcp, and udp
> will match the flow as show below, otherwise we must type the
> ip_proto for icmp, tcp and udp respectively.
>
> $ tc filter add dev netdev01_rep parent : protocol ip prio 1 \
> flower skip_sw dst_ip 3.3.3.3 \
> action pedit ex munge ip dst set 192.168.1.100 pipe \
> action csum ip pipe \
> action mirred egress redirect dev netdev02_rep

this flow specify the ip protocol (1 which is icmp)

> Signed-off-by: Tonghao Zhang 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index 2ee377a..2a29428 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> @@ -2219,7 +2219,7 @@ static bool modify_header_match_supported(struct 
> mlx5_flow_spec *spec,
> }
>
> ip_proto = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ip_protocol);
> -   if (modify_ip_header && ip_proto != IPPROTO_TCP &&
> +   if (modify_ip_header && ip_proto != 0 && ip_proto != IPPROTO_TCP &&
> ip_proto != IPPROTO_UDP && ip_proto != IPPROTO_ICMP) {
> NL_SET_ERR_MSG_MOD(extack,
>"can't offload re-write of non TCP/UDP");

but under your patch we will not err for unknown ip protocol

I have lost you

BTW - I plan to change the polarity of the check here with the below
patch - please see if it helps your use-case:

commit 09b972083266fa8cfe2f24e1c264905d5cd021ed
Author: Or Gerlitz 
Date:   Wed Oct 31 18:42:21 2018 +0200

net/mlx5e: Allow TC offload of IP header re-write for more protocols

So far we allowed re-writing of IP header only for three protocols
(tcp, udp and icmpv4). This is too limiting, e.g for cases where
users want to apply offloading of NAT.

Take a complimentary approach and allow this for wider set of IP
protocols -- all of them except for three (icmpv6, sctp and udp-lite).
For these protos the current HW isn't capable to properly adjust the
l4 checksum while doing the modification <--- UPDATE - we probably
can do icmpv6

Signed-off-by: Or Gerlitz 


diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 608025ca5c04..affb523e0e35 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2167,11 +2167,11 @@ static bool
modify_header_match_supported(struct mlx5_flow_spec *spec,
}

ip_proto = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ip_protocol);
-   if (modify_ip_header && ip_proto != IPPROTO_TCP &&
-   ip_proto != IPPROTO_UDP && ip_proto != IPPROTO_ICMP) {
+   if (modify_ip_header && (ip_proto == IPPROTO_ICMPV6 ||
+   ip_proto == IPPROTO_SCTP || ip_proto == IPPROTO_UDPLITE)) {
NL_SET_ERR_MSG_MOD(extack,
-  "can't offload re-write of non TCP/UDP");
-   pr_info("can't offload re-write of ip proto %d\n", ip_proto);
+  "can't offload this re-write of IP
addresses");
+   pr_info("can't offload re-write of IP addrs for ip
proto %d\n", ip_proto);
return false;
}


Re: [PATCH iproute-next] tc: m_tunnel_key: Allow key-less tunnels

2019-01-13 Thread Or Gerlitz
On Thu, Jan 10, 2019 at 3:05 PM Adi Nissim  wrote:
> Change the id parameter of the tunnel_key set action from mandatory to
> optional.
> Some tunneling protocols (e.g. GRE) specify the id as an optional field.

Hi,

The kernel code for the tunnel key actions allows keyless tunnels as
of 5.0 (commit
80ef0f22ceda "net/sched: act_tunnel_key: Allow key-less tunnels") - so this one
here is actually for iproute and not iproute-next, Stephen, could you look?

Or.


Re: [PATCH net-next,v6 00/12] add flow_rule infrastructure

2018-12-20 Thread Or Gerlitz
On Thu, Dec 20, 2018 at 2:35 PM Pablo Neira Ayuso  wrote:
> On Wed, Dec 19, 2018 at 04:26:53PM -0800, Jakub Kicinski wrote:

> > I'm confused, could you rephrase?  How does you work help such devices?
> > How is tc not suitable for them?

> There are two HW offload usecases:
>
> #1 Policy resides in software, CPU host sees initial packets, based on
>policy, you place flows into hardware via nf_flow_table infrastructure.
>This usecase is fine in your NIC since you assume host CPU can cope
>with policy in software for these few initial packets of the flow.
>However, switches usually have a small CPU to run control plane
>software only. There we _cannot_ use this approach.
>
> #2 Policy resides in hardware. For the usecase of switches with small
>CPU, the ACL is deployed in hardware. We use the host CPU to run
>control plane configurations only.
>
> This patchset _is not_ related to #1, this patchset _is_ related to #2.

confused, isn't this patch set related to connection tracking offloads
on modern NIC HWs?

> So far, there is infrastructure in Netfilter to do #1, it should be
> possible to use it from TC too. In TC, there is infrastructure for #2
> which can be reused from Netfilter.


Re: linux-next: manual merge of the net-next tree with the net tree

2018-12-19 Thread Or Gerlitz
On Thu, Dec 20, 2018 at 4:47 AM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>
> between commit:
>
>   8956f0014ea5 ("net/mlx5e: Fix default amount of channels for VF 
> representors")
>
> from the net tree and commit:
>
>   d9ee0491c2ff ("net/mlx5e: Use dedicated uplink vport netdev representor")
>
> from the net-next tree.
>
> I fixed it up (I just used the net-next tree version) and can carry the
> fix as necessary.  [..]

Yes, this is correct, thanks!


Re: [net-next 05/11] net/mlx5e: Fail attempt to offload e-switch TC flows with egress upper devices

2018-12-19 Thread Or Gerlitz
On Thu, Dec 20, 2018 at 12:16 AM Saeed Mahameed  wrote:
>
> From: Eli Britstein 
>
> We use the switchdev parent HW id helper to identify if the mirred device
> shares the same ASIC/port with the ingress device. This can get us wrong in
> the presence of upper devices (e.g vlan, team, etc) set over the HW devices
> (VF or uplink representors), b/c the switchdev ID is retrieved recursively.
>
> To fail offload attempts in such cases, we condition the check on the egress
> device to have not only the same switchdev ID but also the relevant mlx5 
> netdev ops.

The patch was made before we pushed vf lag.. need to fix it up a bit:

net/mlx5e: Fail attempt to offload e-switch TC flows with egress vlan devices

We use the switchdev parent HW id helper to identify if the mirred
device shares the same ASIC/port with the ingress device. This can get
us wrong in the presence of vlan upper devices set over the HW devices
(VF or uplink representors), b/c the switchdev ID is retrieved
recursively. To fail offload attempts in such cases, we condition the
check on the egress device to have not only the same switchdev ID but
also the relevant mlx5 netdev ops.


>
> Fixes: 03a9d11e6eeb ('net/mlx5e: Add TC drop and mirred/redirect action 
> parsing for SRIOV offloads')
> Signed-off-by: Eli Britstein 
> Reviewed-by: Roi Dayan 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 9 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.h | 3 +++
>  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c  | 3 +++
>  3 files changed, 15 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 91c3eb85f32e..f414f19c1159 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -1316,6 +1316,15 @@ static const struct net_device_ops 
> mlx5e_netdev_ops_uplink_rep = {
> .ndo_get_vf_stats= mlx5e_get_vf_stats,
>  };
>
> +bool mlx5e_eswitch_rep(struct net_device *netdev)
> +{
> +   if (netdev->netdev_ops == &mlx5e_netdev_ops_vf_rep ||
> +   netdev->netdev_ops == &mlx5e_netdev_ops_uplink_rep)
> +   return true;
> +
> +   return false;
> +}
> +
>  static void mlx5e_build_rep_params(struct net_device *netdev)
>  {
> struct mlx5e_priv *priv = netdev_priv(netdev);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
> index 5645d3cef1bb..edd722824697 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
> @@ -176,6 +176,9 @@ void mlx5e_rep_encap_entry_detach(struct mlx5e_priv *priv,
>   struct mlx5e_encap_entry *e);
>
>  void mlx5e_rep_queue_neigh_stats_work(struct mlx5e_priv *priv);
> +
> +bool mlx5e_eswitch_rep(struct net_device *netdev);
> +
>  #else /* CONFIG_MLX5_ESWITCH */
>  static inline bool mlx5e_is_uplink_rep(struct mlx5e_priv *priv) { return 
> false; }
>  static inline int mlx5e_add_sqs_fwd_rules(struct mlx5e_priv *priv) { return 
> 0; }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index c1a9120412b8..a4c6287dfd65 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> @@ -2578,6 +2578,9 @@ static int parse_tc_fdb_actions(struct mlx5e_priv 
> *priv, struct tcf_exts *exts,
> struct net_device *uplink_dev = 
> mlx5_eswitch_uplink_get_proto_dev(esw, REP_ETH);
> struct net_device *uplink_upper = 
> netdev_master_upper_dev_get(uplink_dev);
>
> +   if (!mlx5e_eswitch_rep(out_dev))
> +   return -EOPNOTSUPP;
> +

no, out_dev can be the team/bond device and this way we break vf lag

> if (uplink_upper &&
> netif_is_lag_master(uplink_upper) &&
> uplink_upper == out_dev)

should move it here, after we realize that and potentially made
out_dev be the uplink rep

Otherwise, Acked- by: Or Gerlitz 


Re: [PATCH net] net/sched: cls_flower: Remove old entries from rhashtable

2018-12-19 Thread Or Gerlitz
On Wed, Dec 19, 2018 at 9:41 PM Roi Dayan  wrote:
>
> When replacing a rule we add the new rule to the rhashtable
> but only remove the old if not in skip_sw.
> This commit fix this and remove the old rule anyway.
>
> Fixes: 35cc3cefc4de ("net/sched: cls_flower: Reject duplicated rules also 
> under skip_sw")
> Signed-off-by: Roi Dayan 
> Reviewed-by: Vlad Buslov 

Acked-by: Or Gerlitz 


Re: linux-next: manual merge of the net-next tree with the net tree

2018-12-17 Thread Or Gerlitz
On Mon, Dec 17, 2018 at 11:29 PM Saeed Mahameed
 wrote:
> On Sun, Dec 16, 2018 at 4:25 PM Stephen Rothwell  
> wrote:

> > I fixed it up (see below) and can carry the fix as necessary. This

> Looks good to me.

here too


> > Today's linux-next merge of the net-next tree got a conflict in:
> >   drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > between commit:

> >   154e62abe9cd ("net/mlx5e: Properly initialize flow attributes for slow 
> > path eswitch rule deletion")
> > from the net tree and commit:

> >   e88afe759a49 ("net/mlx5e: Err if asked to mirror a goto chain tc eswitch 
> > rule")
> >   e85e02bad29e ("net/mlx5: E-Switch, Rename esw attr mirror count field")
> > from the net-next tree.

Just a note,

e88afe759a49  ("net/mlx5e: Err if asked to mirror a goto chain tc
eswitch rule")i

s from net  and not net-next

see it here [1] among the top 10 patches

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/log/?qt=grep&q=mlx5


Re: [net-next,v5,00/12] add flow_rule infrastructure

2018-12-13 Thread Or Gerlitz
On Wed, Dec 12, 2018 at 12:17 AM Jakub Kicinski
 wrote:
>
> On Tue, 11 Dec 2018 22:59:47 +0200, Or Gerlitz wrote:
> > To put it a bit more clearly, donno if my concerns are to the extent of
> > being fundamental, but yesknow that they were not sufficiently addressed.
> >
> > TC is the leading kernel CA system for about 2.5 decades, so I am not
> > clear what we want to IR the TC offload path and not TCfy the ethtool
> > and Co offloads path.
>
> I'm not 100% clear on what the proposal would be here.  Would we
> build a flow dissector and allocate fake TC actions?  Would we use
> setup_tc hook?  My gut worry is that we would just end up with the
> worst of all worlds if we do something like this :(  (already to my
> taste this API leaks too much TC through)

yeah it was something in this direction, but I hear that you don't like it and
I guess Jiri is also not advocating for that, so maybe that not a right choice.

> Back to the elephant in the room it would certainly aid "nft offload"
> adoption if drivers didn't even know it was not a real flower being
> offloaded :)

after this Dave put the EIR on the thread someone put on table a photo
of office scene with nice Elephant looking on the happenings, so I have it
every morning now..

> > Going forward to 2019 HWs that can offload OVS/OF (flow) metering,
> > do we really want to IR the TC policers which follow IEEE or a like specs?
>
> Specs are good (y)

I understand you are saying we can IR the TC policers, let it be.

> > Still, seems that other folks on the drivers yard are ok and even happy with
> > the IR direction/implementation, I see that Jiri acked it all.

> > I guess we need some voices to speak, would love to hear the whole
> > of the CCed JJJ triplet speaking up.

> I don't care much either way.  One thing I really don't look forward to
> is trying to do backports and send stable fixes after this conversion.

M2

> Maybe having a side library that could take a ethtool/flower/nft flow
> and return common IR representation of that flow would be less painful?
> Drivers could then migrate at its own pace, for new functionality etc.
> Was that discussed before?  I may have lost track of this discussion...

Currently all drivers are ported to use the IR on the tc/flower offload path

End of the day, seems that Jiri and Jakub are good with this and I didn't hear
any further rejections from other driver folks, so I guess my concerns
were basically
addressed by them.


Re: [net-next,v5,00/12] add flow_rule infrastructure

2018-12-11 Thread Or Gerlitz
On Tue, Dec 11, 2018 at 9:15 PM David Miller  wrote:
> From: Florian Westphal 
> > Pablo Neira Ayuso  wrote:
> >> This is another iteration of the in-kernel intermediate representation
> >> (IR) that allows to express ACL hardware offloads using one unified
> >> representation from the driver side for the ethtool and the tc
> >> frontends [1] [2] [3].

> > This is marked 'rejected' in patchwork, what happened?

> Pablo is not being honest about his true intentions and long term
> goals with this series.  Or also brought up fundamental objections
> against previous versions of this series which were not sufficiently
> addressed by Pablo.

H Dave,

To put it a bit more clearly, donno if my concerns are to the extent of
being fundamental, but yesknow that they were not sufficiently addressed.

TC is the leading kernel CA system for about 2.5 decades, so I am not
clear what we want to IR the TC offload path and not TCfy the ethtool
and Co offloads path.
9
Going forward to 2019 HWs that can offload OVS/OF (flow) metering,
do we really want to IR the TC policers which follow IEEE or a like specs?

Still, seems that other folks on the drivers yard are ok and even happy with
the IR direction/implementation, I see that Jiri acked it all.

I guess we need some voices to speak, would love to hear the whole
of the CCed JJJ triplet speaking up.

> My personal concern has been brought up several times.

I will go dig for it, that's ineed hard when concerns are not replied :(


> I am not going to restate the same thing over and over, and he hasn't
> fixed up his introductory posting in response to my concerns in any
> meaningful and full way.
>
> Therefore, if I feel like my and Or's feedback will keep being
> ignored, I will keep rejecting these changes.


Re: [net-next 12/12] net/sched: Remove egdev mechanism

2018-12-10 Thread Or Gerlitz
On Tue, Dec 11, 2018 at 3:09 AM Jakub Kicinski
 wrote:
>
> On Mon, 10 Dec 2018 16:27:01 -0800, Saeed Mahameed wrote:
> > From: Oz Shlomo 
> >
> > The egdev mechanism was replaced by the TC indirect block notifications
> > platform.
> >
> > Signed-off-by: Oz Shlomo 
> > Reviewed-by: Eli Britstein 
> > Reviewed-by: Jiri Pirko 
> > Cc: John Hurley 
> > Cc: Jakub Kicinski 
> > Signed-off-by: Saeed Mahameed 

> Signed-off-by: Jakub Kicinski 
> Thank you!

Here too

Acked-by: Or Gerlitz 

and let egdev RIP..


Re: [PATCH mlx5-next 07/10] net/mlx5: E-Switch, Change vhca id valid bool field to bit flag

2018-12-10 Thread Or Gerlitz
On Mon, Dec 10, 2018 at 10:09 PM Saeed Mahameed
 wrote:
>
> On Mon, Dec 10, 2018 at 8:12 AM Jason Gunthorpe  wrote:
> >
> > On Sun, Dec 09, 2018 at 07:04:39PM -0800, Saeed Mahameed wrote:
> > > From: Eli Britstein 
> > >
> > > Change the driver flow destination struct to use bit flags with the vhca
> > > id valid being the 1st one. Done to avoid bool fields in structs, as
> > > warned by static checkers, with no functionality change.
> >
> > This is a thing now? I thought the warning was not to use bool with
> > bitfields for some reason?
> >
>
> Yea, the commit message is kinda silly,

hey, sometimes you just want to make your boss happy:

CHECK: Avoid using bool structure members because of possible
alignment issues - see: https://lkml.org/lkml/2017/11/21/384
#99: FILE: include/linux/mlx5/fs.h:99:
+   boolvhca_id_valid;

missed anything?


Re: [PATCH mlx5-next 07/10] net/mlx5: E-Switch, Change vhca id valid bool field to bit flag

2018-12-10 Thread Or Gerlitz
On Mon, Dec 10, 2018 at 6:31 PM Jason Gunthorpe  wrote:
> On Sun, Dec 09, 2018 at 07:04:39PM -0800, Saeed Mahameed wrote:
> > From: Eli Britstein 
> >
> > Change the driver flow destination struct to use bit flags with the vhca
> > id valid being the 1st one. Done to avoid bool fields in structs, as
> > warned by static checkers, with no functionality change.
>
> This is a thing now? I thought the warning was not to use bool with
> bitfields for some reason?

we have downstream, patch that puts more content into the flags field, so
the change is justified also as pre-step to that commit


Re: linux-next: manual merge of the net-next tree with the net tree

2018-12-10 Thread Or Gerlitz
On Mon, Dec 10, 2018 at 3:38 AM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   net/sched/cls_flower.c
>
> between commit:
>
>   35cc3cefc4de ("net/sched: cls_flower: Reject duplicated rules also under 
> skip_sw")
>
> from the net tree and commit:
>
>   5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This

[..]

The fix LGTM, thanks for addressing this.

Or.


[PATCH net] net/sched: cls_flower: Reject duplicated rules also under skip_sw

2018-12-09 Thread Or Gerlitz
Currently, duplicated rules are rejected only for skip_hw or "none",
hence allowing users to push duplicates into HW for no reason.

Use the flower tables to protect for that.

Signed-off-by: Or Gerlitz 
Signed-off-by: Paul Blakey 
Reported-by: Chris Mi 
---
changes from RFC:
 - properly handle delete as well

 net/sched/cls_flower.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c6c327874abc..71312d7bd8f4 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1238,18 +1238,16 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
if (err)
goto errout_idr;
 
-   if (!tc_skip_sw(fnew->flags)) {
-   if (!fold && fl_lookup(fnew->mask, &fnew->mkey)) {
-   err = -EEXIST;
-   goto errout_mask;
-   }
-
-   err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
-fnew->mask->filter_ht_params);
-   if (err)
-   goto errout_mask;
+   if (!fold && fl_lookup(fnew->mask, &fnew->mkey)) {
+   err = -EEXIST;
+   goto errout_mask;
}
 
+   err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
+fnew->mask->filter_ht_params);
+   if (err)
+   goto errout_mask;
+
if (!tc_skip_hw(fnew->flags)) {
err = fl_hw_replace_filter(tp, fnew, extack);
if (err)
@@ -1303,9 +1301,8 @@ static int fl_delete(struct tcf_proto *tp, void *arg, 
bool *last,
struct cls_fl_head *head = rtnl_dereference(tp->root);
struct cls_fl_filter *f = arg;
 
-   if (!tc_skip_sw(f->flags))
-   rhashtable_remove_fast(&f->mask->ht, &f->ht_node,
-  f->mask->filter_ht_params);
+   rhashtable_remove_fast(&f->mask->ht, &f->ht_node,
+  f->mask->filter_ht_params);
__fl_delete(tp, f, extack);
*last = list_empty(&head->masks);
return 0;
-- 
2.17.0.2884.g4ede3d4



[PATCH RFC net] net/sched: cls_flower: Reject duplicated rules also under skip_sw

2018-12-03 Thread Or Gerlitz
Currently, duplicated rules are rejected only for skip_hw or "none",
hence allowing users to push duplicated into HW for no reason.

Use the flower tables to protect for that.

Signed-off-by: Or Gerlitz 
Signed-off-by: Paul Blakey 
Reported-by: Chris Mi 
---
 net/sched/cls_flower.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c6c327874abc..3868e5e73acf 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1238,18 +1238,16 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
if (err)
goto errout_idr;
 
-   if (!tc_skip_sw(fnew->flags)) {
-   if (!fold && fl_lookup(fnew->mask, &fnew->mkey)) {
-   err = -EEXIST;
-   goto errout_mask;
-   }
-
-   err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
-fnew->mask->filter_ht_params);
-   if (err)
-   goto errout_mask;
+   if (!fold && fl_lookup(fnew->mask, &fnew->mkey)) {
+   err = -EEXIST;
+   goto errout_mask;
}
 
+   err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
+fnew->mask->filter_ht_params);
+   if (err)
+   goto errout_mask;
+
if (!tc_skip_hw(fnew->flags)) {
err = fl_hw_replace_filter(tp, fnew, extack);
if (err)
-- 
2.17.0.2884.g4ede3d4



[PATCH net-next V2 0/2] net/sched: act_tunnel_key: support key-less tunnels

2018-12-02 Thread Or Gerlitz
This short series from Adi Nissim allows to support key-less tunnels
by the tc tunnel key actions, which is needed for some GRE use-cases.

changes from V0:
 - addresses build warning spotted by kbuild, make sure to always init
   to zero the tunnel key

Adi Nissim (2):
  net/sched: act_tunnel_key: Allow key-less tunnels
  net/sched: act_tunnel_key: Don't dump dst port if it wasn't set

 net/sched/act_tunnel_key.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

-- 
2.17.0.2884.g4ede3d4



[PATCH net-next V2 2/2] net/sched: act_tunnel_key: Don't dump dst port if it wasn't set

2018-12-02 Thread Or Gerlitz
From: Adi Nissim 

It's possible to set a tunnel without a destination port. However,
on dump(), a zero dst port is returned to user space even if it was not
set, fix that.

Note that so far it wasn't required, b/c key less tunnels were not
supported and the UDP tunnels do require destination port.

Signed-off-by: Adi Nissim 
Reviewed-by: Oz Shlomo 
Acked-by: Jiri Pirko 
---
 net/sched/act_tunnel_key.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index fad438c0f7a7..c3b90fadaff6 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -512,7 +512,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
 nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id)) ||
tunnel_key_dump_addresses(skb,
  
¶ms->tcft_enc_metadata->u.tun_info) ||
-   nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) 
||
+   (key->tp_dst &&
+ nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT,
+  key->tp_dst)) ||
nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
   !(key->tun_flags & TUNNEL_CSUM)) ||
tunnel_key_opts_dump(skb, info))
-- 
2.17.0.2884.g4ede3d4



[PATCH net-next V2 1/2] net/sched: act_tunnel_key: Allow key-less tunnels

2018-12-02 Thread Or Gerlitz
From: Adi Nissim 

Allow setting a tunnel without a tunnel key. This is required for
tunneling protocols, such as GRE, that define the key as an optional
field.

Signed-off-by: Adi Nissim 
Acked-by: Or Gerlitz 
Reviewed-by: Oz Shlomo 
Acked-by: Jiri Pirko 
---
 net/sched/act_tunnel_key.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 4cca8f274662..fad438c0f7a7 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -210,9 +210,9 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
struct tcf_tunnel_key *t;
bool exists = false;
__be16 dst_port = 0;
+   __be64 key_id = 0;
int opts_len = 0;
-   __be64 key_id;
-   __be16 flags;
+   __be16 flags = 0;
u8 tos, ttl;
int ret = 0;
int err;
@@ -246,15 +246,15 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
case TCA_TUNNEL_KEY_ACT_RELEASE:
break;
case TCA_TUNNEL_KEY_ACT_SET:
-   if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
-   NL_SET_ERR_MSG(extack, "Missing tunnel key id");
-   ret = -EINVAL;
-   goto err_out;
-   }
+   if (tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
+   __be32 key32;
 
-   key_id = 
key32_to_tunnel_id(nla_get_be32(tb[TCA_TUNNEL_KEY_ENC_KEY_ID]));
+   key32 = nla_get_be32(tb[TCA_TUNNEL_KEY_ENC_KEY_ID]);
+   key_id = key32_to_tunnel_id(key32);
+   flags = TUNNEL_KEY;
+   }
 
-   flags = TUNNEL_KEY | TUNNEL_CSUM;
+   flags |= TUNNEL_CSUM;
if (tb[TCA_TUNNEL_KEY_NO_CSUM] &&
nla_get_u8(tb[TCA_TUNNEL_KEY_NO_CSUM]))
flags &= ~TUNNEL_CSUM;
@@ -508,7 +508,8 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
struct ip_tunnel_key *key = &info->key;
__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
-   if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
+   if (((key->tun_flags & TUNNEL_KEY) &&
+nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id)) ||
tunnel_key_dump_addresses(skb,
  
¶ms->tcft_enc_metadata->u.tun_info) ||
nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) 
||
-- 
2.17.0.2884.g4ede3d4



[PATCH net-next 2/2] net/sched: act_tunnel_key: Don't dump dst port if it wasn't set

2018-11-28 Thread Or Gerlitz
From: Adi Nissim 

It's possible to set a tunnel without a destination port. However,
on dump(), a zero dst port is returned to user space even if it was not
set, fix that.

Note that so far it wasn't required, b/c key less tunnels were not
supported and the UDP tunnels do require destination port.

Signed-off-by: Adi Nissim 
Reviewed-by: Oz Shlomo 
Acked-by: Jiri Pirko 
---
 net/sched/act_tunnel_key.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 2d8edee6a7c9..9017bc20f92b 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -512,7 +512,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
 nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id)) ||
tunnel_key_dump_addresses(skb,
  
¶ms->tcft_enc_metadata->u.tun_info) ||
-   nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) 
||
+   (key->tp_dst &&
+ nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT,
+  key->tp_dst)) ||
nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
   !(key->tun_flags & TUNNEL_CSUM)) ||
tunnel_key_opts_dump(skb, info))
-- 
2.17.0.2884.g4ede3d4



[PATCH net-next 0/2] net/sched: act_tunnel_key: support key-less tunnels

2018-11-28 Thread Or Gerlitz
This short series from Adi Nissim allows to support key-less tunnels
by the tc tunnel key actions, which is needed for some GRE use-cases.

Adi Nissim (2):
  net/sched: act_tunnel_key: Allow key-less tunnels
  net/sched: act_tunnel_key: Don't dump dst port if it wasn't set

 net/sched/act_tunnel_key.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

-- 
2.17.0.2884.g4ede3d4



[PATCH net-next 1/2] net/sched: act_tunnel_key: Allow key-less tunnels

2018-11-28 Thread Or Gerlitz
From: Adi Nissim 

Allow setting a tunnel without a tunnel key. This is required for
tunneling protocols, such as GRE, that define the key as an optional
field.

Signed-off-by: Adi Nissim 
Acked-by: Or Gerlitz 
Reviewed-by: Oz Shlomo 
Acked-by: Jiri Pirko 
---
 net/sched/act_tunnel_key.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 4cca8f274662..2d8edee6a7c9 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -211,8 +211,8 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
bool exists = false;
__be16 dst_port = 0;
int opts_len = 0;
+   __be16 flags = 0;
__be64 key_id;
-   __be16 flags;
u8 tos, ttl;
int ret = 0;
int err;
@@ -246,15 +246,15 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
case TCA_TUNNEL_KEY_ACT_RELEASE:
break;
case TCA_TUNNEL_KEY_ACT_SET:
-   if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
-   NL_SET_ERR_MSG(extack, "Missing tunnel key id");
-   ret = -EINVAL;
-   goto err_out;
-   }
+   if (tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
+   __be32 key32;
 
-   key_id = 
key32_to_tunnel_id(nla_get_be32(tb[TCA_TUNNEL_KEY_ENC_KEY_ID]));
+   key32 = nla_get_be32(tb[TCA_TUNNEL_KEY_ENC_KEY_ID]);
+   key_id = key32_to_tunnel_id(key32);
+   flags = TUNNEL_KEY;
+   }
 
-   flags = TUNNEL_KEY | TUNNEL_CSUM;
+   flags |= TUNNEL_CSUM;
if (tb[TCA_TUNNEL_KEY_NO_CSUM] &&
nla_get_u8(tb[TCA_TUNNEL_KEY_NO_CSUM]))
flags &= ~TUNNEL_CSUM;
@@ -508,7 +508,8 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
struct ip_tunnel_key *key = &info->key;
__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
-   if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
+   if (((key->tun_flags & TUNNEL_KEY) &&
+nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id)) ||
tunnel_key_dump_addresses(skb,
  
¶ms->tcft_enc_metadata->u.tun_info) ||
nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) 
||
-- 
2.17.0.2884.g4ede3d4



Re: [PATCH 00/12 net-next,v2] add flow_rule infrastructure

2018-11-19 Thread Or Gerlitz
On Mon, Nov 19, 2018 at 10:14 PM David Miller  wrote:
> From: Pablo Neira Ayuso 
> Date: Mon, 19 Nov 2018 01:15:07 +0100
>
> > This patchset introduces a kernel intermediate representation (IR) to
> > express ACL hardware offloads, as already described in previous RFC and
> > v1 patchset [1] [2]. The idea is to normalize the frontend U/APIs to use
> > the flow dissectors and the flow actions so drivers can reuse the
> > existing TC offload driver codebase - that has been converted to use the
> > flow_rule infrastructure.
>
> I'm go to bring up the elephant in the room.

> I think the real motivation here is to offload netfilter rules to HW,
> and you should be completely honest about that.

Thanks Dave for clarifying.

So.. (A) why the TC path isn't enough for CT offloading? if we could have
just do it, that would have sound really cool. (B) why do we have to deal
with EIRs (Elephants In Rooms)? (C) who can address A && B?

Or.


Re: [PATCH 00/10] add flow_rule infrastructure

2018-11-16 Thread Or Gerlitz
On Fri, Nov 16, 2018 at 3:43 AM Pablo Neira Ayuso  wrote:
> This patchset introduces a kernel intermediate representation (IR) to
> express ACL hardware offloads, this is heavily based on the existing
> flow dissector infrastructure and the TC actions. This IR can be used by
> different frontend ACL interfaces such as ethtool_rxnfc and tc to

any reason to keep aRFS out?

> represent ACL hardware offloads. Main goal is to simplify the
> development of ACL hardware offloads for the existing frontend
> interfaces, the idea is that driver developers do not need to add one
> specific parser for each ACL frontend, instead each frontend can just
> generate this flow_rule IR and pass it to drivers to populate the
> hardware IR.

yeah, but we are adding one more chain (IR), today we have

kernel frontend U/API X --> kernel parser Y --> driver --> HW API
kernel frontend U/API Z --> kernel parser W --> driver --> HW API

and we move to

kernel frontend U/API X --> kernel parser Y --> IR --> driver --> HW API
kernel frontend U/API Z --> kernel parser W --> IR --> driver --> HW API

So instead of

Y --> HW
W --> HW

we have IR --> HW while loosing the TC semantics and spirit in the drivers.

We could have normalize all the U/APIs to use the flow dissectors and
the TC actions and then have the drivers to deal with TC only.

IMHO flow dissectors and TC actions are the right approach to deal with ACLs
HW offloading. They properly reflect how ACLs work in modern HW pipelines.

>
> .   ethtool_rxnfc   tc
>|   (ioctl)(netlink)
>|  | | translate native
>   Frontend |  | |  interface representation
>|  | |  to flow_rule IR
>|  | |
> . \/\/
> . flow_rule IR
>||
>Drivers || parsing of flow_rule IR
>||  to populate hardware IR
>|   \/
> .  hardware IR (driver)
>
> For design and implementation details, please have a look at:
>
> https://lwn.net/Articles/766695/

I will further look next week, but as this is not marked as RFC (and
not with net-next
on the title), can we consider this still a discussion and not final/review?

> As an example, with this patchset, it should be possible to simplify the
> existing net/qede driver which already has two parsers to populate the
> hardware IR, one for ethtool_rxnfc interface and another for tc.

I think it would be fair to ask for one such driver porting to see the
impact/benefit.

Or.


Re: [PATCH net-next 0/6] net: sched: indirect tc block cb registration

2018-11-12 Thread Or Gerlitz
On Mon, Nov 12, 2018 at 6:13 AM Jakub Kicinski
 wrote:
> On Sun, 11 Nov 2018 09:55:35 -0800 (PST), David Miller wrote:
> > From: Jakub Kicinski 
> > Date: Fri,  9 Nov 2018 21:21:25 -0800
> >
> > > John says:
> > >
> > > This patchset introduces an alternative to egdev offload by allowing a
> > > driver to register for block updates when an external device (e.g. tunnel
> > > netdev) is bound to a TC block. Drivers can track new netdevs or register
> > > to existing ones to receive information on such events. Based on this,
> > > they may register for block offload rules using already existing
> > > functions.
> > >
> > > The patchset also implements this new indirect block registration in the
> > > NFP driver to allow the offloading of tunnel rules. The use of egdev
> > > offload (which is currently only used for tunnel offload) is subsequently
> > > removed.
> >
> > Really nice.  Series applied.
> >
> > Can the Mellanox folks use this too so that we can remove egdev altogether?
> > mlx5 is the only remaining user.
>
> I believe Or and Oz are working on mlx5 counterpart, hopefully to land
> in this cycle? :)

yeah, Oz is working on that, plumbing the code [..] as we speak, submission
is planned for this cycle.

Or.


  1   2   3   4   5   6   7   8   9   10   >