from:"Yotam Gigi"

Re: [PATCH net 3/3] net: sched: ife: check on metadata length

2018-04-18 Thread yotam gigi

On Thu, Apr 19, 2018 at 12:35 AM, Alexander Aring <ar...@mojatatu.com> wrote:
> This patch checks if sk buffer is available to dererence ife header. If
> not then NULL will returned to signal an malformed ife packet. This
> avoids to crashing the kernel from outside.
>
> Signed-off-by: Alexander Aring <ar...@mojatatu.com>

Reviewed-by: Yotam Gigi <yotam...@gmail.com>

> ---
>  net/ife/ife.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 8632d2685efb..7c100034fbee 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
> u16 ifehdrln;
>
> ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
> +   if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
> +   return NULL;
> +
> ifehdrln = ntohs(ifehdr->metalen);
> total_pull = skb->dev->hard_header_len + ifehdrln;
>
> --
> 2.11.0
>

Re: [PATCH net 2/3] net: sched: ife: handle malformed tlv length

2018-04-18 Thread yotam gigi

On Thu, Apr 19, 2018 at 12:35 AM, Alexander Aring <ar...@mojatatu.com> wrote:
> There is currently no handling to check on a invalid tlv length. This
> patch adds such handling to avoid killing the kernel with a malformed
> ife packet.

That's very important. Thanks for that!

>
> Signed-off-by: Alexander Aring <ar...@mojatatu.com>
> ---
>  include/net/ife.h   |  3 ++-
>  net/ife/ife.c   | 35 +--
>  net/sched/act_ife.c |  7 ++-
>  3 files changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/ife.h b/include/net/ife.h
> index 44b9c00f7223..e117617e3c34 100644
> --- a/include/net/ife.h
> +++ b/include/net/ife.h
> @@ -12,7 +12,8 @@
>  void *ife_encode(struct sk_buff *skb, u16 metalen);
>  void *ife_decode(struct sk_buff *skb, u16 *metalen);
>
> -void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 
> *totlen);
> +void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 
> *attrtype,
> + u16 *dlen, u16 *totlen);
>  int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
> const void *dval);
>
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 7d1ec76e7f43..8632d2685efb 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -92,12 +92,43 @@ struct meta_tlvhdr {
> __be16 len;
>  };
>
> +static inline bool __ife_tlv_meta_valid(const unsigned char *skbdata,
> +   const unsigned char *ifehdr_end)
> +{
> +   const struct meta_tlvhdr *tlv;
> +   u16 tlvlen;
> +
> +   if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
> +   return false;
> +
> +   tlv = (struct meta_tlvhdr *)skbdata;
> +   tlvlen = ntohs(tlv->len);
> +
> +   /* tlv length field is inc header, check on minimum */
> +   if (tlvlen < NLA_HDRLEN)
> +   return false;
> +
> +   /* overflow by NLA_ALIGN check */
> +   if (NLA_ALIGN(tlvlen) < tlvlen)
> +   return false;
> +
> +   if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
> +   return false;
> +
> +   return true;
> +}
> +
>  /* Caller takes care of presenting data in network order
>   */
> -void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 
> *totlen)
> +void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 
> *attrtype,
> + u16 *dlen, u16 *totlen)

Now it is not critical, but it looks a bit odd to me: the function ife_decode
returns "metalen" - maybe it is better to change it too to return ifehdr_end?
If not, the caller has to calculate it himself for no particular reason.
Having both metalen and ifehdr_end is redundant, so we should stick to only
one of these.

Other than that,
Reviewed-by: Yotam Gigi <yotam...@gmail.com>

>  {
> -   struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
> +   struct meta_tlvhdr *tlv;
> +
> +   if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
> +   return NULL;
>
> +   tlv = (struct meta_tlvhdr *)skbdata;
> *dlen = ntohs(tlv->len) - NLA_HDRLEN;
> *attrtype = ntohs(tlv->type);
>
> diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
> index 49b8ab551fbe..8527cfdc446d 100644
> --- a/net/sched/act_ife.c
> +++ b/net/sched/act_ife.c
> @@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const 
> struct tc_action *a,
> u16 mtype;
> u16 dlen;
>
> -   curr_data = ife_tlv_meta_decode(tlv_data, , , 
> NULL);
> +   curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, ,
> +   , NULL);
> +   if (!curr_data) {
> +   qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
> +   return TC_ACT_SHOT;
> +   }
>
> if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
> /* abuse overlimits to count when we receive metadata
> --
> 2.11.0
>

Re: [PATCH net 1/3] net: sched: ife: signal not finding metaid

2018-04-18 Thread yotam gigi

On Thu, Apr 19, 2018 at 12:35 AM, Alexander Aring <ar...@mojatatu.com> wrote:
> We need to record stats for received metadata that we dont know how
> to process. Have find_decode_metaid() return -ENOENT to capture this.

Agree.

>
> Signed-off-by: Alexander Aring <ar...@mojatatu.com>

Reviewed-by: Yotam Gigi <yotam...@gmail.com>

> ---
>  net/sched/act_ife.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
> index a5994cf0512b..49b8ab551fbe 100644
> --- a/net/sched/act_ife.c
> +++ b/net/sched/act_ife.c
> @@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct 
> tcf_ife_info *ife,
> }
> }
>
> -   return 0;
> +   return -ENOENT;
>  }
>
>  static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
> --
> 2.11.0
>

Re: [PATCH iproute2 1/1] tc: use get_u32() in psample action to match types

2018-03-13 Thread yotam gigi

On Tue, Mar 13, 2018 at 11:16 PM, Roman Mashak <m...@mojatatu.com> wrote:

Makes sense :)

Acked-by: Yotam Gigi <yotam...@gmail.com>

> Signed-off-by: Roman Mashak <m...@mojatatu.com>
> ---
>  tc/m_sample.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tc/m_sample.c b/tc/m_sample.c
> index ff5ee6bd1ef6..dff986f5 100644
> --- a/tc/m_sample.c
> +++ b/tc/m_sample.c
> @@ -65,7 +65,7 @@ static int parse_sample(struct action_util *a, int *argc_p, 
> char ***argv_p,
> while (argc > 0) {
> if (matches(*argv, "rate") == 0) {
> NEXT_ARG();
> -   if (get_unsigned(, *argv, 10) != 0) {
> +   if (get_u32(, *argv, 10) != 0) {
> fprintf(stderr, "Illegal rate %s\n", *argv);
> usage();
> return -1;
> @@ -73,7 +73,7 @@ static int parse_sample(struct action_util *a, int *argc_p, 
> char ***argv_p,
> rate_set = true;
> } else if (matches(*argv, "group") == 0) {
> NEXT_ARG();
> -   if (get_unsigned(, *argv, 10) != 0) {
> +   if (get_u32(, *argv, 10) != 0) {
> fprintf(stderr, "Illegal group num %s\n",
> *argv);
> usage();
> @@ -82,7 +82,7 @@ static int parse_sample(struct action_util *a, int *argc_p, 
> char ***argv_p,
> group_set = true;
> } else if (matches(*argv, "trunc") == 0) {
> NEXT_ARG();
> -   if (get_unsigned(, *argv, 10) != 0) {
> +   if (get_u32(, *argv, 10) != 0) {
> fprintf(stderr, "Illegal truncation size 
> %s\n",
> *argv);
> usage();
> --
> 2.7.4
>

[PATCH iproute2 net-next] ip: mroute: Print offload indication

2017-10-08 Thread Yotam Gigi

Since kernel net-next commit c7c0bbeae950 ("net: ipmr: Add MFC offload
indication") the kernel indicates on an MFC entry whether it was offloaded
using the RTNH_F_OFFLOAD flag. Update the "ip mroute show" command to
indicate when a route is offloaded, similarly to the "ip route show"
command.

Example output:
$ ip mroute
(0.0.0.0, 239.255.0.1)  Iif: sw1p7  Oifs: t_br0 State: resolved offload
(192.168.1.1, 239.255.0.1)  Iif: sw1p7  Oifs: sw1p4 State: resolved offload

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 ip/ipmroute.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index b51c23c..453a6cf 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -161,6 +161,8 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
}
fprintf(fp, " State: %s",
r->rtm_flags & RTNH_F_UNRESOLVED ? "unresolved" : "resolved");
+   if (r->rtm_flags & RTNH_F_OFFLOAD)
+   fprintf(fp, " offload");
if (show_stats && tb[RTA_MFC_STATS]) {
struct rta_mfc_stats *mfcs = RTA_DATA(tb[RTA_MFC_STATS]);
 
-- 
2.8.4

Re: [patch net-next 1/6] net: bridge: Use the MDB_RTR_TYPE_TEMP on bridge device too

2017-10-08 Thread Yotam Gigi

On 10/08/2017 12:42 PM, Nikolay Aleksandrov wrote:
> On 08/10/17 12:39, Nikolay Aleksandrov wrote:
>> On 08/10/17 08:23, Yotam Gigi wrote:
>>> On 10/05/2017 03:09 PM, Nikolay Aleksandrov wrote:
>>>> On 05/10/17 13:36, Jiri Pirko wrote:
>>>>> From: Yotam Gigi <yot...@mellanox.com>
>>>>>
>>>>> Every bridge port is in one of four mcast router port states:
>>>>>  - MDB_RTR_TYPE_PERM - the port is set by the user to be an mrouter port
>>>>>regardless of IGMP queries.
>>>>>  - MDB_RTR_TYPE_DISABLED - the port is set by the user to not be an 
>>>>> mrouter
>>>>>port regardless of IGMP queries.
>>>>>  - MDB_RTR_TYPE_TEMP - the port is set by the user to be in mcast router
>>>>>learning state, but currently it is not an mrouter port as no IGMP 
>>>>> query
>>>>>has been received by it for the last multicast_querier_interval.
>>>>>  - MDB_RTR_TYPE_TEMP_QUERY - the port is set by the user to be in mcast
>>>>>router learning state, and currently it is an mrouter port due to an
>>>>>IGMP query that has been received by it during the passed
>>>>>multicast_querier_interval.
>>>> I think you got the last two partially mixed up, MDB_RTR_TYPE_TEMP marks 
>>>> the port as a router
>>>> regardless if there were any igmp queries, while TYPE_TEMP_QUERY means 
>>>> it's in learning
>>>> state. It is the timer (armed vs not) that defines if currently the port 
>>>> is a router
>>>> when one of the TEMP/TEMP_QUERY are set. In the _TEMP case it is always 
>>>> armed as it
>>>> is refreshed by user or igmp queries which was the point of that mode.
>>>> So this means in br_multicast_router() just check for the timer_pending or 
>>>> perm mode.
>>>
>>> As much as I tried to make this clear, it seems like I failed :)
>>>
>>> The 4 states I described are currently the "bridged port" states, not the
>>> "bridge device" state. A bridged port has these 4 states, all can be set by 
>>> the
>>> user, while the bridge device only uses 3 of these states. This patch makes 
>>> the
>>> bridge device use the 4 states too. I thought it makes sense.
>> (disclaimer: this is all about bridge ports, not bridge device)
>> Right, I'll try to explain again: _TEMP always marks the port as a mcast 
>> router,
>> it does not put it into just learning state waiting for an igmp query and it 
>> can
>> be refreshed by either a query or the user again setting the port in _TEMP.
>> While _TEMP_QUERY puts the port in learning state waiting for a query to 
>> become
>> a router, and _TEMP downgrades to _TEMP_QUERY if it expires.
>>
>> Does that make it clearer ?
>>
>> so for _TEMP you say:
>>>>>  - MDB_RTR_TYPE_TEMP - the port is set by the user to be in mcast router
>>>>>learning state, but currently it is not an mrouter port as no IGMP 
>>>>> query
>>>>>has been received by it for the last multicast_querier_interval.
>> which is not the case, it is always a router when that mode is set on a port.
>> Same for _TEMP_QUERY.
> Err, sorry by same I meant it is not correct, not that the _TEMP definition 
> is the same.
> Need to get coffee :-)

Ho, I see  that I was clear but not right :)

I had a look at the code and seems like you are right - for some reason I
thought that _TEMP is learning-active and _TEMP_QUERY is learning-inactive, and
the ports change states when according to IGMP queries.


>
>>> The first paragraph describes the current states of a bridged port, and the
>>> second one explains the difference between bridged port and bridge device. I
>>> will (try to) make it clearer if we agree on resending this patch.
>>>
>>> Is it clearer now?


Yes it is.

>>>
>>>
>>>> In the port code you have the following transitions:
>>>>  _TEMP -> TEMP_QUERY (on timer fire or user-set val, port becomes learning 
>>>> only)
>>>>  _TEMP -> _TEMP (noop on user refresh or igmp query, timer refreshes)
>>>>  _TEMP_QUERY -> _TEMP_QUERY (on igmp query the timer is armed, port 
>>>> becomes router)
>>>>
>>>> you never have _TEMP_QUERY -> _TEMP, which you're using here to denote the 
>>>> timer
>>>> getting armed and the bridge becoming a router.
>>>
>>> I am not sure I g

Re: [patch net-next 1/6] net: bridge: Use the MDB_RTR_TYPE_TEMP on bridge device too

2017-10-07 Thread Yotam Gigi

On 10/05/2017 03:09 PM, Nikolay Aleksandrov wrote:
> On 05/10/17 13:36, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Every bridge port is in one of four mcast router port states:
>>  - MDB_RTR_TYPE_PERM - the port is set by the user to be an mrouter port
>>regardless of IGMP queries.
>>  - MDB_RTR_TYPE_DISABLED - the port is set by the user to not be an mrouter
>>port regardless of IGMP queries.
>>  - MDB_RTR_TYPE_TEMP - the port is set by the user to be in mcast router
>>learning state, but currently it is not an mrouter port as no IGMP query
>>has been received by it for the last multicast_querier_interval.
>>  - MDB_RTR_TYPE_TEMP_QUERY - the port is set by the user to be in mcast
>>router learning state, and currently it is an mrouter port due to an
>>IGMP query that has been received by it during the passed
>>multicast_querier_interval.
> I think you got the last two partially mixed up, MDB_RTR_TYPE_TEMP marks the 
> port as a router
> regardless if there were any igmp queries, while TYPE_TEMP_QUERY means it's 
> in learning
> state. It is the timer (armed vs not) that defines if currently the port is a 
> router
> when one of the TEMP/TEMP_QUERY are set. In the _TEMP case it is always armed 
> as it
> is refreshed by user or igmp queries which was the point of that mode.
> So this means in br_multicast_router() just check for the timer_pending or 
> perm mode.


As much as I tried to make this clear, it seems like I failed :)

The 4 states I described are currently the "bridged port" states, not the
"bridge device" state. A bridged port has these 4 states, all can be set by the
user, while the bridge device only uses 3 of these states. This patch makes the
bridge device use the 4 states too. I thought it makes sense.

The first paragraph describes the current states of a bridged port, and the
second one explains the difference between bridged port and bridge device. I
will (try to) make it clearer if we agree on resending this patch.

Is it clearer now?


>
> In the port code you have the following transitions:
>  _TEMP -> TEMP_QUERY (on timer fire or user-set val, port becomes learning 
> only)
>  _TEMP -> _TEMP (noop on user refresh or igmp query, timer refreshes)
>  _TEMP_QUERY -> _TEMP_QUERY (on igmp query the timer is armed, port becomes 
> router)
>
> you never have _TEMP_QUERY -> _TEMP, which you're using here to denote the 
> timer
> getting armed and the bridge becoming a router.


I am not sure I got this one. I do address that: when an IGMP query is recieved
and the current state is _TEMP_QUERY, I arm the timer and set the state to
_TEMP. I marked that place on the patch, so you can see below.


>
>> The bridge device (brX) itself can also be configured by the user to be
>> either fixed, disabled or learning mrouter port states, but currently there
>> is no distinction between the MDB_RTR_TYPE_TEMP_QUERY and MDB_RTR_TYPE_TEMP
>> in the bridge internal state. Due to that, when an IGMP query is received,
>> it is not straightforward to tell whether it changes the bridge device
>> mrouter port status or not.
> But before this patch the bridge device could not get that set.
>
>> Further patches in this patch-set will introduce notifications upon the
>> bridge device mrouter port state. In order to prevent resending bridge
>> mrouter notification when it is not needed, such distinction is necessary.
>>
> Granted the bridge device hasn't got a way to clearly distinguish the 
> transitions
> without the chance for a race and if using the timer one could get an 
> unnecessary

Is there a race condition?

> notification but that seems like a corner case when the timer fires exactly 
> at the
> same time as the igmp query is received. Can't it be handled by just checking 
> if
> the new state is different in the notification receiver ?
> If it can't and is a problem then I'd prefer to add a new boolean to denote 
> that
> router on/off transition rather than doing this.
>
>> Hence, add the distinction between MDB_RTR_TYPE_TEMP and
>> MDB_RTR_TYPE_TEMP_QUERY states for the bridge device, similarly to any
>> other bridge port.
>>
> This does not add proper MDB_RTR_TYPE_TEMP support for the bridge device
> but seems to abuse it to distinguish the timer state, and changes
> the meaning of MDB_RTR_TYPE_TEMP. Can't you just use the timer instead ?
> I think it will simplify the set and avoid all of this.
>
>> In order to not break the current kernel-user API, don't propagate the new
>> state to the user and use it only in the bridge internal state. Thus, if
>> the user reads (either vi

Re: [PATCH 1/2 net-next] mlxsw: spectrum: Fix check for IS_ERR() instead of NULL

2017-10-03 Thread Yotam Gigi

On 10/03/2017 01:53 PM, Dan Carpenter wrote:
> mlxsw_afa_block_create() doesn't return error pointers, it returns NULL
> on error.
>
> Fixes: 0e14cacb ("mlxsw: spectrum: Add the multicast routing hardware 
> logic")
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>

Acked-by: Yotam Gigi <yot...@mellanox.com>

Thanks!

>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
> index cda9e9ad10e3..5e4ccbf17e3d 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
> @@ -239,8 +239,8 @@ mlxsw_sp_mr_tcam_afa_block_create(struct mlxsw_sp 
> *mlxsw_sp,
>   int err;
>  
>   afa_block = mlxsw_afa_block_create(mlxsw_sp->afa);
> - if (IS_ERR(afa_block))
> - return afa_block;
> + if (!afa_block)
> + return ERR_PTR(-ENOMEM);
>  
>   err = mlxsw_afa_block_append_counter(afa_block, counter_index);
>   if (err)

Re: [PATCH 2/2 net-next] mlxsw: spectrum: Add missing error code on allocation failure

2017-10-03 Thread Yotam Gigi

On 10/03/2017 01:53 PM, Dan Carpenter wrote:
> We accidentally return success if the kmalloc_array() call fails.
>
> Fixes: 0e14cacb ("mlxsw: spectrum: Add the multicast routing hardware 
> logic")
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>

Acked-by: Yotam Gigi <yot...@mellanox.com>

>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
> index 5e4ccbf17e3d..839eadf0765b 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
> @@ -763,8 +763,10 @@ mlxsw_sp_mr_tcam_region_init(struct mlxsw_sp *mlxsw_sp,
>  
>   parman_prios = kmalloc_array(MLXSW_SP_MR_ROUTE_PRIO_MAX + 1,
>sizeof(*parman_prios), GFP_KERNEL);
> - if (!parman_prios)
> + if (!parman_prios) {
> + err = -ENOMEM;
>   goto err_parman_prios_alloc;
> + }
>   mr_tcam_region->parman_prios = parman_prios;
>  
>   for (i = 0; i < MLXSW_SP_MR_ROUTE_PRIO_MAX + 1; i++)

Re: [PATCH][next] mlxsw: spectrum: fix uninitialized value in err

2017-10-01 Thread Yotam Gigi

On 10/01/2017 07:27 PM, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
>
> In the unlikely event that mfc->mfc_un.res.ttls[i] is 255 for all
> values of i from 0 to MAXIVS-1, the err is not set at all and hence
> has a garbage value on the error return at the end of the function,
> so initialize it to 0.  Also, the error return check on err and goto
> to err: inside the for loop makes it impossible for err to be zero
> at the end of the for loop, so we can remove the redundant err check
> at the end of the loop.
>
> Detected by CoverityScan CID#1457207 ("Unitialized scalar value")

Thanks for that!

Reviewed-by: Yotam Gigi <yot...@mellanox.com>


>
> Fixes: c011ec1bbfd6 ("mlxsw: spectrum: Add the multicast routing offloading 
> logic")
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
> index 09120259a45d..4aaf6ca1be7c 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
> @@ -349,7 +349,7 @@ mlxsw_sp_mr_route4_create(struct mlxsw_sp_mr_table 
> *mr_table,
>  {
>   struct mlxsw_sp_mr_route_vif_entry *rve, *tmp;
>   struct mlxsw_sp_mr_route *mr_route;
> - int err;
> + int err = 0;
>   int i;
>  
>   /* Allocate and init a new route and fill it with parameters */
> @@ -376,8 +376,6 @@ mlxsw_sp_mr_route4_create(struct mlxsw_sp_mr_table 
> *mr_table,
>   }
>   }
>   mlxsw_sp_mr_route_ivif_link(mr_route, _table->vifs[mfc->mfc_parent]);
> - if (err)
> - goto err;
>  
>   mr_route->route_action = mlxsw_sp_mr_route_action(mr_route);
>   return mr_route;

Re: [patch net-next 3/7] ipv4: ipmr: Don't forward packets already forwarded by hardware

2017-10-01 Thread Yotam Gigi

On 09/28/2017 08:56 PM, Florian Fainelli wrote:
> On 09/28/2017 10:34 AM, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Change the ipmr module to not forward packets if:
>>  - The packet is marked with the offload_mr_fwd_mark, and
>>  - Both input interface and output interface share the same parent ID.
>>
>> This way, a packet can go through partial multicast forwarding in the
>> hardware, where it will be forwarded only to the devices that share the
>> same parent ID (AKA, reside inside the same hardware). The kernel will
>> forward the packet to all other interfaces.
>>
>> To do this, add the ipmr_offload_forward helper, which per skb, ingress VIF
>> and egress VIF, returns whether the forwarding was offloaded to hardware.
>> The ipmr_queue_xmit frees the skb and does not forward it if the result is
>> a true value.
>>
>> All the forwarding path code compiles out when the CONFIG_NET_SWITCHDEV is
>> not set.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>>  net/ipv4/ipmr.c | 37 -
>>  1 file changed, 32 insertions(+), 5 deletions(-)
>>
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 4566c54..deba569 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -1857,10 +1857,33 @@ static inline int ipmr_forward_finish(struct net 
>> *net, struct sock *sk,
>>  return dst_output(net, sk, skb);
>>  }
>>  
>> +#ifdef CONFIG_NET_SWITCHDEV
>> +static bool ipmr_forward_offloaded(struct sk_buff *skb, struct mr_table 
>> *mrt,
>> +   int in_vifi, int out_vifi)
>> +{
>> +struct vif_device *out_vif = >vif_table[out_vifi];
>> +struct vif_device *in_vif = >vif_table[in_vifi];
> Nit: in_vifi and out_vifi may be better named as in_vif_idx and
> out_vif_idx, oh well you are just replicating the existing naming
> conventions used down below, never mind then.


Yes, unfortunately, the acronym "vifi" is pretty common in the file. I would
also prefer something like vif_index, but vifi would better match the current
convention in the code.

Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct

2017-10-01 Thread Yotam Gigi

On 09/29/2017 12:50 PM, Nikolay Aleksandrov wrote:
> On 28/09/17 20:34, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> In order to allow the ipmr module to do partial multicast forwarding
>> according to the device parent ID, add the device parent ID field to the
>> VIF struct. This way, the forwarding path can use the parent ID field
>> without invoking switchdev calls, which requires the RTNL lock.
>>
>> When a new VIF is added, set the device parent ID field in it by invoking
>> the switchdev_port_attr_get call.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>>  include/linux/mroute.h | 2 ++
>>  net/ipv4/ipmr.c| 9 +
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index b072a84..a46577f 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule 
>> *rule)
>>  
>>  struct vif_device {
>>  struct net_device   *dev;   /* Device we are using 
>> */
>> +struct netdev_phys_item_id dev_parent_id;   /* Device parent ID
>> */
>> +booldev_parent_id_valid;
>>  unsigned long   bytes_in,bytes_out;
>>  unsigned long   pkt_in,pkt_out; /* Statistics   
>> */
>>  unsigned long   rate_limit; /* Traffic shaping (NI) 
>> */
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 292a8e8..4566c54 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -67,6 +67,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  struct ipmr_rule {
>>  struct fib_rule common;
>> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>> struct vifctl *vifc, int mrtsock)
>>  {
>>  int vifi = vifc->vifc_vifi;
>> +struct switchdev_attr attr = {
>> +.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>> +};
>>  struct vif_device *v = >vif_table[vifi];
>>  struct net_device *dev;
>>  struct in_device *in_dev;
>> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table 
>> *mrt,
>>  
>>  /* Fill in the VIF structures */
>>  
>> +attr.orig_dev = dev;
>> +if (!switchdev_port_attr_get(dev, )) {
>> +v->dev_parent_id_valid = true;
>> +memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
>> +}
>>  v->rate_limit = vifc->vifc_rate_limit;
>>  v->local = vifc->vifc_lcl_addr.s_addr;
>>  v->remote = vifc->vifc_rmt_addr.s_addr;
>>
> One more thing - what happens on vif delete, then add with the same vif index 
> of another
> device that doesn't have a parent id ? I think the vif will be stuck with its 
> parent_id
> when it gets set.
>

Right. I will set the len to 0 if the device has no parent.

Thanks!

Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct

2017-10-01 Thread Yotam Gigi

On 09/29/2017 12:45 PM, Nikolay Aleksandrov wrote:
> On 29/09/17 12:29, Nikolay Aleksandrov wrote:
>> On 28/09/17 20:34, Jiri Pirko wrote:
>>> From: Yotam Gigi <yot...@mellanox.com>
>>>
>>> In order to allow the ipmr module to do partial multicast forwarding
>>> according to the device parent ID, add the device parent ID field to the
>>> VIF struct. This way, the forwarding path can use the parent ID field
>>> without invoking switchdev calls, which requires the RTNL lock.
>>>
>>> When a new VIF is added, set the device parent ID field in it by invoking
>>> the switchdev_port_attr_get call.
>>>
>>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>> ---
>>>  include/linux/mroute.h | 2 ++
>>>  net/ipv4/ipmr.c| 9 +
>>>  2 files changed, 11 insertions(+)
>>>
>>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>>> index b072a84..a46577f 100644
>>> --- a/include/linux/mroute.h
>>> +++ b/include/linux/mroute.h
>>> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct 
>>> fib_rule *rule)
>>>  
>>>  struct vif_device {
>>> struct net_device   *dev;   /* Device we are using 
>>> */
>>> +   struct netdev_phys_item_id dev_parent_id;   /* Device parent ID
>>> */
>>> +   booldev_parent_id_valid;
>>> unsigned long   bytes_in,bytes_out;
>>> unsigned long   pkt_in,pkt_out; /* Statistics   
>>> */
>>> unsigned long   rate_limit; /* Traffic shaping (NI) 
>>> */
>>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>>> index 292a8e8..4566c54 100644
>>> --- a/net/ipv4/ipmr.c
>>> +++ b/net/ipv4/ipmr.c
>>> @@ -67,6 +67,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  struct ipmr_rule {
>>> struct fib_rule common;
>>> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table 
>>> *mrt,
>>>struct vifctl *vifc, int mrtsock)
>>>  {
>>> int vifi = vifc->vifc_vifi;
>>> +   struct switchdev_attr attr = {
>>> +   .id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>>> +   };
>>> struct vif_device *v = >vif_table[vifi];
>>> struct net_device *dev;
>>> struct in_device *in_dev;
>>> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table 
>>> *mrt,
>>>  
>>> /* Fill in the VIF structures */
>>>  
>>> +   attr.orig_dev = dev;
>>> +   if (!switchdev_port_attr_get(dev, )) {
>>> +   v->dev_parent_id_valid = true;
>>> +   memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
>> Hmm, shouldn't you set dev_parent_id.id_len too ? It would seem 
>> netdev_phys_item_id_same()
>> uses it in the comparison and without the len it would always look like 
>> they're the same
>> because memcmp will simply return 0 with count = 0.
> Also maybe we can use the non-zero id_len as a signal that it was set and 
> drop the dev_parent_id_valid
> field altogether, it would seem there's no valid reason to have id_len == 0 
> and yet expect a valid
> parent_id.


Yes, I agree to both. I will remove the parent_id_valid field and use the len to
indicate whether it is valid.

Thanks for spotting the bug - since we have only been testing it with a pimreg
device, the problem was not found in our tests. Multi-ASIC setups are a bit
hard to find these days :)


>>> +   }
>>> v->rate_limit = vifc->vifc_rate_limit;
>>> v->local = vifc->vifc_lcl_addr.s_addr;
>>> v->remote = vifc->vifc_rmt_addr.s_addr;
>>>

Re: [patch net-next v2 06/12] net: mroute: Check if rule is a default rule

2017-09-25 Thread Yotam Gigi

On 09/25/2017 01:02 PM, Nikolay Aleksandrov wrote:
> On 25/09/17 12:45, Jiri Pirko wrote:
>> Mon, Sep 25, 2017 at 03:28:21AM CEST, linyunsh...@huawei.com wrote:
>>> Hi, Jiri
>>>
>>> On 2017/9/25 1:22, Jiri Pirko wrote:
>>>> From: Yotam Gigi <yot...@mellanox.com>
>>>>
>>>> When the ipmr starts, it adds one default FIB rule that matches all packets
>>>> and sends them to the DEFAULT (multicast) FIB table. A more complex rule
>>>> can be added by user to specify that for a specific interface, a packet
>>>> should be look up at either an arbitrary table or according to the l3mdev
>>>> of the interface.
>>>>
>>>> For drivers willing to offload the ipmr logic into a hardware but don't
>>>> want to offload all the FIB rules functionality, provide a function that
>>>> can indicate whether the FIB rule is the default multicast rule, thus only
>>>> one routing table is needed.
>>>>
>>>> This way, a driver can register to the FIB notification chain, get
>>>> notifications about FIB rules added and trigger some kind of an internal
>>>> abort mechanism when a non default rule is added by the user.
>>>>
>>>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>>>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>>> ---
>>>>  include/linux/mroute.h |  7 +++
>>>>  net/ipv4/ipmr.c| 10 ++
>>>>  2 files changed, 17 insertions(+)
>>>>
>>>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>>>> index 5566580..b072a84 100644
>>>> --- a/include/linux/mroute.h
>>>> +++ b/include/linux/mroute.h
>>>> @@ -5,6 +5,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  #include 
>>>>  
>>>> @@ -19,6 +20,7 @@ int ip_mroute_getsockopt(struct sock *, int, char __user 
>>>> *, int __user *);
>>>>  int ipmr_ioctl(struct sock *sk, int cmd, void __user *arg);
>>>>  int ipmr_compat_ioctl(struct sock *sk, unsigned int cmd, void __user 
>>>> *arg);
>>>>  int ip_mr_init(void);
>>>> +bool ipmr_rule_default(const struct fib_rule *rule);
>>>>  #else
>>>>  static inline int ip_mroute_setsockopt(struct sock *sock, int optname,
>>>>   char __user *optval, unsigned int optlen)
>>>> @@ -46,6 +48,11 @@ static inline int ip_mroute_opt(int opt)
>>>>  {
>>>>return 0;
>>>>  }
>>>> +
>>>> +static inline bool ipmr_rule_default(const struct fib_rule *rule)
>>>> +{
>>>> +  return true;
>>>> +}
>>>>  #endif
>>>>  
>>>>  struct vif_device {
>>>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>>>> index 2a795d2..a714f55 100644
>>>> --- a/net/ipv4/ipmr.c
>>>> +++ b/net/ipv4/ipmr.c
>>>> @@ -320,6 +320,16 @@ static unsigned int ipmr_rules_seq_read(struct net 
>>>> *net)
>>>>  }
>>>>  #endif
>>>>  
>>>> +bool ipmr_rule_default(const struct fib_rule *rule)
>>>> +{
>>>> +#if IS_ENABLED(CONFIG_FIB_RULES)
>>>> +  return fib_rule_matchall(rule) && rule->table == RT_TABLE_DEFAULT;
>>>> +#else
>>>> +  return true;
>>>> +#endif
>>> In patch 02, You have the following, can you do the same for the above?
>>> +#ifdef CONFIG_IP_MROUTE
>>> +void ipmr_cache_free(struct mfc_cache *mfc_cache);
>>> +#else
>>> +static inline void ipmr_cache_free(struct mfc_cache *mfc_cache)
>>> +{
>>> +}
>>> +#endif
>> I don't believe this is necessary. The solution you described is often
>> used in headers. But here, I'm ok with the current code.
>>
> +1


Hmm, when re-looking at it, I think I will just use the already existing
#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES other than adding a new one. It selects
the CONFIG_FIB_RULES, and if CONFIG_IP_MROUTE_MULTIPLE_TABLES is not defined
than only default rules can exist for the IPMR family.

I will fix it for v3.



>
>>>> +}
>>>> +EXPORT_SYMBOL(ipmr_rule_default);
>>>> +
>>>>  static inline int ipmr_hash_cmp(struct rhashtable_compare_arg *arg,
>>>>const void *ptr)
>>>>  {
>>>>

Re: [patch net-next v2 05/12] net: ipmr: Add MFC offload indication

2017-09-25 Thread Yotam Gigi

On 09/25/2017 12:36 PM, Nikolay Aleksandrov wrote:
> On 24/09/17 20:22, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Allow drivers, registered to the fib notification chain indicate whether a
>> multicast MFC route is offloaded or not, similarly to unicast routes. The
>> indication of whether a route is offloaded is done using the mfc_flags
>> field on an mfc_cache struct, and the information is sent to the userspace
>> via the RTNetlink interface only.
>>
>> Currently, MFC routes are either offloaded or not, thus there is no need to
>> add per-VIF offload indication.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>> v1->v2:
>>  - Add comment for the MFC_OFFLOAD flag
>> ---
>>  include/linux/mroute.h | 2 ++
>>  net/ipv4/ipmr.c| 3 +++
>>  2 files changed, 5 insertions(+)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index 54c5cb8..5566580 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -90,9 +90,11 @@ struct mr_table {
>>  
>>  /* mfc_flags:
>>   * MFC_STATIC - the entry was added statically (not by a routing daemon)
>> + * MFC_OFFLOAD - the entry was offloaded to the hardware
>>   */
>>  enum {
>>  MFC_STATIC = BIT(0),
>> +MFC_OFFLOAD = BIT(1),
>>  };
>>  
>>  struct mfc_cache_cmp_arg {
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index ba71bc4..2a795d2 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -2268,6 +2268,9 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, 
>> struct sk_buff *skb,
>>  nla_put_u32(skb, RTA_IIF, 
>> mrt->vif_table[c->mfc_parent].dev->ifindex) < 0)
>>  return -EMSGSIZE;
>>  
>> +if (c->mfc_flags & MFC_OFFLOAD)
>> +rtm->rtm_flags |= RTNH_F_OFFLOAD;
>> +
>>  if (!(mp_attr = nla_nest_start(skb, RTA_MULTIPATH)))
>>  return -EMSGSIZE;
>>  
>>
> Thanks!

Thank you for reviewing :)

>
> Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>

Re: [patch net-next v2 07/12] mlxsw: spectrum: Add the multicast routing offloading logic

2017-09-25 Thread Yotam Gigi

On 09/25/2017 01:40 PM, Nikolay Aleksandrov wrote:
> On 24/09/17 20:22, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Add the multicast router offloading logic, which is in charge of handling
>> the VIF and MFC notifications and translating it to the hardware logic API.
>>
>> The offloading logic has to overcome several obstacles in order to safely
>> comply with the kernel multicast router user API:
>>  - It must keep track of the mapping between VIFs to netdevices. The user
>>can add an MFC cache entry pointing to a VIF, delete the VIF and add
>>re-add it with a different netdevice. The offloading logic has to handle
>>this in order to be compatible with the kernel logic.
>>  - It must keep track of the mapping between netdevices to spectrum RIFs,
>>as the current hardware implementation assume having a RIF for every
>>port in a multicast router.
>>  - It must handle routes pointing to pimreg device to be trapped to the
>>kernel, as the packet should be delivered to userspace.
>>  - It must handle routes pointing tunnel VIFs. The current implementation
>>does not support multicast forwarding to tunnels, thus routes that point
>>to a tunnel should be trapped to the kernel.
>>  - It must be aware of proxy multicast routes, which include both (*,*)
>>routes and duplicate routes. Currently proxy routes are not offloaded
>>and trigger the abort mechanism: removal of all routes from hardware and
>>triggering the traffic to go through the kernel.
>>
>> The multicast routing offloading logic also updates the counters of the
>> offloaded MFC routes in a periodic work.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>> v1->v2:
>>  - Update the lastuse MFC entry field too, in addition to packets an bytes.
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/Makefile  |3 +-
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h|1 +
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c | 1014 
>> +
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h |  133 +++
>>  4 files changed, 1150 insertions(+), 1 deletion(-)
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h
>>
> [snip]
>> +static void mlxsw_sp_mr_route_erase(struct mlxsw_sp_mr_table *mr_table,
>> +struct mlxsw_sp_mr_route *mr_route)
>> +{
>> +struct mlxsw_sp *mlxsw_sp = mr_table->mlxsw_sp;
>> +struct mlxsw_sp_mr *mr = mlxsw_sp->mr;
>> +
>> +mr->mr_ops->route_destroy(mlxsw_sp, mr->priv, mr_route->route_priv);
>> +kfree(mr_route->route_priv);
>> +}
>> +
>> +static struct mlxsw_sp_mr_route *
>> +mlxsw_sp_mr_route4_create(struct mlxsw_sp_mr_table *mr_table,
>> +  struct mfc_cache *mfc)
>> +{
>> +struct mlxsw_sp_mr_route_vif_entry *rve, *tmp;
>> +struct mlxsw_sp_mr_route *mr_route;
>> +int err;
>> +int i;
>> +
>> +/* Allocate and init a new route and fill it with parameters */
>> +mr_route = kzalloc(sizeof(*mr_table), GFP_KERNEL);
> sizeof(*mr_table) ? Shouldn't you allocate sizeof struct mlsw_sp_mr_route 
> (*mr_route) here ?
>

Seems like you are right. Because of the fact that sizeof(*mr_table) is much
bigger than sizeof(*mr_route), all our tests did not notice it.

Thanks for that!

>> +if (!mr_route)
>> +return ERR_PTR(-ENOMEM);
>> +INIT_LIST_HEAD(_route->evif_list);
>> +mlxsw_sp_mr_route4_key(mr_table, _route->key, mfc);
>> +
>> +/* Find min_mtu and link iVIF and eVIFs */
>> +mr_route->min_mtu = ETH_MAX_MTU;
>> +ipmr_cache_hold(mfc);
>> +mr_route->mfc4 = mfc;
>> +mr_route->mr_table = mr_table;
>> +for (i = 0; i < MAXVIFS; i++) {
>> +if (mfc->mfc_un.res.ttls[i] != 255) {
>> +err = mlxsw_sp_mr_route_evif_link(mr_route,
>> +  _table->vifs[i]);
>> +if (err)
>> +goto err;
>> +if (mr_table->vifs[i].dev &&
>> +mr_table->vifs[i].dev->mtu < mr_route->min_mtu)
>> +mr_route->min_mtu = mr_table->vifs[i].dev->mtu;
>> +

Re: [patch net-next v2 07/12] mlxsw: spectrum: Add the multicast routing offloading logic

2017-09-24 Thread Yotam Gigi

On 09/25/2017 04:48 AM, Yunsheng Lin wrote:
> Hi, Jiri
>
> On 2017/9/25 1:22, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Add the multicast router offloading logic, which is in charge of handling
>> the VIF and MFC notifications and translating it to the hardware logic API.
>>
>> The offloading logic has to overcome several obstacles in order to safely
>> comply with the kernel multicast router user API:
>>  - It must keep track of the mapping between VIFs to netdevices. The user
>>can add an MFC cache entry pointing to a VIF, delete the VIF and add
>>re-add it with a different netdevice. The offloading logic has to handle
>>this in order to be compatible with the kernel logic.
>>  - It must keep track of the mapping between netdevices to spectrum RIFs,
>>as the current hardware implementation assume having a RIF for every
>>port in a multicast router.
>>  - It must handle routes pointing to pimreg device to be trapped to the
>>kernel, as the packet should be delivered to userspace.
>>  - It must handle routes pointing tunnel VIFs. The current implementation
>>does not support multicast forwarding to tunnels, thus routes that point
>>to a tunnel should be trapped to the kernel.
>>  - It must be aware of proxy multicast routes, which include both (*,*)
>>routes and duplicate routes. Currently proxy routes are not offloaded
>>and trigger the abort mechanism: removal of all routes from hardware and
>>triggering the traffic to go through the kernel.
>>
>> The multicast routing offloading logic also updates the counters of the
>> offloaded MFC routes in a periodic work.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>> v1->v2:
>>  - Update the lastuse MFC entry field too, in addition to packets an bytes.
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/Makefile  |3 +-
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h|1 +
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c | 1014 
>> +
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h |  133 +++
>>  4 files changed, 1150 insertions(+), 1 deletion(-)
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
>> b/drivers/net/ethernet/mellanox/mlxsw/Makefile
>> index 4b88158..9b29764 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
>> @@ -17,7 +17,8 @@ mlxsw_spectrum-objs:= spectrum.o 
>> spectrum_buffers.o \
>> spectrum_kvdl.o spectrum_acl_tcam.o \
>> spectrum_acl.o spectrum_flower.o \
>> spectrum_cnt.o spectrum_fid.o \
>> -   spectrum_ipip.o spectrum_acl_flex_actions.o
>> +   spectrum_ipip.o spectrum_acl_flex_actions.o \
>> +   spectrum_mr.o
>>  mlxsw_spectrum-$(CONFIG_MLXSW_SPECTRUM_DCB) += spectrum_dcb.o
>>  mlxsw_spectrum-$(CONFIG_NET_DEVLINK) += spectrum_dpipe.o
>>  obj-$(CONFIG_MLXSW_MINIMAL) += mlxsw_minimal.o
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>> index e907ec4..51d8b9f 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>> @@ -153,6 +153,7 @@ struct mlxsw_sp {
>>  struct mlxsw_sp_sb *sb;
>>  struct mlxsw_sp_bridge *bridge;
>>  struct mlxsw_sp_router *router;
>> +struct mlxsw_sp_mr *mr;
>>  struct mlxsw_afa *afa;
>>  struct mlxsw_sp_acl *acl;
>>  struct mlxsw_sp_fid_core *fid_core;
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c 
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
>> new file mode 100644
>> index 000..89b2e60
>> --- /dev/null
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
>> @@ -0,0 +1,1014 @@
>> +/*
>> + * drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
>> + * Copyright (c) 2017 Mellanox Technologies. All rights reserved.
>> + * Copyright (c) 2017 Yotam Gigi <yot...@mellanox.com>
>> + *
>> + * Redistribution and use in source and binary forms, with or w

Re: [patch net-next v2 06/12] net: mroute: Check if rule is a default rule

2017-09-24 Thread Yotam Gigi

On 09/25/2017 04:28 AM, Yunsheng Lin wrote:
> Hi, Jiri
>
> On 2017/9/25 1:22, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> When the ipmr starts, it adds one default FIB rule that matches all packets
>> and sends them to the DEFAULT (multicast) FIB table. A more complex rule
>> can be added by user to specify that for a specific interface, a packet
>> should be look up at either an arbitrary table or according to the l3mdev
>> of the interface.
>>
>> For drivers willing to offload the ipmr logic into a hardware but don't
>> want to offload all the FIB rules functionality, provide a function that
>> can indicate whether the FIB rule is the default multicast rule, thus only
>> one routing table is needed.
>>
>> This way, a driver can register to the FIB notification chain, get
>> notifications about FIB rules added and trigger some kind of an internal
>> abort mechanism when a non default rule is added by the user.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>>  include/linux/mroute.h |  7 +++
>>  net/ipv4/ipmr.c| 10 ++
>>  2 files changed, 17 insertions(+)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index 5566580..b072a84 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -5,6 +5,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  
>> @@ -19,6 +20,7 @@ int ip_mroute_getsockopt(struct sock *, int, char __user 
>> *, int __user *);
>>  int ipmr_ioctl(struct sock *sk, int cmd, void __user *arg);
>>  int ipmr_compat_ioctl(struct sock *sk, unsigned int cmd, void __user *arg);
>>  int ip_mr_init(void);
>> +bool ipmr_rule_default(const struct fib_rule *rule);
>>  #else
>>  static inline int ip_mroute_setsockopt(struct sock *sock, int optname,
>> char __user *optval, unsigned int optlen)
>> @@ -46,6 +48,11 @@ static inline int ip_mroute_opt(int opt)
>>  {
>>  return 0;
>>  }
>> +
>> +static inline bool ipmr_rule_default(const struct fib_rule *rule)
>> +{
>> +return true;
>> +}
>>  #endif
>>  
>>  struct vif_device {
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 2a795d2..a714f55 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -320,6 +320,16 @@ static unsigned int ipmr_rules_seq_read(struct net *net)
>>  }
>>  #endif
>>  
>> +bool ipmr_rule_default(const struct fib_rule *rule)
>> +{
>> +#if IS_ENABLED(CONFIG_FIB_RULES)
>> +return fib_rule_matchall(rule) && rule->table == RT_TABLE_DEFAULT;
>> +#else
>> +return true;
>> +#endif
> In patch 02, You have the following, can you do the same for the above?
> +#ifdef CONFIG_IP_MROUTE
> +void ipmr_cache_free(struct mfc_cache *mfc_cache);
> +#else
> +static inline void ipmr_cache_free(struct mfc_cache *mfc_cache)
> +{
> +}
> +#endif

OK.

>> +}
>> +EXPORT_SYMBOL(ipmr_rule_default);
>> +
>>  static inline int ipmr_hash_cmp(struct rhashtable_compare_arg *arg,
>>  const void *ptr)
>>  {
>>

Re: [patch net-next v2 03/12] ipmr: Add FIB notification access functions

2017-09-24 Thread Yotam Gigi

On 09/25/2017 04:19 AM, Yunsheng Lin wrote:
> Hi, Jiri
>
> On 2017/9/25 1:22, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Make the ipmr module register as a FIB notifier. To do that, implement both
>> the ipmr_seq_read and ipmr_dump ops.
>>
>> The ipmr_seq_read op returns a sequence counter that is incremented on
>> every notification related operation done by the ipmr. To implement that,
>> add a sequence counter in the netns_ipv4 struct and increment it whenever a
>> new MFC route or VIF are added or deleted. The sequence operations are
>> protected by the RTNL lock.
>>
>> The ipmr_dump iterates the list of MFC routes and the list of VIF entries
>> and sends notifications about them. The entries dump is done under RCU
>> where the VIF dump uses the mrt_lock too, as the vif->dev field can change
>> under RCU.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>> v1->v2:
>>  - Take the mrt_lock when dumping VIF entries.
>> ---
>>  include/linux/mroute.h   |  15 ++
>>  include/net/netns/ipv4.h |   3 ++
>>  net/ipv4/ipmr.c  | 137 
>> ++-
>>  3 files changed, 153 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index 10028f2..54c5cb8 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -5,6 +5,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #ifdef CONFIG_IP_MROUTE
>> @@ -58,6 +59,14 @@ struct vif_device {
>>  int link;   /* Physical interface index 
>> */
>>  };
>>  
>> +struct vif_entry_notifier_info {
>> +struct fib_notifier_info info;
>> +struct net_device *dev;
>> +vifi_t vif_index;
>> +unsigned short vif_flags;
>> +u32 tb_id;
>> +};
>> +
>>  #define VIFF_STATIC 0x8000
>>  
>>  #define VIF_EXISTS(_mrt, _idx) ((_mrt)->vif_table[_idx].dev != NULL)
>> @@ -146,6 +155,12 @@ struct mfc_cache {
>>  struct rcu_head rcu;
>>  };
>>  
>> +struct mfc_entry_notifier_info {
>> +struct fib_notifier_info info;
>> +struct mfc_cache *mfc;
>> +u32 tb_id;
>> +};
>> +
>>  struct rtmsg;
>>  int ipmr_get_route(struct net *net, struct sk_buff *skb,
>> __be32 saddr, __be32 daddr,
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index 8387f09..abc84d9 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -163,6 +163,9 @@ struct netns_ipv4 {
>>  struct fib_notifier_ops *notifier_ops;
>>  unsigned intfib_seq;/* protected by rtnl_mutex */
>>  
>> +struct fib_notifier_ops *ipmr_notifier_ops;
> Can we add a const here?

It cannot be const as it get initialized it in ipmr_notifier_init.

>
>> +unsigned intipmr_seq;   /* protected by rtnl_mutex */
>> +
>>  atomic_trt_genid;
>>  };
>>  #endif
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 86dc5f9..49879c3 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -264,6 +264,16 @@ static void __net_exit ipmr_rules_exit(struct net *net)
>>  fib_rules_unregister(net->ipv4.mr_rules_ops);
>>  rtnl_unlock();
>>  }
>> +
>> +static int ipmr_rules_dump(struct net *net, struct notifier_block *nb)
>> +{
>> +return fib_rules_dump(net, nb, RTNL_FAMILY_IPMR);
>> +}
>> +
>> +static unsigned int ipmr_rules_seq_read(struct net *net)
>> +{
>> +return fib_rules_seq_read(net, RTNL_FAMILY_IPMR);
>> +}
>>  #else
>>  #define ipmr_for_each_table(mrt, net) \
>>  for (mrt = net->ipv4.mrt; mrt; mrt = NULL)
>> @@ -298,6 +308,16 @@ static void __net_exit ipmr_rules_exit(struct net *net)
>>  net->ipv4.mrt = NULL;
>>  rtnl_unlock();
>>  }
>> +
>> +static int ipmr_rules_dump(struct net *net, struct notifier_block *nb)
>> +{
>> +return 0;
>> +}
>> +
>> +static unsigned int ipmr_rules_seq_read(struct net *net)
>> +{
>> +return 0;
>> +}
>>  #endif
>>  
>>  static inline int ipmr_hash_cmp(struct rhashtable_compare_arg *arg,
>> @@ -587,6 +607,43 @@ static struct net_device *ipmr_reg_vif(struct net *net, 
>> s

Re: [patch net-next 07/12] mlxsw: spectrum: Add the multicast routing offloading logic

2017-09-23 Thread Yotam Gigi

On 09/22/2017 04:21 PM, Andrew Lunn wrote:
> On Fri, Sep 22, 2017 at 11:36:59AM +0300, Yotam Gigi wrote:
>> On 09/21/2017 06:26 PM, Andrew Lunn wrote:
>>>> +static void mlxsw_sp_mr_route_stats_update(struct mlxsw_sp *mlxsw_sp,
>>>> + struct mlxsw_sp_mr_route *mr_route)
>>>> +{
>>>> +  struct mlxsw_sp_mr *mr = mlxsw_sp->mr;
>>>> +  u64 packets, bytes;
>>>> +
>>>> +  if (mr_route->route_action == MLXSW_SP_MR_ROUTE_ACTION_TRAP)
>>>> +  return;
>>>> +
>>>> +  mr->mr_ops->route_stats(mlxsw_sp, mr_route->route_priv, ,
>>>> +  );
>>>> +
>>>> +  switch (mr_route->mr_table->proto) {
>>>> +  case MLXSW_SP_L3_PROTO_IPV4:
>>>> +  mr_route->mfc4->mfc_un.res.pkt = packets;
>>>> +  mr_route->mfc4->mfc_un.res.bytes = bytes;
>>> What about wrong_if and lastuse? 
> Hi Yotam
>
>> wronf_if is updated by ipmr as it is trapped to the CPU.
> Great.
>
>> We did not address lastuse currently, though it can be easily
>> addressed here.
> Please do. I've written multicast routing daemons, where i use it to
> flush out MFCs which are no longer in use. Having it always 0 is going
> to break daemons.

I will. Thanks for the feedback!

>  
>>> Is an mfc with iif on the host, not the switch, not offloaded?
>>
>> I am not sure I followed. What do you mean MFC with iif on the host? you mean
>> MFC with iif that is an external NIC which is not part of the spectrum ASIC?
> Yes. We probably have different perspectives on the world. To
> Mellanox, everything is a switch in a box. In the DSA world, we tend
> to think of having a general purpose machine which also has a switch
> connected. Think of a wireless access point, set top box, passenger
> entertainment system. We have applications on the general purpose
> computer, we have wifi interfaces, cable modems, etc. Think about all
> the different packages in LEDE. We might have a multicast video
> stream, coming from the cable modem being sent over ports of the
> switch to clients.
>
> So when i look at these patches, i try to make sure the general use
> cases works, not just the plain boring Ethernet switch box use cases
> :-)

So when doing it, we did think about multi-ASIC situations, so I think it should
fit :)

>
>> in this case, the route will not be offloaded and all traffic will
>> pass in slowpath.
> O.K. I was just thinking if those counters need to be +=, not =.  But
> either the iif is on the host, or it is in the switch. It cannot be
> both. So = is O.K.
>
> Thanks
>   Andrew

Re: [patch net-next 07/12] mlxsw: spectrum: Add the multicast routing offloading logic

2017-09-22 Thread Yotam Gigi

On 09/21/2017 06:26 PM, Andrew Lunn wrote:
>> +static void mlxsw_sp_mr_route_stats_update(struct mlxsw_sp *mlxsw_sp,
>> +   struct mlxsw_sp_mr_route *mr_route)
>> +{
>> +struct mlxsw_sp_mr *mr = mlxsw_sp->mr;
>> +u64 packets, bytes;
>> +
>> +if (mr_route->route_action == MLXSW_SP_MR_ROUTE_ACTION_TRAP)
>> +return;
>> +
>> +mr->mr_ops->route_stats(mlxsw_sp, mr_route->route_priv, ,
>> +);
>> +
>> +switch (mr_route->mr_table->proto) {
>> +case MLXSW_SP_L3_PROTO_IPV4:
>> +mr_route->mfc4->mfc_un.res.pkt = packets;
>> +mr_route->mfc4->mfc_un.res.bytes = bytes;
> What about wrong_if and lastuse? 

wronf_if is updated by ipmr as it is trapped to the CPU. We did not
address lastuse currently, though it can be easily addressed here.

>
> Is an mfc with iif on the host, not the switch, not offloaded?

I am not sure I followed. What do you mean MFC with iif on the host? you mean
MFC with iif that is an external NIC which is not part of the spectrum ASIC? in
this case, the route will not be offloaded and all traffic will pass in 
slowpath.

>
>Andrew

Re: [patch net-next 03/12] ipmr: Add FIB notification access functions

2017-09-22 Thread Yotam Gigi

On 09/21/2017 02:19 PM, Nikolay Aleksandrov wrote:
> On 21/09/17 09:43, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> Make the ipmr module register as a FIB notifier. To do that, implement both
>> the ipmr_seq_read and ipmr_dump ops.
>>
>> The ipmr_seq_read op returns a sequence counter that is incremented on
>> every notification related operation done by the ipmr. To implement that,
>> add a sequence counter in the netns_ipv4 struct and increment it whenever a
>> new MFC route or VIF are added or deleted. The sequence operations are
>> protected by the RTNL lock.
>>
>> The ipmr_dump iterates the list of MFC routes and the list of VIF entries
>> and sends notifications about them. The entries dump is done under RCU.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> ---
>>  include/linux/mroute.h   |  15 ++
>>  include/net/netns/ipv4.h |   3 ++
>>  net/ipv4/ipmr.c  | 135 
>> ++-
>>  3 files changed, 151 insertions(+), 2 deletions(-)
>>
> [snip]
>> +
>> +static int ipmr_dump(struct net *net, struct notifier_block *nb)
>> +{
>> +struct mr_table *mrt;
>> +int err;
>> +
>> +err = ipmr_rules_dump(net, nb);
>> +if (err)
>> +return err;
>> +
>> +ipmr_for_each_table(mrt, net) {
>> +struct vif_device *v = >vif_table[0];
>> +struct mfc_cache *mfc;
>> +int vifi;
>> +
>> +/* Notifiy on table VIF entries */
>> +for (vifi = 0; vifi < mrt->maxvif; vifi++, v++) {
>> +if (!v->dev)
>> +continue;
>> +
>> +call_ipmr_vif_entry_notifier(nb, net, FIB_EVENT_VIF_ADD,
>> + v, vifi, mrt->id);
>> +}
> The VIF table is protected by mrt_lock (rwlock), here with RCU only
> you're not guaranteed to keep v->dev. It can become NULL after the check 
> above.
> For details you can see vif_delete() in net/ipv4/ipmr.c. You need at least
> mrt_lock for reading.

Hmm, that's interesting. That situation would lead the dump to be inconsistent,
thus eventually discarded, right? So I can also just make sure that the driver
ignores dev=NULL notifications and that would be enough to solve it.

I have to think about it a bit more, but anyway, maybe taking the mrt_lock for
the VIF dump is the right solution here.

Anyway, thanks for the review!


>> +
>> +/* Notify on table MFC entries */
>> +list_for_each_entry_rcu(mfc, >mfc_cache_list, list)
>> +call_ipmr_mfc_entry_notifier(nb, net,
>> + FIB_EVENT_ENTRY_ADD, mfc,
>> + mrt->id);
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +static const struct fib_notifier_ops ipmr_notifier_ops_template = {
>> +.family = RTNL_FAMILY_IPMR,
>> +.fib_seq_read   = ipmr_seq_read,
>> +.fib_dump   = ipmr_dump,
>> +.owner  = THIS_MODULE,
>> +};
>> +
>> +int __net_init ipmr_notifier_init(struct net *net)
>> +{
>> +struct fib_notifier_ops *ops;
>> +
>> +net->ipv4.ipmr_seq = 0;
>> +
>> +ops = fib_notifier_ops_register(_notifier_ops_template, net);
>> +if (IS_ERR(ops))
>> +return PTR_ERR(ops);
>> +net->ipv4.ipmr_notifier_ops = ops;
>> +
>> +return 0;
>> +}
>> +
>> +static void __net_exit ipmr_notifier_exit(struct net *net)
>> +{
>> +fib_notifier_ops_unregister(net->ipv4.ipmr_notifier_ops);
>> +net->ipv4.ipmr_notifier_ops = NULL;
>> +}
>> +
>>  /* Setup for IP multicast routing */
>>  static int __net_init ipmr_net_init(struct net *net)
>>  {
>>  int err;
>>  
>> +err = ipmr_notifier_init(net);
>> +if (err)
>> +goto ipmr_notifier_fail;
>> +
>>  err = ipmr_rules_init(net);
>>  if (err < 0)
>> -goto fail;
>> +goto ipmr_rules_fail;
>>  
>>  #ifdef CONFIG_PROC_FS
>>  err = -ENOMEM;
>> @@ -3074,7 +3202,9 @@ static int __net_init ipmr_net_init(struct net *net)
>>  proc_vif_fail:
>>  ipmr_rules_exit(net);
>>  #endif
>> -fail:
>> +ipmr_rules_fail:
>> +ipmr_notifier_exit(net);
>> +ipmr_notifier_fail:
>>  return err;
>>  }
>>  
>> @@ -3084,6 +3214,7 @@ static void __net_exit ipmr_net_exit(struct net *net)
>>  remove_proc_entry("ip_mr_cache", net->proc_net);
>>  remove_proc_entry("ip_mr_vif", net->proc_net);
>>  #endif
>> +ipmr_notifier_exit(net);
>>  ipmr_rules_exit(net);
>>  }
>>  
>>

Re: [PATCH net] net/sched: cls_matchall: fix crash when used with classful qdisc

2017-09-17 Thread Yotam Gigi

On 09/16/2017 03:02 PM, Davide Caratti wrote:
> this script, edited from Linux Advanced Routing and Traffic Control guide
>
> tc q a dev en0 root handle 1: htb default a
> tc c a dev en0 parent 1:  classid 1:1 htb rate 6mbit burst 15k
> tc c a dev en0 parent 1:1 classid 1:a htb rate 5mbit ceil 6mbit burst 15k
> tc c a dev en0 parent 1:1 classid 1:b htb rate 1mbit ceil 6mbit burst 15k
> tc f a dev en0 parent 1:0 prio 1 $clsname $clsargs classid 1:b
> ping $address -c1
> tc -s c s dev en0
>
> classifies traffic to 1:b or 1:a, depending on whether the packet matches
> or not the pattern $clsargs of filter $clsname. However, when $clsname is
> 'matchall', a systematic crash can be observed in htb_classify(). HTB and
> classful qdiscs don't assign initial value to struct tcf_result, but then
> they expect it to contain valid values after filters have been run. Thus,
> current 'matchall' ignores the TCA_MATCHALL_CLASSID attribute, configured
> by user, and makes HTB (and classful qdiscs) dereference random pointers.
>
> By assigning head->res to *res in mall_classify(), before the actions are
> invoked, we fix this crash and enable TCA_MATCHALL_CLASSID functionality,
> that had no effect on 'matchall' classifier since its first introduction.
>
> BugLink: 
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1460213=02%7C01%7Cyotamg%40mellanox.com%7C399f6ff50cb148cbd0d408d4fcfad4c7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636411601630363571=PSkkBrWNXkTxsvXrTmK6Dx9iKZMq61MAKlTcdVcPj8w%3D=0
> Reported-by: Jiri Benc <jb...@redhat.com>
> Fixes: b87f7936a932 ("net/sched: introduce Match-all classifier")
> Signed-off-by: Davide Caratti <dcara...@redhat.com>
> ---
>  net/sched/cls_matchall.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
> index 21cc45caf842..eeac606c95ab 100644
> --- a/net/sched/cls_matchall.c
> +++ b/net/sched/cls_matchall.c
> @@ -32,6 +32,7 @@ static int mall_classify(struct sk_buff *skb, const struct 
> tcf_proto *tp,
>   if (tc_skip_sw(head->flags))
>   return -1;
>  
> + *res = head->res;
>   return tcf_exts_exec(skb, >exts, res);
>  }
>  

Acked-by: Yotam Gigi <yot...@mellanox.com>

Re: IGMP snooping, switchdev and local multicast receiver on br interface

2017-07-16 Thread Yotam Gigi

On 07/14/2017 06:30 PM, Vivien Didelot wrote:
> Hi All,
>
> Andrew Lunn  writes:
>
>> I've been testing IGMP snooping support with DSA, putting MDB entries
>> into the switch so that traffic only goes out ports where there has
>> been an interest indicated via IGMP. It mostly works, but i've come
>> across one use case which does not.
>>
>> I have a multicast listener running on the host, performing a
>> setsockopt(IP_ADD_MEMBERSHIP) on the bridge interface. It is not an
>> unreasonable thing to want to do, e.g. a WiFi access point listening
>> to mDNS, or running other multicast protocols, a STB wanting to
>> receive a multicast video stream to display on the set, etc.
>>
>> I'm not seeing any switchdev operations when the IP_ADD_MEMBERSHIP is
>> called. So there is no indication that the switch should add an MDB
>> entry to forward traffic to the host.
>>
>> Im i missing something, or is this not implemented?

You are not missing, we did not add that support yet. Currently the hardware MDB
does not get updated with mcast_routers. The flood tables does get updated
though.

> I follow Andrew's question with another multicast issue I'm having:
>
> It seems like there is no way to add a multicast group via its MAC
> address. All iproute2 and kernel bridge code assumes IP multicast
> (0x0800 IPv4 and 0x86DD IPv6.)
>
> But there are valid cases where you might want to add an L2 multicast
> group on a specific VLAN ID, e.g. for 0x88F7 PTP, 0x88BA Multicast
> sampled values, one of the 802.1D reserved 01-80-C2-* addresses, or any
> proprietary protocol addresses.
>
> There is the ip-maddress VLAN-unaware tool using RTM_NEWADDR which isn't
> bound to switchdev, or bridge-mdb which only accepts a IPv4 or IPv6 grp.
>
> I tried to hack a PoC in iproute2 (http://ix.io/yuJ) but the kernel
> counterpart is not trivial at all. *br_mdb_entry only play with br_ip...
>
> Any thoughts on this?

We did not have this usecase yet, but what you say make perfect sense :)

It is weird for me too that one can not configure MDB with MAC address.

>
> Regards,
>
> Vivien

Re: [PATCH net-next] net/mlxfw: fix a NULL dereference

2017-06-14 Thread Yotam Gigi

On 06/14/2017 01:41 PM, Dan Carpenter wrote:
> If we hit this error path we end up returning ERR_PTR(0) which is NULL.
> The caller is not expecting that so it results in a NULL dereference.
>
> Fixes: 410ed13cae39 ("Add the mlxfw module for Mellanox firmware flash 
> process")
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>
>
> diff --git a/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c 
> b/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
> index 628150d28061..993cb5ba934e 100644
> --- a/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
> +++ b/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
> @@ -594,6 +594,7 @@ struct mlxfw_mfa2_component *
>   if (memcmp(comp_data->buff, mlxfw_mfa2_comp_magic,
>  mlxfw_mfa2_comp_magic_len) != 0) {
>   pr_err("Component has wrong magic\n");
> +     err = -EINVAL;
>   goto err_out;
>   }
>  


Acked-by: Yotam Gigi <yot...@mellanox.com>

ipmr: MFC routes when VIF deleted

2017-06-11 Thread Yotam Gigi

I have been looking into some weird behavior, and I am not sure whether it is
a bug or a feature.

When a VIF with index v gets deleted, the MFC routes does not get updated, which
means that there can be routes pointing to that VIF. On datapath, when packet
hits that route, the VIF validity will be checked and will not be sent to that
device (but still, the route does not get updated).  Now, if the user creates
another VIF with the same index v but different underlay device, the same route
will forward the traffic to that device.

It is relevant to mention that when user adds a MFC route, only the active VIFs
are used, so the flow of adding a route with dummy VIF indices and then
connecting those VIF indices to real device is not supported. The only way to
create a MFC route that has non existing VIFs is to create one with existing
VIFs and then delete them.

Do we really want to support that?  To me, it looks like a buggy flow and I
suggest that upon VIF deletion, the MFC routes will be updated to not point to
any non existing VIF indices.

Re: [PATCH net v3] net: ipmr: Fix some mroute forwarding issues in vrf's

2017-06-11 Thread Yotam Gigi

   if (vif >= 0) {
> - int err2 = ipmr_cache_unresolved(mrt, vif, skb);
> + int err2 = ipmr_cache_unresolved(mrt, vif, skb, dev);
>   read_unlock(_lock);
>  
>   return err2;
> @@ -2064,7 +2062,7 @@ int ip_mr_input(struct sk_buff *skb)
>   }
>  
>   read_lock(_lock);
> - ip_mr_forward(net, mrt, skb, cache, local);
> + ip_mr_forward(net, mrt, dev, skb, cache, local);
>   read_unlock(_lock);
>  
>   if (local)
> @@ -2238,7 +2236,7 @@ int ipmr_get_route(struct net *net, struct sk_buff *skb,
>   iph->saddr = saddr;
>   iph->daddr = daddr;
>   iph->version = 0;
> - err = ipmr_cache_unresolved(mrt, vif, skb2);
> + err = ipmr_cache_unresolved(mrt, vif, skb2, dev);
>   read_unlock(_lock);
>   rcu_read_unlock();
>   return err;

Thanks Donald!

Reviewed-by: Yotam Gigi <yot...@mellanox.com>

Re: [PATCH][net-next] net/mlxfw: remove redundant goto on error check

2017-06-06 Thread Yotam Gigi

On 06/06/2017 01:47 PM, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
>
> The check to see of err is set and the subsequent goto is extraneous
> as the next statement is where the goto is jumping to. Remove this
> redundant check and goto.
>
> Detected by CoverityScan, CID#1437734 ("Identical code for
> different branches")
>
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---
>  drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c 
> b/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
> index 7e9589061d30..628150d28061 100644
> --- a/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
> +++ b/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
> @@ -492,8 +492,6 @@ static int mlxfw_mfa2_file_cb_offset_xz(const struct 
> mlxfw_mfa2_file *mfa2_file,
>   dec_buf.out_pos = 0;
>   dec_buf.out_size = size;
>   err = mlxfw_mfa2_xz_dec_run(xz_dec, _buf, );
> - if (err)
> -     goto out;
>  out:
>   xz_dec_end(xz_dec);
>   return err;


Thanks!

Acked-by: Yotam Gigi <yot...@mellanox.com>

[PATCH net-next] mlxsw: spectrum: Implement the ethtool flash_device callback

2017-06-01 Thread Yotam Gigi

Add callback to the ethtool flash_device op. This callback uses the mlxfw
module to flash the new firmware file to the device.

As the firmware flash process takes about 20 seconds and ethtool takes the
rtnl lock during the flash_device callback, release the rtnl lock at the
beginning of the flash process and take it again before leaving the
callback. This way, the rtnl is not held during the process. To make sure
the device does not get deleted during the flash process, take a reference
to it before releasing the rtnl lock.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
Reviewed-by: Ido Schimmel <ido...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 51 +-
 1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index e25c1fd..84b6f36 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -321,6 +321,21 @@ static const struct mlxfw_dev_ops mlxsw_sp_mlxfw_dev_ops = 
{
.fsm_release= mlxsw_sp_fsm_release
 };
 
+static int mlxsw_sp_firmware_flash(struct mlxsw_sp *mlxsw_sp,
+  const struct firmware *firmware)
+{
+   struct mlxsw_sp_mlxfw_dev mlxsw_sp_mlxfw_dev = {
+   .mlxfw_dev = {
+   .ops = _sp_mlxfw_dev_ops,
+   .psid = mlxsw_sp->bus_info->psid,
+   .psid_size = strlen(mlxsw_sp->bus_info->psid),
+   },
+   .mlxsw_sp = mlxsw_sp
+   };
+
+   return mlxfw_firmware_flash(_sp_mlxfw_dev.mlxfw_dev, firmware);
+}
+
 static bool mlxsw_sp_fw_rev_ge(const struct mlxsw_fw_rev *a,
   const struct mlxsw_fw_rev *b)
 {
@@ -334,14 +349,6 @@ static bool mlxsw_sp_fw_rev_ge(const struct mlxsw_fw_rev 
*a,
 static int mlxsw_sp_fw_rev_validate(struct mlxsw_sp *mlxsw_sp)
 {
const struct mlxsw_fw_rev *rev = _sp->bus_info->fw_rev;
-   struct mlxsw_sp_mlxfw_dev mlxsw_sp_mlxfw_dev = {
-   .mlxfw_dev = {
-   .ops = _sp_mlxfw_dev_ops,
-   .psid = mlxsw_sp->bus_info->psid,
-   .psid_size = strlen(mlxsw_sp->bus_info->psid),
-   },
-   .mlxsw_sp = mlxsw_sp
-   };
const struct firmware *firmware;
int err;
 
@@ -361,7 +368,7 @@ static int mlxsw_sp_fw_rev_validate(struct mlxsw_sp 
*mlxsw_sp)
return err;
}
 
-   err = mlxfw_firmware_flash(_sp_mlxfw_dev.mlxfw_dev, firmware);
+   err = mlxsw_sp_firmware_flash(mlxsw_sp, firmware);
release_firmware(firmware);
return err;
 }
@@ -2495,6 +2502,31 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev,
return 0;
 }
 
+static int mlxsw_sp_flash_device(struct net_device *dev,
+struct ethtool_flash *flash)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+   const struct firmware *firmware;
+   int err;
+
+   if (flash->region != ETHTOOL_FLASH_ALL_REGIONS)
+   return -EOPNOTSUPP;
+
+   dev_hold(dev);
+   rtnl_unlock();
+
+   err = request_firmware_direct(, flash->data, >dev);
+   if (err)
+   goto out;
+   err = mlxsw_sp_firmware_flash(mlxsw_sp, firmware);
+   release_firmware(firmware);
+out:
+   rtnl_lock();
+   dev_put(dev);
+   return err;
+}
+
 static const struct ethtool_ops mlxsw_sp_port_ethtool_ops = {
.get_drvinfo= mlxsw_sp_port_get_drvinfo,
.get_link   = ethtool_op_get_link,
@@ -2506,6 +2538,7 @@ static const struct ethtool_ops mlxsw_sp_port_ethtool_ops 
= {
.get_sset_count = mlxsw_sp_port_get_sset_count,
.get_link_ksettings = mlxsw_sp_port_get_link_ksettings,
.set_link_ksettings = mlxsw_sp_port_set_link_ksettings,
+   .flash_device   = mlxsw_sp_flash_device,
 };
 
 static int
-- 
2.8.4

Re: [PATCH linux-firmware] Mellanox: Add firmware for mlxsw_spectrum

2017-05-31 Thread Yotam Gigi

On 05/30/2017 10:22 PM, Kyle McMartin wrote:
> On Mon, May 29, 2017 at 01:42:28PM +0300, Yotam Gigi wrote:
>> Add first firmware for the Mellanox Spectrum switch, as a followup to the
>> recently added commit:
>> 6b7421992b8d ("mlxsw: spectrum: Validate firmware revision on init")
>>
>> The version of the firmware release is 13.1420.122
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> ---
>>  mlxsw_spectrum-13.1420.122.mfa2 | Bin 0 -> 860424 bytes
>>  1 file changed, 0 insertions(+), 0 deletions(-)
>>  create mode 100644 mlxsw_spectrum-13.1420.122.mfa2
>>
>> diff --git a/mlxsw_spectrum-13.1420.122.mfa2 
>> b/mlxsw_spectrum-13.1420.122.mfa2
>> new file mode 100644
> Can we please put this in a subdirectory if we're going to be adding
> more versions of this in the future?
>
> --Kyle


Ok. Will fix.
Thanks!

Re: [PATCH net-next] net/mlxfw: select CONFIG_XZ_DEC

2017-05-30 Thread Yotam Gigi

On 05/30/2017 12:26 PM, Arnd Bergmann wrote:
> The new mlxfw code fails to build without the xz library:
>
> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.o: In function 
> `mlxfw_mfa2_xz_dec_run':
> :(.text.mlxfw_mfa2_xz_dec_run+0x8): undefined reference to `xz_dec_run'
> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.o: In function 
> `mlxfw_mfa2_file_component_get':
> :(.text.mlxfw_mfa2_file_component_get+0x218): undefined reference to 
> `xz_dec_init'
> :(.text.mlxfw_mfa2_file_component_get+0x2c0): undefined reference to 
> `xz_dec_end'
>
> This adds a Kconfig 'select' statement for it, which is also what
> the other user of that library has.
>
> Fixes: 410ed13cae39 ("Add the mlxfw module for Mellanox firmware flash 
> process")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/net/ethernet/mellanox/mlxfw/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxfw/Kconfig 
> b/drivers/net/ethernet/mellanox/mlxfw/Kconfig
> index 56b60ac7bc34..2b21af8a2b1d 100644
> --- a/drivers/net/ethernet/mellanox/mlxfw/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlxfw/Kconfig
> @@ -4,3 +4,4 @@
>  
>  config MLXFW
>  tristate "mlxfw" if COMPILE_TEST
> + select XZ_DEC

Thanks!

Acked-by: Yotam Gigi <yot...@mellanox.com>

Re: [patch net-next 0/9] mlxsw: Support firmware flash

2017-05-29 Thread Yotam Gigi

On 05/29/2017 03:17 AM, Jakub Kicinski wrote:
> On Sun, 28 May 2017 10:26:49 +0300, Yotam Gigi wrote:
>> On 05/23/2017 06:38 PM, David Miller wrote:
>>> From: Yotam Gigi <yot...@mellanox.com>
>>> Date: Tue, 23 May 2017 18:14:15 +0300
>>>  
>>>> Sorry, I am not sure I understand. You think that drivers should not 
>>>> implement
>>>> ethtool's flash_device callback anymore? do you have an alternative for 
>>>> firmware
>>>> flash?  
>>> As stated, export an MTD device.  
>> So, after we have been going over MTD, it seems like it does not fit our 
>> needs
>> at all.
>>
>> MTD device provides (erasable-)block access to a flash storage, where in our
>> case the firmware burn process is just pouring a binary BLOB into the device.
>> The driver is not aware of the internal storage used for storing the 
>> firmware as
>> it is not defined in our driver-hardware API.
>>
>> Needless to say that block access has no meaning in our case, so any solution
>> that will involve MTD device to burn our firmware (if there is a solution at
>> all) will be a workaround and will not fit MTD purpose.
>>
>> Apart for boot time firmware flash, which we have already pushed we would 
>> really
>> like to allow the user to ask for a specific firmware version. Do you have 
>> any
>> other solution for us apart from "ethtool -f"?
> Could you elaborate on what the requirements are for "allowing users to
> ask for a specific firmware version"?  How do the FWs differ?  I'm
> asking because we are currently lacking ABI for selecting device
> "modes".  Netronome has this problem.  Cavium has recently posted a
> patch which used module parameter to flip between "OvS" and "basic NIC"
> firmwares.
>
> For Netronome we will definitely want a way to switch between at least
> three applications so far - basic, OvS and eBPF - but I also feel like
> we shouldn't limit that list, since anyone can write their own FW for
> programmable NICs.

Currently, both in the HCA and switch we only refer to FW images of different
versions which naturally include bug fixes and enhancements (new features), and
not configurations or hardware modes.
 
In the switch driver we use boot time firmware upgrade if the minimum required
version is not met, so it is true that the "ethtool -f" is not really required
for normal operation.

However, it is of much use for development, testing, QA, lab and PoC
environments when specific bug fixes or new features which are not yet present
in official FW releases and hence not yet posted to the upstream libfirmware.

I do agree, on the other hand, that the difference between firmware versions and
firmware configuration is semantic, and there is no code-enforcement on that,
but it will be of much convenience for the user to upgrade to a newer firmware
that includes more features.


>
> I think you were primarily concerned with writing persistent storage so
> far.  Does "allowing the user to ask..." means write flash and reboot or
> also a runtime switch?  I think we probably need both?


Currently the new firmware will be activated after reboot, but we may change it
in the future.


>
>> This problem is even more relevant in the Mellanox HCA driver team, which 
>> would
>> like to use that code in order to burn the HCA firmware, but not intend to
>> trigger it on boot time, which means that must have a way for the user to
>> trigger it.
> What would the requirements for the HCA team be?  Is it about loading
> different code or loading HW settings?

Re: [patch net-next 0/9] mlxsw: Support firmware flash

2017-05-28 Thread Yotam Gigi

On 05/23/2017 06:38 PM, David Miller wrote:
> From: Yotam Gigi <yot...@mellanox.com>
> Date: Tue, 23 May 2017 18:14:15 +0300
>
>> Sorry, I am not sure I understand. You think that drivers should not 
>> implement
>> ethtool's flash_device callback anymore? do you have an alternative for 
>> firmware
>> flash?
> As stated, export an MTD device.

So, after we have been going over MTD, it seems like it does not fit our needs
at all.

MTD device provides (erasable-)block access to a flash storage, where in our
case the firmware burn process is just pouring a binary BLOB into the device.
The driver is not aware of the internal storage used for storing the firmware as
it is not defined in our driver-hardware API.

Needless to say that block access has no meaning in our case, so any solution
that will involve MTD device to burn our firmware (if there is a solution at
all) will be a workaround and will not fit MTD purpose.

Apart for boot time firmware flash, which we have already pushed we would really
like to allow the user to ask for a specific firmware version. Do you have any
other solution for us apart from "ethtool -f"?

This problem is even more relevant in the Mellanox HCA driver team, which would
like to use that code in order to burn the HCA firmware, but not intend to
trigger it on boot time, which means that must have a way for the user to
trigger it.

Re: [patch net-next 0/9] mlxsw: Support firmware flash

2017-05-23 Thread Yotam Gigi

On 05/23/2017 06:04 PM, Mintz, Yuval wrote:
 Patches 6-8 add the "ethtool -f" and the boot-time firmware upgrade on
>> the
 mlxsw spectrum driver.
>>> When we tried using `ethtool -E' for qed we got burned for trying to use
>> the magic
>>> value [1]. When we suggested extending it to allow some private data
>> indications,
>>> Ben claimed that this entire ethtool infrastructure is a thing of the past,
>>> and we should try using MTD instead - a claim no one bothered to counter.
>>>
>>> Creating proprietary file-formats, filling them with metadata and
>>> complementary driver state-machines could easily overcome the original
>>> obstacle we faced. But it raises the question of whether these APIs are
>>> valid or obsolete.
>> The metadata in our format is needed to allow us to hold several firmware
>> images for several spectrum silicon variants in one file, hence the metadata
>> is used by the driver and does not get transferred to the device.
> Understood; Otherwise it would have been 'data' and not 'metadata'.
>
>> Our code can only be used to transfer firmware to the device, and cannot
>> be used to configure the device.
> Not sure what this refers to? The differences between '-f' and '-E'?
> Also, assuming you haven't started selling your adapters as persistent memory
> for storage, what's the difference?
>

Sorry, I am not sure I understand. You think that drivers should not implement
ethtool's flash_device callback anymore? do you have an alternative for firmware
flash?

Re: [patch net-next 0/9] mlxsw: Support firmware flash

2017-05-23 Thread Yotam Gigi

On 05/23/2017 04:18 PM, Mintz, Yuval wrote:
>> Add support for device firmware flash on mlxsw spectrum. The firmware files
>> are expected to be in the Mellanox Firmware Archive version 2 (MFA2)
>> format.
>>
>> The firmware flash can be triggered via "ethtool -f" or on driver 
>> initialization
>> time if the device firmware version does not meet the minimum firmware
>> version supported by the driver.
>>
>> Currently, to activate the newly flashed firmware, the user needs to reboot
>> his system.
>>
>> The first patch introduces the mlxfw module, which implements common
>> logic needed for the firmware flash process on Mellanox products, such as
>> the
>> MFA2 format parsing and the firmware flash state machine logic. As the
>> module implements common logic which will be needed by various different
>> Mellanox drivers, it defines a set of callbacks needed to interact with the
>> specific device.
>>
>> Patches 1-5 implement the needed mlxfw callbacks in the mlxsw spectrum
>> driver.
>>
>> Patches 6-8 add the "ethtool -f" and the boot-time firmware upgrade on the
>> mlxsw spectrum driver.
> When we tried using `ethtool -E' for qed we got burned for trying to use the 
> magic
> value [1]. When we suggested extending it to allow some private data 
> indications,
> Ben claimed that this entire ethtool infrastructure is a thing of the past,
> and we should try using MTD instead - a claim no one bothered to counter.
>
> Creating proprietary file-formats, filling them with metadata and
> complementary driver state-machines could easily overcome the original
> obstacle we faced. But it raises the question of whether these APIs are
> valid or obsolete.

The metadata in our format is needed to allow us to hold several firmware images
for several spectrum silicon variants in one file, hence the metadata is used by
the driver and does not get transferred to the device. Our code can only be used
to transfer firmware to the device, and cannot be used to configure the device.


>
> [1] http://marc.info/?l=linux-netdev=146093513926921=4
> [2] http://marc.info/?l=linux-netdev=146514461214421=2
>
>> Patch 9 adds a fix needed for new firmware versions.
>>
>> Ido Schimmel (1):
>>   mlxsw: spectrum_router: Adjust RIF configuration for new firmware
>> versions
>>
>> Yotam Gigi (8):
>>   Add the mlxfw module for Mellanox firmware flash process
>>   mlxsw: reg: Add Management Component Query Information register
>>   mlxsw: reg: Add Management Component Control register
>>   mlxsw: reg: Add Management Component Data Access register
>>   mlxsw: spectrum: Add the needed callbacks for mlxfw integration
>>   mlxsw: spectrum: Implement the ethtool flash_device callback
>>   mlxsw: core: Create the mlxsw_fw_rev struct
>>   mlxsw: spectrum: Validate firmware revision on init
>>
>>  MAINTAINERS|   8 +
>>  drivers/net/ethernet/mellanox/Kconfig  |   1 +
>>  drivers/net/ethernet/mellanox/Makefile |   1 +
>>  drivers/net/ethernet/mellanox/mlxfw/Kconfig|   6 +
>>  drivers/net/ethernet/mellanox/mlxfw/Makefile   |   2 +
>>  drivers/net/ethernet/mellanox/mlxfw/mlxfw.h| 102 
>>  drivers/net/ethernet/mellanox/mlxfw/mlxfw_fsm.c| 273 +
>>  drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c   | 620
>> +
>>  drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.h   |  66 +++
>>  .../net/ethernet/mellanox/mlxfw/mlxfw_mfa2_file.h  |  60 ++
>>  .../ethernet/mellanox/mlxfw/mlxfw_mfa2_format.h| 103 
>>  .../net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv.h   |  98 
>>  .../ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.c | 126 +
>> .../ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.h |  71 +++
>>  drivers/net/ethernet/mellanox/mlxsw/Kconfig|   1 +
>>  drivers/net/ethernet/mellanox/mlxsw/core.h |  12 +-
>>  drivers/net/ethernet/mellanox/mlxsw/reg.h  | 219 
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 266 +
>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  22 +
>>  19 files changed, 2052 insertions(+), 5 deletions(-)  create mode 100644
>> drivers/net/ethernet/mellanox/mlxfw/Kconfig
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxfw/Makefile
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxfw/mlxfw.h
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxfw/mlxfw_fsm.c
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.c
>>  create mode 100644 drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2.h
>>  create mode 100644
>> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_file.h
>>  create mode 100644
>> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_format.h
>>  create mode 100644
>> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv.h
>>  create mode 100644
>> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.c
>>  create mode 100644
>> drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.h
>>
>> --
>> 2.9.3

Re: [PATCH net] rtnetlink: Fix the IFLA_PHYS_PORT_NAME TLV to include terminating NULL

2017-05-09 Thread Yotam Gigi

On 05/09/2017 03:31 PM, Tobias Klauser wrote:
> On 2017-05-09 at 14:12:02 +0200, Yotam Gigi <yot...@mellanox.com> wrote:
>> The IFLA_PHYS_PORT_NAME rtnetlink TLV length does not include the
>> terminating NULL character, which is different from other string typed
>> TLVs. Due to the fact that libnl checks for the terminating NULL in every
>> string typed attribute, it crashes on every RTM_GETLINK response on
>> drivers that implement ndo_get_phys_port_name.
>>
>> Make the fill_phys_port_name function include the terminating NULL in the
>> TLV size by using the nla_put_string helper function.
>>
>> Fixes: db24a9044ee1 ("net: add support for phys_port_name")
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Cc: David Ahern <d...@cumulusnetworks.com>
>> Reviewed-by: Ido Schimmel <ido...@mellanox.com>
>> Acked-by: Jiri Pirko <j...@mellanox.com>
>> ---
>> Please consider this for stable too. Thanks!
> This is already fixed in commit 77ef033b687c ("rtnetlink: NUL-terminate
> IFLA_PHYS_PORT_NAME string").


You are right. I forgot to rebase my net tree :)

[PATCH net] rtnetlink: Fix the IFLA_PHYS_PORT_NAME TLV to include terminating NULL

2017-05-09 Thread Yotam Gigi

The IFLA_PHYS_PORT_NAME rtnetlink TLV length does not include the
terminating NULL character, which is different from other string typed
TLVs. Due to the fact that libnl checks for the terminating NULL in every
string typed attribute, it crashes on every RTM_GETLINK response on
drivers that implement ndo_get_phys_port_name.

Make the fill_phys_port_name function include the terminating NULL in the
TLV size by using the nla_put_string helper function.

Fixes: db24a9044ee1 ("net: add support for phys_port_name")
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
Cc: David Ahern <d...@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <ido...@mellanox.com>
Acked-by: Jiri Pirko <j...@mellanox.com>
---
Please consider this for stable too. Thanks!
---
 net/core/rtnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index c4e84c5..69daf39 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1056,7 +1056,7 @@ static int rtnl_phys_port_name_fill(struct sk_buff *skb, 
struct net_device *dev)
return err;
}
 
-   if (nla_put(skb, IFLA_PHYS_PORT_NAME, strlen(name), name))
+   if (nla_put_string(skb, IFLA_PHYS_PORT_NAME, name))
return -EMSGSIZE;
 
return 0;
-- 
2.4.11

[PATCH net] bridge: Fix error path in nbp_vlan_init

2017-03-01 Thread Yotam Gigi

Fix error path order in nbp_vlan_init, so if switchdev_port_attr_set
call failes, the vlan_hash wouldn't be destroyed before inited.

Fixes: efa5356b0d97 ("bridge: per vlan dst_metadata netlink support")
CC: Roopa Prabhu <ro...@cumulusnetworks.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/bridge/br_vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 62e68c0..b838213 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -997,10 +997,10 @@ int nbp_vlan_init(struct net_bridge_port *p)
RCU_INIT_POINTER(p->vlgrp, NULL);
synchronize_rcu();
vlan_tunnel_deinit(vg);
-err_vlan_enabled:
 err_tunnel_init:
rhashtable_destroy(>vlan_hash);
 err_rhtbl:
+err_vlan_enabled:
kfree(vg);
 
goto out;
-- 
2.4.11

[PATCH net-next/iproute v2 1/5] tc: bash-completion: Add the _from variant to _tc_one* funcs

2017-02-07 Thread Yotam Gigi

The _tc_one_of_list and _tc_once_attr functions simplfy the bash
completion task by validating each attr exist only once on the command
line.

For example, for the command line:

$ a b c d e

and the call to _tc_once_attr with "a f g", the function will suggest
"f g" as "a" existed in the command line in args 0.

Add the _from variant to those functions, which allows having the command
line option once from a specified index. In the previous example, calling
_tc_once_attr with 4 and "a f g" will suggest "a f g".

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/bash-completion/tc b/bash-completion/tc
index ed2796d..04f969e 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -20,6 +20,26 @@ _tc_once_attr()
 done
 }
 
+# Takes a list of words in argument; each one of them is added to COMPREPLY if
+# it is not already present on the command line from the provided index. 
Returns
+# no value.
+_tc_once_attr_from()
+{
+local w subcword found from=$1
+shift
+for w in $*; do
+found=0
+for (( subcword=$from; subcword < ${#words[@]}-1; subcword++ )); do
+if [[ $w == ${words[subcword]} ]]; then
+found=1
+break
+fi
+done
+[[ $found -eq 0 ]] && \
+COMPREPLY+=( $( compgen -W "$w" -- "$cur" ) )
+done
+}
+
 # Takes a list of words in argument; adds them all to COMPREPLY if none of them
 # is already present on the command line. Returns no value.
 _tc_one_of_list()
@@ -33,6 +53,21 @@ _tc_one_of_list()
 COMPREPLY+=( $( compgen -W "$*" -- "$cur" ) )
 }
 
+# Takes a list of words in argument; adds them all to COMPREPLY if none of them
+# is already present on the command line from the provided index. Returns no
+# value.
+_tc_one_of_list_from()
+{
+local w subcword from=$1
+shift
+for w in $*; do
+for (( subcword=$from; subcword < ${#words[@]}-1; subcword++ )); do
+[[ $w == ${words[subcword]} ]] && return 1
+done
+done
+COMPREPLY+=( $( compgen -W "$*" -- "$cur" ) )
+}
+
 # Returns "$cur ${cur}arg1 ${cur}arg2 ..."
 _tc_expand_units()
 {
-- 
2.4.11

[PATCH net-next/iproute v2 5/5] tc: bash-completion: Add support for matchall

2017-02-07 Thread Yotam Gigi

Add support for the matchall classifier and its parameters.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index e4c6804..80d1297 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -5,7 +5,7 @@
 QDISC_KIND=' choke codel bfifo pfifo pfifo_head_drop fq fq_codel gred hhf \
 mqprio multiq netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
 dsmark hfsc htb prio qfq '
-FILTER_KIND=' basic bpf cgroup flow flower fw route rsvp tcindex u32 '
+FILTER_KIND=' basic bpf cgroup flow flower fw route rsvp tcindex u32 matchall '
 ACTION_KIND=' gact mirred bpf sample '
 
 # Takes a list of words in argument; each one of them is added to COMPREPLY if
@@ -449,6 +449,10 @@ _tc_filter_options()
 _tc_once_attr 'map hash divisor baseclass match action'
 return 0
 ;;
+matchall)
+_tc_once_attr 'action skip_sw skip_hw'
+return 0
+;;
 flower)
 _tc_once_attr 'action classid indev dst_mac src_mac eth_type \
 ip_proto dst_ip src_ip dst_port src_port'
-- 
2.4.11

[PATCH net-next/iproute v2 2/5] tc: bash-completion: Prepare action autocomplete to support several actions

2017-02-07 Thread Yotam Gigi

The action autocomplete routine (_tc_action_options) currently does not
support several actions statements in one tc command line as it uses the
_tc_once_attr and _tc_one_from_list.

For example, in that case:

$ tc filter add dev eth0 handle : u32 [...]  \
   action sample group 5 rate 12 \
   action sample 

the _tc_once_attr function, when invoked with "group rate" will not
suggest those as they already exist on the command line.

Fix the function to use the _from variant, thus allowing each action
autocomplete start from the action keyword, and not from the beginning of
the command line.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index 04f969e..c854dc0 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -454,26 +454,28 @@ _tc_filter_options()
 # Returns 0 is completion should stop after running this function, 1 otherwise.
 _tc_action_options()
 {
-case $1 in
+local from=$1
+local action=${words[from]}
+case $action in
 bpf)
 _tc_bpf_options
 return 0
 ;;
 mirred)
-_tc_one_of_list 'ingress egress'
-_tc_one_of_list 'mirror redirect'
-_tc_once_attr 'index dev'
+_tc_one_of_list_from $from 'ingress egress'
+_tc_one_of_list_from $from 'mirror redirect'
+_tc_once_attr_from $from 'index dev'
 return 0
 ;;
 sample)
-_tc_once_attr 'rate'
-_tc_once_attr 'trunc'
-_tc_once_attr 'group'
+_tc_once_attr_from $from 'rate'
+_tc_once_attr_from $from 'trunc'
+_tc_once_attr_from $from 'group'
 return 0
 ;;
 gact)
-_tc_one_of_list 'reclassify drop continue pass'
-_tc_once_attr 'random'
+_tc_one_of_list_from $from 'reclassify drop continue pass'
+_tc_once_attr_from $from 'random'
 return 0
 ;;
 esac
@@ -715,8 +717,7 @@ _tc()
 local action acwd ACTION_KIND=' gact mirred bpf sample '
 for ((acwd=$subcword; acwd < ${#words[@]}-1; acwd++)); do
 if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
-action=${words[acwd]}
-_tc_action_options $action && return 0
+_tc_action_options $acwd && return 0
 fi
 done
 _tc_one_of_list $ACTION_KIND
-- 
2.4.11

[PATCH net-next/iproute v2 4/5] tc: bash-completion: Add support for filter actions

2017-02-07 Thread Yotam Gigi

Previously, the autocomplete routine did not complete actions after a
filter keyword, for example:

$ tc filter add dev eth0 u32 [...] action 

did not suggest the actions list, and:

$ tc filter add dev eth0 u32 [...] action mirred 

did not suggest the specific mirred parameters. Add the support for this
kind of completion by adding the _tc_filter_action_options routine and
invoking it from inside _tc_filter_options.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index e23f69c..e4c6804 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -386,11 +386,44 @@ _tc_bpf_options()
 return 0
 }
 
+# Complete with options names for filter actions.
+# This function is recursive, thus allowing multiple actions statement to be
+# parsed.
+# Returns 0 is completion should stop after running this function, 1 otherwise.
+_tc_filter_action_options()
+{
+for ((acwd=$1; acwd < ${#words[@]}-1; acwd++));
+do
+if [[ action == ${words[acwd]} ]]; then
+_tc_filter_action_options $((acwd+1)) && return 0
+fi
+done
+
+local action acwd
+for ((acwd=$1; acwd < ${#words[@]}-1; acwd++)); do
+if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
+_tc_one_of_list_from $acwd action
+_tc_action_options $acwd && return 0
+fi
+done
+_tc_one_of_list_from $acwd $ACTION_KIND
+return 0
+}
+
 # Complete with options names for filters.
 # Returns 0 is completion should stop after running this function, 1 otherwise.
 _tc_filter_options()
 {
-case $1 in
+
+for ((acwd=$1; acwd < ${#words[@]}-1; acwd++));
+do
+if [[ action == ${words[acwd]} ]]; then
+_tc_filter_action_options $((acwd+1)) && return 0
+fi
+done
+
+filter=${words[$1]}
+case $filter in
 basic)
 _tc_once_attr 'match action classid'
 return 0
@@ -685,8 +718,7 @@ _tc()
 for ((fltwd=$subcword; fltwd < ${#words[@]}-1; fltwd++));
 do
 if [[ $FILTER_KIND =~ ' '${words[fltwd]}' ' ]]; then
-filter=${words[fltwd]}
-_tc_filter_options $filter && return 0
+_tc_filter_options $fltwd && return 0
 fi
 done
 _tc_one_of_list $FILTER_KIND
-- 
2.4.11

[PATCH net-next/iproute v2 3/5] tc: bash-completion: Make the *_KIND variables global

2017-02-07 Thread Yotam Gigi

The QDISC_KIND, FILTER_KIND, ACTION_KIND variables may be used by other
routines, thus make them global variables.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index c854dc0..e23f69c 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -2,6 +2,12 @@
 # Copyright 2016 6WIND S.A.
 # Copyright 2016 Quentin Monnet <quentin.mon...@6wind.com>
 
+QDISC_KIND=' choke codel bfifo pfifo pfifo_head_drop fq fq_codel gred hhf \
+mqprio multiq netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
+dsmark hfsc htb prio qfq '
+FILTER_KIND=' basic bpf cgroup flow flower fw route rsvp tcindex u32 '
+ACTION_KIND=' gact mirred bpf sample '
+
 # Takes a list of words in argument; each one of them is added to COMPREPLY if
 # it is not already present on the command line. Returns no value.
 _tc_once_attr()
@@ -605,10 +611,7 @@ _tc()
 COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
 return 0
 fi
-local qdisc qdwd QDISC_KIND=' choke codel bfifo pfifo \
-pfifo_head_drop fq fq_codel gred hhf mqprio multiq \
-netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
-dsmark hfsc htb prio qfq '
+local qdisc qdwd
 for ((qdwd=$subcword; qdwd < ${#words[@]}-1; qdwd++)); do
 if [[ $QDISC_KIND =~ ' '${words[qdwd]}' ' ]]; then
 qdisc=${words[qdwd]}
@@ -643,10 +646,7 @@ _tc()
 COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
 return 0
 fi
-local qdisc qdwd QDISC_KIND=' choke codel bfifo pfifo \
-pfifo_head_drop fq fq_codel gred hhf mqprio multiq \
-netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
-dsmark hfsc htb prio qfq '
+local qdisc qdwd
 for ((qdwd=$subcword; qdwd < ${#words[@]}-1; qdwd++)); do
 if [[ $QDISC_KIND =~ ' '${words[qdwd]}' ' ]]; then
 qdisc=${words[qdwd]}
@@ -681,8 +681,7 @@ _tc()
 COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
 return 0
 fi
-local filter fltwd FILTER_KIND=' basic bpf cgroup flow \
-flower fw route rsvp tcindex u32 '
+local filter fltwd
 for ((fltwd=$subcword; fltwd < ${#words[@]}-1; fltwd++));
 do
 if [[ $FILTER_KIND =~ ' '${words[fltwd]}' ' ]]; then
@@ -714,7 +713,7 @@ _tc()
 action)
 case $subcmd in
 add|change|replace)
-local action acwd ACTION_KIND=' gact mirred bpf sample '
+local action acwd
 for ((acwd=$subcword; acwd < ${#words[@]}-1; acwd++)); do
 if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
 _tc_action_options $acwd && return 0
-- 
2.4.11

[PATCH net-next/iproute v2 0/5] tc: bash-completion: Add features

2017-02-07 Thread Yotam Gigi

The first 4 patches add support for autompletion of filter actions, thus
allowing the following tab completions:

$ tc filter add dev eth0 u32 [...] action 
bpf gactmirred  sample

$ tc filter add dev eth0 u32 [...] action sample 
action  group   ratetrunc

$ tc filter add dev eth0 u32 [...] \
action sample group 10 rate 10 action mirred 
actiondev   egressindex ingress   mirrorredirect

Finally, the last patch adds support in matchall autocompletion.

v1->v2:
 - Rebased on top of net-next tree

Yotam Gigi (5):
  tc: bash-completion: Add the _from variant to _tc_one* funcs
  tc: bash-completion: Prepare action autocomplete to support several
actions
  tc: bash-completion: Make the *_KIND variables global
  tc: bash-completion: Add support for filter actions
  tc: bash-completion: Add support for matchall

 bash-completion/tc | 121 ++---
 1 file changed, 96 insertions(+), 25 deletions(-)

-- 
2.4.11

RE: [PATCH net-next/iproute 2/5] tc: bash-completion: Prepare action autocomplete to support several actions

2017-02-06 Thread Yotam Gigi

>-Original Message-
>From: Stephen Hemminger [mailto:step...@networkplumber.org]
>Sent: Tuesday, February 07, 2017 12:11 AM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: netdev@vger.kernel.org; Elad Raz <el...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Jiri Pirko <j...@mellanox.com>; j...@mojatatu.com;
>m...@mojatatu.com
>Subject: Re: [PATCH net-next/iproute 2/5] tc: bash-completion: Prepare action
>autocomplete to support several actions
>
>On Mon,  6 Feb 2017 15:19:21 +0200
>Yotam Gigi <yot...@mellanox.com> wrote:
>
>> The action autocomplete routine (_tc_action_options) currently does not
>> support several actions statements in one tc command line as it uses the
>> _tc_once_attr and _tc_one_from_list.
>>
>> For example, in that case:
>>
>> $ tc filter add dev eth0 handle : u32 [...]  \
>> action sample group 5 rate 12 \
>> action sample 
>>
>> the _tc_once_attr function, when invoked with "group rate" will not
>> suggest those as they already exist on the command line.
>>
>> Fix the function to use the _from variant, thus allowing each action
>> autocomplete start from the action keyword, and not from the beginning of
>> the command line.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>
>This patch does not apply cleanly to current iproute2 net-next tree.
>Please rebase and resubmit the whole series again.

Sorry for that. I will send it again.

Thanks!

>
>
>$ cat bash-completion/tc.rej
>--- bash-completion/tc
>+++ bash-completion/tc
>@@ -454,26 +454,28 @@ _tc_filter_options()
> # Returns 0 is completion should stop after running this function, 1 
> otherwise.
> _tc_action_options()
> {
>-case $1 in
>+local from=$1
>+local action=${words[from]}
>+case $action in
> bpf)
> _tc_bpf_options
> return 0
> ;;
> mirred)
>-_tc_one_of_list 'ingress egress'
>-_tc_one_of_list 'mirror redirect'
>-_tc_once_attr 'index dev'
>+_tc_one_of_list_from $from 'ingress egress'
>+_tc_one_of_list_from $from 'mirror redirect'
>+_tc_once_attr_from $from 'index dev'
> return 0
> ;;
> sample)
>-_tc_once_attr 'rate'
>-_tc_once_attr 'trunc'
>-_tc_once_attr 'group'
>+_tc_once_attr_from $from 'rate'
>+_tc_once_attr_from $from 'trunc'
>+_tc_once_attr_from $from 'group'
> return 0
> ;;
> gact)
>-_tc_one_of_list 'reclassify drop continue pass'
>-_tc_once_attr 'random'
>+_tc_one_of_list_from $from 'reclassify drop continue pass'
>+_tc_once_attr_from $from 'random'
> return 0
> ;;
> esac

RE: [PATCH] [net-next] mlxsw: add psample dependency for spectrum

2017-02-06 Thread Yotam Gigi

>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
>Behalf Of Arnd Bergmann
>Sent: Monday, February 06, 2017 6:27 PM
>To: Jiri Pirko <j...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>
>Cc: Arnd Bergmann <a...@arndb.de>; David S. Miller <da...@davemloft.net>;
>Vadim Pasternak <vad...@mellanox.com>; Elad Raz <el...@mellanox.com>; Ivan
>Vecera <c...@cera.cz>; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
>Subject: [PATCH] [net-next] mlxsw: add psample dependency for spectrum
>
>When PSAMPLE is a loadable module, spectrum must not be built-in:
>
>drivers/net/built-in.o: In function `mlxsw_sp_rx_listener_sample_func':
>spectrum.c:(.text+0xe357e): undefined reference to `psample_sample_packet'
>
>This adds a Kconfig dependency to enforce usable configurations.
>
>Fixes: 98d0f7b9acda ("mlxsw: spectrum: Add packet sample offloading support")
>Signed-off-by: Arnd Bergmann <a...@arndb.de>

Acked-by: Yotam Gigi <yot...@mellanox.com>

>---
> drivers/net/ethernet/mellanox/mlxsw/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/drivers/net/ethernet/mellanox/mlxsw/Kconfig
>b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
>index 76a7574c3c7d..ef23eaedc2ff 100644
>--- a/drivers/net/ethernet/mellanox/mlxsw/Kconfig
>+++ b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
>@@ -73,6 +73,7 @@ config MLXSW_SWITCHX2
> config MLXSW_SPECTRUM
>   tristate "Mellanox Technologies Spectrum support"
>   depends on MLXSW_CORE && MLXSW_PCI && NET_SWITCHDEV &&
>VLAN_8021Q
>+  depends on PSAMPLE || PSAMPLE=n
>   select PARMAN
>   default m
>   ---help---
>--
>2.9.0

[PATCH net-next/iproute 3/5] tc: bash-completion: Make the *_KIND variables global

2017-02-06 Thread Yotam Gigi

The QDISC_KIND, FILTER_KIND, ACTION_KIND variables may be used by other
routines, thus make them global variables.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index c854dc0..e23f69c 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -2,6 +2,12 @@
 # Copyright 2016 6WIND S.A.
 # Copyright 2016 Quentin Monnet <quentin.mon...@6wind.com>
 
+QDISC_KIND=' choke codel bfifo pfifo pfifo_head_drop fq fq_codel gred hhf \
+mqprio multiq netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
+dsmark hfsc htb prio qfq '
+FILTER_KIND=' basic bpf cgroup flow flower fw route rsvp tcindex u32 '
+ACTION_KIND=' gact mirred bpf sample '
+
 # Takes a list of words in argument; each one of them is added to COMPREPLY if
 # it is not already present on the command line. Returns no value.
 _tc_once_attr()
@@ -605,10 +611,7 @@ _tc()
 COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
 return 0
 fi
-local qdisc qdwd QDISC_KIND=' choke codel bfifo pfifo \
-pfifo_head_drop fq fq_codel gred hhf mqprio multiq \
-netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
-dsmark hfsc htb prio qfq '
+local qdisc qdwd
 for ((qdwd=$subcword; qdwd < ${#words[@]}-1; qdwd++)); do
 if [[ $QDISC_KIND =~ ' '${words[qdwd]}' ' ]]; then
 qdisc=${words[qdwd]}
@@ -643,10 +646,7 @@ _tc()
 COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
 return 0
 fi
-local qdisc qdwd QDISC_KIND=' choke codel bfifo pfifo \
-pfifo_head_drop fq fq_codel gred hhf mqprio multiq \
-netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
-dsmark hfsc htb prio qfq '
+local qdisc qdwd
 for ((qdwd=$subcword; qdwd < ${#words[@]}-1; qdwd++)); do
 if [[ $QDISC_KIND =~ ' '${words[qdwd]}' ' ]]; then
 qdisc=${words[qdwd]}
@@ -681,8 +681,7 @@ _tc()
 COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
 return 0
 fi
-local filter fltwd FILTER_KIND=' basic bpf cgroup flow \
-flower fw route rsvp tcindex u32 '
+local filter fltwd
 for ((fltwd=$subcword; fltwd < ${#words[@]}-1; fltwd++));
 do
 if [[ $FILTER_KIND =~ ' '${words[fltwd]}' ' ]]; then
@@ -714,7 +713,7 @@ _tc()
 action)
 case $subcmd in
 add|change|replace)
-local action acwd ACTION_KIND=' gact mirred bpf sample '
+local action acwd
 for ((acwd=$subcword; acwd < ${#words[@]}-1; acwd++)); do
 if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
 _tc_action_options $acwd && return 0
-- 
2.4.11

[PATCH net-next/iproute 5/5] tc: bash-completion: Add support for matchall

2017-02-06 Thread Yotam Gigi

Add support for the matchall classifier and its parameters.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index e4c6804..80d1297 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -5,7 +5,7 @@
 QDISC_KIND=' choke codel bfifo pfifo pfifo_head_drop fq fq_codel gred hhf \
 mqprio multiq netem pfifo_fast pie red rr sfb sfq tbf atm cbq drr \
 dsmark hfsc htb prio qfq '
-FILTER_KIND=' basic bpf cgroup flow flower fw route rsvp tcindex u32 '
+FILTER_KIND=' basic bpf cgroup flow flower fw route rsvp tcindex u32 matchall '
 ACTION_KIND=' gact mirred bpf sample '
 
 # Takes a list of words in argument; each one of them is added to COMPREPLY if
@@ -449,6 +449,10 @@ _tc_filter_options()
 _tc_once_attr 'map hash divisor baseclass match action'
 return 0
 ;;
+matchall)
+_tc_once_attr 'action skip_sw skip_hw'
+return 0
+;;
 flower)
 _tc_once_attr 'action classid indev dst_mac src_mac eth_type \
 ip_proto dst_ip src_ip dst_port src_port'
-- 
2.4.11

[PATCH net-next/iproute 4/5] tc: bash-completion: Add support for filter actions

2017-02-06 Thread Yotam Gigi

Previously, the autocomplete routine did not complete actions after a
filter keyword, for example:

$ tc filter add dev eth0 u32 [...] action 

did not suggest the actions list, and:

$ tc filter add dev eth0 u32 [...] action mirred 

did not suggest the specific mirred parameters. Add the support for this
kind of completion by adding the _tc_filter_action_options routine and
invoking it from inside _tc_filter_options.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index e23f69c..e4c6804 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -386,11 +386,44 @@ _tc_bpf_options()
 return 0
 }
 
+# Complete with options names for filter actions.
+# This function is recursive, thus allowing multiple actions statement to be
+# parsed.
+# Returns 0 is completion should stop after running this function, 1 otherwise.
+_tc_filter_action_options()
+{
+for ((acwd=$1; acwd < ${#words[@]}-1; acwd++));
+do
+if [[ action == ${words[acwd]} ]]; then
+_tc_filter_action_options $((acwd+1)) && return 0
+fi
+done
+
+local action acwd
+for ((acwd=$1; acwd < ${#words[@]}-1; acwd++)); do
+if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
+_tc_one_of_list_from $acwd action
+_tc_action_options $acwd && return 0
+fi
+done
+_tc_one_of_list_from $acwd $ACTION_KIND
+return 0
+}
+
 # Complete with options names for filters.
 # Returns 0 is completion should stop after running this function, 1 otherwise.
 _tc_filter_options()
 {
-case $1 in
+
+for ((acwd=$1; acwd < ${#words[@]}-1; acwd++));
+do
+if [[ action == ${words[acwd]} ]]; then
+_tc_filter_action_options $((acwd+1)) && return 0
+fi
+done
+
+filter=${words[$1]}
+case $filter in
 basic)
 _tc_once_attr 'match action classid'
 return 0
@@ -685,8 +718,7 @@ _tc()
 for ((fltwd=$subcword; fltwd < ${#words[@]}-1; fltwd++));
 do
 if [[ $FILTER_KIND =~ ' '${words[fltwd]}' ' ]]; then
-filter=${words[fltwd]}
-_tc_filter_options $filter && return 0
+_tc_filter_options $fltwd && return 0
 fi
 done
 _tc_one_of_list $FILTER_KIND
-- 
2.4.11

[PATCH net-next/iproute 1/5] tc: bash-completion: Add the _from variant to _tc_one* funcs

2017-02-06 Thread Yotam Gigi

The _tc_one_of_list and _tc_once_attr functions simplfy the bash
completion task by validating each attr exist only once on the command
line.

For example, for the command line:

$ a b c d e

and the call to _tc_once_attr with "a f g", the function will suggest
"f g" as "a" existed in the command line in args 0.

Add the _from variant to those functions, which allows having the command
line option once from a specified index. In the previous example, calling
_tc_once_attr with 4 and "a f g" will suggest "a f g".

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/bash-completion/tc b/bash-completion/tc
index ed2796d..04f969e 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -20,6 +20,26 @@ _tc_once_attr()
 done
 }
 
+# Takes a list of words in argument; each one of them is added to COMPREPLY if
+# it is not already present on the command line from the provided index. 
Returns
+# no value.
+_tc_once_attr_from()
+{
+local w subcword found from=$1
+shift
+for w in $*; do
+found=0
+for (( subcword=$from; subcword < ${#words[@]}-1; subcword++ )); do
+if [[ $w == ${words[subcword]} ]]; then
+found=1
+break
+fi
+done
+[[ $found -eq 0 ]] && \
+COMPREPLY+=( $( compgen -W "$w" -- "$cur" ) )
+done
+}
+
 # Takes a list of words in argument; adds them all to COMPREPLY if none of them
 # is already present on the command line. Returns no value.
 _tc_one_of_list()
@@ -33,6 +53,21 @@ _tc_one_of_list()
 COMPREPLY+=( $( compgen -W "$*" -- "$cur" ) )
 }
 
+# Takes a list of words in argument; adds them all to COMPREPLY if none of them
+# is already present on the command line from the provided index. Returns no
+# value.
+_tc_one_of_list_from()
+{
+local w subcword from=$1
+shift
+for w in $*; do
+for (( subcword=$from; subcword < ${#words[@]}-1; subcword++ )); do
+[[ $w == ${words[subcword]} ]] && return 1
+done
+done
+COMPREPLY+=( $( compgen -W "$*" -- "$cur" ) )
+}
+
 # Returns "$cur ${cur}arg1 ${cur}arg2 ..."
 _tc_expand_units()
 {
-- 
2.4.11

[PATCH net-next/iproute 2/5] tc: bash-completion: Prepare action autocomplete to support several actions

2017-02-06 Thread Yotam Gigi

The action autocomplete routine (_tc_action_options) currently does not
support several actions statements in one tc command line as it uses the
_tc_once_attr and _tc_one_from_list.

For example, in that case:

$ tc filter add dev eth0 handle : u32 [...]  \
   action sample group 5 rate 12 \
   action sample 

the _tc_once_attr function, when invoked with "group rate" will not
suggest those as they already exist on the command line.

Fix the function to use the _from variant, thus allowing each action
autocomplete start from the action keyword, and not from the beginning of
the command line.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/bash-completion/tc b/bash-completion/tc
index 04f969e..c854dc0 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -454,26 +454,28 @@ _tc_filter_options()
 # Returns 0 is completion should stop after running this function, 1 otherwise.
 _tc_action_options()
 {
-case $1 in
+local from=$1
+local action=${words[from]}
+case $action in
 bpf)
 _tc_bpf_options
 return 0
 ;;
 mirred)
-_tc_one_of_list 'ingress egress'
-_tc_one_of_list 'mirror redirect'
-_tc_once_attr 'index dev'
+_tc_one_of_list_from $from 'ingress egress'
+_tc_one_of_list_from $from 'mirror redirect'
+_tc_once_attr_from $from 'index dev'
 return 0
 ;;
 sample)
-_tc_once_attr 'rate'
-_tc_once_attr 'trunc'
-_tc_once_attr 'group'
+_tc_once_attr_from $from 'rate'
+_tc_once_attr_from $from 'trunc'
+_tc_once_attr_from $from 'group'
 return 0
 ;;
 gact)
-_tc_one_of_list 'reclassify drop continue pass'
-_tc_once_attr 'random'
+_tc_one_of_list_from $from 'reclassify drop continue pass'
+_tc_once_attr_from $from 'random'
 return 0
 ;;
 esac
@@ -715,8 +717,7 @@ _tc()
 local action acwd ACTION_KIND=' gact mirred bpf sample '
 for ((acwd=$subcword; acwd < ${#words[@]}-1; acwd++)); do
 if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
-action=${words[acwd]}
-_tc_action_options $action && return 0
+_tc_action_options $acwd && return 0
 fi
 done
 _tc_one_of_list $ACTION_KIND
-- 
2.4.11

[PATCH net-next/iproute 0/5] tc: bash-completion: Add features

2017-02-06 Thread Yotam Gigi

The first 4 patches add support for autompletion of filter actions, thus
allowing the following tab completions:

$ tc filter add dev eth0 u32 [...] action 
bpf gactmirred  sample

$ tc filter add dev eth0 u32 [...] action sample 
action  group   ratetrunc

$ tc filter add dev eth0 u32 [...] \
action sample group 10 rate 10 action mirred 
actiondev   egressindex ingress   mirrorredirect

Finally, the last patch adds support in matchall autocompletion.

Yotam Gigi (5):
  tc: bash-completion: Add the _from variant to _tc_one* funcs
  tc: bash-completion: Prepare action autocomplete to support several
actions
  tc: bash-completion: Make the *_KIND variables global
  tc: bash-completion: Add support for filter actions
  tc: bash-completion: Add support for matchall

 bash-completion/tc | 121 ++---
 1 file changed, 96 insertions(+), 25 deletions(-)

-- 
2.4.11

RE: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-05 Thread Yotam Gigi

>-Original Message-
>From: Florian Fainelli [mailto:f.faine...@gmail.com]
>Sent: Sunday, February 05, 2017 10:55 PM
>To: Yotam Gigi <yot...@mellanox.com>; step...@networkplumber.org;
>netdev@vger.kernel.org; Jiri Pirko <j...@mellanox.com>; Elad Raz
><el...@mellanox.com>
>Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc 
>action
>
>Le 02/05/17 à 12:22, Yotam Gigi a écrit :
>>> -Original Message-
>>> From: Florian Fainelli [mailto:f.faine...@gmail.com]
>>> Sent: Sunday, February 05, 2017 8:37 PM
>>> To: Yotam Gigi <yot...@mellanox.com>; step...@networkplumber.org;
>>> netdev@vger.kernel.org; Jiri Pirko <j...@mellanox.com>; Elad Raz
>>> <el...@mellanox.com>
>>> Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc
>action
>>>
>>> On 02/04/2017 11:58 PM, Yotam Gigi wrote:
>>>> The sample tc action allows sampling packets matching a classifier. It
>>>> peeks randomly packets, and samples them using the psample netlink
>>>> channel. The user can specify the psample group, which the packet will be
>>>> sampled to, the sampling rate and the packet truncation (to save
>>>> kernel-user traffic).
>>>>
>>>> The sampled packets contain informative metadata, for example, the input
>>>> interface and the original packet length.
>>>>
>>>> The action syntax:
>>>> tc filter add [...] \
>>>>action sample rate  group  [trunc ]
>>>>[...]
>>>>
>>>> Where:
>>>>   RATE := The sampling rate which is the ratio of packets observed at the
>>>>  data source to the samples generated
>>>>   GROUP := the psample module sampling group
>>>>   SIZE := optional truncation size
>>>>
>>>> An example for a common usecase of the sample tc action: to sample ingress
>>>> traffic from interface eth1, one may use the commands:
>>>>
>>>> tc qdisc add dev eth1 handle : ingress
>>>>
>>>> tc filter add dev eth1 parent : \
>>>>matchall action sample rate 12 group 4
>>>>
>>>> Where the first command adds an ingress qdisc and the second starts
>>>> sampling randomly with an average of one sampled packet per 12 packets
>>>> on dev eth1 to psample group 4.
>>>
>>> The group argument seems to be mandatory from looking at the code, but
>>> what if just wanted to have a port mirroring between, say sw0p1 and
>>> sw0p2 with the sample rate specified instead (without using the psample
>>> netlink channel at all)? Could we make this group an optional argument
>>> instead?
>>
>> The kernel action currently don't support it, and I am not sure it should.
>>
>> If I understand you correctly, you want to make the sample action identical
>> to mirred-mirror, only with random behavior. This can be done using the
>> matchall and mirred action, plus the 'random' gact keyword.
>
>It sounds like we can indeed, with random determ and using the VAL
>argument we should be able to configure the capture divider; thanks!

You're welcome. It took me some time to find that keyword too :)

>
>>
>> The sample action attaches some metadata in addition to the original packet
>> data, and that cannot be achieved by mirroring the packets, thus making it
>> unusable for our usecase. In the former version we attached the metadata
>> using the IFE protocol, but we decided to use a dedicated netlink channel
>> instead.
>
>Yeah I see that now, thanks for the explanation!
>--
>Florian

RE: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-05 Thread Yotam Gigi

>-Original Message-
>From: Florian Fainelli [mailto:f.faine...@gmail.com]
>Sent: Sunday, February 05, 2017 8:37 PM
>To: Yotam Gigi <yot...@mellanox.com>; step...@networkplumber.org;
>netdev@vger.kernel.org; Jiri Pirko <j...@mellanox.com>; Elad Raz
><el...@mellanox.com>
>Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc 
>action
>
>On 02/04/2017 11:58 PM, Yotam Gigi wrote:
>> The sample tc action allows sampling packets matching a classifier. It
>> peeks randomly packets, and samples them using the psample netlink
>> channel. The user can specify the psample group, which the packet will be
>> sampled to, the sampling rate and the packet truncation (to save
>> kernel-user traffic).
>>
>> The sampled packets contain informative metadata, for example, the input
>> interface and the original packet length.
>>
>> The action syntax:
>> tc filter add [...] \
>>  action sample rate  group  [trunc ]
>>  [...]
>>
>> Where:
>>   RATE := The sampling rate which is the ratio of packets observed at the
>>data source to the samples generated
>>   GROUP := the psample module sampling group
>>   SIZE := optional truncation size
>>
>> An example for a common usecase of the sample tc action: to sample ingress
>> traffic from interface eth1, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>>matchall action sample rate 12 group 4
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling randomly with an average of one sampled packet per 12 packets
>> on dev eth1 to psample group 4.
>
>The group argument seems to be mandatory from looking at the code, but
>what if just wanted to have a port mirroring between, say sw0p1 and
>sw0p2 with the sample rate specified instead (without using the psample
>netlink channel at all)? Could we make this group an optional argument
>instead?

The kernel action currently don't support it, and I am not sure it should.

If I understand you correctly, you want to make the sample action identical
to mirred-mirror, only with random behavior. This can be done using the 
matchall and mirred action, plus the 'random' gact keyword. 

The sample action attaches some metadata in addition to the original packet
data, and that cannot be achieved by mirroring the packets, thus making it
unusable for our usecase. In the former version we attached the metadata
using the IFE protocol, but we decided to use a dedicated netlink channel 
instead.

>
>Thanks!
>--
>Florian

[PATCH iproute2/net-next 0/3] Add the tc-sample action

2017-02-04 Thread Yotam Gigi

This patchset adds the tc-sample action support and the corresponding man
page. More information about the action and its usage can be found in the
commit message.

Yotam Gigi (3):
  tc: Add support for the sample tc action
  tc: man: Add man entry for the tc-sample action
  tc: man: matchall: Update examples to include sample

 bash-completion/tc   |   8 +-
 include/linux/tc_act/tc_sample.h |  26 ++
 man/man8/Makefile|   2 +-
 man/man8/tc-matchall.8   |  10 +++
 man/man8/tc-sample.8 | 125 ++
 tc/Makefile  |   1 +
 tc/m_sample.c| 186 +++
 7 files changed, 356 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/tc_act/tc_sample.h
 create mode 100644 man/man8/tc-sample.8
 create mode 100644 tc/m_sample.c

-- 
2.4.11

[PATCH iproute2/net-next 3/3] tc: man: matchall: Update examples to include sample

2017-02-04 Thread Yotam Gigi

Add an example of packet sampling to the tc-matchall man page examples
section. The example uses the matchall classifier and the sample action to
create packet sampling on a port.

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 man/man8/tc-matchall.8 | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/man/man8/tc-matchall.8 b/man/man8/tc-matchall.8
index f920922..799350a 100644
--- a/man/man8/tc-matchall.8
+++ b/man/man8/tc-matchall.8
@@ -70,6 +70,16 @@ that replaces the root qdisc on device
 where the second command attaches a matchall filters on it that mirrors the
 packets to device eth2.
 
+To sample one of every 100 packets flowing into interface eth0 to psample group
+12:
+.RS
+.EX
+
+tc qdisc add dev eth0 handle : ingress
+tc filter add dev eth0 parent : matchall \\
+ action sample rate 100 group 12
+.EE
+.RE
 
 .EE
 .SH SEE ALSO
-- 
2.4.11

[PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-04 Thread Yotam Gigi

The sample tc action allows sampling packets matching a classifier. It
peeks randomly packets, and samples them using the psample netlink
channel. The user can specify the psample group, which the packet will be
sampled to, the sampling rate and the packet truncation (to save
kernel-user traffic).

The sampled packets contain informative metadata, for example, the input
interface and the original packet length.

The action syntax:
tc filter add [...] \
action sample rate  group  [trunc ]
[...]

Where:
  RATE := The sampling rate which is the ratio of packets observed at the
  data source to the samples generated
  GROUP := the psample module sampling group
  SIZE := optional truncation size

An example for a common usecase of the sample tc action: to sample ingress
traffic from interface eth1, one may use the commands:

tc qdisc add dev eth1 handle : ingress

tc filter add dev eth1 parent : \
   matchall action sample rate 12 group 4

Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets
on dev eth1 to psample group 4.

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 bash-completion/tc   |   8 +-
 include/linux/tc_act/tc_sample.h |  26 ++
 tc/Makefile  |   1 +
 tc/m_sample.c| 186 +++
 4 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/tc_act/tc_sample.h
 create mode 100644 tc/m_sample.c

diff --git a/bash-completion/tc b/bash-completion/tc
index 79dd5fc..ed2796d 100644
--- a/bash-completion/tc
+++ b/bash-completion/tc
@@ -430,6 +430,12 @@ _tc_action_options()
 _tc_once_attr 'index dev'
 return 0
 ;;
+sample)
+_tc_once_attr 'rate'
+_tc_once_attr 'trunc'
+_tc_once_attr 'group'
+return 0
+;;
 gact)
 _tc_one_of_list 'reclassify drop continue pass'
 _tc_once_attr 'random'
@@ -671,7 +677,7 @@ _tc()
 action)
 case $subcmd in
 add|change|replace)
-local action acwd ACTION_KIND=' gact mirred bpf '
+local action acwd ACTION_KIND=' gact mirred bpf sample '
 for ((acwd=$subcword; acwd < ${#words[@]}-1; acwd++)); do
 if [[ $ACTION_KIND =~ ' '${words[acwd]}' ' ]]; then
 action=${words[acwd]}
diff --git a/include/linux/tc_act/tc_sample.h b/include/linux/tc_act/tc_sample.h
new file mode 100644
index 000..edc9058
--- /dev/null
+++ b/include/linux/tc_act/tc_sample.h
@@ -0,0 +1,26 @@
+#ifndef __LINUX_TC_SAMPLE_H
+#define __LINUX_TC_SAMPLE_H
+
+#include 
+#include 
+#include 
+
+#define TCA_ACT_SAMPLE 26
+
+struct tc_sample {
+   tc_gen;
+};
+
+enum {
+   TCA_SAMPLE_UNSPEC,
+   TCA_SAMPLE_TM,
+   TCA_SAMPLE_PARMS,
+   TCA_SAMPLE_RATE,
+   TCA_SAMPLE_TRUNC_SIZE,
+   TCA_SAMPLE_PSAMPLE_GROUP,
+   TCA_SAMPLE_PAD,
+   __TCA_SAMPLE_MAX
+};
+#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
+
+#endif
diff --git a/tc/Makefile b/tc/Makefile
index 7fd0c4a..6dd984f 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -51,6 +51,7 @@ TCMODULES += m_vlan.o
 TCMODULES += m_connmark.o
 TCMODULES += m_bpf.o
 TCMODULES += m_tunnel_key.o
+TCMODULES += m_sample.o
 TCMODULES += p_ip.o
 TCMODULES += p_icmp.o
 TCMODULES += p_tcp.o
diff --git a/tc/m_sample.c b/tc/m_sample.c
new file mode 100644
index 000..9291109
--- /dev/null
+++ b/tc/m_sample.c
@@ -0,0 +1,186 @@
+/*
+ * m_sample.c  ingress/egress packet sampling module
+ *
+ * This program is free software; you can distribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Yotam Gigi <yot...@mellanox.com>
+ *
+ */
+
+#include 
+#include "utils.h"
+#include "tc_util.h"
+#include "tc_common.h"
+#include 
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: sample SAMPLE_CONF\n");
+   fprintf(stderr, "where:\n");
+   fprintf(stderr, "\tSAMPLE_CONF := SAMPLE_PARAMS | SAMPLE_INDEX\n");
+   fprintf(stderr, "\tSAMPLE_PARAMS := rate RATE group GROUP [trunc SIZE] 
[SAMPLE_INDEX]\n");
+   fprintf(stderr, "\tSAMPLE_INDEX := index INDEX\n");
+   fprintf(stderr, "\tRATE := The ratio of packets observed at the data 
source to the samples generated.\n");
+   fprintf(stderr, "\tGROUP := the psample sampling group\n");
+   fprintf(stderr, "\tSIZE := the truncation size\n");
+   fprintf(stderr, "\tINDEX := int

[PATCH iproute2/net-next 2/3] tc: man: Add man entry for the tc-sample action

2017-02-04 Thread Yotam Gigi

In addition to general information about the tc action, the man entry
contains common usage examples and information about the tlv fields packed
within each sampled packet.

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 man/man8/Makefile|   2 +-
 man/man8/tc-sample.8 | 125 +++
 2 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/tc-sample.8

diff --git a/man/man8/Makefile b/man/man8/Makefile
index 77d347c..18e76f2 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -17,7 +17,7 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 
rtmon.8 rtpr.8 ss.
tc-tcindex.8 tc-u32.8 tc-matchall.8 \
tc-connmark.8 tc-csum.8 tc-mirred.8 tc-nat.8 tc-pedit.8 tc-police.8 \
tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8  tc-ife.8 \
-   tc-tunnel_key.8 \
+   tc-tunnel_key.8 tc-sample.8 \
devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8
 
 all: $(TARGETS)
diff --git a/man/man8/tc-sample.8 b/man/man8/tc-sample.8
new file mode 100644
index 000..3e03eba
--- /dev/null
+++ b/man/man8/tc-sample.8
@@ -0,0 +1,125 @@
+.TH "Packet sample action in tc" 8 "31 Jan 2017" "iproute2" "Linux"
+
+.SH NAME
+sample - packet sampling tc action
+.SH SYNOPSIS
+.in +8
+.ti -8
+
+.BR tc " ... " "action sample rate"
+.I RATE
+.BR "group"
+.I GROUP
+.RB "[ " trunc
+.IR SIZE " ] "
+.RB "[ " index
+.IR INDEX " ] "
+.ti -8
+
+.BR tc " ... " "action sample index "
+.I INDEX
+.ti -8
+
+.SH DESCRIPTION
+The
+.B sample
+action allows sampling packets matching classifier.
+
+The packets are chosen randomly according to the
+.B rate
+parameter, and are sampled using the
+.B psample
+generic netlink channel. The user can also specify packet truncation to save
+user-kernel traffic. Each sample includes some informative metadata about the
+original packet, which is sent using netlink attributes, alongside the original
+packet data.
+
+The user can either specify the sample action parameters as presented in the
+first form above, or use an existing sample action using its index, as 
presented
+in the second form.
+
+.SH SAMPLED PACKETS METADATA FIELDS
+The metadata are delivered to userspace applications using the
+.B psample
+generic netlink channel, where each sample includes the following netlink
+attributes:
+.TP
+.BI PSAMPLE_ATTR_IIFINDEX
+The input interface index of the packet, if there is one.
+.TP
+.BI PSAMPLE_ATTR_OIFINDEX
+The output interface index of the packet. This field is not relevant on ingress
+sampling
+.TP
+.BI PSAMPLE_ATTR_ORIGSIZE
+The size of the original packet (before truncation)
+.TP
+.BI PSAMPLE_ATTR_SAMPLE_GROUP
+The
+.B psample
+group the packet was sent to
+.TP
+.BI PSAMPLE_ATTR_GROUP_SEQ
+A sequence number of the sampled packet. This number is incremented with each
+sampled packet of the current
+.B psample
+group
+.TP
+.BI PSAMPLE_ATTR_SAMPLE_RATE
+The rate the packet was sampled with
+.RE
+
+.SH OPTIONS
+.TP
+.BI rate " RATE"
+The packet sample rate.
+.I "RATE"
+is the expected ratio between observed packets and sampled packets. For 
example,
+.I "RATE"
+of 100 will lead to an average of one sampled packet out of every 100 observed.
+.TP
+.BI trunc " SIZE"
+Upon set, defines the maximum size of the sampled packets, and causes 
truncation
+if needed
+.TP
+.BI group " GROUP"
+The
+.B psample
+group the packet will be sent to. The
+.B psample
+module defines the concept of groups, which allows the user to match specific
+sampled packets in the case of multiple sampling rules, thus identify only the
+packets that came from a specific rule.
+.TP
+.BI index " INDEX"
+Is a unique ID for an action. When creating new action instance, this parameter
+allows to set the new action index. When using existing action, this parameter
+allows to specify the existing action index.  The index must 32bit unsigned
+integer greater than zero.
+.SH EXAMPLES
+Sample one of every 100 packets flowing into interface eth0 to psample group 
12:
+
+.RS
+.EX
+tc qdisc add dev eth0 handle : ingress
+tc filter add dev eth0 parent : matchall \\
+ action sample rate 100 group 12 index 19
+.EE
+.RE
+
+Use the same action instance to sample eth1 too:
+
+.RS
+.EX
+tc qdisc add dev eth1 handle : ingress
+tc filter add dev eth1 parent : matchall \\
+ action sample index 19
+.EE
+.RE
+
+.EE
+.RE
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-matchall (8)
+.BR psample (1)
-- 
2.4.11

RE: [PATCH net-next 0/2] Extract IFE logic to module

2017-02-02 Thread Yotam Gigi

>-Original Message-
>From: Simon Horman [mailto:simon.hor...@netronome.com]
>Sent: Thursday, February 02, 2017 12:17 PM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: j...@mojatatu.com; da...@davemloft.net; netdev@vger.kernel.org; Jiri Pirko
><j...@mellanox.com>; Elad Raz <el...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>
>Subject: Re: [PATCH net-next 0/2] Extract IFE logic to module
>
>Hi Yotam,
>
>On Tue, Jan 31, 2017 at 03:57:02PM +0200, Yotam Gigi wrote:
>> Extract ife logic from the tc_ife action into an independent module, and
>> make the tc_ife action use it. This way, the ife encapsulation can be used
>> by other modules other than tc_ife action.
>
>I have no objection to this modularisation but I am curious to know
>if you have a use-case in mind. My understanding is that earlier versions
>of the sample action used IFE but that is not the case in the version that
>was ultimately accepted.

Hi Simon.

You are right that the patches were done for the former version of the sample
classifier, and there are not required for the current version. We don't have
current use-case in mind, but I did send the patches because I think it can help
others, or us in the future.

RE: linux-next: manual merge of the net-next tree with Linus' tree

2017-02-02 Thread Yotam Gigi

>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
>Behalf Of Stephen Rothwell
>Sent: Thursday, February 02, 2017 3:50 AM
>To: David Miller <da...@davemloft.net>; Networking <netdev@vger.kernel.org>
>Cc: linux-n...@vger.kernel.org; linux-ker...@vger.kernel.org; Yotam Gigi
><yot...@mellanox.com>
>Subject: linux-next: manual merge of the net-next tree with Linus' tree
>
>Hi all,
>
>Today's linux-next merge of the net-next tree got a conflict in:
>
>  net/sched/cls_matchall.c
>
>between commit:
>
>  fd62d9f5c575 ("net/sched: matchall: Fix configuration race")
>
>from Linus' tree and commit:
>
>  ec2507d2a306 ("net/sched: cls_matchall: Fix error path")
>
>from the net-next tree.
>
>I fixed it up (see below) and can carry the fix as necessary. This
>is now fixed as far as linux-next is concerned, but any non trivial
>conflicts should be mentioned to your upstream maintainer when your tree
>is submitted for merging.  You may also want to consider cooperating
>with the maintainer of the conflicting tree to minimise any particularly
>complex conflicts.

Looks good. Thanks!

>
>--
>Cheers,
>Stephen Rothwell
>
>diff --cc net/sched/cls_matchall.c
>index b12bc2abea93,fcecf5aac666..
>--- a/net/sched/cls_matchall.c
>+++ b/net/sched/cls_matchall.c
>@@@ -118,19 -141,24 +118,24 @@@ static int mall_set_parms(struct net *n
>   struct tcf_exts e;
>   int err;
>
>-  tcf_exts_init(, TCA_MATCHALL_ACT, 0);
>+  err = tcf_exts_init(, TCA_MATCHALL_ACT, 0);
>+  if (err)
>+  return err;
>   err = tcf_exts_validate(net, tp, tb, est, , ovr);
>   if (err < 0)
>-  return err;
>+  goto errout;
>
>   if (tb[TCA_MATCHALL_CLASSID]) {
> - f->res.classid = nla_get_u32(tb[TCA_MATCHALL_CLASSID]);
> - tcf_bind_filter(tp, >res, base);
> + head->res.classid = nla_get_u32(tb[TCA_MATCHALL_CLASSID]);
> + tcf_bind_filter(tp, >res, base);
>   }
>
> - tcf_exts_change(tp, >exts, );
> + tcf_exts_change(tp, >exts, );
>
>   return 0;
>+ errout:
>+  tcf_exts_destroy();
>+  return err;
>  }
>
>  static int mall_change(struct net *net, struct sk_buff *in_skb,
>@@@ -162,39 -194,43 +167,44 @@@
>   return -EINVAL;
>   }
>
> - f = kzalloc(sizeof(*f), GFP_KERNEL);
> - if (!f)
> + new = kzalloc(sizeof(*new), GFP_KERNEL);
> + if (!new)
>   return -ENOBUFS;
>
>-  tcf_exts_init(>exts, TCA_MATCHALL_ACT, 0);
> - err = tcf_exts_init(>exts, TCA_MATCHALL_ACT, 0);
>++ err = tcf_exts_init(>exts, TCA_MATCHALL_ACT, 0);
>+  if (err)
>+  goto err_exts_init;
>
>   if (!handle)
>   handle = 1;
> - f->handle = handle;
> - f->flags = flags;
> + new->handle = handle;
> + new->flags = flags;
>
> - err = mall_set_parms(net, tp, f, base, tb, tca[TCA_RATE], ovr);
> + err = mall_set_parms(net, tp, new, base, tb, tca[TCA_RATE], ovr);
>   if (err)
>-  goto errout;
>+  goto err_set_parms;
>
>   if (tc_should_offload(dev, tp, flags)) {
> - err = mall_replace_hw_filter(tp, f, (unsigned long) f);
> + err = mall_replace_hw_filter(tp, new, (unsigned long) new);
>   if (err) {
>   if (tc_skip_sw(flags))
>-  goto errout;
>+  goto err_replace_hw_filter;
>   else
>   err = 0;
>   }
>   }
>
> - *arg = (unsigned long) f;
> - rcu_assign_pointer(head->filter, f);
> -
> + *arg = (unsigned long) head;
> + rcu_assign_pointer(tp->root, new);
> + if (head)
> + call_rcu(>rcu, mall_destroy_rcu);
>   return 0;
>
>- errout:
>+ err_replace_hw_filter:
>+ err_set_parms:
> - tcf_exts_destroy(>exts);
>++ tcf_exts_destroy(>exts);
>+ err_exts_init:
> - kfree(f);
> + kfree(new);
>   return err;
>  }
>

[PATCH 2/3] net: Introduce ife encapsulation module

2017-02-01 Thread Yotam Gigi

This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
 - ife_encode: encode skb and reserve space for the ife meta header
 - ife_decode: decode skb and extract the meta header size
 - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
   header space.
 - ife_tlv_meta_decode - decodes one tlv entry from the packet
 - ife_tlv_meta_next - advance to the next tlv

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 MAINTAINERS   |   7 +++
 include/net/ife.h |  51 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/ife.h  |  18 ++
 net/Kconfig   |   1 +
 net/Makefile  |   1 +
 net/ife/Kconfig   |  16 ++
 net/ife/Makefile  |   5 ++
 net/ife/ife.c | 142 ++
 9 files changed, 242 insertions(+)
 create mode 100644 include/net/ife.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c

diff --git a/MAINTAINERS b/MAINTAINERS
index cc106f7..3017257 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6250,6 +6250,13 @@ F:   include/net/cfg802154.h
 F: include/net/ieee802154_netdev.h
 F: Documentation/networking/ieee802154.txt
 
+IFE PROTOCOL
+M:     Yotam Gigi <yot...@mellanox.com>
+M: Jamal Hadi Salim <j...@mojatatu.com>
+F: net/ife
+F: include/net/ife.h
+F: include/uapi/linux/ife.h
+
 IGORPLUG-USB IR RECEIVER
 M: Sean Young <s...@mess.org>
 L: linux-me...@vger.kernel.org
diff --git a/include/net/ife.h b/include/net/ife.h
new file mode 100644
index 000..2d87d68
--- /dev/null
+++ b/include/net/ife.h
@@ -0,0 +1,51 @@
+#ifndef __NET_IFE_H
+#define __NET_IFE_H
+
+#include 
+#include 
+#include 
+#include 
+
+#if IS_ENABLED(CONFIG_NET_IFE)
+
+void *ife_encode(struct sk_buff *skb, u16 metalen);
+void *ife_decode(struct sk_buff *skb, u16 *metalen);
+
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 
*totlen);
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+   const void *dval);
+
+void *ife_tlv_meta_next(void *skbdata);
+
+#else
+
+static inline void *ife_encode(struct sk_buff *skb, u16 metalen)
+{
+   return NULL;
+}
+
+static inline void *ife_decode(struct sk_buff *skb, u16 *metalen)
+{
+   return NULL;
+}
+
+static inline void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 
*dlen,
+   u16 *totlen)
+{
+   return NULL;
+}
+
+static inline int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+   const void *dval)
+{
+   return 0;
+}
+
+static inline void *ife_tlv_meta_next(void *skbdata)
+{
+   return NULL;
+}
+
+#endif
+
+#endif /* __NET_IFE_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 486e050..a2e9072 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -195,6 +195,7 @@ header-y += if_tun.h
 header-y += if_tunnel.h
 header-y += if_vlan.h
 header-y += if_x25.h
+header-y += ife.h
 header-y += igmp.h
 header-y += ila.h
 header-y += in6.h
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
new file mode 100644
index 000..2954da3
--- /dev/null
+++ b/include/uapi/linux/ife.h
@@ -0,0 +1,18 @@
+#ifndef __UAPI_IFE_H
+#define __UAPI_IFE_H
+
+#define IFE_METAHDRLEN 2
+
+enum {
+   IFE_META_SKBMARK = 1,
+   IFE_META_HASHID,
+   IFE_META_PRIO,
+   IFE_META_QMAP,
+   IFE_META_TCINDEX,
+   __IFE_META_MAX
+};
+
+/*Can be overridden at runtime by module option*/
+#define IFE_META_MAX (__IFE_META_MAX - 1)
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index ce4aee6..2f2842d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -391,6 +391,7 @@ source "net/caif/Kconfig"
 source "net/ceph/Kconfig"
 source "net/nfc/Kconfig"
 source "net/psample/Kconfig"
+source "net/ife/Kconfig"
 
 config LWTUNNEL
bool "Network light weight tunnels"
diff --git a/net/Makefile b/net/Makefile
index 7d41de4..9b68155 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_CEPH_LIB)+= ceph/
 obj-$(CONFIG_BATMAN_ADV)   += batman-adv/
 obj-$(CONFIG_NFC)  += nfc/
 obj-$(CONFIG_PSAMPLE)  += psample/
+obj-$(CONFIG_NET_IFE)  += ife/
 obj-$(CONFIG_OPENVSWITCH)  += openvswitch/
 obj-$(CONFIG_VSOCKETS) += vmw_vsock/
 obj-$(CONFIG_MPLS) += mpls/
diff --git a/net/ife/Kconfig b/net/ife/Kconfig
new file mode 100644
index 000..31e48b6
--- /dev/null
+++ b/net/ife/Kconfig
@@ -0,0 +1,16 @@
+#
+# IFE subsystem configuration
+#
+
+menuconfig NET_IFE
+   depends on NET
+tristate "Inter-FE based on IETF ForCES InterFE LFB"
+   default n
+   help
+ S

[PATCH 1/3] net/sched: act_ife: Unexport ife_tlv_meta_encode

2017-02-01 Thread Yotam Gigi

As the function ife_tlv_meta_encode is not used by any other module,
unexport it and make it static for the act_ife module.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 include/net/tc_act/tc_ife.h | 2 --
 net/sched/act_ife.c | 4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 9fd2bea0..f37e751 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -45,8 +45,6 @@ struct tcf_meta_ops {
 
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi);
 int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi);
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
-   const void *dval);
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 921fb20..70148c1 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -48,7 +48,8 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = {
 
 /* Caller takes care of presenting data in network order
 */
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void 
*dval)
+static int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+  const void *dval)
 {
u32 *tlv = (u32 *)(skbdata);
u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
@@ -61,7 +62,6 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 
dlen, const void *dval)
 
return totlen;
 }
-EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
 
 int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
-- 
2.4.11

[PATCH v2 net-next 0/3] Extract IFE logic to module

2017-02-01 Thread Yotam Gigi

Extract ife logic from the tc_ife action into an independent module, and
make the tc_ife action use it. This way, the ife encapsulation can be used
by other modules other than tc_ife action.

v1->v2:
Fix duplicate symbol error by introducing a new patch that makes the
original symbol static.

The symbol ife_tlv_meta_extract is exported in act_ife, though not being
used by any other module. As the symbol is being moved to the new ife
module, introducing the new module creates duplicate symbol. To fix it,
add a new patch (1/3) that makes the ife_tlv_meta_extract symbol static in
act_ife, thus the symbol does not collide.

Yotam Gigi (3):
  net/sched: act_ife: Unexport ife_tlv_meta_encode
  net: Introduce ife encapsulation module
  net/sched: act_ife: Change to use ife module

 MAINTAINERS|   7 ++
 include/net/ife.h  |  51 +
 include/net/tc_act/tc_ife.h|   3 -
 include/uapi/linux/Kbuild  |   1 +
 include/uapi/linux/ife.h   |  18 +
 include/uapi/linux/tc_act/tc_ife.h |  10 +--
 net/Kconfig|   1 +
 net/Makefile   |   1 +
 net/ife/Kconfig|  16 +
 net/ife/Makefile   |   5 ++
 net/ife/ife.c  | 142 +
 net/sched/Kconfig  |   1 +
 net/sched/act_ife.c| 110 +---
 13 files changed, 276 insertions(+), 90 deletions(-)
 create mode 100644 include/net/ife.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c

-- 
2.4.11

[PATCH 3/3] net/sched: act_ife: Change to use ife module

2017-02-01 Thread Yotam Gigi

Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 include/net/tc_act/tc_ife.h|   1 -
 include/uapi/linux/tc_act/tc_ife.h |  10 +---
 net/sched/Kconfig  |   1 +
 net/sched/act_ife.c| 110 +++--
 4 files changed, 34 insertions(+), 88 deletions(-)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index f37e751..30ba459 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
 #include 
 #include 
 
-#define IFE_METAHDRLEN 2
 struct tcf_ife_info {
struct tc_action common;
u8 eth_dst[ETH_ALEN];
diff --git a/include/uapi/linux/tc_act/tc_ife.h 
b/include/uapi/linux/tc_act/tc_ife.h
index cd18360..7c28178 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 #define TCA_ACT_IFE 25
 /* Flag bits for now just encoding/decoding; mutually exclusive */
@@ -28,13 +29,4 @@ enum {
 };
 #define TCA_IFE_MAX (__TCA_IFE_MAX - 1)
 
-#define IFE_META_SKBMARK 1
-#define IFE_META_HASHID 2
-#defineIFE_META_PRIO 3
-#defineIFE_META_QMAP 4
-#defineIFE_META_TCINDEX 5
-/*Can be overridden at runtime by module option*/
-#define__IFE_META_MAX 6
-#define IFE_META_MAX (__IFE_META_MAX - 1)
-
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72cfa3a..403790c 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -776,6 +776,7 @@ config NET_ACT_SKBMOD
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
+select NET_IFE
 ---help---
  Say Y here to allow for sourcing and terminating metadata
  For details refer to netdev01 paper:
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 70148c1..71e7ff22 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IFE_TAB_MASK 15
 
@@ -46,23 +47,6 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = 
{
[TCA_IFE_TYPE] = { .type = NLA_U16},
 };
 
-/* Caller takes care of presenting data in network order
-*/
-static int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
-  const void *dval)
-{
-   u32 *tlv = (u32 *)(skbdata);
-   u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
-   char *dptr = (char *)tlv + NLA_HDRLEN;
-   u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
-
-   *tlv = htonl(htlv);
-   memset(dptr, 0, totlen - NLA_HDRLEN);
-   memcpy(dptr, dval, dlen);
-
-   return totlen;
-}
-
 int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
u16 edata = 0;
@@ -637,69 +621,59 @@ int find_decode_metaid(struct sk_buff *skb, struct 
tcf_ife_info *ife,
return 0;
 }
 
-struct ifeheadr {
-   __be16 metalen;
-   u8 tlv_data[];
-};
-
-struct meta_tlvhdr {
-   __be16 type;
-   __be16 len;
-};
-
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
  struct tcf_result *res)
 {
struct tcf_ife_info *ife = to_ife(a);
int action = ife->tcf_action;
-   struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-   int ifehdrln = (int)ifehdr->metalen;
-   struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
+   u8 *ifehdr_end;
+   u8 *tlv_data;
+   u16 metalen;
 
spin_lock(>tcf_lock);
bstats_update(>tcf_bstats, skb);
tcf_lastuse_update(>tcf_tm);
spin_unlock(>tcf_lock);
 
-   ifehdrln = ntohs(ifehdrln);
-   if (unlikely(!pskb_may_pull(skb, ifehdrln))) {
+   if (skb_at_tc_ingress(skb))
+   skb_push(skb, skb->dev->hard_header_len);
+
+   tlv_data = ife_decode(skb, );
+   if (unlikely(!tlv_data)) {
spin_lock(>tcf_lock);
ife->tcf_qstats.drops++;
spin_unlock(>tcf_lock);
return TC_ACT_SHOT;
}
 
-   skb_set_mac_header(skb, ifehdrln);
-   __skb_pull(skb, ifehdrln);
-   skb->protocol = eth_type_trans(skb, skb->dev);
-   ifehdrln -= IFE_METAHDRLEN;
-
-   while (ifehdrln > 0) {
-   u8 *tlvdata = (u8 *)tlv;
-   u16 mtype = tlv->type;
-   u16 mlen = tlv->len;
-   u16 alen;
+   ifehdr_end = tlv_data + metalen;
+   for (; tlv_data < ifehdr_end; tlv_data = ife_tlv_meta_next(tlv_data)) {
+   u8 *curr_data;
+   u16 mtype;
+   u16 dlen;
 
-   mtype = ntohs(mtype);
-   mlen = ntohs(mlen);
-   alen

RE: [PATCH net-next 1/2] net: Introduce ife encapsulation module

2017-01-31 Thread Yotam Gigi

>-Original Message-
>From: kbuild test robot [mailto:l...@intel.com]
>Sent: Wednesday, February 01, 2017 1:58 AM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: kbuild-...@01.org; j...@mojatatu.com; da...@davemloft.net;
>netdev@vger.kernel.org; Jiri Pirko <j...@mellanox.com>; Elad Raz
><el...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>; Yotam Gigi
><yot...@mellanox.com>
>Subject: Re: [PATCH net-next 1/2] net: Introduce ife encapsulation module
>
>Hi Yotam,
>
>[auto build test ERROR on net-next/master]
>
>url:
>https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.co
>m%2F0day-ci%2Flinux%2Fcommits%2FYotam-Gigi%2FExtract-IFE-logic-to-
>module%2F20170131-
>222757=02%7C01%7Cyotamg%40mellanox.com%7C78ded484c30746e7ecba08
>d44a350e8e%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636215039177
>989111=ZjfwyvJCJWthy4VSx2Yyteu%2BU3VkVdSIqAPiaCUprqo%3D
>ed=0
>config: alpha-allyesconfig (attached as .config)
>compiler: alpha-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
>reproduce:
>wget
>https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel
>.org%2Fcgit%2Flinux%2Fkernel%2Fgit%2Fwfg%2Flkp-
>tests.git%2Fplain%2Fsbin%2Fmake.cross=02%7C01%7Cyotamg%40mellanox.
>com%7C78ded484c30746e7ecba08d44a350e8e%7Ca652971c7d2e4d9ba6a4d149256f
>461b%7C0%7C0%7C636215039177999119=PFLoGWvLTSWuY2e%2F%2F5C%2
>F%2BGlpNGHQgeX3UPXYxrgYqzc%3D=0 -O ~/bin/make.cross
>chmod +x ~/bin/make.cross
># save the attached .config to linux build tree
>make.cross ARCH=alpha
>
>Note: the linux-review/Yotam-Gigi/Extract-IFE-logic-to-module/20170131-222757
>HEAD 236a17de4759e7d4d2927d4ac50329ec788ec655 builds fine.
>  It only hurts bisectibility.
>
>All errors (new ones prefixed by >>):
>
>   net/ife/built-in.o: In function `ife_tlv_meta_encode':
>>> (.text+0x420): multiple definition of `ife_tlv_meta_encode'
>   net/sched/built-in.o:(.text+0x18800): first defined here
>
>---
>0-DAY kernel test infrastructureOpen Source Technology Center
>https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.or
>g%2Fpipermail%2Fkbuild-
>all=02%7C01%7Cyotamg%40mellanox.com%7C78ded484c30746e7ecba08d44a
>350e8e%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63621503917799911
>9=m2yy4YGzOaxrLF%2FjqgxikQ8GELNBj%2F9IlSOlS%2F831tY%3D
>=0   Intel Corporation

Sorry for that. I will send v2 soon.

[PATCH net-next 0/2] Extract IFE logic to module

2017-01-31 Thread Yotam Gigi

Extract ife logic from the tc_ife action into an independent module, and
make the tc_ife action use it. This way, the ife encapsulation can be used
by other modules other than tc_ife action.

Yotam Gigi (2):
  net: Introduce ife encapsulation module
  net/sched: act_ife: Change to use ife module

 MAINTAINERS|   7 ++
 include/net/ife.h  |  51 +
 include/net/tc_act/tc_ife.h|   3 -
 include/uapi/linux/Kbuild  |   1 +
 include/uapi/linux/ife.h   |  18 +
 include/uapi/linux/tc_act/tc_ife.h |  10 +--
 net/Kconfig|   1 +
 net/Makefile   |   1 +
 net/ife/Kconfig|  16 +
 net/ife/Makefile   |   5 ++
 net/ife/ife.c  | 142 +
 net/sched/Kconfig  |   1 +
 net/sched/act_ife.c| 110 +---
 13 files changed, 276 insertions(+), 90 deletions(-)
 create mode 100644 include/net/ife.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c

-- 
2.4.11

[PATCH net-next 2/2] net/sched: act_ife: Change to use ife module

2017-01-31 Thread Yotam Gigi

Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 include/net/tc_act/tc_ife.h|   3 -
 include/uapi/linux/tc_act/tc_ife.h |  10 +---
 net/sched/Kconfig  |   1 +
 net/sched/act_ife.c| 110 +++--
 4 files changed, 34 insertions(+), 90 deletions(-)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 9fd2bea0..30ba459 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
 #include 
 #include 
 
-#define IFE_METAHDRLEN 2
 struct tcf_ife_info {
struct tc_action common;
u8 eth_dst[ETH_ALEN];
@@ -45,8 +44,6 @@ struct tcf_meta_ops {
 
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi);
 int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi);
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
-   const void *dval);
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
diff --git a/include/uapi/linux/tc_act/tc_ife.h 
b/include/uapi/linux/tc_act/tc_ife.h
index cd18360..7c28178 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 #define TCA_ACT_IFE 25
 /* Flag bits for now just encoding/decoding; mutually exclusive */
@@ -28,13 +29,4 @@ enum {
 };
 #define TCA_IFE_MAX (__TCA_IFE_MAX - 1)
 
-#define IFE_META_SKBMARK 1
-#define IFE_META_HASHID 2
-#defineIFE_META_PRIO 3
-#defineIFE_META_QMAP 4
-#defineIFE_META_TCINDEX 5
-/*Can be overridden at runtime by module option*/
-#define__IFE_META_MAX 6
-#define IFE_META_MAX (__IFE_META_MAX - 1)
-
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72cfa3a..403790c 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -776,6 +776,7 @@ config NET_ACT_SKBMOD
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
+select NET_IFE
 ---help---
  Say Y here to allow for sourcing and terminating metadata
  For details refer to netdev01 paper:
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 921fb20..71e7ff22 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IFE_TAB_MASK 15
 
@@ -46,23 +47,6 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = 
{
[TCA_IFE_TYPE] = { .type = NLA_U16},
 };
 
-/* Caller takes care of presenting data in network order
-*/
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void 
*dval)
-{
-   u32 *tlv = (u32 *)(skbdata);
-   u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
-   char *dptr = (char *)tlv + NLA_HDRLEN;
-   u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
-
-   *tlv = htonl(htlv);
-   memset(dptr, 0, totlen - NLA_HDRLEN);
-   memcpy(dptr, dval, dlen);
-
-   return totlen;
-}
-EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
-
 int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
u16 edata = 0;
@@ -637,69 +621,59 @@ int find_decode_metaid(struct sk_buff *skb, struct 
tcf_ife_info *ife,
return 0;
 }
 
-struct ifeheadr {
-   __be16 metalen;
-   u8 tlv_data[];
-};
-
-struct meta_tlvhdr {
-   __be16 type;
-   __be16 len;
-};
-
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
  struct tcf_result *res)
 {
struct tcf_ife_info *ife = to_ife(a);
int action = ife->tcf_action;
-   struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-   int ifehdrln = (int)ifehdr->metalen;
-   struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
+   u8 *ifehdr_end;
+   u8 *tlv_data;
+   u16 metalen;
 
spin_lock(>tcf_lock);
bstats_update(>tcf_bstats, skb);
tcf_lastuse_update(>tcf_tm);
spin_unlock(>tcf_lock);
 
-   ifehdrln = ntohs(ifehdrln);
-   if (unlikely(!pskb_may_pull(skb, ifehdrln))) {
+   if (skb_at_tc_ingress(skb))
+   skb_push(skb, skb->dev->hard_header_len);
+
+   tlv_data = ife_decode(skb, );
+   if (unlikely(!tlv_data)) {
spin_lock(>tcf_lock);
ife->tcf_qstats.drops++;
spin_unlock(>tcf_lock);
return TC_ACT_SHOT;
}
 
-   skb_set_mac_header(skb, ifehdrln);
-   __skb_pull(skb, ifehdrln);
-   skb->protocol = eth_type_trans(skb, skb->dev);
-   ifehdrln -= IFE_METAH

[PATCH net-next 1/2] net: Introduce ife encapsulation module

2017-01-31 Thread Yotam Gigi

This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
 - ife_encode: encode skb and reserve space for the ife meta header
 - ife_decode: decode skb and extract the meta header size
 - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
   header space.
 - ife_tlv_meta_decode - decodes one tlv entry from the packet
 - ife_tlv_meta_next - advance to the next tlv

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 MAINTAINERS   |   7 +++
 include/net/ife.h |  51 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/ife.h  |  18 ++
 net/Kconfig   |   1 +
 net/Makefile  |   1 +
 net/ife/Kconfig   |  16 ++
 net/ife/Makefile  |   5 ++
 net/ife/ife.c | 142 ++
 9 files changed, 242 insertions(+)
 create mode 100644 include/net/ife.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c

diff --git a/MAINTAINERS b/MAINTAINERS
index cc106f7..3017257 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6250,6 +6250,13 @@ F:   include/net/cfg802154.h
 F: include/net/ieee802154_netdev.h
 F: Documentation/networking/ieee802154.txt
 
+IFE PROTOCOL
+M:     Yotam Gigi <yot...@mellanox.com>
+M: Jamal Hadi Salim <j...@mojatatu.com>
+F: net/ife
+F: include/net/ife.h
+F: include/uapi/linux/ife.h
+
 IGORPLUG-USB IR RECEIVER
 M: Sean Young <s...@mess.org>
 L: linux-me...@vger.kernel.org
diff --git a/include/net/ife.h b/include/net/ife.h
new file mode 100644
index 000..2d87d68
--- /dev/null
+++ b/include/net/ife.h
@@ -0,0 +1,51 @@
+#ifndef __NET_IFE_H
+#define __NET_IFE_H
+
+#include 
+#include 
+#include 
+#include 
+
+#if IS_ENABLED(CONFIG_NET_IFE)
+
+void *ife_encode(struct sk_buff *skb, u16 metalen);
+void *ife_decode(struct sk_buff *skb, u16 *metalen);
+
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 
*totlen);
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+   const void *dval);
+
+void *ife_tlv_meta_next(void *skbdata);
+
+#else
+
+static inline void *ife_encode(struct sk_buff *skb, u16 metalen)
+{
+   return NULL;
+}
+
+static inline void *ife_decode(struct sk_buff *skb, u16 *metalen)
+{
+   return NULL;
+}
+
+static inline void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 
*dlen,
+   u16 *totlen)
+{
+   return NULL;
+}
+
+static inline int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+   const void *dval)
+{
+   return 0;
+}
+
+static inline void *ife_tlv_meta_next(void *skbdata)
+{
+   return NULL;
+}
+
+#endif
+
+#endif /* __NET_IFE_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 486e050..a2e9072 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -195,6 +195,7 @@ header-y += if_tun.h
 header-y += if_tunnel.h
 header-y += if_vlan.h
 header-y += if_x25.h
+header-y += ife.h
 header-y += igmp.h
 header-y += ila.h
 header-y += in6.h
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
new file mode 100644
index 000..2954da3
--- /dev/null
+++ b/include/uapi/linux/ife.h
@@ -0,0 +1,18 @@
+#ifndef __UAPI_IFE_H
+#define __UAPI_IFE_H
+
+#define IFE_METAHDRLEN 2
+
+enum {
+   IFE_META_SKBMARK = 1,
+   IFE_META_HASHID,
+   IFE_META_PRIO,
+   IFE_META_QMAP,
+   IFE_META_TCINDEX,
+   __IFE_META_MAX
+};
+
+/*Can be overridden at runtime by module option*/
+#define IFE_META_MAX (__IFE_META_MAX - 1)
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index ce4aee6..2f2842d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -391,6 +391,7 @@ source "net/caif/Kconfig"
 source "net/ceph/Kconfig"
 source "net/nfc/Kconfig"
 source "net/psample/Kconfig"
+source "net/ife/Kconfig"
 
 config LWTUNNEL
bool "Network light weight tunnels"
diff --git a/net/Makefile b/net/Makefile
index 7d41de4..9b68155 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_CEPH_LIB)+= ceph/
 obj-$(CONFIG_BATMAN_ADV)   += batman-adv/
 obj-$(CONFIG_NFC)  += nfc/
 obj-$(CONFIG_PSAMPLE)  += psample/
+obj-$(CONFIG_NET_IFE)  += ife/
 obj-$(CONFIG_OPENVSWITCH)  += openvswitch/
 obj-$(CONFIG_VSOCKETS) += vmw_vsock/
 obj-$(CONFIG_MPLS) += mpls/
diff --git a/net/ife/Kconfig b/net/ife/Kconfig
new file mode 100644
index 000..31e48b6
--- /dev/null
+++ b/net/ife/Kconfig
@@ -0,0 +1,16 @@
+#
+# IFE subsystem configuration
+#
+
+menuconfig NET_IFE
+   depends on NET
+tristate "Inter-FE based on IETF ForCES InterFE LFB"
+   default n
+   help
+ S

[PATCH net v2] net/sched: matchall: Fix configuration race

2017-01-31 Thread Yotam Gigi

In the current version, the matchall internal state is split into two
structs: cls_matchall_head and cls_matchall_filter. This makes little
sense, as matchall instance supports only one filter, and there is no
situation where one exists and the other does not. In addition, that led
to some races when filter was deleted while packet was processed.

Unify that two structs into one, thus simplifying the process of matchall
creation and deletion. As a result, the new, delete and get callbacks have
a dummy implementation where all the work is done in destroy and change
callbacks, as was done in cls_cgroup.

Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Reported-by: Daniel Borkmann <dan...@iogearbox.net>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
Acked-by: Jiri Pirko <j...@mellanox.com>
---
v1->v2:
 - Rebased on net

When net is merged with net-next, this patch will probably conflict with
ec2507d2a306 ("net/sched: cls_matchall: Fix error path").
---
 net/sched/cls_matchall.c | 127 +--
 1 file changed, 45 insertions(+), 82 deletions(-)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index f935429..b12bc2a 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -16,16 +16,11 @@
 #include 
 #include 
 
-struct cls_mall_filter {
+struct cls_mall_head {
struct tcf_exts exts;
struct tcf_result res;
u32 handle;
-   struct rcu_head rcu;
u32 flags;
-};
-
-struct cls_mall_head {
-   struct cls_mall_filter *filter;
struct rcu_head rcu;
 };
 
@@ -33,38 +28,29 @@ static int mall_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
 struct tcf_result *res)
 {
struct cls_mall_head *head = rcu_dereference_bh(tp->root);
-   struct cls_mall_filter *f = head->filter;
 
-   if (tc_skip_sw(f->flags))
+   if (tc_skip_sw(head->flags))
return -1;
 
-   return tcf_exts_exec(skb, >exts, res);
+   return tcf_exts_exec(skb, >exts, res);
 }
 
 static int mall_init(struct tcf_proto *tp)
 {
-   struct cls_mall_head *head;
-
-   head = kzalloc(sizeof(*head), GFP_KERNEL);
-   if (!head)
-   return -ENOBUFS;
-
-   rcu_assign_pointer(tp->root, head);
-
return 0;
 }
 
-static void mall_destroy_filter(struct rcu_head *head)
+static void mall_destroy_rcu(struct rcu_head *rcu)
 {
-   struct cls_mall_filter *f = container_of(head, struct cls_mall_filter, 
rcu);
+   struct cls_mall_head *head = container_of(rcu, struct cls_mall_head,
+ rcu);
 
-   tcf_exts_destroy(>exts);
-
-   kfree(f);
+   tcf_exts_destroy(>exts);
+   kfree(head);
 }
 
 static int mall_replace_hw_filter(struct tcf_proto *tp,
- struct cls_mall_filter *f,
+ struct cls_mall_head *head,
  unsigned long cookie)
 {
struct net_device *dev = tp->q->dev_queue->dev;
@@ -74,7 +60,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
offload.type = TC_SETUP_MATCHALL;
offload.cls_mall = _offload;
offload.cls_mall->command = TC_CLSMATCHALL_REPLACE;
-   offload.cls_mall->exts = >exts;
+   offload.cls_mall->exts = >exts;
offload.cls_mall->cookie = cookie;
 
return dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol,
@@ -82,7 +68,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
 }
 
 static void mall_destroy_hw_filter(struct tcf_proto *tp,
-  struct cls_mall_filter *f,
+  struct cls_mall_head *head,
   unsigned long cookie)
 {
struct net_device *dev = tp->q->dev_queue->dev;
@@ -103,29 +89,20 @@ static bool mall_destroy(struct tcf_proto *tp, bool force)
 {
struct cls_mall_head *head = rtnl_dereference(tp->root);
struct net_device *dev = tp->q->dev_queue->dev;
-   struct cls_mall_filter *f = head->filter;
 
-   if (!force && f)
-   return false;
+   if (!head)
+   return true;
 
-   if (f) {
-   if (tc_should_offload(dev, tp, f->flags))
-   mall_destroy_hw_filter(tp, f, (unsigned long) f);
+   if (tc_should_offload(dev, tp, head->flags))
+   mall_destroy_hw_filter(tp, head, (unsigned long) head);
 
-   call_rcu(>rcu, mall_destroy_filter);
-   }
-   kfree_rcu(head, rcu);
+   call_rcu(>rcu, mall_destroy_rcu);
return true;
 }
 
 static unsigned long mall_get(struct tcf_proto *tp, u32 handle)
 {
-   struct cls_mall_head *head = rtnl_dereference(tp->root);
-   struct cls_mall_filter *f = head->filt

[PATCH iproute] tc: man: matchall: Fix example indentation

2017-01-31 Thread Yotam Gigi

The man page contains two examples, which have different indentation. Fix
the indentation of the two examples to match.

Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 man/man8/tc-matchall.8 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/man/man8/tc-matchall.8 b/man/man8/tc-matchall.8
index f920922..5aa11da 100644
--- a/man/man8/tc-matchall.8
+++ b/man/man8/tc-matchall.8
@@ -53,6 +53,7 @@ where the second command attaches a matchall filters on it 
that mirrors the
 packets to device eth2.
 
 To create egress mirroring from port eth1 to port eth2:
+.RS
 .EX
 
 tc qdisc add dev eth1 handle 1: root prio
-- 
2.4.11

[PATCH net-next 1/2] net/sched: act_sample: Fix error path in init

2017-01-31 Thread Yotam Gigi

Fix error path of in sample init, by releasing the tc hash in case of
failure in psample_group creation.

Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
Reported-by: Cong Wang <xiyou.wangc...@gmail.com>
Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/sched/act_sample.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 3922975..02b6749 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -81,8 +81,11 @@ static int tcf_sample_init(struct net *net, struct nlattr 
*nla,
s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
s->psample_group_num = nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]);
psample_group = psample_group_get(net, s->psample_group_num);
-   if (!psample_group)
+   if (!psample_group) {
+   if (ret == ACT_P_CREATED)
+   tcf_hash_release(*a, bind);
return -ENOMEM;
+   }
RCU_INIT_POINTER(s->psample_group, psample_group);
 
if (tb[TCA_SAMPLE_TRUNC_SIZE]) {
-- 
2.4.11

[PATCH net-next 0/2] net/sched: act_sample: Little fixes

2017-01-31 Thread Yotam Gigi

Little fixes in sample tc action.

Yotam Gigi (2):
  net/sched: act_sample: Fix error path in init
  net/sched: act_psample: Remove unnecessary ASSERT_RTNL

 net/sched/act_sample.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

-- 
2.4.11

[PATCH net-next 2/2] net/sched: act_psample: Remove unnecessary ASSERT_RTNL

2017-01-31 Thread Yotam Gigi

The ASSERT_RTNL is not necessary in the init function, as it does not
touch any rtnl protected structures, as opposed to the mirred action which
does have to hold a net device.

Reported-by: Cong Wang <xiyou.wangc...@gmail.com>
Reviewed-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/sched/act_sample.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 02b6749..0b8217b 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -76,7 +76,6 @@ static int tcf_sample_init(struct net *net, struct nlattr 
*nla,
}
s = to_sample(*a);
 
-   ASSERT_RTNL();
s->tcf_action = parm->action;
s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
s->psample_group_num = nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]);
-- 
2.4.11

RE: [patch net-next v2 2/4] net/sched: Introduce sample tc action

2017-01-30 Thread Yotam Gigi

>-Original Message-
>From: Yotam Gigi
>Sent: Friday, January 27, 2017 8:09 AM
>To: 'Cong Wang' <xiyou.wangc...@gmail.com>; Jiri Pirko <j...@resnulli.us>
>Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; David Miller
><da...@davemloft.net>; Ido Schimmel <ido...@mellanox.com>; Elad Raz
><el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
><ogerl...@mellanox.com>; Jamal Hadi Salim <j...@mojatatu.com>;
>geert+rene...@glider.be; Stephen Hemminger <step...@networkplumber.org>;
>Guenter Roeck <li...@roeck-us.net>; Roopa Prabhu
><ro...@cumulusnetworks.com>; John Fastabend <john.fastab...@gmail.com>;
>Simon Horman <simon.hor...@netronome.com>; Roman Mashak
><m...@mojatatu.com>
>Subject: RE: [patch net-next v2 2/4] net/sched: Introduce sample tc action
>
>>-Original Message-
>>From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
>>Sent: Thursday, January 26, 2017 1:30 AM
>>To: Jiri Pirko <j...@resnulli.us>
>>Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; David Miller
>><da...@davemloft.net>; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
>><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
>><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>; Jamal Hadi Salim
>><j...@mojatatu.com>; geert+rene...@glider.be; Stephen Hemminger
>><step...@networkplumber.org>; Guenter Roeck <li...@roeck-us.net>; Roopa
>>Prabhu <ro...@cumulusnetworks.com>; John Fastabend
>><john.fastab...@gmail.com>; Simon Horman
><simon.hor...@netronome.com>;
>>Roman Mashak <m...@mojatatu.com>
>>Subject: Re: [patch net-next v2 2/4] net/sched: Introduce sample tc action
>>
>>On Mon, Jan 23, 2017 at 2:07 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>> +
>>> +static int tcf_sample_init(struct net *net, struct nlattr *nla,
>>> +  struct nlattr *est, struct tc_action **a, int 
>>> ovr,
>>> +  int bind)
>>> +{
>>> +   struct tc_action_net *tn = net_generic(net, sample_net_id);
>>> +   struct nlattr *tb[TCA_SAMPLE_MAX + 1];
>>> +   struct psample_group *psample_group;
>>> +   struct tc_sample *parm;
>>> +   struct tcf_sample *s;
>>> +   bool exists = false;
>>> +   int ret;
>>> +
>>> +   if (!nla)
>>> +   return -EINVAL;
>>> +   ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy);
>>> +   if (ret < 0)
>>> +   return ret;
>>> +   if (!tb[TCA_SAMPLE_PARMS] || !tb[TCA_SAMPLE_RATE] ||
>>> +   !tb[TCA_SAMPLE_PSAMPLE_GROUP])
>>> +   return -EINVAL;
>>> +
>>> +   parm = nla_data(tb[TCA_SAMPLE_PARMS]);
>>> +
>>> +   exists = tcf_hash_check(tn, parm->index, a, bind);
>>> +   if (exists && bind)
>>> +   return 0;
>>> +
>>> +   if (!exists) {
>>> +   ret = tcf_hash_create(tn, parm->index, est, a,
>>> + _sample_ops, bind, false);
>>> +   if (ret)
>>> +   return ret;
>>> +   ret = ACT_P_CREATED;
>>> +   } else {
>>> +   tcf_hash_release(*a, bind);
>>> +   if (!ovr)
>>> +   return -EEXIST;
>>> +   }
>>> +   s = to_sample(*a);
>>> +
>>> +   ASSERT_RTNL();
>>
>>Copy-n-paste from mirred aciton? This is not needed for you, mirred
>>needs it because of target netdevice.
>
>Ho, you are right. I will remove it.
>
>>
>>
>>> +   s->tcf_action = parm->action;
>>> +   s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
>>> +   s->psample_group_num =
>>nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]);
>>> +   psample_group = psample_group_get(net, s->psample_group_num);
>>> +   if (!psample_group)
>>> +   return -ENOMEM;
>>
>>I don't think you can just return here, needs tcf_hash_cleanup() for create
>>case, right?
>
>Will fix.
>
>>
>>
>>> +   RCU_INIT_POINTER(s->psample_group, psample_group);
>>> +
>>> +   if (tb[TCA_SAMPLE_TRUNC_SIZE]) {
>>> +   s->truncate = true;
>>> +   s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]);
>>> +   }
>>
>>
>>Do you need tcf_lock here if RCU only protects ->psample_group??
>>
>
>You are right. I need to protect in case of update.

Cong, after some thinking I think I don't really need the tcf_lock here. I
don't really care if the truncate, trunc_size, rate and tcf_action are
consistent among themselves - the only parameter that I care about is the
psample_group pointer, and it is protected via RCU. As far as I see, there is
no need to lock here.

I do need to take the tcf_lock to protect the statistics update in the 
tcf_sample_act code, as far as I see.

Am I missing something?

>
>I will send a fixup patch in the following days. Thanks!
>
>>
>>> +
>>> +   if (ret == ACT_P_CREATED)
>>> +   tcf_hash_insert(tn, *a);
>>> +   return ret;
>>> +}
>>> +
>>
>>
>>Thanks.

RE: [patch net-next v2 2/4] net/sched: Introduce sample tc action

2017-01-26 Thread Yotam Gigi

>-Original Message-
>From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
>Sent: Thursday, January 26, 2017 1:30 AM
>To: Jiri Pirko <j...@resnulli.us>
>Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; David Miller
><da...@davemloft.net>; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>; Jamal Hadi Salim
><j...@mojatatu.com>; geert+rene...@glider.be; Stephen Hemminger
><step...@networkplumber.org>; Guenter Roeck <li...@roeck-us.net>; Roopa
>Prabhu <ro...@cumulusnetworks.com>; John Fastabend
><john.fastab...@gmail.com>; Simon Horman <simon.hor...@netronome.com>;
>Roman Mashak <m...@mojatatu.com>
>Subject: Re: [patch net-next v2 2/4] net/sched: Introduce sample tc action
>
>On Mon, Jan 23, 2017 at 2:07 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> +
>> +static int tcf_sample_init(struct net *net, struct nlattr *nla,
>> +  struct nlattr *est, struct tc_action **a, int ovr,
>> +  int bind)
>> +{
>> +   struct tc_action_net *tn = net_generic(net, sample_net_id);
>> +   struct nlattr *tb[TCA_SAMPLE_MAX + 1];
>> +   struct psample_group *psample_group;
>> +   struct tc_sample *parm;
>> +   struct tcf_sample *s;
>> +   bool exists = false;
>> +   int ret;
>> +
>> +   if (!nla)
>> +   return -EINVAL;
>> +   ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy);
>> +   if (ret < 0)
>> +   return ret;
>> +   if (!tb[TCA_SAMPLE_PARMS] || !tb[TCA_SAMPLE_RATE] ||
>> +   !tb[TCA_SAMPLE_PSAMPLE_GROUP])
>> +   return -EINVAL;
>> +
>> +   parm = nla_data(tb[TCA_SAMPLE_PARMS]);
>> +
>> +   exists = tcf_hash_check(tn, parm->index, a, bind);
>> +   if (exists && bind)
>> +   return 0;
>> +
>> +   if (!exists) {
>> +   ret = tcf_hash_create(tn, parm->index, est, a,
>> + _sample_ops, bind, false);
>> +   if (ret)
>> +   return ret;
>> +   ret = ACT_P_CREATED;
>> +   } else {
>> +   tcf_hash_release(*a, bind);
>> +   if (!ovr)
>> +   return -EEXIST;
>> +   }
>> +   s = to_sample(*a);
>> +
>> +   ASSERT_RTNL();
>
>Copy-n-paste from mirred aciton? This is not needed for you, mirred
>needs it because of target netdevice.

Ho, you are right. I will remove it.

>
>
>> +   s->tcf_action = parm->action;
>> +   s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
>> +   s->psample_group_num =
>nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]);
>> +   psample_group = psample_group_get(net, s->psample_group_num);
>> +   if (!psample_group)
>> +   return -ENOMEM;
>
>I don't think you can just return here, needs tcf_hash_cleanup() for create
>case, right?

Will fix.

>
>
>> +   RCU_INIT_POINTER(s->psample_group, psample_group);
>> +
>> +   if (tb[TCA_SAMPLE_TRUNC_SIZE]) {
>> +   s->truncate = true;
>> +   s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]);
>> +   }
>
>
>Do you need tcf_lock here if RCU only protects ->psample_group??
>

You are right. I need to protect in case of update.

I will send a fixup patch in the following days. Thanks!

>
>> +
>> +   if (ret == ACT_P_CREATED)
>> +   tcf_hash_insert(tn, *a);
>> +   return ret;
>> +}
>> +
>
>
>Thanks.

RE: [patch net-next v2 2/4] net/sched: Introduce sample tc action

2017-01-24 Thread Yotam Gigi

>-Original Message-
>From: Simon Horman [mailto:simon.hor...@netronome.com]
>Sent: Tuesday, January 24, 2017 10:33 AM
>To: Jiri Pirko <j...@resnulli.us>
>Cc: netdev@vger.kernel.org; da...@davemloft.net; Yotam Gigi
><yot...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>; Elad Raz
><el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
><ogerl...@mellanox.com>; j...@mojatatu.com; geert+rene...@glider.be;
>step...@networkplumber.org; xiyou.wangc...@gmail.com; li...@roeck-us.net;
>ro...@cumulusnetworks.com; john.fastab...@gmail.com; m...@mojatatu.com
>Subject: Re: [patch net-next v2 2/4] net/sched: Introduce sample tc action
>
>On Mon, Jan 23, 2017 at 11:07:09AM +0100, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> This action allows the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly and sampling them using
>> the psample module. The user can configure the psample group number, the
>> sampling rate and the packet's truncation (to save kernel-user traffic).
>>
>> Example:
>> To sample ingress traffic from interface eth1, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>> matchall action sample rate 12 group 4
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling randomly with an average of one sampled packet per 12 packets on
>> dev eth1 to psample group 4.
>>
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>> Acked-by: Jamal Hadi Salim <j...@mojatatu.com>
>
>Reviewed-by: Simon Horman <simon.hor...@netronome.com>
>
>Is the tc user-space (iproute2) code available yet?

Yes it is. I thought about sending it the moment the kernel patches gets 
accepted.

Do you want it to send it directly to you?

[PATCH net] net/sched: matchall: Fix configuration race

2017-01-23 Thread Yotam Gigi

In the current version, the matchall internal state is split into two
structs: cls_matchall_head and cls_matchall_filter. This makes little
sense, as matchall instance supports only one filter, and there is no
situation where one exists and the other does not. In addition, that led
to some races when filter was deleted while packet was processed.

Unify that two structs into one, thus simplifying the process of matchall
creation and deletion. As a result, the new, delete and get callbacks have
a dummy implementation where all the work is done in destroy and change
callbacks, as was done in cls_cgroup.

Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
Acked-by: Jiri Pirko <j...@mellanox.com>
---
 net/sched/cls_matchall.c | 129 +--
 1 file changed, 46 insertions(+), 83 deletions(-)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index fcecf5a..f2141cb 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -16,16 +16,11 @@
 #include 
 #include 
 
-struct cls_mall_filter {
+struct cls_mall_head {
struct tcf_exts exts;
struct tcf_result res;
u32 handle;
-   struct rcu_head rcu;
u32 flags;
-};
-
-struct cls_mall_head {
-   struct cls_mall_filter *filter;
struct rcu_head rcu;
 };
 
@@ -33,38 +28,29 @@ static int mall_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
 struct tcf_result *res)
 {
struct cls_mall_head *head = rcu_dereference_bh(tp->root);
-   struct cls_mall_filter *f = head->filter;
 
-   if (tc_skip_sw(f->flags))
+   if (tc_skip_sw(head->flags))
return -1;
 
-   return tcf_exts_exec(skb, >exts, res);
+   return tcf_exts_exec(skb, >exts, res);
 }
 
 static int mall_init(struct tcf_proto *tp)
 {
-   struct cls_mall_head *head;
-
-   head = kzalloc(sizeof(*head), GFP_KERNEL);
-   if (!head)
-   return -ENOBUFS;
-
-   rcu_assign_pointer(tp->root, head);
-
return 0;
 }
 
-static void mall_destroy_filter(struct rcu_head *head)
+static void mall_destroy_rcu(struct rcu_head *rcu)
 {
-   struct cls_mall_filter *f = container_of(head, struct cls_mall_filter, 
rcu);
+   struct cls_mall_head *head = container_of(rcu, struct cls_mall_head,
+ rcu);
 
-   tcf_exts_destroy(>exts);
-
-   kfree(f);
+   tcf_exts_destroy(>exts);
+   kfree(head);
 }
 
 static int mall_replace_hw_filter(struct tcf_proto *tp,
- struct cls_mall_filter *f,
+ struct cls_mall_head *head,
  unsigned long cookie)
 {
struct net_device *dev = tp->q->dev_queue->dev;
@@ -74,7 +60,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
offload.type = TC_SETUP_MATCHALL;
offload.cls_mall = _offload;
offload.cls_mall->command = TC_CLSMATCHALL_REPLACE;
-   offload.cls_mall->exts = >exts;
+   offload.cls_mall->exts = >exts;
offload.cls_mall->cookie = cookie;
 
return dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol,
@@ -82,7 +68,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
 }
 
 static void mall_destroy_hw_filter(struct tcf_proto *tp,
-  struct cls_mall_filter *f,
+  struct cls_mall_head *head,
   unsigned long cookie)
 {
struct net_device *dev = tp->q->dev_queue->dev;
@@ -103,29 +89,20 @@ static bool mall_destroy(struct tcf_proto *tp, bool force)
 {
struct cls_mall_head *head = rtnl_dereference(tp->root);
struct net_device *dev = tp->q->dev_queue->dev;
-   struct cls_mall_filter *f = head->filter;
 
-   if (!force && f)
-   return false;
+   if (!head)
+   return true;
 
-   if (f) {
-   if (tc_should_offload(dev, tp, f->flags))
-   mall_destroy_hw_filter(tp, f, (unsigned long) f);
+   if (tc_should_offload(dev, tp, head->flags))
+   mall_destroy_hw_filter(tp, head, (unsigned long) head);
 
-   call_rcu(>rcu, mall_destroy_filter);
-   }
-   kfree_rcu(head, rcu);
+   call_rcu(>rcu, mall_destroy_rcu);
return true;
 }
 
 static unsigned long mall_get(struct tcf_proto *tp, u32 handle)
 {
-   struct cls_mall_head *head = rtnl_dereference(tp->root);
-   struct cls_mall_filter *f = head->filter;
-
-   if (f && f->handle == handle)
-   return (unsigned long) f;
-   return 0;
+   return 0UL;
 }
 
 static const struct nla_policy mall_policy[TCA_MATCHALL_MAX + 1] = {
@@ -134,

RE: [patch net-next 2/4] net/sched: Introduce sample tc action

2017-01-22 Thread Yotam Gigi

>-Original Message-
>From: Jamal Hadi Salim [mailto:j...@mojatatu.com]
>Sent: Sunday, January 22, 2017 3:17 PM
>To: Jiri Pirko <j...@resnulli.us>; netdev@vger.kernel.org
>Cc: da...@davemloft.net; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; ro...@cumulusnetworks.com;
>john.fastab...@gmail.com; simon.hor...@netronome.com; m...@mojatatu.com
>Subject: Re: [patch net-next 2/4] net/sched: Introduce sample tc action
>
>On 17-01-22 06:44 AM, Jiri Pirko wrote:
>
>> diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
>> new file mode 100644
>> index 000..24e20e4
>> --- /dev/null
>> +++ b/net/sched/act_sample.c
>> @@ -0,0 +1,274 @@
>> +/*
>> + * net/sched/act_sample.c - Packet samplig tc action
>
>typo: "Sampling"

It took me a while to see it. Will fix :)

>
>
>> +static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
>> +  struct tcf_result *res)
>
>
>Can you rename this function because it is also the name of the data
>structure? It makes it easier to grep.
>I know we have this all over the place in other actions (and i hope
>those are cleaned up at some point).

OK, makes sense.

>
>otherwise:
>Acked-by: Jamal Hadi Salim <j...@mojatatu.com>

Thanks! 

>
>cheers,
>jamal

RE: [patch net-next v2 5/8] Introduce sample tc action

2017-01-04 Thread Yotam Gigi

>-Original Message-
>From: Simon Horman [mailto:simon.hor...@netronome.com]
>Sent: Wednesday, January 04, 2017 12:43 PM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: Roopa Prabhu <ro...@cumulusnetworks.com>; Jiri Pirko <j...@resnulli.us>;
>netdev@vger.kernel.org; da...@davemloft.net; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <orl...@mellanox.com>;
>j...@mojatatu.com; geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; john.fastab...@gmail.com
>Subject: Re: [patch net-next v2 5/8] Introduce sample tc action
>
>On Wed, Nov 16, 2016 at 04:26:44PM +, Yotam Gigi wrote:
>> >-Original Message-
>> >From: Roopa Prabhu [mailto:ro...@cumulusnetworks.com]
>> >Sent: Wednesday, November 16, 2016 6:15 PM
>> >To: Jiri Pirko <j...@resnulli.us>
>> >Cc: netdev@vger.kernel.org; da...@davemloft.net; Yotam Gigi
>> ><yot...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>; Elad Raz
>> ><el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
>> ><ogerl...@mellanox.com>; j...@mojatatu.com; geert+rene...@glider.be;
>> >step...@networkplumber.org; xiyou.wangc...@gmail.com; linux@roeck-
>us.net;
>> >john.fastab...@gmail.com; simon.hor...@netronome.com
>> >Subject: Re: [patch net-next v2 5/8] Introduce sample tc action
>> >
>> >On 11/14/16, 7:00 AM, Jiri Pirko wrote:
>> >> From: Yotam Gigi <yot...@mellanox.com>
>> >>
>> >> This action allow the user to sample traffic matched by tc classifier.
>> >> The sampling consists of choosing packets randomly, truncating them,
>> >> adding some informative metadata regarding the interface and the original
>> >> packet size and mark them with specific mark, to allow further tc rules to
>> >> match and process. The marked sample packets are then injected into the
>> >> device ingress qdisc using netif_receive_skb.
>> >>
>> >> The packets metadata is packed using the ife encapsulation protocol, and
>> >> the outer packet's ethernet dest, source and eth_type, along with the
>> >> rate, mark and the optional truncation size can be configured from
>> >> userspace.
>> >>
>> >> Example:
>> >> To sample ingress traffic from interface eth1, and redirect the sampled
>> >> the sampled packets to interface dummy0, one may use the commands:
>> >>
>> >> tc qdisc add dev eth1 handle : ingress
>> >>
>> >> tc filter add dev eth1 parent : \
>> >>  matchall action sample rate 12 mark 17
>> >
>> >Yotham, I am guessing in the future if one does not want to use mark,
>> >the sample api is extensible to allow for other actions to be added.
>> >This is from the general concern we had on using mark: some may not want to
>use
>> >mark.
>> >As long as the api is extensible to allow an alternate way in the future,
>> > we should be good. (We would prefer to not go down the path of having to
>> >introduce
>> >a new  'action sample' if this limits us in some way).
>>
>> The code is extensible  - if one does want to add another action to sample, 
>> he
>> totally can :)
>>
>> By the way, one of the reasons we removed the patches for now is that we
>> consider adding mirroring functionality to it instead of heaving two 
>> tc-rules.
>
>Hi Yotham,
>
>I see that this action combines several discrete sub-actions, some of which
>duplicate functionality present in existing actions:
>
>* sample
>* truncate
>* IFE encapsulation with metadata
>* mark
>* output (proposed)
>
>I wonder if it would make sense to provided a mechanism whereby sampled
>packets can be passed on to other actions. Something similar to the
>existing pipe mechanism but allowing for the sampled and the original packets
>to continue to be processed separately.
>
>If so, it may be worth white-listing actions for which sampled packets are
>known to work well to limit the scope for unforseen side effects.

Hi Simon,

The patches were removed and we intend to propose a different implementation for
that, using a dedicated sampling netlink channel, thus removing the need to
mark, mirror and IFE-pack the packets.

We intend to send in the next few weeks, so I guess you can comment on the new
version.

Thanks!

[PATCH net-next v2] net/sched: cls_matchall: Fix error path

2017-01-03 Thread Yotam Gigi

Fix several error paths in matchall:
 - Release reference to actions in case the hardware fails offloading
   (relevant to skip_sw only)
 - Fix error path in case tcf_exts initialization/validation fail

Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
v1->v2:
 - Add check for tcf_exts_init return code and fix error path for it too
---
 net/sched/cls_matchall.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index f935429..fcecf5a 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -141,10 +141,12 @@ static int mall_set_parms(struct net *net, struct 
tcf_proto *tp,
struct tcf_exts e;
int err;
 
-   tcf_exts_init(, TCA_MATCHALL_ACT, 0);
+   err = tcf_exts_init(, TCA_MATCHALL_ACT, 0);
+   if (err)
+   return err;
err = tcf_exts_validate(net, tp, tb, est, , ovr);
if (err < 0)
-   return err;
+   goto errout;
 
if (tb[TCA_MATCHALL_CLASSID]) {
f->res.classid = nla_get_u32(tb[TCA_MATCHALL_CLASSID]);
@@ -154,6 +156,9 @@ static int mall_set_parms(struct net *net, struct tcf_proto 
*tp,
tcf_exts_change(tp, >exts, );
 
return 0;
+errout:
+   tcf_exts_destroy();
+   return err;
 }
 
 static int mall_change(struct net *net, struct sk_buff *in_skb,
@@ -193,7 +198,9 @@ static int mall_change(struct net *net, struct sk_buff 
*in_skb,
if (!f)
return -ENOBUFS;
 
-   tcf_exts_init(>exts, TCA_MATCHALL_ACT, 0);
+   err = tcf_exts_init(>exts, TCA_MATCHALL_ACT, 0);
+   if (err)
+   goto err_exts_init;
 
if (!handle)
handle = 1;
@@ -202,13 +209,13 @@ static int mall_change(struct net *net, struct sk_buff 
*in_skb,
 
err = mall_set_parms(net, tp, f, base, tb, tca[TCA_RATE], ovr);
if (err)
-   goto errout;
+   goto err_set_parms;
 
if (tc_should_offload(dev, tp, flags)) {
err = mall_replace_hw_filter(tp, f, (unsigned long) f);
if (err) {
if (tc_skip_sw(flags))
-   goto errout;
+   goto err_replace_hw_filter;
else
err = 0;
}
@@ -219,7 +226,10 @@ static int mall_change(struct net *net, struct sk_buff 
*in_skb,
 
return 0;
 
-errout:
+err_replace_hw_filter:
+err_set_parms:
+   tcf_exts_destroy(>exts);
+err_exts_init:
kfree(f);
return err;
 }
-- 
2.4.11

RE: [PATCH net-next] net/sched: cls_matchall: Fix error path

2017-01-03 Thread Yotam Gigi

>-Original Message-
>From: David Miller [mailto:da...@davemloft.net]
>Sent: Tuesday, January 03, 2017 6:08 PM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: j...@mojatatu.com; Elad Raz <el...@mellanox.com>; Jiri Pirko
><j...@mellanox.com>; netdev@vger.kernel.org
>Subject: Re: [PATCH net-next] net/sched: cls_matchall: Fix error path
>
>From: Yotam Gigi <yot...@mellanox.com>
>Date: Tue,  3 Jan 2017 17:47:02 +0200
>
>> Fix several error paths in matchall:
>>  - Release reference to actions in case the hardware fails offloading
>>(relevant to skip_sw only)
>>  - Fix error path in case tcf_exts initialization fails
>>
>> Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>
>Nothing is checking the tcf_exts_init() return value for errors either,
>and I think you should fix this alongside these release problems at the
>same time.

Ok. Will send v2 soon.

Thanks!

>
>Thanks.

[PATCH net-next] net/sched: cls_matchall: Fix error path

2017-01-03 Thread Yotam Gigi

Fix several error paths in matchall:
 - Release reference to actions in case the hardware fails offloading
   (relevant to skip_sw only)
 - Fix error path in case tcf_exts initialization fails

Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/sched/cls_matchall.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index f935429..ce7d28b 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -144,7 +144,7 @@ static int mall_set_parms(struct net *net, struct tcf_proto 
*tp,
tcf_exts_init(, TCA_MATCHALL_ACT, 0);
err = tcf_exts_validate(net, tp, tb, est, , ovr);
if (err < 0)
-   return err;
+   goto errout;
 
if (tb[TCA_MATCHALL_CLASSID]) {
f->res.classid = nla_get_u32(tb[TCA_MATCHALL_CLASSID]);
@@ -154,6 +154,9 @@ static int mall_set_parms(struct net *net, struct tcf_proto 
*tp,
tcf_exts_change(tp, >exts, );
 
return 0;
+errout:
+   tcf_exts_destroy();
+   return err;
 }
 
 static int mall_change(struct net *net, struct sk_buff *in_skb,
@@ -220,6 +223,7 @@ static int mall_change(struct net *net, struct sk_buff 
*in_skb,
return 0;
 
 errout:
+   tcf_exts_destroy(>exts);
kfree(f);
return err;
 }
-- 
2.4.11

RE: [patch net-next v2 5/8] Introduce sample tc action

2016-11-16 Thread Yotam Gigi

>-Original Message-
>From: Roopa Prabhu [mailto:ro...@cumulusnetworks.com]
>Sent: Wednesday, November 16, 2016 6:15 PM
>To: Jiri Pirko <j...@resnulli.us>
>Cc: netdev@vger.kernel.org; da...@davemloft.net; Yotam Gigi
><yot...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>; Elad Raz
><el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
><ogerl...@mellanox.com>; j...@mojatatu.com; geert+rene...@glider.be;
>step...@networkplumber.org; xiyou.wangc...@gmail.com; li...@roeck-us.net;
>john.fastab...@gmail.com; simon.hor...@netronome.com
>Subject: Re: [patch net-next v2 5/8] Introduce sample tc action
>
>On 11/14/16, 7:00 AM, Jiri Pirko wrote:
>> From: Yotam Gigi <yot...@mellanox.com>
>>
>> This action allow the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly, truncating them,
>> adding some informative metadata regarding the interface and the original
>> packet size and mark them with specific mark, to allow further tc rules to
>> match and process. The marked sample packets are then injected into the
>> device ingress qdisc using netif_receive_skb.
>>
>> The packets metadata is packed using the ife encapsulation protocol, and
>> the outer packet's ethernet dest, source and eth_type, along with the
>> rate, mark and the optional truncation size can be configured from
>> userspace.
>>
>> Example:
>> To sample ingress traffic from interface eth1, and redirect the sampled
>> the sampled packets to interface dummy0, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>> matchall action sample rate 12 mark 17
>
>Yotham, I am guessing in the future if one does not want to use mark,
>the sample api is extensible to allow for other actions to be added.
>This is from the general concern we had on using mark: some may not want to use
>mark.
>As long as the api is extensible to allow an alternate way in the future,
> we should be good. (We would prefer to not go down the path of having to
>introduce
>a new  'action sample' if this limits us in some way).

The code is extensible  - if one does want to add another action to sample, he 
totally can :)

By the way, one of the reasons we removed the patches for now is that we 
consider adding mirroring functionality to it instead of heaving two tc-rules.


>
>>
>> tc filter add parent : dev eth1 protocol all \
>> u32 match mark 17 0xff \
>> action mirred egress redirect dev dummy0
>>
>
>thanks.

RE: [patch net-next 5/8] Introduce sample tc action

2016-11-11 Thread Yotam Gigi

>-Original Message-
>From: Simon Horman [mailto:simon.hor...@netronome.com]
>Sent: Friday, November 11, 2016 2:44 PM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: John Fastabend <john.fastab...@gmail.com>; Jiri Pirko <j...@resnulli.us>;
>netdev@vger.kernel.org; da...@davemloft.net; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>j...@mojatatu.com; geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; ro...@cumulusnetworks.com
>Subject: Re: [patch net-next 5/8] Introduce sample tc action
>
>On Fri, Nov 11, 2016 at 08:28:50AM +, Yotam Gigi wrote:
>
>...
>
>> John, as a result of your question I realized that our hardware does do
>> randomized sampling that I was not aware of. I will use the extensibility of
>> the API and implement a random keyword, that will be offloaded in our
>> hardware. Those changes will be sent on v2.
>>
>> Eventually, your question was very relevant :) Thanks!
>
>Perhaps I am missing the point but why not just make random the default and
>implement the inverse as an extension if it turns out to be needed in
>future?

It makes sense. It does seem to me that the average user does prefer random
sampling over deterministic one. 

We will consider that. Thanks for the comment!

RE: [patch net-next 5/8] Introduce sample tc action

2016-11-11 Thread Yotam Gigi

>-Original Message-
>From: Yotam Gigi
>Sent: Thursday, November 10, 2016 9:59 PM
>To: 'John Fastabend' <john.fastab...@gmail.com>; Jiri Pirko <j...@resnulli.us>;
>netdev@vger.kernel.org
>Cc: da...@davemloft.net; Ido Schimmel <ido...@mellanox.com>; Elad Raz
><el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
><ogerl...@mellanox.com>; j...@mojatatu.com; geert+rene...@glider.be;
>step...@networkplumber.org; xiyou.wangc...@gmail.com; li...@roeck-us.net;
>ro...@cumulusnetworks.com
>Subject: RE: [patch net-next 5/8] Introduce sample tc action
>
>
>
>>-Original Message-
>>From: John Fastabend [mailto:john.fastab...@gmail.com]
>>Sent: Thursday, November 10, 2016 9:38 PM
>>To: Jiri Pirko <j...@resnulli.us>; netdev@vger.kernel.org
>>Cc: da...@davemloft.net; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
>><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
>><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>>j...@mojatatu.com; geert+rene...@glider.be; step...@networkplumber.org;
>>xiyou.wangc...@gmail.com; li...@roeck-us.net; ro...@cumulusnetworks.com
>>Subject: Re: [patch net-next 5/8] Introduce sample tc action
>>
>>On 16-11-10 11:35 AM, John Fastabend wrote:
>>> On 16-11-10 03:23 AM, Jiri Pirko wrote:
>>>> From: Yotam Gigi <yot...@mellanox.com>
>>>>
>>>> This action allow the user to sample traffic matched by tc classifier.
>>>> The sampling consists of choosing packets randomly, truncating them,
>>>> adding some informative metadata regarding the interface and the original
>>>> packet size and mark them with specific mark, to allow further tc rules to
>>>> match and process. The marked sample packets are then injected into the
>>>> device ingress qdisc using netif_receive_skb.
>>>>
>>>> The packets metadata is packed using the ife encapsulation protocol, and
>>>> the outer packet's ethernet dest, source and eth_type, along with the
>>>> rate, mark and the optional truncation size can be configured from
>>>> userspace.
>>>>
>>>> Example:
>>>> To sample ingress traffic from interface eth1, and redirect the sampled
>>>> the sampled packets to interface dummy0, one may use the commands:
>>>>
>>>> tc qdisc add dev eth1 handle : ingress
>>>>
>>>> tc filter add dev eth1 parent : \
>>>>   matchall action sample rate 12 mark 17
>>>>
>>>> tc filter add parent : dev eth1 protocol all \
>>>>   u32 match mark 17 0xff \
>>>>   action mirred egress redirect dev dummy0
>>>>
>>>> Where the first command adds an ingress qdisc and the second starts
>>>> sampling every 12'th packet on dev eth1 and marks the sampled packets with
>>>> 17. The third command catches the sampled packets, which are marked with
>>>> 17, and redirects them to dev dummy0.
>>>
>>> The sampling algorithm was not randomized based on the above commit
>>> log? It really needs to be for all the reasons Roopa mentioned earlier.
>>> Did I miss some email on why it didn't get implemented?
>>>
>>> Also there was an indication the already is actually implemented
>>> correctly so don't we need the hw/sw to behave the same. The whole
>>> argument about sw/hw parity, etc.
>>
>>sorry bit of a typo there corrected 2nd paragraph here...
>>
>>Also there was an indication the hardware is already implemented \
>>correctly so don't we need the hw/sw to behave the same. The argument
>>about sw/hw parity, etc.
>
>Our hardware currently does not support sampling with random behavior, so
>we did implement it in software too.
>
>But, the API is extensible and it is possible to add a random keyword to
>the tc action to allow random sampling. In that case, the keyword will be
>implemented in sw only and our driver will fail offloading it.
>

John, as a result of your question I realized that our hardware does do 
randomized sampling that I was not aware of. I will use the extensibility of
the API and implement a random keyword, that will be offloaded in our 
hardware. Those changes will be sent on v2.

Eventually, your question was very relevant :) Thanks!

>>
>>>
>>>>
>>>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>>> ---
>>>

RE: [patch net-next 5/8] Introduce sample tc action

2016-11-10 Thread Yotam Gigi



>-Original Message-
>From: John Fastabend [mailto:john.fastab...@gmail.com]
>Sent: Thursday, November 10, 2016 9:38 PM
>To: Jiri Pirko <j...@resnulli.us>; netdev@vger.kernel.org
>Cc: da...@davemloft.net; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>j...@mojatatu.com; geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; ro...@cumulusnetworks.com
>Subject: Re: [patch net-next 5/8] Introduce sample tc action
>
>On 16-11-10 11:35 AM, John Fastabend wrote:
>> On 16-11-10 03:23 AM, Jiri Pirko wrote:
>>> From: Yotam Gigi <yot...@mellanox.com>
>>>
>>> This action allow the user to sample traffic matched by tc classifier.
>>> The sampling consists of choosing packets randomly, truncating them,
>>> adding some informative metadata regarding the interface and the original
>>> packet size and mark them with specific mark, to allow further tc rules to
>>> match and process. The marked sample packets are then injected into the
>>> device ingress qdisc using netif_receive_skb.
>>>
>>> The packets metadata is packed using the ife encapsulation protocol, and
>>> the outer packet's ethernet dest, source and eth_type, along with the
>>> rate, mark and the optional truncation size can be configured from
>>> userspace.
>>>
>>> Example:
>>> To sample ingress traffic from interface eth1, and redirect the sampled
>>> the sampled packets to interface dummy0, one may use the commands:
>>>
>>> tc qdisc add dev eth1 handle : ingress
>>>
>>> tc filter add dev eth1 parent : \
>>>matchall action sample rate 12 mark 17
>>>
>>> tc filter add parent : dev eth1 protocol all \
>>>u32 match mark 17 0xff \
>>>action mirred egress redirect dev dummy0
>>>
>>> Where the first command adds an ingress qdisc and the second starts
>>> sampling every 12'th packet on dev eth1 and marks the sampled packets with
>>> 17. The third command catches the sampled packets, which are marked with
>>> 17, and redirects them to dev dummy0.
>>
>> The sampling algorithm was not randomized based on the above commit
>> log? It really needs to be for all the reasons Roopa mentioned earlier.
>> Did I miss some email on why it didn't get implemented?
>>
>> Also there was an indication the already is actually implemented
>> correctly so don't we need the hw/sw to behave the same. The whole
>> argument about sw/hw parity, etc.
>
>sorry bit of a typo there corrected 2nd paragraph here...
>
>Also there was an indication the hardware is already implemented \
>correctly so don't we need the hw/sw to behave the same. The argument
>about sw/hw parity, etc.

Our hardware currently does not support sampling with random behavior, so 
we did implement it in software too. 

But, the API is extensible and it is possible to add a random keyword to 
the tc action to allow random sampling. In that case, the keyword will be
implemented in sw only and our driver will fail offloading it.

>
>>
>>>
>>> Signed-off-by: Yotam Gigi <yot...@mellanox.com>
>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>> ---
>>

RE: [patch net-next 1/8] Introduce ife encapsulation module

2016-11-10 Thread Yotam Gigi

>-Original Message-
>From: David Miller [mailto:da...@davemloft.net]
>Sent: Thursday, November 10, 2016 9:18 PM
>To: j...@resnulli.us
>Cc: netdev@vger.kernel.org; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>j...@mojatatu.com; geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; ro...@cumulusnetworks.com
>Subject: Re: [patch net-next 1/8] Introduce ife encapsulation module
>
>From: Jiri Pirko <j...@resnulli.us>
>Date: Thu, 10 Nov 2016 12:23:01 +0100
>
>> +void *ife_encode(struct sk_buff *skb, u16 metalen)
>
>metalen is u16
>
>> +metalen = htons(metalen);
>
>htons() returns be16.
>
>> +int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void
>*dval)
>> +{
>> +u32 *tlv = (u32 *) (skbdata);
> ...
>> +*tlv = htonl(htlv);
>
>Similar situation here.

I will fix those and we will send a v2.

[PATCH net-next] tc_act: Remove tcf_act macro

2016-11-08 Thread Yotam Gigi

tc_act macro addressed a non existing field, and was not used in the
kernel source.

Signed-off-by: Yotam Gigi <yot...@mellanox.com>
Reviewed-by: Jiri Pirko <j...@mellanox.com>
---
 include/net/act_api.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 82f3c91..d8eae87 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -42,7 +42,6 @@ struct tc_action {
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
 };
-#define tcf_actcommon.tcfa_act
 #define tcf_head   common.tcfa_head
 #define tcf_index  common.tcfa_index
 #define tcf_refcnt common.tcfa_refcnt
-- 
2.4.11

RE: nfs NULL-dereferencing in net-next

2016-10-27 Thread Yotam Gigi



>-Original Message-
>From: Anna Schumaker [mailto:anna.schuma...@netapp.com]
>Sent: Wednesday, October 26, 2016 9:17 PM
>To: Jakub Kicinski <kubak...@wp.pl>
>Cc: Yotam Gigi <yot...@mellanox.com>; Andy Adamson <and...@netapp.com>;
>linux-...@vger.kernel.org; netdev@vger.kernel.org; Trond Myklebust
><trond.mykleb...@netapp.com>; Yotam Gigi <yotam...@gmail.com>; mlxsw
><ml...@mellanox.com>
>Subject: Re: nfs NULL-dereferencing in net-next
>
>On 10/26/2016 02:08 PM, Jakub Kicinski wrote:
>> On Wed, 26 Oct 2016 16:15:24 +, Yotam Gigi wrote:
>>>> -Original Message-
>>>> From: Anna Schumaker [mailto:anna.schuma...@netapp.com]
>>>> Sent: Wednesday, October 26, 2016 5:40 PM
>>>> To: Yotam Gigi <yot...@mellanox.com>; Jakub Kicinski <kubak...@wp.pl>;
>Andy
>>>> Adamson <and...@netapp.com>; Anna Schumaker
>>>> <anna.schuma...@netapp.com>; linux-...@vger.kernel.org
>>>> Cc: netdev@vger.kernel.org; Trond Myklebust
><trond.mykleb...@netapp.com>;
>>>> Yotam Gigi <yotam...@gmail.com>; mlxsw <ml...@mellanox.com>
>>>> Subject: Re: nfs NULL-dereferencing in net-next
>>>>
>>>> On 10/25/2016 01:19 PM, Yotam Gigi wrote:
>>>>>
>>>>>> -Original Message-
>>>>>> From: netdev-ow...@vger.kernel.org [mailto:netdev-
>ow...@vger.kernel.org]
>>>> On
>>>>>> Behalf Of Jakub Kicinski
>>>>>> Sent: Monday, October 17, 2016 10:20 PM
>>>>>> To: Andy Adamson <and...@netapp.com>; Anna Schumaker
>>>>>> <anna.schuma...@netapp.com>; linux-...@vger.kernel.org
>>>>>> Cc: netdev@vger.kernel.org; Trond Myklebust
>>>> <trond.mykleb...@netapp.com>
>>>>>> Subject: nfs NULL-dereferencing in net-next
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I'm hitting this reliably on net-next, HEAD at 3f3177bb680f
>>>>>> ("fsl/fman: fix error return code in mac_probe()").
>>>>>
>>>>>
>>>>> I see the same thing. It happens constantly on some of my machines, making
>>>> them
>>>>> completely unusable.
>>>>>
>>>>> I bisected it and got to the commit:
>>>>>
>>>>> commit 04ea1b3e6d8ed4978bb608c1748530af3de8c274
>>>>> Author: Andy Adamson <and...@netapp.com>
>>>>> Date:   Fri Sep 9 09:22:27 2016 -0400
>>>>>
>>>>> NFS add xprt switch addrs test to match client
>>>>>
>>>>> Signed-off-by: Andy Adamson <and...@netapp.com>
>>>>> Signed-off-by: Anna Schumaker <anna.schuma...@netapp.com>
>>>>
>>>> Thanks for reporting on this everyone!  Does this patch help?
>>>
>>> Actually, I still see the same bug with the same trace.
>
>Well, it was worth a shot.  I'll keep poking at it.
>
>>
>> I rebuild the latest net-next and I'm not seeing the trace any more...
>> I'm only seeing this (with or without your patch):
>>
>> [   23.465877] NFS: set_pnfs_layoutdriver: cl_exchange_flags 0x0
>> [   23.473784] NFS: set_pnfs_layoutdriver: cl_exchange_flags 0x0
>> [   23.588890] NFS: set_pnfs_layoutdriver: cl_exchange_flags 0x0
>> [   23.596746] NFS: set_pnfs_layoutdriver: cl_exchange_flags 0x0
>> [   23.781574] NFS: set_pnfs_layoutdriver: cl_exchange_flags 0x0
>> [   23.789599] NFS: set_pnfs_layoutdriver: cl_exchange_flags 0x0
>
>Interesting, I get that too when I try to use NFS v4.1.  It's weird that the 
>crash would
>stop happening like that, so maybe something is racy in this area.
>
>Thanks for testing, Yotam and Jakub!
>Anna

I just found out that it happens on any of my machines, once I put two nfs 
entries in
my fstab. If I put only one, I don't see the problem. 

I hope it might be helpful :)

>
>>
>> HTH
>>

RE: nfs NULL-dereferencing in net-next

2016-10-25 Thread Yotam Gigi


>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
>Behalf Of Jakub Kicinski
>Sent: Monday, October 17, 2016 10:20 PM
>To: Andy Adamson ; Anna Schumaker
>; linux-...@vger.kernel.org
>Cc: netdev@vger.kernel.org; Trond Myklebust 
>Subject: nfs NULL-dereferencing in net-next
>
>Hi!
>
>I'm hitting this reliably on net-next, HEAD at 3f3177bb680f
>("fsl/fman: fix error return code in mac_probe()").


I see the same thing. It happens constantly on some of my machines, making them
completely unusable.

I bisected it and got to the commit:

commit 04ea1b3e6d8ed4978bb608c1748530af3de8c274
Author: Andy Adamson 
Date:   Fri Sep 9 09:22:27 2016 -0400

NFS add xprt switch addrs test to match client

Signed-off-by: Andy Adamson 
Signed-off-by: Anna Schumaker 


>
>[   23.409633] BUG: unable to handle kernel NULL pointer dereference at
>0172
>[   23.418716] IP: [] rpc_clnt_xprt_switch_has_addr+0xc/0x40
>[sunrpc]
>[   23.427574] PGD 859020067 [   23.430472] PUD 858f2d067
>PMD 0 [   23.434311]
>[   23.436133] Oops:  [#1] PREEMPT SMP
>[   23.440506] Modules linked in: nfsv4 ip6table_filter ip6_tables 
>iptable_filter
>ip_tables ebtable_nat ebtables x_tables intel_ri
>[   23.505915] CPU: 1 PID: 1067 Comm: mount.nfs Not tainted 4.8.0-perf-13951-
>g3f3177bb680f #51
>[   23.515363] Hardware name: Dell Inc. PowerEdge T630/0W9WXC, BIOS 1.2.10
>03/10/2015
>[   23.523937] task: 983e9086ea00 task.stack: ac6c0a57c000
>[   23.530641] RIP: 0010:[]  []
>rpc_clnt_xprt_switch_has_addr+0xc/0x40 [sunrpc]
>[   23.542229] RSP: 0018:ac6c0a57fb28  EFLAGS: 00010a97
>[   23.548255] RAX: c80214ac RBX: 983e97c7b000 RCX: 
>983e9b3bc180
>[   23.556320] RDX: 0001 RSI: 983e9928ed28 RDI: 
>ffea
>[   23.564386] RBP: ac6c0a57fb38 R08: 983e97090630 R09: 
>983e9928ed30
>[   23.572452] R10: ac6c0a57fba0 R11: 0010 R12: 
>ac6c0a57fba0
>[   23.580517] R13: 983e9928ed28 R14:  R15: 
>983e91360560
>[   23.588585] FS:  7f4c348aa880() GS:983e9f24()
>knlGS:
>[   23.597742] CS:  0010 DS:  ES:  CR0: 80050033
>[   23.604251] CR2: 0172 CR3: 000850a5f000 CR4:
>001406e0
>[   23.612316] Stack:
>[   23.614648]  983e97c7b000 ac6c0a57fba0 ac6c0a57fb90 
>c04d38c3
>[   23.623331]  983e91360500 983e9928ed30 c0b9e560
>983e913605b8
>[   23.632016]  983e9882e800 983e9882e800 ac6c0a57fc30 
>ac6c0a57fdb8
>[   23.640706] Call Trace:
>[   23.643535]  [] nfs_get_client+0x123/0x340 [nfs]
>[   23.650542]  [] nfs4_set_client+0x80/0xb0 [nfsv4]
>[   23.657642]  [] nfs4_create_server+0x115/0x2a0 [nfsv4]
>[   23.665230]  [] nfs4_remote_mount+0x2e/0x60 [nfsv4]
>[   23.672519]  [] mount_fs+0x3a/0x160
>[   23.678254]  [] ? alloc_vfsmnt+0x19e/0x230
>[   23.684669]  [] vfs_kern_mount+0x67/0x110
>[   23.690990]  [] nfs_do_root_mount+0x84/0xc0 [nfsv4]
>[   23.698284]  [] nfs4_try_mount+0x37/0x50 [nfsv4]
>[   23.705287]  [] nfs_fs_mount+0x2d1/0xa70 [nfs]
>[   23.712092]  [] ? find_next_bit+0x18/0x20
>[   23.718413]  [] ? nfs_remount+0x3c0/0x3c0 [nfs]
>[   23.725316]  [] ? nfs_clone_super+0x130/0x130 [nfs]
>[   23.732606]  [] mount_fs+0x3a/0x160
>[   23.738340]  [] ? alloc_vfsmnt+0x19e/0x230
>[   23.744755]  [] vfs_kern_mount+0x67/0x110
>[   23.751071]  [] do_mount+0x1bf/0xc70
>[   23.756904]  [] ? copy_mount_options+0xbb/0x220
>[   23.763803]  [] SyS_mount+0x83/0xd0
>[   23.769538]  [] entry_SYSCALL_64_fastpath+0x17/0x98
>[   23.776817] Code: 01 00 48 8b 93 f8 04 00 00 44 89 e6 48 c7 c7 98 b2 43 c0 
>e8 9f 0d d4
>f9 eb c0 0f 1f 44 00 00 0f 1f 44 00 00
>[   23.802909] RIP  [] rpc_clnt_xprt_switch_has_addr+0xc/0x40
>[sunrpc]
>[   23.811857]  RSP 
>[   23.815839] CR2: 0172
>[   23.819629] ---[ end trace 9958eca92c9eeafe ]---
>[   23.827345] note: mount.nfs[1067] exited with preempt_count 1

RE: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-19 Thread Yotam Gigi



>-Original Message-
>From: Roopa Prabhu [mailto:ro...@cumulusnetworks.com]
>Sent: Wednesday, October 19, 2016 10:33 AM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: Jamal Hadi Salim <j...@mojatatu.com>; Jiri Pirko <j...@resnulli.us>;
>netdev@vger.kernel.org; da...@davemloft.net; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; Shrijeet Mukherjee
><s...@cumulusnetworks.com>; Yotam Gigi <yotam...@gmail.com>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On 10/18/16, 3:58 AM, Yotam Gigi wrote:
>
>> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>[snip]
>
>>> The OVS implementation is a good example, the metadata includes all the
>>> actions applied
>>>>> to the packet in the kernel data path.
>>>>>
>>>> Again not sure what the use case would be (and why waste such space
>>>> especially when you are sending over the wire with such details).
>>> All this is being used currently.., But, this can be other api's sflow uses
>>> for monitoring.
>>> http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf
>>>
>>> Does not have to be part of the main/basic sampling api...
>>> it was just an example.
>>>
>> I guess that making the API extensible solves this, isn't it?
>
>yes, that might help...
>
>Just wanted to bring up the question/clarification on using mark again
>
>tc qdisc add dev eth1 handle : ingress
>
>tc filter add dev eth1 parent : \
>   matchall action sample rate 12 mark 17
>
>tc filter add parent : dev eth1 protocol all \
>   u32 match mark 172 0xff
>   action mirred egress redirect dev dummy0
>
>Like we discussed @ netdev, mark can be used by other things in the system.
>A request to sample on an interface cannot be disruptive.
>Does this require mark to be not used elsewhere in the system when sampling is
>enabled on an interface ?

I think the we can spare the usage of mark, or at least make it optional, as 
the user
can match on the packets according to the eth_type (as part of the IFE, the 
user 
can set the sampled packet eth_type).

I will do that, and update the documentation as well.

RE: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-18 Thread Yotam Gigi



>-Original Message-
>From: Or Gerlitz [mailto:gerlitz...@gmail.com]
>Sent: Sunday, October 16, 2016 1:27 PM
>To: Jiri Pirko <j...@resnulli.us>
>Cc: Linux Netdev List <netdev@vger.kernel.org>; David Miller
><da...@davemloft.net>; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>; Jamal Hadi Salim
><j...@mojatatu.com>; geert+rene...@glider.be; Stephen Hemminger
><step...@networkplumber.org>; Cong Wang <xiyou.wangc...@gmail.com>;
>Guenter Roeck <li...@roeck-us.net>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On Wed, Oct 12, 2016 at 3:41 PM, Jiri Pirko <j...@resnulli.us> wrote:
>> From: Yotam Gigi <yotam...@gmail.com>
>>
>> This action allow the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly, truncating them,
>> adding some informative metadata regarding the interface and the original
>> packet size and mark them with specific mark, to allow further tc rules to
>> match and process. The marked sample packets are then injected into the
>> device ingress qdisc using netif_receive_skb.
>>
>> The packets metadata is packed using the ife encapsulation protocol, and
>> the outer packet's ethernet dest, source and eth_type, along with the
>> rate, mark and the optional truncation size can be configured from
>> userspace.
>>
>> Example:
>> To sample ingress traffic from interface eth1, and redirect the sampled
>> the sampled packets to interface dummy0, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>>matchall action sample rate 12 mark 17
>>
>> tc filter add parent : dev eth1 protocol all \
>>u32 match mark 172 0xff
>>action mirred egress redirect dev dummy0
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling every 12'th packet on dev eth0 and marks the sampled packets with
>> 17. The command third catches the sampled packets, which are marked with
>> 17, and redirects them to dev dummy0.
>
>eth0 --> eth1
>
>command third --> third command
>

Missed that. Thanks :)

>don't we need a re-classify directive for the u32 filter to apply
>after the marking done by the matchall rule + sample action
>or is that implicit?

No, as the packets are re-injected to the ingress qdisc (as described in the 
commit message). Reclassify won't work as the sampled packets, which are a copy
of the chosen packets are generated inside the sample action and are not part of
the device packet stream.

>
>
>> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
>> new file mode 100644
>> index 000..a2b445a
>> --- /dev/null
>> +++ b/include/net/tc_act/tc_sample.h
>> @@ -0,0 +1,88 @@
>> +#ifndef __NET_TC_SAMPLE_H
>> +#define __NET_TC_SAMPLE_H
>> +
>> +#include 
>> +#include 
>> +
>> +struct tcf_sample {
>> +   struct tc_actioncommon;
>> +   u32 rate;
>> +   u32 mark;
>> +   booltruncate;
>> +   u32 trunc_size;
>> +   u32 packet_counter;
>> +   u8  eth_dst[ETH_ALEN];
>> +   u8  eth_src[ETH_ALEN];
>> +   u16 eth_type;
>> +   booleth_type_set;
>> +   struct list_headtcfm_list;
>> +};
>
>> +++ b/include/uapi/linux/tc_act/tc_sample.h
>> @@ -0,0 +1,31 @@
>> +#ifndef __LINUX_TC_SAMPLE_H
>> +#define __LINUX_TC_SAMPLE_H
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#define TCA_ACT_SAMPLE 26
>> +
>> +struct tc_sample {
>> +   tc_gen;
>> +   __u32   rate;   /* sample rate */
>> +   __u32   mark;   /* mark to put on the sampled 
>> packets */
>> +   booltruncate;   /* whether to truncate the packets */
>> +   __u32   trunc_size; /* truncation size */
>> +   __u8eth_dst[ETH_ALEN]; /* encapsulated mac destination */
>> +   __u8eth_src[ETH_ALEN]; /* encapsulated mac source */
>> +   booleth_type_set;  /* whether to overrid ethtype */
>> +   __u16   eth_type;  /*

RE: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-18 Thread Yotam Gigi

>-Original Message-
>From: Roopa Prabhu [mailto:ro...@cumulusnetworks.com]
>Sent: Tuesday, October 18, 2016 3:17 AM
>To: Jamal Hadi Salim <j...@mojatatu.com>
>Cc: Jiri Pirko <j...@resnulli.us>; netdev@vger.kernel.org; da...@davemloft.net;
>Yotam Gigi <yot...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>; Elad
>Raz <el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
><ogerl...@mellanox.com>; geert+rene...@glider.be;
>step...@networkplumber.org; xiyou.wangc...@gmail.com; li...@roeck-us.net;
>Shrijeet Mukherjee <s...@cumulusnetworks.com>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
>>
>> Some comments:
>> IIUC, the main struggle seems to be whether the redirect to dummy0
>> is useful or not? i.e instead of just letting the packets go up the
>> stack on eth1?
>
>yep, correct...given existing workflow for the non-offloaded case is
>to receive sample packets via bpf filter on socket or
>use netlink as a sample delivery mechanism (NFLOG eg)
>
>
>> It seems like sflowd needs to read off eth1 via packet socket?
>> To be backward compatible - supporting that approach seems sensible.

I am not sure whether using socket is backward compatible. hsflowd 
expects nflog packets, and ife packets will not be understood.

We planned on adding support on hsflowd in IFE encapsulated packets from
tuntap device. 

Did I understand you correctly?

>>
>> Note:
>> There is a clear efficiency benefit of both using IFE encoding and
>> redirecting to dummy0.
>> 1) Redirecting to dummy0 implies you dont need to exercise a bpf
>> filter around every packet that comes off eth1.
>> I understand there are probably not millions of pps for this case;
>> but in a non-offloaded cases it could be millions pps.
>> And in case of sampling over many ethx devices, you can redirect
>> samples from many other ethx devices.
>> So making dummy0 the sflow device is a win.
>> 2) Encaping an IFE header implies a much more efficient bpf filter
>> (IFE ethertype is an excellent discriminator for bpf).
>>
>> Additional benefit is as mentioned before - redirecting to a device
>> means you can send it remotely over ethernet to a more powerful
>> machine without having to cross kernel-userspace. Redirecting instead
>> of mirroring to tuntap is also an interesting option.
>
>sure, this seems like a good option to have.
>generally you have one instance of the sampling agent on a hyper visor or 
>switch.
>But, if you have use-cases where monitoring agents run external, sure.
>would have preferred if it was optional or an addon and not the default.
>
>Regarding the device, yeah, agree there are pros and cons.
>An additional device just to sample packets seems like an overkill.
>But, if there is no other other option, and there are benefits to it, no 
>objections.
>Hopefully we can add another option on the existing api to skip the device in 
>the
>future.
>
>
>>
>>
>> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>>> From: Yotam Gigi <yotam...@gmail.com>
>>
>>
>>>> +
>>>> +struct sample_packet_metadata {
>>>> +int sample_size;
>>>> +int orig_size;
>>>> +int ifindex;
>>>> +};
>>>> +
>>> This metadata does not look extensible.. can it be made to ?
>>>
>>
>> Sure it can...

I will update the userspace API to be more generice: I will drop this struct and
let the user (iproute2 currently)  build the netlink packet himself (as Or 
Gerlitz suggested).

>>
>>> With sflow in context, you need a pair of ifindex numbers to encode ingress 
>>> and
>egress ports.
>>
>> What is the use case for both?
>
>I have heard that most monitoring tools have moved to ingress only sampling
>because of operational
>complexity (use case is sflow). I think hardware also supports ingress and 
>egress
>only sampling.
>better to have an option to reflect that in the api.

Agree. I will add both ingress and egress ports in the IFE. Both the hardware 
Implementation and kernel implementation don't support setting both, but 
It is good to have that option.

>
>>> Ideally you would also include a sequence number and a count of the total
>number of packets
>> > that were candidates for sampling.
>>
>> Sequence number may make sense (they will help show a gap if something
>> gets dropped). But i am not sure about the stats consuming such space.
>> Stats are something that can b

[PATCH net v3 2/2] act_ife: Fix false encoding

2016-09-26 Thread Yotam Gigi

On ife encode side, the action stores the different tlvs inside the ife
header, where each tlv length field should refer to the length of the
whole tlv (without additional padding) and not just the data length.

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the encoding encoded only the data length inside the tlv, which led
to false parsing of ife the header. In addition, due to the fact that the
loop counter was unsigned, it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the encoding to
take into account the tlv type and size.

Fixes: 28a10c426e81 ("net sched: fix encoding to use real length")
Acked-by: Jamal Hadi Salim <j...@mojatatu.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---

v2->v3
 - Fix the encode side instead of the decode side

---
 net/sched/act_ife.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index b949d97..95c463c 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -53,7 +53,7 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 
dlen, const void *dval)
u32 *tlv = (u32 *)(skbdata);
u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
char *dptr = (char *)tlv + NLA_HDRLEN;
-   u32 htlv = attrtype << 16 | dlen;
+   u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
 
*tlv = htonl(htlv);
memset(dptr, 0, totlen - NLA_HDRLEN);
@@ -653,7 +653,7 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_ife_info *ife = to_ife(a);
int action = ife->tcf_action;
struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-   u16 ifehdrln = ifehdr->metalen;
+   int ifehdrln = (int)ifehdr->metalen;
struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
 
spin_lock(>tcf_lock);
-- 
2.4.11

[PATCH net v3 1/2] act_ife: Fix external mac header on encode

2016-09-26 Thread Yotam Gigi

On ife encode side, external mac header is copied from the original packet
and may be overridden if the user requests. Before, the mac header copy
was done from memory region that might not be accessible anymore, as
skb_cow_head might free it and copy the packet. This led to random values
in the external mac header once the values were not set by user.

This fix takes the internal mac header from the packet, after the call to
skb_cow_head.

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Acked-by: Jamal Hadi Salim <j...@mojatatu.com>
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/sched/act_ife.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index ccf7b4b..b949d97 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -766,8 +766,6 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct 
tc_action *a,
return TC_ACT_SHOT;
}
 
-   iethh = eth_hdr(skb);
-
err = skb_cow_head(skb, hdrm);
if (unlikely(err)) {
ife->tcf_qstats.drops++;
@@ -778,6 +776,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct 
tc_action *a,
if (!(at & AT_EGRESS))
skb_push(skb, skb->dev->hard_header_len);
 
+   iethh = (struct ethhdr *)skb->data;
__skb_push(skb, hdrm);
memcpy(skb->data, iethh, skb->mac_len);
skb_reset_mac_header(skb);
-- 
2.4.11

[PATCH net v3 0/2] Fix tc-ife bugs

2016-09-26 Thread Yotam Gigi

This patch-set contains two bugfixes in the tc-ife action, one fixing some
random behaviour in encode side, and one fixing the decode side packet
parsing logic.

v2->v3
 - Fix the encode side instead of the decode side

Yotam Gigi (2):
  act_ife: Fix external mac header on encode
  act_ife: Fix false encoding

 net/sched/act_ife.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

-- 
2.4.11

RE: [PATCH net v2 0/2] Fix tc-ife bugs

2016-09-26 Thread Yotam Gigi



>-Original Message-
>From: Jamal Hadi Salim [mailto:j...@mojatatu.com]
>Sent: Monday, September 26, 2016 3:47 AM
>To: Yotam Gigi <yot...@mellanox.com>; Yotam Gigi <yotam...@gmail.com>;
>da...@davemloft.net; netdev@vger.kernel.org
>Cc: Jiri Pirko <j...@mellanox.com>; Elad Raz <el...@mellanox.com>; mlxsw
><ml...@mellanox.com>; Roman Mashak <m...@mojatatu.com>
>Subject: Re: [PATCH net v2 0/2] Fix tc-ife bugs
>
>On 16-09-25 07:17 PM, Jamal Hadi Salim wrote:
>[..]
>>> Do you prefer that I will fix the encode side to encode the whole tlv
>>> header
>>> size instead of fixing the decode side?
>>
>> Yes please - Add NLA_HDRLEN to the dlen on the encode you showed above.

OK. Did it and tested it. Everything seems functional.

>>
>
>And the correct commit it fixes is:
>a823f93750e341bc0d6829a00019435b96830fd4

I can't find this commit. Where is it? 

Don't you mean that commit: 28a10c426e81afc88514bca8e73affccf850fdf6
with title: "net sched: fix encoding to use real length"?

>
>I removed the padding but also the headerlen when i did
>that. Tested with tc_index which is a u16 was the mistake i made.
>
>cheers,
>jamal

RE: [PATCH net v2 0/2] Fix tc-ife bugs

2016-09-25 Thread Yotam Gigi

>-Original Message-
>From: Jamal Hadi Salim [mailto:j...@mojatatu.com]
>Sent: Sunday, September 25, 2016 4:46 PM
>To: Yotam Gigi <yotam...@gmail.com>; da...@davemloft.net;
>netdev@vger.kernel.org; Yotam Gigi <yot...@mellanox.com>
>Subject: Re: [PATCH net v2 0/2] Fix tc-ife bugs
>
>On 16-09-25 08:31 AM, Yotam Gigi wrote:
>> This patch-set contains two bugfixes in the tc-ife action, one fixing some
>> random behaviour in encode side, and one fixing the decode side packet
>> parsing logic.
>>
>> Yotam Gigi (2):
>>   act_ife: Fix external mac header on encode
>>   act_ife: Fix false parsing on decode side
>>
>>  net/sched/act_ife.c | 9 -
>>  1 file changed, 4 insertions(+), 5 deletions(-)
>>
>
>Yotam, did you run the test i proposed? I am worried about the TLV one.
>Note:
>Encoder includes the length of the TLV header length in the L part.
>If you are reaching a conclusion that you need this in the decoding:
>+  tlvdata += alen + sizeof(struct meta_tlvhdr);
>
>then very likely whoever is sending that packet is not encoding
>correctly.

The one encoding the packets I get is the ife action. You can look at the code:

  int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void 
*dval)
  {
u32 *tlv = (u32 *)(skbdata);
u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
char *dptr = (char *)tlv + NLA_HDRLEN;
u32 htlv = attrtype << 16 | dlen;

*tlv = htonl(htlv);
memset(dptr, 0, totlen - NLA_HDRLEN);
memcpy(dptr, dval, dlen);

return totlen;
  }

As you can see, the data that is actually written into the packet is the htlv
var, which has the 'dlen' in it, and not the totlen which corresponds the tlv
header size. You can see that in the following example:

I ran the tc command you proposed:
  $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit prio 33 \
action ife encode \
type 0xDEAD \
use mark 12 \
allow prio \
dst 02:15:15:15:15:15

And this is the packet I got:
  0x:  0215 1515 1515 da23 d209 8644 dead 0012
  0x0010:  0001 0004  000c 0003 0004  0033
  0x0020:  fa30 7418 593a da23 d209 8644 0800 4500
  0x0030:  0054 da3c 4000 4001 486a 0c00 0001 0c00
  0x0040:  0002 0800 9fa2 1562 0008 aeec e757 
  0x0050:   ecdb 0100   1011 1213 1415
  0x0060:  1617 1819 1a1b 1c1d 1e1f 2021 2223 2425
  0x0070:  2627 2829 2a2b 2c2d 2e2f 3031 3233 3435
  0x0080:  3637

You can see that there are two tlvs there, one for mark (with value 0xc=12) and
one for prio (with value 0x33). In the packets, you can see the on offsets 0x12
and 0x1a, that the size in the tlv is 4 which is the datalen and not 8 which is
the total tlv size.

When I ran the decode without the fix, my kernel went into infinite loop which
was caused by:
 - The loop stopping condition was checking that an unsigned variable is greater
   than zero.
 - The parsing assumes that in the tlv header, the size refers to the whole tlv
   size, but it refers to the size of the data only.

To fix those problems, I fixed the decode side to assume that the tlv->size
refers to the data len and not the whole tlv, and changed the variable to be
signed.

Do you prefer that I will fix the encode side to encode the whole tlv header
size instead of fixing the decode side?

Thanks :)

>
>cheers,
>jamal

[PATCH net v2 1/2] act_ife: Fix external mac header on encode

2016-09-25 Thread Yotam Gigi

On ife encode side, external mac header is copied from the original packet
and may be overridden if the user requests. Before, the mac header copy
was done from memory region that might not be accessible anymore, as
skb_cow_head might free it and copy the packet. This led to random values
in the external mac header once the values were not set by user.

This fix takes the internal mac header from the packet, after the call to
skb_cow_head.

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/sched/act_ife.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index ccf7b4b..b949d97 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -766,8 +766,6 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct 
tc_action *a,
return TC_ACT_SHOT;
}
 
-   iethh = eth_hdr(skb);
-
err = skb_cow_head(skb, hdrm);
if (unlikely(err)) {
ife->tcf_qstats.drops++;
@@ -778,6 +776,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct 
tc_action *a,
if (!(at & AT_EGRESS))
skb_push(skb, skb->dev->hard_header_len);
 
+   iethh = (struct ethhdr *)skb->data;
__skb_push(skb, hdrm);
memcpy(skb->data, iethh, skb->mac_len);
skb_reset_mac_header(skb);
-- 
2.4.11

[PATCH net v2 2/2] act_ife: Fix false parsing on decode side

2016-09-25 Thread Yotam Gigi

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the pointer was advanced in a wrong way, as the tlv type and size
bytes were not taken into account. This led to false parsing of ife
header. In addition, due to the fact that the loop counter was unsigned,
it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the parsing to
take into account the tlv type and size.

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Signed-off-by: Yotam Gigi <yot...@mellanox.com>
---
 net/sched/act_ife.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index b949d97..1a055ea 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -653,7 +653,7 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_ife_info *ife = to_ife(a);
int action = ife->tcf_action;
struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-   u16 ifehdrln = ifehdr->metalen;
+   int ifehdrln = (int)ifehdr->metalen;
struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
 
spin_lock(>tcf_lock);
@@ -694,8 +694,8 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct 
tc_action *a,
ife->tcf_qstats.overlimits++;
}
 
-   tlvdata += alen;
-   ifehdrln -= alen;
+   tlvdata += alen + sizeof(struct meta_tlvhdr);
+   ifehdrln -= alen + sizeof(struct meta_tlvhdr);
tlv = (struct meta_tlvhdr *)tlvdata;
}
 
-- 
2.4.11

1 2 >

1 - 100 of 108 matches

Mail list logo