from:"Nambiar, Amritha"

Re: [iproute2-next PATCH v6] tc: flower: Classify packets based port ranges

2018-12-03 Thread Nambiar, Amritha

On 12/3/2018 3:52 PM, David Ahern wrote:
> On 11/27/18 3:40 PM, Amritha Nambiar wrote:
>> Added support for filtering based on port ranges.
>> UAPI changes have been accepted into net-next.
>>
>> Example:
>> 1. Match on a port range:
>> -
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>   prio 1 flower ip_proto tcp dst_port 20-30 skip_hw\
>>   action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0
>> filter protocol ip pref 1 flower chain 0 handle 0x1
>>   eth_type ipv4
>>   ip_proto tcp
>>   dst_port 20-30
>>   skip_hw
>>   not_in_hw
>> action order 1: gact action drop
>>  random type none pass val 0
>>  index 1 ref 1 bind 1 installed 85 sec used 3 sec
>> Action statistics:
>> Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>>
>> 2. Match on IP address and port range:
>> --
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>   prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port 100-200\
>>   skip_hw action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0 handle 0x2
>>   eth_type ipv4
>>   ip_proto tcp
>>   dst_ip 192.168.1.1
>>   dst_port 100-200
>>   skip_hw
>>   not_in_hw
>> action order 1: gact action drop
>>  random type none pass val 0
>>  index 2 ref 1 bind 1 installed 58 sec used 2 sec
>> Action statistics:
>> Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>>
>> v6:
>> Modified to change json output format as object for sport/dport.
>>
>>  "dst_port":{
>>"start":2000,
>>"end":6000
>>  },
>>  "src_port":{
>>"start":50,
>>"end":60
>>  }
>>
>> v5:
>> Simplified some code and used 'sscanf' for parsing. Removed
>> space in output format.
>>
>> v4:
>> Added man updates explaining filtering based on port ranges.
>> Removed 'range' keyword.
>>
>> v3:
>> Modified flower_port_range_attr_type calls.
>>
>> v2:
>> Addressed Jiri's comment to sync output format with input
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  man/man8/tc-flower.8 |   13 +---
>>  tc/f_flower.c|   85 
>> +-
>>  2 files changed, 84 insertions(+), 14 deletions(-)
>>
> 
> sorry for the delay; fell through the cracks. It no longer applies to
> net-next. Please update. Thanks,
> 

A previous version v3 of this patch was already applied to iproute2-next.
https://patchwork.ozlabs.org/patch/998644/

I think that needs to be reverted for this v6 to apply clean.

Re: [iproute2-next PATCH v4] tc: flower: Classify packets based port ranges

2018-11-27 Thread Nambiar, Amritha

On 11/26/2018 8:02 PM, Stephen Hemminger wrote:
> On Mon, 26 Nov 2018 17:56:10 -0800
> "Nambiar, Amritha"  wrote:
> 
>> On 11/26/2018 4:43 PM, David Ahern wrote:
>>> On 11/26/18 5:23 PM, Nambiar, Amritha wrote:  
>>>>> Can tc flower use something similar to ip ru with single port or port
>>>>> range handled like this?
>>>>>
>>>>> },{
>>>>> "priority": 32764,
>>>>> "src": "172.16.1.0",
>>>>> "srclen": 24,
>>>>> "ipproto": "tcp",
>>>>> "sport": 1100,
>>>>> "table": "main"
>>>>> },{
>>>>> "priority": 32765,
>>>>> "src": "172.16.1.0",
>>>>> "srclen": 24,
>>>>> "ipproto": "tcp",
>>>>> "sport_start": 1000,
>>>>> "sport_end": 1010,
>>>>> "table": "main"
>>>>> },{
>>>>>
>>>>>  
>>>>
>>>> Does it have to be separate fields "sport_start" and "sport_end"?
>>>> Removing the space and 'range' keyword will make the output format
>>>> consistent with the input format and print as "sport " for
>>>> single port and "sport -" for range.
>>>> Example:
>>>>
>>>> ... flower ip_proto tcp src_port 12 skip_hw action will print as:
>>>>   ip_proto tcp
>>>>   src_port 12
>>>>   skip_hw
>>>>   not_in_hw
>>>> action
>>>>
>>>> ... flower ip_proto tcp src_port 100-200 skip_hw action :
>>>>   ip_proto tcp
>>>>   src_port 100-200
>>>>   skip_hw
>>>>   not_in_hw
>>>> action  
>>>
>>> non-json output needs to match what the user gives on the command line.
>>>
>>> My comment was about consistency with json output when possible. I am
>>> not a json expert by any means. Other commands have a single key value
>>> pair, so I suspect the json here needs to follow suit (ie., not
>>> "src_port": "1000-1010" but separate start and end entries).
>>>   
>> I'm not quite familiar with json. Maybe, Jiri can give feedback here.
> 
> JSON support strings and numeric and objects.
> The more common JSON way of expressing this would be either as object for 
> sport
> 
> {
>"priority": 32765,
>"src": "172.16.1.0",
> "srclen": 24,
>"ipproto": "tcp",
>   "sport" : {
>   "start" : 1000,
>   "end" : 1010
>   },
>   "table: "main"
> }
> 
> or as an array:
> {
>"priority": 32765,
>"src": "172.16.1.0",
> "srclen": 24,
>"ipproto": "tcp",
>   "sport" : [ 1000, 1010 ],
>   "table: "main"
> }
> 
> My point is don't build some semantic meaning directly into the tag part of 
> the syntax.
> 

Okay. Will fix this in v6. I'll keep the default (non-json) output
format similar to the input format, and for the json output, I'll use
the object for sport as in the first example above.

Re: [iproute2-next PATCH v4] tc: flower: Classify packets based port ranges

2018-11-26 Thread Nambiar, Amritha

On 11/26/2018 4:43 PM, David Ahern wrote:
> On 11/26/18 5:23 PM, Nambiar, Amritha wrote:
>>> Can tc flower use something similar to ip ru with single port or port
>>> range handled like this?
>>>
>>> },{
>>> "priority": 32764,
>>> "src": "172.16.1.0",
>>> "srclen": 24,
>>> "ipproto": "tcp",
>>> "sport": 1100,
>>> "table": "main"
>>> },{
>>> "priority": 32765,
>>> "src": "172.16.1.0",
>>> "srclen": 24,
>>> "ipproto": "tcp",
>>> "sport_start": 1000,
>>> "sport_end": 1010,
>>> "table": "main"
>>> },{
>>>
>>>
>>
>> Does it have to be separate fields "sport_start" and "sport_end"?
>> Removing the space and 'range' keyword will make the output format
>> consistent with the input format and print as "sport " for
>> single port and "sport -" for range.
>> Example:
>>
>> ... flower ip_proto tcp src_port 12 skip_hw action will print as:
>>   ip_proto tcp
>>   src_port 12
>>   skip_hw
>>   not_in_hw
>> action
>>
>> ... flower ip_proto tcp src_port 100-200 skip_hw action :
>>   ip_proto tcp
>>   src_port 100-200
>>   skip_hw
>>   not_in_hw
>> action
> 
> non-json output needs to match what the user gives on the command line.
> 
> My comment was about consistency with json output when possible. I am
> not a json expert by any means. Other commands have a single key value
> pair, so I suspect the json here needs to follow suit (ie., not
> "src_port": "1000-1010" but separate start and end entries).
> 
I'm not quite familiar with json. Maybe, Jiri can give feedback here.

Re: [iproute2-next PATCH v4] tc: flower: Classify packets based port ranges

2018-11-26 Thread Nambiar, Amritha

On 11/21/2018 1:42 PM, David Ahern wrote:
> On 11/20/18 11:17 PM, Amritha Nambiar wrote:
>> diff --git a/tc/f_flower.c b/tc/f_flower.c
>> index 65fca04..722647d 100644
>> --- a/tc/f_flower.c
>> +++ b/tc/f_flower.c
>> @@ -494,6 +494,68 @@ static int flower_parse_port(char *str, __u8 ip_proto,
>>  return 0;
>>  }
>>  
>> +static int flower_port_range_attr_type(__u8 ip_proto, enum flower_endpoint 
>> type,
>> +   __be16 *min_port_type,
>> +   __be16 *max_port_type)
>> +{
>> +if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP ||
>> +ip_proto == IPPROTO_SCTP) {
>> +if (type == FLOWER_ENDPOINT_SRC) {
>> +*min_port_type = TCA_FLOWER_KEY_PORT_SRC_MIN;
>> +*max_port_type = TCA_FLOWER_KEY_PORT_SRC_MAX;
>> +} else {
>> +*min_port_type = TCA_FLOWER_KEY_PORT_DST_MIN;
>> +*max_port_type = TCA_FLOWER_KEY_PORT_DST_MAX;
>> +}
>> +} else {
>> +return -1;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +static int flower_parse_port_range(__be16 *min, __be16 *max, __u8 ip_proto,
> 
> why not just min and max directly since they are not set here but only
> referenced by value. Also, you do not parse anything in this function so
> the helper is misnamed.
> 
> But I think this can be done simpler using what was done in ip/iprule.c ...
> 

Okay, will modify this.

> 
>> +   enum flower_endpoint endpoint,
>> +   struct nlmsghdr *n)
>> +{
>> +__be16 min_port_type, max_port_type;
>> +
>> +if (htons(*max) <= htons(*min)) {
>> +fprintf(stderr, "max value should be greater than min value\n");
>> +return -1;
>> +}
>> +
>> +if (flower_port_range_attr_type(ip_proto, endpoint, _port_type,
>> +_port_type))
>> +return -1;
>> +
>> +addattr16(n, MAX_MSG, min_port_type, *min);
>> +addattr16(n, MAX_MSG, max_port_type, *max);
>> +
>> +return 0;
>> +}
>> +
>> +static int get_range(__be16 *min, __be16 *max, char *argv)
>> +{
>> +char *r;
>> +
>> +r = strchr(argv, '-');
>> +if (r) {
>> +*r = '\0';
>> +if (get_be16(min, argv, 10)) {
>> +fprintf(stderr, "invalid min range\n");
>> +return -1;
>> +}
>> +if (get_be16(max, r + 1, 10)) {
>> +fprintf(stderr, "invalid max range\n");
>> +return -1;
>> +}
>> +} else {
>> +return -1;
>> +}
>> +return 0;
>> +}
>> +
>>  #define TCP_FLAGS_MAX_MASK 0xfff
>>  
>>  static int flower_parse_tcp_flags(char *str, int flags_type, int mask_type,
>> @@ -1061,20 +1123,47 @@ static int flower_parse_opt(struct filter_util *qu, 
>> char *handle,
>>  return -1;
>>  }
>>  } else if (matches(*argv, "dst_port") == 0) {
>> +__be16 min, max;
>> +
>>  NEXT_ARG();
>> -ret = flower_parse_port(*argv, ip_proto,
>> -FLOWER_ENDPOINT_DST, n);
>> -if (ret < 0) {
>> -fprintf(stderr, "Illegal \"dst_port\"\n");
>> -return -1;
>> +
>> +if (!get_range(, , *argv)) {
>> +ret = flower_parse_port_range(, ,
>> +  ip_proto,
>> +  
>> FLOWER_ENDPOINT_DST,
>> +  n);
>> +if (ret < 0) {
>> +fprintf(stderr, "Illegal \"dst_port 
>> range\"\n");
>> +return -1;
>> +}
>> +} else {
>> +ret = flower_parse_port(*argv, ip_proto,
>> +FLOWER_ENDPOINT_DST, n);
>> +if (ret < 0) {
>> +fprintf(stderr, "Illegal 
>> \"dst_port\"\n");
>> +return -1;
>> +}
> 
> Take a look at ip/iprule.c, line 921:
>   } else if (strcmp(*argv, "sport") == 0) {
>   ...
>   }
> 
> Using sscanf and handling the ret to be 1 or 2 should simplify the above.
> 

Okay, will simplify using sscanf.

>>  }
>>  } else if (matches(*argv, "src_port") == 0) {
>> +__be16 min, max;
>> +
>>  NEXT_ARG();
>> -ret = flower_parse_port(*argv, ip_proto,
>> -FLOWER_ENDPOINT_SRC, n);
>> -

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Nambiar, Amritha

On 11/20/2018 8:59 PM, David Ahern wrote:
> On 11/20/18 9:59 PM, Nambiar, Amritha wrote:
>> Oops, submitted the v2 patch for man changes too soon, without seeing
>> this. So, in this case, should I re-submit the iproute2-flower patch
>> that was accepted removing the 'range' keyword?
> 
> I think so. Consistency across commands is a good thing.
> 

Okay, will do. I'll also combine the 'man patch' into 'flower patch' and
make a single patch as Jiri recommended.

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Nambiar, Amritha

On 11/20/2018 8:46 PM, David Ahern wrote:
> On 11/20/18 9:44 PM, Nambiar, Amritha wrote:
>> On 11/20/2018 2:56 PM, David Ahern wrote:
>>> On 11/15/18 5:55 PM, Amritha Nambiar wrote:
>>>> Add details explaining filtering based on port ranges.
>>>>
>>>> Signed-off-by: Amritha Nambiar 
>>>> ---
>>>>  man/man8/tc-flower.8 |   12 ++--
>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>>>> index 8be8882..768bfa1 100644
>>>> --- a/man/man8/tc-flower.8
>>>> +++ b/man/man8/tc-flower.8
>>>> @@ -56,8 +56,10 @@ flower \- flow based traffic control filter
>>>>  .IR MASKED_IP_TTL " | { "
>>>>  .BR dst_ip " | " src_ip " } "
>>>>  .IR PREFIX " | { "
>>>> -.BR dst_port " | " src_port " } "
>>>> -.IR port_number " } | "
>>>> +.BR dst_port " | " src_port " } { "
>>>> +.IR port_number " | "
>>>> +.B range
>>>> +.IR min_port_number-max_port_number " } | "
>>>>  .B tcp_flags
>>>>  .IR MASKED_TCP_FLAGS " | "
>>>>  .B type
>>>> @@ -227,6 +229,12 @@ Match on layer 4 protocol source or destination port 
>>>> number. Only available for
>>>>  .BR ip_proto " values " udp ", " tcp  " and " sctp
>>>>  which have to be specified in beforehand.
>>>>  .TP
>>>> +.BI range " MIN_VALUE-MAX_VALUE"
>>>> +Match on a range of layer 4 protocol source or destination port number. 
>>>> Only
>>>> +available for
>>>> +.BR ip_proto " values " udp ", " tcp  " and " sctp
>>>> +which have to be specified in beforehand.
>>>> +.TP
>>>>  .BI tcp_flags " MASKED_TCP_FLAGS"
>>>>  Match on TCP flags represented as 12bit bitfield in in hexadecimal format.
>>>>  A mask may be optionally provided to limit the bits which are matched. A 
>>>> mask
>>>>
>>>
>>> This prints as:
>>>
>>> dst_port NUMBER
>>> src_port NUMBER
>>>   Match  on  layer  4  protocol source or destination port number.
>>>   Only available for ip_proto values udp, tcp and sctp which  have
>>>   to be specified in beforehand.
>>>
>>> range MIN_VALUE-MAX_VALUE
>>>   Match  on a range of layer 4 protocol source or destination port
>>>   number. Only available for ip_proto values  udp,  tcp  and  sctp
>>>   which have to be specified in beforehand.
>>>
>>> ###
>>>
>>> That makes it look like range is a standalone option - independent of
>>> dst_port/src_port.
>>>
>>> It seems to me the dst_port / src_port should be updated to:
>>>
>>> dst_port {NUMBER | range MIN_VALUE-MAX_VALUE}
>>>
>>> with the description updated for both options and indented under
>>> dst_port / src_port
>>>
>>
>> Okay, will do.
>>
> 
> Thinking about this perhaps the 'range' keyword can just be dropped. We
> do not use it in other places -- e.g., ip rule.
> 

Oops, submitted the v2 patch for man changes too soon, without seeing
this. So, in this case, should I re-submit the iproute2-flower patch
that was accepted removing the 'range' keyword?

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Nambiar, Amritha

On 11/20/2018 2:56 PM, David Ahern wrote:
> On 11/15/18 5:55 PM, Amritha Nambiar wrote:
>> Add details explaining filtering based on port ranges.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  man/man8/tc-flower.8 |   12 ++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>> index 8be8882..768bfa1 100644
>> --- a/man/man8/tc-flower.8
>> +++ b/man/man8/tc-flower.8
>> @@ -56,8 +56,10 @@ flower \- flow based traffic control filter
>>  .IR MASKED_IP_TTL " | { "
>>  .BR dst_ip " | " src_ip " } "
>>  .IR PREFIX " | { "
>> -.BR dst_port " | " src_port " } "
>> -.IR port_number " } | "
>> +.BR dst_port " | " src_port " } { "
>> +.IR port_number " | "
>> +.B range
>> +.IR min_port_number-max_port_number " } | "
>>  .B tcp_flags
>>  .IR MASKED_TCP_FLAGS " | "
>>  .B type
>> @@ -227,6 +229,12 @@ Match on layer 4 protocol source or destination port 
>> number. Only available for
>>  .BR ip_proto " values " udp ", " tcp  " and " sctp
>>  which have to be specified in beforehand.
>>  .TP
>> +.BI range " MIN_VALUE-MAX_VALUE"
>> +Match on a range of layer 4 protocol source or destination port number. Only
>> +available for
>> +.BR ip_proto " values " udp ", " tcp  " and " sctp
>> +which have to be specified in beforehand.
>> +.TP
>>  .BI tcp_flags " MASKED_TCP_FLAGS"
>>  Match on TCP flags represented as 12bit bitfield in in hexadecimal format.
>>  A mask may be optionally provided to limit the bits which are matched. A 
>> mask
>>
> 
> This prints as:
> 
> dst_port NUMBER
> src_port NUMBER
>   Match  on  layer  4  protocol source or destination port number.
>   Only available for ip_proto values udp, tcp and sctp which  have
>   to be specified in beforehand.
> 
> range MIN_VALUE-MAX_VALUE
>   Match  on a range of layer 4 protocol source or destination port
>   number. Only available for ip_proto values  udp,  tcp  and  sctp
>   which have to be specified in beforehand.
> 
> ###
> 
> That makes it look like range is a standalone option - independent of
> dst_port/src_port.
> 
> It seems to me the dst_port / src_port should be updated to:
> 
> dst_port {NUMBER | range MIN_VALUE-MAX_VALUE}
> 
> with the description updated for both options and indented under
> dst_port / src_port
> 

Okay, will do.

- Amritha

Re: [net-next PATCH v3] net: sched: cls_flower: Classify packets using port ranges

2018-11-12 Thread Nambiar, Amritha

On 11/9/2018 11:14 PM, Jiri Pirko wrote:
> Sat, Nov 10, 2018 at 01:11:10AM CET, amritha.namb...@intel.com wrote:
> 
> [...]
> 
>> @@ -1026,8 +1122,7 @@ static void fl_init_dissector(struct flow_dissector 
>> *dissector,
>>   FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
>>  FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>>   FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
>> -FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>> - FLOW_DISSECTOR_KEY_PORTS, tp);
>> +FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_PORTS, tp);
> 
> You still need to set the key under a condition. Something like:
>   if (FL_KEY_IS_MASKED(mask, tp) ||
>   FL_KEY_IS_MASKED(mask, tp_min) ||
>   FL_KEY_IS_MASKED(mask, tp_max)
>   FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_PORTS, tp);
> 

Yes, will do. Thanks!

> 
>>  FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>>   FLOW_DISSECTOR_KEY_IP, ip);
>>  FL_KEY_SET_IF_MASKED(mask, keys, cnt,
> 
> [...]
>

Re: [net-next PATCH v2] net: sched: cls_flower: Classify packets using port ranges

2018-11-09 Thread Nambiar, Amritha

On 11/9/2018 4:10 AM, Jiri Pirko wrote:
> Wed, Nov 07, 2018 at 10:22:42PM CET, amritha.namb...@intel.com wrote:
>> Added support in tc flower for filtering based on port ranges.
>>
>> Example:
>> 1. Match on a port range:
>> -
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>  prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
>>  action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0
>> filter protocol ip pref 1 flower chain 0 handle 0x1
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_port range 20-30
>>  skip_hw
>>  not_in_hw
>>action order 1: gact action drop
>> random type none pass val 0
>> index 1 ref 1 bind 1 installed 85 sec used 3 sec
>>Action statistics:
>>Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>>backlog 0b 0p requeues 0
>>
>> 2. Match on IP address and port range:
>> --
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>  prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
>>  skip_hw action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0 handle 0x2
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_ip 192.168.1.1
>>  dst_port range 100-200
>>  skip_hw
>>  not_in_hw
>>action order 1: gact action drop
>> random type none pass val 0
>> index 2 ref 1 bind 1 installed 58 sec used 2 sec
>>Action statistics:
>>Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
>>backlog 0b 0p requeues 0
>>
>> v2:
>> Addressed Jiri's comments:
>> 1. Added separate functions for dst and src comparisons.
>> 2. Removed endpoint enum.
>> 3. Added new bit TCA_FLOWER_FLAGS_RANGE to decide normal/range
>>  lookup.
>> 4. Cleaned up fl_lookup function.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>> include/uapi/linux/pkt_cls.h |7 ++
>> net/sched/cls_flower.c   |  133 
>> --
>> 2 files changed, 134 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> index 401d0c1..b63c3cf 100644
>> --- a/include/uapi/linux/pkt_cls.h
>> +++ b/include/uapi/linux/pkt_cls.h
>> @@ -405,6 +405,11 @@ enum {
>>  TCA_FLOWER_KEY_UDP_SRC, /* be16 */
>>  TCA_FLOWER_KEY_UDP_DST, /* be16 */
>>
>> +TCA_FLOWER_KEY_PORT_SRC_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_SRC_MAX,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MAX,/* be16 */
>> +
> 
> Please put it at the end of the enum, as David mentioned.

Will fix in v3.

> 
> 
>>  TCA_FLOWER_FLAGS,
>>  TCA_FLOWER_KEY_VLAN_ID, /* be16 */
>>  TCA_FLOWER_KEY_VLAN_PRIO,   /* u8   */
>> @@ -518,6 +523,8 @@ enum {
>>  TCA_FLOWER_KEY_FLAGS_FRAG_IS_FIRST = (1 << 1),
>> };
>>
>> +#define TCA_FLOWER_MASK_FLAGS_RANGE (1 << 0) /* Range-based match */
>> +
>> /* Match-all classifier */
>>
>> enum {
>> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> index 9aada2d..9d2582d 100644
>> --- a/net/sched/cls_flower.c
>> +++ b/net/sched/cls_flower.c
>> @@ -55,6 +55,9 @@ struct fl_flow_key {
>>  struct flow_dissector_key_ip ip;
>>  struct flow_dissector_key_ip enc_ip;
>>  struct flow_dissector_key_enc_opts enc_opts;
>> +
> 
> No need for an empty line.

Will fix in v3.

> 
> 
>> +struct flow_dissector_key_ports tp_min;
>> +struct flow_dissector_key_ports tp_max;
>> } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as 
>> longs. */
>>
>> struct fl_flow_mask_range {
>> @@ -65,6 +68,7 @@ struct fl_flow_mask_range {
>> struct fl_flow_mask {
>>  struct fl_flow_key key;
>>  struct fl_flow_mask_range range;
>> +u32 flags;
>>  struct rhash_head ht_node;
>>  struct rhashtable ht;
>>  struct rhashtable_params filter_ht_params;
>> @@ -179,13 +183,89 @@ static void fl_clear_masked_range(struct fl_flow_key 
>> *key,
>>  memset(fl_key_get_start(key, mask), 0, fl_mask_range(mask));
>> }
>>
>> -static struct cls_fl_filter *fl_lookup(struct fl_flow_mask *mask,
>> -   struct fl_flow_key *mkey)
>> +static bool fl_range_port_dst_cmp(struct cls_fl_filter *filter,
>> +  struct fl_flow_key *key,
>> +  struct fl_flow_key *mkey)
>> +{
>> +__be16 min_mask, max_mask, min_val, max_val;
>> +
>> +min_mask = htons(filter->mask->key.tp_min.dst);
>> +max_mask = htons(filter->mask->key.tp_max.dst);
>> +min_val = htons(filter->key.tp_min.dst);
>> +max_val = htons(filter->key.tp_max.dst);
>> +
>> +if (min_mask && max_mask) {
>> +if (htons(key->tp.dst) < min_val ||
>> +htons(key->tp.dst) > max_val)
>> +return false;
>> +
>> +/* skb does not have min and max values */
>> +

Re: [net-next PATCH v2] net: sched: cls_flower: Classify packets using port ranges

2018-11-09 Thread Nambiar, Amritha

On 11/8/2018 3:15 PM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 07 Nov 2018 13:22:42 -0800
> 
>> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> index 401d0c1..b63c3cf 100644
>> --- a/include/uapi/linux/pkt_cls.h
>> +++ b/include/uapi/linux/pkt_cls.h
>> @@ -405,6 +405,11 @@ enum {
>>  TCA_FLOWER_KEY_UDP_SRC, /* be16 */
>>  TCA_FLOWER_KEY_UDP_DST, /* be16 */
>>  
>> +TCA_FLOWER_KEY_PORT_SRC_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_SRC_MAX,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MAX,/* be16 */
>> +
>>  TCA_FLOWER_FLAGS,
>>  TCA_FLOWER_KEY_VLAN_ID, /* be16 */
>>  TCA_FLOWER_KEY_VLAN_PRIO,   /* u8   */
>> @@ -518,6 +523,8 @@ enum {
> 
> I don't think you can do this without breaking UAPI, this changes the
> value of TCA_FLOWER_FLAGS and all subsequent values in this
> enumeration.
> 

Will move the new fields to the bottom of the enum in v3.

Re: [iproute2 PATCH v2] tc: flower: Classify packets based port ranges

2018-11-09 Thread Nambiar, Amritha

On 11/9/2018 12:51 AM, Jiri Pirko wrote:
> Wed, Nov 07, 2018 at 10:22:50PM CET, amritha.namb...@intel.com wrote:
>> Added support for filtering based on port ranges.
>>
>> Example:
>> 1. Match on a port range:
>> -
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>  prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
>>  action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0
>> filter protocol ip pref 1 flower chain 0 handle 0x1
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_port range 20-30
>>  skip_hw
>>  not_in_hw
>>action order 1: gact action drop
>> random type none pass val 0
>> index 1 ref 1 bind 1 installed 85 sec used 3 sec
>>Action statistics:
>>Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>>backlog 0b 0p requeues 0
>>
>> 2. Match on IP address and port range:
>> --
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>  prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
>>  skip_hw action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0 handle 0x2
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_ip 192.168.1.1
>>  dst_port range 100-200
>>  skip_hw
>>  not_in_hw
>>action order 1: gact action drop
>> random type none pass val 0
>> index 2 ref 1 bind 1 installed 58 sec used 2 sec
>>Action statistics:
>>Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
>>backlog 0b 0p requeues 0
>>
>> v2:
>> Addressed Jiri's comment to sync output format with input
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>> include/uapi/linux/pkt_cls.h |7 ++
>> tc/f_flower.c|  145 
>> +++---
>> 2 files changed, 142 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> index 401d0c1..b63c3cf 100644
>> --- a/include/uapi/linux/pkt_cls.h
>> +++ b/include/uapi/linux/pkt_cls.h
>> @@ -405,6 +405,11 @@ enum {
>>  TCA_FLOWER_KEY_UDP_SRC, /* be16 */
>>  TCA_FLOWER_KEY_UDP_DST, /* be16 */
>>
>> +TCA_FLOWER_KEY_PORT_SRC_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_SRC_MAX,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MAX,/* be16 */
>> +
>>  TCA_FLOWER_FLAGS,
>>  TCA_FLOWER_KEY_VLAN_ID, /* be16 */
>>  TCA_FLOWER_KEY_VLAN_PRIO,   /* u8   */
>> @@ -518,6 +523,8 @@ enum {
>>  TCA_FLOWER_KEY_FLAGS_FRAG_IS_FIRST = (1 << 1),
>> };
>>
>> +#define TCA_FLOWER_MASK_FLAGS_RANGE (1 << 0) /* Range-based match */
>> +
>> /* Match-all classifier */
>>
>> enum {
>> diff --git a/tc/f_flower.c b/tc/f_flower.c
>> index 65fca04..7724a1d 100644
>> --- a/tc/f_flower.c
>> +++ b/tc/f_flower.c
>> @@ -494,6 +494,66 @@ static int flower_parse_port(char *str, __u8 ip_proto,
>>  return 0;
>> }
>>
>> +static int flower_port_range_attr_type(__u8 ip_proto, enum flower_endpoint 
>> type,
>> +   __be16 *min_port_type,
>> +   __be16 *max_port_type)
>> +{
>> +if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP ||
>> +ip_proto == IPPROTO_SCTP) {
>> +if (type == FLOWER_ENDPOINT_SRC) {
>> +*min_port_type = TCA_FLOWER_KEY_PORT_SRC_MIN;
>> +*max_port_type = TCA_FLOWER_KEY_PORT_SRC_MAX;
>> +} else {
>> +*min_port_type = TCA_FLOWER_KEY_PORT_DST_MIN;
>> +*max_port_type = TCA_FLOWER_KEY_PORT_DST_MAX;
>> +}
>> +} else {
>> +return -1;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +static int flower_parse_port_range(__be16 *min, __be16 *max, __u8 ip_proto,
>> +   enum flower_endpoint endpoint,
>> +   struct nlmsghdr *n)
>> +{
>> +__be16 min_port_type, max_port_type;
>> +
>> +flower_port_range_attr_type(ip_proto, endpoint, _port_type,
>> +_port_type);
>> +addattr16(n, MAX_MSG, min_port_type, *min);
>> +addattr16(n, MAX_MSG, max_port_type, *max);
>> +
>> +return 0;
>> +}
>> +
>> +static int get_range(__be16 *min, __be16 *max, char *argv)
>> +{
>> +char *r;
>> +
>> +r = strchr(argv, '-');
>> +if (r) {
>> +*r = '\0';
>> +if (get_be16(min, argv, 10)) {
>> +fprintf(stderr, "invalid min range\n");
>> +return -1;
>> +}
>> +if (get_be16(max, r + 1, 10)) {
>> +fprintf(stderr, "invalid max range\n");
>> +return -1;
>> +}
>> +if (htons(*max) <= htons(*min)) {
>> +fprintf(stderr, "max value should be greater than min 
>> value\n");
>> +return -1;
>> +

Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges

2018-11-07 Thread Nambiar, Amritha

On 10/19/2018 1:52 AM, Jiri Pirko wrote:
> Thu, Oct 18, 2018 at 08:24:44PM CEST, amritha.namb...@intel.com wrote:
>> On 10/18/2018 5:17 AM, Jiri Pirko wrote:
>>> Fri, Oct 12, 2018 at 03:53:30PM CEST, amritha.namb...@intel.com wrote:
 Added support in tc flower for filtering based on port ranges.
 This is a rework of the RFC patch at:
 https://patchwork.ozlabs.org/patch/969595/

 Example:
 1. Match on a port range:
 -
 $ tc filter add dev enp4s0 protocol ip parent :\
  prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
  action drop

 $ tc -s filter show dev enp4s0 parent :
 filter protocol ip pref 1 flower chain 0
 filter protocol ip pref 1 flower chain 0 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 20
  dst_port_max 30
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 181 sec used 5 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

 2. Match on IP address and port range:
 --
 $ tc filter add dev enp4s0 protocol ip parent :\
  prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
  skip_hw action drop

 $ tc -s filter show dev enp4s0 parent :
 filter protocol ip pref 1 flower chain 0 handle 0x2
  eth_type ipv4
  ip_proto tcp
  dst_ip 192.168.1.1
  dst_port_min 100
  dst_port_max 200
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 2 ref 1 bind 1 installed 28 sec used 6 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

 Signed-off-by: Amritha Nambiar 
 ---
 include/uapi/linux/pkt_cls.h |5 ++
 net/sched/cls_flower.c   |  134 
 --
 2 files changed, 132 insertions(+), 7 deletions(-)

 diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
 index 401d0c1..b569308 100644
 --- a/include/uapi/linux/pkt_cls.h
 +++ b/include/uapi/linux/pkt_cls.h
 @@ -405,6 +405,11 @@ enum {
TCA_FLOWER_KEY_UDP_SRC, /* be16 */
TCA_FLOWER_KEY_UDP_DST, /* be16 */

 +  TCA_FLOWER_KEY_PORT_SRC_MIN,/* be16 */
 +  TCA_FLOWER_KEY_PORT_SRC_MAX,/* be16 */
 +  TCA_FLOWER_KEY_PORT_DST_MIN,/* be16 */
 +  TCA_FLOWER_KEY_PORT_DST_MAX,/* be16 */
 +
TCA_FLOWER_FLAGS,
TCA_FLOWER_KEY_VLAN_ID, /* be16 */
TCA_FLOWER_KEY_VLAN_PRIO,   /* u8   */
 diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
 index 9aada2d..5f135f0 100644
 --- a/net/sched/cls_flower.c
 +++ b/net/sched/cls_flower.c
 @@ -55,6 +55,9 @@ struct fl_flow_key {
struct flow_dissector_key_ip ip;
struct flow_dissector_key_ip enc_ip;
struct flow_dissector_key_enc_opts enc_opts;
 +
 +  struct flow_dissector_key_ports tp_min;
 +  struct flow_dissector_key_ports tp_max;
 } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as 
 longs. */

 struct fl_flow_mask_range {
 @@ -103,6 +106,11 @@ struct cls_fl_filter {
struct net_device *hw_dev;
 };

 +enum fl_endpoint {
 +  FLOWER_ENDPOINT_DST,
 +  FLOWER_ENDPOINT_SRC
 +};
 +
 static const struct rhashtable_params mask_ht_params = {
.key_offset = offsetof(struct fl_flow_mask, key),
.key_len = sizeof(struct fl_flow_key),
 @@ -179,11 +187,86 @@ static void fl_clear_masked_range(struct fl_flow_key 
 *key,
memset(fl_key_get_start(key, mask), 0, fl_mask_range(mask));
 }

 +static int fl_range_compare_params(struct cls_fl_filter *filter,
 + struct fl_flow_key *key,
 + struct fl_flow_key *mkey,
 + enum fl_endpoint endpoint)
 +{
 +  __be16 min_mask, max_mask, min_val, max_val;
 +
 +  if (endpoint == FLOWER_ENDPOINT_DST) {
 +  min_mask = htons(filter->mask->key.tp_min.dst);
 +  max_mask = htons(filter->mask->key.tp_max.dst);
 +  min_val = htons(filter->key.tp_min.dst);
 +  max_val = htons(filter->key.tp_max.dst);
 +
 +  if (min_mask && max_mask) {
 +  if (htons(key->tp.dst) < min_val ||
 +  htons(key->tp.dst) > max_val)
 +  return -1;
 +
 +  /* skb does not have min and max values */
 +  mkey->tp_min.dst = filter->mkey.tp_min.dst;
 +

Re: [iproute2 PATCH] tc: flower: Classify packets based port ranges

2018-10-18 Thread Nambiar, Amritha

On 10/18/2018 5:21 AM, Jiri Pirko wrote:
> Fri, Oct 12, 2018 at 03:54:42PM CEST, amritha.namb...@intel.com wrote:
> 
> [...]
> 
>> @@ -1516,6 +1625,22 @@ static int flower_print_opt(struct filter_util *qu, 
>> FILE *f,
>>  if (nl_type >= 0)
>>  flower_print_port("src_port", tb[nl_type]);
>>
>> +if (flower_port_range_attr_type(ip_proto, FLOWER_ENDPOINT_DST, )
>> +== 0) {
>> +flower_print_port_range("dst_port_min",
>> +tb[range.min_port_type]);
>> +flower_print_port_range("dst_port_max",
>> +tb[range.max_port_type]);
> 
> The input and output of iproute2 utils, tc included should be in sync.
> So you need to print "range x-y" here.
> 

Agree, will fix in v2. Thanks!

> 
>> +}
>> +
>> +if (flower_port_range_attr_type(ip_proto, FLOWER_ENDPOINT_SRC, )
>> +== 0) {
>> +flower_print_port_range("src_port_min",
>> +tb[range.min_port_type]);
>> +flower_print_port_range("src_port_max",
>> +tb[range.max_port_type]);
>> +}
>> +
>>  flower_print_tcp_flags("tcp_flags", tb[TCA_FLOWER_KEY_TCP_FLAGS],
>> tb[TCA_FLOWER_KEY_TCP_FLAGS_MASK]);
>>
>>

Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges

2018-10-18 Thread Nambiar, Amritha

On 10/18/2018 5:17 AM, Jiri Pirko wrote:
> Fri, Oct 12, 2018 at 03:53:30PM CEST, amritha.namb...@intel.com wrote:
>> Added support in tc flower for filtering based on port ranges.
>> This is a rework of the RFC patch at:
>> https://patchwork.ozlabs.org/patch/969595/
>>
>> Example:
>> 1. Match on a port range:
>> -
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>  prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
>>  action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0
>> filter protocol ip pref 1 flower chain 0 handle 0x1
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_port_min 20
>>  dst_port_max 30
>>  skip_hw
>>  not_in_hw
>>action order 1: gact action drop
>> random type none pass val 0
>> index 1 ref 1 bind 1 installed 181 sec used 5 sec
>>Action statistics:
>>Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>>backlog 0b 0p requeues 0
>>
>> 2. Match on IP address and port range:
>> --
>> $ tc filter add dev enp4s0 protocol ip parent :\
>>  prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
>>  skip_hw action drop
>>
>> $ tc -s filter show dev enp4s0 parent :
>> filter protocol ip pref 1 flower chain 0 handle 0x2
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_ip 192.168.1.1
>>  dst_port_min 100
>>  dst_port_max 200
>>  skip_hw
>>  not_in_hw
>>action order 1: gact action drop
>> random type none pass val 0
>> index 2 ref 1 bind 1 installed 28 sec used 6 sec
>>Action statistics:
>>Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>>backlog 0b 0p requeues 0
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>> include/uapi/linux/pkt_cls.h |5 ++
>> net/sched/cls_flower.c   |  134 
>> --
>> 2 files changed, 132 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> index 401d0c1..b569308 100644
>> --- a/include/uapi/linux/pkt_cls.h
>> +++ b/include/uapi/linux/pkt_cls.h
>> @@ -405,6 +405,11 @@ enum {
>>  TCA_FLOWER_KEY_UDP_SRC, /* be16 */
>>  TCA_FLOWER_KEY_UDP_DST, /* be16 */
>>
>> +TCA_FLOWER_KEY_PORT_SRC_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_SRC_MAX,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MIN,/* be16 */
>> +TCA_FLOWER_KEY_PORT_DST_MAX,/* be16 */
>> +
>>  TCA_FLOWER_FLAGS,
>>  TCA_FLOWER_KEY_VLAN_ID, /* be16 */
>>  TCA_FLOWER_KEY_VLAN_PRIO,   /* u8   */
>> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> index 9aada2d..5f135f0 100644
>> --- a/net/sched/cls_flower.c
>> +++ b/net/sched/cls_flower.c
>> @@ -55,6 +55,9 @@ struct fl_flow_key {
>>  struct flow_dissector_key_ip ip;
>>  struct flow_dissector_key_ip enc_ip;
>>  struct flow_dissector_key_enc_opts enc_opts;
>> +
>> +struct flow_dissector_key_ports tp_min;
>> +struct flow_dissector_key_ports tp_max;
>> } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as 
>> longs. */
>>
>> struct fl_flow_mask_range {
>> @@ -103,6 +106,11 @@ struct cls_fl_filter {
>>  struct net_device *hw_dev;
>> };
>>
>> +enum fl_endpoint {
>> +FLOWER_ENDPOINT_DST,
>> +FLOWER_ENDPOINT_SRC
>> +};
>> +
>> static const struct rhashtable_params mask_ht_params = {
>>  .key_offset = offsetof(struct fl_flow_mask, key),
>>  .key_len = sizeof(struct fl_flow_key),
>> @@ -179,11 +187,86 @@ static void fl_clear_masked_range(struct fl_flow_key 
>> *key,
>>  memset(fl_key_get_start(key, mask), 0, fl_mask_range(mask));
>> }
>>
>> +static int fl_range_compare_params(struct cls_fl_filter *filter,
>> +   struct fl_flow_key *key,
>> +   struct fl_flow_key *mkey,
>> +   enum fl_endpoint endpoint)
>> +{
>> +__be16 min_mask, max_mask, min_val, max_val;
>> +
>> +if (endpoint == FLOWER_ENDPOINT_DST) {
>> +min_mask = htons(filter->mask->key.tp_min.dst);
>> +max_mask = htons(filter->mask->key.tp_max.dst);
>> +min_val = htons(filter->key.tp_min.dst);
>> +max_val = htons(filter->key.tp_max.dst);
>> +
>> +if (min_mask && max_mask) {
>> +if (htons(key->tp.dst) < min_val ||
>> +htons(key->tp.dst) > max_val)
>> +return -1;
>> +
>> +/* skb does not have min and max values */
>> +mkey->tp_min.dst = filter->mkey.tp_min.dst;
>> +mkey->tp_max.dst = filter->mkey.tp_max.dst;
>> +}
>> +} else {
>> +min_mask = htons(filter->mask->key.tp_min.src);
>> +max_mask = htons(filter->mask->key.tp_max.src);
>> +min_val = htons(filter->key.tp_min.src);
>> +max_val =

Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges

2018-10-18 Thread Nambiar, Amritha

On 10/17/2018 10:41 PM, Cong Wang wrote:
> On Wed, Oct 17, 2018 at 9:42 PM David Miller  wrote:
>>
>> From: Amritha Nambiar 
>> Date: Fri, 12 Oct 2018 06:53:30 -0700
>>
>>> Added support in tc flower for filtering based on port ranges.
>>> This is a rework of the RFC patch at:
>>> https://patchwork.ozlabs.org/patch/969595/
>>
>> You never addressed Cong's feedback asking you to explain why this
>> can't be simply built using existing generic filtering facilities that
>> exist already.
>>
>> I appreciate that you addressed Jiri's feedback, but Cong's feedback is
>> just as, if not more, important.
>>
> 
> My objection is against introducing a new filter just for port range, now
> it is built on top of flower filter, so it is much better now.
> 
> u32 filter can do the nearly same, but requires a power-of-two, so it is
> not completely duplicated.
> 
> Therefore, I think the idea of building it on top of flower is fine. But I 
> don't
> read into any code, only the description.
> 
> Thanks!
> 

Sorry for not clarifying it out, this reworked patch addresses both
Jiri's and Cong's concerns. The previous RFC patch introduced a new
special-purpose classifier called 'range' for port-range based
filtering, that as Cong pointed out had overlaps with other existing
classifiers. The reason I added a new classifier was because u32 does
not support ranges that are not power-of-2 and flower uses mask-key
based rhashtable lookup which was not suited for range based keys. Based
on the feedback for the RFC, this patch adds port-range support to
cls_flower by separating out range comparison from the rhashtable
lookup. Since this adds to cls_flower, overlaps with other
general-purpose classifiers are avoided.

-Amritha

Re: [net-next, RFC PATCH] net: sched: cls_range: Introduce Range classifier

2018-09-14 Thread Nambiar, Amritha

On 9/14/2018 2:58 AM, Jiri Pirko wrote:
> Thu, Sep 13, 2018 at 10:52:06PM CEST, amritha.namb...@intel.com wrote:
> 
> [...]
> 
>> +static struct cls_range_filter *range_lookup(struct cls_range_head *head,
>> + struct range_flow_key *key,
>> + struct range_flow_key *mkey,
>> + bool is_skb)
>> +{
>> +struct cls_range_filter *filter, *next_filter;
>> +struct range_params range;
>> +int ret;
>> +size_t cmp_size;
>> +
>> +list_for_each_entry_safe(filter, next_filter, >filters, flist) {
> 
> This really should be list_for_each_entry_rcu()
> 
> also, as I wrote in the previous email, this should be done in
> cls_flower. Look at fl_lookup() it looks-up hashtable. You just need to
> add linked list traversal and range comparison to that function for the
> hit in the hashtable.
> 

I see. Will integrate the range comparison into cls_flower.

> 
>> +if (!is_skb) {
>> +/* Existing filter comparison */
>> +cmp_size = sizeof(filter->mkey);
>> +} else {
>> +/* skb classification */
>> +ret = range_compare_params(, filter, key,
>> +   RANGE_PORT_DST);
>> +if (ret < 0)
>> +continue;
>> +
>> +ret = range_compare_params(, filter, key,
>> +   RANGE_PORT_SRC);
>> +if (ret < 0)
>> +continue;
>> +
>> +/* skb does not have min and max values */
>> +cmp_size = RANGE_KEY_MEMBER_OFFSET(tp_min);
>> +}
>> +if (!memcmp(mkey, >mkey, cmp_size))
>> +return filter;
>> +}
>> +return NULL;
> 
> [...]
>

Re: [net-next,RFC PATCH] Introduce TC Range classifier

2018-09-14 Thread Nambiar, Amritha

On 9/14/2018 2:09 PM, Cong Wang wrote:
> On Fri, Sep 14, 2018 at 2:53 AM Jiri Pirko  wrote:
>>
>> Thu, Sep 13, 2018 at 10:52:01PM CEST, amritha.namb...@intel.com wrote:
>>> This patch introduces a TC range classifier to support filtering based
>>> on ranges. Only port-range filters are supported currently. This can
>>> be combined with flower classifier to support filters that are a
>>> combination of port-ranges and other parameters based on existing
>>> fields supported by cls_flower. The 'goto chain' action can be used to
>>> combine the flower and range filter.
>>> The filter precedence is decided based on the 'prio' value.
>>
>> For example Spectrum ASIC supports mask-based and range-based matching
>> in a single TCAM rule. No chains needed. Also, I don't really understand
>> why is this a separate cls. I believe that this functionality should be
>> put as an extension of existing cls_flower.
> 
> Exactly. u32 filters support range matching too with proper masks.
> 
Can u32 filters support ranges that are not power-of-2 ?

Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-07-10 Thread Nambiar, Amritha

On 7/4/2018 12:20 AM, Andrei Vagin wrote:
> Hello Amritha,
> 
> I see a following warning on 4.18.0-rc3-next-20180703.
> It looks like a problem is in this series.
> 
> [1.084722] 
> [1.084797] WARNING: possible recursive locking detected
> [1.084872] 4.18.0-rc3-next-20180703+ #1 Not tainted
> [1.084949] 
> [1.085024] swapper/0/1 is trying to acquire lock:
> [1.085100] cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
> static_key_slow_inc+0xe/0x20
> [1.085189] 
> [1.085189] but task is already holding lock:
> [1.085271] cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
> init_vqs+0x513/0x5a0
> [1.085357] 
> [1.085357] other info that might help us debug this:
> [1.085450]  Possible unsafe locking scenario:
> [1.085450] 
> [1.085531]CPU0
> [1.085605]
> [1.085679]   lock(cpu_hotplug_lock.rw_sem);
> [1.085753]   lock(cpu_hotplug_lock.rw_sem);
> [1.085828] 
> [1.085828]  *** DEADLOCK ***
> [1.085828] 
> [1.085916]  May be due to missing lock nesting notation
> [1.085916] 
> [1.085998] 3 locks held by swapper/0/1:
> [1.086074]  #0: 244bc7da (>mutex){}, at: 
> __driver_attach+0x5a/0x110
> [1.086164]  #1: cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
> init_vqs+0x513/0x5a0
> [1.086248]  #2: 5cd8463f (xps_map_mutex){+.+.}, at: 
> __netif_set_xps_queue+0x8d/0xc60
> [1.086336] 
> [1.086336] stack backtrace:
> [1.086419] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.18.0-rc3-next-20180703+ #1
> [1.086504] Hardware name: Google Google Compute Engine/Google Compute 
> Engine, BIOS Google 01/01/2011
> [1.086587] Call Trace:
> [1.086667]  dump_stack+0x85/0xcb
> [1.086744]  __lock_acquire+0x68a/0x1330
> [1.086821]  ? lock_acquire+0x9f/0x200
> [1.086900]  ? find_held_lock+0x2d/0x90
> [1.086976]  ? lock_acquire+0x9f/0x200
> [1.087051]  lock_acquire+0x9f/0x200
> [1.087126]  ? static_key_slow_inc+0xe/0x20
> [1.087205]  cpus_read_lock+0x3e/0x80
> [1.087280]  ? static_key_slow_inc+0xe/0x20
> [1.087355]  static_key_slow_inc+0xe/0x20
> [1.087435]  __netif_set_xps_queue+0x216/0xc60
> [1.087512]  virtnet_set_affinity+0xf0/0x130
> [1.087589]  init_vqs+0x51b/0x5a0
> [1.087665]  virtnet_probe+0x39f/0x870
> [1.087742]  virtio_dev_probe+0x170/0x220
> [1.087819]  driver_probe_device+0x30b/0x480
> [1.087897]  ? set_debug_rodata+0x11/0x11
> [1.087972]  __driver_attach+0xe0/0x110
> [1.088064]  ? driver_probe_device+0x480/0x480
> [1.088141]  bus_for_each_dev+0x79/0xc0
> [1.088221]  bus_add_driver+0x164/0x260
> [1.088302]  ? veth_init+0x11/0x11
> [1.088379]  driver_register+0x5b/0xe0
> [1.088402]  ? veth_init+0x11/0x11
> [1.088402]  virtio_net_driver_init+0x6d/0x90
> [1.088402]  do_one_initcall+0x5d/0x34c
> [1.088402]  ? set_debug_rodata+0x11/0x11
> [1.088402]  ? rcu_read_lock_sched_held+0x6b/0x80
> [1.088402]  kernel_init_freeable+0x1ea/0x27b
> [1.088402]  ? rest_init+0xd0/0xd0
> [1.088402]  kernel_init+0xa/0x110
> [1.088402]  ret_from_fork+0x3a/0x50
> [1.094190] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 
> 0x60,0x64 irq 1,12
> 
> 
> https://travis-ci.org/avagin/linux/jobs/399867744
> 

With this patch series, I introduced static_key for XPS maps
(xps_needed), so static_key_slow_inc() is used to switch branches. The
definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the
queue:cpu affinity in virtnet_set_affinity() which is already protected
within cpus_read_lock. Hence, the warning here trying to acquire
cpus_read_lock when it is already held.

A quick fix for this would be to just extract netif_set_xps_queue() out
of the lock by simply wrapping it with another put/get_online_cpus
(unlock right before and hold lock right after). But this may not a
clean solution. It'd help if I can get suggestions on what would be a
clean option to fix this without extensively changing the code in
virtio_net. Is it mandatory to protect the affinitization with
read_lock? I don't see similar lock in other drivers while setting the
affinity. I understand this warning should go away, but isn't it safe to
have multiple readers.

> On Fri, Jun 29, 2018 at 09:27:07PM -0700, Amritha Nambiar wrote:
>> Extend transmit queue sysfs attribute to configure Rx queue(s) map
>> per Tx queue. By default no receive queues are configured for the
>> Tx queue.
>>
>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  net/core/net-sysfs.c |   83 
>> ++
>>  1 file changed, 83 insertions(+)
>>
>> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
>> index b39987c..f25ac5f 100644
>> ---

Re: [net-next PATCH v5 4/7] net: Record receive queue number for a connection

2018-06-30 Thread Nambiar, Amritha

On 6/29/2018 6:06 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 27 Jun 2018 15:31:34 -0700
> 
>> @@ -1702,6 +1709,13 @@ static inline int sk_tx_queue_get(const struct sock 
>> *sk)
>>  return -1;
>>  }
>>  
>> +static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff 
>> *skb)
>> +{
>> +#ifdef CONFIG_XPS
>> +sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
>> +#endif
>> +}
> 
> Maybe add a >= NO_QUEUE_MAPPING check like you did for
> skc_tx_queue_mapping.
> 
Will fix.

Re: [net-next PATCH v5 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-30 Thread Nambiar, Amritha

On 6/29/2018 6:05 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 27 Jun 2018 15:31:28 -0700
> 
>> @@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, 
>> struct sk_buff *skb,
>>  
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> +/* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +if (WARN_ON_ONCE((unsigned short)tx_queue > USHRT_MAX))
>> +return;
>>  sk->sk_tx_queue_mapping = tx_queue;
>>  }
>>  
>> +#define NO_QUEUE_MAPPINGUSHRT_MAX
> 
> I think you need to check ">= USHRT_MAX" since USHRT_MAX is how you
> indicate no queue.
> 
Agree, I'll fix this.

Re: [net-next PATCH v5 1/7] net: Refactor XPS for CPUs and Rx queues

2018-06-30 Thread Nambiar, Amritha

On 6/29/2018 5:59 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 27 Jun 2018 15:31:18 -0700
> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index c6b377a..3790ac9 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>  ...
>> +static inline bool attr_test_online(unsigned long j,
>> +const unsigned long *online_mask,
>> +unsigned int nr_bits)
> 
> This is a networking header file, so for names you are putting into
> the global namespace please give them some kind of prefix that
> indicates it is about networking.
> 
Sure, will prefix these with 'netif_'.

Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues

2018-06-27 Thread Nambiar, Amritha

On 6/27/2018 3:47 AM, Willem de Bruijn wrote:
 +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
  {
  #ifdef CONFIG_XPS
 struct xps_dev_maps *dev_maps;
 -   struct xps_map *map;
 +   struct sock *sk = skb->sk;
 int queue_index = -1;

 if (!static_key_false(_needed))
 return -1;

 rcu_read_lock();
 -   dev_maps = rcu_dereference(dev->xps_cpus_map);
 +   if (!static_key_false(_rxqs_needed))
 +   goto get_cpus_map;
 +
 +   dev_maps = rcu_dereference(dev->xps_rxqs_map);
 if (dev_maps) {
 -   unsigned int tci = skb->sender_cpu - 1;
 +   int tci = sk_rx_queue_get(sk);
>>>
>>> What if the rx device differs from the tx device?
>>>
>> I think I have 3 options here:
>> 1. Cache the ifindex in sock_common which will introduce a new
>> additional field in sock_common.
>> 2. Use dev_get_by_napi_id to get the device id. This could be expensive,
>> if the rxqs_map is set, this will be done on every packet and involves
>> walking through the hashlist for napi_id lookup.
> 
> The tx queue mapping is cached in the sk for connected sockets, but
> indeed this would be expensive for many workloads.
> 
>> 3. Remove validating device id, similar to how it is in skb_tx_hash
>> where rx_queue recorded is used and if not, fall through to flow hash
>> calculation.
>> What do you think is suitable here?
> 
> Alternatively, just accept the misprediction in this rare case. But do
> make the caveat explicit in the documentation.
> 
Okay, I will add this in the documentation.

Re: [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues

2018-06-27 Thread Nambiar, Amritha

On 6/26/2018 3:53 PM, Tom Herbert wrote:
> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
>  wrote:
>> Refactor XPS code to support Tx queue selection based on
>> CPU(s) map or Rx queue(s) map.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  include/linux/cpumask.h   |   11 ++
>>  include/linux/netdevice.h |  100 +
>>  net/core/dev.c|  211 
>> ++---
>>  net/core/net-sysfs.c  |4 -
>>  4 files changed, 246 insertions(+), 80 deletions(-)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index bf53d89..57f20a0 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
>>  #define cpu_active(cpu)((cpu) == 0)
>>  #endif
>>
>> -/* verify cpu argument to cpumask_* operators */
>> -static inline unsigned int cpumask_check(unsigned int cpu)
>> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
>>  {
>>  #ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> -   WARN_ON_ONCE(cpu >= nr_cpumask_bits);
>> +   WARN_ON_ONCE(cpu >= bits);
>>  #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>> +}
>> +
>> +/* verify cpu argument to cpumask_* operators */
>> +static inline unsigned int cpumask_check(unsigned int cpu)
>> +{
>> +   cpu_max_bits_warn(cpu, nr_cpumask_bits);
>> return cpu;
>>  }
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 3ec9850..c534f03 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -730,10 +730,15 @@ struct xps_map {
>>   */
>>  struct xps_dev_maps {
>> struct rcu_head rcu;
>> -   struct xps_map __rcu *cpu_map[0];
>> +   struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
>>  };
>> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>> +
>> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>> (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
>> +
>> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
>> +   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
>> +
>>  #endif /* CONFIG_XPS */
>>
>>  #define TC_MAX_QUEUE   16
>> @@ -1909,7 +1914,8 @@ struct net_device {
>> int watchdog_timeo;
>>
>>  #ifdef CONFIG_XPS
>> -   struct xps_dev_maps __rcu *xps_maps;
>> +   struct xps_dev_maps __rcu *xps_cpus_map;
>> +   struct xps_dev_maps __rcu *xps_rxqs_map;
>>  #endif
>>  #ifdef CONFIG_NET_CLS_ACT
>> struct mini_Qdisc __rcu *miniq_egress;
>> @@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct 
>> net_device *dev, u16 queue_index)
>>  #ifdef CONFIG_XPS
>>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>> u16 index);
>> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>> + u16 index, bool is_rxqs_map);
>> +
>> +/**
>> + * attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues 
>> mask
>> + * @j: CPU/Rx queue index
>> + * @mask: bitmask of all cpus/rx queues
>> + * @nr_bits: number of bits in the bitmask
>> + *
>> + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
>> + */
>> +static inline bool attr_test_mask(unsigned long j, const unsigned long 
>> *mask,
>> + unsigned int nr_bits)
>> +{
>> +   cpu_max_bits_warn(j, nr_bits);
>> +   return test_bit(j, mask);
>> +}
>> +
>> +/**
>> + * attr_test_online - Test for online CPU/Rx queue
>> + * @j: CPU/Rx queue index
>> + * @online_mask: bitmask for CPUs/Rx queues that are online
>> + * @nr_bits: number of bits in the bitmask
>> + *
>> + * Returns true if a CPU/Rx queue is online.
>> + */
>> +static inline bool attr_test_online(unsigned long j,
>> +   const unsigned long *online_mask,
>> +   unsigned int nr_bits)
>> +{
>> +   cpu_max_bits_warn(j, nr_bits);
>> +
>> +   if (online_mask)
>> +   return test_bit(j, online_mask);
>> +
>> +   if (j >= 0 && j < nr_bits)
> 
> j is unsigned so j >= 0 is superfluous.
> 
>> +   return true;
>> +
>> +   return false;
> 
> Could just do:
> 
> return (j < nr_bits);

Will fix.

> 
>> +}
>> +
>> +/**
>> + * attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
>> + * @n: CPU/Rx queue index
>> + * @srcp: the cpumask/Rx queue mask pointer
>> + * @nr_bits: number of bits in the bitmask
>> + *
>> + * Returns >= nr_bits if no further CPUs/Rx queues set.
>> + */
>> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
>> +unsigned int nr_bits)
>> +{
>> +   /* -1 is a legal arg here. */
>> +   if (n != -1)
>> +   cpu_max_bits_warn(n, nr_bits);
>> +
>> +   if (srcp)

Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues

2018-06-26 Thread Nambiar, Amritha

On 6/26/2018 4:04 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
>  wrote:
>>
>> This patch adds support to pick Tx queue based on the Rx queue(s) map
>> configuration set by the admin through the sysfs attribute
>> for each Tx queue. If the user configuration for receive queue(s) map
>> does not apply, then the Tx queue selection falls back to CPU(s) map
>> based selection and finally to hashing.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
> 
>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>  {
>>  #ifdef CONFIG_XPS
>> struct xps_dev_maps *dev_maps;
>> -   struct xps_map *map;
>> +   struct sock *sk = skb->sk;
>> int queue_index = -1;
>>
>> if (!static_key_false(_needed))
>> return -1;
>>
>> rcu_read_lock();
>> -   dev_maps = rcu_dereference(dev->xps_cpus_map);
>> +   if (!static_key_false(_rxqs_needed))
>> +   goto get_cpus_map;
>> +
>> +   dev_maps = rcu_dereference(dev->xps_rxqs_map);
>> if (dev_maps) {
>> -   unsigned int tci = skb->sender_cpu - 1;
>> +   int tci = sk_rx_queue_get(sk);
> 
> What if the rx device differs from the tx device?
> 
I think I have 3 options here:
1. Cache the ifindex in sock_common which will introduce a new
additional field in sock_common.
2. Use dev_get_by_napi_id to get the device id. This could be expensive,
if the rxqs_map is set, this will be done on every packet and involves
walking through the hashlist for napi_id lookup.
3. Remove validating device id, similar to how it is in skb_tx_hash
where rx_queue recorded is used and if not, fall through to flow hash
calculation.
What do you think is suitable here?

Re: [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-06-26 Thread Nambiar, Amritha

On 6/26/2018 3:55 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
>  wrote:
>>
>> Extend transmit queue sysfs attribute to configure Rx queue(s) map
>> per Tx queue. By default no receive queues are configured for the
>> Tx queue.
>>
>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
> 
>> +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
>> +{
>> +   struct net_device *dev = queue->dev;
>> +   struct xps_dev_maps *dev_maps;
>> +   unsigned long *mask, index;
>> +   int j, len, num_tc = 1, tc = 0;
>> +
>> +   mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
>> +  GFP_KERNEL);
>> +   if (!mask)
>> +   return -ENOMEM;
>> +
>> +   index = get_netdev_queue_index(queue);
>> +
>> +   if (dev->num_tc) {
>> +   num_tc = dev->num_tc;
>> +   tc = netdev_txq_to_tc(dev, index);
>> +   if (tc < 0)
>> +   return -EINVAL;
> 
> Must free mask
> 
>> +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
>> + size_t len)
>> +{
>> +   struct net_device *dev = queue->dev;
>> +   unsigned long *mask, index;
>> +   int err;
>> +
>> +   if (!capable(CAP_NET_ADMIN))
>> +   return -EPERM;
> 
> ns_capable?
> 
Will change this to ns_capable.

Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-26 Thread Nambiar, Amritha

On 6/26/2018 3:58 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
>  wrote:
>>
>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>> 'int' to 'unsigned short' type with 0 indicating unset and
>> a positive queue value being set. This way it is consistent with
>> the queue_mapping field in the sk_buff. This will also accommodate
>> adding a new 'unsigned short' field in sock_common in the next
>> patch for rx_queue_mapping.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
> 
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> -   sk->sk_tx_queue_mapping = tx_queue;
>> +   /* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +   WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>> +   sk->sk_tx_queue_mapping = tx_queue + 1;
>>  }
> 
> WARN_ON_ONCE to avoid flooding the kernel buffer.
> 
Will fix.

Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-26 Thread Nambiar, Amritha

On 6/25/2018 8:25 PM, Alexander Duyck wrote:
> On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert  wrote:
>>
>>
>> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
>>  wrote:
>>>
>>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>>> 'int' to 'unsigned short' type with 0 indicating unset and
>>> a positive queue value being set. This way it is consistent with
>>> the queue_mapping field in the sk_buff. This will also accommodate
>>> adding a new 'unsigned short' field in sock_common in the next
>>> patch for rx_queue_mapping.
>>>
>>> Signed-off-by: Amritha Nambiar 
>>> ---
>>>  include/net/sock.h |   10 ++
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index b3b7541..009fd30 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -214,7 +214,7 @@ struct sock_common {
>>> struct hlist_node   skc_node;
>>> struct hlist_nulls_node skc_nulls_node;
>>> };
>>> -   int skc_tx_queue_mapping;
>>> +   unsigned short  skc_tx_queue_mapping;
>>> union {
>>> int skc_incoming_cpu;
>>> u32 skc_rcv_wnd;
>>> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk,
>>> struct sk_buff *skb,
>>>
>>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>>  {
>>> -   sk->sk_tx_queue_mapping = tx_queue;
>>> +   /* sk_tx_queue_mapping accept only upto a 16-bit value */
>>> +   WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>>
>>
>> Shouldn't this be USHRT_MAX - 1 ?
> 
> Actually just a ">=" would probably do as well.

Ugh! Will definitely fix this.

> 
>>
>>> +   sk->sk_tx_queue_mapping = tx_queue + 1;
>>>  }
>>>
>>>  static inline void sk_tx_queue_clear(struct sock *sk)
>>>  {
>>> -   sk->sk_tx_queue_mapping = -1;
>>>
>>> +   sk->sk_tx_queue_mapping = 0;
>>
>>
>> I think it's slightly better to define a new constant like NO_QUEUE_MAPPING
>> to be USHRT_MAX. That avoids needing to do the arithmetic every time the
>> value is accessed.

The idea was to have avoid having to make any changes to the callers of
these functions and make this similar to queue_mapping in skbuff with 0
indicating unset and +ve value for set. sk_tx_queue_get returns -1 on
invalid and the callers were validating -ve values.  With
sk_tx_queue_mapping initialized to USHRT_MAX, and having an additional
check in sk_tx_queue_get to return -1 if sk_tx_queue_mapping has
USHRT_MAX, I think I can keep changes minimal and avoid the arithmetic
if that's a more acceptable solution.

>>>
>>>  }
>>>
>>>  static inline int sk_tx_queue_get(const struct sock *sk)
>>>  {
>>> -   return sk ? sk->sk_tx_queue_mapping : -1;
>>> +   return sk ? sk->sk_tx_queue_mapping - 1 : -1;
>>
>>
>> Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed
>> for this?
> 
> This doesn't change the result. It was still -1 if the queue mapping
> is not set. It was just initialized to 0 instead of to -1 so we have
> to perform the operation to get there.
> 
> Also in regards to the comment above about needing an extra operation
> I am not sure it makes much difference.
> 
> In the case of us starting with 0 as a reserved value I think the
> instruction count should be about the same. We move the unsigned short
> into an unsigned in, then decrement, and if the value is non-negative
> we can assume it is valid. Although maybe I should double check the
> code to make certain it is doing what I thought it was supposed to be
> doing.
> 
>>
>>>
>>>
>>>
>>>  }
>>>
>>>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>>>
>>

Re: [net-next PATCH v3 3/5] net: Enable Tx queue selection based on Rx queues

2018-06-06 Thread Nambiar, Amritha

On 6/6/2018 12:13 PM, Willem de Bruijn wrote:
> On Wed, Jun 6, 2018 at 3:08 PM, Samudrala, Sridhar
>  wrote:
>>
>> On 6/6/2018 11:56 AM, Willem de Bruijn wrote:
>>>
>>> On Tue, Jun 5, 2018 at 4:38 AM, Amritha Nambiar
>>>  wrote:

 This patch adds support to pick Tx queue based on the Rx queue(s) map
 configuration set by the admin through the sysfs attribute
 for each Tx queue. If the user configuration for receive queue(s) map
 does not apply, then the Tx queue selection falls back to CPU(s) map
 based selection and finally to hashing.

 Signed-off-by: Amritha Nambiar 
 Signed-off-by: Sridhar Samudrala 
 ---
   int sysctl_tcp_max_orphans __read_mostly = NR_FILE;

 @@ -5574,6 +5575,7 @@ void tcp_finish_connect(struct sock *sk, struct
 sk_buff *skb)
  if (skb) {
  icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
  security_inet_conn_established(sk, skb);
 +   sk_mark_napi_id(sk, skb);
  }
>>>
>>> This and the call below should be in a standalone patch, as the mark
>>> changes are not rxq-xps specific. Is the additional earlier marking really
>>> required?
>>
>>
>> The additional earlier marking in tcp_finish_connect() allows a client app
>> to do
>> SO_INCOMING_NAPI_ID after a a connect() call to get the right queue
>> association
>> for a socket.
>>
>> The marking in tcp_conn_request() allows syn-ack to go on the right tx-queue
>> associated with the queue on which syn is received.
> 
> I understand the intent. My question really is whether it is needed.
> Marking has been slightly lossy in this regard in the past, not
> necessarily as an oversight. I don't mean to make that call here,
> but it's worth discussion and its own patch.
> 
Will separate this out into a standalone patch in v4.

Re: [net-next PATCH v3 3/5] net: Enable Tx queue selection based on Rx queues

2018-06-06 Thread Nambiar, Amritha

On 6/5/2018 10:57 AM, Tom Herbert wrote:
> 
> 
> On Tue, Jun 5, 2018 at 1:38 AM, Amritha Nambiar
> mailto:amritha.namb...@intel.com>> wrote:
> 
> This patch adds support to pick Tx queue based on the Rx queue(s) map
> configuration set by the admin through the sysfs attribute
> for each Tx queue. If the user configuration for receive queue(s) map
> does not apply, then the Tx queue selection falls back to CPU(s) map
> based selection and finally to hashing.
> 
> Signed-off-by: Amritha Nambiar  >
> Signed-off-by: Sridhar Samudrala  >
> ---
>  include/net/busy_poll.h |    3 ++
>  include/net/sock.h      |   14 +++
>  net/core/dev.c          |   60
> ---
>  net/core/sock.c         |    4 +++
>  net/ipv4/tcp_input.c    |    3 ++
>  5 files changed, 65 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 71c72a9..fc4fb68 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -136,6 +136,9 @@ static inline void sk_mark_napi_id(struct sock
> *sk, const struct sk_buff *skb)
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>         sk->sk_napi_id = skb->napi_id;
>  #endif
> +#ifdef CONFIG_XPS
> +       sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
> +#endif
>  }
> 
>  /* variant used for unconnected sockets */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 4f7c584..12313653 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
>   *     @skc_node: main hash linkage for various protocol lookup tables
>   *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   *     @skc_tx_queue_mapping: tx queue number for this connection
> + *     @skc_rx_queue_mapping: rx queue number for this connection
>   *     @skc_flags: place holder for sk_flags
>   *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>   *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -215,6 +216,9 @@ struct sock_common {
>                 struct hlist_nulls_node skc_nulls_node;
>         };
>         int                     skc_tx_queue_mapping;
> +#ifdef CONFIG_XPS
> +       int                     skc_rx_queue_mapping;
> 
> 
> This is still expensive cost to be adding an int field into sock_common
> for a relatively rare use case. Maybe there should be a CONFIG_XPS_RQS?
> Or maybe skc_tx_queue_mapping and skc_rx_queue_mapping could be shorts
> (so maximum queue mapping would then be 2^16-2).

Thanks for the review, Tom. I will fix up the code incorporating all
your feedback in the next version (v4). I could have a new config option
CONFIG_XPS_RXQS that would be default off, in addition to the CONFIG_XPS
option that's already there. With changing the 'skc_tx_queue_mapping' to
short, my concern is that the change would become extensive, there are a
lot of places where this gets filled with int or u32 values.

> 
> +#endif
>         union {
>                 int             skc_incoming_cpu;
>                 u32             skc_rcv_wnd;
> @@ -326,6 +330,9 @@ struct sock {
>  #define sk_nulls_node          __sk_common.skc_nulls_node
>  #define sk_refcnt              __sk_common.skc_refcnt
>  #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping
> +#ifdef CONFIG_XPS
> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping
> +#endif
> 
>  #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin
>  #define sk_dontcopy_end                __sk_common.skc_dontcopy_end
> @@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const
> struct sock *sk)
>         return sk ? sk->sk_tx_queue_mapping : -1;
>  }
> 
> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff
> *skb)
> +{
> +#ifdef CONFIG_XPS
> +       sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
> +#endif
> +}
> +
>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>  {
>         sk_tx_queue_clear(sk);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index bba755f..1880e6c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3479,36 +3479,58 @@ sch_handle_egress(struct sk_buff *skb, int
> *ret, struct net_device *dev)
>  }
>  #endif /* CONFIG_NET_EGRESS */
> 
> -static inline int get_xps_queue(struct net_device *dev, struct
> sk_buff *skb)
> +#ifdef CONFIG_XPS
> +static int __get_xps_queue_idx(struct net_device *dev, struct
> sk_buff *skb,
> +                              struct xps_dev_maps *dev_maps,
> unsigned int tci)
> +{
> +       struct xps_map

Re: [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues

2018-05-23 Thread Nambiar, Amritha

On 5/22/2018 7:09 AM, Tom Herbert wrote:
> On Mon, May 21, 2018 at 8:12 AM, Willem de Bruijn
>  wrote:
>> On Mon, May 21, 2018 at 10:51 AM, Tom Herbert  wrote:
>>> On Sat, May 19, 2018 at 1:27 PM, Willem de Bruijn
>>>  wrote:
 On Sat, May 19, 2018 at 4:13 PM, Willem de Bruijn
  wrote:
> On Fri, May 18, 2018 at 12:03 AM, Tom Herbert  
> wrote:
>> On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
>>  wrote:
>>> This patch adds support to pick Tx queue based on the Rx queue map
>>> configuration set by the admin through the sysfs attribute
>>> for each Tx queue. If the user configuration for receive
>>> queue map does not apply, then the Tx queue selection falls back
>>> to CPU map based selection and finally to hashing.
>>>
>>> Signed-off-by: Amritha Nambiar 
>>> Signed-off-by: Sridhar Samudrala 
>>> ---

>>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>> +{
>>> +#ifdef CONFIG_XPS
>>> +   enum xps_map_type i = XPS_MAP_RXQS;
>>> +   struct xps_dev_maps *dev_maps;
>>> +   struct sock *sk = skb->sk;
>>> +   int queue_index = -1;
>>> +   unsigned int tci = 0;
>>> +
>>> +   if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
>>> +   dev->ifindex == sk->sk_rx_ifindex)
>>> +   tci = sk->sk_rx_queue_mapping;
>>> +
>>> +   rcu_read_lock();
>>> +   while (queue_index < 0 && i < __XPS_MAP_MAX) {
>>> +   if (i == XPS_MAP_CPUS)
>>
>> This while loop typifies exactly why I don't think the XPS maps should
>> be an array.
>
> +1

 as a matter of fact, as enabling both cpu and rxqueue map at the same
 time makes no sense, only one map is needed at any one time. The
 only difference is in how it is indexed. It should probably not be possible
 to configure both at the same time. Keeping a single map probably also
 significantly simplifies patch 1/4.
>>>
>>> Willem,
>>>
>>> I think it might makes sense to have them both. Maybe one application
>>> is spin polling that needs this, where others might be happy with
>>> normal CPU mappings as default.
>>
>> Some entries in the rx_queue table have queue_pair affinity
>> configured, the others return -1 to fall through to the cpu
>> affinity table?
>>
> Right, that's the intent of the while loop.
> 

Yes, by default, rx queue maps are not configured for the tx queue.
These maps have to be explicitly configured mapping rx queue(s) to tx
queue(s). If the rx queue map configuration does not apply, then the tx
queue is selected based on the CPUs map.

>> I guess that implies flow steering to those special purpose
>> queues. I wonder whether this would be used this in practice.
>> I does make the code more complex by having to duplicate
>> the map lookup logic (mostly, patch 1/4).
> 
> That's a good pont. I think we need more information on how the
> feature is going to be used in practice. My assumption is that there
> are some number of "special" queues for which spin polling is being
> done.

Will submit v3 with testing hints and performance results.

> 
> Tom
>

Re: [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues

2018-05-23 Thread Nambiar, Amritha

On 5/19/2018 1:13 PM, Willem de Bruijn wrote:
> On Fri, May 18, 2018 at 12:03 AM, Tom Herbert  wrote:
>> On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
>>  wrote:
>>> This patch adds support to pick Tx queue based on the Rx queue map
>>> configuration set by the admin through the sysfs attribute
>>> for each Tx queue. If the user configuration for receive
>>> queue map does not apply, then the Tx queue selection falls back
>>> to CPU map based selection and finally to hashing.
>>>
>>> Signed-off-by: Amritha Nambiar 
>>> Signed-off-by: Sridhar Samudrala 
>>> ---
>>>  include/net/sock.h   |   18 ++
>>>  net/core/dev.c   |   36 +---
>>>  net/core/sock.c  |5 +
>>>  net/ipv4/tcp_input.c |7 +++
>>>  net/ipv4/tcp_ipv4.c  |1 +
>>>  net/ipv4/tcp_minisocks.c |1 +
>>>  6 files changed, 61 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index 4f7c584..0613f63 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair;
>>>   * @skc_node: main hash linkage for various protocol lookup tables
>>>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>>>   * @skc_tx_queue_mapping: tx queue number for this connection
>>> + * @skc_rx_queue_mapping: rx queue number for this connection
>>> + * @skc_rx_ifindex: rx ifindex for this connection
>>>   * @skc_flags: place holder for sk_flags
>>>   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>>>   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>>> @@ -215,6 +217,10 @@ struct sock_common {
>>> struct hlist_nulls_node skc_nulls_node;
>>> };
>>> int skc_tx_queue_mapping;
>>> +#ifdef CONFIG_XPS
>>> +   int skc_rx_queue_mapping;
>>> +   int skc_rx_ifindex;
>>
>> Isn't this increasing size of sock_common for a narrow use case 
>> functionality?
> 
> You can get the device from the already recorded sk_napi_id.
> Sadly, not the queue number as far as I can see.
> 
I plan to not have the ifindex cached in the sock_common, but retain the
rx_queue only. This way, it'll look similar to skb_tx_hash where
rx_queue recorded is used and if not, fall through to flow hash
calculation. Likewise, we use the rx_queue mapped and fall through to
CPU map on failures.
> 
>>> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
>>> +{
>>> +#ifdef CONFIG_XPS
>>> +   sk->sk_rx_ifindex = skb->skb_iif;
>>> +   sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
>>> +#endif
>>> +}
>>> +
> 
> Instead of adding this function and calls to it in many locations in
> the stack, you can expand sk_mark_napi_id.
> 
> Also, it is not clear why this should be called in locations where
> sk_mark_napi_id is not.
> 
Makes sense, I will add this as part of sk_mark_napi_id.

> 
>>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>> +{
>>> +#ifdef CONFIG_XPS
>>> +   enum xps_map_type i = XPS_MAP_RXQS;
>>> +   struct xps_dev_maps *dev_maps;
>>> +   struct sock *sk = skb->sk;
>>> +   int queue_index = -1;
>>> +   unsigned int tci = 0;
>>> +
>>> +   if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
>>> +   dev->ifindex == sk->sk_rx_ifindex)
>>> +   tci = sk->sk_rx_queue_mapping;
>>> +
>>> +   rcu_read_lock();
>>> +   while (queue_index < 0 && i < __XPS_MAP_MAX) {
>>> +   if (i == XPS_MAP_CPUS)
>>
>> This while loop typifies exactly why I don't think the XPS maps should
>> be an array.
> 
> +1
> 
Okay, I will change this to two maps with separate pointers.

Re: [net-next PATCH v2 1/4] net: Refactor XPS for CPUs and Rx queues

2018-05-23 Thread Nambiar, Amritha

On 5/17/2018 9:08 PM, Tom Herbert wrote:
> On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
>  wrote:
>> Refactor XPS code to support Tx queue selection based on
>> CPU map or Rx queue map.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  include/linux/cpumask.h   |   11 ++
>>  include/linux/netdevice.h |   72 +++-
>>  net/core/dev.c|  208 
>> +
>>  net/core/net-sysfs.c  |4 -
>>  4 files changed, 215 insertions(+), 80 deletions(-)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index bf53d89..57f20a0 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
>>  #define cpu_active(cpu)((cpu) == 0)
>>  #endif
>>
>> -/* verify cpu argument to cpumask_* operators */
>> -static inline unsigned int cpumask_check(unsigned int cpu)
>> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
>>  {
>>  #ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> -   WARN_ON_ONCE(cpu >= nr_cpumask_bits);
>> +   WARN_ON_ONCE(cpu >= bits);
>>  #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>> +}
>> +
>> +/* verify cpu argument to cpumask_* operators */
>> +static inline unsigned int cpumask_check(unsigned int cpu)
>> +{
>> +   cpu_max_bits_warn(cpu, nr_cpumask_bits);
>> return cpu;
>>  }
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 03ed492..c2eeb36 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -730,10 +730,21 @@ struct xps_map {
>>   */
>>  struct xps_dev_maps {
>> struct rcu_head rcu;
>> -   struct xps_map __rcu *cpu_map[0];
>> +   struct xps_map __rcu *attr_map[0];
>>  };
>> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>> +
>> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>> (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
>> +
>> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
>> +   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
>> +
>> +enum xps_map_type {
>> +   XPS_MAP_RXQS,
>> +   XPS_MAP_CPUS,
>> +   __XPS_MAP_MAX
>> +};
>> +
>>  #endif /* CONFIG_XPS */
>>
>>  #define TC_MAX_QUEUE   16
>> @@ -1891,7 +1902,7 @@ struct net_device {
>> int watchdog_timeo;
>>
>>  #ifdef CONFIG_XPS
>> -   struct xps_dev_maps __rcu *xps_maps;
>> +   struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
>>  #endif
>>  #ifdef CONFIG_NET_CLS_ACT
>> struct mini_Qdisc __rcu *miniq_egress;
>> @@ -3229,6 +3240,61 @@ static inline void netif_wake_subqueue(struct 
>> net_device *dev, u16 queue_index)
>>  #ifdef CONFIG_XPS
>>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>> u16 index);
>> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>> + u16 index, enum xps_map_type type);
>> +
>> +static inline bool attr_test_mask(unsigned long j, const unsigned long 
>> *mask,
>> + unsigned int nr_bits)
>> +{
>> +   cpu_max_bits_warn(j, nr_bits);
>> +   return test_bit(j, mask);
>> +}
>> +
>> +static inline bool attr_test_online(unsigned long j,
>> +   const unsigned long *online_mask,
>> +   unsigned int nr_bits)
>> +{
>> +   cpu_max_bits_warn(j, nr_bits);
>> +
>> +   if (online_mask)
>> +   return test_bit(j, online_mask);
>> +
>> +   if (j >= 0 && j < nr_bits)
>> +   return true;
>> +
>> +   return false;
>> +}
>> +
>> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
>> +unsigned int nr_bits)
>> +{
>> +   /* -1 is a legal arg here. */
>> +   if (n != -1)
>> +   cpu_max_bits_warn(n, nr_bits);
>> +
>> +   if (srcp)
>> +   return find_next_bit(srcp, nr_bits, n + 1);
>> +
>> +   return n + 1;
>> +}
>> +
>> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
>> +   const unsigned long *src2p,
>> +   unsigned int nr_bits)
>> +{
>> +   /* -1 is a legal arg here. */
>> +   if (n != -1)
>> +   cpu_max_bits_warn(n, nr_bits);
>> +
>> +   if (src1p && src2p)
>> +   return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
>> +   else if (src1p)
>> +   return find_next_bit(src1p, nr_bits, n + 1);
>> +   else if (src2p)
>> +   return find_next_bit(src2p, nr_bits, n + 1);
>> +
>> +   return n + 1;
>> +}
>>  #else
>>  static inline int netif_set_xps_queue(struct net_device *dev,
>>   const struct cpumask *mask,
>> diff --git

Re: [net-next PATCH v2 3/4] net-sysfs: Add interface for Rx queue map per Tx queue

2018-05-17 Thread Nambiar, Amritha

On 5/17/2018 12:05 PM, Florian Fainelli wrote:
> On 05/15/2018 06:26 PM, Amritha Nambiar wrote:
>> Extend transmit queue sysfs attribute to configure Rx queue map
>> per Tx queue. By default no receive queues are configured for the
>> Tx queue.
>>
>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
> 
> Please include an update to Documentation/ABI/testing/sysfs-class-net
> with your new attribute.
> 
Will do in the next version.
Thanks.

Re: [net-next PATCH v2 1/4] net: Refactor XPS for CPUs and Rx queues

2018-05-17 Thread Nambiar, Amritha

On 5/17/2018 11:38 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Tue, 15 May 2018 18:26:43 -0700
> 
>> @@ -2125,7 +2125,7 @@ static bool remove_xps_queue_cpu(struct net_device 
>> *dev,
>>  int i, j;
>>  
>>  for (i = count, j = offset; i--; j++) {
>> -if (!remove_xps_queue(dev_maps, cpu, j))
>> +if (!remove_xps_queue(dev_maps, tci, j))
>>  break;
>>  }
>>  
> 
> This looks like a bug fix, completely unrelated to the feature being added
> by this patch set.
> 
> Please submit this targetting the 'net' tree, then when that fix propagates
> into 'net-next' you can rebase this series on top of that.
> 
> Thank you.
> 

Sure, will do that.
Thanks.

Re: [net-next PATCH 1/3] net: Refactor XPS for CPUs and Rx queues

2018-05-14 Thread Nambiar, Amritha

On 5/14/2018 9:53 AM, Tom Herbert wrote:
> On Wed, May 9, 2018 at 1:54 PM, Nambiar, Amritha
> <amritha.namb...@intel.com> wrote:
>> On 5/9/2018 1:31 PM, Tom Herbert wrote:
>>> On Thu, Apr 19, 2018 at 6:04 PM, Amritha Nambiar
>>> <amritha.namb...@intel.com> wrote:
>>>> Refactor XPS code to support Tx queue selection based on
>>>> CPU map or Rx queue map.
>>>>
>>>> Signed-off-by: Amritha Nambiar <amritha.namb...@intel.com>
>>>> ---
>>>>  include/linux/netdevice.h |   82 +-
>>>>  net/core/dev.c|  206 
>>>> +
>>>>  net/core/net-sysfs.c  |4 -
>>>>  3 files changed, 216 insertions(+), 76 deletions(-)
>>>>
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index 14e0777..40a9171 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -730,10 +730,21 @@ struct xps_map {
>>>>   */
>>>>  struct xps_dev_maps {
>>>> struct rcu_head rcu;
>>>> -   struct xps_map __rcu *cpu_map[0];
>>>> +   struct xps_map __rcu *attr_map[0];
>>>
>>> This seems unnecessarily complicated to me. Why not just add another
>>> map called something like "rxq2txq_map". Then when selecting TXQ just
>>> check the new map first and then the normal cpu_map if there's not a
>>> hit.
>>>
>>
>> This is just a change in the name to something more generic ('attr')
>> since the maps can either be cpu based or rxq based. I have added two
>> map types, XPS_MAP_RXQS, XPS_MAP_CPUS and the TXQ selection (in patch
> 
> I think adding map types is overkill and we really don't want to turn
> this in to a generic but complex interface with a bunch of map types.
> Just have two pointers to the two different maps.
> 

Sorry if I wasn't clear enough, instead of having two different pointers:
struct xps_dev_maps __rcu *xps_cpu_maps;
struct xps_dev_maps __rcu *xps_rxq_maps;
in the struct netdevice below, I have a pointer array:
struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
and use the map type enum as an index to select which one we are updating.

The idea was this might look cleaner while selecting the queue where the
processing order would be XPS_MAP_RXQS first and then XPS_MAP_CPUS.

>> 2/3) works how you described,  first based on the RXQ map and if there
>> is no hit, falls to the normal CPU map.
>>
>>>>  };
>>>> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>>>> +
>>>> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>>>> (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
>>>> +
>>>> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
>>>> +   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
>>>> +
>>>> +enum xps_map_type {
>>>> +   XPS_MAP_RXQS,
>>>> +   XPS_MAP_CPUS,
>>>> +   __XPS_MAP_MAX
>>>> +};
>>>> +
>>>>  #endif /* CONFIG_XPS */
>>>>
>>>>  #define TC_MAX_QUEUE   16
>>>> @@ -1867,7 +1878,7 @@ struct net_device {
>>>> int watchdog_timeo;
>>>>
>>>>  #ifdef CONFIG_XPS
>>>> -   struct xps_dev_maps __rcu *xps_maps;
>>>> +   struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
>>>>  #endif
>>>>  #ifdef CONFIG_NET_CLS_ACT
>>>> struct mini_Qdisc __rcu *miniq_egress;
>>>> @@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct 
>>>> net_device *dev, u16 queue_index)
>>>>  #ifdef CONFIG_XPS
>>>>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask 
>>>> *mask,
>>>> u16 index);
>>>> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long 
>>>> *mask,
>>>> + u16 index, enum xps_map_type type);
>>>> +
>>>> +static inline bool attr_test_mask(unsigned long j, const unsigned long 
>>>> *mask,
>>>> + unsigned int nr_bits)
>>>> +{
>>>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>>>> +   WARN_ON_ONCE(j >= nr_bits);
>>>> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>>>
>>> This #ifdef block appear

Re: [net-next PATCH 1/3] net: Refactor XPS for CPUs and Rx queues

2018-05-09 Thread Nambiar, Amritha

On 5/9/2018 1:31 PM, Tom Herbert wrote:
> On Thu, Apr 19, 2018 at 6:04 PM, Amritha Nambiar
>  wrote:
>> Refactor XPS code to support Tx queue selection based on
>> CPU map or Rx queue map.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  include/linux/netdevice.h |   82 +-
>>  net/core/dev.c|  206 
>> +
>>  net/core/net-sysfs.c  |4 -
>>  3 files changed, 216 insertions(+), 76 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 14e0777..40a9171 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -730,10 +730,21 @@ struct xps_map {
>>   */
>>  struct xps_dev_maps {
>> struct rcu_head rcu;
>> -   struct xps_map __rcu *cpu_map[0];
>> +   struct xps_map __rcu *attr_map[0];
> 
> This seems unnecessarily complicated to me. Why not just add another
> map called something like "rxq2txq_map". Then when selecting TXQ just
> check the new map first and then the normal cpu_map if there's not a
> hit.
> 

This is just a change in the name to something more generic ('attr')
since the maps can either be cpu based or rxq based. I have added two
map types, XPS_MAP_RXQS, XPS_MAP_CPUS and the TXQ selection (in patch
2/3) works how you described,  first based on the RXQ map and if there
is no hit, falls to the normal CPU map.

>>  };
>> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>> +
>> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
>> (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
>> +
>> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
>> +   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
>> +
>> +enum xps_map_type {
>> +   XPS_MAP_RXQS,
>> +   XPS_MAP_CPUS,
>> +   __XPS_MAP_MAX
>> +};
>> +
>>  #endif /* CONFIG_XPS */
>>
>>  #define TC_MAX_QUEUE   16
>> @@ -1867,7 +1878,7 @@ struct net_device {
>> int watchdog_timeo;
>>
>>  #ifdef CONFIG_XPS
>> -   struct xps_dev_maps __rcu *xps_maps;
>> +   struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
>>  #endif
>>  #ifdef CONFIG_NET_CLS_ACT
>> struct mini_Qdisc __rcu *miniq_egress;
>> @@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct 
>> net_device *dev, u16 queue_index)
>>  #ifdef CONFIG_XPS
>>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>> u16 index);
>> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>> + u16 index, enum xps_map_type type);
>> +
>> +static inline bool attr_test_mask(unsigned long j, const unsigned long 
>> *mask,
>> + unsigned int nr_bits)
>> +{
>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> +   WARN_ON_ONCE(j >= nr_bits);
>> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> 
> This #ifdef block appears 3 times in the patch. Seems like it should
> be replace by simple macro.

Sure, will do in the next version.

> 
>> +   return test_bit(j, mask);
>> +}
>> +
>> +static inline bool attr_test_online(unsigned long j,
>> +   const unsigned long *online_mask,
>> +   unsigned int nr_bits)
>> +{
>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> +   WARN_ON_ONCE(j >= nr_bits);
>> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>> +
>> +   if (online_mask)
>> +   return test_bit(j, online_mask);
>> +
>> +   if (j >= 0 && j < nr_bits)
>> +   return true;
>> +
>> +   return false;
>> +}
>> +
>> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
>> +unsigned int nr_bits)
>> +{
>> +   /* -1 is a legal arg here. */
>> +   if (n != -1) {
>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> +   WARN_ON_ONCE(n >= nr_bits);
>> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>> +   }
>> +
>> +   if (srcp)
>> +   return find_next_bit(srcp, nr_bits, n + 1);
>> +
>> +   return n + 1;
>> +}
>> +
>> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
>> +   const unsigned long *src2p,
>> +   unsigned int nr_bits)
>> +{
>> +   /* -1 is a legal arg here. */
>> +   if (n != -1) {
>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> +   WARN_ON_ONCE(n >= nr_bits);
>> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>> +   }
>> +
>> +   if (src1p && src2p)
>> +   return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
>> +   else if (src1p)
>> +   return find_next_bit(src1p, nr_bits, n + 1);
>> +   else if (src2p)
>> +   return find_next_bit(src2p, nr_bits, n + 1);
>> +
>> +   return n + 1;
>> +}
>>  #else
>>  static inline int netif_set_xps_queue(struct

Re: [net-next PATCH 0/3] Symmetric queue selection using XPS for Rx queues

2018-05-08 Thread Nambiar, Amritha

On 5/8/2018 10:19 AM, Alexander Duyck wrote:
> On Tue, May 8, 2018 at 10:07 AM, Eric Dumazet  wrote:
>>
>>
>> On 05/08/2018 09:02 AM, Alexander Duyck wrote:
>>> On Tue, May 8, 2018 at 8:15 AM, Tom Herbert  wrote:
 On Thu, Apr 19, 2018 at 7:41 PM, Eric Dumazet  wrote:
> On Thu, Apr 19, 2018 at 6:07 PM Amritha Nambiar 
> 
> wrote:
>
>> This patch series implements support for Tx queue selection based on
>> Rx queue map. This is done by configuring Rx queue map per Tx-queue
>> using sysfs attribute. If the user configuration for Rx queues does
>> not apply, then the Tx queue selection falls back to XPS using CPUs and
>> finally to hashing.
>
>> XPS is refactored to support Tx queue selection based on either the
>> CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be
>> enabled. By default no receive queues are configured for the Tx queue.
>
>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>
>> This is to enable sending packets on the same Tx-Rx queue pair as this
>> is useful for busy polling multi-threaded workloads where it is not
>> possible to pin the threads to a CPU. This is a rework of Sridhar's
>> patch for symmetric queueing via socket option:
>> https://www.spinics.net/lists/netdev/msg453106.html
>
 I suspect this is an artifact of flow director which I believe
 required queue pairs to be able to work (i.e. receive queue chose
 hardware is determined send queue). But that was only required because
 of hardware design, I don't see the rationale for introducing queue
 pairs in the software stack. There's no need to correlate the transmit
 path with receive path, no need to enforce a 1-1 mapping between RX
 and TX queues, and the OOO mitigations should be sufficient when TX
 queue changes for a flow.

 Tom
>>>
>>> If I am not mistaken I think there are benefits to doing this sort of
>>> thing with polling as it keeps the Tx work locked into the same queue
>>> pair that a given application is polling on. So as a result you can
>>> keep the interrupts contained to the queue pair that is being busy
>>> polled on and if the application cleans up the packets during the busy
>>> poll it ends up being a net savings in terms of both latency and power
>>> since the Tx clean-up happens sooner, and it happens on the queue that
>>> is already busy polling instead of possibly triggering an interrupt on
>>> another CPU.
>>>
>>> So for example in the case of routing and bridging workloads we
>>> already had code that would take the Rx queue and associate it to a Tx
>>> queue. One of the ideas behind doing this is to try and keep the CPU
>>> overhead low by having a 1:1 mapping. In the case of this code we
>>> allow for a little more flexibility in that you could have
>>> many-to-many mappings but the general idea and common use case is the
>>> same which is a 1:1 mapping.
>>
>>
>> I thought we had everything in place to be able to have this already.
>>
>> Setting IRQ affinities and XPS is certainly something doable.
>>
>> This is why I wanted a proper documentation of yet another way to reach the
>> same behavior.
> 
> IRQ affinities and XPS work for pure NAPI setups, but the problem is
> you have to also do application affinity in the case of busy polling
> which can provide some additional challenges since then you have to
> add code in your application to associate a given queue/CPU to a given
> application thread. I believe this is a way of simplifying this.
> 
> I agree on the documentation aspect. The usage of this should be well
> documented as well as the why of using it.
> 
> - Alex
> 
I'll submit another version of the series with the documentation added.

- Amritha

Re: [PATCH net] i40e: flower: check if TC offload is enabled on a netdev

2018-01-23 Thread Nambiar, Amritha

On 1/23/2018 12:08 AM, Jakub Kicinski wrote:
> Since TC block changes drivers are required to check if
> the TC hw offload flag is set on the interface themselves.
> 
> Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower")
> Fixes: 44ae12a768b7 ("net: sched: move the can_offload check from binding 
> phase to rule insertion phase")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 

Acked-by: Amritha Nambiar 

> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 42dcaefc4c19..af792112a2d3 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -7505,6 +7505,8 @@ static int i40e_setup_tc_cls_flower(struct 
> i40e_netdev_priv *np,
>  {
>   struct i40e_vsi *vsi = np->vsi;
>  
> + if (!tc_can_offload(vsi->netdev))
> + return -EOPNOTSUPP;
>   if (cls_flower->common.chain_index)
>   return -EOPNOTSUPP;
>  
>

Re: [iproute2 PATCH] tc/mqprio: Offload mode and shaper options in mqprio

2017-11-01 Thread Nambiar, Amritha

On 10/31/2017 10:20 AM, Stephen Hemminger wrote:
> On Thu, 26 Oct 2017 17:02:42 -0700
> Amritha Nambiar  wrote:
> 
>> This patch was previously submitted as RFC. Submitting this as
>> non-RFC now that the tc/mqprio changes are accepted in net-next.
>>
>> Adds new mqprio options for 'mode' and 'shaper'. The mode
>> option can take values for offload modes such as 'dcb' (default),
>> 'channel' with the 'hw' option set to 1. The new 'channel' mode
>> supports offloading TCs and other queue configurations. The
>> 'shaper' option is to support HW shapers ('dcb' default) and
>> takes the value 'bw_rlimit' for bandwidth rate limiting. The
>> parameters to the bw_rlimit shaper are minimum and maximum
>> bandwidth rates. New HW shapers in future can be supported
>> through the shaper attribute.
>>
>> # tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
>>   queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
>>   min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
>>
>> # tc qdisc show dev eth0
>>
>> qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
>>  queues:(0:3) (4:7)
>>  mode:channel
>>  shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
>>
>> Signed-off-by: Amritha Nambiar 
> 
> 
> Please build test your patch with compiler checks enabled
> 
> CC   q_mqprio.o
> In file included from /usr/include/string.h:635:0,
>  from q_mqprio.c:20:
> In function ‘memcpy’,
> inlined from ‘mqprio_print_opt’ at q_mqprio.c:237:2:
> /usr/include/x86_64-linux-gnu/bits/string3.h:53:10: warning: call to 
> __builtin___memcpy_chk will always overflow destination buffer
>return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
>   ^~
> 
I will send v2 fixing this.

Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v4 6/6] i40e: Enable cloud filters via tc-flower

2017-10-26 Thread Nambiar, Amritha

On 10/26/2017 2:35 PM, Shannon Nelson wrote:
> On 10/26/2017 2:29 PM, Nambiar, Amritha wrote:
>> On 10/11/2017 4:30 PM, Shannon Nelson wrote:
>>> On 10/10/2017 5:24 PM, Amritha Nambiar wrote:
> 
> [...]
> 
>>>> +  /* For Geneve, the VNI should be placed in offset shifted by a
>>>> +   * byte than the offset for the Tenant ID for rest of the
>>>
>>> This comment isn't quite clear - maybe s/than/then/ ?
>>
>> How about, "For Geneve, the VNI should begin at an offset shifted by a
>> byte compared to the offset normally used for the Tenant ID for other
>> tunnel types" ?
> 
> Or maybe "Due to hardware eccentricities, the VNI for Geneve is shifted 
> one more byte further than normally used for Tenant ID in other tunnel 
> types"

Will do. Thanks.

> 
> [...]
> 
>>>> +  dev_err(>pdev->dev, "Bad src port mask 
>>>> 0x%04x\n",
>>>
>>> Consider using %pI4
>>
>> For ports, I'll keep the 0x%04x. I'll change to %pI4 for the IPv4
>> address masks.
> 
> Oh, right, that one's a port.
> 
> sln
>

Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v4 6/6] i40e: Enable cloud filters via tc-flower

2017-10-26 Thread Nambiar, Amritha

On 10/11/2017 4:30 PM, Shannon Nelson wrote:
> On 10/10/2017 5:24 PM, Amritha Nambiar wrote:
>> This patch enables tc-flower based hardware offloads. tc flower
>> filter provided by the kernel is configured as driver specific
>> cloud filter. The patch implements functions and admin queue
>> commands needed to support cloud filters in the driver and
>> adds cloud filters to configure these tc-flower filters.
>>
>> The classification function of the filter is to direct matched
>> packets to a traffic class which is set based on the offloaded
>> tc-flower classid. The approach here is similar to the tc 'prio'
>> qdisc which uses the classid for band selection. The ingress qdisc
>> is called :0, so traffic classes are :1 to :8 (i40e
>> has max of 8 TCs). TC0 is minor number 1, TC1 is minor number 2 etc.
>>
>> # tc qdisc add dev eth0 ingress
>> # ethtool -K eth0 hw-tc-offload on
>>
>> Match Dst MAC and route to TC0:
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>classid :1
>>
>> Match Dst IPv4,Dst Port and route to TC1:
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 2 flower dst_ip 192.168.3.5/32\
>>ip_proto udp dst_port 25 skip_sw\
>>classid :2
>>
>> Match Dst IPv6,Dst Port and route to TC1:
>> # tc filter add dev eth0 protocol ipv6 parent :\
>>prio 3 flower dst_ip fe8::200:1\
>>ip_proto udp dst_port 66 skip_sw\
>>classid :2
>>
>> Delete tc flower filter:
>> Example:
>>
>> # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
>> # tc filter del dev eth0 parent :
>>
>> Flow Director Sideband is disabled while configuring cloud filters
>> via tc-flower and until any cloud filter exists.
>>
>> Unsupported matches when cloud filters are added using enhanced
>> big buffer cloud filter mode of underlying switch include:
>> 1. source port and source IP
>> 2. Combined MAC address and IP fields.
>> 3. Not specifying L4 port
>>
>> These filter matches can however be used to redirect traffic to
>> the main VSI (tc 0) which does not require the enhanced big buffer
>> cloud filter support.
>>
>> v4: Use classid to set traffic class for matched packets. Do not
>> allow disabling hw-tc-offloads when offloaded tc filters are active.
>> v3: Cleaned up some lengthy function names. Changed ipv6 address to
>> __be32 array instead of u8 array. Used macro for IP version. Minor
>> formatting changes.
>> v2:
>> 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>> 2. Moved dev_info for add/deleting cloud filters in else condition
>> 3. Fixed some format specifier in dev_err logs
>> 4. Refactored i40e_get_capabilities to take an additional
>> list_type parameter and use it to query device and function
>> level capabilities.
>> 5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc()
>> to verify if redirect to a traffic class is supported.
>> 6. Added comments for Geneve fix in cloud filter big buffer AQ
>> function definitions.
>> 7. Cleaned up setup_tc interface to rebase and work with Jiri's
>> updates, separate function to process tc cls flower offloads.
>> 8. Changes to make Flow Director Sideband and Cloud filters mutually
>> exclusive.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> Signed-off-by: Anjali Singhai Jain 
>> Signed-off-by: Jingjing Wu 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e.h |   45 +
>>   drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |3
>>   drivers/net/ethernet/intel/i40e/i40e_common.c  |  189 
>>   drivers/net/ethernet/intel/i40e/i40e_main.c|  913 
>> +++-
>>   drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   16
>>   drivers/net/ethernet/intel/i40e/i40e_type.h|1
>>   .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|3
>>   7 files changed, 1140 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index b938bb4a..c3f1312 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -55,6 +55,8 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>> +#include 
>>   #include "i40e_type.h"
>>   #include "i40e_prototype.h"
>>   #include "i40e_client.h"
>> @@ -253,9 +255,48 @@ struct i40e_fdir_filter {
>>  u32 fd_id;
>>   };
>>   
>> +#define IPV4_VERSION 4
>> +#define IPV6_VERSION 6
> 
> Why bother with yet-another-ip-type name?  Just use the existing 
> ETH_P_IP and ETH_P_IPV6.
> 
>> +
>> +#define I40E_CLOUD_FIELD_OMAC   0x01
>> +#define I40E_CLOUD_FIELD_IMAC   0x02
>> +#define I40E_CLOUD_FIELD_IVLAN  0x04
>> +#define I40E_CLOUD_FIELD_TEN_ID 0x08
>> +#define I40E_CLOUD_FIELD_IIP0x10
>> +
>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC

Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v4 3/6] i40e: Cloud filter mode for set_switch_config command

2017-10-26 Thread Nambiar, Amritha

On 10/11/2017 4:30 PM, Shannon Nelson wrote:
> On 10/10/2017 5:24 PM, Amritha Nambiar wrote:
>> Add definitions for L4 filters and switch modes based on cloud filters
>> modes and extend the set switch config command to include the
>> additional cloud filter mode.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |   30 
>> -
>>   drivers/net/ethernet/intel/i40e/i40e_common.c |4 ++-
>>   drivers/net/ethernet/intel/i40e/i40e_ethtool.c|2 +
>>   drivers/net/ethernet/intel/i40e/i40e_main.c   |2 +
>>   drivers/net/ethernet/intel/i40e/i40e_prototype.h  |2 +
>>   drivers/net/ethernet/intel/i40e/i40e_type.h   |9 ++
>>   6 files changed, 44 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
>> b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> index 6a5db1b..729976b 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> @@ -790,7 +790,35 @@ struct i40e_aqc_set_switch_config {
>>   */
>>  __le16  first_tag;
>>  __le16  second_tag;
>> -u8  reserved[6];
>> +/* Next byte is split into following:
>> + * Bit 7 : 0: No action, 1: Switch to mode defined by bits 6:0
>> + * Bit 6: 0 : Destination Port, 1: source port
>> + * Bit 5..4: L4 type
> 
> Can you tweak the formatting on these comments to line up the first 
> couple of ':'s?
> 
>> + * 0: rsvd
>> + * 1: TCP
>> + * 2: UDP
>> + * 3: Both TCP and UDP
>> + * Bits 3:0 Mode
>> + * 0: default mode
>> + * 1: L4 port only mode
>> + * 2: non-tunneled mode
>> + * 3: tunneled mode
>> + */
>> +#define I40E_AQ_SET_SWITCH_BIT7_VALID   0x80
>> +
>> +#define I40E_AQ_SET_SWITCH_L4_SRC_PORT  0x40
>> +
>> +#define I40E_AQ_SET_SWITCH_L4_TYPE_RSVD 0x00
>> +#define I40E_AQ_SET_SWITCH_L4_TYPE_TCP  0x10
>> +#define I40E_AQ_SET_SWITCH_L4_TYPE_UDP  0x20
>> +#define I40E_AQ_SET_SWITCH_L4_TYPE_BOTH 0x30
>> +
>> +#define I40E_AQ_SET_SWITCH_MODE_DEFAULT 0x00
>> +#define I40E_AQ_SET_SWITCH_MODE_L4_PORT 0x01
>> +#define I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL  0x02
>> +#define I40E_AQ_SET_SWITCH_MODE_TUNNEL  0x03
>> +u8  mode;
>> +u8  rsvd5[5];
>>   };
>>   
>>   I40E_CHECK_CMD_LENGTH(i40e_aqc_set_switch_config);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_common.c
>> index 1b85eb3..0b3c5b7 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_common.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
>> @@ -2402,13 +2402,14 @@ i40e_status i40e_aq_get_switch_config(struct i40e_hw 
>> *hw,
>>* @hw: pointer to the hardware structure
>>* @flags: bit flag values to set
>>* @valid_flags: which bit flags to set
>> + * @mode: cloud filter mode
>>* @cmd_details: pointer to command details structure or NULL
>>*
>>* Set switch configuration bits
>>**/
>>   enum i40e_status_code i40e_aq_set_switch_config(struct i40e_hw *hw,
>>  u16 flags,
>> -u16 valid_flags,
>> +u16 valid_flags, u8 mode,
>>  struct i40e_asq_cmd_details *cmd_details)
>>   {
>>  struct i40e_aq_desc desc;
>> @@ -2420,6 +2421,7 @@ enum i40e_status_code i40e_aq_set_switch_config(struct 
>> i40e_hw *hw,
>>i40e_aqc_opc_set_switch_config);
>>  scfg->flags = cpu_to_le16(flags);
>>  scfg->valid_flags = cpu_to_le16(valid_flags);
>> +scfg->mode = mode;
>>  if (hw->flags & I40E_HW_FLAG_802_1AD_CAPABLE) {
>>  scfg->switch_tag = cpu_to_le16(hw->switch_tag);
>>  scfg->first_tag = cpu_to_le16(hw->first_tag);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> index a760d75..37ca294 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> @@ -4341,7 +4341,7 @@ static int i40e_set_priv_flags(struct net_device *dev, 
>> u32 flags)
>>  sw_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
>>  valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
>>  ret = i40e_aq_set_switch_config(>hw, sw_flags, valid_flags,
>> -NULL);
>> +0, NULL);
>>  if (ret && pf->hw.aq.asq_last_status != I40E_AQ_RC_ESRCH) {
>>  dev_info(>pdev->dev,
>>   "couldn't set switch config bits, err %s 
>> aq_err %s\n",
>> diff --git

Re: [patch net-next v2 00/20] net: sched: convert cls ndo_setup_tc offload calls to per-block callbacks

2017-10-26 Thread Nambiar, Amritha

On 10/25/2017 5:15 AM, Jiri Pirko wrote:
> Tue, Oct 24, 2017 at 06:01:34PM CEST, alexander.du...@gmail.com wrote:
> 
> [...]
> 
>> 1.   To offload filter into HW, the hw-tc-offload feature flag has
>> to be turned on before creating the ingress qdisc.
>>
>> Previously, this could also be turned on after the qdisc was created
>> and the filters could still be offloaded. Looks like this is because,
>> previously the offload flag was checked as a part of filter
>> integration in the classifier, and now it is checked as part of qdisc
>> creation (ingress_init). So, if no offload capability is advertised at
>> ingress qdisc creation time then hardware will not be asked to offload
>> filters later if the flag is enabled.
> 
> I have patchset that fixes this in my queue now. Will do some smoke
> testing and send later today.
> 
> 
>>
>> 2.   Deleting the ingress qdisc fails to remove filters added in
>> HW. Filters in SW gets deleted.
>>
>> We haven’t exactly root-caused this, the changes being extensive, but
>> our guess is again something wrong with the offload check or similar
>> while unregistering the block callback (tcf_block_cb_unregister) and
>> further to the classifier (CLS_U32/CLS_FLOWER etc.) with the
>> DESTROY/REMOVE command.
> 
> Hmm. How does this worked previously. I mean, do you see change of
> behaviour? I'm asking because I don't see how rules added only to HW
> could be removed, driver should care of it. Or are you talking about
> rules added to both SW and HW?

These are rules added to both SW and HW. Previously all cls_* had
ndo_setup_tc calls based on the offload capability.

commit 8d26d5636d "net: sched: avoid ndo_setup_tc calls for
TC_SETUP_CLS*" removed this bit to work with the new block callback. Is
there something similar in the block callback flow while acting on the
tcf_proto destroy call initiated when the qdisc is cleared?

> 
> Thanks!
>

Re: [patch net-next v2 00/20] net: sched: convert cls ndo_setup_tc offload calls to per-block callbacks

2017-10-24 Thread Nambiar, Amritha

On 10/24/2017 10:02 AM, Or Gerlitz wrote:
> On Tue, Oct 24, 2017 at 7:24 PM, Alexander Duyck
> > wrote:
> 
> On Tue, Oct 24, 2017 at 9:01 AM, Alexander Duyck
> > wrote:
> 
>  
> 
> > We are getting internal reports of regressions being seen with this
> > patch set applied. Specifically the issues below have been pointed out
> > to me. My understanding is that these issues are all being reported on
> > ixgbe:
> >
> > 1.       To offload filter into HW, the hw-tc-offload feature flag has
> > to be turned on before creating the ingress qdisc.
> 
>  
> 
> > 2.       Deleting the ingress qdisc fails to remove filters added in
> > HW. Filters in SW gets deleted.
> 
> 
> 
> FWIW, I don't think this is what you're hitting, but please note there
> are two  fixes 
> to the series in net-next 
> 
> 9d452ce net/sched: Fix actions list corruption when adding offloaded tc
> flows

It looks like this patch just got applied, this fix wasn't included
during our testing.

> 0843c09 net/sched: Set the net-device for egress device instance

This fix was already in place during our testing. Also, the filters
added were matching on IPv4 addresses with action drop.

> I had to looked on other stuff today, but I will keep stressing see
> if/what is broken w.r.t mlx5 
> and help to get things to work later this and next week so by netdev
> we'll be fine
> 
> Or.
> 
>

Re: [Intel-wired-lan] [PATCH] [v2] i40e: avoid 64-bit division where possible

2017-10-17 Thread Nambiar, Amritha

On 10/17/2017 10:33 AM, Alexander Duyck wrote:
> On Tue, Oct 17, 2017 at 8:49 AM, Arnd Bergmann  wrote:
>> The new bandwidth calculation caused a link error on 32-bit
>> architectures, like
>>
>> ERROR: "__aeabi_uldivmod" [drivers/net/ethernet/intel/i40e/i40e.ko] 
>> undefined!
>>
>> The problem is the max_tx_rate calculation that uses 64-bit integers.
>> This is not really necessary since the numbers are in MBit/s so
>> they won't be higher than 4 for the highest support rate, and
>> are guaranteed to not exceed 2^32 in future generations either.
>>
>> Another patch from Alan Brady fixed the link error by adding
>> many calls to do_div(), which makes the code less efficent and
>> less readable than necessary.
>>
>> This changes the representation to 'u32' when dealing with MBit/s
>> and uses div_u64() to convert from u64 numbers in byte/s, reverting
>> parts of Alan's earlier fix that have become obsolete now.
>>

This patch breaks the functionality while converting the rates in
bytes/s provided by tc-layer into the Mbit/s in the driver.
I40E_BW_MBPS_DIVISOR defined in Alan's patch should be used for the
conversion, and not I40E_BW_CREDIT_DIVISOR which does the incorrect
math. I40E_BW_CREDIT_DIVISOR is in place because the device uses credit
rates in values of 50Mbps.

>> Fixes: 2027d4deacb1 ("i40e: Add support setting TC max bandwidth rates")
>> Fixes: 73983b5ae011 ("i40e: fix u64 division usage")
>> Cc: Alan Brady 
>> Signed-off-by: Arnd Bergmann 
> 
> So this patch looks good to me, we just need to test it to verify it
> doesn't break existing functionality. In the meantime if Alan's patch
> has gone through testing we should probably push "i40e: fix u64
> division usage" to Dave so that we can at least fix the linking issues
> on ARM and i386.
> 
> Reviewed-by: Alexander Duyck 
> 
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e.h  |  4 +-
>>  drivers/net/ethernet/intel/i40e/i40e_main.c | 70 
>> +++--
>>  2 files changed, 27 insertions(+), 47 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index c3f13120f3ce..c7aa0c982273 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -407,7 +407,7 @@ struct i40e_channel {
>> u8 enabled_tc;
>> struct i40e_aqc_vsi_properties_data info;
>>
>> -   u64 max_tx_rate;
>> +   u32 max_tx_rate; /* in Mbits/s */
>>
>> /* track this channel belongs to which VSI */
>> struct i40e_vsi *parent_vsi;
>> @@ -1101,5 +1101,5 @@ static inline bool i40e_enabled_xdp_vsi(struct 
>> i40e_vsi *vsi)
>>  }
>>
>>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel 
>> *ch);
>> -int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>> +int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u32 max_tx_rate);
>>  #endif /* _I40E_H_ */
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index 3ceda140170d..57682cc78508 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -5448,17 +5448,16 @@ int i40e_get_link_speed(struct i40e_vsi *vsi)
>>   *
>>   * Helper function to set BW limit for a given VSI
>>   **/
>> -int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate)
>> +int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u32 max_tx_rate)
>>  {
>> struct i40e_pf *pf = vsi->back;
>> -   u64 credits = 0;
>> int speed = 0;
>> int ret = 0;
>>
>> speed = i40e_get_link_speed(vsi);
>> if (max_tx_rate > speed) {
>> dev_err(>pdev->dev,
>> -   "Invalid max tx rate %llu specified for VSI seid 
>> %d.",
>> +   "Invalid max tx rate %u specified for VSI seid %d.",
>> max_tx_rate, seid);
>> return -EINVAL;
>> }
>> @@ -5469,13 +5468,12 @@ int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 
>> seid, u64 max_tx_rate)
>> }
>>
>> /* Tx rate credits are in values of 50Mbps, 0 is disabled */
>> -   credits = max_tx_rate;
>> -   do_div(credits, I40E_BW_CREDIT_DIVISOR);
>> -   ret = i40e_aq_config_vsi_bw_limit(>hw, seid, credits,
>> +   ret = i40e_aq_config_vsi_bw_limit(>hw, seid,
>> + max_tx_rate / 
>> I40E_BW_CREDIT_DIVISOR,
>>   I40E_MAX_BW_INACTIVE_ACCUM, NULL);
>> if (ret)
>> dev_err(>pdev->dev,
>> -   "Failed set tx rate (%llu Mbps) for vsi->seid %u, 
>> err %s aq_err %s\n",
>> +   "Failed set tx rate (%u Mbps) for vsi->seid %u, err 
>> %s aq_err %s\n",
>> max_tx_rate, seid, i40e_stat_str(>hw, ret),
>>

Re: [net-next 4/9] i40e: Enable 'channel' mode in mqprio for TC configs

2017-10-16 Thread Nambiar, Amritha

On 10/16/2017 1:53 AM, Yunsheng Lin wrote:
> Hi, Jeff
> 
> On 2017/10/14 5:52, Jeff Kirsher wrote:
>> From: Amritha Nambiar 
>>
>> The i40e driver is modified to enable the new mqprio hardware
>> offload mode and factor the TCs and queue configuration by
>> creating channel VSIs. In this mode, the priority to traffic
>> class mapping and the user specified queue ranges are used
>> to configure the traffic classes by setting the mode option to
>> 'channel'.
>>
>> Example:
>>   map 0 0 0 0 1 2 2 3 queues 2@0 2@2 1@4 1@5\
>>   hw 1 mode channel
>>
>> qdisc mqprio 8038: root  tc 4 map 0 0 0 0 1 2 2 3 0 0 0 0 0 0 0 0
>>  queues:(0:1) (2:3) (4:4) (5:5)
>>  mode:channel
>>  shaper:dcb
>>
>> The HW channels created are removed and all the queue configuration
>> is set to default when the qdisc is detached from the root of the
>> device.
>>
>> This patch also disables setting up channels via ethtool (ethtool -L)
>> when the TCs are configured using mqprio scheduler.
>>
>> The patch also limits setting ethtool Rx flow hash indirection
>> (ethtool -X eth0 equal N) to max queues configured via mqprio.
>> The Rx flow hash indirection input through ethtool should be
>> validated so that it is within in the queue range configured via
>> tc/mqprio. The bound checking is achieved by reporting the current
>> rss size to the kernel when queues are configured via mqprio.
>>
>> Example:
>>   map 0 0 0 1 0 2 3 0 queues 2@0 4@2 8@6 11@14\
>>   hw 1 mode channel
>>
>> Cannot set RX flow hash configuration: Invalid argument
>>
>> Signed-off-by: Amritha Nambiar 
>> Tested-by: Andrew Bowers 
>> Signed-off-by: Jeff Kirsher 
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e.h |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   8 +-
>>  drivers/net/ethernet/intel/i40e/i40e_main.c| 457 
>> +++--
>>  3 files changed, 362 insertions(+), 106 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index bde982541772..024c88474951 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -54,6 +54,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include "i40e_type.h"
>>  #include "i40e_prototype.h"
>>  #include "i40e_client.h"
>> @@ -700,6 +701,7 @@ struct i40e_vsi {
>>  enum i40e_vsi_type type;  /* VSI type, e.g., LAN, FCoE, etc */
>>  s16 vf_id;  /* Virtual function ID for SRIOV VSIs */
>>  
>> +struct tc_mqprio_qopt_offload mqprio_qopt; /* queue parameters */
>>  struct i40e_tc_configuration tc_config;
>>  struct i40e_aqc_vsi_properties_data info;
>>  
>> @@ -725,6 +727,7 @@ struct i40e_vsi {
>>  u16 cnt_q_avail;/* num of queues available for channel usage */
>>  u16 orig_rss_size;
>>  u16 current_rss_size;
>> +bool reconfig_rss;
>>  
>>  u16 next_base_queue;/* next queue to be used for channel setup */
>>  
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> index afd3ca8d9851..72d5f2cdf419 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> @@ -2652,7 +2652,7 @@ static int i40e_get_rxnfc(struct net_device *netdev, 
>> struct ethtool_rxnfc *cmd,
>>  
>>  switch (cmd->cmd) {
>>  case ETHTOOL_GRXRINGS:
>> -cmd->data = vsi->num_queue_pairs;
>> +cmd->data = vsi->rss_size;
>>  ret = 0;
>>  break;
>>  case ETHTOOL_GRXFH:
>> @@ -3897,6 +3897,12 @@ static int i40e_set_channels(struct net_device *dev,
>>  if (vsi->type != I40E_VSI_MAIN)
>>  return -EINVAL;
>>  
>> +/* We do not support setting channels via ethtool when TCs are
>> + * configured through mqprio
>> + */
>> +if (pf->flags & I40E_FLAG_TC_MQPRIO)
>> +return -EINVAL;
>> +
>>  /* verify they are not requesting separate vectors */
>>  if (!count || ch->rx_count || ch->tx_count)
>>  return -EINVAL;
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index e23105bee6d1..e803aa1552c6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -1588,6 +1588,170 @@ static int i40e_set_mac(struct net_device *netdev, 
>> void *p)
>>  return 0;
>>  }
>>  
>> +/**
>> + * i40e_config_rss_aq - Prepare for RSS using AQ commands
>> + * @vsi: vsi structure
>> + * @seed: RSS hash seed
>> + **/
> 
> [...]
> 
>> + * i40e_vsi_set_default_tc_config - set default values for tc configuration
>> + * @vsi: the VSI being configured
>> + **/
>> +static void i40e_vsi_set_default_tc_config(struct i40e_vsi *vsi)
>> +{
>> +u16 qcount;
>> +int i;
>> +
>> +

Re: [jkirsher/next-queue PATCH v4 0/6] tc-flower based cloud filters in i40e

2017-10-11 Thread Nambiar, Amritha

On 10/11/2017 5:42 AM, Jamal Hadi Salim wrote:
> On 17-10-10 08:24 PM, Amritha Nambiar wrote:
>> This patch series enables configuring cloud filters in i40e
>> using the tc-flower classifier. The classification function
>> of the filter is to match a packet to a class. cls_flower is
>> extended to offload classid to hardware. The offloaded classid
>> is used direct matched packets to a traffic class on the device.
>> The approach here is similar to the tc 'prio' qdisc which uses
>> the classid for band selection. The ingress qdisc is called :0,
>> so traffic classes are :1 to :8 (i40e has max of 8 TCs).
>> TC0 is minor number 1, TC1 is minor number 2 etc.
>>
>> The cloud filters are added for a VSI and are cleaned up when
>> the VSI is deleted. The filters that match on L4 ports needs
>> enhanced admin queue functions with big buffer support for
>> extended fields in cloud filter commands.
>>
>> Example:
>> # tc qdisc add dev eth0 ingress
>> # ethtool -K eth0 hw-tc-offload on
>>
>> Match Dst IPv4,Dst Port and route to TC1:
>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>skip_sw classid :2
>>
>> # tc filter show dev eth0 parent :
>> filter pref 1 flower chain 0
>> filter pref 1 flower chain 0 handle 0x1 classid :2
>>eth_type ipv4
>>ip_proto udp
>>dst_ip 192.168.1.1
>>dst_port 22
>>skip_sw
>>in_hw
>>
> 
> Much much better semantic. Thank you.
> Have you tested many filter mapping to the same classid?

Yes, I have tested mapping different filters to the same classID,
packets matching the flows were assigned the same classID and routed to
the same traffic class in HW.

filter pref 1 flower chain 0
filter pref 1 flower chain 0 handle 0x1 classid :2
  dst_mac 3c:fd:fe:a0:d6:70
  eth_type ipv4
  ip_proto udp
  dst_port 12000
  in_hw
filter pref 5 flower chain 0
filter pref 5 flower chain 0 handle 0x1 classid :2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.1.1
  dst_port 12000
  in_hw

> 
> cheers,
> jamal
>

Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower

2017-09-28 Thread Nambiar, Amritha

On 9/14/2017 1:00 AM, Nambiar, Amritha wrote:
> On 9/13/2017 6:26 AM, Jiri Pirko wrote:
>> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.namb...@intel.com wrote:
>>> This patch enables tc-flower based hardware offloads. tc flower
>>> filter provided by the kernel is configured as driver specific
>>> cloud filter. The patch implements functions and admin queue
>>> commands needed to support cloud filters in the driver and
>>> adds cloud filters to configure these tc-flower filters.
>>>
>>> The only action supported is to redirect packets to a traffic class
>>> on the same device.
>>
>> So basically you are not doing redirect, you are just setting tclass for
>> matched packets, right? Why you use mirred for this? I think that
>> you might consider extending g_act for that:
>>
>> # tc filter add dev eth0 protocol ip ingress \
>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>   action tclass 0
>>
> Yes, this doesn't work like a typical egress redirect, but is aimed at
> forwarding the matched packets to a different queue-group/traffic class
> on the same device, so some sort-of ingress redirect in the hardware. I
> possibly may not need the mirred-redirect as you say, I'll look into the
> g_act way of doing this with a new gact tc action.
> 

I was looking at introducing a new gact tclass action to TC. In the HW
offload path, this sets a traffic class value for certain matched
packets so they will be processed in a queue belonging to the traffic class.

# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw\
  action tclass 2

But, I'm having trouble defining what this action means in the kernel
datapath. For ingress, this action could just take the default path and
do nothing and only have meaning in the HW offloaded path. For egress,
certain qdiscs like 'multiq' and 'prio' could use this 'tclass' value
for band selection, while the 'mqprio' qdisc selects the traffic class
based on the skb priority in netdev_pick_tx(), so what would this action
mean for the 'mqprio' qdisc?

It looks like the 'prio' qdisc uses band selection based on the
'classid', so I was thinking of using the 'classid' through the cls
flower filter and offload it to HW for the traffic class index, this way
we would have the same behavior in HW offload and SW fallback and there
would be no need for a separate tc action.

In HW:
# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw classid 1:2\

filter pref 2 flower chain 0
filter pref 2 flower chain 0 handle 0x1 classid 1:2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.3.5
  dst_port 25
  skip_sw
  in_hw

This will be used to route packets to traffic class 2.

In SW:
# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_hw classid 1:2

filter pref 2 flower chain 0
filter pref 2 flower chain 0 handle 0x1 classid 1:2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.3.5
  dst_port 25
  skip_hw
  not_in_hw

>>
>>>
>>> # tc qdisc add dev eth0 ingress
>>> # ethtool -K eth0 hw-tc-offload on
>>>
>>> # tc filter add dev eth0 protocol ip parent :\
>>>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 0
>>>
>>> # tc filter add dev eth0 protocol ip parent :\
>>>  prio 2 flower dst_ip 192.168.3.5/32\
>>>  ip_proto udp dst_port 25 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 1
>>>
>>> # tc filter add dev eth0 protocol ipv6 parent :\
>>>  prio 3 flower dst_ip fe8::200:1\
>>>  ip_proto udp dst_port 66 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 1
>>>
>>> Delete tc flower filter:
>>> Example:
>>>
>>> # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
>>> # tc filter del dev eth0 parent :
>>>
>>> Flow Director Sideband is disabled while configuring cloud filters
>>> via tc-flower and until any cloud filter exists.
>>>
>>> Unsupported matches when cloud filters are added using enhanced
>>> big buffer cloud filter mode of underlying switch include:
>>> 1. source port and source IP
>>> 2. Combined MAC address and IP fields.
>>> 3. Not specifying L4 port
>>>
>>> These filter matches can however be used to redirect traffic to
>>> the main VSI (tc 0) which does not require the enhanced big buffer
>>> cloud filter support.
>>>
>>> v3: Cleaned up some lengthy function names. Changed ipv

Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower

2017-09-14 Thread Nambiar, Amritha

On 9/13/2017 6:26 AM, Jiri Pirko wrote:
> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.namb...@intel.com wrote:
>> This patch enables tc-flower based hardware offloads. tc flower
>> filter provided by the kernel is configured as driver specific
>> cloud filter. The patch implements functions and admin queue
>> commands needed to support cloud filters in the driver and
>> adds cloud filters to configure these tc-flower filters.
>>
>> The only action supported is to redirect packets to a traffic class
>> on the same device.
> 
> So basically you are not doing redirect, you are just setting tclass for
> matched packets, right? Why you use mirred for this? I think that
> you might consider extending g_act for that:
> 
> # tc filter add dev eth0 protocol ip ingress \
>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>   action tclass 0
> 
Yes, this doesn't work like a typical egress redirect, but is aimed at
forwarding the matched packets to a different queue-group/traffic class
on the same device, so some sort-of ingress redirect in the hardware. I
possibly may not need the mirred-redirect as you say, I'll look into the
g_act way of doing this with a new gact tc action.

> 
>>
>> # tc qdisc add dev eth0 ingress
>> # ethtool -K eth0 hw-tc-offload on
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>  action mirred ingress redirect dev eth0 tclass 0
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>  prio 2 flower dst_ip 192.168.3.5/32\
>>  ip_proto udp dst_port 25 skip_sw\
>>  action mirred ingress redirect dev eth0 tclass 1
>>
>> # tc filter add dev eth0 protocol ipv6 parent :\
>>  prio 3 flower dst_ip fe8::200:1\
>>  ip_proto udp dst_port 66 skip_sw\
>>  action mirred ingress redirect dev eth0 tclass 1
>>
>> Delete tc flower filter:
>> Example:
>>
>> # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
>> # tc filter del dev eth0 parent :
>>
>> Flow Director Sideband is disabled while configuring cloud filters
>> via tc-flower and until any cloud filter exists.
>>
>> Unsupported matches when cloud filters are added using enhanced
>> big buffer cloud filter mode of underlying switch include:
>> 1. source port and source IP
>> 2. Combined MAC address and IP fields.
>> 3. Not specifying L4 port
>>
>> These filter matches can however be used to redirect traffic to
>> the main VSI (tc 0) which does not require the enhanced big buffer
>> cloud filter support.
>>
>> v3: Cleaned up some lengthy function names. Changed ipv6 address to
>> __be32 array instead of u8 array. Used macro for IP version. Minor
>> formatting changes.
>> v2:
>> 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>> 2. Moved dev_info for add/deleting cloud filters in else condition
>> 3. Fixed some format specifier in dev_err logs
>> 4. Refactored i40e_get_capabilities to take an additional
>>   list_type parameter and use it to query device and function
>>   level capabilities.
>> 5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc()
>>   to verify if redirect to a traffic class is supported.
>> 6. Added comments for Geneve fix in cloud filter big buffer AQ
>>   function definitions.
>> 7. Cleaned up setup_tc interface to rebase and work with Jiri's
>>   updates, separate function to process tc cls flower offloads.
>> 8. Changes to make Flow Director Sideband and Cloud filters mutually
>>   exclusive.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> Signed-off-by: Anjali Singhai Jain 
>> Signed-off-by: Jingjing Wu 
>> ---
>> drivers/net/ethernet/intel/i40e/i40e.h |   49 +
>> drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |3 
>> drivers/net/ethernet/intel/i40e/i40e_common.c  |  189 
>> drivers/net/ethernet/intel/i40e/i40e_main.c|  971 
>> +++-
>> drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   16 
>> drivers/net/ethernet/intel/i40e/i40e_type.h|1 
>> .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|3 
>> 7 files changed, 1202 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index 6018fb6..b110519 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -55,6 +55,8 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> +#include 
>> #include "i40e_type.h"
>> #include "i40e_prototype.h"
>> #include "i40e_client.h"
>> @@ -252,9 +254,52 @@ struct i40e_fdir_filter {
>>  u32 fd_id;
>> };
>>
>> +#define IPV4_VERSION 4
>> +#define IPV6_VERSION 6
>> +
>> +#define I40E_CLOUD_FIELD_OMAC   0x01
>> +#define I40E_CLOUD_FIELD_IMAC   0x02
>> +#define I40E_CLOUD_FIELD_IVLAN  0x04
>> +#define I40E_CLOUD_FIELD_TEN_ID 0x08
>> +#define I40E_CLOUD_FIELD_IIP0x10
>> +
>> +#define

Re: [RFC PATCH v3 2/7] sched: act_mirred: Traffic class option for mirror/redirect action

2017-09-14 Thread Nambiar, Amritha

On 9/13/2017 6:18 AM, Jiri Pirko wrote:
> Wed, Sep 13, 2017 at 11:59:24AM CEST, amritha.namb...@intel.com wrote:
>> Adds optional traffic class parameter to the mirror/redirect action.
>> The mirror/redirect action is extended to forward to a traffic
>> class on the device if the traffic class index is provided in
>> addition to the device's ifindex.
> 
> Do I understand it correctly that you just abuse mirred to pas tcclass
> index down to the driver, without actually doing anything with the value
> inside mirred-code ? That is a bit confusing for me.
> 

I think I get your point, I was looking at it more from a hardware
angle, and the 'redirect' action looked quite close to how this actually
works in the hardware. I agree the tclass value in the mirred-code is
not very useful other than offloading it.

Re: [RFC PATCH v3 0/6] Configuring traffic classes via new hardware offload mechanism in tc/mqprio

2017-09-07 Thread Nambiar, Amritha

On 9/7/2017 11:34 AM, Florian Fainelli wrote:
> On 09/07/2017 04:00 AM, Amritha Nambiar wrote:
>> The following series introduces a new hardware offload mode in 
>> tc/mqprio where the TCs, the queue configurations and bandwidth 
>> rate limits are offloaded to the hardware. The existing mqprio 
>> framework is extended to configure the queue counts and layout and
>>  also added support for rate limiting. This is achieved through new
>>  netlink attributes for the 'mode' option which takes values such
>> as 'dcb' (default) and 'channel' and a 'shaper' option for QoS 
>> attributes such as bandwidth rate limits in hw mode 1.
> 
> So "dcb" defines a default priorities to queue mapping?

In the default offload implementation, only the basic hw offload was
supported factoring only the 'number of TCs' values. The rest of the
queue configuration was being ignored. This is the legacy behavior with
hw mode set to 1.
Example:
# tc qdisc add dev eth0 root mqprio num_tc 4  map 0 0 0 0 1 1 1 1 queues
4@0 4@4 hw 1
I just named this default behavior to 'dcb' while introducing a new
offload mechanism.

> 
>> Legacy devices can fall back to the existing setup supporting hw 
>> mode 1 without these additional options where only the TCs are 
>> offloaded and then the 'mode' and 'shaper' options defaults to DCB
>>  support.
> 
> That's the last part that confuses me, see below.

As I introduced new options for 'mode' and 'shaper', I set the defaults
for these options to 'dcb' so existing offloaders can continue to work
without supporting these new options. Patch 1 has a detailed description
on how this is done.

> 
>> The i40e driver enables the new mqprio hardware offload mechanism 
>> factoring the TCs, queue configuration and bandwidth rates by 
>> creating HW channel VSIs.
> 
> I am really confused by what you call hw_mode 1, as I understand it 
> there are really 3 different modes:

There are actually 2 modes now with 'hw' option set to 1, legacy/dcb and
channel.
> 
> - legacy: you don't define any traffic class mapping, but you can 
> still chain this scheduler with a match + action (like what 
> Documentation/networking/multiqueue.txt) you can optionally also add 
> "shaper" arguments, but there should not be any default DCB queue 
> mapping either?
> 
> - dcb: a default mapping for traffic classes to queues is defined, 
> optional "shaper" arguments

The legacy mode now becomes the dcb mode. In this mode, although the TC
values, the queue configurations, prio-tc-mapping are all offloaded to
the device, the existing implementation in current drivers support only
a basic hw offload factoring only the TC values.

Examples:
# ... num_tc 2  map 0 0 0 0 1 1 1 1 queues 4@0 4@4 hw 1
# ... num_tc 2  map 0 0 0 0 1 1 1 1 queues 4@0 4@4 hw 1 mode dcb
# ... num_tc 2  map 0 0 0 0 1 1 1 1 queues 4@0 4@4 hw 1 mode dcb\
  shaper dcb

> 
> - channel: (maybe calling that "custom_tc_map" would be clearer?) 
> where you express the exact traffic classes to queue mapping and 
> optional "shaper" arguments

In the channel mode, a full hw offload is supported, the TC values, the
queue configurations and additionally QoS attributes (optional) are all
used in the new implementation.

Examples:
# ... num_tc 2  map 0 0 0 0 1 1 1 1 queues 4@0 4@4 hw 1 mode channel
# ... num_tc 2  map 0 0 0 0 1 1 1 1 queues 4@0 4@4 hw 1 mode channel\
shaper bw_rlimit max_rate 4Gbit 5Gbit

> 
> I think that's what you are doing, but I just got confused by the 
> cover letter.
> 
>> 
>> In this new mode, the priority to traffic class mapping and the 
>> user specified queue ranges are used to configure the traffic class
>> when the 'mode' option is set to 'channel'. This is achieved by
>> creating HW channels(VSI). A new channel is created for each of the
>> traffic class configuration offloaded via mqprio framework except
>> for the first TC (TC0) which is for the main VSI. TC0 for the main
>> VSI is also reconfigured as per user provided queue parameters.
>> Finally, bandwidth rate limits are set on these traffic classes
>> through the shaper attribute by sending these rates in addition to
>> the number of TCs and the queue configurations.
>> 
>> Example: # tc qdisc add dev eth0 root mqprio num_tc 2 map 0 0 0 0 1
>> 1 1 1\ queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
> 
> Do you see a case where you can declare a different number of
> traffic classes say 4 and map them onto just 2 hardware queues? If
> not, it seems a tiny bit redundant to have to specify both the map
> and the queue mapping should be sufficient, right?

This will be subjected to validation of the user input and will be
treated as invalid configuration. The 'map' specifies the mapping
between user priorities and traffic classes while the queue mapping is
the queue layout specifying the queue count and offsets.

> 
>> min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
>> 
>> To dump the bandwidth rates:
>> 
>> # tc qdisc show dev eth0
>> 
>> qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0

Re: [Intel-wired-lan] [RFC PATCH v3 0/6] Configuring traffic classes via new hardware offload mechanism in tc/mqprio

2017-09-07 Thread Nambiar, Amritha

On 9/7/2017 9:45 AM, Shannon Nelson wrote:
> On 9/7/2017 4:00 AM, Amritha Nambiar wrote:
>> The following series introduces a new hardware offload mode in
>> tc/mqprio where the TCs, the queue configurations and
>> bandwidth rate limits are offloaded to the hardware. The existing
>> mqprio framework is extended to configure the queue counts and
>> layout and also added support for rate limiting. This is achieved
>> through new netlink attributes for the 'mode' option which takes
>> values such as 'dcb' (default) and 'channel' and a 'shaper' option
>> for QoS attributes such as bandwidth rate limits in hw mode 1.
>> Legacy devices can fall back to the existing setup supporting hw mode
>> 1 without these additional options where only the TCs are offloaded
>> and then the 'mode' and 'shaper' options defaults to DCB support.
>> The i40e driver enables the new mqprio hardware offload mechanism
>> factoring the TCs, queue configuration and bandwidth rates by
>> creating HW channel VSIs.
>>
>> In this new mode, the priority to traffic class mapping and the
>> user specified queue ranges are used to configure the traffic
>> class when the 'mode' option is set to 'channel'. This is achieved by
>> creating HW channels(VSI). A new channel is created for each of the
>> traffic class configuration offloaded via mqprio framework except for
>> the first TC (TC0) which is for the main VSI. TC0 for the main VSI is
>> also reconfigured as per user provided queue parameters. Finally,
>> bandwidth rate limits are set on these traffic classes through the
>> shaper attribute by sending these rates in addition to the number of
>> TCs and the queue configurations.
>>
>> Example:
>>  # tc qdisc add dev eth0 root mqprio num_tc 2 map 0 0 0 0 1 1 1 1\
>>queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
>>min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
>>
>>  To dump the bandwidth rates:
>>
>>  # tc qdisc show dev eth0
>>
>>  qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
>>   queues:(0:3) (4:7)
>>   mode:channel
>>   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 
>> 5Gbit
>>
>> ---
>>
>> Amritha Nambiar (6):
>>mqprio: Introduce new hardware offload mode and shaper in mqprio
>>i40e: Add macro for PF reset bit
>>i40e: Add infrastructure for queue channel support
>>i40e: Enable 'channel' mode in mqprio for TC configs
>>i40e: Refactor VF BW rate limiting
>>i40e: Add support setting TC max bandwidth rates
>>
> 
> It would be nice to know what has changed since the last review, either 
> summarized here or in the individual patch files.  This helps in knowing 
> how much attention should be given to this new set of patches, and 
> encourages further review.  I don't remember seeing any responses to my 
> previous comments, and it looks like not all of them were acted upon.

For all those patch files that have changed since the last revision, I
have captured all the new changes in the section titled "v3: " of each
patch file. Also, I had replied to most of your comments that they'll be
fixed in v3 and the one in the mqprio patch that error handling was
already being done and these responses can be verified in patchwork. I
missed to respond to your comment regarding supporting macvlan offloads
through these channels, that will not be a part of this series and will
be looked upon at a later time in a subsequent patch series.

> 
> sln
> 
>>
>>   drivers/net/ethernet/intel/i40e/i40e.h |   44 +
>>   drivers/net/ethernet/intel/i40e/i40e_debugfs.c |3
>>   drivers/net/ethernet/intel/i40e/i40e_ethtool.c |8
>>   drivers/net/ethernet/intel/i40e/i40e_main.c| 1463 
>> +---
>>   drivers/net/ethernet/intel/i40e/i40e_txrx.h|2
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   50 -
>>   include/net/pkt_cls.h  |9
>>   include/uapi/linux/pkt_sched.h |   32
>>   net/sched/sch_mqprio.c |  183 ++-
>>   9 files changed, 1551 insertions(+), 243 deletions(-)
>>
>> --
>> ___
>> Intel-wired-lan mailing list
>> intel-wired-...@osuosl.org
>> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>

Re: [Intel-wired-lan] [PATCH 6/6] [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier

2017-08-14 Thread Nambiar, Amritha

On 8/1/2017 12:16 PM, Shannon Nelson wrote:
> On 7/31/2017 5:38 PM, Amritha Nambiar wrote:
>> This patch enables tc-flower based hardware offloads. tc/flower
>> filter provided by the kernel is configured as driver specific
>> cloud filter. The patch implements functions and admin queue
>> commands needed to support cloud filters in the driver and
>> adds cloud filters to configure these tc-flower filters.
>>
>> The only action supported is to redirect packets to a traffic class
>> on the same device.
>>
>> # tc qdisc add dev eth0 ingress
>> # ethtool -K eth0 hw-tc-offload on
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 0
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 2 flower dst_ip 192.168.3.5/32\
>>ip_proto udp dst_port 25 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 1
>>
>> # tc filter add dev eth0 protocol ipv6 parent :\
>>prio 3 flower dst_ip fe8::200:1\
>>ip_proto udp dst_port 66 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 2
>>
>> Delete tc flower filter:
>> Example:
>>
>> # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
>> # tc filter del dev eth0 parent :
>>
>> Flow Director Sideband is disabled while configuring cloud filters
>> via tc-flower.
> 
> Only while configuring, or the whole time there is a cloud filter?  This 
> is unclear here.

The entire time cloud filters exists. Will make the comment clearer in v2.

> 
>>
>> Unsupported matches when cloud filters are added using enhanced
>> big buffer cloud filter mode of underlying switch include:
>> 1. source port and source IP
>> 2. Combined MAC address and IP fields.
>> 3. Not specfying L4 port
> 
> s/specfying/specifying/

Will fix in v2.

> 
>>
>> These filter matches can however be used to redirect traffic to
>> the main VSI (tc 0) which does not require the enhanced big buffer
>> cloud filter support.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e.h   |   46 +
>>   drivers/net/ethernet/intel/i40e/i40e_common.c|  180 
>>   drivers/net/ethernet/intel/i40e/i40e_main.c  |  952 
>> ++
>>   drivers/net/ethernet/intel/i40e/i40e_prototype.h |   17
>>   4 files changed, 1193 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index 5c0cad5..7288265 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -55,6 +55,8 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>> +#include 
>>   #include "i40e_type.h"
>>   #include "i40e_prototype.h"
>>   #include "i40e_client.h"
>> @@ -252,10 +254,51 @@ struct i40e_fdir_filter {
>>  u32 fd_id;
>>   };
>>   
>> +#define I40E_CLOUD_FIELD_OMAC   0x01
>> +#define I40E_CLOUD_FIELD_IMAC   0x02
>> +#define I40E_CLOUD_FIELD_IVLAN  0x04
>> +#define I40E_CLOUD_FIELD_TEN_ID 0x08
>> +#define I40E_CLOUD_FIELD_IIP0x10
>> +
>> +#define I40E_CLOUD_FILTER_FLAGS_OMACI40E_CLOUD_FIELD_OMAC
>> +#define I40E_CLOUD_FILTER_FLAGS_IMACI40E_CLOUD_FIELD_IMAC
>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN  (I40E_CLOUD_FIELD_IMAC | \
>> + I40E_CLOUD_FIELD_IVLAN)
>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>> + I40E_CLOUD_FIELD_TEN_ID)
>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
>> +  I40E_CLOUD_FIELD_IMAC | \
>> +  I40E_CLOUD_FIELD_TEN_ID)
>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>> +   I40E_CLOUD_FIELD_IVLAN | \
>> +   I40E_CLOUD_FIELD_TEN_ID)
>> +#define I40E_CLOUD_FILTER_FLAGS_IIP I40E_CLOUD_FIELD_IIP
>> +
>>   struct i40e_cloud_filter {
>>  struct hlist_node cloud_node;
>>  /* cloud filter input set follows */
>>  unsigned long cookie;
>> +u8 dst_mac[ETH_ALEN];
>> +u8 src_mac[ETH_ALEN];
>> +__be16 vlan_id;
>> +__be32 dst_ip[4];
>> +__be32 src_ip[4];
>> +u8 dst_ipv6[16];
>> +u8 src_ipv6[16];
>> +__be16 dst_port;
>> +__be16 src_port;
>> +/* matter only when IP based filtering is set */
>> +bool is_ipv6;
>> +/* IPPROTO value */
>> +u8 ip_proto;
>> +/* L4 port type: src or destination port */
>> +#define I40E_CLOUD_FILTER_PORT_SRC  0x01
>> +#define I40E_CLOUD_FILTER_PORT_DEST 0x02
>> +u8 port_type;
>> +u32 tenant_id;
>> +u8 flags;
>> +#define I40E_CLOUD_TNL_TYPE_NONE0xff
>> +u8 tunnel_type;
>>  /* filter

Re: [Intel-wired-lan] [PATCH 5/6] [net-next]net: i40e: Clean up of cloud filters

2017-08-14 Thread Nambiar, Amritha

On 8/1/2017 12:16 PM, Shannon Nelson wrote:
> On 7/31/2017 5:38 PM, Amritha Nambiar wrote:
>> Introduce the cloud filter datastructure and cleanup of cloud
>> filters associated with the device.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e.h  |   11 +++
>>   drivers/net/ethernet/intel/i40e/i40e_main.c |   27 
>> +++
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index 1391e5d..5c0cad5 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -252,6 +252,14 @@ struct i40e_fdir_filter {
>>  u32 fd_id;
>>   };
>>   
>> +struct i40e_cloud_filter {
>> +struct hlist_node cloud_node;
>> +/* cloud filter input set follows */
>> +unsigned long cookie;
>> +/* filter control */
>> +u16 seid;
>> +};
> 
> This would be cleaner and more readable with the field comments off to 
> the side rather than in line with the fields.

Will fix in the next version of the series.

> 
>> +
>>   #define I40E_ETH_P_LLDP0x88cc
>>   
>>   #define I40E_DCB_PRIO_TYPE_STRICT  0
>> @@ -419,6 +427,9 @@ struct i40e_pf {
>>  struct i40e_udp_port_config udp_ports[I40E_MAX_PF_UDP_OFFLOAD_PORTS];
>>  u16 pending_udp_bitmap;
>>   
>> +struct hlist_head cloud_filter_list;
>> +u16 num_cloud_filters;
>> +
>>  enum i40e_interrupt_policy int_policy;
>>  u16 rx_itr_default;
>>  u16 tx_itr_default;
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index f74..93f6fe2 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -6928,6 +6928,29 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf)
>>   }
>>   
>>   /**
>> + * i40e_cloud_filter_exit - Cleans up the Cloud Filters
>> + * @pf: Pointer to PF
>> + *
>> + * This function destroys the hlist where all the Cloud Filters
>> + * filters were saved.
>> + **/
>> +static void i40e_cloud_filter_exit(struct i40e_pf *pf)
>> +{
>> +struct i40e_cloud_filter *cfilter;
>> +struct hlist_node *node;
>> +
>> +if (hlist_empty(>cloud_filter_list))
>> +return;
> 
> Is this check really necessary?  Doesn't hlist_for_each_entry_safe() 
> check for this?

That's right. Will fix in the next version of the series.

> 
>> +
>> +hlist_for_each_entry_safe(cfilter, node,
>> +  >cloud_filter_list, cloud_node) {
>> +hlist_del(>cloud_node);
>> +kfree(cfilter);
>> +}
>> +pf->num_cloud_filters = 0;
>> +}
>> +
>> +/**
>>* i40e_close - Disables a network interface
>>* @netdev: network interface device structure
>>*
>> @@ -12137,6 +12160,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, 
>> bool reinit)
>>  vsi = i40e_vsi_reinit_setup(pf->vsi[pf->lan_vsi]);
>>  if (!vsi) {
>>  dev_info(>pdev->dev, "setup of MAIN VSI failed\n");
>> +i40e_cloud_filter_exit(pf);
>>  i40e_fdir_teardown(pf);
>>  return -EAGAIN;
>>  }
>> @@ -12961,6 +12985,8 @@ static void i40e_remove(struct pci_dev *pdev)
>>  if (pf->vsi[pf->lan_vsi])
>>  i40e_vsi_release(pf->vsi[pf->lan_vsi]);
>>   
>> +i40e_cloud_filter_exit(pf);
>> +
>>  /* remove attached clients */
>>  if (pf->flags & I40E_FLAG_IWARP_ENABLED) {
>>  ret_code = i40e_lan_del_device(pf);
>> @@ -13170,6 +13196,7 @@ static void i40e_shutdown(struct pci_dev *pdev)
>>   
>>  del_timer_sync(>service_timer);
>>  cancel_work_sync(>service_task);
>> +i40e_cloud_filter_exit(pf);
>>  i40e_fdir_teardown(pf);
>>   
>>  /* Client close must be called explicitly here because the timer
>>
>> ___
>> Intel-wired-lan mailing list
>> intel-wired-...@osuosl.org
>> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>

Re: [Intel-wired-lan] [PATCH 4/6] [net-next]net: i40e: Admin queue definitions for cloud filters

2017-08-14 Thread Nambiar, Amritha

On 8/1/2017 12:16 PM, Shannon Nelson wrote:
> On 7/31/2017 5:37 PM, Amritha Nambiar wrote:
>> Add new admin queue definitions and extended fields for cloud
>> filter support. Define big buffer for extended general fields
>> in Add/Remove Cloud filters command.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> Signed-off-by: Store Laura 
>> Signed-off-by: Iremonger Bernard 
>> Signed-off-by: Jingjing Wu 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |   98 
>> +
>>   1 file changed, 97 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
>> b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> index 8bba04c..9f14305 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
> 
> I see that you're changing the i40e version of this, but not the i40evf 
> version.  I understand that these changes are not useful for the VF, but 
> are you no longer trying to keep the AdminQ definitions consistent 
> between the two?

I will add these definitions to the VF as well for consistency in the
next version.

> 
>> @@ -1358,7 +1358,9 @@ struct i40e_aqc_add_remove_cloud_filters {
>>   #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT  0
>>   #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_MASK   (0x3FF << \
>>  I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT)
>> -u8  reserved2[4];
>> +u8  big_buffer_flag;
>> +#define I40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER   1
>> +u8  reserved2[3];
>>  __le32  addr_high;
>>  __le32  addr_low;
>>   };
>> @@ -1395,6 +1397,13 @@ struct i40e_aqc_add_remove_cloud_filters_element_data 
>> {
>>   #define I40E_AQC_ADD_CLOUD_FILTER_IMAC 0x000A
>>   #define I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC 0x000B
>>   #define I40E_AQC_ADD_CLOUD_FILTER_IIP  0x000C
>> +/* 0x0010 to 0x0017 is for custom filters */
>> +/* flag to be used when adding cloud filter: IP + L4 Port */
>> +#define I40E_AQC_ADD_CLOUD_FILTER_IP_PORT   0x0010
>> +/* flag to be used when adding cloud filter: Dest MAC + L4 Port */
>> +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT  0x0011
>> +/* flag to be used when adding cloud filter: Dest MAC + VLAN + L4 Port */
>> +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT 0x0012
>>   
>>   #define I40E_AQC_ADD_CLOUD_FLAGS_TO_QUEUE  0x0080
>>   #define I40E_AQC_ADD_CLOUD_VNK_SHIFT   6
>> @@ -1429,6 +1438,45 @@ struct i40e_aqc_add_remove_cloud_filters_element_data 
>> {
>>  u8  response_reserved[7];
>>   };
> 
> I know you didn't add this struct, but where's the I40E_CHECK_STRUCT_LEN 
> check?

Will add all the needed I40E_CHECK_STRUCT_LEN check in the next version
of the series.

> 
>>   
>> +/* i40e_aqc_add_remove_cloud_filters_element_big_data is used when
>> + * I40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER flag is set.
>> + */
>> +struct i40e_aqc_add_remove_cloud_filters_element_big_data {
>> +struct i40e_aqc_add_remove_cloud_filters_element_data element;
>> +u16 general_fields[32];
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD00
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD11
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD22
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD03
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD14
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD25
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD06
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD17
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD28
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD09
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD110
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD211
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD012
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD113
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD214
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD015
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD116
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD217
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD318
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD419
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD520
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD621
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD722
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD023
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD124
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD225
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD326
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD427
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD5

Re: [PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter

2017-08-02 Thread Nambiar, Amritha

On 8/2/2017 11:41 AM, Roopa Prabhu wrote:
> On Mon, Jul 31, 2017 at 5:40 PM, Amritha Nambiar
>  wrote:
>> The Mirred/redirect action is extended to accept a traffic
>> class on the device in addition to the device's ifindex.
>>
>> Usage: mirred
>>
>> Example:
>> # tc qdisc add dev eth0 ingress
>>
>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>   dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>   indev eth0 action mirred ingress redirect dev eth0 tc 1
> 
> 
> Can we call the new parameter tclass or something else ?.
> with this 'tc' appears twice in the command :)
> 

Sounds right. I was already thinking of alternatives like 'tcqgroup',
'qgroup' and now we have 'tclass'.

Re: [PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter

2017-08-02 Thread Nambiar, Amritha

On 8/1/2017 8:12 AM, David Laight wrote:
> From: Stephen Hemminger
>> Sent: 01 August 2017 04:52
>> On Mon, 31 Jul 2017 17:40:50 -0700
>> Amritha Nambiar  wrote:
>> The concept is fine, bu t the code looks different than the rest which
>> is never a good sign.
>>
>>
>>> +   if ((argc > 0) && (matches(*argv, "tc") == 0)) {
>>
>> Extra () are unnecessary in compound conditional.
>>
>>> +   tc = atoi(*argv);
>>
>> Prefer using strtoul since it has better error handling than atoi()
>>
>>> +   argc--;
>>> +   argv++;
>>> +   }
>>
>>
>> Use NEXT_ARG() construct like rest of the code.
> 
> Why bother faffing about with argc at all?
> The argument list terminates when *argv == NULL.
> 

I'll submit the next version with these fixes.

>   David
>

Re: [PATCH 6/6] [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier

2017-08-02 Thread Nambiar, Amritha

On 8/2/2017 5:02 AM, Jamal Hadi Salim wrote:
> On 17-08-01 10:13 PM, Nambiar, Amritha wrote:
>>
>> On 8/1/2017 3:56 AM, Jamal Hadi Salim wrote:
>>> On 17-07-31 08:38 PM, Amritha Nambiar wrote:
>>>> This patch enables tc-flower based hardware offloads. tc/flower
>>>> filter provided by the kernel is configured as driver specific
>>>> cloud filter. The patch implements functions and admin queue
>>>> commands needed to support cloud filters in the driver and
>>>> adds cloud filters to configure these tc-flower filters.
>>>>
>>>> The only action supported is to redirect packets to a traffic class
>>>> on the same device.
>>>>
>>>> # tc qdisc add dev eth0 ingress
>>>> # ethtool -K eth0 hw-tc-offload on
>>>>
>>>> # tc filter add dev eth0 protocol ip parent :\
>>>> prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw indev eth0\
>>>> action mirred ingress redirect dev eth0 tc 0
>>>>
>>>
>>> Out of curiosity - did you need to say "indev eth0" there?
>>
>> It looks like I don't need to specify "indev eth0". I will need to look
>> up how this part is offloaded and probably validate in the driver when
>> this is specified.
>>
>>> Also: Is it possible to add an skbmark? Example something like
>>> these that directs two flows to the same queue but different
>>> skb marks:
>>>
>>> # tc filter add dev eth0 protocol ip parent : \
>>> prio 2 flower dst_ip 192.168.3.5/32 \
>>> ip_proto udp dst_port 2a skip_sw \
>>> action skbedit mark 11 \
>>> action mirred ingress redirect dev eth0 tcqueue 1
>>>
>>> # tc filter add dev eth0 protocol ip parent : \
>>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>>   action skbedit mark 12 \
>>>   action mirred ingress redirect dev eth0 tcqueue 1
>>>
>>
>> It is possible to support the skbedit mark action for the first rule
>> here (L3 and L4) which I can take up in a subsequent patch, but this
>> cannot be supported on our device for L2 based match in the second rule.
>>
> 
> Ok, thanks. So the issue is one of hardware limitation?
> 

Right. Our hardware does not have this support now.

> cheers,
> jamal
>

Re: [PATCH net-next RFC 0/6] Configure cloud filters in i40e via tc/flower classifier

2017-08-02 Thread Nambiar, Amritha

On 8/2/2017 5:01 AM, Jamal Hadi Salim wrote:
> On 17-08-01 08:57 PM, Nambiar, Amritha wrote:
>>
>> On 8/1/2017 3:15 AM, Jamal Hadi Salim wrote:
>>> On 17-07-31 08:36 PM, Amritha Nambiar wrote:
> 
>>>>
>>>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>>> dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>>> skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1
>>>>
>>>
>>> I think "queue 1" sounds better than "tc 1".
>>> "tc" is  already a keyword in a few places (even within that declaration
>>> above).
>>
>> The idea is to redirect to a traffic class that has queues assigned to
>> it and not a single queue i.e. these are actually queue groups and not a
>> single queue. So may be "qgroup 1" or "tcqgroup 1" fits better.
>>
> 
> Can you describe how this works? So the specific memeber of a
> a tcgroups show up on a specific rx DMA ring? If you only have
> 16 and 512 RX DMA rings - how does that work?
>

The Rx rule here is to redirect packets a specific traffic class. It is
the traffic class index (queue group index) that is offloaded to the
device. Queues were already configured for the traffic class by mapping
the queue counts and offsets and offloading this layout using the mqprio
framework. I had submitted a patch series for this which uses a new
hardware offload mode in mqprio to offload the TCs, the queue
configurations and the bandwidth rates for the TCs. So the 512 rings can
be mapped into 16 TCs using the mqprio offload mechanism, something like
this:
TC0 : 0 – 15
TC1: 16 – 31
TC2: 32 – 33
TC3: 34 – 49
.
.
.
TC15: 500 - 511

Now, once the TC configuration is prepared, it is just a matter of
hooking up the Rx rules to route traffic to the traffic class/queue
group. Rx queue selection within the queue group happens based on RSS.

> cheers,
> jamal
>

Re: [PATCH 1/6] [net-next]net: sched: act_mirred: Extend redirect action to accept a traffic class

2017-08-01 Thread Nambiar, Amritha


On 8/1/2017 4:12 AM, Jiri Pirko wrote:
> Tue, Aug 01, 2017 at 02:37:37AM CEST, amritha.namb...@intel.com wrote:
>> The Mirred/redirect action is extended to forward to a traffic
>> class on the device. The traffic class index needs to be
>> provided in addition to the device's ifindex.
>>
>> Example:
>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>  dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>  skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1
> 
> You need to make sure that the current offloaders fill forbid to add
> this rule, not just silently ignore the tc value.

I will fix this in the next version, probably using the 'flags' field
I've defined.

> 
> 
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>> include/net/tc_act/tc_mirred.h|7 +++
>> include/uapi/linux/tc_act/tc_mirred.h |5 +
>> net/sched/act_mirred.c|   17 +
>> 3 files changed, 29 insertions(+)
>>
>> diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
>> index 604bc31..60058c4 100644
>> --- a/include/net/tc_act/tc_mirred.h
>> +++ b/include/net/tc_act/tc_mirred.h
>> @@ -9,6 +9,8 @@ struct tcf_mirred {
>>  int tcfm_eaction;
>>  int tcfm_ifindex;
>>  booltcfm_mac_header_xmit;
>> +u8  tcfm_tc;
>> +u32 flags;
>>  struct net_device __rcu *tcfm_dev;
>>  struct list_headtcfm_list;
>> };
>> @@ -37,4 +39,9 @@ static inline int tcf_mirred_ifindex(const struct 
>> tc_action *a)
>>  return to_mirred(a)->tcfm_ifindex;
>> }
>>
>> +static inline int tcf_mirred_tc(const struct tc_action *a)
>> +{
>> +return to_mirred(a)->tcfm_tc;
>> +}
>> +
>> #endif /* __NET_TC_MIR_H */
>> diff --git a/include/uapi/linux/tc_act/tc_mirred.h 
>> b/include/uapi/linux/tc_act/tc_mirred.h
>> index 3d7a2b3..8ff4d76 100644
>> --- a/include/uapi/linux/tc_act/tc_mirred.h
>> +++ b/include/uapi/linux/tc_act/tc_mirred.h
>> @@ -9,6 +9,10 @@
>> #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */
>> #define TCA_INGRESS_REDIR 3  /* packet redirect to INGRESS*/
>> #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */
>> +
>> +#define MIRRED_F_TC_MAP 0x1
>> +#define MIRRED_TC_MAP_MAX   0x10
>> +#define MIRRED_TC_MAP_MASK  0xF
> 
> I'm completely lost. Why do you have these values here? and in fact one
> twice?

I'll fix this to remove the defines for the TC max range and its bitmap
here and reuse the existing ones defined in linux/netdevice.h

> 
> 
>>  
>>
>> struct tc_mirred {
>>  tc_gen;
>> @@ -21,6 +25,7 @@ enum {
>>  TCA_MIRRED_TM,
>>  TCA_MIRRED_PARMS,
>>  TCA_MIRRED_PAD,
>> +TCA_MIRRED_TC_MAP,
>>  __TCA_MIRRED_MAX
>> };
>> #define TCA_MIRRED_MAX (__TCA_MIRRED_MAX - 1)
>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>> index 1b5549a..f9801de 100644
>> --- a/net/sched/act_mirred.c
>> +++ b/net/sched/act_mirred.c
>> @@ -67,6 +67,7 @@ static void tcf_mirred_release(struct tc_action *a, int 
>> bind)
>>
>> static const struct nla_policy mirred_policy[TCA_MIRRED_MAX + 1] = {
>>  [TCA_MIRRED_PARMS]  = { .len = sizeof(struct tc_mirred) },
>> +[TCA_MIRRED_TC_MAP] = { .type = NLA_U8 },
>> };
>>
>> static unsigned int mirred_net_id;
>> @@ -83,6 +84,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
>> *nla,
>>  struct tcf_mirred *m;
>>  struct net_device *dev;
>>  bool exists = false;
>> +u8 *tc_map = NULL;
>> +u32 flags = 0;
>>  int ret;
>>
>>  if (nla == NULL)
>> @@ -92,6 +95,14 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
>> *nla,
>>  return ret;
>>  if (tb[TCA_MIRRED_PARMS] == NULL)
>>  return -EINVAL;
>> +
>> +if (tb[TCA_MIRRED_TC_MAP]) {
>> +tc_map = nla_data(tb[TCA_MIRRED_TC_MAP]);
>> +if (*tc_map >= MIRRED_TC_MAP_MAX)
>> +return -EINVAL;
>> +flags |= MIRRED_F_TC_MAP;
>> +}
>> +
>>  parm = nla_data(tb[TCA_MIRRED_PARMS]);
>>
>>  exists = tcf_hash_check(tn, parm->index, a, bind);
>> @@ -139,6 +150,7 @@ static int tcf_mirred_init(struct net *net, struct 
>> nlattr *nla,
>>  ASSERT_RTNL();
>>  m->tcf_action = parm->action;
>>  m->tcfm_eaction = parm->eaction;
>> +m->flags = flags;
>>  if (dev != NULL) {
>>  m->tcfm_ifindex = parm->ifindex;
>>  if (ret != ACT_P_CREATED)
>> @@ -146,6 +158,8 @@ static int tcf_mirred_init(struct net *net, struct 
>> nlattr *nla,
>>  dev_hold(dev);
>>  rcu_assign_pointer(m->tcfm_dev, dev);
>>  m->tcfm_mac_header_xmit = mac_header_xmit;
>> +if (flags & MIRRED_F_TC_MAP)
>> +m->tcfm_tc = *tc_map & MIRRED_TC_MAP_MASK;
>>  }
>>
>>  if (ret ==

Re: [PATCH 6/6] [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier

2017-08-01 Thread Nambiar, Amritha


On 8/1/2017 3:56 AM, Jamal Hadi Salim wrote:
> On 17-07-31 08:38 PM, Amritha Nambiar wrote:
>> This patch enables tc-flower based hardware offloads. tc/flower
>> filter provided by the kernel is configured as driver specific
>> cloud filter. The patch implements functions and admin queue
>> commands needed to support cloud filters in the driver and
>> adds cloud filters to configure these tc-flower filters.
>>
>> The only action supported is to redirect packets to a traffic class
>> on the same device.
>>
>> # tc qdisc add dev eth0 ingress
>> # ethtool -K eth0 hw-tc-offload on
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 0
>>
> 
> Out of curiosity - did you need to say "indev eth0" there?

It looks like I don't need to specify "indev eth0". I will need to look
up how this part is offloaded and probably validate in the driver when
this is specified.

> Also: Is it possible to add an skbmark? Example something like
> these that directs two flows to the same queue but different
> skb marks:
> 
> # tc filter add dev eth0 protocol ip parent : \
>prio 2 flower dst_ip 192.168.3.5/32 \
>ip_proto udp dst_port 2a skip_sw \
>action skbedit mark 11 \
>action mirred ingress redirect dev eth0 tcqueue 1
> 
> # tc filter add dev eth0 protocol ip parent : \
>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>  action skbedit mark 12 \
>  action mirred ingress redirect dev eth0 tcqueue 1
> 

It is possible to support the skbedit mark action for the first rule
here (L3 and L4) which I can take up in a subsequent patch, but this
cannot be supported on our device for L2 based match in the second rule.

> cheers,
> jamal
>

Re: [PATCH 1/6] [net-next]net: sched: act_mirred: Extend redirect action to accept a traffic class

2017-08-01 Thread Nambiar, Amritha


On 8/1/2017 3:44 AM, Jamal Hadi Salim wrote:
> On 17-07-31 08:37 PM, Amritha Nambiar wrote:
>> The Mirred/redirect action is extended to forward to a traffic
>> class on the device. The traffic class index needs to be
>> provided in addition to the device's ifindex.
>>
>> Example:
>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>   include/net/tc_act/tc_mirred.h|7 +++
>>   include/uapi/linux/tc_act/tc_mirred.h |5 +
>>   net/sched/act_mirred.c|   17 +
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
>> index 604bc31..60058c4 100644
>> --- a/include/net/tc_act/tc_mirred.h
>> +++ b/include/net/tc_act/tc_mirred.h
>> @@ -9,6 +9,8 @@ struct tcf_mirred {
>>  int tcfm_eaction;
>>  int tcfm_ifindex;
>>  booltcfm_mac_header_xmit;
>> +u8  tcfm_tc;
>> +u32 flags;
>>  struct net_device __rcu *tcfm_dev;
>>  struct list_headtcfm_list;
>>   };
>> @@ -37,4 +39,9 @@ static inline int tcf_mirred_ifindex(const struct 
>> tc_action *a)
>>  return to_mirred(a)->tcfm_ifindex;
>>   }
>>   
>> +static inline int tcf_mirred_tc(const struct tc_action *a)
>> +{
>> +return to_mirred(a)->tcfm_tc;
>> +}
>> +
>>   #endif /* __NET_TC_MIR_H */
>> diff --git a/include/uapi/linux/tc_act/tc_mirred.h 
>> b/include/uapi/linux/tc_act/tc_mirred.h
>> index 3d7a2b3..8ff4d76 100644
>> --- a/include/uapi/linux/tc_act/tc_mirred.h
>> +++ b/include/uapi/linux/tc_act/tc_mirred.h
>> @@ -9,6 +9,10 @@
>>   #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */
>>   #define TCA_INGRESS_REDIR 3  /* packet redirect to INGRESS*/
>>   #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */
>> +
>> +#define MIRRED_F_TC_MAP 0x1
>> +#define MIRRED_TC_MAP_MAX   0x10
>> +#define MIRRED_TC_MAP_MASK  0xF
>>  
>>  
>>   struct tc_mirred {
>>  tc_gen;
>> @@ -21,6 +25,7 @@ enum {
>>  TCA_MIRRED_TM,
>>  TCA_MIRRED_PARMS,
>>  TCA_MIRRED_PAD,
>> +TCA_MIRRED_TC_MAP,
>>  __TCA_MIRRED_MAX
>>   };
>>   #define TCA_MIRRED_MAX (__TCA_MIRRED_MAX - 1)
>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>> index 1b5549a..f9801de 100644
>> --- a/net/sched/act_mirred.c
>> +++ b/net/sched/act_mirred.c
>> @@ -67,6 +67,7 @@ static void tcf_mirred_release(struct tc_action *a, int 
>> bind)
>>   
>>   static const struct nla_policy mirred_policy[TCA_MIRRED_MAX + 1] = {
>>  [TCA_MIRRED_PARMS]  = { .len = sizeof(struct tc_mirred) },
>> +[TCA_MIRRED_TC_MAP] = { .type = NLA_U8 },
>>   };
>>   
>>   static unsigned int mirred_net_id;
>> @@ -83,6 +84,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
>> *nla,
>>  struct tcf_mirred *m;
>>  struct net_device *dev;
>>  bool exists = false;
>> +u8 *tc_map = NULL;
>> +u32 flags = 0;
>>  int ret;
>>   
>>  if (nla == NULL)
>> @@ -92,6 +95,14 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
>> *nla,
>>  return ret;
>>  if (tb[TCA_MIRRED_PARMS] == NULL)
>>  return -EINVAL;
>> +
>> +if (tb[TCA_MIRRED_TC_MAP]) {
>> +tc_map = nla_data(tb[TCA_MIRRED_TC_MAP]);
>> +if (*tc_map >= MIRRED_TC_MAP_MAX)
>> +return -EINVAL;
>> +flags |= MIRRED_F_TC_MAP;
> 
> 
>> +}
>> +
>>  parm = nla_data(tb[TCA_MIRRED_PARMS]);
>>   
>>  exists = tcf_hash_check(tn, parm->index, a, bind);
>> @@ -139,6 +150,7 @@ static int tcf_mirred_init(struct net *net, struct 
>> nlattr *nla,
>>  ASSERT_RTNL();
>>  m->tcf_action = parm->action;
>>  m->tcfm_eaction = parm->eaction;
>> +m->flags = flags;
>>  if (dev != NULL) {
>>  m->tcfm_ifindex = parm->ifindex;
>>  if (ret != ACT_P_CREATED)
>> @@ -146,6 +158,8 @@ static int tcf_mirred_init(struct net *net, struct 
>> nlattr *nla,
>>  dev_hold(dev);
>>  rcu_assign_pointer(m->tcfm_dev, dev);
>>  m->tcfm_mac_header_xmit = mac_header_xmit;
>> +if (flags & MIRRED_F_TC_MAP)
>> +m->tcfm_tc = *tc_map & MIRRED_TC_MAP_MASK;
>>  }
>>   
> Is the mask a hardware limit. I dont know how these queues are
> allocated - I am assuming each of these "tc queues" maps to a rx
> DMA ring?

This is the bitmask for TC max range again defined in linux/netdevice.h.
I'll fix this to remove the new definition I have (MIRRED_TC_MAP_MASK)
here and replace with the existing TC_BITMASK. These are just the
traffic class index that are offloaded to the device. I had submitted
another

Re: [PATCH 1/6] [net-next]net: sched: act_mirred: Extend redirect action to accept a traffic class

2017-08-01 Thread Nambiar, Amritha


On 8/1/2017 3:22 AM, Jamal Hadi Salim wrote:
> On 17-07-31 08:37 PM, Amritha Nambiar wrote:
>> The Mirred/redirect action is extended to forward to a traffic
>> class on the device. The traffic class index needs to be
>> provided in addition to the device's ifindex.
>>
>> Example:
>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>   include/net/tc_act/tc_mirred.h|7 +++
>>   include/uapi/linux/tc_act/tc_mirred.h |5 +
>>   net/sched/act_mirred.c|   17 +
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
>> index 604bc31..60058c4 100644
>> --- a/include/net/tc_act/tc_mirred.h
>> +++ b/include/net/tc_act/tc_mirred.h
>> @@ -9,6 +9,8 @@ struct tcf_mirred {
>>  int tcfm_eaction;
>>  int tcfm_ifindex;
>>  booltcfm_mac_header_xmit;
>> +u8  tcfm_tc;
>> +u32 flags;
>>  struct net_device __rcu *tcfm_dev;
>>  struct list_headtcfm_list;
>>   };
>> @@ -37,4 +39,9 @@ static inline int tcf_mirred_ifindex(const struct 
>> tc_action *a)
>>  return to_mirred(a)->tcfm_ifindex;
>>   }
>>   
>> +static inline int tcf_mirred_tc(const struct tc_action *a)
>> +{
>> +return to_mirred(a)->tcfm_tc;
>> +}
>> +
>>   #endif /* __NET_TC_MIR_H */
>> diff --git a/include/uapi/linux/tc_act/tc_mirred.h 
>> b/include/uapi/linux/tc_act/tc_mirred.h
>> index 3d7a2b3..8ff4d76 100644
>> --- a/include/uapi/linux/tc_act/tc_mirred.h
>> +++ b/include/uapi/linux/tc_act/tc_mirred.h
>> @@ -9,6 +9,10 @@
>>   #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */
>>   #define TCA_INGRESS_REDIR 3  /* packet redirect to INGRESS*/
>>   #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */
>> +
>> +#define MIRRED_F_TC_MAP 0x1
>> +#define MIRRED_TC_MAP_MAX   0x10
> 
> Assuming this is the max number of queues?
> Where does this upper bound come from? Is it a spec
> or an intel thing? If spec - mentioning which
> spec and section would be useful.

This is the max number of TCs. The Linux upper bound for this is defined
in linux/netdevice.h. I will fix this part to remove the definition here
and reuse the existing one.

> 
> cheers,
> jamal
>

Re: [PATCH net-next RFC 0/6] Configure cloud filters in i40e via tc/flower classifier

2017-08-01 Thread Nambiar, Amritha


On 8/1/2017 3:15 AM, Jamal Hadi Salim wrote:
> On 17-07-31 08:36 PM, Amritha Nambiar wrote:
>> This patch series enables configuring cloud filters in i40e
>> using the tc/flower classifier. The only tc-filter action
>> supported is to redirect packets to a traffic class on the
>> same device. The tc/mirred:redirect action is extended to
>> accept a traffic class to achieve this.
>>
>> The cloud filters are added for a VSI and are cleaned up when
>> the VSI is deleted. The filters that match on L4 ports needs
>> enhanced admin queue functions with big buffer support for
>> extended general fields in Add/Remove Cloud filters command.
>>
>> Example:
>> # tc qdisc add dev eth0 ingress
>>
>> # ethtool -K eth0 hw-tc-offload on
>>
>> # tc filter add dev eth0 protocol ip parent : prio 1 flower\
>>dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
>>skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1
>>
> 
> I think "queue 1" sounds better than "tc 1".
> "tc" is  already a keyword in a few places (even within that declaration
> above).

The idea is to redirect to a traffic class that has queues assigned to
it and not a single queue i.e. these are actually queue groups and not a
single queue. So may be "qgroup 1" or "tcqgroup 1" fits better.

> 
> cheers,
> jamal
>

Re: [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio

2017-07-26 Thread Nambiar, Amritha




On 7/21/2017 2:42 AM, Richard Cochran wrote:

On Mon, May 22, 2017 at 12:31:12PM -0700, Jeff Kirsher wrote:

On Fri, 2017-05-19 at 17:58 -0700, Amritha Nambiar wrote:

The following series introduces a new harware offload mode in tc/mqprio
where the TCs, the queue configurations and bandwidth rate limits are
offloaded to the hardware.


...


This was meant to be sent out as an RFC, but apparently that did not get
conveyed when these were sent out Friday.


I am looking at tc/mqprio and dcb with the purpose of implementing

  - Forwarding and Queuing Enhancements for Time-Sensitive Streams (FQTSS)
802.1Q-2014 Clause 34

  - Scheduled Traffic (time based scheduling)
P802.1Qbv

using the HW capabilities of the i210.  This series looks like a
promising avenue for these features.

My question is, did series go anywhere?

I didn't see any follow ups on netdev, but maybe I missed something.


I have submitted another series as non-RFC for next-queue maintained by 
Jeff Kirsher. This is going to come through the next-queue tree. You can 
follow the new series "[PATCH 0/6] Configuring traffic classes via new 
hardware offload mechanism in tc/mqprio" at 
https://www.mail-archive.com/netdev@vger.kernel.org/msg177390.html




Thanks,
Richard

Re: [PATCH 1/6] [next-queue]net: mqprio: Introduce new hardware offload mode in mqprio for offloading full TC configurations

2017-07-14 Thread Nambiar, Amritha


On 7/14/2017 1:36 AM, Jamal Hadi Salim wrote:

On 17-07-11 06:18 AM, Amritha Nambiar wrote:

This patch introduces a new hardware offload mode in mqprio
which makes full use of the mqprio options, the TCs, the
queue configurations and the bandwidth rates for the TCs.
This is achieved by setting the value 2 for the "hw" option.
This new offload mode supports new attributes for traffic
class such as minimum and maximum values for bandwidth rate limits.

Introduces a new datastructure 'tc_mqprio_qopt_offload' for offloading
mqprio queue options and use this to be shared between the kernel and
device driver. This contains a copy of the exisiting datastructure
for mqprio queue options. This new datastructure can be extended when
adding new attributes for traffic class such as bandwidth rate limits. The
existing datastructure for mqprio queue options will be shared between the
kernel and userspace.

This patch enables configuring additional attributes associated
with a traffic class such as minimum and maximum bandwidth
rates and can be offloaded to the hardware in the new offload mode.
The min and max limits for bandwidth rates are provided
by the user along with the the TCs and the queue configurations
when creating the mqprio qdisc.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2 map 0 0 0 0 1 1 1 1\
queues 4@0 4@4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2


I know this has nothing to do with your patches - but that is very
unfriendly ;-> Most mortals will have a problem with the map (but you
can argue it has been there since prio qdisc was introduced) - leave
alone the 4@4 syntax and now min_rate where i have to type in obvious
defaults like "0Mbit".


The min_rate and max_rate are optional attributes for the traffic class 
and it is
not mandatory to have both. It is also possible to have either one of 
them, say,

devices that do not support setting min rate need to specify only
the max rate and need not type in the default 0Mbit. My bad, I should 
probably

have given a better example.

# tc qdisc add dev eth0 root mqprio num_tc 2 map 0 0 0 0 1 1 1 1\
   queues 4@0 4@4 max_rate 55Mbit 60Mbit hw 2



You have some great features that not many people can use as a result.
Note:
This is just a comment maybe someone can be kind enough to fix (or
it would get annoying enough I will fix it); i.e should not be
holding your good work.


To dump the bandwidth rates:

# tc qdisc show dev eth0

qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
   queues:(0:3) (4:7)
   min rates:0bit 0bit
   max rates:55Mbit 60Mbit

Signed-off-by: Amritha Nambiar 
---
   include/linux/netdevice.h  |2
   include/net/pkt_cls.h  |7 ++
   include/uapi/linux/pkt_sched.h |   13 +++
   net/sched/sch_mqprio.c |  170 
+---
   4 files changed, 181 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e48ee2e..12c6c3f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -779,6 +779,7 @@ enum {
TC_SETUP_CLSFLOWER,
TC_SETUP_MATCHALL,
TC_SETUP_CLSBPF,
+   TC_SETUP_MQPRIO_EXT,
   };

   struct tc_cls_u32_offload;
@@ -791,6 +792,7 @@ struct tc_to_netdev {
struct tc_cls_matchall_offload *cls_mall;
struct tc_cls_bpf_offload *cls_bpf;
struct tc_mqprio_qopt *mqprio;
+   struct tc_mqprio_qopt_offload *mqprio_qopt;
};
bool egress_dev;
   };



diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 537d0a0..9facda2 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -569,6 +569,13 @@ struct tc_cls_bpf_offload {
u32 gen_flags;
   };
   
+struct tc_mqprio_qopt_offload {

+   /* struct tc_mqprio_qopt must always be the first element */
+   struct tc_mqprio_qopt qopt;
+   u32 flags;
+   u64 min_rate[TC_QOPT_MAX_QUEUE];
+   u64 max_rate[TC_QOPT_MAX_QUEUE];
+};


Quickly scanned code.
My opinion is: struct tc_mqprio_qopt is messed up in terms of
alignments. And you just made it worse. Why not create a new struct
call it "tc_mqprio_qopt_hw" or something indicating it is for hw
offload. You can then fixup stuff. I think it will depend on whether
you can have both hw priority and rate in all network cards.
If some hw cannot support rate offload then I would suggest it becomes
optional via TLVs etc.
If you are willing to do that clean up I can say more.


The existing struct tc_mqprio_qopt does have alignment issues, but is 
shared with
the userspace. The new struct tc_mqprio_qopt_offload is shared with the 
device

driver. This contains a copy of the existing tc_mqprio_qopt for mqprio queue
options for legacy users. The rates are optional attributes obtained as 
TLVs from
the userspace via additional netlink attributes. This would be clear 
from the

corresponding iproute2 RFC patch

Re: [PATCH 0/6] Configuring traffic classes via new hardware offload mechanism in tc/mqprio

2017-07-14 Thread Nambiar, Amritha


On 7/14/2017 12:56 AM, Jamal Hadi Salim wrote:

Hi Amritha,

On 17-07-11 06:18 AM, Amritha Nambiar wrote:

The following series introduces a new hardware offload mode in
tc/mqprio where the TCs, the queue configurations and
bandwidth rate limits are offloaded to the hardware. The existing
mqprio framework is extended to configure the queue counts and
layout and also added support for rate limiting. This is achieved
by setting the value 2 for the "hw" mode. Legacy devices can fall
back to the existing setup supporting hw mode 1 where only the
TCs are offloaded, however if hw mode 2 is specified, then this
type of offload fails to initialize if the requested mode is not
supported. The i40e driver enables the new mqprio hardware offload
mechanism factoring the TCs, queue configuration and bandwidth
rates by creating HW channel VSIs.

In this new mode, the priority to traffic class mapping and the
user specified queue ranges are used to configure the traffic
class when the 'hw' option is set to 2. This is achieved by
creating HW channels(VSI). A new channel is created for each
of the traffic class configuration offloaded via mqprio
framework except for the first TC (TC0) which is for the main
VSI. TC0 for the main VSI is also reconfigured as per user
provided queue parameters. Finally, bandwidth rate limits are
set on these traffic classes through the mqprio offload
framework by sending these rates in addition to the number of
TCs and the queue configurations.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
   queues 4@0 4@4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2

To dump the bandwidth rates:

# tc qdisc show dev eth0
   qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
queues:(0:3) (4:7)
min rates:0bit 0bit
max rates:55Mbit 60Mbit


My only concern is usability. Was already cryptic with hw "1" and
now we introduce another magic number "2". Could we not have two
have some human friendly nouns? I know 1 means offload qos and two
mean offload rate (+qos?).


I agree this could be less friendly, I was trying to retain the existing
conventions. Yes, hw "2" offloads qos and rates.



I have some questions:
BTW, what would tc class show dev eth0 display in this case?
Above seems to select queue groupings for a specific rates which
is a short-cut to say assigning each queue via something like TBF
via "tc class" semantics (but way more cryptic than tc class;->)


Dumping the classes under mqprio qdisc will not display any of the
queue configurations or rates. MQPRIO is classless and does not support
change_class operation like HTB does to add a class to the qdisc and set
a rate for the class.



cheers,
jamal

67 matches

Mail list logo