RE: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-19 Thread Yotam Gigi


>-Original Message-
>From: Roopa Prabhu [mailto:ro...@cumulusnetworks.com]
>Sent: Wednesday, October 19, 2016 10:33 AM
>To: Yotam Gigi <yot...@mellanox.com>
>Cc: Jamal Hadi Salim <j...@mojatatu.com>; Jiri Pirko <j...@resnulli.us>;
>netdev@vger.kernel.org; da...@davemloft.net; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>;
>geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; Shrijeet Mukherjee
><s...@cumulusnetworks.com>; Yotam Gigi <yotam...@gmail.com>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On 10/18/16, 3:58 AM, Yotam Gigi wrote:
>
>> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>[snip]
>
>>> The OVS implementation is a good example, the metadata includes all the
>>> actions applied
>>>>> to the packet in the kernel data path.
>>>>>
>>>> Again not sure what the use case would be (and why waste such space
>>>> especially when you are sending over the wire with such details).
>>> All this is being used currently.., But, this can be other api's sflow uses
>>> for monitoring.
>>> http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf
>>>
>>> Does not have to be part of the main/basic sampling api...
>>> it was just an example.
>>>
>> I guess that making the API extensible solves this, isn't it?
>
>yes, that might help...
>
>Just wanted to bring up the question/clarification on using mark again
>
>tc qdisc add dev eth1 handle : ingress
>
>tc filter add dev eth1 parent : \
>   matchall action sample rate 12 mark 17
>
>tc filter add parent : dev eth1 protocol all \
>   u32 match mark 172 0xff
>   action mirred egress redirect dev dummy0
>
>Like we discussed @ netdev, mark can be used by other things in the system.
>A request to sample on an interface cannot be disruptive.
>Does this require mark to be not used elsewhere in the system when sampling is
>enabled on an interface ?

I think the we can spare the usage of mark, or at least make it optional, as 
the user
can match on the packets according to the eth_type (as part of the IFE, the 
user 
can set the sampled packet eth_type).

I will do that, and update the documentation as well.


Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-19 Thread Roopa Prabhu
On 10/18/16, 3:58 AM, Yotam Gigi wrote:

> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
[snip]

>> The OVS implementation is a good example, the metadata includes all the
>> actions applied
 to the packet in the kernel data path.

>>> Again not sure what the use case would be (and why waste such space
>>> especially when you are sending over the wire with such details).
>> All this is being used currently.., But, this can be other api's sflow uses
>> for monitoring.
>> http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf
>>  
>> Does not have to be part of the main/basic sampling api...
>> it was just an example.
>>
> I guess that making the API extensible solves this, isn't it?

yes, that might help...

Just wanted to bring up the question/clarification on using mark again

tc qdisc add dev eth1 handle : ingress

tc filter add dev eth1 parent : \
   matchall action sample rate 12 mark 17

tc filter add parent : dev eth1 protocol all \
   u32 match mark 172 0xff
   action mirred egress redirect dev dummy0

Like we discussed @ netdev, mark can be used by other things in the system.
A request to sample on an interface cannot be disruptive.
Does this require mark to be not used elsewhere in the system when sampling is 
enabled on an interface ?



RE: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-18 Thread Yotam Gigi


>-Original Message-
>From: Or Gerlitz [mailto:gerlitz...@gmail.com]
>Sent: Sunday, October 16, 2016 1:27 PM
>To: Jiri Pirko <j...@resnulli.us>
>Cc: Linux Netdev List <netdev@vger.kernel.org>; David Miller
><da...@davemloft.net>; Yotam Gigi <yot...@mellanox.com>; Ido Schimmel
><ido...@mellanox.com>; Elad Raz <el...@mellanox.com>; Nogah Frankel
><nog...@mellanox.com>; Or Gerlitz <ogerl...@mellanox.com>; Jamal Hadi Salim
><j...@mojatatu.com>; geert+rene...@glider.be; Stephen Hemminger
><step...@networkplumber.org>; Cong Wang <xiyou.wangc...@gmail.com>;
>Guenter Roeck <li...@roeck-us.net>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On Wed, Oct 12, 2016 at 3:41 PM, Jiri Pirko <j...@resnulli.us> wrote:
>> From: Yotam Gigi <yotam...@gmail.com>
>>
>> This action allow the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly, truncating them,
>> adding some informative metadata regarding the interface and the original
>> packet size and mark them with specific mark, to allow further tc rules to
>> match and process. The marked sample packets are then injected into the
>> device ingress qdisc using netif_receive_skb.
>>
>> The packets metadata is packed using the ife encapsulation protocol, and
>> the outer packet's ethernet dest, source and eth_type, along with the
>> rate, mark and the optional truncation size can be configured from
>> userspace.
>>
>> Example:
>> To sample ingress traffic from interface eth1, and redirect the sampled
>> the sampled packets to interface dummy0, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>>matchall action sample rate 12 mark 17
>>
>> tc filter add parent : dev eth1 protocol all \
>>u32 match mark 172 0xff
>>action mirred egress redirect dev dummy0
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling every 12'th packet on dev eth0 and marks the sampled packets with
>> 17. The command third catches the sampled packets, which are marked with
>> 17, and redirects them to dev dummy0.
>
>eth0 --> eth1
>
>command third --> third command
>

Missed that. Thanks :)

>don't we need a re-classify directive for the u32 filter to apply
>after the marking done by the matchall rule + sample action
>or is that implicit?

No, as the packets are re-injected to the ingress qdisc (as described in the 
commit message). Reclassify won't work as the sampled packets, which are a copy
of the chosen packets are generated inside the sample action and are not part of
the device packet stream.

>
>
>> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
>> new file mode 100644
>> index 000..a2b445a
>> --- /dev/null
>> +++ b/include/net/tc_act/tc_sample.h
>> @@ -0,0 +1,88 @@
>> +#ifndef __NET_TC_SAMPLE_H
>> +#define __NET_TC_SAMPLE_H
>> +
>> +#include 
>> +#include 
>> +
>> +struct tcf_sample {
>> +   struct tc_actioncommon;
>> +   u32 rate;
>> +   u32 mark;
>> +   booltruncate;
>> +   u32 trunc_size;
>> +   u32 packet_counter;
>> +   u8  eth_dst[ETH_ALEN];
>> +   u8  eth_src[ETH_ALEN];
>> +   u16 eth_type;
>> +   booleth_type_set;
>> +   struct list_headtcfm_list;
>> +};
>
>> +++ b/include/uapi/linux/tc_act/tc_sample.h
>> @@ -0,0 +1,31 @@
>> +#ifndef __LINUX_TC_SAMPLE_H
>> +#define __LINUX_TC_SAMPLE_H
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#define TCA_ACT_SAMPLE 26
>> +
>> +struct tc_sample {
>> +   tc_gen;
>> +   __u32   rate;   /* sample rate */
>> +   __u32   mark;   /* mark to put on the sampled 
>> packets */
>> +   booltruncate;   /* whether to truncate the packets */
>> +   __u32   trunc_size; /* truncation size */
>> +   __u8eth_dst[ETH_ALEN]; /* encapsulated mac destination */
>> +   __u8eth_src[ETH_ALEN]; /* encapsulated mac source */
>> +   booleth_type_set;  /* whether to overrid ethtype */
>> +   __u16   eth_type;  /* 

RE: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-18 Thread Yotam Gigi
>-Original Message-
>From: Roopa Prabhu [mailto:ro...@cumulusnetworks.com]
>Sent: Tuesday, October 18, 2016 3:17 AM
>To: Jamal Hadi Salim <j...@mojatatu.com>
>Cc: Jiri Pirko <j...@resnulli.us>; netdev@vger.kernel.org; da...@davemloft.net;
>Yotam Gigi <yot...@mellanox.com>; Ido Schimmel <ido...@mellanox.com>; Elad
>Raz <el...@mellanox.com>; Nogah Frankel <nog...@mellanox.com>; Or Gerlitz
><ogerl...@mellanox.com>; geert+rene...@glider.be;
>step...@networkplumber.org; xiyou.wangc...@gmail.com; li...@roeck-us.net;
>Shrijeet Mukherjee <s...@cumulusnetworks.com>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
>>
>> Some comments:
>> IIUC, the main struggle seems to be whether the redirect to dummy0
>> is useful or not? i.e instead of just letting the packets go up the
>> stack on eth1?
>
>yep, correct...given existing workflow for the non-offloaded case is
>to receive sample packets via bpf filter on socket or
>use netlink as a sample delivery mechanism (NFLOG eg)
>
>
>> It seems like sflowd needs to read off eth1 via packet socket?
>> To be backward compatible - supporting that approach seems sensible.

I am not sure whether using socket is backward compatible. hsflowd 
expects nflog packets, and ife packets will not be understood.

We planned on adding support on hsflowd in IFE encapsulated packets from
tuntap device. 

Did I understand you correctly?

>>
>> Note:
>> There is a clear efficiency benefit of both using IFE encoding and
>> redirecting to dummy0.
>> 1) Redirecting to dummy0 implies you dont need to exercise a bpf
>> filter around every packet that comes off eth1.
>> I understand there are probably not millions of pps for this case;
>> but in a non-offloaded cases it could be millions pps.
>> And in case of sampling over many ethx devices, you can redirect
>> samples from many other ethx devices.
>> So making dummy0 the sflow device is a win.
>> 2) Encaping an IFE header implies a much more efficient bpf filter
>> (IFE ethertype is an excellent discriminator for bpf).
>>
>> Additional benefit is as mentioned before - redirecting to a device
>> means you can send it remotely over ethernet to a more powerful
>> machine without having to cross kernel-userspace. Redirecting instead
>> of mirroring to tuntap is also an interesting option.
>
>sure, this seems like a good option to have.
>generally you have one instance of the sampling agent on a hyper visor or 
>switch.
>But, if you have use-cases where monitoring agents run external, sure.
>would have preferred if it was optional or an addon and not the default.
>
>Regarding the device, yeah, agree there are pros and cons.
>An additional device just to sample packets seems like an overkill.
>But, if there is no other other option, and there are benefits to it, no 
>objections.
>Hopefully we can add another option on the existing api to skip the device in 
>the
>future.
>
>
>>
>>
>> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>>> From: Yotam Gigi <yotam...@gmail.com>
>>
>>
>>>> +
>>>> +struct sample_packet_metadata {
>>>> +int sample_size;
>>>> +int orig_size;
>>>> +int ifindex;
>>>> +};
>>>> +
>>> This metadata does not look extensible.. can it be made to ?
>>>
>>
>> Sure it can...

I will update the userspace API to be more generice: I will drop this struct and
let the user (iproute2 currently)  build the netlink packet himself (as Or 
Gerlitz suggested).

>>
>>> With sflow in context, you need a pair of ifindex numbers to encode ingress 
>>> and
>egress ports.
>>
>> What is the use case for both?
>
>I have heard that most monitoring tools have moved to ingress only sampling
>because of operational
>complexity (use case is sflow). I think hardware also supports ingress and 
>egress
>only sampling.
>better to have an option to reflect that in the api.

Agree. I will add both ingress and egress ports in the IFE. Both the hardware 
Implementation and kernel implementation don't support setting both, but 
It is good to have that option.

>
>>> Ideally you would also include a sequence number and a count of the total
>number of packets
>> > that were candidates for sampling.
>>
>> Sequence number may make sense (they will help show a gap if something
>> gets dropped). But i am not sure about the stats consuming such space.
>> Stats are something that can b

Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-17 Thread Roopa Prabhu
On 10/17/16, 5:17 PM, Roopa Prabhu wrote:
> On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
[snip]

inline below more data/context..

 +
 +struct sample_packet_metadata {
 +int sample_size;
 +int orig_size;
 +int ifindex;
 +};
 +
>>> This metadata does not look extensible.. can it be made to ?
>>>
>> Sure it can...
more sflow context here... [1]

An extensible metadata scheme is  highly desirable when passing data from the 
dataplane to 
the sampling agent in userspace. Looking forward, advanced instrumentation is 
being 
added to data planes and keeping the api future proof will help.



>>
>>> With sflow in context, you need a pair of ifindex numbers to encode ingress 
>>> and egress ports.
>> What is the use case for both?
> I have heard that most monitoring tools have moved to ingress only sampling 
> because of operational
> complexity (use case is sflow). I think hardware also supports ingress and 
> egress only sampling.
> better to have an option to reflect that in the api.

The reason for having two ifindex numbers is to record the ingress and egress 
ports (i.e. the path that the packet takes through the datapath/ASIC). You may 
actually have three ifindex numbers associated with a sample:
1. The data source that made the measurement (on a linux system each bridge has 
its own ifindex)
2. The ifindex associated with the ingress switch port
3. The ifindex associated with the egress switch port.

All three apply irrespective of sampling direction.


thanks,
Roopa

[1] Additional extended flow attributes have been defined to further extend 
sFlow packet samples:
http://sflow.org/sflow_tunnels.txt 
http://sflow.org/sflow_openflow.txt 




Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-17 Thread Roopa Prabhu
On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
>
> Some comments:
> IIUC, the main struggle seems to be whether the redirect to dummy0
> is useful or not? i.e instead of just letting the packets go up the
> stack on eth1?

yep, correct...given existing workflow for the non-offloaded case is
to receive sample packets via bpf filter on socket or
use netlink as a sample delivery mechanism (NFLOG eg)


> It seems like sflowd needs to read off eth1 via packet socket?
> To be backward compatible - supporting that approach seems sensible.
>
> Note:
> There is a clear efficiency benefit of both using IFE encoding and
> redirecting to dummy0.
> 1) Redirecting to dummy0 implies you dont need to exercise a bpf
> filter around every packet that comes off eth1.
> I understand there are probably not millions of pps for this case;
> but in a non-offloaded cases it could be millions pps.
> And in case of sampling over many ethx devices, you can redirect
> samples from many other ethx devices.
> So making dummy0 the sflow device is a win.
> 2) Encaping an IFE header implies a much more efficient bpf filter
> (IFE ethertype is an excellent discriminator for bpf).
>
> Additional benefit is as mentioned before - redirecting to a device
> means you can send it remotely over ethernet to a more powerful
> machine without having to cross kernel-userspace. Redirecting instead
> of mirroring to tuntap is also an interesting option.

sure, this seems like a good option to have.
generally you have one instance of the sampling agent on a hyper visor or 
switch.
But, if you have use-cases where monitoring agents run external, sure.
would have preferred if it was optional or an addon and not the default.

Regarding the device, yeah, agree there are pros and cons.
An additional device just to sample packets seems like an overkill.
But, if there is no other other option, and there are benefits to it, no 
objections.
Hopefully we can add another option on the existing api to skip the device in 
the future.


>
>
> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>> From: Yotam Gigi 
>
>
>>> +
>>> +struct sample_packet_metadata {
>>> +int sample_size;
>>> +int orig_size;
>>> +int ifindex;
>>> +};
>>> +
>> This metadata does not look extensible.. can it be made to ?
>>
>
> Sure it can...
>
>> With sflow in context, you need a pair of ifindex numbers to encode ingress 
>> and egress ports.
>
> What is the use case for both?

I have heard that most monitoring tools have moved to ingress only sampling 
because of operational
complexity (use case is sflow). I think hardware also supports ingress and 
egress only sampling.
better to have an option to reflect that in the api.

>> Ideally you would also include a sequence number and a count of the total 
>> number of packets
> > that were candidates for sampling.
>
> Sequence number may make sense (they will help show a gap if something
> gets dropped). But i am not sure about the stats consuming such space.
> Stats are something that can be queried (tc stats should have a record
> of how many bytes/packets )

sure, thats fine.
>
>> The OVS implementation is a good example, the metadata includes all the 
>> actions applied
>> to the packet in the kernel data path.
>>
>
> Again not sure what the use case would be (and why waste such space
> especially when you are sending over the wire with such details).

All this is being used currently.., But, this can be other api's sflow uses
for monitoring.
http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf

Does not have to be part of the main/basic sampling api...
it was just an example.

>
>>> +rcu_read_lock();
>>> +retval = READ_ONCE(s->tcf_action);
>>> +
>>> +if (++s->packet_counter % s->rate == 0) {
>>
>> The sampling function isn’t random
>>
>> if (++s->packet_counter % s->rate == 0) {
>>
>> This is unsuitable for sFlow, which is specific about the random sampling 
>> function required.
>> BPF, OVS, and the
>> ULOG statistics module include efficient kernel based random sampling 
>> functions that could be used instead.
>>
>
> If i understood correctly, the above is a fallback sampling algorithm.
> In the case of the spectrum it already does the sampling in the ASIC
> so there is no need to repeat it in software.
> Agreed that in that case the sampling approach is not sufficiently
> random.

yes. and since the same sampling api will be used for offloaded and 
non-offloaded case,
the sampling algo here for the non-offloaded case...can do better .. atleast 
match the existing
api efficiency. We would want people to use the same api for the offload and 
non-offloaded case.

thanks,
Roopa



Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-17 Thread Jamal Hadi Salim


Some comments:
IIUC, the main struggle seems to be whether the redirect to dummy0
is useful or not? i.e instead of just letting the packets go up the
stack on eth1?
It seems like sflowd needs to read off eth1 via packet socket?
To be backward compatible - supporting that approach seems sensible.

Note:
There is a clear efficiency benefit of both using IFE encoding and
redirecting to dummy0.
1) Redirecting to dummy0 implies you dont need to exercise a bpf
filter around every packet that comes off eth1.
I understand there are probably not millions of pps for this case;
but in a non-offloaded cases it could be millions pps.
And in case of sampling over many ethx devices, you can redirect
samples from many other ethx devices.
So making dummy0 the sflow device is a win.
2) Encaping an IFE header implies a much more efficient bpf filter
(IFE ethertype is an excellent discriminator for bpf).

Additional benefit is as mentioned before - redirecting to a device
means you can send it remotely over ethernet to a more powerful
machine without having to cross kernel-userspace. Redirecting instead
of mirroring to tuntap is also an interesting option.

More comments below (on the sflow person's comment - dont seem him
on the Cc):

On 16-10-15 12:34 PM, Roopa Prabhu wrote:

On 10/12/16, 5:41 AM, Jiri Pirko wrote:

From: Yotam Gigi 




+
+struct sample_packet_metadata {
+   int sample_size;
+   int orig_size;
+   int ifindex;
+};
+

This metadata does not look extensible.. can it be made to ?



Sure it can...


With sflow in context, you need a pair of ifindex numbers to encode ingress and 
egress ports.


What is the use case for both?

Ideally you would also include a sequence number and a count of the total 
number of packets

> that were candidates for sampling.

Sequence number may make sense (they will help show a gap if something
gets dropped). But i am not sure about the stats consuming such space.
Stats are something that can be queried (tc stats should have a record
of how many bytes/packets )


The OVS implementation is a good example, the metadata includes all the actions 
applied
to the packet in the kernel data path.



Again not sure what the use case would be (and why waste such space
especially when you are sending over the wire with such details).


+   rcu_read_lock();
+   retval = READ_ONCE(s->tcf_action);
+
+   if (++s->packet_counter % s->rate == 0) {


The sampling function isn’t random

if (++s->packet_counter % s->rate == 0) {

This is unsuitable for sFlow, which is specific about the random sampling 
function required.
BPF, OVS, and the
ULOG statistics module include efficient kernel based random sampling functions 
that could be used instead.



If i understood correctly, the above is a fallback sampling algorithm.
In the case of the spectrum it already does the sampling in the ASIC
so there is no need to repeat it in software.
Agreed that in that case the sampling approach is not sufficiently
random.

cheers,
jamal


Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-16 Thread Or Gerlitz
On Wed, Oct 12, 2016 at 3:41 PM, Jiri Pirko  wrote:
> From: Yotam Gigi 
>
> This action allow the user to sample traffic matched by tc classifier.
> The sampling consists of choosing packets randomly, truncating them,
> adding some informative metadata regarding the interface and the original
> packet size and mark them with specific mark, to allow further tc rules to
> match and process. The marked sample packets are then injected into the
> device ingress qdisc using netif_receive_skb.
>
> The packets metadata is packed using the ife encapsulation protocol, and
> the outer packet's ethernet dest, source and eth_type, along with the
> rate, mark and the optional truncation size can be configured from
> userspace.
>
> Example:
> To sample ingress traffic from interface eth1, and redirect the sampled
> the sampled packets to interface dummy0, one may use the commands:
>
> tc qdisc add dev eth1 handle : ingress
>
> tc filter add dev eth1 parent : \
>matchall action sample rate 12 mark 17
>
> tc filter add parent : dev eth1 protocol all \
>u32 match mark 172 0xff
>action mirred egress redirect dev dummy0
>
> Where the first command adds an ingress qdisc and the second starts
> sampling every 12'th packet on dev eth0 and marks the sampled packets with
> 17. The command third catches the sampled packets, which are marked with
> 17, and redirects them to dev dummy0.

eth0 --> eth1

command third --> third command

don't we need a re-classify directive for the u32 filter to apply
after the marking done by the matchall rule + sample action
or is that implicit?


> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
> new file mode 100644
> index 000..a2b445a
> --- /dev/null
> +++ b/include/net/tc_act/tc_sample.h
> @@ -0,0 +1,88 @@
> +#ifndef __NET_TC_SAMPLE_H
> +#define __NET_TC_SAMPLE_H
> +
> +#include 
> +#include 
> +
> +struct tcf_sample {
> +   struct tc_actioncommon;
> +   u32 rate;
> +   u32 mark;
> +   booltruncate;
> +   u32 trunc_size;
> +   u32 packet_counter;
> +   u8  eth_dst[ETH_ALEN];
> +   u8  eth_src[ETH_ALEN];
> +   u16 eth_type;
> +   booleth_type_set;
> +   struct list_headtcfm_list;
> +};

> +++ b/include/uapi/linux/tc_act/tc_sample.h
> @@ -0,0 +1,31 @@
> +#ifndef __LINUX_TC_SAMPLE_H
> +#define __LINUX_TC_SAMPLE_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define TCA_ACT_SAMPLE 26
> +
> +struct tc_sample {
> +   tc_gen;
> +   __u32   rate;   /* sample rate */
> +   __u32   mark;   /* mark to put on the sampled packets 
> */
> +   booltruncate;   /* whether to truncate the packets */
> +   __u32   trunc_size; /* truncation size */
> +   __u8eth_dst[ETH_ALEN]; /* encapsulated mac destination */
> +   __u8eth_src[ETH_ALEN]; /* encapsulated mac source */
> +   booleth_type_set;  /* whether to overrid ethtype */
> +   __u16   eth_type;  /* encapsulated mac ethtype */
> +};

overrid --> override

what do you mean by override here, to encapsulate?

consider using 0 as special value, e.g no truncation and no encapsulation

best if you just define the netlink attributes (document on the RHS
the type, see the uapi
for the new tunnel key action) and let the tc action in-kernel code to
decode them directly
into the non UAPI structure. This way you are extendable and also
avoid having two
structs which is sort of confusing.

> +
> +enum {
> +   TCA_SAMPLE_UNSPEC,
> +   TCA_SAMPLE_TM,
> +   TCA_SAMPLE_PARMS,
> +   TCA_SAMPLE_PAD,
> +   __TCA_SAMPLE_MAX
> +};
> +#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
> +
> +#endif


> +static bool dev_ok_push(struct net_device *dev)
> +{
> +   switch (dev->type) {
> +   case ARPHRD_TUNNEL:
> +   case ARPHRD_TUNNEL6:
> +   case ARPHRD_SIT:
> +   case ARPHRD_IPGRE:
> +   case ARPHRD_VOID:
> +   case ARPHRD_NONE:
> +   return false;
> +   default:
> +   return true;
> +   }
> +}
> +

> +static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
> + struct tcf_result *res)
> +{
> +   struct tcf_sample *s = to_sample(a);
> +   struct sample_packet_metadata metadata;
> +   static struct ethhdr *ethhdr;
> +   struct sk_buff *skb2;
> +   int retval;
> +   u32 at;
> +
> +   tcf_lastuse_update(>tcf_tm);
> +   bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
> +
> +   rcu_read_lock();
> +   retval = READ_ONCE(s->tcf_action);
> +
> +   if (++s->packet_counter % s->rate == 0) {
> +   skb2 = 

Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-15 Thread Roopa Prabhu
On 10/15/16, 9:34 AM, Roopa Prabhu wrote:
> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>> From: Yotam Gigi 
>>
>> This action allow the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly, truncating them,
>> adding some informative metadata regarding the interface and the original
>> packet size and mark them with specific mark, to allow further tc rules to
>> match and process. The marked sample packets are then injected into the
>> device ingress qdisc using netif_receive_skb.
>>
>> The packets metadata is packed using the ife encapsulation protocol, and
>> the outer packet's ethernet dest, source and eth_type, along with the
>> rate, mark and the optional truncation size can be configured from
>> userspace.
>>
>> Example:
>> To sample ingress traffic from interface eth1, and redirect the sampled
>> the sampled packets to interface dummy0, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>> matchall action sample rate 12 mark 17
>>
>> tc filter add parent : dev eth1 protocol all \
>> u32 match mark 172 0xff
>> action mirred egress redirect dev dummy0
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling every 12'th packet on dev eth0 and marks the sampled packets with
>> 17. The command third catches the sampled packets, which are marked with
>> 17, and redirects them to dev dummy0.
>>
>> Signed-off-by: Yotam Gigi 
>> Signed-off-by: Jiri Pirko 
> channeling some feedback from Peter Phaal @sflow inline below:
>
>
If it helps, one more thing that came up was using bpf.
They also use bpf filters for pkt sampling in the non-offloaded case:
http://blog.sflow.com/2016/05/berkeley-packet-filter-bpf.html

so, existing apps (like sflow) that care about packet sampling do prefer to use
a socket api for sample delivery: netlink nflog or bpf like socket filters

also, to keep the software and hardware models the same, wondering if ebpf 
attach
can be a viable option (have not thought about the offloaded case completely 
yet).
This would give apps more control on attaching sample headers (like sflow) if 
needed.

thanks,
Roopa






Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-15 Thread Roopa Prabhu
On 10/12/16, 5:41 AM, Jiri Pirko wrote:
> From: Yotam Gigi 
>
> This action allow the user to sample traffic matched by tc classifier.
> The sampling consists of choosing packets randomly, truncating them,
> adding some informative metadata regarding the interface and the original
> packet size and mark them with specific mark, to allow further tc rules to
> match and process. The marked sample packets are then injected into the
> device ingress qdisc using netif_receive_skb.
>
> The packets metadata is packed using the ife encapsulation protocol, and
> the outer packet's ethernet dest, source and eth_type, along with the
> rate, mark and the optional truncation size can be configured from
> userspace.
>
> Example:
> To sample ingress traffic from interface eth1, and redirect the sampled
> the sampled packets to interface dummy0, one may use the commands:
>
> tc qdisc add dev eth1 handle : ingress
>
> tc filter add dev eth1 parent : \
>  matchall action sample rate 12 mark 17
>
> tc filter add parent : dev eth1 protocol all \
>  u32 match mark 172 0xff
>  action mirred egress redirect dev dummy0
>
> Where the first command adds an ingress qdisc and the second starts
> sampling every 12'th packet on dev eth0 and marks the sampled packets with
> 17. The command third catches the sampled packets, which are marked with
> 17, and redirects them to dev dummy0.
>
> Signed-off-by: Yotam Gigi 
> Signed-off-by: Jiri Pirko 
channeling some feedback from Peter Phaal @sflow inline below:

> ---
>  
> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
> new file mode 100644
> index 000..a2b445a
> --- /dev/null
> +++ b/include/net/tc_act/tc_sample.h
> @@ -0,0 +1,88 @@
> +#ifndef __NET_TC_SAMPLE_H
> +#define __NET_TC_SAMPLE_H
> +
> +#include 
> +#include 
> +
> +struct tcf_sample {
> + struct tc_actioncommon;
> + u32 rate;
> + u32 mark;
> + booltruncate;
> + u32 trunc_size;
> + u32 packet_counter;
> + u8  eth_dst[ETH_ALEN];
> + u8  eth_src[ETH_ALEN];
> + u16 eth_type;
> + booleth_type_set;
> + struct list_headtcfm_list;
> +};
> +#define to_sample(a) ((struct tcf_sample *)a)
> +
> +struct sample_packet_metadata {
> + int sample_size;
> + int orig_size;
> + int ifindex;
> +};
> +
This metadata does not look extensible.. can it be made to ?

With sflow in context, you need a pair of ifindex numbers to encode ingress and 
egress ports. Ideally you would also include a sequence number and a count of 
the total number of packets that were candidates for sampling. The OVS 
implementation is a good example, the metadata includes all the actions applied 
to the packet in the kernel data path.



[snip]

> diff --git a/include/uapi/linux/tc_act/tc_sample.h 
> b/include/uapi/linux/tc_act/tc_sample.h
> new file mode 100644
> index 000..654945b
> --- /dev/null
> +++ b/include/uapi/linux/tc_act/tc_sample.h
> @@ -0,0 +1,31 @@
> +#ifndef __LINUX_TC_SAMPLE_H
> +#define __LINUX_TC_SAMPLE_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define TCA_ACT_SAMPLE 26
> +
> +struct tc_sample {
> + tc_gen;
> + __u32   rate;   /* sample rate */
> + __u32   mark;   /* mark to put on the sampled packets */
> + booltruncate;   /* whether to truncate the packets */
> + __u32   trunc_size; /* truncation size */
> + __u8eth_dst[ETH_ALEN]; /* encapsulated mac destination */
> + __u8eth_src[ETH_ALEN]; /* encapsulated mac source */
> + booleth_type_set;  /* whether to overrid ethtype */
> + __u16   eth_type;  /* encapsulated mac ethtype */
> +};
> +
this does not look extensible and is part of UAPI ..

Doing the minimum in the kernel and leaving the rest to the user space agent is 
much more flexible. The user space agent can attach additional metadata and 
offer more flexibility in forwarding (sFlow uses XDR encoding over UDP and is 
routable over IPv4/IPv6).



> +enum {
> + TCA_SAMPLE_UNSPEC,
> + TCA_SAMPLE_TM,
> + TCA_SAMPLE_PARMS,
> + TCA_SAMPLE_PAD,
> + __TCA_SAMPLE_MAX
> +};
> +#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
> +
> +#endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 24f7cac..c54ea6b 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -650,6 +650,19 @@ config NET_ACT_MIRRED
> To compile this code as a module, choose M here: the
> module will be called act_mirred.
>  
> +config NET_ACT_SAMPLE
> +tristate "Traffic Sampling"
> +depends on NET_CLS_ACT
> +select NET_IFE
> +---help---
> +   Say Y here to allow packet 

[patch net-next RFC 4/6] Introduce sample tc action

2016-10-12 Thread Jiri Pirko
From: Yotam Gigi 

This action allow the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly, truncating them,
adding some informative metadata regarding the interface and the original
packet size and mark them with specific mark, to allow further tc rules to
match and process. The marked sample packets are then injected into the
device ingress qdisc using netif_receive_skb.

The packets metadata is packed using the ife encapsulation protocol, and
the outer packet's ethernet dest, source and eth_type, along with the
rate, mark and the optional truncation size can be configured from
userspace.

Example:
To sample ingress traffic from interface eth1, and redirect the sampled
the sampled packets to interface dummy0, one may use the commands:

tc qdisc add dev eth1 handle : ingress

tc filter add dev eth1 parent : \
   matchall action sample rate 12 mark 17

tc filter add parent : dev eth1 protocol all \
   u32 match mark 172 0xff
   action mirred egress redirect dev dummy0

Where the first command adds an ingress qdisc and the second starts
sampling every 12'th packet on dev eth0 and marks the sampled packets with
17. The command third catches the sampled packets, which are marked with
17, and redirects them to dev dummy0.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 include/net/tc_act/tc_sample.h|  88 ++
 include/uapi/linux/tc_act/Kbuild  |   1 +
 include/uapi/linux/tc_act/tc_sample.h |  31 
 net/sched/Kconfig |  13 ++
 net/sched/Makefile|   1 +
 net/sched/act_sample.c| 300 ++
 6 files changed, 434 insertions(+)
 create mode 100644 include/net/tc_act/tc_sample.h
 create mode 100644 include/uapi/linux/tc_act/tc_sample.h
 create mode 100644 net/sched/act_sample.c

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
new file mode 100644
index 000..a2b445a
--- /dev/null
+++ b/include/net/tc_act/tc_sample.h
@@ -0,0 +1,88 @@
+#ifndef __NET_TC_SAMPLE_H
+#define __NET_TC_SAMPLE_H
+
+#include 
+#include 
+
+struct tcf_sample {
+   struct tc_actioncommon;
+   u32 rate;
+   u32 mark;
+   booltruncate;
+   u32 trunc_size;
+   u32 packet_counter;
+   u8  eth_dst[ETH_ALEN];
+   u8  eth_src[ETH_ALEN];
+   u16 eth_type;
+   booleth_type_set;
+   struct list_headtcfm_list;
+};
+#define to_sample(a) ((struct tcf_sample *)a)
+
+struct sample_packet_metadata {
+   int sample_size;
+   int orig_size;
+   int ifindex;
+};
+
+#if IS_ENABLED(NET_ACT_SAMPLE)
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+ struct sample_packet_metadata *metadata);
+#else
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+ struct sample_packet_metadata *metadata)
+{
+   return NULL;
+}
+#endif
+
+static inline bool is_tcf_sample(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+   return a->ops && a->ops->type == TCA_ACT_SAMPLE;
+#else
+   return false;
+#endif
+}
+
+static inline __u32 tcf_sample_mark(const struct tc_action *a)
+{
+   return to_sample(a)->mark;
+}
+
+static inline __u32 tcf_sample_rate(const struct tc_action *a)
+{
+   return to_sample(a)->rate;
+}
+
+static inline bool tcf_sample_truncate(const struct tc_action *a)
+{
+   return to_sample(a)->truncate;
+}
+
+static inline int tcf_sample_trunc_size(const struct tc_action *a)
+{
+   return to_sample(a)->trunc_size;
+}
+
+static inline u16 tcf_sample_eth_type(const struct tc_action *a)
+{
+   return to_sample(a)->eth_type;
+}
+
+static inline bool tcf_sample_eth_type_set(const struct tc_action *a)
+{
+   return to_sample(a)->eth_type_set;
+}
+
+static inline void tcf_sample_eth_dst_addr(const struct tc_action *a, u8 *dst)
+{
+   ether_addr_copy(dst, to_sample(a)->eth_dst);
+}
+
+static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
+{
+   ether_addr_copy(src, to_sample(a)->eth_src);
+}
+
+#endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/tc_act/Kbuild b/include/uapi/linux/tc_act/Kbuild
index e3969bd..6c6b8d6 100644
--- a/include/uapi/linux/tc_act/Kbuild
+++ b/include/uapi/linux/tc_act/Kbuild
@@ -4,6 +4,7 @@ header-y += tc_defact.h
 header-y += tc_gact.h
 header-y += tc_ipt.h
 header-y += tc_mirred.h
+header-y += tc_sample.h
 header-y += tc_nat.h
 header-y += tc_pedit.h
 header-y += tc_skbedit.h
diff --git a/include/uapi/linux/tc_act/tc_sample.h 
b/include/uapi/linux/tc_act/tc_sample.h
new file mode 100644
index 000..654945b
--- /dev/null
+++