Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Alexei Starovoitov via iovisor-dev
On Thu, Jul 07, 2016 at 09:05:29PM -0700, John Fastabend wrote:
> On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 07, 2016 at 03:18:11PM +, Fastabend, John R wrote:
> >> Hi Jesper,
> >>
> >> I have done some previous work on proprietary systems where we
> >> used hardware to do the classification/parsing then passed a cookie to the
> >> software which used the cookie to lookup a program to run on the packet.
> >> When your programs are structured as a bunch of parsing followed by some
> >> actions this can provide real performance benefits. Also a lot of
> >> existing hardware supports this today assuming you use headers the
> >> hardware "knows" about. It's a natural model for hardware that uses a
> >> parser followed by tcam/cam/sram/etc lookup tables.
> 
> > looking at bpf programs written in plumgrid, facebook and cisco
> > with full certainty I can assure that parse/action split doesn't exist.
> > Parsing is always interleaved with lookups and actions.
> > cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
> 
> What is heavy about a lookup? Is it the key generation? The key
> generation can be provided by the hardware is what I was really alluding
> to. If your data structures are ebpf maps though its probably a hash
> or array table and the benefit of leveraging hardware would likely be
> much better if/when there are software structures for LPM or wildcard
> lookups.

there is only hash map in the sw and the main cost of it was doing jhash
math and occasional miss in hashtable.
'key generation' is only copying bytes, so it mostly free.
Just like parsing which is few branches which tend to be predicted
by cpu quite well.
In case of our L4 loadbalancer we need to do consistent hash which
fixed hw probably won't be able to provide.
Unless hw is programmable :)
In general when we developed and benchmarked the programs,
redesigning the program to remove extra hash lookup gave performance
improvement whereas simplifying parsing logic (like removing vlan
handling or ip option) showed no difference in performance.

> > Trying to split single logical program into parsing/after_parse stages
> > has no pracitcal benefit.
> > 
> >> If the goal is to just separate XDP traffic from non-XDP traffic
> >> you could accomplish this with a combination of SR-IOV/macvlan to separate
> >> the device queues into multiple netdevs and then run XDP on just one of
> >> the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to
> >> steer traffic to the netdev. This is how we support multiple networking
> >> stacks on one device by the way it is called the bifurcated driver. Its
> >> not too far of a stretch to think we could offload some simple XDP
> >> programs to program the splitting of traffic instead of
> >> cls_u32/flower/flow_director and then you would have a stack of XDP
> >> programs. One running in hardware and a set running on the queues in
> >> software.
> > 
> > the above sounds like much better approach then Jesper/mine prog_per_ring 
> > stuff.
> > If we can split the nic via sriov and have dedicated netdev via VF just for 
> > XDP that's way cleaner approach.
> > I guess we won't need to do xdp_rxqmask after all.
> > 
> 
> Right and this works today so all it would require is adding the XDP
> engine code to the VF drivers. Which should be relatively straight
> forward if you have the PF driver working.

Good point. I think the next step should be to enable xdp in VF drivers
and measure performance.

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread John Fastabend via iovisor-dev
On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
> On Thu, Jul 07, 2016 at 03:18:11PM +, Fastabend, John R wrote:
>> Hi Jesper,
>>
>> I have done some previous work on proprietary systems where we
>> used hardware to do the classification/parsing then passed a cookie to the
>> software which used the cookie to lookup a program to run on the packet.
>> When your programs are structured as a bunch of parsing followed by some
>> actions this can provide real performance benefits. Also a lot of
>> existing hardware supports this today assuming you use headers the
>> hardware "knows" about. It's a natural model for hardware that uses a
>> parser followed by tcam/cam/sram/etc lookup tables.

> looking at bpf programs written in plumgrid, facebook and cisco
> with full certainty I can assure that parse/action split doesn't exist.
> Parsing is always interleaved with lookups and actions.
> cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.

What is heavy about a lookup? Is it the key generation? The key
generation can be provided by the hardware is what I was really alluding
to. If your data structures are ebpf maps though its probably a hash
or array table and the benefit of leveraging hardware would likely be
much better if/when there are software structures for LPM or wildcard
lookups.

> Trying to split single logical program into parsing/after_parse stages
> has no pracitcal benefit.
> 
>> If the goal is to just separate XDP traffic from non-XDP traffic
>> you could accomplish this with a combination of SR-IOV/macvlan to separate
>> the device queues into multiple netdevs and then run XDP on just one of
>> the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to
>> steer traffic to the netdev. This is how we support multiple networking
>> stacks on one device by the way it is called the bifurcated driver. Its
>> not too far of a stretch to think we could offload some simple XDP
>> programs to program the splitting of traffic instead of
>> cls_u32/flower/flow_director and then you would have a stack of XDP
>> programs. One running in hardware and a set running on the queues in
>> software.
> 
> the above sounds like much better approach then Jesper/mine prog_per_ring 
> stuff.
> If we can split the nic via sriov and have dedicated netdev via VF just for 
> XDP that's way cleaner approach.
> I guess we won't need to do xdp_rxqmask after all.
> 

Right and this works today so all it would require is adding the XDP
engine code to the VF drivers. Which should be relatively straight
forward if you have the PF driver working.

.John
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Alexei Starovoitov via iovisor-dev
On Thu, Jul 07, 2016 at 03:18:11PM +, Fastabend, John R wrote:
> Hi Jesper,
> 
> I have done some previous work on proprietary systems where we used hardware 
> to do the classification/parsing then passed a cookie to the software which 
> used the cookie to lookup a program to run on the packet. When your programs 
> are structured as a bunch of parsing followed by some actions this can 
> provide real performance benefits. Also a lot of existing hardware supports 
> this today assuming you use headers the hardware "knows" about. It's a 
> natural model for hardware that uses a parser followed by tcam/cam/sram/etc 
> lookup tables.

looking at bpf programs written in plumgrid, facebook and cisco
with full certainty I can assure that parse/action split doesn't exist.
Parsing is always interleaved with lookups and actions.
cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
Trying to split single logical program into parsing/after_parse stages
has no pracitcal benefit.

> If the goal is to just separate XDP traffic from non-XDP traffic you could 
> accomplish this with a combination of SR-IOV/macvlan to separate the device 
> queues into multiple netdevs and then run XDP on just one of the netdevs. 
> Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to 
> the netdev. This is how we support multiple networking stacks on one device 
> by the way it is called the bifurcated driver. Its not too far of a stretch 
> to think we could offload some simple XDP programs to program the splitting 
> of traffic instead of cls_u32/flower/flow_director and then you would have a 
> stack of XDP programs. One running in hardware and a set running on the 
> queues in software.

the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
If we can split the nic via sriov and have dedicated netdev via VF just for XDP 
that's way cleaner approach.
I guess we won't need to do xdp_rxqmask after all.

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread John Fastabend via iovisor-dev
On 16-07-07 10:53 AM, Tom Herbert wrote:
> On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
>  wrote:
>> On Thu, 7 Jul 2016 15:18:11 +, Fastabend, John R wrote:
>>> The other interesting thing would be to do more than just packet
>>> steering but actually run a more complete XDP program. Netronome
>>> supports this right. The question I have though is this a stacked of
>>> XDP programs one or more designated for hardware and some running in
>>> software perhaps with some annotation in the program so the hardware
>>> JIT knows where to place programs or do we expect the JIT itself to
>>> try and decide what is best to offload. I think the easiest to start
>>> with is to annotate the programs.
>>>
>>> Also as far as I know a lot of hardware can stick extra data to the
>>> front or end of a packet so you could push metadata calculated by the
>>> program here in a generic way without having to extend XDP defined
>>> metadata structures. Another option is to DMA the metadata to a
>>> specified address. With this metadata the consumer/producer XDP
>>> programs have to agree on the format but no one else.
>>
>> Yes!
>>
>> At the XDP summit we were discussing pipe-lining XDP programs in
>> general, with different stages of the pipeline potentially using
>> specific hardware capabilities or even being directly mappable on
>> fixed HW functions.
>>
>> Designating parsing as one of specialized blocks makes sense in a long
>> run, probably at the first stage with recirculation possible.  We also
>> have some parsing HW we could utilize at some point.  However, I'm
>> worried that it's too early to impose constraints and APIs.  I agree
>> that we should first set a standard way to pass metadata across tail
>> calls to facilitate any form of pipe lining, regardless of which parts
>> of pipeline HW is able to offload.
> 
> +1
> 
> I don't see any reason why XDP programs can be turned into a pipeline,
> but this is implementation based on the output of one program being
> the inout of the next.  While XDP may work with pipeline it does not
> require it or define it. This makes XDP different from P4 and the
> match-action paradigm.
> 
> Tom
> 

Sounds like we all agree. Just a note, XDP is a reasonable target
for P4 in fact we have a P4 to eBPF target already working. We may end
up with a set of DSLs running on top of XDP where P4 is one of them.

.John
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Tom Herbert via iovisor-dev
On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
 wrote:
> On Thu, 7 Jul 2016 15:18:11 +, Fastabend, John R wrote:
>> The other interesting thing would be to do more than just packet
>> steering but actually run a more complete XDP program. Netronome
>> supports this right. The question I have though is this a stacked of
>> XDP programs one or more designated for hardware and some running in
>> software perhaps with some annotation in the program so the hardware
>> JIT knows where to place programs or do we expect the JIT itself to
>> try and decide what is best to offload. I think the easiest to start
>> with is to annotate the programs.
>>
>> Also as far as I know a lot of hardware can stick extra data to the
>> front or end of a packet so you could push metadata calculated by the
>> program here in a generic way without having to extend XDP defined
>> metadata structures. Another option is to DMA the metadata to a
>> specified address. With this metadata the consumer/producer XDP
>> programs have to agree on the format but no one else.
>
> Yes!
>
> At the XDP summit we were discussing pipe-lining XDP programs in
> general, with different stages of the pipeline potentially using
> specific hardware capabilities or even being directly mappable on
> fixed HW functions.
>
> Designating parsing as one of specialized blocks makes sense in a long
> run, probably at the first stage with recirculation possible.  We also
> have some parsing HW we could utilize at some point.  However, I'm
> worried that it's too early to impose constraints and APIs.  I agree
> that we should first set a standard way to pass metadata across tail
> calls to facilitate any form of pipe lining, regardless of which parts
> of pipeline HW is able to offload.

+1

I don't see any reason why XDP programs can be turned into a pipeline,
but this is implementation based on the output of one program being
the inout of the next.  While XDP may work with pipeline it does not
require it or define it. This makes XDP different from P4 and the
match-action paradigm.

Tom
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Jakub Kicinski via iovisor-dev
On Thu, 7 Jul 2016 15:18:11 +, Fastabend, John R wrote:
> The other interesting thing would be to do more than just packet
> steering but actually run a more complete XDP program. Netronome
> supports this right. The question I have though is this a stacked of
> XDP programs one or more designated for hardware and some running in
> software perhaps with some annotation in the program so the hardware
> JIT knows where to place programs or do we expect the JIT itself to
> try and decide what is best to offload. I think the easiest to start
> with is to annotate the programs.
> 
> Also as far as I know a lot of hardware can stick extra data to the
> front or end of a packet so you could push metadata calculated by the
> program here in a generic way without having to extend XDP defined
> metadata structures. Another option is to DMA the metadata to a
> specified address. With this metadata the consumer/producer XDP
> programs have to agree on the format but no one else.

Yes!

At the XDP summit we were discussing pipe-lining XDP programs in
general, with different stages of the pipeline potentially using
specific hardware capabilities or even being directly mappable on
fixed HW functions.

Designating parsing as one of specialized blocks makes sense in a long
run, probably at the first stage with recirculation possible.  We also
have some parsing HW we could utilize at some point.  However, I'm
worried that it's too early to impose constraints and APIs.  I agree
that we should first set a standard way to pass metadata across tail
calls to facilitate any form of pipe lining, regardless of which parts
of pipeline HW is able to offload.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Fastabend, John R via iovisor-dev
Hi Jesper,

I have done some previous work on proprietary systems where we used hardware to 
do the classification/parsing then passed a cookie to the software which used 
the cookie to lookup a program to run on the packet. When your programs are 
structured as a bunch of parsing followed by some actions this can provide real 
performance benefits. Also a lot of existing hardware supports this today 
assuming you use headers the hardware "knows" about. It's a natural model for 
hardware that uses a parser followed by tcam/cam/sram/etc lookup tables.

If the goal is to just separate XDP traffic from non-XDP traffic you could 
accomplish this with a combination of SR-IOV/macvlan to separate the device 
queues into multiple netdevs and then run XDP on just one of the netdevs. Then 
use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the 
netdev. This is how we support multiple networking stacks on one device by the 
way it is called the bifurcated driver. Its not too far of a stretch to think 
we could offload some simple XDP programs to program the splitting of traffic 
instead of cls_u32/flower/flow_director and then you would have a stack of XDP 
programs. One running in hardware and a set running on the queues in software.

The other interesting thing would be to do more than just packet steering but 
actually run a more complete XDP program. Netronome supports this right. The 
question I have though is this a stacked of XDP programs one or more designated 
for hardware and some running in software perhaps with some annotation in the 
program so the hardware JIT knows where to place programs or do we expect the 
JIT itself to try and decide what is best to offload. I think the easiest to 
start with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the front or 
end of a packet so you could push metadata calculated by the program here in a 
generic way without having to extend XDP defined metadata structures. Another 
option is to DMA the metadata to a specified address. With this metadata the 
consumer/producer XDP programs have to agree on the format but no one else.

FWIW I was hoping to get some data to show performance overhead vs how deep we 
parse into the packets. I just wont have time to get to it for awhile but that 
could tell us how much perf gain the hardware could provide.

Thanks,
John

-Original Message-
From: Jesper Dangaard Brouer [mailto:bro...@redhat.com] 
Sent: Thursday, July 7, 2016 3:43 AM
To: iovisor-dev@lists.iovisor.org
Cc: bro...@redhat.com; Brenden Blanco ; Alexei 
Starovoitov ; Rana Shahout ; 
Ari Saha ; Tariq Toukan ; Or Gerlitz 
; net...@vger.kernel.org; Simon Horman 
; Simon Horman ; Jakub Kicinski 
; Edward Cree ; Fastabend, 
John R 
Subject: XDP seeking input from NIC hardware vendors


Would it make sense from a hardware point of view, to split the XDP eBPF 
program into two stages.

 Stage-1: Filter (restricted eBPF / no-helper calls)
 Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the likely more 
advanced stage-2 on the kernel side.  Do HW vendors see a benefit of this 
approach?


The generic problem I'm trying to solve is parsing. E.g. that the first step in 
every XDP program will be to parse the packet-data, in-order to determine if 
this is a packet the XDP program should process.

Actions from stage-1 "filter" program:
 - DROP (like XDP_DROP, early drop)
 - PASS (like XDP_PASS, normal netstack)
 - MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that makes 
sense for the stage-2 program. E.g. proto id and/or data offset.

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Jesper Dangaard Brouer via iovisor-dev

Would it make sense from a hardware point of view, to split the XDP
eBPF program into two stages.

 Stage-1: Filter (restricted eBPF / no-helper calls)
 Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the
likely more advanced stage-2 on the kernel side.  Do HW vendors see a
benefit of this approach?


The generic problem I'm trying to solve is parsing. E.g. that the
first step in every XDP program will be to parse the packet-data,
in-order to determine if this is a packet the XDP program should
process.

Actions from stage-1 "filter" program:
 - DROP (like XDP_DROP, early drop)
 - PASS (like XDP_PASS, normal netstack)
 - MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that
makes sense for the stage-2 program. E.g. proto id and/or data offset.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev