Re: Explaining RX-stages for XDP

2016-09-28 Thread Alexei Starovoitov
On Wed, Sep 28, 2016 at 12:44:31PM +0200, Jesper Dangaard Brouer wrote:
> 
> The idea is quite different.  It has nothing to do with Edward's
> proposal[3].  The RX packet-vector is simply an array, either of pointers
> or index numbers (into the RX-ring).  The needed changes are completely
> contained inside the driver.
> 
> > As far as intermixed XDP vs stack traffic, I think for DoS case the
> > traffic patterns are binary. Either all of it is good or under attack
> > most of the traffic is bad, so makes sense to optimize for these two.
> > 50/50 case I think is artificial and not worth optimizing for.
> 
> Sorry, but I feel you have completely misunderstood the concept of my
> idea.  It does not matter what traffic pattern you believe or don't
> believe in, it is irrelevant.  The fact is that intermixed traffic is
> possible with the current solution.  The core of my idea is to remove
> the possibility for this intermixed traffic to occur, simply by seeing
> XDP as a RX-stage before the stack.

Is the idea to add two extra 'to_stack' and 'not_to_stack' arrays
after XDP program made a decision ?
Sounds interesting, but I cannot evaluate it properly, since I fail
to see what problem it solves and how it can be benchmarked.
Since you're saying the percentage of good vs bad traffic is irrelevant
then this idea should help all types of traffic in all cases?
That would be truly awesome.

I think the discussion around performance optimizations
should start from:
1. describing the problem: What is the use case?
What is the traffic pattern?
After problem is understood and accepted as a valid problem
and not some research subject then
2. define a benchmark that simulates it
when benchmarking methodology is agreed upon
we can go the next step of
3. come up with multiple ideas/proposals to solve the problem 1.
and finally discuss
4. different implementations of the best proposal from step 3

Without 1 it's easy to come up with a fake benchmark in step 2
that can justify addition of any code in 4
Without 2, the comparison of different proposals will be
based on subjective opinions instead of hard data.
Without 3 the proposed patch will be a hard sell, since
other alternatives were not on the table.

For new features the steps are different, of course.

When we started on XDP among everything else the first two problems
that it's supposed to solve were DoS and loadbalancer/ila_router.
For former the attack traffic is mostly dropped after parsing.
That behavior is simulated by samples/bpf/xdp1_kern.c
For load balancer most of the traffic is parsed and transmitted
out on the same device after packet rewrite. That scenario
is simulated by samples/bpf/xdp2_kern.c
Therefore the benchmarking and performance numbers were
centered around these two programs and I believe XDP itself
is a good solution for DoS and ila_router based on
the performance numbers from these two benchmarks.
I think XDP is _not_ a good solution for many other use cases.

The DoS protection feature when not under attack was
not benchmarked, so this is the area to work on in the future.
(the appropriate benchmark is xdp_pass+tcp+netperf)

Another area that was not benchmarked is how XDP drop/tx
interacts with normal control plane traffic.
We assumed that different hw tx queues should provide
sufficient separation. That is certainly to be tested.
A lot of work to do, no doubt.

> > Optimizing xdp for 'mostly good' traffic is indeed a challange.
> > We'd need all the tricks to make it as good as normal skb-based traffic.
> >
> > I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program
> > vs no xdp at all running netperf tcp/udp in user space. It shouldn't
> > be too far off.
> 
> Well, I did post numbers to the list with a 'return XDP_PASS' program[4]:
>  https://mid.mail-archive.com/netdev@vger.kernel.org/msg122350.html
> 
> Wake-up and smell the coffee, please revise your assumptions:
>  * It showed that the performance reduction is 25.98%!!!
>(AB comparison dropping packets in iptables raw)

sure. iptables drop is slow with xdp_pass vs iptables without xdp.
This is not a benchmark I was interested in, since I don't
understand what use case it simulates.

> Conclusion: These measurements confirm that we need a page recycle
> facility for the drivers before switching to order-0 allocations.
...
> page_pool work
>  - iptables-raw-drop: driver mlx5
>* 4,487,518 pps - baseline-before =>  100.0%
>* 3,624,237 pps - mlx5 order0-patch   => - 19.2% (slower)
>* 4,806,142 pps - PoC page_pool patch =>   +7.1% (faster)

I agree that generic page_pool is a useful facility, but
I don't think it's the right approach to design it based on
iptables_drop+xdp_pass benchmark.
You're saying the protoype gives 7.1% improvement. Ok,
but what is the problem being solved?
If the use case is DoS and fast as possible drop, then XDP_DROP
is a better alternative. Why design/benchmark page_pool
based on iptables_drop ?
If page_pool replaces custo

Re: Explaining RX-stages for XDP

2016-09-28 Thread Jesper Dangaard Brouer

On Tue, 27 Sep 2016 19:12:44 -0700 Alexei Starovoitov 
 wrote:

> On Tue, Sep 27, 2016 at 11:32:37AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > Let me try in a calm way (not like [1]) to explain how I imagine that
> > the XDP processing RX-stage should be implemented. As I've pointed out
> > before[2], I'm proposing splitting up the driver into RX-stages.  This
> > is a mental-model change, I hope you can follow my "inception" attempt.
> > 
> > The basic concept behind this idea is, if the RX-ring contains
> > multiple "ready" packets, then the kernel was too slow, processing
> > incoming packets. Thus, switch into more efficient mode, which is a
> > "packet-vector" mode.
> > 
> > Today, our XDP micro-benchmarks looks amazing, and they are!  But once
> > real-life intermixed traffic is used, then we loose the XDP I-cache
> > benefit.  XDP is meant for DoS protection, and an attacker can easily
> > construct intermixed traffic.  Why not fix this architecturally?
> > 
> > Most importantly concept: If XDP return XDP_PASS, do NOT pass the
> > packet up the network stack immediately (that would flush I-cache).
> > Instead store the packet for the next RX-stage.  Basically splitting
> > the packet-vector into two packet-vectors, one for network-stack and
> > one for XDP.  Thus, intermixed XDP vs. netstack not longer have effect
> > on XDP performance.
> > 
> > The reason for also creating an XDP packet-vector, is to move the
> > XDP_TX transmit code out of the XDP processing stage (and future
> > features).  This maximize I-cache availability to the eBPF program,
> > and make eBPF performance more uniform across drivers.
> > 
> > 
> > Inception:
> >  * Instead of individual packets, see it as a RX packet-vector.
> >  * XDP should be seen as a stage *before* the network stack gets called.
> > 
> > If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets.
> > I actually want N-packet per vector (8-16).  As the NIC HW RX process
> > runs concurrently, and by the time it takes to process N-packets, more
> > packets have had a chance to arrive in the RX-ring queue.  
> 
> Sounds like what Edward was proposing earlier with building
> link list of skbs and passing further into stack?
> Or the idea is different ?

The idea is quite different.  It has nothing to do with Edward's
proposal[3].  The RX packet-vector is simply an array, either of pointers
or index numbers (into the RX-ring).  The needed changes are completely
contained inside the driver.


> As far as intermixed XDP vs stack traffic, I think for DoS case the
> traffic patterns are binary. Either all of it is good or under attack
> most of the traffic is bad, so makes sense to optimize for these two.
> 50/50 case I think is artificial and not worth optimizing for.

Sorry, but I feel you have completely misunderstood the concept of my
idea.  It does not matter what traffic pattern you believe or don't
believe in, it is irrelevant.  The fact is that intermixed traffic is
possible with the current solution.  The core of my idea is to remove
the possibility for this intermixed traffic to occur, simply by seeing
XDP as a RX-stage before the stack.


> For all good traffic whether xdp is there or not shouldn't matter
> for this N-vector optimization. Whether it's a batch of 8, 16 or 64,
> either via link-list or array, it should probably be a generic
> mechanism independent of any xdp stuff.

I also feel you have misunderstood the N-vector "optimization".
But, yes, this introduction of RX-stages is independent of XDP.

The RX-stages is a generic change to the drivers programming model.


[...]
> I think existing mlx4+xdp is already optimized for 'mostly attack' traffic
> and performs pretty well, since imo 'all drop' benchmark is accurate.

The "all drop" benchmark is as artificial as it gets.  It think Eric
agrees.

My idea is confining the XDP_DROP part to a RX-stage _before_ the
netstack. Then, the "all drop" benchmark number will get a little more
trustworthy.


> Optimizing xdp for 'mostly good' traffic is indeed a challange.
> We'd need all the tricks to make it as good as normal skb-based traffic.
>
> I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program
> vs no xdp at all running netperf tcp/udp in user space. It shouldn't
> be too far off.

Well, I did post numbers to the list with a 'return XDP_PASS' program[4]:
 https://mid.mail-archive.com/netdev@vger.kernel.org/msg122350.html

Wake-up and smell the coffee, please revise your assumptions:
 * It showed that the performance reduction is 25.98%!!!
   (AB comparison dropping packets in iptables raw)

Conclusion: These measurements confirm that we need a page recycle
facility for the drivers before switching to order-0 allocations.

I did the same kind of experiment with mlx5. Where I change the memory
model to order-0 pages, and then I implemented page_pool on top. (Below
number are before I implemented the DMA part of page_pool, which do
work now).

page_pool work
 - 

Re: Explaining RX-stages for XDP

2016-09-27 Thread Alexei Starovoitov
On Tue, Sep 27, 2016 at 11:32:37AM +0200, Jesper Dangaard Brouer wrote:
> 
> Let me try in a calm way (not like [1]) to explain how I imagine that
> the XDP processing RX-stage should be implemented. As I've pointed out
> before[2], I'm proposing splitting up the driver into RX-stages.  This
> is a mental-model change, I hope you can follow my "inception" attempt.
> 
> The basic concept behind this idea is, if the RX-ring contains
> multiple "ready" packets, then the kernel was too slow, processing
> incoming packets. Thus, switch into more efficient mode, which is a
> "packet-vector" mode.
> 
> Today, our XDP micro-benchmarks looks amazing, and they are!  But once
> real-life intermixed traffic is used, then we loose the XDP I-cache
> benefit.  XDP is meant for DoS protection, and an attacker can easily
> construct intermixed traffic.  Why not fix this architecturally?
> 
> Most importantly concept: If XDP return XDP_PASS, do NOT pass the
> packet up the network stack immediately (that would flush I-cache).
> Instead store the packet for the next RX-stage.  Basically splitting
> the packet-vector into two packet-vectors, one for network-stack and
> one for XDP.  Thus, intermixed XDP vs. netstack not longer have effect
> on XDP performance.
> 
> The reason for also creating an XDP packet-vector, is to move the
> XDP_TX transmit code out of the XDP processing stage (and future
> features).  This maximize I-cache availability to the eBPF program,
> and make eBPF performance more uniform across drivers.
> 
> 
> Inception:
>  * Instead of individual packets, see it as a RX packet-vector.
>  * XDP should be seen as a stage *before* the network stack gets called.
> 
> If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets.
> I actually want N-packet per vector (8-16).  As the NIC HW RX process
> runs concurrently, and by the time it takes to process N-packets, more
> packets have had a chance to arrive in the RX-ring queue.

Sounds like what Edward was proposing earlier with building
link list of skbs and passing further into stack?
Or the idea is different ?

As far as intermixed XDP vs stack traffic, I think for DoS case the traffic
patterns are binary. Either all of it is good or under attack most of
the traffic is bad, so makes sense to optimize for these two.
50/50 case I think is artificial and not worth optimizing for.
For all good traffic whether xdp is there or not shouldn't matter
for this N-vector optimization. Whether it's a batch of 8, 16 or 64,
either via link-list or array, it should probably be a generic
mechanism independent of any xdp stuff.
For under attack traffic the most important is to optimize for line
rate parsing of the traffic inside bpf and quickest as possible
drop on the driver side. Few good packets that are passed to the stack
make no difference to overall system performance.
I think existing mlx4+xdp is already optimized for 'mostly attack' traffic
and performs pretty weel, since imo 'all drop' benchmark is accurate.
Optimizing xdp for 'mostly good' traffic is indeed a challange.
We'd need all the tricks to make it as good as normal skb-based traffic.
I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program
vs no xdp at all running netperf tcp/udp in user space. It shouldn't
be too far off. Doing this benchmarking on mlx4 is also not necessarily
will speak for ixgbe, since large mtu there is packet per page already,
so whenever ixgbe supports xdp, I think, ixgbe+xdp+'return XDP_PASS'
should be the same tcp/udp performance as ixgbe+large_mtu.
No doubt, would be interesting to see mlx numbers.



Explaining RX-stages for XDP

2016-09-27 Thread Jesper Dangaard Brouer

Let me try in a calm way (not like [1]) to explain how I imagine that
the XDP processing RX-stage should be implemented. As I've pointed out
before[2], I'm proposing splitting up the driver into RX-stages.  This
is a mental-model change, I hope you can follow my "inception" attempt.

The basic concept behind this idea is, if the RX-ring contains
multiple "ready" packets, then the kernel was too slow, processing
incoming packets. Thus, switch into more efficient mode, which is a
"packet-vector" mode.

Today, our XDP micro-benchmarks looks amazing, and they are!  But once
real-life intermixed traffic is used, then we loose the XDP I-cache
benefit.  XDP is meant for DoS protection, and an attacker can easily
construct intermixed traffic.  Why not fix this architecturally?

Most importantly concept: If XDP return XDP_PASS, do NOT pass the
packet up the network stack immediately (that would flush I-cache).
Instead store the packet for the next RX-stage.  Basically splitting
the packet-vector into two packet-vectors, one for network-stack and
one for XDP.  Thus, intermixed XDP vs. netstack not longer have effect
on XDP performance.

The reason for also creating an XDP packet-vector, is to move the
XDP_TX transmit code out of the XDP processing stage (and future
features).  This maximize I-cache availability to the eBPF program,
and make eBPF performance more uniform across drivers.


Inception:
 * Instead of individual packets, see it as a RX packet-vector.
 * XDP should be seen as a stage *before* the network stack gets called.

If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets.
I actually want N-packet per vector (8-16).  As the NIC HW RX process
runs concurrently, and by the time it takes to process N-packets, more
packets have had a chance to arrive in the RX-ring queue.

-- 
Best regards,
  Jesper Dangaard Brouertho
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://mid.mail-archive.com/netdev@vger.kernel.org/msg127043.html

[2] http://lists.openwall.net/netdev/2016/01/15/51  

[3] http://lists.openwall.net/netdev/2016/04/19/89