Re: [RFC] net: remove busylock

2016-05-24 Thread Eric Dumazet
On Tue, 2016-05-24 at 13:50 +, David Laight wrote:
> From: Jesper Dangaard Brouer
> > Sent: 20 May 2016 18:50
> ...
> > If would be cool if you could run a test with removed busylock and
> > allow HTB to bulk dequeue.
> 
> (Without having looked )
> Could you have two queues and separate queue and dequeue locks.
> 
> The enqueue code would acquire the enqueue lock and add the packet
> to the first queue.
> 
> The dequeue code would acquire the dequeue lock and try to remove
> a packet from the 2nd queue, if nothing present it would acquire
> the enqueue lock and move the entire 1st queue to the 2nd queue.
> 
> The obvious downside is two lock/unlocks for single packet dequeue.
> If you can guarantee a single dequeue thread the 2nd lock isn't needed.
> 

Not with HTB or any 'complex' qdisc hierarchy, where we have many
'queues' and strict limits (packets per qdisc)







RE: [RFC] net: remove busylock

2016-05-24 Thread David Laight
From: Jesper Dangaard Brouer
> Sent: 20 May 2016 18:50
...
> If would be cool if you could run a test with removed busylock and
> allow HTB to bulk dequeue.

(Without having looked )
Could you have two queues and separate queue and dequeue locks.

The enqueue code would acquire the enqueue lock and add the packet
to the first queue.

The dequeue code would acquire the dequeue lock and try to remove
a packet from the 2nd queue, if nothing present it would acquire
the enqueue lock and move the entire 1st queue to the 2nd queue.

The obvious downside is two lock/unlocks for single packet dequeue.
If you can guarantee a single dequeue thread the 2nd lock isn't needed.

David



Re: [RFC] net: remove busylock

2016-05-23 Thread Jesper Dangaard Brouer
On Fri, 20 May 2016 14:32:40 -0700
Eric Dumazet  wrote:

> On Fri, 2016-05-20 at 19:49 +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 20 May 2016 07:16:55 -0700
> > Eric Dumazet  wrote:
> >   
> > > Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
> > > on sch->flags when HTB is installed at the bonding device root.  
> > 
> > If would be cool if you could run a test with removed busylock and
> > allow HTB to bulk dequeue.  
> 
> I added a /sys/class/net/eth0/tx_bulk_limit to tune the number of
> packets that we could bulk dequeue from a virtual device 
> (no BQL on them)
> 
> 200 TCP_RR through HTB on eth0 (bonding device)
> 
> 1) busylock enabled
> 
> 
> With tx_bulk_limit set to 8, we get 12.7 % increase.

That is actually a quite good performance increase for this workload.


> lpaa23:~# for f in `seq 1 16`; do echo $f|tee 
> /sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
> 1
> Average: eth0 868625.67 868487.44  57055.69  56826.37  0.00  
> 0.00  0.56
> 2
> Average: eth0 927081.67 926920.78  60923.83  60649.90  0.00  
> 0.00  0.44
> 3
> Average: eth0 957678.11 957554.00  62877.04  62653.89  0.00  
> 0.00  0.56
> 4
> Average: eth0 966912.44 966747.33  63532.72  63255.51  0.00  
> 0.00  0.56
> 5
> Average: eth0 973137.56 972950.44  63958.31  63661.39  0.00  
> 0.00  0.44
> 6
> Average: eth0 958589.22 958449.44  62961.79  62712.56  0.00  
> 0.00  0.67
> 7
> Average: eth0 960737.67 960672.22  62968.34  62857.97  0.00  
> 0.00  0.44
> 8
> Average: eth0 979271.78 979201.67  64199.47  64070.84  0.00  
> 0.00  0.56
> 9
> Average: eth0 982464.33 982390.33  64418.42  64278.93  0.00  
> 0.00  0.56
> 10
> Average: eth0 982698.00 982578.22  64495.25  64291.28  0.00  
> 0.00  0.44
> 11
> Average: eth0 981862.22 981746.00  64438.16  64236.31  0.00  
> 0.00  0.56
> 12
> Average: eth0 983277.44 983096.33  64632.79  64327.79  0.00  
> 0.00  0.44
> 13
> Average: eth0 981221.11 981018.00  64521.82  64189.26  0.00  
> 0.00  0.67
> 14
> Average: eth0 981754.11 981555.89  64553.19  64224.39  0.00  
> 0.00  0.44
> 15
> Average: eth0 982484.33 982301.67  64572.00  64273.38  0.00  
> 0.00  0.44
> 16
> Average: eth0 978529.56 978326.67  64350.89  64013.39  0.00  
> 0.00  0.67
> 
> 
> 2) busylock disabled
> 
> 
> Well, bulk dequeue helps, but does not close the gap. 
> (busylock is needed)

Interesting observation, that busylock trick is still needed.  I would
have liked to see this getting cleaned up, but I guess it is not that
trivial to "fix"/remove.


> lpaa23:~# for f in `seq 1 16`; do echo $f|tee 
> /sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
> 1
> Average: eth0 795408.44 795407.67  52044.66  52045.93  0.00  
> 0.00  0.56
> 2
> Average: eth0 843411.78 843415.11  55185.23  55184.51  0.00  
> 0.00  0.56
> 3
> Average: eth0 876175.89 876175.00  57329.50  57327.98  0.00  
> 0.00  0.44
> 4
> Average: eth0 890631.22 890629.44  58274.58  58274.25  0.00  
> 0.00  0.67
> 5
> Average: eth0 900672.00 900668.89  58931.29  58930.54  0.00  
> 0.00  0.44
> 6
> Average: eth0 908325.78 908328.22  59432.97  59431.76  0.00  
> 0.00  0.56
> 7
> Average: eth0 913895.33 913885.11  59796.89  59795.46  0.00  
> 0.00  0.56
> 8
> Average: eth0 914429.11 914433.56  59832.26  59831.23  0.00  
> 0.00  0.67
> 9
> Average: eth0 918701.11 918699.67  60110.68  60110.36  0.00  
> 0.00  0.55
> 10
> Average: eth0 920382.33 920376.56  60223.31  60220.54  0.00  
> 0.00  0.67
> 11
> Average: eth0 914341.67 914344.67  59826.25  59825.90  0.00  
> 0.00  0.67
> 12
> Average: eth0 912697.00 912693.78  59718.77  59717.44  0.00  
> 0.00  0.44
> 13
> Average: eth0 917392.56 917385.00  60025.79  60024.34  0.00  
> 0.00  0.44
> 14
> Average: eth0 918232.89 918233.78  60081.04  60079.94  0.00  
> 0.00  0.67
> 15
> Average: eth0 918377.11 918381.00  60091.14  60089.79  0.00  
> 0.00  0.44
> 16
> Average: eth0 913817.56 913812.33  59792.09  59790.66  0.00  
> 0.00  0.56
> 

Thanks for doing the experiment.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC] net: remove busylock

2016-05-20 Thread Eric Dumazet
On Fri, 2016-05-20 at 19:49 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 20 May 2016 07:16:55 -0700
> Eric Dumazet  wrote:
> 
> > Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
> > on sch->flags when HTB is installed at the bonding device root.
> 
> If would be cool if you could run a test with removed busylock and
> allow HTB to bulk dequeue.

I added a /sys/class/net/eth0/tx_bulk_limit to tune the number of
packets that we could bulk dequeue from a virtual device 
(no BQL on them)

200 TCP_RR through HTB on eth0 (bonding device)

1) busylock enabled


With tx_bulk_limit set to 8, we get 12.7 % increase.

lpaa23:~# for f in `seq 1 16`; do echo $f|tee 
/sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
1
Average: eth0 868625.67 868487.44  57055.69  56826.37  0.00  
0.00  0.56
2
Average: eth0 927081.67 926920.78  60923.83  60649.90  0.00  
0.00  0.44
3
Average: eth0 957678.11 957554.00  62877.04  62653.89  0.00  
0.00  0.56
4
Average: eth0 966912.44 966747.33  63532.72  63255.51  0.00  
0.00  0.56
5
Average: eth0 973137.56 972950.44  63958.31  63661.39  0.00  
0.00  0.44
6
Average: eth0 958589.22 958449.44  62961.79  62712.56  0.00  
0.00  0.67
7
Average: eth0 960737.67 960672.22  62968.34  62857.97  0.00  
0.00  0.44
8
Average: eth0 979271.78 979201.67  64199.47  64070.84  0.00  
0.00  0.56
9
Average: eth0 982464.33 982390.33  64418.42  64278.93  0.00  
0.00  0.56
10
Average: eth0 982698.00 982578.22  64495.25  64291.28  0.00  
0.00  0.44
11
Average: eth0 981862.22 981746.00  64438.16  64236.31  0.00  
0.00  0.56
12
Average: eth0 983277.44 983096.33  64632.79  64327.79  0.00  
0.00  0.44
13
Average: eth0 981221.11 981018.00  64521.82  64189.26  0.00  
0.00  0.67
14
Average: eth0 981754.11 981555.89  64553.19  64224.39  0.00  
0.00  0.44
15
Average: eth0 982484.33 982301.67  64572.00  64273.38  0.00  
0.00  0.44
16
Average: eth0 978529.56 978326.67  64350.89  64013.39  0.00  
0.00  0.67


2) busylock disabled


Well, bulk dequeue helps, but does not close the gap. 
(busylock is needed)

lpaa23:~# for f in `seq 1 16`; do echo $f|tee 
/sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
1
Average: eth0 795408.44 795407.67  52044.66  52045.93  0.00  
0.00  0.56
2
Average: eth0 843411.78 843415.11  55185.23  55184.51  0.00  
0.00  0.56
3
Average: eth0 876175.89 876175.00  57329.50  57327.98  0.00  
0.00  0.44
4
Average: eth0 890631.22 890629.44  58274.58  58274.25  0.00  
0.00  0.67
5
Average: eth0 900672.00 900668.89  58931.29  58930.54  0.00  
0.00  0.44
6
Average: eth0 908325.78 908328.22  59432.97  59431.76  0.00  
0.00  0.56
7
Average: eth0 913895.33 913885.11  59796.89  59795.46  0.00  
0.00  0.56
8
Average: eth0 914429.11 914433.56  59832.26  59831.23  0.00  
0.00  0.67
9
Average: eth0 918701.11 918699.67  60110.68  60110.36  0.00  
0.00  0.55
10
Average: eth0 920382.33 920376.56  60223.31  60220.54  0.00  
0.00  0.67
11
Average: eth0 914341.67 914344.67  59826.25  59825.90  0.00  
0.00  0.67
12
Average: eth0 912697.00 912693.78  59718.77  59717.44  0.00  
0.00  0.44
13
Average: eth0 917392.56 917385.00  60025.79  60024.34  0.00  
0.00  0.44
14
Average: eth0 918232.89 918233.78  60081.04  60079.94  0.00  
0.00  0.67
15
Average: eth0 918377.11 918381.00  60091.14  60089.79  0.00  
0.00  0.44
16
Average: eth0 913817.56 913812.33  59792.09  59790.66  0.00  
0.00  0.56




Re: [RFC] net: remove busylock

2016-05-20 Thread Jesper Dangaard Brouer
On Fri, 20 May 2016 07:16:55 -0700
Eric Dumazet  wrote:

> Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
> on sch->flags when HTB is installed at the bonding device root.

If would be cool if you could run a test with removed busylock and
allow HTB to bulk dequeue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC] net: remove busylock

2016-05-20 Thread John Fastabend
On 16-05-20 06:11 AM, Eric Dumazet wrote:
> On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:
> 
> 
>> The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
>> by allowing dequeue to do more work, while holding the lock.
>>
>> You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
>> Have you tried to enable/allow HTB to bulk dequeue?
>>
> 
> Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
> many packets from the qdisc and tx them to the device.
> 
> It is generic for any kind of qdisc.
> 
> HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
> this while holding qdisc spinlock, you block other cpus from doing
> concurrent ->enqueue(), adding latencies (always the same trade off...)
> 
> HTB wont be anytime soon have separate protections for the ->enqueue()
> and the ->dequeue(). Have you looked at this monster ? I did, many
> times...
> 

I came to the conclusion that we just need to rewrite a new modern
version of HTB at some point. Easier said than done however.

> Note that I am working on a patch to transform __QDISC___STATE_RUNNING
> to a seqcount do that we can grab stats without holding the qdisc lock.
> 
> 
> 
> 



Re: [RFC] net: remove busylock

2016-05-20 Thread Eric Dumazet
On Fri, 2016-05-20 at 06:47 -0700, Eric Dumazet wrote:
> On Fri, 2016-05-20 at 06:11 -0700, Eric Dumazet wrote:
> > On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:
> > 
> > 
> > > The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
> > > by allowing dequeue to do more work, while holding the lock.
> > > 
> > > You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
> > > Have you tried to enable/allow HTB to bulk dequeue?
> > > 
> > 
> > Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
> > many packets from the qdisc and tx them to the device.
> > 
> > It is generic for any kind of qdisc.
> > 
> > HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
> > this while holding qdisc spinlock, you block other cpus from doing
> > concurrent ->enqueue(), adding latencies (always the same trade off...)
> > 
> > HTB wont be anytime soon have separate protections for the ->enqueue()
> > and the ->dequeue(). Have you looked at this monster ? I did, many
> > times...
> > 
> > Note that I am working on a patch to transform __QDISC___STATE_RUNNING
> > to a seqcount do that we can grab stats without holding the qdisc lock.
> 
> Slide note : __qdisc_run() could probably avoid a __netif_schedule()
> when it breaks the loop, if another cpu is busy spinning on qdisc lock.
> 
> -> Less (spurious) TX softirq invocations, so less chance to trigger the
> infamous ksoftirqd bug we discussed lately.

Also note that in our case, we have HTB on a bonding device, and
FQ/pacing on slaves.

Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
on sch->flags when HTB is installed at the bonding device root.

We might add a flag to tell qdisc layer that a device is virtual and
could benefit from bulk dequeue, since the ultimate TX queue is located
on another (physical) netdev, eventually MQ enabled.






Re: [RFC] net: remove busylock

2016-05-20 Thread Eric Dumazet
On Fri, 2016-05-20 at 06:11 -0700, Eric Dumazet wrote:
> On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:
> 
> 
> > The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
> > by allowing dequeue to do more work, while holding the lock.
> > 
> > You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
> > Have you tried to enable/allow HTB to bulk dequeue?
> > 
> 
> Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
> many packets from the qdisc and tx them to the device.
> 
> It is generic for any kind of qdisc.
> 
> HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
> this while holding qdisc spinlock, you block other cpus from doing
> concurrent ->enqueue(), adding latencies (always the same trade off...)
> 
> HTB wont be anytime soon have separate protections for the ->enqueue()
> and the ->dequeue(). Have you looked at this monster ? I did, many
> times...
> 
> Note that I am working on a patch to transform __QDISC___STATE_RUNNING
> to a seqcount do that we can grab stats without holding the qdisc lock.

Slide note : __qdisc_run() could probably avoid a __netif_schedule()
when it breaks the loop, if another cpu is busy spinning on qdisc lock.

-> Less (spurious) TX softirq invocations, so less chance to trigger the
infamous ksoftirqd bug we discussed lately.





Re: [RFC] net: remove busylock

2016-05-20 Thread Eric Dumazet
On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:


> The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
> by allowing dequeue to do more work, while holding the lock.
> 
> You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
> Have you tried to enable/allow HTB to bulk dequeue?
> 

Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
many packets from the qdisc and tx them to the device.

It is generic for any kind of qdisc.

HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
this while holding qdisc spinlock, you block other cpus from doing
concurrent ->enqueue(), adding latencies (always the same trade off...)

HTB wont be anytime soon have separate protections for the ->enqueue()
and the ->dequeue(). Have you looked at this monster ? I did, many
times...

Note that I am working on a patch to transform __QDISC___STATE_RUNNING
to a seqcount do that we can grab stats without holding the qdisc lock.






Re: [RFC] net: remove busylock

2016-05-20 Thread Jesper Dangaard Brouer
On Thu, 19 May 2016 11:03:32 -0700
Alexander Duyck  wrote:

> On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet  wrote:
> > busylock was added at the time we had expensive ticket spinlocks
> >
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> >
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
> >
> >
> > With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host 
> > with 48 hyperthreads.
> >
[...]
> 
> The main point of the busy lock is to deal with the bulk throughput
> case, not the latency case which would be relatively well behaved.
> The problem wasn't really related to lock bouncing slowing things
> down.  It was the fairness between the threads that was killing us
> because the dequeue needs to have priority.

Yes, exactly.
 
> The main problem that the busy lock solved was the fact that you could
> start a number of stream tests equal to the number of CPUs in a given
> system and the result was that the performance would drop off a cliff
> and you would drop almost all the packets for almost all the streams
> because the qdisc never had a chance to drain because it would be CPU
> - 1 enqueues, followed by 1 dequeue.

Notice 1 enqueue does not guarantee 1 dequeue.  If one CPU has entered
dequeue/xmit state (setting __QDISC___STATE_RUNNING) then other CPUs
will just enqueue.  Thus, N CPUs will enqueue (a packet each), and 1 CPU
will dequeue a packet.

The problem arise due to the fairness of the ticket spinlock, and AFAIK
the MCS lock is even more fair.  All enqueue's get their turn before
the dequeue can move forward.  And the qdisc dequeue CPU have an
unfortunate pattern of releasing the qdisc root_lock and acquiring it
again (qdisc_restart+sch_direct_xmit). Thus, while dequeue is waiting
re-aquire the root_lock, N CPUs will enqueue packets.

The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
by allowing dequeue to do more work, while holding the lock.

You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
Have you tried to enable/allow HTB to bulk dequeue?


> What we need if we are going to get rid of busy lock would be some
> sort of priority locking mechanism that would allow the dequeue thread
> to jump to the head of the line if it is attempting to take the lock.
> Otherwise you end up spending all your time enqueuing packets into
> oblivion because the qdiscs just overflow without the busy lock in
> place.

Exactly. The qdisc locking scheme is designed to only allow one
dequeuing CPU to run (via state __QDISC___STATE_RUNNING).  Jamal told
me this was an optimization.  Maybe this optimization "broke" when
locking got fair?

I don't want to offend the original qdisc designers, but when I look at
the qdisc locking code, I keep thinking this scheme is broken.

Wouldn't it be better to have seperate an enqueue lock and a dequeue
lock? (with a producer/consumer queue).

Once we get John's lockless qdisc scheme, then the individual qdiscs
can implement a locking scheme like this, and we can keep the old
locking scheme intact for legacy qdisc. Or we can clean up this locking
scheme and update the legacy qdisc to use a MPMC queue?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC] net: remove busylock

2016-05-19 Thread Eric Dumazet
On Thu, 2016-05-19 at 21:49 -0700, John Fastabend wrote:

> I plan to start looking at this again in June when I have some
> more time FWIW. The last set of RFCs I sent out bypassed both the
> qdisc lock and the busy poll lock. I remember thinking this was a
> net win at the time but I only did very basic testing e.g. firing
> up n sessions of pktgen.
> 
> And because we are talking about cruft I always thought the gso_skb
> requeue logic could be done away with as well. As far as I can tell
> it must be there from some historic code that has been re-factored
> or deleted pre-git days. It would be much better I think (no data)
> to use byte queue limits or some other way to ensure the driver can
> consume the packet vs popping and pushing skbs around.

Problem is : byte queue limit can tell qdisc to send one packet, that
happens to be a GSO packet needing software segmentation.

(BQL does not know the size of the next packet to be dequeued from
qdisc)

Let say this GSO packet had 45 segs.

Then the driver has a limited TX ring space and only accepts 10 segs,
or simply BQL budget is consumed after 10 segs.

You need to requeue the remaining 35 segs.

So for example, following patch does not even help the requeue syndrom.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 269dd71b3828..a440c059fbcf 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -60,7 +60,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 const struct netdev_queue *txq,
 int *packets)
 {
-   int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
+   int bytelimit = qdisc_avail_bulklimit(txq) - qdisc_skb_cb(skb)->pkt_len;
 
while (bytelimit > 0) {
struct sk_buff *nskb = q->dequeue(q);
@@ -68,7 +68,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
if (!nskb)
break;
 
-   bytelimit -= nskb->len; /* covers GSO len */
+   bytelimit -= qdisc_skb_cb(nskb)->pkt_len;
skb->next = nskb;
skb = nskb;
(*packets)++; /* GSO counts as one pkt */




Re: [RFC] net: remove busylock

2016-05-19 Thread John Fastabend
On 16-05-19 01:39 PM, Alexander Duyck wrote:
> On Thu, May 19, 2016 at 12:35 PM, Eric Dumazet  wrote:
>> On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote:
>>
>>> Removing busylock helped in all cases I tested. (at least on x86 as
>>> David pointed out)
>>>
>>> As I said, we need to revisit busylock now that spinlocks are different.
>>>
>>> In one case (20 concurrent UDP netperf), I even got a 500 % increase.
>>>
>>> With busylock :
>>>
>>> lpaa5:~# sar -n DEV 4 4|grep eth0
>>> Average: eth0 12.19 112797.12  1.95  37672.28  0.00 
>>>  0.00  0.69
>>>
>>
>>
>> Hmpf, my sysctl logic was inverted. Really these results made little
>> sense.
>>
>> Sorry for the noise. At least we have 8% confirmed gain with this
>> stuff ;)
> 
> Unfortunately I see a 21% regression when you place the qdisc under
> load from multiple CPUs/nodes.  In my testing I dropped from around
> 1.05 Mpps to around 827 Kpps when I removed the busylock.
> 
> My test setup is pretty straight forward and the results I have seen
> are consistent between runs.  I have a system with 32 threads / 16
> cores spread over 2 NUMA nodes.  I reduce the number of queues on a
> 10Gb/s NIC to 1.  I kill irqbalance and disable CPU power states.  I
> then start a quick "for" loop that will schedule one UDP_STREAM
> netperf session on each CPU using a small payload size like 32.
> 
> On a side note, if I move everything onto one node I can push about
> 2.4 Mpps and the busylock doesn't seem to really impact things, but if
> I go to both nodes then I see the performance regression as described
> above.  I was thinking about it and I don't think the MCS type locks
> would have that much of an impact.  If anything I think that xmit_more
> probably has a bigger effect given that it allows us to grab multiple
> frames with each fetch and thereby reduce the lock contention on the
> dequeue side.
> 
>>> Presumably it would tremendously help if the actual kfree_skb()
>>> was done after qdisc lock is released, ie not from the qdisc->enqueue()
>>> method.
>>>
>>
>> This part is still valid.
>>
>> We could have a per cpu storage of one skb pointer, so that we do not
>> have to change all ->enqueue() prototypes.
> 
> I fully agree with that.
> 
> One thought I had is if we could move to a lockless dequeue path for
> the qdisc then we could also get rid of the busy lock.  I know there
> has been talk about doing away with qdisc locks or changing the inner
> mechanics of the qdisc itself in the past, I'm CCing Jesper and John
> for that reason.  Maybe it is time for us to start looking into that
> so we can start cleaning out some of the old cruft we have in this
> path.
> 
> - Alex
> 

I plan to start looking at this again in June when I have some
more time FWIW. The last set of RFCs I sent out bypassed both the
qdisc lock and the busy poll lock. I remember thinking this was a
net win at the time but I only did very basic testing e.g. firing
up n sessions of pktgen.

And because we are talking about cruft I always thought the gso_skb
requeue logic could be done away with as well. As far as I can tell
it must be there from some historic code that has been re-factored
or deleted pre-git days. It would be much better I think (no data)
to use byte queue limits or some other way to ensure the driver can
consume the packet vs popping and pushing skbs around.

.John


Re: [RFC] net: remove busylock

2016-05-19 Thread Alexander Duyck
On Thu, May 19, 2016 at 12:35 PM, Eric Dumazet  wrote:
> On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote:
>
>> Removing busylock helped in all cases I tested. (at least on x86 as
>> David pointed out)
>>
>> As I said, we need to revisit busylock now that spinlocks are different.
>>
>> In one case (20 concurrent UDP netperf), I even got a 500 % increase.
>>
>> With busylock :
>>
>> lpaa5:~# sar -n DEV 4 4|grep eth0
>> Average: eth0 12.19 112797.12  1.95  37672.28  0.00  
>> 0.00  0.69
>>
>
>
> Hmpf, my sysctl logic was inverted. Really these results made little
> sense.
>
> Sorry for the noise. At least we have 8% confirmed gain with this
> stuff ;)

Unfortunately I see a 21% regression when you place the qdisc under
load from multiple CPUs/nodes.  In my testing I dropped from around
1.05 Mpps to around 827 Kpps when I removed the busylock.

My test setup is pretty straight forward and the results I have seen
are consistent between runs.  I have a system with 32 threads / 16
cores spread over 2 NUMA nodes.  I reduce the number of queues on a
10Gb/s NIC to 1.  I kill irqbalance and disable CPU power states.  I
then start a quick "for" loop that will schedule one UDP_STREAM
netperf session on each CPU using a small payload size like 32.

On a side note, if I move everything onto one node I can push about
2.4 Mpps and the busylock doesn't seem to really impact things, but if
I go to both nodes then I see the performance regression as described
above.  I was thinking about it and I don't think the MCS type locks
would have that much of an impact.  If anything I think that xmit_more
probably has a bigger effect given that it allows us to grab multiple
frames with each fetch and thereby reduce the lock contention on the
dequeue side.

>> Presumably it would tremendously help if the actual kfree_skb()
>> was done after qdisc lock is released, ie not from the qdisc->enqueue()
>> method.
>>
>
> This part is still valid.
>
> We could have a per cpu storage of one skb pointer, so that we do not
> have to change all ->enqueue() prototypes.

I fully agree with that.

One thought I had is if we could move to a lockless dequeue path for
the qdisc then we could also get rid of the busy lock.  I know there
has been talk about doing away with qdisc locks or changing the inner
mechanics of the qdisc itself in the past, I'm CCing Jesper and John
for that reason.  Maybe it is time for us to start looking into that
so we can start cleaning out some of the old cruft we have in this
path.

- Alex


Re: [RFC] net: remove busylock

2016-05-19 Thread Eric Dumazet
On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote:

> Removing busylock helped in all cases I tested. (at least on x86 as
> David pointed out)
> 
> As I said, we need to revisit busylock now that spinlocks are different.
> 
> In one case (20 concurrent UDP netperf), I even got a 500 % increase.
> 
> With busylock :
> 
> lpaa5:~# sar -n DEV 4 4|grep eth0
> Average: eth0 12.19 112797.12  1.95  37672.28  0.00  
> 0.00  0.69
> 


Hmpf, my sysctl logic was inverted. Really these results made little
sense.

Sorry for the noise. At least we have 8% confirmed gain with this
stuff ;)

> Presumably it would tremendously help if the actual kfree_skb()
> was done after qdisc lock is released, ie not from the qdisc->enqueue()
> method.
> 

This part is still valid.

We could have a per cpu storage of one skb pointer, so that we do not
have to change all ->enqueue() prototypes.





Re: [RFC] net: remove busylock

2016-05-19 Thread Eric Dumazet
On Thu, 2016-05-19 at 11:03 -0700, Alexander Duyck wrote:
> On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet  wrote:
> > busylock was added at the time we had expensive ticket spinlocks
> >
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> >
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
> >
> >
> > With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host 
> > with 48 hyperthreads.
> >
> > lpaa5:~# sar -n DEV 4 4 |grep eth0
> > 10:05:44 eth0 798951.25 798951.75  52276.22  52275.26  0.00 
> >  0.00  0.50
> > 10:05:48 eth0 798576.00 798572.75  52251.24  52250.39  0.00 
> >  0.00  0.75
> > 10:05:52 eth0 798746.00 798748.75  52262.89  52262.13  0.00 
> >  0.00  0.50
> > 10:05:56 eth0 798303.25 798291.50  52235.22  52233.10  0.00 
> >  0.00  0.50
> > Average: eth0 798644.12 798641.19  52256.39  52255.22  0.00 
> >  0.00  0.56
> >
> > Disabling busylock (by using a local sysctl)
> >
> > lpaa5:~# sar -n DEV 4 4 |grep eth0
> > 10:05:14 eth0 864085.75 864091.50  56538.09  56537.46  0.00 
> >  0.00  0.50
> > 10:05:18 eth0 864734.75 864729.25  56580.35  56579.05  0.00 
> >  0.00  0.75
> > 10:05:22 eth0 864366.00 864361.50  56556.74  56555.00  0.00 
> >  0.00  0.50
> > 10:05:26 eth0 864246.50 864248.75  56549.19  56547.65  0.00 
> >  0.00  0.50
> > Average: eth0 864358.25 864357.75  56556.09  56554.79  0.00 
> >  0.00  0.56
> >
> > That would be a 8 % increase.
> 
> The main point of the busy lock is to deal with the bulk throughput
> case, not the latency case which would be relatively well behaved.
> The problem wasn't really related to lock bouncing slowing things
> down.  It was the fairness between the threads that was killing us
> because the dequeue needs to have priority.



> 
> The main problem that the busy lock solved was the fact that you could
> start a number of stream tests equal to the number of CPUs in a given
> system and the result was that the performance would drop off a cliff
> and you would drop almost all the packets for almost all the streams
> because the qdisc never had a chance to drain because it would be CPU
> - 1 enqueues, followed by 1 dequeue.
> 
> What we need if we are going to get rid of busy lock would be some
> sort of priority locking mechanism that would allow the dequeue thread
> to jump to the head of the line if it is attempting to take the lock.
> Otherwise you end up spending all your time enqueuing packets into
> oblivion because the qdiscs just overflow without the busy lock in
> place.


Removing busylock helped in all cases I tested. (at least on x86 as
David pointed out)

As I said, we need to revisit busylock now that spinlocks are different.

In one case (20 concurrent UDP netperf), I even got a 500 % increase.

With busylock :

lpaa5:~# sar -n DEV 4 4|grep eth0
11:33:34 eth0  9.00 115057.00  1.60  38426.92  0.00  
0.00  0.50
11:33:38 eth0 13.50 113237.75  2.04  37819.69  0.00  
0.00  0.75
11:33:42 eth0 13.50 111492.25  1.76  37236.58  0.00  
0.00  0.75
11:33:46 eth0 12.75 111401.50  2.40  37205.93  0.00  
0.00  0.75
Average: eth0 12.19 112797.12  1.95  37672.28  0.00  
0.00  0.69

Packets are dropped in HTB because we hit a limit of 1000 packets there

- 100.00%  netperf  [kernel.kallsyms]  [k] kfree_skb   ▒
   - kfree_skb ▒
  -  100.00% htb_enqueue   ▒
__dev_queue_xmit   ▒
dev_queue_xmit ▒
ip_finish_output2  ▒
ip_finish_output   ▒
ip_output  ▒
ip_local_out   ▒
 +  ip_send_skb  


Presumably it would tremendously help if the actual kfree_skb()
was done after qdisc lock is released, ie not from the qdisc->enqueue()
method.


Without busylock :

lpaa5:~# sar -n DEV 4 4|grep eth0
11:41:12 eth0 11.00 669053.50  1.99 223452.30  0.00  
0.00  0.75
11:41:16 eth0  8.50 669513.25  2.27 223605.55  0.00  
0.00  0.75
11:41:20 eth0  3.50 669426.50  0.90 223577.19  0.00  
0.00  0.50
11:41:24 eth0  8.25 669284.00  1.42 223529.79  0.00  
0.00  0.50
Average: eth0  7.81 669319.31  1.65 223541.21  0.00  
0.00  0.62





Re: [RFC] net: remove busylock

2016-05-19 Thread Eric Dumazet
On Thu, 2016-05-19 at 11:12 -0700, David Miller wrote:
> From: Eric Dumazet 
> Date: Thu, 19 May 2016 10:08:36 -0700
> 
> > busylock was added at the time we had expensive ticket spinlocks
> > 
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> > 
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
>  ...
> > That would be a 8 % increase.
> 
> Presumably only for x86-64 and other platforms that actually are using
> the ticket spinlocks.
> 
> For others it would be a regression.

Ticket spinlocks are actually gone, these are now MCS.

But yes, point taken ;)





Re: [RFC] net: remove busylock

2016-05-19 Thread Rick Jones

On 05/19/2016 11:03 AM, Alexander Duyck wrote:

On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet  wrote:

With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 
48 hyperthreads.

...


That would be a 8 % increase.


The main point of the busy lock is to deal with the bulk throughput
case, not the latency case which would be relatively well behaved.
The problem wasn't really related to lock bouncing slowing things
down.  It was the fairness between the threads that was killing us
because the dequeue needs to have priority.


Quibbledrift... While the origins of the netperf TCP_RR test center on 
measuring latency, I'm not sure I'd call 200 of them running 
concurrently a latency test.  Indeed it may be neither fish nor fowl, 
but it will certainly be exercising the basic packet send/receive path 
rather fully and is likely a reasonable proxy for aggregate small packet 
performance.


happy benchmarking,

rick jones


Re: [RFC] net: remove busylock

2016-05-19 Thread David Miller
From: Eric Dumazet 
Date: Thu, 19 May 2016 10:08:36 -0700

> busylock was added at the time we had expensive ticket spinlocks
> 
> (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> lock to qdisc to increase throughput")
> 
> Now kernel spinlocks are MCS, this busylock things is no longer
> relevant. It is slowing down things a bit.
 ...
> That would be a 8 % increase.

Presumably only for x86-64 and other platforms that actually are using
the ticket spinlocks.

For others it would be a regression.


Re: [RFC] net: remove busylock

2016-05-19 Thread Alexander Duyck
On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet  wrote:
> busylock was added at the time we had expensive ticket spinlocks
>
> (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> lock to qdisc to increase throughput")
>
> Now kernel spinlocks are MCS, this busylock things is no longer
> relevant. It is slowing down things a bit.
>
>
> With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host 
> with 48 hyperthreads.
>
> lpaa5:~# sar -n DEV 4 4 |grep eth0
> 10:05:44 eth0 798951.25 798951.75  52276.22  52275.26  0.00  
> 0.00  0.50
> 10:05:48 eth0 798576.00 798572.75  52251.24  52250.39  0.00  
> 0.00  0.75
> 10:05:52 eth0 798746.00 798748.75  52262.89  52262.13  0.00  
> 0.00  0.50
> 10:05:56 eth0 798303.25 798291.50  52235.22  52233.10  0.00  
> 0.00  0.50
> Average: eth0 798644.12 798641.19  52256.39  52255.22  0.00  
> 0.00  0.56
>
> Disabling busylock (by using a local sysctl)
>
> lpaa5:~# sar -n DEV 4 4 |grep eth0
> 10:05:14 eth0 864085.75 864091.50  56538.09  56537.46  0.00  
> 0.00  0.50
> 10:05:18 eth0 864734.75 864729.25  56580.35  56579.05  0.00  
> 0.00  0.75
> 10:05:22 eth0 864366.00 864361.50  56556.74  56555.00  0.00  
> 0.00  0.50
> 10:05:26 eth0 864246.50 864248.75  56549.19  56547.65  0.00  
> 0.00  0.50
> Average: eth0 864358.25 864357.75  56556.09  56554.79  0.00  
> 0.00  0.56
>
> That would be a 8 % increase.

The main point of the busy lock is to deal with the bulk throughput
case, not the latency case which would be relatively well behaved.
The problem wasn't really related to lock bouncing slowing things
down.  It was the fairness between the threads that was killing us
because the dequeue needs to have priority.

The main problem that the busy lock solved was the fact that you could
start a number of stream tests equal to the number of CPUs in a given
system and the result was that the performance would drop off a cliff
and you would drop almost all the packets for almost all the streams
because the qdisc never had a chance to drain because it would be CPU
- 1 enqueues, followed by 1 dequeue.

What we need if we are going to get rid of busy lock would be some
sort of priority locking mechanism that would allow the dequeue thread
to jump to the head of the line if it is attempting to take the lock.
Otherwise you end up spending all your time enqueuing packets into
oblivion because the qdiscs just overflow without the busy lock in
place.

- Alex