Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-02 Thread Paolo Abeni
On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 23:17:48 +0100
> Paolo Abeni  wrote:
> 
> > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > > small socket queues)
> > > 
> > > On Wed, 30 Nov 2016 16:35:20 +
> > > Mel Gorman  wrote:
> > >   
> > > > > I don't quite get why you are setting the socket recv size
> > > > > (with -- -s and -S) to such a small number, size + 256.
> > > > > 
> > > > 
> > > > Maybe I missed something at the time I wrote that but why would it
> > > > need to be larger?  
> > > 
> > > Well, to me it is quite obvious that we need some queue to avoid packet
> > > drops.  We have two processes netperf and netserver, that are sending
> > > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > > These PIDs are getting scheduled and migrated between CPUs, and thus
> > > does not get executed equally fast, thus a queue is need absorb the
> > > fluctuations.
> > > 
> > > The network stack is even partly catching your config "mistake" and
> > > increase the socket queue size, so we minimum can handle one max frame
> > > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > > 
> > > Hopefully for localhost testing a small queue should hopefully not
> > > result in packet drops.  Testing... ups, this does result in packet
> > > drops.
> > > 
> > > Test command extracted from mmtests, UDP_STREAM size 1024:
> > > 
> > >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > > 
> > >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > >  Socket  Message  Elapsed  Messages
> > >  SizeSize Time Okay Errors   Throughput
> > >  bytes   bytessecs#  #   10^6bits/sec
> > > 
> > >46081024   60.00 50024301  06829.98
> > >2560   60.00 46133211   6298.72
> > > 
> > >  Dropped packets: 50024301-46133211=3891090
> > > 
> > > To get a better drop indication, during this I run a command, to get
> > > system-wide network counters from the last second, so below numbers are
> > > per second.
> > > 
> > >  $ nstat > /dev/null && sleep 1  && nstat
> > >  #kernel
> > >  IpInReceives885162 0.0
> > >  IpInDelivers885161 0.0
> > >  IpOutRequests   885162 0.0
> > >  UdpInDatagrams  776105 0.0
> > >  UdpInErrors 109056 0.0
> > >  UdpOutDatagrams 885160 0.0
> > >  UdpRcvbufErrors 109056 0.0
> > >  IpExtInOctets   931190476  0.0
> > >  IpExtOutOctets  931189564  0.0
> > >  IpExtInNoECTPkts885162 0.0
> > > 
> > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > > process didn't empty it's queue fast enough see [2].
> > > 
> > > Although upstream changes are coming in this area, [2] is replaced with
> > > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > > 
> > > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > > To Paolo, you might want to look into this.  And it could also explain why
> > > I've not see the mentioned speedup by mm-change, as I've been testing
> > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes.  
> > 
> > Thank you for reporting this.
> > 
> > It seems that the commit 123b4a633580 ("udp: use it's own memory
> > accounting schema") is too strict while checking the rcvbuf. 
> > 
> > For very small value of rcvbuf, it allows a single skb to be enqueued,
> > while previously we allowed 2 of them to enter the queue, even if the
> > first one truesize exceeded rcvbuf, as in your test-case.
> > 
> > Can you please try the following patch ?
> 
> Sure, it looks much better with this patch.

Thank you for testing. I'll send a formal patch to David soon.

BTW I see I nice performance improvement compared to 4.7...

Cheers,

Paolo



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-02 Thread Paolo Abeni
On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 23:17:48 +0100
> Paolo Abeni  wrote:
> 
> > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > > small socket queues)
> > > 
> > > On Wed, 30 Nov 2016 16:35:20 +
> > > Mel Gorman  wrote:
> > >   
> > > > > I don't quite get why you are setting the socket recv size
> > > > > (with -- -s and -S) to such a small number, size + 256.
> > > > > 
> > > > 
> > > > Maybe I missed something at the time I wrote that but why would it
> > > > need to be larger?  
> > > 
> > > Well, to me it is quite obvious that we need some queue to avoid packet
> > > drops.  We have two processes netperf and netserver, that are sending
> > > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > > These PIDs are getting scheduled and migrated between CPUs, and thus
> > > does not get executed equally fast, thus a queue is need absorb the
> > > fluctuations.
> > > 
> > > The network stack is even partly catching your config "mistake" and
> > > increase the socket queue size, so we minimum can handle one max frame
> > > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > > 
> > > Hopefully for localhost testing a small queue should hopefully not
> > > result in packet drops.  Testing... ups, this does result in packet
> > > drops.
> > > 
> > > Test command extracted from mmtests, UDP_STREAM size 1024:
> > > 
> > >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > > 
> > >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > >  Socket  Message  Elapsed  Messages
> > >  SizeSize Time Okay Errors   Throughput
> > >  bytes   bytessecs#  #   10^6bits/sec
> > > 
> > >46081024   60.00 50024301  06829.98
> > >2560   60.00 46133211   6298.72
> > > 
> > >  Dropped packets: 50024301-46133211=3891090
> > > 
> > > To get a better drop indication, during this I run a command, to get
> > > system-wide network counters from the last second, so below numbers are
> > > per second.
> > > 
> > >  $ nstat > /dev/null && sleep 1  && nstat
> > >  #kernel
> > >  IpInReceives885162 0.0
> > >  IpInDelivers885161 0.0
> > >  IpOutRequests   885162 0.0
> > >  UdpInDatagrams  776105 0.0
> > >  UdpInErrors 109056 0.0
> > >  UdpOutDatagrams 885160 0.0
> > >  UdpRcvbufErrors 109056 0.0
> > >  IpExtInOctets   931190476  0.0
> > >  IpExtOutOctets  931189564  0.0
> > >  IpExtInNoECTPkts885162 0.0
> > > 
> > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > > process didn't empty it's queue fast enough see [2].
> > > 
> > > Although upstream changes are coming in this area, [2] is replaced with
> > > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > > 
> > > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > > To Paolo, you might want to look into this.  And it could also explain why
> > > I've not see the mentioned speedup by mm-change, as I've been testing
> > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes.  
> > 
> > Thank you for reporting this.
> > 
> > It seems that the commit 123b4a633580 ("udp: use it's own memory
> > accounting schema") is too strict while checking the rcvbuf. 
> > 
> > For very small value of rcvbuf, it allows a single skb to be enqueued,
> > while previously we allowed 2 of them to enter the queue, even if the
> > first one truesize exceeded rcvbuf, as in your test-case.
> > 
> > Can you please try the following patch ?
> 
> Sure, it looks much better with this patch.

Thank you for testing. I'll send a formal patch to David soon.

BTW I see I nice performance improvement compared to 4.7...

Cheers,

Paolo



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-02 Thread Jesper Dangaard Brouer
On Thu, 01 Dec 2016 23:17:48 +0100
Paolo Abeni  wrote:

> On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > small socket queues)
> > 
> > On Wed, 30 Nov 2016 16:35:20 +
> > Mel Gorman  wrote:
> >   
> > > > I don't quite get why you are setting the socket recv size
> > > > (with -- -s and -S) to such a small number, size + 256.
> > > > 
> > > 
> > > Maybe I missed something at the time I wrote that but why would it
> > > need to be larger?  
> > 
> > Well, to me it is quite obvious that we need some queue to avoid packet
> > drops.  We have two processes netperf and netserver, that are sending
> > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > These PIDs are getting scheduled and migrated between CPUs, and thus
> > does not get executed equally fast, thus a queue is need absorb the
> > fluctuations.
> > 
> > The network stack is even partly catching your config "mistake" and
> > increase the socket queue size, so we minimum can handle one max frame
> > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > 
> > Hopefully for localhost testing a small queue should hopefully not
> > result in packet drops.  Testing... ups, this does result in packet
> > drops.
> > 
> > Test command extracted from mmtests, UDP_STREAM size 1024:
> > 
> >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > 
> >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> >  Socket  Message  Elapsed  Messages
> >  SizeSize Time Okay Errors   Throughput
> >  bytes   bytessecs#  #   10^6bits/sec
> > 
> >46081024   60.00 50024301  06829.98
> >2560   60.00 46133211   6298.72
> > 
> >  Dropped packets: 50024301-46133211=3891090
> > 
> > To get a better drop indication, during this I run a command, to get
> > system-wide network counters from the last second, so below numbers are
> > per second.
> > 
> >  $ nstat > /dev/null && sleep 1  && nstat
> >  #kernel
> >  IpInReceives885162 0.0
> >  IpInDelivers885161 0.0
> >  IpOutRequests   885162 0.0
> >  UdpInDatagrams  776105 0.0
> >  UdpInErrors 109056 0.0
> >  UdpOutDatagrams 885160 0.0
> >  UdpRcvbufErrors 109056 0.0
> >  IpExtInOctets   931190476  0.0
> >  IpExtOutOctets  931189564  0.0
> >  IpExtInNoECTPkts885162 0.0
> > 
> > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > process didn't empty it's queue fast enough see [2].
> > 
> > Although upstream changes are coming in this area, [2] is replaced with
> > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > 
> > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > To Paolo, you might want to look into this.  And it could also explain why
> > I've not see the mentioned speedup by mm-change, as I've been testing
> > this patch on top of net-next (at 93ba550) with Paolo's UDP changes.  
> 
> Thank you for reporting this.
> 
> It seems that the commit 123b4a633580 ("udp: use it's own memory
> accounting schema") is too strict while checking the rcvbuf. 
> 
> For very small value of rcvbuf, it allows a single skb to be enqueued,
> while previously we allowed 2 of them to enter the queue, even if the
> first one truesize exceeded rcvbuf, as in your test-case.
> 
> Can you please try the following patch ?

Sure, it looks much better with this patch.


$ 
/home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf
 -t UDP_STREAM  -l 60  -H 127.0.0.1-- -s 1280 -S 1280 -m 1024 -M 1024 -P 
15895
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 
127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

  46081024   60.00 50191555  06852.82
  2560   60.00 50189872   6852.59

Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives885417 0.0
IpInDelivers885416 0.0
IpOutRequests   885417 0.0
UdpInDatagrams  885382 0.0
UdpInErrors 

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-02 Thread Jesper Dangaard Brouer
On Thu, 01 Dec 2016 23:17:48 +0100
Paolo Abeni  wrote:

> On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > small socket queues)
> > 
> > On Wed, 30 Nov 2016 16:35:20 +
> > Mel Gorman  wrote:
> >   
> > > > I don't quite get why you are setting the socket recv size
> > > > (with -- -s and -S) to such a small number, size + 256.
> > > > 
> > > 
> > > Maybe I missed something at the time I wrote that but why would it
> > > need to be larger?  
> > 
> > Well, to me it is quite obvious that we need some queue to avoid packet
> > drops.  We have two processes netperf and netserver, that are sending
> > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > These PIDs are getting scheduled and migrated between CPUs, and thus
> > does not get executed equally fast, thus a queue is need absorb the
> > fluctuations.
> > 
> > The network stack is even partly catching your config "mistake" and
> > increase the socket queue size, so we minimum can handle one max frame
> > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > 
> > Hopefully for localhost testing a small queue should hopefully not
> > result in packet drops.  Testing... ups, this does result in packet
> > drops.
> > 
> > Test command extracted from mmtests, UDP_STREAM size 1024:
> > 
> >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > 
> >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> >  Socket  Message  Elapsed  Messages
> >  SizeSize Time Okay Errors   Throughput
> >  bytes   bytessecs#  #   10^6bits/sec
> > 
> >46081024   60.00 50024301  06829.98
> >2560   60.00 46133211   6298.72
> > 
> >  Dropped packets: 50024301-46133211=3891090
> > 
> > To get a better drop indication, during this I run a command, to get
> > system-wide network counters from the last second, so below numbers are
> > per second.
> > 
> >  $ nstat > /dev/null && sleep 1  && nstat
> >  #kernel
> >  IpInReceives885162 0.0
> >  IpInDelivers885161 0.0
> >  IpOutRequests   885162 0.0
> >  UdpInDatagrams  776105 0.0
> >  UdpInErrors 109056 0.0
> >  UdpOutDatagrams 885160 0.0
> >  UdpRcvbufErrors 109056 0.0
> >  IpExtInOctets   931190476  0.0
> >  IpExtOutOctets  931189564  0.0
> >  IpExtInNoECTPkts885162 0.0
> > 
> > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > process didn't empty it's queue fast enough see [2].
> > 
> > Although upstream changes are coming in this area, [2] is replaced with
> > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > 
> > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > To Paolo, you might want to look into this.  And it could also explain why
> > I've not see the mentioned speedup by mm-change, as I've been testing
> > this patch on top of net-next (at 93ba550) with Paolo's UDP changes.  
> 
> Thank you for reporting this.
> 
> It seems that the commit 123b4a633580 ("udp: use it's own memory
> accounting schema") is too strict while checking the rcvbuf. 
> 
> For very small value of rcvbuf, it allows a single skb to be enqueued,
> while previously we allowed 2 of them to enter the queue, even if the
> first one truesize exceeded rcvbuf, as in your test-case.
> 
> Can you please try the following patch ?

Sure, it looks much better with this patch.


$ 
/home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf
 -t UDP_STREAM  -l 60  -H 127.0.0.1-- -s 1280 -S 1280 -m 1024 -M 1024 -P 
15895
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 
127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

  46081024   60.00 50191555  06852.82
  2560   60.00 50189872   6852.59

Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives885417 0.0
IpInDelivers885416 0.0
IpOutRequests   885417 0.0
UdpInDatagrams  885382 0.0
UdpInErrors 29 0.0
UdpOutDatagrams

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-01 Thread Paolo Abeni
On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +
> Mel Gorman  wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed  Messages
>  SizeSize Time Okay Errors   Throughput
>  bytes   bytessecs#  #   10^6bits/sec
> 
>46081024   60.00 50024301  06829.98
>2560   60.00 46133211   6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives885162 0.0
>  IpInDelivers885161 0.0
>  IpOutRequests   885162 0.0
>  UdpInDatagrams  776105 0.0
>  UdpInErrors 109056 0.0
>  UdpOutDatagrams 885160 0.0
>  UdpRcvbufErrors 109056 0.0
>  IpExtInOctets   931190476  0.0
>  IpExtOutOctets  931189564  0.0
>  IpExtInNoECTPkts885162 0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct 
sk_buff *skb)
struct sk_buff_head *list = >sk_receive_queue;
int rmem, delta, amt, err = -ENOMEM;
int size = skb->truesize;
+   int limit;
 
/* try to avoid the costly atomic add/sub pair when the receive
 * queue is full; always allow at least a packet
 */
rmem = atomic_read(>sk_rmem_alloc);
-   if (rmem && (rmem + size > sk->sk_rcvbuf))
+   limit = size + sk->sk_rcvbuf;
+   if (rmem > limit)
goto drop;
 
/* we drop only if the receive buf is full and the receive
 * queue contains some other skb
 */
rmem = atomic_add_return(size, >sk_rmem_alloc);
-   if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+   if (rmem > limit)
goto uncharge_drop;
 
spin_lock(>lock);







Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-01 Thread Paolo Abeni
On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +
> Mel Gorman  wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed  Messages
>  SizeSize Time Okay Errors   Throughput
>  bytes   bytessecs#  #   10^6bits/sec
> 
>46081024   60.00 50024301  06829.98
>2560   60.00 46133211   6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives885162 0.0
>  IpInDelivers885161 0.0
>  IpOutRequests   885162 0.0
>  UdpInDatagrams  776105 0.0
>  UdpInErrors 109056 0.0
>  UdpOutDatagrams 885160 0.0
>  UdpRcvbufErrors 109056 0.0
>  IpExtInOctets   931190476  0.0
>  IpExtOutOctets  931189564  0.0
>  IpExtInNoECTPkts885162 0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct 
sk_buff *skb)
struct sk_buff_head *list = >sk_receive_queue;
int rmem, delta, amt, err = -ENOMEM;
int size = skb->truesize;
+   int limit;
 
/* try to avoid the costly atomic add/sub pair when the receive
 * queue is full; always allow at least a packet
 */
rmem = atomic_read(>sk_rmem_alloc);
-   if (rmem && (rmem + size > sk->sk_rcvbuf))
+   limit = size + sk->sk_rcvbuf;
+   if (rmem > limit)
goto drop;
 
/* we drop only if the receive buf is full and the receive
 * queue contains some other skb
 */
rmem = atomic_add_return(size, >sk_rmem_alloc);
-   if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+   if (rmem > limit)
goto uncharge_drop;
 
spin_lock(>lock);







Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-01 Thread Jesper Dangaard Brouer
(Cc. netdev, we might have an issue with Paolo's UDP accounting and
small socket queues)

On Wed, 30 Nov 2016 16:35:20 +
Mel Gorman  wrote:

> > I don't quite get why you are setting the socket recv size
> > (with -- -s and -S) to such a small number, size + 256.
> >   
> 
> Maybe I missed something at the time I wrote that but why would it
> need to be larger?

Well, to me it is quite obvious that we need some queue to avoid packet
drops.  We have two processes netperf and netserver, that are sending
packets between each-other (UDP_STREAM mostly netperf -> netserver).
These PIDs are getting scheduled and migrated between CPUs, and thus
does not get executed equally fast, thus a queue is need absorb the
fluctuations.

The network stack is even partly catching your config "mistake" and
increase the socket queue size, so we minimum can handle one max frame
(due skb "truesize" concept approx PAGE_SIZE + overhead).

Hopefully for localhost testing a small queue should hopefully not
result in packet drops.  Testing... ups, this does result in packet
drops.

Test command extracted from mmtests, UDP_STREAM size 1024:

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
  port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec

   46081024   60.00 50024301  06829.98
   2560   60.00 46133211   6298.72

 Dropped packets: 50024301-46133211=3891090

To get a better drop indication, during this I run a command, to get
system-wide network counters from the last second, so below numbers are
per second.

 $ nstat > /dev/null && sleep 1  && nstat
 #kernel
 IpInReceives885162 0.0
 IpInDelivers885161 0.0
 IpOutRequests   885162 0.0
 UdpInDatagrams  776105 0.0
 UdpInErrors 109056 0.0
 UdpOutDatagrams 885160 0.0
 UdpRcvbufErrors 109056 0.0
 IpExtInOctets   931190476  0.0
 IpExtOutOctets  931189564  0.0
 IpExtInNoECTPkts885162 0.0

So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
happens kernel side in __udp_queue_rcv_skb[1], because receiving
process didn't empty it's queue fast enough see [2].

Although upstream changes are coming in this area, [2] is replaced with
__udp_enqueue_schedule_skb, which I actually tested with... hmm

Retesting with kernel 4.7.0-baseline+ ... show something else. 
To Paolo, you might want to look into this.  And it could also explain why
I've not see the mentioned speedup by mm-change, as I've been testing
this patch on top of net-next (at 93ba550) with Paolo's UDP changes.

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895
  AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec

   46081024   60.00 47248301  06450.97
   2560   60.00 47245030   6450.52

Only dropped 47248301-47245030=3271

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives810566 0.0
IpInDelivers810566 0.0
IpOutRequests   810566 0.0
UdpInDatagrams  810468 0.0
UdpInErrors 99 0.0
UdpOutDatagrams 810566 0.0
UdpRcvbufErrors 99 0.0
IpExtInOctets   852713328  0.0
IpExtOutOctets  852713328  0.0
IpExtInNoECTPkts810563 0.0

And nstat is also much better with only 99 drop/sec.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.8#L1454
[2] http://lxr.free-electrons.com/source/net/core/sock.c?v=4.8#L413


Extra: with net-next at 93ba550

If I use netperf default socket queue, then there is not a single
packet drop:

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1  
   -- -m 1024 -M 1024 -P 15895

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) 
 port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-01 Thread Jesper Dangaard Brouer
(Cc. netdev, we might have an issue with Paolo's UDP accounting and
small socket queues)

On Wed, 30 Nov 2016 16:35:20 +
Mel Gorman  wrote:

> > I don't quite get why you are setting the socket recv size
> > (with -- -s and -S) to such a small number, size + 256.
> >   
> 
> Maybe I missed something at the time I wrote that but why would it
> need to be larger?

Well, to me it is quite obvious that we need some queue to avoid packet
drops.  We have two processes netperf and netserver, that are sending
packets between each-other (UDP_STREAM mostly netperf -> netserver).
These PIDs are getting scheduled and migrated between CPUs, and thus
does not get executed equally fast, thus a queue is need absorb the
fluctuations.

The network stack is even partly catching your config "mistake" and
increase the socket queue size, so we minimum can handle one max frame
(due skb "truesize" concept approx PAGE_SIZE + overhead).

Hopefully for localhost testing a small queue should hopefully not
result in packet drops.  Testing... ups, this does result in packet
drops.

Test command extracted from mmtests, UDP_STREAM size 1024:

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
  port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec

   46081024   60.00 50024301  06829.98
   2560   60.00 46133211   6298.72

 Dropped packets: 50024301-46133211=3891090

To get a better drop indication, during this I run a command, to get
system-wide network counters from the last second, so below numbers are
per second.

 $ nstat > /dev/null && sleep 1  && nstat
 #kernel
 IpInReceives885162 0.0
 IpInDelivers885161 0.0
 IpOutRequests   885162 0.0
 UdpInDatagrams  776105 0.0
 UdpInErrors 109056 0.0
 UdpOutDatagrams 885160 0.0
 UdpRcvbufErrors 109056 0.0
 IpExtInOctets   931190476  0.0
 IpExtOutOctets  931189564  0.0
 IpExtInNoECTPkts885162 0.0

So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
happens kernel side in __udp_queue_rcv_skb[1], because receiving
process didn't empty it's queue fast enough see [2].

Although upstream changes are coming in this area, [2] is replaced with
__udp_enqueue_schedule_skb, which I actually tested with... hmm

Retesting with kernel 4.7.0-baseline+ ... show something else. 
To Paolo, you might want to look into this.  And it could also explain why
I've not see the mentioned speedup by mm-change, as I've been testing
this patch on top of net-next (at 93ba550) with Paolo's UDP changes.

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895
  AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec

   46081024   60.00 47248301  06450.97
   2560   60.00 47245030   6450.52

Only dropped 47248301-47245030=3271

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives810566 0.0
IpInDelivers810566 0.0
IpOutRequests   810566 0.0
UdpInDatagrams  810468 0.0
UdpInErrors 99 0.0
UdpOutDatagrams 810566 0.0
UdpRcvbufErrors 99 0.0
IpExtInOctets   852713328  0.0
IpExtOutOctets  852713328  0.0
IpExtInNoECTPkts810563 0.0

And nstat is also much better with only 99 drop/sec.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.8#L1454
[2] http://lxr.free-electrons.com/source/net/core/sock.c?v=4.8#L413


Extra: with net-next at 93ba550

If I use netperf default socket queue, then there is not a single
packet drop:

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1  
   -- -m 1024 -M 1024 -P 15895

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) 
 port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  Message  Elapsed  Messages  

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Wed, Nov 30, 2016 at 04:06:12PM +0100, Jesper Dangaard Brouer wrote:
> > > [...]  
> > > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > > selected on the basis that it is slab-intensive and has been the subject
> > > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > > testing between two physical hosts.  
> > > 
> > > I do like you are using a networking test to benchmark this. Looking at
> > > the results, my initial response is that the improvements are basically
> > > too good to be true.
> > >   
> > 
> > FWIW, LKP independently measured the boost to be 23% so it's expected
> > there will be different results depending on exact configuration and CPU.
> 
> Yes, noticed that, nice (which was a SCTP test) 
>  https://lists.01.org/pipermail/lkp/2016-November/005210.html
> 
> It is of-cause great. It is just strange I cannot reproduce it on my
> high-end box, with manual testing. I'll try your test suite and try to
> figure out what is wrong with my setup.
> 

That would be great. I had seen the boost on multiple machines and LKP
verifying it is helpful. 

> 
> > > Can you share how you tested this with netperf and the specific netperf
> > > parameters?   
> > 
> > The mmtests config file used is
> > configs/config-global-dhp__network-netperf-unbound so all details can be
> > extrapolated or reproduced from that.
> 
> I didn't know of mmtests: https://github.com/gormanm/mmtests
> 
> It looks nice and quite comprehensive! :-)
> 

Thanks.

> > > e.g.
> > >  How do you configure the send/recv sizes?  
> > 
> > Static range of sizes specified in the config file.
> 
> I'll figure it out... reading your shell code :-)
> 
> export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
>  
> https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72
> 
> I see you are using netperf 2.4.5 and setting both the send an recv
> size (-- -m and -M) which is fine.
> 

Ok.

> I don't quite get why you are setting the socket recv size (with -- -s
> and -S) to such a small number, size + 256.
> 

Maybe I missed something at the time I wrote that but why would it need
to be larger?

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Wed, Nov 30, 2016 at 04:06:12PM +0100, Jesper Dangaard Brouer wrote:
> > > [...]  
> > > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > > selected on the basis that it is slab-intensive and has been the subject
> > > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > > testing between two physical hosts.  
> > > 
> > > I do like you are using a networking test to benchmark this. Looking at
> > > the results, my initial response is that the improvements are basically
> > > too good to be true.
> > >   
> > 
> > FWIW, LKP independently measured the boost to be 23% so it's expected
> > there will be different results depending on exact configuration and CPU.
> 
> Yes, noticed that, nice (which was a SCTP test) 
>  https://lists.01.org/pipermail/lkp/2016-November/005210.html
> 
> It is of-cause great. It is just strange I cannot reproduce it on my
> high-end box, with manual testing. I'll try your test suite and try to
> figure out what is wrong with my setup.
> 

That would be great. I had seen the boost on multiple machines and LKP
verifying it is helpful. 

> 
> > > Can you share how you tested this with netperf and the specific netperf
> > > parameters?   
> > 
> > The mmtests config file used is
> > configs/config-global-dhp__network-netperf-unbound so all details can be
> > extrapolated or reproduced from that.
> 
> I didn't know of mmtests: https://github.com/gormanm/mmtests
> 
> It looks nice and quite comprehensive! :-)
> 

Thanks.

> > > e.g.
> > >  How do you configure the send/recv sizes?  
> > 
> > Static range of sizes specified in the config file.
> 
> I'll figure it out... reading your shell code :-)
> 
> export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
>  
> https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72
> 
> I see you are using netperf 2.4.5 and setting both the send an recv
> size (-- -m and -M) which is fine.
> 

Ok.

> I don't quite get why you are setting the socket recv size (with -- -s
> and -S) to such a small number, size + 256.
> 

Maybe I missed something at the time I wrote that but why would it need
to be larger?

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Jesper Dangaard Brouer
On Wed, 30 Nov 2016 14:06:15 +
Mel Gorman  wrote:

> On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Sun, 27 Nov 2016 13:19:54 + Mel Gorman  
> > wrote:
> > 
> > [...]  
> > > SLUB has been the default small kernel object allocator for quite some 
> > > time
> > > but it is not universally used due to performance concerns and a reliance
> > > on high-order pages. The high-order concerns has two major components --
> > > high-order pages are not always available and high-order page allocations
> > > potentially contend on the zone->lock. This patch addresses some concerns
> > > about the zone lock contention by extending the per-cpu page allocator to
> > > cache high-order pages. The patch makes the following modifications
> > > 
> > > o New per-cpu lists are added to cache the high-order pages. This 
> > > increases
> > >   the cache footprint of the per-cpu allocator and overall usage but for
> > >   some workloads, this will be offset by reduced contention on 
> > > zone->lock.  
> > 
> > This will also help performance of NIC driver that allocator
> > higher-order pages for their RX-ring queue (and chop it up for MTU).
> > I do like this patch, even-though I'm working on moving drivers away
> > from allocation these high-order pages.
> > 
> > Acked-by: Jesper Dangaard Brouer 
> >   
> 
> Thanks.
> 
> > [...]  
> > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > selected on the basis that it is slab-intensive and has been the subject
> > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > testing between two physical hosts.  
> > 
> > I do like you are using a networking test to benchmark this. Looking at
> > the results, my initial response is that the improvements are basically
> > too good to be true.
> >   
> 
> FWIW, LKP independently measured the boost to be 23% so it's expected
> there will be different results depending on exact configuration and CPU.

Yes, noticed that, nice (which was a SCTP test) 
 https://lists.01.org/pipermail/lkp/2016-November/005210.html

It is of-cause great. It is just strange I cannot reproduce it on my
high-end box, with manual testing. I'll try your test suite and try to
figure out what is wrong with my setup.


> > Can you share how you tested this with netperf and the specific netperf
> > parameters?   
> 
> The mmtests config file used is
> configs/config-global-dhp__network-netperf-unbound so all details can be
> extrapolated or reproduced from that.

I didn't know of mmtests: https://github.com/gormanm/mmtests

It looks nice and quite comprehensive! :-)


> > e.g.
> >  How do you configure the send/recv sizes?  
> 
> Static range of sizes specified in the config file.

I'll figure it out... reading your shell code :-)

export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
 
https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72

I see you are using netperf 2.4.5 and setting both the send an recv
size (-- -m and -M) which is fine.

I don't quite get why you are setting the socket recv size (with -- -s
and -S) to such a small number, size + 256.

 SOCKETSIZE_OPT="-s $((SIZE+256)) -S $((SIZE+256))

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 320 -S 320 -m 64 -M 64 -P 15895

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 384 -S 384 -m 128 -M 128 -P 15895

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
 
> >  Have you pinned netperf and netserver on different CPUs?
> >   
> 
> No. While it's possible to do a pinned test which helps stability, it
> also tends to be less reflective of what happens in a variety of
> workloads so I took the "harder" option.

Agree.
 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Jesper Dangaard Brouer
On Wed, 30 Nov 2016 14:06:15 +
Mel Gorman  wrote:

> On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Sun, 27 Nov 2016 13:19:54 + Mel Gorman  
> > wrote:
> > 
> > [...]  
> > > SLUB has been the default small kernel object allocator for quite some 
> > > time
> > > but it is not universally used due to performance concerns and a reliance
> > > on high-order pages. The high-order concerns has two major components --
> > > high-order pages are not always available and high-order page allocations
> > > potentially contend on the zone->lock. This patch addresses some concerns
> > > about the zone lock contention by extending the per-cpu page allocator to
> > > cache high-order pages. The patch makes the following modifications
> > > 
> > > o New per-cpu lists are added to cache the high-order pages. This 
> > > increases
> > >   the cache footprint of the per-cpu allocator and overall usage but for
> > >   some workloads, this will be offset by reduced contention on 
> > > zone->lock.  
> > 
> > This will also help performance of NIC driver that allocator
> > higher-order pages for their RX-ring queue (and chop it up for MTU).
> > I do like this patch, even-though I'm working on moving drivers away
> > from allocation these high-order pages.
> > 
> > Acked-by: Jesper Dangaard Brouer 
> >   
> 
> Thanks.
> 
> > [...]  
> > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > selected on the basis that it is slab-intensive and has been the subject
> > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > testing between two physical hosts.  
> > 
> > I do like you are using a networking test to benchmark this. Looking at
> > the results, my initial response is that the improvements are basically
> > too good to be true.
> >   
> 
> FWIW, LKP independently measured the boost to be 23% so it's expected
> there will be different results depending on exact configuration and CPU.

Yes, noticed that, nice (which was a SCTP test) 
 https://lists.01.org/pipermail/lkp/2016-November/005210.html

It is of-cause great. It is just strange I cannot reproduce it on my
high-end box, with manual testing. I'll try your test suite and try to
figure out what is wrong with my setup.


> > Can you share how you tested this with netperf and the specific netperf
> > parameters?   
> 
> The mmtests config file used is
> configs/config-global-dhp__network-netperf-unbound so all details can be
> extrapolated or reproduced from that.

I didn't know of mmtests: https://github.com/gormanm/mmtests

It looks nice and quite comprehensive! :-)


> > e.g.
> >  How do you configure the send/recv sizes?  
> 
> Static range of sizes specified in the config file.

I'll figure it out... reading your shell code :-)

export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
 
https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72

I see you are using netperf 2.4.5 and setting both the send an recv
size (-- -m and -M) which is fine.

I don't quite get why you are setting the socket recv size (with -- -s
and -S) to such a small number, size + 256.

 SOCKETSIZE_OPT="-s $((SIZE+256)) -S $((SIZE+256))

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 320 -S 320 -m 64 -M 64 -P 15895

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 384 -S 384 -m 128 -M 128 -P 15895

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
 
> >  Have you pinned netperf and netserver on different CPUs?
> >   
> 
> No. While it's possible to do a pinned test which helps stability, it
> also tends to be less reflective of what happens in a variety of
> workloads so I took the "harder" option.

Agree.
 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Michal Hocko
On Wed 30-11-16 14:16:13, Mel Gorman wrote:
> On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
[...]
> > But...  Unless I am missing something this effectively means that we do
> > not exercise high order atomic reserves. Shouldn't we fallback to
> > the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> > order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> > path which I am not seeing?
> > 
> 
> Good spot, would this be acceptable to you?

It's not a queen of beauty but it works. A more elegant solution would
require more surgery I guess which is probably not worth it at this
stage.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91dc68c2a717..94808f565f74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone 
> *preferred_zone,
>   int nr_pages = rmqueue_bulk(zone, order,
>   pcp->batch, list,
>   migratetype, cold);
> - pcp->count += (nr_pages << order);
> - if (unlikely(list_empty(list)))
> + if (unlikely(list_empty(list))) {
> + /*
> +  * Retry high-order atomic allocs
> +  * from the buddy list which may
> +  * use MIGRATE_HIGHATOMIC.
> +  */
> + if (order && (alloc_flags & 
> ALLOC_HARDER))
> + goto try_buddylist;
> +
>   goto failed;
> + }
> + pcp->count += (nr_pages << order);
>   }
>  
>   if (cold)
> @@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone 
> *preferred_zone,
>  
>   } while (check_new_pcp(page));
>   } else {
> +try_buddylist:
>   /*
>* We most definitely don't want callers attempting to
>* allocate greater than order-1 page units with __GFP_NOFAIL.
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Michal Hocko
On Wed 30-11-16 14:16:13, Mel Gorman wrote:
> On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
[...]
> > But...  Unless I am missing something this effectively means that we do
> > not exercise high order atomic reserves. Shouldn't we fallback to
> > the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> > order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> > path which I am not seeing?
> > 
> 
> Good spot, would this be acceptable to you?

It's not a queen of beauty but it works. A more elegant solution would
require more surgery I guess which is probably not worth it at this
stage.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91dc68c2a717..94808f565f74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone 
> *preferred_zone,
>   int nr_pages = rmqueue_bulk(zone, order,
>   pcp->batch, list,
>   migratetype, cold);
> - pcp->count += (nr_pages << order);
> - if (unlikely(list_empty(list)))
> + if (unlikely(list_empty(list))) {
> + /*
> +  * Retry high-order atomic allocs
> +  * from the buddy list which may
> +  * use MIGRATE_HIGHATOMIC.
> +  */
> + if (order && (alloc_flags & 
> ALLOC_HARDER))
> + goto try_buddylist;
> +
>   goto failed;
> + }
> + pcp->count += (nr_pages << order);
>   }
>  
>   if (cold)
> @@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone 
> *preferred_zone,
>  
>   } while (check_new_pcp(page));
>   } else {
> +try_buddylist:
>   /*
>* We most definitely don't want callers attempting to
>* allocate greater than order-1 page units with __GFP_NOFAIL.
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
> On Sun 27-11-16 13:19:54, Mel Gorman wrote:
> [...]
> > @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone 
> > *preferred_zone,
> > struct page *page;
> > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -   if (likely(order == 0)) {
> > +   if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >  
> > local_irq_save(flags);
> > do {
> > +   unsigned int pindex;
> > +
> > +   pindex = order_to_pindex(migratetype, order);
> > pcp = _cpu_ptr(zone->pageset)->pcp;
> > -   list = >lists[migratetype];
> > +   list = >lists[pindex];
> > if (list_empty(list)) {
> > -   pcp->count += rmqueue_bulk(zone, 0,
> > +   int nr_pages = rmqueue_bulk(zone, order,
> > pcp->batch, list,
> > migratetype, cold);
> > +   pcp->count += (nr_pages << order);
> > if (unlikely(list_empty(list)))
> > goto failed;
> 
> just a nit, we can reorder the check and the count update because nobody
> could have stolen pages allocated by rmqueue_bulk.

Ok, it's minor but I can do that.

> I would also consider
> nr_pages a bit misleading because we get a number or allocated elements.
> Nothing to lose sleep over...
> 

I didn't think of a clearer name because in this sort of context, I consider
a high-order page to be a single page.

> > }
> 
> But...  Unless I am missing something this effectively means that we do
> not exercise high order atomic reserves. Shouldn't we fallback to
> the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> path which I am not seeing?
> 

Good spot, would this be acceptable to you?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91dc68c2a717..94808f565f74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone 
*preferred_zone,
int nr_pages = rmqueue_bulk(zone, order,
pcp->batch, list,
migratetype, cold);
-   pcp->count += (nr_pages << order);
-   if (unlikely(list_empty(list)))
+   if (unlikely(list_empty(list))) {
+   /*
+* Retry high-order atomic allocs
+* from the buddy list which may
+* use MIGRATE_HIGHATOMIC.
+*/
+   if (order && (alloc_flags & 
ALLOC_HARDER))
+   goto try_buddylist;
+
goto failed;
+   }
+   pcp->count += (nr_pages << order);
}
 
if (cold)
@@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 
} while (check_new_pcp(page));
} else {
+try_buddylist:
/*
 * We most definitely don't want callers attempting to
 * allocate greater than order-1 page units with __GFP_NOFAIL.
-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
> On Sun 27-11-16 13:19:54, Mel Gorman wrote:
> [...]
> > @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone 
> > *preferred_zone,
> > struct page *page;
> > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -   if (likely(order == 0)) {
> > +   if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >  
> > local_irq_save(flags);
> > do {
> > +   unsigned int pindex;
> > +
> > +   pindex = order_to_pindex(migratetype, order);
> > pcp = _cpu_ptr(zone->pageset)->pcp;
> > -   list = >lists[migratetype];
> > +   list = >lists[pindex];
> > if (list_empty(list)) {
> > -   pcp->count += rmqueue_bulk(zone, 0,
> > +   int nr_pages = rmqueue_bulk(zone, order,
> > pcp->batch, list,
> > migratetype, cold);
> > +   pcp->count += (nr_pages << order);
> > if (unlikely(list_empty(list)))
> > goto failed;
> 
> just a nit, we can reorder the check and the count update because nobody
> could have stolen pages allocated by rmqueue_bulk.

Ok, it's minor but I can do that.

> I would also consider
> nr_pages a bit misleading because we get a number or allocated elements.
> Nothing to lose sleep over...
> 

I didn't think of a clearer name because in this sort of context, I consider
a high-order page to be a single page.

> > }
> 
> But...  Unless I am missing something this effectively means that we do
> not exercise high order atomic reserves. Shouldn't we fallback to
> the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> path which I am not seeing?
> 

Good spot, would this be acceptable to you?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91dc68c2a717..94808f565f74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone 
*preferred_zone,
int nr_pages = rmqueue_bulk(zone, order,
pcp->batch, list,
migratetype, cold);
-   pcp->count += (nr_pages << order);
-   if (unlikely(list_empty(list)))
+   if (unlikely(list_empty(list))) {
+   /*
+* Retry high-order atomic allocs
+* from the buddy list which may
+* use MIGRATE_HIGHATOMIC.
+*/
+   if (order && (alloc_flags & 
ALLOC_HARDER))
+   goto try_buddylist;
+
goto failed;
+   }
+   pcp->count += (nr_pages << order);
}
 
if (cold)
@@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 
} while (check_new_pcp(page));
} else {
+try_buddylist:
/*
 * We most definitely don't want callers attempting to
 * allocate greater than order-1 page units with __GFP_NOFAIL.
-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> 
> On Sun, 27 Nov 2016 13:19:54 + Mel Gorman  
> wrote:
> 
> [...]
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> > 
> > o New per-cpu lists are added to cache the high-order pages. This increases
> >   the cache footprint of the per-cpu allocator and overall usage but for
> >   some workloads, this will be offset by reduced contention on zone->lock.
> 
> This will also help performance of NIC driver that allocator
> higher-order pages for their RX-ring queue (and chop it up for MTU).
> I do like this patch, even-though I'm working on moving drivers away
> from allocation these high-order pages.
> 
> Acked-by: Jesper Dangaard Brouer 
> 

Thanks.

> [...]
> > This is the result from netperf running UDP_STREAM on localhost. It was
> > selected on the basis that it is slab-intensive and has been the subject
> > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > testing between two physical hosts.
> 
> I do like you are using a networking test to benchmark this. Looking at
> the results, my initial response is that the improvements are basically
> too good to be true.
> 

FWIW, LKP independently measured the boost to be 23% so it's expected
there will be different results depending on exact configuration and CPU.

> Can you share how you tested this with netperf and the specific netperf
> parameters? 

The mmtests config file used is
configs/config-global-dhp__network-netperf-unbound so all details can be
extrapolated or reproduced from that.

> e.g.
>  How do you configure the send/recv sizes?

Static range of sizes specified in the config file.

>  Have you pinned netperf and netserver on different CPUs?
> 

No. While it's possible to do a pinned test which helps stability, it
also tends to be less reflective of what happens in a variety of
workloads so I took the "harder" option.

> For localhost testing, when netperf and netserver run on the same CPU,
> you observer half the performance, very intuitively.  When pinning
> netperf and netserver (via e.g. option -T 1,2) you observe the most
> stable results.  When allowing netperf and netserver to migrate between
> CPUs (default setting), the real fun starts and unstable results,
> because now the CPU scheduler is also being tested, and my experience
> is also more "fun" memory situations occurs, as I guess we are hopping
> between more per CPU alloc caches (also affecting the SLUB per CPU usage
> pattern).
> 

Yes which is another reason why I used an unbound configuration. I didn't
want to get an artificial boost from pinned server/client using the same
per-cpu caches. As a side-effect, it may mean that machines with fewer
CPUs get a greater boost as there are fewer per-cpu caches being used.

> > 2-socket modern machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> 
> The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
> this single change right?

Yes.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> 
> On Sun, 27 Nov 2016 13:19:54 + Mel Gorman  
> wrote:
> 
> [...]
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> > 
> > o New per-cpu lists are added to cache the high-order pages. This increases
> >   the cache footprint of the per-cpu allocator and overall usage but for
> >   some workloads, this will be offset by reduced contention on zone->lock.
> 
> This will also help performance of NIC driver that allocator
> higher-order pages for their RX-ring queue (and chop it up for MTU).
> I do like this patch, even-though I'm working on moving drivers away
> from allocation these high-order pages.
> 
> Acked-by: Jesper Dangaard Brouer 
> 

Thanks.

> [...]
> > This is the result from netperf running UDP_STREAM on localhost. It was
> > selected on the basis that it is slab-intensive and has been the subject
> > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > testing between two physical hosts.
> 
> I do like you are using a networking test to benchmark this. Looking at
> the results, my initial response is that the improvements are basically
> too good to be true.
> 

FWIW, LKP independently measured the boost to be 23% so it's expected
there will be different results depending on exact configuration and CPU.

> Can you share how you tested this with netperf and the specific netperf
> parameters? 

The mmtests config file used is
configs/config-global-dhp__network-netperf-unbound so all details can be
extrapolated or reproduced from that.

> e.g.
>  How do you configure the send/recv sizes?

Static range of sizes specified in the config file.

>  Have you pinned netperf and netserver on different CPUs?
> 

No. While it's possible to do a pinned test which helps stability, it
also tends to be less reflective of what happens in a variety of
workloads so I took the "harder" option.

> For localhost testing, when netperf and netserver run on the same CPU,
> you observer half the performance, very intuitively.  When pinning
> netperf and netserver (via e.g. option -T 1,2) you observe the most
> stable results.  When allowing netperf and netserver to migrate between
> CPUs (default setting), the real fun starts and unstable results,
> because now the CPU scheduler is also being tested, and my experience
> is also more "fun" memory situations occurs, as I guess we are hopping
> between more per CPU alloc caches (also affecting the SLUB per CPU usage
> pattern).
> 

Yes which is another reason why I used an unbound configuration. I didn't
want to get an artificial boost from pinned server/client using the same
per-cpu caches. As a side-effect, it may mean that machines with fewer
CPUs get a greater boost as there are fewer per-cpu caches being used.

> > 2-socket modern machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> 
> The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
> this single change right?

Yes.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Michal Hocko
On Sun 27-11-16 13:19:54, Mel Gorman wrote:
[...]
> @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone 
> *preferred_zone,
>   struct page *page;
>   bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  
> - if (likely(order == 0)) {
> + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
>   struct per_cpu_pages *pcp;
>   struct list_head *list;
>  
>   local_irq_save(flags);
>   do {
> + unsigned int pindex;
> +
> + pindex = order_to_pindex(migratetype, order);
>   pcp = _cpu_ptr(zone->pageset)->pcp;
> - list = >lists[migratetype];
> + list = >lists[pindex];
>   if (list_empty(list)) {
> - pcp->count += rmqueue_bulk(zone, 0,
> + int nr_pages = rmqueue_bulk(zone, order,
>   pcp->batch, list,
>   migratetype, cold);
> + pcp->count += (nr_pages << order);
>   if (unlikely(list_empty(list)))
>   goto failed;

just a nit, we can reorder the check and the count update because nobody
could have stolen pages allocated by rmqueue_bulk. I would also consider
nr_pages a bit misleading because we get a number or allocated elements.
Nothing to lose sleep over...

>   }

But...  Unless I am missing something this effectively means that we do
not exercise high order atomic reserves. Shouldn't we fallback to
the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
path which I am not seeing?

Other than that the patch looks reasonable to me. Keeping some portion
of !costly pages on pcp lists sounds useful from the fragmentation
point of view as well AFAICS because it would be normally dissolved for
order-0 requests while we push on the reclaim more right now.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Michal Hocko
On Sun 27-11-16 13:19:54, Mel Gorman wrote:
[...]
> @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone 
> *preferred_zone,
>   struct page *page;
>   bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  
> - if (likely(order == 0)) {
> + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
>   struct per_cpu_pages *pcp;
>   struct list_head *list;
>  
>   local_irq_save(flags);
>   do {
> + unsigned int pindex;
> +
> + pindex = order_to_pindex(migratetype, order);
>   pcp = _cpu_ptr(zone->pageset)->pcp;
> - list = >lists[migratetype];
> + list = >lists[pindex];
>   if (list_empty(list)) {
> - pcp->count += rmqueue_bulk(zone, 0,
> + int nr_pages = rmqueue_bulk(zone, order,
>   pcp->batch, list,
>   migratetype, cold);
> + pcp->count += (nr_pages << order);
>   if (unlikely(list_empty(list)))
>   goto failed;

just a nit, we can reorder the check and the count update because nobody
could have stolen pages allocated by rmqueue_bulk. I would also consider
nr_pages a bit misleading because we get a number or allocated elements.
Nothing to lose sleep over...

>   }

But...  Unless I am missing something this effectively means that we do
not exercise high order atomic reserves. Shouldn't we fallback to
the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
path which I am not seeing?

Other than that the patch looks reasonable to me. Keeping some portion
of !costly pages on pcp lists sounds useful from the fragmentation
point of view as well AFAICS because it would be normally dissolved for
order-0 requests while we push on the reclaim more right now.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Jesper Dangaard Brouer

On Sun, 27 Nov 2016 13:19:54 + Mel Gorman  
wrote:

[...]
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.

This will also help performance of NIC driver that allocator
higher-order pages for their RX-ring queue (and chop it up for MTU).
I do like this patch, even-though I'm working on moving drivers away
from allocation these high-order pages.

Acked-by: Jesper Dangaard Brouer 

[...]
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.

I do like you are using a networking test to benchmark this. Looking at
the results, my initial response is that the improvements are basically
too good to be true.

Can you share how you tested this with netperf and the specific netperf
parameters? 
e.g.
 How do you configure the send/recv sizes?
 Have you pinned netperf and netserver on different CPUs?

For localhost testing, when netperf and netserver run on the same CPU,
you observer half the performance, very intuitively.  When pinning
netperf and netserver (via e.g. option -T 1,2) you observe the most
stable results.  When allowing netperf and netserver to migrate between
CPUs (default setting), the real fun starts and unstable results,
because now the CPU scheduler is also being tested, and my experience
is also more "fun" memory situations occurs, as I guess we are hopping
between more per CPU alloc caches (also affecting the SLUB per CPU usage
pattern).

> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v3

The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
this single change right?
Netdev/Paolo recently (in net-next) optimized the UDP code path
significantly, and I just want to make sure your results are not
affected by these changes.


> Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
> Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
> Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
> Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
> Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
> Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
> Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
> Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
> Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
> Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
> Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
> Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
> Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
> Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
> Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
> Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v3
> Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
> Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
> Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
> Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
> Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
> Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
> Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
> Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
> Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
> Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
> Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
> Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
> Hmeanrecv-2048  2623.45 

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Jesper Dangaard Brouer

On Sun, 27 Nov 2016 13:19:54 + Mel Gorman  
wrote:

[...]
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.

This will also help performance of NIC driver that allocator
higher-order pages for their RX-ring queue (and chop it up for MTU).
I do like this patch, even-though I'm working on moving drivers away
from allocation these high-order pages.

Acked-by: Jesper Dangaard Brouer 

[...]
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.

I do like you are using a networking test to benchmark this. Looking at
the results, my initial response is that the improvements are basically
too good to be true.

Can you share how you tested this with netperf and the specific netperf
parameters? 
e.g.
 How do you configure the send/recv sizes?
 Have you pinned netperf and netserver on different CPUs?

For localhost testing, when netperf and netserver run on the same CPU,
you observer half the performance, very intuitively.  When pinning
netperf and netserver (via e.g. option -T 1,2) you observe the most
stable results.  When allowing netperf and netserver to migrate between
CPUs (default setting), the real fun starts and unstable results,
because now the CPU scheduler is also being tested, and my experience
is also more "fun" memory situations occurs, as I guess we are hopping
between more per CPU alloc caches (also affecting the SLUB per CPU usage
pattern).

> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v3

The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
this single change right?
Netdev/Paolo recently (in net-next) optimized the UDP code path
significantly, and I just want to make sure your results are not
affected by these changes.


> Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
> Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
> Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
> Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
> Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
> Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
> Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
> Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
> Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
> Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
> Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
> Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
> Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
> Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
> Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
> Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v3
> Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
> Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
> Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
> Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
> Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
> Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
> Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
> Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
> Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
> Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
> Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
> Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
> Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
> Hmean

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> > 1-socket 6 year old machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> > Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
> > Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
> > Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
> > Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
> > Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
> > Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
> > Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
> > Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
> > Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
> > Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
> > Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
> > Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
> > Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
> > Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
> > Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
> > Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
> > Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
> > Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)
> 
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
> 

FYI, the LKP test robot reported the following so there is some
independent basis for picking this up.

---8<---

FYI, we noticed a +23.0% improvement of netperf.Throughput_Mbps due to
commit:

commit 79404c5a5c66481aa55c0cae685e49e0f44a0479 ("mm: page_alloc: High-order 
per-cpu page allocator")
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
mm-pagealloc-highorder-percpu-v3r1


-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-30 Thread Mel Gorman
On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> > 1-socket 6 year old machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> > Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
> > Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
> > Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
> > Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
> > Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
> > Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
> > Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
> > Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
> > Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
> > Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
> > Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
> > Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
> > Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
> > Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
> > Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
> > Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
> > Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
> > Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)
> 
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
> 

FYI, the LKP test robot reported the following so there is some
independent basis for picking this up.

---8<---

FYI, we noticed a +23.0% improvement of netperf.Throughput_Mbps due to
commit:

commit 79404c5a5c66481aa55c0cae685e49e0f44a0479 ("mm: page_alloc: High-order 
per-cpu page allocator")
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
mm-pagealloc-highorder-percpu-v3r1


-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Vlastimil Babka
On 11/28/2016 07:54 PM, Christoph Lameter wrote:
> On Mon, 28 Nov 2016, Mel Gorman wrote:
> 
>> If you have a series aimed at parts of the fragmentation problem or how
>> subsystems can avoid tracking 4K pages in some important cases then by
>> all means post them.
> 
> I designed SLUB with defrag methods in mind. We could warm up some old
> patchsets that where never merged:
> 
> https://lkml.org/lkml/2010/1/29/332

Note that some other solutions to the dentry cache problem (perhaps of a
more low-hanging fruit kind) were also discussed at KS/LPC MM panel
session: https://lwn.net/Articles/705758/


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Vlastimil Babka
On 11/28/2016 07:54 PM, Christoph Lameter wrote:
> On Mon, 28 Nov 2016, Mel Gorman wrote:
> 
>> If you have a series aimed at parts of the fragmentation problem or how
>> subsystems can avoid tracking 4K pages in some important cases then by
>> all means post them.
> 
> I designed SLUB with defrag methods in mind. We could warm up some old
> patchsets that where never merged:
> 
> https://lkml.org/lkml/2010/1/29/332

Note that some other solutions to the dentry cache problem (perhaps of a
more low-hanging fruit kind) were also discussed at KS/LPC MM panel
session: https://lwn.net/Articles/705758/


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Johannes Weiner
On Sun, Nov 27, 2016 at 01:19:54PM +, Mel Gorman wrote:
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman 

This seems like a net gain to me, and the patch loos good too.

Acked-by: Johannes Weiner 

> @@ -255,6 +255,24 @@ enum zone_watermarks {
>   NR_WMARK
>  };
>  
> +/*
> + * One per migratetype for order-0 pages and one per high-order up to
> + * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
> + * allocations to contaminate reclaimable pageblocks if high-order
> + * pages are heavily used.

I think that should be fine. Higher order allocations rely on being
able to compact movable blocks, not on reclaim freeing contiguous
blocks, so poisoning reclaimable blocks is much less of a concern than
poisoning movable blocks. And I'm not aware of any 0 < order < COSTLY
movable allocations that would put movable blocks into an HO cache.


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Johannes Weiner
On Sun, Nov 27, 2016 at 01:19:54PM +, Mel Gorman wrote:
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman 

This seems like a net gain to me, and the patch loos good too.

Acked-by: Johannes Weiner 

> @@ -255,6 +255,24 @@ enum zone_watermarks {
>   NR_WMARK
>  };
>  
> +/*
> + * One per migratetype for order-0 pages and one per high-order up to
> + * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
> + * allocations to contaminate reclaimable pageblocks if high-order
> + * pages are heavily used.

I think that should be fine. Higher order allocations rely on being
able to compact movable blocks, not on reclaim freeing contiguous
blocks, so poisoning reclaimable blocks is much less of a concern than
poisoning movable blocks. And I'm not aware of any 0 < order < COSTLY
movable allocations that would put movable blocks into an HO cache.


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Christoph Lameter
On Mon, 28 Nov 2016, Mel Gorman wrote:

> If you have a series aimed at parts of the fragmentation problem or how
> subsystems can avoid tracking 4K pages in some important cases then by
> all means post them.

I designed SLUB with defrag methods in mind. We could warm up some old
patchsets that where never merged:

https://lkml.org/lkml/2010/1/29/332



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Christoph Lameter
On Mon, 28 Nov 2016, Mel Gorman wrote:

> If you have a series aimed at parts of the fragmentation problem or how
> subsystems can avoid tracking 4K pages in some important cases then by
> all means post them.

I designed SLUB with defrag methods in mind. We could warm up some old
patchsets that where never merged:

https://lkml.org/lkml/2010/1/29/332



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Mel Gorman
On Mon, Nov 28, 2016 at 10:38:58AM -0600, Christoph Lameter wrote:
> > > that only insiders know how to tune and an overall fragile solution.
> > While I agree with all of this, it's also a problem independent of this
> > patch.
> 
> It is related. The fundamental issue with fragmentation remain and IMHO we
> really need to tackle this.
> 

Fragmentation is one issue. Allocation scalability is a separate issue.
This patch is about scaling parallel allocations of small contiguous
ranges. Even if there were fragmentation-related patches up for discussion,
they would not be directly affected by this patch.

If you have a series aimed at parts of the fragmentation problem or how
subsystems can avoid tracking 4K pages in some important cases then by
all means post them.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Mel Gorman
On Mon, Nov 28, 2016 at 10:38:58AM -0600, Christoph Lameter wrote:
> > > that only insiders know how to tune and an overall fragile solution.
> > While I agree with all of this, it's also a problem independent of this
> > patch.
> 
> It is related. The fundamental issue with fragmentation remain and IMHO we
> really need to tackle this.
> 

Fragmentation is one issue. Allocation scalability is a separate issue.
This patch is about scaling parallel allocations of small contiguous
ranges. Even if there were fragmentation-related patches up for discussion,
they would not be directly affected by this patch.

If you have a series aimed at parts of the fragmentation problem or how
subsystems can avoid tracking 4K pages in some important cases then by
all means post them.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Christoph Lameter
On Mon, 28 Nov 2016, Mel Gorman wrote:

> Yes, that's a problem for SLUB with or without this patch. It's always
> been the case that SLUB relying on high-order pages for performance is
> problematic.

This is a general issue in the kernel. Performance often requires larger
contiguous ranges of memory.


> > that only insiders know how to tune and an overall fragile solution.
> While I agree with all of this, it's also a problem independent of this
> patch.

It is related. The fundamental issue with fragmentation remain and IMHO we
really need to tackle this.



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Christoph Lameter
On Mon, 28 Nov 2016, Mel Gorman wrote:

> Yes, that's a problem for SLUB with or without this patch. It's always
> been the case that SLUB relying on high-order pages for performance is
> problematic.

This is a general issue in the kernel. Performance often requires larger
contiguous ranges of memory.


> > that only insiders know how to tune and an overall fragile solution.
> While I agree with all of this, it's also a problem independent of this
> patch.

It is related. The fundamental issue with fragmentation remain and IMHO we
really need to tackle this.



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Mel Gorman
On Mon, Nov 28, 2016 at 09:39:19AM -0600, Christoph Lameter wrote:
> On Sun, 27 Nov 2016, Mel Gorman wrote:
> 
> >
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> 
> Note that SLUB will only use high order pages when available and fall back
> to order 0 if memory is fragmented. This means that the effect of this
> patch is going to gradually vanish as memory becomes more and more
> fragmented.
> 

Yes, that's a problem for SLUB with or without this patch. It's always
been the case that SLUB relying on high-order pages for performance is
problematic.

> I think this patch is beneficial but we need to address long term the
> issue of memory fragmentation. That is not only a SLUB issue but an
> overall problem since we keep on having to maintain lists of 4k memory
> blocks in variuos subsystems. And as memory increases these lists are
> becoming larger and larger and more difficult to manage. Code complexity
> increases and fragility too (look at transparent hugepages). Ultimately we
> will need a clean way to manage the allocation and freeing of large
> physically contiguous pages. Reserving memory at booting (CMA, giant
> pages) is some sort of solution but this all devolves into lots of knobs
> that only insiders know how to tune and an overall fragile solution.
> 

While I agree with all of this, it's also a problem independent of this
patch.


-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Mel Gorman
On Mon, Nov 28, 2016 at 09:39:19AM -0600, Christoph Lameter wrote:
> On Sun, 27 Nov 2016, Mel Gorman wrote:
> 
> >
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> 
> Note that SLUB will only use high order pages when available and fall back
> to order 0 if memory is fragmented. This means that the effect of this
> patch is going to gradually vanish as memory becomes more and more
> fragmented.
> 

Yes, that's a problem for SLUB with or without this patch. It's always
been the case that SLUB relying on high-order pages for performance is
problematic.

> I think this patch is beneficial but we need to address long term the
> issue of memory fragmentation. That is not only a SLUB issue but an
> overall problem since we keep on having to maintain lists of 4k memory
> blocks in variuos subsystems. And as memory increases these lists are
> becoming larger and larger and more difficult to manage. Code complexity
> increases and fragility too (look at transparent hugepages). Ultimately we
> will need a clean way to manage the allocation and freeing of large
> physically contiguous pages. Reserving memory at booting (CMA, giant
> pages) is some sort of solution but this all devolves into lots of knobs
> that only insiders know how to tune and an overall fragile solution.
> 

While I agree with all of this, it's also a problem independent of this
patch.


-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Christoph Lameter
On Sun, 27 Nov 2016, Mel Gorman wrote:

>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications

Note that SLUB will only use high order pages when available and fall back
to order 0 if memory is fragmented. This means that the effect of this
patch is going to gradually vanish as memory becomes more and more
fragmented.

I think this patch is beneficial but we need to address long term the
issue of memory fragmentation. That is not only a SLUB issue but an
overall problem since we keep on having to maintain lists of 4k memory
blocks in variuos subsystems. And as memory increases these lists are
becoming larger and larger and more difficult to manage. Code complexity
increases and fragility too (look at transparent hugepages). Ultimately we
will need a clean way to manage the allocation and freeing of large
physically contiguous pages. Reserving memory at booting (CMA, giant
pages) is some sort of solution but this all devolves into lots of knobs
that only insiders know how to tune and an overall fragile solution.



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Christoph Lameter
On Sun, 27 Nov 2016, Mel Gorman wrote:

>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications

Note that SLUB will only use high order pages when available and fall back
to order 0 if memory is fragmented. This means that the effect of this
patch is going to gradually vanish as memory becomes more and more
fragmented.

I think this patch is beneficial but we need to address long term the
issue of memory fragmentation. That is not only a SLUB issue but an
overall problem since we keep on having to maintain lists of 4k memory
blocks in variuos subsystems. And as memory increases these lists are
becoming larger and larger and more difficult to manage. Code complexity
increases and fragility too (look at transparent hugepages). Ultimately we
will need a clean way to manage the allocation and freeing of large
physically contiguous pages. Reserving memory at booting (CMA, giant
pages) is some sort of solution but this all devolves into lots of knobs
that only insiders know how to tune and an overall fragile solution.



Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Mel Gorman
On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> On 11/27/2016 02:19 PM, Mel Gorman wrote:
> > 
> > 2-socket modern machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> > Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
> > Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
> > Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
> > Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
> > Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
> > Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
> > Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
> > Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
> > Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
> > Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
> > Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
> > Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
> > Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
> > Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
> > Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
> > Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
> > Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
> > Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)
> > 
> > 1-socket 6 year old machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> > Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
> > Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
> > Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
> > Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
> > Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
> > Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
> > Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
> > Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
> > Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
> > Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
> > Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
> > Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
> > Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
> > Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
> > Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
> > Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
> > Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
> > Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)
> 
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
> 

The array is sized correctly which avoids one useless check. The order-0
lists are always drained first so in some rare cases, only the fast
paths are used. There was a subtle correction in detecting when all of
one list should be drained. In combination, it happened to boost
performance a lot on the two machines I reported on. While 6 other
machines were tested, not all of them saw such a dramatic boost and if
these machines are rebooted and retested every time, the high
performance is not always consistent, it all depends on how often the
fast paths are used.

> > Signed-off-by: Mel Gorman 
> 
> Acked-by: Vlastimil Babka 
> 

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Mel Gorman
On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> On 11/27/2016 02:19 PM, Mel Gorman wrote:
> > 
> > 2-socket modern machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> > Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
> > Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
> > Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
> > Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
> > Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
> > Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
> > Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
> > Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
> > Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
> > Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
> > Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
> > Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
> > Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
> > Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
> > Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
> > Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
> > Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
> > Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)
> > 
> > 1-socket 6 year old machine
> > 4.9.0-rc5 4.9.0-rc5
> >   vanilla hopcpu-v3
> > Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
> > Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
> > Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
> > Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
> > Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
> > Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
> > Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
> > Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
> > Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
> > Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
> > Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
> > Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
> > Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
> > Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
> > Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
> > Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
> > Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
> > Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)
> 
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
> 

The array is sized correctly which avoids one useless check. The order-0
lists are always drained first so in some rare cases, only the fast
paths are used. There was a subtle correction in detecting when all of
one list should be drained. In combination, it happened to boost
performance a lot on the two machines I reported on. While 6 other
machines were tested, not all of them saw such a dramatic boost and if
these machines are rebooted and retested every time, the high
performance is not always consistent, it all depends on how often the
fast paths are used.

> > Signed-off-by: Mel Gorman 
> 
> Acked-by: Vlastimil Babka 
> 

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Vlastimil Babka

On 11/27/2016 02:19 PM, Mel Gorman wrote:


2-socket modern machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)


That looks way much better than the "v1" RFC posting. Was it just 
because you stopped doing the "at first iteration, use migratetype as 
index", and initializing pindex UINT_MAX hits so much quicker, or was 
there something more subtle that I missed? There was no changelog 
between "v1" and "v2".




Signed-off-by: Mel Gorman 


Acked-by: Vlastimil Babka 




Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-28 Thread Vlastimil Babka

On 11/27/2016 02:19 PM, Mel Gorman wrote:


2-socket modern machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)


That looks way much better than the "v1" RFC posting. Was it just 
because you stopped doing the "at first iteration, use migratetype as 
index", and initializing pindex UINT_MAX hits so much quicker, or was 
there something more subtle that I missed? There was no changelog 
between "v1" and "v2".




Signed-off-by: Mel Gorman 


Acked-by: Vlastimil Babka 




[PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-27 Thread Mel Gorman
Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
 4.9.0-rc5 4.9.0-rc5
   

[PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-11-27 Thread Mel Gorman
Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64 178.38 (  0.00%)  256.74 ( 43.93%)
Hmeansend-128351.49 (  0.00%)  507.52 ( 44.39%)
Hmeansend-256671.23 (  0.00%) 1004.19 ( 49.60%)
Hmeansend-1024  2663.60 (  0.00%) 3910.42 ( 46.81%)
Hmeansend-2048  5126.53 (  0.00%) 7562.13 ( 47.51%)
Hmeansend-3312  7949.99 (  0.00%)11565.98 ( 45.48%)
Hmeansend-4096  9433.56 (  0.00%)12929.67 ( 37.06%)
Hmeansend-8192 15940.64 (  0.00%)21587.63 ( 35.43%)
Hmeansend-1638426699.54 (  0.00%)32013.79 ( 19.90%)
Hmeanrecv-64 178.38 (  0.00%)  256.72 ( 43.92%)
Hmeanrecv-128351.49 (  0.00%)  507.47 ( 44.38%)
Hmeanrecv-256671.20 (  0.00%) 1003.95 ( 49.57%)
Hmeanrecv-1024  2663.45 (  0.00%) 3909.70 ( 46.79%)
Hmeanrecv-2048  5126.26 (  0.00%) 7560.67 ( 47.49%)
Hmeanrecv-3312  7949.50 (  0.00%)11564.63 ( 45.48%)
Hmeanrecv-4096  9433.04 (  0.00%)12927.48 ( 37.04%)
Hmeanrecv-8192 15939.64 (  0.00%)21584.59 ( 35.41%)
Hmeanrecv-1638426698.44 (  0.00%)32009.77 ( 19.89%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v3
Hmeansend-64  87.47 (  0.00%)  127.14 ( 45.36%)
Hmeansend-128174.36 (  0.00%)  256.42 ( 47.06%)
Hmeansend-256347.52 (  0.00%)  509.41 ( 46.59%)
Hmeansend-1024  1363.03 (  0.00%) 1991.54 ( 46.11%)
Hmeansend-2048  2632.68 (  0.00%) 3759.51 ( 42.80%)
Hmeansend-3312  4123.19 (  0.00%) 5873.28 ( 42.45%)
Hmeansend-4096  5056.48 (  0.00%) 7072.81 ( 39.88%)
Hmeansend-8192  8784.22 (  0.00%)12143.92 ( 38.25%)
Hmeansend-1638415081.60 (  0.00%)19812.71 ( 31.37%)
Hmeanrecv-64  86.19 (  0.00%)  126.59 ( 46.87%)
Hmeanrecv-128173.93 (  0.00%)  255.21 ( 46.73%)
Hmeanrecv-256346.19 (  0.00%)  506.72 ( 46.37%)
Hmeanrecv-1024  1358.28 (  0.00%) 1980.03 ( 45.77%)
Hmeanrecv-2048  2623.45 (  0.00%) 3729.35 ( 42.15%)
Hmeanrecv-3312  4108.63 (  0.00%) 5831.47 ( 41.93%)
Hmeanrecv-4096  5037.25 (  0.00%) 7021.59 ( 39.39%)
Hmeanrecv-8192  8762.32 (  0.00%)12072.44 ( 37.78%)
Hmeanrecv-1638415042.36 (  0.00%)19690.14 ( 30.90%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
 4.9.0-rc5 4.9.0-rc5