Re: Testing memory performance

2018-11-21 Thread Sad Clouds
On Wed, Nov 21, 2018 at 4:18 AM Eric Hawicz  wrote:

> That still sounds to me like the test is a bit off.  If you've already
> recorded the start time of each thread, then the time that the threads
> are blocked from running would be included in the per-thread rate, thus
> causing it to appear much slower.
>
>
No because start/end times are taken for specific operations like
pre-faulting or memcpy. It doesn't tell you what this thread is doing in
relation to other threads, so a thread can be blocked for some time and
then scheduled to run and start time taken, how would this latency be
accounted for if it occurred before start time was taken?

Maybe think of it as a simple example, let's say memory bus has maximum
bandwidth of 10 GiB/sec and you have two threads A and B, each doing memcpy
of 10 GiB.

Scenario 1 - both threads run in parallel and share memory bus bandwidth:
--> time in seconds
AA  thread runs for 2 seconds and does memcpy at 5 GiB/sec
BB  thread runs for 2 seconds and does memcpy at 5 GiB/sec
Aggregate throughput = (2 threads * 10 GiB) / 2 second = 10 GiB/sec

Scenario 2 - each thread runs in sequence and uses full memory bus bandwidth
--> time in seconds
A  thread runs for 1 second and does memcpy at 10 GiB/sec
  L  lock contention causes latency of 1 second
B  thread runs for 1 second and does memcpy at 10 GiB/sec
Aggregate throughput = (2 threads * 10 GiB) / 3 second = 6.6 GiB/sec


Re: Testing memory performance

2018-11-20 Thread Michael van Elst
On Tue, Nov 20, 2018 at 11:44:47AM -0500, Greg Troxel wrote:
> I thought we were using a pool allocator that had per-cpu freelists,
> derived from Solaris and
>   https://www.usenix.org/legacy/event/usenix01/bonwick.html

We are talking about a lower level free list. Even when you could reuse the pool
allocator code on that level, it wouldn't be sufficient.

But yes, the methods used by the pool allocator need to be applied here too.


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Testing memory performance

2018-11-20 Thread Greg Troxel
Michael van Elst  writes:

> On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote:
>> 
>> Michael van Elst  writes:
>> > There is a global lock for the page freelist.
>> 
>> I wonder if using a pool-type structure would be feasible.  That might
>> fix almost all of the slowness.
>
> You need a per-cpu freelist and some mechanism to steal from other
> freelists. Ideally that also includes something to optimize for NUMA.

I thought we were using a pool allocator that had per-cpu freelists,
derived from Solaris and

  https://www.usenix.org/legacy/event/usenix01/bonwick.html

but maybe I am off on that.


Re: Testing memory performance

2018-11-20 Thread Michael van Elst
On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote:
> 
> Michael van Elst  writes:
> > There is a global lock for the page freelist.
> 
> I wonder if using a pool-type structure would be feasible.  That might
> fix almost all of the slowness.

You need a per-cpu freelist and some mechanism to steal from other
freelists. Ideally that also includes something to optimize for NUMA.

Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Testing memory performance

2018-11-20 Thread Greg Troxel


Michael van Elst  writes:

>> Maybe there is a global lock in NetBSD VM subsystem that slows things
>> down with higher number of threads.
>
> There is a global lock for the page freelist.

I wonder if using a pool-type structure would be feasible.  That might
fix almost all of the slowness.



Re: Testing memory performance

2018-11-20 Thread Sad Clouds
On Tue, 20 Nov 2018 00:27:22 +0100
Michael van Elst  wrote:

> There is a global lock for the page freelist.

OK I've made changes to my bench tool to synchronize all threads before
each stage. So threads now wait for all other threads to finish
pre-faulting pages, before they all start memcpy at the same time. This
makes it more clear where time is lost.

Did some more tests on Solaris, Linux and NetBSD. Looks like NetBSD
memcpy is actually a bit faster than Linux, but NetBSD is quite slow at
servicing page faults. The latency when pre-faulting those pages is
about 18 times longer on NetBSD, which results in longer overall execution
time.

Anyway, this has been an interesting exercise.



Solaris 11.3, x1 UltraSPARC-T2 1415 MHz, 8 cores per CPU, 8 hw threads per core

$ ./sv_mem -mode=wr -size=1g -block=1K -threads=16
Per-thread metrics:
  T 16 mlock 0.00 msec,  preflt 1880.88 msec,  memcpy 1521.74 msec (672.91 
MiB/sec)
  T 14 mlock 0.00 msec,  preflt 1896.63 msec,  memcpy 1522.38 msec (672.63 
MiB/sec)
  T 10 mlock 0.00 msec,  preflt 1872.01 msec,  memcpy 1522.73 msec (672.48 
MiB/sec)
  T 2  mlock 0.00 msec,  preflt 1889.55 msec,  memcpy 1522.43 msec (672.61 
MiB/sec)
  T 8  mlock 0.00 msec,  preflt 1862.79 msec,  memcpy 1523.32 msec (672.22 
MiB/sec)
  T 6  mlock 0.00 msec,  preflt 1875.76 msec,  memcpy 1523.68 msec (672.06 
MiB/sec)
  T 5  mlock 0.00 msec,  preflt 1869.91 msec,  memcpy 1524.26 msec (671.80 
MiB/sec)
  T 12 mlock 0.00 msec,  preflt 1880.11 msec,  memcpy 1525.13 msec (671.42 
MiB/sec)
  T 4  mlock 0.00 msec,  preflt 1884.96 msec,  memcpy 1525.37 msec (671.31 
MiB/sec)
  T 1  mlock 0.00 msec,  preflt 1885.92 msec,  memcpy 1525.54 msec (671.24 
MiB/sec)
  T 9  mlock 0.00 msec,  preflt 1875.25 msec,  memcpy 1526.15 msec (670.97 
MiB/sec)
  T 13 mlock 0.00 msec,  preflt 1869.48 msec,  memcpy 1526.74 msec (670.71 
MiB/sec)
  T 15 mlock 0.00 msec,  preflt 1869.14 msec,  memcpy 1527.30 msec (670.46 
MiB/sec)
  T 7  mlock 0.00 msec,  preflt 1889.29 msec,  memcpy 1527.45 msec (670.40 
MiB/sec)
  T 3  mlock 0.00 msec,  preflt 1880.53 msec,  memcpy 1529.22 msec (669.62 
MiB/sec)
  T 11 mlock 0.00 msec,  preflt 1876.53 msec,  memcpy 1530.20 msec (669.19 
MiB/sec)

Aggregate metrics, 16 threads, 16384.00 MiB:
  mlock  0.00 msec
  preflt 1897.69 msec
  memcpy 1530.59 msec (10704.36 MiB/sec)




Linux 4.9.0, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per 
core

$ ./sv_mem -mode=wr -size=1g -block=1K -threads=16
Per-thread metrics:
  T 5  mlock 0.00 msec,  preflt 1192.80 msec,  memcpy 1141.42 msec (897.13 
MiB/sec)
  T 7  mlock 0.00 msec,  preflt 1211.61 msec,  memcpy 1144.62 msec (894.62 
MiB/sec)
  T 16 mlock 0.00 msec,  preflt 1211.59 msec,  memcpy 1145.37 msec (894.04 
MiB/sec)
  T 3  mlock 0.00 msec,  preflt 1207.33 msec,  memcpy 1146.42 msec (893.21 
MiB/sec)
  T 2  mlock 0.00 msec,  preflt 1211.02 msec,  memcpy 1146.36 msec (893.26 
MiB/sec)
  T 1  mlock 0.00 msec,  preflt 1210.36 msec,  memcpy 1146.57 msec (893.10 
MiB/sec)
  T 13 mlock 0.00 msec,  preflt 1208.53 msec,  memcpy 1146.67 msec (893.02 
MiB/sec)
  T 9  mlock 0.00 msec,  preflt 1209.00 msec,  memcpy 1146.33 msec (893.28 
MiB/sec)
  T 15 mlock 0.00 msec,  preflt 1210.63 msec,  memcpy 1147.20 msec (892.61 
MiB/sec)
  T 14 mlock 0.00 msec,  preflt 1190.98 msec,  memcpy 1147.90 msec (892.06 
MiB/sec)
  T 4  mlock 0.00 msec,  preflt 1193.98 msec,  memcpy 1147.89 msec (892.07 
MiB/sec)
  T 6  mlock 0.00 msec,  preflt 1194.16 msec,  memcpy 1148.72 msec (891.43 
MiB/sec)
  T 12 mlock 0.00 msec,  preflt 1191.37 msec,  memcpy 1149.35 msec (890.94 
MiB/sec)
  T 8  mlock 0.00 msec,  preflt 1196.99 msec,  memcpy 1149.30 msec (890.98 
MiB/sec)
  T 10 mlock 0.00 msec,  preflt 1197.32 msec,  memcpy 1149.37 msec (890.92 
MiB/sec)
  T 11 mlock 0.00 msec,  preflt 1197.75 msec,  memcpy 1152.12 msec (888.79 
MiB/sec)

Aggregate metrics, 16 threads, 16384.00 MiB:
  mlock  0.00 msec
  preflt 1211.96 msec
  memcpy 1152.58 msec (14215.02 MiB/sec)




NetBSD-8, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per core

$ ./sv_mem -mode=wr -size=1g -block=1K -threads=16
Per-thread metrics:
  T 16 mlock 0.00 msec,  preflt 18116.24 msec,  memcpy 945.99 msec (1082.46 
MiB/sec)
  T 9  mlock 0.00 msec,  preflt 18112.29 msec,  memcpy 949.79 msec (1078.13 
MiB/sec)
  T 10 mlock 0.00 msec,  preflt 18131.93 msec,  memcpy 955.33 msec (1071.88 
MiB/sec)
  T 8  mlock 0.00 msec,  preflt 17868.22 msec,  memcpy 959.28 msec (1067.46 
MiB/sec)
  T 4  mlock 0.00 msec,  preflt 17437.47 msec,  memcpy 958.71 msec (1068.11 
MiB/sec)
  T 6  mlock 0.00 msec,  preflt 16743.15 msec,  memcpy 958.53 msec (1068.31 
MiB/sec)
  T 3  mlock 0.00 msec,  preflt 18130.67 msec,  memcpy 944.33 msec (1084.36 
MiB/sec)
  T 2  mlock 0.00 msec,  preflt 18060.20 msec,  memcpy 958.34 msec (1068.51 
MiB/sec)
  T 11

Re: Testing memory performance

2018-11-20 Thread Sad Clouds
On Mon, 19 Nov 2018 22:10:41 -0500
Eric Hawicz  wrote:

> The only way I can see that you'd end up with a total transfer rate 
> around 5GB/s is if you didn't actually manage to get the threads
> running in parallel, but instead have perhaps 2-3 running at a time,
> then the next 2-3 don't even start until those first few finish.
> 
> Eric
> 

That is exactly what happens, other threads are blocked from running,
because NetBSD VM subsystem that allocates pages is hitting single lock
and causing contention. 



Re: Testing memory performance

2018-11-19 Thread Sad Clouds
OK I disabled NUMA in BIOS, there is a slight performance hit, but
NetBSD is still much slower than Linux. This time I did single thread
test, but disparity grows with number of threads.

NetBSD:
$ ./sv_mem -mode=wr -size=16g -block=1k -threads=1
Thread 1 preflt=11285.07 msec, memcpy=3056.22 MiB/sec
Total transfer rate: 3056.22 MiB/sec

Linux:
$ ./sv_mem -mode=wr -size=16g -block=1k -threads=1
Thread 1 preflt=7319.33 msec, memcpy=5089.21 MiB/sec
Total transfer rate: 5089.21 MiB/sec

Note that to pre-fault (touch 1 byte at every 4 KiB page) 16 GiB of
pages it took NetBSD around 11 seconds, Linux took 7 seconds. With 16
concurrent threads, NetBSD pre-fault is 18 times longer.
Maybe there is a global lock in NetBSD VM subsystem that slows things
down with higher number of threads.

So the average throughput of memcpy is slower on NetBSD with higher
number of threads because they can't make progress until pages are
allocated and a global lock causes contention, so they sit waiting
idle. 

Note below how NetBSD memcpy for individual threads is faster, but the
overall throughput is almost half of Linux, because NetBSD VM subsystem
acts like a barrier and causes those threads to stall until pages are
allocated.


NetBSD:
$ ./sv_mem -mode=wr -size=1g -block=1k -threads=16
Thread 5 preflt=16400.12 msec, memcpy=3130.44 MiB/sec
Thread 11preflt=16931.65 msec, memcpy=3154.73 MiB/sec
Thread 9 preflt=17169.03 msec, memcpy=2514.06 MiB/sec
Thread 4 preflt=17632.37 msec, memcpy=2928.74 MiB/sec
Thread 14preflt=17696.83 msec, memcpy=2146.89 MiB/sec
Thread 7 preflt=17885.63 msec, memcpy=2926.97 MiB/sec
Thread 1 preflt=17918.38 msec, memcpy=1338.85 MiB/sec
Thread 10preflt=18316.65 msec, memcpy=2082.36 MiB/sec
Thread 15preflt=18323.43 msec, memcpy=1338.62 MiB/sec
Thread 12preflt=18310.89 msec, memcpy=1322.38 MiB/sec
Thread 6 preflt=18363.57 msec, memcpy=1507.58 MiB/sec
Thread 16preflt=18360.23 msec, memcpy=1909.12 MiB/sec
Thread 8 preflt=18155.39 msec, memcpy=1478.17 MiB/sec
Thread 13preflt=18236.67 msec, memcpy=1849.76 MiB/sec
Thread 3 preflt=18303.09 msec, memcpy=2116.50 MiB/sec
Thread 2 preflt=17960.70 msec, memcpy=1325.43 MiB/sec
Total transfer rate: 6087.94 MiB/sec

Linux:
$ ./sv_mem -mode=wr -size=1g -block=1k -threads=16
Thread 13preflt=1182.27 msec, memcpy=902.88 MiB/sec
Thread 9 preflt=1183.55 msec, memcpy=903.02 MiB/sec
Thread 5 preflt=1191.65 msec, memcpy=899.32 MiB/sec
Thread 11preflt=1186.96 msec, memcpy=897.64 MiB/sec
Thread 7 preflt=1195.46 msec, memcpy=898.71 MiB/sec
Thread 6 preflt=1207.12 msec, memcpy=904.71 MiB/sec
Thread 15preflt=1194.18 msec, memcpy=896.05 MiB/sec
Thread 4 preflt=1216.37 msec, memcpy=909.09 MiB/sec
Thread 3 preflt=1210.41 msec, memcpy=897.77 MiB/sec
Thread 2 preflt=1210.36 msec, memcpy=896.36 MiB/sec
Thread 12preflt=1210.59 msec, memcpy=898.79 MiB/sec
Thread 14preflt=1209.41 msec, memcpy=898.01 MiB/sec
Thread 10preflt=1210.00 msec, memcpy=896.88 MiB/sec
Thread 1 preflt=1216.32 msec, memcpy=899.56 MiB/sec
Thread 16preflt=1209.18 msec, memcpy=899.34 MiB/sec
Thread 8 preflt=1231.36 msec, memcpy=910.00 MiB/sec
Total transfer rate: 13978.88 MiB/sec



Re: Testing memory performance

2018-11-19 Thread Sad Clouds
On Sun, 18 Nov 2018 16:30:32 -0500
Eric Hawicz  wrote:

> > NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
> > Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
> > ...
> > Total transfer rate: 5817.56 MiB/sec
> 
> What?  I think your measurements are a bit off here.  There may be a 
> problem with the speed, but if you're measuring the per-thread rate 
> properly then the sum of those should equal your total transfer
> rate. Are the periods during which each thread calculates its rate
> very different from the period of the overall test?

The sum of all threads should not equal total transfer rate, because
threads could be running at different times. So instead of all threads
running in parallel you could have something like - T1 runs, pause, T2
runs, pause, T3 runs, pause, etc, the more pauses you have the longer
it will take for all threads to complete. Have a think about it, it
makes sense.



Re: Testing memory performance

2018-11-19 Thread Eric Hawicz

On 11/19/2018 4:38 PM, Sad Clouds wrote:

On Sun, 18 Nov 2018 16:30:32 -0500
Eric Hawicz  wrote:

NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
...
Total transfer rate: 5817.56 MiB/sec

What?  I think your measurements are a bit off here.  There may be a
problem with the speed, but if you're measuring the per-thread rate
properly then the sum of those should equal your total transfer
rate. Are the periods during which each thread calculates its rate
very different from the period of the overall test?

The sum of all threads should not equal total transfer rate, because
threads could be running at different times. So instead of all threads
running in parallel you could have something like - T1 runs, pause, T2
runs, pause, T3 runs, pause, etc, the more pauses you have the longer
it will take for all threads to complete. Have a think about it, it
makes sense.


Sure the threads pause, but so what?  Unless you have dramatically 
different start and end times for all of the threads, the numbers are 
way off.  It doesn't matter whether a thread pauses, since that pause 
will be within the start & end times for that thread, and thus included 
in the rate calculation.


Say each thread is around for 10 seconds, and in that time it transfers 
25GB of data, so that's 2.5GB/s


If your overall test is also roughly 10 seconds long, then the the total 
transfer rate must be roughly 2.5GB/s * # of threads.


The only way I can see that you'd end up with a total transfer rate 
around 5GB/s is if you didn't actually manage to get the threads running 
in parallel, but instead have perhaps 2-3 running at a time, then the 
next 2-3 don't even start until those first few finish.


Eric



Re: Testing memory performance

2018-11-19 Thread Michael van Elst
On Mon, Nov 19, 2018 at 09:25:31PM +, Sad Clouds wrote:
> OK I disabled NUMA in BIOS, there is a slight performance hit, but
> NetBSD is still much slower than Linux. This time I did single thread
> test, but disparity grows with number of threads.

You cannot disable NUMA, that's how the machine is built. You may change
how memory is physically mapped (usually done by hashing address bits).


> Maybe there is a global lock in NetBSD VM subsystem that slows things
> down with higher number of threads.

There is a global lock for the page freelist.



-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Testing memory performance

2018-11-18 Thread Eric Hawicz

On 11/18/2018 7:00 AM, Sad Clouds wrote:

I'm developing a small tool that tests memory performance/throughput
across different environments. I'm noticing performance issues on
NetBSD-8, below are the details:

...

NetBSD and Linux have different versions of GCC, but I was hoping the
following flags would keep optimization differences to a minimum:


If you want to rule that out, you could always build the same version of 
gcc on both.  Or even run the linux binary (and libs) on NetBSD.




NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
...
Total transfer rate: 5817.56 MiB/sec


What?  I think your measurements are a bit off here.  There may be a 
problem with the speed, but if you're measuring the per-thread rate 
properly then the sum of those should equal your total transfer rate.  
Are the periods during which each thread calculates its rate very 
different from the period of the overall test?



Also, your subsequent email about memcpy disassembly does not list the 
full code for the linux version (the jumps at the start refer to 
instruction addresses that you don't include), so you can't really 
compare them.  I expect that both implementations have a variety of code 
blocks to handle different alignments, different supported instructions, 
etc..



Eric



Re: Testing memory performance

2018-11-18 Thread Rhialto
On Sun 18 Nov 2018 at 19:04:02 +, Sad Clouds wrote:
> Linux (gcc 6.3.0):

It looks to me like this fragment is not the whole function:

> Dump of assembler code for function memcpy:
> => 0x778a0e90 <+0>:   mov%rdi,%rax
>0x778a0e93 <+3>:   cmp$0x10,%rdx
>0x778a0e97 <+7>:   jb 0x778a0f77

0x778a0f77 isn't in the disassembly

>0x778a0e9d <+13>:  cmp$0x20,%rdx
>0x778a0ea1 <+17>:  ja 0x778a0fc6

0x778a0fc6 neither.

>0x778a0ea7 <+23>:  movups (%rsi),%xmm0
>0x778a0eaa <+26>:  movups -0x10(%rsi,%rdx,1),%xmm1
>0x778a0eaf <+31>:  movups %xmm0,(%rdi)
>0x778a0eb2 <+34>:  movups %xmm1,-0x10(%rdi,%rdx,1)
>0x778a0eb7 <+39>:  retq   
> End of assembler dump.

It looks like both functions check for some initial conditions to see
which optimized loop they can use, but they use very different
optimizations.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- "What good is a Ring of Power
\X/ rhialto/at/falu.nl  -- if you're unable...to Speak." - Agent Elrond


signature.asc
Description: PGP signature