subject:"Testing memory performance"

Re: Testing memory performance

2018-11-21 Thread Sad Clouds

On Wed, Nov 21, 2018 at 4:18 AM Eric Hawicz  wrote:

> That still sounds to me like the test is a bit off.  If you've already
> recorded the start time of each thread, then the time that the threads
> are blocked from running would be included in the per-thread rate, thus
> causing it to appear much slower.
>
>
No because start/end times are taken for specific operations like
pre-faulting or memcpy. It doesn't tell you what this thread is doing in
relation to other threads, so a thread can be blocked for some time and
then scheduled to run and start time taken, how would this latency be
accounted for if it occurred before start time was taken?

Maybe think of it as a simple example, let's say memory bus has maximum
bandwidth of 10 GiB/sec and you have two threads A and B, each doing memcpy
of 10 GiB.

Scenario 1 - both threads run in parallel and share memory bus bandwidth:
--> time in seconds
AA  thread runs for 2 seconds and does memcpy at 5 GiB/sec
BB  thread runs for 2 seconds and does memcpy at 5 GiB/sec
Aggregate throughput = (2 threads * 10 GiB) / 2 second = 10 GiB/sec

Scenario 2 - each thread runs in sequence and uses full memory bus bandwidth
--> time in seconds
A  thread runs for 1 second and does memcpy at 10 GiB/sec
  L  lock contention causes latency of 1 second
B  thread runs for 1 second and does memcpy at 10 GiB/sec
Aggregate throughput = (2 threads * 10 GiB) / 3 second = 6.6 GiB/sec

Re: Testing memory performance

2018-11-20 Thread Eric Hawicz


On 11/20/2018 4:54 AM, Sad Clouds wrote:

On Mon, 19 Nov 2018 22:10:41 -0500
Eric Hawicz  wrote:

The only way I can see that you'd end up with a total transfer rate
around 5GB/s is if you didn't actually manage to get the threads
running in parallel, but instead have perhaps 2-3 running at a time,
then the next 2-3 don't even start until those first few finish.

That is exactly what happens, other threads are blocked from running,
because NetBSD VM subsystem that allocates pages is hitting single lock
and causing contention.


That still sounds to me like the test is a bit off.  If you've already 
recorded the start time of each thread, then the time that the threads 
are blocked from running would be included in the per-thread rate, thus 
causing it to appear much slower.


Originally, you said:

"The tool creates a number of concurrent threads, each threads allocates

1 GiB memory segment and a 1 KiB transfer block. It pre-faults every
page by writing a single byte at every 4 KiB offset. It then calls
memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is
filled."

So, I'm imagining each thread has code that does the following sequence of 
operations:

* Allocate 1GB memory
* Pre-fault each page
* Notify that we're ready to start and wait until all threads are ready
* Record this thread's start time
* Perform memcpy
* Record this thread's end time

Is that what you're doing?

Re: Testing memory performance

2018-11-20 Thread Michael van Elst

On Tue, Nov 20, 2018 at 11:44:47AM -0500, Greg Troxel wrote:
> I thought we were using a pool allocator that had per-cpu freelists,
> derived from Solaris and
>   https://www.usenix.org/legacy/event/usenix01/bonwick.html

We are talking about a lower level free list. Even when you could reuse the pool
allocator code on that level, it wouldn't be sufficient.

But yes, the methods used by the pool allocator need to be applied here too.

Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."

Re: Testing memory performance

2018-11-20 Thread Greg Troxel

Michael van Elst  writes:

> On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote:
>> 
>> Michael van Elst  writes:
>> > There is a global lock for the page freelist.
>> 
>> I wonder if using a pool-type structure would be feasible.  That might
>> fix almost all of the slowness.
>
> You need a per-cpu freelist and some mechanism to steal from other
> freelists. Ideally that also includes something to optimize for NUMA.

I thought we were using a pool allocator that had per-cpu freelists,
derived from Solaris and

  https://www.usenix.org/legacy/event/usenix01/bonwick.html

but maybe I am off on that.

Re: Testing memory performance

2018-11-20 Thread Michael van Elst

On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote:
> 
> Michael van Elst  writes:
> > There is a global lock for the page freelist.
> 
> I wonder if using a pool-type structure would be feasible.  That might
> fix almost all of the slowness.

You need a per-cpu freelist and some mechanism to steal from other
freelists. Ideally that also includes something to optimize for NUMA.

Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."

Re: Testing memory performance

2018-11-20 Thread Greg Troxel



Michael van Elst  writes:

>> Maybe there is a global lock in NetBSD VM subsystem that slows things
>> down with higher number of threads.
>
> There is a global lock for the page freelist.

I wonder if using a pool-type structure would be feasible.  That might
fix almost all of the slowness.

Re: Testing memory performance

2018-11-20 Thread Sad Clouds

On Tue, 20 Nov 2018 00:27:22 +0100
Michael van Elst  wrote:

> There is a global lock for the page freelist.

OK I've made changes to my bench tool to synchronize all threads before
each stage. So threads now wait for all other threads to finish
pre-faulting pages, before they all start memcpy at the same time. This
makes it more clear where time is lost.

Did some more tests on Solaris, Linux and NetBSD. Looks like NetBSD
memcpy is actually a bit faster than Linux, but NetBSD is quite slow at
servicing page faults. The latency when pre-faulting those pages is
about 18 times longer on NetBSD, which results in longer overall execution
time.

Anyway, this has been an interesting exercise.



Solaris 11.3, x1 UltraSPARC-T2 1415 MHz, 8 cores per CPU, 8 hw threads per core

$ ./sv_mem -mode=wr -size=1g -block=1K -threads=16
Per-thread metrics:
  T 16 mlock 0.00 msec,  preflt 1880.88 msec,  memcpy 1521.74 msec (672.91 
MiB/sec)
  T 14 mlock 0.00 msec,  preflt 1896.63 msec,  memcpy 1522.38 msec (672.63 
MiB/sec)
  T 10 mlock 0.00 msec,  preflt 1872.01 msec,  memcpy 1522.73 msec (672.48 
MiB/sec)
  T 2  mlock 0.00 msec,  preflt 1889.55 msec,  memcpy 1522.43 msec (672.61 
MiB/sec)
  T 8  mlock 0.00 msec,  preflt 1862.79 msec,  memcpy 1523.32 msec (672.22 
MiB/sec)
  T 6  mlock 0.00 msec,  preflt 1875.76 msec,  memcpy 1523.68 msec (672.06 
MiB/sec)
  T 5  mlock 0.00 msec,  preflt 1869.91 msec,  memcpy 1524.26 msec (671.80 
MiB/sec)
  T 12 mlock 0.00 msec,  preflt 1880.11 msec,  memcpy 1525.13 msec (671.42 
MiB/sec)
  T 4  mlock 0.00 msec,  preflt 1884.96 msec,  memcpy 1525.37 msec (671.31 
MiB/sec)
  T 1  mlock 0.00 msec,  preflt 1885.92 msec,  memcpy 1525.54 msec (671.24 
MiB/sec)
  T 9  mlock 0.00 msec,  preflt 1875.25 msec,  memcpy 1526.15 msec (670.97 
MiB/sec)
  T 13 mlock 0.00 msec,  preflt 1869.48 msec,  memcpy 1526.74 msec (670.71 
MiB/sec)
  T 15 mlock 0.00 msec,  preflt 1869.14 msec,  memcpy 1527.30 msec (670.46 
MiB/sec)
  T 7  mlock 0.00 msec,  preflt 1889.29 msec,  memcpy 1527.45 msec (670.40 
MiB/sec)
  T 3  mlock 0.00 msec,  preflt 1880.53 msec,  memcpy 1529.22 msec (669.62 
MiB/sec)
  T 11 mlock 0.00 msec,  preflt 1876.53 msec,  memcpy 1530.20 msec (669.19 
MiB/sec)

Aggregate metrics, 16 threads, 16384.00 MiB:
  mlock  0.00 msec
  preflt 1897.69 msec
  memcpy 1530.59 msec (10704.36 MiB/sec)




Linux 4.9.0, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per 
core

$ ./sv_mem -mode=wr -size=1g -block=1K -threads=16
Per-thread metrics:
  T 5  mlock 0.00 msec,  preflt 1192.80 msec,  memcpy 1141.42 msec (897.13 
MiB/sec)
  T 7  mlock 0.00 msec,  preflt 1211.61 msec,  memcpy 1144.62 msec (894.62 
MiB/sec)
  T 16 mlock 0.00 msec,  preflt 1211.59 msec,  memcpy 1145.37 msec (894.04 
MiB/sec)
  T 3  mlock 0.00 msec,  preflt 1207.33 msec,  memcpy 1146.42 msec (893.21 
MiB/sec)
  T 2  mlock 0.00 msec,  preflt 1211.02 msec,  memcpy 1146.36 msec (893.26 
MiB/sec)
  T 1  mlock 0.00 msec,  preflt 1210.36 msec,  memcpy 1146.57 msec (893.10 
MiB/sec)
  T 13 mlock 0.00 msec,  preflt 1208.53 msec,  memcpy 1146.67 msec (893.02 
MiB/sec)
  T 9  mlock 0.00 msec,  preflt 1209.00 msec,  memcpy 1146.33 msec (893.28 
MiB/sec)
  T 15 mlock 0.00 msec,  preflt 1210.63 msec,  memcpy 1147.20 msec (892.61 
MiB/sec)
  T 14 mlock 0.00 msec,  preflt 1190.98 msec,  memcpy 1147.90 msec (892.06 
MiB/sec)
  T 4  mlock 0.00 msec,  preflt 1193.98 msec,  memcpy 1147.89 msec (892.07 
MiB/sec)
  T 6  mlock 0.00 msec,  preflt 1194.16 msec,  memcpy 1148.72 msec (891.43 
MiB/sec)
  T 12 mlock 0.00 msec,  preflt 1191.37 msec,  memcpy 1149.35 msec (890.94 
MiB/sec)
  T 8  mlock 0.00 msec,  preflt 1196.99 msec,  memcpy 1149.30 msec (890.98 
MiB/sec)
  T 10 mlock 0.00 msec,  preflt 1197.32 msec,  memcpy 1149.37 msec (890.92 
MiB/sec)
  T 11 mlock 0.00 msec,  preflt 1197.75 msec,  memcpy 1152.12 msec (888.79 
MiB/sec)

Aggregate metrics, 16 threads, 16384.00 MiB:
  mlock  0.00 msec
  preflt 1211.96 msec
  memcpy 1152.58 msec (14215.02 MiB/sec)




NetBSD-8, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per core

$ ./sv_mem -mode=wr -size=1g -block=1K -threads=16
Per-thread metrics:
  T 16 mlock 0.00 msec,  preflt 18116.24 msec,  memcpy 945.99 msec (1082.46 
MiB/sec)
  T 9  mlock 0.00 msec,  preflt 18112.29 msec,  memcpy 949.79 msec (1078.13 
MiB/sec)
  T 10 mlock 0.00 msec,  preflt 18131.93 msec,  memcpy 955.33 msec (1071.88 
MiB/sec)
  T 8  mlock 0.00 msec,  preflt 17868.22 msec,  memcpy 959.28 msec (1067.46 
MiB/sec)
  T 4  mlock 0.00 msec,  preflt 17437.47 msec,  memcpy 958.71 msec (1068.11 
MiB/sec)
  T 6  mlock 0.00 msec,  preflt 16743.15 msec,  memcpy 958.53 msec (1068.31 
MiB/sec)
  T 3  mlock 0.00 msec,  preflt 18130.67 msec,  memcpy 944.33 msec (1084.36 
MiB/sec)
  T 2  mlock 0.00 msec,  preflt 18060.20 msec,  memcpy 958.34 msec (1068.51 
MiB/sec)
  T 11

Re: Testing memory performance

2018-11-20 Thread Sad Clouds

On Mon, 19 Nov 2018 22:10:41 -0500
Eric Hawicz  wrote:

> The only way I can see that you'd end up with a total transfer rate 
> around 5GB/s is if you didn't actually manage to get the threads
> running in parallel, but instead have perhaps 2-3 running at a time,
> then the next 2-3 don't even start until those first few finish.
> 
> Eric
> 

That is exactly what happens, other threads are blocked from running,
because NetBSD VM subsystem that allocates pages is hitting single lock
and causing contention.

Re: Testing memory performance

2018-11-19 Thread Sad Clouds

OK I disabled NUMA in BIOS, there is a slight performance hit, but
NetBSD is still much slower than Linux. This time I did single thread
test, but disparity grows with number of threads.

NetBSD:
$ ./sv_mem -mode=wr -size=16g -block=1k -threads=1
Thread 1 preflt=11285.07 msec, memcpy=3056.22 MiB/sec
Total transfer rate: 3056.22 MiB/sec

Linux:
$ ./sv_mem -mode=wr -size=16g -block=1k -threads=1
Thread 1 preflt=7319.33 msec, memcpy=5089.21 MiB/sec
Total transfer rate: 5089.21 MiB/sec

Note that to pre-fault (touch 1 byte at every 4 KiB page) 16 GiB of
pages it took NetBSD around 11 seconds, Linux took 7 seconds. With 16
concurrent threads, NetBSD pre-fault is 18 times longer.
Maybe there is a global lock in NetBSD VM subsystem that slows things
down with higher number of threads.

So the average throughput of memcpy is slower on NetBSD with higher
number of threads because they can't make progress until pages are
allocated and a global lock causes contention, so they sit waiting
idle. 

Note below how NetBSD memcpy for individual threads is faster, but the
overall throughput is almost half of Linux, because NetBSD VM subsystem
acts like a barrier and causes those threads to stall until pages are
allocated.


NetBSD:
$ ./sv_mem -mode=wr -size=1g -block=1k -threads=16
Thread 5 preflt=16400.12 msec, memcpy=3130.44 MiB/sec
Thread 11preflt=16931.65 msec, memcpy=3154.73 MiB/sec
Thread 9 preflt=17169.03 msec, memcpy=2514.06 MiB/sec
Thread 4 preflt=17632.37 msec, memcpy=2928.74 MiB/sec
Thread 14preflt=17696.83 msec, memcpy=2146.89 MiB/sec
Thread 7 preflt=17885.63 msec, memcpy=2926.97 MiB/sec
Thread 1 preflt=17918.38 msec, memcpy=1338.85 MiB/sec
Thread 10preflt=18316.65 msec, memcpy=2082.36 MiB/sec
Thread 15preflt=18323.43 msec, memcpy=1338.62 MiB/sec
Thread 12preflt=18310.89 msec, memcpy=1322.38 MiB/sec
Thread 6 preflt=18363.57 msec, memcpy=1507.58 MiB/sec
Thread 16preflt=18360.23 msec, memcpy=1909.12 MiB/sec
Thread 8 preflt=18155.39 msec, memcpy=1478.17 MiB/sec
Thread 13preflt=18236.67 msec, memcpy=1849.76 MiB/sec
Thread 3 preflt=18303.09 msec, memcpy=2116.50 MiB/sec
Thread 2 preflt=17960.70 msec, memcpy=1325.43 MiB/sec
Total transfer rate: 6087.94 MiB/sec

Linux:
$ ./sv_mem -mode=wr -size=1g -block=1k -threads=16
Thread 13preflt=1182.27 msec, memcpy=902.88 MiB/sec
Thread 9 preflt=1183.55 msec, memcpy=903.02 MiB/sec
Thread 5 preflt=1191.65 msec, memcpy=899.32 MiB/sec
Thread 11preflt=1186.96 msec, memcpy=897.64 MiB/sec
Thread 7 preflt=1195.46 msec, memcpy=898.71 MiB/sec
Thread 6 preflt=1207.12 msec, memcpy=904.71 MiB/sec
Thread 15preflt=1194.18 msec, memcpy=896.05 MiB/sec
Thread 4 preflt=1216.37 msec, memcpy=909.09 MiB/sec
Thread 3 preflt=1210.41 msec, memcpy=897.77 MiB/sec
Thread 2 preflt=1210.36 msec, memcpy=896.36 MiB/sec
Thread 12preflt=1210.59 msec, memcpy=898.79 MiB/sec
Thread 14preflt=1209.41 msec, memcpy=898.01 MiB/sec
Thread 10preflt=1210.00 msec, memcpy=896.88 MiB/sec
Thread 1 preflt=1216.32 msec, memcpy=899.56 MiB/sec
Thread 16preflt=1209.18 msec, memcpy=899.34 MiB/sec
Thread 8 preflt=1231.36 msec, memcpy=910.00 MiB/sec
Total transfer rate: 13978.88 MiB/sec

Re: Testing memory performance

2018-11-19 Thread Sad Clouds

On Sun, 18 Nov 2018 16:30:32 -0500
Eric Hawicz  wrote:

> > NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
> > Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
> > ...
> > Total transfer rate: 5817.56 MiB/sec
> 
> What?  I think your measurements are a bit off here.  There may be a 
> problem with the speed, but if you're measuring the per-thread rate 
> properly then the sum of those should equal your total transfer
> rate. Are the periods during which each thread calculates its rate
> very different from the period of the overall test?

The sum of all threads should not equal total transfer rate, because
threads could be running at different times. So instead of all threads
running in parallel you could have something like - T1 runs, pause, T2
runs, pause, T3 runs, pause, etc, the more pauses you have the longer
it will take for all threads to complete. Have a think about it, it
makes sense.

Re: Testing memory performance

2018-11-19 Thread Eric Hawicz


On 11/19/2018 4:38 PM, Sad Clouds wrote:

On Sun, 18 Nov 2018 16:30:32 -0500
Eric Hawicz  wrote:

NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
...
Total transfer rate: 5817.56 MiB/sec

What?  I think your measurements are a bit off here.  There may be a
problem with the speed, but if you're measuring the per-thread rate
properly then the sum of those should equal your total transfer
rate. Are the periods during which each thread calculates its rate
very different from the period of the overall test?

The sum of all threads should not equal total transfer rate, because
threads could be running at different times. So instead of all threads
running in parallel you could have something like - T1 runs, pause, T2
runs, pause, T3 runs, pause, etc, the more pauses you have the longer
it will take for all threads to complete. Have a think about it, it
makes sense.


Sure the threads pause, but so what?  Unless you have dramatically 
different start and end times for all of the threads, the numbers are 
way off.  It doesn't matter whether a thread pauses, since that pause 
will be within the start & end times for that thread, and thus included 
in the rate calculation.


Say each thread is around for 10 seconds, and in that time it transfers 
25GB of data, so that's 2.5GB/s


If your overall test is also roughly 10 seconds long, then the the total 
transfer rate must be roughly 2.5GB/s * # of threads.


The only way I can see that you'd end up with a total transfer rate 
around 5GB/s is if you didn't actually manage to get the threads running 
in parallel, but instead have perhaps 2-3 running at a time, then the 
next 2-3 don't even start until those first few finish.


Eric

Re: Testing memory performance

2018-11-19 Thread Michael van Elst

On Mon, Nov 19, 2018 at 09:25:31PM +, Sad Clouds wrote:
> OK I disabled NUMA in BIOS, there is a slight performance hit, but
> NetBSD is still much slower than Linux. This time I did single thread
> test, but disparity grows with number of threads.

You cannot disable NUMA, that's how the machine is built. You may change
how memory is physically mapped (usually done by hashing address bits).


> Maybe there is a global lock in NetBSD VM subsystem that slows things
> down with higher number of threads.

There is a global lock for the page freelist.



-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."

Re: Testing memory performance

2018-11-19 Thread Sad Clouds

On Mon, 19 Nov 2018 01:06:45 - (UTC)
mlel...@serpens.de (Michael van Elst) wrote:

> munlock fails when not the whole range has been locked, Since the
> range is rounded to page boundaries, there could be some overlap.
> 
Are you referring to virtual or physical range of addresses? As far as
I remember all memory ranges were power of 2 and much greater than 
4 KiB. Maybe memory alignment has to be on page boundary, I'll see if
it helps changing malloc to posix_memalign.

> Another effect on your system is NUMA. Linux will allocate memory
> on the CPU that requests it when possible. NetBSD has no idea about
> NUMA. On your system that can easily have a 20-30% impact on memcpy
> speed.
> 
> If a thread sleeps, it either is doing a system call, or the scheduler
> doesn't allocate a CPU for it. The latter shouldn't happen in netbsd-8
> for CPU bound user threads.
> 
> But without seeing your code, it's difficult to tell what happens.

Speed difference is about 2.5 times, so way bigger than 30% you
mentioned. Also, there is a simple loop that calls memcpy, no syscalls
of any kind, but for some reason threads are idle 60% of the time. I'll
run some more tests and provide more details.

Re: Testing memory performance

2018-11-19 Thread Sad Clouds

On Sun, 18 Nov 2018 22:50:17 +0100
Rhialto  wrote:

> On Sun 18 Nov 2018 at 19:04:02 +, Sad Clouds wrote:
> > Linux (gcc 6.3.0):
> 
> It looks to me like this fragment is not the whole function:
> 
> > Dump of assembler code for function memcpy:
> > => 0x778a0e90 <+0>:   mov%rdi,%rax
> >0x778a0e93 <+3>:   cmp$0x10,%rdx
> >0x778a0e97 <+7>:   jb 0x778a0f77
> 
> 0x778a0f77 isn't in the disassembly
> 
> >0x778a0e9d <+13>:  cmp$0x20,%rdx
> >0x778a0ea1 <+17>:  ja 0x778a0fc6
> 
> 0x778a0fc6 neither.
> 
> >0x778a0ea7 <+23>:  movups (%rsi),%xmm0
> >0x778a0eaa <+26>:  movups -0x10(%rsi,%rdx,1),%xmm1
> >0x778a0eaf <+31>:  movups %xmm0,(%rdi)
> >0x778a0eb2 <+34>:  movups %xmm1,-0x10(%rdi,%rdx,1)
> >0x778a0eb7 <+39>:  retq   
> > End of assembler dump.

That's what GDB printed out, not sure why some parts may be missing.

Re: Testing memory performance

2018-11-18 Thread Michael van Elst

cryintotheblue...@gmail.com (Sad Clouds) writes:

>Looked at disassembly of memcpy() and NetBSD version looks way more
>complicated. I don't know anything about x86 assembly, but maybe the
>clue is somewhere here:

The Linux code shown is incomplete. But that can't be relevant to your
problem.

munlock fails when not the whole range has been locked, Since the
range is rounded to page boundaries, there could be some overlap.

The memcpy speed is obviously influenced by the caches. Multiple
threads can easily cause trashing and the memory allocator may
make a difference.

Another effect on your system is NUMA. Linux will allocate memory
on the CPU that requests it when possible. NetBSD has no idea about
NUMA. On your system that can easily have a 20-30% impact on memcpy
speed.

If a thread sleeps, it either is doing a system call, or the scheduler
doesn't allocate a CPU for it. The latter shouldn't happen in netbsd-8
for CPU bound user threads.

But without seeing your code, it's difficult to tell what happens.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."

Re: Testing memory performance

2018-11-18 Thread Eric Hawicz


On 11/18/2018 7:00 AM, Sad Clouds wrote:

I'm developing a small tool that tests memory performance/throughput
across different environments. I'm noticing performance issues on
NetBSD-8, below are the details:

...

NetBSD and Linux have different versions of GCC, but I was hoping the
following flags would keep optimization differences to a minimum:


If you want to rule that out, you could always build the same version of 
gcc on both.  Or even run the linux binary (and libs) on NetBSD.




NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
...
Total transfer rate: 5817.56 MiB/sec


What?  I think your measurements are a bit off here.  There may be a 
problem with the speed, but if you're measuring the per-thread rate 
properly then the sum of those should equal your total transfer rate.  
Are the periods during which each thread calculates its rate very 
different from the period of the overall test?



Also, your subsequent email about memcpy disassembly does not list the 
full code for the linux version (the jumps at the start refer to 
instruction addresses that you don't include), so you can't really 
compare them.  I expect that both implementations have a variety of code 
blocks to handle different alignments, different supported instructions, 
etc..



Eric

Re: Testing memory performance

2018-11-18 Thread Rhialto

On Sun 18 Nov 2018 at 19:04:02 +, Sad Clouds wrote:
> Linux (gcc 6.3.0):

It looks to me like this fragment is not the whole function:

> Dump of assembler code for function memcpy:
> => 0x778a0e90 <+0>:   mov%rdi,%rax
>0x778a0e93 <+3>:   cmp$0x10,%rdx
>0x778a0e97 <+7>:   jb 0x778a0f77

0x778a0f77 isn't in the disassembly

>0x778a0e9d <+13>:  cmp$0x20,%rdx
>0x778a0ea1 <+17>:  ja 0x778a0fc6

0x778a0fc6 neither.

>0x778a0ea7 <+23>:  movups (%rsi),%xmm0
>0x778a0eaa <+26>:  movups -0x10(%rsi,%rdx,1),%xmm1
>0x778a0eaf <+31>:  movups %xmm0,(%rdi)
>0x778a0eb2 <+34>:  movups %xmm1,-0x10(%rdi,%rdx,1)
>0x778a0eb7 <+39>:  retq   
> End of assembler dump.

It looks like both functions check for some initial conditions to see
which optimized loop they can use, but they use very different
optimizations.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- "What good is a Ring of Power
\X/ rhialto/at/falu.nl  -- if you're unable...to Speak." - Agent Elrond


signature.asc
Description: PGP signature

Re: Testing memory performance

2018-11-18 Thread Sad Clouds

Looked at disassembly of memcpy() and NetBSD version looks way more
complicated. I don't know anything about x86 assembly, but maybe the
clue is somewhere here:

NetBSD (gcc 5.5.0):

Dump of assembler code for function memcpy:
=> 0x7f7e5940b980 <+0>: mov%rdx,%rcx
   0x7f7e5940b983 <+3>: mov%rdi,%rax
   0x7f7e5940b986 <+6>: mov%rdi,%r11
   0x7f7e5940b989 <+9>: shr$0x3,%rcx
   0x7f7e5940b98d <+13>:je 0x7f7e5940b9cc 
   0x7f7e5940b98f <+15>:lea-0x8(%rdi,%rdx,1),%r9
   0x7f7e5940b994 <+20>:mov-0x8(%rsi,%rdx,1),%r10
   0x7f7e5940b999 <+25>:and$0x7,%r11
   0x7f7e5940b99d <+29>:jne0x7f7e5940b9a6 
   0x7f7e5940b99f <+31>:rep movsq %ds:(%rsi),%es:(%rdi)
   0x7f7e5940b9a2 <+34>:mov%r10,(%r9)
   0x7f7e5940b9a5 <+37>:retq   
   0x7f7e5940b9a6 <+38>:lea-0x9(%r11,%rdx,1),%rcx
   0x7f7e5940b9ab <+43>:neg%r11
   0x7f7e5940b9ae <+46>:mov(%rsi),%rdx
   0x7f7e5940b9b1 <+49>:mov%rdi,%r8
   0x7f7e5940b9b4 <+52>:lea0x8(%rsi,%r11,1),%rsi
   0x7f7e5940b9b9 <+57>:lea0x8(%rdi,%r11,1),%rdi
   0x7f7e5940b9be <+62>:shr$0x3,%rcx
   0x7f7e5940b9c2 <+66>:rep movsq %ds:(%rsi),%es:(%rdi)
   0x7f7e5940b9c5 <+69>:mov%rdx,(%r8)
   0x7f7e5940b9c8 <+72>:mov%r10,(%r9)
   0x7f7e5940b9cb <+75>:retq   
   0x7f7e5940b9cc <+76>:mov%rdx,%rcx
   0x7f7e5940b9cf <+79>:rep movsb %ds:(%rsi),%es:(%rdi)
   0x7f7e5940b9d1 <+81>:retq   
End of assembler dump.



Linux (gcc 6.3.0):

Dump of assembler code for function memcpy:
=> 0x778a0e90 <+0>:   mov%rdi,%rax
   0x778a0e93 <+3>:   cmp$0x10,%rdx
   0x778a0e97 <+7>:   jb 0x778a0f77
   0x778a0e9d <+13>:  cmp$0x20,%rdx
   0x778a0ea1 <+17>:  ja 0x778a0fc6
   0x778a0ea7 <+23>:  movups (%rsi),%xmm0
   0x778a0eaa <+26>:  movups -0x10(%rsi,%rdx,1),%xmm1
   0x778a0eaf <+31>:  movups %xmm0,(%rdi)
   0x778a0eb2 <+34>:  movups %xmm1,-0x10(%rdi,%rdx,1)
   0x778a0eb7 <+39>:  retq   
End of assembler dump.

Testing memory performance

2018-11-18 Thread Sad Clouds

I'm developing a small tool that tests memory performance/throughput
across different environments. I'm noticing performance issues on
NetBSD-8, below are the details:

The tool creates a number of concurrent threads, each threads allocates
1 GiB memory segment and a 1 KiB transfer block. It pre-faults every
page by writing a single byte at every 4 KiB offset. It then calls
memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is
filled.

NetBSD and Linux have different versions of GCC, but I was hoping the
following flags would keep optimization differences to a minimum:

gcc -O1 -fno-builtin -march=westmere -Wall -pedantic -std=c11 \
-D_FILE_OFFSET_BITS=64 -D_XOPEN_SOURCE=700 -D_DEFAULT_SOURCE

Hardware has 48 GiB of RAM, For this test I'm using 16 threads x 1 GiB =
16 GiB total.

I'm seeing several issues on NetBSD:

1. When each thread calls mlock() to lock pages, sometimes when 
unlocking those pages, munlock() fails with ENOMEM. It doesn't happen 
every time, but frequently enough and I don't know why specifically 
munlock() fails. Same code works correctly on Linux.

2. Performance with 16 concurrent threads is rather bad. Most threads
are idle 60% of the time (on Linux they are 100% busy), which suggests
some sort of contention somewhere. On NetBSD average throughput with 16
threads is around 5.8 GiB/sec, on Linux it is around 15.3 GiB/sec.

3. This issue affects both NetBSD and Linux. When using mlock() to
lock memory pages before issuing memcpy(), overall throughput drops
significantly. Threads seem to be serialized, while a few threads are
running, others are blocked for some reason. I don't know why mlock()
has this affect. 

If anyone has any thoughts on this, please let me know. 

Below are details of SMP architecture and test results

# lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):16
On-line CPU(s) list:   0-15
Thread(s) per core:2
Core(s) per socket:4
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 44
Model name:Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz
Stepping:  2
CPU MHz:   1596.000
CPU max MHz:   2395.
CPU min MHz:   1596.
BogoMIPS:  4787.71
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  12288K
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15



NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
Thread 7 preflt=14277.53 msec, memcpy=2891.39 MiB/sec
Thread 3 preflt=14765.99 msec, memcpy=2553.72 MiB/sec
Thread 14preflt=15036.90 msec, memcpy=2288.19 MiB/sec
Thread 1 preflt=15126.01 msec, memcpy=2315.53 MiB/sec
Thread 12preflt=15333.82 msec, memcpy=2071.52 MiB/sec
Thread 5 preflt=15603.25 msec, memcpy=1880.64 MiB/sec
Thread 6 preflt=15704.05 msec, memcpy=1662.66 MiB/sec
Thread 10preflt=15693.48 msec, memcpy=1642.44 MiB/sec
Thread 4 preflt=15571.64 msec, memcpy=1557.73 MiB/sec
Thread 15preflt=15574.60 msec, memcpy=1571.76 MiB/sec
Thread 9 preflt=15750.08 msec, memcpy=2170.44 MiB/sec
Thread 13preflt=15588.69 msec, memcpy=1900.24 MiB/sec
Thread 8 preflt=15587.50 msec, memcpy=2043.66 MiB/sec
Thread 16preflt=15265.48 msec, memcpy=1884.74 MiB/sec
Thread 11preflt=15294.87 msec, memcpy=2272.75 MiB/sec
Total transfer rate: 5817.56 MiB/sec


NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, with mlock:
Thread 2 preflt=5.27 msec, memcpy=2595.67 MiB/sec
Thread 3 preflt=5.37 msec, memcpy=2550.90 MiB/sec
Thread 16preflt=5.02 msec, memcpy=2770.11 MiB/sec
Thread 4 preflt=4.12 msec, memcpy=3209.06 MiB/sec
Thread 15preflt=5.31 msec, memcpy=2496.82 MiB/sec
Thread 13preflt=7.46 msec, memcpy=3083.72 MiB/sec
Thread 5 preflt=5.49 msec, memcpy=2766.81 MiB/sec
Thread 14preflt=6.94 msec, memcpy=2574.98 MiB/sec
Thread 8 preflt=6.53 msec, memcpy=2201.47 MiB/sec
Thread 12preflt=4.90 msec, memcpy=2814.79 MiB/sec
Thread 10preflt=4.41 msec, memcpy=2615.27 MiB/sec
Thread 6 preflt=6.18 msec, memcpy=2844.57 MiB/sec
Thread 9 preflt=5.38 msec, memcpy=2976.05 MiB/sec
Thread 7 preflt=4.81 msec, memcpy=2828.54 MiB/sec
Thread 11preflt=5.10 msec, memcpy=2778.69 MiB/sec
Thread 1 preflt=3.84 msec, memcpy=3229.88 MiB/sec
Total transfer rate: 3789.33 MiB/sec




Linux: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 5 preflt=1122.06 msec, memcpy=990.24 MiB/sec
Thread 2 preflt=1137.94 msec, memcpy=990.41 MiB/sec
Thread 15preflt=1125.65 msec, memcpy=982.23 MiB/sec
Thread 4 preflt=1130.02 msec, memcpy=981.37 MiB/sec
Thread 9 preflt=1130.47 msec, memcpy=982.23 MiB/sec
Thread 13preflt=1127.70 msec, memcpy=982.00 MiB/sec
Thread 3 preflt=

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Re: Testing memory performance

Testing memory performance

19 matches

Site Navigation

Mail list logo

Footer information