Re: Testing memory performance
On Wed, Nov 21, 2018 at 4:18 AM Eric Hawicz wrote: > That still sounds to me like the test is a bit off. If you've already > recorded the start time of each thread, then the time that the threads > are blocked from running would be included in the per-thread rate, thus > causing it to appear much slower. > > No because start/end times are taken for specific operations like pre-faulting or memcpy. It doesn't tell you what this thread is doing in relation to other threads, so a thread can be blocked for some time and then scheduled to run and start time taken, how would this latency be accounted for if it occurred before start time was taken? Maybe think of it as a simple example, let's say memory bus has maximum bandwidth of 10 GiB/sec and you have two threads A and B, each doing memcpy of 10 GiB. Scenario 1 - both threads run in parallel and share memory bus bandwidth: --> time in seconds AA thread runs for 2 seconds and does memcpy at 5 GiB/sec BB thread runs for 2 seconds and does memcpy at 5 GiB/sec Aggregate throughput = (2 threads * 10 GiB) / 2 second = 10 GiB/sec Scenario 2 - each thread runs in sequence and uses full memory bus bandwidth --> time in seconds A thread runs for 1 second and does memcpy at 10 GiB/sec L lock contention causes latency of 1 second B thread runs for 1 second and does memcpy at 10 GiB/sec Aggregate throughput = (2 threads * 10 GiB) / 3 second = 6.6 GiB/sec
Re: Testing memory performance
On 11/20/2018 4:54 AM, Sad Clouds wrote: On Mon, 19 Nov 2018 22:10:41 -0500 Eric Hawicz wrote: The only way I can see that you'd end up with a total transfer rate around 5GB/s is if you didn't actually manage to get the threads running in parallel, but instead have perhaps 2-3 running at a time, then the next 2-3 don't even start until those first few finish. That is exactly what happens, other threads are blocked from running, because NetBSD VM subsystem that allocates pages is hitting single lock and causing contention. That still sounds to me like the test is a bit off. If you've already recorded the start time of each thread, then the time that the threads are blocked from running would be included in the per-thread rate, thus causing it to appear much slower. Originally, you said: "The tool creates a number of concurrent threads, each threads allocates 1 GiB memory segment and a 1 KiB transfer block. It pre-faults every page by writing a single byte at every 4 KiB offset. It then calls memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is filled." So, I'm imagining each thread has code that does the following sequence of operations: * Allocate 1GB memory * Pre-fault each page * Notify that we're ready to start and wait until all threads are ready * Record this thread's start time * Perform memcpy * Record this thread's end time Is that what you're doing?
Re: Testing memory performance
On Tue, Nov 20, 2018 at 11:44:47AM -0500, Greg Troxel wrote: > I thought we were using a pool allocator that had per-cpu freelists, > derived from Solaris and > https://www.usenix.org/legacy/event/usenix01/bonwick.html We are talking about a lower level free list. Even when you could reuse the pool allocator code on that level, it wouldn't be sufficient. But yes, the methods used by the pool allocator need to be applied here too. Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
Michael van Elst writes: > On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote: >> >> Michael van Elst writes: >> > There is a global lock for the page freelist. >> >> I wonder if using a pool-type structure would be feasible. That might >> fix almost all of the slowness. > > You need a per-cpu freelist and some mechanism to steal from other > freelists. Ideally that also includes something to optimize for NUMA. I thought we were using a pool allocator that had per-cpu freelists, derived from Solaris and https://www.usenix.org/legacy/event/usenix01/bonwick.html but maybe I am off on that.
Re: Testing memory performance
On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote: > > Michael van Elst writes: > > There is a global lock for the page freelist. > > I wonder if using a pool-type structure would be feasible. That might > fix almost all of the slowness. You need a per-cpu freelist and some mechanism to steal from other freelists. Ideally that also includes something to optimize for NUMA. Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
Michael van Elst writes: >> Maybe there is a global lock in NetBSD VM subsystem that slows things >> down with higher number of threads. > > There is a global lock for the page freelist. I wonder if using a pool-type structure would be feasible. That might fix almost all of the slowness.
Re: Testing memory performance
On Tue, 20 Nov 2018 00:27:22 +0100 Michael van Elst wrote: > There is a global lock for the page freelist. OK I've made changes to my bench tool to synchronize all threads before each stage. So threads now wait for all other threads to finish pre-faulting pages, before they all start memcpy at the same time. This makes it more clear where time is lost. Did some more tests on Solaris, Linux and NetBSD. Looks like NetBSD memcpy is actually a bit faster than Linux, but NetBSD is quite slow at servicing page faults. The latency when pre-faulting those pages is about 18 times longer on NetBSD, which results in longer overall execution time. Anyway, this has been an interesting exercise. Solaris 11.3, x1 UltraSPARC-T2 1415 MHz, 8 cores per CPU, 8 hw threads per core $ ./sv_mem -mode=wr -size=1g -block=1K -threads=16 Per-thread metrics: T 16 mlock 0.00 msec, preflt 1880.88 msec, memcpy 1521.74 msec (672.91 MiB/sec) T 14 mlock 0.00 msec, preflt 1896.63 msec, memcpy 1522.38 msec (672.63 MiB/sec) T 10 mlock 0.00 msec, preflt 1872.01 msec, memcpy 1522.73 msec (672.48 MiB/sec) T 2 mlock 0.00 msec, preflt 1889.55 msec, memcpy 1522.43 msec (672.61 MiB/sec) T 8 mlock 0.00 msec, preflt 1862.79 msec, memcpy 1523.32 msec (672.22 MiB/sec) T 6 mlock 0.00 msec, preflt 1875.76 msec, memcpy 1523.68 msec (672.06 MiB/sec) T 5 mlock 0.00 msec, preflt 1869.91 msec, memcpy 1524.26 msec (671.80 MiB/sec) T 12 mlock 0.00 msec, preflt 1880.11 msec, memcpy 1525.13 msec (671.42 MiB/sec) T 4 mlock 0.00 msec, preflt 1884.96 msec, memcpy 1525.37 msec (671.31 MiB/sec) T 1 mlock 0.00 msec, preflt 1885.92 msec, memcpy 1525.54 msec (671.24 MiB/sec) T 9 mlock 0.00 msec, preflt 1875.25 msec, memcpy 1526.15 msec (670.97 MiB/sec) T 13 mlock 0.00 msec, preflt 1869.48 msec, memcpy 1526.74 msec (670.71 MiB/sec) T 15 mlock 0.00 msec, preflt 1869.14 msec, memcpy 1527.30 msec (670.46 MiB/sec) T 7 mlock 0.00 msec, preflt 1889.29 msec, memcpy 1527.45 msec (670.40 MiB/sec) T 3 mlock 0.00 msec, preflt 1880.53 msec, memcpy 1529.22 msec (669.62 MiB/sec) T 11 mlock 0.00 msec, preflt 1876.53 msec, memcpy 1530.20 msec (669.19 MiB/sec) Aggregate metrics, 16 threads, 16384.00 MiB: mlock 0.00 msec preflt 1897.69 msec memcpy 1530.59 msec (10704.36 MiB/sec) Linux 4.9.0, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per core $ ./sv_mem -mode=wr -size=1g -block=1K -threads=16 Per-thread metrics: T 5 mlock 0.00 msec, preflt 1192.80 msec, memcpy 1141.42 msec (897.13 MiB/sec) T 7 mlock 0.00 msec, preflt 1211.61 msec, memcpy 1144.62 msec (894.62 MiB/sec) T 16 mlock 0.00 msec, preflt 1211.59 msec, memcpy 1145.37 msec (894.04 MiB/sec) T 3 mlock 0.00 msec, preflt 1207.33 msec, memcpy 1146.42 msec (893.21 MiB/sec) T 2 mlock 0.00 msec, preflt 1211.02 msec, memcpy 1146.36 msec (893.26 MiB/sec) T 1 mlock 0.00 msec, preflt 1210.36 msec, memcpy 1146.57 msec (893.10 MiB/sec) T 13 mlock 0.00 msec, preflt 1208.53 msec, memcpy 1146.67 msec (893.02 MiB/sec) T 9 mlock 0.00 msec, preflt 1209.00 msec, memcpy 1146.33 msec (893.28 MiB/sec) T 15 mlock 0.00 msec, preflt 1210.63 msec, memcpy 1147.20 msec (892.61 MiB/sec) T 14 mlock 0.00 msec, preflt 1190.98 msec, memcpy 1147.90 msec (892.06 MiB/sec) T 4 mlock 0.00 msec, preflt 1193.98 msec, memcpy 1147.89 msec (892.07 MiB/sec) T 6 mlock 0.00 msec, preflt 1194.16 msec, memcpy 1148.72 msec (891.43 MiB/sec) T 12 mlock 0.00 msec, preflt 1191.37 msec, memcpy 1149.35 msec (890.94 MiB/sec) T 8 mlock 0.00 msec, preflt 1196.99 msec, memcpy 1149.30 msec (890.98 MiB/sec) T 10 mlock 0.00 msec, preflt 1197.32 msec, memcpy 1149.37 msec (890.92 MiB/sec) T 11 mlock 0.00 msec, preflt 1197.75 msec, memcpy 1152.12 msec (888.79 MiB/sec) Aggregate metrics, 16 threads, 16384.00 MiB: mlock 0.00 msec preflt 1211.96 msec memcpy 1152.58 msec (14215.02 MiB/sec) NetBSD-8, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per core $ ./sv_mem -mode=wr -size=1g -block=1K -threads=16 Per-thread metrics: T 16 mlock 0.00 msec, preflt 18116.24 msec, memcpy 945.99 msec (1082.46 MiB/sec) T 9 mlock 0.00 msec, preflt 18112.29 msec, memcpy 949.79 msec (1078.13 MiB/sec) T 10 mlock 0.00 msec, preflt 18131.93 msec, memcpy 955.33 msec (1071.88 MiB/sec) T 8 mlock 0.00 msec, preflt 17868.22 msec, memcpy 959.28 msec (1067.46 MiB/sec) T 4 mlock 0.00 msec, preflt 17437.47 msec, memcpy 958.71 msec (1068.11 MiB/sec) T 6 mlock 0.00 msec, preflt 16743.15 msec, memcpy 958.53 msec (1068.31 MiB/sec) T 3 mlock 0.00 msec, preflt 18130.67 msec, memcpy 944.33 msec (1084.36 MiB/sec) T 2 mlock 0.00 msec, preflt 18060.20 msec, memcpy 958.34 msec (1068.51 MiB/sec) T 11
Re: Testing memory performance
On Mon, 19 Nov 2018 22:10:41 -0500 Eric Hawicz wrote: > The only way I can see that you'd end up with a total transfer rate > around 5GB/s is if you didn't actually manage to get the threads > running in parallel, but instead have perhaps 2-3 running at a time, > then the next 2-3 don't even start until those first few finish. > > Eric > That is exactly what happens, other threads are blocked from running, because NetBSD VM subsystem that allocates pages is hitting single lock and causing contention.
Re: Testing memory performance
OK I disabled NUMA in BIOS, there is a slight performance hit, but NetBSD is still much slower than Linux. This time I did single thread test, but disparity grows with number of threads. NetBSD: $ ./sv_mem -mode=wr -size=16g -block=1k -threads=1 Thread 1 preflt=11285.07 msec, memcpy=3056.22 MiB/sec Total transfer rate: 3056.22 MiB/sec Linux: $ ./sv_mem -mode=wr -size=16g -block=1k -threads=1 Thread 1 preflt=7319.33 msec, memcpy=5089.21 MiB/sec Total transfer rate: 5089.21 MiB/sec Note that to pre-fault (touch 1 byte at every 4 KiB page) 16 GiB of pages it took NetBSD around 11 seconds, Linux took 7 seconds. With 16 concurrent threads, NetBSD pre-fault is 18 times longer. Maybe there is a global lock in NetBSD VM subsystem that slows things down with higher number of threads. So the average throughput of memcpy is slower on NetBSD with higher number of threads because they can't make progress until pages are allocated and a global lock causes contention, so they sit waiting idle. Note below how NetBSD memcpy for individual threads is faster, but the overall throughput is almost half of Linux, because NetBSD VM subsystem acts like a barrier and causes those threads to stall until pages are allocated. NetBSD: $ ./sv_mem -mode=wr -size=1g -block=1k -threads=16 Thread 5 preflt=16400.12 msec, memcpy=3130.44 MiB/sec Thread 11preflt=16931.65 msec, memcpy=3154.73 MiB/sec Thread 9 preflt=17169.03 msec, memcpy=2514.06 MiB/sec Thread 4 preflt=17632.37 msec, memcpy=2928.74 MiB/sec Thread 14preflt=17696.83 msec, memcpy=2146.89 MiB/sec Thread 7 preflt=17885.63 msec, memcpy=2926.97 MiB/sec Thread 1 preflt=17918.38 msec, memcpy=1338.85 MiB/sec Thread 10preflt=18316.65 msec, memcpy=2082.36 MiB/sec Thread 15preflt=18323.43 msec, memcpy=1338.62 MiB/sec Thread 12preflt=18310.89 msec, memcpy=1322.38 MiB/sec Thread 6 preflt=18363.57 msec, memcpy=1507.58 MiB/sec Thread 16preflt=18360.23 msec, memcpy=1909.12 MiB/sec Thread 8 preflt=18155.39 msec, memcpy=1478.17 MiB/sec Thread 13preflt=18236.67 msec, memcpy=1849.76 MiB/sec Thread 3 preflt=18303.09 msec, memcpy=2116.50 MiB/sec Thread 2 preflt=17960.70 msec, memcpy=1325.43 MiB/sec Total transfer rate: 6087.94 MiB/sec Linux: $ ./sv_mem -mode=wr -size=1g -block=1k -threads=16 Thread 13preflt=1182.27 msec, memcpy=902.88 MiB/sec Thread 9 preflt=1183.55 msec, memcpy=903.02 MiB/sec Thread 5 preflt=1191.65 msec, memcpy=899.32 MiB/sec Thread 11preflt=1186.96 msec, memcpy=897.64 MiB/sec Thread 7 preflt=1195.46 msec, memcpy=898.71 MiB/sec Thread 6 preflt=1207.12 msec, memcpy=904.71 MiB/sec Thread 15preflt=1194.18 msec, memcpy=896.05 MiB/sec Thread 4 preflt=1216.37 msec, memcpy=909.09 MiB/sec Thread 3 preflt=1210.41 msec, memcpy=897.77 MiB/sec Thread 2 preflt=1210.36 msec, memcpy=896.36 MiB/sec Thread 12preflt=1210.59 msec, memcpy=898.79 MiB/sec Thread 14preflt=1209.41 msec, memcpy=898.01 MiB/sec Thread 10preflt=1210.00 msec, memcpy=896.88 MiB/sec Thread 1 preflt=1216.32 msec, memcpy=899.56 MiB/sec Thread 16preflt=1209.18 msec, memcpy=899.34 MiB/sec Thread 8 preflt=1231.36 msec, memcpy=910.00 MiB/sec Total transfer rate: 13978.88 MiB/sec
Re: Testing memory performance
On Sun, 18 Nov 2018 16:30:32 -0500 Eric Hawicz wrote: > > NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: > > Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec > > ... > > Total transfer rate: 5817.56 MiB/sec > > What? I think your measurements are a bit off here. There may be a > problem with the speed, but if you're measuring the per-thread rate > properly then the sum of those should equal your total transfer > rate. Are the periods during which each thread calculates its rate > very different from the period of the overall test? The sum of all threads should not equal total transfer rate, because threads could be running at different times. So instead of all threads running in parallel you could have something like - T1 runs, pause, T2 runs, pause, T3 runs, pause, etc, the more pauses you have the longer it will take for all threads to complete. Have a think about it, it makes sense.
Re: Testing memory performance
On 11/19/2018 4:38 PM, Sad Clouds wrote: On Sun, 18 Nov 2018 16:30:32 -0500 Eric Hawicz wrote: NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec ... Total transfer rate: 5817.56 MiB/sec What? I think your measurements are a bit off here. There may be a problem with the speed, but if you're measuring the per-thread rate properly then the sum of those should equal your total transfer rate. Are the periods during which each thread calculates its rate very different from the period of the overall test? The sum of all threads should not equal total transfer rate, because threads could be running at different times. So instead of all threads running in parallel you could have something like - T1 runs, pause, T2 runs, pause, T3 runs, pause, etc, the more pauses you have the longer it will take for all threads to complete. Have a think about it, it makes sense. Sure the threads pause, but so what? Unless you have dramatically different start and end times for all of the threads, the numbers are way off. It doesn't matter whether a thread pauses, since that pause will be within the start & end times for that thread, and thus included in the rate calculation. Say each thread is around for 10 seconds, and in that time it transfers 25GB of data, so that's 2.5GB/s If your overall test is also roughly 10 seconds long, then the the total transfer rate must be roughly 2.5GB/s * # of threads. The only way I can see that you'd end up with a total transfer rate around 5GB/s is if you didn't actually manage to get the threads running in parallel, but instead have perhaps 2-3 running at a time, then the next 2-3 don't even start until those first few finish. Eric
Re: Testing memory performance
On Mon, Nov 19, 2018 at 09:25:31PM +, Sad Clouds wrote: > OK I disabled NUMA in BIOS, there is a slight performance hit, but > NetBSD is still much slower than Linux. This time I did single thread > test, but disparity grows with number of threads. You cannot disable NUMA, that's how the machine is built. You may change how memory is physically mapped (usually done by hashing address bits). > Maybe there is a global lock in NetBSD VM subsystem that slows things > down with higher number of threads. There is a global lock for the page freelist. -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
On Mon, 19 Nov 2018 01:06:45 - (UTC) mlel...@serpens.de (Michael van Elst) wrote: > munlock fails when not the whole range has been locked, Since the > range is rounded to page boundaries, there could be some overlap. > Are you referring to virtual or physical range of addresses? As far as I remember all memory ranges were power of 2 and much greater than 4 KiB. Maybe memory alignment has to be on page boundary, I'll see if it helps changing malloc to posix_memalign. > Another effect on your system is NUMA. Linux will allocate memory > on the CPU that requests it when possible. NetBSD has no idea about > NUMA. On your system that can easily have a 20-30% impact on memcpy > speed. > > If a thread sleeps, it either is doing a system call, or the scheduler > doesn't allocate a CPU for it. The latter shouldn't happen in netbsd-8 > for CPU bound user threads. > > But without seeing your code, it's difficult to tell what happens. Speed difference is about 2.5 times, so way bigger than 30% you mentioned. Also, there is a simple loop that calls memcpy, no syscalls of any kind, but for some reason threads are idle 60% of the time. I'll run some more tests and provide more details.
Re: Testing memory performance
On Sun, 18 Nov 2018 22:50:17 +0100 Rhialto wrote: > On Sun 18 Nov 2018 at 19:04:02 +, Sad Clouds wrote: > > Linux (gcc 6.3.0): > > It looks to me like this fragment is not the whole function: > > > Dump of assembler code for function memcpy: > > => 0x778a0e90 <+0>: mov%rdi,%rax > >0x778a0e93 <+3>: cmp$0x10,%rdx > >0x778a0e97 <+7>: jb 0x778a0f77 > > 0x778a0f77 isn't in the disassembly > > >0x778a0e9d <+13>: cmp$0x20,%rdx > >0x778a0ea1 <+17>: ja 0x778a0fc6 > > 0x778a0fc6 neither. > > >0x778a0ea7 <+23>: movups (%rsi),%xmm0 > >0x778a0eaa <+26>: movups -0x10(%rsi,%rdx,1),%xmm1 > >0x778a0eaf <+31>: movups %xmm0,(%rdi) > >0x778a0eb2 <+34>: movups %xmm1,-0x10(%rdi,%rdx,1) > >0x778a0eb7 <+39>: retq > > End of assembler dump. That's what GDB printed out, not sure why some parts may be missing.
Re: Testing memory performance
cryintotheblue...@gmail.com (Sad Clouds) writes: >Looked at disassembly of memcpy() and NetBSD version looks way more >complicated. I don't know anything about x86 assembly, but maybe the >clue is somewhere here: The Linux code shown is incomplete. But that can't be relevant to your problem. munlock fails when not the whole range has been locked, Since the range is rounded to page boundaries, there could be some overlap. The memcpy speed is obviously influenced by the caches. Multiple threads can easily cause trashing and the memory allocator may make a difference. Another effect on your system is NUMA. Linux will allocate memory on the CPU that requests it when possible. NetBSD has no idea about NUMA. On your system that can easily have a 20-30% impact on memcpy speed. If a thread sleeps, it either is doing a system call, or the scheduler doesn't allocate a CPU for it. The latter shouldn't happen in netbsd-8 for CPU bound user threads. But without seeing your code, it's difficult to tell what happens. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
On 11/18/2018 7:00 AM, Sad Clouds wrote: I'm developing a small tool that tests memory performance/throughput across different environments. I'm noticing performance issues on NetBSD-8, below are the details: ... NetBSD and Linux have different versions of GCC, but I was hoping the following flags would keep optimization differences to a minimum: If you want to rule that out, you could always build the same version of gcc on both. Or even run the linux binary (and libs) on NetBSD. NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec ... Total transfer rate: 5817.56 MiB/sec What? I think your measurements are a bit off here. There may be a problem with the speed, but if you're measuring the per-thread rate properly then the sum of those should equal your total transfer rate. Are the periods during which each thread calculates its rate very different from the period of the overall test? Also, your subsequent email about memcpy disassembly does not list the full code for the linux version (the jumps at the start refer to instruction addresses that you don't include), so you can't really compare them. I expect that both implementations have a variety of code blocks to handle different alignments, different supported instructions, etc.. Eric
Re: Testing memory performance
On Sun 18 Nov 2018 at 19:04:02 +, Sad Clouds wrote: > Linux (gcc 6.3.0): It looks to me like this fragment is not the whole function: > Dump of assembler code for function memcpy: > => 0x778a0e90 <+0>: mov%rdi,%rax >0x778a0e93 <+3>: cmp$0x10,%rdx >0x778a0e97 <+7>: jb 0x778a0f77 0x778a0f77 isn't in the disassembly >0x778a0e9d <+13>: cmp$0x20,%rdx >0x778a0ea1 <+17>: ja 0x778a0fc6 0x778a0fc6 neither. >0x778a0ea7 <+23>: movups (%rsi),%xmm0 >0x778a0eaa <+26>: movups -0x10(%rsi,%rdx,1),%xmm1 >0x778a0eaf <+31>: movups %xmm0,(%rdi) >0x778a0eb2 <+34>: movups %xmm1,-0x10(%rdi,%rdx,1) >0x778a0eb7 <+39>: retq > End of assembler dump. It looks like both functions check for some initial conditions to see which optimized loop they can use, but they use very different optimizations. -Olaf. -- ___ Olaf 'Rhialto' Seibert -- "What good is a Ring of Power \X/ rhialto/at/falu.nl -- if you're unable...to Speak." - Agent Elrond signature.asc Description: PGP signature
Re: Testing memory performance
Looked at disassembly of memcpy() and NetBSD version looks way more complicated. I don't know anything about x86 assembly, but maybe the clue is somewhere here: NetBSD (gcc 5.5.0): Dump of assembler code for function memcpy: => 0x7f7e5940b980 <+0>: mov%rdx,%rcx 0x7f7e5940b983 <+3>: mov%rdi,%rax 0x7f7e5940b986 <+6>: mov%rdi,%r11 0x7f7e5940b989 <+9>: shr$0x3,%rcx 0x7f7e5940b98d <+13>:je 0x7f7e5940b9cc 0x7f7e5940b98f <+15>:lea-0x8(%rdi,%rdx,1),%r9 0x7f7e5940b994 <+20>:mov-0x8(%rsi,%rdx,1),%r10 0x7f7e5940b999 <+25>:and$0x7,%r11 0x7f7e5940b99d <+29>:jne0x7f7e5940b9a6 0x7f7e5940b99f <+31>:rep movsq %ds:(%rsi),%es:(%rdi) 0x7f7e5940b9a2 <+34>:mov%r10,(%r9) 0x7f7e5940b9a5 <+37>:retq 0x7f7e5940b9a6 <+38>:lea-0x9(%r11,%rdx,1),%rcx 0x7f7e5940b9ab <+43>:neg%r11 0x7f7e5940b9ae <+46>:mov(%rsi),%rdx 0x7f7e5940b9b1 <+49>:mov%rdi,%r8 0x7f7e5940b9b4 <+52>:lea0x8(%rsi,%r11,1),%rsi 0x7f7e5940b9b9 <+57>:lea0x8(%rdi,%r11,1),%rdi 0x7f7e5940b9be <+62>:shr$0x3,%rcx 0x7f7e5940b9c2 <+66>:rep movsq %ds:(%rsi),%es:(%rdi) 0x7f7e5940b9c5 <+69>:mov%rdx,(%r8) 0x7f7e5940b9c8 <+72>:mov%r10,(%r9) 0x7f7e5940b9cb <+75>:retq 0x7f7e5940b9cc <+76>:mov%rdx,%rcx 0x7f7e5940b9cf <+79>:rep movsb %ds:(%rsi),%es:(%rdi) 0x7f7e5940b9d1 <+81>:retq End of assembler dump. Linux (gcc 6.3.0): Dump of assembler code for function memcpy: => 0x778a0e90 <+0>: mov%rdi,%rax 0x778a0e93 <+3>: cmp$0x10,%rdx 0x778a0e97 <+7>: jb 0x778a0f77 0x778a0e9d <+13>: cmp$0x20,%rdx 0x778a0ea1 <+17>: ja 0x778a0fc6 0x778a0ea7 <+23>: movups (%rsi),%xmm0 0x778a0eaa <+26>: movups -0x10(%rsi,%rdx,1),%xmm1 0x778a0eaf <+31>: movups %xmm0,(%rdi) 0x778a0eb2 <+34>: movups %xmm1,-0x10(%rdi,%rdx,1) 0x778a0eb7 <+39>: retq End of assembler dump.
Testing memory performance
I'm developing a small tool that tests memory performance/throughput across different environments. I'm noticing performance issues on NetBSD-8, below are the details: The tool creates a number of concurrent threads, each threads allocates 1 GiB memory segment and a 1 KiB transfer block. It pre-faults every page by writing a single byte at every 4 KiB offset. It then calls memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is filled. NetBSD and Linux have different versions of GCC, but I was hoping the following flags would keep optimization differences to a minimum: gcc -O1 -fno-builtin -march=westmere -Wall -pedantic -std=c11 \ -D_FILE_OFFSET_BITS=64 -D_XOPEN_SOURCE=700 -D_DEFAULT_SOURCE Hardware has 48 GiB of RAM, For this test I'm using 16 threads x 1 GiB = 16 GiB total. I'm seeing several issues on NetBSD: 1. When each thread calls mlock() to lock pages, sometimes when unlocking those pages, munlock() fails with ENOMEM. It doesn't happen every time, but frequently enough and I don't know why specifically munlock() fails. Same code works correctly on Linux. 2. Performance with 16 concurrent threads is rather bad. Most threads are idle 60% of the time (on Linux they are 100% busy), which suggests some sort of contention somewhere. On NetBSD average throughput with 16 threads is around 5.8 GiB/sec, on Linux it is around 15.3 GiB/sec. 3. This issue affects both NetBSD and Linux. When using mlock() to lock memory pages before issuing memcpy(), overall throughput drops significantly. Threads seem to be serialized, while a few threads are running, others are blocked for some reason. I don't know why mlock() has this affect. If anyone has any thoughts on this, please let me know. Below are details of SMP architecture and test results # lscpu Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):16 On-line CPU(s) list: 0-15 Thread(s) per core:2 Core(s) per socket:4 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family:6 Model: 44 Model name:Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Stepping: 2 CPU MHz: 1596.000 CPU max MHz: 2395. CPU min MHz: 1596. BogoMIPS: 4787.71 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-3,8-11 NUMA node1 CPU(s): 4-7,12-15 NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec Thread 7 preflt=14277.53 msec, memcpy=2891.39 MiB/sec Thread 3 preflt=14765.99 msec, memcpy=2553.72 MiB/sec Thread 14preflt=15036.90 msec, memcpy=2288.19 MiB/sec Thread 1 preflt=15126.01 msec, memcpy=2315.53 MiB/sec Thread 12preflt=15333.82 msec, memcpy=2071.52 MiB/sec Thread 5 preflt=15603.25 msec, memcpy=1880.64 MiB/sec Thread 6 preflt=15704.05 msec, memcpy=1662.66 MiB/sec Thread 10preflt=15693.48 msec, memcpy=1642.44 MiB/sec Thread 4 preflt=15571.64 msec, memcpy=1557.73 MiB/sec Thread 15preflt=15574.60 msec, memcpy=1571.76 MiB/sec Thread 9 preflt=15750.08 msec, memcpy=2170.44 MiB/sec Thread 13preflt=15588.69 msec, memcpy=1900.24 MiB/sec Thread 8 preflt=15587.50 msec, memcpy=2043.66 MiB/sec Thread 16preflt=15265.48 msec, memcpy=1884.74 MiB/sec Thread 11preflt=15294.87 msec, memcpy=2272.75 MiB/sec Total transfer rate: 5817.56 MiB/sec NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, with mlock: Thread 2 preflt=5.27 msec, memcpy=2595.67 MiB/sec Thread 3 preflt=5.37 msec, memcpy=2550.90 MiB/sec Thread 16preflt=5.02 msec, memcpy=2770.11 MiB/sec Thread 4 preflt=4.12 msec, memcpy=3209.06 MiB/sec Thread 15preflt=5.31 msec, memcpy=2496.82 MiB/sec Thread 13preflt=7.46 msec, memcpy=3083.72 MiB/sec Thread 5 preflt=5.49 msec, memcpy=2766.81 MiB/sec Thread 14preflt=6.94 msec, memcpy=2574.98 MiB/sec Thread 8 preflt=6.53 msec, memcpy=2201.47 MiB/sec Thread 12preflt=4.90 msec, memcpy=2814.79 MiB/sec Thread 10preflt=4.41 msec, memcpy=2615.27 MiB/sec Thread 6 preflt=6.18 msec, memcpy=2844.57 MiB/sec Thread 9 preflt=5.38 msec, memcpy=2976.05 MiB/sec Thread 7 preflt=4.81 msec, memcpy=2828.54 MiB/sec Thread 11preflt=5.10 msec, memcpy=2778.69 MiB/sec Thread 1 preflt=3.84 msec, memcpy=3229.88 MiB/sec Total transfer rate: 3789.33 MiB/sec Linux: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 5 preflt=1122.06 msec, memcpy=990.24 MiB/sec Thread 2 preflt=1137.94 msec, memcpy=990.41 MiB/sec Thread 15preflt=1125.65 msec, memcpy=982.23 MiB/sec Thread 4 preflt=1130.02 msec, memcpy=981.37 MiB/sec Thread 9 preflt=1130.47 msec, memcpy=982.23 MiB/sec Thread 13preflt=1127.70 msec, memcpy=982.00 MiB/sec Thread 3 preflt=