Re: Testing memory performance
On Wed, Nov 21, 2018 at 4:18 AM Eric Hawicz wrote: > That still sounds to me like the test is a bit off. If you've already > recorded the start time of each thread, then the time that the threads > are blocked from running would be included in the per-thread rate, thus > causing it to appear much slower. > > No because start/end times are taken for specific operations like pre-faulting or memcpy. It doesn't tell you what this thread is doing in relation to other threads, so a thread can be blocked for some time and then scheduled to run and start time taken, how would this latency be accounted for if it occurred before start time was taken? Maybe think of it as a simple example, let's say memory bus has maximum bandwidth of 10 GiB/sec and you have two threads A and B, each doing memcpy of 10 GiB. Scenario 1 - both threads run in parallel and share memory bus bandwidth: --> time in seconds AA thread runs for 2 seconds and does memcpy at 5 GiB/sec BB thread runs for 2 seconds and does memcpy at 5 GiB/sec Aggregate throughput = (2 threads * 10 GiB) / 2 second = 10 GiB/sec Scenario 2 - each thread runs in sequence and uses full memory bus bandwidth --> time in seconds A thread runs for 1 second and does memcpy at 10 GiB/sec L lock contention causes latency of 1 second B thread runs for 1 second and does memcpy at 10 GiB/sec Aggregate throughput = (2 threads * 10 GiB) / 3 second = 6.6 GiB/sec
Re: Testing memory performance
On Tue, Nov 20, 2018 at 11:44:47AM -0500, Greg Troxel wrote: > I thought we were using a pool allocator that had per-cpu freelists, > derived from Solaris and > https://www.usenix.org/legacy/event/usenix01/bonwick.html We are talking about a lower level free list. Even when you could reuse the pool allocator code on that level, it wouldn't be sufficient. But yes, the methods used by the pool allocator need to be applied here too. Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
Michael van Elst writes: > On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote: >> >> Michael van Elst writes: >> > There is a global lock for the page freelist. >> >> I wonder if using a pool-type structure would be feasible. That might >> fix almost all of the slowness. > > You need a per-cpu freelist and some mechanism to steal from other > freelists. Ideally that also includes something to optimize for NUMA. I thought we were using a pool allocator that had per-cpu freelists, derived from Solaris and https://www.usenix.org/legacy/event/usenix01/bonwick.html but maybe I am off on that.
Re: Testing memory performance
On Tue, Nov 20, 2018 at 10:50:13AM -0500, Greg Troxel wrote: > > Michael van Elst writes: > > There is a global lock for the page freelist. > > I wonder if using a pool-type structure would be feasible. That might > fix almost all of the slowness. You need a per-cpu freelist and some mechanism to steal from other freelists. Ideally that also includes something to optimize for NUMA. Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
Michael van Elst writes: >> Maybe there is a global lock in NetBSD VM subsystem that slows things >> down with higher number of threads. > > There is a global lock for the page freelist. I wonder if using a pool-type structure would be feasible. That might fix almost all of the slowness.
Re: Testing memory performance
On Tue, 20 Nov 2018 00:27:22 +0100 Michael van Elst wrote: > There is a global lock for the page freelist. OK I've made changes to my bench tool to synchronize all threads before each stage. So threads now wait for all other threads to finish pre-faulting pages, before they all start memcpy at the same time. This makes it more clear where time is lost. Did some more tests on Solaris, Linux and NetBSD. Looks like NetBSD memcpy is actually a bit faster than Linux, but NetBSD is quite slow at servicing page faults. The latency when pre-faulting those pages is about 18 times longer on NetBSD, which results in longer overall execution time. Anyway, this has been an interesting exercise. Solaris 11.3, x1 UltraSPARC-T2 1415 MHz, 8 cores per CPU, 8 hw threads per core $ ./sv_mem -mode=wr -size=1g -block=1K -threads=16 Per-thread metrics: T 16 mlock 0.00 msec, preflt 1880.88 msec, memcpy 1521.74 msec (672.91 MiB/sec) T 14 mlock 0.00 msec, preflt 1896.63 msec, memcpy 1522.38 msec (672.63 MiB/sec) T 10 mlock 0.00 msec, preflt 1872.01 msec, memcpy 1522.73 msec (672.48 MiB/sec) T 2 mlock 0.00 msec, preflt 1889.55 msec, memcpy 1522.43 msec (672.61 MiB/sec) T 8 mlock 0.00 msec, preflt 1862.79 msec, memcpy 1523.32 msec (672.22 MiB/sec) T 6 mlock 0.00 msec, preflt 1875.76 msec, memcpy 1523.68 msec (672.06 MiB/sec) T 5 mlock 0.00 msec, preflt 1869.91 msec, memcpy 1524.26 msec (671.80 MiB/sec) T 12 mlock 0.00 msec, preflt 1880.11 msec, memcpy 1525.13 msec (671.42 MiB/sec) T 4 mlock 0.00 msec, preflt 1884.96 msec, memcpy 1525.37 msec (671.31 MiB/sec) T 1 mlock 0.00 msec, preflt 1885.92 msec, memcpy 1525.54 msec (671.24 MiB/sec) T 9 mlock 0.00 msec, preflt 1875.25 msec, memcpy 1526.15 msec (670.97 MiB/sec) T 13 mlock 0.00 msec, preflt 1869.48 msec, memcpy 1526.74 msec (670.71 MiB/sec) T 15 mlock 0.00 msec, preflt 1869.14 msec, memcpy 1527.30 msec (670.46 MiB/sec) T 7 mlock 0.00 msec, preflt 1889.29 msec, memcpy 1527.45 msec (670.40 MiB/sec) T 3 mlock 0.00 msec, preflt 1880.53 msec, memcpy 1529.22 msec (669.62 MiB/sec) T 11 mlock 0.00 msec, preflt 1876.53 msec, memcpy 1530.20 msec (669.19 MiB/sec) Aggregate metrics, 16 threads, 16384.00 MiB: mlock 0.00 msec preflt 1897.69 msec memcpy 1530.59 msec (10704.36 MiB/sec) Linux 4.9.0, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per core $ ./sv_mem -mode=wr -size=1g -block=1K -threads=16 Per-thread metrics: T 5 mlock 0.00 msec, preflt 1192.80 msec, memcpy 1141.42 msec (897.13 MiB/sec) T 7 mlock 0.00 msec, preflt 1211.61 msec, memcpy 1144.62 msec (894.62 MiB/sec) T 16 mlock 0.00 msec, preflt 1211.59 msec, memcpy 1145.37 msec (894.04 MiB/sec) T 3 mlock 0.00 msec, preflt 1207.33 msec, memcpy 1146.42 msec (893.21 MiB/sec) T 2 mlock 0.00 msec, preflt 1211.02 msec, memcpy 1146.36 msec (893.26 MiB/sec) T 1 mlock 0.00 msec, preflt 1210.36 msec, memcpy 1146.57 msec (893.10 MiB/sec) T 13 mlock 0.00 msec, preflt 1208.53 msec, memcpy 1146.67 msec (893.02 MiB/sec) T 9 mlock 0.00 msec, preflt 1209.00 msec, memcpy 1146.33 msec (893.28 MiB/sec) T 15 mlock 0.00 msec, preflt 1210.63 msec, memcpy 1147.20 msec (892.61 MiB/sec) T 14 mlock 0.00 msec, preflt 1190.98 msec, memcpy 1147.90 msec (892.06 MiB/sec) T 4 mlock 0.00 msec, preflt 1193.98 msec, memcpy 1147.89 msec (892.07 MiB/sec) T 6 mlock 0.00 msec, preflt 1194.16 msec, memcpy 1148.72 msec (891.43 MiB/sec) T 12 mlock 0.00 msec, preflt 1191.37 msec, memcpy 1149.35 msec (890.94 MiB/sec) T 8 mlock 0.00 msec, preflt 1196.99 msec, memcpy 1149.30 msec (890.98 MiB/sec) T 10 mlock 0.00 msec, preflt 1197.32 msec, memcpy 1149.37 msec (890.92 MiB/sec) T 11 mlock 0.00 msec, preflt 1197.75 msec, memcpy 1152.12 msec (888.79 MiB/sec) Aggregate metrics, 16 threads, 16384.00 MiB: mlock 0.00 msec preflt 1211.96 msec memcpy 1152.58 msec (14215.02 MiB/sec) NetBSD-8, x2 Intel Xeon E5620 2395 MHz, 4 cores per CPU, 2 hw threads per core $ ./sv_mem -mode=wr -size=1g -block=1K -threads=16 Per-thread metrics: T 16 mlock 0.00 msec, preflt 18116.24 msec, memcpy 945.99 msec (1082.46 MiB/sec) T 9 mlock 0.00 msec, preflt 18112.29 msec, memcpy 949.79 msec (1078.13 MiB/sec) T 10 mlock 0.00 msec, preflt 18131.93 msec, memcpy 955.33 msec (1071.88 MiB/sec) T 8 mlock 0.00 msec, preflt 17868.22 msec, memcpy 959.28 msec (1067.46 MiB/sec) T 4 mlock 0.00 msec, preflt 17437.47 msec, memcpy 958.71 msec (1068.11 MiB/sec) T 6 mlock 0.00 msec, preflt 16743.15 msec, memcpy 958.53 msec (1068.31 MiB/sec) T 3 mlock 0.00 msec, preflt 18130.67 msec, memcpy 944.33 msec (1084.36 MiB/sec) T 2 mlock 0.00 msec, preflt 18060.20 msec, memcpy 958.34 msec (1068.51 MiB/sec) T 11
Re: Testing memory performance
On Mon, 19 Nov 2018 22:10:41 -0500 Eric Hawicz wrote: > The only way I can see that you'd end up with a total transfer rate > around 5GB/s is if you didn't actually manage to get the threads > running in parallel, but instead have perhaps 2-3 running at a time, > then the next 2-3 don't even start until those first few finish. > > Eric > That is exactly what happens, other threads are blocked from running, because NetBSD VM subsystem that allocates pages is hitting single lock and causing contention.
Re: Testing memory performance
OK I disabled NUMA in BIOS, there is a slight performance hit, but NetBSD is still much slower than Linux. This time I did single thread test, but disparity grows with number of threads. NetBSD: $ ./sv_mem -mode=wr -size=16g -block=1k -threads=1 Thread 1 preflt=11285.07 msec, memcpy=3056.22 MiB/sec Total transfer rate: 3056.22 MiB/sec Linux: $ ./sv_mem -mode=wr -size=16g -block=1k -threads=1 Thread 1 preflt=7319.33 msec, memcpy=5089.21 MiB/sec Total transfer rate: 5089.21 MiB/sec Note that to pre-fault (touch 1 byte at every 4 KiB page) 16 GiB of pages it took NetBSD around 11 seconds, Linux took 7 seconds. With 16 concurrent threads, NetBSD pre-fault is 18 times longer. Maybe there is a global lock in NetBSD VM subsystem that slows things down with higher number of threads. So the average throughput of memcpy is slower on NetBSD with higher number of threads because they can't make progress until pages are allocated and a global lock causes contention, so they sit waiting idle. Note below how NetBSD memcpy for individual threads is faster, but the overall throughput is almost half of Linux, because NetBSD VM subsystem acts like a barrier and causes those threads to stall until pages are allocated. NetBSD: $ ./sv_mem -mode=wr -size=1g -block=1k -threads=16 Thread 5 preflt=16400.12 msec, memcpy=3130.44 MiB/sec Thread 11preflt=16931.65 msec, memcpy=3154.73 MiB/sec Thread 9 preflt=17169.03 msec, memcpy=2514.06 MiB/sec Thread 4 preflt=17632.37 msec, memcpy=2928.74 MiB/sec Thread 14preflt=17696.83 msec, memcpy=2146.89 MiB/sec Thread 7 preflt=17885.63 msec, memcpy=2926.97 MiB/sec Thread 1 preflt=17918.38 msec, memcpy=1338.85 MiB/sec Thread 10preflt=18316.65 msec, memcpy=2082.36 MiB/sec Thread 15preflt=18323.43 msec, memcpy=1338.62 MiB/sec Thread 12preflt=18310.89 msec, memcpy=1322.38 MiB/sec Thread 6 preflt=18363.57 msec, memcpy=1507.58 MiB/sec Thread 16preflt=18360.23 msec, memcpy=1909.12 MiB/sec Thread 8 preflt=18155.39 msec, memcpy=1478.17 MiB/sec Thread 13preflt=18236.67 msec, memcpy=1849.76 MiB/sec Thread 3 preflt=18303.09 msec, memcpy=2116.50 MiB/sec Thread 2 preflt=17960.70 msec, memcpy=1325.43 MiB/sec Total transfer rate: 6087.94 MiB/sec Linux: $ ./sv_mem -mode=wr -size=1g -block=1k -threads=16 Thread 13preflt=1182.27 msec, memcpy=902.88 MiB/sec Thread 9 preflt=1183.55 msec, memcpy=903.02 MiB/sec Thread 5 preflt=1191.65 msec, memcpy=899.32 MiB/sec Thread 11preflt=1186.96 msec, memcpy=897.64 MiB/sec Thread 7 preflt=1195.46 msec, memcpy=898.71 MiB/sec Thread 6 preflt=1207.12 msec, memcpy=904.71 MiB/sec Thread 15preflt=1194.18 msec, memcpy=896.05 MiB/sec Thread 4 preflt=1216.37 msec, memcpy=909.09 MiB/sec Thread 3 preflt=1210.41 msec, memcpy=897.77 MiB/sec Thread 2 preflt=1210.36 msec, memcpy=896.36 MiB/sec Thread 12preflt=1210.59 msec, memcpy=898.79 MiB/sec Thread 14preflt=1209.41 msec, memcpy=898.01 MiB/sec Thread 10preflt=1210.00 msec, memcpy=896.88 MiB/sec Thread 1 preflt=1216.32 msec, memcpy=899.56 MiB/sec Thread 16preflt=1209.18 msec, memcpy=899.34 MiB/sec Thread 8 preflt=1231.36 msec, memcpy=910.00 MiB/sec Total transfer rate: 13978.88 MiB/sec
Re: Testing memory performance
On Sun, 18 Nov 2018 16:30:32 -0500 Eric Hawicz wrote: > > NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: > > Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec > > ... > > Total transfer rate: 5817.56 MiB/sec > > What? I think your measurements are a bit off here. There may be a > problem with the speed, but if you're measuring the per-thread rate > properly then the sum of those should equal your total transfer > rate. Are the periods during which each thread calculates its rate > very different from the period of the overall test? The sum of all threads should not equal total transfer rate, because threads could be running at different times. So instead of all threads running in parallel you could have something like - T1 runs, pause, T2 runs, pause, T3 runs, pause, etc, the more pauses you have the longer it will take for all threads to complete. Have a think about it, it makes sense.
Re: Testing memory performance
On 11/19/2018 4:38 PM, Sad Clouds wrote: On Sun, 18 Nov 2018 16:30:32 -0500 Eric Hawicz wrote: NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec ... Total transfer rate: 5817.56 MiB/sec What? I think your measurements are a bit off here. There may be a problem with the speed, but if you're measuring the per-thread rate properly then the sum of those should equal your total transfer rate. Are the periods during which each thread calculates its rate very different from the period of the overall test? The sum of all threads should not equal total transfer rate, because threads could be running at different times. So instead of all threads running in parallel you could have something like - T1 runs, pause, T2 runs, pause, T3 runs, pause, etc, the more pauses you have the longer it will take for all threads to complete. Have a think about it, it makes sense. Sure the threads pause, but so what? Unless you have dramatically different start and end times for all of the threads, the numbers are way off. It doesn't matter whether a thread pauses, since that pause will be within the start & end times for that thread, and thus included in the rate calculation. Say each thread is around for 10 seconds, and in that time it transfers 25GB of data, so that's 2.5GB/s If your overall test is also roughly 10 seconds long, then the the total transfer rate must be roughly 2.5GB/s * # of threads. The only way I can see that you'd end up with a total transfer rate around 5GB/s is if you didn't actually manage to get the threads running in parallel, but instead have perhaps 2-3 running at a time, then the next 2-3 don't even start until those first few finish. Eric
Re: Testing memory performance
On Mon, Nov 19, 2018 at 09:25:31PM +, Sad Clouds wrote: > OK I disabled NUMA in BIOS, there is a slight performance hit, but > NetBSD is still much slower than Linux. This time I did single thread > test, but disparity grows with number of threads. You cannot disable NUMA, that's how the machine is built. You may change how memory is physically mapped (usually done by hashing address bits). > Maybe there is a global lock in NetBSD VM subsystem that slows things > down with higher number of threads. There is a global lock for the page freelist. -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Testing memory performance
On 11/18/2018 7:00 AM, Sad Clouds wrote: I'm developing a small tool that tests memory performance/throughput across different environments. I'm noticing performance issues on NetBSD-8, below are the details: ... NetBSD and Linux have different versions of GCC, but I was hoping the following flags would keep optimization differences to a minimum: If you want to rule that out, you could always build the same version of gcc on both. Or even run the linux binary (and libs) on NetBSD. NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec ... Total transfer rate: 5817.56 MiB/sec What? I think your measurements are a bit off here. There may be a problem with the speed, but if you're measuring the per-thread rate properly then the sum of those should equal your total transfer rate. Are the periods during which each thread calculates its rate very different from the period of the overall test? Also, your subsequent email about memcpy disassembly does not list the full code for the linux version (the jumps at the start refer to instruction addresses that you don't include), so you can't really compare them. I expect that both implementations have a variety of code blocks to handle different alignments, different supported instructions, etc.. Eric
Re: Testing memory performance
On Sun 18 Nov 2018 at 19:04:02 +, Sad Clouds wrote: > Linux (gcc 6.3.0): It looks to me like this fragment is not the whole function: > Dump of assembler code for function memcpy: > => 0x778a0e90 <+0>: mov%rdi,%rax >0x778a0e93 <+3>: cmp$0x10,%rdx >0x778a0e97 <+7>: jb 0x778a0f77 0x778a0f77 isn't in the disassembly >0x778a0e9d <+13>: cmp$0x20,%rdx >0x778a0ea1 <+17>: ja 0x778a0fc6 0x778a0fc6 neither. >0x778a0ea7 <+23>: movups (%rsi),%xmm0 >0x778a0eaa <+26>: movups -0x10(%rsi,%rdx,1),%xmm1 >0x778a0eaf <+31>: movups %xmm0,(%rdi) >0x778a0eb2 <+34>: movups %xmm1,-0x10(%rdi,%rdx,1) >0x778a0eb7 <+39>: retq > End of assembler dump. It looks like both functions check for some initial conditions to see which optimized loop they can use, but they use very different optimizations. -Olaf. -- ___ Olaf 'Rhialto' Seibert -- "What good is a Ring of Power \X/ rhialto/at/falu.nl -- if you're unable...to Speak." - Agent Elrond signature.asc Description: PGP signature