On Sun, Jun 26 2022, Jeremie Courreges-Anglas <[email protected]> wrote:
> I noticed that our MI __mp_lock (kernel lock, sched lock)
> implementation is based on a ticket lock without any backoff.
> The downside of this algorithm is that it results in bus trafic increase
> because all the actors are writing (lock releaser) and reading (lock
> waiters) the same memory region from different cpus. Reducing
> effectively that contention with a proportional backoff doesn't look
> easy, since we have to target the minimum backoff time even though we
> ignore how much time the lock will actually be held.
>
> Some algorithms (MCS, CLH, M-lock) avoid the sharing by building a queue
> and letting waiters spin on a mostly private memory region. I initially
> tried the MCS lock and it gave good results on amd64 and riscv64, until
> it broke on arm64. Which kinda confirms what I read in various papers:
> it looks simple but it's hard to get right.
>
> The M-lock implementation below is easy to understand and seems to give
> better (or similar) results than both ticket lock and MCS on amd64,
> arm64 and riscv64 machines. False sharing is partly avoided at the
> expense of growing the data structure, which should be fine since the
> __mp_lock is only used for the kernel and sched lock.
>
> I'm showing this off right now to see if there is interest. Benchmarks
> results would help confirm that. I don't own big machines where it
> would really shine.
My test case for contention is building libc with max number of jobs:
while sleep 1; do doas -u build make clean >/dev/null 2>&1; time doas -u
build make -j4 >/dev/null 2>&1;done
Here are some benchmarks, with help from tb@. Unfortunately I don't
have the numbers for my riscv64 unmatched (currently offline, result
was a slight improvement). robert@ showed some rather impressive result
on a machine of his but it consisted of only one timing, before and
after. Obviously such timings should be done on a mostly idle machine.
4 cores amd64 vm:
cpu0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz, 783.36 MHz, 06-5e-03
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,VMX,SSSE3,FMA3,CX16,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,FSGSBASE,TSC_ADJUST,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,UMIP,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line
16-way L2 cache
cpu0: ITLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: DTLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: smt 0, core 0, package 0
-current:
2m01.57s real 2m41.80s user 4m27.86s system
2m00.40s real 2m41.01s user 4m27.05s system
2m00.67s real 2m41.05s user 4m27.61s system
m-lock:
1m59.73s real 2m40.26s user 4m23.76s system
2m00.94s real 2m42.52s user 4m24.46s system
1m59.97s real 2m41.88s user 4m22.46s system
16 cores arm64 m1:
mainbus0 at root: Apple Mac mini (M1, 2020)
cpu0 at mainbus0 mpidr 0: Apple Icestorm r1p1
cpu0: 128KB 64b/line 8-way L1 VIPT I-cache, 64KB 64b/line 8-way L1 D-cache
cpu0: 4096KB 128b/line 16-way L2 cache
cpu0:
TLBIOS+IRANGE,TS+AXFLAG,FHM,DP,SHA3,RDM,Atomic,CRC32,SHA2+SHA512,SHA1,AES+PMULL,SPECRES,SB,FRINTTS,GPI,LRCPC+LDAPUR,FCMA,JSCVT,API+PAC,DPB,SpecSEI,PAN+ATS1E1,LO,HPDS,CSV3,CSV2
-current:
1m55.93s real 2m05.46s user 8m27.08s system
1m56.90s real 2m06.11s user 8m27.11s system
1m56.17s real 2m06.42s user 8m24.11s system
m-lock1.diff:
1m48.43s real 2m05.45s user 8m07.63s system
1m48.65s real 2m06.27s user 8m07.15s system
1m49.41s real 2m04.13s user 8m10.92s system
16 cores amd64 server:
bios0: Dell Inc. Precision 3640 Tower
cpu0: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz, 3691.40 MHz, 06-a5-05
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXS
R,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 256KB 64b/line 8-way L2 cache
-current:
0m55.83s real 1m38.42s user 4m30.81s system
0m55.07s real 1m39.35s user 4m27.74s system
0m55.44s real 1m37.57s user 4m30.68s system
m-lock1.diff:
0m53.06s real 1m39.42s user 4m12.92s system
0m52.83s real 1m37.94s user 4m12.33s system
0m53.03s real 1m39.21s user 4m13.67s system
--
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE