On Sun, Jun 26 2022, Jeremie Courreges-Anglas <[email protected]> wrote:
> I noticed that our MI __mp_lock (kernel lock, sched lock)
> implementation is based on a ticket lock without any backoff.
> The downside of this algorithm is that it results in bus trafic increase
> because all the actors are writing (lock releaser) and reading (lock
> waiters) the same memory region from different cpus.  Reducing
> effectively that contention with a proportional backoff doesn't look
> easy, since we have to target the minimum backoff time even though we
> ignore how much time the lock will actually be held.
>
> Some algorithms (MCS, CLH, M-lock) avoid the sharing by building a queue
> and letting waiters spin on a mostly private memory region.  I initially
> tried the MCS lock and it gave good results on amd64 and riscv64, until
> it broke on arm64.  Which kinda confirms what I read in various papers:
> it looks simple but it's hard to get right.
>
> The M-lock implementation below is easy to understand and seems to give
> better (or similar) results than both ticket lock and MCS on amd64,
> arm64 and riscv64 machines.  False sharing is partly avoided at the
> expense of growing the data structure, which should be fine since the
> __mp_lock is only used for the kernel and sched lock.
>
> I'm showing this off right now to see if there is interest.  Benchmarks
> results would help confirm that.  I don't own big machines where it
> would really shine.

My test case for contention is building libc with max number of jobs:

  while sleep 1; do doas -u build make clean >/dev/null 2>&1; time doas -u 
build make -j4 >/dev/null 2>&1;done

Here are some benchmarks, with help from tb@.  Unfortunately I don't
have the numbers for my riscv64 unmatched (currently offline, result
was a slight improvement).  robert@ showed some rather impressive result
on a machine of his but it consisted of only one timing, before and
after.  Obviously such timings should be done on a mostly idle machine.



4 cores amd64 vm:
cpu0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz, 783.36 MHz, 06-5e-03
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,VMX,SSSE3,FMA3,CX16,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,FSGSBASE,TSC_ADJUST,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,UMIP,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 
16-way L2 cache
cpu0: ITLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: DTLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: smt 0, core 0, package 0

-current:

    2m01.57s real     2m41.80s user     4m27.86s system
    2m00.40s real     2m41.01s user     4m27.05s system
    2m00.67s real     2m41.05s user     4m27.61s system

m-lock:
    1m59.73s real     2m40.26s user     4m23.76s system
    2m00.94s real     2m42.52s user     4m24.46s system
    1m59.97s real     2m41.88s user     4m22.46s system


16 cores arm64 m1:
mainbus0 at root: Apple Mac mini (M1, 2020)
cpu0 at mainbus0 mpidr 0: Apple Icestorm r1p1
cpu0: 128KB 64b/line 8-way L1 VIPT I-cache, 64KB 64b/line 8-way L1 D-cache
cpu0: 4096KB 128b/line 16-way L2 cache
cpu0: 
TLBIOS+IRANGE,TS+AXFLAG,FHM,DP,SHA3,RDM,Atomic,CRC32,SHA2+SHA512,SHA1,AES+PMULL,SPECRES,SB,FRINTTS,GPI,LRCPC+LDAPUR,FCMA,JSCVT,API+PAC,DPB,SpecSEI,PAN+ATS1E1,LO,HPDS,CSV3,CSV2

-current:

    1m55.93s real     2m05.46s user     8m27.08s system
    1m56.90s real     2m06.11s user     8m27.11s system
    1m56.17s real     2m06.42s user     8m24.11s system

m-lock1.diff:

    1m48.43s real     2m05.45s user     8m07.63s system
    1m48.65s real     2m06.27s user     8m07.15s system
    1m49.41s real     2m04.13s user     8m10.92s system


16 cores amd64 server:
bios0: Dell Inc. Precision 3640 Tower

cpu0: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz, 3691.40 MHz, 06-a5-05
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXS
R,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 256KB 64b/line 8-way L2 cache

-current:

    0m55.83s real     1m38.42s user     4m30.81s system
    0m55.07s real     1m39.35s user     4m27.74s system
    0m55.44s real     1m37.57s user     4m30.68s system

m-lock1.diff:

    0m53.06s real     1m39.42s user     4m12.92s system
    0m52.83s real     1m37.94s user     4m12.33s system
    0m53.03s real     1m39.21s user     4m13.67s system

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

Reply via email to