Hi,

Anyone interested in taking a look?  This solves the problem we have with
uvm_fpageqlock.  Here's the code, and the blurb is below:

        http://www.netbsd.org/~ad/2019/allocator.diff

Cheers,
Andrew


General overview

        The page allocator currently works in 4 dimensions, with freelist
        being the most important consideration, page colour the next, and so
        on.  It has a global counter tracking the number of free pages.

        uvm -> freelist[f] -> colour[c] -> queue[q] +-> global list
                                                    \-> local CPU list

        uvm -> uvmexp.free

        It turns out that this allocator has a "lock contention problem",
        which is actually more of a false sharing problem, because
        allocating a page involves touching and writing back to many cache
        lines.

        The new allocator is reduced to working in 3 dimensions.  The
        separate lists for zeroed/unknown pages have been deleted (but the
        ability to track whether an individual page is zeroed is retained),
        the CPU free lists have been deleted, and a new concept of "bucket"
        introduced.  Again freelist is the most important consideration,
        bucket is the next, and colour the last (* unless the experimental
        NUMA stuff is enabled, but that's disabled for now).

        uvm -> freelist[f] -> bucket[b] -> colour[c]
                                 \
                                  -> pgb_nfree

        The data structures to manage this are carefully arranged so that
        there's minimal false sharing, fewer write backs need to occur, and
        empty freelists (of which there are often many) can be skipped
        quickly without having to scan everything nor take locks.

Buckets

        In both old & new allocators we have the ability to dynamically
        rearrange the free lists to support an arbitrary number of page
        colours.  Buckets work this way too but there is a maximum number
        (PGFL_MAX_BUCKETS, currently 8).

        The number of buckets is sized at runtime according to the number of
        physical CPU packages in the system (ci->ci_package_id), and each
        logical CPU is assigned a preferred bucket to allocate from
        (ucpu->pgflbucket).

        This does a few things for us (when the system is not very low on
        memory):

        - It exponentially reduces lock contention by reducing the number of
          CPUs competing for access to each bucket.

        - The lock and data structures for each bucket typically remain in
          cache somewhere on the chip, and thereby don't need to go over the
          bus, further reducing contention.

        - It has a positive effect on L2/L3 cache locality, because if we
          free pages back to a bucket corresponding to the CPU where they
          were last used, the pages held in free list are very likely to be
          in cache on the chip they will next be allocated from.  The
          current allocator has a problem where logical CPUs "hoard" L2/L3
          cache by having their own free lists.

Experimental rudimentary NUMA support

        I thought it would be interesting to try this, just for fun.

        This changes the configuration of the buckets to correspond to NUMA
        node, assigns pages to buckets according to which node the hardware
        tells us they live on, changes the strategy for freeing pages so
        that pages ALWAYS go back to their assigned bucket, instead of
        moving about as they do with the L2/L3 strategy, and changes the
        strategy for allocating so that we consider allocating from the
        correct bucket to be more important than allocating from the correct
        free list.

        I have disabled this for now even if NUMA is enabled on the system
        because it's not quite there yet and more testing needs doing.  It
        does seem to work though (for a compile job).

Per-cpu caching

        There is a tiny per-CPU caching layer that can be paused and resumed
        at runtime, and it's only enabled when CPUs are sharing buckets. 
        This functions as a buffer, where pages are shuffled to and from
        buckets in batch.  On a typical Intel system, about 224kB worth of
        pages will be cached.  That's not much, but is enough to further
        reduce lock contention by an order of magnitude without exerting too
        much pressure on the L2/L3 cache by hoarding.

Idlezero

        I have removed the guts of this, but the hooks to support it are
        still in place.  I think it may be useful to experiment with zeroing
        the per-CPU cached pages, or try some other method, but what we have
        now doesn't work well on modern systems.

How this looks in practice

        Here is what the freelists look like on my system, and the dmesg
        from said system.

        http://www.netbsd.org/~ad/2019/freelist.txt
        http://www.netbsd.org/~ad/2019/dmesg.boot

Results:

        This from a "make -j96" kernel build, with a !DIAGNOSTIC, GENERIC
        kernel on the system mentioned above.  System time is the most
        interesting here.  With NUMA disabled in the BIOS:

               74.55 real      1635.13 user       725.04 sys    before
               72.66 real      1653.86 user       593.19 sys    after

        With NUMA enabled in the BIOS & the allocator:

               76.81 real      1690.27 user       797.56 sys    before
               71.10 real      1632.42 user       603.41 sys    after

        Lock contention before (no NUMA):

        Total%  Count   Time/ms          Lock          
        ------ ------- --------- ----------------------
         99.80 36756212 182656.88 uvm_fpageqlock       

        Lock contention after (no NUMA):

        Total%  Count   Time/ms          Lock
        ------ ------- --------- ----------------------
         20.21  196928    132.50 uvm_freelist_locks+40  <all>
         18.72  180522    122.74 uvm_freelist_locks     <all>

Reply via email to