commit e6b81333e1c342741087c579a9407da5bf4eca62 Author: Matthew Dillon <dil...@apollo.backplane.com> Date: Sat Aug 12 10:26:17 2017 -0700
kernel - Fix bottlenecks that develop when many processes are running * When a large number of processes or threads are running (in the tens of thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop. These bottlenecks do not develop when only a few thousand threads are present. By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured or manually set high enough, DFly can now handle hundreds of thousands of active processes running, polling, sleeping, whatever. Tested to around 400,000 discrete processes (no shared VM pages) on a 32-thread dual-socket Xeon system. Each process is placed in a 1/10 second sleep loop using umtx timeouts: baseline - (before changes), system bottlenecked starting at around the 30,000 process mark, eating all available cpu, high IPI rate from hash collisions, and other unrelated user processes bogged down due to the scheduling overhead. 200,000 processes - System settles down to 45% idle, and low IPI rate. 220,000 processes - System 30% idle and low IPI rate 250,000 processes - System 0% idle and low IPI rate 300,000 processes - System 0% idle and low IPI rate. 400,000 processes - Scheduler begins to bottleneck again after the 350,000 while the process test is still in its fork/exec loop. Once all 400,000 processes are settled down, system behaves fairly well. 0% idle, modest IPI rate averaging 300 IPI/sec/cpu (due to hash collisions in the wakeup code). * More work will be needed to better handle processes with massively shared VM pages. It should also be noted that the system does a *VERY* good job allocating and releasing kernel resources during this test using discrete processes. It can kill 400,000 processes in a few seconds when I ^C the test. * Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan. This bottleneck does not arise when large numbers of processes are running in usermode, because typically only one user process per cpu will be scheduled to LWKT. However, this bottleneck does arise when large numbers of threads are woken up in-kernel. While in-kernel, a thread schedules directly to LWKT. Round-robin operation tends to result in appends to the tail of the queue, so this optimization saves an enormous amount of cpu time when large numbers of threads are present. * Limit ncallout to ~5 minutes worth of ring. The calculation code is primarily designed to allocate less space on low-memory machines, but will also cause an excessively-sized ring to be allocated on large-memory machines. 512MB was observed on a 32-way box. * Remove vm_map->hint, which had basically stopped functioning in a useful manner. Add a new vm_map hinting mechanism that caches up to four (size, align) start addresses for vm_map_findspace(). This cache is used to quickly index into the linear vm_map_entry list before entering the linear search phase. This fixes a serious bottleneck that arises due to vm_map_findspace()'s linear scan if the vm_map_entry list when the kernel_map becomes fragmented, typically when the machine is managing a large number of processes or threads (in the tens of thousands or more). This will also reduce overheads for processes with highly fragmented vm_maps. * Dynamically size the action_hash[] array in vm/vm_page.c. This array is used to record blocked umtx operations. The limited size of the array could result in an excessive number of hash entries when a large number of processes/threads are present in the system. Again, the effect is noticed as the number of threads exceeds a few tens of thousands. Summary of changes: sys/kern/kern_synch.c | 6 +- sys/kern/lwkt_thread.c | 25 ++++++- sys/kern/subr_param.c | 7 ++ sys/vm/vm_glue.c | 10 +-- sys/vm/vm_map.c | 188 ++++++++++++++++++++++++++----------------------- sys/vm/vm_map.h | 30 +++++++- sys/vm/vm_page.c | 70 ++++++++++++------ 7 files changed, 213 insertions(+), 123 deletions(-) http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/e6b81333e1c342741087c579a9407da5bf4eca62 -- DragonFly BSD source repository