[Cc'ing yamt@ and para@, in case they're not reading tech-kern@ right now, since they know far more about allocators than I do.]
> Date: Sun, 22 Oct 2017 22:32:40 +0200 > From: Manuel Bouyer <[email protected]> > > With a pullup of kern_exec.c 1.448-1.449, to netbsd-6, we're still seeing > hangs on vmem. Welp. At least it's not an execargs hang! I hypothesize that this may be an instance of a general problem with chaining sleeping allocators: To allocate a foo, first allocate a block of foos; then allocate a foo within the block. (Repeat recursively for a few iterations: 1KB foos, 4KB pages of foos, 128KB blocks of pages, &c.) - Suppose thread A tries to allocate a foo, and every foo in every block allocated so far is currently in use. Thread A will proceed to try to allocate a block of foos. If there's not enough KVA to allocate a block of foos, thread A will sleep until there is. - Suppose thread B comes along and frees a foo. That doesn't wake thread A, because there's still not enough KVA to allocate a block of foos. So thread A continues to hang -- forever, if KVA is too fragmented. Even if thread A eventually makes progress, every time this happens, it will allocate a new block of foos instead of reusing a foo from an existing block. And if there's no bound on the number of threads waiting to allocate a block of foos (as is the case, I think, with pools), then under bursts of heavy load there may be lots of nearly empty foo blocks allocated simultaneously, which makes fragmentation even worse. Thread A _should_ make progress if a foo is freed up, but it doesn't: we have no mechanism by which multiple different signals can cause a thread to wake, short of sharing the condition variables for them and restarting every cascade of blocking allocations from the top. This won't always happen: - In the case of execargs buffers, this _won't_ happen (now) because each execargs buffer is uvm_km_allocated one at a time, not in blocks, so as long as the page daemon runs and there is an unused execargs buffer, shrinking exec_pool will free enough KVA in exec_map to allow a blocked uvm_km_alloc to continue and thereby allow a blocked pool_get to continue. - But in the case of pathbufs, they're 1024 bytes apiece, allocated in 4KB pages from kmem_va_arena, which in turn are allocated from qcached chunks of 128KB blocks from kmem_va_arena, which in turn are allocated from kmem_arena. And there are no 128KB regions left in kmem_arena, according to your `show vmem', which (weakly) supports this hypothesis. To really test this hypothesis, you also need to check either (a) for pages of pathbufs with free pathbufs in pnbuf_cache, or (b) for blocks with free pages in kmem_va_arena's qcache. I'm a little puzzled about the call stack. By code inspection, it seems the call stack should look like: cv_wait(&kmem_arena->vm_cv, &kmem_arena->vm_lock) vmem_xalloc(kmem_arena, #x20000, ...) vmem_alloc(kmem_arena, #x20000, ...) vmem_xalloc(kmem_va_arena, #x20000, ...) vmem_alloc(kmem_va_arena, #x20000, ...) qc_poolpage_alloc(...qc...) pool_grow(...qc...) * pool_get(...qc...) * pool_cache_get_paddr(...qc...) * vmem_alloc(kmem_va_arena, #x1000, ...) * uvm_km_kmem_alloc(kmem_va_arena, #x1000, ...) * pool_page_alloc(&pnbuf_cache->pc_pool, ...) * pool_allocator_alloc(&pnbuf_cache->pc_pool, ...) * pool_grow(&pnbuf_cache->pc_pool, ...) pool_get(&pnbuf_cache->pc_pool, ...) pool_cache_get_slow(pnbuf_cache->pc_cpus[curcpu()->ci_index], ...) pool_cache_get_paddr(pnbuf_cache, ...) pathbuf_create_raw The starred lines do not seem to appear in your stack trace. Note that immediately above pool_get in your stack trace, which presumably passes &pnbuf_cache->pc_pool, is a call to pool_grow for a _different_ pool, presumably the one inside kmem_arena's qcache.
