Re: evbarm hang
[redirected to port-arm] On Fri, Apr 19, 2019 at 12:47:11PM +0200, Manuel Bouyer wrote: > On Fri, Apr 19, 2019 at 08:33:00PM +1000, matthew green wrote: > > > So here's our deadlock: cpu 0 holds the kernel lock and wants the pool > > > spin > > > mutex; cpu 1 holds the spin mutex and wants the kenrel lock. > > > > AFAICT, cpu 1 is at fault here. this locking order is > > backwards. > > > > not sure why arm32 pmap operations on the kernel map are > > with the kernel lock instead of another lock. the arm64 > > version of this code doesn't take any lock. > > Yes, this looks like it can cause lots of problems. > > Indeed, I suspect the kernel_lock here could be replaced with pm_lock as > aarch64 does. I will try this. Here's what I found. I think a per-pmap lock is needed even for the kernel pmap, because of the l2_dtable (and related) structures contains statistics that needs to be keep coherent. For user pmaps, pm_lock protects it. For the kernel pmap we can't use pm_lock, because as pmap_kenter/pmap_kremove can be called from interrupt context, it needs a mutex at IPL_VM. So I added a kpm_lock kmutex, and use it in pmap_acquire_pmap_lock() for kernel pmap. Then we need to be carefull not calling a path that could sleep with this mutex held. This needs some adjustements in pmap_enter() (don't call pool_get()/pool_put() with the pmap locked) and pmap_remove() (defer pool_put() util it is safe to release the pmap lock). While there, also use splvm() instead of splhigh() in pmap_growkernel(), as x86 does. Also, rename pmap_lock to pmap_pg_lock. As this can be used with the kernel pmap lock held, this also needs to be a IPL_VM kmutex (in all cases). It also includes the patch I posted earlier today to workaround deadlocks in ddb. I've been running a DIAGNOSTIC+LOCKDEBUG kernel on my lime2 allwinner A20 board, and it did build a few packages (using distcc, and a SATA disk for local storage) without problems, so at this point it looks like an improvement. I'll keep it running over the week-end and unless I get negative feedback, I'll commit it early next week. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference -- Index: arm32/pmap.c === RCS file: /cvsroot/src/sys/arch/arm/arm32/pmap.c,v retrieving revision 1.371 diff -u -p -u -r1.371 pmap.c --- arm32/pmap.c28 Oct 2018 14:59:17 - 1.371 +++ arm32/pmap.c19 Apr 2019 15:55:30 - @@ -217,6 +217,10 @@ #include +#ifdef DDB +#include +#endif + __KERNEL_RCSID(0, "$NetBSD: pmap.c,v 1.371 2018/10/28 14:59:17 skrll Exp $"); //#define PMAP_DEBUG @@ -516,7 +520,8 @@ static size_t cnptes; #endif vaddr_t memhook; /* used by mem.c & others */ kmutex_t memlock __cacheline_aligned; /* used by mem.c & others */ -kmutex_t pmap_lock __cacheline_aligned; +kmutex_t pmap_pg_lock __cacheline_aligned; +kmutex_t kpm_lock __cacheline_aligned; extern void *msgbufaddr; int pmap_kmpages; /* @@ -538,9 +543,14 @@ vaddr_t pmap_directlimit; static inline void pmap_acquire_pmap_lock(pmap_t pm) { +#if defined(MULTIPROCESSOR) && defined(DDB) + if (__predict_false(db_onproc != NULL)) + return; +#endif + if (pm == pmap_kernel()) { #ifdef MULTIPROCESSOR - KERNEL_LOCK(1, NULL); + mutex_enter(_lock); #endif } else { mutex_enter(pm->pm_lock); @@ -550,9 +560,13 @@ pmap_acquire_pmap_lock(pmap_t pm) static inline void pmap_release_pmap_lock(pmap_t pm) { +#if defined(MULTIPROCESSOR) && defined(DDB) + if (__predict_false(db_onproc != NULL)) + return; +#endif if (pm == pmap_kernel()) { #ifdef MULTIPROCESSOR - KERNEL_UNLOCK_ONE(NULL); + mutex_exit(_lock); #endif } else { mutex_exit(pm->pm_lock); @@ -562,20 +576,20 @@ pmap_release_pmap_lock(pmap_t pm) static inline void pmap_acquire_page_lock(struct vm_page_md *md) { - mutex_enter(_lock); + mutex_enter(_pg_lock); } static inline void pmap_release_page_lock(struct vm_page_md *md) { - mutex_exit(_lock); + mutex_exit(_pg_lock); } #ifdef DIAGNOSTIC static inline int pmap_page_locked_p(struct vm_page_md *md) { - return mutex_owned(_lock); + return mutex_owned(_pg_lock); } #endif @@ -3057,6 +3071,10 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_ #else const bool vector_page_p = (va == vector_page); #endif + struct pmap_page *pp = pmap_pv_tracked(pa); + struct pv_entry *new_pv = NULL; + struct pv_entry *old_pv = NULL; + int error = 0; UVMHIST_FUNC(__func__); UVMHIST_CALLED(maphist); @@ -3072,6 +3090,12 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_ * test for a managed page by checking pg != NULL. */ pg = pmap_initialized ? PHYS_TO_VM_PAGE(pa) : NULL; + /* +* if we may need a new pv entry
Re: evbarm hang
On Fri, Apr 19, 2019 at 08:33:00PM +1000, matthew green wrote: > > So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin > > mutex; cpu 1 holds the spin mutex and wants the kenrel lock. > > AFAICT, cpu 1 is at fault here. this locking order is > backwards. > > not sure why arm32 pmap operations on the kernel map are > with the kernel lock instead of another lock. the arm64 > version of this code doesn't take any lock. Yes, this looks like it can cause lots of problems. Indeed, I suspect the kernel_lock here could be replaced with pm_lock as aarch64 does. I will try this. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
re: evbarm hang
> So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin > mutex; cpu 1 holds the spin mutex and wants the kenrel lock. AFAICT, cpu 1 is at fault here. this locking order is backwards. not sure why arm32 pmap operations on the kernel map are with the kernel lock instead of another lock. the arm64 version of this code doesn't take any lock. .mrg.
Re: evbarm hang
On Fri, Apr 19, 2019 at 11:13:40AM +0100, Nick Hudson wrote: > On 19/04/2019 10:10, Manuel Bouyer wrote: > > Overnight lockdebug did find something: > > login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout > > > > [ 1908.3939406] lock address : 0x90b79074 type : > > spin > > [ 1908.3939406] initialized : 0x8041601c > > [ 1908.3939406] shared holds : 0 exclusive: > > 1 > > [ 1908.3939406] shares wanted: 0 exclusive: > > 1 > > [ 1908.3939406] current cpu : 0 last held: > > 1 > > [ 1908.3939406] current lwp : 0x91fc3760 last held: > > 0x91fc26e0 > > [ 1908.3939406] last locked* : 0x80416668 unlocked : > > 0x804169e8 > > [ 1908.3939406] owner field : 0x00010500 wait/spin: > > 0/1 > > > > [ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: > > spinout > > [ 1908.4626458] cpu0: Begin traceback... > > [ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14 > > [ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194 > > [ 1908.4626458] 0x9e4a195c: netbsd:snprintf > > [ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more > > [ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0 > > [ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34 > > [ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580 > > [ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70 > > [ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4 > > [ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288 > > [ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34 > > [ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194 > > [ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c > > [ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110 > > [ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4 > > [ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68 > > [ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c > > [ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc > > [ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c > > [ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c > > [ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c > > [ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c > > [ 1908.4626458] cpu0: End traceback... > > > > Does show event tell you if dwc_gmac interrupts are being distributed to > both cpus? Looks like they are > I think we need to prevent or protect against this. but the IRQ handler is not registered as MPSAFE, so it should run under the kernel lock -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: evbarm hang
On Fri, Apr 19, 2019 at 11:10:07AM +0200, Manuel Bouyer wrote: > [...] > So cpu 1 is indeed running the LWP hodling the spin lock, and it looks > like it's itself waiting for a mutex. > Now I have to find why "mach cpu 1" hangs, and how to avoid it ... I found the reason of the hang (will send a separate mail to port-arm@ about it). So now I have stack traces for both CPUs: [ 1553.4798668] cpu0: Begin traceback... [ 1553.4798668] 0x9cf6b674: netbsd:db_panic+0x14 [ 1553.4798668] 0x9cf6b68c: netbsd:vpanic+0x194 [ 1553.4798668] 0x9cf6b6a4: netbsd:snprintf [ 1553.4798668] 0x9cf6b6e4: netbsd:lockdebug_more [ 1553.4798668] 0x9cf6b71c: netbsd:lockdebug_abort+0xc0 [ 1553.4798668] 0x9cf6b73c: netbsd:mutex_abort+0x34 [ 1553.4798668] 0x9cf6b7ac: netbsd:mutex_enter+0x580 [ 1553.4798668] 0x9cf6b804: netbsd:pool_get+0x70 [ 1553.4798668] 0x9cf6b854: netbsd:pool_cache_get_slow+0x1f4 [ 1553.4798668] 0x9cf6b8a4: netbsd:pool_cache_get_paddr+0x288 [ 1553.4798668] 0x9cf6b8c4: netbsd:m_clget+0x34 [ 1553.4798668] 0x9cf6b924: netbsd:dwc_gmac_intr+0x194 [ 1553.4798668] 0x9cf6b93c: netbsd:gic_fdt_intr+0x2c [ 1553.4798668] 0x9cf6b964: netbsd:pic_dispatch+0x110 [ 1553.4798668] 0x9cf6b9c4: netbsd:armgic_irq_handler+0xf4 [ 1553.4798668] 0x9cf6ba44: netbsd:irq_entry+0x68 [ 1553.4798668] 0x9cf6bacc: netbsd:rw_enter+0x44c [ 1553.4798668] 0x9cf6bc1c: netbsd:uvm_fault_internal+0x124 [ 1553.4798668] 0x9cf6bca4: netbsd:data_abort_handler+0x1b0 [ 1553.4798668] cpu0: End traceback... db{0}> mach cpu 1 kdb_trap: switching to cpu1 Stopped in pid 27357.1 (gcc) at netbsd:nullop: mov r0, #0 ; #0x0 db{1}> tr 0x9e365ba4: netbsd:_kernel_lock+0xc 0x9e365be4: netbsd:pmap_extract_coherency+0x200 0x9e365c4c: netbsd:uvm_km_kmem_alloc+0x110 0x9e365c64: netbsd:pool_page_alloc+0x3c 0x9e365cbc: netbsd:pool_grow+0x80 0x9e365cd4: netbsd:pool_catchup+0x30 0x9e365d2c: netbsd:pool_get+0x620 0x9e365d7c: netbsd:pool_cache_get_slow+0x1f4 0x9e365dcc: netbsd:pool_cache_get_paddr+0x288 0x9e365dec: netbsd:m_clget+0x34 0x9e365e84: netbsd:sosend+0x38c 0x9e365eac: netbsd:soo_write+0x3c 0x9e365f04: netbsd:dofilewrite+0x7c 0x9e365f34: netbsd:sys_write+0x5c 0x9e365fac: netbsd:syscall+0x12c So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin mutex; cpu 1 holds the spin mutex and wants the kenrel lock. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: evbarm hang
On 19/04/2019 10:10, Manuel Bouyer wrote: Overnight lockdebug did find something: login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout [ 1908.3939406] lock address : 0x90b79074 type : spin [ 1908.3939406] initialized : 0x8041601c [ 1908.3939406] shared holds : 0 exclusive: 1 [ 1908.3939406] shares wanted: 0 exclusive: 1 [ 1908.3939406] current cpu : 0 last held: 1 [ 1908.3939406] current lwp : 0x91fc3760 last held: 0x91fc26e0 [ 1908.3939406] last locked* : 0x80416668 unlocked : 0x804169e8 [ 1908.3939406] owner field : 0x00010500 wait/spin:0/1 [ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: spinout [ 1908.4626458] cpu0: Begin traceback... [ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14 [ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194 [ 1908.4626458] 0x9e4a195c: netbsd:snprintf [ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more [ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0 [ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34 [ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580 [ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70 [ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4 [ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288 [ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34 [ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194 [ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c [ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110 [ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4 [ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68 [ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c [ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc [ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c [ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c [ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c [ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c [ 1908.4626458] cpu0: End traceback... Does show event tell you if dwc_gmac interrupts are being distributed to both cpus? I think we need to prevent or protect against this. Nick
Re: evbarm hang
On Fri, Apr 19, 2019 at 10:34:22AM +0100, Nick Hudson wrote: > On 19/04/2019 10:10, Manuel Bouyer wrote: > [snip] > > > Did you see my suggestion for getting the backtrace from the lwp on the > "other" cpu? Yes, it didn't give anything interestng. I guess it is because this lwp is on CPU, so its register copy in its strct lwp is not up to date. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: evbarm hang
On 19/04/2019 10:10, Manuel Bouyer wrote: [snip] Did you see my suggestion for getting the backtrace from the lwp on the "other" cpu? db{0}> mach cpu 1 kdb_trap: switching to cpu1 Stopped in pid 21532.1 (gcc) at netbsd:_kernel_lock+0x19c: So cpu 1 is indeed running the LWP hodling the spin lock, and it looks like it's itself waiting for a mutex. Now I have to find why "mach cpu 1" hangs, and how to avoid it ... The kdb_trap code is racey and needs to look more like the mips version. Feel free to beat me to fixing it. Nick
Re: evbarm hang
Overnight lockdebug did find something: login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout [ 1908.3939406] lock address : 0x90b79074 type : spin [ 1908.3939406] initialized : 0x8041601c [ 1908.3939406] shared holds : 0 exclusive: 1 [ 1908.3939406] shares wanted: 0 exclusive: 1 [ 1908.3939406] current cpu : 0 last held: 1 [ 1908.3939406] current lwp : 0x91fc3760 last held: 0x91fc26e0 [ 1908.3939406] last locked* : 0x80416668 unlocked : 0x804169e8 [ 1908.3939406] owner field : 0x00010500 wait/spin:0/1 [ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: spinout [ 1908.4626458] cpu0: Begin traceback... [ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14 [ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194 [ 1908.4626458] 0x9e4a195c: netbsd:snprintf [ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more [ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0 [ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34 [ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580 [ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70 [ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4 [ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288 [ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34 [ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194 [ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c [ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110 [ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4 [ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68 [ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c [ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc [ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c [ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c [ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c [ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c [ 1908.4626458] cpu0: End traceback... db{0}> tr/a 0x91fc3760 trace: pid 21381 lid 1 at 0x9e4a1dbc 0x9e4a1dbc: 914a1730 Bad frame pointer: 0x801f884c db{0}> tr/a 0x91fc26e0 trace: pid 21532 lid 1 at 0x9cf97dbc 0x9cf97dbc: 91dfcc18 0x80416668 is mutex_enter(>pr_lock); in pool_get(). PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT 215621 2 0 0 92d177c0gcc 21381> 1 7 0 0 91fc3760gcc 21532> 1 7 1 0 91fc26e0gcc 232181 2 1 0 92968f60gcc owner field : 0x00010500 if I decode this properly, mtx_ipl is 5 and mtx_lock is 1. 5 would be IPL_VM on evbarm. db{0}> mach cpu 1 kdb_trap: switching to cpu1 Stopped in pid 21532.1 (gcc) at netbsd:_kernel_lock+0x19c: So cpu 1 is indeed running the LWP hodling the spin lock, and it looks like it's itself waiting for a mutex. Now I have to find why "mach cpu 1" hangs, and how to avoid it ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: evbarm hang
On Thu, Apr 18, 2019 at 04:32:56PM +0200, Manuel Bouyer wrote: > [...] > > (the common point is the dwc_gmac_intr() call, which ends up in pool_get(). > pool_get() seems to spin on the mutex_enter(>pr_lock); call > just before startover:. > Unfortunably I can't get a stack trace for CPU 1: > db{0}> mach cpu 1 > kdb_trap: switching to cpu1 > Stopped in pid 26110.1 (gcc) at netbsd:_kernel_lock+0xc4: > > Any idea how to process from here ? Here's another instance. This one was with LOCKDEBUG, but LOCKDEBUG didn't trigger. 0x9e06d8cc: netbsd:comintr+0xc 0x9e06d8e4: netbsd:gic_fdt_intr+0x2c 0x9e06d90c: netbsd:pic_dispatch+0x38 0x9e06d96c: netbsd:armgic_irq_handler+0xf4 0x9e06da2c: netbsd:irq_entry+0x68 0x9e06da84: netbsd:pool_get+0x70 0x9e06dad4: netbsd:pool_cache_get_slow+0x1f4 0x9e06db24: netbsd:pool_cache_get_paddr+0x288 0x9e06db44: netbsd:m_clget+0x34 0x9e06dba4: netbsd:dwc_gmac_intr+0x194 0x9e06dbbc: netbsd:gic_fdt_intr+0x2c 0x9e06dbe4: netbsd:pic_dispatch+0x110 0x9e06dc44: netbsd:armgic_irq_handler+0xf4 0x9e06dcf4: netbsd:irq_entry+0x68 0x9e06dd2c: netbsd:pmap_extract_coherency+0x16c 0x9e06dd7c: netbsd:pool_cache_get_slow+0x254 0x9e06ddcc: netbsd:pool_cache_get_paddr+0x288 0x9e06ddec: netbsd:m_clget+0x34 0x9e06de84: netbsd:sosend+0x38c 0x9e06deac: netbsd:soo_write+0x3c 0x9e06df04: netbsd:dofilewrite+0x7c 0x9e06df34: netbsd:sys_write+0x5c 0x9e06dfac: netbsd:syscall+0x12c PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT 132501 2 1 0 92ba1d20gcc 253701 2 0 0 929d8c80gcc 16746> 1 7 0 0 91fc1a20gcc 156361 2 0 4 91b5d1c0cc1plus 15822> 1 7 1 0 92db9d40gcc db{0}> tr/a 92db9d40 trace: pid 15822 lid 1 at 0x99feddbc 0x99feddbc: 91af4828 db{0}> tr/a 91fc1a20 trace: pid 16746 lid 1 at 0x9e06ddbc 0x9e06ddbc: 922c8ec0 Maybe more interesting: 0 21 3 1 204 90bf1640xcall/1 xcall 0 20 1 1 200 90bf1380 softser/1 0 19 1 1 200 90bf10c0 softclk/1 0 18 1 1 200 90bf0e00 softbio/1 0 17 1 1 200 90bf0b40 softnet/1 0 16 1 1 201 90bf0880 idle/1 db{0}> tr/a 90bf0b40 trace: pid 0 lid 17 at 0x99e83e24 0x99e83e24: netbsd:pic_do_pending_ints+0x648 0x99e83eb4: netbsd:pic_do_pending_ints+0x614 0x99e83ee4: netbsd:splx+0x64 0x99e83f44: netbsd:canintr+0x290 0x99e83fac: netbsd:softint_dispatch+0x22c 0x9e1f9dbc: netbsd:softint_switch+0x58 0x9e1f9de4: netbsd:_splraise+0x3c --- tf 0x99e83fb0 --- canintr+0x290 is: mutex_exit(softnet_lock); (gdb) l *(splx+0x64) 0x80004270 is in splx (/dsk/l1/misc/bouyer/HEAD/clean/src/sys/arch/arm/pic/pic_splfuncs.c:96). 91 "splx(%d) to a higher ipl than %d", savedipl, ci->ci_cpl); 92 93 ci->ci_intr_depth++; 94 pic_do_pending_ints(psw, savedipl, NULL); 95 ci->ci_intr_depth--; 96 KASSERTMSG(ci->ci_cpl == savedipl, "cpl %d savedipl %d", 97 ci->ci_cpl, savedipl); (gdb) l *(pic_do_pending_ints+0x614) 0x800031f4 is in pic_do_pending_ints (/dsk/l1/misc/bouyer/HEAD/clean/src/sys/arch/arm/pic/pic.c:583). 578 if (newipl == IPL_NONE && (ci->ci_astpending & __BIT(1))) { 579 pic_set_priority(ci, IPL_SCHED); 580 kpreempt(0); 581 } 582 #endif 583 if (ci->ci_cpl != newipl) 584 pic_set_priority(ci, newipl); 585 } 586 587 static void my guess is that we did call kpreempt() here and switched to another lwp. It is possible that we did so with a spin lock held ? pic_do_pending_ints+0x648 is in pic_set_priority() (gdb points to ci->ci_cpl = newipl; but it may be the call to pic_list[0]->pic_ops->pic_set_priority() just before). >From show all locks /t, this one could be interesting: Locks held on CPU 1: Lock 0 (initialized at pool_init) lock address : 0x90b79074 type : spin initialized : 0x8041582c shared holds : 0 exclusive: 1 shares wanted: 0 exclusive: 1 current cpu : 0 last held: 1 current lwp : 0x91fc1a20 last held: 0x92db9d40 last locked* : 0x80415e78 unlocked : 0x804161f8 [ 1447.2013328] owner field : 0x00010500 wait/spin:0/1 trace: pid 15822 lid 1 at 0x99feddbc 0x99feddbc: 91af4828 0x80415e78 is also the mutex_enter(>pr_lock); line in pool_get(). trying to switch to cpu 1 did hang as usual. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: evbarm hang
On 18/04/2019 15:32, Manuel Bouyer wrote: Hello, I got several hard hang while building packages on an allwinner a20 (lime2) I use distcc so there is moderate network traffic in addition to local CPU load. most of the time I can't enter ddb from serial console, I just get Stopped in pid 25136.1 (cc1) at netbsd:cpu_Debugger+0x4: On 2 occasion, I could enter ddb, and both time the stack trace on CPU 0 was similar: 0x9cea7c0c: netbsd:pool_get+0x70 <-- hang here 0x9cea7c5c: netbsd:pool_cache_get_slow+0x1f4 0x9cea7cac: netbsd:pool_cache_get_paddr+0x288 0x9cea7ccc: netbsd:m_clget+0x34 0x9cea7d2c: netbsd:dwc_gmac_intr+0x194 0x9cea7d44: netbsd:gic_fdt_intr+0x2c 0x9cea7d6c: netbsd:pic_dispatch+0x110 0x9cea7dcc: netbsd:armgic_irq_handler+0xf4 0x9cea7e3c: netbsd:irq_entry+0x68 0x9cea7e7c: netbsd:uvm_unmap_detach+0x80 0x9cea7eac: netbsd:uvmspace_free+0x100 0x9cea7f14: netbsd:exit1+0x190 0x9cea7f34: netbsd:sys_exit+0x3c 0x9cea7fac: netbsd:syscall+0x12c (the common point is the dwc_gmac_intr() call, which ends up in pool_get(). pool_get() seems to spin on the mutex_enter(>pr_lock); call just before startover:. Unfortunably I can't get a stack trace for CPU 1: db{0}> mach cpu 1 kdb_trap: switching to cpu1 Stopped in pid 26110.1 (gcc) at netbsd:_kernel_lock+0xc4: Maybe you can get it by looking for the lwp on the other cpu using ps/l and using bt/a? Nick
Re: evbarm hang
On Thu, Apr 18, 2019 at 04:32:56PM +0200, Manuel Bouyer wrote: > Hello, > I got several hard hang while building packages on an allwinner a20 (lime2) > I use distcc so there is moderate network traffic in addition to > local CPU load. > > most of the time I can't enter ddb from serial console, I just get > Stopped in pid 25136.1 (cc1) at netbsd:cpu_Debugger+0x4: > > On 2 occasion, I could enter ddb, and both time the stack trace on CPU 0 > was similar: > > 0x9cea7c0c: netbsd:pool_get+0x70 <-- hang here > 0x9cea7c5c: netbsd:pool_cache_get_slow+0x1f4 > 0x9cea7cac: netbsd:pool_cache_get_paddr+0x288 > 0x9cea7ccc: netbsd:m_clget+0x34 > 0x9cea7d2c: netbsd:dwc_gmac_intr+0x194 > 0x9cea7d44: netbsd:gic_fdt_intr+0x2c > 0x9cea7d6c: netbsd:pic_dispatch+0x110 > 0x9cea7dcc: netbsd:armgic_irq_handler+0xf4 > 0x9cea7e3c: netbsd:irq_entry+0x68 > 0x9cea7e7c: netbsd:uvm_unmap_detach+0x80 > 0x9cea7eac: netbsd:uvmspace_free+0x100 > 0x9cea7f14: netbsd:exit1+0x190 > 0x9cea7f34: netbsd:sys_exit+0x3c > 0x9cea7fac: netbsd:syscall+0x12c > > > (the common point is the dwc_gmac_intr() call, which ends up in pool_get(). > pool_get() seems to spin on the mutex_enter(>pr_lock); call > just before startover:. > Unfortunably I can't get a stack trace for CPU 1: > db{0}> mach cpu 1 > kdb_trap: switching to cpu1 > Stopped in pid 26110.1 (gcc) at netbsd:_kernel_lock+0xc4: > > Any idea how to process from here ? I forgot to mention: I tried a LOCKDEBUG kernel, but the problem didn't show up ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --