Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
[redirected to port-arm]

On Fri, Apr 19, 2019 at 12:47:11PM +0200, Manuel Bouyer wrote:
> On Fri, Apr 19, 2019 at 08:33:00PM +1000, matthew green wrote:
> > > So here's our deadlock: cpu 0 holds the kernel lock and wants the pool 
> > > spin
> > > mutex; cpu 1 holds the spin mutex and wants the kenrel lock.
> > 
> > AFAICT, cpu 1 is at fault here.  this locking order is
> > backwards.
> > 
> > not sure why arm32 pmap operations on the kernel map are
> > with the kernel lock instead of another lock.  the arm64
> > version of this code doesn't take any lock.
> 
> Yes, this looks like it can cause lots of problems.
> 
> Indeed, I suspect the kernel_lock here could be replaced with pm_lock as
> aarch64 does. I will try this.

Here's what I found.
I think a per-pmap lock is needed even for the kernel pmap, because of
the l2_dtable (and related) structures contains statistics that
needs to be keep coherent. For user pmaps, pm_lock protects it.
For the kernel pmap we can't use pm_lock, because as pmap_kenter/pmap_kremove
can be called from interrupt context, it needs a mutex at IPL_VM.
So I added a kpm_lock kmutex, and use it in pmap_acquire_pmap_lock() for
kernel pmap.

Then we need to be carefull not calling a path that could sleep with this mutex
held. This needs some adjustements in pmap_enter() (don't call
pool_get()/pool_put() with the pmap locked)  and pmap_remove()
(defer pool_put() util it is safe to release the pmap lock).

While there, also use splvm() instead of splhigh() in pmap_growkernel(),
as x86 does.
Also, rename pmap_lock to pmap_pg_lock. As this can be used with the
kernel pmap lock held, this also needs to be a IPL_VM kmutex (in all cases).

It also includes the patch I posted earlier today to workaround deadlocks
in ddb.

I've been running a DIAGNOSTIC+LOCKDEBUG kernel on my lime2 allwinner A20
board, and it did build a few packages (using distcc, and a SATA disk for
local storage) without problems, so at this point it looks like
an improvement. I'll keep it running over the week-end and unless I get
negative feedback, I'll commit it early next week.


-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--
Index: arm32/pmap.c
===
RCS file: /cvsroot/src/sys/arch/arm/arm32/pmap.c,v
retrieving revision 1.371
diff -u -p -u -r1.371 pmap.c
--- arm32/pmap.c28 Oct 2018 14:59:17 -  1.371
+++ arm32/pmap.c19 Apr 2019 15:55:30 -
@@ -217,6 +217,10 @@
 
 #include 
 
+#ifdef DDB
+#include 
+#endif
+
 __KERNEL_RCSID(0, "$NetBSD: pmap.c,v 1.371 2018/10/28 14:59:17 skrll Exp $");
 
 //#define PMAP_DEBUG
@@ -516,7 +520,8 @@ static size_t cnptes;
 #endif
 vaddr_t memhook;   /* used by mem.c & others */
 kmutex_t memlock __cacheline_aligned;  /* used by mem.c & others */
-kmutex_t pmap_lock __cacheline_aligned;
+kmutex_t pmap_pg_lock __cacheline_aligned;
+kmutex_t kpm_lock __cacheline_aligned;
 extern void *msgbufaddr;
 int pmap_kmpages;
 /*
@@ -538,9 +543,14 @@ vaddr_t pmap_directlimit;
 static inline void
 pmap_acquire_pmap_lock(pmap_t pm)
 {
+#if defined(MULTIPROCESSOR) && defined(DDB)
+   if (__predict_false(db_onproc != NULL))
+   return;
+#endif
+   
if (pm == pmap_kernel()) {
 #ifdef MULTIPROCESSOR
-   KERNEL_LOCK(1, NULL);
+   mutex_enter(_lock);
 #endif
} else {
mutex_enter(pm->pm_lock);
@@ -550,9 +560,13 @@ pmap_acquire_pmap_lock(pmap_t pm)
 static inline void
 pmap_release_pmap_lock(pmap_t pm)
 {
+#if defined(MULTIPROCESSOR) && defined(DDB)
+   if (__predict_false(db_onproc != NULL))
+   return;
+#endif
if (pm == pmap_kernel()) {
 #ifdef MULTIPROCESSOR
-   KERNEL_UNLOCK_ONE(NULL);
+   mutex_exit(_lock);
 #endif
} else {
mutex_exit(pm->pm_lock);
@@ -562,20 +576,20 @@ pmap_release_pmap_lock(pmap_t pm)
 static inline void
 pmap_acquire_page_lock(struct vm_page_md *md)
 {
-   mutex_enter(_lock);
+   mutex_enter(_pg_lock);
 }
 
 static inline void
 pmap_release_page_lock(struct vm_page_md *md)
 {
-   mutex_exit(_lock);
+   mutex_exit(_pg_lock);
 }
 
 #ifdef DIAGNOSTIC
 static inline int
 pmap_page_locked_p(struct vm_page_md *md)
 {
-   return mutex_owned(_lock);
+   return mutex_owned(_pg_lock);
 }
 #endif
 
@@ -3057,6 +3071,10 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_
 #else
const bool vector_page_p = (va == vector_page);
 #endif
+   struct pmap_page *pp = pmap_pv_tracked(pa);
+   struct pv_entry *new_pv = NULL;
+   struct pv_entry *old_pv = NULL;
+   int error = 0;
 
UVMHIST_FUNC(__func__); UVMHIST_CALLED(maphist);
 
@@ -3072,6 +3090,12 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_
 * test for a managed page by checking pg != NULL.
 */
pg = pmap_initialized ? PHYS_TO_VM_PAGE(pa) : NULL;
+   /*
+* if we may need a new pv entry 

of_getnode_byname: infinite loop

2019-04-19 Thread yarl-baudig
Hello,



of_getnode_byname(start, target) goes on infinite loop if no node with 'target' 
name exists.



Moreover, I think the function should not go out of the subtree whose 'start' 
is the root.



Also, both the tests and the documentation can be improved.



I propose the attached patch.



Thank you.


commit 92e06cb016b3a4792f6b6140a7b5055d794420ed
Author: Yarl Baudig 
Date:   Fri Apr 19 09:24:18 2019 +0200

Upgrade of_getnode_byname()

-no more infinite loop when no node with 'target' as name exists
-no more go out of the subtree you are searching in
-improved (more standard) tests on return values
-improved documentation

diff --git a/sys/dev/ofw/ofw_subr.c b/sys/dev/ofw/ofw_subr.c
index 91d4b7a004d3..6529c831ef28 100644
--- a/sys/dev/ofw/ofw_subr.c
+++ b/sys/dev/ofw/ofw_subr.c
@@ -337,9 +337,21 @@ of_find_firstchild_byname(int node, const char *name)
 }
 
 /*
- * Find a give node by name.  Recurses, and seems to walk upwards too.
+ * int of_getnode_byname(start, target)
+ *
+ * Find a node by name.
+ *
+ * Arguments:
+ *	start		OFW phandle of the root node of the subtree
+ *			to search in. You can give 0 for the root of the full tree.
+ *	target		Property "name" to search for.
+ *
+ * Return Value:
+ *	-1 error, 0 if no node with such a name exists or the OFW phandle of the first
+ *	node with that name.
+ * Side Effects:
+ *	None.
  */
-
 int
 of_getnode_byname(int start, const char *target)
 {
@@ -351,21 +363,23 @@ of_getnode_byname(int start, const char *target)
 
 	for (node = start; node; node = next) {
 		memset(name, 0, sizeof name);
-		OF_getprop(node, "name", name, sizeof name - 1);
-		if (strcmp(name, target) == 0)
+		if ((OF_getprop(node, "name", name, sizeof name - 1) > 0) &&
+			(strcmp(name, target) == 0))
 			break;
 
-		if ((next = OF_child(node)) != 0)
+		if ((next = OF_child(node)) > 0)
 			continue;
 
 		while (node) {
-			if ((next = OF_peer(node)) != 0)
+			if ((next = OF_peer(node)) > 0)
 break;
-			node = OF_parent(node);
+			if ((node = OF_parent(node)) < 1)
+return node;
+			if (node == start)
+return 0;
 		}
 	}
 
-	/* XXX is this correct? */
 	return node;
 }
 


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 08:33:00PM +1000, matthew green wrote:
> > So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin
> > mutex; cpu 1 holds the spin mutex and wants the kenrel lock.
> 
> AFAICT, cpu 1 is at fault here.  this locking order is
> backwards.
> 
> not sure why arm32 pmap operations on the kernel map are
> with the kernel lock instead of another lock.  the arm64
> version of this code doesn't take any lock.

Yes, this looks like it can cause lots of problems.

Indeed, I suspect the kernel_lock here could be replaced with pm_lock as
aarch64 does. I will try this.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


re: evbarm hang

2019-04-19 Thread matthew green
> So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin
> mutex; cpu 1 holds the spin mutex and wants the kenrel lock.

AFAICT, cpu 1 is at fault here.  this locking order is
backwards.

not sure why arm32 pmap operations on the kernel map are
with the kernel lock instead of another lock.  the arm64
version of this code doesn't take any lock.


.mrg.


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 11:13:40AM +0100, Nick Hudson wrote:
> On 19/04/2019 10:10, Manuel Bouyer wrote:
> > Overnight lockdebug did find something:
> > login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout
> > 
> > [ 1908.3939406] lock address : 0x90b79074 type :   
> > spin
> > [ 1908.3939406] initialized  : 0x8041601c
> > [ 1908.3939406] shared holds :  0 exclusive:
> >   1
> > [ 1908.3939406] shares wanted:  0 exclusive:
> >   1
> > [ 1908.3939406] current cpu  :  0 last held:
> >   1
> > [ 1908.3939406] current lwp  : 0x91fc3760 last held: 
> > 0x91fc26e0
> > [ 1908.3939406] last locked* : 0x80416668 unlocked : 
> > 0x804169e8
> > [ 1908.3939406] owner field  : 0x00010500 wait/spin:
> > 0/1
> > 
> > [ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: 
> > spinout
> > [ 1908.4626458] cpu0: Begin traceback...
> > [ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14
> > [ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194
> > [ 1908.4626458] 0x9e4a195c: netbsd:snprintf
> > [ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more
> > [ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0
> > [ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34
> > [ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580
> > [ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70
> > [ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4
> > [ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288
> > [ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34
> > [ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194
> > [ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c
> > [ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110
> > [ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4
> > [ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68
> > [ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c
> > [ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc
> > [ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c
> > [ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c
> > [ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c
> > [ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c
> > [ 1908.4626458] cpu0: End traceback...
> > 
> 
> Does show event tell you if dwc_gmac interrupts are being distributed to
> both cpus?

Looks like they are

> I think we need to prevent or protect against this.

but the IRQ handler is not registered as MPSAFE, so it should run under the
kernel lock

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 11:10:07AM +0200, Manuel Bouyer wrote:
> [...]
> So cpu 1 is indeed running the LWP hodling the spin lock, and it looks
> like it's itself waiting for a mutex.
> Now I have to find why "mach cpu 1" hangs, and how to avoid it ...

I found the reason of the hang (will send a separate mail to port-arm@
about it).

So now I have stack traces for both CPUs:
[ 1553.4798668] cpu0: Begin traceback...   
[ 1553.4798668] 0x9cf6b674: netbsd:db_panic+0x14  
[ 1553.4798668] 0x9cf6b68c: netbsd:vpanic+0x194  
[ 1553.4798668] 0x9cf6b6a4: netbsd:snprintf
[ 1553.4798668] 0x9cf6b6e4: netbsd:lockdebug_more
[ 1553.4798668] 0x9cf6b71c: netbsd:lockdebug_abort+0xc0  
[ 1553.4798668] 0x9cf6b73c: netbsd:mutex_abort+0x34  
[ 1553.4798668] 0x9cf6b7ac: netbsd:mutex_enter+0x580 
[ 1553.4798668] 0x9cf6b804: netbsd:pool_get+0x70 
[ 1553.4798668] 0x9cf6b854: netbsd:pool_cache_get_slow+0x1f4 
[ 1553.4798668] 0x9cf6b8a4: netbsd:pool_cache_get_paddr+0x288
[ 1553.4798668] 0x9cf6b8c4: netbsd:m_clget+0x34 
[ 1553.4798668] 0x9cf6b924: netbsd:dwc_gmac_intr+0x194 
[ 1553.4798668] 0x9cf6b93c: netbsd:gic_fdt_intr+0x2c
[ 1553.4798668] 0x9cf6b964: netbsd:pic_dispatch+0x110   
[ 1553.4798668] 0x9cf6b9c4: netbsd:armgic_irq_handler+0xf4   
[ 1553.4798668] 0x9cf6ba44: netbsd:irq_entry+0x68   
[ 1553.4798668] 0x9cf6bacc: netbsd:rw_enter+0x44c
[ 1553.4798668] 0x9cf6bc1c: netbsd:uvm_fault_internal+0x124  
[ 1553.4798668] 0x9cf6bca4: netbsd:data_abort_handler+0x1b0
[ 1553.4798668] cpu0: End traceback...

db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 27357.1 (gcc) at netbsd:nullop:  mov r0, #0  ; #0x0
db{1}> tr
0x9e365ba4: netbsd:_kernel_lock+0xc  
0x9e365be4: netbsd:pmap_extract_coherency+0x200  
0x9e365c4c: netbsd:uvm_km_kmem_alloc+0x110   
0x9e365c64: netbsd:pool_page_alloc+0x3c  
0x9e365cbc: netbsd:pool_grow+0x80
0x9e365cd4: netbsd:pool_catchup+0x30
0x9e365d2c: netbsd:pool_get+0x620  
0x9e365d7c: netbsd:pool_cache_get_slow+0x1f4
0x9e365dcc: netbsd:pool_cache_get_paddr+0x288   
0x9e365dec: netbsd:m_clget+0x34  
0x9e365e84: netbsd:sosend+0x38c 
0x9e365eac: netbsd:soo_write+0x3c
0x9e365f04: netbsd:dofilewrite+0x7c  
0x9e365f34: netbsd:sys_write+0x5c
0x9e365fac: netbsd:syscall+0x12c 

So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin
mutex; cpu 1 holds the spin mutex and wants the kenrel lock.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-19 Thread Nick Hudson

On 19/04/2019 10:10, Manuel Bouyer wrote:

Overnight lockdebug did find something:
login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout

[ 1908.3939406] lock address : 0x90b79074 type :   spin
[ 1908.3939406] initialized  : 0x8041601c
[ 1908.3939406] shared holds :  0 exclusive:  1
[ 1908.3939406] shares wanted:  0 exclusive:  1
[ 1908.3939406] current cpu  :  0 last held:  1
[ 1908.3939406] current lwp  : 0x91fc3760 last held: 0x91fc26e0
[ 1908.3939406] last locked* : 0x80416668 unlocked : 0x804169e8
[ 1908.3939406] owner field  : 0x00010500 wait/spin:0/1

[ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: spinout
[ 1908.4626458] cpu0: Begin traceback...
[ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14
[ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194
[ 1908.4626458] 0x9e4a195c: netbsd:snprintf
[ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more
[ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0
[ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34
[ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580
[ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70
[ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4
[ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288
[ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34
[ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194
[ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c
[ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110
[ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4
[ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68
[ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c
[ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc
[ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c
[ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c
[ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c
[ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c
[ 1908.4626458] cpu0: End traceback...



Does show event tell you if dwc_gmac interrupts are being distributed to
both cpus? I think we need to prevent or protect against this.

Nick


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 10:34:22AM +0100, Nick Hudson wrote:
> On 19/04/2019 10:10, Manuel Bouyer wrote:
> [snip]
> 
> 
> Did you see my suggestion for getting the backtrace from the lwp on the
> "other" cpu?

Yes, it didn't give anything interestng. I guess it is because this lwp
is on CPU, so its register copy in its strct lwp is not up to date.
-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-19 Thread Nick Hudson

On 19/04/2019 10:10, Manuel Bouyer wrote:
[snip]


Did you see my suggestion for getting the backtrace from the lwp on the
"other" cpu?


db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 21532.1 (gcc) at netbsd:_kernel_lock+0x19c:

So cpu 1 is indeed running the LWP hodling the spin lock, and it looks
like it's itself waiting for a mutex.
Now I have to find why "mach cpu 1" hangs, and how to avoid it ...



The kdb_trap code is racey and needs to look more like the mips version.
Feel free to beat me to fixing it.

Nick


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
Overnight lockdebug did find something:
login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout
 
[ 1908.3939406] lock address : 0x90b79074 type :   spin
[ 1908.3939406] initialized  : 0x8041601c 
[ 1908.3939406] shared holds :  0 exclusive:  1
[ 1908.3939406] shares wanted:  0 exclusive:  1
[ 1908.3939406] current cpu  :  0 last held:  1
[ 1908.3939406] current lwp  : 0x91fc3760 last held: 0x91fc26e0
[ 1908.3939406] last locked* : 0x80416668 unlocked : 0x804169e8
[ 1908.3939406] owner field  : 0x00010500 wait/spin:0/1
 
[ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: spinout
[ 1908.4626458] cpu0: Begin traceback... 
[ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14
[ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194
[ 1908.4626458] 0x9e4a195c: netbsd:snprintf
[ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more
[ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0
[ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34
[ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580
[ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70
[ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4
[ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288
[ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34
[ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194
[ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c
[ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110
[ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4
[ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68
[ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c
[ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc
[ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c
[ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c
[ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c
[ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c
[ 1908.4626458] cpu0: End traceback...

db{0}> tr/a 0x91fc3760
trace: pid 21381 lid 1 at 0x9e4a1dbc
0x9e4a1dbc: 914a1730
Bad frame pointer: 0x801f884c
db{0}> tr/a 0x91fc26e0
trace: pid 21532 lid 1 at 0x9cf97dbc
0x9cf97dbc: 91dfcc18

0x80416668 is
mutex_enter(>pr_lock);
in pool_get().

PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
215621 2   0 0   92d177c0gcc
21381>   1 7   0 0   91fc3760gcc
21532>   1 7   1 0   91fc26e0gcc
232181 2   1 0   92968f60gcc


owner field  : 0x00010500
if I decode this properly, mtx_ipl is 5 and mtx_lock is 1.
5 would be IPL_VM on evbarm.

db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 21532.1 (gcc) at netbsd:_kernel_lock+0x19c:

So cpu 1 is indeed running the LWP hodling the spin lock, and it looks
like it's itself waiting for a mutex.
Now I have to find why "mach cpu 1" hangs, and how to avoid it ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--