Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
[redirected to port-arm]

On Fri, Apr 19, 2019 at 12:47:11PM +0200, Manuel Bouyer wrote:
> On Fri, Apr 19, 2019 at 08:33:00PM +1000, matthew green wrote:
> > > So here's our deadlock: cpu 0 holds the kernel lock and wants the pool 
> > > spin
> > > mutex; cpu 1 holds the spin mutex and wants the kenrel lock.
> > 
> > AFAICT, cpu 1 is at fault here.  this locking order is
> > backwards.
> > 
> > not sure why arm32 pmap operations on the kernel map are
> > with the kernel lock instead of another lock.  the arm64
> > version of this code doesn't take any lock.
> 
> Yes, this looks like it can cause lots of problems.
> 
> Indeed, I suspect the kernel_lock here could be replaced with pm_lock as
> aarch64 does. I will try this.

Here's what I found.
I think a per-pmap lock is needed even for the kernel pmap, because of
the l2_dtable (and related) structures contains statistics that
needs to be keep coherent. For user pmaps, pm_lock protects it.
For the kernel pmap we can't use pm_lock, because as pmap_kenter/pmap_kremove
can be called from interrupt context, it needs a mutex at IPL_VM.
So I added a kpm_lock kmutex, and use it in pmap_acquire_pmap_lock() for
kernel pmap.

Then we need to be carefull not calling a path that could sleep with this mutex
held. This needs some adjustements in pmap_enter() (don't call
pool_get()/pool_put() with the pmap locked)  and pmap_remove()
(defer pool_put() util it is safe to release the pmap lock).

While there, also use splvm() instead of splhigh() in pmap_growkernel(),
as x86 does.
Also, rename pmap_lock to pmap_pg_lock. As this can be used with the
kernel pmap lock held, this also needs to be a IPL_VM kmutex (in all cases).

It also includes the patch I posted earlier today to workaround deadlocks
in ddb.

I've been running a DIAGNOSTIC+LOCKDEBUG kernel on my lime2 allwinner A20
board, and it did build a few packages (using distcc, and a SATA disk for
local storage) without problems, so at this point it looks like
an improvement. I'll keep it running over the week-end and unless I get
negative feedback, I'll commit it early next week.


-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--
Index: arm32/pmap.c
===
RCS file: /cvsroot/src/sys/arch/arm/arm32/pmap.c,v
retrieving revision 1.371
diff -u -p -u -r1.371 pmap.c
--- arm32/pmap.c28 Oct 2018 14:59:17 -  1.371
+++ arm32/pmap.c19 Apr 2019 15:55:30 -
@@ -217,6 +217,10 @@
 
 #include 
 
+#ifdef DDB
+#include 
+#endif
+
 __KERNEL_RCSID(0, "$NetBSD: pmap.c,v 1.371 2018/10/28 14:59:17 skrll Exp $");
 
 //#define PMAP_DEBUG
@@ -516,7 +520,8 @@ static size_t cnptes;
 #endif
 vaddr_t memhook;   /* used by mem.c & others */
 kmutex_t memlock __cacheline_aligned;  /* used by mem.c & others */
-kmutex_t pmap_lock __cacheline_aligned;
+kmutex_t pmap_pg_lock __cacheline_aligned;
+kmutex_t kpm_lock __cacheline_aligned;
 extern void *msgbufaddr;
 int pmap_kmpages;
 /*
@@ -538,9 +543,14 @@ vaddr_t pmap_directlimit;
 static inline void
 pmap_acquire_pmap_lock(pmap_t pm)
 {
+#if defined(MULTIPROCESSOR) && defined(DDB)
+   if (__predict_false(db_onproc != NULL))
+   return;
+#endif
+   
if (pm == pmap_kernel()) {
 #ifdef MULTIPROCESSOR
-   KERNEL_LOCK(1, NULL);
+   mutex_enter(_lock);
 #endif
} else {
mutex_enter(pm->pm_lock);
@@ -550,9 +560,13 @@ pmap_acquire_pmap_lock(pmap_t pm)
 static inline void
 pmap_release_pmap_lock(pmap_t pm)
 {
+#if defined(MULTIPROCESSOR) && defined(DDB)
+   if (__predict_false(db_onproc != NULL))
+   return;
+#endif
if (pm == pmap_kernel()) {
 #ifdef MULTIPROCESSOR
-   KERNEL_UNLOCK_ONE(NULL);
+   mutex_exit(_lock);
 #endif
} else {
mutex_exit(pm->pm_lock);
@@ -562,20 +576,20 @@ pmap_release_pmap_lock(pmap_t pm)
 static inline void
 pmap_acquire_page_lock(struct vm_page_md *md)
 {
-   mutex_enter(_lock);
+   mutex_enter(_pg_lock);
 }
 
 static inline void
 pmap_release_page_lock(struct vm_page_md *md)
 {
-   mutex_exit(_lock);
+   mutex_exit(_pg_lock);
 }
 
 #ifdef DIAGNOSTIC
 static inline int
 pmap_page_locked_p(struct vm_page_md *md)
 {
-   return mutex_owned(_lock);
+   return mutex_owned(_pg_lock);
 }
 #endif
 
@@ -3057,6 +3071,10 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_
 #else
const bool vector_page_p = (va == vector_page);
 #endif
+   struct pmap_page *pp = pmap_pv_tracked(pa);
+   struct pv_entry *new_pv = NULL;
+   struct pv_entry *old_pv = NULL;
+   int error = 0;
 
UVMHIST_FUNC(__func__); UVMHIST_CALLED(maphist);
 
@@ -3072,6 +3090,12 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_
 * test for a managed page by checking pg != NULL.
 */
pg = pmap_initialized ? PHYS_TO_VM_PAGE(pa) : NULL;
+   /*
+* if we may need a new pv entry 

Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 08:33:00PM +1000, matthew green wrote:
> > So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin
> > mutex; cpu 1 holds the spin mutex and wants the kenrel lock.
> 
> AFAICT, cpu 1 is at fault here.  this locking order is
> backwards.
> 
> not sure why arm32 pmap operations on the kernel map are
> with the kernel lock instead of another lock.  the arm64
> version of this code doesn't take any lock.

Yes, this looks like it can cause lots of problems.

Indeed, I suspect the kernel_lock here could be replaced with pm_lock as
aarch64 does. I will try this.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


re: evbarm hang

2019-04-19 Thread matthew green
> So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin
> mutex; cpu 1 holds the spin mutex and wants the kenrel lock.

AFAICT, cpu 1 is at fault here.  this locking order is
backwards.

not sure why arm32 pmap operations on the kernel map are
with the kernel lock instead of another lock.  the arm64
version of this code doesn't take any lock.


.mrg.


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 11:13:40AM +0100, Nick Hudson wrote:
> On 19/04/2019 10:10, Manuel Bouyer wrote:
> > Overnight lockdebug did find something:
> > login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout
> > 
> > [ 1908.3939406] lock address : 0x90b79074 type :   
> > spin
> > [ 1908.3939406] initialized  : 0x8041601c
> > [ 1908.3939406] shared holds :  0 exclusive:
> >   1
> > [ 1908.3939406] shares wanted:  0 exclusive:
> >   1
> > [ 1908.3939406] current cpu  :  0 last held:
> >   1
> > [ 1908.3939406] current lwp  : 0x91fc3760 last held: 
> > 0x91fc26e0
> > [ 1908.3939406] last locked* : 0x80416668 unlocked : 
> > 0x804169e8
> > [ 1908.3939406] owner field  : 0x00010500 wait/spin:
> > 0/1
> > 
> > [ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: 
> > spinout
> > [ 1908.4626458] cpu0: Begin traceback...
> > [ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14
> > [ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194
> > [ 1908.4626458] 0x9e4a195c: netbsd:snprintf
> > [ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more
> > [ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0
> > [ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34
> > [ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580
> > [ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70
> > [ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4
> > [ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288
> > [ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34
> > [ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194
> > [ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c
> > [ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110
> > [ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4
> > [ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68
> > [ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c
> > [ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc
> > [ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c
> > [ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c
> > [ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c
> > [ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c
> > [ 1908.4626458] cpu0: End traceback...
> > 
> 
> Does show event tell you if dwc_gmac interrupts are being distributed to
> both cpus?

Looks like they are

> I think we need to prevent or protect against this.

but the IRQ handler is not registered as MPSAFE, so it should run under the
kernel lock

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 11:10:07AM +0200, Manuel Bouyer wrote:
> [...]
> So cpu 1 is indeed running the LWP hodling the spin lock, and it looks
> like it's itself waiting for a mutex.
> Now I have to find why "mach cpu 1" hangs, and how to avoid it ...

I found the reason of the hang (will send a separate mail to port-arm@
about it).

So now I have stack traces for both CPUs:
[ 1553.4798668] cpu0: Begin traceback...   
[ 1553.4798668] 0x9cf6b674: netbsd:db_panic+0x14  
[ 1553.4798668] 0x9cf6b68c: netbsd:vpanic+0x194  
[ 1553.4798668] 0x9cf6b6a4: netbsd:snprintf
[ 1553.4798668] 0x9cf6b6e4: netbsd:lockdebug_more
[ 1553.4798668] 0x9cf6b71c: netbsd:lockdebug_abort+0xc0  
[ 1553.4798668] 0x9cf6b73c: netbsd:mutex_abort+0x34  
[ 1553.4798668] 0x9cf6b7ac: netbsd:mutex_enter+0x580 
[ 1553.4798668] 0x9cf6b804: netbsd:pool_get+0x70 
[ 1553.4798668] 0x9cf6b854: netbsd:pool_cache_get_slow+0x1f4 
[ 1553.4798668] 0x9cf6b8a4: netbsd:pool_cache_get_paddr+0x288
[ 1553.4798668] 0x9cf6b8c4: netbsd:m_clget+0x34 
[ 1553.4798668] 0x9cf6b924: netbsd:dwc_gmac_intr+0x194 
[ 1553.4798668] 0x9cf6b93c: netbsd:gic_fdt_intr+0x2c
[ 1553.4798668] 0x9cf6b964: netbsd:pic_dispatch+0x110   
[ 1553.4798668] 0x9cf6b9c4: netbsd:armgic_irq_handler+0xf4   
[ 1553.4798668] 0x9cf6ba44: netbsd:irq_entry+0x68   
[ 1553.4798668] 0x9cf6bacc: netbsd:rw_enter+0x44c
[ 1553.4798668] 0x9cf6bc1c: netbsd:uvm_fault_internal+0x124  
[ 1553.4798668] 0x9cf6bca4: netbsd:data_abort_handler+0x1b0
[ 1553.4798668] cpu0: End traceback...

db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 27357.1 (gcc) at netbsd:nullop:  mov r0, #0  ; #0x0
db{1}> tr
0x9e365ba4: netbsd:_kernel_lock+0xc  
0x9e365be4: netbsd:pmap_extract_coherency+0x200  
0x9e365c4c: netbsd:uvm_km_kmem_alloc+0x110   
0x9e365c64: netbsd:pool_page_alloc+0x3c  
0x9e365cbc: netbsd:pool_grow+0x80
0x9e365cd4: netbsd:pool_catchup+0x30
0x9e365d2c: netbsd:pool_get+0x620  
0x9e365d7c: netbsd:pool_cache_get_slow+0x1f4
0x9e365dcc: netbsd:pool_cache_get_paddr+0x288   
0x9e365dec: netbsd:m_clget+0x34  
0x9e365e84: netbsd:sosend+0x38c 
0x9e365eac: netbsd:soo_write+0x3c
0x9e365f04: netbsd:dofilewrite+0x7c  
0x9e365f34: netbsd:sys_write+0x5c
0x9e365fac: netbsd:syscall+0x12c 

So here's our deadlock: cpu 0 holds the kernel lock and wants the pool spin
mutex; cpu 1 holds the spin mutex and wants the kenrel lock.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-19 Thread Nick Hudson

On 19/04/2019 10:10, Manuel Bouyer wrote:

Overnight lockdebug did find something:
login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout

[ 1908.3939406] lock address : 0x90b79074 type :   spin
[ 1908.3939406] initialized  : 0x8041601c
[ 1908.3939406] shared holds :  0 exclusive:  1
[ 1908.3939406] shares wanted:  0 exclusive:  1
[ 1908.3939406] current cpu  :  0 last held:  1
[ 1908.3939406] current lwp  : 0x91fc3760 last held: 0x91fc26e0
[ 1908.3939406] last locked* : 0x80416668 unlocked : 0x804169e8
[ 1908.3939406] owner field  : 0x00010500 wait/spin:0/1

[ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: spinout
[ 1908.4626458] cpu0: Begin traceback...
[ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14
[ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194
[ 1908.4626458] 0x9e4a195c: netbsd:snprintf
[ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more
[ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0
[ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34
[ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580
[ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70
[ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4
[ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288
[ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34
[ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194
[ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c
[ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110
[ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4
[ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68
[ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c
[ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc
[ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c
[ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c
[ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c
[ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c
[ 1908.4626458] cpu0: End traceback...



Does show event tell you if dwc_gmac interrupts are being distributed to
both cpus? I think we need to prevent or protect against this.

Nick


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
On Fri, Apr 19, 2019 at 10:34:22AM +0100, Nick Hudson wrote:
> On 19/04/2019 10:10, Manuel Bouyer wrote:
> [snip]
> 
> 
> Did you see my suggestion for getting the backtrace from the lwp on the
> "other" cpu?

Yes, it didn't give anything interestng. I guess it is because this lwp
is on CPU, so its register copy in its strct lwp is not up to date.
-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-19 Thread Nick Hudson

On 19/04/2019 10:10, Manuel Bouyer wrote:
[snip]


Did you see my suggestion for getting the backtrace from the lwp on the
"other" cpu?


db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 21532.1 (gcc) at netbsd:_kernel_lock+0x19c:

So cpu 1 is indeed running the LWP hodling the spin lock, and it looks
like it's itself waiting for a mutex.
Now I have to find why "mach cpu 1" hangs, and how to avoid it ...



The kdb_trap code is racey and needs to look more like the mips version.
Feel free to beat me to fixing it.

Nick


Re: evbarm hang

2019-04-19 Thread Manuel Bouyer
Overnight lockdebug did find something:
login: [ 1908.3939406] Mutex error: mutex_vector_enter,504: spinout
 
[ 1908.3939406] lock address : 0x90b79074 type :   spin
[ 1908.3939406] initialized  : 0x8041601c 
[ 1908.3939406] shared holds :  0 exclusive:  1
[ 1908.3939406] shares wanted:  0 exclusive:  1
[ 1908.3939406] current cpu  :  0 last held:  1
[ 1908.3939406] current lwp  : 0x91fc3760 last held: 0x91fc26e0
[ 1908.3939406] last locked* : 0x80416668 unlocked : 0x804169e8
[ 1908.3939406] owner field  : 0x00010500 wait/spin:0/1
 
[ 1908.4626458] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,504: spinout
[ 1908.4626458] cpu0: Begin traceback... 
[ 1908.4626458] 0x9e4a192c: netbsd:db_panic+0x14
[ 1908.4626458] 0x9e4a1944: netbsd:vpanic+0x194
[ 1908.4626458] 0x9e4a195c: netbsd:snprintf
[ 1908.4626458] 0x9e4a199c: netbsd:lockdebug_more
[ 1908.4626458] 0x9e4a19d4: netbsd:lockdebug_abort+0xc0
[ 1908.4626458] 0x9e4a19f4: netbsd:mutex_abort+0x34
[ 1908.4626458] 0x9e4a1a64: netbsd:mutex_enter+0x580
[ 1908.4626458] 0x9e4a1abc: netbsd:pool_get+0x70
[ 1908.4626458] 0x9e4a1b0c: netbsd:pool_cache_get_slow+0x1f4
[ 1908.4626458] 0x9e4a1b5c: netbsd:pool_cache_get_paddr+0x288
[ 1908.4626458] 0x9e4a1b7c: netbsd:m_clget+0x34
[ 1908.4626458] 0x9e4a1bdc: netbsd:dwc_gmac_intr+0x194
[ 1908.4626458] 0x9e4a1bf4: netbsd:gic_fdt_intr+0x2c
[ 1908.4626458] 0x9e4a1c1c: netbsd:pic_dispatch+0x110
[ 1908.4626458] 0x9e4a1c7c: netbsd:armgic_irq_handler+0xf4
[ 1908.4626458] 0x9e4a1db4: netbsd:irq_entry+0x68
[ 1908.4626458] 0x9e4a1dec: netbsd:tcp_send_wrapper+0x9c
[ 1908.4626458] 0x9e4a1e84: netbsd:sosend+0x6fc
[ 1908.4626458] 0x9e4a1eac: netbsd:soo_write+0x3c
[ 1908.4626458] 0x9e4a1f04: netbsd:dofilewrite+0x7c
[ 1908.4626458] 0x9e4a1f34: netbsd:sys_write+0x5c
[ 1908.4626458] 0x9e4a1fac: netbsd:syscall+0x12c
[ 1908.4626458] cpu0: End traceback...

db{0}> tr/a 0x91fc3760
trace: pid 21381 lid 1 at 0x9e4a1dbc
0x9e4a1dbc: 914a1730
Bad frame pointer: 0x801f884c
db{0}> tr/a 0x91fc26e0
trace: pid 21532 lid 1 at 0x9cf97dbc
0x9cf97dbc: 91dfcc18

0x80416668 is
mutex_enter(>pr_lock);
in pool_get().

PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
215621 2   0 0   92d177c0gcc
21381>   1 7   0 0   91fc3760gcc
21532>   1 7   1 0   91fc26e0gcc
232181 2   1 0   92968f60gcc


owner field  : 0x00010500
if I decode this properly, mtx_ipl is 5 and mtx_lock is 1.
5 would be IPL_VM on evbarm.

db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 21532.1 (gcc) at netbsd:_kernel_lock+0x19c:

So cpu 1 is indeed running the LWP hodling the spin lock, and it looks
like it's itself waiting for a mutex.
Now I have to find why "mach cpu 1" hangs, and how to avoid it ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: evbarm hang

2019-04-18 Thread Manuel Bouyer
On Thu, Apr 18, 2019 at 04:32:56PM +0200, Manuel Bouyer wrote:
> [...]
> 
> (the common point is the dwc_gmac_intr() call, which ends up in pool_get().
> pool_get() seems to spin on the mutex_enter(>pr_lock); call
> just before startover:.
> Unfortunably I can't get a stack trace for CPU 1:
> db{0}> mach cpu 1
> kdb_trap: switching to cpu1
> Stopped in pid 26110.1 (gcc) at netbsd:_kernel_lock+0xc4:
> 
> Any idea how to process from here ?

Here's another instance. This one was with LOCKDEBUG, but LOCKDEBUG didn't
trigger.
0x9e06d8cc: netbsd:comintr+0xc   
0x9e06d8e4: netbsd:gic_fdt_intr+0x2c
0x9e06d90c: netbsd:pic_dispatch+0x38
0x9e06d96c: netbsd:armgic_irq_handler+0xf4
0x9e06da2c: netbsd:irq_entry+0x68
0x9e06da84: netbsd:pool_get+0x70
0x9e06dad4: netbsd:pool_cache_get_slow+0x1f4
0x9e06db24: netbsd:pool_cache_get_paddr+0x288   
0x9e06db44: netbsd:m_clget+0x34
0x9e06dba4: netbsd:dwc_gmac_intr+0x194
0x9e06dbbc: netbsd:gic_fdt_intr+0x2c
0x9e06dbe4: netbsd:pic_dispatch+0x110  
0x9e06dc44: netbsd:armgic_irq_handler+0xf4
0x9e06dcf4: netbsd:irq_entry+0x68
0x9e06dd2c: netbsd:pmap_extract_coherency+0x16c
0x9e06dd7c: netbsd:pool_cache_get_slow+0x254
0x9e06ddcc: netbsd:pool_cache_get_paddr+0x288
0x9e06ddec: netbsd:m_clget+0x34
0x9e06de84: netbsd:sosend+0x38c
0x9e06deac: netbsd:soo_write+0x3c   
0x9e06df04: netbsd:dofilewrite+0x7c
0x9e06df34: netbsd:sys_write+0x5c
0x9e06dfac: netbsd:syscall+0x12c

PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
132501 2   1 0   92ba1d20gcc
253701 2   0 0   929d8c80gcc
16746>   1 7   0 0   91fc1a20gcc
156361 2   0 4   91b5d1c0cc1plus
15822>   1 7   1 0   92db9d40gcc

db{0}> tr/a 92db9d40
trace: pid 15822 lid 1 at 0x99feddbc
0x99feddbc: 91af4828
db{0}> tr/a 91fc1a20
trace: pid 16746 lid 1 at 0x9e06ddbc
0x9e06ddbc: 922c8ec0

Maybe more interesting:
0   21 3   1   204   90bf1640xcall/1 xcall
0   20 1   1   200   90bf1380  softser/1
0   19 1   1   200   90bf10c0  softclk/1
0   18 1   1   200   90bf0e00  softbio/1
0   17 1   1   200   90bf0b40  softnet/1
0   16 1   1   201   90bf0880 idle/1
db{0}> tr/a 90bf0b40
trace: pid 0 lid 17 at 0x99e83e24
0x99e83e24: netbsd:pic_do_pending_ints+0x648
0x99e83eb4: netbsd:pic_do_pending_ints+0x614
0x99e83ee4: netbsd:splx+0x64
0x99e83f44: netbsd:canintr+0x290
0x99e83fac: netbsd:softint_dispatch+0x22c
0x9e1f9dbc: netbsd:softint_switch+0x58
0x9e1f9de4: netbsd:_splraise+0x3c
--- tf 0x99e83fb0 ---

canintr+0x290 is:
mutex_exit(softnet_lock);

(gdb) l *(splx+0x64)
0x80004270 is in splx 
(/dsk/l1/misc/bouyer/HEAD/clean/src/sys/arch/arm/pic/pic_splfuncs.c:96).
91  "splx(%d) to a higher ipl than %d", savedipl, ci->ci_cpl);
92
93  ci->ci_intr_depth++;
94  pic_do_pending_ints(psw, savedipl, NULL);
95  ci->ci_intr_depth--;
96  KASSERTMSG(ci->ci_cpl == savedipl, "cpl %d savedipl %d",
97  ci->ci_cpl, savedipl);
(gdb) l *(pic_do_pending_ints+0x614)
0x800031f4 is in pic_do_pending_ints 
(/dsk/l1/misc/bouyer/HEAD/clean/src/sys/arch/arm/pic/pic.c:583).
578 if (newipl == IPL_NONE && (ci->ci_astpending & __BIT(1))) {
579 pic_set_priority(ci, IPL_SCHED);
580 kpreempt(0);
581 }
582 #endif
583 if (ci->ci_cpl != newipl)
584 pic_set_priority(ci, newipl);
585 }
586
587 static void

my guess is that we did call kpreempt() here and switched to another lwp.
It is possible that we did so with a spin lock held ?

pic_do_pending_ints+0x648 is in pic_set_priority() (gdb points to
ci->ci_cpl = newipl;
but it may be the call to pic_list[0]->pic_ops->pic_set_priority() just
before).

>From show all locks /t, this one could be interesting:
Locks held on CPU 1:
Lock 0 (initialized at pool_init)
lock address : 0x90b79074 type :   spin
initialized  : 0x8041582c
shared holds :  0 exclusive:  1
shares wanted:  0 exclusive:  1
current cpu  :  0 last held:  1
current lwp  : 0x91fc1a20 last held: 0x92db9d40
last locked* : 0x80415e78 unlocked : 0x804161f8
[ 1447.2013328] owner field  : 0x00010500 wait/spin:0/1
trace: pid 15822 lid 1 at 0x99feddbc
0x99feddbc: 91af4828

0x80415e78 is also the mutex_enter(>pr_lock); line in pool_get().

trying to switch to cpu 1 did hang as usual.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: evbarm hang

2019-04-18 Thread Nick Hudson

On 18/04/2019 15:32, Manuel Bouyer wrote:

Hello,
I got several hard hang while building packages on an allwinner a20 (lime2)
I use distcc so there is moderate network traffic in addition to
local CPU load.

most of the time I can't enter ddb from serial console, I just get
Stopped in pid 25136.1 (cc1) at netbsd:cpu_Debugger+0x4:

On 2 occasion, I could enter ddb, and both time the stack trace on CPU 0
was similar:

0x9cea7c0c: netbsd:pool_get+0x70 <-- hang here
0x9cea7c5c: netbsd:pool_cache_get_slow+0x1f4
0x9cea7cac: netbsd:pool_cache_get_paddr+0x288
0x9cea7ccc: netbsd:m_clget+0x34
0x9cea7d2c: netbsd:dwc_gmac_intr+0x194
0x9cea7d44: netbsd:gic_fdt_intr+0x2c
0x9cea7d6c: netbsd:pic_dispatch+0x110
0x9cea7dcc: netbsd:armgic_irq_handler+0xf4
0x9cea7e3c: netbsd:irq_entry+0x68
0x9cea7e7c: netbsd:uvm_unmap_detach+0x80
0x9cea7eac: netbsd:uvmspace_free+0x100
0x9cea7f14: netbsd:exit1+0x190
0x9cea7f34: netbsd:sys_exit+0x3c
0x9cea7fac: netbsd:syscall+0x12c


(the common point is the dwc_gmac_intr() call, which ends up in pool_get().
pool_get() seems to spin on the mutex_enter(>pr_lock); call
just before startover:.
Unfortunably I can't get a stack trace for CPU 1:
db{0}> mach cpu 1
kdb_trap: switching to cpu1
Stopped in pid 26110.1 (gcc) at netbsd:_kernel_lock+0xc4:


Maybe you can get it by looking for the lwp on the other cpu using ps/l
and using bt/a?


Nick


Re: evbarm hang

2019-04-18 Thread Manuel Bouyer
On Thu, Apr 18, 2019 at 04:32:56PM +0200, Manuel Bouyer wrote:
> Hello,
> I got several hard hang while building packages on an allwinner a20 (lime2)
> I use distcc so there is moderate network traffic in addition to
> local CPU load.
> 
> most of the time I can't enter ddb from serial console, I just get
> Stopped in pid 25136.1 (cc1) at netbsd:cpu_Debugger+0x4:
> 
> On 2 occasion, I could enter ddb, and both time the stack trace on CPU 0
> was similar:
> 
> 0x9cea7c0c: netbsd:pool_get+0x70 <-- hang here
> 0x9cea7c5c: netbsd:pool_cache_get_slow+0x1f4
> 0x9cea7cac: netbsd:pool_cache_get_paddr+0x288
> 0x9cea7ccc: netbsd:m_clget+0x34
> 0x9cea7d2c: netbsd:dwc_gmac_intr+0x194
> 0x9cea7d44: netbsd:gic_fdt_intr+0x2c
> 0x9cea7d6c: netbsd:pic_dispatch+0x110
> 0x9cea7dcc: netbsd:armgic_irq_handler+0xf4
> 0x9cea7e3c: netbsd:irq_entry+0x68
> 0x9cea7e7c: netbsd:uvm_unmap_detach+0x80
> 0x9cea7eac: netbsd:uvmspace_free+0x100
> 0x9cea7f14: netbsd:exit1+0x190
> 0x9cea7f34: netbsd:sys_exit+0x3c
> 0x9cea7fac: netbsd:syscall+0x12c
> 
> 
> (the common point is the dwc_gmac_intr() call, which ends up in pool_get().
> pool_get() seems to spin on the mutex_enter(>pr_lock); call
> just before startover:.
> Unfortunably I can't get a stack trace for CPU 1:
> db{0}> mach cpu 1
> kdb_trap: switching to cpu1
> Stopped in pid 26110.1 (gcc) at netbsd:_kernel_lock+0xc4:
> 
> Any idea how to process from here ?

I forgot to mention: I tried a LOCKDEBUG kernel, but the problem didn't
show up ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--