On Tue, Jul 26, 2016 at 06:43:18PM +0200, Patrick Wildt wrote:
> Hi,
> 
> I've been trying to debug a pmap issue on ARM.  I have no solution, but
> I would like to share my findings so far.
> 
> First of all, the panic/uvm_fault:
> 
> 
> login:
> uvm_fault(0xca06c858, 0, 1, 0) -> e
> Fatal kernel mode data abort: 'Translation Fault (P)'
> trapframe: 0xc986bd98
> DFSR=00000007, DFAR=0000000c, spsr=60000113
> r0 =00000000, r1 =ccaa7148, r2 =00136000, r3 =00000000
> r4 =00000012, r5 =ccaa7148, r6 =32130000, r7 =00000005
> r8 =3213000e, r9 =00000021, r10=cc634698, r11=c986be34
> r12=cd43a9a0, ssp=c986bde8, slr=c054c1d0, pc =c054c1d0
> 
> Stopped at      pmap_enter+0x178:       ldr     r6, [r0, #0x00c]
> ddb> trace
> pmap_enter+0xc
>         scp=0xc054c064 rlv=0xc04f60c0 (uvm_fault+0xa28)
>         rsp=0xc986be38 rfp=0xc986bf54
>         r10=0x00000000 r9=0x00000001 r8=0x00000000 r7=0xc050fe80
>         r6=0x00000000 r5=0xc4a5aad0 r4=0x00000001
> uvm_fault+0xc
>         scp=0xc04f56a4 rlv=0xc0547e68 (data_abort_handler+0x264)
>         rsp=0xc986bf58 rfp=0xc986bfac
>         r10=0xc986bfb0 r9=0x00000007 r8=0xc986a000 r7=0x00000001
>         r6=0xca06c858 r5=0x00000001 r4=0x00136000
> data_abort_handler+0xc
>         scp=0xc0547c10 rlv=0xc05477bc (exception_exit)
>         rsp=0xc986bfb0 rfp=0xbffc2430
>         r10=0x00da776c r9=0x00000001 r8=0x60165dbc r7=0x00000000
>         r6=0x00000200 r5=0x00d0c920 r4=0x4188e9f4
> 
> The faulting instruction is in pmap_enter() of pamp7.c:
> 
>                         if (opg) {
>                                 /*
>                                  * Replacing an existing mapping with a new 
> one.
>                                  * It is part of our managed memory so we
>                                  * must remove it from the PV list
>                                  */
>                                 pve = pmap_remove_pv(opg, pm, va);
>     pve is NULL ->              oflags = pve->pv_flags;
>                         } else
> 
> How did we come here?  pmap_enter() was called to enter mapping for a
> given virtual and physical address.  The VA is 0x136000.  Then it looked
> if there is already a mapping at the given address.  Turns out there is!
> opte (old pagetable entry) is set to 0x429f202c.  This is an "invalid"
> (means not active) page entry pointing to the physical page 0x429f2000.
> Using the vm_pages struct for that physical address, we can have a look
> at its virtual addresses.
> 
> vm_pages:
> 0xc4f86754:     429f1000
> 0xc4f86758:     0
> 0xc4f8675c:     0
> 0xc4f86760:     3
> 0xc4f86764:     0
> 0xc4f86768:     0
> 0xc4f8676c:     0
> 0xc4f86770:     c47263f0
> 0xc4f86774:     c4f86590
> 0xc4f86778:     c4a721d0
> 0xc4f8677c:     c520d200
> 0xc4f86780:     0
> 0xc4f86784:     0
> 0xc4f86788:     ca32b230
> 0xc4f8678c:     0
> 0xc4f86790:     0
> 0xc4f86794:     0
> 0xc4f86798:     140000
> 0xc4f8679c:     41
> 0xc4f867a0:     0
> 0xc4f867a4:     429f2000  <-- vm_page.phys_addr
> 0xc4f867a8:     0
> 0xc4f867ac:     cd43a9a0  <-- vm_page.mdpage.pvh_list
> 0xc4f867b0:     3
> 0xc4f867b4:     0
> 0xc4f867b8:     0
> 0xc4f867bc:     0
> 0xc4f867c0:     ffffffff
> 0xc4f867c4:     ffffffff
> 0xc4f867c8:     c4f50760
> 0xc4f867cc:     c4c67990
> 0xc4f867d0:     c50374d0
> 0xc4f867d4:     0
> 0xc4f867d8:     0
> 0xc4f867dc:     c0714248
> 0xc4f867e0:     cf2c6000
> 0xc4f867e4:     0
> 0xc4f867e8:     4d
> 0xc4f867ec:     385
> 0xc4f867f0:     1
> 0xc4f867f4:     429f3000
> 
> vm_page.mdpage.pvh_list:
> 0xcd43a9a0:     0 <-- next ptr in list
> 0xcd43a9a4:     ccaa7148 <- pmap
> 0xcd43a9a8:     717c7000 <- va
> 0xcd43a9ac:     b <- flags
> 
> This means, the physical page already has a VA.  Oh, and it belongs
> to the same pmap.  Have a look at register r5, that's the pmap we want
> to enter a mapping for.  But it's not the kernel pmap:
> 
> ddb> print kernel_pmap_store
> c073b8f4
> 
> So wait, the only VA for that physical address is 0x717c7000; but the
> VA I used to look the physical page up is 0x136000.  That is weird.
> 
> So now we got a pmap_enter() for a physical address and a virtual
> address, where the virtual address is already used by the same pmap
> for another physical address.  But that physical address thinks a
> completely different VA is using it.
> 
> The phys address behind the mapping is 0x429f2000.  The phys address
> it wanted to map the VA to is in R6: 0x32130000.
> 
> That's probably not enough to find the exact fault, but it's a writeup
> of what I have found so far.
> 
> Patrick

So, I found a suspicious place, added printfs, triggered the bug again
and voila, I think I got it.

I was looking for places where the PTE for a given VA were to be zeroed
and where that same VA would be removed from the reverse link in the
vm pages.

The following part of the code goes through each mapped virtual page
for a given physical page, and removes the mappings.  It looks at the
PTE pointer.  If there is no valid mapping, then it just removes the
VA from the list and does not zero the PTE.

But invalid and zeroed is a difference.  Zeroed means there is no
mapping.  Invalid means either "no mapping" or "mapped but not
enabled for reference emulation".

So the VA was mapped, no process accessed the page, so it never got
valid.  And in that case, the code in pmap_page_remove() removes the
VA from the vm pages, but does not change the PTE.  Because it assumed
there is no mapping anway.

There are more places in that pmap where we explicitly check for zero
and not for being valid.  Unfortunately this place was missed.

Patrick

diff --git a/sys/arch/arm/arm/pmap7.c b/sys/arch/arm/arm/pmap7.c
index 0d32bf9..e965c48 100644
--- a/sys/arch/arm/arm/pmap7.c
+++ b/sys/arch/arm/arm/pmap7.c
@@ -1155,7 +1155,7 @@ pmap_page_remove(struct vm_page *pg)
                KDASSERT(l2b != NULL);
 
                ptep = &l2b->l2b_kva[l2pte_index(pv->pv_va)];
-               if (l2pte_valid(*ptep)) {
+               if (*ptep != 0) {
                        pte = *ptep;
 
                        /* inline pmap_is_current(pm) */

Reply via email to