Re: Consistent Kernel Panic-Hardware-Related?

Mark Kettenis Wed, 24 Jul 2013 02:54:06 -0700

> Date: Mon, 8 Jul 2013 11:06:51 +0200
> From: Christian Ehrhardt <[email protected]>
> 
> Hi,


Hi Christian,

Taking this to tech@ in the hope some more people will look into this.

> On Thu, Jul 04, 2013 at 09:56:56AM -0700, Scott Vanderbilt wrote:
> > I've been trying to build userland repeatedly over the past few days
> > on a particular machine and consistently get kernel panics, though
> > never at exactly the same point in the process. The latest occurred
> > midway through 'make obj'. Attempts to build userland on another
> > i386 machine from code pulled via cvs at more or less the same time
> > works fine, so it seems the issue is isolated to this hardware.
> >
> > I initially suspected my SSD had gone bad, so I replaced it with a
> > brand new drive. However, the issue persists, so I no longer suspect
> > the drive.
> >
> > A ps, trace, and dmesg are provided below. This is my first
> > reporting a bug of this nature. I hope I've followed procedure. If
> > not, please do let me know. I'm trying to be useful. :-)
> >
> > -------------------------------------------------------------------------
> >
> > panic: pmap_remove_ptes: managed page without PG_PVLIST for 0x3c001000
> > Stopped at      Debugger+0x4:   popl    %ebp
> > RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
> > DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
> >
> > ddb> show panic
> > pmap_remove_ptes: managed page without PG_PVLIST for 0x3c001000
> >
> > ddb> trace
> > Debugger(d0963718,f6269e38,d0966be4,f6269e38,d1cf1040) at Debugger+0x4
> > panic(d0966be4,3c001000,d1ceb16c,f6269e4c,0) at panic+0x5d
> > pmap_remove_ptes(d9e39798,d1cf1040,ffcf0000,3c000000,3c003000) at
> > pmap_remove_p
> > tes+0x142
> > pmap_do_remove(d9e39798,3c000000,3c003000,0,d0ad7820) at
> pmap_do_remove+0xeb
> > pmap_remove(d9e39798,3c000000,3c003000,d056c4e9,d9c68e1c) at
> > pmap_remove+0x27
> > uvm_unmap_kill_entry(d9e3ad80,d9c68e1c,f6269f2c,d043a597,0) at
> > uvm_unmap_kill_e
> > ntry+0xf8
> > uvm_map_teardown(d9e3ad80,1,4,d093e66e,d9cc2700) at uvm_map_teardown+0xac
> > uvmspace_free(d9e3ad80,1,1,f6269f6c,d0203009) at uvmspace_free+0x2e
> > uvm_exit(d9cc3ba4,d0a4e0a8,4,d093e66e,0) at uvm_exit+0x15
> > reaper(d9e33004) at reaper+0x8a
> > Bad frame pointer: 0xd0c3ce68
> 
> Can you try to see if the following patch helps? It did for me, when
> I was debugging a similar panic back in December. However, my
> explanation why the patch would fix this bug, turned out to be invalid.
> Still the bug went away. If the same happens for you, some more people
> should have a look at the patch:
> 
> --- /mount/blink/aegis/project/gg/history/os/src/sys/arch/i386/i386/pmap.c
> 2012/10/16 18:31:28   1.117
> +++ /mount/blink/aegis/project/gg/history/os/src/sys/arch/i386/i386/pmap.c
> 2013/01/24 17:20:06   1.118
> @@ -495,7 +495,7 @@ pmap_map_ptes(struct pmap *pmap)
> 
>       /* need to load a new alternate pt space into curpmap? */
>       opde = *APDP_PDE;
> -#if defined(MULTIPROCESSOR) && defined(DIAGNOSTIC)
> +#if defined(DIAGNOSTIC)
>       if (pmap_valid_entry(opde))
>               panic("pmap_map_ptes: APTE valid");
>  #endif
> @@ -521,10 +521,8 @@ pmap_unmap_ptes(struct pmap *pmap)
>       if (pmap_is_curpmap(pmap)) {
>               simple_unlock(&pmap->pm_obj.vmobjlock);
>       } else {
> -#if defined(MULTIPROCESSOR)
>               *APDP_PDE = 0;
>               pmap_apte_flush();
> -#endif
>               simple_unlock(&pmap->pm_obj.vmobjlock);
>               simple_unlock(&curpcb->pcb_pmap->pm_obj.vmobjlock);
>       }

Wish somebody with more in-depth knowledge about the i386 pmap
implementation would respond :(.

Your diff basically disables an optimization where the alternate pmap
is kept around in case we need it again.  Not sure how important this
optimization is.  I guess the primary user of the alternate pmap is
the reaper, and keeping the alternate pmap around there could be
beneficial if the address space of the process we're reaping is
heavily fragmented.

There is something fishy with this optimization.  *APDP_PDE is never
cleared, which means that it becomes stale after the process exits.
Presumably we'd notice the next time we try to map an alternate pmap,
but if the physical pages for the pmap get recycled, we might not.
Not quite seeing how this leads to that panic, but perhaps we should
clear *APDP_PDE in pmap_switch()?

Re: Consistent Kernel Panic-Hardware-Related?

Reply via email to