Re: Consistent Kernel Panic-Hardware-Related?

Christian Ehrhardt Wed, 24 Jul 2013 04:41:52 -0700

Hi,

On Wed, Jul 24, 2013 at 11:52:38AM +0200, Mark Kettenis wrote:
> Taking this to tech@ in the hope some more people will look into this.


Ok. And thanks for picking this up.

> > On Thu, Jul 04, 2013 at 09:56:56AM -0700, Scott Vanderbilt wrote:
> > > I've been trying to build userland repeatedly over the past few days
> > > on a particular machine and consistently get kernel panics, though
> > > never at exactly the same point in the process. The latest occurred
> > > midway through 'make obj'. Attempts to build userland on another
> > > i386 machine from code pulled via cvs at more or less the same time
> > > works fine, so it seems the issue is isolated to this hardware.
> > >
> > > I initially suspected my SSD had gone bad, so I replaced it with a
> > > brand new drive. However, the issue persists, so I no longer suspect
> > > the drive.
> > >
> > > A ps, trace, and dmesg are provided below. This is my first
> > > reporting a bug of this nature. I hope I've followed procedure. If
> > > not, please do let me know. I'm trying to be useful. :-)
> > >
> > > -------------------------------------------------------------------------
> > >
> > > panic: pmap_remove_ptes: managed page without PG_PVLIST for 0x3c001000
> > > Stopped at      Debugger+0x4:   popl    %ebp
> > > RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS 
> > > PANIC!
> > > DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
> > >
> > > ddb> show panic
> > > pmap_remove_ptes: managed page without PG_PVLIST for 0x3c001000
> > >
> > > ddb> trace
> > > Debugger(d0963718,f6269e38,d0966be4,f6269e38,d1cf1040) at Debugger+0x4
> > > panic(d0966be4,3c001000,d1ceb16c,f6269e4c,0) at panic+0x5d
> > > pmap_remove_ptes(d9e39798,d1cf1040,ffcf0000,3c000000,3c003000) at
> > > pmap_remove_p
> > > tes+0x142
> > > pmap_do_remove(d9e39798,3c000000,3c003000,0,d0ad7820) at
> > pmap_do_remove+0xeb
> > > pmap_remove(d9e39798,3c000000,3c003000,d056c4e9,d9c68e1c) at
> > > pmap_remove+0x27
> > > uvm_unmap_kill_entry(d9e3ad80,d9c68e1c,f6269f2c,d043a597,0) at
> > > uvm_unmap_kill_e
> > > ntry+0xf8
> > > uvm_map_teardown(d9e3ad80,1,4,d093e66e,d9cc2700) at uvm_map_teardown+0xac
> > > uvmspace_free(d9e3ad80,1,1,f6269f6c,d0203009) at uvmspace_free+0x2e
> > > uvm_exit(d9cc3ba4,d0a4e0a8,4,d093e66e,0) at uvm_exit+0x15
> > > reaper(d9e33004) at reaper+0x8a
> > > Bad frame pointer: 0xd0c3ce68
> > 
> > Can you try to see if the following patch helps? It did for me, when
> > I was debugging a similar panic back in December. However, my
> > explanation why the patch would fix this bug, turned out to be invalid.
> > Still the bug went away. If the same happens for you, some more people
> > should have a look at the patch:
> > 
> > --- /mount/blink/aegis/project/gg/history/os/src/sys/arch/i386/i386/pmap.c
> > 2012/10/16 18:31:28 1.117
> > +++ /mount/blink/aegis/project/gg/history/os/src/sys/arch/i386/i386/pmap.c
> > 2013/01/24 17:20:06 1.118
> > @@ -495,7 +495,7 @@ pmap_map_ptes(struct pmap *pmap)
> > 
> >     /* need to load a new alternate pt space into curpmap? */
> >     opde = *APDP_PDE;
> > -#if defined(MULTIPROCESSOR) && defined(DIAGNOSTIC)
> > +#if defined(DIAGNOSTIC)
> >     if (pmap_valid_entry(opde))
> >             panic("pmap_map_ptes: APTE valid");
> >  #endif
> > @@ -521,10 +521,8 @@ pmap_unmap_ptes(struct pmap *pmap)
> >     if (pmap_is_curpmap(pmap)) {
> >             simple_unlock(&pmap->pm_obj.vmobjlock);
> >     } else {
> > -#if defined(MULTIPROCESSOR)
> >             *APDP_PDE = 0;
> >             pmap_apte_flush();
> > -#endif
> >             simple_unlock(&pmap->pm_obj.vmobjlock);
> >             simple_unlock(&curpcb->pcb_pmap->pm_obj.vmobjlock);
> >     }
> 
> Wish somebody with more in-depth knowledge about the i386 pmap
> implementation would respond :(.

I think it is some kind of caching issue. IIRC I've seen at least
one case where the condition that is checked by the assertionx
turned out to be _not true_ when re-evaluated manually in the debugger.

> Your diff basically disables an optimization where the alternate pmap
> is kept around in case we need it again.  Not sure how important this
> optimization is.  I guess the primary user of the alternate pmap is
> the reaper, and keeping the alternate pmap around there could be
> beneficial if the address space of the process we're reaping is
> heavily fragmented.

fork() also makes use of the alternate pmap. And ptrace based memory
access, I guess.

> There is something fishy with this optimization.  *APDP_PDE is never
> cleared, which means that it becomes stale after the process exits.
> Presumably we'd notice the next time we try to map an alternate pmap,
> but if the physical pages for the pmap get recycled, we might not.

Yes. Additionally, the reaper is particularly problematic as
it is a kernel thread and switching to/from kernel threads omits
the TLB-flush that is inherent to a normal process switch.

> Not quite seeing how this leads to that panic,

Exactly. This is the problem with the patch. It most likely fixes a bug.
However, the bug being fixed is not sufficient to explain the symptoms
seen by the origial poster.

> but perhaps we should clear *APDP_PDE in pmap_switch()?

Hm, need to think about that. But at first glance it looks
sufficient

    regards   Christian

signature.asc
Description: Digital signature

Re: Consistent Kernel Panic-Hardware-Related?

Reply via email to