Re: (another) Intel driver change needs testing.

J.C. Roberts Wed, 19 May 2010 09:52:05 -0700

On Wed, 19 May 2010 09:36:14 +0200 David Coppa <[email protected]> wrote:
> > > The occasional/intermittent screen corruption bug seems to have
> > > disappeared for me with latest xenocara -current + the
> > > "experimental" Mesa 7.8 update.
> > 
> > That's good news! I don't have the "experimental" mesa patch, but
> > if you want me to test it, mail it to me off list.
> 
> I've spoken too soon!
> The problem now has changed, but it's still here.
> 
> Now the screen goes blank but not turn off (a black 
> screen): no crashes or artifacts, but it's totally 
> unresponsive and I need to switch to text console and 
> kill X.
> 
> The sad is that this error is unrecoverable: when I try 
> to restart X, X starts with a blank screen again, with 
> the message:
>


When testing code controlling video/graphics hardware, you should
power off the system after hitting an error. The reason is simple; the
hardware/memory is no longer in a know state, so you no longer know
what you are testing, and hence, you may not be able to isolate
*repeatable* errors.

Given that the Intel video chips are a shared memory design, you may
also want to waste the time to do a full memory clear/check at boot.
Typically, you would want to disable the "Fast Boot" or similarly named
option in your BIOS, as well as the "Show Logo" option, then let it
ever so slowly count/test/clear memory on boot.

With video/graphics chipsets using dedicated memory (e.g. RAM on the
graphics card), the safe bet is to unplug the power (or main battery in
the case of a laptop), and then hold down the power button for 20-30
seconds. This will turn off the residual power to the PCI bus as well
as discharge the caps.

Sure, the above is a pain in the ass and takes more time, but it's as
close as you'll get to clearing everything without removing the
BIOS/CMOS battery. I expect some novice to say that (corrupted) memory
cannot survive a normal cold boot, but in reality, it actually can.

Yes, it is good to attempt restarting X just to test if a crash is
recoverable, but don't be surprised when it doesn't work, and also
don't expect the core resulting from the second crash to be very
useful for finding the real problem, since the restart may have failed
because things were *already* in an unknown, corrupted state.

> assertion "(gttentry & PTE_VALID) != 0" failed: file
> "/usr/xenocara/driver/xf86-video-intel/src/i830_memory.c", line 520,
> function "i830_get_gtt_physical" giving up.
> 
> I need to reboot the system to have a working X...
> 
> One time, it was also accompanied by the following 
> messages from the kernel:
>  
> render error detected, EIR: 0x00000010
> page table error
>   PGTBL_ER: 0x00000010
> render error detected, EIR: 0x00000010
> page table error
>   PGTBL_ER: 0x00000010
> no reset function for chipset.
> no reset function for chipset.
> error: [drm:pid24688:inteldrm_lastclose] *ERROR* failed to idle
> hardware: 5
> 
> kernel and xenocara are -current from yesterday.
> 
> I have purchased an UltraBase for my X41 on eBay (for the 
> serial port). What can I do to further debug this problem?
> Building a kernel with DRMDEBUG can help?
> 
> ciao,
> david


I suck badly at debugging cores in UNIX, and I suck even worse at live
debugging in UNIX, but if I can capture a core file and all relevant
logs, then I have something potentially useful to others.

The following may not be the best way to do it (or even perfectly
correct), but it is how I'm testing over here.

$ grep console /etc/fbtab
/dev/ttyC0      0666    /dev/console

$ grep console /etc/syslog.conf
# console: be aware that this could create lots of output.
*.err;auth.notice;authpriv.none;kern.debug;mail.crit    /dev/console

$ grep 'xterm -C' ~/.xinitrc
xterm -C -fg '#ff0040' -geometry 106x42+0+0 &

$ grep xdbg ~/.kshrc
        alias xdbg='startx -- /usr/X11R6/bin/X -keepPriv 2>&1 | tee
-a .xdbg.txt > /dev/console'

$ grep nosuid /etc/sysctl.conf 
kern.nosuidcoredump=2           # 2=Put suid coredumps in /var/crash

$ cat /etc/X11/xorg.conf
   Section "ServerFlags"
        Option  "NoTrapSignals" "true"
   EndSection

$ grep 'DEBUG=' /usr/src/sys/conf/GENERIC 
makeoptions     DEBUG="-g"      # compile full symbol table

$ cat /usr/src/sys/arch/i386/conf/GENERIC_DRMDEBUG
# GENERIC with INTELDRM_GEM
include         "arch/i386/conf/GENERIC"
option          DRMDEBUG
option          DRMLOCKDEBUG


NOTE_TO_SELF: Check is DRMLOCKDEBUG is still relevant.


As mentioned in xenocara/README, you need to run your handy new 'xdbg'
alias as root to get a core. I don't know if you have your /etc/sudoers
configured to keep the environment but since capturing a core of X
requires running as root, this can be real convenient. The slightly
excessive permissions on /dev/console help to get around the fact
you'll be running as two different users, starting as root, and also
probably using su(1) within X to access particular applications as your
normal user (email, browser, ...).

Though you will capture a good deal of the console output to
your .xdbg.txt file as well as be able to see it from within X without
VT switching, you will not capture everything. Some relevant messages
like the infamous "GPU Hung!" will still go to /var/log/messages.

If luck is on your side, you'll get a core file in /var/crash if X
crashes. The trouble is, often you'll only get screen corruption but X
will not actually crash, so there's no core to inspect. With the i845G,
I've found that once I get screen corruption (often with "GPU Hung!"), a
VT switch will sometimes make the X server actually crash, but then the
trouble is whether or not the crash is relevant to the corruption?

The above may not be perfect, but it works well for me.

        jcr

-- 
The OpenBSD Journal - http://www.undeadly.org

Re: (another) Intel driver change needs testing.

Reply via email to