>> With the patch from Chris Wilson, it should be sufficient to capture >> only the file i915_error_state, but I guess we have to get the timing >> right. The udev rule is only triggered when the kernel notices that >> the GPU is hung, right? At that time the GPU is reset and this is >> probably also the time that i915_error_state shows up. So I'm >> wondering if we currently end up with recording a GPU dump of a >> reinitialized GPU, which is not very useful. Maybe this would have >> been obvious to me if I knew how to read the output of >> intel_gpu_dump...
I have read up a bit on intel_gpu_dump. Apparently, there was some rationale for doing it the way it's currently done. I found this in xserver-xorg-video-intel_2:2.9.0-1ubuntu2_2:2.9.0-1ubuntu4.diff.gz: --- xserver-xorg-video-intel-2.9.0.orig/debian/xserver-xorg-video-intel.udev +++ xserver-xorg-video-intel-2.9.0/debian/xserver-xorg-video-intel.udev @@ -0,0 +1,10 @@ +# do not edit this file, it will be overwritten on update + +# Jesse Barnes on [email protected]: +# You'll get three events, one when the error is detected, one before the +# reset and one after. Each has a different environment variable set; the +# initial error has ERROR=1, the pre-reset event has RESET=1 and the +# post-reset event has ERROR=0. + + +DRIVER=="i915, "ACTION=="change", ENV{ERROR}==1, PROGRAM="/usr/share/apport/apport-gpu-error-intel.py" So the event is indeed triggered before the reset happens. At that point intel_gpu_dump should give a useful dump and i915_error_state will contain nothing useful yet. At some point, at least when the capture-error-state patch is in the Ubuntu kernel, we should trigger at ERROR=0 and capture i915_error_state (which can be decoded with intel_error_decode from intel-gpu-tools in newest git). > Jesse, here are a few examples of the dumps we're collecting now. Mind > doublechecking that this are actually useful dumps? I'm not an expert, but it looks like they have some potentially useful information. > https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/529702 This one is maybe not so useful. The ringbuffer isn't shown. > https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/529410 The ringbuffer is all zero (MI_NOOP), but PGTBL_ER: 0x00000010 indicates that the hardware has detected an error. According the the i965 PRM [1] it is "Invalid GTT Entry during Display A Fetch". > https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/528795 PGTBL_ER: 0x00000029 This one also has something in the Page Table Error register. Here, bits 5, 3 and 0 are set. On i965 5 and 3 are reserved, but 0 means "Invalid GTT Entry during Fetch on behalf of the Host". It will be interesting once we get the first bug reports from xserver-xorg-video-intel 2.9.1-1ubuntu8, since that also should have hardware information attached :-) [1]: http://intellinuxgraphics.org/VOL_1_graphics_core.pdf -- Ubuntu-x mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-x
