A little incremental update on the apport GPU lockup reports... On how GPU reset works:
I have looked a little on the code, and the first thing that pops out is that only chipsets above i965 and GM45 are being reset. i945, G33, and below are not reset. This resonates well with what I see in the bug reports. On the chipsets where the GPU is not reset, the attached IntelGpuDump.txt is compatible with the (limited) information in i915_error_state. For the bug reports where I have got a manual dump of i915_error_state with drm-intel-next kernel which dumps all relevant information there, the information is compatible with IntelGpuDump.txt, although more complete (i.e. includes all the relevant buffers, IntelGpuDump.txt often lacks some important ones). On chipsets where the GPU is reset, IntelGpuDump.txt is a dump of a freshly initialized GPU. The best sign is that the HEAD is right in the beginning of the ringbuffer, i.e. it just got started. The other sign is that ACTHD and IPEHR are different from the ones recorded in i915_error_state. With drm.debug=0x02 as kernel parameter, we can also see that the GPU is being reset in dmesg output (see [1] for an example from LP # 516909). The code that triggers the reset is i915_error_work_func in drivers/gpu/drm/i915/i915_irq.c [2]. The actual reset happens in 965_reset in i915_drv.c [3]. [1]: https://bugs.freedesktop.org/attachment.cgi?id=34126&action=edit [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_irq.c;h=5388354da0d176df4ff2a3b7c33de069abff12da;hb=HEAD [3]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_drv.c;h=1b2e95455c05d0cce04d17483c7bd4ff9f218fe0;hb=HEAD On how the udev events are triggered: The udev events are sent from i915_error_work_func mentioned above. When a GPU reset happens, there are three events being sent. Once is at the beginning of the function, when we know that an error has been detected, one right before the reset and one after. The two last ones only happen on i965 and above, so we don't want to listen for them. The first happens whether the GPU is wedged or not (as defined by dev_priv->mm.wedged). There is no uevent that is triggered for all chipsets, but only if the GPU is wedged, which may be what we want. The i915_error_work_func is called from the end of i915_handle_error (also in i915_irq.c), which takes care of recording the error state to i915_error_state in debugfs first, so it's fine to grab this file on the first udev event also in the cases where the GPU will be reset (I was worried about this in previous emails). i915_handle_error is called from two places. One is when a bit in the error register EIR gets set, which triggers an interrupt. The other is when the hangcheck timer ellapses, i.e. EIR is not set, but the GPU makes no progress. In the latter case "Hangcheck timer elapsed... GPU hung\n" is logged. In both cases i915_handle_error prints "render error detected, EIR: 0x%08x\n" (i.e the EIR register is printed), but this will probably change in drm-intel-next soon, so that this only is printed when a bit in EIR is set [4] [4]: http://lists.freedesktop.org/archives/intel-gfx/2010-March/006150.html On what upstream wants: Chris Wilson says that they would prefer dumps from kernels with the i915_error_state dumping patch [5]. IntelGpuDump.txt usually lacks some important information. [5]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e On what we can do: 1. Differentiate between "GPU hung" and other GPU errors. I think I got this part right in my previous email: - If there is "Hangcheck timer elapsed... GPU hung" in dmesg, give title "GPU hung ++", - If there is "page table error" in dmesg, give title "GPU page table error ++" - If none of the above, simply let the title be "GPU error ++" for now. 2. Include error registers in the right priority in the title - If PGTBL_ER is non-zero, use that . - Otherwise, if EIR is non-zero, use that. - Ignore ESR, it's useless. 3. If possible, carry the record-batch-buffer-following-GPU-error patch [5] (above) in the kernel. Possibly drop it before release. This will make the dumps for pre-i965 become better, and will make the post-i965 dumps become useful. 4. Possibly add some message in the apport-script that says that while we are recording the logs of the incident, they don't tell us how the reporter experienced the problem. We get a lot of descriptions that only says things like "problem happened" and we don't know if the computer hung and needed a reboot or if the computer recovered all by itself and the only thing the user notices is that apport asks it to report a problem he/she was unaware of. 5. Fix whatever caused https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/539533 . This seemed to happen for a lot of people since yesterday. It seems to be related to trying to add the MachineType to the title. Open question: - Is wedged the same as hung, or is there a subtle difference? Geir Ove -- Ubuntu-x mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-x
