> Yes, the userspace notification is asynchronous and the kernel does not > wait before starting the reset procedure (if supported). Hence there is a > race to capture the accurate data. > > The current i915_error_state gets around this by performing the capture in > the error handler and aims to collect all the data that is strictly relevant > to the crash. I would strongly recommend that this is used, and I want to > deprecate the ringbuffer_info and batchbuffers debug files in the future - > hence killing intel_gpu_dump.
I have noticed in the GPU-lockup bug report that we have been receiving (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated) that the IntelGpuDump.txt that is attached usually is incomplete, but can be useful for gathering statistics since dumps on the same chipset often has similar characteristics. This may be due to the race condition that Chris mentions. One thing that I see a lot is that only the ringbuffer is captured, while the GPU is executing a batchbuffer (see https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level description of ringbuffers and batchbuffers) . One example is IntelGpuDump.txt from https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477 . The first line captures the memory address of the active head, i.e. where the GPU is currently executing (ACTHD: 0x0e366d50). From the ringbuffer dump we see 0x00012500: 0x18800080: MI_BATCH_BUFFER_START 0x00012504: 0x0e363001: dword 1 0x00012508: HEAD 0x02000004: MI_FLUSH which means that the last executed command in the ringbuffer was start a batch buffer at memory address 0x0e363001. This is a little bit ahead of ACTHD, so we can assume that the GPU is executing in that batchbuffer, but the batchbuffer is not part of the dump, which makes it hard to say what the GPU is up to. The only thing we can see is that the last executed instruction is 0x15000000 (from the IPEHR register which is loaded with every instruction that is processed). I'm also wondering if there are many false positives, since I don't always see signs of a GPU errror in the dmesg output. Even when there are GPU hung messages, there may be messages in dmesg for a long time after that, which means that it couldn't have been that GPU hang that triggered the udev rule. I'm not sure how to interpret this. Since the number of bug reports is quite overwhelming, I think a suitable thing to do would be to lump similar automatic report together by duplicating them to a master bug report. Most likely, the i8xx reports are mostly this issue: http://bugs.freedesktop.org/show_bug.cgi?id=26345 . The bugs on i945 also seem similar to one another. Then we can coordinate some testing from the master bug report, but ask people to comment on their findings on their own reports. That way the master bug report will not be overcommented and we can easily detach bug reports later. Geir Ove -- Ubuntu-x mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-x
