On Wed, Mar 10, 2010 at 09:12:33AM +0100, Geir Ove Myhr wrote: > I have noticed in the GPU-lockup bug report that we have been > receiving > (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated) > that the IntelGpuDump.txt that is attached usually is incomplete, but > can be useful for gathering statistics since dumps on the same chipset > often has similar characteristics. This may be due to the race > condition that Chris mentions.
Incomplete in what sense? Btw, you've noticed the random number strings that are included in titles. That is basically a checksum hex of the dump report, which I'm calling the 'dump sign'. If two bug reports have exactly the same gpu dump (character-for-character) then they'll have identical dump signs and thus are almost assuredly dupes. Looking through our existing bug reports I found half a dozen with the same hex, and sure enough they were all against 915gm and so I marked them all dupes. Ideally, when apport tries filing a bug report with the same dump sign as one already filed, it should automatically set it as a dupe. I don't know that this is working yet, the dupe detection stuff is still magical to me. However, I recognize these hex strings are nigh-unreadible for triagers, and notice you've been replacing them with the PGTBL_ER or ESR values in some cases. To save you some typing I've updated the report to append these to the title, if the values are non-zero. I did not include looking at the EIR but notice this is discussed in your other email - let me know if that would be worth including and if it should be used preferentially to ESR and/or PGTBL_ER. > One thing that I see a lot is that only the ringbuffer is captured, > while the GPU is executing a batchbuffer (see > https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level > description of ringbuffers and batchbuffers) . One example is > IntelGpuDump.txt from > https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477 > . The first line captures the memory address of the active head, i.e. > where the GPU is currently executing (ACTHD: 0x0e366d50). From the > ringbuffer dump we see > 0x00012500: 0x18800080: MI_BATCH_BUFFER_START > 0x00012504: 0x0e363001: dword 1 > 0x00012508: HEAD 0x02000004: MI_FLUSH > which means that the last executed command in the ringbuffer was start > a batch buffer at memory address 0x0e363001. This is a little bit > ahead of ACTHD, so we can assume that the GPU is executing in that > batchbuffer, but the batchbuffer is not part of the dump, which makes > it hard to say what the GPU is up to. The only thing we can see is > that the last executed instruction is 0x15000000 (from the IPEHR > register which is loaded with every instruction that is processed). Can you propose a mechanism for how we can solve this? I only half grok the freeze dumping stuff, and unfortunately some other X projects are demanding my time. But if you can propose some specific changes I can at least supply some time to update the apport hook and/or get the bits into the archive. I would love patches or even just bash snippets that can be put into the apport hook, udev hook, or whatever. > I'm also wondering if there are many false positives, since I don't > always see signs of a GPU errror in the dmesg output. Even when there > are GPU hung messages, there may be messages in dmesg for a long time > after that, which means that it couldn't have been that GPU hang that > triggered the udev rule. I'm not sure how to interpret this. Can you propose a string to look for in the dmesg output? It would be straightforward to have the apport hook scan for that string and refuse to file a bug report unless it sees it. > Since the number of bug reports is quite overwhelming, I think a > suitable thing to do would be to lump similar automatic report > together by duplicating them to a master bug report. Most likely, the > i8xx reports are mostly this issue: > http://bugs.freedesktop.org/show_bug.cgi?id=26345 . It could be. Are we sufficiently confident that we could just dupe all the bug reports in launchpad? Or if we're not sure, we could go ahead and start forwarding the bug reports and let upstream dupe them there. The former is probably less total work, and like you mention we can always undupe them ourselves as we learn more. With 8xx, another option we could pursue would be to blacklist KMS in the kernel and force them to use UMS instead. Do you know if there has been testing to verify that the freezes experienced by 8xx are specific to KMS? I'd hate to blacklist 845 for example, only to find it still doesn't work. I've removed the --kms-only flag on -intel, so it should now be possible for 8xx users to switch off KMS via modeset=0 I think. If we can get some verifications that this helps eliminate the freezes, let me know and we can proceed with blacklisting 8xx chips. > The bugs on i945 > also seem similar to one another. Then we can coordinate some testing > from the master bug report, but ask people to comment on their > findings on their own reports. That way the master bug report will not > be overcommented and we can easily detach bug reports later. Sounds good. Bryce -- Ubuntu-x mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-x
