It's been a while, but I've found the time to dig much deeper into this
and familiarize myself with the kernel code some. Actually, I feel
comfortable with the idea of directly contacting the appropriate mailing
list now so this is more to keep the record up-to-date than a request
for more triage.

Anyways, after just walking through the kernel code, I first realized
that the first sign of the bug (the 30ms gap) was occurring somewhere
within the function pci_scan_child_bus (in drivers/pci/probe.c), between
when it invokes the function pci_scan_slot (also in drivers/pci/probe.c)
and the function pcibios_fixup_bus (in my case, under
arch/x86/pci/common.c)

>From there, I began adding dev_info statements around function calls
that would be executed in between, then looked between whichever 2
messages the gap occurred between to further narrow down the problem.
After a few rounds of this, I found the delay consistently appearing
within the function pcie_aspm_configure_common_clock (in
drivers/pci/pcie/aspm.c) After a little research about what the PCIe
common clock is about, it actually explains several aspects of this bug.
Booting the computer from battery power would influence the power state
of the device, which is what ASPM is all about. And it turns out the
discrepancy of 24ms between a good boot and a bad boot is precisely the
length of time the PCIe standard defines as a timeout for link training.

Unfortunately, I don't know how, or even if, the two commits I found
earlier directly tie into this. It seems there's a really weird race
condition or resource fight going on. I'm not exactly sure how to fix
the problem clearly either because just adding the overhead of dev_info
statements to the function makes the bug go away (so I can technically
"fix" the bug, but that's just a total hack). The one other little cliue
I found was that the delay went away completely when I put dev_info
statements in every possible branch of the function's logic. When I only
added dev_info to the ifs corresponding to a problem though, a slight
delay appeared (bumping the total time in the function to around 10ms),
but still not enough for link training to timeout (so my GPU always
loaded).

I plan on mailing the list for the PCI subsystem of the kernel soon, but
I'm stumped about how exactly to proceed so if you have any debugging
suggestions, I'd be happy to hear them. Thanks again.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1009312

Title:
  10de:0426 GPU loads unreliably, possible kernel timeout

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1009312/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to