On Tue, 2012-10-23 at 21:45 +1100, Benjamin Herrenschmidt wrote: > On Tue, 2012-10-23 at 18:54 +1100, Benjamin Herrenschmidt wrote: > > On Tue, 2012-10-23 at 18:42 +1100, Benjamin Herrenschmidt wrote: > > > > > > As you can see, it's not doing much before the failure: > > > > Allright, that debug output is bad, it's missing a bunch of stuff, > > due to a bad log level (the prink(KERN_DEBUG) in the atom debug > > stuff doesn't work anymore new kernel btw) > > More data: I've done a bit of AtomDis under Dave instructions and > improved my tracing, and what it looks like is we run those 3 tables in > that order:
And more :-) .../... > I don't know (yet) whether anything happens in between that doesn't go > via ATOM, in which case that wouldn't be traced. That's the next thing > to check (including interrupts though we shouldn't be getting any at > this stage afaik). So I think it's in between. From what I can tell, the error happens somewhere inside the call to drm_vblank_pre_modeset() from atombios_crtc_dpms(). This is actually a bit nasty, but basically that pre_modeset() calls drm_vblank_get() which enables vblanks. Now, it doesn't look like it's actually taking any interrupt ... Well assuming this is indeed using the irq handler in evergreen.c, which doesn't appear to be called, but I might have confused my ASICs and missed something specific to CEDAR here. Here's what I've traced so far: evergreen_irq_set: vblank 0 evergreen_irq_set: hpd 1 evergreen_irq_set: hpd 2 evergreen_irq_set: hpd 3 evergreen_irq_set: hpd 4 0001:01:00.0: EEH freeze detected, fstate=3 pcierr=9 msg: irqset 2 What that output means is that it called evergreen_irq_set() which enables vblank0 (and various hpd's but those were already enabled), and the freeze is detected at the tracepoint "irqset 2" that I added in there. This point is basically right at the end of evergreen_irq_set(), where I do a 500ms delay and check for freeze. A previous trace point right before writing to CP_INT_CNTL didn't show any freeze. Now the interrupt being an MSI, it's a memory store ... I had a vague memory of one of you guys mentioning address limitations to 40-bit or so in the radeon, though I though that shouldn't affect MSIs right ? Well ... Our 64-bit MSIs are actually using all 64-bit address bits. If the radeon doesn't do that properly and crops the address bits, the MSIs are going to hit wrong, right in the middle of nowhere, possibly some DMA space. So I hacked my platform code to force it to only hand out 32-bit MSI addresses and guess what ? ... the problem seems to be gone. Ouch. That's really nasty. Supporting only a subset of the PCI address space for DMA was already fairly nasty to begin with, but not doing the full MSI addresses looks like a clear violation of the PCIe spec :-( I'll do some more tests tomorrow to confirm whether that is the problem or not at which point, if it is, we'll need some kind of quirk to indicate that it supports only MSI32 and not MSI64 or something along those lines. Guys, go shoot your HW engineers please. Cheers, Ben. _______________________________________________ xorg-driver-ati mailing list [email protected] http://lists.x.org/mailman/listinfo/xorg-driver-ati
