Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Hi Egbert, Am 23.07.2013 13:26, schrieb Egbert Eich: I've looked at the logs a bit more. Here's a patch adding some more debug information. Would you please apply this to your 3.10 kernel and generate a log file the same way as you did before. The driver will be even more chatty - but I don't expect any problems from this. diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index e43d809..46bb77c 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -919,6 +919,11 @@ static inline void intel_hpd_irq_handler(struct drm_device *dev, spin_lock(dev_priv-irq_lock); for (i = 1; i HPD_NUM_PINS; i++) { + if (IS_G4X(dev) (hpd[i] hotplug_trigger) + dev_priv-hpd_stats[i].hpd_mark != HPD_ENABLED) + DRM_DEBUG_KMS(Received HPD intterupt although disabled\n, + I915_READ(PORT_HOTPLUG_EN)); + if (!(hpd[i] hotplug_trigger) || dev_priv-hpd_stats[i].hpd_mark != HPD_ENABLED) continue; @@ -929,6 +934,8 @@ static inline void intel_hpd_irq_handler(struct drm_device *dev, + msecs_to_jiffies(HPD_STORM_DETECT_PERIOD))) { dev_priv-hpd_stats[i].hpd_last_jiffies = jiffies; dev_priv-hpd_stats[i].hpd_cnt = 0; + DRM_DEBUG_KMS(Received HPD interrupt on PIN %d - jiffies: %d\n, i, + dev_priv-hpd_stats[i].hpd_last_jiffies); } else if (dev_priv-hpd_stats[i].hpd_cnt HPD_STORM_THRESHOLD) { dev_priv-hpd_stats[i].hpd_mark = HPD_MARK_DISABLED; dev_priv-hpd_event_bits = ~(1 i); @@ -936,6 +943,8 @@ static inline void intel_hpd_irq_handler(struct drm_device *dev, storm_detected = true; } else { dev_priv-hpd_stats[i].hpd_cnt++; + DRM_DEBUG_KMS(Received HPD interrupt on PIN %d - cnt: %d\n, i, + dev_priv-hpd_stats[i].hpd_cnt); } } I applied this patch (but not the one you sent on Monday), recompiled and logged: uncompressed (8,2M) http://files.hz6.de/kern_20130724.log bz2 (288 KB) http://files.hz6.de/kern_20130724.log.bz2 Hope this helps, Regards jan ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Hi Daniel, Am 25.07.2013 11:58, schrieb Daniel Vetter: Can you pls try the below patch (on top of Egbert's debug stuff)? diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h index 6caa748..2d4c884 100644 --- a/drivers/gpu/drm/i915/i915_reg.h +++ b/drivers/gpu/drm/i915/i915_reg.h @@ -1925,9 +1925,9 @@ #define PORT_HOTPLUG_STAT (dev_priv-info-display_mmio_offset + 0x61114) /* HDMI/DP bits are gen4+ */ -#define PORTB_HOTPLUG_LIVE_STATUS (1 29) +#define PORTD_HOTPLUG_LIVE_STATUS (1 29) #define PORTC_HOTPLUG_LIVE_STATUS (1 28) -#define PORTD_HOTPLUG_LIVE_STATUS (1 27) +#define PORTB_HOTPLUG_LIVE_STATUS (1 27) #define PORTD_HOTPLUG_INT_STATUS (3 21) #define PORTC_HOTPLUG_INT_STATUS (3 19) #define PORTB_HOTPLUG_INT_STATUS (3 17) I did, here are the logs: 364K http://files.hz6.de/kern_20130724_2.log I don't understand what exactly this patch does, but I noticed: - much less drm debug info, resulting in a much smaller log - no more noticeable lag on my system (even though a storm was detected). I double checked the latter and the lag seems indeed to be gone... Jan ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Hi Jan! Jan Niggemann writes: Hi Egbert, [...] Thanks for generating the logs! Hope you had a nice birthday dinner :) I applied this patch (but not the one you sent on Monday), recompiled and logged: uncompressed (8,2M) http://files.hz6.de/kern_20130724.log bz2 (288 KB) http://files.hz6.de/kern_20130724.log.bz2 These logs show clearly that we are seeing interrupts which should be disabled. Could it be that we we have either the status or enable bits mixed up? Unfortunately the publically available docs for GEN4 don't show the HOTPLUG_EN and HOTPLUG_STAT registers. Daniel, could you please help me out here? Thanks a lot! Cheers, Egbert. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Daniel Vetter writes: I've checked the docs again for gm45 and we seem to have the right values. But on early gen4 (i.e. i965g/gm) the Bspec has been wrong about these, so I wouldn't be surprised at all if they're wrong for the digital ports on gm45, too. Iirc we've already had a case like that, but there was no real conclusion (and atm I can't find the bug). Can you pls try the below patch (on top of Egbert's debug stuff)? If this doesn't help, lets disable all the digital ports together for testing. Egbert I think your debug patch has fairly useful information for debugging the storm code in general, can you please submit a patch against drm-intel-nightly? Will do :) Cheers, Egbert. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Jan Niggemann writes: Hi Daniel, Am 25.07.2013 11:58, schrieb Daniel Vetter: Can you pls try the below patch (on top of Egbert's debug stuff)? diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h index 6caa748..2d4c884 100644 --- a/drivers/gpu/drm/i915/i915_reg.h +++ b/drivers/gpu/drm/i915/i915_reg.h @@ -1925,9 +1925,9 @@ #define PORT_HOTPLUG_STAT (dev_priv-info-display_mmio_offset + 0x61114) /* HDMI/DP bits are gen4+ */ -#define PORTB_HOTPLUG_LIVE_STATUS (1 29) +#define PORTD_HOTPLUG_LIVE_STATUS (1 29) #define PORTC_HOTPLUG_LIVE_STATUS (1 28) -#define PORTD_HOTPLUG_LIVE_STATUS (1 27) +#define PORTB_HOTPLUG_LIVE_STATUS (1 27) #define PORTD_HOTPLUG_INT_STATUS (3 21) #define PORTC_HOTPLUG_INT_STATUS (3 19) #define PORTB_HOTPLUG_INT_STATUS (3 17) I did, here are the logs: 364K http://files.hz6.de/kern_20130724_2.log I get a 'forbidden' when I try to access this. I don't understand what exactly this patch does, but I noticed: - much less drm debug info, resulting in a much smaller log - no more noticeable lag on my system (even though a storm was detected). Ok, so this indicates that the storm detection works now :) I double checked the latter and the lag seems indeed to be gone... Before we actually masked out individual events it was irrelevant if the bits were mixed up. Now we need to be correct on these. Jan, thanks for your help! Cheers, Egbert. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Hi Jan - Jan Niggemann writes: As to the log: I messed up the kernel parameters this morning... was out of coffee this morning and my 1,5y daughter played around me :-) Here's my kernel log with drm.debug and printk.time enabled: Uncompressed (22M): http://files.hz6.de/kern_20130722.log bzip2'd (some 600 KB): http://files.hz6.de/kern_20130722.log.bz2 I've looked at the logs a bit more. Here's a patch adding some more debug information. Would you please apply this to your 3.10 kernel and generate a log file the same way as you did before. The driver will be even more chatty - but I don't expect any problems from this. Cheers, Egbert. diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index e43d809..46bb77c 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -919,6 +919,11 @@ static inline void intel_hpd_irq_handler(struct drm_device *dev, spin_lock(dev_priv-irq_lock); for (i = 1; i HPD_NUM_PINS; i++) { + if (IS_G4X(dev) (hpd[i] hotplug_trigger) + dev_priv-hpd_stats[i].hpd_mark != HPD_ENABLED) + DRM_DEBUG_KMS(Received HPD intterupt although disabled\n, + I915_READ(PORT_HOTPLUG_EN)); + if (!(hpd[i] hotplug_trigger) || dev_priv-hpd_stats[i].hpd_mark != HPD_ENABLED) continue; @@ -929,6 +934,8 @@ static inline void intel_hpd_irq_handler(struct drm_device *dev, + msecs_to_jiffies(HPD_STORM_DETECT_PERIOD))) { dev_priv-hpd_stats[i].hpd_last_jiffies = jiffies; dev_priv-hpd_stats[i].hpd_cnt = 0; + DRM_DEBUG_KMS(Received HPD interrupt on PIN %d - jiffies: %d\n, i, + dev_priv-hpd_stats[i].hpd_last_jiffies); } else if (dev_priv-hpd_stats[i].hpd_cnt HPD_STORM_THRESHOLD) { dev_priv-hpd_stats[i].hpd_mark = HPD_MARK_DISABLED; dev_priv-hpd_event_bits = ~(1 i); @@ -936,6 +943,8 @@ static inline void intel_hpd_irq_handler(struct drm_device *dev, storm_detected = true; } else { dev_priv-hpd_stats[i].hpd_cnt++; + DRM_DEBUG_KMS(Received HPD interrupt on PIN %d - cnt: %d\n, i, + dev_priv-hpd_stats[i].hpd_cnt); } } ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Egbert, Daniel, others, Am 22.07.2013 10:04, schrieb Egbert Eich: Daniel Vetter writes: On Sun, Jul 21, 2013 at 10:23 PM, Jan Niggemann j...@hz6.de wrote: But every time this happens we only let through a few interrupts, so this shouldn't affect you badly. Can you please check whether those slowdowns line up with 2 minute intervalls? I observed these slowdowns for a couple of weeks now. On my machine, they only happen once, some minutes after a cold boot. They last for a minute or two, and then they are gone. I'd have guessed that the storm detection kicks in pretty quickly after a storm is detected and that it would go unnoticed. Hm, that sounds like something doesn't quite work as expected. We should kill things once we get 5 interrupts or so in 1 second. So if it's bad enough that it slows your machine down it really should only be barely noticeable. The logs show that the disable mechanism got triggered, so there was a storm that got detected. The respective message is generated by the worker, everything up to there (detection and marking disabled) seems to be fine. I bet we are still getting interrupts but the respective bit in hpd_event_bits doesn't get set any more. Since we unconditionally queue the worker on interrupt there is surprise it is so busy. Then this points to the call to hpd_irq_setup() in intel_hpd_irq_handler() not doing what is expected, ie masking out the stormy interrupt. Could it be that we can't mask/disable an interrupt before ACKing it? @Jan, could you also specify what hardware you are using (ie give us an output of lspci -n)? It's a Lenovo ThinkPad T400, the model is 7434-AG2. root@muretop:~# lspci -n 00:00.0 0600: 8086:2a40 (rev 07) 00:02.0 0300: 8086:2a42 (rev 07) 00:02.1 0380: 8086:2a43 (rev 07) 00:03.0 0780: 8086:2a44 (rev 07) 00:19.0 0200: 8086:10f5 (rev 03) 00:1a.0 0c03: 8086:2937 (rev 03) 00:1a.1 0c03: 8086:2938 (rev 03) 00:1a.2 0c03: 8086:2939 (rev 03) 00:1a.7 0c03: 8086:293c (rev 03) 00:1b.0 0403: 8086:293e (rev 03) 00:1c.0 0604: 8086:2940 (rev 03) 00:1c.1 0604: 8086:2942 (rev 03) 00:1c.3 0604: 8086:2946 (rev 03) 00:1c.4 0604: 8086:2948 (rev 03) 00:1d.0 0c03: 8086:2934 (rev 03) 00:1d.1 0c03: 8086:2935 (rev 03) 00:1d.2 0c03: 8086:2936 (rev 03) 00:1d.7 0c03: 8086:293a (rev 03) 00:1e.0 0604: 8086:2448 (rev 93) 00:1f.0 0601: 8086:2917 (rev 03) 00:1f.2 0106: 8086:2929 (rev 03) 00:1f.3 0c05: 8086:2930 (rev 03) 03:00.0 0280: 8086:4237 15:00.0 0607: 1180:0476 (rev ba) As to the log: I messed up the kernel parameters this morning... was out of coffee this morning and my 1,5y daughter played around me :-) Here's my kernel log with drm.debug and printk.time enabled: Uncompressed (22M): http://files.hz6.de/kern_20130722.log bzip2'd (some 600 KB): http://files.hz6.de/kern_20130722.log.bz2 Regards jan ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
On Mon, Jul 22, 2013 at 8:14 AM, Jan Niggemann j...@hz6.de wrote: Am 21.07.2013 22:43, schrieb Daniel Vetter: Can you please boot with drm.debug=0xe added to your kernel cmdline, do a cold boot (so that we get a storm) and then once everything is quiet grab the full kernel log (you probably have to grab it from the disk, dmesg has likely overflown) and upload it somewhere (since it's probably too big for the mailing lists)? Please also enable printk timestamps for your kernel so that we can create a timeline of everything going on. I hope I did it correctly, apending both printk.time=1 and drm.debug=0xe to my kernel command line: http://files.hz6.de/kern_20130722.log That log should give us a starting point, but I guess we need to run some debug patches past you to figure out what's going on in your system exactly. Did the drm.debug option work, it's not very chatty? Nope, that didn't work out somehow, it should be _much_ more chatty, even without a hpd irq storm. Usually the kernel prints the commandline in the log, but I didn't see that here either ... Can you please try to figure out what's going on here? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Daniel Vetter writes: On Sun, Jul 21, 2013 at 10:23 PM, Jan Niggemann j...@hz6.de wrote: But every time this happens we only let through a few interrupts, so this shouldn't affect you badly. Can you please check whether those slowdowns line up with 2 minute intervalls? I observed these slowdowns for a couple of weeks now. On my machine, they only happen once, some minutes after a cold boot. They last for a minute or two, and then they are gone. I'd have guessed that the storm detection kicks in pretty quickly after a storm is detected and that it would go unnoticed. Hm, that sounds like something doesn't quite work as expected. We should kill things once we get 5 interrupts or so in 1 second. So if it's bad enough that it slows your machine down it really should only be barely noticeable. The logs show that the disable mechanism got triggered, so there was a storm that got detected. The respective message is generated by the worker, everything up to there (detection and marking disabled) seems to be fine. I bet we are still getting interrupts but the respective bit in hpd_event_bits doesn't get set any more. Since we unconditionally queue the worker on interrupt there is surprise it is so busy. Then this points to the call to hpd_irq_setup() in intel_hpd_irq_handler() not doing what is expected, ie masking out the stormy interrupt. Could it be that we can't mask/disable an interrupt before ACKing it? @Jan, could you also specify what hardware you are using (ie give us an output of lspci -n)? Cheers, Egbert. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
Hi all, Am 21.07.2013 22:43, schrieb Daniel Vetter: Can you please boot with drm.debug=0xe added to your kernel cmdline, do a cold boot (so that we get a storm) and then once everything is quiet grab the full kernel log (you probably have to grab it from the disk, dmesg has likely overflown) and upload it somewhere (since it's probably too big for the mailing lists)? Please also enable printk timestamps for your kernel so that we can create a timeline of everything going on. I hope I did it correctly, apending both printk.time=1 and drm.debug=0xe to my kernel command line: http://files.hz6.de/kern_20130722.log That log should give us a starting point, but I guess we need to run some debug patches past you to figure out what's going on in your system exactly. Did the drm.debug option work, it's not very chatty? Another thing I have noticed: All storm-related messages are about DP-3 and HDMI-A-1, but my T400 doesn't have either a HDMI out nor a DisplayPort connector... We'll only enable ports (and interrupts for them) if we think they're there. Could it be that there's a docking station for your machine available which would have these ports? Hmm, yes, I think so. Thank you all for helping diagnosing this issue! Regards jan ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
On Sun, Jul 21, 2013 at 10:23 PM, Jan Niggemann j...@hz6.de wrote: But every time this happens we only let through a few interrupts, so this shouldn't affect you badly. Can you please check whether those slowdowns line up with 2 minute intervalls? I observed these slowdowns for a couple of weeks now. On my machine, they only happen once, some minutes after a cold boot. They last for a minute or two, and then they are gone. I'd have guessed that the storm detection kicks in pretty quickly after a storm is detected and that it would go unnoticed. Hm, that sounds like something doesn't quite work as expected. We should kill things once we get 5 interrupts or so in 1 second. So if it's bad enough that it slows your machine down it really should only be barely noticeable. Can you please boot with drm.debug=0xe added to your kernel cmdline, do a cold boot (so that we get a storm) and then once everything is quiet grab the full kernel log (you probably have to grab it from the disk, dmesg has likely overflown) and upload it somewhere (since it's probably too big for the mailing lists)? Please also enable printk timestamps for your kernel so that we can create a timeline of everything going on. That log should give us a starting point, but I guess we need to run some debug patches past you to figure out what's going on in your system exactly. Another thing I have noticed: All storm-related messages are about DP-3 and HDMI-A-1, but my T400 doesn't have either a HDMI out nor a DisplayPort connector... We'll only enable ports (and interrupts for them) if we think they're there. Could it be that there's a docking station for your machine available which would have these ports? I returned to 3.2.0 with drm_kms_helper.poll=0 for the time being, so this doesn't affect me badly. Please tell me if and how I can further diagnose this. That's not good, we need to fix this. Also adding Egbert, maybe he has seen something like this in the wild already, after all he's implemented the storm handling code. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] i915 irq storm mitigation in 3.10
On Sat, Jul 06, 2013 at 11:41:07PM +0200, Jan Niggemann wrote: Hi list, I'm not sure, but I think the recent patch to keep i915 systems from locking up / slowing down in case of irq storms didn't fully help. I compiled 3.10, but every now and then, my system has the same symptoms as before. grep storm /var/log/messages Jul 5 21:55:43 muretop kernel: [ 324.767375] [drm] HPD interrupt storm detected on connector DP-3: switching from hotplug detection to polling Jul 6 23:05:57 muretop kernel: [ 200.372208] [drm] HPD interrupt storm detected on connector HDMI-A-1: switching from hotplug detection to polling Jul 6 23:07:51 muretop kernel: [ 313.752039] [drm] HPD interrupt storm detected on connector DP-3: switching from hotplug detection to pol I notice up to 50% CPU from kworker/u8:0 when my system slows down. I'm new to kernel development, but if you tell me what to do, I'm very willing to help diagnosing this. Seems to work as advertised, once we detect a storm we kill the hotplug interrupt. Note that since on many systems these storms only happen in certain conditions (e.g. specific battery charger state) we'll re-enable the interrupts every 2 minutes to check whether the storm has passed (there's not really any other way). But every time this happens we only let through a few interrupts, so this shouldn't affect you badly. Can you please check whether those slowdowns line up with 2 minute intervalls? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
[Intel-gfx] i915 irq storm mitigation in 3.10
Hi list, I'm not sure, but I think the recent patch to keep i915 systems from locking up / slowing down in case of irq storms didn't fully help. I compiled 3.10, but every now and then, my system has the same symptoms as before. grep storm /var/log/messages Jul 5 21:55:43 muretop kernel: [ 324.767375] [drm] HPD interrupt storm detected on connector DP-3: switching from hotplug detection to polling Jul 6 23:05:57 muretop kernel: [ 200.372208] [drm] HPD interrupt storm detected on connector HDMI-A-1: switching from hotplug detection to polling Jul 6 23:07:51 muretop kernel: [ 313.752039] [drm] HPD interrupt storm detected on connector DP-3: switching from hotplug detection to pol I notice up to 50% CPU from kworker/u8:0 when my system slows down. I'm new to kernel development, but if you tell me what to do, I'm very willing to help diagnosing this. Regards jan ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx