Hi, My system (kernel / mcelog) frequently (numerous times a day) spits out warnings such as these:
Feb 20 21:10:07 lila kernel: [ 7808.093821] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1760) Feb 20 21:10:07 lila kernel: [ 7808.093832] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1760) Feb 20 21:10:07 lila kernel: [ 7808.093837] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2292) Feb 20 21:10:07 lila kernel: [ 7808.093838] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2292) Feb 20 21:10:07 lila kernel: [ 7808.093840] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2292) Feb 20 21:10:07 lila kernel: [ 7808.093841] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2292) Feb 20 21:10:07 lila kernel: [ 7808.093852] mce: [Hardware Error]: Machine check events logged Feb 20 21:10:07 lila kernel: [ 7808.101828] CPU1: Core temperature/speed normal Feb 20 21:10:07 lila kernel: [ 7808.101829] CPU0: Core temperature/speed normal Feb 20 21:10:07 lila kernel: [ 7808.101830] CPU3: Package temperature/speed normal Feb 20 21:10:07 lila kernel: [ 7808.101831] CPU2: Package temperature/speed normal Feb 20 21:10:07 lila kernel: [ 7808.101832] CPU0: Package temperature/speed normal Feb 20 21:10:07 lila kernel: [ 7808.101834] CPU1: Package temperature/speed normal Feb 20 21:10:07 lila kernel: [ 7808.101834] mce: [Hardware Error]: Machine check events logged Feb 20 21:10:07 lila mcelog: Processor 0 heated above trip temperature. Throttling enabled. Feb 20 21:10:07 lila mcelog: Please check your system cooling. Performance will be impacted Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger' Feb 20 21:10:07 lila mcelog: Processor 1 heated above trip temperature. Throttling enabled. Feb 20 21:10:07 lila mcelog: Please check your system cooling. Performance will be impacted Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger' Feb 20 21:10:07 lila mcelog: warning: 16 bytes ignored in each record Feb 20 21:10:07 lila mcelog: consider an update Feb 20 21:10:07 lila mcelog: CPU 0 on socket 0 received unknown error Feb 20 21:10:07 lila mcelog: Location: CPU 0 on socket 0 Feb 20 21:10:07 lila mcelog: CPU 1 on socket 0 received unknown error Feb 20 21:10:07 lila mcelog: Location: CPU 1 on socket 0 Feb 20 21:10:07 lila mcelog: Processor 0 below trip temperature. Throttling disabled Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger' Feb 20 21:10:07 lila mcelog: Too many trigger children running already Feb 20 21:10:07 lila mcelog: Processor 1 below trip temperature. Throttling disabled Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger' Feb 20 21:10:07 lila mcelog: Too many trigger children running already Feb 20 21:10:07 lila mcelog: warning: 16 bytes ignored in each record Feb 20 21:10:07 lila mcelog: consider an update Sometimes I see just the kernel warnings, without the MCE stuff: Feb 20 21:25:07 lila kernel: [ 8708.134901] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2556) Feb 20 21:25:07 lila kernel: [ 8708.134903] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2556) Feb 20 21:25:07 lila kernel: [ 8708.134904] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2556) Feb 20 21:25:07 lila kernel: [ 8708.134906] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2556) Feb 20 21:25:07 lila kernel: [ 8708.141929] CPU1: Package temperature/speed normal Feb 20 21:25:07 lila kernel: [ 8708.141931] CPU2: Package temperature/speed normal Feb 20 21:25:07 lila kernel: [ 8708.141932] CPU0: Package temperature/speed normal Feb 20 21:25:07 lila kernel: [ 8708.141933] CPU3: Package temperature/speed normal I suspect that these warning are spurious, possibly a kernel bug. They do not seem to correlate with times that the system is actually under stress: they often seem to occur when the system is under no particular stress, and conversely, I can stress the system without a whimper. [E.g., "sysbench --num-threads=1 --test=cpu --cpu-max-prime=35000 run", which revs the cpu frequencies to their maximum of 3 GHz and raises their temperatures to ~71 C, or "sysbench --num-threads=4 --test=cpu --cpu-max-prime=50000 run" which runs at only ~2.65 GHz, but raises their temperatures to 80 C.] Moreover, the warnings always seem to come in pairs, with the temperatures / speeds reported as returning to normal immediately (see log timestamps). This is a ThinkPad W550s, with a dual core hyperthreaded i7-5500U, with a base speed of 2.4 GHz, and Turbo Boost to 3 GHz. It's a fairly new (manufacturer refurbished) machine. I'm running mostly stable, with some backports and the occasional bit from unstable, when installable without ripping out half the basic stable installation. Recent kernels have been self-built from vanilla sources, in the 4.7.x-4.9.10 range. Searching the internet, reactions to similar problems fall into two categories: 1) You're frying your system! Your fan or your thermal interface needs to be cleaned / replaced immediately! 2) These are just spurious artifacts. Some discussions - the best seems to be the first one: https://bugzilla.redhat.com/show_bug.cgi?id=924570 https://bbs.archlinux.org/viewtopic.php?id=191347 https://www.centos.org/forums/viewtopic.php?t=24420 Any thoughts? Celejar