On Tue, 31 Jan 2017 09:07:35 -0500 Giles Orr via talk <[email protected]> wrote:
> My primary machine is crashing with increasing frequency. The > commonest error I'm seeing in the log looks like this: > my 1c observation (with limited data) - check your drive - had similar soft locks just before head crash (close to the start of the part) as i said, ymmv :) hth Andre > Jan 29 18:29:39 toshi7 kernel: nouveau 0000:01:00.0: DRM: suspending > kernel object tree... > Jan 29 18:30:00 toshi7 kernel: NMI watchdog: BUG: soft lockup - CPU#3 > stuck for 23s! [kscreenlocker_g:19647] > Jan 29 18:30:00 toshi7 kernel: Modules linked in: fuse uas usb_storage > rfcomm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set > nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat > nf_conntrack ... > > I realize that I'm probably not giving enough information, but pasting > large chunks of log files would be just as counterproductive in its > own way. I've seen this one A LOT - and sometimes I get it and the > machine goes hours (but not days) before crashing. So ... is > kscreenlocker likely to be the problem here? When I searched for "BUG > soft lockup CPU stuck for" on Google, the top result had exactly the > same number of seconds, and said that replacing the power supply fixed > the problem. Which is a step I'd probably be willing to take, but > this isn't a desktop, it's a laptop. So I'd want to be very sure as > the power supply is unique to this machine (if it's available at all) > and probably quite expensive. > > The processor: > > Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (4594 bogomips) > current speed: 1274MHz, 4 cores, 8 threads > > While it's not a current gen processor, this is still a good machine > and I'd rather fix it than toss it. > > Got an immediate crash this morning, and to my surprise the error was > very different: > > Jan 31 07:56:35 toshi7 kernel: ------------[ cut here ]------------ > Jan 31 07:56:35 toshi7 kernel: kernel BUG at lib/radix-tree.c:769! > Jan 31 07:56:35 toshi7 kernel: invalid opcode: 0000 [#1] SMP > Jan 31 07:56:35 toshi7 kernel: Modules linked in: uas usb_storage > rfcomm ip6t_rpfilter ip6t_REJECT nf_reject > _ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge > stp llc ip6table_nat nf_conntrack_ipv6 ... > > Finally, I'm also getting this periodically: > > Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature above threshold, > cpu clock throttled (total events = 1 > ) > Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature above threshold, > cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU7: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU4: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU1: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU5: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU3: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: mce: [Hardware Error]: Machine check > events logged > Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature above > threshold, cpu clock throttled (total events = 1) > Jan 28 08:49:52 toshi7 kernel: mce: [Hardware Error]: Machine check > events logged > Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU4: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU5: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU1: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU3: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU7: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature/speed normal > Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature/speed normal > > This suggests that it's overheating, throttling, and recovering pretty > much instantaneously: my thought is that it's probably not a problem, > but I thought I should check. > > How should I proceed from here: > - the processor is going funny, replace it > - junk the laptop, it's toast > - debug further (how?) > - replace the power supply > - uninstall kscreenlocker and see what happens > --- Talk Mailing List [email protected] https://gtalug.org/mailman/listinfo/talk
