On 31 January 2017 at 10:37, Dhaval Giani <[email protected]> wrote: > > On Tue, Jan 31, 2017 at 10:28 AM Giles Orr via talk <[email protected]> wrote: >> >> On 31 January 2017 at 10:03, Alvin Starr via talk <[email protected]> wrote: >> > On 01/31/2017 09:07 AM, Giles Orr via talk wrote: >> >> My primary machine is crashing with increasing frequency. The >> >> commonest error I'm seeing in the log looks like this: >> >> >> >> Jan 29 18:29:39 toshi7 kernel: nouveau 0000:01:00.0: DRM: suspending >> >> kernel object tree... >> >> Jan 29 18:30:00 toshi7 kernel: NMI watchdog: BUG: soft lockup - CPU#3 >> >> stuck for 23s! [kscreenlocker_g:19647] >> >> Jan 29 18:30:00 toshi7 kernel: Modules linked in: fuse uas usb_storage >> >> rfcomm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set >> >> nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat >> >> nf_conntrack ... >> >> >> >> I realize that I'm probably not giving enough information, but pasting >> >> large chunks of log files would be just as counterproductive in its >> >> own way. I've seen this one A LOT - and sometimes I get it and the >> >> machine goes hours (but not days) before crashing. So ... is >> >> kscreenlocker likely to be the problem here? When I searched for "BUG >> >> soft lockup CPU stuck for" on Google, the top result had exactly the >> >> same number of seconds, and said that replacing the power supply fixed >> >> the problem. Which is a step I'd probably be willing to take, but >> >> this isn't a desktop, it's a laptop. So I'd want to be very sure as >> >> the power supply is unique to this machine (if it's available at all) >> >> and probably quite expensive. >> >> >> >> The processor: >> >> >> >> Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (4594 bogomips) >> >> current speed: 1274MHz, 4 cores, 8 threads >> >> >> >> While it's not a current gen processor, this is still a good machine >> >> and I'd rather fix it than toss it. >> >> >> >> Got an immediate crash this morning, and to my surprise the error was >> >> very different: >> >> >> >> Jan 31 07:56:35 toshi7 kernel: ------------[ cut here ]------------ >> >> Jan 31 07:56:35 toshi7 kernel: kernel BUG at lib/radix-tree.c:769! >> >> Jan 31 07:56:35 toshi7 kernel: invalid opcode: 0000 [#1] SMP >> >> Jan 31 07:56:35 toshi7 kernel: Modules linked in: uas usb_storage >> >> rfcomm ip6t_rpfilter ip6t_REJECT nf_reject >> >> _ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge >> >> stp llc ip6table_nat nf_conntrack_ipv6 ... >> >> >> >> Finally, I'm also getting this periodically: >> >> >> >> Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature above threshold, >> >> cpu clock throttled (total events = 1 >> >> ) >> >> Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature above threshold, >> >> cpu clock throttled (total events = 1) >> > [snip] >> >> Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature/speed normal >> >> Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature/speed normal >> >> Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature/speed normal >> >> >> >> This suggests that it's overheating, throttling, and recovering pretty >> >> much instantaneously: my thought is that it's probably not a problem, >> >> but I thought I should check. >> >> >> >> How should I proceed from here: >> >> - the processor is going funny, replace it >> >> - junk the laptop, it's toast >> >> - debug further (how?) >> >> - replace the power supply >> >> - uninstall kscreenlocker and see what happens >> >> >> > >> > If the CPU is going over temp then it could start acting unpredictably. >> > >> > If you have lm_sensors installed then it would be worthwhile checking >> > the temp of the CPU during normal operation. >> > I would also check the fans because most fans out there are >> > "inexpensive" and will start to cease up over time slowing down till >> > things start getting hot. >> > Another thing that has bitten me in the past was pushing a computer with >> > a side vent up against a wall causing the still good fans from working >> > almost at all. >> > >> > Another thing that will cause random problems is memory so if the >> > cooling is not the issue then try running a memory test. >> > Unless you have ECC and there are no errors being logged. >> >> I should add that I ran memtest86(+?) for a couple hours a month ago, >> and it came up error-free. And I ran the smartctl long test on the >> hard drive quite recently, again without error. I should run the >> memory test again (and possibly even the HD one), but it makes me >> think that these aren't the problem. I think the fans are functioning >> okay, but that's worth looking at and I'll get lmsensors installed >> again. > > A good starting point would be knowing what you are running. Also updating > to the latest packages for you distro as it might already be fixed.
Fair point ... Fedora Core 24 or 25 (sorry, not at home - can't tell you for sure which) KDE spin. I do keep it up-to-date: all packages should be current as of approximately the last three days. -- Giles http://www.gilesorr.com/ [email protected] --- Talk Mailing List [email protected] https://gtalug.org/mailman/listinfo/talk
