Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-13 Thread David Rosenstrauch



On 03/13/2018 03:34 AM, Carsten Mattner via arch-general wrote:

On 3/13/18, David Rosenstrauch  wrote >> I have an old PCI 
card in the machine that's powers an old parallel port

printer I used to use with it.  Perhaps that's failing.  I don't need
the card anymore, so I might as well try taking it out and see if that
makes things better.  If not, I'll be back.  :-)


Let us know in either case. Please tell me it's a huge plotter
you use to make construction plans for your rocket ships.



Just an old printer that's been good to me that I've been reluctant to 
let go of:


https://www.cnet.com/products/hp-laserjet-6l/specs/

DR


Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-13 Thread Carsten Mattner via arch-general
On 3/13/18, David Rosenstrauch  wrote:
>
>
> On 03/12/2018 09:56 PM, Carsten Mattner via arch-general wrote:
>> Any BIOS updates or kernel updates recently (4.15.8)?
>>
>> Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
>> just for testing (not production).
>>
>> It's more likely that the kernel regressed rather than IRQ issues popping
>> up suddenly. It's possible but less likely.
>
> No BIOS updates.  As far as kernel updates, I do those all the time, so
> not sure that would be the cause.
>
> After doing some digging, though, I did "cat /proc/interrupts", and this
> line stood out, for having an astronomically high number:
>
> CPU0   CPU1   CPU2   CPU3
> ...
>   16: 2424156658  0  0  0   IO-APIC  16-fasteoi
>   uhci_hcd:usb5, parport1
>
>
> I have an old PCI card in the machine that's powers an old parallel port
> printer I used to use with it.  Perhaps that's failing.  I don't need
> the card anymore, so I might as well try taking it out and see if that
> makes things better.  If not, I'll be back.  :-)

Let us know in either case. Please tell me it's a huge plotter
you use to make construction plans for your rocket ships.


Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-12 Thread David Rosenstrauch



On 03/12/2018 09:56 PM, Carsten Mattner via arch-general wrote:

Any BIOS updates or kernel updates recently (4.15.8)?

Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
just for testing (not production).

It's more likely that the kernel regressed rather than IRQ issues popping
up suddenly. It's possible but less likely.


No BIOS updates.  As far as kernel updates, I do those all the time, so 
not sure that would be the cause.



After doing some digging, though, I did "cat /proc/interrupts", and this 
line stood out, for having an astronomically high number:


   CPU0   CPU1   CPU2   CPU3
...
 16: 2424156658  0  0  0   IO-APIC  16-fasteoi 
 uhci_hcd:usb5, parport1



I have an old PCI card in the machine that's powers an old parallel port 
printer I used to use with it.  Perhaps that's failing.  I don't need 
the card anymore, so I might as well try taking it out and see if that

makes things better.  If not, I'll be back.  :-)

Thanks,

DR


Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-12 Thread Carsten Mattner via arch-general
On 3/13/18, David Rosenstrauch  wrote:

> So if this issue is irq-based, I guess that means some piece of hardware
> is faulty or failing.  Any idea how I might go about pinning down which
> one?  Would there be info in the kernel log about this?  Or something
> that I can look at in /proc?

Any BIOS updates or kernel updates recently (4.15.8)?

Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
just for testing (not production).

It's more likely that the kernel regressed rather than IRQ issues popping
up suddenly. It's possible but less likely.


Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-12 Thread David Rosenstrauch



On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:



On 2018年03月12日 11:19, David Rosenstrauch wrote:

My server's been exhibiting some very strange behavior lately.  Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU.



Can you check whether you have enabled "Detailed CPU time" option in
htop's setup (F2 -> Display options -> "Detailed CPU time")?
 From my experience and understanging, htop's CPU meter is accounting
IO-wait/IRQ-response time by default but not showing them differently
unless you enabled the "Detailed CPU time" option.
And these waiting time is not accounted on each process or kernel
thread. Enabling that said option will revail more detailed CPU usage info.
High IO-wait or IRQ time is itself an indication of some misbehaving
hardware, but at least you can be sure that it is not by more
"dangerous" malwares or attacks.


Thanks for the suggestion.  So this issue happened again tonight, and I 
switched to "Detailed CPU time" to try to research it further. 
According to htop, the cpu usage is from "irq" (orange color).  I guess 
this would explain why I'm not seeing any process responsible too.


And it also might be related that I'm seeing these messages in my dmesg:

[  871.317377] perf: interrupt took too long (2506 > 2500), lowering 
kernel.perf_event_max_sample_rate to 79000
[ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering 
kernel.perf_event_max_sample_rate to 63000
[ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering 
kernel.perf_event_max_sample_rate to 5



So if this issue is irq-based, I guess that means some piece of hardware 
is faulty or failing.  Any idea how I might go about pinning down which 
one?  Would there be info in the kernel log about this?  Or something 
that I can look at in /proc?


Thanks,

DR


Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-12 Thread David C. Rankin
On 03/11/2018 09:19 PM, David Rosenstrauch wrote:
> My server's been exhibiting some very strange behavior lately.  Every couple
> of days I run into a situation where one core (core #0) on the quad core CPU
> starts continuously using around 34% of CPU, but I'm not able to see (using
> htop) any process that's responsible for using all that CPU.  Even when I tell
> htop to show me kernel threads too, I still am not able to see the offending
> process.  Every process remains under 1% CPU usage (except for occasional,
> small, short-lived spikes up) yet the CPU usage on that core remains
> permanently hovering at around 34%.  The problem goes away when I reboot, but
> then comes back with a day or so.
> 
> I'm rather stumped as to how to fix this.  The server is a bit old, running an
> up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU.  Any
> suggestions anyone might have as to either what might be going on here, or how
> to go about debugging it would be greatly appreciated.
> 
> Thanks!
> 
> DR

DR,

  You are on to something. In the past 24hr. I've experienced two hardlocks on
two separate multi-cpu Arch servers. Both AMD based boxes. One the
dual-quad-core opteron box that was the subject of the 4.15.8 update failure
that left the box unbootable. Then today on a quad-quad-core opteron box that
hardlocked doing a simple rsync from another server on the LAN. Other than the
rsync process, the box is nearly idle, e.g.

top - 19:49:45 up 8 min,  2 users,  load average: 2.93, 2.23, 1.11
Tasks: 271 total,   1 running, 158 sleeping,   0 stopped,   0 zombie
%Cpu0  :   0.0/0.0 0[]
%Cpu1  :   0.0/0.0 0[]
%Cpu2  :   0.0/0.0 0[]
%Cpu3  :   0.0/0.0 0[]
%Cpu4  :   0.0/0.0 0[]
%Cpu5  :   0.0/0.0 0[]
%Cpu6  :   0.0/0.0 0[]
%Cpu7  :   0.0/0.0 0[]
%Cpu8  :   0.0/0.0 0[]
%Cpu9  :   0.0/0.0 0[]
%Cpu10 :   0.0/0.0 0[]
%Cpu11 :   0.0/0.7 1[|   ]
%Cpu12 :   0.0/0.0 0[]
%Cpu13 :   0.0/0.0 0[]
%Cpu14 :   0.0/0.0 0[]
%Cpu15 :   0.0/0.0 0[]
GiB Mem :  1.7/62.915   []
GiB Swap:  0.0/1.998[]


  I could not get a top during the lockup, because when the lockup occurred,
while I could connect to the server from another box via ssh, I could not log
in. E.g.,

19:34 wizard:~/dev/src-c/tmp/debug> svh
Last login: Mon Mar 12 19:34:44 2018 from 192.168.6.104
^CConnection to valhalla closed.

  Both boxes have exhibited strange behavior regarding the linux-raid array
(all disks are fine), but I receive spurious errors like (Out of IOMMU space),
Huh?

Mar 12 19:45:20 valhalla su[869]: pam_unix(su:session): session opened for
user root by david(uid=1000)
Mar 12 19:45:57 valhalla kernel: sata_nv :00:05.0: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:45:57 valhalla kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:45:57 valhalla kernel: ata3: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:45:57 valhalla kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:45:57 valhalla kernel: ata3: tag : dhfis dmafis sdbfis sactive
Mar 12 19:45:57 valhalla kernel: ata3.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:45:57 valhalla kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:45:57 valhalla kernel: ata3.00: cmd
61/00:10:00:d0:e4/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
  res
40/00:20:00:ea:e3/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:45:57 valhalla kernel: ata3.00: status: { DRDY }
Mar 12 19:45:57 valhalla kernel: ata3: hard resetting link
Mar 12 19:45:57 valhalla kernel: ata3: nv: skipping hardreset on occupied port
Mar 12 19:45:58 valhalla kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:45:58 valhalla kernel: ata3.00: configured for UDMA/133
Mar 12 19:45:58 valhalla kernel: ata3: EH complete
Mar 12 19:46:09 valhalla kernel: sata_nv :00:05.1: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:46:09 valhalla kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:46:09 valhalla kernel: ata5: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1

Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-12 Thread Jiachen Yang via arch-general


On 2018年03月12日 11:19, David Rosenstrauch wrote:
> My server's been exhibiting some very strange behavior lately.  Every
> couple of days I run into a situation where one core (core #0) on the
> quad core CPU starts continuously using around 34% of CPU, but I'm not
> able to see (using htop) any process that's responsible for using all
> that CPU.  Even when I tell htop to show me kernel threads too, I
> still am not able to see the offending process.  Every process remains
> under 1% CPU usage (except for occasional, small, short-lived spikes
> up) yet the CPU usage on that core remains permanently hovering at
> around 34%.  The problem goes away when I reboot, but then comes back
> with a day or so.
>
> I'm rather stumped as to how to fix this.  The server is a bit old,
> running an up-to-date installation of Arch on a Intel Core 2 Quad
> Q6600 CPU.  Any suggestions anyone might have as to either what might
> be going on here, or how to go about debugging it would be greatly
> appreciated.
>
> Thanks!
>
> DR

Can you check whether you have enabled "Detailed CPU time" option in
htop's setup (F2 -> Display options -> "Detailed CPU time")?
From my experience and understanging, htop's CPU meter is accounting
IO-wait/IRQ-response time by default but not showing them differently
unless you enabled the "Detailed CPU time" option.
And these waiting time is not accounted on each process or kernel
thread. Enabling that said option will revail more detailed CPU usage info.
High IO-wait or IRQ time is itself an indication of some misbehaving
hardware, but at least you can be sure that it is not by more
"dangerous" malwares or attacks.



signature.asc
Description: OpenPGP digital signature


Re: [arch-general] High CPU on one core, but unable to find process responsible

2018-03-11 Thread Carsten Mattner via arch-general
On 3/12/18, David Rosenstrauch  wrote:
> My server's been exhibiting some very strange behavior lately.  Every
> couple of days I run into a situation where one core (core #0) on the
> quad core CPU starts continuously using around 34% of CPU, but I'm not
> able to see (using htop) any process that's responsible for using all
> that CPU.  Even when I tell htop to show me kernel threads too, I still
> am not able to see the offending process.  Every process remains under
> 1% CPU usage (except for occasional, small, short-lived spikes up) yet
> the CPU usage on that core remains permanently hovering at around 34%.
> The problem goes away when I reboot, but then comes back with a day or
> so.

My gut feeling is that one of the kernel worker threads hangs.
So that would be 25% overall and 100% of the affected core.
But you say there's no load to be found in the kernel threads,
which is odd.

Or if the server is accessible from the Internet, is it possible
it's rooted and someone's running a hidden process? To confirm
this isn't the case, cut off Internet access and let it run for
two days.

I don't think there are any official hidden processes that do not
show up in htop or top since that would make them seem like rootkits.
That means if the guilty process is really invisible, then it's
definitely unusual.

It's scary to consider a rootkit, but if that's the case, then
it's best to be aware as soon as possible. I hope this is not
case for you, wouldn't wish it on your worst enemy.

Another idea. Can you limit the cores to 1 or maybe two and see
if it becomes easier to pinpoint?

This might work in the booted system:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

But on the kernel command line maxcpus=1 should work.