Bug#888465: CPU usage reporting issues

2018-01-25 Thread Ryan Thoryk

(cleaned the post up for readability, sorry about that)

On 01/25/2018 05:40 PM, Hans van Kranenburg wrote:

This means that your vcpus want to execute work but are not being
scheduled on a physical cpu core. Either the physical machine gets too
much work from all the virtual machines that are requesting cpu time, or
other things are going on, like your virtual machine getting paused
(e.g. when doing live migration there's a handover moment when it's
shortly paused and then resumed, this is also visible as a short 100%
steal spike).


After going over log files, it appears that the issue started when 
Amazon did a live migration of the VM, probably for the Meltdown patching.



A patch to fix that cpu accounting breakage (picked from linux 4.15) was
included in 4.9.65-3. So only for the 4.9.0-3 (which actual version?)
you could be seeing that one happening.


The versions were both 4.9.30-2+deb9u2 and the latest, 4.9.65-3+deb9u2.  
So basically the kernel never recovered properly after being paused 
during a live migration.



Because of the mentioned steal time fix that was included in a version
in between the 2 versions you mention, my first suggestion would be to
see if the symptoms on the old and new kernel are exactly the same, or
if they are only similar but different.

Hans


I already rebooted the system running the 4.9.65 kernel, and beforehand, 
the symptoms were the same.  The CPU usage stats went back to normal 
after the reboot.


--
Ryan Thoryk
r...@thoryk.com
r...@tliquest.net



Bug#888465: CPU usage reporting issues

2018-01-25 Thread Ryan Thoryk

On 01/25/2018 05:40 PM, Hans van Kranenburg wrote:

Hi Ryan,

On 01/26/2018 12:20 AM, Ryan Thoryk wrote:

Package: linux-image-4.9.0-5-amd64
Version: 4.9.65-3+deb9u2
Severity: normal

I'm having an issue with CPU usage reporting, tested on kernels 4.9.0-3
and 4.9.0-5.  The machines are running on Amazon EC2, which could be
related.  With the "sar" utility, after some time, the system's "steal"
value periodically is 100%,

This means that your vcpus want to execute work but are not being
scheduled on a physical cpu core. Either the physical machine gets too
much work from all the virtual machines that are requesting cpu time, or
other things are going on, like your virtual machine getting paused
(e.g. when doing live migration there's a handover moment when it's
shortly paused and then resumed, this is also visible as a short 100%
steal spike).
After going over log files, it appears that the issue started when 
Amazon did a live migration of the VM, probably for the Meltdown patching.

and the normal CPU user/system values,
including idle, are always 0.  When running a cpu-intensive app and
using the "top" utility, the user and system values are always 0, the
"idle" field stays at 100%, and only the "wait" field increases.

Sounds a lot like this one:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=871608

A patch to fix that cpu accounting breakage (picked from linux 4.15) was
included in 4.9.65-3. So only for the 4.9.0-3 (which actual version?)
you could be seeing that one happening.
The versions were both 4.9.30-2+deb9u2 and the latest, 4.9.65-3+deb9u2.  
So basically the kernel never recovered properly after being paused 
during a live migration.

The attached file shows the "sar" output around the time the issue
started.  This has happened on 2 separate machines (started at different
times on each), and a reboot appears to (temporarily) fix the issue.
I'm wondering if anyone else has this issue, and if it could be
something to do with the hypervisor.

Because of the mentioned steal time fix that was included in a version
in between the 2 versions you mention, my first suggestion would be to
see if the symptoms on the old and new kernel are exactly the same, or
if they are only similar but different.

Hans
I already rebooted the system running the 4.9.65 kernel, and beforehand, 
the symptoms were the same.  The CPU usage stats went back to normal 
after the reboot.


--
Ryan Thoryk
r...@thoryk.com
r...@tliquest.net



Bug#888465: CPU usage reporting issues

2018-01-25 Thread Hans van Kranenburg
Hi Ryan,

On 01/26/2018 12:20 AM, Ryan Thoryk wrote:
> Package: linux-image-4.9.0-5-amd64
> Version: 4.9.65-3+deb9u2
> Severity: normal
> 
> I'm having an issue with CPU usage reporting, tested on kernels 4.9.0-3
> and 4.9.0-5.  The machines are running on Amazon EC2, which could be
> related.  With the "sar" utility, after some time, the system's "steal"
> value periodically is 100%,

This means that your vcpus want to execute work but are not being
scheduled on a physical cpu core. Either the physical machine gets too
much work from all the virtual machines that are requesting cpu time, or
other things are going on, like your virtual machine getting paused
(e.g. when doing live migration there's a handover moment when it's
shortly paused and then resumed, this is also visible as a short 100%
steal spike).

> and the normal CPU user/system values,
> including idle, are always 0.  When running a cpu-intensive app and
> using the "top" utility, the user and system values are always 0, the
> "idle" field stays at 100%, and only the "wait" field increases.

Sounds a lot like this one:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=871608

A patch to fix that cpu accounting breakage (picked from linux 4.15) was
included in 4.9.65-3. So only for the 4.9.0-3 (which actual version?)
you could be seeing that one happening.

> The attached file shows the "sar" output around the time the issue
> started.  This has happened on 2 separate machines (started at different
> times on each), and a reboot appears to (temporarily) fix the issue. 
> I'm wondering if anyone else has this issue, and if it could be
> something to do with the hypervisor.

Because of the mentioned steal time fix that was included in a version
in between the 2 versions you mention, my first suggestion would be to
see if the symptoms on the old and new kernel are exactly the same, or
if they are only similar but different.

Hans



Bug#888465: CPU usage reporting issues

2018-01-25 Thread Ryan Thoryk

Package: linux-image-4.9.0-5-amd64
Version: 4.9.65-3+deb9u2
Severity: normal

Hi,

I'm having an issue with CPU usage reporting, tested on kernels 4.9.0-3 
and 4.9.0-5.  The machines are running on Amazon EC2, which could be 
related.  With the "sar" utility, after some time, the system's "steal" 
value periodically is 100%, and the normal CPU user/system values, 
including idle, are always 0.  When running a cpu-intensive app and 
using the "top" utility, the user and system values are always 0, the 
"idle" field stays at 100%, and only the "wait" field increases.


The attached file shows the "sar" output around the time the issue 
started.  This has happened on 2 separate machines (started at different 
times on each), and a reboot appears to (temporarily) fix the issue.  
I'm wondering if anyone else has this issue, and if it could be 
something to do with the hypervisor.


--
Ryan Thoryk
r...@thoryk.com
r...@tliquest.net

11:05:01 AM all  0.01  0.00  0.02  0.00  0.00 99.97
11:15:01 AM all  0.01  0.00  0.02  0.00  0.00 99.97
11:25:01 AM all  0.01  0.00  0.02  0.00  0.00 99.97
11:35:01 AM all  0.01  0.00  0.02  0.00  0.00 99.97
11:45:01 AM all  0.01  0.00  0.02  0.00  0.00 99.97
11:55:01 AM all  0.01  0.00  0.02  0.00  0.00 99.97
12:05:01 PM all  0.01  0.00  0.01  0.00  0.00 99.97
12:15:01 PM all  0.01  0.00  0.01  0.00  0.00 99.97
12:25:01 PM all  0.01  0.00  0.02  0.00  0.00 99.97

12:25:01 PM CPU %user %nice   %system   %iowait%steal %idle
12:35:01 PM all  0.01  0.00  0.02  0.00  0.00 99.97
12:45:01 PM all  0.01  0.00  0.02  0.00  0.00 99.97
12:55:01 PM all  0.02  0.00  0.02  0.00  0.00 99.96
01:05:01 PM all  0.01  0.00  0.02  0.00  0.00 99.97
01:15:01 PM all  0.00  0.00  0.00  0.00100.00  0.00
01:25:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00
01:35:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00
01:45:01 PM all  0.00  0.00  0.00  0.00100.00  0.00
01:55:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00
02:05:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00
02:15:01 PM all  0.00  0.00  0.00  0.00100.00  0.00
02:25:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00
02:35:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00
02:45:01 PM all  0.00  0.00  0.00  0.00100.00  0.00
02:55:01 PM all  0.00  0.00  0.00  0.00  0.00  0.00