Re: turbostat-17.06.23 floating point exception

2018-10-18 Thread Solio Sarabia
On Fri, Oct 12, 2018 at 07:03:41PM -0400, Len Brown wrote:
> > Why would the cpu topology report 0 cpus?  I added a debug entry to
> > cpu_usage_stat and /proc/stat showed it as an extra column.  Then
> > fscanf parsing in for_all_cpus() failed, causing the SIGFPE.
> >
> > This is not an issue. Thanks.
> 
> Yes, it is true that turbostat doesn't check for systems with 0 cpus.
> I'm curious how you provoked the kernel to claim that.  If it is
> something others might do, we can have check for it and gracefully
> exit.

source/tools/power/x86/turbostat/turbostat.c
int for_all_proc_cpus(int (func)(int))
{
retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n");
^
This fails due to an extra debug entry in /proc/stat
(total of 11 columns).  I was measuring time in a hot
function and decided to add this time in an extra
cpu_usage_stat. This was an experiment though.

Thanks,
-S.


Re: turbostat-17.06.23 floating point exception

2018-10-18 Thread Solio Sarabia
On Fri, Oct 12, 2018 at 07:03:41PM -0400, Len Brown wrote:
> > Why would the cpu topology report 0 cpus?  I added a debug entry to
> > cpu_usage_stat and /proc/stat showed it as an extra column.  Then
> > fscanf parsing in for_all_cpus() failed, causing the SIGFPE.
> >
> > This is not an issue. Thanks.
> 
> Yes, it is true that turbostat doesn't check for systems with 0 cpus.
> I'm curious how you provoked the kernel to claim that.  If it is
> something others might do, we can have check for it and gracefully
> exit.

source/tools/power/x86/turbostat/turbostat.c
int for_all_proc_cpus(int (func)(int))
{
retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n");
^
This fails due to an extra debug entry in /proc/stat
(total of 11 columns).  I was measuring time in a hot
function and decided to add this time in an extra
cpu_usage_stat. This was an experiment though.

Thanks,
-S.


Re: turbostat-17.06.23 floating point exception

2018-10-12 Thread Solio Sarabia
On Fri, Oct 12, 2018 at 11:26:30AM -0700, Solio Sarabia wrote:
> Hi --
> 
> turbostat 17.06.23 is throwing an exception on a custom linux-4.16.12
> kernel, on Xeon E5-2699 v4 Broadwell EP, 2S, 22C/S, 44C total, HT off,
> VTx off.
> 
> Initially the system had 4.4.0-137. Then I built and installed
> linux-4.16.12-default.  turbostat works fine for these two versions.
> After building linux-4.16.12 for a second time, the older kernel is
> renamed and now `ls -l /boot/` (I'm using version without .old suffix):
> 
>   vmlinuz-4.16.12-default+
>   vmlinuz-4.16.12-default+.old
> 
> grep -i 'turbostat' /var/log/kern.log
> 
> kernel: [  159.140836] capability: warning: `turbostat' uses 32-bit
>   capabilities (legacy support in use)
> kernel: [  164.149264] traps: turbostat[1801] trap divide error
>   ip:407625 sp:7ffe4b0df000 error:0 in turbostat[40+17000]
> 
> (gdb)
> cpu22: MSR_PKGC3_IRTL: 0x (NOTvalid, 0 ns)
> cpu22: MSR_PKGC6_IRTL: 0x (NOTvalid, 0 ns)
> cpu22: MSR_PKGC7_IRTL: 0x (NOTvalid, 0 ns)
> 
> Program received signal SIGFPE, Arithmetic exception.
> 0x00407625 in compute_average (t=0x61a3b0, c=0x61a3d0, p=0x61a480) at 
> turbostat.c:1378
> 1378average.threads.tsc /= topo.num_cpus;
> 
Why would the cpu topology report 0 cpus?  I added a debug entry to
cpu_usage_stat and /proc/stat showed it as an extra column.  Then
fscanf parsing in for_all_cpus() failed, causing the SIGFPE.

This is not an issue. Thanks.

> Let me know if you need more details.
> 
> Thanks,
> -SS


Re: turbostat-17.06.23 floating point exception

2018-10-12 Thread Solio Sarabia
On Fri, Oct 12, 2018 at 11:26:30AM -0700, Solio Sarabia wrote:
> Hi --
> 
> turbostat 17.06.23 is throwing an exception on a custom linux-4.16.12
> kernel, on Xeon E5-2699 v4 Broadwell EP, 2S, 22C/S, 44C total, HT off,
> VTx off.
> 
> Initially the system had 4.4.0-137. Then I built and installed
> linux-4.16.12-default.  turbostat works fine for these two versions.
> After building linux-4.16.12 for a second time, the older kernel is
> renamed and now `ls -l /boot/` (I'm using version without .old suffix):
> 
>   vmlinuz-4.16.12-default+
>   vmlinuz-4.16.12-default+.old
> 
> grep -i 'turbostat' /var/log/kern.log
> 
> kernel: [  159.140836] capability: warning: `turbostat' uses 32-bit
>   capabilities (legacy support in use)
> kernel: [  164.149264] traps: turbostat[1801] trap divide error
>   ip:407625 sp:7ffe4b0df000 error:0 in turbostat[40+17000]
> 
> (gdb)
> cpu22: MSR_PKGC3_IRTL: 0x (NOTvalid, 0 ns)
> cpu22: MSR_PKGC6_IRTL: 0x (NOTvalid, 0 ns)
> cpu22: MSR_PKGC7_IRTL: 0x (NOTvalid, 0 ns)
> 
> Program received signal SIGFPE, Arithmetic exception.
> 0x00407625 in compute_average (t=0x61a3b0, c=0x61a3d0, p=0x61a480) at 
> turbostat.c:1378
> 1378average.threads.tsc /= topo.num_cpus;
> 
Why would the cpu topology report 0 cpus?  I added a debug entry to
cpu_usage_stat and /proc/stat showed it as an extra column.  Then
fscanf parsing in for_all_cpus() failed, causing the SIGFPE.

This is not an issue. Thanks.

> Let me know if you need more details.
> 
> Thanks,
> -SS


Time accounting difference under high IO interrupts

2018-08-14 Thread Solio Sarabia
Under high IO activity (storage or network), the kernel is not
accounting some cpu cycles, comparing sar vs emon (tool that accesses hw
pmu directly). The difference is higher on cores that spend most time on
idle state and are constantly waking up to handle interrupts. It happens
even with fine IRQ time accounting enabled (CONFIG_IRQ_TIME_ACCOUNTING).

After playing with timer subsytem options (periodick ticks, idle
tickless, full tickless), time and stats accounting, and jiffie values,
the issue persists. Cycles lost are not accounted on other cores as
'extra' util. Example with linux 4.15.18 baremetal, xeon v4 broadwell,
driving network traffic:

   saremon   emon-sar   intrs/sec
core125.00   11.70   6.70   29,302
core17   19.07   23.16   4.09   17,345
core20   19.41   23.11   3.70   16,578

Based on how kernel accounts time:
Do you have an idea why a high number of intrs affect time accounting?

Thanks,
-Solio


Time accounting difference under high IO interrupts

2018-08-14 Thread Solio Sarabia
Under high IO activity (storage or network), the kernel is not
accounting some cpu cycles, comparing sar vs emon (tool that accesses hw
pmu directly). The difference is higher on cores that spend most time on
idle state and are constantly waking up to handle interrupts. It happens
even with fine IRQ time accounting enabled (CONFIG_IRQ_TIME_ACCOUNTING).

After playing with timer subsytem options (periodick ticks, idle
tickless, full tickless), time and stats accounting, and jiffie values,
the issue persists. Cycles lost are not accounted on other cores as
'extra' util. Example with linux 4.15.18 baremetal, xeon v4 broadwell,
driving network traffic:

   saremon   emon-sar   intrs/sec
core125.00   11.70   6.70   29,302
core17   19.07   23.16   4.09   17,345
core20   19.41   23.11   3.70   16,578

Based on how kernel accounts time:
Do you have an idea why a high number of intrs affect time accounting?

Thanks,
-Solio


Re: Differences in cpu utilization reported by sar, emon

2018-07-10 Thread Solio Sarabia
Further analysis shows that even with CONFIG_IRQ_TIME_ACCOUNTING or
DYNTICKS (CONFIG_VIRT_CPU_ACCOUNTING_GEN) there are some CPU cycles
lost. This difference correlates with the number of interrupts/sec
handled in the core, as the number increases, difference also does.

Network example (linux baremetal):
 saremon   emon-sar  interrupts/sec
core18   19.2   20.9   1.7   11057
core21   25.1   30.5   5.4   31472
core27   18.3   20.1   1.8   10384
core305.6   11.4   5.8   35841
core34   17.8   20.5   2.7   10973

Storage example (fio, device attached directly to vm):
 sarperfmon   emon-perfmon  interrupts/sec
core10   7.419.7  12.3  100481

In the storage case, up to 12% irq cycles were not accounted. As users
start adopting more capable SSDs for instance, the issue will be more
evident. I would like to understand the reason as to why this happens:

What could be the reason for this issue?
Any pointers to the kernel subsystem/code performing time accounting?

Thanks,
-Solio


On Wed, Jun 20, 2018 at 04:41:40PM -0700, Solio Sarabia wrote:
> Thanks Andi, Stephen, for your help/insights.
> 
> TICK_CPU_ACCOUNTING (default option) does not account for cpu util on
> cores handling irqs and softriqs.
> 
> IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util
> gap. With either option, there is still a difference, for example, up to
> 8% in terms of sar/emon ratio (sar shows lesser util). This is an
> improvement to the default case though.
> 
> 
> This is a brief description of the Kbuild options:
> 
> -> General setup
>   -> CPU/Task time and stats accounting
> -> Cputime accounting
> TICK_CPU_ACCOUNTING
> Simple/basic tick based cpu accounting--maintains statistics about
> user, system and idle time spent on per jiffies granularity.
> VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel)
> Deterministic task and cpu time accounting--more accurate task
> and cpu time accounting. Kernel reads a cpu counter on each kernel
> entry and exit, and on transitions within the kernel between
> system, softirq, and hardirq state, so there is a small performance
> impact.
> VIRT_CPU_ACCOUTING_GEN
> Full dynticks cpu time accounting--enable task and cpu time
> accounting on full dynticks systems. Kernel watches every
> kernel-user boundaries using the context tracking subsystem.
> There is significant overhead. For now only useful if you are
> working on the full dynticks subsystem development.
> IRQ_TIME_ACCOUNTING
> Fine granularity task level irq time accounting--kernel reads a
> timestamp on each transition between softirq and hardirq state,
> so there can be a performance impact.
> 
> -Solio
> 
> 
> On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote:
> > Hello --
> > 
> > I'm running into an issue where sar, mpstat, top, and other tools show
> > less cpu utilization compared to emon [1]. Sar uses /proc/stat as its
> > source, and was configured to collect in 1s intervals. Emon reads
> > hardware counter MSRs in the PMU in timer intervals, 0.1s for this
> > scenario.
> > 
> > The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets,
> > 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux
> > 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2],
> > sends packets from client to server, through a 40GbE direct link.
> > Numbers below are from server side.
> > 
> >  total %util
> >CPU11CPU21CPU22CPU25
> > emon   99.9915.9036.2236.82
> > sar99.99 0.06 0.36 0.35
> > 
> >  interrupts/sec
> >CPU11CPU21CPU22CPU25
> > intrs/sec8462892312844 6304
> > Contributors to /proc/interrupts:
> > CPU11: Local timer interrupts and Rescheduling interrupts
> > CPU21-CPU25: PCI MSI vector from network driver
> > 
> >  softirqs/sec
> >CPU11CPU21CPU22CPU25
> > TIMER198121
> > NET_RX 1288892355318546
> > TASKLET02888911676 6249
> > 
> > 
> > Somehow hardware irqs and softirqs do not have an effect on the core's
> > utilization. Another observation is that as more cores are used to
> > process packets, the emon/sar gap increases.
> > 
> > Kernels used default HZ=250. I also tried HZ=1000, which helped improve
> > throughput, but difference in util is still there. Same for newer
> > kernels 4.13, 4.15. I would appreciate pointers to debug this, or
> > insights as what could cause this behavior.
> > 
> > [1] https://software.intel.com/en-us/download/emon-users-guide
> > [2] https://github.com/simonxiaoss/ntttcp-for-linux
> > 
> > Thanks,
> > -Solio


Re: Differences in cpu utilization reported by sar, emon

2018-07-10 Thread Solio Sarabia
Further analysis shows that even with CONFIG_IRQ_TIME_ACCOUNTING or
DYNTICKS (CONFIG_VIRT_CPU_ACCOUNTING_GEN) there are some CPU cycles
lost. This difference correlates with the number of interrupts/sec
handled in the core, as the number increases, difference also does.

Network example (linux baremetal):
 saremon   emon-sar  interrupts/sec
core18   19.2   20.9   1.7   11057
core21   25.1   30.5   5.4   31472
core27   18.3   20.1   1.8   10384
core305.6   11.4   5.8   35841
core34   17.8   20.5   2.7   10973

Storage example (fio, device attached directly to vm):
 sarperfmon   emon-perfmon  interrupts/sec
core10   7.419.7  12.3  100481

In the storage case, up to 12% irq cycles were not accounted. As users
start adopting more capable SSDs for instance, the issue will be more
evident. I would like to understand the reason as to why this happens:

What could be the reason for this issue?
Any pointers to the kernel subsystem/code performing time accounting?

Thanks,
-Solio


On Wed, Jun 20, 2018 at 04:41:40PM -0700, Solio Sarabia wrote:
> Thanks Andi, Stephen, for your help/insights.
> 
> TICK_CPU_ACCOUNTING (default option) does not account for cpu util on
> cores handling irqs and softriqs.
> 
> IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util
> gap. With either option, there is still a difference, for example, up to
> 8% in terms of sar/emon ratio (sar shows lesser util). This is an
> improvement to the default case though.
> 
> 
> This is a brief description of the Kbuild options:
> 
> -> General setup
>   -> CPU/Task time and stats accounting
> -> Cputime accounting
> TICK_CPU_ACCOUNTING
> Simple/basic tick based cpu accounting--maintains statistics about
> user, system and idle time spent on per jiffies granularity.
> VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel)
> Deterministic task and cpu time accounting--more accurate task
> and cpu time accounting. Kernel reads a cpu counter on each kernel
> entry and exit, and on transitions within the kernel between
> system, softirq, and hardirq state, so there is a small performance
> impact.
> VIRT_CPU_ACCOUTING_GEN
> Full dynticks cpu time accounting--enable task and cpu time
> accounting on full dynticks systems. Kernel watches every
> kernel-user boundaries using the context tracking subsystem.
> There is significant overhead. For now only useful if you are
> working on the full dynticks subsystem development.
> IRQ_TIME_ACCOUNTING
> Fine granularity task level irq time accounting--kernel reads a
> timestamp on each transition between softirq and hardirq state,
> so there can be a performance impact.
> 
> -Solio
> 
> 
> On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote:
> > Hello --
> > 
> > I'm running into an issue where sar, mpstat, top, and other tools show
> > less cpu utilization compared to emon [1]. Sar uses /proc/stat as its
> > source, and was configured to collect in 1s intervals. Emon reads
> > hardware counter MSRs in the PMU in timer intervals, 0.1s for this
> > scenario.
> > 
> > The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets,
> > 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux
> > 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2],
> > sends packets from client to server, through a 40GbE direct link.
> > Numbers below are from server side.
> > 
> >  total %util
> >CPU11CPU21CPU22CPU25
> > emon   99.9915.9036.2236.82
> > sar99.99 0.06 0.36 0.35
> > 
> >  interrupts/sec
> >CPU11CPU21CPU22CPU25
> > intrs/sec8462892312844 6304
> > Contributors to /proc/interrupts:
> > CPU11: Local timer interrupts and Rescheduling interrupts
> > CPU21-CPU25: PCI MSI vector from network driver
> > 
> >  softirqs/sec
> >CPU11CPU21CPU22CPU25
> > TIMER198121
> > NET_RX 1288892355318546
> > TASKLET02888911676 6249
> > 
> > 
> > Somehow hardware irqs and softirqs do not have an effect on the core's
> > utilization. Another observation is that as more cores are used to
> > process packets, the emon/sar gap increases.
> > 
> > Kernels used default HZ=250. I also tried HZ=1000, which helped improve
> > throughput, but difference in util is still there. Same for newer
> > kernels 4.13, 4.15. I would appreciate pointers to debug this, or
> > insights as what could cause this behavior.
> > 
> > [1] https://software.intel.com/en-us/download/emon-users-guide
> > [2] https://github.com/simonxiaoss/ntttcp-for-linux
> > 
> > Thanks,
> > -Solio


Re: Differences in cpu utilization reported by sar, emon

2018-06-20 Thread Solio Sarabia
Thanks Andi, Stephen, for your help/insights.

TICK_CPU_ACCOUNTING (default option) does not account for cpu util on
cores handling irqs and softriqs.

IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util
gap. With either option, there is still a difference, for example, up to
8% in terms of sar/emon ratio (sar shows lesser util). This is an
improvement to the default case though.


This is a brief description of the Kbuild options:

-> General setup
  -> CPU/Task time and stats accounting
-> Cputime accounting
TICK_CPU_ACCOUNTING
Simple/basic tick based cpu accounting--maintains statistics about
user, system and idle time spent on per jiffies granularity.
VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel)
Deterministic task and cpu time accounting--more accurate task
and cpu time accounting. Kernel reads a cpu counter on each kernel
entry and exit, and on transitions within the kernel between
system, softirq, and hardirq state, so there is a small performance
impact.
VIRT_CPU_ACCOUTING_GEN
Full dynticks cpu time accounting--enable task and cpu time
accounting on full dynticks systems. Kernel watches every
kernel-user boundaries using the context tracking subsystem.
There is significant overhead. For now only useful if you are
working on the full dynticks subsystem development.
IRQ_TIME_ACCOUNTING
Fine granularity task level irq time accounting--kernel reads a
timestamp on each transition between softirq and hardirq state,
so there can be a performance impact.

-Solio


On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote:
> Hello --
> 
> I'm running into an issue where sar, mpstat, top, and other tools show
> less cpu utilization compared to emon [1]. Sar uses /proc/stat as its
> source, and was configured to collect in 1s intervals. Emon reads
> hardware counter MSRs in the PMU in timer intervals, 0.1s for this
> scenario.
> 
> The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets,
> 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux
> 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2],
> sends packets from client to server, through a 40GbE direct link.
> Numbers below are from server side.
> 
>  total %util
>CPU11CPU21CPU22CPU25
> emon   99.9915.9036.2236.82
> sar99.99 0.06 0.36 0.35
> 
>  interrupts/sec
>CPU11CPU21CPU22CPU25
> intrs/sec8462892312844 6304
> Contributors to /proc/interrupts:
> CPU11: Local timer interrupts and Rescheduling interrupts
> CPU21-CPU25: PCI MSI vector from network driver
> 
>  softirqs/sec
>CPU11CPU21CPU22CPU25
> TIMER198121
> NET_RX 1288892355318546
> TASKLET02888911676 6249
> 
> 
> Somehow hardware irqs and softirqs do not have an effect on the core's
> utilization. Another observation is that as more cores are used to
> process packets, the emon/sar gap increases.
> 
> Kernels used default HZ=250. I also tried HZ=1000, which helped improve
> throughput, but difference in util is still there. Same for newer
> kernels 4.13, 4.15. I would appreciate pointers to debug this, or
> insights as what could cause this behavior.
> 
> [1] https://software.intel.com/en-us/download/emon-users-guide
> [2] https://github.com/simonxiaoss/ntttcp-for-linux
> 
> Thanks,
> -Solio


Re: Differences in cpu utilization reported by sar, emon

2018-06-20 Thread Solio Sarabia
Thanks Andi, Stephen, for your help/insights.

TICK_CPU_ACCOUNTING (default option) does not account for cpu util on
cores handling irqs and softriqs.

IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util
gap. With either option, there is still a difference, for example, up to
8% in terms of sar/emon ratio (sar shows lesser util). This is an
improvement to the default case though.


This is a brief description of the Kbuild options:

-> General setup
  -> CPU/Task time and stats accounting
-> Cputime accounting
TICK_CPU_ACCOUNTING
Simple/basic tick based cpu accounting--maintains statistics about
user, system and idle time spent on per jiffies granularity.
VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel)
Deterministic task and cpu time accounting--more accurate task
and cpu time accounting. Kernel reads a cpu counter on each kernel
entry and exit, and on transitions within the kernel between
system, softirq, and hardirq state, so there is a small performance
impact.
VIRT_CPU_ACCOUTING_GEN
Full dynticks cpu time accounting--enable task and cpu time
accounting on full dynticks systems. Kernel watches every
kernel-user boundaries using the context tracking subsystem.
There is significant overhead. For now only useful if you are
working on the full dynticks subsystem development.
IRQ_TIME_ACCOUNTING
Fine granularity task level irq time accounting--kernel reads a
timestamp on each transition between softirq and hardirq state,
so there can be a performance impact.

-Solio


On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote:
> Hello --
> 
> I'm running into an issue where sar, mpstat, top, and other tools show
> less cpu utilization compared to emon [1]. Sar uses /proc/stat as its
> source, and was configured to collect in 1s intervals. Emon reads
> hardware counter MSRs in the PMU in timer intervals, 0.1s for this
> scenario.
> 
> The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets,
> 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux
> 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2],
> sends packets from client to server, through a 40GbE direct link.
> Numbers below are from server side.
> 
>  total %util
>CPU11CPU21CPU22CPU25
> emon   99.9915.9036.2236.82
> sar99.99 0.06 0.36 0.35
> 
>  interrupts/sec
>CPU11CPU21CPU22CPU25
> intrs/sec8462892312844 6304
> Contributors to /proc/interrupts:
> CPU11: Local timer interrupts and Rescheduling interrupts
> CPU21-CPU25: PCI MSI vector from network driver
> 
>  softirqs/sec
>CPU11CPU21CPU22CPU25
> TIMER198121
> NET_RX 1288892355318546
> TASKLET02888911676 6249
> 
> 
> Somehow hardware irqs and softirqs do not have an effect on the core's
> utilization. Another observation is that as more cores are used to
> process packets, the emon/sar gap increases.
> 
> Kernels used default HZ=250. I also tried HZ=1000, which helped improve
> throughput, but difference in util is still there. Same for newer
> kernels 4.13, 4.15. I would appreciate pointers to debug this, or
> insights as what could cause this behavior.
> 
> [1] https://software.intel.com/en-us/download/emon-users-guide
> [2] https://github.com/simonxiaoss/ntttcp-for-linux
> 
> Thanks,
> -Solio


Differences in cpu utilization reported by sar, emon

2018-06-14 Thread Solio Sarabia
Hello --

I'm running into an issue where sar, mpstat, top, and other tools show
less cpu utilization compared to emon [1]. Sar uses /proc/stat as its
source, and was configured to collect in 1s intervals. Emon reads
hardware counter MSRs in the PMU in timer intervals, 0.1s for this
scenario.

The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets,
18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux
4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2],
sends packets from client to server, through a 40GbE direct link.
Numbers below are from server side.

 total %util
   CPU11CPU21CPU22CPU25
emon   99.9915.9036.2236.82
sar99.99 0.06 0.36 0.35

 interrupts/sec
   CPU11CPU21CPU22CPU25
intrs/sec8462892312844 6304
Contributors to /proc/interrupts:
CPU11: Local timer interrupts and Rescheduling interrupts
CPU21-CPU25: PCI MSI vector from network driver

 softirqs/sec
   CPU11CPU21CPU22CPU25
TIMER198121
NET_RX 1288892355318546
TASKLET02888911676 6249


Somehow hardware irqs and softirqs do not have an effect on the core's
utilization. Another observation is that as more cores are used to
process packets, the emon/sar gap increases.

Kernels used default HZ=250. I also tried HZ=1000, which helped improve
throughput, but difference in util is still there. Same for newer
kernels 4.13, 4.15. I would appreciate pointers to debug this, or
insights as what could cause this behavior.

[1] https://software.intel.com/en-us/download/emon-users-guide
[2] https://github.com/simonxiaoss/ntttcp-for-linux

Thanks,
-Solio


Differences in cpu utilization reported by sar, emon

2018-06-14 Thread Solio Sarabia
Hello --

I'm running into an issue where sar, mpstat, top, and other tools show
less cpu utilization compared to emon [1]. Sar uses /proc/stat as its
source, and was configured to collect in 1s intervals. Emon reads
hardware counter MSRs in the PMU in timer intervals, 0.1s for this
scenario.

The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets,
18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux
4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2],
sends packets from client to server, through a 40GbE direct link.
Numbers below are from server side.

 total %util
   CPU11CPU21CPU22CPU25
emon   99.9915.9036.2236.82
sar99.99 0.06 0.36 0.35

 interrupts/sec
   CPU11CPU21CPU22CPU25
intrs/sec8462892312844 6304
Contributors to /proc/interrupts:
CPU11: Local timer interrupts and Rescheduling interrupts
CPU21-CPU25: PCI MSI vector from network driver

 softirqs/sec
   CPU11CPU21CPU22CPU25
TIMER198121
NET_RX 1288892355318546
TASKLET02888911676 6249


Somehow hardware irqs and softirqs do not have an effect on the core's
utilization. Another observation is that as more cores are used to
process packets, the emon/sar gap increases.

Kernels used default HZ=250. I also tried HZ=1000, which helped improve
throughput, but difference in util is still there. Same for newer
kernels 4.13, 4.15. I would appreciate pointers to debug this, or
insights as what could cause this behavior.

[1] https://software.intel.com/en-us/download/emon-users-guide
[2] https://github.com/simonxiaoss/ntttcp-for-linux

Thanks,
-Solio


Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-27 Thread Solio Sarabia
On Fri, Nov 24, 2017 at 10:32:49AM -0800, Eric Dumazet wrote:
> On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> > 
> > This should be added to rtnetlink rather than sysfs.
> 
> This is already exposed by rtnetlink [1]
> 
> Please lets not add yet another net-sysfs knob.
> 
> [1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add 
> IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes

It's useful `ip -d a` reports these values. Thanks. 

I had an old version (iproute2-ss151103). Based on changelog, it is
available since iproute2-ss161009.


Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-27 Thread Solio Sarabia
On Fri, Nov 24, 2017 at 10:32:49AM -0800, Eric Dumazet wrote:
> On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> > 
> > This should be added to rtnetlink rather than sysfs.
> 
> This is already exposed by rtnetlink [1]
> 
> Please lets not add yet another net-sysfs knob.
> 
> [1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add 
> IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes

It's useful `ip -d a` reports these values. Thanks. 

I had an old version (iproute2-ss151103). Based on changelog, it is
available since iproute2-ss161009.


[PATCH RFC] veth: make veth aware of gso buffer size

2017-11-25 Thread Solio Sarabia
GSO buffer size supported by underlying devices is not propagated to
veth. In high-speed connections with hw TSO enabled, veth sends buffers
bigger than lower device's maximum GSO, forcing sw TSO and increasing
system CPU usage.

Signed-off-by: Solio Sarabia <solio.sara...@intel.com>
---
Exposing gso_max_size via sysfs is not advised [0]. This patch queries
available interfaces get this value. Reading dev_list is O(n), since it
can be large (e.g. hundreds of containers), only a subset of interfaces
is inspected.  _Please_ advise pointers how to make veth aware of lower
device's GSO value.

In a test scenario with Hyper-V, Ubuntu VM, Docker inside VM, and NTttcp
microworkload sending 40 Gbps from one container, this fix reduces 3x
sender host CPU overhead, since now all TSO is done on physical NIC.
Savings in CPU cycles benefit other use cases where veth is used, and
the GSO buffer size is properly set.

[0] https://lkml.org/lkml/2017/11/24/512

 drivers/net/veth.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5438d0..e255b51 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -298,6 +298,34 @@ static const struct net_device_ops veth_netdev_ops = {
   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
   NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
 
+static void veth_set_gso(struct net_device *dev)
+{
+   struct net_device *nd;
+   unsigned int size = GSO_MAX_SIZE;
+   u16 segs = GSO_MAX_SEGS;
+   unsigned int count = 0;
+   const unsigned int limit = 10;
+
+   /* Set default gso based on available physical/synthetic devices,
+* ignore virtual interfaces, and limit looping through dev_list
+* as the total number of interfaces can be large.
+*/
+   read_lock(_base_lock);
+   for_each_netdev(_net, nd) {
+   if (count >= limit)
+   break;
+   if (nd->dev.parent && nd->flags & IFF_UP) {
+   size = min(size, nd->gso_max_size);
+   segs = min(segs, nd->gso_max_segs);
+   }
+   count++;
+   }
+
+   read_unlock(_base_lock);
+   netif_set_gso_max_size(dev, size);
+   dev->gso_max_segs = size ? size - 1 : 0;
+}
+
 static void veth_setup(struct net_device *dev)
 {
ether_setup(dev);
@@ -323,6 +351,8 @@ static void veth_setup(struct net_device *dev)
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
+
+   veth_set_gso(dev);
 }
 
 /*
-- 
2.7.4



[PATCH RFC] veth: make veth aware of gso buffer size

2017-11-25 Thread Solio Sarabia
GSO buffer size supported by underlying devices is not propagated to
veth. In high-speed connections with hw TSO enabled, veth sends buffers
bigger than lower device's maximum GSO, forcing sw TSO and increasing
system CPU usage.

Signed-off-by: Solio Sarabia 
---
Exposing gso_max_size via sysfs is not advised [0]. This patch queries
available interfaces get this value. Reading dev_list is O(n), since it
can be large (e.g. hundreds of containers), only a subset of interfaces
is inspected.  _Please_ advise pointers how to make veth aware of lower
device's GSO value.

In a test scenario with Hyper-V, Ubuntu VM, Docker inside VM, and NTttcp
microworkload sending 40 Gbps from one container, this fix reduces 3x
sender host CPU overhead, since now all TSO is done on physical NIC.
Savings in CPU cycles benefit other use cases where veth is used, and
the GSO buffer size is properly set.

[0] https://lkml.org/lkml/2017/11/24/512

 drivers/net/veth.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5438d0..e255b51 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -298,6 +298,34 @@ static const struct net_device_ops veth_netdev_ops = {
   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
   NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
 
+static void veth_set_gso(struct net_device *dev)
+{
+   struct net_device *nd;
+   unsigned int size = GSO_MAX_SIZE;
+   u16 segs = GSO_MAX_SEGS;
+   unsigned int count = 0;
+   const unsigned int limit = 10;
+
+   /* Set default gso based on available physical/synthetic devices,
+* ignore virtual interfaces, and limit looping through dev_list
+* as the total number of interfaces can be large.
+*/
+   read_lock(_base_lock);
+   for_each_netdev(_net, nd) {
+   if (count >= limit)
+   break;
+   if (nd->dev.parent && nd->flags & IFF_UP) {
+   size = min(size, nd->gso_max_size);
+   segs = min(segs, nd->gso_max_segs);
+   }
+   count++;
+   }
+
+   read_unlock(_base_lock);
+   netif_set_gso_max_size(dev, size);
+   dev->gso_max_segs = size ? size - 1 : 0;
+}
+
 static void veth_setup(struct net_device *dev)
 {
ether_setup(dev);
@@ -323,6 +351,8 @@ static void veth_setup(struct net_device *dev)
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
+
+   veth_set_gso(dev);
 }
 
 /*
-- 
2.7.4



Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-23 Thread Solio Sarabia
On Wed, Nov 22, 2017 at 04:30:41PM -0800, Solio Sarabia wrote:
> The netdevice gso_max_size is exposed to allow users fine-control on
> systems with multiple NICs with different GSO buffer sizes, and where
> the virtual devices like bridge and veth, need to be aware of the GSO
> size of the underlying devices.
> 
> In a virtualized environment, setting the right GSO sizes for physical
> and virtual devices makes all TSO work to be on physical NIC, improving
> throughput and reducing CPU util. If virtual devices send buffers
> greater than what NIC supports, it forces host to do TSO for buffers
> exceeding the limit, increasing CPU utilization in host.
> 
> Suggested-by: Shiny Sebastian <shiny.sebast...@intel.com>
> Signed-off-by: Solio Sarabia <solio.sara...@intel.com>
> ---
> In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker
> inside VM, and NTttcp sending 40 Gbps from one container, setting the
> right gso_max_size values for all network devices in the chain, reduces
> CPU overhead about 3x (for the sender), since all TSO work is done by
> physical NIC.
> 
>  net/core/net-sysfs.c | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 799b752..7314bc8 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device 
> *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>  
> +static int change_gso_max_size(struct net_device *dev, unsigned long 
> new_size)
> +{
> + unsigned int orig_size = dev->gso_max_size;
> +
> + if (new_size != (unsigned int)new_size)
> + return -ERANGE;
> +
> + if (new_size == orig_size)
> + return 0;
> +
> + if (new_size <= 0 || new_size > GSO_MAX_SIZE)
> + return -ERANGE;
> +
> + dev->gso_max_size = new_size;
> + return 0;
> +}
Hindsight, we need to re-evaluate the valid range. As it is now, in a
virtualized environment, users could set the gso to a value greater than
what NICs expose, which would inflict the original issue: overhead in
the host os due to a configuration value in the vm.

> +
> +static ssize_t gso_max_size_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> +
> + return netdev_store(dev, attr, buf, len, change_gso_max_size);
> +}
> +
> +NETDEVICE_SHOW_RW(gso_max_size, fmt_dec);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute 
> *attr,
>const char *buf, size_t len)
>  {
> @@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] 
> __ro_after_init = {
>   _attr_flags.attr,
>   _attr_tx_queue_len.attr,
>   _attr_gro_flush_timeout.attr,
> + _attr_gso_max_size.attr,
>   _attr_phys_port_id.attr,
>   _attr_phys_port_name.attr,
>   _attr_phys_switch_id.attr,
> -- 
> 2.7.4
> 


Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-23 Thread Solio Sarabia
On Wed, Nov 22, 2017 at 04:30:41PM -0800, Solio Sarabia wrote:
> The netdevice gso_max_size is exposed to allow users fine-control on
> systems with multiple NICs with different GSO buffer sizes, and where
> the virtual devices like bridge and veth, need to be aware of the GSO
> size of the underlying devices.
> 
> In a virtualized environment, setting the right GSO sizes for physical
> and virtual devices makes all TSO work to be on physical NIC, improving
> throughput and reducing CPU util. If virtual devices send buffers
> greater than what NIC supports, it forces host to do TSO for buffers
> exceeding the limit, increasing CPU utilization in host.
> 
> Suggested-by: Shiny Sebastian 
> Signed-off-by: Solio Sarabia 
> ---
> In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker
> inside VM, and NTttcp sending 40 Gbps from one container, setting the
> right gso_max_size values for all network devices in the chain, reduces
> CPU overhead about 3x (for the sender), since all TSO work is done by
> physical NIC.
> 
>  net/core/net-sysfs.c | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 799b752..7314bc8 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device 
> *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>  
> +static int change_gso_max_size(struct net_device *dev, unsigned long 
> new_size)
> +{
> + unsigned int orig_size = dev->gso_max_size;
> +
> + if (new_size != (unsigned int)new_size)
> + return -ERANGE;
> +
> + if (new_size == orig_size)
> + return 0;
> +
> + if (new_size <= 0 || new_size > GSO_MAX_SIZE)
> + return -ERANGE;
> +
> + dev->gso_max_size = new_size;
> + return 0;
> +}
Hindsight, we need to re-evaluate the valid range. As it is now, in a
virtualized environment, users could set the gso to a value greater than
what NICs expose, which would inflict the original issue: overhead in
the host os due to a configuration value in the vm.

> +
> +static ssize_t gso_max_size_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> +
> + return netdev_store(dev, attr, buf, len, change_gso_max_size);
> +}
> +
> +NETDEVICE_SHOW_RW(gso_max_size, fmt_dec);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute 
> *attr,
>const char *buf, size_t len)
>  {
> @@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] 
> __ro_after_init = {
>   _attr_flags.attr,
>   _attr_tx_queue_len.attr,
>   _attr_gro_flush_timeout.attr,
> + _attr_gso_max_size.attr,
>   _attr_phys_port_id.attr,
>   _attr_phys_port_name.attr,
>   _attr_phys_switch_id.attr,
> -- 
> 2.7.4
> 


[PATCH] net-sysfs: export gso_max_size attribute

2017-11-22 Thread Solio Sarabia
The netdevice gso_max_size is exposed to allow users fine-control on
systems with multiple NICs with different GSO buffer sizes, and where
the virtual devices like bridge and veth, need to be aware of the GSO
size of the underlying devices.

In a virtualized environment, setting the right GSO sizes for physical
and virtual devices makes all TSO work to be on physical NIC, improving
throughput and reducing CPU util. If virtual devices send buffers
greater than what NIC supports, it forces host to do TSO for buffers
exceeding the limit, increasing CPU utilization in host.

Suggested-by: Shiny Sebastian <shiny.sebast...@intel.com>
Signed-off-by: Solio Sarabia <solio.sara...@intel.com>
---
In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker
inside VM, and NTttcp sending 40 Gbps from one container, setting the
right gso_max_size values for all network devices in the chain, reduces
CPU overhead about 3x (for the sender), since all TSO work is done by
physical NIC.

 net/core/net-sysfs.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 799b752..7314bc8 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
 }
 NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
 
+static int change_gso_max_size(struct net_device *dev, unsigned long new_size)
+{
+   unsigned int orig_size = dev->gso_max_size;
+
+   if (new_size != (unsigned int)new_size)
+   return -ERANGE;
+
+   if (new_size == orig_size)
+   return 0;
+
+   if (new_size <= 0 || new_size > GSO_MAX_SIZE)
+   return -ERANGE;
+
+   dev->gso_max_size = new_size;
+   return 0;
+}
+
+static ssize_t gso_max_size_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   return netdev_store(dev, attr, buf, len, change_gso_max_size);
+}
+
+NETDEVICE_SHOW_RW(gso_max_size, fmt_dec);
+
 static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 const char *buf, size_t len)
 {
@@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] __ro_after_init 
= {
_attr_flags.attr,
_attr_tx_queue_len.attr,
_attr_gro_flush_timeout.attr,
+   _attr_gso_max_size.attr,
_attr_phys_port_id.attr,
_attr_phys_port_name.attr,
_attr_phys_switch_id.attr,
-- 
2.7.4



[PATCH] net-sysfs: export gso_max_size attribute

2017-11-22 Thread Solio Sarabia
The netdevice gso_max_size is exposed to allow users fine-control on
systems with multiple NICs with different GSO buffer sizes, and where
the virtual devices like bridge and veth, need to be aware of the GSO
size of the underlying devices.

In a virtualized environment, setting the right GSO sizes for physical
and virtual devices makes all TSO work to be on physical NIC, improving
throughput and reducing CPU util. If virtual devices send buffers
greater than what NIC supports, it forces host to do TSO for buffers
exceeding the limit, increasing CPU utilization in host.

Suggested-by: Shiny Sebastian 
Signed-off-by: Solio Sarabia 
---
In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker
inside VM, and NTttcp sending 40 Gbps from one container, setting the
right gso_max_size values for all network devices in the chain, reduces
CPU overhead about 3x (for the sender), since all TSO work is done by
physical NIC.

 net/core/net-sysfs.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 799b752..7314bc8 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
 }
 NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
 
+static int change_gso_max_size(struct net_device *dev, unsigned long new_size)
+{
+   unsigned int orig_size = dev->gso_max_size;
+
+   if (new_size != (unsigned int)new_size)
+   return -ERANGE;
+
+   if (new_size == orig_size)
+   return 0;
+
+   if (new_size <= 0 || new_size > GSO_MAX_SIZE)
+   return -ERANGE;
+
+   dev->gso_max_size = new_size;
+   return 0;
+}
+
+static ssize_t gso_max_size_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   return netdev_store(dev, attr, buf, len, change_gso_max_size);
+}
+
+NETDEVICE_SHOW_RW(gso_max_size, fmt_dec);
+
 static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 const char *buf, size_t len)
 {
@@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] __ro_after_init 
= {
_attr_flags.attr,
_attr_tx_queue_len.attr,
_attr_gro_flush_timeout.attr,
+   _attr_gso_max_size.attr,
_attr_phys_port_id.attr,
_attr_phys_port_name.attr,
_attr_phys_switch_id.attr,
-- 
2.7.4