Re: turbostat-17.06.23 floating point exception
On Fri, Oct 12, 2018 at 07:03:41PM -0400, Len Brown wrote: > > Why would the cpu topology report 0 cpus? I added a debug entry to > > cpu_usage_stat and /proc/stat showed it as an extra column. Then > > fscanf parsing in for_all_cpus() failed, causing the SIGFPE. > > > > This is not an issue. Thanks. > > Yes, it is true that turbostat doesn't check for systems with 0 cpus. > I'm curious how you provoked the kernel to claim that. If it is > something others might do, we can have check for it and gracefully > exit. source/tools/power/x86/turbostat/turbostat.c int for_all_proc_cpus(int (func)(int)) { retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); ^ This fails due to an extra debug entry in /proc/stat (total of 11 columns). I was measuring time in a hot function and decided to add this time in an extra cpu_usage_stat. This was an experiment though. Thanks, -S.
Re: turbostat-17.06.23 floating point exception
On Fri, Oct 12, 2018 at 07:03:41PM -0400, Len Brown wrote: > > Why would the cpu topology report 0 cpus? I added a debug entry to > > cpu_usage_stat and /proc/stat showed it as an extra column. Then > > fscanf parsing in for_all_cpus() failed, causing the SIGFPE. > > > > This is not an issue. Thanks. > > Yes, it is true that turbostat doesn't check for systems with 0 cpus. > I'm curious how you provoked the kernel to claim that. If it is > something others might do, we can have check for it and gracefully > exit. source/tools/power/x86/turbostat/turbostat.c int for_all_proc_cpus(int (func)(int)) { retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); ^ This fails due to an extra debug entry in /proc/stat (total of 11 columns). I was measuring time in a hot function and decided to add this time in an extra cpu_usage_stat. This was an experiment though. Thanks, -S.
Re: turbostat-17.06.23 floating point exception
On Fri, Oct 12, 2018 at 11:26:30AM -0700, Solio Sarabia wrote: > Hi -- > > turbostat 17.06.23 is throwing an exception on a custom linux-4.16.12 > kernel, on Xeon E5-2699 v4 Broadwell EP, 2S, 22C/S, 44C total, HT off, > VTx off. > > Initially the system had 4.4.0-137. Then I built and installed > linux-4.16.12-default. turbostat works fine for these two versions. > After building linux-4.16.12 for a second time, the older kernel is > renamed and now `ls -l /boot/` (I'm using version without .old suffix): > > vmlinuz-4.16.12-default+ > vmlinuz-4.16.12-default+.old > > grep -i 'turbostat' /var/log/kern.log > > kernel: [ 159.140836] capability: warning: `turbostat' uses 32-bit > capabilities (legacy support in use) > kernel: [ 164.149264] traps: turbostat[1801] trap divide error > ip:407625 sp:7ffe4b0df000 error:0 in turbostat[40+17000] > > (gdb) > cpu22: MSR_PKGC3_IRTL: 0x (NOTvalid, 0 ns) > cpu22: MSR_PKGC6_IRTL: 0x (NOTvalid, 0 ns) > cpu22: MSR_PKGC7_IRTL: 0x (NOTvalid, 0 ns) > > Program received signal SIGFPE, Arithmetic exception. > 0x00407625 in compute_average (t=0x61a3b0, c=0x61a3d0, p=0x61a480) at > turbostat.c:1378 > 1378average.threads.tsc /= topo.num_cpus; > Why would the cpu topology report 0 cpus? I added a debug entry to cpu_usage_stat and /proc/stat showed it as an extra column. Then fscanf parsing in for_all_cpus() failed, causing the SIGFPE. This is not an issue. Thanks. > Let me know if you need more details. > > Thanks, > -SS
Re: turbostat-17.06.23 floating point exception
On Fri, Oct 12, 2018 at 11:26:30AM -0700, Solio Sarabia wrote: > Hi -- > > turbostat 17.06.23 is throwing an exception on a custom linux-4.16.12 > kernel, on Xeon E5-2699 v4 Broadwell EP, 2S, 22C/S, 44C total, HT off, > VTx off. > > Initially the system had 4.4.0-137. Then I built and installed > linux-4.16.12-default. turbostat works fine for these two versions. > After building linux-4.16.12 for a second time, the older kernel is > renamed and now `ls -l /boot/` (I'm using version without .old suffix): > > vmlinuz-4.16.12-default+ > vmlinuz-4.16.12-default+.old > > grep -i 'turbostat' /var/log/kern.log > > kernel: [ 159.140836] capability: warning: `turbostat' uses 32-bit > capabilities (legacy support in use) > kernel: [ 164.149264] traps: turbostat[1801] trap divide error > ip:407625 sp:7ffe4b0df000 error:0 in turbostat[40+17000] > > (gdb) > cpu22: MSR_PKGC3_IRTL: 0x (NOTvalid, 0 ns) > cpu22: MSR_PKGC6_IRTL: 0x (NOTvalid, 0 ns) > cpu22: MSR_PKGC7_IRTL: 0x (NOTvalid, 0 ns) > > Program received signal SIGFPE, Arithmetic exception. > 0x00407625 in compute_average (t=0x61a3b0, c=0x61a3d0, p=0x61a480) at > turbostat.c:1378 > 1378average.threads.tsc /= topo.num_cpus; > Why would the cpu topology report 0 cpus? I added a debug entry to cpu_usage_stat and /proc/stat showed it as an extra column. Then fscanf parsing in for_all_cpus() failed, causing the SIGFPE. This is not an issue. Thanks. > Let me know if you need more details. > > Thanks, > -SS
Time accounting difference under high IO interrupts
Under high IO activity (storage or network), the kernel is not accounting some cpu cycles, comparing sar vs emon (tool that accesses hw pmu directly). The difference is higher on cores that spend most time on idle state and are constantly waking up to handle interrupts. It happens even with fine IRQ time accounting enabled (CONFIG_IRQ_TIME_ACCOUNTING). After playing with timer subsytem options (periodick ticks, idle tickless, full tickless), time and stats accounting, and jiffie values, the issue persists. Cycles lost are not accounted on other cores as 'extra' util. Example with linux 4.15.18 baremetal, xeon v4 broadwell, driving network traffic: saremon emon-sar intrs/sec core125.00 11.70 6.70 29,302 core17 19.07 23.16 4.09 17,345 core20 19.41 23.11 3.70 16,578 Based on how kernel accounts time: Do you have an idea why a high number of intrs affect time accounting? Thanks, -Solio
Time accounting difference under high IO interrupts
Under high IO activity (storage or network), the kernel is not accounting some cpu cycles, comparing sar vs emon (tool that accesses hw pmu directly). The difference is higher on cores that spend most time on idle state and are constantly waking up to handle interrupts. It happens even with fine IRQ time accounting enabled (CONFIG_IRQ_TIME_ACCOUNTING). After playing with timer subsytem options (periodick ticks, idle tickless, full tickless), time and stats accounting, and jiffie values, the issue persists. Cycles lost are not accounted on other cores as 'extra' util. Example with linux 4.15.18 baremetal, xeon v4 broadwell, driving network traffic: saremon emon-sar intrs/sec core125.00 11.70 6.70 29,302 core17 19.07 23.16 4.09 17,345 core20 19.41 23.11 3.70 16,578 Based on how kernel accounts time: Do you have an idea why a high number of intrs affect time accounting? Thanks, -Solio
Re: Differences in cpu utilization reported by sar, emon
Further analysis shows that even with CONFIG_IRQ_TIME_ACCOUNTING or DYNTICKS (CONFIG_VIRT_CPU_ACCOUNTING_GEN) there are some CPU cycles lost. This difference correlates with the number of interrupts/sec handled in the core, as the number increases, difference also does. Network example (linux baremetal): saremon emon-sar interrupts/sec core18 19.2 20.9 1.7 11057 core21 25.1 30.5 5.4 31472 core27 18.3 20.1 1.8 10384 core305.6 11.4 5.8 35841 core34 17.8 20.5 2.7 10973 Storage example (fio, device attached directly to vm): sarperfmon emon-perfmon interrupts/sec core10 7.419.7 12.3 100481 In the storage case, up to 12% irq cycles were not accounted. As users start adopting more capable SSDs for instance, the issue will be more evident. I would like to understand the reason as to why this happens: What could be the reason for this issue? Any pointers to the kernel subsystem/code performing time accounting? Thanks, -Solio On Wed, Jun 20, 2018 at 04:41:40PM -0700, Solio Sarabia wrote: > Thanks Andi, Stephen, for your help/insights. > > TICK_CPU_ACCOUNTING (default option) does not account for cpu util on > cores handling irqs and softriqs. > > IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util > gap. With either option, there is still a difference, for example, up to > 8% in terms of sar/emon ratio (sar shows lesser util). This is an > improvement to the default case though. > > > This is a brief description of the Kbuild options: > > -> General setup > -> CPU/Task time and stats accounting > -> Cputime accounting > TICK_CPU_ACCOUNTING > Simple/basic tick based cpu accounting--maintains statistics about > user, system and idle time spent on per jiffies granularity. > VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel) > Deterministic task and cpu time accounting--more accurate task > and cpu time accounting. Kernel reads a cpu counter on each kernel > entry and exit, and on transitions within the kernel between > system, softirq, and hardirq state, so there is a small performance > impact. > VIRT_CPU_ACCOUTING_GEN > Full dynticks cpu time accounting--enable task and cpu time > accounting on full dynticks systems. Kernel watches every > kernel-user boundaries using the context tracking subsystem. > There is significant overhead. For now only useful if you are > working on the full dynticks subsystem development. > IRQ_TIME_ACCOUNTING > Fine granularity task level irq time accounting--kernel reads a > timestamp on each transition between softirq and hardirq state, > so there can be a performance impact. > > -Solio > > > On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote: > > Hello -- > > > > I'm running into an issue where sar, mpstat, top, and other tools show > > less cpu utilization compared to emon [1]. Sar uses /proc/stat as its > > source, and was configured to collect in 1s intervals. Emon reads > > hardware counter MSRs in the PMU in timer intervals, 0.1s for this > > scenario. > > > > The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets, > > 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux > > 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2], > > sends packets from client to server, through a 40GbE direct link. > > Numbers below are from server side. > > > > total %util > >CPU11CPU21CPU22CPU25 > > emon 99.9915.9036.2236.82 > > sar99.99 0.06 0.36 0.35 > > > > interrupts/sec > >CPU11CPU21CPU22CPU25 > > intrs/sec8462892312844 6304 > > Contributors to /proc/interrupts: > > CPU11: Local timer interrupts and Rescheduling interrupts > > CPU21-CPU25: PCI MSI vector from network driver > > > > softirqs/sec > >CPU11CPU21CPU22CPU25 > > TIMER198121 > > NET_RX 1288892355318546 > > TASKLET02888911676 6249 > > > > > > Somehow hardware irqs and softirqs do not have an effect on the core's > > utilization. Another observation is that as more cores are used to > > process packets, the emon/sar gap increases. > > > > Kernels used default HZ=250. I also tried HZ=1000, which helped improve > > throughput, but difference in util is still there. Same for newer > > kernels 4.13, 4.15. I would appreciate pointers to debug this, or > > insights as what could cause this behavior. > > > > [1] https://software.intel.com/en-us/download/emon-users-guide > > [2] https://github.com/simonxiaoss/ntttcp-for-linux > > > > Thanks, > > -Solio
Re: Differences in cpu utilization reported by sar, emon
Further analysis shows that even with CONFIG_IRQ_TIME_ACCOUNTING or DYNTICKS (CONFIG_VIRT_CPU_ACCOUNTING_GEN) there are some CPU cycles lost. This difference correlates with the number of interrupts/sec handled in the core, as the number increases, difference also does. Network example (linux baremetal): saremon emon-sar interrupts/sec core18 19.2 20.9 1.7 11057 core21 25.1 30.5 5.4 31472 core27 18.3 20.1 1.8 10384 core305.6 11.4 5.8 35841 core34 17.8 20.5 2.7 10973 Storage example (fio, device attached directly to vm): sarperfmon emon-perfmon interrupts/sec core10 7.419.7 12.3 100481 In the storage case, up to 12% irq cycles were not accounted. As users start adopting more capable SSDs for instance, the issue will be more evident. I would like to understand the reason as to why this happens: What could be the reason for this issue? Any pointers to the kernel subsystem/code performing time accounting? Thanks, -Solio On Wed, Jun 20, 2018 at 04:41:40PM -0700, Solio Sarabia wrote: > Thanks Andi, Stephen, for your help/insights. > > TICK_CPU_ACCOUNTING (default option) does not account for cpu util on > cores handling irqs and softriqs. > > IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util > gap. With either option, there is still a difference, for example, up to > 8% in terms of sar/emon ratio (sar shows lesser util). This is an > improvement to the default case though. > > > This is a brief description of the Kbuild options: > > -> General setup > -> CPU/Task time and stats accounting > -> Cputime accounting > TICK_CPU_ACCOUNTING > Simple/basic tick based cpu accounting--maintains statistics about > user, system and idle time spent on per jiffies granularity. > VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel) > Deterministic task and cpu time accounting--more accurate task > and cpu time accounting. Kernel reads a cpu counter on each kernel > entry and exit, and on transitions within the kernel between > system, softirq, and hardirq state, so there is a small performance > impact. > VIRT_CPU_ACCOUTING_GEN > Full dynticks cpu time accounting--enable task and cpu time > accounting on full dynticks systems. Kernel watches every > kernel-user boundaries using the context tracking subsystem. > There is significant overhead. For now only useful if you are > working on the full dynticks subsystem development. > IRQ_TIME_ACCOUNTING > Fine granularity task level irq time accounting--kernel reads a > timestamp on each transition between softirq and hardirq state, > so there can be a performance impact. > > -Solio > > > On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote: > > Hello -- > > > > I'm running into an issue where sar, mpstat, top, and other tools show > > less cpu utilization compared to emon [1]. Sar uses /proc/stat as its > > source, and was configured to collect in 1s intervals. Emon reads > > hardware counter MSRs in the PMU in timer intervals, 0.1s for this > > scenario. > > > > The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets, > > 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux > > 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2], > > sends packets from client to server, through a 40GbE direct link. > > Numbers below are from server side. > > > > total %util > >CPU11CPU21CPU22CPU25 > > emon 99.9915.9036.2236.82 > > sar99.99 0.06 0.36 0.35 > > > > interrupts/sec > >CPU11CPU21CPU22CPU25 > > intrs/sec8462892312844 6304 > > Contributors to /proc/interrupts: > > CPU11: Local timer interrupts and Rescheduling interrupts > > CPU21-CPU25: PCI MSI vector from network driver > > > > softirqs/sec > >CPU11CPU21CPU22CPU25 > > TIMER198121 > > NET_RX 1288892355318546 > > TASKLET02888911676 6249 > > > > > > Somehow hardware irqs and softirqs do not have an effect on the core's > > utilization. Another observation is that as more cores are used to > > process packets, the emon/sar gap increases. > > > > Kernels used default HZ=250. I also tried HZ=1000, which helped improve > > throughput, but difference in util is still there. Same for newer > > kernels 4.13, 4.15. I would appreciate pointers to debug this, or > > insights as what could cause this behavior. > > > > [1] https://software.intel.com/en-us/download/emon-users-guide > > [2] https://github.com/simonxiaoss/ntttcp-for-linux > > > > Thanks, > > -Solio
Re: Differences in cpu utilization reported by sar, emon
Thanks Andi, Stephen, for your help/insights. TICK_CPU_ACCOUNTING (default option) does not account for cpu util on cores handling irqs and softriqs. IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util gap. With either option, there is still a difference, for example, up to 8% in terms of sar/emon ratio (sar shows lesser util). This is an improvement to the default case though. This is a brief description of the Kbuild options: -> General setup -> CPU/Task time and stats accounting -> Cputime accounting TICK_CPU_ACCOUNTING Simple/basic tick based cpu accounting--maintains statistics about user, system and idle time spent on per jiffies granularity. VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel) Deterministic task and cpu time accounting--more accurate task and cpu time accounting. Kernel reads a cpu counter on each kernel entry and exit, and on transitions within the kernel between system, softirq, and hardirq state, so there is a small performance impact. VIRT_CPU_ACCOUTING_GEN Full dynticks cpu time accounting--enable task and cpu time accounting on full dynticks systems. Kernel watches every kernel-user boundaries using the context tracking subsystem. There is significant overhead. For now only useful if you are working on the full dynticks subsystem development. IRQ_TIME_ACCOUNTING Fine granularity task level irq time accounting--kernel reads a timestamp on each transition between softirq and hardirq state, so there can be a performance impact. -Solio On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote: > Hello -- > > I'm running into an issue where sar, mpstat, top, and other tools show > less cpu utilization compared to emon [1]. Sar uses /proc/stat as its > source, and was configured to collect in 1s intervals. Emon reads > hardware counter MSRs in the PMU in timer intervals, 0.1s for this > scenario. > > The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets, > 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux > 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2], > sends packets from client to server, through a 40GbE direct link. > Numbers below are from server side. > > total %util >CPU11CPU21CPU22CPU25 > emon 99.9915.9036.2236.82 > sar99.99 0.06 0.36 0.35 > > interrupts/sec >CPU11CPU21CPU22CPU25 > intrs/sec8462892312844 6304 > Contributors to /proc/interrupts: > CPU11: Local timer interrupts and Rescheduling interrupts > CPU21-CPU25: PCI MSI vector from network driver > > softirqs/sec >CPU11CPU21CPU22CPU25 > TIMER198121 > NET_RX 1288892355318546 > TASKLET02888911676 6249 > > > Somehow hardware irqs and softirqs do not have an effect on the core's > utilization. Another observation is that as more cores are used to > process packets, the emon/sar gap increases. > > Kernels used default HZ=250. I also tried HZ=1000, which helped improve > throughput, but difference in util is still there. Same for newer > kernels 4.13, 4.15. I would appreciate pointers to debug this, or > insights as what could cause this behavior. > > [1] https://software.intel.com/en-us/download/emon-users-guide > [2] https://github.com/simonxiaoss/ntttcp-for-linux > > Thanks, > -Solio
Re: Differences in cpu utilization reported by sar, emon
Thanks Andi, Stephen, for your help/insights. TICK_CPU_ACCOUNTING (default option) does not account for cpu util on cores handling irqs and softriqs. IRQ_TIME_ACCOUNTING or VIRT_CPU_ACCOUTING_GEN helps to reduce the util gap. With either option, there is still a difference, for example, up to 8% in terms of sar/emon ratio (sar shows lesser util). This is an improvement to the default case though. This is a brief description of the Kbuild options: -> General setup -> CPU/Task time and stats accounting -> Cputime accounting TICK_CPU_ACCOUNTING Simple/basic tick based cpu accounting--maintains statistics about user, system and idle time spent on per jiffies granularity. VIRT_CPU_ACCOUNTING_NATIVE (not available on my kernel) Deterministic task and cpu time accounting--more accurate task and cpu time accounting. Kernel reads a cpu counter on each kernel entry and exit, and on transitions within the kernel between system, softirq, and hardirq state, so there is a small performance impact. VIRT_CPU_ACCOUTING_GEN Full dynticks cpu time accounting--enable task and cpu time accounting on full dynticks systems. Kernel watches every kernel-user boundaries using the context tracking subsystem. There is significant overhead. For now only useful if you are working on the full dynticks subsystem development. IRQ_TIME_ACCOUNTING Fine granularity task level irq time accounting--kernel reads a timestamp on each transition between softirq and hardirq state, so there can be a performance impact. -Solio On Thu, Jun 14, 2018 at 08:41:33PM -0700, Solio Sarabia wrote: > Hello -- > > I'm running into an issue where sar, mpstat, top, and other tools show > less cpu utilization compared to emon [1]. Sar uses /proc/stat as its > source, and was configured to collect in 1s intervals. Emon reads > hardware counter MSRs in the PMU in timer intervals, 0.1s for this > scenario. > > The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets, > 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux > 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2], > sends packets from client to server, through a 40GbE direct link. > Numbers below are from server side. > > total %util >CPU11CPU21CPU22CPU25 > emon 99.9915.9036.2236.82 > sar99.99 0.06 0.36 0.35 > > interrupts/sec >CPU11CPU21CPU22CPU25 > intrs/sec8462892312844 6304 > Contributors to /proc/interrupts: > CPU11: Local timer interrupts and Rescheduling interrupts > CPU21-CPU25: PCI MSI vector from network driver > > softirqs/sec >CPU11CPU21CPU22CPU25 > TIMER198121 > NET_RX 1288892355318546 > TASKLET02888911676 6249 > > > Somehow hardware irqs and softirqs do not have an effect on the core's > utilization. Another observation is that as more cores are used to > process packets, the emon/sar gap increases. > > Kernels used default HZ=250. I also tried HZ=1000, which helped improve > throughput, but difference in util is still there. Same for newer > kernels 4.13, 4.15. I would appreciate pointers to debug this, or > insights as what could cause this behavior. > > [1] https://software.intel.com/en-us/download/emon-users-guide > [2] https://github.com/simonxiaoss/ntttcp-for-linux > > Thanks, > -Solio
Differences in cpu utilization reported by sar, emon
Hello -- I'm running into an issue where sar, mpstat, top, and other tools show less cpu utilization compared to emon [1]. Sar uses /proc/stat as its source, and was configured to collect in 1s intervals. Emon reads hardware counter MSRs in the PMU in timer intervals, 0.1s for this scenario. The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets, 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2], sends packets from client to server, through a 40GbE direct link. Numbers below are from server side. total %util CPU11CPU21CPU22CPU25 emon 99.9915.9036.2236.82 sar99.99 0.06 0.36 0.35 interrupts/sec CPU11CPU21CPU22CPU25 intrs/sec8462892312844 6304 Contributors to /proc/interrupts: CPU11: Local timer interrupts and Rescheduling interrupts CPU21-CPU25: PCI MSI vector from network driver softirqs/sec CPU11CPU21CPU22CPU25 TIMER198121 NET_RX 1288892355318546 TASKLET02888911676 6249 Somehow hardware irqs and softirqs do not have an effect on the core's utilization. Another observation is that as more cores are used to process packets, the emon/sar gap increases. Kernels used default HZ=250. I also tried HZ=1000, which helped improve throughput, but difference in util is still there. Same for newer kernels 4.13, 4.15. I would appreciate pointers to debug this, or insights as what could cause this behavior. [1] https://software.intel.com/en-us/download/emon-users-guide [2] https://github.com/simonxiaoss/ntttcp-for-linux Thanks, -Solio
Differences in cpu utilization reported by sar, emon
Hello -- I'm running into an issue where sar, mpstat, top, and other tools show less cpu utilization compared to emon [1]. Sar uses /proc/stat as its source, and was configured to collect in 1s intervals. Emon reads hardware counter MSRs in the PMU in timer intervals, 0.1s for this scenario. The platform is based on Xeon E5-2699 v3 (Haswell) 2.3GHz, 2_sockets, 18_cores/socket, 36_cores in total, running Ubuntu 16.04, Linux 4.4.0-128-generic. A network micro workload, ntttcp-for-linux [2], sends packets from client to server, through a 40GbE direct link. Numbers below are from server side. total %util CPU11CPU21CPU22CPU25 emon 99.9915.9036.2236.82 sar99.99 0.06 0.36 0.35 interrupts/sec CPU11CPU21CPU22CPU25 intrs/sec8462892312844 6304 Contributors to /proc/interrupts: CPU11: Local timer interrupts and Rescheduling interrupts CPU21-CPU25: PCI MSI vector from network driver softirqs/sec CPU11CPU21CPU22CPU25 TIMER198121 NET_RX 1288892355318546 TASKLET02888911676 6249 Somehow hardware irqs and softirqs do not have an effect on the core's utilization. Another observation is that as more cores are used to process packets, the emon/sar gap increases. Kernels used default HZ=250. I also tried HZ=1000, which helped improve throughput, but difference in util is still there. Same for newer kernels 4.13, 4.15. I would appreciate pointers to debug this, or insights as what could cause this behavior. [1] https://software.intel.com/en-us/download/emon-users-guide [2] https://github.com/simonxiaoss/ntttcp-for-linux Thanks, -Solio
Re: [PATCH] net-sysfs: export gso_max_size attribute
On Fri, Nov 24, 2017 at 10:32:49AM -0800, Eric Dumazet wrote: > On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote: > > > > This should be added to rtnetlink rather than sysfs. > > This is already exposed by rtnetlink [1] > > Please lets not add yet another net-sysfs knob. > > [1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add > IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes It's useful `ip -d a` reports these values. Thanks. I had an old version (iproute2-ss151103). Based on changelog, it is available since iproute2-ss161009.
Re: [PATCH] net-sysfs: export gso_max_size attribute
On Fri, Nov 24, 2017 at 10:32:49AM -0800, Eric Dumazet wrote: > On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote: > > > > This should be added to rtnetlink rather than sysfs. > > This is already exposed by rtnetlink [1] > > Please lets not add yet another net-sysfs knob. > > [1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add > IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes It's useful `ip -d a` reports these values. Thanks. I had an old version (iproute2-ss151103). Based on changelog, it is available since iproute2-ss161009.
[PATCH RFC] veth: make veth aware of gso buffer size
GSO buffer size supported by underlying devices is not propagated to veth. In high-speed connections with hw TSO enabled, veth sends buffers bigger than lower device's maximum GSO, forcing sw TSO and increasing system CPU usage. Signed-off-by: Solio Sarabia <solio.sara...@intel.com> --- Exposing gso_max_size via sysfs is not advised [0]. This patch queries available interfaces get this value. Reading dev_list is O(n), since it can be large (e.g. hundreds of containers), only a subset of interfaces is inspected. _Please_ advise pointers how to make veth aware of lower device's GSO value. In a test scenario with Hyper-V, Ubuntu VM, Docker inside VM, and NTttcp microworkload sending 40 Gbps from one container, this fix reduces 3x sender host CPU overhead, since now all TSO is done on physical NIC. Savings in CPU cycles benefit other use cases where veth is used, and the GSO buffer size is properly set. [0] https://lkml.org/lkml/2017/11/24/512 drivers/net/veth.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f5438d0..e255b51 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -298,6 +298,34 @@ static const struct net_device_ops veth_netdev_ops = { NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \ NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX ) +static void veth_set_gso(struct net_device *dev) +{ + struct net_device *nd; + unsigned int size = GSO_MAX_SIZE; + u16 segs = GSO_MAX_SEGS; + unsigned int count = 0; + const unsigned int limit = 10; + + /* Set default gso based on available physical/synthetic devices, +* ignore virtual interfaces, and limit looping through dev_list +* as the total number of interfaces can be large. +*/ + read_lock(_base_lock); + for_each_netdev(_net, nd) { + if (count >= limit) + break; + if (nd->dev.parent && nd->flags & IFF_UP) { + size = min(size, nd->gso_max_size); + segs = min(segs, nd->gso_max_segs); + } + count++; + } + + read_unlock(_base_lock); + netif_set_gso_max_size(dev, size); + dev->gso_max_segs = size ? size - 1 : 0; +} + static void veth_setup(struct net_device *dev) { ether_setup(dev); @@ -323,6 +351,8 @@ static void veth_setup(struct net_device *dev) dev->hw_features = VETH_FEATURES; dev->hw_enc_features = VETH_FEATURES; dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE; + + veth_set_gso(dev); } /* -- 2.7.4
[PATCH RFC] veth: make veth aware of gso buffer size
GSO buffer size supported by underlying devices is not propagated to veth. In high-speed connections with hw TSO enabled, veth sends buffers bigger than lower device's maximum GSO, forcing sw TSO and increasing system CPU usage. Signed-off-by: Solio Sarabia --- Exposing gso_max_size via sysfs is not advised [0]. This patch queries available interfaces get this value. Reading dev_list is O(n), since it can be large (e.g. hundreds of containers), only a subset of interfaces is inspected. _Please_ advise pointers how to make veth aware of lower device's GSO value. In a test scenario with Hyper-V, Ubuntu VM, Docker inside VM, and NTttcp microworkload sending 40 Gbps from one container, this fix reduces 3x sender host CPU overhead, since now all TSO is done on physical NIC. Savings in CPU cycles benefit other use cases where veth is used, and the GSO buffer size is properly set. [0] https://lkml.org/lkml/2017/11/24/512 drivers/net/veth.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f5438d0..e255b51 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -298,6 +298,34 @@ static const struct net_device_ops veth_netdev_ops = { NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \ NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX ) +static void veth_set_gso(struct net_device *dev) +{ + struct net_device *nd; + unsigned int size = GSO_MAX_SIZE; + u16 segs = GSO_MAX_SEGS; + unsigned int count = 0; + const unsigned int limit = 10; + + /* Set default gso based on available physical/synthetic devices, +* ignore virtual interfaces, and limit looping through dev_list +* as the total number of interfaces can be large. +*/ + read_lock(_base_lock); + for_each_netdev(_net, nd) { + if (count >= limit) + break; + if (nd->dev.parent && nd->flags & IFF_UP) { + size = min(size, nd->gso_max_size); + segs = min(segs, nd->gso_max_segs); + } + count++; + } + + read_unlock(_base_lock); + netif_set_gso_max_size(dev, size); + dev->gso_max_segs = size ? size - 1 : 0; +} + static void veth_setup(struct net_device *dev) { ether_setup(dev); @@ -323,6 +351,8 @@ static void veth_setup(struct net_device *dev) dev->hw_features = VETH_FEATURES; dev->hw_enc_features = VETH_FEATURES; dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE; + + veth_set_gso(dev); } /* -- 2.7.4
Re: [PATCH] net-sysfs: export gso_max_size attribute
On Wed, Nov 22, 2017 at 04:30:41PM -0800, Solio Sarabia wrote: > The netdevice gso_max_size is exposed to allow users fine-control on > systems with multiple NICs with different GSO buffer sizes, and where > the virtual devices like bridge and veth, need to be aware of the GSO > size of the underlying devices. > > In a virtualized environment, setting the right GSO sizes for physical > and virtual devices makes all TSO work to be on physical NIC, improving > throughput and reducing CPU util. If virtual devices send buffers > greater than what NIC supports, it forces host to do TSO for buffers > exceeding the limit, increasing CPU utilization in host. > > Suggested-by: Shiny Sebastian <shiny.sebast...@intel.com> > Signed-off-by: Solio Sarabia <solio.sara...@intel.com> > --- > In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker > inside VM, and NTttcp sending 40 Gbps from one container, setting the > right gso_max_size values for all network devices in the chain, reduces > CPU overhead about 3x (for the sender), since all TSO work is done by > physical NIC. > > net/core/net-sysfs.c | 30 ++ > 1 file changed, 30 insertions(+) > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index 799b752..7314bc8 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device > *dev, > } > NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong); > > +static int change_gso_max_size(struct net_device *dev, unsigned long > new_size) > +{ > + unsigned int orig_size = dev->gso_max_size; > + > + if (new_size != (unsigned int)new_size) > + return -ERANGE; > + > + if (new_size == orig_size) > + return 0; > + > + if (new_size <= 0 || new_size > GSO_MAX_SIZE) > + return -ERANGE; > + > + dev->gso_max_size = new_size; > + return 0; > +} Hindsight, we need to re-evaluate the valid range. As it is now, in a virtualized environment, users could set the gso to a value greater than what NICs expose, which would inflict the original issue: overhead in the host os due to a configuration value in the vm. > + > +static ssize_t gso_max_size_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; > + > + return netdev_store(dev, attr, buf, len, change_gso_max_size); > +} > + > +NETDEVICE_SHOW_RW(gso_max_size, fmt_dec); > + > static ssize_t ifalias_store(struct device *dev, struct device_attribute > *attr, >const char *buf, size_t len) > { > @@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] > __ro_after_init = { > _attr_flags.attr, > _attr_tx_queue_len.attr, > _attr_gro_flush_timeout.attr, > + _attr_gso_max_size.attr, > _attr_phys_port_id.attr, > _attr_phys_port_name.attr, > _attr_phys_switch_id.attr, > -- > 2.7.4 >
Re: [PATCH] net-sysfs: export gso_max_size attribute
On Wed, Nov 22, 2017 at 04:30:41PM -0800, Solio Sarabia wrote: > The netdevice gso_max_size is exposed to allow users fine-control on > systems with multiple NICs with different GSO buffer sizes, and where > the virtual devices like bridge and veth, need to be aware of the GSO > size of the underlying devices. > > In a virtualized environment, setting the right GSO sizes for physical > and virtual devices makes all TSO work to be on physical NIC, improving > throughput and reducing CPU util. If virtual devices send buffers > greater than what NIC supports, it forces host to do TSO for buffers > exceeding the limit, increasing CPU utilization in host. > > Suggested-by: Shiny Sebastian > Signed-off-by: Solio Sarabia > --- > In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker > inside VM, and NTttcp sending 40 Gbps from one container, setting the > right gso_max_size values for all network devices in the chain, reduces > CPU overhead about 3x (for the sender), since all TSO work is done by > physical NIC. > > net/core/net-sysfs.c | 30 ++ > 1 file changed, 30 insertions(+) > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index 799b752..7314bc8 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device > *dev, > } > NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong); > > +static int change_gso_max_size(struct net_device *dev, unsigned long > new_size) > +{ > + unsigned int orig_size = dev->gso_max_size; > + > + if (new_size != (unsigned int)new_size) > + return -ERANGE; > + > + if (new_size == orig_size) > + return 0; > + > + if (new_size <= 0 || new_size > GSO_MAX_SIZE) > + return -ERANGE; > + > + dev->gso_max_size = new_size; > + return 0; > +} Hindsight, we need to re-evaluate the valid range. As it is now, in a virtualized environment, users could set the gso to a value greater than what NICs expose, which would inflict the original issue: overhead in the host os due to a configuration value in the vm. > + > +static ssize_t gso_max_size_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; > + > + return netdev_store(dev, attr, buf, len, change_gso_max_size); > +} > + > +NETDEVICE_SHOW_RW(gso_max_size, fmt_dec); > + > static ssize_t ifalias_store(struct device *dev, struct device_attribute > *attr, >const char *buf, size_t len) > { > @@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] > __ro_after_init = { > _attr_flags.attr, > _attr_tx_queue_len.attr, > _attr_gro_flush_timeout.attr, > + _attr_gso_max_size.attr, > _attr_phys_port_id.attr, > _attr_phys_port_name.attr, > _attr_phys_switch_id.attr, > -- > 2.7.4 >
[PATCH] net-sysfs: export gso_max_size attribute
The netdevice gso_max_size is exposed to allow users fine-control on systems with multiple NICs with different GSO buffer sizes, and where the virtual devices like bridge and veth, need to be aware of the GSO size of the underlying devices. In a virtualized environment, setting the right GSO sizes for physical and virtual devices makes all TSO work to be on physical NIC, improving throughput and reducing CPU util. If virtual devices send buffers greater than what NIC supports, it forces host to do TSO for buffers exceeding the limit, increasing CPU utilization in host. Suggested-by: Shiny Sebastian <shiny.sebast...@intel.com> Signed-off-by: Solio Sarabia <solio.sara...@intel.com> --- In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker inside VM, and NTttcp sending 40 Gbps from one container, setting the right gso_max_size values for all network devices in the chain, reduces CPU overhead about 3x (for the sender), since all TSO work is done by physical NIC. net/core/net-sysfs.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 799b752..7314bc8 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device *dev, } NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong); +static int change_gso_max_size(struct net_device *dev, unsigned long new_size) +{ + unsigned int orig_size = dev->gso_max_size; + + if (new_size != (unsigned int)new_size) + return -ERANGE; + + if (new_size == orig_size) + return 0; + + if (new_size <= 0 || new_size > GSO_MAX_SIZE) + return -ERANGE; + + dev->gso_max_size = new_size; + return 0; +} + +static ssize_t gso_max_size_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + return netdev_store(dev, attr, buf, len, change_gso_max_size); +} + +NETDEVICE_SHOW_RW(gso_max_size, fmt_dec); + static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { @@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = { _attr_flags.attr, _attr_tx_queue_len.attr, _attr_gro_flush_timeout.attr, + _attr_gso_max_size.attr, _attr_phys_port_id.attr, _attr_phys_port_name.attr, _attr_phys_switch_id.attr, -- 2.7.4
[PATCH] net-sysfs: export gso_max_size attribute
The netdevice gso_max_size is exposed to allow users fine-control on systems with multiple NICs with different GSO buffer sizes, and where the virtual devices like bridge and veth, need to be aware of the GSO size of the underlying devices. In a virtualized environment, setting the right GSO sizes for physical and virtual devices makes all TSO work to be on physical NIC, improving throughput and reducing CPU util. If virtual devices send buffers greater than what NIC supports, it forces host to do TSO for buffers exceeding the limit, increasing CPU utilization in host. Suggested-by: Shiny Sebastian Signed-off-by: Solio Sarabia --- In one test scenario with Hyper-V host, Ubuntu 16.04 VM, with Docker inside VM, and NTttcp sending 40 Gbps from one container, setting the right gso_max_size values for all network devices in the chain, reduces CPU overhead about 3x (for the sender), since all TSO work is done by physical NIC. net/core/net-sysfs.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 799b752..7314bc8 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -376,6 +376,35 @@ static ssize_t gro_flush_timeout_store(struct device *dev, } NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong); +static int change_gso_max_size(struct net_device *dev, unsigned long new_size) +{ + unsigned int orig_size = dev->gso_max_size; + + if (new_size != (unsigned int)new_size) + return -ERANGE; + + if (new_size == orig_size) + return 0; + + if (new_size <= 0 || new_size > GSO_MAX_SIZE) + return -ERANGE; + + dev->gso_max_size = new_size; + return 0; +} + +static ssize_t gso_max_size_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + return netdev_store(dev, attr, buf, len, change_gso_max_size); +} + +NETDEVICE_SHOW_RW(gso_max_size, fmt_dec); + static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { @@ -543,6 +572,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = { _attr_flags.attr, _attr_tx_queue_len.attr, _attr_gro_flush_timeout.attr, + _attr_gso_max_size.attr, _attr_phys_port_id.attr, _attr_phys_port_name.attr, _attr_phys_switch_id.attr, -- 2.7.4