On Fri, Aug 28, 2020 at 6:04 PM Artem Bityutskiy wrote:
>
> On Thu, 2020-08-27 at 22:25 +0530, Subhashini Rao Beerisetty wrote:
> > I have an application which finds the data rate over the PCIe
> > interface. I’m getting the lesser data rate in one of my Linux X86
> > systems.
>
> Some more description, may be? Do you have a PCIe device reading one
> RAM buffer and then writing to another RAM buffer? Or does it generate
> dome data and writes them to a RAM buffer? Presumably it uses DMA? How
> much is the CPU involved into the process? Are we talking about
> transferring few kilobytes or gigabytes?
Thanks a lot for your help and reply.
Regarding hardware setup, Xilinx PCIe FPGA endpoint is connected to
HOST CPU via PCIe bus.
Xilinx PCIe FPGA endpoint has the DMA_REF block and it provides a
mechanism to DMA transfer data at the maximum rate between host CPU
memory and a FIFO in the DMA-REF block.
The host software sets up some data in it’s memory, it then transfers
the data to the DMA-REF’s FIFO and then reads it back into a different
location in its host memory. This is repeated in a loop. There is a
register in the DMA-REF block that gives an indication of transfer
speed.
>
> > When I change the scaling_governor from "powersave" to "performance"
> > mode for each CPU, then there is slight improvement in the PCIe data
> > rate.
>
> Definitely this makes your CPU(s) run at max speed, but depending on
> platform and settings, this may also affect C-states. Are the CPU(s)
> generally idle while you measure, or busy (involved into the test)? You
> could run 'turbostat' while measuring the bandwidth, to get some CPU
> statistics (e.g., do C-states happen during the PCI test, how busy are
> the CPUs).
>
> > Parallely I started profiling the workload with perf. Whenever I start
> > running the profile command “perf stat -a -d -p ” surprisingly
> > the application resulted in excellent data rate over PCIe, but when I
> > kill the perf command again PCIe data rate drops. I am really confused
> > about this behavior.Any clues from this behaviour?
>
> Well, one possible reason that comes to mind - you get rid of C-states
> when you rung perf, and this increases the PCI bandwidth. You can just
> try disabling C-states (there are sysfs knobs) and check it out.
> Turbostat could be useful to check for this (with and without perf, run
> 'turbostat sleep 10' or something like this (measure for 10 seconds in
> this example), do this while running your PCI test.
Disabling the C-states improved the throughput a lot, thanks a lot for
pointing this out. Could you please give some more explanation on how
disabling C-states improved the throughput?
As you suggested I collected and attached the turbostat log with and
without perf while running the PCIe test.
In my system, only 'performance\powersave' are listed in
scaling_available_governors. Rest other governors
"userspace\ondemand\schedutil" are not listed in available_goverors.
What might be the reason for this?
>
> But I am really just guessing here, I do not know enough about your
> test and the system (e.g., "a Linux x86" system can be so many things,
> like Intel or AMD server or a mobile device)…
It's an Intel Atom processor.
>
>
turbostat version 17.06.23 - Len Brown
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:9 (6:55:9)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, No-TURBO, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow,
No-HWPepp, No-HWPpkg, EPB
cpu2: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST No-MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
cpu2: MSR_CC6_DEMOTION_POLICY_CONFIG: 0x (DISable-CC6-Demotion)
cpu2: MSR_MC6_DEMOTION_POLICY_CONFIG: 0x (DISable-MC6-Demotion)
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu2: MSR_PLATFORM_INFO: 0x6001700
6 * 83.3 = 499.8 MHz max efficiency frequency
23 * 83.3 = 1915.9 MHz base frequency
cpu2: MSR_IA32_POWER_CTL: 0x (C1E auto-promotion: DISabled)
cpu2: MSR_ATOM_CORE_RATIOS: 0x00170602
2 * 83.3 = 166.6 MHz minimum operating frequency
6 * 83.3 = 499.8 MHz low frequency mode (LFM)
23 * 83.3 = 1915.9 MHz base frequency
cpu2: MSR_ATOM_CORE_TURBO_RATIOS: 0x17171717
23 * 83.3 = 1915.9 MHz max turbo 4 active cores
23 * 83.3 = 1915.9 MHz max turbo 3 active cores
23 * 83.3 = 1915.9 MHz max turbo 2 active cores
23 * 83.3 = 1915.9 MHz max turbo 1 active core
cpu2: MSR_PKG_CST_CONFIG_CONTROL: 0x0017000f (UNlocked: pkg-cstate-limit=15:
pc7)
cpu2: POLL: CPUIDLE CORE POLL IDLE
cpu2: C1: MWAIT 0x00
cpu2: C6N: MWAIT 0x58
cpu2: C6S: MWAIT 0x52
cpu2: cpufreq driver: intel_pstate
cpu2: cpufreq governor: performance
cpufreq intel_pstate no_turbo: 1
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x0006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x0505 (0.031250 Watts, 0.32 Joules,
0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.00 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.00