Wonderful tool, now it is completely clear, I do not have a bottleneck on the DDR but on the core to DDR interface.
Single core results: Command line parameters: mlc --max_bandwidth -H -k3 ALL Reads : 9239.05 3:1 Reads-Writes : 13348.68 2:1 Reads-Writes : 14360.44 1:1 Reads-Writes : 13792.73 Two cores: Command line parameters: mlc --max_bandwidth -H -k3-4 ALL Reads : 24666.55 3:1 Reads-Writes : 30905.30 2:1 Reads-Writes : 32256.26 1:1 Reads-Writes : 37349.44 Eight cores: Command line parameters: mlc --max_bandwidth -H -k3-10 ALL Reads : 78109.94 3:1 Reads-Writes : 62105.06 2:1 Reads-Writes : 59628.81 1:1 Reads-Writes : 55320.34 On Wed, May 25, 2022 at 12:55 PM Kinsella, Ray <[email protected]> wrote: > > Hi Antonio, > > If it is an Intel Platform you are using. > You can take a look at the Intel Memory Latency Checker. > https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html > > (don't be fooled by the name, it does measure bandwidth). > > Ray K > > -----Original Message----- > From: Antonio Di Bacco <[email protected]> > Sent: Wednesday 25 May 2022 08:30 > To: Stephen Hemminger <[email protected]> > Cc: [email protected] > Subject: Re: Optimizing memory access with DPDK allocated memory > > Just to add some more info that could possibly be useful to someone. > Even if a processor has many memory channels; there is also another parameter > to take into consideration, a given "core" cannot exploit all the memory > bandwidth available. > For example for a DDR4 2933 MT/s with 4 channels: > the memory bandwidth is 2933 X 8 (# of bytes of width) X 4 (# of > channels) = 93,866.88 MB/s bandwidth, or 94 GB/s but a single core (according > to my tests with DPDK process writing a 1GB hugepage) is about 12 GB/s (with > a block size exceeding the L3 cache size). > > Can anyone confirm that ? > > On Mon, May 23, 2022 at 3:16 PM Antonio Di Bacco <[email protected]> > wrote: > > > > Got feedback from a guy working on HPC with DPDK and he told me that > > with dpdk mem-test (don't know where to find it) I should be doing > > 16GB/s with DDR4 (2666) per channel. In my case with 6 channels I > > should be doing 90GB/s .... that would be amazing! > > > > On Sat, May 21, 2022 at 11:42 AM Antonio Di Bacco > > <[email protected]> wrote: > > > > > > I read a couple of articles > > > (https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of > > > _Intel_Xeon_Scalable_systems?xtxsearchselecthit=1 > > > and this > > > https://www.exxactcorp.com/blog/HPC/balance-memory-guidelines-for-in > > > tel-xeon-scalable-family-processors) > > > and I understood a little bit more. > > > > > > If the XEON memory controller is able to spread contiguous memory > > > accesses onto different channels in hardware (as Stepphen correctly > > > stated), then, how DPDK with option -n can benefit an application? > > > I also coded a test application to write a 1GB hugepage and > > > calculate time needed but, equipping an additional two DIMM on two > > > unused channels of my available six channels motherboard (X11DPi-NT) > > > , I didn't observe any improvement. This is strange because adding > > > two channels to the 4 already equipped should make a noticeable > > > difference. > > > > > > For reference this is the small program for allocating and writing memory. > > > https://github.com/adibacco/simple_mp_mem_2 > > > and the results with 4 memory channels: > > > https://docs.google.com/spreadsheets/d/1mDoKYLMhMMKDaOS3RuGEnpPgRNKu > > > ZOy4lMIhG-1N7B8/edit?usp=sharing > > > > > > > > > On Fri, May 20, 2022 at 5:48 PM Stephen Hemminger > > > <[email protected]> wrote: > > > > > > > > On Fri, 20 May 2022 10:34:46 +0200 Antonio Di Bacco > > > > <[email protected]> wrote: > > > > > > > > > Let us say I have two memory channels each one with its own 16GB > > > > > memory module, I suppose the first memory channel will be used > > > > > when addressing physical memory in the range 0 to 0x4 0000 0000 > > > > > and the second when addressing physical memory in the range 0x4 0000 > > > > > 0000 to 0x7 ffff ffff. > > > > > Correct? > > > > > Now, I need to have a 2GB buffer with one "writer" and one > > > > > "reader", the writer writes on half of the buffer (call it A) > > > > > and, in the meantime, the reader reads on the other half (B). > > > > > When the writer finishes writing its half buffer (A), signal it > > > > > to the reader and they swap, the reader starts to read from A and > > > > > writer starts to write to B. > > > > > If I allocate the whole buffer (on two 1GB hugepages) across the > > > > > two memory channels, one half of the buffer is allocated on the > > > > > end of first channel while the other half is allocated on the > > > > > start of the second memory channel, would this increase > > > > > performances compared to the whole buffer allocated within the same > > > > > memory channel? > > > > > > > > Most systems just interleave memory chips based on number of filled > > > > slots. > > > > This is handled by BIOS before kernel even starts. > > > > The DPDK has a number of memory channels parameter and what it > > > > does is try and optimize memory allocation by spreading. > > > > > > > > Looks like you are inventing your own limited version of what memif > > > > does.
