Re: [EXT] Re: [vpp-dev] 128 byte cache line support

Damjan Marion via Lists.Fd.Io Thu, 21 Mar 2019 05:34:06 -0700


> On 21 Mar 2019, at 06:51, Nitin Saxena <nsax...@marvell.com> wrote:
> 
> Hi,
> 
> First all sorry for responding late to this mail chain. Please see my answers 
> inline in blue
> 
> Thanks,
> Nitin
> 
> 
> From: Damjan Marion <dmar...@me.com>
> Sent: Monday, March 18, 2019 4:48 PM
> To: Honnappa Nagarahalli
> Cc: vpp-dev; Nitin Saxena
> Subject: [EXT] Re: [vpp-dev] 128 byte cache line support
>  
> External Email
> 
> 
>> On 15 Mar 2019, at 04:52, Honnappa Nagarahalli 
>> <honnappa.nagaraha...@arm.com> wrote:
>> 
>>  
>> 
>> Related to change 18278[1], I was wondering if there is really a benefit of 
>> dealing with 128-byte cachelines like we do today.
>> Compiling VPP with cacheline size set to 128 will basically just add 64 
>> bytes of unused space at the end of each cacheline so
>> vlib_buffer_t for example will grow from 128 bytes to 256 bytes, but we will 
>> still need to prefetch 2 cachelines like we do by default.
>> 
>> [Nitin]: This is the existing model. In case of forwarding mainly first vlib 
>> cache line size is being used. We are utilising existing hole (in first vlib 
>> cache line) by putting packet parsing info (Size ==64B). This has many 
>> benefits, one of them is to avoid ipv4-input-no-chksum() software checks. It 
>> gives us ~20 cycles benefits on our platform. So I do not want to lose that 
>> gain.


That sounds like a terribly bad idea, and it likely will never be upstreamed.
vlib_buffer_t is 128-byte data structure, and it is perfect fit for 128-byte 
cacheline size systems. I don't see a point in growing this to 256-byte.
If you need more space, you can always grow headroom space for additional 
cacheline and store whatever you want there.


>> Whta will happen if we just leave that to be 64?
>> 
>> [Nitin]: This will create L1D holes on 128B targets right?
>> 
>> Unutilized holes are not acceptable as it will waste L1D space and thereby 
>> affecting performance. On the contrary we want to pack structures from  
>> 2x64B to 1x128B cache line size to reduce number of pending prefetches in 
>> core pipeline. VPP heavily prefetches LOAD/STORE version of 64B and our 
>> effort is to reduce them for our target.
>> 

Not sure what do you mean by L1D holes. My proposal is that we align all 
per-thread data structures to 128 bytes, not to grow anything.

>> [Honnappa] Currently, ThunderX1 and Octeon TX have 128B cache line. What I 
>> have heard from Marvel folks is 64B cache line setting in DPDK does not 
>> work. I have not gone into details on what does not work exactly. May be 
>> Nitin can elaborate.
>> 
> I’m curious to hear details…
>> 
>> 1. sometimes (and not very frequently) we will issue 2 prefetch instructions 
>> for same cacheline, but I hope hardware is smart enough to just ignore 2nd 
>> one
>> 
>> 2. we may face false sharing issues if first 64 bytes is touched by one 
>> thread and another 64 bytes are touched by another one
>> 
>> Second one sounds to me like a real problem, but it can be solved by 
>> aligning all per-thread data structures to 2 x cacheline size.
>> 
>> [Honnappa] Sorry, I don’t understand you here. Even if the data structure is 
>> aligned on 128B (2 X 64B), 2 contiguous blocks of 64B data would be on a 
>> single cache line.
>> 
> I wanted to say that we can align all per-thread data structures to 128 
> bytes, even on systems which have 64 byte cacheline size.
>> 
>> Actually If i remember correctly, even on x86 some of hardware prefetchers 
>> are dealing with blocks of 2 cachelines.
>> 
>> So unless I missed something, my proposal here is, instead of maintaining 
>> special 128 byte images for some ARM64 machines,
>> let’s just align all per-thread data structures to 128 and have just one ARM 
>> image.
>> 
>> [Honnappa] When we run VPP compiled with 128B cache line size on platforms 
>> with 64B cache line size, there is a performance degradation.
>> 
> Yeah, sure, what I’m suggesting here is how to address that perf degradation.
> [Nitin]: Is this proposal for Intel as well? If yes then I am fine with the 
> proposal but I think it will decrease performance on 64B architecture with 
> existing code. 

I’m curious to hear why do you think so…

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12597): https://lists.fd.io/g/vpp-dev/message/12597
Mute This Topic: https://lists.fd.io/mt/30633927/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [EXT] Re: [vpp-dev] 128 byte cache line support

Reply via email to