Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-15 Thread Benoit Ganne (bganne) via lists.fd.io
Looks good to me, thanks!

ben

> -Original Message-
> From: vpp-dev@lists.fd.io  On Behalf Of Damjan Marion
> via lists.fd.io
> Sent: mercredi 14 juillet 2021 19:48
> To: vpp-dev 
> Cc: Honnappa Nagarahalli ; Govindarajan
> Mohandoss ; Tianyu Li ;
> Nitin Saxena ; Lijian Zhang 
> Subject: Re: [vpp-dev] Prefetches improvement for VPP Arm generic image
> 
> 
> I spent a bit of time to look at this and come up with some reasonable
> solution.
> 
> First, 128-byte cacheline is not dead, recently announced Marvell Octeon
> 10 have 128 byte cacheline.
> 
> In current code cacheline size defines both amount of data prefetch
> instruction prefetches and
> alignment of data in data structures needed to avoid false sharing.
> 
> So I think ideally we should have following:
> 
> - on x86:
>   - number of bytes prefetch instruction prefetches set to 64
>   - data structures should be aligned to 64 bytes
>   - due the fact that there is adjacent cacehline prefetcher on x86 it may
> be worth
> investigating if aligning to 128 brings some value
> 
> - on AArch64
>   - number of bytes prefetch instruction prefetches set to 64 or 128,
> based on multiarch variant running
>   - data structures should be aligned to 128 bytes as that value prevents
> false sharing for both 64 and 128 byte cacheline systems
> 
> Main problem is abuse of CLIB_PREFETCH() macro in our codebase.
> Original idea of it was good, somebody wanted to provide macro which
> transparently emits 1-4 prefetch
> instructions based on data size recognising that there may be systems with
> different cacheline size
> 
> Like:
>   CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD);
> 
> But reality is, most of the time we have:
>   CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD);
> 
> Where it is assumed that cacheline size is 64 and that just wasted
> resources on system with 128-byte cacheline.
> 
> Also, most of places in our codebase are perfectly fine with whatever
> cacheline size is, so I’m thinking about following:
> 
> 1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make
> sure false sharing is not happening
> 
> 2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different value
> for different multiarch variant (64 for N1, 128 ThinderX2)
> 
> 3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to emit
> proper number of prefetch instructions for cases where data size is
> specified
> 
> 4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES,
> LOAD);` with `clib_prefetch_load (p);`.
>There may be exceptions but those lines typically mean: "i want to
> prefetch few (<=64) bytes at this address and i really don’t care what the
> cache line size is”.
> 
> 5. analyse remaining few cases where CLIB_PREFETCH() is used with size
> specified by CLIB_CACHE_LINE_BYTES.
> 
> Thoughts?
> 
> —
> Damjan
> 
> > On 06.07.2021., at 03:48, Lijian Zhang  wrote:
> >
> > Thanks Damjan for your comments. Some replies in lines.
> > 
> > Hi Lijian,
> >
> > It will be good to know if 128 byte cacheline is something ARM platforms
> will be using in the future or it is just historical.
> > [Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line.
> To my knowledge, there may be more CPUs with 128 byte cache-line in the
> future.
> >
> > Cacheline size problem is not just about prefetching, even bigger issue
> is false sharing, so we need to address both.
> > [Lijian] Yes, there may be false-sharing issue when running VPP image
> with 64B definition on 128B cache-line CPUs. We will do some scalability
> testing with that case, and check the multi-core performance.
> >
> > Probably best solution is to have 2 VPP images, one for 128 and one for
> 64 byte cacheline size.
> > [Lijian] For native built image, that’s fine. But I’m not sure if it’s
> possible for cloud binaries installed via “apt-get install”.
> >
> > Going across the whole codebase and replacing prefetch macros is
> something we should definitely avoid.
> > [Lijian] I got your concerns on large scope replacement. My concern is
> when CLIB_PREFETCH() is used to prefetch packet content into cache as
> below example, cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as
> 64 bytes always.
> > CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD);
> >
> > —
> > Damjan
> >
> >
> > On 05.07.2021., at 07:28, Lijian Zhang  wrote:
> >
> > Hi Damjan,
> > I committed several patches to address some issues around cache-line
> definitions in VPP.
> >
> > Patch [1.1] is to resolve t

Re: [EXT] Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-15 Thread Nitin Saxena
Hi Damjan,

+1. This proposal completely aligns with our thought process. i.e 

-  Data structure alignment to 128B cache line and separate macro for 
prefetching which can be 64B or 128B depending on underlying SoC

It also allow us to have single AARCH64 binary in distros for any cloud 
deployments. We can provide our feedback on the changes for any performance 
impact on our SoC

Thanks,
Nitin

> -Original Message-
> From: Damjan Marion 
> Sent: Wednesday, July 14, 2021 11:18 PM
> To: vpp-dev 
> Cc: Honnappa Nagarahalli ; Govindarajan
> Mohandoss ; Tianyu Li
> ; Nitin Saxena ; Lijian Zhang
> 
> Subject: [EXT] Re: [vpp-dev] Prefetches improvement for VPP Arm generic
> image
> 
> External Email
> 
> --
> 
> I spent a bit of time to look at this and come up with some reasonable
> solution.
> 
> First, 128-byte cacheline is not dead, recently announced Marvell Octeon 10
> have 128 byte cacheline.
> 
> In current code cacheline size defines both amount of data prefetch
> instruction prefetches and alignment of data in data structures needed to
> avoid false sharing.
> 
> So I think ideally we should have following:
> 
> - on x86:
>   - number of bytes prefetch instruction prefetches set to 64
>   - data structures should be aligned to 64 bytes
>   - due the fact that there is adjacent cacehline prefetcher on x86 it may be
> worth
> investigating if aligning to 128 brings some value
> 
> - on AArch64
>   - number of bytes prefetch instruction prefetches set to 64 or 128, based on
> multiarch variant running
>   - data structures should be aligned to 128 bytes as that value prevents 
> false
> sharing for both 64 and 128 byte cacheline systems
> 
> Main problem is abuse of CLIB_PREFETCH() macro in our codebase.
> Original idea of it was good, somebody wanted to provide macro which
> transparently emits 1-4 prefetch instructions based on data size recognising
> that there may be systems with different cacheline size
> 
> Like:
>   CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD);
> 
> But reality is, most of the time we have:
>   CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD);
> 
> Where it is assumed that cacheline size is 64 and that just wasted resources
> on system with 128-byte cacheline.
> 
> Also, most of places in our codebase are perfectly fine with whatever
> cacheline size is, so I’m thinking about following:
> 
> 1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make
> sure false sharing is not happening
> 
> 2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different
> value for different multiarch variant (64 for N1, 128 ThinderX2)
> 
> 3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to
> emit proper number of prefetch instructions for cases where data size is
> specified
> 
> 4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES,
> LOAD);` with `clib_prefetch_load (p);`.
>There may be exceptions but those lines typically mean: "i want to prefetch
> few (<=64) bytes at this address and i really don’t care what the cache line
> size is”.
> 
> 5. analyse remaining few cases where CLIB_PREFETCH() is used with size
> specified by CLIB_CACHE_LINE_BYTES.
> 
> Thoughts?
> 
> —
> Damjan
> 
> > On 06.07.2021., at 03:48, Lijian Zhang  wrote:
> >
> > Thanks Damjan for your comments. Some replies in lines.
> > 
> > Hi Lijian,
> >
> > It will be good to know if 128 byte cacheline is something ARM platforms
> will be using in the future or it is just historical.
> > [Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line. To
> my knowledge, there may be more CPUs with 128 byte cache-line in the
> future.
> >
> > Cacheline size problem is not just about prefetching, even bigger issue is
> false sharing, so we need to address both.
> > [Lijian] Yes, there may be false-sharing issue when running VPP image with
> 64B definition on 128B cache-line CPUs. We will do some scalability testing
> with that case, and check the multi-core performance.
> >
> > Probably best solution is to have 2 VPP images, one for 128 and one for 64
> byte cacheline size.
> > [Lijian] For native built image, that’s fine. But I’m not sure if it’s 
> > possible for
> cloud binaries installed via “apt-get install”.
> >
> > Going across the whole codebase and replacing prefetch macros is
> something we should definitely avoid.
> > [Lijian] I got your concerns on large scope replacement. My concern is
> when CLIB_PREFETCH() is used to prefetch packet content into cache as
> below example, cache-li

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-14 Thread Lijian Zhang
It looks good.
Thanks.
From: Florin Coras 
Subject: Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

+1

Florin


On Jul 14, 2021, at 10:47 AM, Damjan Marion via lists.fd.io<http://lists.fd.io> 
mailto:dmarion=me@lists.fd.io>> wrote:


I spent a bit of time to look at this and come up with some reasonable solution.

First, 128-byte cacheline is not dead, recently announced Marvell Octeon 10 
have 128 byte cacheline.

In current code cacheline size defines both amount of data prefetch instruction 
prefetches and
alignment of data in data structures needed to avoid false sharing.

So I think ideally we should have following:

- on x86:
 - number of bytes prefetch instruction prefetches set to 64
 - data structures should be aligned to 64 bytes
 - due the fact that there is adjacent cacehline prefetcher on x86 it may be 
worth
   investigating if aligning to 128 brings some value

- on AArch64
 - number of bytes prefetch instruction prefetches set to 64 or 128, based on 
multiarch variant running
 - data structures should be aligned to 128 bytes as that value prevents false 
sharing for both 64 and 128 byte cacheline systems

Main problem is abuse of CLIB_PREFETCH() macro in our codebase.
Original idea of it was good, somebody wanted to provide macro which 
transparently emits 1-4 prefetch
instructions based on data size recognising that there may be systems with 
different cacheline size

Like:
 CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD);

But reality is, most of the time we have:
 CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD);

Where it is assumed that cacheline size is 64 and that just wasted resources on 
system with 128-byte cacheline.

Also, most of places in our codebase are perfectly fine with whatever cacheline 
size is, so I’m thinking about following:

1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make sure 
false sharing is not happening

2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different value for 
different multiarch variant (64 for N1, 128 ThinderX2)

3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to emit proper 
number of prefetch instructions for cases where data size is specified

4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, 
LOAD);` with `clib_prefetch_load (p);`.
  There may be exceptions but those lines typically mean: "i want to prefetch 
few (<=64) bytes at this address and i really don’t care what the cache line 
size is”.

5. analyse remaining few cases where CLIB_PREFETCH() is used with size 
specified by CLIB_CACHE_LINE_BYTES.

Thoughts?

—
Damjan


On 06.07.2021., at 03:48, Lijian Zhang 
mailto:lijian.zh...@arm.com>> wrote:

Thanks Damjan for your comments. Some replies in lines.

Hi Lijian,

It will be good to know if 128 byte cacheline is something ARM platforms will 
be using in the future or it is just historical.
[Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line. To my 
knowledge, there may be more CPUs with 128 byte cache-line in the future.

Cacheline size problem is not just about prefetching, even bigger issue is 
false sharing, so we need to address both.
[Lijian] Yes, there may be false-sharing issue when running VPP image with 64B 
definition on 128B cache-line CPUs. We will do some scalability testing with 
that case, and check the multi-core performance.

Probably best solution is to have 2 VPP images, one for 128 and one for 64 byte 
cacheline size.
[Lijian] For native built image, that’s fine. But I’m not sure if it’s possible 
for cloud binaries installed via “apt-get install”.

Going across the whole codebase and replacing prefetch macros is something we 
should definitely avoid.
[Lijian] I got your concerns on large scope replacement. My concern is when 
CLIB_PREFETCH() is used to prefetch packet content into cache as below example, 
cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as 64 bytes always.
CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD);

—
Damjan


On 05.07.2021., at 07:28, Lijian Zhang 
mailto:lijian.zh...@arm.com>> wrote:

Hi Damjan,
I committed several patches to address some issues around cache-line 
definitions in VPP.

Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
e.g., ThunderX2, NeoverseN1, caused by the commit 
(https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and sections).
It also supports building Arm generic image (with command of “make 
build-release”) with 128Byte cache line definition, and building native image 
with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, NeoverseN1 
(with command of “make build-release TARGET_PLATFORM=native”).

Patch [1.5] is to set the default cache line definition in Arm generic image 
from 128Byte to 64Byte.
Setting cache line definition to 128Byte for Arm generic image is required for 
ThunderX1 (with 128Byte physical cache line), which is also the build machi

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-14 Thread Florin Coras
+1

Florin

> On Jul 14, 2021, at 10:47 AM, Damjan Marion via lists.fd.io 
>  wrote:
> 
> 
> I spent a bit of time to look at this and come up with some reasonable 
> solution.
> 
> First, 128-byte cacheline is not dead, recently announced Marvell Octeon 10 
> have 128 byte cacheline.
> 
> In current code cacheline size defines both amount of data prefetch 
> instruction prefetches and
> alignment of data in data structures needed to avoid false sharing.
> 
> So I think ideally we should have following:
> 
> - on x86:
>  - number of bytes prefetch instruction prefetches set to 64
>  - data structures should be aligned to 64 bytes
>  - due the fact that there is adjacent cacehline prefetcher on x86 it may be 
> worth
>investigating if aligning to 128 brings some value
> 
> - on AArch64
>  - number of bytes prefetch instruction prefetches set to 64 or 128, based on 
> multiarch variant running
>  - data structures should be aligned to 128 bytes as that value prevents 
> false sharing for both 64 and 128 byte cacheline systems
> 
> Main problem is abuse of CLIB_PREFETCH() macro in our codebase.
> Original idea of it was good, somebody wanted to provide macro which 
> transparently emits 1-4 prefetch
> instructions based on data size recognising that there may be systems with 
> different cacheline size
> 
> Like:
>  CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD);
> 
> But reality is, most of the time we have:
>  CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD);
> 
> Where it is assumed that cacheline size is 64 and that just wasted resources 
> on system with 128-byte cacheline.
> 
> Also, most of places in our codebase are perfectly fine with whatever 
> cacheline size is, so I’m thinking about following:
> 
> 1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make sure 
> false sharing is not happening
> 
> 2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different value 
> for different multiarch variant (64 for N1, 128 ThinderX2)
> 
> 3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to emit proper 
> number of prefetch instructions for cases where data size is specified
> 
> 4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, 
> LOAD);` with `clib_prefetch_load (p);`.
>   There may be exceptions but those lines typically mean: "i want to prefetch 
> few (<=64) bytes at this address and i really don’t care what the cache line 
> size is”.
> 
> 5. analyse remaining few cases where CLIB_PREFETCH() is used with size 
> specified by CLIB_CACHE_LINE_BYTES.
> 
> Thoughts?
> 
> — 
> Damjan
> 
>> On 06.07.2021., at 03:48, Lijian Zhang  wrote:
>> 
>> Thanks Damjan for your comments. Some replies in lines.
>> 
>> Hi Lijian,
>> 
>> It will be good to know if 128 byte cacheline is something ARM platforms 
>> will be using in the future or it is just historical.
>> [Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line. To 
>> my knowledge, there may be more CPUs with 128 byte cache-line in the future.
>> 
>> Cacheline size problem is not just about prefetching, even bigger issue is 
>> false sharing, so we need to address both.
>> [Lijian] Yes, there may be false-sharing issue when running VPP image with 
>> 64B definition on 128B cache-line CPUs. We will do some scalability testing 
>> with that case, and check the multi-core performance.
>> 
>> Probably best solution is to have 2 VPP images, one for 128 and one for 64 
>> byte cacheline size.
>> [Lijian] For native built image, that’s fine. But I’m not sure if it’s 
>> possible for cloud binaries installed via “apt-get install”.
>> 
>> Going across the whole codebase and replacing prefetch macros is something 
>> we should definitely avoid.
>> [Lijian] I got your concerns on large scope replacement. My concern is when 
>> CLIB_PREFETCH() is used to prefetch packet content into cache as below 
>> example, cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as 64 bytes 
>> always.
>> CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD);
>> 
>> — 
>> Damjan
>> 
>> 
>> On 05.07.2021., at 07:28, Lijian Zhang  wrote:
>> 
>> Hi Damjan,
>> I committed several patches to address some issues around cache-line 
>> definitions in VPP.
>> 
>> Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
>> e.g., ThunderX2, NeoverseN1, caused by the commit 
>> (https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and 
>> sections).
>> It also supports building Arm generic image (with command of “make 
>> build-release”) with 128Byte cache line definition, and building native 
>> image with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, 
>> NeoverseN1 (with command of “make build-release TARGET_PLATFORM=native”).
>> 
>> Patch [1.5] is to set the default cache line definition in Arm generic image 
>> from 128Byte to 64Byte.
>> Setting cache line definition to 128Byte for Arm generic image is required 
>> for ThunderX1 (with 128Byte physical cache 

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-14 Thread Damjan Marion via lists.fd.io

I spent a bit of time to look at this and come up with some reasonable solution.

First, 128-byte cacheline is not dead, recently announced Marvell Octeon 10 
have 128 byte cacheline.

In current code cacheline size defines both amount of data prefetch instruction 
prefetches and
alignment of data in data structures needed to avoid false sharing.

So I think ideally we should have following:

- on x86:
  - number of bytes prefetch instruction prefetches set to 64
  - data structures should be aligned to 64 bytes
  - due the fact that there is adjacent cacehline prefetcher on x86 it may be 
worth
investigating if aligning to 128 brings some value

- on AArch64
  - number of bytes prefetch instruction prefetches set to 64 or 128, based on 
multiarch variant running
  - data structures should be aligned to 128 bytes as that value prevents false 
sharing for both 64 and 128 byte cacheline systems

Main problem is abuse of CLIB_PREFETCH() macro in our codebase.
Original idea of it was good, somebody wanted to provide macro which 
transparently emits 1-4 prefetch
instructions based on data size recognising that there may be systems with 
different cacheline size

Like:
  CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD);

But reality is, most of the time we have:
  CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD);

Where it is assumed that cacheline size is 64 and that just wasted resources on 
system with 128-byte cacheline.

Also, most of places in our codebase are perfectly fine with whatever cacheline 
size is, so I’m thinking about following:

1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make sure 
false sharing is not happening

2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different value for 
different multiarch variant (64 for N1, 128 ThinderX2)

3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to emit proper 
number of prefetch instructions for cases where data size is specified

4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, 
LOAD);` with `clib_prefetch_load (p);`.
   There may be exceptions but those lines typically mean: "i want to prefetch 
few (<=64) bytes at this address and i really don’t care what the cache line 
size is”.

5. analyse remaining few cases where CLIB_PREFETCH() is used with size 
specified by CLIB_CACHE_LINE_BYTES.

Thoughts?

— 
Damjan

> On 06.07.2021., at 03:48, Lijian Zhang  wrote:
> 
> Thanks Damjan for your comments. Some replies in lines.
> 
> Hi Lijian,
>  
> It will be good to know if 128 byte cacheline is something ARM platforms will 
> be using in the future or it is just historical.
> [Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line. To my 
> knowledge, there may be more CPUs with 128 byte cache-line in the future.
>  
> Cacheline size problem is not just about prefetching, even bigger issue is 
> false sharing, so we need to address both.
> [Lijian] Yes, there may be false-sharing issue when running VPP image with 
> 64B definition on 128B cache-line CPUs. We will do some scalability testing 
> with that case, and check the multi-core performance.
>  
> Probably best solution is to have 2 VPP images, one for 128 and one for 64 
> byte cacheline size.
> [Lijian] For native built image, that’s fine. But I’m not sure if it’s 
> possible for cloud binaries installed via “apt-get install”.
>  
> Going across the whole codebase and replacing prefetch macros is something we 
> should definitely avoid.
> [Lijian] I got your concerns on large scope replacement. My concern is when 
> CLIB_PREFETCH() is used to prefetch packet content into cache as below 
> example, cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as 64 bytes 
> always.
> CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD);
>  
> — 
> Damjan
> 
> 
> On 05.07.2021., at 07:28, Lijian Zhang  wrote:
>  
> Hi Damjan,
> I committed several patches to address some issues around cache-line 
> definitions in VPP.
>  
> Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
> e.g., ThunderX2, NeoverseN1, caused by the commit 
> (https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and 
> sections).
> It also supports building Arm generic image (with command of “make 
> build-release”) with 128Byte cache line definition, and building native image 
> with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, 
> NeoverseN1 (with command of “make build-release TARGET_PLATFORM=native”).
>  
> Patch [1.5] is to set the default cache line definition in Arm generic image 
> from 128Byte to 64Byte.
> Setting cache line definition to 128Byte for Arm generic image is required 
> for ThunderX1 (with 128Byte physical cache line), which is also the build 
> machine in FD.io lab. I’m thinking for setting 64Byte cache line definition 
> in VPP for Arm image, which will affect ThunderX1 and OcteonTX2 CPUs. So it 
> requires the confirmation by Marvell.
>  
> Arm architecture CPUs 

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-05 Thread Lijian Zhang
Thanks Damjan for your comments. Some replies in lines.

Hi Lijian,

It will be good to know if 128 byte cacheline is something ARM platforms will 
be using in the future or it is just historical.
[Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line. To my 
knowledge, there may be more CPUs with 128 byte cache-line in the future.

Cacheline size problem is not just about prefetching, even bigger issue is 
false sharing, so we need to address both.
[Lijian] Yes, there may be false-sharing issue when running VPP image with 64B 
definition on 128B cache-line CPUs. We will do some scalability testing with 
that case, and check the multi-core performance.

Probably best solution is to have 2 VPP images, one for 128 and one for 64 byte 
cacheline size.
[Lijian] For native built image, that’s fine. But I’m not sure if it’s possible 
for cloud binaries installed via “apt-get install”.

Going across the whole codebase and replacing prefetch macros is something we 
should definitely avoid.
[Lijian] I got your concerns on large scope replacement. My concern is when 
CLIB_PREFETCH() is used to prefetch packet content into cache as below example, 
cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as 64 bytes always.
CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD);

—
Damjan


On 05.07.2021., at 07:28, Lijian Zhang 
mailto:lijian.zh...@arm.com>> wrote:

Hi Damjan,
I committed several patches to address some issues around cache-line 
definitions in VPP.

Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
e.g., ThunderX2, NeoverseN1, caused by the commit 
(https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and sections).
It also supports building Arm generic image (with command of “make 
build-release”) with 128Byte cache line definition, and building native image 
with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, NeoverseN1 
(with command of “make build-release TARGET_PLATFORM=native”).

Patch [1.5] is to set the default cache line definition in Arm generic image 
from 128Byte to 64Byte.
Setting cache line definition to 128Byte for Arm generic image is required for 
ThunderX1 (with 128Byte physical cache line), which is also the build machine 
in FD.io lab. I’m thinking for setting 64Byte cache line 
definition in VPP for Arm image, which will affect ThunderX1 and OcteonTX2 
CPUs. So it requires the confirmation by Marvell.

Arm architecture CPUs have 128Byte or 64Byte physical designs. So no matter the 
cache line definition is 128Byte or 64Byte in VPP source code, the prefetch 
functions in generic image will not work properly on all Arm CPUs. Patches 
[1.2] [1.3] [1.4] are to resolve the issue.

For example when running Arm generic image (cache-line-size is defined as 128B 
in Makefile for all Arm architectures) on 64Byte cache-line-size CPUs, e.g., 
Neoverse-N1, Ampere altra, ThunderX2.

[3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(), you can 
prefetch data resides in multiple cache lines.
[4] shows some usage examples of the prefetch macros in VPP. When running Arm 
generic image (128B cache-line-size definition) on 64B cache-line CPUs (N1SDP 
for example), 4.2, 4.3 and 4.4 have issues.

  *   For 4.2, the input for size parameter is 68. On N1SDP with 64B 
cache-line-size, there should be two prefetch instructions executed, but due to 
68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP), only the 
first prefetch instruction is executed.
  *   For 4.3, if sizeof (ip0[0]) equals 68 or any other values larger than 
64B, there will be the same issue as 4.2.
  *   For 4.4, the code  is trying to prefetch the first 128B of packet 
content. It assumes  CLIB_CACHE_LINE_BYTES is 64B always. In Arm generic image, 
the input for size parameter is 256B, which will execute prefetches on 
unexpected cache-lines (expected prefetches on 64B-0 and 64B-1, but actually on 
B64-0 and B64-2) .
Packet content: [64B-0][64B-1][64B-2][64B-3]

Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-arch 
feature (check patch [1.2]), to reflect the runtime CPU cache-line-size in Arm 
generic image, so that the prefetch instructions can be executed correctly.
Then for 4.4, we will need to modify the parameter for size, from 
2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention.

Some additional macros [1.3] can be added for users to do prefetch based on 
number of cache-lines, besides number of bytes.

Could you please suggest on the issue and proposal?

[1]. Patches
[1.1] build: support 128B/64B cache line size in Arm image, 
https://gerrit.fd.io/r/c/vpp/+/32968/2
[1.2] vppinfra: refactor prefetch macro, https://gerrit.fd.io/r/c/vpp/+/32969/3
[1.3] vppinfra: fix functions to prefetch single line, 
https://gerrit.fd.io/r/c/vpp/+/32970/2
[1.4] misc: correct prefetch macro usage, https://gerrit.fd.io/r/c/vpp/+/32971/3
[1.5] build: set 64B cache line size in Arm image, 

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-05 Thread Florin Coras
+1

Florin

> On Jul 5, 2021, at 4:27 AM, Dave Barach  wrote:
> 
> “Going across the whole codebase and replacing prefetch macros is something 
> we should definitely avoid.”
>
> +1, for sure... FWIW... Dave
>
> From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>  <mailto:vpp-dev@lists.fd.io>> On Behalf Of Damjan Marion via lists.fd.io 
> <http://lists.fd.io/>
> Sent: Monday, July 5, 2021 4:50 AM
> To: Lijian Zhang mailto:lijian.zh...@arm.com>>
> Cc: vpp-dev mailto:vpp-dev@lists.fd.io>>; nd 
> mailto:n...@arm.com>>; Honnappa Nagarahalli 
> mailto:honnappa.nagaraha...@arm.com>>; 
> Govindarajan Mohandoss  <mailto:govindarajan.mohand...@arm.com>>; Tianyu Li  <mailto:tianyu...@arm.com>>; Nitin Saxena  <mailto:nsax...@marvell.com>>
> Subject: Re: [vpp-dev] Prefetches improvement for VPP Arm generic image
>
> Hi Lijian,
>
> It will be good to know if 128 byte cacheline is something ARM platforms will 
> be using in the future or it is just historical.
>
> Cacheline size problem is not just about prefetching, even bigger issue is 
> false sharing, so we need to address both.
> Probably best solution is to have 2 VPP images, one for 128 and one for 64 
> byte cacheline size.
>
> Going across the whole codebase and replacing prefetch macros is something we 
> should definitely avoid.
>
> — 
> Damjan
> 
> 
> On 05.07.2021., at 07:28, Lijian Zhang  <mailto:lijian.zh...@arm.com>> wrote:
>
> Hi Damjan,
> I committed several patches to address some issues around cache-line 
> definitions in VPP.
>
> Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
> e.g., ThunderX2, NeoverseN1, caused by the commit 
> (https://gerrit.fd.io/r/c/vpp/+/32996 <https://gerrit.fd.io/r/c/vpp/+/32996>, 
> build: remove unused files and sections).
> It also supports building Arm generic image (with command of “make 
> build-release”) with 128Byte cache line definition, and building native image 
> with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, 
> NeoverseN1 (with command of “make build-release TARGET_PLATFORM=native”).
>
> Patch [1.5] is to set the default cache line definition in Arm generic image 
> from 128Byte to 64Byte.
> Setting cache line definition to 128Byte for Arm generic image is required 
> for ThunderX1 (with 128Byte physical cache line), which is also the build 
> machine in FD.io <http://fd.io/> lab. I’m thinking for setting 64Byte cache 
> line definition in VPP for Arm image, which will affect ThunderX1 and 
> OcteonTX2 CPUs. So it requires the confirmation by Marvell.
>
> Arm architecture CPUs have 128Byte or 64Byte physical designs. So no matter 
> the cache line definition is 128Byte or 64Byte in VPP source code, the 
> prefetch functions in generic image will not work properly on all Arm CPUs. 
> Patches [1.2] [1.3] [1.4] are to resolve the issue.
>
> For example when running Arm generic image (cache-line-size is defined as 
> 128B in Makefile for all Arm architectures) on 64Byte cache-line-size CPUs, 
> e.g., Neoverse-N1, Ampere altra, ThunderX2.
>
> [3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(), you 
> can prefetch data resides in multiple cache lines.
> [4] shows some usage examples of the prefetch macros in VPP. When running Arm 
> generic image (128B cache-line-size definition) on 64B cache-line CPUs (N1SDP 
> for example), 4.2, 4.3 and 4.4 have issues.
> For 4.2, the input for size parameter is 68. On N1SDP with 64B 
> cache-line-size, there should be two prefetch instructions executed, but due 
> to 68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP), only 
> the first prefetch instruction is executed.
> For 4.3, if sizeof (ip0[0]) equals 68 or any other values larger than 64B, 
> there will be the same issue as 4.2.
> For 4.4, the code  is trying to prefetch the first 128B of packet content. It 
> assumes  CLIB_CACHE_LINE_BYTES is 64B always. In Arm generic image, the input 
> for size parameter is 256B, which will execute prefetches on unexpected 
> cache-lines (expected prefetches on 64B-0 and 64B-1, but actually on B64-0 
> and B64-2) .
> Packet content: [64B-0][64B-1][64B-2][64B-3]
>
> Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-arch 
> feature (check patch [1.2]), to reflect the runtime CPU cache-line-size in 
> Arm generic image, so that the prefetch instructions can be executed 
> correctly.
> Then for 4.4, we will need to modify the parameter for size, from 
> 2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention.
>
> Some additional macros [1.3] can be added for users to do prefetch based on 
> number of cach

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-05 Thread Dave Barach
“Going across the whole codebase and replacing prefetch macros is something we 
should definitely avoid.”

 

+1, for sure... FWIW... Dave

 

From: vpp-dev@lists.fd.io  On Behalf Of Damjan Marion via 
lists.fd.io
Sent: Monday, July 5, 2021 4:50 AM
To: Lijian Zhang 
Cc: vpp-dev ; nd ; Honnappa Nagarahalli 
; Govindarajan Mohandoss 
; Tianyu Li ; Nitin Saxena 

Subject: Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

 

Hi Lijian,

 

It will be good to know if 128 byte cacheline is something ARM platforms will 
be using in the future or it is just historical.

 

Cacheline size problem is not just about prefetching, even bigger issue is 
false sharing, so we need to address both.

Probably best solution is to have 2 VPP images, one for 128 and one for 64 byte 
cacheline size.

 

Going across the whole codebase and replacing prefetch macros is something we 
should definitely avoid.

 

— 

Damjan





On 05.07.2021., at 07:28, Lijian Zhang mailto:lijian.zh...@arm.com> > wrote:

 

Hi Damjan,

I committed several patches to address some issues around cache-line 
definitions in VPP.

 

Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
e.g., ThunderX2, NeoverseN1, caused by the commit ( 
<https://gerrit.fd.io/r/c/vpp/+/32996> https://gerrit.fd.io/r/c/vpp/+/32996, 
build: remove unused files and sections).

It also supports building Arm generic image (with command of “make 
build-release”) with 128Byte cache line definition, and building native image 
with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, NeoverseN1 
(with command of “make build-release TARGET_PLATFORM=native”).

 

Patch [1.5] is to set the default cache line definition in Arm generic image 
from 128Byte to 64Byte.

Setting cache line definition to 128Byte for Arm generic image is required for 
ThunderX1 (with 128Byte physical cache line), which is also the build machine 
in  <http://fd.io/> FD.io lab. I’m thinking for setting 64Byte cache line 
definition in VPP for Arm image, which will affect ThunderX1 and OcteonTX2 
CPUs. So it requires the confirmation by Marvell.

 

Arm architecture CPUs have 128Byte or 64Byte physical designs. So no matter the 
cache line definition is 128Byte or 64Byte in VPP source code, the prefetch 
functions in generic image will not work properly on all Arm CPUs. Patches 
[1.2] [1.3] [1.4] are to resolve the issue.

 

For example when running Arm generic image (cache-line-size is defined as 128B 
in Makefile for all Arm architectures) on 64Byte cache-line-size CPUs, e.g., 
Neoverse-N1, Ampere altra, ThunderX2.

 

[3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(), you can 
prefetch data resides in multiple cache lines.

[4] shows some usage examples of the prefetch macros in VPP. When running Arm 
generic image (128B cache-line-size definition) on 64B cache-line CPUs (N1SDP 
for example), 4.2, 4.3 and 4.4 have issues.

*   For 4.2, the input for size parameter is 68. On N1SDP with 64B 
cache-line-size, there should be two prefetch instructions executed, but due to 
68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP), only the 
first prefetch instruction is executed.
*   For 4.3, if sizeof (ip0[0]) equals 68 or any other values larger than 
64B, there will be the same issue as 4.2.
*   For 4.4, the code  is trying to prefetch the first 128B of packet 
content. It assumes  CLIB_CACHE_LINE_BYTES is 64B always. In Arm generic image, 
the input for size parameter is 256B, which will execute prefetches on 
unexpected cache-lines (expected prefetches on 64B-0 and 64B-1, but actually on 
B64-0 and B64-2) .

Packet content: [64B-0][64B-1][64B-2][64B-3]

 

Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-arch 
feature (check patch [1.2]), to reflect the runtime CPU cache-line-size in Arm 
generic image, so that the prefetch instructions can be executed correctly.

Then for 4.4, we will need to modify the parameter for size, from 
2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention.

 

Some additional macros [1.3] can be added for users to do prefetch based on 
number of cache-lines, besides number of bytes.

 

Could you please suggest on the issue and proposal?

 

[1]. Patches

[1.1] build: support 128B/64B cache line size in Arm image,  
<https://gerrit.fd.io/r/c/vpp/+/32968/2> https://gerrit.fd.io/r/c/vpp/+/32968/2

[1.2] vppinfra: refactor prefetch macro,  
<https://gerrit.fd.io/r/c/vpp/+/32969/3> https://gerrit.fd.io/r/c/vpp/+/32969/3

[1.3] vppinfra: fix functions to prefetch single line,  
<https://gerrit.fd.io/r/c/vpp/+/32970/2> https://gerrit.fd.io/r/c/vpp/+/32970/2

[1.4] misc: correct prefetch macro usage,  
<https://gerrit.fd.io/r/c/vpp/+/32971/3> https://gerrit.fd.io/r/c/vpp/+/32971/3

[1.5] build: set 64B cache line size in Arm image,  
<https://gerrit.fd.io/r/c/vpp/+/32972/2> https://gerrit.fd.io/r/c/vpp/+/32972/2

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-05 Thread Damjan Marion via lists.fd.io
Hi Lijian,

It will be good to know if 128 byte cacheline is something ARM platforms will 
be using in the future or it is just historical.

Cacheline size problem is not just about prefetching, even bigger issue is 
false sharing, so we need to address both.
Probably best solution is to have 2 VPP images, one for 128 and one for 64 byte 
cacheline size.

Going across the whole codebase and replacing prefetch macros is something we 
should definitely avoid.

— 
Damjan

> On 05.07.2021., at 07:28, Lijian Zhang  wrote:
> 
> Hi Damjan,
> I committed several patches to address some issues around cache-line 
> definitions in VPP.
>
> Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
> e.g., ThunderX2, NeoverseN1, caused by the commit 
> (https://gerrit.fd.io/r/c/vpp/+/32996 , 
> build: remove unused files and sections).
> It also supports building Arm generic image (with command of “make 
> build-release”) with 128Byte cache line definition, and building native image 
> with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, 
> NeoverseN1 (with command of “make build-release TARGET_PLATFORM=native”).
>
> Patch [1.5] is to set the default cache line definition in Arm generic image 
> from 128Byte to 64Byte.
> Setting cache line definition to 128Byte for Arm generic image is required 
> for ThunderX1 (with 128Byte physical cache line), which is also the build 
> machine in FD.io  lab. I’m thinking for setting 64Byte cache 
> line definition in VPP for Arm image, which will affect ThunderX1 and 
> OcteonTX2 CPUs. So it requires the confirmation by Marvell.
>
> Arm architecture CPUs have 128Byte or 64Byte physical designs. So no matter 
> the cache line definition is 128Byte or 64Byte in VPP source code, the 
> prefetch functions in generic image will not work properly on all Arm CPUs. 
> Patches [1.2] [1.3] [1.4] are to resolve the issue.
>
> For example when running Arm generic image (cache-line-size is defined as 
> 128B in Makefile for all Arm architectures) on 64Byte cache-line-size CPUs, 
> e.g., Neoverse-N1, Ampere altra, ThunderX2.
>
> [3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(), you 
> can prefetch data resides in multiple cache lines.
> [4] shows some usage examples of the prefetch macros in VPP. When running Arm 
> generic image (128B cache-line-size definition) on 64B cache-line CPUs (N1SDP 
> for example), 4.2, 4.3 and 4.4 have issues.
> For 4.2, the input for size parameter is 68. On N1SDP with 64B 
> cache-line-size, there should be two prefetch instructions executed, but due 
> to 68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP), only 
> the first prefetch instruction is executed.
> For 4.3, if sizeof (ip0[0]) equals 68 or any other values larger than 64B, 
> there will be the same issue as 4.2.
> For 4.4, the code  is trying to prefetch the first 128B of packet content. It 
> assumes  CLIB_CACHE_LINE_BYTES is 64B always. In Arm generic image, the input 
> for size parameter is 256B, which will execute prefetches on unexpected 
> cache-lines (expected prefetches on 64B-0 and 64B-1, but actually on B64-0 
> and B64-2) .
> Packet content: [64B-0][64B-1][64B-2][64B-3]
>
> Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-arch 
> feature (check patch [1.2]), to reflect the runtime CPU cache-line-size in 
> Arm generic image, so that the prefetch instructions can be executed 
> correctly.
> Then for 4.4, we will need to modify the parameter for size, from 
> 2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention.
>
> Some additional macros [1.3] can be added for users to do prefetch based on 
> number of cache-lines, besides number of bytes.
>
> Could you please suggest on the issue and proposal?
>
> [1]. Patches
> [1.1] build: support 128B/64B cache line size in Arm image, 
> https://gerrit.fd.io/r/c/vpp/+/32968/2 
> 
> [1.2] vppinfra: refactor prefetch macro, 
> https://gerrit.fd.io/r/c/vpp/+/32969/3 
> 
> [1.3] vppinfra: fix functions to prefetch single line, 
> https://gerrit.fd.io/r/c/vpp/+/32970/2 
> 
> [1.4] misc: correct prefetch macro usage, 
> https://gerrit.fd.io/r/c/vpp/+/32971/3 
> 
> [1.5] build: set 64B cache line size in Arm image, 
> https://gerrit.fd.io/r/c/vpp/+/32972/2 
> 
>
> [2]. Error message
> src/plugins/dpdk/device/init.c:1916:3: error: static_assert failed due to 
> requirement '128 == 1 << 6' "DPDK RTE CACHE LINE SIZE does not match with 
> 1<   STATIC_ASSERT (RTE_CACHE_LINE_SIZE == 1 << CLIB_LOG2_CACHE_LINE_BYTES,
>   ^~
> /home/lijian/tasks/plsremove/src/vppinfra/error_bootstrap.h:111:34: note: 
> expanded from macro 'STATIC_ASSERT'
> #define 

[vpp-dev] Prefetches improvement for VPP Arm generic image

2021-07-04 Thread Lijian Zhang
Hi Damjan,
I committed several patches to address some issues around cache-line 
definitions in VPP.

Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, 
e.g., ThunderX2, NeoverseN1, caused by the commit 
(https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and sections).
It also supports building Arm generic image (with command of "make 
build-release") with 128Byte cache line definition, and building native image 
with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, NeoverseN1 
(with command of "make build-release TARGET_PLATFORM=native").

Patch [1.5] is to set the default cache line definition in Arm generic image 
from 128Byte to 64Byte.
Setting cache line definition to 128Byte for Arm generic image is required for 
ThunderX1 (with 128Byte physical cache line), which is also the build machine 
in FD.io lab. I'm thinking for setting 64Byte cache line definition in VPP for 
Arm image, which will affect ThunderX1 and OcteonTX2 CPUs. So it requires the 
confirmation by Marvell.

Arm architecture CPUs have 128Byte or 64Byte physical designs. So no matter the 
cache line definition is 128Byte or 64Byte in VPP source code, the prefetch 
functions in generic image will not work properly on all Arm CPUs. Patches 
[1.2] [1.3] [1.4] are to resolve the issue.

For example when running Arm generic image (cache-line-size is defined as 128B 
in Makefile for all Arm architectures) on 64Byte cache-line-size CPUs, e.g., 
Neoverse-N1, Ampere altra, ThunderX2.

[3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(), you can 
prefetch data resides in multiple cache lines.
[4] shows some usage examples of the prefetch macros in VPP. When running Arm 
generic image (128B cache-line-size definition) on 64B cache-line CPUs (N1SDP 
for example), 4.2, 4.3 and 4.4 have issues.

  *   For 4.2, the input for size parameter is 68. On N1SDP with 64B 
cache-line-size, there should be two prefetch instructions executed, but due to 
68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP), only the 
first prefetch instruction is executed.
  *   For 4.3, if sizeof (ip0[0]) equals 68 or any other values larger than 
64B, there will be the same issue as 4.2.
  *   For 4.4, the code  is trying to prefetch the first 128B of packet 
content. It assumes  CLIB_CACHE_LINE_BYTES is 64B always. In Arm generic image, 
the input for size parameter is 256B, which will execute prefetches on 
unexpected cache-lines (expected prefetches on 64B-0 and 64B-1, but actually on 
B64-0 and B64-2) .
Packet content: [64B-0][64B-1][64B-2][64B-3]

Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-arch 
feature (check patch [1.2]), to reflect the runtime CPU cache-line-size in Arm 
generic image, so that the prefetch instructions can be executed correctly.
Then for 4.4, we will need to modify the parameter for size, from 
2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention.

Some additional macros [1.3] can be added for users to do prefetch based on 
number of cache-lines, besides number of bytes.

Could you please suggest on the issue and proposal?

[1]. Patches
[1.1] build: support 128B/64B cache line size in Arm image, 
https://gerrit.fd.io/r/c/vpp/+/32968/2
[1.2] vppinfra: refactor prefetch macro, https://gerrit.fd.io/r/c/vpp/+/32969/3
[1.3] vppinfra: fix functions to prefetch single line, 
https://gerrit.fd.io/r/c/vpp/+/32970/2
[1.4] misc: correct prefetch macro usage, https://gerrit.fd.io/r/c/vpp/+/32971/3
[1.5] build: set 64B cache line size in Arm image, 
https://gerrit.fd.io/r/c/vpp/+/32972/2

[2]. Error message
src/plugins/dpdk/device/init.c:1916:3: error: static_assert failed due to 
requirement '128 == 1 << 6' "DPDK RTE CACHE LINE SIZE does not match with 
1< (n)*CLIB_CACHE_LINE_BYTES)   
  \
__builtin_prefetch (_addr + (n)*CLIB_CACHE_LINE_BYTES,   \
   CLIB_PREFETCH_##type,
  \
   /* locality */ 3);

#define CLIB_PREFETCH(addr,size,type)   \
do {
   \
  void * _addr = (addr);   \

  \
  ASSERT ((size) <= 4*CLIB_CACHE_LINE_BYTES); \
  _CLIB_PREFETCH (0, size, type);\
  _CLIB_PREFETCH (1, size, type);\
  _CLIB_PREFETCH (2, size, type);\
  _CLIB_PREFETCH (3, size, type);\
} while (0)
'''

[4]
4.1 CLIB_PREFETCH (p2->pre_data, 32, STORE);
4.2 CLIB_PREFETCH (p2->pre_data, 68, STORE);
4.3 CLIB_PREFETCH (b[4]->data, sizeof (ip0[0]), LOAD);
4.4 CLIB_PREFETCH (p2->data, 2*CLIB_CACHE_LINE_BYTES, LOAD);

Thanks.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this