Re: [PATCH v2 2/3] arm64: memset-arm64: Use simple memset when cache is disabled

2021-08-10 Thread Stefan Roese

On 10.08.21 11:27, Rasmus Villemoes wrote:

On 10/08/2021 09.13, Stefan Roese wrote:

The optimized memset uses the dc opcode, which causes problems when the
cache is disabled. This patch adds a check if the cache is disabled and
uses a very simple memset implementation in this case. Otherwise the
optimized version is used.

Signed-off-by: Stefan Roese 

---

Changes in v2:
- New patch

  arch/arm/lib/memset-arm64.S | 30 ++
  1 file changed, 30 insertions(+)

diff --git a/arch/arm/lib/memset-arm64.S b/arch/arm/lib/memset-arm64.S
index 710f6f582cad..a474dcb53c83 100644
--- a/arch/arm/lib/memset-arm64.S
+++ b/arch/arm/lib/memset-arm64.S
@@ -11,6 +11,7 @@
   *
   */
  
+#include 

  #include "asmdefs.h"
  
  #define dstin	x0

@@ -25,6 +26,35 @@ ENTRY (memset)
PTR_ARG (0)
SIZE_ARG (2)
  
+	/*

+* The optimized memset uses the dc opcode, which causes problems
+* when the cache is disabled. Let's check if the cache is disabled
+* and use a very simple memset implementation in this case. Otherwise
+* jump to the optimized version.
+*/
+   switch_el x6, 3f, 2f, 1f
+3: mrs x6, sctlr_el3
+   b   0f
+2: mrs x6, sctlr_el2
+   b   0f
+1: mrs x6, sctlr_el1
+0:
+   tst x6, #CR_C
+   bne 9f


How costly is this switch_el and access to a control register? For a
"big" memset of several 100s of bytes I'm sure it's a net win
regardless. But smaller memsets are much more common, and no individual
memset would show up in any boot time "profiling", but it is possible
that the extra cost added to those smaller memsets adds up to some
significant penalty.


I can do some further analysis on how costly this additional check is.

Another good test would perhaps be, if you (and any other interested
developers with ARM64 boards) apply this patchset and test booting to
the U-Boot prompt or even better into the OS with and without this
patchset. To check if it has a positive effect.

In my current case, clearing 256MiB of RAM for the NXP MC firmware
memory is roughly 4 times as fast as using the original implementation.
Saving a few 100 milliseconds in boot-time is quite crucial in this
project.


+   /*
+* A very "simple" memset implementation without the use of the
+* dc opcode. Can be run with caches disabled.
+*/
+   mov x3, #0x0
+4: strbw1, [x0, x3]
+   add x3, x3, #0x1
+   cmp x2, x3
+   bne 4b
+   ret
+9:
+


So I can hardly claim to be fluent in aarch64, but AFAICT this does
return the destination pointer as it must (because it leaves x0
untouched throughout). However, it fails (by writing over all of memory)
when the size is 0, since it is essentially a do{}while and not a while{}.


Thanks for catching. I'll add a size = 0 check to the next version.

Thanks,
Stefan


Re: [PATCH v2 2/3] arm64: memset-arm64: Use simple memset when cache is disabled

2021-08-10 Thread Rasmus Villemoes
On 10/08/2021 09.13, Stefan Roese wrote:
> The optimized memset uses the dc opcode, which causes problems when the
> cache is disabled. This patch adds a check if the cache is disabled and
> uses a very simple memset implementation in this case. Otherwise the
> optimized version is used.
> 
> Signed-off-by: Stefan Roese 
> 
> ---
> 
> Changes in v2:
> - New patch
> 
>  arch/arm/lib/memset-arm64.S | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/arm/lib/memset-arm64.S b/arch/arm/lib/memset-arm64.S
> index 710f6f582cad..a474dcb53c83 100644
> --- a/arch/arm/lib/memset-arm64.S
> +++ b/arch/arm/lib/memset-arm64.S
> @@ -11,6 +11,7 @@
>   *
>   */
>  
> +#include 
>  #include "asmdefs.h"
>  
>  #define dstinx0
> @@ -25,6 +26,35 @@ ENTRY (memset)
>   PTR_ARG (0)
>   SIZE_ARG (2)
>  
> + /*
> +  * The optimized memset uses the dc opcode, which causes problems
> +  * when the cache is disabled. Let's check if the cache is disabled
> +  * and use a very simple memset implementation in this case. Otherwise
> +  * jump to the optimized version.
> +  */
> + switch_el x6, 3f, 2f, 1f
> +3:   mrs x6, sctlr_el3
> + b   0f
> +2:   mrs x6, sctlr_el2
> + b   0f
> +1:   mrs x6, sctlr_el1
> +0:
> + tst x6, #CR_C
> + bne 9f

How costly is this switch_el and access to a control register? For a
"big" memset of several 100s of bytes I'm sure it's a net win
regardless. But smaller memsets are much more common, and no individual
memset would show up in any boot time "profiling", but it is possible
that the extra cost added to those smaller memsets adds up to some
significant penalty.

> + /*
> +  * A very "simple" memset implementation without the use of the
> +  * dc opcode. Can be run with caches disabled.
> +  */
> + mov x3, #0x0
> +4:   strbw1, [x0, x3]
> + add x3, x3, #0x1
> + cmp x2, x3
> + bne 4b
> + ret
> +9:
> +

So I can hardly claim to be fluent in aarch64, but AFAICT this does
return the destination pointer as it must (because it leaves x0
untouched throughout). However, it fails (by writing over all of memory)
when the size is 0, since it is essentially a do{}while and not a while{}.

Rasmus


[PATCH v2 2/3] arm64: memset-arm64: Use simple memset when cache is disabled

2021-08-10 Thread Stefan Roese
The optimized memset uses the dc opcode, which causes problems when the
cache is disabled. This patch adds a check if the cache is disabled and
uses a very simple memset implementation in this case. Otherwise the
optimized version is used.

Signed-off-by: Stefan Roese 

---

Changes in v2:
- New patch

 arch/arm/lib/memset-arm64.S | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/arm/lib/memset-arm64.S b/arch/arm/lib/memset-arm64.S
index 710f6f582cad..a474dcb53c83 100644
--- a/arch/arm/lib/memset-arm64.S
+++ b/arch/arm/lib/memset-arm64.S
@@ -11,6 +11,7 @@
  *
  */
 
+#include 
 #include "asmdefs.h"
 
 #define dstin  x0
@@ -25,6 +26,35 @@ ENTRY (memset)
PTR_ARG (0)
SIZE_ARG (2)
 
+   /*
+* The optimized memset uses the dc opcode, which causes problems
+* when the cache is disabled. Let's check if the cache is disabled
+* and use a very simple memset implementation in this case. Otherwise
+* jump to the optimized version.
+*/
+   switch_el x6, 3f, 2f, 1f
+3: mrs x6, sctlr_el3
+   b   0f
+2: mrs x6, sctlr_el2
+   b   0f
+1: mrs x6, sctlr_el1
+0:
+   tst x6, #CR_C
+   bne 9f
+
+   /*
+* A very "simple" memset implementation without the use of the
+* dc opcode. Can be run with caches disabled.
+*/
+   mov x3, #0x0
+4: strbw1, [x0, x3]
+   add x3, x3, #0x1
+   cmp x2, x3
+   bne 4b
+   ret
+9:
+
+   /* Here the optimized memset version starts */
dup v0.16B, valw
add dstend, dstin, count
 
-- 
2.32.0