On 25/11/2024 2:28 pm, Jan Beulich wrote:
> Move the function to its own assembly file. Having it in C just for the
> entire body to be an asm() isn't really helpful. Then have two flavors:
> A "basic" version using qword steps for the bulk of the operation, and an
> ERMS version for modern hardware, to be substituted in via alternatives
> patching.
>
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

This is far nicer than previous versions with nested alternatives.

> ---
> We may want to consider branching over the REP STOSQ as well, if the
> number of qwords turns out to be zero.

Until FSR{S,M} (Fast Short Rep {STO,MOV}SB), which is far newer than
ERMS, passing 0 into any REP instruction is expensive.

I wonder how often we memset with a size less than 8.

> We may also want to consider using non-REP STOS{L,W,B} for the tail.

Probably, yes.  We use this form in non-ERMS cases, where we're advised
to stay away from STOSB entirely.

Interestingly, Linux doesn't have a STOSQ case at all.  Or rather, it
was deleted by Linus in 20f3337d350c last year.  It was also identified
as causing a performance regression. 
https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kk+sb2zsshiffz7qkpldpatkj8v4wu...@mail.gmail.com/T/#u
although the memset() path was not reverted as part of the fix
(47ee3f1dd93bcb eventually).

Yet ca96b162bfd2 shows that REP MOVSQ is still definitely a win on Rome
CPUs.

I expect we probably do want some non-rep forms in here.

Do you have any benchmarks with this series?

> ---
> v3: Re-base.
>
> --- a/xen/arch/x86/Makefile
> +++ b/xen/arch/x86/Makefile
> @@ -48,6 +48,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
>  obj-$(CONFIG_PV) += ioport_emulate.o
>  obj-y += irq.o
>  obj-$(CONFIG_KEXEC) += machine_kexec.o
> +obj-y += memset.o
>  obj-y += mm.o x86_64/mm.o
>  obj-$(CONFIG_HVM) += monitor.o
>  obj-y += mpparse.o
> --- /dev/null
> +++ b/xen/arch/x86/memset.S
> @@ -0,0 +1,30 @@
> +#include <asm/asm_defns.h>
> +
> +.macro memset
> +        and     $7, %edx
> +        shr     $3, %rcx
> +        movzbl  %sil, %esi
> +        mov     $0x0101010101010101, %rax
> +        imul    %rsi, %rax
> +        mov     %rdi, %rsi
> +        rep stosq
> +        or      %edx, %ecx
> +        jz      0f
> +        rep stosb
> +0:
> +        mov     %rsi, %rax

Could you use %r8/9/etc instead of %rsi please?  This is deceptively
close to looking like a bug, and it took me a while to figure out it's
only correct because STOSB only edits %rdi.

Otherwise, I suspect this can go in.  It should be an improvement on
plain REP STOSB on non-ERMS systems, even if there are other
improvements to come.  I specifically wouldn't suggest blocking it until
patch 1 is resolved.

~Andrew

Reply via email to