Re: X86-64 uses generic string functions (strlen, strchr, memcmp, ...)

2018-10-03 Thread Ingo Molnar


* Jann Horn  wrote:

> Hi!
> 
> I noticed that X86-64 is using the generic string functions from
> lib/string.c for things like strlen(), strchr(), memcmp() and so on.
> Is that an intentional omission, because they're not considered worth
> optimizing, or is this an oversight? The kernel doesn't use string
> functions much, but if you e.g. run readlinkat() in a loop on a
> symlink with a 1000-byte target, something around 25%-50% of time are
> spent on strlen(). But that's a microbenchmark that people probably
> don't care about a lot?
> 
> One notable in-kernel user of memcmp() is BPF, which uses it for its
> hash table implementations when walking the linked list of a hash
> bucket. But I don't know whether anyone uses BPF hash tables with keys
> that are sufficiently large to make this noticeable?

One reason we've been resisting this is how hard it is to determine whether a 
micro-optimization truly helps application workloads.

But there's a way:

 - Write a 'perf bench vfs ...' kind of scalability microbenchmark that
   runs in less than 60 seconds, provides stable numeric output, can
   meaningfully measured via 'perf', etc., which does multi-threaded
   or multi-tasked, CPU-bound VFS operations intentionally designed
   to hit these string ops.

 - Use this benchmark to demonstrate that the performance of any of the
   string ops matters.

 - Implement nice assembly speedups.

 - If the functions are out of line then add a kernel patching based method
   to run either the generic string function or the assembly version -
   a static-key based approach would be fine I think. This makes the two
   versions runtime switchable.

 - Use the benchmark again to prove that it indeed helped this particular
   workload. It can be a small speedup but has to be a larger signal than the
   "perf stat --null --repeat 10 ..." stddev.

Then that offers a maintainable way to implement such speedups:

 - The 'perf bench vfs ...' testcase and the kernel-patching debug knobs allows 
other to 
   replicate and check out other hardware.  Does the assembly function written 
on contemporary 
   Intel hardware work equally well on AMD hardware? People can help out by 
running those 
   tests.

 - We can go back and check the difference anytime in the future, once new CPUs 
arrive,
   or a new variant of the benchmark is written, or a workload is hurting.

If you do it systematically like that then I'd be *very* interested in merging 
both the tooling 
(benchmarking) and any eventual assembly speedups.

But it's quite some work - much harder than just writing a random assembly 
variant and using it 
instead of the generic version.

Thanks,

Ingo


Re: X86-64 uses generic string functions (strlen, strchr, memcmp, ...)

2018-10-03 Thread Ingo Molnar


* Jann Horn  wrote:

> Hi!
> 
> I noticed that X86-64 is using the generic string functions from
> lib/string.c for things like strlen(), strchr(), memcmp() and so on.
> Is that an intentional omission, because they're not considered worth
> optimizing, or is this an oversight? The kernel doesn't use string
> functions much, but if you e.g. run readlinkat() in a loop on a
> symlink with a 1000-byte target, something around 25%-50% of time are
> spent on strlen(). But that's a microbenchmark that people probably
> don't care about a lot?
> 
> One notable in-kernel user of memcmp() is BPF, which uses it for its
> hash table implementations when walking the linked list of a hash
> bucket. But I don't know whether anyone uses BPF hash tables with keys
> that are sufficiently large to make this noticeable?

One reason we've been resisting this is how hard it is to determine whether a 
micro-optimization truly helps application workloads.

But there's a way:

 - Write a 'perf bench vfs ...' kind of scalability microbenchmark that
   runs in less than 60 seconds, provides stable numeric output, can
   meaningfully measured via 'perf', etc., which does multi-threaded
   or multi-tasked, CPU-bound VFS operations intentionally designed
   to hit these string ops.

 - Use this benchmark to demonstrate that the performance of any of the
   string ops matters.

 - Implement nice assembly speedups.

 - If the functions are out of line then add a kernel patching based method
   to run either the generic string function or the assembly version -
   a static-key based approach would be fine I think. This makes the two
   versions runtime switchable.

 - Use the benchmark again to prove that it indeed helped this particular
   workload. It can be a small speedup but has to be a larger signal than the
   "perf stat --null --repeat 10 ..." stddev.

Then that offers a maintainable way to implement such speedups:

 - The 'perf bench vfs ...' testcase and the kernel-patching debug knobs allows 
other to 
   replicate and check out other hardware.  Does the assembly function written 
on contemporary 
   Intel hardware work equally well on AMD hardware? People can help out by 
running those 
   tests.

 - We can go back and check the difference anytime in the future, once new CPUs 
arrive,
   or a new variant of the benchmark is written, or a workload is hurting.

If you do it systematically like that then I'd be *very* interested in merging 
both the tooling 
(benchmarking) and any eventual assembly speedups.

But it's quite some work - much harder than just writing a random assembly 
variant and using it 
instead of the generic version.

Thanks,

Ingo


X86-64 uses generic string functions (strlen, strchr, memcmp, ...)

2018-10-03 Thread Jann Horn
Hi!

I noticed that X86-64 is using the generic string functions from
lib/string.c for things like strlen(), strchr(), memcmp() and so on.
Is that an intentional omission, because they're not considered worth
optimizing, or is this an oversight? The kernel doesn't use string
functions much, but if you e.g. run readlinkat() in a loop on a
symlink with a 1000-byte target, something around 25%-50% of time are
spent on strlen(). But that's a microbenchmark that people probably
don't care about a lot?

One notable in-kernel user of memcmp() is BPF, which uses it for its
hash table implementations when walking the linked list of a hash
bucket. But I don't know whether anyone uses BPF hash tables with keys
that are sufficiently large to make this noticeable?


X86-64 uses generic string functions (strlen, strchr, memcmp, ...)

2018-10-03 Thread Jann Horn
Hi!

I noticed that X86-64 is using the generic string functions from
lib/string.c for things like strlen(), strchr(), memcmp() and so on.
Is that an intentional omission, because they're not considered worth
optimizing, or is this an oversight? The kernel doesn't use string
functions much, but if you e.g. run readlinkat() in a loop on a
symlink with a 1000-byte target, something around 25%-50% of time are
spent on strlen(). But that's a microbenchmark that people probably
don't care about a lot?

One notable in-kernel user of memcmp() is BPF, which uses it for its
hash table implementations when walking the linked list of a hash
bucket. But I don't know whether anyone uses BPF hash tables with keys
that are sufficiently large to make this noticeable?