Re: Gecko performance with newer x86_64 levels

2021-02-10 Thread Gabriele Svelto
On 10/02/21 10:21, Henri Sivonen wrote:
> Chrome is moving to SSE3 as the unconditional baseline, which I
> personally find surprising:
> https://docs.google.com/document/d/1QUzL4MGNqX4wiLvukUwBf6FdCL35kCDoEJTm2wMkahw/edit#
> 
> A quick and very unscientific look at Searchfox suggests that
> unconditional SSE3 would mainly eliminate conditional/dynamic dispatch
> on YUV conversion code paths when it comes to explicit SSE3 usage. No
> idea how LLVM would insert SSE3 usage on its own.

SSE3 instructions were very specialized - mostly for video processing -
I doubt that LLVM can make use of them in regular code. It's unclear to
me why the Chrome devs decided to jump to SSE3 given it should give very
little benefit over SSE2. It would have made more sense if they jumped
to SSSE3 instead, but that would have cut out all the users still on
early Athlon 64 processors.

 Gabriele



OpenPGP_signature
Description: OpenPGP digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Gecko performance with newer x86_64 levels

2021-02-10 Thread Henri Sivonen
On Tue, Feb 9, 2021 at 5:35 PM Gian-Carlo Pascutto  wrote:
>
> On 3/02/2021 10:51, Henri Sivonen wrote:
> > I came across 
> > https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level/
> > . Previously, when microbenchmarking Rust code that used count_ones()
> > in an inner loop (can't recall what code this was), I noticed 4x
> > runtime speed when compiling for target_cpu=nehalem and running on a
> > much later CPU.
>
> That's an extreme edge case though.

It is an extreme edge case but it's also a case where run-time
dispatch doesn't make sense. The interesting thing is how much these
plus LLVM using newer instructions on its own would add up around the
code base.

> > I'm wondering:
> >
> > Have we done benchmark comparisons with libxul compiled for the
> > newly-defined x86_64 levels?
>
> No. Should be easy to do

In that case, it seems worth trying.

> but I don't expect much to come off of it. The
> main change (that is broadly applicable, unlike POPCNT) in recent years
> would be AVX. Do we have much floating point code in critical paths? I
> was wondering about the JS' engine usage of double for value storage -
> but it's what comes out of the JIT that matters, right?

AVX is much more recent than what's available after SSE2, which is our
current baseline.

Chrome is moving to SSE3 as the unconditional baseline, which I
personally find surprising:
https://docs.google.com/document/d/1QUzL4MGNqX4wiLvukUwBf6FdCL35kCDoEJTm2wMkahw/edit#

A quick and very unscientific look at Searchfox suggests that
unconditional SSE3 would mainly eliminate conditional/dynamic dispatch
on YUV conversion code paths when it comes to explicit SSE3 usage. No
idea how LLVM would insert SSE3 usage on its own.

> Media codecs don't count - they should detect at runtime. Same applies
> to crypto code, that - I really hope - would be using runtime detection
> for their SIMD implementations or even hardware AES/SHA routines.
>
> > For macOS and Android, do we actively track the baseline CPU age that
> > Firefox-compatible OS versions run on and adjust the compiler options
> > accordingly when we drop compatibility for older OS versions?
>
> Android only recently added 64-bit builds, and 32-bit would be limited
> to ARMv7-A. There used to be people on non-NEON devices, but those are
> probably gone by now. Google says "For NDK r21 and newer Neon is enabled
> by default for all API levels." - note that should be the NDK used for
> 64-bit builds.
>
> So it's possible Android could now assume NEON even on 32-bit, if it
> isn't already. Most of the code that cares (i.e. media) will already be
> doing runtime detection though.

I meant tracking baseline CPU age on the x86/x86_64 Android side. We
have required NEON on Android ARMv7 for quite a while already.

> For macOS Apple Silicon is a hard break. For macOS on x86, I guess AVX
> is also breaking point. There was an open question if any non-AVX
> hardware is still supported on Big Sur because Rosetta doesn't support
> AVX code, but given that we support (much) older macOS releases I don't
> think we can assume AVX presence regardless. We support back to macOS
> 10.12, which runs on "MacBook Late 2009", which was a Core 2 Duo. Guess
> we could assume SSSE3 but nothing more.

That's older than I expected, but it still seems worthwhile to make
our compiler settings for Mac reflect that if they don't already.
Also, doesn't the whole Core 2 Duo family have SSE 4.1?


--
Henri Sivonen
hsivo...@mozilla.com
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Gecko performance with newer x86_64 levels

2021-02-09 Thread Gian-Carlo Pascutto
On 3/02/2021 10:51, Henri Sivonen wrote:
> I came across 
> https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level/
> . Previously, when microbenchmarking Rust code that used count_ones()
> in an inner loop (can't recall what code this was), I noticed 4x
> runtime speed when compiling for target_cpu=nehalem and running on a
> much later CPU.

That's an extreme edge case though. count_ones() literally maps to POPCNT.

> I'm wondering:
> 
> Have we done benchmark comparisons with libxul compiled for the
> newly-defined x86_64 levels?

No. Should be easy to do, but I don't expect much to come off of it. The
main change (that is broadly applicable, unlike POPCNT) in recent years
would be AVX. Do we have much floating point code in critical paths? I
was wondering about the JS' engine usage of double for value storage -
but it's what comes out of the JIT that matters, right?

Media codecs don't count - they should detect at runtime. Same applies
to crypto code, that - I really hope - would be using runtime detection
for their SIMD implementations or even hardware AES/SHA routines.

> For macOS and Android, do we actively track the baseline CPU age that
> Firefox-compatible OS versions run on and adjust the compiler options
> accordingly when we drop compatibility for older OS versions?

Android only recently added 64-bit builds, and 32-bit would be limited
to ARMv7-A. There used to be people on non-NEON devices, but those are
probably gone by now. Google says "For NDK r21 and newer Neon is enabled
by default for all API levels." - note that should be the NDK used for
64-bit builds.

So it's possible Android could now assume NEON even on 32-bit, if it
isn't already. Most of the code that cares (i.e. media) will already be
doing runtime detection though.

For macOS Apple Silicon is a hard break. For macOS on x86, I guess AVX
is also breaking point. There was an open question if any non-AVX
hardware is still supported on Big Sur because Rosetta doesn't support
AVX code, but given that we support (much) older macOS releases I don't
think we can assume AVX presence regardless. We support back to macOS
10.12, which runs on "MacBook Late 2009", which was a Core 2 Duo. Guess
we could assume SSSE3 but nothing more.

-- 
GCP
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Gecko performance with newer x86_64 levels

2021-02-03 Thread Henri Sivonen
I came across 
https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level/
. Previously, when microbenchmarking Rust code that used count_ones()
in an inner loop (can't recall what code this was), I noticed 4x
runtime speed when compiling for target_cpu=nehalem and running on a
much later CPU.

I'm wondering:

Have we done benchmark comparisons with libxul compiled for the
newly-defined x86_64 levels?

How feasible would it be, considering CI cost, to compile for multiple
x86_64 levels and make the Windows installer / updater pick the right
one and to use the new glibc-hwcaps mechanism on Linux?

For macOS and Android, do we actively track the baseline CPU age that
Firefox-compatible OS versions run on and adjust the compiler options
accordingly when we drop compatibility for older OS versions?

-- 
Henri Sivonen
hsivo...@mozilla.com
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform