Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-07 Thread Aurelien Jarno
On 2020-05-07 13:04, Noah Meyerhans wrote:
> On Wed, May 06, 2020 at 04:15:09PM +0200, Aurelien Jarno wrote:
> > > >One solution for this would be to ship the optimized library in the same
> > > >package as the default library. Now this is not acceptable for embedded
> > > >systems as they might not need that library and can't remove it. This is
> > > >even more problematic if we need to add more optimized libraries. I guess
> > > >this might be the case for arm64 as there are many new extensions in the
> > > >pipe.
> > > 
> > > ACK. It's a problem to ship the different things in separate
> > > packages. If it's really a problem for smaller systems to have all the
> > > variants because of size, is there maybe another way to do things? How
> > > about keeping the existing libc and have an extra package
> > > ("libc-optimised") with all the optimised versions *and* the basic
> > > version, and have it provide/replace/conflict libc6?
> > > 
> > > (/me prepares to be ambarrassed as you point out the obvious flaw I'm
> > > missing...)
> > 
> > I guess that the provide/replace/conflict libc6 will just prevent
> > installation of foreign libc6 packages, basically making this optimized
> > package useless in the multiarch context.
> > 
> > OTOH, what is the drawback of having GCC defaulting to -moutline-atomics?
> > It will improve performance on many more packages than only glibc, and
> > is way easier to implement overall. It also means users has nothing to
> > do to get additional performances.
> 
> For the current issue, defaulting to -moutline-atomics might be a sane
> approach.  As you said earlier, though, it seems that there are many new
> extensions in the pipe for ARM.  There may not be an equivalent solution
> for all of them, and even if there is, at some point the runtime
> overhead of all this conditional code is going to add up to something
> meaningful.

If we are talking about future extensions, another option for some of
them is to use ifunc. It's how the various SSE and AVX extensions are
supported on x86, and neon is supported on armv7.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-07 Thread Noah Meyerhans
On Wed, May 06, 2020 at 04:15:09PM +0200, Aurelien Jarno wrote:
> > >One solution for this would be to ship the optimized library in the same
> > >package as the default library. Now this is not acceptable for embedded
> > >systems as they might not need that library and can't remove it. This is
> > >even more problematic if we need to add more optimized libraries. I guess
> > >this might be the case for arm64 as there are many new extensions in the
> > >pipe.
> > 
> > ACK. It's a problem to ship the different things in separate
> > packages. If it's really a problem for smaller systems to have all the
> > variants because of size, is there maybe another way to do things? How
> > about keeping the existing libc and have an extra package
> > ("libc-optimised") with all the optimised versions *and* the basic
> > version, and have it provide/replace/conflict libc6?
> > 
> > (/me prepares to be ambarrassed as you point out the obvious flaw I'm
> > missing...)
> 
> I guess that the provide/replace/conflict libc6 will just prevent
> installation of foreign libc6 packages, basically making this optimized
> package useless in the multiarch context.
> 
> OTOH, what is the drawback of having GCC defaulting to -moutline-atomics?
> It will improve performance on many more packages than only glibc, and
> is way easier to implement overall. It also means users has nothing to
> do to get additional performances.

For the current issue, defaulting to -moutline-atomics might be a sane
approach.  As you said earlier, though, it seems that there are many new
extensions in the pipe for ARM.  There may not be an equivalent solution
for all of them, and even if there is, at some point the runtime
overhead of all this conditional code is going to add up to something
meaningful.

noah



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-07 Thread Adrian Bunk
On Wed, May 06, 2020 at 01:56:24PM +0100, Steve McIntyre wrote:
>...
> On Sun, May 03, 2020 at 11:53:35PM +0200, Aurelien Jarno wrote:
> >
> >One solution for this would be to ship the optimized library in the same
> >package as the default library. Now this is not acceptable for embedded
> >systems as they might not need that library and can't remove it. This is
> >even more problematic if we need to add more optimized libraries. I guess
> >this might be the case for arm64 as there are many new extensions in the
> >pipe.
> 
> ACK. It's a problem to ship the different things in separate
> packages. If it's really a problem for smaller systems to have all the
> variants because of size, is there maybe another way to do things? How
> about keeping the existing libc and have an extra package
> ("libc-optimised") with all the optimised versions *and* the basic
> version, and have it provide/replace/conflict libc6?
>...

What Noah mentioned for a similar proposal also applies here:

On Mon, May 04, 2020 at 02:45:41PM -0400, Noah Meyerhans wrote:
>...
> I don't know how well dpkg would cope with transitioning
> between providers, which seems like the riskiest side of this kind of
> thing.

I'd guess you could make this an installation-only change with
a few hacks here and there, but once you think that through
with all the followup-hacks required it doesn't sound like
a good idea.

cu
Adrian



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-06 Thread Aurelien Jarno
On 2020-05-06 13:56, Steve McIntyre wrote:
> Hey Aurelien,
> 
> On Sun, May 03, 2020 at 11:53:35PM +0200, Aurelien Jarno wrote:
> >
> >One solution for this would be to ship the optimized library in the same
> >package as the default library. Now this is not acceptable for embedded
> >systems as they might not need that library and can't remove it. This is
> >even more problematic if we need to add more optimized libraries. I guess
> >this might be the case for arm64 as there are many new extensions in the
> >pipe.
> 
> ACK. It's a problem to ship the different things in separate
> packages. If it's really a problem for smaller systems to have all the
> variants because of size, is there maybe another way to do things? How
> about keeping the existing libc and have an extra package
> ("libc-optimised") with all the optimised versions *and* the basic
> version, and have it provide/replace/conflict libc6?
> 
> (/me prepares to be ambarrassed as you point out the obvious flaw I'm
> missing...)

I guess that the provide/replace/conflict libc6 will just prevent
installation of foreign libc6 packages, basically making this optimized
package useless in the multiarch context.

OTOH, what is the drawback of having GCC defaulting to -moutline-atomics?
It will improve performance on many more packages than only glibc, and
is way easier to implement overall. It also means users has nothing to
do to get additional performances.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-06 Thread Steve McIntyre
Hey Aurelien,

On Sun, May 03, 2020 at 11:53:35PM +0200, Aurelien Jarno wrote:
>
>One solution for this would be to ship the optimized library in the same
>package as the default library. Now this is not acceptable for embedded
>systems as they might not need that library and can't remove it. This is
>even more problematic if we need to add more optimized libraries. I guess
>this might be the case for arm64 as there are many new extensions in the
>pipe.

ACK. It's a problem to ship the different things in separate
packages. If it's really a problem for smaller systems to have all the
variants because of size, is there maybe another way to do things? How
about keeping the existing libc and have an extra package
("libc-optimised") with all the optimised versions *and* the basic
version, and have it provide/replace/conflict libc6?

(/me prepares to be ambarrassed as you point out the obvious flaw I'm
missing...)

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
"... the premise [is] that privacy is about hiding a wrong. It's not.
 Privacy is an inherent human right, and a requirement for maintaining
 the human condition with dignity and respect."
  -- Bruce Schneier



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-06 Thread Adrian Bunk
On Mon, May 04, 2020 at 02:45:41PM -0400, Noah Meyerhans wrote:
>...
> I wonder if it'd make sense for libc to be a virtual package, with
> functionality provided by optimized builds and dependencies satisfied
> via Provides.  I don't know how well dpkg would cope with transitioning
> between providers, which seems like the riskiest side of this kind of
> thing.

What would happens if apt finds a dependency solution when installing or 
updating packages that involves switching to a libc package that does
not run on your device?
There are situations where changing the libc package would be the only
possible solution of the dependencies.

IMHO there are far too many ways how such a virtual package solution 
could brick devices.

> noah

cu
Adrian



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-04 Thread Noah Meyerhans
On Sun, May 03, 2020 at 11:53:35PM +0200, Aurelien Jarno wrote:
> The hardware capabilities system works fine upstream, but doesn't work
> for us because:
> 1) we want to be able to upgrade major upstream version online (as
> opposed to fedora for example)
> 2) we ship the optimized libraries in a different package
> 
> The various libc librairies need to have the same version at any time,
> this is especially true for ld.so vs libc.so. As we do not upgrade the
> default libc and the optimized one exactly at the same time (they are in
> different packages), we upgrade first the default libc and then we have
> the Debian specific nohwcap mechanism to prevent using the optimize
> library until it has also been upgraded.
> 
> One solution for this would be to ship the optimized library in the same
> package as the default library. Now this is not acceptable for embedded
> systems as they might not need that library and can't remove it. This is
> even more problematic if we need to add more optimized libraries. I guess
> this might be the case for arm64 as there are many new extensions in the
> pipe.

Thanks for taking the time to explain that!

I wonder if it'd make sense for libc to be a virtual package, with
functionality provided by optimized builds and dependencies satisfied
via Provides.  I don't know how well dpkg would cope with transitioning
between providers, which seems like the riskiest side of this kind of
thing.

noah



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-05-03 Thread Aurelien Jarno
On 2020-04-21 18:37, Noah Meyerhans wrote:
> > To be honest from a glibc maintenance point of view it's something I
> > would like to avoid. We haven't been actively trying to remove the
> > remaining optimized libraries (on i386, hurd and alpha), but we have
> > tried to avoid adding new ones. The problem is not building a second
> > optimized glibc, but rather providing a safe upgrade as the optimized
> > and the non-optimized package have to be at the same version or one of
> > them has to be disabled. This has caused many system breakages overall.
> 
> Understood, that makes sense.  I wonder if it's worth it to investigate
> techniques to improve the situation around optimized libraries.  Do you
> have any thoughts on what such an improvement might look like?

The hardware capabilities system works fine upstream, but doesn't work
for us because:
1) we want to be able to upgrade major upstream version online (as
opposed to fedora for example)
2) we ship the optimized libraries in a different package

The various libc librairies need to have the same version at any time,
this is especially true for ld.so vs libc.so. As we do not upgrade the
default libc and the optimized one exactly at the same time (they are in
different packages), we upgrade first the default libc and then we have
the Debian specific nohwcap mechanism to prevent using the optimize
library until it has also been upgraded.

One solution for this would be to ship the optimized library in the same
package as the default library. Now this is not acceptable for embedded
systems as they might not need that library and can't remove it. This is
even more problematic if we need to add more optimized libraries. I guess
this might be the case for arm64 as there are many new extensions in the
pipe.

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-30 Thread Florian Weimer
* Florian Weimer:

> I raised the matter of compiler defaults on the GCC list:
>
>   

The link is now: 



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-29 Thread Florian Weimer
I raised the matter of compiler defaults on the GCC list:

  

Thanks,
Florian



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-22 Thread Florian Weimer
* Noah Meyerhans:

> On Sun, Apr 12, 2020 at 12:18:35PM +0200, Aurelien Jarno wrote:
>> > Significant performance impact has also been observed in less contrived
>> > cases (MariaDB and Postgres), but I don't have a repro to share.
>> 
>> But indeed what counts is number on real workloads. It would be nice to
>> get numbers when those software are run against a rebuilt glibc. As
>> those software are using a lot of atomics directly, it would be also
>> interesting to have numbers with those software also rebuilt to use
>> those new instructions.
>
> Agreed.  I don't have specific examples of real world impact at the
> moment.  AIUI, the most significant impact comes in the usage of atomics
> in pthread_mutex_lock().  When there are multiple threads contending for
> a lock, one thread will (approximately) always obtain the lock, while
> the others will starve.  With atomics support in place, the probability
> of obtaining the lock is roughly evenly distributed among all the
> threads.  So any workload in which multiple threads may contend for a
> lock should be a candidate to demonstrate this problem in the real
> world.

Does this behavior affect just one implementation with LSE, or also
implementations without LSE?

If the latter, we might need a different mutex implementation for
AArch64. 8-(



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-22 Thread Steve Capper
On Wed, Apr 22, 2020 at 05:48:27PM +0100, Steve McIntyre wrote:
> Hi folks!

Hiya,

> 
> I'm adding a CC to Steve Capper, a colleague in Arm who's our expert
> here for this kind of question. He's also a DM in Debian... :-)

Now I feel guilty about not doing enough Debian :-).

> 
> On Tue, Apr 21, 2020 at 06:37:07PM -0400, Noah Meyerhans wrote:
> >On Sun, Apr 12, 2020 at 12:18:35PM +0200, Aurelien Jarno wrote:
> >
> >> It would also be nice to have numbers to see the impact on non-ARMv8.1
> >> CPU on real workloads. As pointed out by Florian, and if the impact is
> >> negligible, it might be a good idea to enable -moutline-atomics
> >> globally at the GCC level so that all software can benefit from it, and
> >> instead of only glibc. That could be either upstream or only in Debian,
> >> that's probably a separate discussion. Otherwise we will likely end up
> >> using this non-default GCC option on all packages that runs faster with
> >> it.
> >
> >Agreed.
> 
> I think the -moutline-atomics is probably good to enable by default
> once we've got it (gcc 10). that's the suggestion I've heard from gcc
> folks in Arm.
> 
> >> Also note that the mechanism allowing a safe upgrade *does* incur a 
> >> runtime overhead as every binary now has to test for the presence of
> >> /etc/ld.so.nohwcap to detect a possible upgrade of the glibc in
> >> progress. That's why we have disabled it on architecture not providing
> >> an optimized library [1].
> 
> Oh, ick. :-/
> 
> >Thanks for the pointer, it's interesting to see data on that.  This also
> >suggests that it might be worthwhile to investigate a better mechanism
> >for identifying the availability of hardware features.
> >
> >> > I've tested both options and found them to be acceptable on v8.1a 
> >> > (Neoverse
> >> > N1) and v8a (Cortex A72) CPUs.  I can provide bulk test run data of the
> >> > various different configuration permutations if you'd like to see 
> >> > additional
> >> > data.

That's good to hear!

> >> 
> >> As said above I think we would need more numbers on real workload to
> >> take a decision. Don't get me wrong I do not oppose on improving atomics
> >> on ARMv8.1, but I would like that we chose the best option. Also if we
> >> go with the -moutline-atomics option, I believe it rather has to be a
> >> ARM porters decision than a glibc maintainers decision (hence the Cc:).
> >
> >I'll see what I can come up with.
> >
> >Do the arm porters have any opinions on this matter?
> 
> It's a good question, and thanks for asking! I definitely think it's
> worth doing -moutline-atomics, and I'm hoping Steve can share some
> performance numbers to help convince. :-)
> 

We ran -moutline-atomics on a mixture of development hardware running,
IIRC some DPDK lock tests that employed C11-style atomics. As expected
there was a performance penalty, but it was order of magnitude of 1%.
The perf boost from moving to LSE was a lot larger (and we noticed the
variance dropping a lot with LSE too).

FWIW, I'd recommend the -moutline-atomics for the general case. (I used
to be a fan of the multi-lib approach; but the way the runtime selection
is implemented in gcc with a direct branch changed my mind :-) ).

Cheers,
-- 
Steve



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-22 Thread Steve McIntyre
On Wed, Apr 22, 2020 at 01:08:46PM -0400, Noah Meyerhans wrote:
>On Wed, Apr 22, 2020 at 05:48:27PM +0100, Steve McIntyre wrote:
>> I think the -moutline-atomics is probably good to enable by default
>> once we've got it (gcc 10). that's the suggestion I've heard from gcc
>> folks in Arm.
>
>JFTR, it's been backported to gcc 9 and is available in Debian's gcc-9
>as of 9.3.0-9. See
>https://salsa.debian.org/toolchain-team/gcc/-/blob/gcc-9-debian/debian/patches/git-updates.diff

Ah, cool. I knew it *was* being backported, but I wasn't aware it was
already with us. Woot!

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
  Getting a SCSI chain working is perfectly simple if you remember that there
  must be exactly three terminations: one on one end of the cable, one on the
  far end, and the goat, terminated over the SCSI chain with a silver-handled
  knife whilst burning *black* candles. --- Anthony DeBoer



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-22 Thread Noah Meyerhans
On Wed, Apr 22, 2020 at 05:48:27PM +0100, Steve McIntyre wrote:
> I think the -moutline-atomics is probably good to enable by default
> once we've got it (gcc 10). that's the suggestion I've heard from gcc
> folks in Arm.

JFTR, it's been backported to gcc 9 and is available in Debian's gcc-9
as of 9.3.0-9. See
https://salsa.debian.org/toolchain-team/gcc/-/blob/gcc-9-debian/debian/patches/git-updates.diff

noah



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-22 Thread Steve McIntyre
Hi folks!

I'm adding a CC to Steve Capper, a colleague in Arm who's our expert
here for this kind of question. He's also a DM in Debian... :-)

On Tue, Apr 21, 2020 at 06:37:07PM -0400, Noah Meyerhans wrote:
>On Sun, Apr 12, 2020 at 12:18:35PM +0200, Aurelien Jarno wrote:
>
>> It would also be nice to have numbers to see the impact on non-ARMv8.1
>> CPU on real workloads. As pointed out by Florian, and if the impact is
>> negligible, it might be a good idea to enable -moutline-atomics
>> globally at the GCC level so that all software can benefit from it, and
>> instead of only glibc. That could be either upstream or only in Debian,
>> that's probably a separate discussion. Otherwise we will likely end up
>> using this non-default GCC option on all packages that runs faster with
>> it.
>
>Agreed.

I think the -moutline-atomics is probably good to enable by default
once we've got it (gcc 10). that's the suggestion I've heard from gcc
folks in Arm.

>> Also note that the mechanism allowing a safe upgrade *does* incur a 
>> runtime overhead as every binary now has to test for the presence of
>> /etc/ld.so.nohwcap to detect a possible upgrade of the glibc in
>> progress. That's why we have disabled it on architecture not providing
>> an optimized library [1].

Oh, ick. :-/

>Thanks for the pointer, it's interesting to see data on that.  This also
>suggests that it might be worthwhile to investigate a better mechanism
>for identifying the availability of hardware features.
>
>> > I've tested both options and found them to be acceptable on v8.1a (Neoverse
>> > N1) and v8a (Cortex A72) CPUs.  I can provide bulk test run data of the
>> > various different configuration permutations if you'd like to see 
>> > additional
>> > data.
>> 
>> As said above I think we would need more numbers on real workload to
>> take a decision. Don't get me wrong I do not oppose on improving atomics
>> on ARMv8.1, but I would like that we chose the best option. Also if we
>> go with the -moutline-atomics option, I believe it rather has to be a
>> ARM porters decision than a glibc maintainers decision (hence the Cc:).
>
>I'll see what I can come up with.
>
>Do the arm porters have any opinions on this matter?

It's a good question, and thanks for asking! I definitely think it's
worth doing -moutline-atomics, and I'm hoping Steve can share some
performance numbers to help convince. :-)

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
Who needs computer imagery when you've got Brian Blessed?



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-12 Thread Aurelien Jarno
Hi,

On 2020-04-10 13:16, Noah Meyerhans wrote:
> Package: src:glibc
> Version: 2.30-4
> Severity: wishlist
> X-Debbugs-CC: debian-...@lists.debian.org
> 
> The ARMv8.1 spec, as implemented by the ARM Neoverse N1 processor,
> introduces a set of instructions [1] that result in significant performance
> improvements for multithreaded applications.  Sample code demonstrating the
> performance improvements is attached.  When run on a 16-core Neoverse N1
> host with glibc 2.30-4, runtimes vary significantly, ranging from lows
> around 250ms to highs around 15 seconds.  When linked against glibc rebuilt
> with support for these instructions, runtimes are consistently <50ms.

This is an impressive improvement!

> Significant performance impact has also been observed in less contrived
> cases (MariaDB and Postgres), but I don't have a repro to share.

But indeed what counts is number on real workloads. It would be nice to
get numbers when those software are run against a rebuilt glibc. As
those software are using a lot of atomics directly, it would be also
interesting to have numbers with those software also rebuilt to use
those new instructions.

> Gcc provides two ways to enable support for these instructions at build
> time.  The simplest, and least disruptive, is to enable -moutline-atomics
> globally in the arm64 glibc build.  As described at [2], this option enables
> runtime checks for the availability of the atomic instructions.  If found,
> they are used, otherwise ARMv8.0 compatible code is used.  The drawback of
> this option is that the check happens at runtime, thus introducing some
> overhead on all arm64 installations.

It would also be nice to have numbers to see the impact on non-ARMv8.1
CPU on real workloads. As pointed out by Florian, and if the impact is
negligible, it might be a good idea to enable -moutline-atomics
globally at the GCC level so that all software can benefit from it, and
instead of only glibc. That could be either upstream or only in Debian,
that's probably a separate discussion. Otherwise we will likely end up
using this non-default GCC option on all packages that runs faster with
it.

> The second option is to provide libraries built with explicit support for
> the ARM v8.1a spec via the -march=armv8.1-a flag.  This option is also
> described at [2].  This build would be incompatible with earlier versions of
> the spec, so it would need to be provided in a location where the linker
> will automatically discover it if it is usable (e.g.
> /lib/aarch64-linux-gnu/atomics/).  This does not incur any runtime overhead,
> but obviously involves an additional libc build, and the corresponding
> complixity and disk space utilization.  I'm not sure if this is an option
> that the glibc maintainers are interested in pursuing.

To be honest from a glibc maintenance point of view it's something I
would like to avoid. We haven't been actively trying to remove the
remaining optimized libraries (on i386, hurd and alpha), but we have
tried to avoid adding new ones. The problem is not building a second
optimized glibc, but rather providing a safe upgrade as the optimized
and the non-optimized package have to be at the same version or one of
them has to be disabled. This has caused many system breakages overall.

Also note that the mechanism allowing a safe upgrade *does* incur a 
runtime overhead as every binary now has to test for the presence of
/etc/ld.so.nohwcap to detect a possible upgrade of the glibc in
progress. That's why we have disabled it on architecture not providing
an optimized library [1].

> I've tested both options and found them to be acceptable on v8.1a (Neoverse
> N1) and v8a (Cortex A72) CPUs.  I can provide bulk test run data of the
> various different configuration permutations if you'd like to see additional
> data.

As said above I think we would need more numbers on real workload to
take a decision. Don't get me wrong I do not oppose on improving atomics
on ARMv8.1, but I would like that we chose the best option. Also if we
go with the -moutline-atomics option, I believe it rather has to be a
ARM porters decision than a glibc maintainers decision (hence the Cc:).

> I can provide patches or merge requests implementing either option, at least
> for a starting point, if you'd like to see them.

Thanks for this offer, but I don't think that's the most difficult part,
it's fairly straightforward to go for either of those options once a
decision is taken.

Regards,
Aurelien

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908928

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-11 Thread Noah Meyerhans
On Sat, Apr 11, 2020 at 10:23:54PM +0200, Florian Weimer wrote:
> Or put differently: If upstream doesn't want to default to
> -moutline-atomics, why should Debian?

Well, ultimately we own our build configurations and the optimizations
we enable therein.  If we don't want to enable -moutline-atomics
globally, then a second, optimized library is also an option.  IMO,
timing data like these should be enough to show that it's worth making a
change somewhere:

# 100 serial invocations of the "a.c" program attached to the bug
# report, linked against libc with -moutline-atomics
real0m1.902s
user0m3.488s
sys 0m25.498s

# 100 invocations of the same program linked against glibc with
# -march=armv8.1-a
real0m1.844s
user0m3.137s
sys 0m24.275s

# 100 invocations of the same program against our current libc build:
real8m15.452s
user130m33.139s
sys 0m1.162s

noah



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-11 Thread Florian Weimer
* Noah Meyerhans:

> On Sat, Apr 11, 2020 at 09:14:11PM +0200, Florian Weimer wrote:
>> > At least if I'm reading the code right (which I may very well not be
>> > doing, being generally unfamiliar with gcc internals), -mtune=generic
>> > enables the equivalent of ARMv8 support:
>> >
>> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/common/config/aarch64/aarch64-common.c;h=0bddcc8c3e9282a957c5479b4df7f68058093bab;hb=HEAD#l176
>> >
>> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64-cores.def;h=ea9b98b4b0ad2a578755561bba5b6d5c56115994;hb=HEAD
>> >
>> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64.h;h=8f08bad3562c4cbe8acdf5891e84f89d23ea6784;hb=HEAD#l226
>> 
>> Hmm.  I don't see anything that sets TARGET_OUTLINE_ATOMICS by
>> default.
>
> Only -moutline-atomics enables that.  Otherwise, unconditional support
> for atomics is enabled by TARGET_LSE, which itself is enabled by a
> number of options, e.g. -marmv8-a+lse, -marmv8.1-a, etc.
>
> See
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64.c;h=4af562a81ea760891fac3cf7101b8bf887fe7a0d;hb=HEAD#l18961

Sorry, I have a feeling that we are discussing different matters.

I believe that ideally, Debian (and Fedora etc.) should follow
upstream GCC defaults.  I don't think we are in this state
(code_for_aarch64_compare_and_swap uses the atomics.md patterns to
call aarch64_split_compare_and_swap, as far as I can see).

Or put differently: If upstream doesn't want to default to
-moutline-atomics, why should Debian?



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-11 Thread Noah Meyerhans
On Sat, Apr 11, 2020 at 09:14:11PM +0200, Florian Weimer wrote:
> > At least if I'm reading the code right (which I may very well not be
> > doing, being generally unfamiliar with gcc internals), -mtune=generic
> > enables the equivalent of ARMv8 support:
> >
> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/common/config/aarch64/aarch64-common.c;h=0bddcc8c3e9282a957c5479b4df7f68058093bab;hb=HEAD#l176
> >
> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64-cores.def;h=ea9b98b4b0ad2a578755561bba5b6d5c56115994;hb=HEAD
> >
> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64.h;h=8f08bad3562c4cbe8acdf5891e84f89d23ea6784;hb=HEAD#l226
> 
> Hmm.  I don't see anything that sets TARGET_OUTLINE_ATOMICS by
> default.

Only -moutline-atomics enables that.  Otherwise, unconditional support
for atomics is enabled by TARGET_LSE, which itself is enabled by a
number of options, e.g. -marmv8-a+lse, -marmv8.1-a, etc.

See
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64.c;h=4af562a81ea760891fac3cf7101b8bf887fe7a0d;hb=HEAD#l18961



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-11 Thread Florian Weimer
* Noah Meyerhans:

> On Sat, Apr 11, 2020 at 08:44:29AM +0200, Florian Weimer wrote:
>> > Gcc provides two ways to enable support for these instructions at build
>> > time.  The simplest, and least disruptive, is to enable -moutline-atomics
>> > globally in the arm64 glibc build.
>> 
>> Shouldn't GCC do this by default, at least for -mtune=generic?
>
> Maybe.  Would you rather pursue that avenue first?

My hope is that GCC upstream defaults reflect current practices for
the architecture.  It doesn't make sense if every distribution ends up
patching in same GCC defaults which are not upstream.

Sure, there might be bare-metal targets which do not want this, but is
this really the primary audience nowadays?

> At least if I'm reading the code right (which I may very well not be
> doing, being generally unfamiliar with gcc internals), -mtune=generic
> enables the equivalent of ARMv8 support:
>
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/common/config/aarch64/aarch64-common.c;h=0bddcc8c3e9282a957c5479b4df7f68058093bab;hb=HEAD#l176
>
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64-cores.def;h=ea9b98b4b0ad2a578755561bba5b6d5c56115994;hb=HEAD
>
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64.h;h=8f08bad3562c4cbe8acdf5891e84f89d23ea6784;hb=HEAD#l226

Hmm.  I don't see anything that sets TARGET_OUTLINE_ATOMICS by
default.



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-11 Thread Noah Meyerhans
On Sat, Apr 11, 2020 at 08:44:29AM +0200, Florian Weimer wrote:
> > Gcc provides two ways to enable support for these instructions at build
> > time.  The simplest, and least disruptive, is to enable -moutline-atomics
> > globally in the arm64 glibc build.
> 
> Shouldn't GCC do this by default, at least for -mtune=generic?

Maybe.  Would you rather pursue that avenue first?

At least if I'm reading the code right (which I may very well not be
doing, being generally unfamiliar with gcc internals), -mtune=generic
enables the equivalent of ARMv8 support:

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/common/config/aarch64/aarch64-common.c;h=0bddcc8c3e9282a957c5479b4df7f68058093bab;hb=HEAD#l176

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64-cores.def;h=ea9b98b4b0ad2a578755561bba5b6d5c56115994;hb=HEAD

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/aarch64/aarch64.h;h=8f08bad3562c4cbe8acdf5891e84f89d23ea6784;hb=HEAD#l226

noah



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-11 Thread Florian Weimer
* Noah Meyerhans:

> Gcc provides two ways to enable support for these instructions at build
> time.  The simplest, and least disruptive, is to enable -moutline-atomics
> globally in the arm64 glibc build.

Shouldn't GCC do this by default, at least for -mtune=generic?



Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

2020-04-10 Thread Noah Meyerhans
Package: src:glibc
Version: 2.30-4
Severity: wishlist
X-Debbugs-CC: debian-...@lists.debian.org

The ARMv8.1 spec, as implemented by the ARM Neoverse N1 processor,
introduces a set of instructions [1] that result in significant performance
improvements for multithreaded applications.  Sample code demonstrating the
performance improvements is attached.  When run on a 16-core Neoverse N1
host with glibc 2.30-4, runtimes vary significantly, ranging from lows
around 250ms to highs around 15 seconds.  When linked against glibc rebuilt
with support for these instructions, runtimes are consistently <50ms.
Significant performance impact has also been observed in less contrived
cases (MariaDB and Postgres), but I don't have a repro to share.

Gcc provides two ways to enable support for these instructions at build
time.  The simplest, and least disruptive, is to enable -moutline-atomics
globally in the arm64 glibc build.  As described at [2], this option enables
runtime checks for the availability of the atomic instructions.  If found,
they are used, otherwise ARMv8.0 compatible code is used.  The drawback of
this option is that the check happens at runtime, thus introducing some
overhead on all arm64 installations.

The second option is to provide libraries built with explicit support for
the ARM v8.1a spec via the -march=armv8.1-a flag.  This option is also
described at [2].  This build would be incompatible with earlier versions of
the spec, so it would need to be provided in a location where the linker
will automatically discover it if it is usable (e.g.
/lib/aarch64-linux-gnu/atomics/).  This does not incur any runtime overhead,
but obviously involves an additional libc build, and the corresponding
complixity and disk space utilization.  I'm not sure if this is an option
that the glibc maintainers are interested in pursuing.

I've tested both options and found them to be acceptable on v8.1a (Neoverse
N1) and v8a (Cortex A72) CPUs.  I can provide bulk test run data of the
various different configuration permutations if you'd like to see additional
data.

I can provide patches or merge requests implementing either option, at least
for a starting point, if you'd like to see them.

Thanks!
noah

1. https://static.docs.arm.com/ddi0557/a/DDI0557A_b_armv8_1_supplement.pdf
   Section B1
2. https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html
/*
 * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"). You may
 * not use this file except in compliance with the License. A copy of the
 * License is located at
 *
 *  http://aws.amazon.com/apache2.0/
 *
 * or in the "license" file accompanying this file. This file is distributed
 * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 * express or implied. See the License for the specific language governing
 * permissions and limitations under the License.
*/

/* Build with:
 * gcc -O2 -o a.out a.c -lpthread -DITER=1000 -DTHREADS=64
*/

#include 
#include 
#include 
#include 

#ifndef ITER
# define ITER 1000
#endif
#ifndef THREADS
# define THREADS 3
#endif

#if THREADS < 1
# error "THREADS is supposed to be at least 1"
#endif

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static int shared_ptr = 0;

typedef struct stats_s {
  uint64_t min, max;
  int times;
  uint64_t total;
  uint64_t flips;
} stats_t;

stats_t stats[THREADS + 1];
pthread_t threads[THREADS];

#ifdef __aarch64__
static uint64_t cpu_shift() {
  uint64_t shift = 0;
  __asm__ __volatile__ ("mrs %0,cntfrq_el0; clz %w0, %w0":"="(shift));
  return shift;
}
#endif

static uint64_t gettime() {
#ifdef __aarch64__
  uint64_t ret = 0;
  __asm__ __volatile__ ("isb; mrs %0,cntvct_el0":"=r"(ret));
  return ret << cpu_shift();

#elif defined __x86_64__
  uint64_t a, d;
  __asm__ __volatile__ ("rdtsc" : "=a" (a), "=d" (d));
  return ((uint64_t)a + ((uint64_t)d << 32));
#endif

  return 0;
}

static void init_stats() {
  int i;
  for (i = 0; i <= THREADS; i++) {
stats_t *s = [i];
s->min = 100;
s->max = 0;
s->times = 0;
s->total = 0;
s->flips = 0;
  }
}

static void print_stat(int i) {
  stats_t *s = [i];
  float average = (float) s->total / s->times;
  if (i == THREADS)
fprintf(stdout, "server: min=%ld, max=%ld, average=%f, mutexes_locked=%d, flips=%ld\n", s->min, s->max, average, s->times, s->flips);
  else
fprintf(stdout, "thread %d: min=%ld, max=%ld, average=%f, mutexes_locked=%d, flips=%ld\n", i, s->min, s->max, average, s->times, s->flips);
}

static void print_stats() {
  int i;
  for (i = 0; i <= THREADS; i++)
print_stat(i);
}

static void update_stats(stats_t *s, uint64_t time) {
  ++s->times;
  if (time < s->min)
s->min = time;
  if (time > s->max)
s->max = time;
  s->total += time;
}

static void fun(int check, int set, stats_t *stat) {
  int loop = 1;
  while (loop) {
uint64_t start = gettime();
pthread_mutex_lock ();
if (shared_ptr