Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-04-09 Thread Lisandro Damián Nicanor Pérez Meyer
El sábado, 6 de abril de 2019 17:55:35 -03 Guillem Jover escribió:
[snip]
> If what you are interested in though is just a small subset of the
> archive, another option that would benefit everyone and is perhaps
> less cumbersome than having to jugle around with multiple archives
> and package rebuilds/variants, is to make use of libc's hwcaps [H]
> support, which means the dynamic linker will automatically load the
> best optimized shared object for the current hardware. This of course
> can complicate a bit the packaging, and bloat it, but if the performance
> improvement is substantial, it might be a very good trade-off.


FWIW: we did this with qtbase-opensource-src for some specific packages, 
namely sse2.

So fully agreeing with Guillem here: if you *really* think the effort is worth 
then going this way might be the best thing to do.


-- 
No subestimes el ancho de banda de un camión
cargado de cintas.
  Andrew S. Tanenbaum

Lisandro Damián Nicanor Pérez Meyer
http://perezmeyer.com.ar/
http://perezmeyer.blogspot.com/


signature.asc
Description: This is a digitally signed message part.


Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-04-09 Thread Guillem Jover
On Tue, 2019-04-09 at 06:48:59 +, Mo Zhou wrote:
> On Sat, Apr 06, 2019 at 10:55:35PM +0200, Guillem Jover wrote:
> > If what you are interested in though is just a small subset of the
> > archive, another option that would benefit everyone and is perhaps
> > less cumbersome than having to jugle around with multiple archives
> > and package rebuilds/variants, is to make use of libc's hwcaps [H]
> > support, which means the dynamic linker will automatically load the
> > best optimized shared object for the current hardware. This of course
> > can complicate a bit the packaging, and bloat it, but if the performance
> > improvement is substantial, it might be a very good trade-off.
> >   [H] man ld.so "NOTES" / "Hardware capabilities"
> 
> This sounds like a nice feature. However, unfortunately, the "avx2" and
> "avx512" features I wanted didn't show up in the list... IIRC in my
> original post I presented a C++ example with Eigen (a header-only
> library). Reverse deps such as TensorFlow would benefit from this HWCAPS
> feature if ld.so supported amd64's avx2 and avx512.

I guess the man page is just not exhaustive. ld.so should support those
hwcaps, as it handles them mostly as opaque ASCII strings sent by the
kernel. On a system with a CPU supporting those, just check the output
for something like:

  $ LD_SHOW_AUXV=1 /lib/ld-linux.so.2 2>/dev/null|grep AT_HWCAP:

Thanks,
Guillem



Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-04-09 Thread Mo Zhou
Hi Guillem,

Thanks for your helpful pointers.

On Sat, Apr 06, 2019 at 10:55:35PM +0200, Guillem Jover wrote:
> If what you are interested in though is just a small subset of the
> archive, another option that would benefit everyone and is perhaps
> less cumbersome than having to jugle around with multiple archives
> and package rebuilds/variants, is to make use of libc's hwcaps [H]
> support, which means the dynamic linker will automatically load the
> best optimized shared object for the current hardware. This of course
> can complicate a bit the packaging, and bloat it, but if the performance
> improvement is substantial, it might be a very good trade-off.
>   [H] man ld.so "NOTES" / "Hardware capabilities"

This sounds like a nice feature. However, unfortunately, the "avx2" and
"avx512" features I wanted didn't show up in the list... IIRC in my
original post I presented a C++ example with Eigen (a header-only
library). Reverse deps such as TensorFlow would benefit from this HWCAPS
feature if ld.so supported amd64's avx2 and avx512.
 
> Another option which requires upstream code changes (and ideally them
> being complicit) is to add run-time selection for the more suitable
> optimized functions, for example via the __target__ and __ifunc__ [I]
> function __attribute__ (and __builtin_cpu_supports or __builtin_cpu_is),
> or the __target_clone__ function __attribute__. Perhaps also of
> interest is the __simd__ function __attribute__.
> 
>   [I] info gcc "Function Attributes";
>   

This compiler feature (which has been considered in the past) is a quite
good solution for small projects.  However this is not easy to enforce for
projects like TensorFlow ...



Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-04-06 Thread Guillem Jover
Hi!

On Fri, 2019-02-08 at 16:25:41 +, Mo Zhou wrote:
> For most programs the "-march=native" option is not expected to bring any
> significant performance improvement. However for some scientific applications
> this proposition doesn't hold. When I was creating the tensorflow debian
> package, I observed a significant performance gap between generic code and
> kabylake (Intel 7XXX Series) code[1].

> Having seen such interesting results, I immediately created a Debian partial
> fork named SIMDebian (SIMD + Debian)[0]. It makes great sense to some
> applications due to the significant performance gain brought by SIMD code.
> Currently this partial fork is still in the very early stage, and it needs
> 
>   * More experience about software that benefit a lot from SIMD code
> (e.g. What package would potentially benefit from SIMD code?)
>   * Suggestions and comments
> (e.g. Is such a partial fork really useful and valuable?)
>   * More people interested in this
> 
> SIMDebian is only a PARTIAL fork, which means that it only takes care of
> packages that would obviously benefit from SIMD code, because no performance
> gain is expected in terms of the majority of packages in the Debian archive.

There's been talk in the past about this, AFAIR the most recent one
previous to this was about the various MIPS ISAs (?). We covered this
in the Debian Bootstrap sprint in 2014 (see §2):

  

There's not been much progress there, as it seemed like interest had
wanned.

If what you are interested in though is just a small subset of the
archive, another option that would benefit everyone and is perhaps
less cumbersome than having to jugle around with multiple archives
and package rebuilds/variants, is to make use of libc's hwcaps [H]
support, which means the dynamic linker will automatically load the
best optimized shared object for the current hardware. This of course
can complicate a bit the packaging, and bloat it, but if the performance
improvement is substantial, it might be a very good trade-off.

  [H] man ld.so "NOTES" / "Hardware capabilities"

Another option which requires upstream code changes (and ideally them
being complicit) is to add run-time selection for the more suitable
optimized functions, for example via the __target__ and __ifunc__ [I]
function __attribute__ (and __builtin_cpu_supports or __builtin_cpu_is),
or the __target_clone__ function __attribute__. Perhaps also of
interest is the __simd__ function __attribute__.

  [I] info gcc "Function Attributes";
  

Thanks,
Guillem



Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-02-08 Thread Mo Zhou
On Sat, Feb 09, 2019 at 01:14:43PM +0800, Benda Xu wrote:
> Hi Mo,
> 
> Very interesting initiative.  I understand it will Intel-specific for
> the moment.  What is your vision in opitmizing with AMD CPUs?

Thanks for your interest. This project didn't mention AMD CPU because I
have no experience about it[1]. Maybe we could select a better set of
compiler options from [2] ?

Currently there are three branches (although badly synchronized
currently) in the SIMDebian archive[3]. With some slight modification on
the forked dpkg we can also setup specific archives for e.g. NEON (arm),
VSX (ppc64el), and advanced AMD CPU microarchitectures.

SIMDebian only cares about packages that would benefit from bumped ISA.
So the archive will be small even if we support tens of baselines over
different architectures at the same time.

[1] I've never had an AMD CPU in my life.
[2] https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gcc/x86-Options.html#x86-Options
[3] https://sim.debiancn.org/simdebian/



Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-02-08 Thread Mo Zhou
Hi Drew,

On Sat, Feb 09, 2019 at 01:07:47PM +1100, Drew Parsons wrote:
> On 2019-02-09 03:25, Mo Zhou wrote:
> 
> I think it would be more constructive to provide arch-specific packages for
> eigen/blas etc on amd64 which Conflict/Replace/Provide the standard
> packages.
>
> That way a local administrator can choose to install them if they help
> improve performance.

Theoretically graceful, but this actually makes packaging much harder
and more complex, in order to avoid SIGILL. However, in a partial fork
we could directly assume a haswell baseline and provide rebuilt packages
from identical source. This way looks unconstructive but it's economic.

https://tracker.debian.org/pkg/isa-support was available since Aug 2017,
but the highest ISA baseline (SSE4.2) is still too conservative to make
actual sense.

> Hacking dpkg itself for this purpose and forking an entire linux
> distribution is way too invasive and distracts from solving the actual
> problem.

I must emphasize that SIMDebian only cares about packages that would
definitely benefit from bumped ISA baseline, and hence I call it a
**partial** fork.  The modified dpkg is not used for rebuilding the
whole archive by brute force, but for (automatically) picking
low-hanging fruits. If package benefits a lot from bumped ISA baseline,
the modified dpkg helps you quickly rebuild it without any code change.

Apart from that, reminded by jrtc27 on IRC, I dug a bit on compiler's
FMV (function multi-versioning) feature[1]. Julia[2] and Clear Linux[3]
are taking advantage from this feature. And most importantly, packages
with FMV feature can enter the Debian archive. So FMV is another
feasible solution to packages need to be accelerated.

After glancing at some benchmarks[4][5] of Clear Linux I started to
wonder if we could borrow some ideas or even patches from Clear linux
and introduce them to Debian...

[1] https://lwn.net/Articles/691932/
[2] https://github.com/JuliaLang/julia/pull/21849
[3] https://clearlinux.org/documentation/clear-linux/tutorials/fmv
[4] 
https://www.phoronix.com/scan.php?page=article=arch-antergos-clear=1
[5] 
https://www.phoronix.com/scan.php?page=article=ubuntu-clear-tweaks=1



Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-02-08 Thread Benda Xu
Hi Mo,

Mo Zhou  writes:

> For most programs the "-march=native" option is not expected to bring any
> significant performance improvement. However for some scientific applications
> this proposition doesn't hold. When I was creating the tensorflow debian
> package, I observed a significant performance gap between generic code and
> kabylake (Intel 7XXX Series) code[1].

> ...

Very interesting initiative.  I understand it will Intel-specific for
the moment.  What is your vision in opitmizing with AMD CPUs?

Yours,
Benda



Re: SIMDebian: Debian Partial Fork with Radical ISA Baseline

2019-02-08 Thread Drew Parsons

On 2019-02-09 03:25, Mo Zhou wrote:

Hi folks,

For most programs the "-march=native" option is not expected to bring 
any
significant performance improvement. However for some scientific 
applications
this proposition doesn't hold. When I was creating the tensorflow 
debian
package, I observed a significant performance gap between generic code 
and

kabylake (Intel 7XXX Series) code[1].


...


Generally speaking, in order to bump the ISA baseline for a given 
package, one
could add the -march=xxx flag to {C,CXX,F}FLAGS by modifying 
debian/rules.
However SIMDebian employes a more economic approach to this end: 
forking
dpkg[5] and injecting -march=xxx flag to the system default flag list. 
With the
resulting dpkg package, most debian packages could be rebuilt with 
bumped ISA

baseline without any code modification.




I think it would be more constructive to provide arch-specific packages 
for eigen/blas etc on amd64 which Conflict/Replace/Provide the standard 
packages.


That way a local administrator can choose to install them if they help 
improve performance.


Hacking dpkg itself for this purpose and forking an entire linux 
distribution is way too invasive and distracts from solving the actual 
problem.


Drew