Re: [hwloc-users] memory binding on Knights Landing

2016-10-16 Thread Christopher Samuel
On 12/10/16 02:03, Dave Love wrote:

> For what it's worth, I was misled when I investigated originally by
> counting calls of open and not openat, which is what hwloc uses.

strace -c can be very handy to give you a quick insight into which
system calls are taking the most time (either in a process run from
strace or attaching to a running one).

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-10-11 Thread Dave Love
Brice Goglin  writes:

> I ran more benchmarks. What's really slow is the reading of all sysfs
> files. About 90% of the topology building time is spent there on KNL.
> We're reading more than 7000 files (most of them are 6 files for each
> hardware thread and 6 files for each cache).

For what it's worth, I was misled when I investigated originally by
counting calls of open and not openat, which is what hwloc uses.  It's
especially confusing that numactl uses open, not openat (doubtless less
correctly).  A useful lesson.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-10-04 Thread Brice Goglin


Le 12/09/2016 04:20, Brice Goglin a écrit :
> So what's really slow is reading sysfs and/or inserting all hwloc
> objects in the tree. I need to do some profiling. And I am moving the
> item "parallelize the discovery" higher in the TODO list :) Brice 

Hello

I ran more benchmarks. What's really slow is the reading of all sysfs
files. About 90% of the topology building time is spent there on KNL.
We're reading more than 7000 files (most of them are 6 files for each
hardware thread and 6 files for each cache).
Reading from sysfs is significantly slower than reading normal files
that are cached (not surprising since the kernel doesn't cache sysfs
file contents).
And reading on KNL is about 30 times slower than on my laptop (70us for
each sysfs file). Don't know why.

And if you have one process doing this on each core simultaneously,
things become up to 30x slower.

Looks like XML is really the way to go on these platforms. One thing
that XML currently misses is cgroup support. You need to export the XML
inside the same cgroup or the topology will be wrong. I am adding an
option to read the current cgroup restrictions from the OS and apply it
to a XML imported topology (must be created outside of all cgroups).

Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-12 Thread Dave Love
Brice Goglin  writes:

> So what's really slow is reading sysfs and/or inserting all hwloc
> objects in the tree. I need to do some profiling. And I am moving the
> item "parallelize the discovery" higher in the TODO list :)

It didn't seem to scale between systems with the number of sysfs opens,
accounting for differences in processor speed, but I didn't actually
profile.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-12 Thread Dave Love
Brice Goglin  writes:

> I am not sure where that hwloc for RHEL on KNL is available from. It
> might be in Intel's "XPPSL" software suite.

I didn't know about that, but it only has hwloc 1.11.2, as in RHEL7
beta, in case the more recent changes for KNL are relevant.

[Because of the nasty packaging, it's not clear what version of the
other things it includes, and at least memkind is incompatible with
what's in EPEL.]
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-12 Thread Brice Goglin
Le 08/09/2016 19:17, Brice Goglin a écrit :
>
>> By the way, is it expected that binding will be slow on it?  hwloc-bind
>> is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
>> slower than on a 128-core, 16-socket system.
> Binding itself shouldn't be slower. But hwloc's topology discovery
> (which is performed by hwloc-bind before actual binding) is slower on
> KNL than on "normal" nodes. The overhead is basically linear with the
> number of hyperthreads, and KNL sequential perf is lower than your other
> nodes.
>
> The easy fix is to export the topology to XML with lstopo foo.xml and
> then tell all hwloc users to load from XML:
> export HWLOC_XMLFILE=foo.xml
> export HWLOC_THISSYSTEM=1
> https://www.open-mpi.org/projects/hwloc/doc/v1.11.4/a00030.php#faq_xml
>
> For hwloc 2.0, I am trying to make sure we don't perform useless
> discovery steps. hwloc-bind (and many applications) don't require all
> topology details. v1.x gathers everything and filters things out later.
> For 2.0, the plan is rather to directly just gather what we need. What
> you can try for fun is:
> export HWLOC_COMPONENTS=-x86 (without the above XML env vars)
> It disables the x86-specific discovery which is useless for most cases
> on Linux.
>

Interesting, this last idea doesn't help. XML is much faster (0.14s),
but normal discovery is still 1s without the x86-specific code.

So what's really slow is reading sysfs and/or inserting all hwloc
objects in the tree. I need to do some profiling. And I am moving the
item "parallelize the discovery" higher in the TODO list :)

Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-09 Thread Brice Goglin
Le 09/09/2016 12:49, Dave Love a écrit :
>
>> Intel people are carrefully
>> working with RedHat so that hwloc is properly packaged for RHEL. I can
>> report bugs if needed.
> I can't see a recent hwloc for RHEL (e.g. in RHEL7 beta), but don't get
> me started on RHEL and HPC...
>

I am not sure where that hwloc for RHEL on KNL is available from. It
might be in Intel's "XPPSL" software suite.

Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-09 Thread Dave Love
Brice Goglin  writes:

> Is there anything to fix on the RPM side?

Nothing significant, I think.  The update Fedora version needed slight
adjustment for hwloc-dump-hwdata, at least.

> Intel people are carrefully
> working with RedHat so that hwloc is properly packaged for RHEL. I can
> report bugs if needed.

I can't see a recent hwloc for RHEL (e.g. in RHEL7 beta), but don't get
me started on RHEL and HPC...

I'd originally just rebuilt an updated RHEL6 package, but that had a
load of compatibility stuff in it.  In case it's useful, a clean build
of the updated Fedora package for EPEL7 is under
https://copr.fedorainfracloud.org/coprs/loveshack/livhpc/package/hwloc/

[Off-topic, but there's a little KNL-specific stuff in a sibling repo
(el7-knl), and one or two things with avx512 support in the livhpc one.]

>
>> By the way, is it expected that binding will be slow on it?  hwloc-bind
>> is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
>> slower than on a 128-core, 16-socket system.
>
> Binding itself shouldn't be slower. But hwloc's topology discovery
> (which is performed by hwloc-bind before actual binding) is slower on
> KNL than on "normal" nodes. The overhead is basically linear with the
> number of hyperthreads, and KNL sequential perf is lower than your other
> nodes.

Right, thanks.  I shouldn't have confused them, especially as I was
looking at things concerned with discovery...  I don't remember how that
works, if I ever did in detail, but I was expecting it to scale with the
number of kernel file opens.

>
> The easy fix is to export the topology to XML with lstopo foo.xml and
> then tell all hwloc users to load from XML:

Yes; I think I even noted that somewhere in the SGE source...

By the way, in case this attracts the attention of anyone who might do
what I did:  hwloc-distances produces no output, as a feature
.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-09 Thread Dave Love
Jeff Hammond  writes:

>> By the way, is it expected that binding will be slow on it?  hwloc-bind
>> is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
>> slower than on a 128-core, 16-socket system.
>>
>> Is this a bottleneck in any application?  Are there codes bindings memory
> frequently?

As Brice pointed out, I was stupidly confusing discovery and binding,
but there are cases where a second to do the discovery could be
significant, even if I might not consider them too sensible.

> Because most things inside the kernel are limited by single-threaded
> performance, it is reasonable for them to be slower than on a Xeon
> processor, but I've not seen slowdowns that high.
>
> Jeff

Yes.  Actually I was originally comparing with a slower 64-core
Interlagos, but that was with an older hwloc.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-08 Thread Brice Goglin



Le 08/09/2016 17:59, Dave Love a écrit :
> Brice Goglin  writes:
>
>> Hello
>> It's not a feature. This should work fine.
>> Random guess: do you have NUMA headers on your build machine ? (package
>> libnuma-dev or numactl-devel)
>> (hwloc-info --support also report whether membinding is supported or not)
>> Brice
> Oops, you're right!  Thanks.  I thought what I'm using elsewhere was
> built from the same srpm, but the rpm on the KNL box doesn't actually
> require libnuma.  After a rebuild, it's OK and I'm suitably embarrassed.

Is there anything to fix on the RPM side? Intel people are carrefully
working with RedHat so that hwloc is properly packaged for RHEL. I can
report bugs if needed.

> By the way, is it expected that binding will be slow on it?  hwloc-bind
> is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
> slower than on a 128-core, 16-socket system.

Binding itself shouldn't be slower. But hwloc's topology discovery
(which is performed by hwloc-bind before actual binding) is slower on
KNL than on "normal" nodes. The overhead is basically linear with the
number of hyperthreads, and KNL sequential perf is lower than your other
nodes.

The easy fix is to export the topology to XML with lstopo foo.xml and
then tell all hwloc users to load from XML:
export HWLOC_XMLFILE=foo.xml
export HWLOC_THISSYSTEM=1
https://www.open-mpi.org/projects/hwloc/doc/v1.11.4/a00030.php#faq_xml

For hwloc 2.0, I am trying to make sure we don't perform useless
discovery steps. hwloc-bind (and many applications) don't require all
topology details. v1.x gathers everything and filters things out later.
For 2.0, the plan is rather to directly just gather what we need. What
you can try for fun is:
export HWLOC_COMPONENTS=-x86 (without the above XML env vars)
It disables the x86-specific discovery which is useless for most cases
on Linux.

I'll do some performance testing tomorrow too.

Regards
Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-08 Thread Jeff Hammond
On Thu, Sep 8, 2016 at 8:59 AM, Dave Love  wrote:

> Brice Goglin  writes:
>
> > Hello
> > It's not a feature. This should work fine.
> > Random guess: do you have NUMA headers on your build machine ? (package
> > libnuma-dev or numactl-devel)
> > (hwloc-info --support also report whether membinding is supported or not)
> > Brice
>
> Oops, you're right!  Thanks.  I thought what I'm using elsewhere was
> built from the same srpm, but the rpm on the KNL box doesn't actually
> require libnuma.  After a rebuild, it's OK and I'm suitably embarrassed.
>
> By the way, is it expected that binding will be slow on it?  hwloc-bind
> is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
> slower than on a 128-core, 16-socket system.
>
> Is this a bottleneck in any application?  Are there codes bindings memory
frequently?

Because most things inside the kernel are limited by single-threaded
performance, it is reasonable for them to be slower than on a Xeon
processor, but I've not seen slowdowns that high.

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] memory binding on Knights Landing

2016-09-08 Thread Dave Love
Brice Goglin  writes:

> Hello
> It's not a feature. This should work fine.
> Random guess: do you have NUMA headers on your build machine ? (package
> libnuma-dev or numactl-devel)
> (hwloc-info --support also report whether membinding is supported or not)
> Brice

Oops, you're right!  Thanks.  I thought what I'm using elsewhere was
built from the same srpm, but the rpm on the KNL box doesn't actually
require libnuma.  After a rebuild, it's OK and I'm suitably embarrassed.

By the way, is it expected that binding will be slow on it?  hwloc-bind
is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
slower than on a 128-core, 16-socket system.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] memory binding on Knights Landing

2016-09-08 Thread Brice Goglin
Hello
It's not a feature. This should work fine.
Random guess: do you have NUMA headers on your build machine ? (package
libnuma-dev or numactl-devel)
(hwloc-info --support also report whether membinding is supported or not)
Brice



Le 08/09/2016 16:34, Dave Love a écrit :
> I'm somewhat confused by binding on Knights Landing -- which is probably
> a feature.
>
> I'm looking at a KNL box configured as "Cluster Mode: SNC4 Memory Mode:
> Cache" with hwloc 1.11.4; I've read the KNL hwloc FAQ entries.  I ran
> openmpi and it reported failure to bind memory (but binding to cores was
> OK).  So I tried hwloc-bind --membind and that seems to fail with no
> matter what I do, reporting
>
>   hwloc_set_membind 0x0002 (policy 2 flags 0) failed (errno 38 Function 
> not implemented)
>
> Is that expected, and is there a recommendation on how to do binding in
> that configuration with things that use hwloc?  I'm particularly
> interested in OMPI, but I guess this is a better place to ask.  Thanks.
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


[hwloc-users] memory binding on Knights Landing

2016-09-08 Thread Dave Love
I'm somewhat confused by binding on Knights Landing -- which is probably
a feature.

I'm looking at a KNL box configured as "Cluster Mode: SNC4 Memory Mode:
Cache" with hwloc 1.11.4; I've read the KNL hwloc FAQ entries.  I ran
openmpi and it reported failure to bind memory (but binding to cores was
OK).  So I tried hwloc-bind --membind and that seems to fail with no
matter what I do, reporting

  hwloc_set_membind 0x0002 (policy 2 flags 0) failed (errno 38 Function not 
implemented)

Is that expected, and is there a recommendation on how to do binding in
that configuration with things that use hwloc?  I'm particularly
interested in OMPI, but I guess this is a better place to ask.  Thanks.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users