Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread fabricio

Em 30-06-2017 17:28, Brice Goglin escreveu:

Le 30/06/2017 22:08, fabricio a écrit :

Em 30-06-2017 16:21, Brice Goglin escreveu:

Yes, it's possible but very easy. Before we go that way:
Can you also pass HWLOC_COMPONENTS_VERBOSE=1 in the environment and send
the verbose output?


///
Registered cpu discovery component `no_os' with priority 40
(statically build)
Registered global discovery component `xml' with priority 30
(statically build)
Registered global discovery component `synthetic' with priority 30
(statically build)
Registered global discovery component `custom' with priority 30
(statically build)
Registered cpu discovery component `linux' with priority 50
(statically build)
Registered misc discovery component `linuxpci' with priority 19
(statically build)
Registered misc discovery component `pci' with priority 20 (statically
build)
Registered cpu discovery component `x86' with priority 45 (statically
build)
Enabling cpu discovery component `linux'
Enabling cpu discovery component `x86'
Enabling cpu discovery component `no_os'
Excluding global discovery component `xml', conflicts with excludes 0x2
Excluding global discovery component `synthetic', conflicts with
excludes 0x2
Excluding global discovery component `custom', conflicts with excludes
0x2
Enabling misc discovery component `pci'
Enabling misc discovery component `linuxpci'
Final list of enabled discovery components: linux,x86,no_os,pci,linuxpci


* hwloc has encountered what looks like an error from the operating
system.
*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
0x003f) without inclusion!
* Error occurred in topology.c line 942
*
* The following FAQ entry in a recent hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's
mailing list,
* along with the output+tarball generated by the hwloc-gather-topology
script.


Enabling global discovery component `xml'
Excluding cpu discovery component `linux', conflicts with excludes
0x
Excluding cpu discovery component `x86', conflicts with excludes
0x
Excluding cpu discovery component `no_os', conflicts with excludes
0x
Excluding global discovery component `xml', conflicts with excludes
0x
Excluding global discovery component `synthetic', conflicts with
excludes 0x
Excluding global discovery component `custom', conflicts with excludes
0x
Excluding misc discovery component `pci', conflicts with excludes
0x
Excluding misc discovery component `linuxpci', conflicts with excludes
0x
Final list of enabled discovery components: xml
///


I am wondering if the x86 backend was disabled somehow.
Please also send your config.log


I'm using the embebbed hwloc in openmpi 1.10.7, whose version seems to
be 1.9.1. I could not find a config.log file.


I thought you were using hwloc 1.11.5? HWLOC_COMPONENTS=x86 can help
there, but not in 1.9.1 from OMPI. Which one did you try?




Setting HWLOC_COMPONENTS=-linux could also work: It totally disables the
Linux backend. If the x86 is disabled as well, you would get an almost
empty topology.


Will this leave the process allocation to the kernel, potentially
diminishing performance?


This would basically ignore all topology information.
But it's not needed anymore here since the x86 backend is enabled above.

What you can do is one of these:
* tell OMPI to use an external hwloc >= 1.11.2
* use a more recent OMPI :)
* use a XML generated with hwloc >= 1.11.2 with HWLOC_COMPONENTS=x86,
and pass it to OMPI and/or hwloc with HWLOC_XMLFILE=/path/to/foo.xml and
HWLOC_THISSYSTEM=1 in the environment. If it doesn't work, I'll generate
the XML


Updating hwloc version to 1.11.7 && recompiling openmpi && 
HWLOC_COMPONENTS=x86 made the error message disappear.


Thanks for the attention!
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread Brice Goglin
Hello

We have seen _many_ reports like these. But there are different kinds of
errors. As far as I understand:

* Julio's error is caused by the Linux kernel improperly reporting L3
cache affinities. It's specific to multi-socket 12-core processors
because the kernel makes invalid assumptions about core APIC IDs in
these processors (because only 12 out of 16 cores are enabled).
HWLOC_COMPONENTS=x86 was designed to solve this issue until AMD fixed
the kernel, but it looks like they didn't.

* Your error looks like another issue where the BIOS reports invalid
NUMA affinity (likely in the SRAT table). A BIOS upgrade may help.
Fortunately, the x86 backend can also read NUMA affinity from CPUID
instructions on AMD. I didn't know/remember HWLOC_COMPONENTS=x86 could
help for this bug too.


I am going to add this workaround to the FAQ about these errors (this
FAQ is listed in the error below since 1.11).


By the way, you should upgrade. 1.10 is vey old :)

Brice




Le 30/06/2017 21:59, Belgin, Mehmet a écrit :
> We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi
> machines (6378). We weren’t aware of HWLOC_COMPONENTS workaround,
> which seems to mitigate the issue. 
>
> *Before:*
>
> # ./lstopo
> 
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3
> cpuset 0xff00,0xff00) without inclusion!
> * Error occurred in topology.c line 940
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> 
> Machine (128GB total)
>   Group0 L#0
> NUMANode L#0 (P#1 32GB)
> ...
>
> *After:*
>
> # export HWLOC_COMPONENTS=x86
> # ./lstopo
> Machine
>   Socket L#0
> NUMANode L#0 (P#0) + L3 L#0 (6144KB)
>   L2 L#0 (2048KB) + L1i L#0 (64KB)
> ...
>
> These nodes are the only one in our entire cluster to cause zombie
> processes using torque/moab. I have a feeling that they are related.
> We use hwloc/1.10.0.
>
> Not sure if this helps at all, but you are definitely not alone :)
>
> Thanks,
> -Mehmet
>
>
>
>> On Jun 29, 2017, at 1:24 AM, Brice Goglin > > wrote:
>>
>> Hello
>>
>> We've seen this issue many times (it's specific to 12-core opterons),
>> but I am surprised it still occurs with such a recent kernel. AMD was
>> supposed to fix the kernel in early 2016 but I forgot checking
>> whether something was actually pushed.
>>
>> Anyway, you can likely ignore the issue as documented in the FAQ
>> https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless
>> you care about L3 affinity for binding. Otherwise, you can workaround
>> the issue by passing HWLOC_COMPONENTS=x86 in the environment so that
>> hwloc uses cpuid before of Linux sysfs files for discovery the topology.
>>
>> Brice
>>
>>
>>
>>
>> Le 29/06/2017 02:17, Julio Figueroa a écrit :
>>> Hi
>>>
>>> I am experincing the following issues when using pnetcdf version 1.8.1
>>> The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238
>>> (patch_level=0x0600063d)
>>> The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
>>> OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1
>>> (2017-06-18) x86_64 GNU/Linux
>>> 
>>> * hwloc 1.11.5 has encountered what looks like an error from the
>>> operating system.
>>> *
>>> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
>>> 0x003f) without inclusion!
>>> * Error occurred in topology.c line 1074
>>> *
>>> * The following FAQ entry in the hwloc documentation may help:
>>> *   What should I do when hwloc reports "operating system" warnings?
>>> * Otherwise please report this error message to the hwloc user's
>>> mailing list,
>>> * along with the output+tarball generated by the
>>> hwloc-gather-topology script.
>>> 
>>>
>>> As suggested by the error message, here is the hwloc-gather-topology
>>> attached.
>>>
>>> Please let me know if you need more information.
>>>
>>> Julio Figueroa
>>> Oceanographer
>>>
>>>
>>>
>>> ___
>>> hwloc-users mailing list
>>> hwloc-users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>>
>> ___
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread Brice Goglin
Le 30/06/2017 22:08, fabricio a écrit :
> Em 30-06-2017 16:21, Brice Goglin escreveu:
>> Yes, it's possible but very easy. Before we go that way:
>> Can you also pass HWLOC_COMPONENTS_VERBOSE=1 in the environment and send
>> the verbose output?
>
> ///
> Registered cpu discovery component `no_os' with priority 40
> (statically build)
> Registered global discovery component `xml' with priority 30
> (statically build)
> Registered global discovery component `synthetic' with priority 30
> (statically build)
> Registered global discovery component `custom' with priority 30
> (statically build)
> Registered cpu discovery component `linux' with priority 50
> (statically build)
> Registered misc discovery component `linuxpci' with priority 19
> (statically build)
> Registered misc discovery component `pci' with priority 20 (statically
> build)
> Registered cpu discovery component `x86' with priority 45 (statically
> build)
> Enabling cpu discovery component `linux'
> Enabling cpu discovery component `x86'
> Enabling cpu discovery component `no_os'
> Excluding global discovery component `xml', conflicts with excludes 0x2
> Excluding global discovery component `synthetic', conflicts with
> excludes 0x2
> Excluding global discovery component `custom', conflicts with excludes
> 0x2
> Enabling misc discovery component `pci'
> Enabling misc discovery component `linuxpci'
> Final list of enabled discovery components: linux,x86,no_os,pci,linuxpci
> 
>
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 942
> *
> * The following FAQ entry in a recent hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's
> mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> 
>
> Enabling global discovery component `xml'
> Excluding cpu discovery component `linux', conflicts with excludes
> 0x
> Excluding cpu discovery component `x86', conflicts with excludes
> 0x
> Excluding cpu discovery component `no_os', conflicts with excludes
> 0x
> Excluding global discovery component `xml', conflicts with excludes
> 0x
> Excluding global discovery component `synthetic', conflicts with
> excludes 0x
> Excluding global discovery component `custom', conflicts with excludes
> 0x
> Excluding misc discovery component `pci', conflicts with excludes
> 0x
> Excluding misc discovery component `linuxpci', conflicts with excludes
> 0x
> Final list of enabled discovery components: xml
> ///
>
>> I am wondering if the x86 backend was disabled somehow.
>> Please also send your config.log
>
> I'm using the embebbed hwloc in openmpi 1.10.7, whose version seems to
> be 1.9.1. I could not find a config.log file.

I thought you were using hwloc 1.11.5? HWLOC_COMPONENTS=x86 can help
there, but not in 1.9.1 from OMPI. Which one did you try?

>
>> Setting HWLOC_COMPONENTS=-linux could also work: It totally disables the
>> Linux backend. If the x86 is disabled as well, you would get an almost
>> empty topology.
>
> Will this leave the process allocation to the kernel, potentially
> diminishing performance?

This would basically ignore all topology information.
But it's not needed anymore here since the x86 backend is enabled above.

What you can do is one of these:
* tell OMPI to use an external hwloc >= 1.11.2
* use a more recent OMPI :)
* use a XML generated with hwloc >= 1.11.2 with HWLOC_COMPONENTS=x86,
and pass it to OMPI and/or hwloc with HWLOC_XMLFILE=/path/to/foo.xml and
HWLOC_THISSYSTEM=1 in the environment. If it doesn't work, I'll generate
the XML

Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread Belgin, Mehmet
We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi machines 
(6378). We weren’t aware of HWLOC_COMPONENTS workaround, which seems to 
mitigate the issue.

Before:

# ./lstopo

* hwloc has encountered what looks like an error from the operating system.
*
* Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 cpuset 
0xff00,0xff00) without inclusion!
* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology script.

Machine (128GB total)
  Group0 L#0
NUMANode L#0 (P#1 32GB)
...

After:

# export HWLOC_COMPONENTS=x86
# ./lstopo
Machine
  Socket L#0
NUMANode L#0 (P#0) + L3 L#0 (6144KB)
  L2 L#0 (2048KB) + L1i L#0 (64KB)
...

These nodes are the only one in our entire cluster to cause zombie processes 
using torque/moab. I have a feeling that they are related. We use hwloc/1.10.0.

Not sure if this helps at all, but you are definitely not alone :)

Thanks,
-Mehmet



On Jun 29, 2017, at 1:24 AM, Brice Goglin 
> wrote:

Hello

We've seen this issue many times (it's specific to 12-core opterons), but I am 
surprised it still occurs with such a recent kernel. AMD was supposed to fix 
the kernel in early 2016 but I forgot checking whether something was actually 
pushed.

Anyway, you can likely ignore the issue as documented in the FAQ 
https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless you care 
about L3 affinity for binding. Otherwise, you can workaround the issue by 
passing HWLOC_COMPONENTS=x86 in the environment so that hwloc uses cpuid before 
of Linux sysfs files for discovery the topology.

Brice




Le 29/06/2017 02:17, Julio Figueroa a écrit :
Hi

I am experincing the following issues when using pnetcdf version 1.8.1
The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238 
(patch_level=0x0600063d)
The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) 
x86_64 GNU/Linux

* hwloc 1.11.5 has encountered what looks like an error from the operating 
system.
*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f) 
without inclusion!
* Error occurred in topology.c line 1074
*
* The following FAQ entry in the hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology script.


As suggested by the error message, here is the hwloc-gather-topology
attached.

Please let me know if you need more information.

Julio Figueroa
Oceanographer




___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread fabricio

Em 29-06-2017 02:24, Brice Goglin escreveu:

Hello Brice

I'm still seeing this error message even when passing the 
HWLOC_COMPONENTS=x86 variable.

Is it possible to generate a xml file that can silence this error?


TIA,
Fabricio
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-28 Thread Brice Goglin
Hello

We've seen this issue many times (it's specific to 12-core opterons),
but I am surprised it still occurs with such a recent kernel. AMD was
supposed to fix the kernel in early 2016 but I forgot checking whether
something was actually pushed.

Anyway, you can likely ignore the issue as documented in the FAQ
https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless
you care about L3 affinity for binding. Otherwise, you can workaround
the issue by passing HWLOC_COMPONENTS=x86 in the environment so that
hwloc uses cpuid before of Linux sysfs files for discovery the topology.

Brice




Le 29/06/2017 02:17, Julio Figueroa a écrit :
> Hi
>
> I am experincing the following issues when using pnetcdf version 1.8.1
> The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238
> (patch_level=0x0600063d)
> The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
> OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1
> (2017-06-18) x86_64 GNU/Linux
> 
> * hwloc 1.11.5 has encountered what looks like an error from the
> operating system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 1074
> *
> * The following FAQ entry in the hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's
> mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> 
>
> As suggested by the error message, here is the hwloc-gather-topology
> attached.
>
> Please let me know if you need more information.
>
> Julio Figueroa
> Oceanographer
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users