Hi Brice and Ralph

Many thanks for helping out with this!

Yes, you are right about node15 being OK.
Node15 was a red herring, as along with node14 it was part of
the same job that failed.
However, after a closer look, I noticed that failure reported
by hwloc was indeed in node14.

I attach both diagnostic files generated by hwloc-gather-topology on
node14.

I will open the node and see if there is anything unusual with the
hardware, and perhaps reinstall the OS, as Ralph suggested.
It is awkward that the other node that had the motherboard replaced
passes the hwloc-gather-topology test.
After motherboard replacement I renistalled the OS on both,
but it doesn't hurt to do it again.

Gus Correa




On 02/28/2014 03:26 AM, Brice Goglin wrote:
Hello Gus,
I'll need the tarball generated by gather-topology on node14 to debug
this. node15 doesn't have any issue.
We've seen issues on AMD machines because of buggy BIOS reporting
incompatible Socket and NUMA info. If node14 doesn't have the same BIOS
version as other nodes, that could explain things.
Brice




Le 28/02/2014 01:39, Gus Correa a écrit :
Thank you, Ralph!

I did a bit more of homework, and found out that all jobs that had
the hwloc error involved one specific node (node14).

The "report bindings" output in those jobs' stderr show
that node14 systematically failed to bind the processes to the cores,
while other nodes on the same jobs didn't fail.
Interestingly, the jobs continued to run, although they
eventually failed, but much later.
So, the hwloc error doesn't seem to stop the job on its tracks.
As a matter of policy, should it perhaps shutdown the job instead?

In addition, when I try the hwloc-gather-topology diagnostic on node14
I get the same error, a bit more verbose (see below).
So, now my guess is that this may be a hardware problem on that node.

I replaced two nodes' motherboards last week, including node14's,
and something may have gone wrong on that one.
The other node that had the motherboard replaced
doesn't show the hwloc-gather-topology error, though.

Does the error message below (Socket P#0 ...)
suggest anything that I should be looking for on the hardware side?
(Thermal compound on the heatsink, memory modules, etc)

Thank you,
Gus Correa



[root@node14 ~]# /usr/bin/hwloc-gather-topology /tmp/$(uname -n)
Hierarchy gathered in /tmp/node14.tar.bz2 and kept in
/tmp/tmp.D46Sdhcnru/node14/
****************************************************************************

* Hwloc has encountered what looks like an error from the operating
system.
*
* object (Socket P#0 cpuset 0x0000ffff) intersection without inclusion!
* Error occurred in topology.c line 718
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************

Expected topology output stored in /tmp/node14.output


On 02/27/2014 06:39 PM, Ralph Castain wrote:
The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having
trouble with those data/instruction cache breakdowns.
I don't know why it wouldn't have shown up before,
however, as this looks to be happening when we first try to
assemble the topology. To check that, what happens if you just run
"mpiexec hostname" on the local node?


On Feb 27, 2014, at 3:04 PM, Gus Correa<g...@ldeo.columbia.edu>   wrote:

Dear OMPI pros

This seems to be a question in the nowhere land between OMPI and hwloc.
However, it appeared as an OMPI error, hence it may be OK to ask the
question in this list.

***

A user here got this error (or warning?) message today:

+ mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/echam6
****************************************************************************

* Hwloc has encountered what looks like an error from the operating
system.
*
* object intersection without inclusion!
* Error occurred in topology.c line 594
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************


Additional info:

1) We have OMPI 1.6.5. This user is using the one built
with Intel compilers 2011.13.367.

2) I set these MCA parameters in $OMPI/etc/openmpi-mca-params.conf
(includes binding to core):

btl = ^tcp
orte_tag_output = 1
rmaps_base_schedule_policy = core
orte_process_binding = core
orte_report_bindings = 1
opal_paffinity_alone = 1


3) The machines have dual-socket 16-core AMD Opteron 6376 (Abu-Dhabi),
which have one FPU for each pair of cores, a hierarchy of caches
serving
sub-groups of cores, etc.
The OS is  Linux CentOS 6.4 with stock CentOS OFED.
Interconnect is Infiniband QDR (Mellanox HW).

4) We have Torque 4.2.5, built with cpuset support.
OMPI is built with Torque (tm) support.

5) In case it helps, I attach the output of
hwloc-gather-topology, which I ran on the node that threw the error,
although not immediately after the job failure.
I used the hwloc-gather-topology script that comes with
the hwloc (version 1.5) provided by CentOS.
As far as I can tell the hwloc nuts and bits built into OMPI
do not include the hwloc-gather-topology script (although it may be
a newer hwloc version. 1.8 perhaps?).
Hopefully the mail servers won't chop off the attachments.

6) I am a bit surprised by this error message, because I haven't
seen it before, although we have used OMPI 1.6.5 in
this machine with several other programs without problems.
Alas, it happened now.

**

- Is this a known hwloc problem in this processor architecture?

- Is this a known issue in this combination of HW and SW?

- Would not binding the MPI processes (to core or socket), perhaps
help?

- Any workarounds or suggestions?

**

Thank you,
Gus Correa
<node15.output><node15.tar.bz2>_______________________________________________

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Machine (P#0 total=67090536KB DMIProductName=H8DGU DMIProductVersion=1234567890 
DMIProductSerial=1234567890 DMIProductUUID=534D4349-0002-F190-2500-F1902500637D 
DMIBoardVendor=Supermicro DMIBoardName=H8DGU DMIBoardVersion=1234567890 
DMIBoardSerial=NM141S600018 DMIBoardAssetTag="To Be Filled By O.E.M." 
DMIChassisVendor=Supermicro DMIChassisType=17 DMIChassisVersion=1234567890 
DMIChassisSerial=1234567890 DMIChassisAssetTag="To Be Filled By O.E.M." 
DMIBIOSVendor="American Megatrends Inc." DMIBIOSVersion="3.5       " 
DMIBIOSDate=11/25/2013 DMISysVendor=Supermicro Backend=Linux LinuxCgroup=/)
  NUMANode L#0 (P#0 local=33552488KB total=33552488KB)
    L3Cache L#0 (size=6144KB linesize=64 ways=64)
      L2Cache L#0 (size=2048KB linesize=64 ways=16)
        L1iCache L#0 (size=64KB linesize=64 ways=2)
          L1dCache L#0 (size=16KB linesize=64 ways=4)
            Core L#0 (P#0)
              PU L#0 (P#0)
          L1dCache L#1 (size=16KB linesize=64 ways=4)
            Core L#1 (P#1)
              PU L#1 (P#1)
      L2Cache L#1 (size=2048KB linesize=64 ways=16)
        L1iCache L#1 (size=64KB linesize=64 ways=2)
          L1dCache L#2 (size=16KB linesize=64 ways=4)
            Core L#2 (P#2)
              PU L#2 (P#2)
          L1dCache L#3 (size=16KB linesize=64 ways=4)
            Core L#3 (P#3)
              PU L#3 (P#3)
      L2Cache L#2 (size=2048KB linesize=64 ways=16)
        L1iCache L#2 (size=64KB linesize=64 ways=2)
          L1dCache L#4 (size=16KB linesize=64 ways=4)
            Core L#4 (P#4)
              PU L#4 (P#4)
          L1dCache L#5 (size=16KB linesize=64 ways=4)
            Core L#5 (P#5)
              PU L#5 (P#5)
      L2Cache L#3 (size=2048KB linesize=64 ways=16)
        L1iCache L#3 (size=64KB linesize=64 ways=2)
          L1dCache L#6 (size=16KB linesize=64 ways=4)
            Core L#6 (P#6)
              PU L#6 (P#6)
          L1dCache L#7 (size=16KB linesize=64 ways=4)
            Core L#7 (P#7)
              PU L#7 (P#7)
    L3Cache L#1 (size=6144KB linesize=64 ways=64)
      L2Cache L#4 (size=2048KB linesize=64 ways=16)
        L1iCache L#4 (size=64KB linesize=64 ways=2)
          L1dCache L#8 (size=16KB linesize=64 ways=4)
            Core L#8 (P#0)
              PU L#8 (P#16)
          L1dCache L#9 (size=16KB linesize=64 ways=4)
            Core L#9 (P#1)
              PU L#9 (P#17)
      L2Cache L#5 (size=2048KB linesize=64 ways=16)
        L1iCache L#5 (size=64KB linesize=64 ways=2)
          L1dCache L#10 (size=16KB linesize=64 ways=4)
            Core L#10 (P#2)
              PU L#10 (P#18)
          L1dCache L#11 (size=16KB linesize=64 ways=4)
            Core L#11 (P#3)
              PU L#11 (P#19)
      L2Cache L#6 (size=2048KB linesize=64 ways=16)
        L1iCache L#6 (size=64KB linesize=64 ways=2)
          L1dCache L#12 (size=16KB linesize=64 ways=4)
            Core L#12 (P#4)
              PU L#12 (P#20)
          L1dCache L#13 (size=16KB linesize=64 ways=4)
            Core L#13 (P#5)
              PU L#13 (P#21)
      L2Cache L#7 (size=2048KB linesize=64 ways=16)
        L1iCache L#7 (size=64KB linesize=64 ways=2)
          L1dCache L#14 (size=16KB linesize=64 ways=4)
            Core L#14 (P#6)
              PU L#14 (P#22)
          L1dCache L#15 (size=16KB linesize=64 ways=4)
            Core L#15 (P#7)
              PU L#15 (P#23)
  NUMANode L#1 (P#1 local=33538048KB total=33538048KB)
    L3Cache L#2 (size=6144KB linesize=64 ways=64)
      L2Cache L#8 (size=2048KB linesize=64 ways=16)
        L1iCache L#8 (size=64KB linesize=64 ways=2)
          L1dCache L#16 (size=16KB linesize=64 ways=4)
            Core L#16 (P#0)
              PU L#16 (P#8)
          L1dCache L#17 (size=16KB linesize=64 ways=4)
            Core L#17 (P#1)
              PU L#17 (P#9)
      L2Cache L#9 (size=2048KB linesize=64 ways=16)
        L1iCache L#9 (size=64KB linesize=64 ways=2)
          L1dCache L#18 (size=16KB linesize=64 ways=4)
            Core L#18 (P#2)
              PU L#18 (P#10)
          L1dCache L#19 (size=16KB linesize=64 ways=4)
            Core L#19 (P#3)
              PU L#19 (P#11)
      L2Cache L#10 (size=2048KB linesize=64 ways=16)
        L1iCache L#10 (size=64KB linesize=64 ways=2)
          L1dCache L#20 (size=16KB linesize=64 ways=4)
            Core L#20 (P#4)
              PU L#20 (P#12)
          L1dCache L#21 (size=16KB linesize=64 ways=4)
            Core L#21 (P#5)
              PU L#21 (P#13)
      L2Cache L#11 (size=2048KB linesize=64 ways=16)
        L1iCache L#11 (size=64KB linesize=64 ways=2)
          L1dCache L#22 (size=16KB linesize=64 ways=4)
            Core L#22 (P#6)
              PU L#22 (P#14)
          L1dCache L#23 (size=16KB linesize=64 ways=4)
            Core L#23 (P#7)
              PU L#23 (P#15)
    L3Cache L#3 (size=6144KB linesize=64 ways=64)
      L2Cache L#12 (size=2048KB linesize=64 ways=16)
        L1iCache L#12 (size=64KB linesize=64 ways=2)
          L1dCache L#24 (size=16KB linesize=64 ways=4)
            Core L#24 (P#0)
              PU L#24 (P#24)
          L1dCache L#25 (size=16KB linesize=64 ways=4)
            Core L#25 (P#1)
              PU L#25 (P#25)
      L2Cache L#13 (size=2048KB linesize=64 ways=16)
        L1iCache L#13 (size=64KB linesize=64 ways=2)
          L1dCache L#26 (size=16KB linesize=64 ways=4)
            Core L#26 (P#2)
              PU L#26 (P#26)
          L1dCache L#27 (size=16KB linesize=64 ways=4)
            Core L#27 (P#3)
              PU L#27 (P#27)
      L2Cache L#14 (size=2048KB linesize=64 ways=16)
        L1iCache L#14 (size=64KB linesize=64 ways=2)
          L1dCache L#28 (size=16KB linesize=64 ways=4)
            Core L#28 (P#4)
              PU L#28 (P#28)
          L1dCache L#29 (size=16KB linesize=64 ways=4)
            Core L#29 (P#5)
              PU L#29 (P#29)
      L2Cache L#15 (size=2048KB linesize=64 ways=16)
        L1iCache L#15 (size=64KB linesize=64 ways=2)
          L1dCache L#30 (size=16KB linesize=64 ways=4)
            Core L#30 (P#6)
              PU L#30 (P#30)
          L1dCache L#31 (size=16KB linesize=64 ways=4)
            Core L#31 (P#7)
              PU L#31 (P#31)
depth 0:        1 Machine (type #1)
 depth 1:       2 NUMANode (type #2)
  depth 2:      4 L3Cache (type #4)
   depth 3:     16 L2Cache (type #4)
    depth 4:    16 L1iCache (type #4)
     depth 5:   32 L1dCache (type #4)
      depth 6:  32 Core (type #5)
       depth 7: 32 PU (type #6)
latency matrix between NUMANodes (depth 1) by logical indexes:
  index     0     1
      0 1.000 1.600
      1 1.600 1.000
Topology not from this system

Attachment: node14.tar.bz2
Description: application/bzip

Reply via email to