Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-03-04 Thread Gus Correa
On 03/03/2014 05:06 PM, Brice Goglin wrote: Le 03/03/2014 23:02, Gus Correa a écrit : I rebooted the node and ran hwloc-gather-topology again. This turn it didn't throw any errors on the terminal window, which may be a good sign. [root@node14 ~]# hwloc-gather-topology /tmp/`date +"%Y%m%d%H%M"`.

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-03-03 Thread Brice Goglin
Le 03/03/2014 23:02, Gus Correa a écrit : > I rebooted the node and ran hwloc-gather-topology again. > This turn it didn't throw any errors on the terminal window, > which may be a good sign. > > [root@node14 ~]# hwloc-gather-topology /tmp/`date > +"%Y%m%d%H%M"`.$(uname -n) > Hierarchy gathered in

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-03-03 Thread Gus Correa
Hi Brice Here are answers to your questions, and my latest attempt to solve the problem: 1) Kernel version: The nodes with new motherboards (node14 and node16) have the same kernel as the nodes with original motherboards (e.g. node15), as they were cloned from the same node image: [root@node14

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Le 28/02/2014 21:30, Gus Correa a écrit : > Hi Brice > > The (pdf) output of lstopo shows one L1d (16k) for each core, > and one L1i (64k) for each *pair* of cores. > Is this wrong? It's correct. AMD uses this "dual-core compute unit" where L2 and L1i are shared but L1d isn't. > BTW, if there are

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Reuti
Am 28.02.2014 um 21:23 schrieb Brice Goglin: > OK, the problem is that node14's BIOS reports invalid NUMA info. It properly > detects 2 sockets with 16-cores each. But it reports 2 NUMA nodes total, > instead of 2 per socket (4 total). And hwloc warns because the cores > contained in these NUMA

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Gus Correa
On 02/28/2014 03:32 AM, Brice Goglin wrote: Le 28/02/2014 02:48, Ralph Castain a écrit : Remember, hwloc doesn't actually "sense" hardware - it just parses files in the /proc area. So if something is garbled in those files, hwloc will report errors. Doesn't mean anything is wrong with the hard

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
OK, the problem is that node14's BIOS reports invalid NUMA info. It properly detects 2 sockets with 16-cores each. But it reports 2 NUMA nodes total, instead of 2 per socket (4 total). And hwloc warns because the cores contained in these NUMA nodes are incompatible with sockets: socket0 contains 0-

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Ralph Castain
You might also want to check the BIOS rev level on node14, Gus - as Brice suggested, it could be that the board came with the wrong firmware. On Feb 28, 2014, at 11:55 AM, Gus Correa wrote: > Hi Brice and Ralph > > Many thanks for helping out with this! > > Yes, you are right about node15 bei

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Gus Correa
Hi Brice and Ralph Many thanks for helping out with this! Yes, you are right about node15 being OK. Node15 was a red herring, as along with node14 it was part of the same job that failed. However, after a closer look, I noticed that failure reported by hwloc was indeed in node14. I attach both

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Ralph Castain
On Feb 28, 2014, at 12:32 AM, Brice Goglin wrote: > Le 28/02/2014 02:48, Ralph Castain a écrit : >> Remember, hwloc doesn't actually "sense" hardware - it just parses files in >> the /proc area. So if something is garbled in those files, hwloc will report >> errors. Doesn't mean anything is wr

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Le 28/02/2014 02:48, Ralph Castain a écrit : > Remember, hwloc doesn't actually "sense" hardware - it just parses files in > the /proc area. So if something is garbled in those files, hwloc will report > errors. Doesn't mean anything is wrong with the hardware at all. For the record, that's not

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Hello Gus, I'll need the tarball generated by gather-topology on node14 to debug this. node15 doesn't have any issue. We've seen issues on AMD machines because of buggy BIOS reporting incompatible Socket and NUMA info. If node14 doesn't have the same BIOS version as other nodes, that could explain

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Ralph Castain
On Feb 27, 2014, at 4:39 PM, Gus Correa wrote: > Thank you, Ralph! > > I did a bit more of homework, and found out that all jobs that had > the hwloc error involved one specific node (node14). > > The "report bindings" output in those jobs' stderr show > that node14 systematically failed to bi

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Gus Correa
Thank you, Ralph! I did a bit more of homework, and found out that all jobs that had the hwloc error involved one specific node (node14). The "report bindings" output in those jobs' stderr show that node14 systematically failed to bind the processes to the cores, while other nodes on the same jo

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Ralph Castain
The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having trouble with those data/instruction cache breakdowns. I don't know why it wouldn't have shown up before, however, as this looks to be happening when we first try to assemble the topology. To check that, what happens if you ju

[OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Gus Correa
Dear OMPI pros This seems to be a question in the nowhere land between OMPI and hwloc. However, it appeared as an OMPI error, hence it may be OK to ask the question in this list. *** A user here got this error (or warning?) message today: + mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/ec