Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-12 Thread Pim Schellart
Dear All, we have now recompiled both openmpi (1.8.3) and SLURM against an externally compiled and installed hwloc (1.10.0). With these changes the out-of-order topology discovery warning disappears. By now we also believe the problem was probably somewhere in SLURM rather than in openmpi but w

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Gilles Gouaillardet
Ralph, You are right, please disregard my previous post, it was irrelevant. i just noticed that unlike ompi v1.8 (hwloc 1.7.2 based => no warning), master has this warning (hwloc 1.9.1) i will build slurm vs a recent hwloc and see what happens (FWIW RHEL6 comes with hwloc 1.5, RHEL7 comes with h

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Ralph Castain
Per his prior notes, he is using mpirun to launch his jobs. Brice has confirmed that OMPI doesn’t have that hwloc warning in it. So either he has inadvertently linked against the Ubuntu system version of hwloc, or the message must be coming from Slurm. > On Dec 10, 2014, at 6:14 PM, Gilles Gou

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Gilles Gouaillardet
Pim, at this stage, all i can do is acknowledge your slurm is configured to use cgroups. and based on your previous comment (e.g. problem only occurs with several jobs on the same node) that *could* be a bug in OpenMPI (or hwloc). by the way, how do you start your mpi application ? - do you use

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Ralph Castain
I think you actually already answered this - if that warning message isn’t in OMPI’s internal code, and the user gets it when building with either internal or external hwloc support, then it must be coming from Slurm. This assumes that ldd libopen-pal.so doesn’t show OMPI to actually be linked

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Brice Goglin
Unfortunately I don't think we have any way to know which process and hwloc version generates a XML so far. I am currently looking at adding this to hwloc 1.10.1 because of this thread. One thing that could help would be to dump the XML file that OMPI receives. Just write the entire buffer to a fi

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Ralph Castain
Brice: is there any way to tell if these are coming from Slurm vs OMPI? Given this data, I’m suspicious that this might have something to do with Slurm and not us. > On Dec 10, 2014, at 9:45 AM, Pim Schellart wrote: > > Dear Gilles et al., > > we tested with openmpi compiled from source (ver

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Brice Goglin
The warning does not exist in the hwloc code inside OMPI 1.8, so there's something strange happening in your first test. I would assume it's using the external hwloc in both cases for some reason. Running ldd on libopen-pal.so could be a way to check whether it depends on an external libhwloc.so or

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-10 Thread Pim Schellart
Dear Gilles et al., we tested with openmpi compiled from source (version 1.8.3) both with: ./configure --prefix=/usr/local/openmpi --disable-silent-rules --with-libltdl=external --with-devel-headers --with-slurm --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi and ./configure --p

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-09 Thread Gilles Gouaillardet
Pim, if you configure OpenMPI with --with-hwloc=external (or something like --with-hwloc=/usr) it is very likely OpenMPI will use the same hwloc library (e.g. the "system" library) that is used by SLURM /* i do not know how Ubuntu packages OpenMPI ... */ The default (e.g. no --with-hwloc parame

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-09 Thread Pim Schellart
Ah, ok so that was where the confusion came from, I did see hwloc in the SLURM sources but couldn’t immediately figure out where exactly it was used. We will try compiling openmpi with the embedded hwloc. Any particular flags I should set? > On 09 Dec 2014, at 09:30, Ralph Castain wrote: > >

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-09 Thread Ralph Castain
There is no linkage between slurm and ompi when it comes to hwloc. If you directly launch your app using srun, then slurm will use its version of hwloc to do the binding. If you use mpirun to launch the app, then we’ll use our internal version to do it. The two are completely isolated from each

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-09 Thread Pim Schellart
The version that “lstopo --version” reports is the same (1.8) on all nodes, but we may indeed be hitting the second issue. We can try to compile a new version of openmpi, but how do we ensure that the external programs (e.g. SLURM) are using the same hwloc version as the one embedded in openmpi?

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Ralph Castain
Hmmm…they probably linked that to the external, system hwloc version, so it sounds like one or more of your nodes has a different hwloc rpm on it. I couldn’t leaf thru your output well enough to see all the lstopo versions, but you might check to ensure they are the same. Looking at the code ba

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Pim Schellart
It is the default openmpi that comes with Ubuntu 14.04. > On 08 Dec 2014, at 17:17, Ralph Castain wrote: > > Pim: is this an OMPI you built, or one you were given somehow? If you built > it, how did you configure it? > >> On Dec 8, 2014, at 8:12 AM, Brice Goglin wrote: >> >> It likely depend

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Ralph Castain
Pim: is this an OMPI you built, or one you were given somehow? If you built it, how did you configure it? > On Dec 8, 2014, at 8:12 AM, Brice Goglin wrote: > > It likely depends on how SLURM allocates the cpuset/cgroup inside the > nodes. The XML warning is related to these restrictions inside

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Brice Goglin
It likely depends on how SLURM allocates the cpuset/cgroup inside the nodes. The XML warning is related to these restrictions inside the node. Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere. How do we check after install whether OMPI uses the embedded or the system-wide hwl

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Pim Schellart
Dear Ralph, the nodes are called coma## and as you can see in the logs the nodes of the broken example are the same as the nodes of the working one, so that doesn’t seem to be the cause. Unless (very likely) I’m missing something. Anything else I can check? Regards, Pim > On 08 Dec 2014, at

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Ralph Castain
As Brice said, OMPI has its own embedded version of hwloc that we use, so there is no Slurm interaction to be considered. The most likely cause is that one or more of your nodes is picking up a different version of OMPI. So things “work” if you happen to get nodes where all the versions match, a

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Pim Schellart
Dear Brice, I am not sure why this is happening since all code seems to be using the same hwloc library version (1.8) but it does :) An MPI program is started through SLURM on two nodes with four CPU cores total (divided over the nodes) using the following script: #! /bin/bash #SBATCH -N 2 -n

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-07 Thread Brice Goglin
Hello The github issue you're refering to was closed 18 months ago. The warning (it's not an error) is only supposed to appear if you're importing in a recent hwloc a XML that was exported from a old hwloc. I don't see how that could happen when using Open MPI since the hwloc versions on both sides

[OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-07 Thread Pim Schellart
Dear OpenMPI developers, this might be a bit off topic but when using the SLURM scheduler (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a "out-of-order topology discovery” error. According to issue #103 on github (https://github.com/open-mpi/hwloc/issues/103) this er