Ralph, For 1.8.2rc4 I get:
(1003) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154 [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for commands! Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143 Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144 [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for commands! [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for commands! Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145 Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153 [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for commands! [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for commands! [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],0] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],2] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],3] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],1] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],5] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],4] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],6] [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],7] MPIR_being_debugged = 0 MPIR_debug_state = 1 MPIR_partial_attach_ok = 1 MPIR_i_am_starter = 0 MPIR_forward_output = 0 MPIR_proctable_size = 8 MPIR_proctable: (i, host, exe, pid) = (0, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647) (i, host, exe, pid) = (1, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648) (i, host, exe, pid) = (2, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650) (i, host, exe, pid) = (3, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652) (i, host, exe, pid) = (4, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654) (i, host, exe, pid) = (5, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656) (i, host, exe, pid) = (6, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658) (i, host, exe, pid) = (7, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660) MPIR_executable_path: NULL MPIR_server_arguments: NULL [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs Process 2 of 8 is on borg01x142 Process 5 of 8 is on borg01x142 Process 4 of 8 is on borg01x142 Process 1 of 8 is on borg01x142 Process 0 of 8 is on borg01x142 Process 3 of 8 is on borg01x142 Process 6 of 8 is on borg01x142 Process 7 of 8 is on borg01x142 [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],2] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],1] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],3] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],0] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],4] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],6] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],5] [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],7] [borg01x142:01629] [[47143,0],0] orted_cmd: received exit cmd [borg01x144:08250] [[47143,0],2] orted_cmd: received exit cmd [borg01x144:08250] [[47143,0],2] orted_cmd: all routes and children gone - exiting [borg01x153:10902] [[47143,0],4] orted_cmd: received exit cmd [borg01x153:10902] [[47143,0],4] orted_cmd: all routes and children gone - exiting [borg01x143:23473] [[47143,0],1] orted_cmd: received exit cmd [borg01x154:10990] [[47143,0],5] orted_cmd: received exit cmd [borg01x154:10990] [[47143,0],5] orted_cmd: all routes and children gone - exiting [borg01x145:12320] [[47143,0],3] orted_cmd: received exit cmd [borg01x145:12320] [[47143,0],3] orted_cmd: all routes and children gone - exiting Using the 1.8.2 mpirun: (1004) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 srun.slurm: error: borg01x143: task 0: Exited with exit code 213 srun.slurm: Terminating job step 2332583.4 [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** srun.slurm: error: borg01x144: task 1: Exited with exit code 213 slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** srun.slurm: error: borg01x153: task 3: Exited with exit code 213 slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 *** srun.slurm: error: borg01x154: task 4: Killed srun.slurm: error: borg01x145: task 2: Killed sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:34169: No such file or directory On Thu, Aug 28, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote: > I'm unaware of any changes to the Slurm integration between rc4 and final > release. It sounds like this might be something else going on - try adding > "--leave-session-attached --debug-daemons" to your 1.8.2 command line and > let's see if any errors get reported. > > > On Aug 28, 2014, at 12:20 PM, Matt Thompson <fort...@gmail.com> wrote: > > Open MPI List, > > I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our > cluster (reported on this list), and decided to try it with 1.8.2. However, > we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, > Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout > with Open MPI 1.8.2. That is, HelloWorld doesn't work. > > To wit, our sysadmin has two tarballs: > > (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2 > 7e7496913c949451f546f22a1a159df25f8bb683 openmpi-1.8.2rc4.tar.bz2 > (1442) $ sha1sum openmpi-1.8.2.tar.gz > cf2b1e45575896f63367406c6c50574699d8b2e1 openmpi-1.8.2.tar.gz > > I then build each with a script in the method our sysadmin usually does: > > #!/bin/sh >> set -x >> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2 >> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64 >> build() { >> echo `pwd` >> ./configure --with-slurm --disable-wrapper-rpath --enable-shared >> --enable-mca-no-build=btl-usnic \ >> CC=gcc CXX=g++ F77=gfortran FC=gfortran \ >> CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC >> -m64" FFLAGS="-mtune=generic -fPIC -m64" \ >> F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC >> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \ >> LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" >> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \ >> --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log >> make 2>&1 | tee make.1.8.2.log >> make check 2>&1 | tee makecheck.1.8.2.log >> make install 2>&1 | tee makeinstall.1.8.2.log >> } >> echo "calling build" >> build >> echo "exiting" > > > The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX > and log file tees. Now, let us test. First, I grab some nodes with slurm: > > $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00 >> --account=g0620 --mail-type=BEGIN > > > Once I get my nodes, I run with 1.8.2rc4: > > (1142) $ >> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o >> helloWorld.182rc4.x helloWorld.F90 >> (1143) $ >> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 >> ./helloWorld.182rc4.x >> Process 0 of 8 is on borg01w044 >> Process 5 of 8 is on borg01w044 >> Process 3 of 8 is on borg01w044 >> Process 7 of 8 is on borg01w044 >> Process 1 of 8 is on borg01w044 >> Process 2 of 8 is on borg01w044 >> Process 4 of 8 is on borg01w044 >> Process 6 of 8 is on borg01w044 > > > Now 1.8.2: > > (1144) $ >> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o >> helloWorld.182.x helloWorld.F90 >> (1145) $ >> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 >> ./helloWorld.182.x >> (1146) $ > > > No output at all. But, if I take the helloWorld.x from 1.8.2 and run it > with 1.8.2rc4's mpirun: > > (1146) $ >> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 >> ./helloWorld.182.x >> Process 5 of 8 is on borg01w044 >> Process 7 of 8 is on borg01w044 >> Process 2 of 8 is on borg01w044 >> Process 4 of 8 is on borg01w044 >> Process 1 of 8 is on borg01w044 >> Process 3 of 8 is on borg01w044 >> Process 6 of 8 is on borg01w044 >> Process 0 of 8 is on borg01w044 > > > So...any idea what is happening here? There did seem to be a few SLURM > related changes between the two tarballs involving /dev/null but it's a bit > above me to decipher. > > You can find the ompi_info, build, make, config, etc logs at these links > (they are ~300kB which is over the mailing list limit according to the Open > MPI web page): > > https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2 > https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2 > > Thank you for any help and please let me know if you need more information, > Matt > > -- > "And, isn't sanity really just a one-trick pony anyway? I mean all you > get is one trick: rational thinking. But when you're good and crazy, > oooh, oooh, oooh, the sky is the limit!" -- The Tick > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25182.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25184.php > -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick