libimf.so is present on all nodes, by design. However, some times the simulation runs and other times not. I have a suspicion that the filesystem (GPFS) where the Intel library is located, may become temporarily unavailable in the failure cases. I do not suspect any problem with OpenMPI, but I am hopeful that it can produce diagnostics that indicate the root cause of the problem.
I have followed Ralph's advice to build with --enable-debug and am now waiting for the problem to happen again so I can see the ssh command used to launch the orted. -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reuti Sent: Tuesday, December 18, 2012 4:14 AM To: Open MPI Users Subject: Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs Am 17.12.2012 um 16:42 schrieb Blosch, Edwin L: > Ralph, > > Unfortunately I didn't see the ssh output. The output I got was pretty much > as before. > > You know, the fact that the error message is not prefixed with a host name > makes me think it could be happening on the host where the job is placed by > PBS. If there is something wrong in the user environment prior to mpirun, > that is not an OpenMPI problem. And yet, in one of the jobs that failed, I > have also printed outthe results of 'ldd' on the mpirun executable just prior > to executing the command, and all the shared libraries were resolved: You checked the mpirun, but not the orted which misses a "libimf.so" from Intel. The Intel libimf.so from the redistributable archive is present on all nodes? -- Reuti > > ldd /release/cfd/openmpi-intel/bin/mpirun > linux-vdso.so.1 => (0x00007fffbbb39000) > libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 > (0x00002abdf75d2000) > libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 > (0x00002abdf7887000) > libdl.so.2 => /lib64/libdl.so.2 (0x00002abdf7b39000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x00002abdf7d3d000) > libutil.so.1 => /lib64/libutil.so.1 (0x00002abdf7f56000) > libm.so.6 => /lib64/libm.so.6 (0x00002abdf8159000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002abdf83af000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00002abdf85c7000) > libc.so.6 => /lib64/libc.so.6 (0x00002abdf87e4000) > libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so > (0x00002abdf8b42000) > libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so > (0x00002abdf8ed7000) > libintlc.so.5 => > /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 > (0x00002abdf90ed000) > /lib64/ld-linux-x86-64.so.2 (0x00002abdf73b1000) > > Hence my initial assumption that the shared-library problem was happening > with one of the child processes on a remote node. > > So at this point I have more questions than answers. I still don't know if > this message comes from the main mpirun process or one of the child > processes, although it seems that it should not be the main process because > of the output of ldd above. > > Any more suggestions are welcomed of course. > > Thanks > > > /release/cfd/openmpi-intel/bin/mpirun --machinefile > /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x > MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached > /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000 > -ro /tmp/fv420804.maruhpc4-mgt/restart.5000 > > [c6n38:16219] mca:base:select:( plm) Querying component [rsh] > [c6n38:16219] mca:base:select:( plm) Query of component [rsh] set > priority to 10 [c6n38:16219] mca:base:select:( plm) Selected > component [rsh] > Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M > /release/cfd/openmpi-intel/bin/orted: error while loading shared > libraries: libimf.so: cannot open shared object file: No such file or > directory > ---------------------------------------------------------------------- > ---- A daemon (pid 16227) died unexpectedly with status 127 while > attempting to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed > shared libraries on the remote node. You may set your LD_LIBRARY_PATH > to have the location of the shared libraries on the remote nodes and > this will automatically be forwarded to the remote nodes. > ---------------------------------------------------------------------- > ---- > ---------------------------------------------------------------------- > ---- mpirun noticed that the job aborted, but has no info as to the > process that caused that situation. > ---------------------------------------------------------------------- > ---- > Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M > ---------------------------------------------------------------------- > ---- mpirun was unable to cleanly terminate the daemons on the nodes > shown below. Additional manual cleanup may be required - please refer > to the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > c6n39 - daemon did not report back when launched > c6n40 - daemon did not report back when launched > c6n41 - daemon did not report back when launched > c6n42 - daemon did not report back when launched > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On Behalf Of Ralph Castain > Sent: Friday, December 14, 2012 2:25 PM > To: Open MPI Users > Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries > while launching jobs > > Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that > will show the ssh command being used to start each orted. > > On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" <edwin.l.blo...@lmco.com> > wrote: > > > I am having a weird problem launching cases with OpenMPI 1.4.3. It is most > likely a problem with a particular node of our cluster, as the jobs will run > fine on some submissions, but not other submissions. It seems to depend on > the node list. I just am having trouble diagnosing which node, and what is > the nature of the problem it has. > > One or perhaps more of the orted are indicating they cannot find an Intel > Math library. The error is: > /release/cfd/openmpi-intel/bin/orted: error while loading shared > libraries: libimf.so: cannot open shared object file: No such file or > directory > > I've checked the environment just before launching mpirun, and > LD_LIBRARY_PATH includes the necessary component to point to where the Intel > shared libraries are located. Furthermore, my mpirun command line says to > export the LD_LIBRARY_PATH variable: > Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile > /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x > LD_LIBRARY_PATH', '-x MPI_ENVIRONMENT=1', > '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', '-cycles', > '10000', '-ri', 'restart.1', '-ro', > '/tmp/fv420761.maruhpc4-mgt/restart.1'] > > My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH. > OpenMPI is built explicitly --without-torque and should be using ssh to > launch the orted. > > What options can I add to get more debugging of problems launching orted? > > Thanks, > > Ed > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users