libimf.so is present on all nodes, by design. However, some times the 
simulation runs and other times not. I have a suspicion that the filesystem 
(GPFS) where the Intel library is located, may become temporarily unavailable 
in the failure cases.  I do not suspect any problem with OpenMPI, but I am 
hopeful that it can produce diagnostics that indicate the root cause of the 
problem.

I have followed Ralph's advice to build with --enable-debug and am now waiting 
for the problem to happen again so I can see the ssh command used to launch the 
orted.

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Reuti
Sent: Tuesday, December 18, 2012 4:14 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while 
launching jobs

Am 17.12.2012 um 16:42 schrieb Blosch, Edwin L:

> Ralph,
>  
> Unfortunately I didn't see the ssh output.  The output I got was pretty much 
> as before.
>  
> You know, the fact that the error message is not prefixed with a host name 
> makes me think it could be happening on the host where the job is placed by 
> PBS. If there is something wrong in the user environment prior to mpirun, 
> that is not an OpenMPI problem. And yet, in one of the jobs that failed, I 
> have also printed outthe results of 'ldd' on the mpirun executable just prior 
> to executing the command, and all the shared libraries were resolved:

You checked the mpirun, but not the orted which misses a "libimf.so" from 
Intel. The Intel libimf.so from the redistributable archive is present on all 
nodes?

-- Reuti


>  
> ldd /release/cfd/openmpi-intel/bin/mpirun
>         linux-vdso.so.1 =>  (0x00007fffbbb39000)
>         libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 
> (0x00002abdf75d2000)
>         libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 
> (0x00002abdf7887000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x00002abdf7b39000)
>         libnsl.so.1 => /lib64/libnsl.so.1 (0x00002abdf7d3d000)
>         libutil.so.1 => /lib64/libutil.so.1 (0x00002abdf7f56000)
>         libm.so.6 => /lib64/libm.so.6 (0x00002abdf8159000)
>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002abdf83af000)
>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00002abdf85c7000)
>         libc.so.6 => /lib64/libc.so.6 (0x00002abdf87e4000)
>         libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so 
> (0x00002abdf8b42000)
>         libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
> (0x00002abdf8ed7000)
>         libintlc.so.5 => 
> /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 
> (0x00002abdf90ed000)
>         /lib64/ld-linux-x86-64.so.2 (0x00002abdf73b1000)
>  
> Hence my initial assumption that the shared-library problem was happening 
> with one of the child processes on a remote node.
>  
> So at this point I have more questions than answers.  I still don't know if 
> this message comes from the main mpirun process or one of the child 
> processes, although it seems that it should not be the main process because 
> of the output of ldd above.
>  
> Any more suggestions are welcomed of course.
>  
> Thanks
>  
>  
> /release/cfd/openmpi-intel/bin/mpirun --machinefile 
> /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x 
> MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached 
> /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000 
> -ro /tmp/fv420804.maruhpc4-mgt/restart.5000
>  
> [c6n38:16219] mca:base:select:(  plm) Querying component [rsh] 
> [c6n38:16219] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10 [c6n38:16219] mca:base:select:(  plm) Selected 
> component [rsh]
> Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M
> /release/cfd/openmpi-intel/bin/orted: error while loading shared 
> libraries: libimf.so: cannot open shared object file: No such file or 
> directory
> ----------------------------------------------------------------------
> ---- A daemon (pid 16227) died unexpectedly with status 127 while 
> attempting to launch so we are aborting.
>  
> There may be more information reported by the environment (see above).
>  
> This may be because the daemon was unable to find all the needed 
> shared libraries on the remote node. You may set your LD_LIBRARY_PATH 
> to have the location of the shared libraries on the remote nodes and 
> this will automatically be forwarded to the remote nodes.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ---- mpirun noticed that the job aborted, but has no info as to the 
> process that caused that situation.
> ----------------------------------------------------------------------
> ----
> Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M
> ----------------------------------------------------------------------
> ---- mpirun was unable to cleanly terminate the daemons on the nodes 
> shown below. Additional manual cleanup may be required - please refer 
> to the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         c6n39 - daemon did not report back when launched
>         c6n40 - daemon did not report back when launched
>         c6n41 - daemon did not report back when launched
>         c6n42 - daemon did not report back when launched
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
> On Behalf Of Ralph Castain
> Sent: Friday, December 14, 2012 2:25 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries 
> while launching jobs
>  
> Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that 
> will show the ssh command being used to start each orted.
>  
> On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" <edwin.l.blo...@lmco.com> 
> wrote:
> 
> 
> I am having a weird problem launching cases with OpenMPI 1.4.3.  It is most 
> likely a problem with a particular node of our cluster, as the jobs will run 
> fine on some submissions, but not other submissions.  It seems to depend on 
> the node list.  I just am having trouble diagnosing which node, and what is 
> the nature of the problem it has.
>  
> One or perhaps more of the orted are indicating they cannot find an Intel 
> Math library.  The error is:
> /release/cfd/openmpi-intel/bin/orted: error while loading shared 
> libraries: libimf.so: cannot open shared object file: No such file or 
> directory
>  
> I've checked the environment just before launching mpirun, and 
> LD_LIBRARY_PATH includes the necessary component to point to where the Intel 
> shared libraries are located.  Furthermore, my mpirun command line says to 
> export the LD_LIBRARY_PATH variable:
> Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile 
> /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x 
> LD_LIBRARY_PATH', '-x MPI_ENVIRONMENT=1', 
> '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', '-cycles', 
> '10000', '-ri', 'restart.1', '-ro', 
> '/tmp/fv420761.maruhpc4-mgt/restart.1']
>  
> My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH.  
> OpenMPI is built explicitly --without-torque and should be using ssh to 
> launch the orted.
>  
> What options can I add to get more debugging of problems launching orted?
>  
> Thanks,
>  
> Ed
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to