Re: [OMPI users] Displaying Selected MCA Modules

Joshua Bernstein Tue, 24 Jun 2008 20:11:13 -0400

Ralph,

        I really appreciate all of your help and guidance on this.


Ralph H Castain wrote:

Of more interest would be understanding why your build isn't working in
bproc. Could you send me the error you are getting? I'm betting that the
problem lies in determining the node allocation as that is the usual place
we hit problems - not much is "standard" about how allocations are
communicated in the bproc world, though we did try to support a few of the
more common methods.

Alright, I've been playing around a bit more, and I think I'munderstanding what is going on. Though it seems that for whatever reasonthe ORTE daemon is failing to launch on a remote node, and I'm left with:


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi

[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Notavailable in file ras_bjs.c at line 247

--------------------------------------------------------------------------
A daemon (pid 4208) launched by the bproc PLS component on node 0 died
unexpectedly so we are aborting.

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error infile pls_bproc.c at line 717[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error infile pls_bproc.c at line 1164[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error infile rmgr_urm.c at line 462

[goldstar.penguincomputing.com:04207] mpirun: spawn failed with errno=-1

So, I take the advice suggested in the note, and double check to makesure our library caching is working. It nicely picks up the librariesthough once they are staged on the compute nodes, now mpirun just dies:


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi

[goldstar.penguincomputing.com:09335] [0,0,0] ORTE_ERROR_LOG: Notavailable in file ras_bjs.c at line 247

[ats@goldstar mpi]$

I thought maybe it was actually working, but I/O forwarding wasn't setupproperly, though checking the exit code shows that it infact crashed:


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[ats@goldstar mpi]$ echo $?
1

Any ideas here?

If I use the NODES envar, I can run a job on the head node though:

[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
Process 0 on goldstar.penguincomputing.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000097

What also is interesting, and you suspected correctly, only the NODESenvar is being honored, things like BEOWULF_JOB_MAP is not beinghonored. This probably correct as I imagine this BEOWULF_JOB_MAP envaris Scyld specific and likely not implemented. This isn't a big issuethough, its something I'll likely add later on.


-Joshua Bernstein
Software Engineer
Penguin Computing

Re: [OMPI users] Displaying Selected MCA Modules

Reply via email to