Ralph,
I really appreciate all of your help and guidance on this.
Ralph H Castain wrote:
Of more interest would be understanding why your build isn't working in
bproc. Could you send me the error you are getting? I'm betting that the
problem lies in determining the node allocation as that is the usual place
we hit problems - not much is "standard" about how allocations are
communicated in the bproc world, though we did try to support a few of the
more common methods.
Alright, I've been playing around a bit more, and I think I'm
understanding what is going on. Though it seems that for whatever reason
the ORTE daemon is failing to launch on a remote node, and I'm left with:
[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Not
available in file ras_bjs.c at line 247
--------------------------------------------------------------------------
A daemon (pid 4208) launched by the bproc PLS component on node 0 died
unexpectedly so we are aborting.
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in
file pls_bproc.c at line 717
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in
file pls_bproc.c at line 1164
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in
file rmgr_urm.c at line 462
[goldstar.penguincomputing.com:04207] mpirun: spawn failed with errno=-1
So, I take the advice suggested in the note, and double check to make
sure our library caching is working. It nicely picks up the libraries
though once they are staged on the compute nodes, now mpirun just dies:
[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[goldstar.penguincomputing.com:09335] [0,0,0] ORTE_ERROR_LOG: Not
available in file ras_bjs.c at line 247
[ats@goldstar mpi]$
I thought maybe it was actually working, but I/O forwarding wasn't setup
properly, though checking the exit code shows that it infact crashed:
[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[ats@goldstar mpi]$ echo $?
1
Any ideas here?
If I use the NODES envar, I can run a job on the head node though:
[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
Process 0 on goldstar.penguincomputing.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000097
What also is interesting, and you suspected correctly, only the NODES
envar is being honored, things like BEOWULF_JOB_MAP is not being
honored. This probably correct as I imagine this BEOWULF_JOB_MAP envar
is Scyld specific and likely not implemented. This isn't a big issue
though, its something I'll likely add later on.
-Joshua Bernstein
Software Engineer
Penguin Computing