Hmmm....well, the problem is as I suspected. The system doesn't see any allocation of nodes to your job, and so it aborts with a crummy error message that doesn't really tell you the problem. We are working on improving them.
How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain info on the nodes that are to be used? One of the biggest headaches with bproc is that there is no adhered-to standard for describing the node allocation. What we implemented will support LSF+Bproc (since that is what was being used here) and BJS. It sounds like you are using something different - true? If so, we can work around it by just mapping enviro variables to what the system is seeking. Or, IIRC, we could use the hostfile option (have to check on that one). Ralph On 6/24/08 6:11 PM, "Joshua Bernstein" <jbernst...@penguincomputing.com> wrote: > Ralph, > > I really appreciate all of your help and guidance on this. > > Ralph H Castain wrote: >> Of more interest would be understanding why your build isn't working in >> bproc. Could you send me the error you are getting? I'm betting that the >> problem lies in determining the node allocation as that is the usual place >> we hit problems - not much is "standard" about how allocations are >> communicated in the bproc world, though we did try to support a few of the >> more common methods. > > Alright, I've been playing around a bit more, and I think I'm > understanding what is going on. Though it seems that for whatever reason > the ORTE daemon is failing to launch on a remote node, and I'm left with: > > [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi > [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Not > available in file ras_bjs.c at line 247 > -------------------------------------------------------------------------- > A daemon (pid 4208) launched by the bproc PLS component on node 0 died > unexpectedly so we are aborting. > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in > file pls_bproc.c at line 717 > [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in > file pls_bproc.c at line 1164 > [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in > file rmgr_urm.c at line 462 > [goldstar.penguincomputing.com:04207] mpirun: spawn failed with errno=-1 > > So, I take the advice suggested in the note, and double check to make > sure our library caching is working. It nicely picks up the libraries > though once they are staged on the compute nodes, now mpirun just dies: > > [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi > [goldstar.penguincomputing.com:09335] [0,0,0] ORTE_ERROR_LOG: Not > available in file ras_bjs.c at line 247 > [ats@goldstar mpi]$ > > I thought maybe it was actually working, but I/O forwarding wasn't setup > properly, though checking the exit code shows that it infact crashed: > > [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi > [ats@goldstar mpi]$ echo $? > 1 > > Any ideas here? > > If I use the NODES envar, I can run a job on the head node though: > > [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi > Process 0 on goldstar.penguincomputing.com > pi is approximately 3.1416009869231254, Error is 0.0000083333333323 > wall clock time = 0.000097 > > What also is interesting, and you suspected correctly, only the NODES > envar is being honored, things like BEOWULF_JOB_MAP is not being > honored. This probably correct as I imagine this BEOWULF_JOB_MAP envar > is Scyld specific and likely not implemented. This isn't a big issue > though, its something I'll likely add later on. > > -Joshua Bernstein > Software Engineer > Penguin Computing > >