Ick is the proper response. :-)

The old 1.2 series would attempt to spawn a local orted on each of those nodes, and that is what is failing. Best guess is that it is because pbsdsh doesn't fully replicate a key part of the environment that is expected.

One thing you could try is do this with 1.3.1. It will just fork/exec that local application instead of trying to start a daemon, so the odds are much better that it will work.

I don't know of any native way to get mpirun to launch a farm - it will always set the comm_size to the total #procs. I suppose we could add that option, if people want it - wouldn't be very hard to implement.

Ralph'
On Apr 1, 2009, at 8:49 AM, Brock Palen wrote:

Ok this is weird, and the correct answer is probably "don't do that",
Anyway:

User wants to run many many small jobs, faster than our scheduler +torque can start, he uses pbsdsh to start them in parallel, under tm.

pbsdsh bash -c 'cd $PBS_O_WORKDIR/$PBS_VNODENUM; mpirun -np 1 application'

This is kinda silly because the code while MPI based, when ran on single rank does not require mpirun to start, and just just fine if you leave off mpirun.

What happens though if you do leave it on (this is with ompi-1.2.x) you get errors about

[nyx428.engin.umich.edu:01929] pls:tm: failed to poll for a spawned proc, return status = 17002 [nyx428.engin.umich.edu:01929] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c at line 462


Kinda makes sense, pbsdsh has already started 'mpirun' under tm, and now mpirun is trying to start a process also under tm. In fact with older versions (1.2.0). The above will work fine only for the first TMNODE, any second node, will hang, at 'poll()' if you strace it.

To we can solve the above by not using mpirun to start single processes under tm that were spawned by tm in the first place. Just thought you would like to know.

Is there a way to have mpirun spawn all the processes like pbsdsh? Problem is the code is MPI based, so if you say 'run 4' its going to do the noraml COMM_SIZE=4, only read first input, etc. Also we have to change the CWD of each rank. Thus can you make mpirun farm?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to