Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Ralph Castain Wed, 1 Apr 2009 02:13:35 -0400

The difference you are seeing here indicates that the "direct" run isusing the rsh launcher, while the other run is using the Torquelauncher.

So I gather that by "direct" you mean that you don't get an allocationfrom Maui before running the job, but for the other you do? Otherwise,OMPI should detect the that it is running under Torque andautomatically use the Torque launcher unless directed to do otherwise.

The --set-sid option causes the orteds to separate from mpirun'sprocess group. This was done on the rsh launcher to avoid havingsignals directly propagate to local processes so that mpirun couldproperly deal with them.

The --no-daemonize option on the Torque launch keeps the daemons inthe PBS job so that Torque can properly terminate them all when youreach your time limit. We let the rsh-launched daemons daemonize sothat they terminate the ssh session as there are system limits to thenumber of ssh sessions you can have concurrently open.

Once the daemon gets running on the node, there isn't anythingdifferent about how it starts a process that depends upon how thedaemon was started. The environment seen by the processes will be thesame either way, with the exception of the process group. Is theresomething about that application which is sensitive to the processgroup?

If so, what you could do is simply add -mca pls rsh to your commandline when launching it. This will direct OMPI to use the rsh launcher,which will give you the same behavior as your "direct" scenario (wewill still read the PBS_NODEFILE to get the allocation).

You might also want to upgrade to the 1.3 series - the launch systemthere is simpler and scales better. If your application cares aboutprocess group, you might still need to specify the rsh launcher (in1.3, you would use -mca plm rsh to do this - slight syntax change),but it would be interesting to see if it has any impact...and woulddefinitely run better either way.


Ralph



On Mar 31, 2009, at 8:36 PM, Rahul Nabar wrote:

2009/3/31 Ralph Castain <r...@lanl.gov>:

It is very hard to debug the problem with so little information. We
regularly run OMPI jobs on Torque without issue.


Another small thing that I noticed. Not sure if it is relevant.

When the job starts running there is an orte process. The args to this
process are slightly different depending on whether the job was
submitted with Torque or directly on a node. Could this be an issue?
Just a thought.

The essential difference seems that the torque run has the
--no-daemonize option whereas the direct run has a --set-sid option. I
got these via ps after I submitted an interactive Torque job.

Do these matter at all? Full ps output snippets reproduced below. Some
other numbers also seem different on closer inspection but that might
be by design.

###############via Torque; segfaults. ##################
rpnabar  11287  0.1  0.0  24680  1828 ?        Ss   21:04   0:00 orted
--no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
--nodename node17 --universe rpnabar@node17:default-universe-11286
--nsreplica "0.0.0;tcp://10.0.0.17:45839" --gprreplica
"0.0.0;tcp://10.0.0.17:45839"
######################################################


##############direct MPI run; this works OK################
rpnabar  11026  0.0  0.0  24676  1712 ?        Ss   20:52   0:00 orted
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
node17 --universe rpnabar@node17:default-universe-11024 --nsreplica
"0.0.0;tcp://10.0.0.17:34716" --gprreplica
"0.0.0;tcp://10.0.0.17:34716" --set-sid
##########################################################
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Reply via email to