2009/3/31 Ralph Castain <r...@lanl.gov>:
> It is very hard to debug the problem with so little information. We
> regularly run OMPI jobs on Torque without issue.

Another small thing that I noticed. Not sure if it is relevant.

When the job starts running there is an orte process. The args to this
process are slightly different depending on whether the job was
submitted with Torque or directly on a node. Could this be an issue?
Just a thought.

The essential difference seems that the torque run has the
--no-daemonize option whereas the direct run has a --set-sid option. I
got these via ps after I submitted an interactive Torque job.

Do these matter at all? Full ps output snippets reproduced below. Some
other numbers also seem different on closer inspection but that might
be by design.

###############via Torque; segfaults. ##################
rpnabar  11287  0.1  0.0  24680  1828 ?        Ss   21:04   0:00 orted
--no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
--nodename node17 --universe rpnabar@node17:default-universe-11286
--nsreplica "0.0.0;tcp://10.0.0.17:45839" --gprreplica
"0.0.0;tcp://10.0.0.17:45839"
######################################################


##############direct MPI run; this works OK################
rpnabar  11026  0.0  0.0  24676  1712 ?        Ss   20:52   0:00 orted
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
node17 --universe rpnabar@node17:default-universe-11024 --nsreplica
"0.0.0;tcp://10.0.0.17:34716" --gprreplica
"0.0.0;tcp://10.0.0.17:34716" --set-sid
##########################################################

Reply via email to