2009/3/31 Ralph Castain <r...@lanl.gov>: > It is very hard to debug the problem with so little information. We > regularly run OMPI jobs on Torque without issue.
Another small thing that I noticed. Not sure if it is relevant. When the job starts running there is an orte process. The args to this process are slightly different depending on whether the job was submitted with Torque or directly on a node. Could this be an issue? Just a thought. The essential difference seems that the torque run has the --no-daemonize option whereas the direct run has a --set-sid option. I got these via ps after I submitted an interactive Torque job. Do these matter at all? Full ps output snippets reproduced below. Some other numbers also seem different on closer inspection but that might be by design. ###############via Torque; segfaults. ################## rpnabar 11287 0.1 0.0 24680 1828 ? Ss 21:04 0:00 orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node17 --universe rpnabar@node17:default-universe-11286 --nsreplica "0.0.0;tcp://10.0.0.17:45839" --gprreplica "0.0.0;tcp://10.0.0.17:45839" ###################################################### ##############direct MPI run; this works OK################ rpnabar 11026 0.0 0.0 24676 1712 ? Ss 20:52 0:00 orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node17 --universe rpnabar@node17:default-universe-11024 --nsreplica "0.0.0;tcp://10.0.0.17:34716" --gprreplica "0.0.0;tcp://10.0.0.17:34716" --set-sid ##########################################################