Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-04 Thread Ralph Castain
Hmmm...yes, I guess we did get off-track then. This soln is exactly what I proposed on the first response to your thread, and was repeated by others later on. :-/ So long as mpirun is executed on the node where the "sister mom" is located, and as long as your script "B" does -not- include an

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-04 Thread Jeff Squyres
On Apr 4, 2011, at 10:38 AM, Laurence Marks wrote: > Thanks, I think we may have a mistaken communication here; I assume > that the computer where they have disabled rsh and ssh they have > "something" to communicate with so we don't need to use pbsdsh. Clarification in terminology: -

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-04 Thread Laurence Marks
Thanks, I think we may have a mistaken communication here; I assume that the computer where they have disabled rsh and ssh they have "something" to communicate with so we don't need to use pbsdsh. If they don't there is not much a lowly user like me can do. I think we can close this, since like

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-04 Thread Ralph Castain
I apologize - I realized late last night that I had a typo in my recommended command. It should read: mpirun -mca plm rsh -mca plm_rsh_agent pbsdsh -mca ras ^tm --machinefile m1 ^^^ Also, if you know that #procs <= #cores on your nodes,

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 5:25 PM, Laurence Marks wrote: > Thanks. I will test this tomorrow. > > Many people run Wien2k with openmpi as you say, I only became aware of > the issue of Wien2k (and perhaps other codes) leaving orphaned > processes still running a few days ago. I also know someone who

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
And, before someone wonders, while Wien2k is a commercial code it is about 500 Eu for a lifetime licence so this is not the same as Vasp or Gaussian which cost $. And, I have no financial interest in the code, but like many others help make it better (semi gnu). On Sun, Apr 3, 2011 at 6:25

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
Thanks. I will test this tomorrow. Many people run Wien2k with openmpi as you say, I only became aware of the issue of Wien2k (and perhaps other codes) leaving orphaned processes still running a few days ago. I also know someone who wants to run Wien2k on a system where both rsh and ssh are

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 4:37 PM, Laurence Marks wrote: > On Sun, Apr 3, 2011 at 5:08 PM, Reuti wrote: >> Am 03.04.2011 um 23:59 schrieb David Singleton: >> >>> On 04/04/2011 12:56 AM, Ralph Castain wrote: What I still don't understand is why you are trying to

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
On Sun, Apr 3, 2011 at 5:08 PM, Reuti wrote: > Am 03.04.2011 um 23:59 schrieb David Singleton: > >> On 04/04/2011 12:56 AM, Ralph Castain wrote: >>> >>> What I still don't understand is why you are trying to do it this way. Why >>> not just run >>> >>> time mpirun -v

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 4:08 PM, Reuti wrote: > Am 03.04.2011 um 23:59 schrieb David Singleton: > >> On 04/04/2011 12:56 AM, Ralph Castain wrote: >>> >>> What I still don't understand is why you are trying to do it this way. Why >>> not just run >>> >>> time mpirun -v -x LD_LIBRARY_PATH -x PATH

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 3:22 PM, Reuti wrote: > Am 03.04.2011 um 22:57 schrieb Ralph Castain: > >> On Apr 3, 2011, at 2:00 PM, Laurence Marks wrote: >> > > I am not using that computer. A scenario that I have come across is > that when a msub job is killed because it has exceeded it's

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
Works great for me...sleep is dead every time. On Apr 3, 2011, at 3:13 PM, David Singleton wrote: > >> You can prove this to yourself rather easily. Just ssh to a remote node and >> execute any command that lingers for awhile - say something simple like >> "sleep". Then kill the ssh and do a

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Reuti
Am 03.04.2011 um 23:59 schrieb David Singleton: > On 04/04/2011 12:56 AM, Ralph Castain wrote: >> >> What I still don't understand is why you are trying to do it this way. Why >> not just run >> >> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN >>

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread David Singleton
On 04/04/2011 12:56 AM, Ralph Castain wrote: What I still don't understand is why you are trying to do it this way. Why not just run time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def where machineN contains the names

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Reuti
Am 03.04.2011 um 22:57 schrieb Ralph Castain: > On Apr 3, 2011, at 2:00 PM, Laurence Marks wrote: > I am not using that computer. A scenario that I have come across is that when a msub job is killed because it has exceeded it's Walltime mpi tasks spawned by ssh may not be

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
> > It most certainly will! That mpirun on nodeB is executing under the ssh from > nodeA, so when that ssh session is killed, it automatically kills everything > run underneath it. And when mpirun dies, so does the job it was running, as > per above. > You can prove this to yourself rather easily.

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread David Singleton
You can prove this to yourself rather easily. Just ssh to a remote node and execute any command that lingers for awhile - say something simple like "sleep". Then kill the ssh and do a "ps" on the remote node. I guarantee that the command will have died. H ... vayu1:~ > ssh v37 sleep

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 2:00 PM, Laurence Marks wrote: >>> >>> I am not using that computer. A scenario that I have come across is >>> that when a msub job is killed because it has exceeded it's Walltime >>> mpi tasks spawned by ssh may not be terminated because (so I am told) >>> Torque does not

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
On Sun, Apr 3, 2011 at 11:41 AM, Ralph Castain wrote: > > On Apr 3, 2011, at 9:34 AM, Laurence Marks wrote: > >> On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain wrote: >>> >>> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: >>> Let me expand on this

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 9:34 AM, Laurence Marks wrote: > On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain wrote: >> >> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: >> >>> Let me expand on this slightly (in response to Ralph Castain's posting >>> -- I had digest mode set). As

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain wrote: > > On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: > >> Let me expand on this slightly (in response to Ralph Castain's posting >> -- I had digest mode set). As currently constructed a shellscript in >> Wien2k

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 9:12 AM, Reuti wrote: > Am 03.04.2011 um 16:56 schrieb Ralph Castain: > >> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: >> >>> Let me expand on this slightly (in response to Ralph Castain's posting >>> -- I had digest mode set). As currently constructed a shellscript in

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Reuti
Am 03.04.2011 um 16:56 schrieb Ralph Castain: > On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: > >> Let me expand on this slightly (in response to Ralph Castain's posting >> -- I had digest mode set). As currently constructed a shellscript in >> Wien2k (www.wien2k.at) launches a series of

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: > Let me expand on this slightly (in response to Ralph Castain's posting > -- I had digest mode set). As currently constructed a shellscript in > Wien2k (www.wien2k.at) launches a series of tasks using > > ($remote $remotemachine "cd $PWD;$t

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Laurence Marks
Let me expand on this slightly (in response to Ralph Castain's posting -- I had digest mode set). As currently constructed a shellscript in Wien2k (www.wien2k.at) launches a series of tasks using ($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]") >>.time1_$loop & where the

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
I'm afraid I have no idea what you are talking about. Are you saying you are launching OMPI processes via mpirun, but with "pbsdsh" as the plm_rsh_agent??? That would be a very bad idea. If you are running under Torque, then let mpirun "do the right thing" and use its Torque-based launcher. On

[OMPI users] openmpi/pbsdsh/Torque problem

2011-04-02 Thread Laurence Marks
I have a problem which may or may not be openmpi, but since this list was useful before with a race condition I am posting. I am trying to use pbsdsh as a ssh replacement, pushed by sysadmins as Torque does not know about ssh tasks launched from a task. In a simple case, a script launches three