So that's it. I changed the export MPI_REMSH=rsh to export MPI_REMSH=ssh
and now it works as it should.
Reuti, many thanks for the very professional support.
And now i want to make some conclusions on what helped for the people ho
have the same problem.
1. add:
export PATH=/export/apps/platform_mpi/bin:$PATH
export MPI_REMSH=ssh
export MPI_TMPDIR=$TMPDIR
to the submit sctipt.
2. use:
mpirun -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
to start mpi
3. use this as MPI
[root@rocks mpi]# qconf -sp pmpi
pe_name pmpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
$pe_hostfile
stop_proc_args /opt/gridengine/mpi/pmpi/stoppmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary TRUE
4. edit rsh to look like this:
if [ x$just_wrap = x ]; then
if [ $minus_n -eq 1 ]; then
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
5. make sure that /opt/gridengine/mpi/pmpi/startpmpi.sh is available
also on the nodes.
Have fun!
One again many thanks to Reuti, who made all this happen!
Isue closed!
On 03/07/2014 05:34 PM, Reuti wrote:
> Am 07.03.2014 um 17:19 schrieb Petar Penchev:
>
>> I have this in /opt/gridengine/mpi/pmpi/rsh:
>> if [ x$just_wrap = x ]; then
>> if [ $minus_n -eq 1 ]; then
>> echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
>> exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
>> else
>> echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
>> exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
>> fi
>> else
>>
>> and this in /opt/gridengine/mpi/pmpi/startpmpi.sh
>> #
>> # Make script wrapper for 'rsh' available in jobs tmp dir
>> #
>> if [ $catch_rsh = 1 ]; then
>> rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh
>>
>>
>>
>> This are the files created when i submit a job:
>> -rw-r--r-- 1 petar users 128 Mar 7 16:54 machines
>> -rw------- 1 petar users 2865 Mar 7 16:54 mpiafuT79Rl
>> -rw-r--r-- 1 petar users 0 Mar 7 16:54 mpijob_petar_29112
>> lrwxrwxrwx 1 petar users 28 Mar 7 16:54 ssh ->
>> /opt/gridengine/mpi/pmpi/rsh
> Looks almost perfect. But the link is named `ssh`. Then the:
>
> export MPI_REMSH=rsh
>
> is either not necessary or should also be defined as "ssh". As said: you
> could name the link "foo" and set MPI_REMSH to "foo" - it's just a name.
>
> -- Reuti
>
>
>> [petar@mnode01 33318.1.test.q]$ cat /tmp/33318.1.test.q/mpiafuT79Rl
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> Okay, this file was already assembled of Platform MPI out of the
> $TMPDIR/machines.
>
>
>> What do you mean exactly with "...; resp. hostname"? Do i have to add
>> something else?
> It's only a precaution to have the creation of the `hostname` wrapper also
> pointing to the correct location in the pmpi directory - in case you switch
> it on later.
>
>
>> And now as you suggested i changed the tmpdir to be local for all nodes,
>> but i still get this error.
> About "command not found"?
>
> -- Reuti
>
>
>> Cheers,
>> Petar
>>
>>
>>
>> On 03/07/2014 04:20 PM, Reuti wrote:
>>> Am 07.03.2014 um 15:57 schrieb Petar Penchev:
>>>
>>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation.
>>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>>
>>>> In fact i have also other MPI libraries (openMPI, PlatformMPI and
>>>> HP-MPI) and i an controlling which one to use through modules.
>>>> 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec'
>>>>
>>>> (You copied rsh/hostname to pmpi too?)
>>>>
>>>> Yes, both are there.
>>>>
>>>> control_slaves TRUE
>>>> now this is also set
>>> Good.
>>>
>>>
>>>> so it should be accessible when you job starts.
>>>>
>>>>
>>>> As you suggested i have added in my submit script 'export
>>>> PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error
>>>> disappeared. Adding only the job tmp dir didn't work (export
>>>> PATH=/export/apps/platform_mpi/bin:$TMPDIR).
>>>> The output is now
>>>>
>>>> echo $PATH
>>>>
>>>> /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin
>>> Okay, here we have the /home/tmp/33108.1.test.q which looks like the
>>> scratch space on the node. But: this is in /home and so on an NFS space? It
>>> would be better, in case it's local on each node.
>>>
>>> OTOH: in the queue definition you posted I see "tmpdir /tmp"
>>> - is /tmp a symbolic link to /home/tmp?
>>>
>>>
>>>> But i have another problem. After I submit a simulation, in the log file
>>>> i have this error: "10.197.9.32: Connection refused" (this is the ip of
>>>> mnode02) and in the error log this: "mpirun: Warning one or more remote
>>>> shell commands exited with non-zero status, which may indicate a remote
>>>> access problem."
>>>>
>>>> Which protocol is using mpirun to comunicate between nodes?
>>> By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit
>>> ...`. To clarify: there is no `rsh` in the game. We could tell Platform MPI
>>> to use "foo" to access a node and in the startmpi.sh we create a symbolic
>>> link "foo" to point to a routine "baz" which calls `qrsh -inherit ...` in
>>> the end.
>>>
>>>
>>>> I checked
>>>> and i can ssh-log without password from the head on the nodes and
>>>> between the nodes.
>>> rsh or ssh is not necessary if you use a tight integration. In my clusters
>>> it's always disabled. The idea is: we tell Platform MPI to use rsh, this
>>> will in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which
>>> is pointed to by the created symbolic link in /home/tmp/33108.1.test.q The
>>> part:
>>>
>>> #
>>> # Make script wrapper for 'rsh' available in jobs tmp dir
>>> #
>>> if [ $catch_rsh = 1 ]; then
>>> rsh_wrapper=$SGE_ROOT/mpi/rsh
>>>
>>> in /opt/gridengine/mpi/pmpi/startmpi.sh points to
>>> /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname?
>>>
>>> -- Reuti
>>>
>>>
>>>> Thanks,
>>>> Petar
>>>>
>>>> On 03/07/2014 02:39 PM, Reuti wrote:
>>>>> Am 07.03.2014 um 13:20 schrieb Petar Penchev:
>>>>>
>>>>>> I have added the -catch_rsh to the PE and now when i start a sim
>>>>> Good.
>>>>>
>>>>>
>>>>>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
>>>>>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
>>>>>> Command not found' in the lsdyna.err.
>>>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation.
>>>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>>>
>>>>> The path to `rsh` is set up by the wrapper, so it should be accessible
>>>>> when you job starts. Can you please add to your jobscript:
>>>>>
>>>>> echo $PATH
>>>>>
>>>>> The $TMPDIR of the job on the node should be included there, and therein
>>>>> the `rsh` should exist.
>>>>>
>>>>> BTW: I'm not sure about your application, but several ones need all
>>>>> environment variable from the master node of the parallel job also be set
>>>>> for the slaves. This can be achieved by including "-V" for `qrsh -inherit
>>>>> ...` near the end in /opt/gridengine/mpi/pmpi/rsh
>>>>>
>>>>> (You copied rsh/hostname to pmpi too?)
>>>>>
>>>>>
>>>>>> Petar
>>>>>>
>>>>>> [petar@rocks test]$ cat lsdyna.err
>>>>>> mpirun: rsh: Command not found
>>>>>>
>>>>>> [petar@rocks test]$ cat lsdyna.out
>>>>>> -catch_rsh
>>>>>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> Error: Unknown option -np
>>>>>>
>>>>>> [root@rocks test]# qconf -mp pmpi
>>>>>> pe_name pmpi
>>>>>> slots 9999
>>>>>> user_lists NONE
>>>>>> xuser_lists NONE
>>>>>> start_proc_args /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
>>>>>> $pe_hostfile
>>>>>> stop_proc_args /opt/gridengine/mpi/pmpi/stoppmpi.sh
>>>>>> allocation_rule $fill_up
>>>>>> control_slaves FALSE
>>>>> control_slaves TRUE
>>>>>
>>>>> Otherwise the `qrsh -inherit ...` will fail.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> job_is_first_task TRUE
>>>>>> urgency_slots min
>>>>>> accounting_summary TRUE
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/07/2014 12:49 PM, Reuti wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>>>>>>
>>>>>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use
>>>>>>>> the
>>>>>>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>>>>>>> using qsub test.sh, the job starts only on one node with 16 processes
>>>>>>>> and not on both nodes. The -pe pmpi, which i am using for now is only
>>>>>>>> a
>>>>>>>> copy of mpi.
>>>>>>> The definition of the PE pmpi does also include the -catch_rsh? The
>>>>>>> recent IBM/Platform-MPI can cope with a machine file in the MPICH(1)
>>>>>>> format, which is created by the /usr/sge/mpi/startmpi.sh
>>>>>>>
>>>>>>> In addition you need the following settings for a tight integration.
>>>>>>> Please try:
>>>>>>>
>>>>>>> ...
>>>>>>> export MPI_REMSH=rsh
>>>>>>> export MPI_TMPDIR=$TMPDIR
>>>>>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>> What am i missing? Dose anyone have a working -pe submit script, or
>>>>>>>> some
>>>>>>>> hints how to make this working?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Petar
>>>>>>>>
>>>>>>>> [root@rocks mpi]# test.sh
>>>>>>>> #!/bin/bash
>>>>>>>> #$ -N lsdyna
>>>>>>>> #$ -S /bin/bash
>>>>>>>> #$ -pe pmpi 16
>>>>>>>> #$ -cwd
>>>>>>>> #$ -o lsdyna.out
>>>>>>>> #$ -e lsdyna.err
>>>>>>>> ###
>>>>>>>> #$ -q test.q
>>>>>>>> ### -notify
>>>>>>>> export MPI_ROOT=/export/apps/platform_mpi
>>>>>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>>>>>>> export PATH=/export/apps/platform_mpi/bin
>>>>>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>>>>>>> ARGS="i=test.k"
>>>>>>>> mpirun -np $NSLOTS $BIN $ARGS
>>>>>>>>
>>>>>>>>
>>>>>>>> [root@rocks mpi]# qconf -sq test.q
>>>>>>>> qname test.q
>>>>>>>> hostlist mnode01 mnode02
>>>>>>>> seq_no 0
>>>>>>>> load_thresholds np_load_avg=1.75
>>>>>>>> suspend_thresholds NONE
>>>>>>>> nsuspend 1
>>>>>>>> suspend_interval 00:05:00
>>>>>>>> priority 0
>>>>>>>> min_cpu_interval 00:05:00
>>>>>>>> processors UNDEFINED
>>>>>>>> qtype BATCH INTERACTIVE
>>>>>>>> ckpt_list NONE
>>>>>>>> pe_list pmpi
>>>>>>>> rerun FALSE
>>>>>>>> slots 8
>>>>>>>> tmpdir /tmp
>>>>>>>> shell /bin/bash
>>>>>>>> prolog NONE
>>>>>>>> epilog NONE
>>>>>>>> shell_start_mode unix_behavior
>>>>>>>> starter_method NONE
>>>>>>>> suspend_method NONE
>>>>>>>> resume_method NONE
>>>>>>>> terminate_method NONE
>>>>>>>> notify 00:00:60
>>>>>>>> owner_list NONE
>>>>>>>> user_lists NONE
>>>>>>>> xuser_lists NONE
>>>>>>>> subordinate_list NONE
>>>>>>>> complex_values NONE
>>>>>>>> projects NONE
>>>>>>>> xprojects NONE
>>>>>>>> calendar NONE
>>>>>>>> initial_state default
>>>>>>>> s_rt INFINITY
>>>>>>>> h_rt INFINITY
>>>>>>>> s_cpu INFINITY
>>>>>>>> h_cpu INFINITY
>>>>>>>> s_fsize INFINITY
>>>>>>>> h_fsize INFINITY
>>>>>>>> s_data INFINITY
>>>>>>>> h_data INFINITY
>>>>>>>> s_stack INFINITY
>>>>>>>> h_stack INFINITY
>>>>>>>> s_core INFINITY
>>>>>>>> h_core INFINITY
>>>>>>>> s_rss INFINITY
>>>>>>>> h_rss INFINITY
>>>>>>>> s_vmem INFINITY
>>>>>>>> h_vmem INFINITY
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users