So that's it. I changed the export MPI_REMSH=rsh to export MPI_REMSH=ssh
and now it works as it should.

Reuti, many thanks for the very professional support.

And now i want to make some conclusions on what helped for the people ho
have the same problem.

1. add:
export PATH=/export/apps/platform_mpi/bin:$PATH
export MPI_REMSH=ssh
export MPI_TMPDIR=$TMPDIR

to the submit sctipt.

2. use:
mpirun -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS

to start mpi

3. use this as MPI
[root@rocks mpi]# qconf -sp pmpi
pe_name            pmpi
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
$pe_hostfile
stop_proc_args     /opt/gridengine/mpi/pmpi/stoppmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary TRUE

4. edit rsh to look like this:
if [ x$just_wrap = x ]; then
   if [ $minus_n -eq 1 ]; then
      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
   else
      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd

5. make sure that /opt/gridengine/mpi/pmpi/startpmpi.sh is available
also on the nodes.

Have fun!

One again many thanks to Reuti, who made all this happen!

Isue closed!

On 03/07/2014 05:34 PM, Reuti wrote:
> Am 07.03.2014 um 17:19 schrieb Petar Penchev:
>
>> I have this in /opt/gridengine/mpi/pmpi/rsh:
>> if [ x$just_wrap = x ]; then
>>   if [ $minus_n -eq 1 ]; then
>>      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
>>      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
>>   else
>>      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
>>      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
>>   fi
>> else
>>
>> and this in /opt/gridengine/mpi/pmpi/startpmpi.sh
>> #
>> # Make script wrapper for 'rsh' available in jobs tmp dir
>> #
>> if [ $catch_rsh = 1 ]; then
>>   rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh
>>
>>
>>
>> This are the files created when i submit a job:
>> -rw-r--r-- 1 petar users  128 Mar  7 16:54 machines
>> -rw------- 1 petar users 2865 Mar  7 16:54 mpiafuT79Rl
>> -rw-r--r-- 1 petar users    0 Mar  7 16:54 mpijob_petar_29112
>> lrwxrwxrwx 1 petar users   28 Mar  7 16:54 ssh ->
>> /opt/gridengine/mpi/pmpi/rsh
> Looks almost perfect. But the link is named `ssh`. Then the:
>
> export MPI_REMSH=rsh
>
> is either not necessary or should also be defined as "ssh". As said: you 
> could name the link "foo" and set MPI_REMSH to "foo" - it's just a name.
>
> -- Reuti
>
>
>> [petar@mnode01 33318.1.test.q]$ cat /tmp/33318.1.test.q/mpiafuT79Rl
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
>> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
>> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> Okay, this file was already assembled of Platform MPI out of the 
> $TMPDIR/machines.
>
>
>> What do you mean exactly with "...; resp. hostname"? Do i have to add
>> something else?
> It's only a precaution to have the creation of the `hostname` wrapper also 
> pointing to the correct location in the pmpi directory - in case you switch 
> it on later.
>
>
>> And now as you suggested i changed the tmpdir to be local for all nodes,
>> but i still get this error.
> About "command not found"?
>
> -- Reuti
>
>
>> Cheers,
>> Petar
>>
>>
>>
>> On 03/07/2014 04:20 PM, Reuti wrote:
>>> Am 07.03.2014 um 15:57 schrieb Petar Penchev:
>>>
>>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. 
>>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>>
>>>> In fact i have also other MPI libraries (openMPI, PlatformMPI and
>>>> HP-MPI) and i an controlling which one to use through modules.
>>>> 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec'
>>>>
>>>> (You copied rsh/hostname to pmpi too?)
>>>>
>>>> Yes, both are there.
>>>>
>>>> control_slaves TRUE
>>>> now this is also set
>>> Good.
>>>
>>>
>>>> so it should be accessible when you job starts.
>>>>
>>>>
>>>> As you suggested i have added in my submit script 'export
>>>> PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error
>>>> disappeared. Adding only the job tmp dir didn't work  (export
>>>> PATH=/export/apps/platform_mpi/bin:$TMPDIR).
>>>> The output is now
>>>>
>>>> echo $PATH
>>>>
>>>> /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin
>>> Okay, here we have the /home/tmp/33108.1.test.q which looks like the 
>>> scratch space on the node. But: this is in /home and so on an NFS space? It 
>>> would be better, in case it's local on each node.
>>>
>>> OTOH: in the queue definition you posted I see "tmpdir                /tmp" 
>>> - is /tmp a symbolic link to /home/tmp?
>>>
>>>
>>>> But i have another problem. After I submit a simulation, in the log file
>>>> i have this error: "10.197.9.32: Connection refused" (this is the ip of
>>>> mnode02) and in the error log this: "mpirun: Warning one or more remote
>>>> shell commands exited with non-zero status, which may indicate a remote
>>>> access problem."
>>>>
>>>> Which protocol is using mpirun to comunicate between nodes?
>>> By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit 
>>> ...`. To clarify: there is no `rsh` in the game. We could tell Platform MPI 
>>> to use "foo" to access a node and in the startmpi.sh we create a symbolic 
>>> link "foo" to point to a routine "baz" which calls `qrsh -inherit ...` in 
>>> the end.
>>>
>>>
>>>> I checked
>>>> and i can ssh-log without password from the head on the nodes and
>>>> between the nodes.
>>> rsh or ssh is not necessary if you use a tight integration. In my clusters 
>>> it's always disabled. The idea is: we tell Platform MPI to use rsh, this 
>>> will in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which 
>>> is pointed to by the created symbolic link in /home/tmp/33108.1.test.q The 
>>> part:
>>>
>>> #
>>> # Make script wrapper for 'rsh' available in jobs tmp dir
>>> #
>>> if [ $catch_rsh = 1 ]; then
>>>   rsh_wrapper=$SGE_ROOT/mpi/rsh
>>>
>>> in /opt/gridengine/mpi/pmpi/startmpi.sh points to 
>>> /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname?
>>>
>>> -- Reuti
>>>
>>>
>>>> Thanks,
>>>> Petar
>>>>
>>>> On 03/07/2014 02:39 PM, Reuti wrote:
>>>>> Am 07.03.2014 um 13:20 schrieb Petar Penchev:
>>>>>
>>>>>> I have added the -catch_rsh to the PE and now when i start a sim
>>>>> Good.
>>>>>
>>>>>
>>>>>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
>>>>>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
>>>>>> Command not found' in the lsdyna.err.
>>>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. 
>>>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>>>
>>>>> The path to `rsh` is set up by the wrapper, so it should be accessible 
>>>>> when you job starts. Can you please add to your jobscript:
>>>>>
>>>>> echo $PATH
>>>>>
>>>>> The $TMPDIR of the job on the node should be included there, and therein 
>>>>> the `rsh` should exist.
>>>>>
>>>>> BTW: I'm not sure about your application, but several ones need all 
>>>>> environment variable from the master node of the parallel job also be set 
>>>>> for the slaves. This can be achieved by including "-V" for `qrsh -inherit 
>>>>> ...` near the end in /opt/gridengine/mpi/pmpi/rsh
>>>>>
>>>>> (You copied rsh/hostname to pmpi too?)
>>>>>
>>>>>
>>>>>> Petar
>>>>>>
>>>>>> [petar@rocks test]$ cat lsdyna.err
>>>>>> mpirun: rsh: Command not found
>>>>>>
>>>>>> [petar@rocks test]$ cat lsdyna.out
>>>>>> -catch_rsh
>>>>>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode01
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> mnode02
>>>>>> Error: Unknown option -np
>>>>>>
>>>>>> [root@rocks test]# qconf -mp pmpi
>>>>>> pe_name            pmpi
>>>>>> slots              9999
>>>>>> user_lists         NONE
>>>>>> xuser_lists        NONE
>>>>>> start_proc_args    /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
>>>>>> $pe_hostfile
>>>>>> stop_proc_args     /opt/gridengine/mpi/pmpi/stoppmpi.sh
>>>>>> allocation_rule    $fill_up
>>>>>> control_slaves     FALSE
>>>>> control_slaves TRUE
>>>>>
>>>>> Otherwise the `qrsh -inherit ...` will fail.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> job_is_first_task  TRUE
>>>>>> urgency_slots      min
>>>>>> accounting_summary TRUE
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/07/2014 12:49 PM, Reuti wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>>>>>>
>>>>>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use 
>>>>>>>> the
>>>>>>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>>>>>>> using qsub test.sh, the job starts only on one node with 16 processes
>>>>>>>> and not on both nodes. The -pe pmpi, which i am using for now  is only 
>>>>>>>> a
>>>>>>>> copy of mpi.
>>>>>>> The definition of the PE pmpi does also include the -catch_rsh? The 
>>>>>>> recent IBM/Platform-MPI can cope with a machine file in the MPICH(1) 
>>>>>>> format, which is created by the /usr/sge/mpi/startmpi.sh
>>>>>>>
>>>>>>> In addition you need the following settings for a tight integration. 
>>>>>>> Please try:
>>>>>>>
>>>>>>> ...
>>>>>>> export MPI_REMSH=rsh
>>>>>>> export MPI_TMPDIR=$TMPDIR
>>>>>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>> What am i missing? Dose anyone have a working -pe submit script, or 
>>>>>>>> some
>>>>>>>> hints how to make this working?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Petar
>>>>>>>>
>>>>>>>> [root@rocks mpi]# test.sh
>>>>>>>> #!/bin/bash
>>>>>>>> #$ -N lsdyna
>>>>>>>> #$ -S /bin/bash
>>>>>>>> #$ -pe pmpi 16
>>>>>>>> #$ -cwd
>>>>>>>> #$ -o lsdyna.out
>>>>>>>> #$ -e lsdyna.err
>>>>>>>> ###
>>>>>>>> #$ -q test.q
>>>>>>>> ### -notify
>>>>>>>> export MPI_ROOT=/export/apps/platform_mpi
>>>>>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>>>>>>> export PATH=/export/apps/platform_mpi/bin
>>>>>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>>>>>>> ARGS="i=test.k"
>>>>>>>> mpirun -np $NSLOTS $BIN $ARGS
>>>>>>>>
>>>>>>>>
>>>>>>>> [root@rocks mpi]# qconf -sq test.q
>>>>>>>> qname                 test.q
>>>>>>>> hostlist              mnode01 mnode02
>>>>>>>> seq_no                0
>>>>>>>> load_thresholds       np_load_avg=1.75
>>>>>>>> suspend_thresholds    NONE
>>>>>>>> nsuspend              1
>>>>>>>> suspend_interval      00:05:00
>>>>>>>> priority              0
>>>>>>>> min_cpu_interval      00:05:00
>>>>>>>> processors            UNDEFINED
>>>>>>>> qtype                 BATCH INTERACTIVE
>>>>>>>> ckpt_list             NONE
>>>>>>>> pe_list               pmpi
>>>>>>>> rerun                 FALSE
>>>>>>>> slots                 8
>>>>>>>> tmpdir                /tmp
>>>>>>>> shell                 /bin/bash
>>>>>>>> prolog                NONE
>>>>>>>> epilog                NONE
>>>>>>>> shell_start_mode      unix_behavior
>>>>>>>> starter_method        NONE
>>>>>>>> suspend_method        NONE
>>>>>>>> resume_method         NONE
>>>>>>>> terminate_method      NONE
>>>>>>>> notify                00:00:60
>>>>>>>> owner_list            NONE
>>>>>>>> user_lists            NONE
>>>>>>>> xuser_lists           NONE
>>>>>>>> subordinate_list      NONE
>>>>>>>> complex_values        NONE
>>>>>>>> projects              NONE
>>>>>>>> xprojects             NONE
>>>>>>>> calendar              NONE
>>>>>>>> initial_state         default
>>>>>>>> s_rt                  INFINITY
>>>>>>>> h_rt                  INFINITY
>>>>>>>> s_cpu                 INFINITY
>>>>>>>> h_cpu                 INFINITY
>>>>>>>> s_fsize               INFINITY
>>>>>>>> h_fsize               INFINITY
>>>>>>>> s_data                INFINITY
>>>>>>>> h_data                INFINITY
>>>>>>>> s_stack               INFINITY
>>>>>>>> h_stack               INFINITY
>>>>>>>> s_core                INFINITY
>>>>>>>> h_core                INFINITY
>>>>>>>> s_rss                 INFINITY
>>>>>>>> h_rss                 INFINITY
>>>>>>>> s_vmem                INFINITY
>>>>>>>> h_vmem                INFINITY
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to