I have this in /opt/gridengine/mpi/pmpi/rsh:
if [ x$just_wrap = x ]; then
   if [ $minus_n -eq 1 ]; then
      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
   else
      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
   fi
else

and this in /opt/gridengine/mpi/pmpi/startpmpi.sh
#
# Make script wrapper for 'rsh' available in jobs tmp dir
#
if [ $catch_rsh = 1 ]; then
   rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh



This are the files created when i submit a job:
-rw-r--r-- 1 petar users  128 Mar  7 16:54 machines
-rw------- 1 petar users 2865 Mar  7 16:54 mpiafuT79Rl
-rw-r--r-- 1 petar users    0 Mar  7 16:54 mpijob_petar_29112
lrwxrwxrwx 1 petar users   28 Mar  7 16:54 ssh ->
/opt/gridengine/mpi/pmpi/rsh


[petar@mnode01 33318.1.test.q]$ cat /tmp/33318.1.test.q/mpiafuT79Rl
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k


What do you mean exactly with "...; resp. hostname"? Do i have to add
something else?

And now as you suggested i changed the tmpdir to be local for all nodes,
but i still get this error.

Cheers,
Petar



On 03/07/2014 04:20 PM, Reuti wrote:
> Am 07.03.2014 um 15:57 schrieb Petar Penchev:
>
>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. But 
>> I wonder: do you have a second MPI library installed: `which mpiexec`?
>>
>> In fact i have also other MPI libraries (openMPI, PlatformMPI and
>> HP-MPI) and i an controlling which one to use through modules.
>> 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec'
>>
>> (You copied rsh/hostname to pmpi too?)
>>
>> Yes, both are there.
>>
>> control_slaves TRUE
>> now this is also set
> Good.
>
>
>> so it should be accessible when you job starts.
>>
>>
>> As you suggested i have added in my submit script 'export
>> PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error
>> disappeared. Adding only the job tmp dir didn't work  (export
>> PATH=/export/apps/platform_mpi/bin:$TMPDIR).
>> The output is now
>>
>> echo $PATH
>>
>> /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin
> Okay, here we have the /home/tmp/33108.1.test.q which looks like the scratch 
> space on the node. But: this is in /home and so on an NFS space? It would be 
> better, in case it's local on each node.
>
> OTOH: in the queue definition you posted I see "tmpdir                /tmp" - 
> is /tmp a symbolic link to /home/tmp?
>
>
>> But i have another problem. After I submit a simulation, in the log file
>> i have this error: "10.197.9.32: Connection refused" (this is the ip of
>> mnode02) and in the error log this: "mpirun: Warning one or more remote
>> shell commands exited with non-zero status, which may indicate a remote
>> access problem."
>>
>> Which protocol is using mpirun to comunicate between nodes?
> By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit ...`. 
> To clarify: there is no `rsh` in the game. We could tell Platform MPI to use 
> "foo" to access a node and in the startmpi.sh we create a symbolic link "foo" 
> to point to a routine "baz" which calls `qrsh -inherit ...` in the end.
>
>
>> I checked
>> and i can ssh-log without password from the head on the nodes and
>> between the nodes.
> rsh or ssh is not necessary if you use a tight integration. In my clusters 
> it's always disabled. The idea is: we tell Platform MPI to use rsh, this will 
> in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which is 
> pointed to by the created symbolic link in /home/tmp/33108.1.test.q The part:
>
> #
> # Make script wrapper for 'rsh' available in jobs tmp dir
> #
> if [ $catch_rsh = 1 ]; then
>    rsh_wrapper=$SGE_ROOT/mpi/rsh
>
> in /opt/gridengine/mpi/pmpi/startmpi.sh points to 
> /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname?
>
> -- Reuti
>
>
>> Thanks,
>> Petar
>>
>> On 03/07/2014 02:39 PM, Reuti wrote:
>>> Am 07.03.2014 um 13:20 schrieb Petar Penchev:
>>>
>>>> I have added the -catch_rsh to the PE and now when i start a sim
>>> Good.
>>>
>>>
>>>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
>>>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
>>>> Command not found' in the lsdyna.err.
>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. 
>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>
>>> The path to `rsh` is set up by the wrapper, so it should be accessible when 
>>> you job starts. Can you please add to your jobscript:
>>>
>>> echo $PATH
>>>
>>> The $TMPDIR of the job on the node should be included there, and therein 
>>> the `rsh` should exist.
>>>
>>> BTW: I'm not sure about your application, but several ones need all 
>>> environment variable from the master node of the parallel job also be set 
>>> for the slaves. This can be achieved by including "-V" for `qrsh -inherit 
>>> ...` near the end in /opt/gridengine/mpi/pmpi/rsh
>>>
>>> (You copied rsh/hostname to pmpi too?)
>>>
>>>
>>>> Petar
>>>>
>>>> [petar@rocks test]$ cat lsdyna.err
>>>> mpirun: rsh: Command not found
>>>>
>>>> [petar@rocks test]$ cat lsdyna.out
>>>> -catch_rsh
>>>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> Error: Unknown option -np
>>>>
>>>> [root@rocks test]# qconf -mp pmpi
>>>> pe_name            pmpi
>>>> slots              9999
>>>> user_lists         NONE
>>>> xuser_lists        NONE
>>>> start_proc_args    /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
>>>> $pe_hostfile
>>>> stop_proc_args     /opt/gridengine/mpi/pmpi/stoppmpi.sh
>>>> allocation_rule    $fill_up
>>>> control_slaves     FALSE
>>> control_slaves TRUE
>>>
>>> Otherwise the `qrsh -inherit ...` will fail.
>>>
>>> -- Reuti
>>>
>>>
>>>> job_is_first_task  TRUE
>>>> urgency_slots      min
>>>> accounting_summary TRUE
>>>>
>>>>
>>>>
>>>> On 03/07/2014 12:49 PM, Reuti wrote:
>>>>> Hi,
>>>>>
>>>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>>>>
>>>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the
>>>>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>>>>> using qsub test.sh, the job starts only on one node with 16 processes
>>>>>> and not on both nodes. The -pe pmpi, which i am using for now  is only a
>>>>>> copy of mpi.
>>>>> The definition of the PE pmpi does also include the -catch_rsh? The 
>>>>> recent IBM/Platform-MPI can cope with a machine file in the MPICH(1) 
>>>>> format, which is created by the /usr/sge/mpi/startmpi.sh
>>>>>
>>>>> In addition you need the following settings for a tight integration. 
>>>>> Please try:
>>>>>
>>>>> ...
>>>>> export MPI_REMSH=rsh
>>>>> export MPI_TMPDIR=$TMPDIR
>>>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> What am i missing? Dose anyone have a working -pe submit script, or some
>>>>>> hints how to make this working?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Petar
>>>>>>
>>>>>> [root@rocks mpi]# test.sh
>>>>>> #!/bin/bash
>>>>>> #$ -N lsdyna
>>>>>> #$ -S /bin/bash
>>>>>> #$ -pe pmpi 16
>>>>>> #$ -cwd
>>>>>> #$ -o lsdyna.out
>>>>>> #$ -e lsdyna.err
>>>>>> ###
>>>>>> #$ -q test.q
>>>>>> ### -notify
>>>>>> export MPI_ROOT=/export/apps/platform_mpi
>>>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>>>>> export PATH=/export/apps/platform_mpi/bin
>>>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>>>>> ARGS="i=test.k"
>>>>>> mpirun -np $NSLOTS $BIN $ARGS
>>>>>>
>>>>>>
>>>>>> [root@rocks mpi]# qconf -sq test.q
>>>>>> qname                 test.q
>>>>>> hostlist              mnode01 mnode02
>>>>>> seq_no                0
>>>>>> load_thresholds       np_load_avg=1.75
>>>>>> suspend_thresholds    NONE
>>>>>> nsuspend              1
>>>>>> suspend_interval      00:05:00
>>>>>> priority              0
>>>>>> min_cpu_interval      00:05:00
>>>>>> processors            UNDEFINED
>>>>>> qtype                 BATCH INTERACTIVE
>>>>>> ckpt_list             NONE
>>>>>> pe_list               pmpi
>>>>>> rerun                 FALSE
>>>>>> slots                 8
>>>>>> tmpdir                /tmp
>>>>>> shell                 /bin/bash
>>>>>> prolog                NONE
>>>>>> epilog                NONE
>>>>>> shell_start_mode      unix_behavior
>>>>>> starter_method        NONE
>>>>>> suspend_method        NONE
>>>>>> resume_method         NONE
>>>>>> terminate_method      NONE
>>>>>> notify                00:00:60
>>>>>> owner_list            NONE
>>>>>> user_lists            NONE
>>>>>> xuser_lists           NONE
>>>>>> subordinate_list      NONE
>>>>>> complex_values        NONE
>>>>>> projects              NONE
>>>>>> xprojects             NONE
>>>>>> calendar              NONE
>>>>>> initial_state         default
>>>>>> s_rt                  INFINITY
>>>>>> h_rt                  INFINITY
>>>>>> s_cpu                 INFINITY
>>>>>> h_cpu                 INFINITY
>>>>>> s_fsize               INFINITY
>>>>>> h_fsize               INFINITY
>>>>>> s_data                INFINITY
>>>>>> h_data                INFINITY
>>>>>> s_stack               INFINITY
>>>>>> h_stack               INFINITY
>>>>>> s_core                INFINITY
>>>>>> h_core                INFINITY
>>>>>> s_rss                 INFINITY
>>>>>> h_rss                 INFINITY
>>>>>> s_vmem                INFINITY
>>>>>> h_vmem                INFINITY
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to