Am 07.03.2014 um 17:19 schrieb Petar Penchev:

> I have this in /opt/gridengine/mpi/pmpi/rsh:
> if [ x$just_wrap = x ]; then
>   if [ $minus_n -eq 1 ]; then
>      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
>      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
>   else
>      echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
>      exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
>   fi
> else
> 
> and this in /opt/gridengine/mpi/pmpi/startpmpi.sh
> #
> # Make script wrapper for 'rsh' available in jobs tmp dir
> #
> if [ $catch_rsh = 1 ]; then
>   rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh
> 
> 
> 
> This are the files created when i submit a job:
> -rw-r--r-- 1 petar users  128 Mar  7 16:54 machines
> -rw------- 1 petar users 2865 Mar  7 16:54 mpiafuT79Rl
> -rw-r--r-- 1 petar users    0 Mar  7 16:54 mpijob_petar_29112
> lrwxrwxrwx 1 petar users   28 Mar  7 16:54 ssh ->
> /opt/gridengine/mpi/pmpi/rsh

Looks almost perfect. But the link is named `ssh`. Then the:

export MPI_REMSH=rsh

is either not necessary or should also be defined as "ssh". As said: you could 
name the link "foo" and set MPI_REMSH to "foo" - it's just a name.

-- Reuti


> [petar@mnode01 33318.1.test.q]$ cat /tmp/33318.1.test.q/mpiafuT79Rl
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode01
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
> -e MPI_WORKDIR=/home/petar/test  -np 1 -h mnode02
> "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k

Okay, this file was already assembled of Platform MPI out of the 
$TMPDIR/machines.


> What do you mean exactly with "...; resp. hostname"? Do i have to add
> something else?

It's only a precaution to have the creation of the `hostname` wrapper also 
pointing to the correct location in the pmpi directory - in case you switch it 
on later.


> And now as you suggested i changed the tmpdir to be local for all nodes,
> but i still get this error.

About "command not found"?

-- Reuti


> Cheers,
> Petar
> 
> 
> 
> On 03/07/2014 04:20 PM, Reuti wrote:
>> Am 07.03.2014 um 15:57 schrieb Petar Penchev:
>> 
>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. 
>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>> 
>>> In fact i have also other MPI libraries (openMPI, PlatformMPI and
>>> HP-MPI) and i an controlling which one to use through modules.
>>> 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec'
>>> 
>>> (You copied rsh/hostname to pmpi too?)
>>> 
>>> Yes, both are there.
>>> 
>>> control_slaves TRUE
>>> now this is also set
>> Good.
>> 
>> 
>>> so it should be accessible when you job starts.
>>> 
>>> 
>>> As you suggested i have added in my submit script 'export
>>> PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error
>>> disappeared. Adding only the job tmp dir didn't work  (export
>>> PATH=/export/apps/platform_mpi/bin:$TMPDIR).
>>> The output is now
>>> 
>>> echo $PATH
>>> 
>>> /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin
>> Okay, here we have the /home/tmp/33108.1.test.q which looks like the scratch 
>> space on the node. But: this is in /home and so on an NFS space? It would be 
>> better, in case it's local on each node.
>> 
>> OTOH: in the queue definition you posted I see "tmpdir                /tmp" 
>> - is /tmp a symbolic link to /home/tmp?
>> 
>> 
>>> But i have another problem. After I submit a simulation, in the log file
>>> i have this error: "10.197.9.32: Connection refused" (this is the ip of
>>> mnode02) and in the error log this: "mpirun: Warning one or more remote
>>> shell commands exited with non-zero status, which may indicate a remote
>>> access problem."
>>> 
>>> Which protocol is using mpirun to comunicate between nodes?
>> By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit 
>> ...`. To clarify: there is no `rsh` in the game. We could tell Platform MPI 
>> to use "foo" to access a node and in the startmpi.sh we create a symbolic 
>> link "foo" to point to a routine "baz" which calls `qrsh -inherit ...` in 
>> the end.
>> 
>> 
>>> I checked
>>> and i can ssh-log without password from the head on the nodes and
>>> between the nodes.
>> rsh or ssh is not necessary if you use a tight integration. In my clusters 
>> it's always disabled. The idea is: we tell Platform MPI to use rsh, this 
>> will in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which is 
>> pointed to by the created symbolic link in /home/tmp/33108.1.test.q The part:
>> 
>> #
>> # Make script wrapper for 'rsh' available in jobs tmp dir
>> #
>> if [ $catch_rsh = 1 ]; then
>>   rsh_wrapper=$SGE_ROOT/mpi/rsh
>> 
>> in /opt/gridengine/mpi/pmpi/startmpi.sh points to 
>> /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname?
>> 
>> -- Reuti
>> 
>> 
>>> Thanks,
>>> Petar
>>> 
>>> On 03/07/2014 02:39 PM, Reuti wrote:
>>>> Am 07.03.2014 um 13:20 schrieb Petar Penchev:
>>>> 
>>>>> I have added the -catch_rsh to the PE and now when i start a sim
>>>> Good.
>>>> 
>>>> 
>>>>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
>>>>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
>>>>> Command not found' in the lsdyna.err.
>>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. 
>>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>> 
>>>> The path to `rsh` is set up by the wrapper, so it should be accessible 
>>>> when you job starts. Can you please add to your jobscript:
>>>> 
>>>> echo $PATH
>>>> 
>>>> The $TMPDIR of the job on the node should be included there, and therein 
>>>> the `rsh` should exist.
>>>> 
>>>> BTW: I'm not sure about your application, but several ones need all 
>>>> environment variable from the master node of the parallel job also be set 
>>>> for the slaves. This can be achieved by including "-V" for `qrsh -inherit 
>>>> ...` near the end in /opt/gridengine/mpi/pmpi/rsh
>>>> 
>>>> (You copied rsh/hostname to pmpi too?)
>>>> 
>>>> 
>>>>> Petar
>>>>> 
>>>>> [petar@rocks test]$ cat lsdyna.err
>>>>> mpirun: rsh: Command not found
>>>>> 
>>>>> [petar@rocks test]$ cat lsdyna.out
>>>>> -catch_rsh
>>>>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
>>>>> mnode01
>>>>> mnode01
>>>>> mnode01
>>>>> mnode01
>>>>> mnode01
>>>>> mnode01
>>>>> mnode01
>>>>> mnode01
>>>>> mnode02
>>>>> mnode02
>>>>> mnode02
>>>>> mnode02
>>>>> mnode02
>>>>> mnode02
>>>>> mnode02
>>>>> mnode02
>>>>> Error: Unknown option -np
>>>>> 
>>>>> [root@rocks test]# qconf -mp pmpi
>>>>> pe_name            pmpi
>>>>> slots              9999
>>>>> user_lists         NONE
>>>>> xuser_lists        NONE
>>>>> start_proc_args    /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
>>>>> $pe_hostfile
>>>>> stop_proc_args     /opt/gridengine/mpi/pmpi/stoppmpi.sh
>>>>> allocation_rule    $fill_up
>>>>> control_slaves     FALSE
>>>> control_slaves TRUE
>>>> 
>>>> Otherwise the `qrsh -inherit ...` will fail.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> job_is_first_task  TRUE
>>>>> urgency_slots      min
>>>>> accounting_summary TRUE
>>>>> 
>>>>> 
>>>>> 
>>>>> On 03/07/2014 12:49 PM, Reuti wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>>>>> 
>>>>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the
>>>>>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>>>>>> using qsub test.sh, the job starts only on one node with 16 processes
>>>>>>> and not on both nodes. The -pe pmpi, which i am using for now  is only a
>>>>>>> copy of mpi.
>>>>>> The definition of the PE pmpi does also include the -catch_rsh? The 
>>>>>> recent IBM/Platform-MPI can cope with a machine file in the MPICH(1) 
>>>>>> format, which is created by the /usr/sge/mpi/startmpi.sh
>>>>>> 
>>>>>> In addition you need the following settings for a tight integration. 
>>>>>> Please try:
>>>>>> 
>>>>>> ...
>>>>>> export MPI_REMSH=rsh
>>>>>> export MPI_TMPDIR=$TMPDIR
>>>>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> What am i missing? Dose anyone have a working -pe submit script, or some
>>>>>>> hints how to make this working?
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> Petar
>>>>>>> 
>>>>>>> [root@rocks mpi]# test.sh
>>>>>>> #!/bin/bash
>>>>>>> #$ -N lsdyna
>>>>>>> #$ -S /bin/bash
>>>>>>> #$ -pe pmpi 16
>>>>>>> #$ -cwd
>>>>>>> #$ -o lsdyna.out
>>>>>>> #$ -e lsdyna.err
>>>>>>> ###
>>>>>>> #$ -q test.q
>>>>>>> ### -notify
>>>>>>> export MPI_ROOT=/export/apps/platform_mpi
>>>>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>>>>>> export PATH=/export/apps/platform_mpi/bin
>>>>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>>>>>> ARGS="i=test.k"
>>>>>>> mpirun -np $NSLOTS $BIN $ARGS
>>>>>>> 
>>>>>>> 
>>>>>>> [root@rocks mpi]# qconf -sq test.q
>>>>>>> qname                 test.q
>>>>>>> hostlist              mnode01 mnode02
>>>>>>> seq_no                0
>>>>>>> load_thresholds       np_load_avg=1.75
>>>>>>> suspend_thresholds    NONE
>>>>>>> nsuspend              1
>>>>>>> suspend_interval      00:05:00
>>>>>>> priority              0
>>>>>>> min_cpu_interval      00:05:00
>>>>>>> processors            UNDEFINED
>>>>>>> qtype                 BATCH INTERACTIVE
>>>>>>> ckpt_list             NONE
>>>>>>> pe_list               pmpi
>>>>>>> rerun                 FALSE
>>>>>>> slots                 8
>>>>>>> tmpdir                /tmp
>>>>>>> shell                 /bin/bash
>>>>>>> prolog                NONE
>>>>>>> epilog                NONE
>>>>>>> shell_start_mode      unix_behavior
>>>>>>> starter_method        NONE
>>>>>>> suspend_method        NONE
>>>>>>> resume_method         NONE
>>>>>>> terminate_method      NONE
>>>>>>> notify                00:00:60
>>>>>>> owner_list            NONE
>>>>>>> user_lists            NONE
>>>>>>> xuser_lists           NONE
>>>>>>> subordinate_list      NONE
>>>>>>> complex_values        NONE
>>>>>>> projects              NONE
>>>>>>> xprojects             NONE
>>>>>>> calendar              NONE
>>>>>>> initial_state         default
>>>>>>> s_rt                  INFINITY
>>>>>>> h_rt                  INFINITY
>>>>>>> s_cpu                 INFINITY
>>>>>>> h_cpu                 INFINITY
>>>>>>> s_fsize               INFINITY
>>>>>>> h_fsize               INFINITY
>>>>>>> s_data                INFINITY
>>>>>>> h_data                INFINITY
>>>>>>> s_stack               INFINITY
>>>>>>> h_stack               INFINITY
>>>>>>> s_core                INFINITY
>>>>>>> h_core                INFINITY
>>>>>>> s_rss                 INFINITY
>>>>>>> h_rss                 INFINITY
>>>>>>> s_vmem                INFINITY
>>>>>>> h_vmem                INFINITY
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> [email protected]
>>>>>>> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to