I have this in /opt/gridengine/mpi/pmpi/rsh:
if [ x$just_wrap = x ]; then
if [ $minus_n -eq 1 ]; then
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
fi
else
and this in /opt/gridengine/mpi/pmpi/startpmpi.sh
#
# Make script wrapper for 'rsh' available in jobs tmp dir
#
if [ $catch_rsh = 1 ]; then
rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh
This are the files created when i submit a job:
-rw-r--r-- 1 petar users 128 Mar 7 16:54 machines
-rw------- 1 petar users 2865 Mar 7 16:54 mpiafuT79Rl
-rw-r--r-- 1 petar users 0 Mar 7 16:54 mpijob_petar_29112
lrwxrwxrwx 1 petar users 28 Mar 7 16:54 ssh ->
/opt/gridengine/mpi/pmpi/rsh
[petar@mnode01 33318.1.test.q]$ cat /tmp/33318.1.test.q/mpiafuT79Rl
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
-e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02
"/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k
What do you mean exactly with "...; resp. hostname"? Do i have to add
something else?
And now as you suggested i changed the tmpdir to be local for all nodes,
but i still get this error.
Cheers,
Petar
On 03/07/2014 04:20 PM, Reuti wrote:
> Am 07.03.2014 um 15:57 schrieb Petar Penchev:
>
>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. But
>> I wonder: do you have a second MPI library installed: `which mpiexec`?
>>
>> In fact i have also other MPI libraries (openMPI, PlatformMPI and
>> HP-MPI) and i an controlling which one to use through modules.
>> 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec'
>>
>> (You copied rsh/hostname to pmpi too?)
>>
>> Yes, both are there.
>>
>> control_slaves TRUE
>> now this is also set
> Good.
>
>
>> so it should be accessible when you job starts.
>>
>>
>> As you suggested i have added in my submit script 'export
>> PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error
>> disappeared. Adding only the job tmp dir didn't work (export
>> PATH=/export/apps/platform_mpi/bin:$TMPDIR).
>> The output is now
>>
>> echo $PATH
>>
>> /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin
> Okay, here we have the /home/tmp/33108.1.test.q which looks like the scratch
> space on the node. But: this is in /home and so on an NFS space? It would be
> better, in case it's local on each node.
>
> OTOH: in the queue definition you posted I see "tmpdir /tmp" -
> is /tmp a symbolic link to /home/tmp?
>
>
>> But i have another problem. After I submit a simulation, in the log file
>> i have this error: "10.197.9.32: Connection refused" (this is the ip of
>> mnode02) and in the error log this: "mpirun: Warning one or more remote
>> shell commands exited with non-zero status, which may indicate a remote
>> access problem."
>>
>> Which protocol is using mpirun to comunicate between nodes?
> By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit ...`.
> To clarify: there is no `rsh` in the game. We could tell Platform MPI to use
> "foo" to access a node and in the startmpi.sh we create a symbolic link "foo"
> to point to a routine "baz" which calls `qrsh -inherit ...` in the end.
>
>
>> I checked
>> and i can ssh-log without password from the head on the nodes and
>> between the nodes.
> rsh or ssh is not necessary if you use a tight integration. In my clusters
> it's always disabled. The idea is: we tell Platform MPI to use rsh, this will
> in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which is
> pointed to by the created symbolic link in /home/tmp/33108.1.test.q The part:
>
> #
> # Make script wrapper for 'rsh' available in jobs tmp dir
> #
> if [ $catch_rsh = 1 ]; then
> rsh_wrapper=$SGE_ROOT/mpi/rsh
>
> in /opt/gridengine/mpi/pmpi/startmpi.sh points to
> /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname?
>
> -- Reuti
>
>
>> Thanks,
>> Petar
>>
>> On 03/07/2014 02:39 PM, Reuti wrote:
>>> Am 07.03.2014 um 13:20 schrieb Petar Penchev:
>>>
>>>> I have added the -catch_rsh to the PE and now when i start a sim
>>> Good.
>>>
>>>
>>>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
>>>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
>>>> Command not found' in the lsdyna.err.
>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation.
>>> But I wonder: do you have a second MPI library installed: `which mpiexec`?
>>>
>>> The path to `rsh` is set up by the wrapper, so it should be accessible when
>>> you job starts. Can you please add to your jobscript:
>>>
>>> echo $PATH
>>>
>>> The $TMPDIR of the job on the node should be included there, and therein
>>> the `rsh` should exist.
>>>
>>> BTW: I'm not sure about your application, but several ones need all
>>> environment variable from the master node of the parallel job also be set
>>> for the slaves. This can be achieved by including "-V" for `qrsh -inherit
>>> ...` near the end in /opt/gridengine/mpi/pmpi/rsh
>>>
>>> (You copied rsh/hostname to pmpi too?)
>>>
>>>
>>>> Petar
>>>>
>>>> [petar@rocks test]$ cat lsdyna.err
>>>> mpirun: rsh: Command not found
>>>>
>>>> [petar@rocks test]$ cat lsdyna.out
>>>> -catch_rsh
>>>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode01
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> mnode02
>>>> Error: Unknown option -np
>>>>
>>>> [root@rocks test]# qconf -mp pmpi
>>>> pe_name pmpi
>>>> slots 9999
>>>> user_lists NONE
>>>> xuser_lists NONE
>>>> start_proc_args /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
>>>> $pe_hostfile
>>>> stop_proc_args /opt/gridengine/mpi/pmpi/stoppmpi.sh
>>>> allocation_rule $fill_up
>>>> control_slaves FALSE
>>> control_slaves TRUE
>>>
>>> Otherwise the `qrsh -inherit ...` will fail.
>>>
>>> -- Reuti
>>>
>>>
>>>> job_is_first_task TRUE
>>>> urgency_slots min
>>>> accounting_summary TRUE
>>>>
>>>>
>>>>
>>>> On 03/07/2014 12:49 PM, Reuti wrote:
>>>>> Hi,
>>>>>
>>>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>>>>
>>>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the
>>>>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>>>>> using qsub test.sh, the job starts only on one node with 16 processes
>>>>>> and not on both nodes. The -pe pmpi, which i am using for now is only a
>>>>>> copy of mpi.
>>>>> The definition of the PE pmpi does also include the -catch_rsh? The
>>>>> recent IBM/Platform-MPI can cope with a machine file in the MPICH(1)
>>>>> format, which is created by the /usr/sge/mpi/startmpi.sh
>>>>>
>>>>> In addition you need the following settings for a tight integration.
>>>>> Please try:
>>>>>
>>>>> ...
>>>>> export MPI_REMSH=rsh
>>>>> export MPI_TMPDIR=$TMPDIR
>>>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> What am i missing? Dose anyone have a working -pe submit script, or some
>>>>>> hints how to make this working?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Petar
>>>>>>
>>>>>> [root@rocks mpi]# test.sh
>>>>>> #!/bin/bash
>>>>>> #$ -N lsdyna
>>>>>> #$ -S /bin/bash
>>>>>> #$ -pe pmpi 16
>>>>>> #$ -cwd
>>>>>> #$ -o lsdyna.out
>>>>>> #$ -e lsdyna.err
>>>>>> ###
>>>>>> #$ -q test.q
>>>>>> ### -notify
>>>>>> export MPI_ROOT=/export/apps/platform_mpi
>>>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>>>>> export PATH=/export/apps/platform_mpi/bin
>>>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>>>>> ARGS="i=test.k"
>>>>>> mpirun -np $NSLOTS $BIN $ARGS
>>>>>>
>>>>>>
>>>>>> [root@rocks mpi]# qconf -sq test.q
>>>>>> qname test.q
>>>>>> hostlist mnode01 mnode02
>>>>>> seq_no 0
>>>>>> load_thresholds np_load_avg=1.75
>>>>>> suspend_thresholds NONE
>>>>>> nsuspend 1
>>>>>> suspend_interval 00:05:00
>>>>>> priority 0
>>>>>> min_cpu_interval 00:05:00
>>>>>> processors UNDEFINED
>>>>>> qtype BATCH INTERACTIVE
>>>>>> ckpt_list NONE
>>>>>> pe_list pmpi
>>>>>> rerun FALSE
>>>>>> slots 8
>>>>>> tmpdir /tmp
>>>>>> shell /bin/bash
>>>>>> prolog NONE
>>>>>> epilog NONE
>>>>>> shell_start_mode unix_behavior
>>>>>> starter_method NONE
>>>>>> suspend_method NONE
>>>>>> resume_method NONE
>>>>>> terminate_method NONE
>>>>>> notify 00:00:60
>>>>>> owner_list NONE
>>>>>> user_lists NONE
>>>>>> xuser_lists NONE
>>>>>> subordinate_list NONE
>>>>>> complex_values NONE
>>>>>> projects NONE
>>>>>> xprojects NONE
>>>>>> calendar NONE
>>>>>> initial_state default
>>>>>> s_rt INFINITY
>>>>>> h_rt INFINITY
>>>>>> s_cpu INFINITY
>>>>>> h_cpu INFINITY
>>>>>> s_fsize INFINITY
>>>>>> h_fsize INFINITY
>>>>>> s_data INFINITY
>>>>>> h_data INFINITY
>>>>>> s_stack INFINITY
>>>>>> h_stack INFINITY
>>>>>> s_core INFINITY
>>>>>> h_core INFINITY
>>>>>> s_rss INFINITY
>>>>>> h_rss INFINITY
>>>>>> s_vmem INFINITY
>>>>>> h_vmem INFINITY
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users