Aha, indeed. This MPI variant provides only `mpirun` in my installation. But I 
wonder: do you have a second MPI library installed: `which mpiexec`?

In fact i have also other MPI libraries (openMPI, PlatformMPI and
HP-MPI) and i an controlling which one to use through modules.
'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec'

(You copied rsh/hostname to pmpi too?)

Yes, both are there.

control_slaves TRUE

now this is also set

so it should be accessible when you job starts.


As you suggested i have added in my submit script 'export
PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error
disappeared. Adding only the job tmp dir didn't work  (export
PATH=/export/apps/platform_mpi/bin:$TMPDIR).
The output is now

echo $PATH

/export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin


But i have another problem. After I submit a simulation, in the log file
i have this error: "10.197.9.32: Connection refused" (this is the ip of
mnode02) and in the error log this: "mpirun: Warning one or more remote
shell commands exited with non-zero status, which may indicate a remote
access problem."

Which protocol is using mpirun to comunicate between nodes? I checked
and i can ssh-log without password from the head on the nodes and
between the nodes.

Thanks,
Petar

On 03/07/2014 02:39 PM, Reuti wrote:
> Am 07.03.2014 um 13:20 schrieb Petar Penchev:
>
>> I have added the -catch_rsh to the PE and now when i start a sim
> Good.
>
>
>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
>> Command not found' in the lsdyna.err.
> Aha, indeed. This MPI variant provides only `mpirun` in my installation. But 
> I wonder: do you have a second MPI library installed: `which mpiexec`?
>
> The path to `rsh` is set up by the wrapper, so it should be accessible when 
> you job starts. Can you please add to your jobscript:
>
> echo $PATH
>
> The $TMPDIR of the job on the node should be included there, and therein the 
> `rsh` should exist.
>
> BTW: I'm not sure about your application, but several ones need all 
> environment variable from the master node of the parallel job also be set for 
> the slaves. This can be achieved by including "-V" for `qrsh -inherit ...` 
> near the end in /opt/gridengine/mpi/pmpi/rsh
>
> (You copied rsh/hostname to pmpi too?)
>
>
>> Petar
>>
>> [petar@rocks test]$ cat lsdyna.err
>> mpirun: rsh: Command not found
>>
>> [petar@rocks test]$ cat lsdyna.out
>> -catch_rsh
>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
>> mnode01
>> mnode01
>> mnode01
>> mnode01
>> mnode01
>> mnode01
>> mnode01
>> mnode01
>> mnode02
>> mnode02
>> mnode02
>> mnode02
>> mnode02
>> mnode02
>> mnode02
>> mnode02
>> Error: Unknown option -np
>>
>> [root@rocks test]# qconf -mp pmpi
>> pe_name            pmpi
>> slots              9999
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
>> $pe_hostfile
>> stop_proc_args     /opt/gridengine/mpi/pmpi/stoppmpi.sh
>> allocation_rule    $fill_up
>> control_slaves     FALSE
> control_slaves TRUE
>
> Otherwise the `qrsh -inherit ...` will fail.
>
> -- Reuti
>
>
>> job_is_first_task  TRUE
>> urgency_slots      min
>> accounting_summary TRUE
>>
>>
>>
>> On 03/07/2014 12:49 PM, Reuti wrote:
>>> Hi,
>>>
>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>>
>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the
>>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>>> using qsub test.sh, the job starts only on one node with 16 processes
>>>> and not on both nodes. The -pe pmpi, which i am using for now  is only a
>>>> copy of mpi.
>>> The definition of the PE pmpi does also include the -catch_rsh? The recent 
>>> IBM/Platform-MPI can cope with a machine file in the MPICH(1) format, which 
>>> is created by the /usr/sge/mpi/startmpi.sh
>>>
>>> In addition you need the following settings for a tight integration. Please 
>>> try:
>>>
>>> ...
>>> export MPI_REMSH=rsh
>>> export MPI_TMPDIR=$TMPDIR
>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>>
>>> -- Reuti
>>>
>>>
>>>> What am i missing? Dose anyone have a working -pe submit script, or some
>>>> hints how to make this working?
>>>>
>>>> Thanks in advance,
>>>> Petar
>>>>
>>>> [root@rocks mpi]# test.sh
>>>> #!/bin/bash
>>>> #$ -N lsdyna
>>>> #$ -S /bin/bash
>>>> #$ -pe pmpi 16
>>>> #$ -cwd
>>>> #$ -o lsdyna.out
>>>> #$ -e lsdyna.err
>>>> ###
>>>> #$ -q test.q
>>>> ### -notify
>>>> export MPI_ROOT=/export/apps/platform_mpi
>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>>> export PATH=/export/apps/platform_mpi/bin
>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>>> ARGS="i=test.k"
>>>> mpirun -np $NSLOTS $BIN $ARGS
>>>>
>>>>
>>>> [root@rocks mpi]# qconf -sq test.q
>>>> qname                 test.q
>>>> hostlist              mnode01 mnode02
>>>> seq_no                0
>>>> load_thresholds       np_load_avg=1.75
>>>> suspend_thresholds    NONE
>>>> nsuspend              1
>>>> suspend_interval      00:05:00
>>>> priority              0
>>>> min_cpu_interval      00:05:00
>>>> processors            UNDEFINED
>>>> qtype                 BATCH INTERACTIVE
>>>> ckpt_list             NONE
>>>> pe_list               pmpi
>>>> rerun                 FALSE
>>>> slots                 8
>>>> tmpdir                /tmp
>>>> shell                 /bin/bash
>>>> prolog                NONE
>>>> epilog                NONE
>>>> shell_start_mode      unix_behavior
>>>> starter_method        NONE
>>>> suspend_method        NONE
>>>> resume_method         NONE
>>>> terminate_method      NONE
>>>> notify                00:00:60
>>>> owner_list            NONE
>>>> user_lists            NONE
>>>> xuser_lists           NONE
>>>> subordinate_list      NONE
>>>> complex_values        NONE
>>>> projects              NONE
>>>> xprojects             NONE
>>>> calendar              NONE
>>>> initial_state         default
>>>> s_rt                  INFINITY
>>>> h_rt                  INFINITY
>>>> s_cpu                 INFINITY
>>>> h_cpu                 INFINITY
>>>> s_fsize               INFINITY
>>>> h_fsize               INFINITY
>>>> s_data                INFINITY
>>>> h_data                INFINITY
>>>> s_stack               INFINITY
>>>> h_stack               INFINITY
>>>> s_core                INFINITY
>>>> h_core                INFINITY
>>>> s_rss                 INFINITY
>>>> h_rss                 INFINITY
>>>> s_vmem                INFINITY
>>>> h_vmem                INFINITY
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to