I saw that the right options to run the mpiexec are -n $NSLOTS and -host
'file'
But now when i submit a simulation, slots get allocated from SGE (i
suppose), but the simulation doesn't start. When i log onto the mnode01
i see only one process running 'mpirun', but i expected to see the
ls-dyna_mpp*.

[petar@rocks test]$ ssh mnode01 ps -e f -o pid,ppid,pgrp,command
21989  1777 21989  \_ sge_shepherd-32746 -bg
22026 21989 22026      \_ /bin/bash
/opt/gridengine/default/spool/mnode01/job_scripts/32746
22027 22026 22026          \_ /bin/sh
/export/apps/platform_mpi/bin/mpiexec -n 16 -host
/home/tmp/32746.1.test.q/machines
/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe
i=/home/petar/mmew/test/main.k
22030 22027 22026              \_ /export/apps/platform_mpi/bin/mpirun
-f /home/tmp/32746.1.test.q/mpiexec.22027

[petar@rocks test]$ ssh mnode01 cat /home/tmp/32746.1.test.q/mpiexec.22027
 -np 16 -h /home/tmp/32746.1.test.q/machines
/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe
i=/home/petar/mmew/test/main.k


Does anyone know, why the job doesn't run.

Thanks,
Petar


On 03/07/2014 01:20 PM, Petar Penchev wrote:
> Hi Reuti,
>
> thanks for the quick reply.
>
> I have added the -catch_rsh to the PE and now when i start a sim
> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown
> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh:
> Command not found' in the lsdyna.err.
>
> Petar
>
> [petar@rocks test]$ cat lsdyna.err
> mpirun: rsh: Command not found
>
> [petar@rocks test]$ cat lsdyna.out
> -catch_rsh
> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile
> mnode01
> mnode01
> mnode01
> mnode01
> mnode01
> mnode01
> mnode01
> mnode01
> mnode02
> mnode02
> mnode02
> mnode02
> mnode02
> mnode02
> mnode02
> mnode02
> Error: Unknown option -np
>
> [root@rocks test]# qconf -mp pmpi
> pe_name            pmpi
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh
> $pe_hostfile
> stop_proc_args     /opt/gridengine/mpi/pmpi/stoppmpi.sh
> allocation_rule    $fill_up
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary TRUE
>
>
>
> On 03/07/2014 12:49 PM, Reuti wrote:
>> Hi,
>>
>> Am 07.03.2014 um 12:28 schrieb Petar Penchev:
>>
>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the
>>> PlatformMPI parallel libraries. My problem is that when i submit a job
>>> using qsub test.sh, the job starts only on one node with 16 processes
>>> and not on both nodes. The -pe pmpi, which i am using for now  is only a
>>> copy of mpi.
>> The definition of the PE pmpi does also include the -catch_rsh? The recent 
>> IBM/Platform-MPI can cope with a machine file in the MPICH(1) format, which 
>> is created by the /usr/sge/mpi/startmpi.sh
>>
>> In addition you need the following settings for a tight integration. Please 
>> try:
>>
>> ...
>> export MPI_REMSH=rsh
>> export MPI_TMPDIR=$TMPDIR
>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS
>>
>> -- Reuti
>>
>>
>>> What am i missing? Dose anyone have a working -pe submit script, or some
>>> hints how to make this working?
>>>
>>> Thanks in advance,
>>> Petar
>>>
>>> [root@rocks mpi]# test.sh
>>> #!/bin/bash
>>> #$ -N lsdyna
>>> #$ -S /bin/bash
>>> #$ -pe pmpi 16
>>> #$ -cwd
>>> #$ -o lsdyna.out
>>> #$ -e lsdyna.err
>>> ###
>>> #$ -q test.q
>>> ### -notify
>>> export MPI_ROOT=/export/apps/platform_mpi
>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64
>>> export PATH=/export/apps/platform_mpi/bin
>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe"
>>> ARGS="i=test.k"
>>> mpirun -np $NSLOTS $BIN $ARGS
>>>
>>>
>>> [root@rocks mpi]# qconf -sq test.q
>>> qname                 test.q
>>> hostlist              mnode01 mnode02
>>> seq_no                0
>>> load_thresholds       np_load_avg=1.75
>>> suspend_thresholds    NONE
>>> nsuspend              1
>>> suspend_interval      00:05:00
>>> priority              0
>>> min_cpu_interval      00:05:00
>>> processors            UNDEFINED
>>> qtype                 BATCH INTERACTIVE
>>> ckpt_list             NONE
>>> pe_list               pmpi
>>> rerun                 FALSE
>>> slots                 8
>>> tmpdir                /tmp
>>> shell                 /bin/bash
>>> prolog                NONE
>>> epilog                NONE
>>> shell_start_mode      unix_behavior
>>> starter_method        NONE
>>> suspend_method        NONE
>>> resume_method         NONE
>>> terminate_method      NONE
>>> notify                00:00:60
>>> owner_list            NONE
>>> user_lists            NONE
>>> xuser_lists           NONE
>>> subordinate_list      NONE
>>> complex_values        NONE
>>> projects              NONE
>>> xprojects             NONE
>>> calendar              NONE
>>> initial_state         default
>>> s_rt                  INFINITY
>>> h_rt                  INFINITY
>>> s_cpu                 INFINITY
>>> h_cpu                 INFINITY
>>> s_fsize               INFINITY
>>> h_fsize               INFINITY
>>> s_data                INFINITY
>>> h_data                INFINITY
>>> s_stack               INFINITY
>>> h_stack               INFINITY
>>> s_core                INFINITY
>>> h_core                INFINITY
>>> s_rss                 INFINITY
>>> h_rss                 INFINITY
>>> s_vmem                INFINITY
>>> h_vmem                INFINITY
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to