Am 07.03.2014 um 15:57 schrieb Petar Penchev: > Aha, indeed. This MPI variant provides only `mpirun` in my installation. But > I wonder: do you have a second MPI library installed: `which mpiexec`? > > In fact i have also other MPI libraries (openMPI, PlatformMPI and > HP-MPI) and i an controlling which one to use through modules. > 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec' > > (You copied rsh/hostname to pmpi too?) > > Yes, both are there. > > control_slaves TRUE > now this is also set
Good. > so it should be accessible when you job starts. > > > As you suggested i have added in my submit script 'export > PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error > disappeared. Adding only the job tmp dir didn't work (export > PATH=/export/apps/platform_mpi/bin:$TMPDIR). > The output is now > > echo $PATH > > /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin Okay, here we have the /home/tmp/33108.1.test.q which looks like the scratch space on the node. But: this is in /home and so on an NFS space? It would be better, in case it's local on each node. OTOH: in the queue definition you posted I see "tmpdir /tmp" - is /tmp a symbolic link to /home/tmp? > But i have another problem. After I submit a simulation, in the log file > i have this error: "10.197.9.32: Connection refused" (this is the ip of > mnode02) and in the error log this: "mpirun: Warning one or more remote > shell commands exited with non-zero status, which may indicate a remote > access problem." > > Which protocol is using mpirun to comunicate between nodes? By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit ...`. To clarify: there is no `rsh` in the game. We could tell Platform MPI to use "foo" to access a node and in the startmpi.sh we create a symbolic link "foo" to point to a routine "baz" which calls `qrsh -inherit ...` in the end. > I checked > and i can ssh-log without password from the head on the nodes and > between the nodes. rsh or ssh is not necessary if you use a tight integration. In my clusters it's always disabled. The idea is: we tell Platform MPI to use rsh, this will in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which is pointed to by the created symbolic link in /home/tmp/33108.1.test.q The part: # # Make script wrapper for 'rsh' available in jobs tmp dir # if [ $catch_rsh = 1 ]; then rsh_wrapper=$SGE_ROOT/mpi/rsh in /opt/gridengine/mpi/pmpi/startmpi.sh points to /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname? -- Reuti > Thanks, > Petar > > On 03/07/2014 02:39 PM, Reuti wrote: >> Am 07.03.2014 um 13:20 schrieb Petar Penchev: >> >>> I have added the -catch_rsh to the PE and now when i start a sim >> Good. >> >> >>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown >>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh: >>> Command not found' in the lsdyna.err. >> Aha, indeed. This MPI variant provides only `mpirun` in my installation. But >> I wonder: do you have a second MPI library installed: `which mpiexec`? >> >> The path to `rsh` is set up by the wrapper, so it should be accessible when >> you job starts. Can you please add to your jobscript: >> >> echo $PATH >> >> The $TMPDIR of the job on the node should be included there, and therein the >> `rsh` should exist. >> >> BTW: I'm not sure about your application, but several ones need all >> environment variable from the master node of the parallel job also be set >> for the slaves. This can be achieved by including "-V" for `qrsh -inherit >> ...` near the end in /opt/gridengine/mpi/pmpi/rsh >> >> (You copied rsh/hostname to pmpi too?) >> >> >>> Petar >>> >>> [petar@rocks test]$ cat lsdyna.err >>> mpirun: rsh: Command not found >>> >>> [petar@rocks test]$ cat lsdyna.out >>> -catch_rsh >>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile >>> mnode01 >>> mnode01 >>> mnode01 >>> mnode01 >>> mnode01 >>> mnode01 >>> mnode01 >>> mnode01 >>> mnode02 >>> mnode02 >>> mnode02 >>> mnode02 >>> mnode02 >>> mnode02 >>> mnode02 >>> mnode02 >>> Error: Unknown option -np >>> >>> [root@rocks test]# qconf -mp pmpi >>> pe_name pmpi >>> slots 9999 >>> user_lists NONE >>> xuser_lists NONE >>> start_proc_args /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh >>> $pe_hostfile >>> stop_proc_args /opt/gridengine/mpi/pmpi/stoppmpi.sh >>> allocation_rule $fill_up >>> control_slaves FALSE >> control_slaves TRUE >> >> Otherwise the `qrsh -inherit ...` will fail. >> >> -- Reuti >> >> >>> job_is_first_task TRUE >>> urgency_slots min >>> accounting_summary TRUE >>> >>> >>> >>> On 03/07/2014 12:49 PM, Reuti wrote: >>>> Hi, >>>> >>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev: >>>> >>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the >>>>> PlatformMPI parallel libraries. My problem is that when i submit a job >>>>> using qsub test.sh, the job starts only on one node with 16 processes >>>>> and not on both nodes. The -pe pmpi, which i am using for now is only a >>>>> copy of mpi. >>>> The definition of the PE pmpi does also include the -catch_rsh? The recent >>>> IBM/Platform-MPI can cope with a machine file in the MPICH(1) format, >>>> which is created by the /usr/sge/mpi/startmpi.sh >>>> >>>> In addition you need the following settings for a tight integration. >>>> Please try: >>>> >>>> ... >>>> export MPI_REMSH=rsh >>>> export MPI_TMPDIR=$TMPDIR >>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS >>>> >>>> -- Reuti >>>> >>>> >>>>> What am i missing? Dose anyone have a working -pe submit script, or some >>>>> hints how to make this working? >>>>> >>>>> Thanks in advance, >>>>> Petar >>>>> >>>>> [root@rocks mpi]# test.sh >>>>> #!/bin/bash >>>>> #$ -N lsdyna >>>>> #$ -S /bin/bash >>>>> #$ -pe pmpi 16 >>>>> #$ -cwd >>>>> #$ -o lsdyna.out >>>>> #$ -e lsdyna.err >>>>> ### >>>>> #$ -q test.q >>>>> ### -notify >>>>> export MPI_ROOT=/export/apps/platform_mpi >>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64 >>>>> export PATH=/export/apps/platform_mpi/bin >>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe" >>>>> ARGS="i=test.k" >>>>> mpirun -np $NSLOTS $BIN $ARGS >>>>> >>>>> >>>>> [root@rocks mpi]# qconf -sq test.q >>>>> qname test.q >>>>> hostlist mnode01 mnode02 >>>>> seq_no 0 >>>>> load_thresholds np_load_avg=1.75 >>>>> suspend_thresholds NONE >>>>> nsuspend 1 >>>>> suspend_interval 00:05:00 >>>>> priority 0 >>>>> min_cpu_interval 00:05:00 >>>>> processors UNDEFINED >>>>> qtype BATCH INTERACTIVE >>>>> ckpt_list NONE >>>>> pe_list pmpi >>>>> rerun FALSE >>>>> slots 8 >>>>> tmpdir /tmp >>>>> shell /bin/bash >>>>> prolog NONE >>>>> epilog NONE >>>>> shell_start_mode unix_behavior >>>>> starter_method NONE >>>>> suspend_method NONE >>>>> resume_method NONE >>>>> terminate_method NONE >>>>> notify 00:00:60 >>>>> owner_list NONE >>>>> user_lists NONE >>>>> xuser_lists NONE >>>>> subordinate_list NONE >>>>> complex_values NONE >>>>> projects NONE >>>>> xprojects NONE >>>>> calendar NONE >>>>> initial_state default >>>>> s_rt INFINITY >>>>> h_rt INFINITY >>>>> s_cpu INFINITY >>>>> h_cpu INFINITY >>>>> s_fsize INFINITY >>>>> h_fsize INFINITY >>>>> s_data INFINITY >>>>> h_data INFINITY >>>>> s_stack INFINITY >>>>> h_stack INFINITY >>>>> s_core INFINITY >>>>> h_core INFINITY >>>>> s_rss INFINITY >>>>> h_rss INFINITY >>>>> s_vmem INFINITY >>>>> h_vmem INFINITY >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
