Am 07.03.2014 um 17:19 schrieb Petar Penchev: > I have this in /opt/gridengine/mpi/pmpi/rsh: > if [ x$just_wrap = x ]; then > if [ $minus_n -eq 1 ]; then > echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd > exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd > else > echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd > exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd > fi > else > > and this in /opt/gridengine/mpi/pmpi/startpmpi.sh > # > # Make script wrapper for 'rsh' available in jobs tmp dir > # > if [ $catch_rsh = 1 ]; then > rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh > > > > This are the files created when i submit a job: > -rw-r--r-- 1 petar users 128 Mar 7 16:54 machines > -rw------- 1 petar users 2865 Mar 7 16:54 mpiafuT79Rl > -rw-r--r-- 1 petar users 0 Mar 7 16:54 mpijob_petar_29112 > lrwxrwxrwx 1 petar users 28 Mar 7 16:54 ssh -> > /opt/gridengine/mpi/pmpi/rsh
Looks almost perfect. But the link is named `ssh`. Then the: export MPI_REMSH=rsh is either not necessary or should also be defined as "ssh". As said: you could name the link "foo" and set MPI_REMSH to "foo" - it's just a name. -- Reuti > [petar@mnode01 33318.1.test.q]$ cat /tmp/33318.1.test.q/mpiafuT79Rl > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode01 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k > -e MPI_WORKDIR=/home/petar/test -np 1 -h mnode02 > "/export/apps/lsdyna/ls-dyna_mpp" i=/home/petar/test/main.k Okay, this file was already assembled of Platform MPI out of the $TMPDIR/machines. > What do you mean exactly with "...; resp. hostname"? Do i have to add > something else? It's only a precaution to have the creation of the `hostname` wrapper also pointing to the correct location in the pmpi directory - in case you switch it on later. > And now as you suggested i changed the tmpdir to be local for all nodes, > but i still get this error. About "command not found"? -- Reuti > Cheers, > Petar > > > > On 03/07/2014 04:20 PM, Reuti wrote: >> Am 07.03.2014 um 15:57 schrieb Petar Penchev: >> >>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. >>> But I wonder: do you have a second MPI library installed: `which mpiexec`? >>> >>> In fact i have also other MPI libraries (openMPI, PlatformMPI and >>> HP-MPI) and i an controlling which one to use through modules. >>> 'which mpiexec' returns: '/export/apps/platform_mpi/bin/mpiexec' >>> >>> (You copied rsh/hostname to pmpi too?) >>> >>> Yes, both are there. >>> >>> control_slaves TRUE >>> now this is also set >> Good. >> >> >>> so it should be accessible when you job starts. >>> >>> >>> As you suggested i have added in my submit script 'export >>> PATH=/export/apps/platform_mpi/bin:$PATH' and now the rsh error >>> disappeared. Adding only the job tmp dir didn't work (export >>> PATH=/export/apps/platform_mpi/bin:$TMPDIR). >>> The output is now >>> >>> echo $PATH >>> >>> /export/apps/platform_mpi/bin:/home/tmp/33108.1.test.q:/usr/local/bin:/bin:/usr/bin >> Okay, here we have the /home/tmp/33108.1.test.q which looks like the scratch >> space on the node. But: this is in /home and so on an NFS space? It would be >> better, in case it's local on each node. >> >> OTOH: in the queue definition you posted I see "tmpdir /tmp" >> - is /tmp a symbolic link to /home/tmp? >> >> >>> But i have another problem. After I submit a simulation, in the log file >>> i have this error: "10.197.9.32: Connection refused" (this is the ip of >>> mnode02) and in the error log this: "mpirun: Warning one or more remote >>> shell commands exited with non-zero status, which may indicate a remote >>> access problem." >>> >>> Which protocol is using mpirun to comunicate between nodes? >> By default `ssh`, but we routed it to `rsh` to map it to `qrsh -inherit >> ...`. To clarify: there is no `rsh` in the game. We could tell Platform MPI >> to use "foo" to access a node and in the startmpi.sh we create a symbolic >> link "foo" to point to a routine "baz" which calls `qrsh -inherit ...` in >> the end. >> >> >>> I checked >>> and i can ssh-log without password from the head on the nodes and >>> between the nodes. >> rsh or ssh is not necessary if you use a tight integration. In my clusters >> it's always disabled. The idea is: we tell Platform MPI to use rsh, this >> will in real start the rsh-wrapper in /opt/gridengine/mpi/pmpi/rsh, which is >> pointed to by the created symbolic link in /home/tmp/33108.1.test.q The part: >> >> # >> # Make script wrapper for 'rsh' available in jobs tmp dir >> # >> if [ $catch_rsh = 1 ]; then >> rsh_wrapper=$SGE_ROOT/mpi/rsh >> >> in /opt/gridengine/mpi/pmpi/startmpi.sh points to >> /opt/gridengine/mpi/pmpi/rsh where you added the -V; resp. hostname? >> >> -- Reuti >> >> >>> Thanks, >>> Petar >>> >>> On 03/07/2014 02:39 PM, Reuti wrote: >>>> Am 07.03.2014 um 13:20 schrieb Petar Penchev: >>>> >>>>> I have added the -catch_rsh to the PE and now when i start a sim >>>> Good. >>>> >>>> >>>>> (mpiexec -np $NSLOTS...) in the lsdyna.out file i see 'Error: Unknown >>>>> option -np'. When i use 'mpirun -np $NSLOTS...' i see this 'mpirun: rsh: >>>>> Command not found' in the lsdyna.err. >>>> Aha, indeed. This MPI variant provides only `mpirun` in my installation. >>>> But I wonder: do you have a second MPI library installed: `which mpiexec`? >>>> >>>> The path to `rsh` is set up by the wrapper, so it should be accessible >>>> when you job starts. Can you please add to your jobscript: >>>> >>>> echo $PATH >>>> >>>> The $TMPDIR of the job on the node should be included there, and therein >>>> the `rsh` should exist. >>>> >>>> BTW: I'm not sure about your application, but several ones need all >>>> environment variable from the master node of the parallel job also be set >>>> for the slaves. This can be achieved by including "-V" for `qrsh -inherit >>>> ...` near the end in /opt/gridengine/mpi/pmpi/rsh >>>> >>>> (You copied rsh/hostname to pmpi too?) >>>> >>>> >>>>> Petar >>>>> >>>>> [petar@rocks test]$ cat lsdyna.err >>>>> mpirun: rsh: Command not found >>>>> >>>>> [petar@rocks test]$ cat lsdyna.out >>>>> -catch_rsh >>>>> /opt/gridengine/default/spool/mnode01/active_jobs/32738.1/pe_hostfile >>>>> mnode01 >>>>> mnode01 >>>>> mnode01 >>>>> mnode01 >>>>> mnode01 >>>>> mnode01 >>>>> mnode01 >>>>> mnode01 >>>>> mnode02 >>>>> mnode02 >>>>> mnode02 >>>>> mnode02 >>>>> mnode02 >>>>> mnode02 >>>>> mnode02 >>>>> mnode02 >>>>> Error: Unknown option -np >>>>> >>>>> [root@rocks test]# qconf -mp pmpi >>>>> pe_name pmpi >>>>> slots 9999 >>>>> user_lists NONE >>>>> xuser_lists NONE >>>>> start_proc_args /opt/gridengine/mpi/pmpi/startpmpi.sh -catch_rsh >>>>> $pe_hostfile >>>>> stop_proc_args /opt/gridengine/mpi/pmpi/stoppmpi.sh >>>>> allocation_rule $fill_up >>>>> control_slaves FALSE >>>> control_slaves TRUE >>>> >>>> Otherwise the `qrsh -inherit ...` will fail. >>>> >>>> -- Reuti >>>> >>>> >>>>> job_is_first_task TRUE >>>>> urgency_slots min >>>>> accounting_summary TRUE >>>>> >>>>> >>>>> >>>>> On 03/07/2014 12:49 PM, Reuti wrote: >>>>>> Hi, >>>>>> >>>>>> Am 07.03.2014 um 12:28 schrieb Petar Penchev: >>>>>> >>>>>>> I have a rocks-cluster 6.1 using OGS2011.11p1 and i am trying to use the >>>>>>> PlatformMPI parallel libraries. My problem is that when i submit a job >>>>>>> using qsub test.sh, the job starts only on one node with 16 processes >>>>>>> and not on both nodes. The -pe pmpi, which i am using for now is only a >>>>>>> copy of mpi. >>>>>> The definition of the PE pmpi does also include the -catch_rsh? The >>>>>> recent IBM/Platform-MPI can cope with a machine file in the MPICH(1) >>>>>> format, which is created by the /usr/sge/mpi/startmpi.sh >>>>>> >>>>>> In addition you need the following settings for a tight integration. >>>>>> Please try: >>>>>> >>>>>> ... >>>>>> export MPI_REMSH=rsh >>>>>> export MPI_TMPDIR=$TMPDIR >>>>>> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines $BIN $ARGS >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> What am i missing? Dose anyone have a working -pe submit script, or some >>>>>>> hints how to make this working? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Petar >>>>>>> >>>>>>> [root@rocks mpi]# test.sh >>>>>>> #!/bin/bash >>>>>>> #$ -N lsdyna >>>>>>> #$ -S /bin/bash >>>>>>> #$ -pe pmpi 16 >>>>>>> #$ -cwd >>>>>>> #$ -o lsdyna.out >>>>>>> #$ -e lsdyna.err >>>>>>> ### >>>>>>> #$ -q test.q >>>>>>> ### -notify >>>>>>> export MPI_ROOT=/export/apps/platform_mpi >>>>>>> export LD_LIBRARY_PATH=/export/apps/platform_mpi/lib/linux_amd64 >>>>>>> export PATH=/export/apps/platform_mpi/bin >>>>>>> BIN="/export/apps/lsdyna/ls-dyna_mpp_s_r6_1_2_85274_x64_redhat54_ifort120_sse2_platformmpi.exe" >>>>>>> ARGS="i=test.k" >>>>>>> mpirun -np $NSLOTS $BIN $ARGS >>>>>>> >>>>>>> >>>>>>> [root@rocks mpi]# qconf -sq test.q >>>>>>> qname test.q >>>>>>> hostlist mnode01 mnode02 >>>>>>> seq_no 0 >>>>>>> load_thresholds np_load_avg=1.75 >>>>>>> suspend_thresholds NONE >>>>>>> nsuspend 1 >>>>>>> suspend_interval 00:05:00 >>>>>>> priority 0 >>>>>>> min_cpu_interval 00:05:00 >>>>>>> processors UNDEFINED >>>>>>> qtype BATCH INTERACTIVE >>>>>>> ckpt_list NONE >>>>>>> pe_list pmpi >>>>>>> rerun FALSE >>>>>>> slots 8 >>>>>>> tmpdir /tmp >>>>>>> shell /bin/bash >>>>>>> prolog NONE >>>>>>> epilog NONE >>>>>>> shell_start_mode unix_behavior >>>>>>> starter_method NONE >>>>>>> suspend_method NONE >>>>>>> resume_method NONE >>>>>>> terminate_method NONE >>>>>>> notify 00:00:60 >>>>>>> owner_list NONE >>>>>>> user_lists NONE >>>>>>> xuser_lists NONE >>>>>>> subordinate_list NONE >>>>>>> complex_values NONE >>>>>>> projects NONE >>>>>>> xprojects NONE >>>>>>> calendar NONE >>>>>>> initial_state default >>>>>>> s_rt INFINITY >>>>>>> h_rt INFINITY >>>>>>> s_cpu INFINITY >>>>>>> h_cpu INFINITY >>>>>>> s_fsize INFINITY >>>>>>> h_fsize INFINITY >>>>>>> s_data INFINITY >>>>>>> h_data INFINITY >>>>>>> s_stack INFINITY >>>>>>> h_stack INFINITY >>>>>>> s_core INFINITY >>>>>>> h_core INFINITY >>>>>>> s_rss INFINITY >>>>>>> h_rss INFINITY >>>>>>> s_vmem INFINITY >>>>>>> h_vmem INFINITY >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> [email protected] >>>>>>> https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
