On Apr 3, 2011, at 4:08 PM, Reuti wrote:

> Am 03.04.2011 um 23:59 schrieb David Singleton:
> 
>> On 04/04/2011 12:56 AM, Ralph Castain wrote:
>>> 
>>> What I still don't understand is why you are trying to do it this way. Why 
>>> not just run
>>> 
>>> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN 
>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>>> 
>>> where machineN contains the names of the nodes where you want the MPI apps 
>>> to execute? mpirun will only execute apps on those nodes, so this 
>>> accomplishes the same thing as your script - only with a lot less pain.
>>> 
>>> Your script would just contain a sequence of these commands, each with its 
>>> number of procs and machinefile as required.
>>> 
>> 
>> Maybe I missed why this suggestion of forgetting about the ssh/pbsdsh 
>> altogether
>> was not feasible?  Just use mpirun (with its great tm support!) to distribute
>> MPI jobs.
> 
> Wien2k has a two stage startup, e.g. for 16 slots:
> 
> a) start 4 times `ssh` in the background to go to some of the granted nodes
> b) use there on each node `mpirun` to start processes on the remaining nodes, 
> 3 for each call

Sounds to me like someone should fix wien2k... :-)

> 
> Problems:
> 
> 1) control `ssh` under Torque
> 2) provide a partially hostlist to `mpirun`, maybe by disabling the default 
> tight integration

Enough for me - this appears all caused by a poorly-executed application, 
frankly.


> 
> -- Reuti
> 
> 
>> A simple example:
>> 
>> vayu1:~/MPI > qsub -lncpus=24,vmem=24gb,walltime=10:00 -wd -I
>> qsub: waiting for job 574900.vu-pbs to start
>> qsub: job 574900.vu-pbs ready
>> 
>> [dbs900@v250 ~/MPI]$ wc -l $PBS_NODEFILE
>> 24
>> [dbs900@v250 ~/MPI]$ head -12 $PBS_NODEFILE > m1
>> [dbs900@v250 ~/MPI]$ tail -12 $PBS_NODEFILE > m2
>> [dbs900@v250 ~/MPI]$ mpirun --machinefile m1 ./a2a143 120000 30 & mpirun 
>> --machinefile m2 ./pp143
>> 
>> 
>> Check how the processes are distributed ...
>> 
>> vayu1:~ > qps 574900.vu-pbs
>> Node 0: v250:
>> PID S   RSS    VSZ %MEM     TIME %CPU COMMAND
>> 11420 S  2104  10396  0.0 00:00:00  0.0 -tcsh
>> 11421 S   620  10552  0.0 00:00:00  0.0 pbs_demux
>> 12471 S  2208  49324  0.0 00:00:00  0.9 /apps/openmpi/1.4.3/bin/mpirun 
>> --machinefile m1 ./a2a143 120000 30
>> 12472 S  2116  49312  0.0 00:00:00  0.0 /apps/openmpi/1.4.3/bin/mpirun 
>> --machinefile m2 ./pp143
>> 12535 R 270160 565668  1.0 00:00:02 82.4 ./a2a143 120000 30
>> 12536 R 270032 565536  1.0 00:00:02 81.4 ./a2a143 120000 30
>> 12537 R 270012 565528  1.0 00:00:02 87.3 ./a2a143 120000 30
>> 12538 R 269992 565532  1.0 00:00:02 93.3 ./a2a143 120000 30
>> 12539 R 269980 565516  1.0 00:00:02 81.4 ./a2a143 120000 30
>> 12540 R 270008 565516  1.0 00:00:02 86.3 ./a2a143 120000 30
>> 12541 R 270008 565516  1.0 00:00:02 96.3 ./a2a143 120000 30
>> 12542 R 272064 567568  1.0 00:00:02 91.3 ./a2a143 120000 30
>> Node 1: v251:
>> PID S   RSS    VSZ %MEM     TIME %CPU COMMAND
>> 10367 S  1872  40648  0.0 00:00:00  0.0 orted -mca ess env -mca 
>> orte_ess_jobid 1444413440 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 
>> --hnp-uri "1444413440.0;tcp://10.1.3.58:37339"
>> 10368 S  1868  40648  0.0 00:00:00  0.0 orted -mca ess env -mca 
>> orte_ess_jobid 1444347904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 
>> --hnp-uri "1444347904.0;tcp://10.1.3.58:39610"
>> 10372 R 271112 567556  1.0 00:00:04 74.5 ./a2a143 120000 30
>> 10373 R 271036 567564  1.0 00:00:04 71.5 ./a2a143 120000 30
>> 10374 R 271032 567560  1.0 00:00:04 66.5 ./a2a143 120000 30
>> 10375 R 273112 569612  1.1 00:00:04 68.5 ./a2a143 120000 30
>> 10378 R 552280 840712  2.2 00:00:04 100 ./pp143
>> 10379 R 552280 840708  2.2 00:00:04 100 ./pp143
>> 10380 R 552328 841576  2.2 00:00:04 100 ./pp143
>> 10381 R 552788 841216  2.2 00:00:04 99.3 ./pp143
>> Node 2: v252:
>> PID S   RSS    VSZ %MEM     TIME %CPU COMMAND
>> 10152 S  1908  40780  0.0 00:00:00  0.0 orted -mca ess env -mca 
>> orte_ess_jobid 1444347904 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 
>> --hnp-uri "1444347904.0;tcp://10.1.3.58:39610"
>> 10156 R 552384 840200  2.2 00:00:07 99.3 ./pp143
>> 10157 R 551868 839692  2.2 00:00:06 99.3 ./pp143
>> 10158 R 551400 839184  2.2 00:00:07 100 ./pp143
>> 10159 R 551436 839184  2.2 00:00:06 98.3 ./pp143
>> 10160 R 551760 839692  2.2 00:00:07 100 ./pp143
>> 10161 R 551788 839824  2.2 00:00:07 97.3 ./pp143
>> 10162 R 552256 840332  2.2 00:00:07 100 ./pp143
>> 10163 R 552216 840340  2.2 00:00:07 99.3 ./pp143
>> 
>> 
>> You would have to do something smarter to get correct process binding etc.
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to