On Apr 3, 2011, at 4:08 PM, Reuti wrote: > Am 03.04.2011 um 23:59 schrieb David Singleton: > >> On 04/04/2011 12:56 AM, Ralph Castain wrote: >>> >>> What I still don't understand is why you are trying to do it this way. Why >>> not just run >>> >>> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN >>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def >>> >>> where machineN contains the names of the nodes where you want the MPI apps >>> to execute? mpirun will only execute apps on those nodes, so this >>> accomplishes the same thing as your script - only with a lot less pain. >>> >>> Your script would just contain a sequence of these commands, each with its >>> number of procs and machinefile as required. >>> >> >> Maybe I missed why this suggestion of forgetting about the ssh/pbsdsh >> altogether >> was not feasible? Just use mpirun (with its great tm support!) to distribute >> MPI jobs. > > Wien2k has a two stage startup, e.g. for 16 slots: > > a) start 4 times `ssh` in the background to go to some of the granted nodes > b) use there on each node `mpirun` to start processes on the remaining nodes, > 3 for each call
Sounds to me like someone should fix wien2k... :-) > > Problems: > > 1) control `ssh` under Torque > 2) provide a partially hostlist to `mpirun`, maybe by disabling the default > tight integration Enough for me - this appears all caused by a poorly-executed application, frankly. > > -- Reuti > > >> A simple example: >> >> vayu1:~/MPI > qsub -lncpus=24,vmem=24gb,walltime=10:00 -wd -I >> qsub: waiting for job 574900.vu-pbs to start >> qsub: job 574900.vu-pbs ready >> >> [dbs900@v250 ~/MPI]$ wc -l $PBS_NODEFILE >> 24 >> [dbs900@v250 ~/MPI]$ head -12 $PBS_NODEFILE > m1 >> [dbs900@v250 ~/MPI]$ tail -12 $PBS_NODEFILE > m2 >> [dbs900@v250 ~/MPI]$ mpirun --machinefile m1 ./a2a143 120000 30 & mpirun >> --machinefile m2 ./pp143 >> >> >> Check how the processes are distributed ... >> >> vayu1:~ > qps 574900.vu-pbs >> Node 0: v250: >> PID S RSS VSZ %MEM TIME %CPU COMMAND >> 11420 S 2104 10396 0.0 00:00:00 0.0 -tcsh >> 11421 S 620 10552 0.0 00:00:00 0.0 pbs_demux >> 12471 S 2208 49324 0.0 00:00:00 0.9 /apps/openmpi/1.4.3/bin/mpirun >> --machinefile m1 ./a2a143 120000 30 >> 12472 S 2116 49312 0.0 00:00:00 0.0 /apps/openmpi/1.4.3/bin/mpirun >> --machinefile m2 ./pp143 >> 12535 R 270160 565668 1.0 00:00:02 82.4 ./a2a143 120000 30 >> 12536 R 270032 565536 1.0 00:00:02 81.4 ./a2a143 120000 30 >> 12537 R 270012 565528 1.0 00:00:02 87.3 ./a2a143 120000 30 >> 12538 R 269992 565532 1.0 00:00:02 93.3 ./a2a143 120000 30 >> 12539 R 269980 565516 1.0 00:00:02 81.4 ./a2a143 120000 30 >> 12540 R 270008 565516 1.0 00:00:02 86.3 ./a2a143 120000 30 >> 12541 R 270008 565516 1.0 00:00:02 96.3 ./a2a143 120000 30 >> 12542 R 272064 567568 1.0 00:00:02 91.3 ./a2a143 120000 30 >> Node 1: v251: >> PID S RSS VSZ %MEM TIME %CPU COMMAND >> 10367 S 1872 40648 0.0 00:00:00 0.0 orted -mca ess env -mca >> orte_ess_jobid 1444413440 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri "1444413440.0;tcp://10.1.3.58:37339" >> 10368 S 1868 40648 0.0 00:00:00 0.0 orted -mca ess env -mca >> orte_ess_jobid 1444347904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 >> --hnp-uri "1444347904.0;tcp://10.1.3.58:39610" >> 10372 R 271112 567556 1.0 00:00:04 74.5 ./a2a143 120000 30 >> 10373 R 271036 567564 1.0 00:00:04 71.5 ./a2a143 120000 30 >> 10374 R 271032 567560 1.0 00:00:04 66.5 ./a2a143 120000 30 >> 10375 R 273112 569612 1.1 00:00:04 68.5 ./a2a143 120000 30 >> 10378 R 552280 840712 2.2 00:00:04 100 ./pp143 >> 10379 R 552280 840708 2.2 00:00:04 100 ./pp143 >> 10380 R 552328 841576 2.2 00:00:04 100 ./pp143 >> 10381 R 552788 841216 2.2 00:00:04 99.3 ./pp143 >> Node 2: v252: >> PID S RSS VSZ %MEM TIME %CPU COMMAND >> 10152 S 1908 40780 0.0 00:00:00 0.0 orted -mca ess env -mca >> orte_ess_jobid 1444347904 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 >> --hnp-uri "1444347904.0;tcp://10.1.3.58:39610" >> 10156 R 552384 840200 2.2 00:00:07 99.3 ./pp143 >> 10157 R 551868 839692 2.2 00:00:06 99.3 ./pp143 >> 10158 R 551400 839184 2.2 00:00:07 100 ./pp143 >> 10159 R 551436 839184 2.2 00:00:06 98.3 ./pp143 >> 10160 R 551760 839692 2.2 00:00:07 100 ./pp143 >> 10161 R 551788 839824 2.2 00:00:07 97.3 ./pp143 >> 10162 R 552256 840332 2.2 00:00:07 100 ./pp143 >> 10163 R 552216 840340 2.2 00:00:07 99.3 ./pp143 >> >> >> You would have to do something smarter to get correct process binding etc. >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users