Re: [SLUG] Clustering weirdness
Steven == Steven Tucker tux...@gmail.com writes: Steven Hi all, got a problem with my cluster using OpenMPI + Torque+ Steven Maui. Steven I can submit 50 different jobs (single process) and the Steven batching system will run all 50 in parallel, but I cant get an Steven MPI job to run on more that 1 node. I assumed it must be my Steven pbs script, but I have tried just about every config I can Steven find/think of and still no luck. I haven't used torque, but if it's anything like NQS, you need a different batch queue that's configured with the nodes you want to be able to use. Also typically there's a different prologue and epilogue (differently named files) for parallel as opposed to single-node jobs. We used to have to do something like qmgr -c set queue batch16 resources_max.nodect=16 to allow jobs submitted to the queue `batch16' to use up to 16 nodes, for instance. It's been fifteen years since I used NQS so my memory may be faulty. And of course, Torque has its own command set (although I believe it's based on NQS). -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
[SLUG] Clustering weirdness
Hi all, got a problem with my cluster using OpenMPI + Torque+ Maui. I can submit 50 different jobs (single process) and the batching system will run all 50 in parallel, but I cant get an MPI job to run on more that 1 node. I assumed it must be my pbs script, but I have tried just about every config I can find/think of and still no luck. pbsnodes -a produces the following, but all the way up to 16 nodes and the second half are quad cores, just showing the first 2 for brevity. tuxta@WPCluster:~$ pbsnodes -a WPCluster.workstation.griffith.edu.au state = free np = 4 ntype = cluster status = rectime=1321274238,varattr=,jobs=,state=free,netload=87819550163,gres=,loadave=0.00,ncpus=4,physmem=4047980kb,availmem=11466240kb,totmem=11860068kb,idletime=1574,nusers=1,nsessions=1,sessions=27627,uname=Linux WPCluster 2.6.32-34-server #77-Ubuntu SMP Tue Sep 13 20:54:38 UTC 2011 x86_64,opsys=linux node02 state = free np = 2 ntype = cluster status = rectime=1321274239,varattr=,jobs=,state=free,netload=191116409,gres=,loadave=0.00,ncpus=2,physmem=1021584kb,availmem=8642904kb,totmem=8832648kb,idletime=2258,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux node02 2.6.32-34-server #77-Ubuntu SMP Tue Sep 13 20:54:38 UTC 2011 x86_64,opsys=linux The following file works fine when nodes=1, but when I make nodes=2 it never runs, just keeps a state of E rather than R #!/bin/bash #PBS -N Hello_Test #PBS -l nodes=2:ppn=4 cd $PBS_O_WORKDIR mpiexec -np 8 hello Not sure what other info is helpful so don't want to put heaps of stuff here, if any more info is needed just let me know. Does anyone have any ideas? Regards Tuxta -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Clustering weirdness
On 14 November 2011 23:53, Steven Tucker tux...@gmail.com wrote: Hi all, got a problem with my cluster using OpenMPI + Torque+ Maui. I don't think OpenMPI is so common that you'll find many people with experience in this forum. You might have better luck in OpenMPI or Torque or Maui specific forums, e.g. http://www.open-mpi.org/community/lists/users/ --Amos -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html