Hi,

> > If, indeed, it is not possible currently to implement this type of 
> > core-binding in tightly integrated OpenMPI/GE, then a solution might lie in 
> > a custom script run in the parallel environment's 'start proc args'. This 
> > script would have to find out which slots are allocated where on the 
> > cluster, and write an OpenMPI rankfile. 
> 
> Exactly this should work. 
> 
> If you use "binding_instance" "pe" and reformat the information in the 
> $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. 
> Maybe you can share the script with this list once you got it working. 


As far as I can see, that's not going to work.  This is because, exactly like 
"binding_instance" "set", for -binding pe linear:n you get n cores bound per 
node.  This is easily verifiable by using a long job and examining the 
pe_hostfile.  For example, I submit a job with:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1

Notice that, because I have specified the -binding pe linear:1, each execution 
node binds processes for the job_id to one core.  If I have -binding pe 
linear:2, I get:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1:0,2
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1:0,2
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1:0,2
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1:0,2
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1:0,2
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1:0,2
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1:0,2

So the pe_hostfile still doesn't give an accurate representation of the binding 
allocation for use by OpenMPI.  Question: is there a system file or command 
that I could use to check which processors are "occupied"?

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






Reply via email to