Hi Dave, Reuti,

Sorry for kicking off this thread, and then disappearing.  I've been away for a 
bit.  Anyway, Dave, I'm glad you experienced the same issue as I had with my 
installation of SGE 6.2u5 and OpenMPI with core binding -- namely that with 
'qsub -pe openmpi 8 -binding set linear:1 <myscript.com>', if two or more of 
the parallel processes get scheduled to the same execution node, then the 
processes end up being bound to the same core.  Not good!

I've been playing around quite a bit trying to understand this issue, and ended 
up on the GE dev list:

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=39&dsMessageId=285878

It seems that most people expect that calls to 'qrsh -inherit' (that I assume 
OpenMPI uses to bind parallel processes to reserved GE slots) activates a 
separate binding.  This does not appear to be the case.  I *was* hoping that 
using -binding pe linear:1 might enable me to write a script that read the 
pe_hostfile and created a machine file for OpenMPI, but this fails as GE does 
not appear to give information as to which cores are unbound, only the number 
required.

So, for now, my solution has been to use a JSV to remove core binding for the 
MPI jobs (but retain it for serial and SMP jobs).  Any more ideas??

Cheers,

Chris

(PS. Dave: how is my alma mater these days??)
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






Reply via email to