On 11/16/2010 08:24 PM, Ralph Castain wrote:


On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje <terry.don...@oracle.com <mailto:terry.don...@oracle.com>> wrote:

    On 11/16/2010 01:31 PM, Reuti wrote:
    Hi Ralph,

    Am 16.11.2010 um 15:40 schrieb Ralph Castain:

    2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
does this automatically to constrain the procs to running on only those cores.
    This is another "bug/feature" in SGE: it's a matter of discussion, whether the 
shepherd should get exactly one core (in case you use more than one `qrsh`per node) for each call, 
or *all* cores assigned (which we need right now, as the processes in Open MPI will be forks of 
orte daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

    I believe this is indeed the crux of the issue
    fantastic to share the same view.

    FWIW, I think I agree too.

    3. tell OMPI to --bind-to-core.

    In other words, tell SGE to allocate a certain number of cores on each 
node, but to bind each proc to all of them (i.e., don't bind a proc to a 
specific core). I'm pretty sure that is a standard SGE option today (at least, 
I know it used to be). I don't believe any patch or devel work is required (to 
either SGE or OMPI).
    When you use a fixed allocation_rule and a matching -binding request it 
will work today. But any other case won't be distributed in the correct way.

    Is it possible to not include the -binding request? If SGE is told to use a 
fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
the orted see
    itself bound to two specific cores on each node?
    When you leave out the -binding, all jobs are allowed to run on any core.


    We would then be okay as the spawned children of orted would inherit its 
binding. Just don't tell mpirun to bind the processes and the threads of those 
MPI procs will be able to operate across the provided cores.

    Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
-binding given), but doesn't bind the orted to any two specific cores? If so, 
then that would be a problem as the orted would think itself unconstrained. If 
I understand the thread correctly, you're saying that this is what happens 
today - true?
    Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.

    So I guess the question I have for Ralph.  I thought, and this
    might be mixing some of the ideas Jeff and I've been talking
    about, that when a RM executes the orted with a bound set of
    resources (ie cores) that orted would bind the individual
    processes on a subset of the bounded resources.  Is this not
    really the case for 1.4.X branch?  I believe it is the case for
    the trunk based on Jeff's refactoring.


You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also allocated. So the orted detects an external binding of one core, and binds all its children to that same core.
I do not think you are right here. Chris sent the following which looks like OGE (fka SGE) actually did bind the hnp to multiple cores. However that message I believe is not coming from the processes themselves and actually is only shown by the hnp. I wonder if Chris adds a "-bind-to-core" option we'll see more output from the a.out's before they exec unterm?

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':
[exec4:17384] System has detected external process binding to cores 0022
 [exec4:17384] ras:gridengine: JOB_ID: 59352
 [exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
 [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
 [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1



--td
What I had suggested to Reuti was to not include the -binding flag to SGE in the hopes that SGE would then bind the orted to all the allocated cores. However, as I feared, SGE in that case doesn't bind the orted at all - and so we assume the entire node is available for our use.

This is an SGE issue. We need them to bind the orted to -all- the allocated cores (and only those cores) in order for us to operate correctly.




-- Oracle
    Terry D. Dontje | Principal Software Engineer
    Developer Tools Engineering | +1.781.442.2631
    Oracle *- Performance Technologies*
    95 Network Drive, Burlington, MA 01803
    Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>




    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to