Re: [O-MPI devel] processor affinity

2005-08-25 Thread Jeff Squyres

On Aug 24, 2005, at 10:27 PM, Troy Benjegerdes wrote:

Processor affinity is now implemented.  You must ask for it via the 
MCA

param "mpi_paffinity_alone".  If this parameter is set to a nonzero
value, OMPI will assume that its job is alone on the nodes that it is
running on, and, if you have not oversubscribed the node, will bind 
MPI

processes to processors, starting with processor ID 0 (i.e.,
effectively bindings MPI processes to the processor number equivalent
to their relative VPID on that node).


Any thoughts on how to support NUMA with something like this? On the
dual opteron w/DDR IB systems I've got, I'm seeing a big perfomance
difference that primarily depends on which node the memory is on.


I take it from this that you have activated the processor affinity 
stuff?  I'm not well-versed on how opterons work, but don't they 
allocate memory in a first-processor-usage kind of basis?  I.e., 
malloc() will return memory local to the processor that invoked it?  If 
so, the processor affinity stuff is called way at the beginning of 
time, before 99% of the malloc's in OMPI are invoked, so that *should* 
be taken care of naturally...


Are you seeing something different?

I'm also working on an memory affinity framework, but that's really for 
explicit shared memory operations on NUMA machines (e.g., shared memory 
collectives, where we want to control the physical location of pages in 
an mmap'ed chunk of memory that is shared between multiple processes).


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Re: [O-MPI devel] processor affinity

2005-08-24 Thread Troy Benjegerdes
On Tue, Aug 16, 2005 at 12:25:32PM -0400, Jeff Squyres wrote:
> Processor affinity is now implemented.  You must ask for it via the MCA 
> param "mpi_paffinity_alone".  If this parameter is set to a nonzero 
> value, OMPI will assume that its job is alone on the nodes that it is 
> running on, and, if you have not oversubscribed the node, will bind MPI 
> processes to processors, starting with processor ID 0 (i.e., 
> effectively bindings MPI processes to the processor number equivalent 
> to their relative VPID on that node).
> 
> Please let me know how this works out for everyone; thanks.

Any thoughts on how to support NUMA with something like this? On the
dual opteron w/DDR IB systems I've got, I'm seeing a big perfomance
difference that primarily depends on which node the memory is on.


Re: [O-MPI devel] processor affinity

2005-07-20 Thread Jeff Squyres

On Jul 20, 2005, at 2:26 AM, Matt Leininger wrote:

Any advice here from the Open MP community would also be 
appreciated...



  Please keep in mind we need this to work for both MPI+OpenMP and MPI
+pthread hybrid apps.


Yes, I think what we loosely concluded here is:

1. We'll have a framework for process affinity (should be pretty easy 
to do) to accommodate the different available APIs.
2. We'll discuss next week where this framework will fit in the 
opal/orte/ompi stack.  I imagine it'll do some kind of default 
processor affinity with user-overridable options.
3. We'll probably have a second framework for memory affinity (it's a 
fundamentally different beast than process affinity).  Haven't really 
discussed this much yet -- we'll start those discussions next week.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Re: [O-MPI devel] processor affinity

2005-07-18 Thread Matt Leininger
On Mon, 2005-07-18 at 08:28 -0400, Jeff Squyres wrote:
> On Jul 18, 2005, at 2:50 AM, Matt Leininger wrote:
> 
> >> Generally speaking, if you launch <=N processes in a job on a node
> >> (where N == number of CPUs on that node), then we set processor
> >> affinity.  We set each process's affinity to the CPU number according
> >> to the VPID ordering of the procs in that job on that node.  So if you
> >> launch VPIDs 5, 6, 7, 8 on a node, 5 would go to processor 0, 6 would
> >> go to processor 1, etc. (it's an easy, locally-determined ordering).
> >
> >You'd need to be careful with dual-core cpus.  Say you launch a 4
> > task MPI job on a 4-socket dual core Opteron.  You'd want to schedule
> > the tasks on nodes 0, 2, 4, 6 - not 0, 1, 2, 3 to get maximum memory
> > bandwidth to each MPI task.
> 
> With the potential for non-trivial logic like this, perhaps the extra 
> work for a real framework would be justified, then.

   I agree.
> 
> >Also, how would this work with hybrid MPI+threading (either pthreads
> > or OpenMP) applications?  Let's say you have an 8 or 16 cpu node and 
> > you
> > start up 2 MPI tasks with 4 compute threads in each task.  The optimum
> > layout may not be running the MPI tasks on cpu's 0 and 1.  Several
> > hybrid applications that ran on ASC White and now Purple will have 
> > these
> > requirements.
> 
> Hum.  Good question.  The MPI API doesn't really address this -- the 
> MPI API is not aware of additional threads that are created until you 
> call an MPI function (and even then, we're not currently checking which 
> thread is calling -- that would just add latency).
> 
> What do these applications do right now?  Do they set their own 
> processor / memory affinity?  This might actually be outside the scope 
> of MPI...?  (I'mm not trying to shrug off responsibility, but this 
> might be a case where the MPI simply doesn't have enough information, 
> and to get that information [e.g., via MPI attributes or MPI info 
> arguments] would be more hassle than the user just setting the affinity 
> themselves...?)

  We played around with setting processor affinity in our app a few
years ago.  It got a little ugly, but things have improved since then.

  I was thinking of having the app pass threading info to MPI (via info
or attributes).  This might be outside the scope of MPI now, but this
should be the responsibility of the parallel programming
language/method.  Making it the apps responsibility to set processor
affinity seems a bit too much of a low-level worry to put on application
developers.  


   Some discussions around what a memory/processor affinity framework
should look like and be doing is a good starting point.

  - Matt




Re: [O-MPI devel] processor affinity

2005-07-18 Thread Ralph Castain
Did a little digging into this last night, and finally figured out what
you were getting at in your comments here. Yeah, I think an "affinity"
framework would definitely be the best approach - can handle both cpu
and memory, I  imagine. Isn't clear how pressing that is as it is mostly
an optimization issue, but you're welcome to create the framework if you
like.


On Sun, 2005-07-17 at 09:13, Jeff Squyres wrote:

> It needs to be done in the launched process itself.  So we'd either 
> have to extend rmaps (from my understanding of rmaps, that doesn't seem 
> like a good idea), or do something different.
> 
> Perhaps the easiest thing to do is to add this to the LANL meeting 
> agenda...?  Then we can have a whiteboard to discuss.  :-)
> 
> 
> 
> On Jul 17, 2005, at 10:26 AM, Ralph Castain wrote:
> 
> > Wouldn't it belong in the rmaps framework? That's where we tell the
> > launcher where to put each process - seems like a natural fit.
> >
> >
> > On Jul 17, 2005, at 6:45 AM, Jeff Squyres wrote:
> >
> >> I'm thinking that we should add some processor affinity code to OMPI 
> >> --
> >> possibly in the orte layer (ORTE is the interface to the back-end
> >> launcher, after all).  This will really help on systems like opterons
> >> (and others) to prevent processes from bouncing between processors, 
> >> and
> >> potentially getting located far from "their" RAM.
> >>
> >> This has the potential to help even micro-benchmark results (e.g.,
> >> ping-pong).  It's going to be quite relevant for my shared memory
> >> collective work on mauve.
> >>
> >>
> >> General scheme:
> >> ---
> >>
> >> I think that somewhere in ORTE, we should actively set processor
> >> affinity when:
> >>- Supported by the OS
> >>- Not disabled by the user (via MCA param)
> >>- The node is not over-subscribed with processes from this job
> >>
> >> Generally speaking, if you launch <=N processes in a job on a node
> >> (where N == number of CPUs on that node), then we set processor
> >> affinity.  We set each process's affinity to the CPU number according
> >> to the VPID ordering of the procs in that job on that node.  So if you
> >> launch VPIDs 5, 6, 7, 8 on a node, 5 would go to processor 0, 6 would
> >> go to processor 1, etc. (it's an easy, locally-determined ordering).
> >>
> >> Someday, we might want to make this scheme universe-aware (i.e., see 
> >> if
> >> any other ORTE jobs are running on that node, and not schedule on any
> >> processors that are already claimed by the processes on that(those)
> >> job(s)), but I think single-job awareness is sufficient for the 
> >> moment.
> >>
> >>
> >> Implementation:
> >> ---
> >>
> >> We'll need relevant configure tests to figure out if the target system
> >> as CPU affinity system calls.  Those are simple to add.
> >>
> >> We could use simply #if statements for the affinity stuff or make it a
> >> real framework.  Since it's only 1 function call to set the affinity, 
> >> I
> >> tend to lean towards the [simpler] #if solution, but could probably be
> >> pretty easily convinced that a framework is the Right solution.  I'm 
> >> on
> >> the fence (and if someone convinces me, I'd volunteer for the extra
> >> work to setup the framework).
> >>
> >> I'm not super-familiar with the processor-affinity stuff (e.g., for
> >> best effect, should it be done after the fork and before the exec?), 
> >> so
> >> I'm not sure exactly where this would go in ORTE.  Potentially either
> >> before new processes are exec'd (where we only have control of that in
> >> some kinds of systems, like rsh/ssh) or right up very very near the 
> >> top
> >> of orte_init().
> >>
> >> Comments?
> >>
> >> -- 
> >> {+} Jeff Squyres
> >> {+} The Open MPI Project
> >> {+} http://www.open-mpi.org/
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >


Re: [O-MPI devel] processor affinity

2005-07-18 Thread Rich L. Graham


On Jul 18, 2005, at 6:28 AM, Jeff Squyres wrote:


On Jul 18, 2005, at 2:50 AM, Matt Leininger wrote:


Generally speaking, if you launch <=N processes in a job on a node
(where N == number of CPUs on that node), then we set processor
affinity.  We set each process's affinity to the CPU number according
to the VPID ordering of the procs in that job on that node.  So if 
you

launch VPIDs 5, 6, 7, 8 on a node, 5 would go to processor 0, 6 would
go to processor 1, etc. (it's an easy, locally-determined ordering).


   You'd need to be careful with dual-core cpus.  Say you launch a 4
task MPI job on a 4-socket dual core Opteron.  You'd want to schedule
the tasks on nodes 0, 2, 4, 6 - not 0, 1, 2, 3 to get maximum memory
bandwidth to each MPI task.


With the potential for non-trivial logic like this, perhaps the extra
work for a real framework would be justified, then.

   Also, how would this work with hybrid MPI+threading (either 
pthreads

or OpenMP) applications?  Let's say you have an 8 or 16 cpu node and
you
start up 2 MPI tasks with 4 compute threads in each task.  The optimum
layout may not be running the MPI tasks on cpu's 0 and 1.  Several
hybrid applications that ran on ASC White and now Purple will have
these
requirements.


Hum.  Good question.  The MPI API doesn't really address this -- the
MPI API is not aware of additional threads that are created until you
call an MPI function (and even then, we're not currently checking which
thread is calling -- that would just add latency).

What do these applications do right now?  Do they set their own
processor / memory affinity?  This might actually be outside the scope
of MPI...?  (I'mm not trying to shrug off responsibility, but this
might be a case where the MPI simply doesn't have enough information,
and to get that information [e.g., via MPI attributes or MPI info
arguments] would be more hassle than the user just setting the affinity
themselves...?)

Comments?


If you set things up such that you can specify input parameters on where
to put each process, you have the flexibility you want.  The locality 
API's
I have seen all mimiced the IRIX API, which had these capabilities.  If 
you

want some ideas, look at LA-MPI, it does this - the implementation is
pretty strange (just he coding), but it is there.

Rich



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel