Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Reuti
Am 27.03.2014 um 23:59 schrieb Dave Love:

> Reuti  writes:
> 
>> Do all of them have an internal bookkeeping of granted cores to slots
>> - i.e. not only the number of scheduled slots per job per node, but
>> also which core was granted to which job? Does Open MPI read this
>> information would be the next question then.
> 
> OMPI works with the bindings it's handed via orted (if the processes are
> started that way).
> 
>>> My understanding is that Torque delegates to OpenMPI the process placement 
>>> and binding (beyond the list of nodes/cpus available for
>>> the job).
> 
> Can't/doesn't torque start the MPI processes itself?  Otherwise, yes,
> since orted gets the binding.
> 
>>> My guess is that OpenPBS behaves the same as Torque.
>>> 
>>> SLURM and SGE/OGE *probably* have pretty much the same behavior.
>> 
>> SGE/OGE: no, any binding request is only a soft request.
> 
> I don't understand that.  Does it mean the system-specific "strict" and
> "non-strict" binding in hwloc, in which case I don't see how UGE can do
> anything different?
> 
>> UGE: here you can request a hard binding. But I have no clue whether this 
>> information is read by Open MPI too.
>> 
>> If in doubt: use only complete nodes for each job (which is often done
>> for massively parallel jobs anyway).
> 
> There's no need with a recent SGE.  All our jobs get core bindings --
> unless they use all the cores, since binding them all is equivalent to
> binding none -- and OMPI inherits them.  See
>  for the
> SGE+OMPI configuration.

To avoid any misunderstanding I first discuss this last paragraph. I read 
http://www.slideshare.net/jsquyres/open-mpi-explorations-in-process-affinity-eurompi13-presentation
 which was posted on this list yesterday. And so I would phrase it: mapping to 
all is like mapping to none. And as they are only mapped, the kernel scheduler 
is free to move them around inside this set of (granted) cores.

But maybe I got it wrong.

-- Reuti



> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Gus Correa

On 03/27/2014 05:58 PM, Jeff Squyres (jsquyres) wrote:

On Mar 27, 2014, at 4:06 PM, "Sasso, John (GE Power & Water, Non-GE)"

 wrote:



Yes, I noticed that I could not find --display-map in any of the man pages.
Intentional?


Oops; nope.  I'll ask Ralph to add it...



Nah ...
John: As far as I can tell,
good features are intentional in OMPI.
The bad ones seem to be just minor lapses.
And minor lapses are benign.
They help keep us alert. :)

So, in the spirit of pitching in two-cent contributions
to build an even more perfect OMPI documentation than we already have:
man pages, README file, FAQ, and this rocking mailing list rhythm,
(Who can ask for anything more?),
I think I found what seems to be the corresponding mca parameter:

rmaps_base_display_map

which defaults to 0, but should be set to 1
to produce the same effect of
mpiexec --display-map.

Right?

Cheers,
Gus Correa







Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Ralph Castain
Oooh...it's Jeff's fault!

Fwiw you can get even more detailed mapping info with --display-devel-map

Sent from my iPhone

> On Mar 27, 2014, at 2:58 PM, "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> On Mar 27, 2014, at 4:06 PM, "Sasso, John (GE Power & Water, Non-GE)" 
>>  wrote:
>> 
>> Yes, I noticed that I could not find --display-map in any of the man pages.  
>> Intentional?
> 
> Oops; nope.  I'll ask Ralph to add it...
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Dave Love
Reuti  writes:

> Do all of them have an internal bookkeeping of granted cores to slots
> - i.e. not only the number of scheduled slots per job per node, but
> also which core was granted to which job? Does Open MPI read this
> information would be the next question then.

OMPI works with the bindings it's handed via orted (if the processes are
started that way).

>> My understanding is that Torque delegates to OpenMPI the process placement 
>> and binding (beyond the list of nodes/cpus available for
>> the job).

Can't/doesn't torque start the MPI processes itself?  Otherwise, yes,
since orted gets the binding.

>> My guess is that OpenPBS behaves the same as Torque.
>> 
>> SLURM and SGE/OGE *probably* have pretty much the same behavior.
>
> SGE/OGE: no, any binding request is only a soft request.

I don't understand that.  Does it mean the system-specific "strict" and
"non-strict" binding in hwloc, in which case I don't see how UGE can do
anything different?

> UGE: here you can request a hard binding. But I have no clue whether this 
> information is read by Open MPI too.
>
> If in doubt: use only complete nodes for each job (which is often done
> for massively parallel jobs anyway).

There's no need with a recent SGE.  All our jobs get core bindings --
unless they use all the cores, since binding them all is equivalent to
binding none -- and OMPI inherits them.  See
 for the
SGE+OMPI configuration.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/



Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Lloyd Brown
I don't know about your users, but experience has, unfortunately, taught
us to assume that users' jobs are very, very badly-behaved.

I choose to assume that it's incompetence on the part of programmers and
users, rather than malice, though. :-)

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 03/27/2014 04:49 PM, Dave Love wrote:
> Actually there's no need for cpusets unless jobs are badly-behaved and
> escape their bindings.


Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Dave Love
Gus Correa  writes:

> On 03/27/2014 05:05 AM, Andreas Schäfer wrote:
>>> >Queue systems won't allow resources to be oversubscribed.

[Maybe that meant that resource managers can, and typically do, prevent
resources being oversubscribed.]

>> I'm fairly confident that you can configure Slurm to oversubscribe
>> nodes: just specify more cores for a node than are actually present.
>>
>
> That is true.
> If you lie to the queue system about your resources,
> it will believe you and oversubscribe.

For what it's worth, oversubscription might be overall or limited.  We
just had a user running some crazy Java program he refuses to explain
submitted as a serial job running ~150 threads.  The over-subscription
was confined to core is used, and the effect on the 127 others was
mostly due to the small overhead of the node daemon reading the crazy
/proc smaps file to track the memory usage.  The other cores were
normally subscribed.

Ob-OMPI:  the other jobs may have been OMPI ones!

> Torque has this same feature.
> I don't know about SGE.
> You may choose to set some or all nodes with more cores than they
> actually have, if that is a good choice for the codes you run.
> However, for our applications oversubscribing is bad, hence my mindset.

Right.  I don't think there's any question that it's a bad idea on a
general purpose cluster running some OMPI jobs, for instance.




Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Dave Love
Gus Correa  writes:

> Torque+Maui, SGE/OGE, and Slurm are free.

[OGE certainly wasn't free, but it apparently no longer exists --
another thing Oracle screwed up and eventually dumped.]

> If you build the queue system with cpuset control, a node can be
> shared among several jobs, but the cpus/cores will be assigned
> specifically
> to each job's processes, so that nobody steps on each other toes.

Actually there's no need for cpusets unless jobs are badly-behaved and
escape their bindings.  Core binding by the resource manager, inherited
by OMPI, is typically enough.  (Note that, as far as I know, cpusets are
Linux-specific now Irix is dead along with its better support for
resource management.)

Anyhow, yes you should use a resource manager even with only trivial
scheduling.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/



Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-27 Thread Dave Love
Edgar Gabriel  writes:

> not sure honestly. Basically, as suggested in this email chain earlier,
> I had to disable the PVFS2_IreadContig and PVFS2_IwriteContig routines
> in ad_pvfs2.c to make the tests pass. Otherwise the tests worked but
> produced wrong data. I did not have however the time to figure what
> actually goes wrong underneath the hood.

[I can't get into trac to comment on the issue (hangs on login), so I'm
following up here.]

In case it's not clear, the changes for 1.6 and 1.7 are different, and
probably shouldn't be.  The patch I took from 1.7 looked similar to
what's in mpich, but hard-wired rather than autoconfiscated, whereas the
patch for 1.6 on the tracker sets the entries to NULL instead.

> Edgar
>
> On 3/25/2014 9:21 AM, Rob Latham wrote:
>> 
>> 
>> On 03/25/2014 07:32 AM, Dave Love wrote:
>>> Edgar Gabriel  writes:
>>>
 I am still looking into the PVFS2 with ROMIO problem with the 1.6
 series, where (as I mentioned yesterday) the problem I am having right
 now is that the data is wrong. Not sure what causes it, but since I have
 teach this afternoon again, it might be friday until I can digg into
 that.
>>>
>>> Was there any progress with this?  Otherwise, what version of PVFS2 is
>>> known to work with OMPI 1.6?  Thanks.
>> 
>> Edgar, should I pick this up for MPICH, or was this fix specific to
>> OpenMPI ?
>> 
>> ==rob
>> 


Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Jeff Squyres (jsquyres)
On Mar 27, 2014, at 4:06 PM, "Sasso, John (GE Power & Water, Non-GE)" 
 wrote:

> Yes, I noticed that I could not find --display-map in any of the man pages.  
> Intentional?

Oops; nope.  I'll ask Ralph to add it...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Gus Correa

On 03/27/2014 04:10 PM, Reuti wrote:

Hi,

Am 27.03.2014 um 20:15 schrieb Gus Correa:




Awesome, but now here is my concern.

If we have OpenMPI-based applications launched as batch jobs
via a batch scheduler like SLURM, PBS, LSF, etc.
(which decides the placement of the app and dispatches it to the compute hosts),
then will including "--report-bindings --bind-to-core" cause problems?


Do all of them have an internal bookkeeping of granted cores to slots -
i.e. not only the number of scheduled slots per job per node, but also 
which

core was granted to which job? Does Open MPI read this information would be
the next question then.




I don't know all resource managers and schedulers.

I use Torque+Maui here.
OpenMPI is built with Torque support,

and will use the nodes and cpus/cores provided by Torque.


Same question here.



Hi Reuti

On Torque the answer is "it depends".
If you configure it with cpuset enabled (which is *not* the default)
then the job can run only on those cpus/cores listed under
/dev/cpuset/bla/bla/job_number/bla/bla.
Otherwise, processes and threads are free to run on any cores inside
the nodes Torque assigned to the job.
However, process placement and binding is deferred to MPI.
What I like about this is that they (Torque and OMPI)
coexist without interfering with each other.

My quick reading of some Slurm documents suggested that
it is configured by default with cpuset enabled,
and if I understood right "srun" does core binding by default as well
(which you can override with other types of binding).
However, I don't understand clearly the interplay
between srun and mpirun.
Does srun replace mpirun perhaps,
and takes over process placement and binding?
Or do they coexist in harmony?
However, I am not a Slurm user, so what I wrote
above are just wild guesses, and may be completely wrong.
In any case, this discouraged me a bit of trying Slurm.

IMHO, the resource manager has no business in
enforcing process/thread placement and binding,
and minimally should have an option to
turn it off at the user request, and let MPI and other tools do it.
As you certainly know, besides MPI (OMPI in particular),
OpenMP has its own mechanisms for thread binding as well,
and so do hwloc, taskset, numactl, etc.
I think these are are the natural baby-sitters of process,
thread, cpu, core, NUMA, etc.
The resource manager should keep baby-sitting the jobs and the
coarse-grained resources, as it always did.
Otherwise those children will be spoiled by too much attention.
One tool for each small task, keep it simple,
aren't these principles that made Unix' success and longevity?
However, this job baby-sitting job may be high-paid,
hence there is more and more people applying for it.

Gus Correa




My understanding is that Torque delegates to OpenMPI

the process placement and binding
(beyond the list of nodes/cpus available for

the job).

My guess is that OpenPBS behaves the same as Torque.

SLURM and SGE/OGE *probably* have pretty much the same behavior.


SGE/OGE: no, any binding request is only a soft request.
UGE: here you can request a hard binding. But I have no clue whether

this information is read by Open MPI too.


If in doubt: use only complete nodes for each job

(which is often done for massively parallel jobs anyway).


-- Reuti



A cursory reading of the SLURM web page suggested to me that it
does core binding by default, but don't quote me on that.

I don't know what LSF does, but I would guess there is a
way to do the appropriate bindings, either at the resource

manager level, or at the OpenMPI level (or a combination of both).



Certainly I can test this, but concerned there may be a case where inclusion of
--bind-to-core would cause an unexpected problem I did not account for.


--john



Well, testing and failing is part of this game!
Would the GE manager buy that? :)

I hope this helps,
Gus Correa



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 27, 2014 2:06 PM
To: Open MPI Users
Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

Hi John

Take a look at the mpiexec/mpirun options:

-report-bindings (this one should report what you want)

and maybe also also:

-bycore, -bysocket, -bind-to-core, -bind-to-socket, ...

and similar, if you want more control on where your MPI processes run.

"man mpiexec" is your friend!

I hope this helps,
Gus Correa

On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:

When a piece of software built against OpenMPI fails, I will see an
error referring to the rank of the MPI task which incurred the failure.
For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

with errorcode 1.

Unfortunately, I do not have access to the software code, just the
installation directory tree for OpenMPI.  My question is:  Is there a
flag that can be passed to mpirun, or an environment variable set,
which would reveal the 

Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Reuti
Hi,

Am 27.03.2014 um 20:15 schrieb Gus Correa:

> 
>> Awesome, but now here is my concern.
> If we have OpenMPI-based applications launched as batch jobs
> via a batch scheduler like SLURM, PBS, LSF, etc.
> (which decides the placement of the app and dispatches it to the compute 
> hosts),
> then will including "--report-bindings --bind-to-core" cause problems?

Do all of them have an internal bookkeeping of granted cores to slots - i.e. 
not only the number of scheduled slots per job per node, but also which core 
was granted to which job? Does Open MPI read this information would be the next 
question then.


> I don't know all resource managers and schedulers.
> 
> I use Torque+Maui here.
> OpenMPI is built with Torque support, and will use the nodes and cpus/cores 
> provided by Torque.

Same question here.


> My understanding is that Torque delegates to OpenMPI the process placement 
> and binding (beyond the list of nodes/cpus available for
> the job).
> 
> My guess is that OpenPBS behaves the same as Torque.
> 
> SLURM and SGE/OGE *probably* have pretty much the same behavior.

SGE/OGE: no, any binding request is only a soft request.
UGE: here you can request a hard binding. But I have no clue whether this 
information is read by Open MPI too.

If in doubt: use only complete nodes for each job (which is often done for 
massively parallel jobs anyway).

-- Reuti


> A cursory reading of the SLURM web page suggested to me that it
> does core binding by default, but don't quote me on that.
> 
> I don't know what LSF does, but I would guess there is a
> way to do the appropriate bindings, either at the resource manager level, or 
> at the OpenMPI level (or a combination of both).
> 
> 
> Certainly I can test this, but concerned there may be a case where inclusion 
> of
> --bind-to-core would cause an unexpected problem I did not account for.
>> 
>> --john
>> 
> 
> Well, testing and failing is part of this game!
> Would the GE manager buy that? :)
> 
> I hope this helps,
> Gus Correa
> 
>> 
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
>> Sent: Thursday, March 27, 2014 2:06 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)
>> 
>> Hi John
>> 
>> Take a look at the mpiexec/mpirun options:
>> 
>> -report-bindings (this one should report what you want)
>> 
>> and maybe also also:
>> 
>> -bycore, -bysocket, -bind-to-core, -bind-to-socket, ...
>> 
>> and similar, if you want more control on where your MPI processes run.
>> 
>> "man mpiexec" is your friend!
>> 
>> I hope this helps,
>> Gus Correa
>> 
>> On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
>>> When a piece of software built against OpenMPI fails, I will see an
>>> error referring to the rank of the MPI task which incurred the failure.
>>> For example:
>>> 
>>> MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
>>> 
>>> with errorcode 1.
>>> 
>>> Unfortunately, I do not have access to the software code, just the
>>> installation directory tree for OpenMPI.  My question is:  Is there a
>>> flag that can be passed to mpirun, or an environment variable set,
>>> which would reveal the mapping of ranks to the hosts they are on?
>>> 
>>> I do understand that one could have multiple MPI ranks running on the
>>> same host, but finding a way to determine which rank ran on what host
>>> would go a long way in help troubleshooting problems which may be
>>> central to the host.  Thanks!
>>> 
>>>--john
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Sasso, John (GE Power & Water, Non-GE)
Yes, I noticed that I could not find --display-map in any of the man pages.  
Intentional?


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 27, 2014 3:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

On 03/27/2014 03:02 PM, Ralph Castain wrote:
> Or use --display-map to see the process to node assignments
>

Aha!
That one was not on my radar.
Maybe because somehow I can't find it in the OMPI 1.6.5 mpiexec man page.
However, it seems to work with that version also, which is great.
(--display-map goes to stdout, whereas -report-bindings goes to stderr,
right?)
Thanks, Ralph!

Gus Correa

> Sent from my iPhone
>
>> On Mar 27, 2014, at 11:47 AM, Gus Correa  wrote:
>>
>> PS - The (OMPI 1.6.5) mpiexec default is -bind-to-none, in which case 
>> -report-bindings won't report anything.
>>
>> So, if you are using the default,
>> you can apply Joe Landman's suggestion (or alternatively use the 
>> MPI_Get_processor_name function, in lieu of uname(); cpu_name = 
>> uts.nodename; ).
>>
>> However, many MPI applications benefit from some type of hardware 
>> binding, maybe yours will do also, and as a bonus -report-bindings will tell 
>> you where each rank ran.
>> mpiexec's -tag-output is also helpful for debugging, but won't tell 
>> you the node name, just the MPI rank.
>>
>> You can setup a lot of these things as your preferred defaults, via 
>> mca parameters, and omit them from the mpiexec command line.
>> The trick is to match each mpiexec option to the appropriate mca 
>> parameter, as the names are not exactly the same.
>> "ompi-info --all" may help in that regard.
>> See this FAQ:
>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>
>> Again, the OMPI FAQ page is your friend!  :) 
>> http://www.open-mpi.org/faq/
>>
>> I hope this helps,
>> Gus Correa
>>
>>> On 03/27/2014 02:06 PM, Gus Correa wrote:
>>> Hi John
>>>
>>> Take a look at the mpiexec/mpirun options:
>>>
>>> -report-bindings (this one should report what you want)
>>>
>>> and maybe also also:
>>>
>>> -bycore, -bysocket, -bind-to-core, -bind-to-socket, ...
>>>
>>> and similar, if you want more control on where your MPI processes run.
>>>
>>> "man mpiexec" is your friend!
>>>
>>> I hope this helps,
>>> Gus Correa
>>>
 On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
 When a piece of software built against OpenMPI fails, I will see an 
 error referring to the rank of the MPI task which incurred the failure.
 For example:

 MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

 with errorcode 1.

 Unfortunately, I do not have access to the software code, just the 
 installation directory tree for OpenMPI.  My question is:  Is there 
 a flag that can be passed to mpirun, or an environment variable 
 set, which would reveal the mapping of ranks to the hosts they are on?

 I do understand that one could have multiple MPI ranks running on 
 the same host, but finding a way to determine which rank ran on 
 what host would go a long way in help troubleshooting problems 
 which may be central to the host.  Thanks!

--john



 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Gus Correa

On 03/27/2014 03:02 PM, Ralph Castain wrote:

Or use --display-map to see the process to node assignments



Aha!
That one was not on my radar.
Maybe because somehow I can't find it in the
OMPI 1.6.5 mpiexec man page.
However, it seems to work with that version also, which is great.
(--display-map goes to stdout, whereas -report-bindings goes to stderr, 
right?)

Thanks, Ralph!

Gus Correa


Sent from my iPhone


On Mar 27, 2014, at 11:47 AM, Gus Correa  wrote:

PS - The (OMPI 1.6.5) mpiexec default is -bind-to-none,
in which case -report-bindings won't report anything.

So, if you are using the default,
you can apply Joe Landman's suggestion
(or alternatively use the MPI_Get_processor_name function,
in lieu of uname(); cpu_name = uts.nodename; ).

However, many MPI applications benefit from some type of hardware binding, 
maybe yours will do also, and as a bonus
-report-bindings will tell you where each rank ran.
mpiexec's -tag-output is also helpful for debugging,
but won't tell you the node name, just the MPI rank.

You can setup a lot of these things as your preferred defaults,
via mca parameters, and omit them from the mpiexec command line.
The trick is to match each mpiexec option to
the appropriate mca parameter, as the names are not exactly the same.
"ompi-info --all" may help in that regard.
See this FAQ:
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Again, the OMPI FAQ page is your friend!  :)
http://www.open-mpi.org/faq/

I hope this helps,
Gus Correa


On 03/27/2014 02:06 PM, Gus Correa wrote:
Hi John

Take a look at the mpiexec/mpirun options:

-report-bindings (this one should report what you want)

and maybe also also:

-bycore, -bysocket, -bind-to-core, -bind-to-socket, ...

and similar, if you want more control on where your MPI processes run.

"man mpiexec" is your friend!

I hope this helps,
Gus Correa


On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
When a piece of software built against OpenMPI fails, I will see an
error referring to the rank of the MPI task which incurred the failure.
For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

with errorcode 1.

Unfortunately, I do not have access to the software code, just the
installation directory tree for OpenMPI.  My question is:  Is there a
flag that can be passed to mpirun, or an environment variable set, which
would reveal the mapping of ranks to the hosts they are on?

I do understand that one could have multiple MPI ranks running on the
same host, but finding a way to determine which rank ran on what host
would go a long way in help troubleshooting problems which may be
central to the host.  Thanks!

   --john



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Sasso, John (GE Power & Water, Non-GE)
Thank you!  That also works and is very helpful.

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, March 27, 2014 3:03 PM
To: Open MPI Users
Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

Or use --display-map to see the process to node assignments

Sent from my iPhone

> On Mar 27, 2014, at 11:47 AM, Gus Correa  wrote:
> 
> PS - The (OMPI 1.6.5) mpiexec default is -bind-to-none, in which case 
> -report-bindings won't report anything.
> 
> So, if you are using the default,
> you can apply Joe Landman's suggestion (or alternatively use the 
> MPI_Get_processor_name function, in lieu of uname(); cpu_name = 
> uts.nodename; ).
> 
> However, many MPI applications benefit from some type of hardware 
> binding, maybe yours will do also, and as a bonus -report-bindings will tell 
> you where each rank ran.
> mpiexec's -tag-output is also helpful for debugging, but won't tell 
> you the node name, just the MPI rank.
> 
> You can setup a lot of these things as your preferred defaults, via 
> mca parameters, and omit them from the mpiexec command line.
> The trick is to match each mpiexec option to the appropriate mca 
> parameter, as the names are not exactly the same.
> "ompi-info --all" may help in that regard.
> See this FAQ:
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
> 
> Again, the OMPI FAQ page is your friend!  :) 
> http://www.open-mpi.org/faq/
> 
> I hope this helps,
> Gus Correa
> 
>> On 03/27/2014 02:06 PM, Gus Correa wrote:
>> Hi John
>> 
>> Take a look at the mpiexec/mpirun options:
>> 
>> -report-bindings (this one should report what you want)
>> 
>> and maybe also also:
>> 
>> -bycore, -bysocket, -bind-to-core, -bind-to-socket, ...
>> 
>> and similar, if you want more control on where your MPI processes run.
>> 
>> "man mpiexec" is your friend!
>> 
>> I hope this helps,
>> Gus Correa
>> 
>>> On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
>>> When a piece of software built against OpenMPI fails, I will see an 
>>> error referring to the rank of the MPI task which incurred the failure.
>>> For example:
>>> 
>>> MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
>>> 
>>> with errorcode 1.
>>> 
>>> Unfortunately, I do not have access to the software code, just the 
>>> installation directory tree for OpenMPI.  My question is:  Is there 
>>> a flag that can be passed to mpirun, or an environment variable set, 
>>> which would reveal the mapping of ranks to the hosts they are on?
>>> 
>>> I do understand that one could have multiple MPI ranks running on 
>>> the same host, but finding a way to determine which rank ran on what 
>>> host would go a long way in help troubleshooting problems which may 
>>> be central to the host.  Thanks!
>>> 
>>>   --john
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Gus Correa

Hi John

I just set a PS message ...

On 03/27/2014 02:41 PM, Sasso, John (GE Power & Water, Non-GE) wrote:

Thank you, Gus!  I did go through the mpiexec/mpirun man pages but

wasn't quite clear that -report-bindings was what I was looking for.
So what I did is rerun a program w/ --report-bindings but no bindings
were reported.


Scratching my head, I decided to include --bind-to-core as well.

Voila, the bindings are reported!

The OMPI runtime environment is great.
It adds a lot of information and flexibility to what MPI alone provides.

I don't know your code, so is hard to tell if
-bycore and -bind-to-core are good choices, though.

Here we use those two options for pure MPI jobs.
Minimally you need to make sure there is enough memory per core for each 
task, otherwise you may need to skip some cores, to leave enough

RAM for each process (say, with -cpus-per-proc).

If the code is MPI+OpenMP hybrid you may perhaps use -by-socket and
-bind-to-socket, and set
OMP_NUM_THREADS=
(assuming there are no nested OpenMP regions, which would complicate 
matters)


You can get finer control with the -rankfile option.

Apparently all or most of this syntax is changing in
the latest OMPI 1.7.X, though.



Awesome, but now here is my concern.

If we have OpenMPI-based applications launched as batch jobs
via a batch scheduler like SLURM, PBS, LSF, etc.
(which decides the placement of the app and dispatches it to the compute 
hosts),

then will including "--report-bindings --bind-to-core" cause problems?

I don't know all resource managers and schedulers.

I use Torque+Maui here.
OpenMPI is built with Torque support, and will use the nodes and 
cpus/cores provided by Torque.
My understanding is that Torque delegates to OpenMPI the process 
placement and binding (beyond the list of nodes/cpus available for

the job).

My guess is that OpenPBS behaves the same as Torque.

SLURM and SGE/OGE *probably* have pretty much the same behavior.
A cursory reading of the SLURM web page suggested to me that it
does core binding by default, but don't quote me on that.

I don't know what LSF does, but I would guess there is a
way to do the appropriate bindings, either at the resource manager 
level, or at the OpenMPI level (or a combination of both).



Certainly I can test this, but concerned there may be a case where 
inclusion of

--bind-to-core would cause an unexpected problem I did not account for.


--john



Well, testing and failing is part of this game!
Would the GE manager buy that? :)

I hope this helps,
Gus Correa



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 27, 2014 2:06 PM
To: Open MPI Users
Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

Hi John

Take a look at the mpiexec/mpirun options:

-report-bindings (this one should report what you want)

and maybe also also:

-bycore, -bysocket, -bind-to-core, -bind-to-socket, ...

and similar, if you want more control on where your MPI processes run.

"man mpiexec" is your friend!

I hope this helps,
Gus Correa

On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:

When a piece of software built against OpenMPI fails, I will see an
error referring to the rank of the MPI task which incurred the failure.
For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

with errorcode 1.

Unfortunately, I do not have access to the software code, just the
installation directory tree for OpenMPI.  My question is:  Is there a
flag that can be passed to mpirun, or an environment variable set,
which would reveal the mapping of ranks to the hosts they are on?

I do understand that one could have multiple MPI ranks running on the
same host, but finding a way to determine which rank ran on what host
would go a long way in help troubleshooting problems which may be
central to the host.  Thanks!

--john



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] How to replace --cpus-per-proc by --map-by

2014-03-27 Thread Ralph Castain
Agreed - Jeff and I discussed this just this morning. I will be updating FAQ 
soon

Sent from my iPhone

> On Mar 27, 2014, at 9:24 AM, Gus Correa  wrote:
> 
> <\begin hijacking this thread>
> 
> I second Saliya's thanks to Tetsuya.
> I've been following this thread, to learn a bit more about
> how to use hardware locality with OpenMPI effectively.
> [I am still using "--bycore"+"--bind-to-core" in most cases,
> and "--cpus-per-proc" occasionally when in hybrid MPI+OpenMP mode.]
> 
> When it comes to hardware locality,
> the syntax and the functionality has changed fast and significantly
> in the recent past.
> Hence, it would be great if the OpenMPI web page could provide pointers
> for the type of external documentation that Tetsuya just sent.
> Perhaps also some additional guidelines and comments
> on what is available on each release/series of OpenMPI,
> and how to use these options.
> 
> There is some material about hwloc,
> but I can't see much about lama ( which means "mud" in my
> first language :) ).
> We can hardly learn things like that from the mpiexec man page
> alone, although it has very good examples.
> 
> Thank you,
> Gus Correa
> 
> <\end hijacking of this thread>
> 
>> On 03/27/2014 11:38 AM, Saliya Ekanayake wrote:
>> Thank you, this is really helpful.
>> 
>> Saliya
>> 
>> 
>> On Thu, Mar 27, 2014 at 5:11 AM, > > wrote:
>> 
>> 
>> 
>>Mapping and binding is related to so called process affinity.
>>It's a bit difficult for me to explain ...
>> 
>>So please see this URL below(especially the first half part
>>of it - from 1 to 20 pages):
>>
>> http://www.slideshare.net/jsquyres/open-mpi-explorations-in-process-affinity-eurompi13-presentation
>> 
>>Although these slides by Jeff are the explanation for LAMA,
>>which is another mapping system installed in the openmpi-1.7
>>series, I guess you can easily understand what is mapping and
>>binding in general terms.
>> 
>>Tetsuya
>> 
>> > Thank you Tetsuya - it worked.
>> >
>> > Btw. what's the difference between mapping and binding? I think I
>>am bit
>>confused here.
>> >
>> > Thank you,
>> > Saliya
>> >
>> >
>> > On Thu, Mar 27, 2014 at 4:19 AM,  >>wrote:
>> >
>> >
>> > Hi Saliya,
>> >
>> > What you want to do is map-by node. So please try below:
>> >
>> > -np 2 --map-by node:pe=4 --bind-to core
>> >
>> > You might not need to add --bind-to core, because it's default
>>binding.
>> >
>> > Tetsuya
>> >
>> > > Hi,
>> > >
>> > > I see in v.1.7.5rc5 --cpus-per-proc is deprecated and is advised to
>> > replace by --map-by :PE=N.
>> > > I've tried this but I couldn't get the expected allocation of
>>procs.
>> > >
>> > > For example I was running 2 procs on 2 nodes each with 2
>>sockets where
>>a
>> > socket has 4 cores. I wanted 1 proc per node and bound to all
>>cores in
>>one
>> > of the sockets. I could get this by using
>> > >
>> > > --bind-to core: --map-by ppr:1:node --cpus-per-proc 4 -np 2
>> > >
>> > > Then it'll show bindings as
>> > >
>> > > [i51:32274] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
>>0[core
>>1
>> > [hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
>> > [B/B/B/B][./././.]
>> > > [i52:31765] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket
>>0[core
>>1
>> > [hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
>> > [B/B/B/B][./././.]
>> > >
>> > >
>> > > Is there a better way without using -cpus-per-proc as suggested
>>to get
>> > the same effect?
>> > >
>> > > Thank you,
>> > > Saliya
>> > >
>> > >
>> > >
>> > > --
>> > > Saliya Ekanayake esal...@gmail.com 
>> > > Cell 812-391-4914  Home 812-961-6383
>>
>> > > http://saliya.org___
>> > > users mailing list
>> > >
>>users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org 
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake esal...@gmail.com 
>> > Cell 812-391-4914  Home 812-961-6383
>>
>> > http://saliya.org___
>> > users mailing list
>> >
>>users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> 
>>___
>>  

Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Gus Correa

PS - The (OMPI 1.6.5) mpiexec default is -bind-to-none,
in which case -report-bindings won't report anything.

So, if you are using the default,
you can apply Joe Landman's suggestion
(or alternatively use the MPI_Get_processor_name function,
in lieu of uname(); cpu_name = uts.nodename; ).

However, many MPI applications benefit from some type of hardware 
binding, maybe yours will do also, and as a bonus

-report-bindings will tell you where each rank ran.
mpiexec's -tag-output is also helpful for debugging,
but won't tell you the node name, just the MPI rank.

You can setup a lot of these things as your preferred defaults,
via mca parameters, and omit them from the mpiexec command line.
The trick is to match each mpiexec option to
the appropriate mca parameter, as the names are not exactly the same.
"ompi-info --all" may help in that regard.
See this FAQ:
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Again, the OMPI FAQ page is your friend!  :)
http://www.open-mpi.org/faq/

I hope this helps,
Gus Correa

On 03/27/2014 02:06 PM, Gus Correa wrote:

Hi John

Take a look at the mpiexec/mpirun options:

-report-bindings (this one should report what you want)

and maybe also also:

-bycore, -bysocket, -bind-to-core, -bind-to-socket, ...

and similar, if you want more control on where your MPI processes run.

"man mpiexec" is your friend!

I hope this helps,
Gus Correa

On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:

When a piece of software built against OpenMPI fails, I will see an
error referring to the rank of the MPI task which incurred the failure.
For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

with errorcode 1.

Unfortunately, I do not have access to the software code, just the
installation directory tree for OpenMPI.  My question is:  Is there a
flag that can be passed to mpirun, or an environment variable set, which
would reveal the mapping of ranks to the hosts they are on?

I do understand that one could have multiple MPI ranks running on the
same host, but finding a way to determine which rank ran on what host
would go a long way in help troubleshooting problems which may be
central to the host.  Thanks!

   --john



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Sasso, John (GE Power & Water, Non-GE)
Thank you, Gus!  I did go through the mpiexec/mpirun man pages but wasn't quite 
clear that -report-bindings was what I was looking for.   So what I did is 
rerun a program w/ --report-bindings but no bindings were reported.

Scratching my head, I decided to include --bind-to-core as well.  Voila, the 
bindings are reported!  

Awesome, but now here is my concern.  If we have OpenMPI-based applications 
launched as batch jobs via a batch scheduler like SLURM, PBS, LSF, etc. (which 
decides the placement of the app and dispatches it to the compute hosts), then 
will including "--report-bindings --bind-to-core" cause problems?   Certainly I 
can test this, but concerned there may be a case where inclusion of 
--bind-to-core would cause an unexpected problem I did not account for.

--john


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 27, 2014 2:06 PM
To: Open MPI Users
Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

Hi John

Take a look at the mpiexec/mpirun options:

-report-bindings (this one should report what you want)

and maybe also also:

-bycore, -bysocket, -bind-to-core, -bind-to-socket, ...

and similar, if you want more control on where your MPI processes run.

"man mpiexec" is your friend!

I hope this helps,
Gus Correa

On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
> When a piece of software built against OpenMPI fails, I will see an 
> error referring to the rank of the MPI task which incurred the failure.
> For example:
>
> MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
>
> with errorcode 1.
>
> Unfortunately, I do not have access to the software code, just the 
> installation directory tree for OpenMPI.  My question is:  Is there a 
> flag that can be passed to mpirun, or an environment variable set, 
> which would reveal the mapping of ranks to the hosts they are on?
>
> I do understand that one could have multiple MPI ranks running on the 
> same host, but finding a way to determine which rank ran on what host 
> would go a long way in help troubleshooting problems which may be 
> central to the host.  Thanks!
>
>--john
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Gus Correa

Hi John

Take a look at the mpiexec/mpirun options:

-report-bindings (this one should report what you want)

and maybe also also:

-bycore, -bysocket, -bind-to-core, -bind-to-socket, ...

and similar, if you want more control on where your MPI processes run.

"man mpiexec" is your friend!

I hope this helps,
Gus Correa

On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:

When a piece of software built against OpenMPI fails, I will see an
error referring to the rank of the MPI task which incurred the failure.
For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

with errorcode 1.

Unfortunately, I do not have access to the software code, just the
installation directory tree for OpenMPI.  My question is:  Is there a
flag that can be passed to mpirun, or an environment variable set, which
would reveal the mapping of ranks to the hosts they are on?

I do understand that one could have multiple MPI ranks running on the
same host, but finding a way to determine which rank ran on what host
would go a long way in help troubleshooting problems which may be
central to the host.  Thanks!

   --john



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Joe Landman

On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:

When a piece of software built against OpenMPI fails, I will see an
error referring to the rank of the MPI task which incurred the failure.
For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD

with errorcode 1.

Unfortunately, I do not have access to the software code, just the
installation directory tree for OpenMPI.  My question is:  Is there a
flag that can be passed to mpirun, or an environment variable set, which
would reveal the mapping of ranks to the hosts they are on?

I do understand that one could have multiple MPI ranks running on the
same host, but finding a way to determine which rank ran on what host
would go a long way in help troubleshooting problems which may be
central to the host.  Thanks!


In the past, I've done something like this (in C, though a similar thing 
would work well in Fortran/others)


#include 
/* ... */
int debug = 1;
char *cpu_name;
struct utsname  uts;

/* ... later, after MPI_Init/MPI_Comm_rank/MPI_Comm_size .. */

uname();
cpu_name = uts.nodename;

if (debug==1) {
printf("hostname=%s, I am rank %d\n", cpu_name,rank);
}




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615


[OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Sasso, John (GE Power & Water, Non-GE)
When a piece of software built against OpenMPI fails, I will see an error 
referring to the rank of the MPI task which incurred the failure.  For example:

MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
with errorcode 1.

Unfortunately, I do not have access to the software code, just the installation 
directory tree for OpenMPI.  My question is:  Is there a flag that can be 
passed to mpirun, or an environment variable set, which would reveal the 
mapping of ranks to the hosts they are on?

I do understand that one could have multiple MPI ranks running on the same 
host, but finding a way to determine which rank ran on what host would go a 
long way in help troubleshooting problems which may be central to the host.  
Thanks!

  --john


Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Reuti
Am 27.03.2014 um 16:31 schrieb Gus Correa:

> On 03/27/2014 05:05 AM, Andreas Schäfer wrote:
>>> >Queue systems won't allow resources to be oversubscribed.
>> I'm fairly confident that you can configure Slurm to oversubscribe
>> nodes: just specify more cores for a node than are actually present.
>> 
> 
> That is true.
> If you lie to the queue system about your resources,
> it will believe you and oversubscribe.
> Torque has this same feature.
> I don't know about SGE.

It's possible too.

-- Reuti


> You may choose to set some or all nodes with more cores than they actually 
> have, if that is a good choice for the codes you run.
> However, for our applications oversubscribing is bad, hence my mindset.
> 
> Gus Correa
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Gus Correa

On 03/27/2014 05:05 AM, Andreas Schäfer wrote:

>Queue systems won't allow resources to be oversubscribed.

I'm fairly confident that you can configure Slurm to oversubscribe
nodes: just specify more cores for a node than are actually present.



That is true.
If you lie to the queue system about your resources,
it will believe you and oversubscribe.
Torque has this same feature.
I don't know about SGE.
You may choose to set some or all nodes with more cores than they 
actually have, if that is a good choice for the codes you run.

However, for our applications oversubscribing is bad, hence my mindset.

Gus Correa


[OMPI users] Hamster

2014-03-27 Thread madhurima madhunapanthula
Hi,

I came across Hamster while reading some article on Hadoop + OpenMPI
please let me know if the sources of Hamster are available for build and
testing.

-- 
Lokah samasta sukhinobhavanthu

Thanks,
Madhurima


Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Thomas Heller

On 03/27/2014 10:19 AM, Andreas Schäfer wrote:

On 14:26 Wed 26 Mar , Ross Boylan wrote:

[Main part is at the bottom]
On Wed, 2014-03-26 at 19:28 +0100, Andreas Schäfer wrote:

If you have a complex workflow with varying computational loads, then
you might want to take a look at runtime systems which allow you to
express this directly through their API, e.g. HPX[1]. HPX has proven to
run with high efficiency on a wide range of architectures, and with a
multitude of different workloads.

Thanks for the pointer.


I might add that HPX can run on top of MPI, so you could gradually
migrate code towards it.


Another note which is relevant to this discussion:
In HPX we actually do oversubscribe the nodes. There are worker threads 
which are supposed to do the actual computations, those are usually 
pinned to the actual CPU Cores (or hardware threads, depending on your 
machine and the way you want to do your thread pinning). On those worker 
threads, we then schedule (very lightweight) user level tasks which run 
the actual user code. You can have in the order of several million 
concurrent HPX-Threads (the user level tasks) running in an application 
per node.
In addition to those worker threads, we have dedicated Operating threads 
(only pinned to a certain socket or NUMA domain), which are responsible 
for doing the actual communication (This is however completely hidden 
behind our API, which supports truly asynchronous communication). In the 
case you have communication running over MPI or directly on top of 
(native) ibverbs, those threads do a busy wait on the actual sends and 
receives. The impact on performance is negligible here. But keep in mind 
that we put quite some effort in there in order to achieve that


Cheers,
Thomas




Cheers
-Andreas


--
Thomas Heller
Friedrich-Alexander-Universität Erlangen-Nürnberg
Department Informatik - Lehrstuhl Rechnerarchitektur
Martensstr. 3
91058 Erlangen
Tel.: 09131/85-27018
Fax:  09131/85-27912
Email: thomas.hel...@cs.fau.de


Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Andreas Schäfer
On 14:26 Wed 26 Mar , Ross Boylan wrote:
> [Main part is at the bottom]
> On Wed, 2014-03-26 at 19:28 +0100, Andreas Schäfer wrote:
> > If you have a complex workflow with varying computational loads, then
> > you might want to take a look at runtime systems which allow you to
> > express this directly through their API, e.g. HPX[1]. HPX has proven to
> > run with high efficiency on a wide range of architectures, and with a
> > multitude of different workloads.
> Thanks for the pointer.

I might add that HPX can run on top of MPI, so you could gradually
migrate code towards it.

Cheers
-Andreas


-- 
==
Andreas Schäfer
HPC and Grid Computing
Chair of Computer Science 3
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-27910
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!


signature.asc
Description: Digital signature


Re: [OMPI users] How to replace --cpus-per-proc by --map-by

2014-03-27 Thread tmishima


Mapping and binding is related to so called process affinity.
It's a bit difficult for me to explain ...

So please see this URL below(especially the first half part
of it - from 1 to 20 pages):
http://www.slideshare.net/jsquyres/open-mpi-explorations-in-process-affinity-eurompi13-presentation

Although these slides by Jeff are the explanation for LAMA,
which is another mapping system installed in the openmpi-1.7
series, I guess you can easily understand what is mapping and
binding in general terms.

Tetsuya

> Thank you Tetsuya - it worked.
>
> Btw. what's the difference between mapping and binding? I think I am bit
confused here.
>
> Thank you,
> Saliya
>
>
> On Thu, Mar 27, 2014 at 4:19 AM,  wrote:
>
>
> Hi Saliya,
>
> What you want to do is map-by node. So please try below:
>
> -np 2 --map-by node:pe=4 --bind-to core
>
> You might not need to add --bind-to core, because it's default binding.
>
> Tetsuya
>
> > Hi,
> >
> > I see in v.1.7.5rc5 --cpus-per-proc is deprecated and is advised to
> replace by --map-by :PE=N.
> > I've tried this but I couldn't get the expected allocation of procs.
> >
> > For example I was running 2 procs on 2 nodes each with 2 sockets where
a
> socket has 4 cores. I wanted 1 proc per node and bound to all cores in
one
> of the sockets. I could get this by using
> >
> > --bind-to core: --map-by ppr:1:node --cpus-per-proc 4 -np 2
> >
> > Then it'll show bindings as
> >
> > [i51:32274] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
1
> [hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
> [B/B/B/B][./././.]
> > [i52:31765] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core
1
> [hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
> [B/B/B/B][./././.]
> >
> >
> > Is there a better way without using -cpus-per-proc as suggested to get
> the same effect?
> >
> > Thank you,
> > Saliya
> >
> >
> >
> > --
> > Saliya Ekanayake esal...@gmail.com
> > Cell 812-391-4914 Home 812-961-6383
> > http://saliya.org___
> > users mailing list
> > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Saliya Ekanayake esal...@gmail.com
> Cell 812-391-4914 Home 812-961-6383
> http://saliya.org___
> users mailing list
> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Andreas Schäfer
Heya,

On 19:21 Wed 26 Mar , Gus Correa wrote:
> On 03/26/2014 05:26 PM, Ross Boylan wrote:
> > [Main part is at the bottom]
> > On Wed, 2014-03-26 at 19:28 +0100, Andreas Schäfer wrote:
> >> On 09:08 Wed 26 Mar , Ross Boylan wrote:
> >>> Second, we do not operate in a batch queuing environment
> >> Why not fix that?
> > I'm not the sysadmin, though I'm involved in the group that sets policy.
> > At one point we were using Sun's grid engine, but I don't think it's
> > installed now.  I'm not sure why.
> >
> > We have discussed putting in a batch queuing system and nobody was
> > really pushing for it.  My impression was (and probably still is) that
> > it was more pain than gain.  There is hassle not only for the sysadmin
> > to set it up (and, I suppose, monitor it), but for users.  Personally I
> > run a lot of interactive parallel jobs (the interaction is on rank 0
> > only).  I have the impression that won't work under a batch system,
> > though I could be wrong.  I also had the impression we'd need to have an
> > estimate of how long the job would run when we submit, and we don't
> > always know.
> 
> But I've never really used such a system, and may not appreciate what it
> would get us.  The other reason we haven't bothered is that the load on
> the cluster was relatively light and contention was low.  That is less
> and less true, which probably starts tipping the balance toward a
> queuing system.
> 
> This is wandering off topic, but if you or anyone else could say more
> about why you regard the absence of a queuing system as a problem that
> should be fixed, I'd love to hear it.
> 
> Ross
> 
> Hi Ross
> 
> Some pros:
> (I don't know of any cons.)

I second Gus' statement that there are no real downsides for a
queueing system. These systems actually relieves both, users and
admins from a lot of tedious fiddling and debugging. If you're doing a
fresh install, then I'd suggest you to use Slurm[1]. It's a breeze to
install and easy to maintain. It also integrates well with all major
MPI implementations. Yes, the admin and users need to invest to time
to learn the ropes, but they payoff is almost instant. Source: I'm the
sysadmin for our research clusters.

> Queue systems won't allow resources to be oversubscribed.

I'm fairly confident that you can configure Slurm to oversubscribe
nodes: just specify more cores for a node than are actually present.

> Queue systems do support interactive jobs (even with X-windows GUIs, if 
> needed).

Right, actually we've just moved a couple of systems, which are
primarily running interactive jobs, to Slurm to ease arbitration of
resources. Previously users were frequently stepping on each others
toes (Who's pinning jobs to which core? Who's using which GPU? How
much RAM do you consume?) These problems are gone now.

Cheers
-Andreas

[1] https://computing.llnl.gov/linux/slurm/


-- 
==
Andreas Schäfer
HPC and Grid Computing
Chair of Computer Science 3
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-27910
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!


signature.asc
Description: Digital signature


Re: [OMPI users] How to replace --cpus-per-proc by --map-by

2014-03-27 Thread Saliya Ekanayake
Thank you Tetsuya - it worked.

Btw. what's the difference between mapping and binding? I think I am bit
confused here.

Thank you,
Saliya


On Thu, Mar 27, 2014 at 4:19 AM,  wrote:

>
>
> Hi Saliya,
>
> What you want to do is map-by node. So please try below:
>
> -np 2 --map-by node:pe=4 --bind-to core
>
> You might not need to add --bind-to core, because it's default binding.
>
> Tetsuya
>
> > Hi,
> >
> > I see in v.1.7.5rc5 --cpus-per-proc is deprecated and is advised to
> replace by --map-by :PE=N.
> > I've tried this but I couldn't get the expected allocation of procs.
> >
> > For example I was running 2 procs on 2 nodes each with 2 sockets where a
> socket has 4 cores. I wanted 1 proc per node and bound to all cores in one
> of the sockets. I could get this by using
> >
> > --bind-to core: --map-by ppr:1:node --cpus-per-proc 4 -np 2
> >
> > Then it'll show bindings as
> >
> > [i51:32274] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1
> [hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
> [B/B/B/B][./././.]
> > [i52:31765] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1
> [hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
> [B/B/B/B][./././.]
> >
> >
> > Is there a better way without using -cpus-per-proc as suggested to get
> the same effect?
> >
> > Thank you,
> > Saliya
> >
> >
> >
> > --
> > Saliya Ekanayake esal...@gmail.com
> > Cell 812-391-4914 Home 812-961-6383
> > http://saliya.org___
> > users mailing list
> > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Saliya Ekanayake esal...@gmail.com
Cell 812-391-4914 Home 812-961-6383
http://saliya.org


Re: [OMPI users] How to replace --cpus-per-proc by --map-by

2014-03-27 Thread tmishima


Hi Saliya,

What you want to do is map-by node. So please try below:

-np 2 --map-by node:pe=4 --bind-to core

You might not need to add --bind-to core, because it's default binding.

Tetsuya

> Hi,
>
> I see in v.1.7.5rc5 --cpus-per-proc is deprecated and is advised to
replace by --map-by :PE=N.
> I've tried this but I couldn't get the expected allocation of procs.
>
> For example I was running 2 procs on 2 nodes each with 2 sockets where a
socket has 4 cores. I wanted 1 proc per node and bound to all cores in one
of the sockets. I could get this by using
>
> --bind-to core: --map-by ppr:1:node --cpus-per-proc 4 -np 2
>
> Then it'll show bindings as
>
> [i51:32274] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1
[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
[B/B/B/B][./././.]
> [i52:31765] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1
[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
[B/B/B/B][./././.]
>
>
> Is there a better way without using -cpus-per-proc as suggested to get
the same effect?
>
> Thank you,
> Saliya
>
>
>
> --
> Saliya Ekanayake esal...@gmail.com
> Cell 812-391-4914 Home 812-961-6383
> http://saliya.org___
> users mailing list
> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] How to replace --cpus-per-proc by --map-by

2014-03-27 Thread Saliya Ekanayake
Hi,

I see in v.1.7.5rc5 --cpus-per-proc is deprecated and is advised to replace
by --map-by :PE=N.
I've tried this but I couldn't get the expected allocation of procs.

For example I was running 2 procs on 2 nodes each with 2 sockets where a
socket has 4 cores. I wanted 1 proc per node and bound to all cores in one
of the sockets. I could get this by using

--bind-to core: --map-by ppr:1:node --cpus-per-proc 4 -np 2

Then it'll show bindings as

*[i51:32274] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
[B/B/B/B][./././.]*
*[i52:31765] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
[B/B/B/B][./././.]*


Is there a better way without using -cpus-per-proc as suggested to get the
same effect?

Thank you,
Saliya



-- 
Saliya Ekanayake esal...@gmail.com
Cell 812-391-4914 Home 812-961-6383
http://saliya.org