Re: [OMPI users] null characters in output

2008-06-23 Thread Sacerdoti, Federico
Ralph,

I'm working on a test-case for that now, hopefully I can nail it down to
a particular openmpi version.

I have another small issue, which is somewhat bothering: orterun 1.2.6
exits with return code zero if the executable cannot be found. Should
this be non-zero?

E.g.
$ orterun /asdf

--
Failed to find or execute the following executable:

Host:   drdblogin2.en.desres.deshaw.com
Executable: /asdf

Cannot continue.

--
$ echo $?
0

Thanks
Federico

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph H Castain
Sent: Thursday, June 19, 2008 10:24 AM
To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] null characters in output

No, I haven't seen that - if you can provide an example, we can take a
look
at it.

Thanks
Ralph



On 6/19/08 8:15 AM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph, another issue perhaps you can shed some light on.
> 
> When launching with orterun, we sometimes see null characters in the
> stdout output. These do not show up on a terminal, but when piped to a
> file they are visible in an editor. They also can show up in the
middle
> of a line, and so can interfere with greps on the output, etc.
> 
> Have you seen this before? I am working on a simple test case, but
> unfortunately have not found one that is deterministic so far.
> 
> Thanks,
> Federico 
> 
> -Original Message-
> From: Ralph H Castain [mailto:r...@lanl.gov]
> Sent: Tuesday, June 17, 2008 1:09 PM
> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] SLURM and OpenMPI
> 
> I can believe 1.2.x has problems in that regard. Some of that has
> nothing to
> do with slurm and reflects internal issues with 1.2.
> 
> We have made it much more resistant to those problems in the upcoming
> 1.3
> release, but there is no plan to retrofit those changes to 1.2. Part
of
> the
> problem was that we weren't using the --kill-on-bad-exit flag when we
> called
> srun internally, which has been fixed for 1.3.
> 
> BTW: we actually do use srun to launch the daemons - we just call it
> internally from inside orterun. The only real difference is that we
use
> orterun to setup the cmd line and then tell the daemons what they need
> to
> do. The issues you are seeing relate to our ability to detect that
srun
> has
> failed, and/or that one or more daemons have failed to launch or do
> something they were supposed to do. The 1.2 system has problems in
that
> regard, which was one motivation for the 1.3 overhaul.
> 
> I would argue that slurm allowing us to attempt to launch on a
> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
> use
> srun to launch the daemons - the only reason we hang is that srun is
not
> returning with an error. I've seen this on other systems as well, but
> have
> no real answer - if slurm doesn't indicate an error has occurred, I'm
> not
> sure what I can do about it.
> 
> We are unlikely to use srun to directly launch jobs (i.e., to have
slurm
> directly launch the job from an srun cmd line without mpirun) anytime
> soon.
> It isn't clear there is enough benefit to justify the rather large
> effort,
> especially considering what would be required to maintain scalability.
> Decisions on all that are still pending, though, which means any
> significant
> change in that regard wouldn't be released until sometime next year.
> 
> Ralph
> 
> On 6/17/08 10:39 AM, "Sacerdoti, Federico"
> <federico.sacerd...@deshawresearch.com> wrote:
> 
>> Ralph,
>> 
>> I was wondering what the status of this feature was (using srun to
>> launch orted daemons)? I have two new bug reports to add from our
>> experience using orterun from 1.2.6 on our 4000 CPU infiniband
> cluster.
>> 
>> 1. Orterun will happily hang if it is asked to run on an invalid
slurm
>> job, e.g. if the job has exceeded its timelimit. This would be
> trivially
>> fixed if you used srun to launch, as they would fail with non-zero
> exit
>> codes.
>> 
>> 2. A very simple orterun invocation hangs instead of exiting with an
>> error. In this case the executable does not exist, and we would
expect
>> orterun to exit non-zero. This has caused
>> headaches with some workflow management script that automatically
> start
>> jobs.
>> 
>> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist
>> [hang]
>> 
>> orterun dummy-binary-I-dont-exist
>&g

Re: [OMPI users] SLURM and OpenMPI

2008-06-23 Thread Sacerdoti, Federico
Ralph, 

Thanks for your reply. Let me know if I can help in any way.

fds 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph H Castain
Sent: Thursday, June 19, 2008 10:24 AM
To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] SLURM and OpenMPI

Well, if the only system I cared about was slurm, there are some things
I
could possibly do to make things better, but at the expense of our
support
for other environments - which is unacceptable.

There are a few technical barriers to doing this without the orteds on
slurm, and a major licensing issue that prohibits us from calling any
slurm
APIs. How all that gets resolved is unclear.

Frankly, one reason we don't put more emphasis on it is that we don't
see a
significant launch time difference between the two modes, and we truly
do
want to retain the ability to utilize different error response
strategies
(which slurm will not allow - you can only follow theirs).

So I would say we simply have different objectives than what you stated,
and
different concerns that make a deeper slurm integration less favorable.
May
still happen, but not anytime soon.

Ralph



On 6/19/08 8:08 AM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph thanks for your quick response.
> 
> Regarding your fourth paragraph, slurm will not let you run on a
> no-longer-valid allocation, an srun will correctly exit non-zero with
a
> useful failure reason. So perhaps openmpi 1.3 with your changes will
> just work, I look forward to testing it.
> 
> E.g.
> $ srun hostname
> srun: error: Unable to confirm allocation for job 745346: Invalid job
id
> specified
> srun: Check SLURM_JOBID environment variable for expired or invalid
job.
> 
> 
> Regarding srun to launch the jobs directly (no orteds), I am sad to
hear
> the idea is not in favor. We have found srun to be extremely scalable
> (tested up to 4096 MPI processes) and very good at cleaning up after
an
> error or node failure. It seems you could simplify orterun quite a bit
> by relying on slurm (or whatever  resource manager) to handle job
> cleanup after failures; it is their responsibility after all, and they
> have better knowledge about the health and availability of nodes than
> any launcher can hope for.
> 
> I helped write an mvapich launcher used internally called mvrun, which
> was used for several years. I wrote a lot of logic to run down and
stop
> all processes when one had failed, which I understand you have as
well.
> We came to the conclusion that slurm was in a better position to
handle
> such failures, and in fact did it more effectively. For example if
slurm
> detects a node has failed, it will stop the job, allocate an
additional
> free node to make up the deficit, then relaunch. It more difficult (to
> put it mildly) for a job launcher to do that.
> 
> Thanks again,
> Federico
> 
> -Original Message-
> From: Ralph H Castain [mailto:r...@lanl.gov]
> Sent: Tuesday, June 17, 2008 1:09 PM
> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] SLURM and OpenMPI
> 
> I can believe 1.2.x has problems in that regard. Some of that has
> nothing to
> do with slurm and reflects internal issues with 1.2.
> 
> We have made it much more resistant to those problems in the upcoming
> 1.3
> release, but there is no plan to retrofit those changes to 1.2. Part
of
> the
> problem was that we weren't using the --kill-on-bad-exit flag when we
> called
> srun internally, which has been fixed for 1.3.
> 
> BTW: we actually do use srun to launch the daemons - we just call it
> internally from inside orterun. The only real difference is that we
use
> orterun to setup the cmd line and then tell the daemons what they need
> to
> do. The issues you are seeing relate to our ability to detect that
srun
> has
> failed, and/or that one or more daemons have failed to launch or do
> something they were supposed to do. The 1.2 system has problems in
that
> regard, which was one motivation for the 1.3 overhaul.
> 
> I would argue that slurm allowing us to attempt to launch on a
> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
> use
> srun to launch the daemons - the only reason we hang is that srun is
not
> returning with an error. I've seen this on other systems as well, but
> have
> no real answer - if slurm doesn't indicate an error has occurred, I'm
> not
> sure what I can do about it.
> 
> We are unlikely to use srun to directly launch jobs (i.e., to have
slurm
> directly launch the job from an srun cmd line without mpirun) anytime
> soon.
> It isn't clear there is enough benefit to justify the rather large

Re: [OMPI users] SLURM and OpenMPI

2008-03-21 Thread Sacerdoti, Federico

Ralph wrote:
"I don't know if I would say we "interfere" with SLURM - I would say
that we
are only lightly integrated with SLURM at this time. We use SLURM as a
resource manager to assign nodes, and then map processes onto those
nodes
according to the user's wishes. We chose to do this because srun applies
its
own load balancing algorithms if you launch processes directly with it,
which leaves the user with little flexibility to specify their desired
rank/slot mapping. We chose to support the greater flexibility."

Ralph, we wrote a launcher for mvapich that uses srun to launch but
keeps tight control of where processes are started. The way we did it
was to force srun to launch a single process on a particular node. 

The launcher calls many of these:
 srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS

Hope this helps (and we are looking forward to a tighter orterun/slurm
integration as you know).

Regards,
Federico

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Thursday, March 20, 2008 6:41 PM
To: Open MPI Users 
Cc: Ralph Castain
Subject: Re: [OMPI users] SLURM and OpenMPI

Hi there

I am no slurm expert. However, it is our understanding that
SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not
the
number of tasks to be executed on each node. So the 4(x2) tells us that
we
have 4 slots on each of two nodes to work with. You got 4 slots on each
node
because you used the -N option, which told slurm to assign all slots on
that
node to this job - I assume you have 4 processors on your nodes. OpenMPI
parses that string to get the allocation, then maps the number of
specified
processes against it.

It is possible that the interpretation of SLURM_TASKS_PER_NODE is
different
when used to allocate as opposed to directly launch processes. Our
typical
usage is for someone to do:

srun -N 2 -A
mpirun -np 2 helloworld

In other words, we use srun to create an allocation, and then run mpirun
separately within it.


I am therefore unsure what the "-n 2" will do here. If I believe the
documentation, it would seem to imply that srun will attempt to launch
two
copies of "mpirun -np 2 helloworld", yet your output doesn't seem to
support
that interpretation. It would appear that the "-n 2" is being ignored
and
only one copy of mpirun is being launched. I'm no slurm expert, so
perhaps
that interpretation is incorrect.

Assuming that the -n 2 is ignored in this situation, your command line:

> srun -N 2 -n 2 -b mpirun -np 2 helloworld

will cause mpirun to launch two processes, mapped byslot against the
slurm
allocation of two nodes, each having 4 slots. Thus, both processes will
be
launched on the first node, which is what you observed.

Similarly, the command line

> srun -N 2 -n 2 -b mpirun helloworld

doesn't specify the #procs to mpirun. In that case, mpirun will launch a
process on every available slot in the allocation. Given this command,
that
means 4 procs will be launched on each of the 2 nodes, for a total of 8
procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the
second.
Again, this is what you observed.

I don't know if I would say we "interfere" with SLURM - I would say that
we
are only lightly integrated with SLURM at this time. We use SLURM as a
resource manager to assign nodes, and then map processes onto those
nodes
according to the user's wishes. We chose to do this because srun applies
its
own load balancing algorithms if you launch processes directly with it,
which leaves the user with little flexibility to specify their desired
rank/slot mapping. We chose to support the greater flexibility.

Using the SLURM-defined mapping will require launching without our
mpirun.
This capability is still under development, and there are issues with
doing
that in slurm environments which need to be addressed. It is at a lower
priority than providing such support for TM right now, so I wouldn't
expect
it to become available for several months at least.

Alternatively, it may be possible for mpirun to get the SLURM-defined
mapping and use it to launch the processes. If we can get it somehow,
there
is no problem launching it as specified - the problem is how to get the
map!
Unfortunately, slurm's licensing prevents us from using its internal
APIs,
so obtaining the map is not an easy thing to do.

Anyone who wants to help accelerate that timetable is welcome to contact
me.
We know the technical issues - this is mostly a problem of (a)
priorities
versus my available time, and (b) similar considerations on the part of
the
slurm folks to do the work themselves.

Ralph


On 3/20/08 3:48 PM, "Tim Prins"  wrote:

> Hi Werner,
> 
> Open MPI does things a little bit differently than other MPIs when it
> comes to supporting SLURM. See
> http://www.open-mpi.org/faq/?category=slurm
> for general information about running with Open MPI on SLURM.
> 
> After trying the commands you 

[OMPI users] FW: slurm and all-srun orterun

2008-03-06 Thread Sacerdoti, Federico
Ralph, here is Moe's response. The srun options he mentions look
promising: they can signal an otherwise happy orted daemon (sitting on a
waitpid) that something is amiss elsewhere in the job. Do orteds change
their session ID?

Thanks Moe,
Federico

-Original Message-
From: jet...@llnl.gov [mailto:jet...@llnl.gov] 
Sent: Wednesday, March 05, 2008 2:21 PM
To: Sacerdoti, Federico; Open MPI Users
Subject: RE: [OMPI users] slurm and all-srun orterun

Slurm and its APIs are available under the GPL license.
Since Open MPI is not available under the GPL license it
can not link with the Slurm APIs, however virtually all
of that API functionality is available through existing
Slurm commands. The commands are clearly not as simple to
use as the APIs, but if you encounter any problems using
the commands we can certainly make changes to facilitate
their use. For example, Slurm communicates with the Maui
and Moab schedulers using an interface that loosely
resembles XML. We are also prepared to provide additional
functionality as needed by OpenMPI.

Regarding premature termination of processes that Slurm
spawns, the srun command has a couple of option that may
prove useful:

-K, --kill-on-bad-exit
  Terminate a job if any task exits with a non-zero exit code.

-W, --wait=seconds
  Specify how long to wait after the first task terminates before
  terminating  all  remaining  tasks.  A  value of 0 indicates an
  unlimited wait (a warning will be issued after 60 seconds). The
  default  value  is  set  by the WaitTime parameter in the slurm
  configuration file (see slurm.conf(5)). This option can be use-
  ful  to  insure that a job is terminated in a timely fashion in
  the event that one or more tasks terminate prematurely.

Any tasks launched outside of Slurm's control (e.g. rsh) are not
purged on job termination. Slurm locates spawned tasks and any of
their children using the configured ProcTrack plugin, of which
several are available. If you use the SID (session ID) plugin
and spawned tasks change their SID, Slurm will no longer track
them. Several reliable process tracking mechanisms are available,
but some do require kernel changes. See "man slurm.conf" for more
information.

Moe



At 11:16 AM -0500 3/5/08, Sacerdoti, Federico wrote:
>Thanks Ralph,
>
>First, we would be happy to test the slurm direct launch capability.
>Regarding the failure case, I realize that the IB errors do not
directly
>affect the orted daemons. This is what we observed:
>
>1. Parallel job started
>2. IB errors caused some processes to fail (but not all)
>3. slurm tears down entire job, attempting to kill all orted and their
>children
>
>We want this behavior: if any process of a parallel job dies, all
>processes should be stopped. The orted daemons in charge of processes
>that did not fail are the problem, as slurm was not able to kill them.
>Sounds like this is a known issue in openmpi 1.2.x.
>
>In any case, the new direct launching methods sound promising. I am
>surprised there are licensing issues with Slurm, is this a GPL-and-BSD
>issue? I am CC'ing slurm author Moe; he may be able to help.
>
>Thanks again and I look forward to testing the direct launch,
>Federico
>
>
>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>Behalf Of Ralph Castain
>Sent: Monday, March 03, 2008 8:19 PM
>To: Open MPI Users <us...@open-mpi.org>
>Cc: Ralph Castain
>Subject: Re: [OMPI users] slurm and all-srun orterun
>
>Hello
>
>I don't monitor the user list any more, but a friendly elf sent this
>along
>to me.
>
>I'm not entirely sure what problem might be causing the behavior you
are
>seeing. Neither mpirun nor any orted should be impacted by IB problems
>as
>they aren't MPI processes and thus never interact with IB. Only
>application
>procs touch the IB subsystem - if an application proc fails, the orted
>should see that and correctly order the shutdown of the job. So if you
>are
>having IB problems, that wouldn't explain daemons failing.
>
>If a daemon is aborting, that will cause problems in 1.2.x. We have
>noted
>that SLURM (even though the daemons are launched via srun) doesn't
>always
>tell us when this happens, leaving Open MPI vulnerable to "hangs" as it
>attempts to cleanup and finds it can't do it. I'm not sure why you
would
>see
>a daemon die, though - the fact that an application process failed
>shouldn't
>cause that to happen. Likewise, it would seem strange that the
>application
>process would fail and the daemon not notice - this has nothing to do
>with
>slurm, but is just a standard Linux "waitpid" method.
>
>The most likely reason for the behavior you describe is that an
>application
>process encounters an IB problem whi

Re: [OMPI users] slurm and all-srun orterun

2008-03-05 Thread Sacerdoti, Federico
Thanks Ralph,

First, we would be happy to test the slurm direct launch capability.
Regarding the failure case, I realize that the IB errors do not directly
affect the orted daemons. This is what we observed:

1. Parallel job started
2. IB errors caused some processes to fail (but not all)
3. slurm tears down entire job, attempting to kill all orted and their
children

We want this behavior: if any process of a parallel job dies, all
processes should be stopped. The orted daemons in charge of processes
that did not fail are the problem, as slurm was not able to kill them.
Sounds like this is a known issue in openmpi 1.2.x.

In any case, the new direct launching methods sound promising. I am
surprised there are licensing issues with Slurm, is this a GPL-and-BSD
issue? I am CC'ing slurm author Moe; he may be able to help.

Thanks again and I look forward to testing the direct launch,
Federico


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Monday, March 03, 2008 8:19 PM
To: Open MPI Users <us...@open-mpi.org>
Cc: Ralph Castain
Subject: Re: [OMPI users] slurm and all-srun orterun

Hello

I don't monitor the user list any more, but a friendly elf sent this
along
to me.

I'm not entirely sure what problem might be causing the behavior you are
seeing. Neither mpirun nor any orted should be impacted by IB problems
as
they aren't MPI processes and thus never interact with IB. Only
application
procs touch the IB subsystem - if an application proc fails, the orted
should see that and correctly order the shutdown of the job. So if you
are
having IB problems, that wouldn't explain daemons failing.

If a daemon is aborting, that will cause problems in 1.2.x. We have
noted
that SLURM (even though the daemons are launched via srun) doesn't
always
tell us when this happens, leaving Open MPI vulnerable to "hangs" as it
attempts to cleanup and finds it can't do it. I'm not sure why you would
see
a daemon die, though - the fact that an application process failed
shouldn't
cause that to happen. Likewise, it would seem strange that the
application
process would fail and the daemon not notice - this has nothing to do
with
slurm, but is just a standard Linux "waitpid" method.

The most likely reason for the behavior you describe is that an
application
process encounters an IB problem which blocks communication - but the
process doesn't actually abort or terminate, it just hangs there. In
this
case, the orted doesn't see the process exit, so the system doesn't know
it
should take any action.

That said, we know that 1.2.x has problems with clean shutdown in
abnormal
situations. Release 1.3 (when it comes out) addresses these issues and
appears (from our testing, at least) to be much more reliable about
cleanup.
You should see a definite improvement in the detection of process
failures
and subsequent cleanup.

As for your question, I am working as we speak on two new launch modes
for
Open MPI:

1. "direct" - this uses mpirun to directly launch the application
processes
without use of the intermediate daemons.

2. "standalone" - this uses the native launch command to simply launch
the
application processes, without use of mpirun or the intermediate
daemons.

The initial target environments for these capabilities are TM and SLURM.
The
latter poses a bit of a challenge as we cannot use their API due to
licensing issues, so it will come a little later. We have a design for
getting around the problem - the ordering is more driven by priorities
then
anything technical.

The direct launch capability -may- be included in 1.3 assuming it can be
completed in time for the release. If not, it will almost certainly be
in
1.3.1. I'm expecting to complete the TM version in the next few days,
and
perhaps get the SLURM version working sometime this month - but they
will
need validation before being included in an official release.

I can keep you posted if you like - once this gets into our repository,
you
are certainly welcome to try it out. I would welcome feedback on it.

Hope that helps
Ralph


>> From: "Sacerdoti, Federico" <federico.sacerd...@deshaw.com>
>> Date: March 3, 2008 12:44:39 PM EST
>> To: "Open MPI Users" <us...@open-mpi.org>
>> Subject: [OMPI users] slurm and all-srun orterun
>> Reply-To: Open MPI Users <us...@open-mpi.org>
>> 
>> Hi,
>> 
>> We are migrating to openmpi on our large (~1000 node) cluster, and
>> plan
>> to use it exclusively on a multi-thousand core infiniband cluster in
>> the
>> near future. We had extensive problems with parallel processes not
>> dying
>> after a job crash, which was largely solved by switching to the slurm
>> resource manager.
>> 
>> While orterun supports slurm, it only uses the srun facility to
launch
>> the

[OMPI users] slurm and all-srun orterun

2008-03-03 Thread Sacerdoti, Federico
Hi,

We are migrating to openmpi on our large (~1000 node) cluster, and plan
to use it exclusively on a multi-thousand core infiniband cluster in the
near future. We had extensive problems with parallel processes not dying
after a job crash, which was largely solved by switching to the slurm
resource manager.

While orterun supports slurm, it only uses the srun facility to launch
the "orted" daemons, which then start the actual user process
themselves. In our recent migration to openmpi, we have noticed
occasions where orted did not correctly clean up after a parallel job
crash. In most cases the crash was due to an infiniband error. Most
worryingly slurm was not able to cleanup the orted, and it along with
user processes were left running.

At SC07 I was told that there is some talk of using srun to launch both
orted and user processes, or alternatively use srun only. Either would
solve the cleanup problem, in our experience. Is Rolf Castain on this
list?

Thanks,
Federico

P.S.
We use proctrack/linuxproc slurm process tracking plugin. As noted in
the config man page, this may fail to find certain processes and explain
why slurm could not clean up orted effectively.

 man slurm.conf(5), version 1.2.22:
NOTE: "proctrack/linuxproc" and "proctrack/pgid" can fail to identify
all processes associated with a job since processes can become a child
of the init process (when the parent process terminates) or change their
process group. To reliably track all processes, one of the other
mechanisms utilizing kernel modifications is preferable.



Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Sacerdoti, Federico
To keep this out of the weeds, I have attached a program called "bug3"
that illustrates this problem on openmpi 1.2.5 using the openib BTL. In
bug3 process with rank 0 uses all available memory buffering
"unexpected" messages from its neighbors.

Bug3 is a test-case derived from a real, scalable application (desmond
for molecular dynamics) that several experienced MPI developers have
worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
openmpi silently sends them in the background and overwhelms process 0
due to lack of flow control.

It may not be hard to change desmond to work around openmpi's small
message semantics, but a programmer should reasonably be allowed to
think a blocking send will block if the receiver cannot handle it yet.

Federico

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brightwell, Ronald
Sent: Monday, February 04, 2008 3:30 PM
To: Patrick Geoffray
Cc: Open MPI Users
Subject: Re: [OMPI users] openmpi credits for eager messages

> > I'm looking at a network where the number of endpoints is large
enough that
> > everybody can't have a credit to start with, and the "offender"
isn't any
> > single process, but rather a combination of processes doing N-to-1
where N
> > is sufficiently large.  I can't just tell one process to slow down.
I have
> > to tell them all to slow down and do it quickly...
> 
> When you have N->1 patterns, then the hardware flow-control will
> throttle the senders, or drop packets if there is no hardware
> flow-control. If you don't have HOL blocking but the receiver does not
> consume for any reasons (busy, sleeping, dead, whatever), then you can
> still drop packets on the receiver (NIC, driver, thread) at a last
> resort, this is what TCP does. The key is have exponential backoff (or
a
> reasonably large resend timeout) to no continue the hammering.
> 
> It costs nothing in the common case (unlike the credits approach), but
> it does handle corner cases without affecting too much other nodes
> (unlike hardware flow-control).

Right.  For a sufficiently large number of endpoints, flow control has
to get
pushed out of MPI and down into the network, which is why I don't
necesarily
want an MPI that does flow control at the user-level.

> 
> But you know all that. You are just being mean to your users because
you
> can :-) The sick part is that I think I envy you...

You know it :)

-Ron


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


bug3.c
Description: bug3.c


[OMPI users] openmpi credits for eager messages

2008-02-01 Thread Sacerdoti, Federico
Hi,

I am readying an openmpi 1.2.5 software stack for use with a
many-thousand core cluster. I have a question about sending small
messages that I hope can be answered on this list. 

I was under the impression that if node A wants to send a small MPI
message to node B, it must have a credit to do so. The credit assures A
that B has enough buffer space to accept the message. Credits are
required by the mpi layer regardless of the BTL transport layer used.

I have been told by a Voltaire tech that this is not so, the credits are
used by the infiniband transport layer to reliably send a message, and
is not an openmpi feature.

Thanks,
Federico