Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Galen Shipman




I would recommend reading the following tech report, it should shed  
some light to how these things work :
http://www.cs.unm.edu/research/search_technical_reports_by_keyword/? 
string=infiniband




1 - It does not seem that mvapich does RDMA for small messages. It will
do RDMA for any message that is too big to send eagerly, but the
threshold is not that low and cannot be lowered to apply to 0-byte msgs
anyway (nothing lower than 128bytes or so will work).



mvapich does do RDMA for small messages, they preallocate a buffer for  
each peer and then poll each of these buffers for completion,
Take a look at the paper:  High Performance RDMA-Based MPI  
Implementations over Infiniband by Jiuxing Liu,
Also try compiling mvapich without: -D RDMA_FAST_PATH, I am pretty sure  
this is the flag that tells mvapich to compile with small message RDMA.  
Removing this flag will force mvapich to use send/recv




2 - I do not see that there is any raw performance benefit in insisting
on doing rdma for small messages anyway, so it does not seem to be a
tradeoff between scalability and optimal latency. In fact, if I force
ompi or mvapich to go rdma for smaller messages (at least as far as it
seems it will go) the latency for these sizes will actually go up,  
which

does not hurt my intuition. In mvapich I saw an incompressible 13 us
penalty for doing RDMA.


What you are seeing is a general RDMA protocol which requires that the  
initiator obtain the targets memory address and r-key prior to the rdma  
operation, additionally the initiator must inform the target of  
completion of the RDMA operation. This requires the overhead of control  
messages using either send/receive or small message RDMA.




So far, the best latency I got from ompi is 5.24 us, and the best I  
got from mvapich is 3.15.

I am perfectly ready to accept that ompi scales better and that may be
more important (except to the marketing dept :-) ), but I do not
understand your explanation based on small-message RDMA. Either I
missunderstood something badly (my best guess), or the 2 us are lost to
something else than an RDMA-size tradeoff.

Again this is small message RDMA with polling versus send/receive  
semantics, we will be adding small message RDMA and should have  
performance equal to that of mvapich for small messages, but it is only  
relevant for a small working set of peers / micro benchmarks.


Thanks,

Galen








Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Jean-Christophe Hugly

> 
> > So far, the best latency I got from ompi is 5.24 us, and the best I  
> > got from mvapich is 3.15.
> > I am perfectly ready to accept that ompi scales better and that may be
> > more important (except to the marketing dept :-) ), but I do not
> > understand your explanation based on small-message RDMA. Either I
> > missunderstood something badly (my best guess), or the 2 us are lost to
> > something else than an RDMA-size tradeoff.
> >
> Again this is small message RDMA with polling versus send/receive  
> semantics, we will be adding small message RDMA and should have  
> performance equal to that of mvapich for small messages, but it is only  
> relevant for a small working set of peers / micro benchmarks.

Thanks a lot. I was being fooled by various levels of size thresholds in
the mvapich code. It was indeed doing rdma for small messages. After
turning that off, I get numbers comparable to yours. Well, mvapich still
beats ompi by a hair on my configuration.  5.11 vs. 5.25 but that's in
the near-irrelevant range compared to other benefits.

>From an adoption perspective, though, the ability to shine in
micro-benchmarks is important, even if it means using an ad-hoc tuning.
There is some justification for it after all. There are small clusters
out there (many more than big ones, in fact) so taking maximum advantage
of a small scale is relevant.

When do you plan on having the small-msg rdma option available ?

J-C


-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Ron Brightwell
> [...]
> 
> >From an adoption perspective, though, the ability to shine in
> micro-benchmarks is important, even if it means using an ad-hoc tuning.
> There is some justification for it after all. There are small clusters
> out there (many more than big ones, in fact) so taking maximum advantage
> of a small scale is relevant.

I'm obliged to point out that you jumped to a conclusion -- possibly true
in some cases, but not always.

You assumed that a performance increase for a two-node micro-benchmark
would result in an application performance increase for a small cluster.
Using RDMA for short messages is the default on small clusters *because*
of the two-node micro-benchmark, not because the cluster is small.

I've seen plenty of cases where doing the scalable thing, rather than the
optimized for micro-benchmarks thing, leads to increases in application
performance even at a small scale.

-Ron




Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Galen Shipman


When do you plan on having the small-msg rdma option available ?

I would expect this in the very near future, we will be discussing 
schedules next week.


Thanks,
Galen



J-C


--
Jean-Christophe Hugly 
PANTA





Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Jean-Christophe Hugly
On Thu, 2006-02-09 at 14:05 -0700, Ron Brightwell wrote:
> > [...]
> > 
> > >From an adoption perspective, though, the ability to shine in
> > micro-benchmarks is important, even if it means using an ad-hoc tuning.
> > There is some justification for it after all. There are small clusters
> > out there (many more than big ones, in fact) so taking maximum advantage
> > of a small scale is relevant.
> 
> I'm obliged to point out that you jumped to a conclusion -- possibly true
> in some cases, but not always.
> 
> You assumed that a performance increase for a two-node micro-benchmark
> would result in an application performance increase for a small cluster.
> Using RDMA for short messages is the default on small clusters *because*
> of the two-node micro-benchmark, not because the cluster is small.

No, I assumed it based on comparisions between doing and not doing small
msg rdma at various scales, from a paper Galen pointed out to me.
http://www.cs.unm.edu/~treport/tr/05-10/Infiniband.pdf

Benchmarks are what they are. In the above paper, the tests place the
cross-over at around 64 nodes and that confirms a number of anecdotal
reports I got. It may well be that in some situations, small-msg rdma is
better only for 2 nodes, but that's note such a likely scenario; reality
is sometimes linear (at least at our scale :-) ) after all.

The scale threshold could be tunable, couldnt it ?

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Galen Shipman


On Feb 9, 2006, at 3:03 PM, Jean-Christophe Hugly wrote:


On Thu, 2006-02-09 at 14:05 -0700, Ron Brightwell wrote:

[...]


From an adoption perspective, though, the ability to shine in
micro-benchmarks is important, even if it means using an ad-hoc 
tuning.
There is some justification for it after all. There are small 
clusters
out there (many more than big ones, in fact) so taking maximum 
advantage

of a small scale is relevant.


I'm obliged to point out that you jumped to a conclusion -- possibly 
true

in some cases, but not always.

You assumed that a performance increase for a two-node micro-benchmark
would result in an application performance increase for a small 
cluster.
Using RDMA for short messages is the default on small clusters 
*because*

of the two-node micro-benchmark, not because the cluster is small.


No, I assumed it based on comparisions between doing and not doing 
small

msg rdma at various scales, from a paper Galen pointed out to me.
http://www.cs.unm.edu/~treport/tr/05-10/Infiniband.pdf

Hmm, this is not what I would conclude from my results, in fact if you 
look at the NPB results in my paper you will see that Open MPI 
outperforms in the CG and FT benchmarks at both 32 and 64 nodes without 
SRQ. The crossover point you are referring to must be the pairwise 
ping-pong benchmark. So I would have to conclude that it is totally 
application dependent.


- Galen




Benchmarks are what they are. In the above paper, the tests place the
cross-over at around 64 nodes and that confirms a number of anecdotal
reports I got. It may well be that in some situations, small-msg rdma 
is
better only for 2 nodes, but that's note such a likely scenario; 
reality

is sometimes linear (at least at our scale :-) ) after all.

The scale threshold could be tunable, couldnt it ?

--
Jean-Christophe Hugly 
PANTA





[O-MPI users] Job fails with mpiP

2006-02-09 Thread Aniruddha Shet
Hi,

I am trying to profile an Open MPI job using the mpiP profiling library.
Running the job without the library completes successfully. When I link the
profiling library into the executable, the job fails to run. I am able to
build the job with mpiP, but the execution fails. Please see the attached
tar file for details.

Thanks,
Aniruddha

-
Aniruddha G. Shet   | Project webpage:
http://forge-fre.ornl.gov/molar/index.html
Graduate Research Associate | Project webpage: http://www.cs.unm.edu/~fastos
Dept. of Comp. Sci. & Engg  | Personal webpage:
http://www.cse.ohio-state.edu/~shet
The Ohio State University   | Office: DL 474
2015 Neil Avenue| Phone: +1 (614) 292 7036
Columbus OH 43210-1277  | Cell: +1 (614) 446 1630

-


ompi_output.tar.gz
Description: Binary data


Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Brightwell, Ronald
> 
> No, I assumed it based on comparisions between doing and not doing small
> msg rdma at various scales, from a paper Galen pointed out to me.
> http://www.cs.unm.edu/~treport/tr/05-10/Infiniband.pdf
> 

Actually, I wasn't so much concerned with how you jumped to your conclusion.
I just wanted to point out that you did.  Most people who focus on ping-pong
latency like you have don't realize that they're jumping to a conclusion.
You suggested that optimizing for a latency micro-benchmark would benefit
small clusters, and that's just not (uniformly) true.

> Benchmarks are what they are. In the above paper, the tests place the
> cross-over at around 64 nodes and that confirms a number of anecdotal
> reports I got. It may well be that in some situations, small-msg rdma is
> better only for 2 nodes, but that's note such a likely scenario; reality
> is sometimes linear (at least at our scale :-) ) after all.

Indeed.

Well, if you didn't like me pointing out that jump, then I'll try a different
one.  It's fairly straightforward to correlate the latency performance of
the micro-benchmark directly to RDMA versus send/recv.  You can't really
do the same for the NPB results, since things like collective communication
performance can play a big part.  So, assuming that RDMA is the reason that
MVAPICH wins where it does may not hold.

I apologize if it seems like I'm picking on you.  I'm hypersensitive to
people trying to make judgements based on micro-benchmark performance.
I've been trying to make an argument that two-node ping-pong latency
comparisons really only have meaning in the context of a whole system.
The answer to the question of why the latency performance of my 10,000-node
machine is worse than someone else's 128-node cluster has alot to do with
meeting the scaling requirements of a 10,000-node machine. (To some extent
it has to do with the vendor as well, but that's a different story...)

-Ron




[O-MPI users] Firewall ports and Mac OS X 10.4.4

2006-02-09 Thread James Conway
I couldn't find any information on firewall ports to open up for  
using OpenMPI. I have compiled and successfully run simple commands  
(eg mpirun with "uname -n") on the localhost, but including remote  
hosts caused a hang. Statements in the remote .cshrc to echo would be  
returned, but nothing would come back from the "uname" command - the  
process hung until I control-c. I looked in the firewall log  
(ipfw.log) on the remotehost but found no messages. However, the  
localhost log showed that a return connection up in the 51000's was  
being blocked, and when I turned off the localhost's firewall, the  
mpirun command would complete correctly. (The remotehost firewall  
remained on).


However, I cannot find a range of ports to open. I am not really  
familiar with the ipfw syntax, and hope to rely on the very simple  
interface provided by Mac OSX 10.4.4 (ie, define a range of ports,  
TCP and/or UDP). Since this is clearly critical, I suspect that I  
must have overlooked some information on the OpenMPI web-site - if  
so, please direct me to it. If I haven't, it might be worth a word or  
two in the FAQ.


Thanks for any help.

James Conway
--
James Conway, PhD.,
Department of Structural Biology
University of Pittsburgh School of Medicine
Biomedical Science Tower 3, Room 2047
3501 5th Ave
Pittsburgh, PA 15260
U.S.A.
Phone: +1-412-383-9847
Fax:   +1-412-648-8998
Email: jxc...@pitt.edu
Web:    (under construction)
--





Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Jean-Christophe Hugly
On Thu, 2006-02-09 at 16:37 -0700, Brightwell, Ronald wrote:

> I apologize if it seems like I'm picking on you.
No offense taken.

>   I'm hypersensitive to
> people trying to make judgements based on micro-benchmark performance.
> I've been trying to make an argument that two-node ping-pong latency
> comparisons really only have meaning in the context of a whole system.

It's very clear to me that micro-benchmarks do not tell you very much
about real application behaviour; that's not the question. They are
nevertheless relevant to me because, right or wrong, people who buy
stuff look at them. And I work for a commercial outfit.

I may sound silly saying that, but they might be right to look at it,
they just need to look at the rest too. A micro-benchmarks tells you how
much you have of a given currency, that you can trade for another. It
tells you something about the implementation; how efficient the code is,
how well the hardware is utilized, etc. Not in every respect, but some.

It also tells you how far you can emphasize a given feature at the
expense of all others, if it happens that at some point in time it is
what you most need.

By making the argument that a particular characteristic is irrelevant,
you are essentially making a hard coded tradeoff, rather than letting
the users do it.

Back to the specific issue of latency vs. scale. Okay for CG and FT, the
cross-over may be <32, but that's not for all the cases and the
difference visible at 32 is pretty small. So, it is application
dependent, no question about it, but small-msg rdma is beneficial below
a given (application-dependent) cluster size.

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] Firewall ports and Mac OS X 10.4.4

2006-02-09 Thread Brian Barrett

On Feb 9, 2006, at 6:50 PM, James Conway wrote:


I couldn't find any information on firewall ports to open up for
using OpenMPI. I have compiled and successfully run simple commands
(eg mpirun with "uname -n") on the localhost, but including remote
hosts caused a hang. Statements in the remote .cshrc to echo would be
returned, but nothing would come back from the "uname" command - the
process hung until I control-c. I looked in the firewall log
(ipfw.log) on the remotehost but found no messages. However, the
localhost log showed that a return connection up in the 51000's was
being blocked, and when I turned off the localhost's firewall, the
mpirun command would complete correctly. (The remotehost firewall
remained on).

However, I cannot find a range of ports to open. I am not really
familiar with the ipfw syntax, and hope to rely on the very simple
interface provided by Mac OSX 10.4.4 (ie, define a range of ports,
TCP and/or UDP). Since this is clearly critical, I suspect that I
must have overlooked some information on the OpenMPI web-site - if
so, please direct me to it. If I haven't, it might be worth a word or
two in the FAQ.


Open MPI uses random port numbers for all it's communication.  We've  
currently been focusing on the tightly integrated cluster  
environment, which generally does not have port blocking issues.  It  
would probably not be difficult to implement a port range scheme, but  
that has not been an issue that is scheduled to be addressed in the  
short term.  For now, your best option is to open the firewall on  
your machine to the other machines you wish to use with Open MPI.  A  
quick search on google for "OS X ipfw" should turn up a couple  
references on configuring the OS X firewall to do this  
(unfortunately, you can not configure the firewall using the System  
Preferences GUI to do this).


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Sayantan Sur
Galen Shipman wrote:

>Hi Jean,
>
>You probably are not seeing overhead costs so much as you are seeing
>the difference between using send/recv for small messages, which Open
>MPI uses,  and RDMA for small messages. If you are comparing against
>another implementation that uses RDMA for small messages then yes,
>you
>will see lower latencies, but there are issues with using small
>message
>RDMA. I have written a paper that addresses these issues which will
>be
>presented at IPDPS.

I've been working for the MVAPICH project for around three years. Since
this thread is discussing MVAPICH, I thought I should post to this
thread. Galen's description of MVAPICH is not accurate. MVAPICH uses
RDMA for short message to deliver performance benefits to the
applications. However, it needs to be designed properly to handle
scalability while delivering best performance. Since MVAPICH-0.9.6
(released on 6th December, 2005), MVAPICH has been supporting a new mode
of operation which is called ADAPTIVE_RDMA_FAST_PATH (the basic
RDMA_FAST_PATH is also supported).

This new design uses RDMA for short message transfer in an intelligent
and adaptive manner.  Using this mode, the memory allocation of MVAPICH
is no longer static.  Instead its dynamic. Its an implementation of the
short message RDMA implementation for a limited set of peers (user
controllable) which Galen is suggesting. MVAPICH already supports this
feature. This also means that in the paper Galen mentions, the
comparison results in Figures 4 through 7 have to be re-evaluated to
make the paper and the results accurate.

Hope this helps.

Thanks,
Sayantan.


-- 
http://www.cse.ohio-state.edu/~surs