Re: [O-MPI users] mpirun --prefix

2006-01-04 Thread Jeff Squyres

On Jan 4, 2006, at 7:24 PM, Anthony Chan wrote:


How about this -- an ISV asked me for a similar feature a little
while ago: if mpirun is invoked with an absolute pathname, then use
that base directory (minus the difference from $bindir) as an option
to an implicit --prefix.

(your suggestion may actually be parsed as exactly that, but I wasn't
entirely sure)


Yes, that is what I meant. The change should make things easier for
typical MPI users.


Ok, I've added it to the to-do list for the v1.1 series (we're really  
only doing bug fixes to the v1.0 series).


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] mpirun --prefix

2006-01-04 Thread Anthony Chan

Hi Jeff,

On Wed, 4 Jan 2006, Jeff Squyres wrote:

> Anthony --
>
> I'm really sorry; we just noticed this message today -- it got lost
> in the post-SC recovery/holiday craziness.  :-(

I understand. :)
>
> Your request is fairly reasonable, but I wouldn't want to make it the
> default behavior.  Specifically, I can envision some scenarios where
> it might be problematic (e.g., heterogeneous situations -- which we
> don't yet support, but definitely will someday).
>
> How about this -- an ISV asked me for a similar feature a little
> while ago: if mpirun is invoked with an absolute pathname, then use
> that base directory (minus the difference from $bindir) as an option
> to an implicit --prefix.
>
> (your suggestion may actually be parsed as exactly that, but I wasn't
> entirely sure)

Yes, that is what I meant. The change should make things easier for
typical MPI users.

Thanks,
A.Chan
>
>
> On Nov 22, 2005, at 12:20 PM, Anthony Chan wrote:
>
> >
> > This is not a bug just wonder if this can be improved.  I have been
> > running openmpi linked program with command
> >
> > /bin/mpirun --prefix  \
> >  --host A  -np N a.out
> >
> > My understanding is that --prefix allows extra search path in
> > addition to
> > PATH and LD_LIBRARY_PATH, correct me if I am wrong.  Assuming that
> > openmpi's install directory structure is fixed, would it possible for
> > mpirun to search  automatically for libmpi.so &
> > friends so to avoid the redundant --prefix  to
> > mpirun ?
> >
> > Thanks,
> > A.Chan
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


Re: [O-MPI users] LAM vs OPENMPI performance

2006-01-04 Thread Jeff Squyres

On Jan 4, 2006, at 5:05 PM, Tom Rosmond wrote:


Thanks for the quick reply.  I ran my tests with a hostfile with
cedar.reachone.com slots=4

I clearly misunderstood the role of the 'slots' parameter, because
when I removed it, OPENMPI slightly outperformed LAM, which I
assume it should.  Thanks for the help.


Not entirely your fault -- I just went back and re-read the FAQ  
entries and can easily see how the wording would lead you to that  
conclusion.  I have touched up the wording to make it more clear, and  
added an FAQ item about oversubscription:


http://www.open-mpi.org/faq/?category=running#oversubscribing

Here's the text (it looks a bit prettier on the web page):

--
Can I oversubscribe nodes (run more processes than processors)?


Yes.

However, it is critical that Open MPI knows that you are  
oversubscribing the node, or severe performance degredation can result.


The short explanation is as follows: never specify a number of slots  
that is more than the available number of processors. For example, if  
you want to run 4 processes on a uniprocessor, then indicate that you  
only have 1 slot but want to run 4 processes. For example:




shell$ cat my-hostfile
localhost
shell$ mpirun -np 4 --hostfile my-hostfile a.out


Specifically: do NOT have a hostfile that contains "slots =  
4" (because there is only one available processor).


Here's the full explanation:

Open MPI basically runs its message passing progression engine in two  
modes: aggressive and degraded.




Degraded: When Open MPI thinks that it is in an oversubscribed mode  
(i.e., more processes are running than there are processors  
available), MPI processes will automatically run in degraded mode and  
frequently yield the processor to its peers, thereby allowing all  
processes to make progress.


Aggressive: When Open MPI thinks that it is in an exactly- or under- 
subscribed mode (i.e., the number of running processes is equal to or  
less than the numebr of available processors), MPI processes will  
automatically run in aggressive mode, meaning that they will never  
voluntarily give up the processor to other processes. With some  
network transports, this means that Open MPI will spin in tight loops  
attempting to make message passing progress, effectively causing  
other processes to not get any CPU cycles (and therefore never make  
any progress).

For example, on a uniprocessor node:



shell$ cat my-hostfile
localhost slots=4
shell$ mpirun -np 4 --hostfile my-hostfile a.out


This would cause all 4 MPI processes to run in aggressive mode  
because Open MPI thinks that there are 4 available processors to use.  
This is actually a lie (there is only 1 processor -- not 4), and can  
cause extremely bad performance.


-



Hope that clears up the issue.  Sorry about that!


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] mpirun --prefix

2006-01-04 Thread Jeff Squyres

Anthony --

I'm really sorry; we just noticed this message today -- it got lost  
in the post-SC recovery/holiday craziness.  :-(


Your request is fairly reasonable, but I wouldn't want to make it the  
default behavior.  Specifically, I can envision some scenarios where  
it might be problematic (e.g., heterogeneous situations -- which we  
don't yet support, but definitely will someday).


How about this -- an ISV asked me for a similar feature a little  
while ago: if mpirun is invoked with an absolute pathname, then use  
that base directory (minus the difference from $bindir) as an option  
to an implicit --prefix.


(your suggestion may actually be parsed as exactly that, but I wasn't  
entirely sure)



On Nov 22, 2005, at 12:20 PM, Anthony Chan wrote:



This is not a bug just wonder if this can be improved.  I have been
running openmpi linked program with command

/bin/mpirun --prefix  \
 --host A  -np N a.out

My understanding is that --prefix allows extra search path in  
addition to

PATH and LD_LIBRARY_PATH, correct me if I am wrong.  Assuming that
openmpi's install directory structure is fixed, would it possible for
mpirun to search  automatically for libmpi.so &
friends so to avoid the redundant --prefix  to
mpirun ?

Thanks,
A.Chan

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] LAM vs OPENMPI performance

2006-01-04 Thread Tom Rosmond

Thanks for the quick reply. I ran my tests with a hostfile with
cedar.reachone.com slots=4

I clearly misunderstood the role of the 'slots' parameter, because
when I removed it, OPENMPI slightly outperformed LAM, which I
assume it should. Thanks for the help.

Tom



Brian Barrett wrote:


On Jan 4, 2006, at 4:24 PM, Tom Rosmond wrote:

 


I have been using LAM-MPI for many years on PC/Linux systems and
have been quite pleased with its performance.  However, at the  
urging of the

LAM-MPI website, I have decided to switch to OPENMPI.  For much of my
preliminary testing I work on a single processor workstation (see  
the attached
'config.log' and ompi_info.log files for some of the specifics of  
my system). I
frequently run with more than one virtual mpi processor (i.e.  
oversubscribe
the real processor) to test my code.  With LAM the runtime penalty  
for this
is usually insignificant for 2-4 virtual processors, but with  
OPENMPI it has
been prohibitive.  Below is a matrix of runtimes for a simple MPI  
matrix
transpose code using mpi_sendrecv( I tried other variations of  
blocking/
non-blocking, synchronous/non-synchronous send/recv with similar  
results).


 message size=  262144  bytes

LAMOPENMPI
   1 proc:  .02575 secs  .02513 secs
   2 proc:  .04603 secs  10.069 secs
   4 proc:  .04903 secs  35.422 secs

I am pretty sure that LAM exploits the fact that the virtual  
processors are all
sharing the same memory,  so communication is via memory and/or the  
PCI bus
of the system, while my OPENMPI configuration doesn't exploit  
this.  Is this
a reasonable diagnosis of the dramatic difference in performance?   
More
importantly, how to I reconfigure OPENMPI to match the LAM  
performance.
   



Based on the output of ompi_info, you should be using shared memory  
with Open MPI (as you are with LAM/MPI).  What RPI are you using with  
LAM/MPI (just so we have some idea what you are comparing to)?  And  
how are you running Open MPI (what command are you passing to mpirun,  
and if you include a hostfile, what is in that host file)?


If you tell Open MPI via a hostfile that a machine has 2 cpus when it  
only has 1 and try to run 2 processes on it, you will run into severe  
performance issues.  In that case, Open MPI will poll very quickly on  
the CPUs, not giving up the CPU when there is nothing to do.  If Open  
MPI is told that there is only 1 cpu and you run 2 procs of the same  
job on that node, then it will be much better about giving up the  
CPU.  That would be where I would start looking.


If you have some test code you could share, I'd love to see it - it  
would help in duplicating your results and finding a solution...


Brian


 



Re: [O-MPI users] LAM vs OPENMPI performance

2006-01-04 Thread Patrick Geoffray

Hi Tom,

users-requ...@open-mpi.org wrote:
I am pretty sure that LAM exploits the fact that the virtual processors 
are all

sharing the same memory,  so communication is via memory and/or the PCI bus
of the system, while my OPENMPI configuration doesn't exploit this.  Is this
a reasonable diagnosis of the dramatic difference in performance?  More


It would be more likely that OpenMPI is using shared memory and polling 
on it whereas LAM is using sockets, or at least blocking on something.


Polling is a bad thing when oversubscribing processor. When you block on 
a socket (or any OS handle), the process immediately yield the CPU and 
is removed from the scheduler. When you poll waiting for a send or 
receive to complete, you are burning cycles on the CPU and the scheduler 
will wait for the next quantum of time before running another process.


So, if you send a message between 2 processes sharing the same 
processor, the latency will be in the order of half of the scheduler 
quantum (10ms on Linux) if they are both polling. Things are much faster 
when processes are polling on different CPUs (1-2 us) but the blocking 
socket overhead (~20us) is way better than the quantum of time when you 
don't have several processors.



importantly, how to I reconfigure OPENMPI to match the LAM performance.


Try disabling the shared memory device in OpenMPI. Unfortunately, I have 
no clue how to do it.


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com


Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-04 Thread Jeff Squyres

On Jan 4, 2006, at 2:08 PM, Anthony Chan wrote:


Either my program quits without writing the logfile (and without
complaining) or it crashes in MPI_Finalize. I get the message
"33 additional processes aborted (not shown)".


This is not MPE error message.  If the logging crashes in  
MPI_Finalize,

it usually means the merging of logging data from child nodes fails.
Since you didn't get any MPE error messages, so it means the cause of
the crash isn't expected by MPE.  Does anyone know if "33 additional
processes aborted (not shown)" is from OpenMPI ?


Yes, it is.  It is from mpirun telling you that 33 processes -- in  
addition to the error message that it must have shown above that --  
aborted.  So I'm guessing that 34 total processes aborted.


Are you getting corefiles for these processes?  (might need to check  
the limit of your coredumpsize)


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-04 Thread Anthony Chan


On Wed, 4 Jan 2006, Carsten Kutzner wrote:

> On Tue, 3 Jan 2006, Anthony Chan wrote:
>
> > MPE/MPE2 logging (or clog/clog2) does not impose any limitation on the
> > number of processes.  Could you explain what difficulty or error
> > message you encountered when using >32 processes ?
>
> Either my program quits without writing the logfile (and without
> complaining) or it crashes in MPI_Finalize. I get the message
> "33 additional processes aborted (not shown)".

This is not MPE error message.  If the logging crashes in MPI_Finalize,
it usually means the merging of logging data from child nodes fails.
Since you didn't get any MPE error messages, so it means the cause of
the crash isn't expected by MPE.  Does anyone know if "33 additional
processes aborted (not shown)" is from OpenMPI ?

Since I don't know the real cause of the crash, this is what I would do:

1, Set MPE_TMPDIR or TMPDIR to bigger local filesystem to make sure that
   disk space is not an issue here.

2, Run /share/examples_logging/cpilog with >32 processes
   to see if you get the same error message.  If the same error occurs,
   it could be there is some other fundamental issue e.g. networking
   problem...

A.Chan

> Since this looks weird I think I will recompile with the newer MPE
> version you suggested. (When I do not link with MPE libraries my program
> runs fine.)
>
> Thanks,
>   Carsten
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-04 Thread Carsten Kutzner
Hi Graham,

here are the all-to-all test results with the modification to the decision
routine you suggested yesterday. Now the routine behaves nicely for 128
and 256 float messages on 128 CPUs! For the other sizes one probably wants
to keep the original algorithm, since it is faster there. However I have
the feeling that for messages >= 4096 floats there still exists the old
problem since the execution times are so variable there (note that the
standard deviation rises by more than a factor of 10 when going from 2048
to 4096 floats.) If you need additional test results to tune the decision
functions please let me know.

Carsten


OMPI tuned all-to-all with modification:

   mesg size  time in seconds
#CPUs floats  average   std.dev.min.  max.
 128   1  0.001253  0.740.001141  0.001470
 128   2  0.023507  0.0005630.022562  0.024761
 128   4  0.023435  0.0004260.022582  0.024166
 128   8  0.023438  0.0003590.022904  0.024104
 128  16  0.023664  0.0004380.022844  0.024670
 128  32  0.024136  0.0004630.023297  0.025117
 128  64  0.024704  0.0005350.023727  0.026030
 128 128  0.025750  0.0005250.024592  0.026799 *
 128 256  0.028862  0.0006830.027389  0.030168 *
 128 512  0.035869  0.0012140.034067  0.038655
 1281024  0.046528  0.0017220.043549  0.050432
 1282048  0.072388  0.0070320.066708  0.104358
 1284096  0.217678  0.0973120.135113  0.409431
 1288192  0.378586  0.0902670.297878  0.51
 128   16384  0.567473  0.1050830.483573  0.735509
 128   32768  1.151343  0.1465470.937150  1.404478
 128   65536  2.298998  0.1696691.983286  2.572027
 128  131072  4.070989  0.1599583.691039  4.373587



> > OMPI tuned all-to-all:
> > ==
> >   mesg size  time in seconds
> > #CPUs floats  average   std.dev.min.  max.
> > 128   1  0.001288  0.0001020.001077  0.001512
> > 128   2  0.008391  0.0004000.007861  0.009958
> > 128   4  0.008403  0.0002370.008095  0.009018
> > 128   8  0.008228  0.0009420.003801  0.008810
> > 128  16  0.008503  0.0001910.008233  0.008839
> > 128  32  0.008656  0.0002710.008084  0.009177
> > 128  64  0.009085  0.0002090.008757  0.009603
> > 128 128  0.251414  0.0730690.011547  0.506703 !
> > 128 256  0.385515  0.1276610.251431  0.578955 !
> > 128 512  0.035111  0.0008720.033358  0.036262
> > 1281024  0.046028  0.0021160.043381  0.052602
> > 1282048  0.073392  0.0077450.066432  0.104531
> > 1284096  0.165052  0.0728890.124589  0.404213
> > 1288192  0.341377  0.0418150.309457  0.530409
> > 128   16384  0.507200  0.0508720.492307  0.750956
> > 128   32768  1.050291  0.1328670.954496  1.344978
> > 128   65536  2.213977  0.1549871.962907  2.492560
> > 128  131072  4.026107  0.1471033.800191  4.336205
> >
> > alternative all-to-all:
> > ==
> > 128   1  0.012584  0.0007240.011073  0.015331
> > 128   2  0.012506  0.0004440.011707  0.013461
> > 128   4  0.012412  0.0005110.011157  0.013413
> > 128   8  0.012488  0.0004550.011767  0.013746
> > 128  16  0.012664  0.0004160.011745  0.013362
> > 128  32  0.012878  0.0004100.012157  0.013609
> > 128  64  0.013138  0.0004170.012452  0.013826
> > 128 128  0.014016  0.0005050.013195  0.014942 +
> > 128 256  0.015843  0.0005210.015107  0.016725 +
> > 128 512  0.052240  0.0793230.027019  0.320653 !
> > 1281024  0.123884  0.1215600.038062  0.308929 !
> > 1282048  0.176877  0.1252290.074457  0.387276 !
> > 1284096  0.305030  0.1217160.176640  0.496375 !
> > 1288192  0.546405  0.1080070.415272  0.899858 !
> > 128   16384  0.604844  0.0565760.558657  0.843943 !
> > 128   32768  1.235298  0.0979691.094720  1.451241 !
> > 128   65536  2.926902  0.3127332.458742  3.895563 !
> > 128  131072  6.208087  0.4721155.354304  7.317153 !


---
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckut...@gwdg.de
http://www.gwdg.de/~ckutzne