Re: [OMPI users] Receiving MPI messages of unknown size

2009-06-03 Thread Gus Correa

Hi Lars

I wonder if you could always use blocking message passing on the 
preliminary send/receive pair that transmits the message size/header,

then use non-blocking mode for the actual message.
If the "message size/header" part transmits a small buffer,
the preliminary send/recv pair will use the "eager" communication mode,
return quickly, and may not reduce performance, I would guess.

For a group of several messages the preliminary
send/recv pair could transmit a small (to ensure "eager mode")
array of message sizes,
maybe along with the message tags and sender ranks,
instead of only one size.

Just a thought.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Lars Andersson wrote:

Hi,

I'm trying to solve a problem of passing serializable, arbitrarily
sized objects around using MPI and non-blocking communication. The
problem I'm facing is what to do at the receiving end when expecting
an object of unknown size, but at the same time not block on waiting
for it.

When using blocking message passing, I have simply solved the problem
by first sending a small, fixed size header containing the size of
rest of the data, sent in the following mpi message. When using
non-blocking message passing, this doesn't seem to be such a good
idea, since we cant post the main data transfer until we have received
the message header... It seems to take away most of the advantages on
non-blocking io in the first place.


I've been thinking about solving this using MPI_Probe / MPI_IProbe,
but I'm worried about performance.


Question 1:

Will MPI_Probe or the underlying MPI implementation actually receive
the full message data (assuming reasonably sized message, like less
than 10MB) before MPI_Probe returns? Or will there be a significant
data transfer delay (for large messages) when calling MPI_Recv after a
successful MPI_Probe?



What I want is something like this:

 1) post one or several non-blocking, variable sized message receives

 2) do other, non-MPI work, while any incoming messages will be fully
received into
 buffers on the local machine.

 3) perform completion of the receives posted in 1). I don't want to
unnecessarily
 wait here for data transfers that could have taken place during 2).


Problems:

I can't post non-blocking MPI_Irecv() calls in 1, because I don't know
the sizes of incoming messages.

If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that
I won't get nice compute/transfer overlap because the messages wont
actually be received locally until I post a Probe or Recv in 3.


Question 2:

How can I achieve the communication sequence described in 1,2,3 above,
with overlapping data transfer and local computation during 2?


Question 3:

A temporary kludge solution to the problem above might be to allocate
a temporary receive buffer of some arbitrary, constant maximum size
BUFSIZE in 1 for each non-blocking receive operation, make sure
messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer,
BUFSIZE,...) calls in 1. I haven't been able to figure out if it's
actually correct and portable to receive less data than specified in
the count argument to MPI_Irecv.

What if the message sent on the other end is 10 bytes, and
BUFSIZE=count=20. Would that be OK?


If anyone can shed any light on this, I'd be grateful. FYI, we're
using a cluster of 2-8 core x86-64 machines running Linux and
connected using ordinary 1Gbit ethernet.


Best regards,

Lars Andersson
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Receiving MPI messages of unknown size

2009-06-03 Thread Lars Andersson
Hi,

I'm trying to solve a problem of passing serializable, arbitrarily
sized objects around using MPI and non-blocking communication. The
problem I'm facing is what to do at the receiving end when expecting
an object of unknown size, but at the same time not block on waiting
for it.

When using blocking message passing, I have simply solved the problem
by first sending a small, fixed size header containing the size of
rest of the data, sent in the following mpi message. When using
non-blocking message passing, this doesn't seem to be such a good
idea, since we cant post the main data transfer until we have received
the message header... It seems to take away most of the advantages on
non-blocking io in the first place.


I've been thinking about solving this using MPI_Probe / MPI_IProbe,
but I'm worried about performance.


Question 1:

Will MPI_Probe or the underlying MPI implementation actually receive
the full message data (assuming reasonably sized message, like less
than 10MB) before MPI_Probe returns? Or will there be a significant
data transfer delay (for large messages) when calling MPI_Recv after a
successful MPI_Probe?



What I want is something like this:

 1) post one or several non-blocking, variable sized message receives

 2) do other, non-MPI work, while any incoming messages will be fully
received into
 buffers on the local machine.

 3) perform completion of the receives posted in 1). I don't want to
unnecessarily
 wait here for data transfers that could have taken place during 2).


Problems:

I can't post non-blocking MPI_Irecv() calls in 1, because I don't know
the sizes of incoming messages.

If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that
I won't get nice compute/transfer overlap because the messages wont
actually be received locally until I post a Probe or Recv in 3.


Question 2:

How can I achieve the communication sequence described in 1,2,3 above,
with overlapping data transfer and local computation during 2?


Question 3:

A temporary kludge solution to the problem above might be to allocate
a temporary receive buffer of some arbitrary, constant maximum size
BUFSIZE in 1 for each non-blocking receive operation, make sure
messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer,
BUFSIZE,...) calls in 1. I haven't been able to figure out if it's
actually correct and portable to receive less data than specified in
the count argument to MPI_Irecv.

What if the message sent on the other end is 10 bytes, and
BUFSIZE=count=20. Would that be OK?


If anyone can shed any light on this, I'd be grateful. FYI, we're
using a cluster of 2-8 core x86-64 machines running Linux and
connected using ordinary 1Gbit ethernet.


Best regards,

Lars Andersson


Re: [OMPI users] top question

2009-06-03 Thread George Bosilca
Simon, it is a lot more difficult than it appears. You're right,  
select/poll can do it for any file descriptor, and shared mutexes/ 
conditions (despite the performance impact) can do it for shared  
memory. However, in the case where you have to support both  
simultaneously, what is the right approach, i.e. the one that doesn't  
impact the current performance? We're open to smart solutions ...


  george.

On Jun 3, 2009, at 11:49 , Number Cruncher wrote:


Jeff Squyres wrote:
We get this question so much that I really need to add it to the  
FAQ.  :-\
Open MPI currently always spins for completion for exactly the  
reason that Scott cites: lower latency.
Arguably, when using TCP, we could probably get a bit better  
performance by blocking and allowing the kernel to make more  
progress than a single quick pass through the sockets progress  
engine, but that involves some other difficulties such as  
simultaneously allowing shared memory progress.  We have ideas how  
to make this work, but it has unfortunately remained at a lower  
priority: the performance difference isn't that great, and we've  
been focusing on the other, lower latency interconnects (shmem, MX,  
verbs, etc.).


Whilst I understand that you have other priorities, and I grateful  
for the leverage I get by using OpenMPI, I would like to offer an  
alternative use case, which I believe may become more common.


We're developing parallel software which is designed to be used  
*interactively* as well as in batch mode. We want the same SIMD code  
running on a user's quad-core workstation as on a 1,000-node cluster.


For the former case (single workstation), it would be *much* more  
user friendly and interactive, for the back-end MPI code not to be  
spinning at 100% when it's just waiting for the next front-end  
command. The GUI thread doesn't get a look in.


I can't imagine the difficulties involved, but if the POSIX calls  
select() and pthread_cond_wait() can do it for TCP and shared-memory  
threads respectively, it can't be impossible!


Just my .2c,
Simon
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread DEVEL Michel
Dear Reiner, Jeff, Gus and list,

Thanks for your suggestions, I will test them tomorrow.

I did not check your mails before because I was busy trying the
gcc/gfortran way.
I have other problems:
- for static linking I am missing plenty of ibv_* routines. I saw on the
net that they should be in a libibverbs library, but I cannot find it.
- dynamic linking is OK, but when I test a simple test program on my
machine (i7 920) with an mpd-hosts containing a single line with the
name of the machine and slots=4, the program only execute provided I
give my password, allthough I do have a .rhosts file with the name of my
machine in my home directory.
-- 

Sincerely yours,

Michel DEVEL



Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread Gus Correa

Hi Michel, Jeff, Rainer, list

I have AMD Opteron Shanghai, and Intel 10.1017.
I had trouble with the Intel -fast flag also.

According to the ifort man page/help:
-fast  = -xT -O3 -ipo -no-prec-div -static
(Each compiler vendor has a different -fast, PGI is another thing.)

Intel doesn't allow SSE-type optimization beyond W (SSE+SSE2)
for AMD processors (an old and contentious issue,
Google it form more info).
So, I changed -xT to -xW (the highest level allowed,
also recommended by AMD).

I had trouble with ipo before (missing symbols during link),
so I reduced it to ip.

Moreover, -static definitely cannot work with the Infiniband
and other tons of shared libraries, of course,
hence I simply removed it.
However, as suggested by Rainer,
-static-intel may be OK,
if all you want is to avoid sending the Intel LD_LIBRARY_PATH with your
mpiexec command.
(I haven't tried it, though.)

The flags became: -xW -O3 -ip -no-prec-div

I used the same flags for ifort (FFLAGS, FCFLAGS), icc (CFLAGS)
and icpc (CXXFLAGS),to build OpenMPI 1.3.2, and it works.
For "Genuine Intel" processors you can upgrade -xW to whatever is
appropriate.

My $0.02.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Jeff Squyres wrote:
Rainer and I are still iterating on the trunk solution (we moved to an 
hg branch just for convenience for the moment).


Note that the Fortran flags aren't too important to OMPI.  We *only* use 
them in configure.  OMPI doesn't contain any Fortran 77 code at all, and 
the F90 module is extremely minimalistic (generally one-line subroutines 
to call the C counterpart).  So a workaround for the moment -- until we 
can figure out the problem -- might be to remove the -fast from the 
FFLAGS and FCFLAGS.


On Jun 3, 2009, at 11:34 AM, Rainer Keller wrote:


Dear Michel,
per the naming convention test in configure:
   ifort -fast
will turn on -xHOST -O3 -ipo -no-prec-div -static,
of which -ipo turns on interprocedural optimizations for multiple files.
Here the compiled object file does not contain the symbols searched 
for in the

configure-tests.

Looking into the simple test-case in configure and the options that 
one has to
figure out the naming convention using compilation (-c), I don't see 
an other

other than disabling -fast & -ipo for intel-fortan compilers.

Please check trunk in commit r21363.


On Wednesday 03 June 2009 09:29:09 am DEVEL Michel wrote:
> In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from
> '-O3 -C'). There is still an error (many opal_*_* subroutines not found
> during the ipo step) at the same place, coming from the fact that
> "ld: attempted static link of dynamic object
> `../../../opal/.libs/libopen-pal.so'
> although I put --enable-static in the configure step...

> Any idea of how to make the static libraries ?

In order to statically link at least the intel-libraries, please add
  -static-intel   (in previous intel compilers called -i-static)
to LDFLAGS

With best regards,
Rainer
--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users








Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread Ralph Castain
The changes Jeff mentioned are not in the 1.3 branch - not sure if they will
come over there or not.

I'm a little concerned in this thread that someone is reporting the process
affinity binding changing - that shouldn't be happening, and my guess is
that something outside of our control may be changing it.

One other thing to consider that has been an issue around here, and will be
an even bigger issue with the change to bind at app start. If your app is
threaded, we will bind *all* threads to the same processor, thus potentially
hampering performance. We have found that multi-threaded apps often provide
better performance if users do *not* set processor affinity via MPI, but
instead embed binding calls inside the individual threads so they can be
placed on separate processors.

All depends on the exact nature of the application, of course!

HTH
Ralph


On Wed, Jun 3, 2009 at 10:02 AM, Jeff Squyres  wrote:

> On Jun 3, 2009, at 11:40 AM, Ashley Pittman wrote:
>
>  Wasn't there a discussion about this recently on the list, OMPI binds
>> during MPI_Init() so it's possible for memory to be allocated on the
>> wrong quad, the discussion was about moving the binding to the orte
>> process as I recall?
>>
>>
> Yes.  It's been fixed in OMPI devel trunk.  I'm not sure it made it to the
> v1.3 branch, but it's definitely not in a released version yet.
>
> I *thought* that HPL did all allocation after MPI_INIT.  But I could be
> wrong.  If so, then using numactl to bind before the MPI app starts will
> likely give better results -- you're right (until we get our fixes in such
> that we bind pre-main).
>
> Regardless, if something is *changing* the affinity after MPI_INIT, then
> there's little OMPI can do about that.
>
>  >From my testing of process affinity you tend to get much more consistent
>> results with it on and much more unpredictable results with it off, I'd
>> questing that it's working properly if you are seeing a 88-93% range in
>> the results.
>>
>> Ashley Pittman.
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread Jeff Squyres
Rainer and I are still iterating on the trunk solution (we moved to an  
hg branch just for convenience for the moment).


Note that the Fortran flags aren't too important to OMPI.  We *only*  
use them in configure.  OMPI doesn't contain any Fortran 77 code at  
all, and the F90 module is extremely minimalistic (generally one-line  
subroutines to call the C counterpart).  So a workaround for the  
moment -- until we can figure out the problem -- might be to remove  
the -fast from the FFLAGS and FCFLAGS.


On Jun 3, 2009, at 11:34 AM, Rainer Keller wrote:


Dear Michel,
per the naming convention test in configure:
   ifort -fast
will turn on -xHOST -O3 -ipo -no-prec-div -static,
of which -ipo turns on interprocedural optimizations for multiple  
files.
Here the compiled object file does not contain the symbols searched  
for in the

configure-tests.

Looking into the simple test-case in configure and the options that  
one has to
figure out the naming convention using compilation (-c), I don't see  
an other

other than disabling -fast & -ipo for intel-fortan compilers.

Please check trunk in commit r21363.


On Wednesday 03 June 2009 09:29:09 am DEVEL Michel wrote:
> In fact I forgot to put back to '-fast -C' the FCFLAGS variable  
(from
> '-O3 -C'). There is still an error (many opal_*_* subroutines not  
found

> during the ipo step) at the same place, coming from the fact that
> "ld: attempted static link of dynamic object
> `../../../opal/.libs/libopen-pal.so'
> although I put --enable-static in the configure step...

> Any idea of how to make the static libraries ?

In order to statically link at least the intel-libraries, please add
  -static-intel   (in previous intel compilers called -i-static)
to LDFLAGS

With best regards,
Rainer
--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread Jeff Squyres

On Jun 3, 2009, at 11:40 AM, Ashley Pittman wrote:


Wasn't there a discussion about this recently on the list, OMPI binds
during MPI_Init() so it's possible for memory to be allocated on the
wrong quad, the discussion was about moving the binding to the orte
process as I recall?



Yes.  It's been fixed in OMPI devel trunk.  I'm not sure it made it to  
the v1.3 branch, but it's definitely not in a released version yet.


I *thought* that HPL did all allocation after MPI_INIT.  But I could  
be wrong.  If so, then using numactl to bind before the MPI app starts  
will likely give better results -- you're right (until we get our  
fixes in such that we bind pre-main).


Regardless, if something is *changing* the affinity after MPI_INIT,  
then there's little OMPI can do about that.


>From my testing of process affinity you tend to get much more  
consistent
results with it on and much more unpredictable results with it off,  
I'd
questing that it's working properly if you are seeing a 88-93% range  
in

the results.

Ashley Pittman.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] top question

2009-06-03 Thread Number Cruncher

Jeff Squyres wrote:

We get this question so much that I really need to add it to the FAQ.  :-\

Open MPI currently always spins for completion for exactly the reason 
that Scott cites: lower latency.


Arguably, when using TCP, we could probably get a bit better performance 
by blocking and allowing the kernel to make more progress than a single 
quick pass through the sockets progress engine, but that involves some 
other difficulties such as simultaneously allowing shared memory 
progress.  We have ideas how to make this work, but it has unfortunately 
remained at a lower priority: the performance difference isn't that 
great, and we've been focusing on the other, lower latency interconnects 
(shmem, MX, verbs, etc.).


Whilst I understand that you have other priorities, and I grateful for 
the leverage I get by using OpenMPI, I would like to offer an 
alternative use case, which I believe may become more common.


We're developing parallel software which is designed to be used 
*interactively* as well as in batch mode. We want the same SIMD code 
running on a user's quad-core workstation as on a 1,000-node cluster.


For the former case (single workstation), it would be *much* more user 
friendly and interactive, for the back-end MPI code not to be spinning 
at 100% when it's just waiting for the next front-end command. The GUI 
thread doesn't get a look in.


I can't imagine the difficulties involved, but if the POSIX calls 
select() and pthread_cond_wait() can do it for TCP and shared-memory 
threads respectively, it can't be impossible!


Just my .2c,
Simon


Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread Ashley Pittman
On Wed, 2009-06-03 at 11:27 -0400, Jeff Squyres wrote:
> On Jun 3, 2009, at 10:48 AM,  wrote:
> 
> > For HPL, try writing a bash script that pins processes to their  
> > local memory controllers using numactl before kicking off HPL.  This  
> > is particularly helpful when spawning more than 1 thread per  
> > process.  The last line of your script should look like "numactl -c  
> > $cpu_bind -m $ mem_bind $*".
> >
> > Believe it or not, I hit 94.5% HPL efficiency using this tactic on a  
> > 16 node cluster. Using processor affinity (various MPIs) my results  
> > were inconsistent and ranged between 88-93%
> >
> 
> If you're using multi-threaded HPL, that might be useful.  But if  
> you're not, I'd be surprised if you got any different results than  
> Open MPI binding itself.  If there really is a difference, we should  
> figure out why.  More specifically, calling numactl yourself should be  
> pretty much exactly what we do in OMPI (via API, not via calling  
> numactl).

Wasn't there a discussion about this recently on the list, OMPI binds
during MPI_Init() so it's possible for memory to be allocated on the
wrong quad, the discussion was about moving the binding to the orte
process as I recall?

>From my testing of process affinity you tend to get much more consistent
results with it on and much more unpredictable results with it off, I'd
questing that it's working properly if you are seeing a 88-93% range in
the results.

Ashley Pittman.



Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread Rainer Keller
Dear Michel,
per the naming convention test in configure:
   ifort -fast 
will turn on -xHOST -O3 -ipo -no-prec-div -static,
of which -ipo turns on interprocedural optimizations for multiple files.
Here the compiled object file does not contain the symbols searched for in the 
configure-tests.

Looking into the simple test-case in configure and the options that one has to 
figure out the naming convention using compilation (-c), I don't see an other 
other than disabling -fast & -ipo for intel-fortan compilers.

Please check trunk in commit r21363.


On Wednesday 03 June 2009 09:29:09 am DEVEL Michel wrote:
> In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from
> '-O3 -C'). There is still an error (many opal_*_* subroutines not found
> during the ipo step) at the same place, coming from the fact that
> "ld: attempted static link of dynamic object
> `../../../opal/.libs/libopen-pal.so'
> although I put --enable-static in the configure step...

> Any idea of how to make the static libraries ?

In order to statically link at least the intel-libraries, please add
  -static-intel   (in previous intel compilers called -i-static)
to LDFLAGS

With best regards,
Rainer
-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink




Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread JACOB_LIBERMAN
Hi Jeff,

Yes, this technique is particularly helpful for multi-threaded and works 
consistently across the various MPIs I test. 

Thanks, jacob

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: Wednesday, June 03, 2009 10:27 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Openmpi and processor affinity
> 
> On Jun 3, 2009, at 10:48 AM,  wrote:
> 
> > For HPL, try writing a bash script that pins processes to their
> > local memory controllers using numactl before kicking off HPL.  This
> > is particularly helpful when spawning more than 1 thread per
> > process.  The last line of your script should look like "numactl -c
> > $cpu_bind -m $ mem_bind $*".
> >
> > Believe it or not, I hit 94.5% HPL efficiency using this tactic on a
> > 16 node cluster. Using processor affinity (various MPIs) my results
> > were inconsistent and ranged between 88-93%
> >
> 
> If you're using multi-threaded HPL, that might be useful.  But if
> you're not, I'd be surprised if you got any different results than
> Open MPI binding itself.  If there really is a difference, we should
> figure out why.  More specifically, calling numactl yourself should be
> pretty much exactly what we do in OMPI (via API, not via calling
> numactl).
> 
> --
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread Jeff Squyres

On Jun 3, 2009, at 10:48 AM,  wrote:

For HPL, try writing a bash script that pins processes to their  
local memory controllers using numactl before kicking off HPL.  This  
is particularly helpful when spawning more than 1 thread per  
process.  The last line of your script should look like "numactl -c  
$cpu_bind -m $ mem_bind $*".


Believe it or not, I hit 94.5% HPL efficiency using this tactic on a  
16 node cluster. Using processor affinity (various MPIs) my results  
were inconsistent and ranged between 88-93%




If you're using multi-threaded HPL, that might be useful.  But if  
you're not, I'd be surprised if you got any different results than  
Open MPI binding itself.  If there really is a difference, we should  
figure out why.  More specifically, calling numactl yourself should be  
pretty much exactly what we do in OMPI (via API, not via calling  
numactl).


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Hypre

2009-06-03 Thread Jeff Squyres

I'm afraid I have no experience with Hypre -- sorry!  :-(

Do they have a support web site / mailing list somewhere?  You might  
have better luck contacting them about their software.



On Jun 3, 2009, at 11:05 AM, naveed wrote:


Hi,
I wanted to know if any have used Hypre library for the solution of  
Ax = b for of equations.
I have problems reading in matrix file. I went through user manual,  
but couldn't get much out of it. I wanted to know what will be the  
best file format for reading large sparse matrices with Hypre.

Looking forward for any kind of help related to hypre.
Best Regard.
Ahnav.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] top question

2009-06-03 Thread Eugene Loh

tsi...@coas.oregonstate.edu wrote:

Thanks for the explanation. I am using GigEth + Open MPI and the  
buffered MPI_BSend. I had already noticed that top behaved 
differently  on another cluster with Infinibandb + MPICH.


So the only option to find out how much time each process is waiting  
around seems to be to profile the code. Will gprof show me anything  
useful or will I have to use a more sophisticated (any free ones?)  
parallel profiler?


Another frequently asked question!  I can try to add a FAQ 
entry/category.  There are a number of free options including


TAU http://www.cs.uoregon.edu/research/tau/home.php
mpiP http://mpip.sourceforge.net/
FPMPI http://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html
IPM http://ipm-hpc.sourceforge.net/
Sun Studio http://developers.sun.com/sunstudio/

The only one I've really used is Sun Studio.

Jumpshot *might* work with Open MPI, I forget.  Or, it might be more an 
MPICH tool.


[OMPI users] Hypre

2009-06-03 Thread naveed
Hi,
I wanted to know if any have used Hypre library for the solution of Ax = b
for of equations.
I have problems reading in matrix file. I went through user manual, but
couldn't get much out of it. I wanted to know what will be the best file
format for reading large sparse matrices with Hypre.
Looking forward for any kind of help related to hypre.
Best Regard.
Ahnav.


Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread JACOB_LIBERMAN
Hi Iftikhar,

For HPL, try writing a bash script that pins processes to their local memory 
controllers using numactl before kicking off HPL.  This is particularly helpful 
when spawning more than 1 thread per process.  The last line of your script 
should look like "numactl -c $cpu_bind -m $ mem_bind $*".  

Believe it or not, I hit 94.5% HPL efficiency using this tactic on a 16 node 
cluster. Using processor affinity (various MPIs) my results were inconsistent 
and ranged between 88-93%

Thanks, jacob

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Iftikhar Rathore
> Sent: Tuesday, June 02, 2009 10:25 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Openmpi and processor affinity
> 
> Guss
> Thanks for the reply and it was a typo (Im  sick). I have updated to
> 1.3.2 since my last post and  have tried checking cpu affinity by using
> f and j it shows processes spread on all 8 cores in the beginning, but
> it does eventually shows all processes running on 0,
> 
> My P and Q's are made for an 890 run, I have done this run with other
> mpi implementation before. I have made sure that I am using the right
> mpirun, but as Jeff pointed out I may have a mixed build and I am
> investigating it more and will post my findings.
> 
> Regards
> 
> 
> On Tue, 2009-06-02 at 20:58 -0400, Gus Correa wrote:
> > Hi Iftikhar
> >
> > Iftikhar Rathore wrote:
> > > Hi
> > > We are using openmpi version 1.2.8 (packaged with ofed-1.4). I am
> trying
> > > to run hpl-2.0 (linpak). We have two intel quad core CPU's in all
> our
> > > server (8 total cores)  and all hosts in the hostfile have lines
> that
> > > look like "10.100.0.227 slots=8max_slots=8".
> >
> > Is this a typo on your email or on your hostfile?
> >
> >  > look like "10.100.0.227 slots=8max_slots=8".
> >
> > There should be blank space between the number of slots and max_slots:
> >
> > 10.100.0.227 slots=8 max_slots=8
> >
> > Another possibility is that you may be inadvertently using another
> > mpirun on the system.
> >
> > A third possibility:
> > Does your HPL.dat file require 896 processors?
> > The product P x Q on each (P,Q) pair should match 896.
> > If it is less, HPL will run on less processors, i.e., on P x Q only.
> > (If it is more, HPL will issue an error message and stop.)
> > Is this what is happening?
> >
> > A fourth one ...:
> > Are you sure processor affinity is not correct?
> > Do the processes drift across the cores?
> > Typing 1 on top is not enough to clarify this.
> > To see the process-to-core map on top,
> > type "f" (for fields),
> > then "j" (to display the CPU/core number),
> > and wait for several minutes to see if processor/core (header "P")
> > and the process ID (header "PID"),
> > drift or not.
> >
> > Even when I launch less processes than the available/requested cores
> > "--mca mpi_paffinity_alone 1" works right here,
> > as I just checked, with P=4 and Q=1 on HPL.dat
> > and with -np 8 on mpiexec.
> >
> > **
> >
> > I recently ran a bunch of HPL tests with --mca mpi_paffinity_alone 1
> > and OpenMPI 1.3.2, built from source, and the processor affinity
> seems
> > to work (i.e., the processes stick to the cores).
> > Building from source quite simple, and would give you the latest
> OpenMPI.
> >
> > I don't know if 1.2.8 (which you are using)
> > has a problem with mpi_paffinity_alone,
> > but the OpenMPI developers may clarify this.
> >
> >
> > I hope this helps,
> > Gus Correa
> > -
> > Gustavo Correa
> > Lamont-Doherty Earth Observatory - Columbia University
> > Palisades, NY, 10964-8000 - USA
> > -
> >
> > >
> > > Now when I use mpirun (even with --mca mpi_paffinity_alone 1) it
> does
> > > not keep the affinity, the processes seem to gravitate towards
> first
> > > four cores (using top and hitting 1). I know I do have MCA
> paffinity
> > > available.
> > >
> > > [root@devi DLR_WB_88]# ompi_info | grep paffinity
> > > [devi.cisco.com:26178] mca: base: component_find: unable to open
> btl openib: file not found (ignored)
> > >MCA paffinity: linux (MCA v1.0, API v1.0, Component
> v1.2.8)
> > >
> > > The command line I am using is:
> > >
> > > # mpirun -nolocal -np 896 -v  --mca mpi_paffinity_alone 1 -hostfile
> /mnt/apps/hosts/896_8slots /mnt/apps/bin/xhpl
> > >
> > > Am I doing something wrong and is there a way to confirm cpu
> affinity besides hitting "1" on top.
> > >
> > >
> > > [root@devi DLR_WB_88]# mpirun -V
> > > mpirun (Open MPI) 1.2.8
> > >
> > > Report bugs to http://www.open-mpi.org/community/help/
> > >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> --
> Iftikhar Rathore
> Technical Marketing Engineer
> Server Access Virtualization BU.
> Cisco Systems, Inc.
> 
> Phone:  +1 408 853 5322

Re: [OMPI users] top question

2009-06-03 Thread tsilva


Thanks for the explanation. I am using GigEth + Open MPI and the  
buffered MPI_BSend. I had already noticed that top behaved differently  
on another cluster with Infinibandb + MPICH.


So the only option to find out how much time each process is waiting  
around seems to be to profile the code. Will gprof show me anything  
useful or will I have to use a more sophisticated (any free ones?)  
parallel profiler?


Cheers,
Tiago





Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread DEVEL Michel
Ralph Castain a écrit :
> I assume you re-did the ./configure command?
Thanks for your answer.
Yes.
> Did you also remember to "make clean" before doing your "make all
> install"?
No, but now that I have done it, the result is the same: "ld: attempted
static link of dynamic object `../../../opal/.libs/libopen-pal.so'
>
> Also, I note that your prefix looks really strange - it looks like you
> are trying to install OMPI where the Intel compiler is located? Are
> you sure you want to do that?
Well yes, but maybe it is a silly thing. I wanted to do that because of
lazyness to avoid having to make a script to add the directories to
$PATH $LD_LIBRARY_PATH and so on. Furthermore, I would like to keep a
version compiled with gcc and gfortran in /usr/local.
-- 

Sincerely yours,

Michel DEVEL



Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread Ralph Castain
I assume you re-did the ./configure command? Did you also remember to "make
clean" before doing your "make all install"?

Also, I note that your prefix looks really strange - it looks like you are
trying to install OMPI where the Intel compiler is located? Are you sure you
want to do that?


On Wed, Jun 3, 2009 at 7:29 AM, DEVEL Michel  wrote:

>  Hi again,
>
> In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from '-O3
> -C'). There is still an error (many opal_*_* subroutines not found during
> the ipo step) at the same place, coming from the fact that
> "ld: attempted static link of dynamic object
> `../../../opal/.libs/libopen-pal.so'
> although I put --enable-static in the configure step...
>
> Any idea of how to make the static libraries ?
> --
>
> Sincerely yours,
>
> Michel DEVEL
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] top question

2009-06-03 Thread Jeff Squyres
We get this question so much that I really need to add it to the  
FAQ.  :-\


Open MPI currently always spins for completion for exactly the reason  
that Scott cites: lower latency.


Arguably, when using TCP, we could probably get a bit better  
performance by blocking and allowing the kernel to make more progress  
than a single quick pass through the sockets progress engine, but that  
involves some other difficulties such as simultaneously allowing  
shared memory progress.  We have ideas how to make this work, but it  
has unfortunately remained at a lower priority: the performance  
difference isn't that great, and we've been focusing on the other,  
lower latency interconnects (shmem, MX, verbs, etc.).




On Jun 3, 2009, at 8:37 AM, Scott Atchley wrote:


On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote:

> Top always shows all the paralell processes at 100% in the %CPU
> field, although some of the time these must be waiting for a
> communication to complete. How can I see actual processing as
> opposed to waiting at a barrier?
>
> Thanks,
> Tiago

Using what interconnect?

For performance reasons (lower latency), the app and/or OMPI may be
polling on the completion. Are you using blocking or non-blocking
communication?

Scott
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread DEVEL Michel
Hi again,

In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from
'-O3 -C'). There is still an error (many opal_*_* subroutines not found
during the ipo step) at the same place, coming from the fact that
"ld: attempted static link of dynamic object
`../../../opal/.libs/libopen-pal.so'
although I put --enable-static in the configure step...

Any idea of how to make the static libraries ?
-- 

Sincerely yours,

Michel DEVEL



[OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper

2009-06-03 Thread DEVEL Michel
Dear openMPI users and developers,

I have just tried installing  openmpi by compiling it rather than just
using a rpm because I want to use it with the ifort compiler.
I have noticed a problem in the configure script (present at least in
version 1.3.1 and 1.3.2) for the determination of Fortran naming
convention :
I tried to use
./configure --prefix=/opt/intel/Compiler/11.0/074/ --with-sge
--enable-static CC='icc' CFLAGS=' -fast -C' LDFLAGS='-fast -C' AR='ar'
F77='ifort' FC='ifort' FFLAGS=' -fast -C' FCFLAGS=' -fast -C' CXX='icpc'
but the test to determine Fortran naming convention (single underscore
in ifort case) fails because of the -fast flag.
If I do "ifort -c -C -fast conftest.f" then "nm -B conftest.o" gives
" w __ildata_included "
whereas it correctly gives " T foo_bar_ " if I use
"ifort -c -C -O3 conftest.f"

I inserted "ompi_cv_f77_external_symbol="single underscore" at line
35244 of configure script (as if this variable had been cached) to get
around this bug, which is not clean at all but works in my case.
With this change, the configure script completes successfully.
However "make all" then fails at the linking of opal_wrapper with
following messages :

/bin/sh ../../../libtool --tag=CC   --mode=link icc  -DNDEBUG -fast -C
-finline-functions -fno-strict-aliasing -restrict -pthread
-fvisibility=hidden  -export-dynamic -fast -C  -o opal_wrapper
opal_wrapper.o ../../../opal/libopen-pal.la -lnsl
-lutil
libtool: link: icc -DNDEBUG -fast -C -finline-functions
-fno-strict-aliasing -restrict -pthread -fvisibility=hidden -fast -C -o
.libs/opal_wrapper opal_wrapper.o -Wl,--export-dynamic 
../../../opal/.libs/libopen-pal.so -lm -lnsl -lutil -pthread -Wl,-rpath
-Wl,/opt/intel/Compiler/11.0/074/lib 
*** glibc detected *** /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom:
double free or corruption (!prev): 0x02d06c70 *** 
=== Backtrace:
=   
   

/lib64/libc.so.6[0x2b8a83f7d118]
  

/lib64/libc.so.6(cfree+0x76)[0x2b8a83f7ec76]
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x10b43e7]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x1104a68]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x11145ae]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x11172d5]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x11168b7]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x110f181]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x10ffe06]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x10ade6b]  
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0xfe7960]   
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x847c06]   
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x452935]   
  

/lib64/libc.so.6(__libc_start_main+0xe6)[0x2b8a83f27586]
  

/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom(regcomp+0x3a)[0x40557a] 
  

=== Memory map:

  

0040-01dff000 r-xp  08:07 402335
/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom  
01efe000-0202f000 rwxp 019fe000 08:07 402335
/opt/intel/Compiler/11.0/074/bin/intel64/mcpcom  
0202f000-02e6b000 rwxp 0202f000 00:00 0 
[heap]   
2b8a83599000-2b8a835b7000 r-xp  08:07 1262134   

Re: [OMPI users] top question

2009-06-03 Thread Scott Atchley

On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote:

Top always shows all the paralell processes at 100% in the %CPU  
field, although some of the time these must be waiting for a  
communication to complete. How can I see actual processing as  
opposed to waiting at a barrier?


Thanks,
Tiago


Using what interconnect?

For performance reasons (lower latency), the app and/or OMPI may be  
polling on the completion. Are you using blocking or non-blocking  
communication?


Scott


[OMPI users] top question

2009-06-03 Thread tsilva


Top always shows all the paralell processes at 100% in the %CPU field,  
although some of the time these must be waiting for a communication to  
complete. How can I see actual processing as opposed to waiting at a  
barrier?


Thanks,
Tiago





Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

2009-06-03 Thread Ralph Castain
I'm afraid there is no way to do this in 1.3.2 (or any OMPI  
distributed release) with MPI applications.


The OMPI trunk does provide continuous re-spawn of failed processes,  
mapping them to other nodes and considering fault relationships  
between nodes, but this only works if they are -not- MPI processes. I  
can detail that for you, if you would like.


The problem with MPI processes is that restart is a much larger  
problem than just re-spawning a process. The entire MPI system becomes  
out-of-sync when one process fails - messages in-flight can be lost,  
collectives hang, etc.


Even if you rewire the connections after re-spawning the process, you  
still have the problem of re-synchronizing the MPI communications -  
recovering lost messages, determining if a collective is already in  
operation and waiting for this process to respond, etc. Hence, our  
default response is to simply terminate the job, letting the user  
restart it from some prior checkpoint.


Of course, the issue of how to recover from a single process failure  
remains the subject of considerable research. I assume you are  
engaging in such research?


On Jun 2, 2009, at 10:49 PM, Tee Wen Kai wrote:


Hi,

I am writing a program for a central controller that will spawn  
processes depend on the user selection. And when there is some fault  
in the spawn processes like for example, the computer that is  
spawned with the process suddenly go down, the controller should  
react to this and respawn the processes to available machines.  
However, when a computer go down, all communications will hang. To  
resolve this, the controller will sent SIGTERM signal to kill those  
spawned processes. In the spawned program, I have written signal  
handler to handle such cases. However, when I include MPI_Finalize  
in the handler, there will be some error messages when the processes  
exit because some communication is not complete. Thus, I modify my  
program such that when the processes need to exit through handler,  
there will be no MPI_Finalize statement. I am using openmpi 1.2.8  
and this works. However, version 1.2.8 has other bugs like spawned  
processes using MPI_Comm_spawn when exited does not terminate the  
orted services leading to alot of orted services when processes are  
spawn over and over again. Thus, I started evaluating version 1.3.2.  
1.3.2 solve the bug but the whole program exited once a process exit  
without calling MPI_Finalize. Therefore, I seek your help or  
suggestion on how should I overcome this or what should be the  
proper way to quit processes when they are stuck due to one process  
down.


Thank you.

Regards,
Wenkai

New Email names for you!
Get the Email name you've always wanted on the new @ymail and  
@rocketmail.
Hurry before someone else does! 
___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

2009-06-03 Thread Tee Wen Kai
Hi,
 
I am writing a program for a central controller that will spawn processes 
depend on the user selection. And when there is some fault in the spawn 
processes like for example, the computer that is spawned with the process 
suddenly go down, the controller should react to this and respawn the processes 
to available machines. However, when a computer go down, all communications 
will hang. To resolve this, the controller will sent SIGTERM signal to kill 
those spawned processes. In the spawned program, I have written signal handler 
to handle such cases. However, when I include MPI_Finalize in the handler, 
there will be some error messages when the processes exit because some 
communication is not complete. Thus, I modify my program such that when the 
processes need to exit through handler, there will be no MPI_Finalize 
statement. I am using openmpi 1.2.8 and this works. However, version 1.2.8 has 
other bugs like spawned processes using MPI_Comm_spawn when exited
 does not terminate the orted services leading to alot of orted services when 
processes are spawn over and over again. Thus, I started evaluating version 
1.3.2. 1.3.2 solve the bug but the whole program exited once a process exit 
without calling MPI_Finalize. Therefore, I seek your help or suggestion on how 
should I overcome this or what should be the proper way to quit processes when 
they are stuck due to one process down.
 
Thank you.
 
Regards,
Wenkai