subject:"\[OMPI users\] mpirun only works when \-np <4 $Gus Correa$"

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-15 Thread Eugene Loh


Matthew MacManes wrote:


I would be happy to help troubleshoot, but I am not much of a programmer to 
know how. The hang is reproducible, and -mca btl ^sm is about 15% faster.

if you want to shoot me some instructions off list, I can give it a go. 


The application that I am working with, primarily, is ABySS:  
http://www.bcgsc.ca/platform/bioinfo/software/abyss
 

How about this?  File a trac ticket for each issue (hang with more 
FIFOs, ^sm is 15% faster) describing in the simplest terms possible how 
to reproduce each problem (which OMPI rel, simple test code if possible 
so people don't need to come up to speed on ABySS, etc.).

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-15 Thread Matthew MacManes

I would be happy to help troubleshoot, but I am not much of a programmer to 
know how. The hang is reproducible, and -mca btl ^sm is about 15% faster.

if you want to shoot me some instructions off list, I can give it a go. 

The application that I am working with, primarily, is ABySS:  
http://www.bcgsc.ca/platform/bioinfo/software/abyss

Matt

On Dec 15, 2009, at 11:55 AM, Eugene Loh wrote:

> Matthew MacManes wrote:
> 
>> On my system,  mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and 
>> appeared to hang after several thousand interations) than -mca btl ^sm
>> 
> If the hang is reproducible, we should perhaps have a look.  Also, the fact 
> that it's much slower is interesting.  Can you characterize the message 
> pattern?  Increasing the number of FIFOs means that there are more places to 
> look to find messages, but this should make a difference mainly only for very 
> large on-node process counts (more than 8 I would have thought) and very 
> latency-sensitive applications (but perhaps that's what you have).
> 
>> Is there another better way I should be modifying fifos to get better 
>> performance?
>> 
> Actually, there have some been some promising developments on the trac-2043 
> front.  So, maybe 1-3 days of patience could payoff here.  But, I'm not in a 
> position to promise anything.
> 
>> On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote:
>> 
 Date: Thu, 10 Dec 2009 17:57:27 -0500
 From: Jeff Squyres 
 
 On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
 
>> How does the efficiency of loopback
>> (let's say, over TCP and over IB) compare with "sm"?
 Definitely not as good; that's why we have sm.   :-)   I don't have any 
 quantification of that assertion, though (i.e., no numbers to back that 
 up).
 
>>> However, as Eugene wrote earlier you can actually increase the number of 
>>> fifos used by the SM and avoid the hang that way.  Unless you are really 
>>> strapped for memory I think that would be the best way to go.
>>>   
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-15 Thread Eugene Loh


Matthew MacManes wrote:


On my system,  mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and 
appeared to hang after several thousand interations) than -mca btl ^sm
 

If the hang is reproducible, we should perhaps have a look.  Also, the 
fact that it's much slower is interesting.  Can you characterize the 
message pattern?  Increasing the number of FIFOs means that there are 
more places to look to find messages, but this should make a difference 
mainly only for very large on-node process counts (more than 8 I would 
have thought) and very latency-sensitive applications (but perhaps 
that's what you have).



Is there another better way I should be modifying fifos to get better 
performance?
 

Actually, there have some been some promising developments on the 
trac-2043 front.  So, maybe 1-3 days of patience could payoff here.  
But, I'm not in a position to promise anything.



On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote:
 


Date: Thu, 10 Dec 2009 17:57:27 -0500
From: Jeff Squyres 

On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
 


How does the efficiency of loopback
(let's say, over TCP and over IB) compare with "sm"?   
 


Definitely not as good; that's why we have sm.   :-)   I don't have any 
quantification of that assertion, though (i.e., no numbers to back that up).
 


However, as Eugene wrote earlier you can actually increase the number of fifos 
used by the SM and avoid the hang that way.  Unless you are really strapped for 
memory I think that would be the best way to go.

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-11 Thread Matthew MacManes

On my system,  mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and 
appeared to hang after several thousand interations) than -mca btl ^sm

Is there another better way I should be modifying fifos to get better 
performance?

Matt



On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote:

>> 
>> Date: Thu, 10 Dec 2009 17:57:27 -0500
>> From: Jeff Squyres 
>> 
>> On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
>> 
>>  
>>> > How does the efficiency of loopback
>>> > (let's say, over TCP and over IB) compare with "sm"?
>>>
>> 
>> Definitely not as good; that's why we have sm.   :-)   I don't have any 
>> quantification of that assertion, though (i.e., no numbers to back that up).
>> 
>>  
> However, as Eugene wrote earlier you can actually increase the number of 
> fifos used by the SM and avoid the hang that way.  Unless you are really 
> strapped for memory I think that would be the best way to go.
> 
> --td
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-11 Thread Terry Dontje

Date: Thu, 10 Dec 2009 17:57:27 -0500
From: Jeff Squyres 

On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:

> How does the efficiency of loopback
> (let's say, over TCP and over IB) compare with "sm"?

Definitely not as good; that's why we have sm.   :-)   I don't have any 
quantification of that assertion, though (i.e., no numbers to back that up).

However, as Eugene wrote earlier you can actually increase the number of 
fifos used by the SM and avoid the hang that way.  Unless you are really 
strapped for memory I think that would be the best way to go.

--td

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Mark Bolstad

Some additional data:

Without threads it still hangs, similar behavior as before.

All of the tests were run on a system running FC11 with X5550 processors.

I just reran on a node of a RHEL 5.3 cluster with E5530 processors (dual
Nehalam):
 - openmpi 1.3.4 and gcc 4.1.2
 - No issues: connectivity_c works through np = 128

 - openmpi 1.3.4 and gcc 4.4.0
 - Hangs as before

Anything else you want me to try;-)?

Mark

On Thu, Dec 10, 2009 at 5:20 PM, Jeff Squyres  wrote:

> On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:
>
> > > Just a quick interjection, I also have a dual-quad Nehalem system, HT
> > > on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
> > > --enable-mpi-f77=no --with-openib=no
> > >
> > > With v1.3.4 I see roughly the same behavior, hello, ring work,
> > > connectivity fails randomly with np >= 8. Turning on -v increased the
> > > success, but still hangs. np = 16 fails more often, and the hang is
> > > random in which pair of processes are communicating.
> > >
> > > However, it seems to be related to the shared memory layer problem.
> > > Running with -mca btl ^sm works consistently through np = 128.
>
> Note, too, that --enable-mpi-threads "works" but I would not say that it is
> production-quality hardened yet.  IBM is looking into thread safety issues
> to harden up this code.  If the same hangs can be observed without
> --enable-mpi-threads, that would be a good data point.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Eugene Loh


Gus Correa wrote:


Why wouldn't shared memory work right on Nehalem?


We don't know exactly what is driving this problem, but the issue 
appears to be related to memory fences.  Messages have to be posted to a 
receiver's queue.  By default, each process (since OMPI 1.3.2) has only 
one queue.  A sender acquires a lock to the queue, writes a pointer to 
its message, advances the queue index, and releases the lock.  If there 
are problems with memory barriers (or our use of them), messages can get 
lost, overwritten, etc.  One manifestation could be hangs.  One 
workaround, as described on this mail list, is to increase the number of 
queues (FIFOs) so that each sender gets its own.


I think that's what's happening, but we don't know the root cause.  The 
test case in 2043 on the node I used for testing works like a gem for 
GCC versions prior to 4.4.x, but with 4.4.x variants it falls hard on 
its face.  Is the problem with GCC 4.4.x?  Or, does that compiler expose 
a problem with OMPI?  Etc.



It is amazing to me that this issue hasn't surfaced on this list before.


The trac ticket refers to a number of e-mail messages that might be 
related.  At this point, however, it's hard to know what's related and 
what isn't.


Gus Correa wrote:

FYI, I do NOT see the problem reported by Matthew et al. on our AMD 
Opteron Shanghai dual-socket quad-core.  They run a quite outdated 
CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2. and OpenMPI 1.3.2.


In my mind, GCC 4.1.2 may well be the ticket here.  I find strong 
correspondence with GCC rev (< 4.4.x vs >= 4.4.x).


Moreover, all works fine if I oversuscribe up to 256 processes on one 
node.
Beyond that I get segmentation fault (not hanging) sometimes, but not 
always.

I understand that extreme oversubscription is a no-no.


Sounds like another set of problems.

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jeff Squyres

On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:

> How does the efficiency of loopback
> (let's say, over TCP and over IB) compare with "sm"?

Definitely not as good; that's why we have sm.  :-)  I don't have any 
quantification of that assertion, though (i.e., no numbers to back that up).

> FYI, I do NOT see the problem reported by Matthew et al.
> on our AMD Opteron Shanghai dual-socket quad-core.
> They run a quite outdated
> CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2.
> and OpenMPI 1.3.2.
> (I've been lazy to upgrade, it is a production machine.)
> 
> I could run all three OpenMPI test programs (hello_c, ring_c, and
> connectivity_c) on all 8 cores on a single node WITH "sm" turned ON
> with no problem whatsoever.

Good.

> Moreover, all works fine if I oversuscribe up to 256 processes on
> one node.
> Beyond that I get segmentation fault (not hanging) sometimes,
> but not always.
> I understand that extreme oversubscription is a no-no.

It's quite possible that extreme oversubscription and/or that many procs in sm 
have not been well-tested.

> Moreover, on the screenshots that Matthew posted, the cores
> were at 100% CPU utilization on the simple connectivity_c
> (although this was when he had "sm" turned on on Nehalem).
> On my platform I don't get anything more than 3% or so.

100% CPU utilization usually means that some completion hasn't occurred that 
was expected and therefore everything is spinning waiting for that completion.  
The "hasn't occurred" bit is probably the bug here -- it's likely that there 
should have been a completion that somehow got missed.  But this is speculative 
-- we're still investigating...

-- 
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Gus Correa


Hi Jeff

Thanks for jumping in!  :)
And for your clarifications too, of course.

How does the efficiency of loopback
(let's say, over TCP and over IB) compare with "sm"?

FYI, I do NOT see the problem reported by Matthew et al.
on our AMD Opteron Shanghai dual-socket quad-core.
They run a quite outdated
CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2.
and OpenMPI 1.3.2.
(I've been lazy to upgrade, it is a production machine.)

I could run all three OpenMPI test programs (hello_c, ring_c, and 
connectivity_c) on all 8 cores on a single node WITH "sm" turned ON

with no problem whatsoever.
(I also had IB turned on, but I can run again
with sm only if you think this can make a difference.)

Moreover, all works fine if I oversuscribe up to 256 processes on
one node.
Beyond that I get segmentation fault (not hanging) sometimes,
but not always.
I understand that extreme oversubscription is a no-no.

Moreover, on the screenshots that Matthew posted, the cores
were at 100% CPU utilization on the simple connectivity_c
(although this was when he had "sm" turned on on Nehalem).
On my platform I don't get anything more than 3% or so.

Matthew: Which levels of CPU utilization do you see now?

My two speculative cents.
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Jeff Squyres wrote:

On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:


A couple of questions to the OpenMPI pros:
If shared memory ("sm") is turned off on a standalone computer,
which mechanism is used for MPI communication?
TCP via loopback port?  Other?


Whatever device supports node-local loopback.  TCP is one; some OpenFabrics 
devices do, too.


Why wouldn't shared memory work right on Nehalem?
(That is probably distressing for Mark, Matthew, and other Nehalem owners.)


To be clear, we don't know that this is a Nehalem-specific problem.  We 
actually thought it was an AMD-specific problem, but these results are 
interesting.  We've had a notoriously difficult time reproducing the problem 
reliably, which is why it hasn't been fixed yet.  :-(

The best luck so far in reproducing the problem has been with GCC 4.4.x (at 
Sun).  I've been trying for a few days to install GCC 4.4 on my machines 
without much luck yet.  Still working on it...

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jonathan Dursi


Jeff Squyres wrote:


Why wouldn't shared memory work right on Nehalem?
(That is probably distressing for Mark, Matthew, and other Nehalem owners.)


To be clear, we don't know that this is a Nehalem-specific problem.


I have definitely had this problem on Harpertown cores.

- Jonathan
--
Jonathan Dursi

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jeff Squyres

On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:

> > Just a quick interjection, I also have a dual-quad Nehalem system, HT
> > on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
> > --enable-mpi-f77=no --with-openib=no
> >
> > With v1.3.4 I see roughly the same behavior, hello, ring work,
> > connectivity fails randomly with np >= 8. Turning on -v increased the
> > success, but still hangs. np = 16 fails more often, and the hang is
> > random in which pair of processes are communicating.
> >
> > However, it seems to be related to the shared memory layer problem.
> > Running with -mca btl ^sm works consistently through np = 128.

Note, too, that --enable-mpi-threads "works" but I would not say that it is 
production-quality hardened yet.  IBM is looking into thread safety issues to 
harden up this code.  If the same hangs can be observed without 
--enable-mpi-threads, that would be a good data point.

-- 
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW

2009-12-10 Thread Gus Correa

Hi Matthew, Mark, Mattijs

Great news that a solution was found, actually two,
which seem to have been around for a while.
Thanks Mark and Mattijs posting the solutions.
Much better that all can be solved by software,
with a single mca parameter.

A pity that it took a while for the actual
nature of the problem to be identified.\
I have no Nehalem to test, so I could only speculate about
possible causes.

For a while it looked like as a specific code issue
(MrBayes and ABySS that Matthew uses) or a broken MPI package,
or BIOS settings, etc.
It only became clear that was not the case when Matthew
had the same problem with the test programs ring_c, connectivity_c, etc.

Matthew: After all this mess, could you eventually compile and
run MrBayes and ABySS?
Do they work right?
Efficiently?

Good luck on your research.
Gus Correa

Matthew MacManes wrote:
Mark,

Exciting.. SOLVED.. There is an open ticket #2043 regarding
Nehelem/OpenMPI/Hang
problem (https://svn.open-mpi.org/trac/ompi/ticket/2043).. Seems like
the problem might be specific to gcc4.4x and OMPI <1.3.2.. It seems like
there is a group up us with dual socket nehalems trying to use ompi
without much luck (or at least not without headaches)..

Of note, mca btl_sm_num_fifos 7 seems to work as well..

now off to see if I can get some real code to work...

Thanks, Mark, Gus, and the rest of the OMPI Users Group!

On Dec 10, 2009, at 7:42 AM, Mark Bolstad wrote:

Just a quick interjection, I also have a dual-quad Nehalem system, HT
on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
--enable-mpi-f77=no --with-openib=no

With v1.3.4 I see roughly the same behavior, hello, ring work,
connectivity fails randomly with np >= 8. Turning on -v increased the
success, but still hangs. np = 16 fails more often, and the hang is
random in which pair of processes are communicating.

However, it seems to be related to the shared memory layer problem.
Running with -mca btl ^sm works consistently through np = 128.

Hope this helps.

Mark

On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa > wrote:

Hi Matthew

Save any misinterpretation I may have made of the code:

Hello_c has no real communication, except for a final Barrier
synchronization.
Each process prints "hello world" and that's it.

Ring probes a little more, with processes Send(ing) and
Recv(cieving) messages.
Ring just passes a message sequentially along all process
ranks, then back to rank 0, and repeat the game 10 times.
Rank 0 is in charge of counting turns, decrementing the counter,
and printing that (nobody else prints).
With 4 processes:
0->1->2->3->0->1... 10 times

In connectivity every pair of processes exchange a message.
Therefore it probes all pairwise connections.
In verbose mode you can see that.

These programs shouldn't hang at all, if the system were sane.
Actually, they should even run with a significant level of
oversubscription, say,
-np 128 should work easily for all three programs on a powerful
machine like yours.

Suggestions

1) Stick to the OpenMPI you compiled.

2) You can run connectivity_c in verbose mode:

home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v

(Note the trailing "-v".)

It should tell more about who's talking to who.

3) I wonder if there are any BIOS settings that may be required
(and perhaps not in place) to make the Nehalem hyperthreading to
work properly in your computer.

You reach the BIOS settings by typing or
when the computer boots up.
The key varies by
BIOS and computer vendor, but shows quickly on the bootup screen.

You may ask the computer vendor about the recommended BIOS settings.
If you haven't done this before, be careful to change and save only
what really needs to change (if anything really needs to change),
or the result may be worse.
(Overclocking is for gamers, not for genome researchers ... :) )

4) What I read about Nehalem DDR3 memory is that it is optimal
on configurations that are multiples of 3GB per CPU.
Common configs. in dual CPU machines like yours are
6, 12, 24 and 48GB.
The sockets where you install the memory modules also matter.

Your computer has 20GB.
Did you build the computer or upgrade the memory yourself?
Do you know how the memory is installed, in which memory sockets?
What does the vendor have to say about it?

See this:

http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx

5) As I said before, typing "f" then "j" on "top" will add
a column (labeled "P") that shows in which core each process is
running.
This will let you observe how the Linux scheduler is distributing
the MPI load across the cores.

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jeff Squyres

On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:

> A couple of questions to the OpenMPI pros:
> If shared memory ("sm") is turned off on a standalone computer,
> which mechanism is used for MPI communication?
> TCP via loopback port?  Other?

Whatever device supports node-local loopback.  TCP is one; some OpenFabrics 
devices do, too.

> Why wouldn't shared memory work right on Nehalem?
> (That is probably distressing for Mark, Matthew, and other Nehalem owners.)

To be clear, we don't know that this is a Nehalem-specific problem.  We 
actually thought it was an AMD-specific problem, but these results are 
interesting.  We've had a notoriously difficult time reproducing the problem 
reliably, which is why it hasn't been fixed yet.  :-(

The best luck so far in reproducing the problem has been with GCC 4.4.x (at 
Sun).  I've been trying for a few days to install GCC 4.4 on my machines 
without much luck yet.  Still working on it...

-- 
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Matthew MacManes

Hi All, 

I agree that the issue is troublesome.  It apparently has been reported, and 
there is an active bug report, with some technical discussion of the underlying 
problems, found here: https://svn.open-mpi.org/trac/ompi/ticket/2043

For now, it is OK, but it is an issue that hopefully will be resolved sooner 
rather then later. 

Thanks again for everybody's help!
Matt


On Dec 10, 2009, at 2:01 PM, Gus Correa wrote:

> HI Mark, Matthew, list
> 
> Oh well, Mark's direct experience on a Nehalem
> is a game changer, and his recommendation to turn off the shared
> memory feature may be the way to go for Matthew, at least to have
> things working.
> Thank you Mark, your interjection sheds new light on the awkward
> situation reported by Matthew.
> I don't have a Nehalem platform, hence I cannot do any testing.
> 
> A couple of questions to the OpenMPI pros:
> If shared memory ("sm") is turned off on a standalone computer,
> which mechanism is used for MPI communication?
> TCP via loopback port?  Other?
> Why wouldn't shared memory work right on Nehalem?
> (That is probably distressing for Mark, Matthew, and other
> Nehalem owners.)
> 
> So, judging from Mark's experiments,
> it looks like Nehalem, or perhaps its interaction with
> the current Linux kernels, still hasn't solved problems regarding
> efficent memory access.
> Or is this a rough misinterpretation of Mark's experiences?
> 
> It is amazing to me that this issue hasn't surfaced on this list
> before.
> Or maybe it did but I wasn't paying attention, after all,
> I don't have Nehalem.
> After all this is about the very basic functionality of MPI
> in the latest hardware, which has been in the market for several
> months now.
> 
> Anybody running MPI production code on Nehalem,
> that can report scaling experiments, perhaps compare with other
> hardware platforms?
> 
> Any possibility that tweaking with BIOS settings or
> special kernel parameters can solve this problem?
> 
> Any word from the pros on the list that have direct experience
> with Nehalem and OpenMPI?
> 
> Anybody has experimented with MPICH2 on a single node dual
> socket Nehalem, for a comparison?
> 
> Thanks,
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> 
> Mark Bolstad wrote:
>> Just a quick interjection, I also have a dual-quad Nehalem system, HT on, 
>> 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads 
>> --enable-mpi-f77=no --with-openib=no
>> With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity 
>> fails randomly with np >= 8. Turning on -v increased the success, but still 
>> hangs. np = 16 fails more often, and the hang is random in which pair of 
>> processes are communicating.
>> However, it seems to be related to the shared memory layer problem. Running 
>> with -mca btl ^sm works consistently through np = 128.
>> Hope this helps.
>> Mark
>> On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa > > wrote:
>>Hi Matthew
>>Save any misinterpretation I may have made of the code:
>>Hello_c has no real communication, except for a final Barrier
>>synchronization.
>>Each process prints "hello world" and that's it.
>>Ring probes a little more, with processes Send(ing) and
>>Recv(cieving) messages.
>>Ring just passes a message sequentially along all process
>>ranks, then back to rank 0, and repeat the game 10 times.
>>Rank 0 is in charge of counting turns, decrementing the counter,
>>and printing that (nobody else prints).
>>With 4 processes:
>>0->1->2->3->0->1... 10 times
>>In connectivity every pair of processes exchange a message.
>>Therefore it probes all pairwise connections.
>>In verbose mode you can see that.
>>These programs shouldn't hang at all, if the system were sane.
>>Actually, they should even run with a significant level of
>>oversubscription, say,
>>-np 128  should work easily for all three programs on a powerful
>>machine like yours.
>>**
>>Suggestions
>>1) Stick to the OpenMPI you compiled.
>>**
>>2) You can run connectivity_c in verbose mode:
>>home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v
>>(Note the trailing "-v".)
>>It should tell more about who's talking to who.
>>**
>>3) I wonder if there are any BIOS settings that may be required
>>(and perhaps not in place) to make the Nehalem hyperthreading to
>>work properly in your computer.
>>You reach the BIOS settings by typing  or 
>>when the computer boots up.
>>The key varies by
>>BIOS and computer vendor, but shows quickly on the bootup screen.
>>You may ask the computer vendor about the recommended BIOS settings.
>>If you haven't done

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Gus Correa


HI Mark, Matthew, list

Oh well, Mark's direct experience on a Nehalem
is a game changer, and his recommendation to turn off the shared
memory feature may be the way to go for Matthew, at least to have
things working.
Thank you Mark, your interjection sheds new light on the awkward
situation reported by Matthew.
I don't have a Nehalem platform, hence I cannot do any testing.

A couple of questions to the OpenMPI pros:
If shared memory ("sm") is turned off on a standalone computer,
which mechanism is used for MPI communication?
TCP via loopback port?  Other?
Why wouldn't shared memory work right on Nehalem?
(That is probably distressing for Mark, Matthew, and other
Nehalem owners.)

So, judging from Mark's experiments,
it looks like Nehalem, or perhaps its interaction with
the current Linux kernels, still hasn't solved problems regarding
efficent memory access.
Or is this a rough misinterpretation of Mark's experiences?

It is amazing to me that this issue hasn't surfaced on this list
before.
Or maybe it did but I wasn't paying attention, after all,
I don't have Nehalem.
After all this is about the very basic functionality of MPI
in the latest hardware, which has been in the market for several
months now.

Anybody running MPI production code on Nehalem,
that can report scaling experiments, perhaps compare with other
hardware platforms?

Any possibility that tweaking with BIOS settings or
special kernel parameters can solve this problem?

Any word from the pros on the list that have direct experience
with Nehalem and OpenMPI?

Anybody has experimented with MPICH2 on a single node dual
socket Nehalem, for a comparison?

Thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Mark Bolstad wrote:


Just a quick interjection, I also have a dual-quad Nehalem system, HT 
on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads 
--enable-mpi-f77=no --with-openib=no


With v1.3.4 I see roughly the same behavior, hello, ring work, 
connectivity fails randomly with np >= 8. Turning on -v increased the 
success, but still hangs. np = 16 fails more often, and the hang is 
random in which pair of processes are communicating.


However, it seems to be related to the shared memory layer problem. 
Running with -mca btl ^sm works consistently through np = 128.


Hope this helps.

Mark

On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa > wrote:


Hi Matthew

Save any misinterpretation I may have made of the code:

Hello_c has no real communication, except for a final Barrier
synchronization.
Each process prints "hello world" and that's it.

Ring probes a little more, with processes Send(ing) and
Recv(cieving) messages.
Ring just passes a message sequentially along all process
ranks, then back to rank 0, and repeat the game 10 times.
Rank 0 is in charge of counting turns, decrementing the counter,
and printing that (nobody else prints).
With 4 processes:
0->1->2->3->0->1... 10 times

In connectivity every pair of processes exchange a message.
Therefore it probes all pairwise connections.
In verbose mode you can see that.

These programs shouldn't hang at all, if the system were sane.
Actually, they should even run with a significant level of
oversubscription, say,
-np 128  should work easily for all three programs on a powerful
machine like yours.


**

Suggestions

1) Stick to the OpenMPI you compiled.

**

2) You can run connectivity_c in verbose mode:

home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v

(Note the trailing "-v".)

It should tell more about who's talking to who.

**

3) I wonder if there are any BIOS settings that may be required
(and perhaps not in place) to make the Nehalem hyperthreading to
work properly in your computer.

You reach the BIOS settings by typing  or 
when the computer boots up.
The key varies by
BIOS and computer vendor, but shows quickly on the bootup screen.

You may ask the computer vendor about the recommended BIOS settings.
If you haven't done this before, be careful to change and save only
what really needs to change (if anything really needs to change),
or the result may be worse.
(Overclocking is for gamers, not for genome researchers ... :) )

**

4) What I read about Nehalem DDR3 memory is that it is optimal
on configurations that are multiples of 3GB per CPU.
Common configs. in dual CPU machines like yours are
6, 12, 24 and 48GB.
The sockets where you install the memory modules also matter.

Your computer has 20GB.
Did you build the computer or upgrade the memory yourself?
Do you know how the memory is

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW

2009-12-10 Thread Matthew MacManes

Mark, 

Exciting.. SOLVED.. There is an open ticket #2043 regarding 
Nehelem/OpenMPI/Hang problem (https://svn.open-mpi.org/trac/ompi/ticket/2043).. 
Seems like the problem might be specific to gcc4.4x and OMPI <1.3.2.. It seems 
like there is a group up us with dual socket nehalems trying to use ompi 
without much luck (or at least not without headaches).. 

Of note, mca btl_sm_num_fifos 7 seems to work as well..

now off to see if I can get some real code to work... 

Thanks, Mark, Gus, and the rest of the OMPI Users Group!





On Dec 10, 2009, at 7:42 AM, Mark Bolstad wrote:

> 
> Just a quick interjection, I also have a dual-quad Nehalem system, HT on, 
> 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads 
> --enable-mpi-f77=no --with-openib=no
> 
> With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity 
> fails randomly with np >= 8. Turning on -v increased the success, but still 
> hangs. np = 16 fails more often, and the hang is random in which pair of 
> processes are communicating.
> 
> However, it seems to be related to the shared memory layer problem. Running 
> with -mca btl ^sm works consistently through np = 128.
> 
> Hope this helps.
> 
> Mark
> 
> On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa  wrote:
> Hi Matthew
> 
> Save any misinterpretation I may have made of the code:
> 
> Hello_c has no real communication, except for a final Barrier
> synchronization.
> Each process prints "hello world" and that's it.
> 
> Ring probes a little more, with processes Send(ing) and
> Recv(cieving) messages.
> Ring just passes a message sequentially along all process
> ranks, then back to rank 0, and repeat the game 10 times.
> Rank 0 is in charge of counting turns, decrementing the counter,
> and printing that (nobody else prints).
> With 4 processes:
> 0->1->2->3->0->1... 10 times
> 
> In connectivity every pair of processes exchange a message.
> Therefore it probes all pairwise connections.
> In verbose mode you can see that.
> 
> These programs shouldn't hang at all, if the system were sane.
> Actually, they should even run with a significant level of
> oversubscription, say,
> -np 128  should work easily for all three programs on a powerful
> machine like yours.
> 
> 
> **
> 
> Suggestions
> 
> 1) Stick to the OpenMPI you compiled.
> 
> **
> 
> 2) You can run connectivity_c in verbose mode:
> 
> home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v
> 
> (Note the trailing "-v".)
> 
> It should tell more about who's talking to who.
> 
> **
> 
> 3) I wonder if there are any BIOS settings that may be required
> (and perhaps not in place) to make the Nehalem hyperthreading to
> work properly in your computer.
> 
> You reach the BIOS settings by typing  or 
> when the computer boots up.
> The key varies by
> BIOS and computer vendor, but shows quickly on the bootup screen.
> 
> You may ask the computer vendor about the recommended BIOS settings.
> If you haven't done this before, be careful to change and save only
> what really needs to change (if anything really needs to change),
> or the result may be worse.
> (Overclocking is for gamers, not for genome researchers ... :) )
> 
> **
> 
> 4) What I read about Nehalem DDR3 memory is that it is optimal
> on configurations that are multiples of 3GB per CPU.
> Common configs. in dual CPU machines like yours are
> 6, 12, 24 and 48GB.
> The sockets where you install the memory modules also matter.
> 
> Your computer has 20GB.
> Did you build the computer or upgrade the memory yourself?
> Do you know how the memory is installed, in which memory sockets?
> What does the vendor have to say about it?
> 
> See this:
> http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx
> 
> **
> 
> 5) As I said before, typing "f" then "j" on "top" will add
> a column (labeled "P") that shows in which core each process is running.
> This will let you observe how the Linux scheduler is distributing
> the MPI load across the cores.
> Hopefully it is load-balanced, and different processes go to different
> cores.
> 
> ***
> 
> It is very disconcerting when MPI processes hang.
> You are not alone.
> The reasons are not always obvious.
> At least in your case there is no network involved or to troubleshoot.
> 
> 
> **
> 
> I hope it helps,
> 
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> 
> 
> 
> 
> Matthew MacManes wrote:
> Hi Gus and List,
> 
> 1st of all Gus, I want to say thanks.. you have been a huge help, and when I 
> get this fixed, I owe you big time!
> 
> However, the problems continue...
> 
> I formatted the HD, reinstalled OS to make sure that I was working from 
> scratch.  I did your step A, which seemed to go fine:
> 
>

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Mattijs Janssens

On Thursday 10 December 2009 15:42:49 Mark Bolstad wrote:
> Just a quick interjection, I also have a dual-quad Nehalem system, HT on,
> 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
> --enable-mpi-f77=no --with-openib=no
>
> With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity
> fails randomly with np >= 8. Turning on -v increased the success, but still
> hangs. np = 16 fails more often, and the hang is random in which pair of
> processes are communicating.
>
> However, it seems to be related to the shared memory layer problem. Running
> with -mca btl ^sm works consistently through np = 128.
>

I have the same problem, same machine (dual-quad Nehalem system, HT on) - for 
me the fix was the one from

(https://svn.open-mpi.org/trac/ompi/ticket/2043)

mpirun -np 8 -mca btl_sm_num_fifos 7


Mattijs

> Hope this helps.
>
> Mark
>
> On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa  wrote:
> > Hi Matthew
> >
> > Save any misinterpretation I may have made of the code:
> >
> > Hello_c has no real communication, except for a final Barrier
> > synchronization.
> > Each process prints "hello world" and that's it.
> >
> > Ring probes a little more, with processes Send(ing) and
> > Recv(cieving) messages.
> > Ring just passes a message sequentially along all process
> > ranks, then back to rank 0, and repeat the game 10 times.
> > Rank 0 is in charge of counting turns, decrementing the counter,
> > and printing that (nobody else prints).
> > With 4 processes:
> > 0->1->2->3->0->1... 10 times
> >
> > In connectivity every pair of processes exchange a message.
> > Therefore it probes all pairwise connections.
> > In verbose mode you can see that.
> >
> > These programs shouldn't hang at all, if the system were sane.
> > Actually, they should even run with a significant level of
> > oversubscription, say,
> > -np 128  should work easily for all three programs on a powerful
> > machine like yours.
> >
> >
> > **
> >
> > Suggestions
> >
> > 1) Stick to the OpenMPI you compiled.
> >
> > **
> >
> > 2) You can run connectivity_c in verbose mode:
> >
> > home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v
> >
> > (Note the trailing "-v".)
> >
> > It should tell more about who's talking to who.
> >
> > **
> >
> > 3) I wonder if there are any BIOS settings that may be required
> > (and perhaps not in place) to make the Nehalem hyperthreading to
> > work properly in your computer.
> >
> > You reach the BIOS settings by typing  or 
> > when the computer boots up.
> > The key varies by
> > BIOS and computer vendor, but shows quickly on the bootup screen.
> >
> > You may ask the computer vendor about the recommended BIOS settings.
> > If you haven't done this before, be careful to change and save only
> > what really needs to change (if anything really needs to change),
> > or the result may be worse.
> > (Overclocking is for gamers, not for genome researchers ... :) )
> >
> > **
> >
> > 4) What I read about Nehalem DDR3 memory is that it is optimal
> > on configurations that are multiples of 3GB per CPU.
> > Common configs. in dual CPU machines like yours are
> > 6, 12, 24 and 48GB.
> > The sockets where you install the memory modules also matter.
> >
> > Your computer has 20GB.
> > Did you build the computer or upgrade the memory yourself?
> > Do you know how the memory is installed, in which memory sockets?
> > What does the vendor have to say about it?
> >
> > See this:
> >
> > http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/ne
> >halem-and-memory-configurations.aspx
> >
> > **
> >
> > 5) As I said before, typing "f" then "j" on "top" will add
> > a column (labeled "P") that shows in which core each process is running.
> > This will let you observe how the Linux scheduler is distributing
> > the MPI load across the cores.
> > Hopefully it is load-balanced, and different processes go to different
> > cores.
> >
> > ***
> >
> > It is very disconcerting when MPI processes hang.
> > You are not alone.
> > The reasons are not always obvious.
> > At least in your case there is no network involved or to troubleshoot.
> >
> >
> > **
> >
> > I hope it helps,
> >
> > Gus Correa
> > -
> > Gustavo Correa
> > Lamont-Doherty Earth Observatory - Columbia University
> > Palisades, NY, 10964-8000 - USA
> > -
> >
> > Matthew MacManes wrote:
> >> Hi Gus and List,
> >>
> >> 1st of all Gus, I want to say thanks.. you have been a huge help, and
> >> when I get this fixed, I owe you big time!
> >>
> >> However, the problems continue...
> >>
> >> I formatted the HD, reinstalled OS to make sure that I was working from
> >> scratch.  I did your step A, which seemed to go fine:
> >>
> >> macmanes@macmanes:~$ which mpicc
> >> /home/macmanes/apps/openmpi1.4/bin/mpicc
> >> macmanes@macmanes:~$ which mpirun
> >>

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Mark Bolstad

Just a quick interjection, I also have a dual-quad Nehalem system, HT on,
24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
--enable-mpi-f77=no --with-openib=no

With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity
fails randomly with np >= 8. Turning on -v increased the success, but still
hangs. np = 16 fails more often, and the hang is random in which pair of
processes are communicating.

However, it seems to be related to the shared memory layer problem. Running
with -mca btl ^sm works consistently through np = 128.

Hope this helps.

Mark

On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa  wrote:

> Hi Matthew
>
> Save any misinterpretation I may have made of the code:
>
> Hello_c has no real communication, except for a final Barrier
> synchronization.
> Each process prints "hello world" and that's it.
>
> Ring probes a little more, with processes Send(ing) and
> Recv(cieving) messages.
> Ring just passes a message sequentially along all process
> ranks, then back to rank 0, and repeat the game 10 times.
> Rank 0 is in charge of counting turns, decrementing the counter,
> and printing that (nobody else prints).
> With 4 processes:
> 0->1->2->3->0->1... 10 times
>
> In connectivity every pair of processes exchange a message.
> Therefore it probes all pairwise connections.
> In verbose mode you can see that.
>
> These programs shouldn't hang at all, if the system were sane.
> Actually, they should even run with a significant level of
> oversubscription, say,
> -np 128  should work easily for all three programs on a powerful
> machine like yours.
>
>
> **
>
> Suggestions
>
> 1) Stick to the OpenMPI you compiled.
>
> **
>
> 2) You can run connectivity_c in verbose mode:
>
> home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v
>
> (Note the trailing "-v".)
>
> It should tell more about who's talking to who.
>
> **
>
> 3) I wonder if there are any BIOS settings that may be required
> (and perhaps not in place) to make the Nehalem hyperthreading to
> work properly in your computer.
>
> You reach the BIOS settings by typing  or 
> when the computer boots up.
> The key varies by
> BIOS and computer vendor, but shows quickly on the bootup screen.
>
> You may ask the computer vendor about the recommended BIOS settings.
> If you haven't done this before, be careful to change and save only
> what really needs to change (if anything really needs to change),
> or the result may be worse.
> (Overclocking is for gamers, not for genome researchers ... :) )
>
> **
>
> 4) What I read about Nehalem DDR3 memory is that it is optimal
> on configurations that are multiples of 3GB per CPU.
> Common configs. in dual CPU machines like yours are
> 6, 12, 24 and 48GB.
> The sockets where you install the memory modules also matter.
>
> Your computer has 20GB.
> Did you build the computer or upgrade the memory yourself?
> Do you know how the memory is installed, in which memory sockets?
> What does the vendor have to say about it?
>
> See this:
>
> http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx
>
> **
>
> 5) As I said before, typing "f" then "j" on "top" will add
> a column (labeled "P") that shows in which core each process is running.
> This will let you observe how the Linux scheduler is distributing
> the MPI load across the cores.
> Hopefully it is load-balanced, and different processes go to different
> cores.
>
> ***
>
> It is very disconcerting when MPI processes hang.
> You are not alone.
> The reasons are not always obvious.
> At least in your case there is no network involved or to troubleshoot.
>
>
> **
>
> I hope it helps,
>
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
>
>
>
>
>
> Matthew MacManes wrote:
>
>> Hi Gus and List,
>>
>> 1st of all Gus, I want to say thanks.. you have been a huge help, and when
>> I get this fixed, I owe you big time!
>>
>> However, the problems continue...
>>
>> I formatted the HD, reinstalled OS to make sure that I was working from
>> scratch.  I did your step A, which seemed to go fine:
>>
>> macmanes@macmanes:~$ which mpicc
>> /home/macmanes/apps/openmpi1.4/bin/mpicc
>> macmanes@macmanes:~$ which mpirun
>> /home/macmanes/apps/openmpi1.4/bin/mpirun
>>
>> Good stuff there...
>>
>> I then compiled the example files:
>>
>> macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
>> /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c
>> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>>

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Gus Correa

Hi Matthew

Save any misinterpretation I may have made of the code:

Hello_c has no real communication, except for a final Barrier
synchronization.
Each process prints "hello world" and that's it.

In connectivity every pair of processes exchange a message.
Therefore it probes all pairwise connections.
In verbose mode you can see that.

Suggestions

1) Stick to the OpenMPI you compiled.

2) You can run connectivity_c in verbose mode:

home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v

(Note the trailing "-v".)

It should tell more about who's talking to who.

3) I wonder if there are any BIOS settings that may be required
(and perhaps not in place) to make the Nehalem hyperthreading to
work properly in your computer.

You reach the BIOS settings by typing or
when the computer boots up.
The key varies by
BIOS and computer vendor, but shows quickly on the bootup screen.

Your computer has 20GB.
Did you build the computer or upgrade the memory yourself?
Do you know how the memory is installed, in which memory sockets?
What does the vendor have to say about it?

See this:
http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx

5) As I said before, typing "f" then "j" on "top" will add
a column (labeled "P") that shows in which core each process is running.
This will let you observe how the Linux scheduler is distributing
the MPI load across the cores.
Hopefully it is load-balanced, and different processes go to different
cores.

***

It is very disconcerting when MPI processes hang.
You are not alone.
The reasons are not always obvious.
At least in your case there is no network involved or to troubleshoot.

I hope it helps,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Matthew MacManes wrote:

Hi Gus and List,

1st of all Gus, I want to say thanks.. you have been a huge help, and
when I get this fixed, I owe you big time!

However, the problems continue...

I formatted the HD, reinstalled OS to make sure that I was working from
scratch. I did your step A, which seemed to go fine:

macmanes@macmanes:~$ which mpicc
/home/macmanes/apps/openmpi1.4/bin/mpicc
macmanes@macmanes:~$ which mpirun
/home/macmanes/apps/openmpi1.4/bin/mpirun

Good stuff there...

I then compiled the example files:

macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c

Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c

Connectivity test on 8 processes PASSED.
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c

..HANGS..NO OUTPUT

this is maddening because ring_c works.. and connectivity_c worked the
1st time, but not the second... I did it 10 times, and it worked twice..
here is the TOP screenshot:

http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394

What is the difference between connectivity_c and ring_c? Under what
circumstances should one fail and not the other...

I'm

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Matthew MacManes

Hi Gus and List,

1st of all Gus, I want to say thanks.. you have been a huge help, and when I
get this fixed, I owe you big time!

However, the problems continue...

I formatted the HD, reinstalled OS to make sure that I was working from
scratch.  I did your step A, which seemed to go fine:

macmanes@macmanes:~$ which mpicc
/home/macmanes/apps/openmpi1.4/bin/mpicc
macmanes@macmanes:~$ which mpirun
/home/macmanes/apps/openmpi1.4/bin/mpirun

Good stuff there...

I then compiled the example files:

macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
Connectivity test on 8 processes PASSED.
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
..HANGS..NO OUTPUT

this is maddening because ring_c works.. and connectivity_c worked the 1st
time, but not the second... I did it 10 times, and it worked twice.. here is
the TOP screenshot:

http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394

What is the difference between connectivity_c and ring_c? Under what
circumstances should one fail and not the other...

I'm off to the Linux forums to see about the Nehalem kernel issues..

Matt



On Wed, Dec 9, 2009 at 13:25, Gus Correa  wrote:

> Hi Matthew
>
> There is no point in trying to troubleshoot MrBayes and ABySS
> if not even the OpenMPI test programs run properly.
> You must straighten them out first.
>
> **
>
> Suggestions:
>
> **
>
> A) While you are at OpenMPI, do yourself a favor,
> and install it from source on a separate directory.
> Who knows if the OpenMPI package distributed with Ubuntu
> works right on Nehalem?
> Better install OpenMPI yourself from source code.
> It is not a big deal, and may save you further trouble.
>
> Recipe:
>
> 1) Install gfortran and g++ if you don't have them using apt-get.
> 2) Put the OpenMPI tarball in, say /home/matt/downolads/openmpi
> 3) Make another install directory *not in the system directory tree*.
> Something like "mkdir /home/matt/apps/openmpi-X.Y.Z/" (X.Y.Z=version)
> will work
> 4) cd /home/matt/downolads/openmpi
> 5) ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran  \
> --prefix=/home/matt/apps/openmpi-X.Y.Z
> (Use the prefix flag to install in the directory of item 3.)
> 6) make
> 7) make install
> 8) At the bottom of your /home/matt/.bashrc or .profile file
> put these lines:
>
> export PATH=/home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
> export MANPATH=/home/matt/apps/openmpi-X.Y.Z/share/man:`man -w`
> export LD_LIBRARY_PATH=home/matt/apps/openmpi-X.Y.Z/lib:${LD_LIBRARY_PATH}
>
> (If you use csh/tcsh use instead:
> setenv PATH /home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
> etc)
>
> 9) Logout and login again to freshen um the environment variables.
> 10) Do "which mpicc"  to check that it is pointing to your newly
> installed OpenMPI.
> 11) Recompile and rerun the OpenMPI test programs
> with 2, 4, 8, 16,  processors.
> Use full path names to mpicc and to mpirun,
> if the change of PATH above doesn't work right.
>
> 
>
> B) Nehalem is quite new hardware.
> I don't know if the Ubuntu kernel 2.6.31-16 fully supports all
> of Nehalem features, particularly hyperthreading, and NUMA,
> which are used by MPI programs.
> I am not the right person to give you advice about this.
> I googled out but couldn't find a clear information about
> minimal kernel age/requirements to have Nehalem fully supported.
> Some Nehalem owner in the list could come forward and tell.
>
> **
>
> C) On the top screenshot you sent me, please try it again
> (after you do item A) but type "f" and "j" to show the processors
> that are running each process.
>
> **
>
> D) Also, the screeshot shows 20GB of memory.
> This sounds not as a optimal memory for Nehalem,
> which tend to be 6GB, 12GB, 24GB, 48GB.
> Did you put together the system, or upgraded the memory yourself,
> of did you buy the computer as is?
> However, this should not break MPI anyway.
>
> **
>
> E) Answering your question:
> It is true that different flavors of MPI
> used to compile (mpicc) and run (mpiexec) a program would probably
> break right away, regardless of the number of processes.
> However, when it comes to different versions of the
> same MPI flavor (say OpenMPI 1.3.4 and OpenMPI 1.3.3)
> I am not

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Matthew MacManes

Hi Gus,

Interestingly the results for the connectivity_c test... works fine with -np 
<8. For -np >8 it works some of the time, other times it HANGS. I have got to 
believe that this is a big clue!! Also, when it hangs, sometimes I get the 
message "mpirun was unable to cleanly terminate the daemons on the nodes shown 
below" Note that NO nodes are shown below.   Once, I got -np 250 to pass the 
connectivity test, but I was not able to replicate this reliable, so I'm not 
sure if it was a fluke, or what.  Here is a like to a screenshop of TOP when 
connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU 
usage.. H  

http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw=directlink

The other tests, ring_c, hello_c, as well as the cxx versions of these guys 
with with all values of -np.

Using -mca mpi-paffinity_alone 1 I get the same behavior. 

I agree that I am should worry about the mismatch between where the libraries 
are installed versus where I am telling my programs to look for them. Would 
this type of mismatch cause behavior like what I am seeing, i.e. working with  
a small number of processors, but failing with larger?  It seems like a 
mismatch would have the same effect regardless of the number of processors 
used. Maybe I am mistaken. Anyway, to address this, which mpirun gives me 
/usr/local/bin/mpirun.. so to configure ./configure 
--with-mpi=/usr/local/bin/mpirun and to run /usr/local/bin/mpirun -np X ...  
This should 

uname -a gives me: Linux macmanes 2.6.31-16-generic #52-Ubuntu SMP Thu Dec 3 
22:07:16 UTC 2006 x86_64 GNU/Linux

Matt

On Dec 8, 2009, at 8:50 PM, Gus Correa wrote:

> Hi Matthew
> 
> Please see comments/answers inline below.
> 
> Matthew MacManes wrote:
>> Hi Gus, Thanks for your ideas.. I have a few questions, and will try to 
>> answer yours in hopes of solving this!!
> 
> A simple way to test OpenMPI on your system is to run the
> test programs that come with the OpenMPI source code,
> hello_c.c, connectivity_c.c, and ring_c.c:
> http://www.open-mpi.org/
> 
> Get the tarball from the OpenMPI site, gzip and untar it,
> and look for it in the "examples" directory.
> Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c
> Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out
> using X = 2, 4, 8, 16, 32, 64, ...
> 
> This will tell if your OpenMPI is functional,
> and if you can run on many Nehalem cores,
> even with oversubscription perhaps.
> It will also set the stage for further investigation of your
> actual programs.
> 
> 
>> Should I worry about setting things like --num-cores --bind-to-cores?  This, 
>> I think, gets at your questions about processor affinity.. Am I right? I 
>> could not exactly figure out the -mca mpi-paffinity_alone stuff...
> 
> I use the simple minded -mca mpi-paffinity_alone 1.
> This is probably the easiest way to assign a process to a core.
> There more complex  ways in OpenMPI, but I haven't tried.
> Indeed, -mca mpi-paffinity_alone 1 does improve performance of
> our programs here.
> There is a chance that without it the 16 virtual cores of
> your Nehalem get confused with more than 3 processes
> (you reported that -np > 3 breaks).
> 
> Did you try adding just -mca mpi-paffinity_alone 1  to
> your mpiexec command line?
> 
> 
>> 1. Additional load: nope. nothing else, most of the time not even firefox. 
> 
> Good.
> Turn off firefox, etc, to make it even better.
> Ideally, use runlevel 3, no X, like a computer cluster node,
> but this may not be required.
> 
>> 2. RAM: no problems apparent when monitoring through TOP. Interesting, I did 
>> wonder about oversubscription, so I tried the option --nooversubscription, 
>> but this gave me an error mssage.
> 
> Oversubscription from your program would only happen if
> you asked for more processes than available cores, i.e.,
> -np > 8 (or "virtual" cores, in case of Nehalem hyperthreading,
> -np > 16).
> Since you have -np=4 there is no oversubscription,
> unless you have other external load (e.g. Matlab, etc),
> but you said you don't.
> 
> Yet another possibility would be if your program is threaded
> (e.g. using OpenMP along with MPI), but considering what you
> said about OpenMP I would guess the programs don't use it.
> For instance, you launch the program with 4 MPI processes,
> and each process decides to start, say, 8 OpenMP threads.
> You end up with 32 threads and 8 (real) cores (or 16 hyperthreaded
> ones on Nehalem).
> 
> 
> What else does top say?
> Any hog processes (memory- or CPU-wise)
> besides your program processes?
> 
>> 3. I have not tried other MPI flavors.. Ive been speaking to the authors of 
>> the programs, and they are both using openMPI.  
> 
> I was not trying to convince you to use another MPI.
> I use MPICH2 also, but OpenMPI reigns here.
> The idea or trying it with MPICH2 was just to check whether OpenMPI
> is causing the problem, but I don't think it is.
> 
>> 4. I don't think that this

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-08 Thread Gus Correa


Hi Matthew

Please see comments/answers inline below.

Matthew MacManes wrote:
Hi Gus, 

Thanks for your ideas.. I have a few questions, and will try to answer 
yours in hopes of solving this!!


A simple way to test OpenMPI on your system is to run the
test programs that come with the OpenMPI source code,
hello_c.c, connectivity_c.c, and ring_c.c:
http://www.open-mpi.org/

Get the tarball from the OpenMPI site, gzip and untar it,
and look for it in the "examples" directory.
Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c
Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out
using X = 2, 4, 8, 16, 32, 64, ...

This will tell if your OpenMPI is functional,
and if you can run on many Nehalem cores,
even with oversubscription perhaps.
It will also set the stage for further investigation of your
actual programs.




Should I worry about setting things like --num-cores --bind-to-cores? 
 This, I think, gets at your questions about processor affinity.. Am I 
right? I could not exactly figure out the -mca mpi-paffinity_alone stuff...




I use the simple minded -mca mpi-paffinity_alone 1.
This is probably the easiest way to assign a process to a core.
There more complex  ways in OpenMPI, but I haven't tried.
Indeed, -mca mpi-paffinity_alone 1 does improve performance of
our programs here.
There is a chance that without it the 16 virtual cores of
your Nehalem get confused with more than 3 processes
(you reported that -np > 3 breaks).

Did you try adding just -mca mpi-paffinity_alone 1  to
your mpiexec command line?


1. Additional load: nope. nothing else, most of the time not even firefox. 


Good.
Turn off firefox, etc, to make it even better.
Ideally, use runlevel 3, no X, like a computer cluster node,
but this may not be required.

2. RAM: no problems apparent when monitoring through TOP. Interesting, I 
did wonder about oversubscription, so I tried the option 
--nooversubscription, but this gave me an error mssage.


Oversubscription from your program would only happen if
you asked for more processes than available cores, i.e.,
-np > 8 (or "virtual" cores, in case of Nehalem hyperthreading,
-np > 16).
Since you have -np=4 there is no oversubscription,
unless you have other external load (e.g. Matlab, etc),
but you said you don't.

Yet another possibility would be if your program is threaded
(e.g. using OpenMP along with MPI), but considering what you
said about OpenMP I would guess the programs don't use it.
For instance, you launch the program with 4 MPI processes,
and each process decides to start, say, 8 OpenMP threads.
You end up with 32 threads and 8 (real) cores (or 16 hyperthreaded
ones on Nehalem).


What else does top say?
Any hog processes (memory- or CPU-wise)
besides your program processes?

3. I have not tried other MPI flavors.. Ive been speaking to the authors 
of the programs, and they are both using openMPI.  


I was not trying to convince you to use another MPI.
I use MPICH2 also, but OpenMPI reigns here.
The idea or trying it with MPICH2 was just to check whether OpenMPI
is causing the problem, but I don't think it is.

4. I don't think that this is a problem, as I'm specifying 
--with-mpi=/usr/bin/...  when I compile the programs. Is there any other 
way to be sure that this is not a problem?


Hmmm 
I don't know about your Ubuntu (we have CentOS and Fedora on various
machines).
However, most Linux distributions come with their MPI flavors,
and so do compilers, etc.
Often times they install these goodies in unexpected places,
and this has caused a lot of frustration.
There are tons of postings on this list that eventually
boiled down to mismatched versions of MPI in unexpected places.


The easy way is to use full path names to compile and to run.
Something like this:
/my/openmpi/bin/mpicc on your program configuration script),

and something like this
/my/openmpi/bin/mpiexec -np  ... bla, bla ...
when you submit the job.

You can check your version with "which mpicc", "which mpiexec",
and (perhaps using full path names) with
"ompi_info", "mpicc --showme", "mpiexec --help".


5. I had not been, and you could see some shuffling when monitoring the 
load on specific processors. I have tried to use --bind-to-cores to deal 
with this. I don't understand how to use the -mca options you asked about. 
6. I am using Ubuntu 9.10. gcc 4.4.1 and g++  4.4.1


I am afraid I won't be of help, because I don't have Nehalem.
However, I read about Nehalem requiring quite recent kernels
to get all of its features working right.

What is the output of "uname -a"?
This will tell the kernel version, etc.
Other list subscribers may give you a suggestion if you post the
information.




MyBayes is a for bayesian phylogenetics: 
 http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page 
ABySS: is a program for assembly of DNA sequence 
data: http://www.bcgsc.ca/platform/bioinfo/software/abyss




Thanks for the links!
I had found the MrBayes link.
I eventually found what your ABySS was about, but

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-08 Thread Matthew MacManes

Hi Gus, 

Thanks for your ideas.. I have a few questions, and will try to answer yours in 
hopes of solving this!!

Should I worry about setting things like --num-cores --bind-to-cores?  This, I 
think, gets at your questions about processor affinity.. Am I right? I could 
not exactly figure out the -mca mpi-paffinity_alone stuff...

1. Additional load: nope. nothing else, most of the time not even firefox. 
2. RAM: no problems apparent when monitoring through TOP. Interesting, I did 
wonder about oversubscription, so I tried the option --nooversubscription, but 
this gave me an error mssage.
3. I have not tried other MPI flavors.. Ive been speaking to the authors of the 
programs, and they are both using openMPI.  
4. I don't think that this is a problem, as I'm specifying 
--with-mpi=/usr/bin/...  when I compile the programs. Is there any other way to 
be sure that this is not a problem?
5. I had not been, and you could see some shuffling when monitoring the load on 
specific processors. I have tried to use --bind-to-cores to deal with this. I 
don't understand how to use the -mca options you asked about. 
6. I am using Ubuntu 9.10. gcc 4.4.1 and g++  4.4.1


MyBayes is a for bayesian phylogenetics:  
http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page 
ABySS: is a program for assembly of DNA sequence data: 
http://www.bcgsc.ca/platform/bioinfo/software/abyss

> Do the programs mix MPI (message passing) with OpenMP (threads)? 
> 
Im honestly not sure what this means..

Thanks for all your help!

Matt

>  Hi Matthew 
> More guesses/questions than anything else: 
> 1) Is there any additional load on this machine? 
> We had problems like that (on different machines) when 
> users start listening to streaming video, doing Matlab calculations, 
> etc, while the MPI programs are running. 
> This tends to oversubscribe the cores, and may lead to crashes. 
> 2) RAM: 
> Can you monitor the RAM usage through "top"? 
> (I presume you are on Linux.) 
> It may show unexpected memory leaks, if they exist. 
> On "top", type "1" (one) see all cores, type "f" then "j" 
> to see the core number associated to each process. 
> 3) Do the programs work right with other MPI flavors (e.g. MPICH2)? 
> If not, then it is not OpenMPI's fault. 
> 4) Any possibility that the MPI versions/flavors of mpicc and 
> mpirun that you are using to compile and launch the program are not the 
> same? 
> 5) Are you setting processor affinity on mpiexec? 
> mpiexec -mca mpi_paffinity_alone 1 -np ... bla, bla ... 
> Context switching across the cores may also cause trouble, I suppose. 
> 6) Which Linux are you using (uname -a)? 
> On other mailing lists I read reports that only quite recent kernels 
> support all the Intel Nehalem processor features well. 
> I don't have Nehalem, I can't help here, 
> but the information may be useful 
> for other list subscribers to help you. 
> *** 
> As for the programs, some programs require specific setup, 
> (and even specific compilation) when the number of MPI processes 
> vary. 
> It may help if you tell us a link to the program sites. 
> Baysian statistics is not totally out of our business, 
> but phylogenetic genetic trees is not really my league, 
> hence forgive me any bad guesses, please, 
> but would it need specific compilation or a different 
> set of input parameters to run correctly on a different 
> number of processors? 
> Do the programs mix MPI (message passing) with OpenMP (threads)? 
> I found this MrBayes, which seems to do the above: 
> http://mrbayes.csit.fsu.edu/ 
> http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page 
> As for the ABySS, what is it, where can it be found? 
> Doesn't look like a deep ocean circulation model, as the name suggest. 
> My $0.02 
> Gus Correa

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

23 matches

Site Navigation

Mail list logo

Footer information