Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-09-01 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote: > It would simplify testing if you could get all the eth0's to be of one type > and on the same subnet, and the same for eth1. > > Once you do that, try using just one of the networks by telling OMPI to use > only one of

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-26 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote: > Once you do that, try using just one of the networks by telling OMPI to use > only one of the devices, something like this: > >    mpirun --mca btl_tcp_if_include eth0 ... Thanks Jeff! Just tried the exact test that you

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 6:41 AM, John Hearns wrote: > You could sort that out with udev rules on each machine. Sure. I'd always wanted consistent names for the eth interfaces when I set up the cluster but I couldn't get udev to co-operate. Maybe this time! Let me try. >

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Jeff Squyres
On Aug 24, 2010, at 6:26 PM, Rahul Nabar wrote: >> Are all the eth0's on one subnet and all the eth2's on a different subnet? >> >> Or are all eth0's and eth2's all on the same subnet? > > Thanks Jeff! Different subnets. All 10GigE's are on 192.168.x.x and > all 1GigE's are on 10.0.x.x It

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar
On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar wrote: > -- > gather: >    NP256    hangs >    NP128    hangs >    NP64    hangs >    NP32    OK > > Note: "gather" always hangs at the following line of the test: >    

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread John Hearns
On 24 August 2010 18:58, Rahul Nabar wrote: > There are a few unusual things about the cluster. We are using a > 10GigE ethernet fabric. Each node has dual eth adapters. One 1GigE and > the other 10GigE. These are on seperate subnets although the order of > the eth interfaces

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Randolph Pullen
, Rahul Nabar <rpna...@gmail.com> wrote: From: Rahul Nabar <rpna...@gmail.com> Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas? To: "Open MPI Users" <us...@open-mpi.org> Received: Wednesday, 25 August, 2010, 3:38 AM On Mon, Aug 2

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Jeff Squyres
On Aug 24, 2010, at 1:58 PM, Rahul Nabar wrote: > There are a few unusual things about the cluster. We are using a > 10GigE ethernet fabric. Each node has dual eth adapters. One 1GigE and > the other 10GigE. These are on seperate subnets although the order of > the eth interfaces is variable.

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote: > Bugs are always a possibility but unless there is something very unusual > about the cluster and interconnect or this is an unstable version of MPI, it > seems very unlikely this use of MPI_Bcast with so few tasks and

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen wrote: > > I have had a similar load related problem with Bcast. Thanks Randolph! That's interesting to know! What was the hardware you were using? Does your bcast fail at the exact same point too? > > I don't know

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann wrote: > It is hard to imagine how a total data load of 41,943,040 bytes could be a > problem. That is really not much data. By the time the BCAST is done, each > task (except root) will have received a single half meg message

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-23 Thread Richard Treumann
Network saturation could produce arbitrary long delays the total data load we are talking about is really small. It is the responsibility of an MPI library to do one of the following: 1) Use a reliable message protocol for each message (e.g. Infiniband RC or TCP/IP) 2) detect lost packets and

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-23 Thread Randolph Pullen
s.ibm.com> Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas? To: "Open MPI Users" <us...@open-mpi.org> Received: Tuesday, 24 August, 2010, 9:39 AM It is hard to imagine how a total data load of 41,943,040 bytes could be a problem.

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-23 Thread Richard Treumann
& Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 08/23/2010 05:09:56 PM: > [image removed] > > Re: [OMPI users] IMB-MPI broadcast test stalls for large core > cou

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-22 Thread Randolph Pullen
wrote: From: Rahul Nabar <rpna...@gmail.com> Subject: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas? To: "Open MPI Users" <us...@open-mpi.org> Received: Friday, 20 August, 2010, 12:03 PM My Intel IMB-MPI tests stall, but only in very specif

[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-19 Thread Rahul Nabar
My Intel IMB-MPI tests stall, but only in very specific cases:larger packet sizes + large core counts. Only happens for bcast, gather and exchange tests. Only for the larger core counts (~256 cores). Other tests like pingpong and sendrecev run fine even with larger core counts. e.g. This bcast