Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS
Jeff, Only the processes of the program where process 0 successed to publish name, have srv=1 and then call MPI_Comm_accept. The processes of the program where process 0 failed to publish name, have srv=0 and then call MPI_Comm_connect. That's worked like this with openmpi 1.4.1. Is it

Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS
Jeff, The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but before in MPI_Publish_name and MPI_Lookup_name. So the broadcast of srv is not involved in the dead lock. Best Bernard Bernard Secher - SFME/LGLS a écrit : Jeff, Only the processes of the program where process 0

Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS
I get the same dead lock with openmpi tests: pubsub, accept and connect with version 1.5.1 Bernard Secher - SFME/LGLS a écrit : Jeff, The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but before in MPI_Publish_name and MPI_Lookup_name. So the broadcast of srv is not involved in

Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS
The accept and connect tests are OK with version openmpi 1.4.1. I think there is a bug in version 1.5.1 Best Bernard Bernard Secher - SFME/LGLS a écrit : I get the same dead lock with openmpi tests: pubsub, accept and connect with version 1.5.1 Bernard Secher - SFME/LGLS a écrit : Jeff,

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread John Hearns
On 6 January 2011 21:10, Gilbert Grosdidier wrote: > Hi Jeff, > >  Where's located lstopo command on SuseLinux, please ? > And/or hwloc-bind, which seems related to it ? I was able to get hwloc to install quite easily on SuSE - download/configure/make Configure it to

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: > > lstopo > Machine (35GB) > NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) >L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#8) >L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 > PU L#2 (P#1) >

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
Hi Jeff, Thanks for taking care of this. Here is what I got on a worker node: > mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 Is this what is expected, please ? Or should I try yet another command ? Thanks, Regards,

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 5:27 AM, John Hearns wrote: > Actually, the topic of hyperthreading is interesting, and we should > discuss it please. > Hyperthreading is supposedly implemented better and 'properly' on > Nehalem - I would be interested to see some genuine > performance measurements with

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
Can you run with np=8? On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: > Hi Jeff, > > Thanks for taking care of this. > > Here is what I got on a worker node: > > > mpirun --mca mpi_paffinity_alone 1 > > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get > 0x0001 > > Is

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
Yes, here it is : > mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 0x0002 0x0004 0x0008 0x0010 0x0020 0x0040 0x0080 Gilbert. Le 7 janv. 11 à 15:50, Jeff Squyres a écrit : Can you run with np=8? On

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Shamis, Pavel
The FW version looks ok. But it may be driver issues as well. I guess that OFED 1.4.X or 1.5.x driver should be ok. To check driver version , you may run ofed_info command. Regards, Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge

Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Jeff Squyres
You're calling bcast with root=0, so whatever value rank 0 has for srv, everyone will have after the bcast. Plus, I didn't see in your code where *srv was ever set to 0. In my runs, rank 0 is usually the one that publishes first. Everyone then gets the lookup properly, and then the bcast

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Tim Prince
On 1/7/2011 6:49 AM, Jeff Squyres wrote: My understanding is that hyperthreading can only be activated/deactivated at boot time -- once the core resources are allocated to hyperthreads, they can't be changed while running. Whether disabling the hyperthreads or simply telling Linux not to

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Jeff Squyres
+1 AFAIR (and I stopped being an IB vendor a long time ago, so I might be wrong), the _resize_cq function being there or not is not an issue of the underlying HCA; it's a function of what version of OFED you're running. On Jan 7, 2011, at 10:14 AM, Shamis, Pavel wrote: > The FW version looks

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
Well, bummer -- there goes my theory. According to the hwloc info you posted earlier, this shows that OMPI is binding to the 1st hyperthread on each core; *not* to both hyperthreads on a single core. :-\ It would still be slightly interesting to see if there's any difference when you run

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
I'll very soon give a try to using Hyperthreading with our app, and keep you posted about the improvements, if any. Our current cluster is made out of 4-core dual-socket Nehalem nodes. Cheers,Gilbert. Le 7 janv. 11 à 16:17, Tim Prince a écrit : On 1/7/2011 6:49 AM, Jeff Squyres wrote:

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Gilbert Grosdidier
Bonjour Pavel, Here is the output of the ofed_info command : == OFED-1.4.1 libibverbs: git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4 commit b00dc7d2f79e0660ac40160607c9c4937a895433 libmthca:

Re: [OMPI users] srun and openmpi

2011-01-07 Thread Michael Di Domenico
I'm still testing the slurm integration, which seems to work fine so far. However, i just upgraded another cluster to openmpi-1.5 and slurm 2.1.15 but this machine has no infiniband if i salloc the nodes and mpirun the command it seems to run and complete fine however if i srun the command i get

Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 10:41 AM, Bernard Secher - SFME/LGLS wrote: > srv = 0 is set in my main program > I call Bcast because all the processes must call MPI_Comm_accept (collective) > or must call MPI_Comm_connect (collective) Ah -- I see. I thought this was a test program where some processes

Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 11:16 AM, Jeff Squyres wrote: > Ok, I can replicate the hang in publish now. I'll file a bug report. Filed here: https://svn.open-mpi.org/trac/ompi/ticket/2681 Thanks for your persistence! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to:

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
Unfortunately, I was unable to spot any striking difference in perfs when using --bind-to-core. Sorry. Any other suggestion ? Regards,Gilbert. Le 7 janv. 11 à 16:32, Jeff Squyres a écrit : Well, bummer -- there goes my theory. According to the hwloc info you posted earlier, this

Re: [OMPI users] mpirun --nice 10 prog ??

2011-01-07 Thread David Mathog
Ralph Castain wrote: > Afraid not - though you could alias your program name to be "nice --10 prog" > Is there an OMPI wish list? If so, can we please add to it "a method to tell mpirun what nice values to use when it starts programs on nodes"? Minimally, something like this: --nice 12

Re: [OMPI users] mpirun --nice 10 prog ??

2011-01-07 Thread Eugene Loh
David Mathog wrote: Ralph Castain wrote: Afraid not - though you could alias your program name to be "nice --10 prog" Is there an OMPI wish list? If so, can we please add to it "a method to tell mpirun what nice values to use when it starts programs on nodes"?

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Eugene Loh
Gilbert Grosdidier wrote: Any other suggestion ? Can any more information be extracted from profiling?  Here is where I think things left off: Eugene Loh wrote: Gilbert Grosdidier wrote: #     [time]   [calls]    <%mpi>  <%wall> # MPI_Waitall