Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes.

2009-01-12 Thread Jeff Squyres
On Jan 12, 2009, at 2:50 PM, kmur...@lbl.gov wrote: Is there is any requirement on the size of the data buffers I should use in these warmup broadcasts ? If I use small buffers like 1000 real values during warmup, the following actual and timed MPI_BCAST over IB is taking a lot of time (more

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Jeff Squyres
Cross your fingers; we might release tomorrow (I've probably now jinxed it by saying that!). On Jan 12, 2009, at 1:54 PM, Justin wrote: In order for me to test this out I need to wait for TACC to install this version on Ranger. Right now they have version 1.3a1r19685 installed. I'm

Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes.

2009-01-12 Thread kmuriki
Hi Jeff, Thanks for your response. Is there is any requirement on the size of the data buffers I should use in these warmup broadcasts ? If I use small buffers like 1000 real values during warmup, the following actual and timed MPI_BCAST over IB is taking a lot of time (more than that on GiGE).

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Justin
In order for me to test this out I need to wait for TACC to install this version on Ranger. Right now they have version 1.3a1r19685 installed. I'm guessing this is probably an older version. I'm not sure when TACC will get around to updating there OpenMPI version. I could request them to

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Jeff Squyres
Justin -- Could you actually give your code a whirl with 1.3rc3 to ensure that it fixes the problem for you? http://www.open-mpi.org/software/ompi/v1.3/ On Jan 12, 2009, at 1:30 PM, Tim Mattox wrote: Hi Justin, I applied the fixes for this particular deadlock to the 1.3 code base

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Tim Mattox
Hi Justin, I applied the fixes for this particular deadlock to the 1.3 code base late last week, see ticket #1725: https://svn.open-mpi.org/trac/ompi/ticket/1725 This should fix the described problem, but I personally have not tested to see if the deadlock in question is now gone. Everyone

Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes.

2009-01-12 Thread Jeff Squyres
You might want to do some "warmup" bcasts before doing your timing measurements. Open MPI makes network connections lazily, meaning that we only make connections upon the first send (e.g., the sends underneath the MPI_BCAST). So the first MPI_BCAST is likely to be quite slow, while all

Re: [OMPI users] Problem with openmpi and infiniband

2009-01-12 Thread Jeff Squyres
On Jan 7, 2009, at 6:28 PM, Biagio Lucini wrote: [[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0 Ah! If we're dealing a RNR retry

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Justin
Hi, has this deadlock been fixed in the 1.3 source yet? Thanks, Justin Jeff Squyres wrote: On Dec 11, 2008, at 5:30 PM, Justin wrote: The more I look at this bug the more I'm convinced it is with openMPI and not our code. Here is why: Our code generates a communication/execution

Re: [OMPI users] problem with xfmpi_sane

2009-01-12 Thread Jeff Squyres
On Jan 11, 2009, at 3:57 AM, Hana Milani wrote: make: *** No rule to make target `home/hana/openmpi-1.2.8/include/ mpif.h', needed by `mpif.h'. Stop. Are you missing a leading "/" somewhere in your Bmake.inc? It lists "home/hana/" in your error message, not "/home/hana/" -- Jeff

Re: [OMPI users] Error message when using MPI_Type_struct()

2009-01-12 Thread Thomas Ropars
Hi Aurelien, Thank you for your answer. Aurélien Bouteiller wrote: Hi Thomas, The message you get comes from the convertor. The convertor is in charge of packing/unpacking the data. As you add yourself an extra int to the wire data, the convertor gets confused on the receiver side, as it