Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
That's what we needed to know - i.e., that setting num_sockets=1 generates an error instead of segfaulting down the road. I can submit a CMR to do so. thx! On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote: > On 02/22/12 14:54, Ralph Castain wrote: >> That doesn't really address the issue, though.

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 02/22/12 14:54, Ralph Castain wrote: That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present.

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 20:24, Eugene Loh a écrit : > On 2/22/2012 11:08 AM, Ralph Castain wrote: >> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: >>> Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote > On 2/21/2012 10:31 PM, Eugene Loh wrote: >> ...

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote: > On 2/22/2012 11:08 AM, Ralph Castain wrote: >> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: >>> Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote > On 2/21/2012 10:31 PM, Eugene Loh wrote: >>>

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 2/22/2012 11:08 AM, Ralph Castain wrote: On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and O

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: > Le 22/02/2012 17:48, Ralph Castain a écrit : >> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: >> >>> On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 17:48, Ralph Castain a écrit : > On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: > >> On 2/21/2012 10:31 PM, Eugene Loh wrote: >>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI >>> pukes on divide by zero. OS info was listed in the original message >>> (belo

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: > On 2/21/2012 10:31 PM, Eugene Loh wrote: >> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes >> on divide by zero. OS info was listed in the original message (below). >> Might we want to do something else? E.g., ass

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
Much simpler solution - on that platform, you should add "orte_num_sockets=1" to your default mca param file. Problem solved. It's why that param exists, and we added it specifically at Terry's request for an earlier, similar problem. On Feb 22, 2012, at 8:55 AM, Brice Goglin wrote: > Le 22/02

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 07:36, Eugene Loh a écrit : > On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: >> Here are the first of the results of the testing I promised. >> I am not 100% sure how to reach the code that Eugene reported as >> problematic, > I don't think you're going to see it. Somehow, hwloc on th

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: Here are the first of the results of the testing I promised. I am not 100% sure how to reach the code that Eugene reported as problematic, I don't think you're going to see it. Somehow, hwloc on the config in question thinks there is no socket leve

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 02/21/12 19:29, Jeffrey Squyres wrote: What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches) http://www.open-mpi.org/software/hwloc/v1.3/ Is there any difference from v1.4 hwloc? http://www.open-mpi.org/software/hw

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Paul H. Hargrove
My build with the "2011_sp1.8.273" Intel compilers passes the same tests as I detailed below for "2011_sp1.7.256". I don't suspect any longer that the compiler is at fault, but am willing to try additional/alternate tests to help confirm. -Paul On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: He

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Paul H. Hargrove
Here are the first of the results of the testing I promised. I am not 100% sure how to reach the code that Eugene reported as problematic, so I tried just running the ring test with various -bind-to-* options. I am quite willing to run additional test cases. All runs are w/ OMPI_MCA_btl=sm,s

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Paul H. Hargrove
I have been testing v1.5 with slightly older Intel "composerxe-2011.5.220" compilers. I see a "make check" failure in opal_datatype_test which is not present with any other compiler (such as gcc on the same node). This has been seen most recently on the 1.5.5rc2r25990 tarball generated earlier t

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Jeffrey Squyres
What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches) http://www.open-mpi.org/software/hwloc/v1.3/ Is there any difference from v1.4 hwloc? http://www.open-mpi.org/software/hwloc/v1.4/ On Feb 21, 2012, at 7:20 PM, Eugen

[OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Eugene Loh
We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel (composer_xe_201