Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-12 Thread Brad Benton
I think that this is relatively contained and has not been seen out of MTT under normal operating conditions. Also, as Jeff has argued, it doesn't appear to be a regression against 1.3. George & I talked about this and we are in agreement that we should go ahead and release 1.3.1 as it currently

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-12 Thread Jeff Squyres
On Mar 11, 2009, at 12:19 PM, Eugene Loh wrote: I don't understand what's going on, but I guess each process is calling sm_btl_first_time_init(), during which it initializes its own shm_bases value, FIFOs, and shm_fifo pointer. If a remote process saw those memory operations in that order,

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
On Mar 11, 2009, at 2:18 PM, Eugene Loh wrote: > Can this error happen on any test? Presumably yes if two or more processes are on the same node. Yes, because these failures were occurring during MPI_INIT (i.e., 'zactly what Eugene said...). -- Jeff Squyres Cisco Systems

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Eugene Loh
Ethan Mallove wrote: Can this error happen on any test? Presumably yes if two or more processes are on the same node. What do these tests have in common? They all try to start. :^) The problem is in MPI_Init. It almost looks like the problem is more likely to occur if MPI_UB or MPI_L

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ethan Mallove
On Wed, Mar/11/2009 11:38:19AM, Jeff Squyres wrote: > As Terry stated, I think this bugger is quite rare. I'm having a helluva > time trying to reproduce it manually (over 5k runs this morning and still > no segv). Ugh. 5k of which test(s)? Can this error happen on any test? I am wondering if

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
If it is that hard to replicate outside of MTT, then by all means let's just release it - users will probably never see it. On Mar 11, 2009, at 10:07 AM, Terry Dontje wrote: Ralph Castain wrote: You know, this isn't the first time we have encountered errors that -only- appear when running

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Eugene Loh
Ralph Castain wrote: Could be nobody is saying anything...but I would be surprised if - nobody- barked at a segfault during startup. Well, if it segfaulted during startup, someone's first reaction would probably be, "Oh really?" They would try again, have success, attribute to cosmic rays,

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
Ralph Castain wrote: You know, this isn't the first time we have encountered errors that -only- appear when running under MTT. As per my other note, we are not seeing these failures here, even though almost all our users run under batch/scripts. This has been the case with at least some of th

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
You know, this isn't the first time we have encountered errors that - only- appear when running under MTT. As per my other note, we are not seeing these failures here, even though almost all our users run under batch/scripts. This has been the case with at least some of these other MTT-only

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
FWIW, we have people running dozens of jobs every day with 1.3.0 built with Intel 10.0.23 and PGI 7.2-5 compilers, using -mca btl sm,openib,self...and have not received a single report of this failure. This is all on Linux machines (various kernels), under both slurm and torque environments

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
As Terry stated, I think this bugger is quite rare. I'm having a helluva time trying to reproduce it manually (over 5k runs this morning and still no segv). Ugh. Looking through the sm startup code, I can't see exactly what the problem would be. :-( On Mar 11, 2009, at 11:34 AM, Ralph

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
I'll run some tests with 1.3.1 on one of our systems and see if it shows up there. If it is truly rare and was in 1.3.0, then I personally don't have a problem with it. Got bigger problems with hanging collectives, frankly - and we don't know how the sm changes will affect this problem, if

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
Jeff Squyres wrote: So -- Brad/George -- this technically isn't a regression against v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing it there, but if it's so elusive... I haven't been MTT testing the 1.2 series in a long time). But it is a nonzero problem. I have not se

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
So -- Brad/George -- this technically isn't a regression against v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing it there, but if it's so elusive... I haven't been MTT testing the 1.2 series in a long time). But it is a nonzero problem. Should we release 1.3.1 without

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
Could be true; it unfortunately doesn't help us for 1.3.1, though. :-( Maybe I'll add a big memset of 0 across the sm segment at the beginning of time and see if this problem goes away. On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote: I actually wasn't implying that Eugene's changes -ca

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
I actually wasn't implying that Eugene's changes -caused- the problem, but rather that I thought they might have -fixed- the problem. :-) On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote: I forgot to mention that since I ran into this issue so long ago I really doubt that Eugene's SM change

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
I forgot to mention that since I ran into this issue so long ago I really doubt that Eugene's SM changes has caused this issue. --td Terry Dontje wrote: Hey!!! I ran into this problem many months ago but its been so elusive that I've haven't nailed it down. First time we saw this was last O

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
Hey!!! I ran into this problem many months ago but its been so elusive that I've haven't nailed it down. First time we saw this was last October. I did some MTT gleaning and could not find anyone but Solaris having this issue under MTT. What's interesting is I gleaned Sun's MTT results and

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Eugene Loh
Ralph Castain wrote: Hey Jeff I seem to recall seeing the identical problem reported on the user list not long ago...or may have been the devel list. Anyway, it was during btl_sm_add_procs, and the code was segv'ing. I don't have the archives handy here, but perhaps you might search the

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-10 Thread Ralph Castain
Hey Jeff I seem to recall seeing the identical problem reported on the user list not long ago...or may have been the devel list. Anyway, it was during btl_sm_add_procs, and the code was segv'ing. I don't have the archives handy here, but perhaps you might search them and see if there is a

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-10 Thread Jeff Squyres
On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote: Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :- ( I can't reproduce them manually, but they seem to only happen in a very small fraction of overall MTT runs. I'm seeing at least 3 classes of errors: 1. btl_sm_a

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-10 Thread Jeff Squyres
On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote: The one thing that these failures have in common is that they all appear to be compiled by icc. Here's the configure line: Check that; I've found at least one case of the pgi compiler resulting in the same kind of btl sm error.

[OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-10 Thread Jeff Squyres
Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :- ( I can't reproduce them manually, but they seem to only happen in a very small fraction of overall MTT runs. I'm seeing at least 3 classes of errors: 1. btl_sm_add_procs.c:529 which is this: if(mca_btl_sm_compo