I think that this is relatively contained and has not been seen out of MTT
under normal operating conditions. Also, as Jeff has argued, it doesn't
appear to be a regression against 1.3. George & I talked about this and we
are in agreement that we should go ahead and release 1.3.1 as it currently
On Mar 11, 2009, at 12:19 PM, Eugene Loh wrote:
I don't understand what's going on, but I guess each process is
calling
sm_btl_first_time_init(), during which it initializes its own
shm_bases
value, FIFOs, and shm_fifo pointer. If a remote process saw those
memory operations in that order,
On Mar 11, 2009, at 2:18 PM, Eugene Loh wrote:
> Can this error happen on any test?
Presumably yes if two or more processes are on the same node.
Yes, because these failures were occurring during MPI_INIT (i.e.,
'zactly what Eugene said...).
--
Jeff Squyres
Cisco Systems
Ethan Mallove wrote:
Can this error happen on any test?
Presumably yes if two or more processes are on the same node.
What do these tests have in common?
They all try to start. :^) The problem is in MPI_Init.
It almost looks like the problem is more likely to occur if MPI_UB or
MPI_L
On Wed, Mar/11/2009 11:38:19AM, Jeff Squyres wrote:
> As Terry stated, I think this bugger is quite rare. I'm having a helluva
> time trying to reproduce it manually (over 5k runs this morning and still
> no segv). Ugh.
5k of which test(s)? Can this error happen on any test? I am wondering
if
If it is that hard to replicate outside of MTT, then by all means
let's just release it - users will probably never see it.
On Mar 11, 2009, at 10:07 AM, Terry Dontje wrote:
Ralph Castain wrote:
You know, this isn't the first time we have encountered errors that
-only- appear when running
Ralph Castain wrote:
Could be nobody is saying anything...but I would be surprised if -
nobody- barked at a segfault during startup.
Well, if it segfaulted during startup, someone's first reaction would
probably be, "Oh really?" They would try again, have success, attribute
to cosmic rays,
Ralph Castain wrote:
You know, this isn't the first time we have encountered errors that
-only- appear when running under MTT. As per my other note, we are not
seeing these failures here, even though almost all our users run under
batch/scripts.
This has been the case with at least some of th
You know, this isn't the first time we have encountered errors that -
only- appear when running under MTT. As per my other note, we are not
seeing these failures here, even though almost all our users run under
batch/scripts.
This has been the case with at least some of these other MTT-only
FWIW, we have people running dozens of jobs every day with 1.3.0 built
with Intel 10.0.23 and PGI 7.2-5 compilers, using -mca btl
sm,openib,self...and have not received a single report of this failure.
This is all on Linux machines (various kernels), under both slurm and
torque environments
As Terry stated, I think this bugger is quite rare. I'm having a
helluva time trying to reproduce it manually (over 5k runs this
morning and still no segv). Ugh.
Looking through the sm startup code, I can't see exactly what the
problem would be. :-(
On Mar 11, 2009, at 11:34 AM, Ralph
I'll run some tests with 1.3.1 on one of our systems and see if it
shows up there. If it is truly rare and was in 1.3.0, then I
personally don't have a problem with it. Got bigger problems with
hanging collectives, frankly - and we don't know how the sm changes
will affect this problem, if
Jeff Squyres wrote:
So -- Brad/George -- this technically isn't a regression against
v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing
it there, but if it's so elusive... I haven't been MTT testing the
1.2 series in a long time). But it is a nonzero problem.
I have not se
So -- Brad/George -- this technically isn't a regression against
v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing
it there, but if it's so elusive... I haven't been MTT testing the
1.2 series in a long time). But it is a nonzero problem.
Should we release 1.3.1 without
Could be true; it unfortunately doesn't help us for 1.3.1, though. :-(
Maybe I'll add a big memset of 0 across the sm segment at the
beginning of time and see if this problem goes away.
On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
I actually wasn't implying that Eugene's changes -ca
I actually wasn't implying that Eugene's changes -caused- the problem,
but rather that I thought they might have -fixed- the problem.
:-)
On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
I forgot to mention that since I ran into this issue so long ago I
really doubt that Eugene's SM change
I forgot to mention that since I ran into this issue so long ago I
really doubt that Eugene's SM changes has caused this issue.
--td
Terry Dontje wrote:
Hey!!! I ran into this problem many months ago but its been so
elusive that I've haven't nailed it down. First time we saw this was
last O
Hey!!! I ran into this problem many months ago but its been so elusive
that I've haven't nailed it down. First time we saw this was last
October. I did some MTT gleaning and could not find anyone but Solaris
having this issue under MTT. What's interesting is I gleaned Sun's MTT
results and
Ralph Castain wrote:
Hey Jeff
I seem to recall seeing the identical problem reported on the user
list not long ago...or may have been the devel list. Anyway, it was
during btl_sm_add_procs, and the code was segv'ing.
I don't have the archives handy here, but perhaps you might search
the
Hey Jeff
I seem to recall seeing the identical problem reported on the user
list not long ago...or may have been the devel list. Anyway, it was
during btl_sm_add_procs, and the code was segv'ing.
I don't have the archives handy here, but perhaps you might search
them and see if there is a
On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
( I can't reproduce them manually, but they seem to only happen in a
very small fraction of overall MTT runs. I'm seeing at least 3
classes of errors:
1. btl_sm_a
On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
The one thing that these failures have in common is that they all
appear to be compiled by icc. Here's the configure line:
Check that; I've found at least one case of the pgi compiler resulting
in the same kind of btl sm error.
Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
( I can't reproduce them manually, but they seem to only happen in a
very small fraction of overall MTT runs. I'm seeing at least 3
classes of errors:
1. btl_sm_add_procs.c:529 which is this:
if(mca_btl_sm_compo
23 matches
Mail list logo