Jonathan Dursi wrote:
Continuing the conversation with myself:
Sorry to interrupt... :^)
Okay, I managed to reproduce the hang. I'll try to look at this.
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation
Hi, Eugene:
If it continues to be a problem for people to reproduce this, I'll see
what can be done about having an account made here for someone to poke
around. Alternately, any suggestions for tests that I can do to help
diagnose/verify the problem, or figure out whats different about
Jonathan Dursi wrote:
Continuing the conversation with myself:
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation
anywhere in this sample code, but trying one of the suggested
workarounds/clues: that is,
Hi, Jeff:
I wish I had your problems reproducing this. This problem apparently
rears its head when OpenMPI is compiled with the intel compilers, as
well, but only ~1% of the time. Unfortunately, we have users who
launch ~1400 single-node jobs at a go. So they see on order a dozen
or
Continuing the conversation with myself:
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation anywhere
in this sample code, but trying one of the suggested workarounds/clues:
that is, setting btl_sm_num_fifos to at
I hate to repost, but I'm still stuck with the problem that, on a
completely standard install with a standard gcc compiler, we're getting
random hangs with a trivial test program when using the sm btl, and we
still have no clues as to how to track down the problem.
Using a completely standard