Argh. I know the problem here - per note on user list, I actually found more
than five months ago that we weren't properly serializing commands in the
system and created a fix for it. I applied that fix only to the comm_spawn
scenario at the time as this was the source of the pain - but I noted
Too bad. But no problem, that's very nice of you to have spent so much
time on this.
I wish I knew why our experiments are so different, maybe we will find out
eventually ...
Sylvain
On Wed, 2 Dec 2009, Ralph Castain wrote:
I'm sorry, Sylvain - I simply cannot replicate this problem
I'm sorry, Sylvain - I simply cannot replicate this problem (tried yet another
slurm system):
./configure --prefix=blah --with-platform=contrib/platform/iu/odin/debug
[rhc@odin ~]$ salloc -N 16 tcsh
salloc: Granted job allocation 75294
[rhc@odin mpi]$ mpirun -pernode ./hello
Hello, World, I am
Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when
setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the
typical stack.
Without my "reproducer patch", 80 nodes was the lower bound to reproduce
the bug (and you needed a couple of runs to get it). But
On Dec 1, 2009, at 5:48 PM, Jeff Squyres wrote:
> On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote:
>
>> > So perhaps it can become a param in the downcall to the MCA base as to
>> > whether the priority params should be automatically registered...?
>>
>> I can live with that, though I again
On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote:
> So perhaps it can become a param in the downcall to the MCA base
as to whether the priority params should be automatically
registered...?
I can live with that, though I again question why anything needs to
be automatically registered. It
On Dec 1, 2009, at 3:40 PM, Jeff Squyres wrote:
> On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote:
>
>> The only issue with that is it implies there is a param that can be adjusted
>> - and there isn't. So it can confuse a user - or even a developer, as it did
>> here.
>>
>> I should think
On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote:
The only issue with that is it implies there is a param that can be
adjusted - and there isn't. So it can confuse a user - or even a
developer, as it did here.
I should think we wouldn't want MCA to automatically add any
parameter. If the
The only issue with that is it implies there is a param that can be adjusted -
and there isn't. So it can confuse a user - or even a developer, as it did here.
I should think we wouldn't want MCA to automatically add any parameter. If the
component doesn't register it, then it shouldn't exist.
This is not a bug, it's a feature. :-)
The MCA base automatically adds a priority MCA parameter for every
component.
On Dec 1, 2009, at 7:40 AM, Ralph Castain wrote:
I'm afraid Sylvain is right, and we have a bug in ompi_info:
MCA routed: parameter
I'm afraid Sylvain is right, and we have a bug in ompi_info:
MCA routed: parameter "routed_binomial_priority" (current value:
<0>, data source: default value)
MCA routed: parameter "routed_cm_priority" (current value: <0>,
data source: default value)
On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote:
About my previous e-mail, I was wrong about all components having a 0
priority : it was based on default parameters reported by "ompi_info
-a |
grep routed". It seems that the truth is not always in ompi_info ...
ompi_info *does*
Ok. Maybe I should try on a RHEL5 then.
About the compilers, I've tried with both gcc and intel and it doesn't
seem to make a difference.
On Mon, 30 Nov 2009, Ralph Castain wrote:
Interesting. The only difference I see is the FC11 - I haven't seen
anyone running on that OS yet. I wonder if
Interesting. The only difference I see is the FC11 - I haven't seen anyone
running on that OS yet. I wonder if that is the source of the trouble? Do we
know that our code works on that one? I know we had problems in the past with
FC9, for example, that required fixes.
Also, what compiler are
Hi Ralph,
I'm also puzzled :-)
Here is what I did today :
* download the latest nightly build (openmpi-1.7a1r22241)
* untar it
* patch it with my "ORTE_RELAY_DELAY" patch
* build it directly on the cluster (running FC11) with :
./configure
On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:
> Hi Ralph,
>
> I tried with the trunk and it makes no difference for me.
Strange
>
> Looking at potential differences, I found out something strange. The bug may
> have something to do with the "routed" framework. I can reproduce the bug
Hi Ralph,
I tried with the trunk and it makes no difference for me.
Looking at potential differences, I found out something strange. The bug
may have something to do with the "routed" framework. I can reproduce the
bug with binomial and direct, but not with cm and linear (you disabled the
Just to clarify something: I have been testing with the trunk, NOT the 1.5
branch. I haven't even bothered to look at that code since it was branched.
>From what little I have heard plus what I (and others) have done since the
>branch, I strongly suspect a complete ORTE refresh will be required
Hi Sylvain
Well, I hate to tell you this, but I cannot reproduce the "bug" even with this
code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs
really slow as I increase the delay, but it completes the job just fine. I ran
jobs across 16 nodes on a slurm machine, 1-4
Thanks! I'll give it a try.
My tests are all conducted with fast launches (just running slurm on large
clusters) and using an mpi hello world that calls mpi_init at first
instruction. I'll see if adding the delay causes it to misbehave.
On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
>
Hi Ralph,
Thanks for your efforts. I will look at our configuration and see how it
may differ from ours.
Here is a patch which helps reproducing the bug even with a small number
of nodes.
diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009
Hi Sylvain
I've spent several hours trying to replicate the behavior you described on
clusters up to a couple of hundred nodes (all running slurm), without success.
I'm becoming increasingly convinced that this is a configuration issue as
opposed to a code issue.
I have enclosed the platform
Thank you Ralph for this precious help.
I setup a quick-and-dirty patch basically postponing process_msg (hence
daemon_collective) until the launch is done. In process_msg, I therefore
requeue a process_msg handler and return.
In this "all-must-be-non-blocking-and-done-through-opal_progress"
Very strange. As I said, we routinely launch jobs spanning several hundred
nodes without problem. You can see the platform files for that setup in
contrib/platform/lanl/tlcc
That said, it is always possible you are hitting some kind of race condition we
don't hit. In looking at the code, one
I would say I use the default settings, i.e. I don't set anything
"special" at configure.
I'm launching my processes with SLURM (salloc + mpirun).
Sylvain
On Wed, 18 Nov 2009, Ralph Castain wrote:
How did you configure OMPI?
What launch mechanism are you using - ssh?
On Nov 17, 2009, at
How did you configure OMPI?
What launch mechanism are you using - ssh?
On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
> I don't think so, and I'm not doing it explicitely at least. How do I know ?
>
> Sylvain
>
> On Tue, 17 Nov 2009, Ralph Castain wrote:
>
>> We routinely launch across
I don't think so, and I'm not doing it explicitely at least. How do I
know ?
Sylvain
On Tue, 17 Nov 2009, Ralph Castain wrote:
We routinely launch across thousands of nodes without a problem...I have never
seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any
We routinely launch across thousands of nodes without a problem...I have never
seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any chance? If so, that
definitely won't work.
On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
> Hi all,
>
> We are currently
Hi all,
We are currently experiencing problems at launch on the 1.5 branch on
relatively large number of nodes (at least 80). Some processes are not
spawned and orted processes are deadlocked.
When MPI processes are calling MPI_Init before send_relay is complete, the
send_relay function and
29 matches
Mail list logo