Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-16 Thread Ralph Castain
Argh. I know the problem here - per note on user list, I actually found more than five months ago that we weren't properly serializing commands in the system and created a fix for it. I applied that fix only to the comm_spawn scenario at the time as this was the source of the pain - but I noted

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-03 Thread Sylvain Jeaugey
Too bad. But no problem, that's very nice of you to have spent so much time on this. I wish I knew why our experiments are so different, maybe we will find out eventually ... Sylvain On Wed, 2 Dec 2009, Ralph Castain wrote: I'm sorry, Sylvain - I simply cannot replicate this problem

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Ralph Castain
I'm sorry, Sylvain - I simply cannot replicate this problem (tried yet another slurm system): ./configure --prefix=blah --with-platform=contrib/platform/iu/odin/debug [rhc@odin ~]$ salloc -N 16 tcsh salloc: Granted job allocation 75294 [rhc@odin mpi]$ mpirun -pernode ./hello Hello, World, I am

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Sylvain Jeaugey
Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the typical stack. Without my "reproducer patch", 80 nodes was the lower bound to reproduce the bug (and you needed a couple of runs to get it). But

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
On Dec 1, 2009, at 5:48 PM, Jeff Squyres wrote: > On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote: > >> > So perhaps it can become a param in the downcall to the MCA base as to >> > whether the priority params should be automatically registered...? >> >> I can live with that, though I again

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Jeff Squyres
On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote: > So perhaps it can become a param in the downcall to the MCA base as to whether the priority params should be automatically registered...? I can live with that, though I again question why anything needs to be automatically registered. It

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
On Dec 1, 2009, at 3:40 PM, Jeff Squyres wrote: > On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote: > >> The only issue with that is it implies there is a param that can be adjusted >> - and there isn't. So it can confuse a user - or even a developer, as it did >> here. >> >> I should think

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Jeff Squyres
On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote: The only issue with that is it implies there is a param that can be adjusted - and there isn't. So it can confuse a user - or even a developer, as it did here. I should think we wouldn't want MCA to automatically add any parameter. If the

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
The only issue with that is it implies there is a param that can be adjusted - and there isn't. So it can confuse a user - or even a developer, as it did here. I should think we wouldn't want MCA to automatically add any parameter. If the component doesn't register it, then it shouldn't exist.

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Jeff Squyres
This is not a bug, it's a feature. :-) The MCA base automatically adds a priority MCA parameter for every component. On Dec 1, 2009, at 7:40 AM, Ralph Castain wrote: I'm afraid Sylvain is right, and we have a bug in ompi_info: MCA routed: parameter

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
I'm afraid Sylvain is right, and we have a bug in ompi_info: MCA routed: parameter "routed_binomial_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_cm_priority" (current value: <0>, data source: default value)

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Jeff Squyres
On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote: About my previous e-mail, I was wrong about all components having a 0 priority : it was based on default parameters reported by "ompi_info -a | grep routed". It seems that the truth is not always in ompi_info ... ompi_info *does*

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey
Ok. Maybe I should try on a RHEL5 then. About the compilers, I've tried with both gcc and intel and it doesn't seem to make a difference. On Mon, 30 Nov 2009, Ralph Castain wrote: Interesting. The only difference I see is the FC11 - I haven't seen anyone running on that OS yet. I wonder if

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Ralph Castain
Interesting. The only difference I see is the FC11 - I haven't seen anyone running on that OS yet. I wonder if that is the source of the trouble? Do we know that our code works on that one? I know we had problems in the past with FC9, for example, that required fixes. Also, what compiler are

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey
Hi Ralph, I'm also puzzled :-) Here is what I did today : * download the latest nightly build (openmpi-1.7a1r22241) * untar it * patch it with my "ORTE_RELAY_DELAY" patch * build it directly on the cluster (running FC11) with : ./configure

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Ralph Castain
On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote: > Hi Ralph, > > I tried with the trunk and it makes no difference for me. Strange > > Looking at potential differences, I found out something strange. The bug may > have something to do with the "routed" framework. I can reproduce the bug

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Sylvain Jeaugey
Hi Ralph, I tried with the trunk and it makes no difference for me. Looking at potential differences, I found out something strange. The bug may have something to do with the "routed" framework. I can reproduce the bug with binomial and direct, but not with cm and linear (you disabled the

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-26 Thread Ralph Castain
Just to clarify something: I have been testing with the trunk, NOT the 1.5 branch. I haven't even bothered to look at that code since it was branched. >From what little I have heard plus what I (and others) have done since the >branch, I strongly suspect a complete ORTE refresh will be required

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-26 Thread Ralph Castain
Hi Sylvain Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Ralph Castain
Thanks! I'll give it a try. My tests are all conducted with fast launches (just running slurm on large clusters) and using an mpi hello world that calls mpi_init at first instruction. I'll see if adding the delay causes it to misbehave. On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: >

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Sylvain Jeaugey
Hi Ralph, Thanks for your efforts. I will look at our configuration and see how it may differ from ours. Here is a patch which helps reproducing the bug even with a small number of nodes. diff -r b622b9e8f1ac orte/orted/orted_comm.c --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Ralph Castain
Hi Sylvain I've spent several hours trying to replicate the behavior you described on clusters up to a couple of hundred nodes (all running slurm), without success. I'm becoming increasingly convinced that this is a configuration issue as opposed to a code issue. I have enclosed the platform

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey
Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg, I therefore requeue a process_msg handler and return. In this "all-must-be-non-blocking-and-done-through-opal_progress"

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Ralph Castain
Very strange. As I said, we routinely launch jobs spanning several hundred nodes without problem. You can see the platform files for that setup in contrib/platform/lanl/tlcc That said, it is always possible you are hitting some kind of race condition we don't hit. In looking at the code, one

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey
I would say I use the default settings, i.e. I don't set anything "special" at configure. I'm launching my processes with SLURM (salloc + mpirun). Sylvain On Wed, 18 Nov 2009, Ralph Castain wrote: How did you configure OMPI? What launch mechanism are you using - ssh? On Nov 17, 2009, at

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-18 Thread Ralph Castain
How did you configure OMPI? What launch mechanism are you using - ssh? On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: > I don't think so, and I'm not doing it explicitely at least. How do I know ? > > Sylvain > > On Tue, 17 Nov 2009, Ralph Castain wrote: > >> We routinely launch across

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey
I don't think so, and I'm not doing it explicitely at least. How do I know ? Sylvain On Tue, 17 Nov 2009, Ralph Castain wrote: We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did you build and/or are using ORTE threaded by any

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Ralph Castain
We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did you build and/or are using ORTE threaded by any chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: > Hi all, > > We are currently

[OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey
Hi all, We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are deadlocked. When MPI processes are calling MPI_Init before send_relay is complete, the send_relay function and