Re: [OMPI devel] Orte update

2007-07-16 Thread Ralph H Castain
Sigh - somehow, the fix slid out of that commit. I have now fixed it in
r15437.

Thanks
Ralph



On 7/16/07 6:11 AM, "Sven Stork"  wrote:

> On Friday 13 July 2007 15:35, Ralph H Castain wrote:
>> 
>> On 7/13/07 7:22 AM, "Sven Stork"  wrote:
>> 
>>> Hi Ralph,
>>> 
>>> On Thursday 12 July 2007 15:53, Ralph H Castain wrote:
 Yo all
 
 I have a fairly significant change coming to the orte part of the code
> base
 that will require an autogen (sorry). I'll check it in late this
> afternoon
 (can't do it at night as it is on my office desktop).
 
 The commit will fix the singleton operations, including singleton
 comm_spawn. It also takes the first step towards removing event-driven
 operations, replacing them with more serial code (to be explained
 separately). As part of all this, I had to modify the various pls
 components. For those I could not compile, I made a first cut at them
> that
 should (hopefully) allow them to continue to operate.
 
 Any of you using TM: we discovered that the trunk is not working
> currently
 on that environment. We are investigating - it has nothing to do with
> this
 commit, but predates it.
>>> 
>>> what you mean with broken ?
>>> I tried r15394 on out cluster and TM looks working for me. The only issue
> I
>>> currently know about is the problem with iof (see ticket #1071, can be
> tmp.
>>> fixed by using -mca iof ^null)
>> 
>> That is correct - the null component was being incorrectly selected because
>> of an error in its selection logic. We fixed it in the r15390 commit - it
>> was a trivial fix - so now everything works fine.
>> 
> 
> I cannot see anything in r15390 that fixes this issue. I checked with the
> latest version of the trunk and have still the same issue:
> 
> hpcstork@noco042:~/ > ompi_info
> Open MPI: 1.3a1r15427
>Open MPI SVN revision: r15427
> ...
> hpcstork@noco042:~/ > mpiexec date
> hpcstork@noco042:~/ > mpiexec -mca iof ^null date
> Mon Jul 16 14:00:57 CEST 2007
> Mon Jul 16 14:00:57 CEST 2007
> hpcstork@noco042:~/ >
> 
> Thanks,
>   Sven
> 
>>> 
>>> Thanks,
>>>   Sven 
>>> 
 Just wanted to give you a heads-up. Please refrain from making changes to
 the orte codebase today, if you could - it would simplify the commit and
 ensure we don't lose your changes.
 
 Thanks
 Ralph
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>> 
>> 
>> 




Re: [OMPI devel] Orte update

2007-07-16 Thread Sven Stork
On Friday 13 July 2007 15:35, Ralph H Castain wrote:
> 
> On 7/13/07 7:22 AM, "Sven Stork"  wrote:
> 
> > Hi Ralph,
> > 
> > On Thursday 12 July 2007 15:53, Ralph H Castain wrote:
> >> Yo all
> >> 
> >> I have a fairly significant change coming to the orte part of the code 
base
> >> that will require an autogen (sorry). I'll check it in late this 
afternoon
> >> (can't do it at night as it is on my office desktop).
> >> 
> >> The commit will fix the singleton operations, including singleton
> >> comm_spawn. It also takes the first step towards removing event-driven
> >> operations, replacing them with more serial code (to be explained
> >> separately). As part of all this, I had to modify the various pls
> >> components. For those I could not compile, I made a first cut at them 
that
> >> should (hopefully) allow them to continue to operate.
> >> 
> >> Any of you using TM: we discovered that the trunk is not working 
currently
> >> on that environment. We are investigating - it has nothing to do with 
this
> >> commit, but predates it.
> > 
> > what you mean with broken ?
> > I tried r15394 on out cluster and TM looks working for me. The only issue 
I
> > currently know about is the problem with iof (see ticket #1071, can be 
tmp.
> > fixed by using -mca iof ^null)
> 
> That is correct - the null component was being incorrectly selected because
> of an error in its selection logic. We fixed it in the r15390 commit - it
> was a trivial fix - so now everything works fine.
> 

I cannot see anything in r15390 that fixes this issue. I checked with the 
latest version of the trunk and have still the same issue:

hpcstork@noco042:~/ > ompi_info
Open MPI: 1.3a1r15427
   Open MPI SVN revision: r15427
...
hpcstork@noco042:~/ > mpiexec date
hpcstork@noco042:~/ > mpiexec -mca iof ^null date
Mon Jul 16 14:00:57 CEST 2007
Mon Jul 16 14:00:57 CEST 2007
hpcstork@noco042:~/ > 

Thanks,
  Sven

> > 
> > Thanks,
> >   Sven 
> > 
> >> Just wanted to give you a heads-up. Please refrain from making changes to
> >> the orte codebase today, if you could - it would simplify the commit and
> >> ensure we don't lose your changes.
> >> 
> >> Thanks
> >> Ralph
> >> 
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> 
> 
> 


Re: [OMPI devel] Orte update

2007-07-13 Thread Ralph H Castain



On 7/13/07 7:22 AM, "Sven Stork"  wrote:

> Hi Ralph,
> 
> On Thursday 12 July 2007 15:53, Ralph H Castain wrote:
>> Yo all
>> 
>> I have a fairly significant change coming to the orte part of the code base
>> that will require an autogen (sorry). I'll check it in late this afternoon
>> (can't do it at night as it is on my office desktop).
>> 
>> The commit will fix the singleton operations, including singleton
>> comm_spawn. It also takes the first step towards removing event-driven
>> operations, replacing them with more serial code (to be explained
>> separately). As part of all this, I had to modify the various pls
>> components. For those I could not compile, I made a first cut at them that
>> should (hopefully) allow them to continue to operate.
>> 
>> Any of you using TM: we discovered that the trunk is not working currently
>> on that environment. We are investigating - it has nothing to do with this
>> commit, but predates it.
> 
> what you mean with broken ?
> I tried r15394 on out cluster and TM looks working for me. The only issue I
> currently know about is the problem with iof (see ticket #1071, can be tmp.
> fixed by using -mca iof ^null)

That is correct - the null component was being incorrectly selected because
of an error in its selection logic. We fixed it in the r15390 commit - it
was a trivial fix - so now everything works fine.


> 
> Thanks,
>   Sven 
> 
>> Just wanted to give you a heads-up. Please refrain from making changes to
>> the orte codebase today, if you could - it would simplify the commit and
>> ensure we don't lose your changes.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 




Re: [OMPI devel] Orte update

2007-07-13 Thread Sven Stork
Hi Ralph,

On Thursday 12 July 2007 15:53, Ralph H Castain wrote:
> Yo all
> 
> I have a fairly significant change coming to the orte part of the code base
> that will require an autogen (sorry). I'll check it in late this afternoon
> (can't do it at night as it is on my office desktop).
> 
> The commit will fix the singleton operations, including singleton
> comm_spawn. It also takes the first step towards removing event-driven
> operations, replacing them with more serial code (to be explained
> separately). As part of all this, I had to modify the various pls
> components. For those I could not compile, I made a first cut at them that
> should (hopefully) allow them to continue to operate.
> 
> Any of you using TM: we discovered that the trunk is not working currently
> on that environment. We are investigating - it has nothing to do with this
> commit, but predates it.

what you mean with broken ?
I tried r15394 on out cluster and TM looks working for me. The only issue I 
currently know about is the problem with iof (see ticket #1071, can be tmp. 
fixed by using -mca iof ^null)

Thanks,
  Sven 

> Just wanted to give you a heads-up. Please refrain from making changes to
> the orte codebase today, if you could - it would simplify the commit and
> ensure we don't lose your changes.
> 
> Thanks
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


Re: [OMPI devel] Orte update

2007-07-12 Thread Ralph H Castain
The commit has been made - it is r15390.

This commit restored the ability to execute singletons and singleton
comm_spawn, both in single node and multi-node environments. It also
includes a first step in our plan to reduce the ORTE system to the minimum
functionality required to support Open MPI (more on that separately).

Short description of major changes:

1. singletons now fork/exec a local daemon to manage their operations. This
was required not only to resolve the current problem, but also to deal with
threading issues in the progress engine down the road.

2. the orte daemon code now resides in libopen-rte. This was needed so that
mpirun could fully provide all daemon services since we no longer allow
multiple daemons to share a node (so an orted could not co-reside with
mpirun).

3. daemons no longer use the orte triggering system during startup. Instead,
they directly call back to their parent pls component to report ready to
operate.

I have modified all the pls components except xcpu and poe (don't understand
either well enough to do it). Full functionality has been verified for rsh,
SLURM, and TM systems. Compile has been verified for xgrid and gridengine,
and hopefully those environments will work - though I could not verify that
was true.

Note that singletons will *not* operate in Windows environments at this
time. The ability to fork/exec the local daemon would need to be added
first, assuming Windows can support singletons (I honestly don't know).

Please let me know of any problems.
Ralph


On 7/12/07 1:45 PM, "Ralph H Castain"  wrote:

> Yo folks
> 
> Several of us are stuck waiting for this commit to hit. Rather than wasting
> the next several hours, I'm going to make the commit now.
> 
> So please be advised: if you do an update after this commit hits, you will
> need to autogen. You may want to wait until a convenient time before doing
> the update.
> 
> Thanks
> Ralph
> 
> 
> On 7/12/07 7:53 AM, "Ralph H Castain"  wrote:
> 
>> Yo all
>> 
>> I have a fairly significant change coming to the orte part of the code base
>> that will require an autogen (sorry). I'll check it in late this afternoon
>> (can't do it at night as it is on my office desktop).
>> 
>> The commit will fix the singleton operations, including singleton
>> comm_spawn. It also takes the first step towards removing event-driven
>> operations, replacing them with more serial code (to be explained
>> separately). As part of all this, I had to modify the various pls
>> components. For those I could not compile, I made a first cut at them that
>> should (hopefully) allow them to continue to operate.
>> 
>> Any of you using TM: we discovered that the trunk is not working currently
>> on that environment. We are investigating - it has nothing to do with this
>> commit, but predates it.
>> 
>> Just wanted to give you a heads-up. Please refrain from making changes to
>> the orte codebase today, if you could - it would simplify the commit and
>> ensure we don't lose your changes.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Orte update

2007-07-12 Thread Ralph H Castain
Yo folks

Several of us are stuck waiting for this commit to hit. Rather than wasting
the next several hours, I'm going to make the commit now.

So please be advised: if you do an update after this commit hits, you will
need to autogen. You may want to wait until a convenient time before doing
the update.

Thanks
Ralph


On 7/12/07 7:53 AM, "Ralph H Castain"  wrote:

> Yo all
> 
> I have a fairly significant change coming to the orte part of the code base
> that will require an autogen (sorry). I'll check it in late this afternoon
> (can't do it at night as it is on my office desktop).
> 
> The commit will fix the singleton operations, including singleton
> comm_spawn. It also takes the first step towards removing event-driven
> operations, replacing them with more serial code (to be explained
> separately). As part of all this, I had to modify the various pls
> components. For those I could not compile, I made a first cut at them that
> should (hopefully) allow them to continue to operate.
> 
> Any of you using TM: we discovered that the trunk is not working currently
> on that environment. We are investigating - it has nothing to do with this
> commit, but predates it.
> 
> Just wanted to give you a heads-up. Please refrain from making changes to
> the orte codebase today, if you could - it would simplify the commit and
> ensure we don't lose your changes.
> 
> Thanks
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Orte update

2007-07-12 Thread Ralph H Castain
Yo all

I have a fairly significant change coming to the orte part of the code base
that will require an autogen (sorry). I'll check it in late this afternoon
(can't do it at night as it is on my office desktop).

The commit will fix the singleton operations, including singleton
comm_spawn. It also takes the first step towards removing event-driven
operations, replacing them with more serial code (to be explained
separately). As part of all this, I had to modify the various pls
components. For those I could not compile, I made a first cut at them that
should (hopefully) allow them to continue to operate.

Any of you using TM: we discovered that the trunk is not working currently
on that environment. We are investigating - it has nothing to do with this
commit, but predates it.

Just wanted to give you a heads-up. Please refrain from making changes to
the orte codebase today, if you could - it would simplify the commit and
ensure we don't lose your changes.

Thanks
Ralph