Re: [OMPI devel] [RFC] Runtime Services Layer

2007-08-16 Thread Doug Tody
Sounds like a no-brainer to me (I agree).  In a system of this complexity
a strong interface between MPI and the RTE layer is desirable, if for
no other reason than good software engineering.  Such an abstraction
layer also increases flexibility, as it allows the RTE layer to be
a modular component which can be customized, evolved, or otherwise
modified as necessary, with minimal effect upon the MPI functionality
(assuming a strong interface).  Whether this should take the form of
a single framework I can't say, but that could be a reasonable way
to achieve a strong interface.

- Doug Tody
(merely a potential user watching on the side lines, from astronomy)


On Thu, 16 Aug 2007, Tim Prins wrote:

> WHAT: Solicitation of feedback on the possibility of adding a runtime 
> services layer to Open MPI to abstract out the runtime.
> 
> WHY: To solidify the interface between OMPI and the runtime environment, 
> and to allow the use of different runtime systems, including different 
> versions of ORTE.
> 
> WHERE: Addition of a new framework to OMPI, and changes to many of the 
> files in OMPI to funnel all runtime request through this framework. Few 
> changes should be required in OPAL and ORTE.
> 
> WHEN: Development has started in tmp/rsl, but is still in its infancy. We 
> hope 
> to have a working system in the next month.
> 
> TIMEOUT: 8/29/07
> 
> --
> Short version:
> 
> I am working on creating an interface between OMPI and the runtime system. 
> This would make a RSL framework in OMPI which all runtime services would be 
> accessed from. Attached is a graphic depicting this.
> 
> This change would be invasive to the OMPI layer. Few (if any) changes 
> will be required of the ORTE and OPAL layers.
> 
> At this point I am soliciting feedback as to whether people are 
> supportive or not of this change both in general and for v1.3.
> 
> 
> Long version:
> 
> The current model used in Open MPI assumes that one runtime system is 
> the best for all environments. However, in many environments it may be 
> beneficial to have specialized runtime systems. With our current system this 
> is not easy to do.
> 
> With this in mind, the idea of creating a 'runtime services layer' was 
> hatched. This would take the form of a framework within OMPI, through which 
> all runtime functionality would be accessed. This would allow new or 
> different runtime systems to be used with Open MPI. Additionally, with such a
> system it would be possible to have multiple versions of open rte coexisting,
> which may facilitate development and testing. Finally, this would solidify 
> the 
> interface between OMPI and the runtime system, as well as provide 
> documentation and side effects of each interface function.
> 
> However, such a change would be fairly invasive to the OMPI layer, and 
> needs a buy-in from everyone for it to be possible.
> 
> Here is a summary of the changes required for the RSL (at least how it is 
> currently envisioned):
> 
> 1. Add a framework to ompi for the rsl, and a component to support orte.
> 2. Change ompi so that it uses the new interface. This involves:
>  a. Moving runtime specific code into the orte rsl component.
>  b. Changing the process names in ompi to an opaque object.
>  c. change all references to orte in ompi to be to the rsl.
> 3. Change the configuration code so that open-rte is only linked where needed.
> 
> Of course, all this would happen on a tmp branch.
> 
> The design of the rsl is not solidified. I have been playing in a tmp branch 
> (located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which everyone is 
> welcome to look at and comment on, but be advised that things here are 
> subject to change (I don't think it even compiles right now). There are 
> some fairly large open questions on this, including:
> 
> 1. How to handle mpirun (that is, when a user types 'mpirun', do they 
> always get ORTE, or do they sometimes get a system specific runtime). Most 
> likely mpirun will always use ORTE, and alternative launching programs would 
> be used for other runtimes.
> 2. Whether there will be any performance implications. My guess is not, 
> but am not quite sure of this yet.
> 
> Again, I am interested in people's comments on whether they think adding 
> such abstraction is good or not, and whether it is reasonable to do such a 
> thing for v1.3.
> 
> Thanks,
> 
> Tim Prins
> 


Re: [OMPI devel] [Pkg-openmpi-maintainers] Bug#435581: [u...@hermann-uwe.de: Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]

2007-08-16 Thread Uwe Hermann
On Mon, Aug 13, 2007 at 08:04:39PM -0500, Dirk Eddelbuettel wrote:
> On 14 August 2007 at 00:08, Adrian Knoth wrote:
> | On Mon, Aug 13, 2007 at 04:26:31PM -0500, Dirk Eddelbuettel wrote:
> | 
> | > > I'll now compile the 1.2.3 release tarball and see if I can reproduce
> | 
> | The 1.2.3 release also works fine:
> | 
> | adi@debian:~$ ./ompi123/bin/mpirun -np 2 ring
> | 0: sending message (0) to 1
> | 0: sent message
> | 1: waiting for message
> | 1: got message (1) from 0, sending to 0
> | 0: got message (1) from 1
> 
> Now I'm even more confused. I though the bug was that it segfaulted when used
> on a Debian-on-freebsd-kernel system ?

I think Adrian used a tarball, not the Debian package?
I'll try a local, manual install too, maybe the bug is Debian-related only?


> | adi@debian:~$ ./ompi123/bin/ompi_info 
> | Open MPI: 1.2.3
> |Open MPI SVN revision: r15136
> | Open RTE: 1.2.3
> |Open RTE SVN revision: r15136
> | OPAL: 1.2.3
> |OPAL SVN revision: r15136
> |   Prefix: /home/adi/ompi123
> |  Configured architecture: x86_64-unknown-kfreebsd6.2-gnu

Same here.


> | > | JFTR: It's currently not possible to compile OMPI on amd64 (out of the
> | > | box). Though it compiles on i386
> | > | 
> | > |
> http://experimental.debian.net/fetch.php?=openmpi=1.2.3-3=kfreebsd-i386=1187000200=log=raw
> | > | 
> | > | it fails on amd64:
> | > | 
> | > |
> http://experimental.debian.net/fetch.php?=openmpi=1.2.3-3=kfreebsd-amd64=1186969782=log=raw
> | > | 
> | > | stacktrace.c: In function 'opal_show_stackframe':
> | > | stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this
> | > | function)
> | > | stacktrace.c:145: error: (Each undeclared identifier is reported only
> | > | once
> | > | stacktrace.c:145: error: for each function it appears in.)
> | > | stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this
> | > | function)
> | > | stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this
> | > | function)
> | > | make[4]: *** [stacktrace.lo] Error 1
> | > | make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util'
> | > | 
> | > | 
> | > | This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the
> | > | relevant #define's are placed in an #ifdef __i386__ condition. After
> | > | extending this for __x86_64__, everything works fine.
> | > | 
> | > | Should I file a bugreport against libc0.1-dev or will you take care?
> | > I'm confused. What is libc0.1-dev?
> | 
> | 
> |http://packages.debian.org/unstable/libdevel/libc0.1-dev
> | 
> | It's the "libc6-dev" for GNU/kFreeBSD, at least that's how I understand
> | it.
> 
> I see, thanks.  Well if the bug is in the header files supplied by that
> package, please go ahead and file a bug report.

I talked to Aurelien Jarno on IRC and he fixed this issue in svn (an
updated libc0.1 package will soon be uploaded).

I guess the openmpi Debian packages should then depend on the new, fixed
version. I verified that with the fixed version openmpi compiles on
kfreebsd-i386 and kfreebsd-amd64.


> | If you follow my two links and read their headlines, you can see that
> | these are the buildlogs of 1.2.3-3 on kfreebsd, working for i386, but
> | failing for amd64.
> | 
> | This is caused by "wrong" libc headers on kfreebsd, that's why I thought
> | Uwe might want to have a look at it.
> 
> Ok. Back to the initial bug of Open MPI on Debian/kFreeBSD. What exactly is
> the status now?

With the libc0.1 fix (and another small patch for Debian which I'll send soon)
both the kfreebsd-i386 and kfreebsd-amd64 packages build fine.

However, on my systems, both i386 and amd64 still segfault. I'm using
the openmpi Debian packages, version 1.2.3-3.

I'll try the stock tarballs soon, and/or wait for 1.2.4 to see if the
bug is already fixed there...


HTH, Uwe.
-- 
http://www.hermann-uwe.de  | http://www.holsham-traders.de
http://www.crazy-hacks.org | http://www.unmaintained-free-software.org


signature.asc
Description: Digital signature


Re: [OMPI devel] simple compilation error

2007-08-16 Thread George Bosilca
There was a problem with this particular version. Please update, and  
the problem will vanish.


  george.

On Aug 16, 2007, at 4:49 PM, Alexander Margolin wrote:


This question seems so simple - and yet i ask:

I tried following all the steps in the manual:

1) svn co http://svn.open-mpi.org/svn/ompi/trunk ompi
2)  *
3) ./autogen.sh ; ./configure --prefix 
4) make all install

what do I get? The following compilation error:

...
make[2]: Leaving directory `somewhere/ompi/ompi/datatype'
Making all in debuggers
make[2]: Entering directory
`/a/mosna/vol/vol0/aa/alexam02/ompi/ompi/debuggers'
/bin/sh ../../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H  
-I.

-I../../opal/include -I../../orte/include -I../../ompi/include
-I../../opal/mca/paffinity/linux/plpa/src/libplpa   -I../..   -g -Wall
-Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes
-Wstrict-prototypes -Wcomment -pedantic
-Werror-implicit-function-declaration -finline-functions
-fno-strict-aliasing -pthread -g -MT libompitv_la-ompi_dll.lo -MD - 
MP -MF
.deps/libompitv_la-ompi_dll.Tpo -c -o libompitv_la-ompi_dll.lo  
`test -f

'ompi_dll.c' || echo './'`ompi_dll.c
mkdir .libs
 gcc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include
-I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa
-I../.. -g -Wall -Wundef -Wno-long-long -Wsign-compare
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
-Werror-implicit-function-declaration -finline-functions
-fno-strict-aliasing -pthread -g -MT libompitv_la-ompi_dll.lo -MD - 
MP -MF

.deps/libompitv_la-ompi_dll.Tpo -c ompi_dll.c  -fPIC -DPIC -o
.libs/libompitv_la-ompi_dll.o
ompi_dll.c:102: error: initializer element is not constant
make[2]: *** [libompitv_la-ompi_dll.lo] Error 1
make[2]: Leaving directory `somewhere/ompi/ompi/debuggers'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/a/mosna/vol/vol0/aa/alexam02/ompi/ompi'
make: *** [all-recursive] Error 1


-Is there a problem with the specific checkout?
-How can i solve/work around the problem?
(tried removing deguggers directory - error in autogen.sh)

* The i tried again without the modification and it still did the  
same error.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] simple compilation error

2007-08-16 Thread Jeff Squyres
Heh; sorry about that.  This was a problem from a checkin last night  
(a developer accidentally committed something that compiled/worked on  
OSX but didn't compile on Linux); it's been fixed now.  Do an "svn  
up" and you should be ok.




On Aug 16, 2007, at 4:49 PM, Alexander Margolin wrote:


This question seems so simple - and yet i ask:

I tried following all the steps in the manual:

1) svn co http://svn.open-mpi.org/svn/ompi/trunk ompi
2)  *
3) ./autogen.sh ; ./configure --prefix 
4) make all install

what do I get? The following compilation error:

...
make[2]: Leaving directory `somewhere/ompi/ompi/datatype'
Making all in debuggers
make[2]: Entering directory
`/a/mosna/vol/vol0/aa/alexam02/ompi/ompi/debuggers'
/bin/sh ../../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H  
-I.

-I../../opal/include -I../../orte/include -I../../ompi/include
-I../../opal/mca/paffinity/linux/plpa/src/libplpa   -I../..   -g -Wall
-Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes
-Wstrict-prototypes -Wcomment -pedantic
-Werror-implicit-function-declaration -finline-functions
-fno-strict-aliasing -pthread -g -MT libompitv_la-ompi_dll.lo -MD - 
MP -MF
.deps/libompitv_la-ompi_dll.Tpo -c -o libompitv_la-ompi_dll.lo  
`test -f

'ompi_dll.c' || echo './'`ompi_dll.c
mkdir .libs
 gcc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include
-I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa
-I../.. -g -Wall -Wundef -Wno-long-long -Wsign-compare
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
-Werror-implicit-function-declaration -finline-functions
-fno-strict-aliasing -pthread -g -MT libompitv_la-ompi_dll.lo -MD - 
MP -MF

.deps/libompitv_la-ompi_dll.Tpo -c ompi_dll.c  -fPIC -DPIC -o
.libs/libompitv_la-ompi_dll.o
ompi_dll.c:102: error: initializer element is not constant
make[2]: *** [libompitv_la-ompi_dll.lo] Error 1
make[2]: Leaving directory `somewhere/ompi/ompi/debuggers'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/a/mosna/vol/vol0/aa/alexam02/ompi/ompi'
make: *** [all-recursive] Error 1


-Is there a problem with the specific checkout?
-How can i solve/work around the problem?
(tried removing deguggers directory - error in autogen.sh)

* The i tried again without the modification and it still did the  
same error.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI svn] svn:open-mpi r15881

2007-08-16 Thread Tim Prins

Jeff Squyres wrote:

On Aug 16, 2007, at 11:48 AM, Tim Prins wrote:


+#define ORTE_RML_TAG_UDAPL  25
+#define ORTE_RML_TAG_OPENIB 26
+#define ORTE_RML_TAG_MVAPI  27

I think that UDAPL, OPENIB, MVAPI should not appear anywhere in the
ORTE layer ...

I tend to agree with you. However, the precedent has been set long ago
to put all these constants in this file (i.e. there is
ORTE_RML_TAG_WIREUP and ORTE_RML_TAG_SM_BACK_FILE_CREATED which are  
only
used in OMPI), and it makes sense to have all tags defined in one  
place.


I think George's point is that the names UDAPL, OPENIB, MVAPI are all  
specific to the OMPI layer and refer to specific components.  The  
generic action WIREUP was probably somewhat forgivable, but  
SM_BACK_FILE_CREATED is probably the same kind of abstraction break  
as UDAPL/OPENIB/MVAPI, which is your point.


So you're both right.  :-)  But Tim's falling back on an older (and  
unfortunately bad) precedent.  It would be nice to not extend that  
bad precedent, IMHO...


I really don't care where the constants are defined, but they do need to 
be unique. I think it is easiest if all the constants are stored in one 
file, but if someone else wants to chop them up, that's fine with me. We 
would just have to be more careful when adding new constants to check 
both files.





If we end up doing the runtime services layer, all the ompi tags would
be defined in the RSL, and this will become moot.


True.  We will need a robust tag reservation system, though, to  
guarantee that every process gets the same tag values (e.g., if udapl  
is available on some nodes but not others, will that cause openib to  
have different values on different nodes?  And so on).
Not really. All that is needed is a list of constants (similar to the 
one in rml_types.h). If a rsl component doesn't like the particular 
constant tag values, they can do whatever they want in their 
implementation, as long as a messages sent on a tag is received on the 
same tag.


Tim


Re: [OMPI devel] Problem with group code

2007-08-16 Thread Tim Prins

Sorry, I pushed the wrong button and sent this before it was ready

Tim Prins wrote:

Hi folks,

I am running into a problem with the ibm test 'group'. I will try to 
explain what I think is going on, but I do not really understand the 
group code so please forgive me if it is wrong...


The test creates a group based on MPI_COMM_WORLD (group1), and a group 
that has half the procs in group1 (newgroup). Next, all the processes do:


MPI_Group_intersection(newgroup,group1,)

ompi_group_intersection figures out what procs are needed for group2, 
then calls


ompi_group_incl, passing 'newgroup' and ''

This then calls (since I am not using sparse groups) ompi_group_incl_plist

However, ompi_group_plist assumes that the current process is a member 
of the passed group ('newgroup'). Thus when it calls 
ompi_group_peer_lookup on 'newgroup', half of the processes get garbage 
back since they are not in 'newgroup'. In most cases, memory is 
initialized to \0 and things fall through, but we get intermittent 
segfaults in optimized builds.



Here is a patch to a error check which highlights the problem:
Index: group/group.h
===
--- group/group.h   (revision 15869)
+++ group/group.h   (working copy)
@@ -308,7 +308,7 @@
 static inline struct ompi_proc_t* ompi_group_peer_lookup(ompi_group_t
*group, int peer_id)
 {
 #if OMPI_ENABLE_DEBUG
-if (peer_id >= group->grp_proc_count) {
+if (peer_id >= group->grp_proc_count || peer_id < 0) {
 opal_output(0, "ompi_group_lookup_peer: invalid peer index
(%d)", peer_id);


Thanks,

Tim
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] Problem with group code

2007-08-16 Thread Tim Prins

Hi folks,

I am running into a problem with the ibm test 'group'. I will try to 
explain what I think is going on, but I do not really understand the 
group code so please forgive me if it is wrong...


The test creates a group based on MPI_COMM_WORLD (group1), and a group 
that has half the procs in group1 (newgroup). Next, all the processes do:


MPI_Group_intersection(newgroup,group1,)

ompi_group_intersection figures out what procs are needed for group2, 
then calls


ompi_group_incl, passing 'newgroup' and ''

This then calls (since I am not using sparse groups) ompi_group_incl_plist

However, ompi_group_plist assumes that the current process is a member 
of the passed group ('newgroup'). Thus when it calls 
ompi_group_peer_lookup on 'newgroup', half of the processes get garbage 
back since they are not in 'newgroup'. In most cases, memory is 
initialized to \0 and things fall through, but we get intermittent 
segfaults in optimized builds.


In r I have put in a correction to a error check which should help show 
this problem.


Thanks,

Tim