[OMPI devel] Build failure on FreeBSD 7

2008-04-04 Thread Karol Mroz
Hello everyone... it's been some time since I posted here. I pulled the 
latest svn revision (18079) and had some trouble building Open MPI on a 
FreeBSD 7 machine (i386).


Make failed when compiling opal/event/kqueue.c. It appears that freebsd 
needs sys/types.h, sys/ioctl.h, termios.h and libutil.h included in 
order to reference openpty(). I added ifdef/includes for these header 
files into kqueue.c and managed to build. Note that I also tried the 
latest nightly tarball. The tarball build actually succeeded without any 
changes. Curious if anyone has experienced this type of behavior? A 
colleague of mine mentioned it could be a FreeBSD autotools issue?


Although builds were successful (with modification for the svn build, 
and without modification for the nightly tarball), I tried running a 
simple app locally with 2 processes using the TCP BTL that does a 
non-blocking send/recv. The app simply hung. After attaching gdb to one 
of the 2 processes, the console output (not gdb) reported the following 
output:


[warn] kq_init: detected broken kqueue (failed add); not using error 4 
(Interrupted system call)

: Interrupted system call

I'm including the diff of kqueue.c here for completeness. If anyone 
requires any further information, please let me know.


Thanks.
--
Karol

Index: opal/event/kqueue.c
===
--- opal/event/kqueue.c (revision 18079)
+++ opal/event/kqueue.c (working copy)
@@ -52,7 +52,17 @@
 #ifdef HAVE_UTIL_H
 #include 
 #endif
+#ifdef HAVE_SYS_IOCTL_H
+#include 
+#endif
+#ifdef HAVE_LIBUTIL_H
+#include 
+#endif
+#ifdef HAVE_TERMIOS_H
+#include 
+#endif

+
 /* Some platforms apparently define the udata field of struct kevent as
  * intptr_t, whereas others define it as void*.  There doesn't seem to be an
  * easy way to tell them apart via autoconf, so we need to use OS macros. */


Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-04 Thread Ralph H Castain
Okay, I have a partial fix in there now. You'll have to use -mca routed
unity as I still need to fix it for routed tree.

Couple of things:

1. I fixed the --debug flag so it automatically turns on the debug output
from the data server code itself. Now ompi-server will tell you when it is
accessed.

2. remember, we added an MPI_Info key that specifies if you want the data
stored locally (on your own mpirun) or globally (on the ompi-server). If you
specify nothing, there is a precedence built into the code that defaults to
"local". So you have to tell us that this data is to be published "global"
if you want to connect multiple mpiruns.

I believe Jeff wrote all that up somewhere - could be in an email thread,
though. Been too long ago for me to remember... ;-) You can look it up in
the code though as a last resort - it is in
ompi/mca/pubsub/orte/pubsub_orte.c.

Ralph



On 4/4/08 12:55 PM, "Ralph H Castain"  wrote:

> Well, something got borked in here - will have to fix it, so this will
> probably not get done until next week.
> 
> 
> On 4/4/08 12:26 PM, "Ralph H Castain"  wrote:
> 
>> Yeah, you didn't specify the file correctly...plus I found a bug in the code
>> when I looked (out-of-date a little in orterun).
>> 
>> I am updating orterun (commit soon) and will include a better help message
>> about the proper format of the orterun cmd-line option. The syntax is:
>> 
>> -ompi-server uri
>> 
>> or -ompi-server file:filename-where-uri-exists
>> 
>> Problem here is that you gave it a uri of "test", which means nothing. ;-)
>> 
>> Should have it up-and-going soon.
>> Ralph
>> 
>> On 4/4/08 12:02 PM, "Aurélien Bouteiller"  wrote:
>> 
>>> Ralph,
>>> 
>>> I've not been very successful at using ompi-server. I tried this :
>>> 
>>> xterm1$ ompi-server --debug-devel -d --report-uri test
>>> [grosse-pomme.local:01097] proc_info: hnp_uri NULL
>>> daemon uri NULL
>>> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!
>>> 
>>> 
>>> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
>>> Port name:
>>> 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300
>>> 
>>> xterm3$ mpirun -ompi-server test  -np 1 simple_connect
>>> --
>>> Process rank 0 attempted to lookup from a global ompi_server that
>>> could not be contacted. This is typically caused by either not
>>> specifying the contact info for the server, or by the server not
>>> currently executing. If you did specify the contact info for a
>>> server, please check to see that the server is running and start
>>> it again (or have your sys admin start it) if it isn't.
>>> 
>>> --
>>> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
>>> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
>>> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
>>> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>> --
>>> 
>>> 
>>> 
>>> The server code Open_port, and then PublishName. Looks like the
>>> LookupName function cannot reach the ompi-server. The ompi-server in
>>> debug mode does not show any output when a new event occurs (like when
>>> the server is launched). Is there something wrong in the way I use it ?
>>> 
>>> Aurelien
>>> 
>>> Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
 Take a gander at ompi/tools/ompi-server - I believe I put a man page
 in
 there. You might just try "man ompi-server" and see if it shows up.
 
 Holler if you have a question - not sure I documented it very
 thoroughly at
 the time.
 
 
 On 4/3/08 3:10 PM, "Aurélien Bouteiller" 
 wrote:
 
> Ralph,
> 
> 
> I am using trunk. Is there a documentation for ompi-server ? Sounds
> exactly like what I need to fix point 1.
> 
> Aurelien
> 
> Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
>> I guess I'll have to ask the basic question: what version are you
>> using?
>> 
>> If you are talking about the trunk, there no longer is a "universe"
>> concept
>> anywhere in the code. Two mpiruns can connect/accept to each other
>> as long
>> as they can make contact. To facilitate that, we created an "ompi-
>> server"
>> tool that is supposed to be run by the sys-admin (or a user, doesn't
>> matter
>> which) on the head node - there are various ways to tell mpirun
>> how to
>> contact the server, or it can self-discover it.
>> 
>> I have tested publish/lookup pretty thoroughly and it seems to
>> work. I
>> haven't spent much time testing connect/accept except via
>> comm_spawn, which
>> seems to be working. Since that uses the same mechanism, I would
>> have
>> expected connect/accept to work as well.
>> 
>> If you are talking about 1.2

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-04 Thread Ralph H Castain
Well, something got borked in here - will have to fix it, so this will
probably not get done until next week.


On 4/4/08 12:26 PM, "Ralph H Castain"  wrote:

> Yeah, you didn't specify the file correctly...plus I found a bug in the code
> when I looked (out-of-date a little in orterun).
> 
> I am updating orterun (commit soon) and will include a better help message
> about the proper format of the orterun cmd-line option. The syntax is:
> 
> -ompi-server uri
> 
> or -ompi-server file:filename-where-uri-exists
> 
> Problem here is that you gave it a uri of "test", which means nothing. ;-)
> 
> Should have it up-and-going soon.
> Ralph
> 
> On 4/4/08 12:02 PM, "Aurélien Bouteiller"  wrote:
> 
>> Ralph,
>> 
>> I've not been very successful at using ompi-server. I tried this :
>> 
>> xterm1$ ompi-server --debug-devel -d --report-uri test
>> [grosse-pomme.local:01097] proc_info: hnp_uri NULL
>> daemon uri NULL
>> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!
>> 
>> 
>> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
>> Port name:
>> 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300
>> 
>> xterm3$ mpirun -ompi-server test  -np 1 simple_connect
>> --
>> Process rank 0 attempted to lookup from a global ompi_server that
>> could not be contacted. This is typically caused by either not
>> specifying the contact info for the server, or by the server not
>> currently executing. If you did specify the contact info for a
>> server, please check to see that the server is running and start
>> it again (or have your sys admin start it) if it isn't.
>> 
>> --
>> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
>> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
>> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
>> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
>> --
>> 
>> 
>> 
>> The server code Open_port, and then PublishName. Looks like the
>> LookupName function cannot reach the ompi-server. The ompi-server in
>> debug mode does not show any output when a new event occurs (like when
>> the server is launched). Is there something wrong in the way I use it ?
>> 
>> Aurelien
>> 
>> Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
>>> Take a gander at ompi/tools/ompi-server - I believe I put a man page
>>> in
>>> there. You might just try "man ompi-server" and see if it shows up.
>>> 
>>> Holler if you have a question - not sure I documented it very
>>> thoroughly at
>>> the time.
>>> 
>>> 
>>> On 4/3/08 3:10 PM, "Aurélien Bouteiller" 
>>> wrote:
>>> 
 Ralph,
 
 
 I am using trunk. Is there a documentation for ompi-server ? Sounds
 exactly like what I need to fix point 1.
 
 Aurelien
 
 Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
> I guess I'll have to ask the basic question: what version are you
> using?
> 
> If you are talking about the trunk, there no longer is a "universe"
> concept
> anywhere in the code. Two mpiruns can connect/accept to each other
> as long
> as they can make contact. To facilitate that, we created an "ompi-
> server"
> tool that is supposed to be run by the sys-admin (or a user, doesn't
> matter
> which) on the head node - there are various ways to tell mpirun
> how to
> contact the server, or it can self-discover it.
> 
> I have tested publish/lookup pretty thoroughly and it seems to
> work. I
> haven't spent much time testing connect/accept except via
> comm_spawn, which
> seems to be working. Since that uses the same mechanism, I would
> have
> expected connect/accept to work as well.
> 
> If you are talking about 1.2.x, then the story is totally different.
> 
> Ralph
> 
> 
> 
> On 4/3/08 2:29 PM, "Aurélien Bouteiller" 
> wrote:
> 
>> Hi everyone,
>> 
>> I'm trying to figure out how complete is the implementation of
>> Comm_connect/Accept. I found two problematic cases.
>> 
>> 1) Two different programs are started in two different mpirun. One
>> makes accept, the second one use connect. I would not expect
>> MPI_Publish_name/Lookup_name to work because they do not share the
>> HNP. Still I would expect to be able to connect by copying (with
>> printf-scanf) the port_name string generated by Open_port;
>> especially
>> considering that in Open MPI, the port_name is a string containing
>> the
>> tcp address and port of the rank 0 in the server communicator.
>> However, doing so results in "no route to host" and the connecting
>> application aborts. Is the problem related to an explicit check of
>> the
>> universes on the accept HNP ? Do I ex

Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-04 Thread Edgar Gabriel
actually, we used lzo a looong time ago with PACX-MPI, it was indeed 
faster then zlib. Our findings at that time were however similar to what 
George mentioned, namely a benefit from compression was only visible if 
the network latency was really high (e.g. multiple ms)...


Thanks
Edgar

Roland Dreier wrote:

 > Based on some discussion on this list, I integrated a zlib-based compression
 > ability into ORTE. Since the launch message sent to the orteds and the modex
 > between the application procs are the only places where messages of any size
 > are sent, I only implemented compression for those two exchanges.
 > 
 > I have found virtually no benefit to the compression. Essentially, the

 > overhead consumed in compression/decompressing the messages pretty much
 > balances out any transmission time differences. However, I could only test
 > this for 64 nodes, 8ppn, so perhaps there is some benefit at larger sizes.

A faster compression library might change the balance... eg LZO
(http://www.oberhumer.com/opensource/lzo) might be worth a look although
I'm not an expert on all the compression libraries that are out there.

 - R.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-04 Thread Jeff Squyres

LZO looks cool, but it's unfortunately GPL (Open MPI is BSD).  Bummer.

On Apr 4, 2008, at 2:29 PM, Roland Dreier wrote:
Based on some discussion on this list, I integrated a zlib-based  
compression
ability into ORTE. Since the launch message sent to the orteds and  
the modex
between the application procs are the only places where messages of  
any size

are sent, I only implemented compression for those two exchanges.

I have found virtually no benefit to the compression. Essentially,  
the
overhead consumed in compression/decompressing the messages pretty  
much
balances out any transmission time differences. However, I could  
only test
this for 64 nodes, 8ppn, so perhaps there is some benefit at larger  
sizes.


A faster compression library might change the balance... eg LZO
(http://www.oberhumer.com/opensource/lzo) might be worth a look  
although

I'm not an expert on all the compression libraries that are out there.

- R.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-04 Thread Roland Dreier
 > Based on some discussion on this list, I integrated a zlib-based compression
 > ability into ORTE. Since the launch message sent to the orteds and the modex
 > between the application procs are the only places where messages of any size
 > are sent, I only implemented compression for those two exchanges.
 > 
 > I have found virtually no benefit to the compression. Essentially, the
 > overhead consumed in compression/decompressing the messages pretty much
 > balances out any transmission time differences. However, I could only test
 > this for 64 nodes, 8ppn, so perhaps there is some benefit at larger sizes.

A faster compression library might change the balance... eg LZO
(http://www.oberhumer.com/opensource/lzo) might be worth a look although
I'm not an expert on all the compression libraries that are out there.

 - R.


Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-04 Thread Ralph H Castain
Yeah, you didn't specify the file correctly...plus I found a bug in the code
when I looked (out-of-date a little in orterun).

I am updating orterun (commit soon) and will include a better help message
about the proper format of the orterun cmd-line option. The syntax is:

-ompi-server uri

or -ompi-server file:filename-where-uri-exists

Problem here is that you gave it a uri of "test", which means nothing. ;-)

Should have it up-and-going soon.
Ralph

On 4/4/08 12:02 PM, "Aurélien Bouteiller"  wrote:

> Ralph,
> 
> I've not been very successful at using ompi-server. I tried this :
> 
> xterm1$ ompi-server --debug-devel -d --report-uri test
> [grosse-pomme.local:01097] proc_info: hnp_uri NULL
> daemon uri NULL
> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!
> 
> 
> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
> Port name:
> 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300
> 
> xterm3$ mpirun -ompi-server test  -np 1 simple_connect
> --
> Process rank 0 attempted to lookup from a global ompi_server that
> could not be contacted. This is typically caused by either not
> specifying the contact info for the server, or by the server not
> currently executing. If you did specify the contact info for a
> server, please check to see that the server is running and start
> it again (or have your sys admin start it) if it isn't.
> 
> --
> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
> --
> 
> 
> 
> The server code Open_port, and then PublishName. Looks like the
> LookupName function cannot reach the ompi-server. The ompi-server in
> debug mode does not show any output when a new event occurs (like when
> the server is launched). Is there something wrong in the way I use it ?
> 
> Aurelien
> 
> Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
>> Take a gander at ompi/tools/ompi-server - I believe I put a man page
>> in
>> there. You might just try "man ompi-server" and see if it shows up.
>> 
>> Holler if you have a question - not sure I documented it very
>> thoroughly at
>> the time.
>> 
>> 
>> On 4/3/08 3:10 PM, "Aurélien Bouteiller" 
>> wrote:
>> 
>>> Ralph,
>>> 
>>> 
>>> I am using trunk. Is there a documentation for ompi-server ? Sounds
>>> exactly like what I need to fix point 1.
>>> 
>>> Aurelien
>>> 
>>> Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
 I guess I'll have to ask the basic question: what version are you
 using?
 
 If you are talking about the trunk, there no longer is a "universe"
 concept
 anywhere in the code. Two mpiruns can connect/accept to each other
 as long
 as they can make contact. To facilitate that, we created an "ompi-
 server"
 tool that is supposed to be run by the sys-admin (or a user, doesn't
 matter
 which) on the head node - there are various ways to tell mpirun
 how to
 contact the server, or it can self-discover it.
 
 I have tested publish/lookup pretty thoroughly and it seems to
 work. I
 haven't spent much time testing connect/accept except via
 comm_spawn, which
 seems to be working. Since that uses the same mechanism, I would
 have
 expected connect/accept to work as well.
 
 If you are talking about 1.2.x, then the story is totally different.
 
 Ralph
 
 
 
 On 4/3/08 2:29 PM, "Aurélien Bouteiller" 
 wrote:
 
> Hi everyone,
> 
> I'm trying to figure out how complete is the implementation of
> Comm_connect/Accept. I found two problematic cases.
> 
> 1) Two different programs are started in two different mpirun. One
> makes accept, the second one use connect. I would not expect
> MPI_Publish_name/Lookup_name to work because they do not share the
> HNP. Still I would expect to be able to connect by copying (with
> printf-scanf) the port_name string generated by Open_port;
> especially
> considering that in Open MPI, the port_name is a string containing
> the
> tcp address and port of the rank 0 in the server communicator.
> However, doing so results in "no route to host" and the connecting
> application aborts. Is the problem related to an explicit check of
> the
> universes on the accept HNP ? Do I expect too much from the MPI
> standard ? Is it because my two applications does not share the
> same
> universe ? Should we (re) add the ability to use the same universe
> for
> several mpirun ?
> 
> 2) Second issue is when the program setup a port, and then accept
> multi

Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-04 Thread George Bosilca

Ralph,

There are several studies about compressions and data exchange. Few  
years ago we integrate such mechanism (adaptive compression of  
communication) in one of the projects here at ICL (called GridSolve).  
The idea was to optimize the network traffic for sending large  
matrices used for computation from a server to a specific workers.  
Under some circumstances (few) it can improve the network traffic, and  
according to the main author in the worst case it doesn't harm.  
However, it is still unclear that there is any benefit when the data  
is reasonably small (which is the case in ORTE).


The project is hosted at http://www.loria.fr/~ejeannot/adoc/adoc.html.  
It's a simple dropin for read/write so it is fairly simple to  
integrate. On the author webpage you can find some publication about  
this, publication that highlight the performances of this approach.


  george.

PS: One of these a reference to the paper is available on ACM:

E. Jeannot, B. Knutsson, M. Bjorkmann.
Adaptive Online Data Compression, in: High Performance Distributed  
Computing (HPDC'11), Edinburgh, Scotland, IEEE, july 2002.



On Apr 4, 2008, at 12:52 PM, Ralph H Castain wrote:

Hello all

Based on some discussion on this list, I integrated a zlib-based  
compression
ability into ORTE. Since the launch message sent to the orteds and  
the modex
between the application procs are the only places where messages of  
any size

are sent, I only implemented compression for those two exchanges.

I have found virtually no benefit to the compression. Essentially, the
overhead consumed in compression/decompressing the messages pretty  
much
balances out any transmission time differences. However, I could  
only test
this for 64 nodes, 8ppn, so perhaps there is some benefit at larger  
sizes.


Even though my test size wasn't very big, I did try forcing the  
worst-case
scenario. I included all available BTL's, and ran the OOB over  
Ethernet.
Although there was some difference, it wasn't appreciable - easily  
within

the variations I see on this rather unstable machine.

I invite you to try it yourself. You can get a copy of the code via:

hg clone http://www.open-mpi.org/hg/hgwebdir.cgi/rhc/gather

You will need to configure with LIBS=-lz.

Compression is normally turned "off". You can turn it on by setting:

-mca orte_compress 1

You can also adjust the level of compression:

-mca orte_compress_level [1-9]

If you don't specify the level and select compression, the level will
default to 1. From my tests, this seemed a good compromise. The  
other levels

provided some small amount of better compression, but took longer.

With compression "on", you will get output telling you the original  
size of

the message and its compressed size so you can see what was done.

Please let me know what you find out. I would like to reach a  
decision as to

whether or not compression is worthwhile.

Thanks
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-04 Thread Aurélien Bouteiller

Ralph,

I've not been very successful at using ompi-server. I tried this :

xterm1$ ompi-server --debug-devel -d --report-uri test
[grosse-pomme.local:01097] proc_info: hnp_uri NULL
daemon uri NULL
[grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!


xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
Port name:
2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300

xterm3$ mpirun -ompi-server test  -np 1 simple_connect
--
Process rank 0 attempted to lookup from a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.

--
[grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
[grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
[grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
[grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
--



The server code Open_port, and then PublishName. Looks like the  
LookupName function cannot reach the ompi-server. The ompi-server in  
debug mode does not show any output when a new event occurs (like when  
the server is launched). Is there something wrong in the way I use it ?


Aurelien

Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
Take a gander at ompi/tools/ompi-server - I believe I put a man page  
in

there. You might just try "man ompi-server" and see if it shows up.

Holler if you have a question - not sure I documented it very  
thoroughly at

the time.


On 4/3/08 3:10 PM, "Aurélien Bouteiller"   
wrote:



Ralph,


I am using trunk. Is there a documentation for ompi-server ? Sounds
exactly like what I need to fix point 1.

Aurelien

Le 3 avr. 08 à 17:06, Ralph Castain a écrit :

I guess I'll have to ask the basic question: what version are you
using?

If you are talking about the trunk, there no longer is a "universe"
concept
anywhere in the code. Two mpiruns can connect/accept to each other
as long
as they can make contact. To facilitate that, we created an "ompi-
server"
tool that is supposed to be run by the sys-admin (or a user, doesn't
matter
which) on the head node - there are various ways to tell mpirun  
how to

contact the server, or it can self-discover it.

I have tested publish/lookup pretty thoroughly and it seems to  
work. I

haven't spent much time testing connect/accept except via
comm_spawn, which
seems to be working. Since that uses the same mechanism, I would  
have

expected connect/accept to work as well.

If you are talking about 1.2.x, then the story is totally different.

Ralph



On 4/3/08 2:29 PM, "Aurélien Bouteiller" 
wrote:


Hi everyone,

I'm trying to figure out how complete is the implementation of
Comm_connect/Accept. I found two problematic cases.

1) Two different programs are started in two different mpirun. One
makes accept, the second one use connect. I would not expect
MPI_Publish_name/Lookup_name to work because they do not share the
HNP. Still I would expect to be able to connect by copying (with
printf-scanf) the port_name string generated by Open_port;  
especially

considering that in Open MPI, the port_name is a string containing
the
tcp address and port of the rank 0 in the server communicator.
However, doing so results in "no route to host" and the connecting
application aborts. Is the problem related to an explicit check of
the
universes on the accept HNP ? Do I expect too much from the MPI
standard ? Is it because my two applications does not share the  
same

universe ? Should we (re) add the ability to use the same universe
for
several mpirun ?

2) Second issue is when the program setup a port, and then accept
multiple clients on this port. Everything works fine for the first
client, and then accept stalls forever when waiting for the second
one. My understanding of the standard is that it should work: 5.4.2
states "it must call MPI_Open_port to establish a port [...] it  
must

call MPI_Comm_accept to accept connections from clients". I
understand
that for one MPI_Open_port I should be able to manage several MPI
clients. Am I understanding correctly the standard here and  
should we

fix this ?

Here is a copy of the non-working code for reference.

/*
* Copyright (c) 2004-2007 The Trustees of the University of
Tennessee.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include 
#include 
#include 

int main(int argc, char *argv[])
{
   char port[MPI_MAX_PORT_NAME];
   int rank;
   int np;


   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI

[OMPI devel] Affect of compression on modex and launch messages

2008-04-04 Thread Ralph H Castain
Hello all

Based on some discussion on this list, I integrated a zlib-based compression
ability into ORTE. Since the launch message sent to the orteds and the modex
between the application procs are the only places where messages of any size
are sent, I only implemented compression for those two exchanges.

I have found virtually no benefit to the compression. Essentially, the
overhead consumed in compression/decompressing the messages pretty much
balances out any transmission time differences. However, I could only test
this for 64 nodes, 8ppn, so perhaps there is some benefit at larger sizes.

Even though my test size wasn't very big, I did try forcing the worst-case
scenario. I included all available BTL's, and ran the OOB over Ethernet.
Although there was some difference, it wasn't appreciable - easily within
the variations I see on this rather unstable machine.

I invite you to try it yourself. You can get a copy of the code via:

 hg clone http://www.open-mpi.org/hg/hgwebdir.cgi/rhc/gather

You will need to configure with LIBS=-lz.

Compression is normally turned "off". You can turn it on by setting:

-mca orte_compress 1

You can also adjust the level of compression:

-mca orte_compress_level [1-9]

If you don't specify the level and select compression, the level will
default to 1. From my tests, this seemed a good compromise. The other levels
provided some small amount of better compression, but took longer.

With compression "on", you will get output telling you the original size of
the message and its compressed size so you can see what was done.

Please let me know what you find out. I would like to reach a decision as to
whether or not compression is worthwhile.

Thanks
Ralph




Re: [OMPI devel] init_thread + spawn error

2008-04-04 Thread Tim Prins
Thanks for the report. As Ralph indicated the threading support in Open 
MPI is not good right now, but we are working to make it better.


I have filed a ticket (https://svn.open-mpi.org/trac/ompi/ticket/1267) 
so we do not loose track of this issue, and attached a potential fix to 
the ticket.


Thanks,

Tim

Joao Vicente Lima wrote:

Hi,
I getting a error on call init_thread and comm_spawn on this code:

#include "mpi.h"
#include 

int
main (int argc, char *argv[])
{
  int provided;
  MPI_Comm parentcomm, intercomm;

  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_get_parent (&parentcomm);

  if (parentcomm == MPI_COMM_NULL)
{
  printf ("spawning ... \n");
  MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 1,
  MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, 
MPI_ERRCODES_IGNORE);
  MPI_Comm_disconnect (&intercomm);
}
  else
  {
printf ("child!\n");
MPI_Comm_disconnect (&parentcomm);
  }

  MPI_Finalize ();
  return 0;
}

and the error is:

spawning ...
opal_mutex_lock(): Resource deadlock avoided
[localhost:18718] *** Process received signal ***
[localhost:18718] Signal: Aborted (6)
[localhost:18718] Signal code:  (-6)
[localhost:18718] [ 0] /lib/libpthread.so.0 [0x2b6e5d9fced0]
[localhost:18718] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b6e5dc3b3c5]
[localhost:18718] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b6e5dc3c73e]
[localhost:18718] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9560ff]
[localhost:18718] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c95601d]
[localhost:18718] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9560ac]
[localhost:18718] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c956a93]
[localhost:18718] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9569dd]
[localhost:18718] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c95797d]
[localhost:18718] [ 9]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec)
[0x2b6e5c957dd9]
[localhost:18718] [10]
/usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b6e607f05cf]
[localhost:18718] [11]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(MPI_Comm_spawn+0x459)
[0x2b6e5c98ede9]
[localhost:18718] [12] ./spawn1(main+0x7a) [0x400ae2]
[localhost:18718] [13] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b6e5dc28b74]
[localhost:18718] [14] ./spawn1 [0x4009d9]
[localhost:18718] *** End of error message ***
opal_mutex_lock(): Resource deadlock avoided
[localhost:18719] *** Process received signal ***
[localhost:18719] Signal: Aborted (6)
[localhost:18719] Signal code:  (-6)
[localhost:18719] [ 0] /lib/libpthread.so.0 [0x2b9317a17ed0]
[localhost:18719] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b9317c563c5]
[localhost:18719] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b9317c5773e]
[localhost:18719] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169710ff]
[localhost:18719] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b931697101d]
[localhost:18719] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169710ac]
[localhost:18719] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b9316971a93]
[localhost:18719] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169719dd]
[localhost:18719] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b931697297d]
[localhost:18719] [ 9]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec)
[0x2b9316972dd9]
[localhost:18719] [10]
/usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80b5cf]
[localhost:18719] [11]
/usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80dad7]
[localhost:18719] [12] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b9316977207]
[localhost:18719] [13]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(PMPI_Init_thread+0x166)
[0x2b93169b8622]
[localhost:18719] [14] ./spawn1(main+0x25) [0x400a8d]
[localhost:18719] [15] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b9317c43b74]
[localhost:18719] [16] ./spawn1 [0x4009d9]
[localhost:18719] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 18719 on node localhost
exited on signal 6 (Aborted).
--

if I change MPI_Init_thread to MPI_Init all works.
some suggest ?
The attachments contain my ompi_info (r18077) and config.log.

thanks in advance,
Joao.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel