Re: [OMPI users] Message compression in OpenMPI

2008-04-24 Thread Aurélien Bouteiller
From a pretty old experiment I made, compression was giving good  
results on 10Mbps network but was actually decreasing RTT on 100Mbs  
and more. I played with all the zlib settings from 1 to 9, and  
actually even the low compression setting was unable to reach decent  
performance. I don't believe that the computing/bandwidth ratio has  
changed to favor compression.


Aurelien.

Le 24 avr. 08 à 11:06, George Bosilca a écrit :

Actually, even in this particular condition (over internet)1  
compression make sense only for very specific data. The problem is  
that usually the compression algorithm is very expensive if you want  
to really get a interesting factor of size reduction. And there is  
the tradeoff, what you save in terms of data transfer you lose in  
terms of compression time. In other terms, the compression became  
interesting in only 2 scenarios: you have a very congested network  
(really very very congested) or you have a network with a limited  
bandwidth.


The algorithm use in the paper you cited is fast, but unfortunately  
very specific for MPI_DOUBLE and only works if the data exhibit the  
properties I cited in my previous email. The generic compression  
algorithms are at least one order of magnitude slower. And then  
again, one needs a very slow network in order to get any benefits  
from doing the compression ... And of course slow networks is not  
exactly the most common place where you will find MPI applications.


But as Jeff stated in his email, contributions are always welcomed :)

 george.


On Apr 24, 2008, at 8:26 AM, Tomas Ukkonen wrote:


George Bosilca wrote:


The paper you cited, while presenting a particular implementation  
doesn't bring present any new ideas. The compression of the data  
was studied for long time, and [unfortunately] it always came back  
to the same result. In the general case, not worth the effort !


Now of course, if one limit itself to very regular applications  
(such as the one presented in the paper), where the matrices  
involved in the computation are well conditioned (such as in the  
paper), and if you only use MPI_DOUBLE (\cite{same_paper}), and  
finally if you only expect to run over slow Ethernet (1Gbs)  
(\cite{same_paper_again})... then yes one might get some benefit.


Yes, you are probably right that its not worth the effort in  
general and

especially not in HPC environments where you have very fast network.

But I can think of (rather important) special cases where it is  
important


- non HPC environments with slow network: beowulf clusters and/or
 internet + normal PCs where you use existing workstations and  
network

 for computations.
- communication/io-bound computations where you transfer
 large redundant datasets between nodes

So it would be nice to be able to turn on the compression (for spefic
communicators and/or data transfers) when you need it.

--
Tomas

 george.

On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote:


Hello

I read from somewhere that OpenMPI supports
some kind of data compression but I couldn't find
any information about it.

Is this true and how it can be used?

Does anyone have any experiences about using it?

Is it possible to use compression in just some
subset of communications (communicator
specific compression settings)?

In our MPI application we are transferring large
amounts of sparse/redundant data that compresses
very well. Also my initial tests showed significant
improvements in performance.

There are also articles that suggest that compression
should be used [1].

[1] J. Ke, M. Burtcher and E. Speight.
Runtime Compression of MPI Messages to Improve the
Performance and Scalability of Parallel Applications.


Thanks in advance,
Tomas

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Openmpi (VASP): Signal code: Address not mapped (1)

2008-04-24 Thread Andreas Schäfer
Hi, 

On 10:03 Thu 24 Apr , Steven Truong wrote:
> Could somebody tell me what might cause this error?

I'll try.

> [compute-1-27:31550] *** Process received signal ***
> [compute-1-27:31550] Signal: Segmentation fault (11)
> [compute-1-27:31550] Signal code: Address not mapped (1)

"Address not mapped" means that the program tried to access a memory
location that is not part of the process' address space (e.g. null
pointer).

> [compute-1-27:31550] Failing at address: (nil)
> [compute-1-27:31550] [ 0] /lib64/tls/libpthread.so.0 [0x34e6c0c4f0]
> [compute-1-27:31550] [ 1]
> /usr/local/bin/vaspopenmpi_scala(__dfast__cnormn+0x18e) [0x4dd0ee]
> [compute-1-27:31550] [ 2]
> /usr/local/bin/vaspopenmpi_scala(__rmm_diis__eddrmm+0x59be) [0x5b11fe]
> [compute-1-27:31550] [ 3]
> /usr/local/bin/vaspopenmpi_scala(elmin_+0x32fa) [0x608a9a]
> [compute-1-27:31550] [ 4]
> /usr/local/bin/vaspopenmpi_scala(MAIN__+0x15492) [0x425f4a]
> [compute-1-27:31550] [ 5] /usr/local/bin/vaspopenmpi_scala(main+0xe) 
> [0x6ed9ee]
> [compute-1-27:31550] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x34e5f1c3fb]
> [compute-1-27:31550] [ 7] /usr/local/bin/vaspopenmpi_scala [0x410a2a]
> [compute-1-27:31550] *** End of error message ***
> [compute-1-27:31549] *** Process received signal ***

What follows is a backtrace of the functions currently being executed
(in reverse order, as found on the stack). I'd hazard a guess that
it's not OMPI's fault but VASP's, since the segfault happens in one of
its functions. Maybe you should have a look there.

HTH
-Andi


-- 

Andreas Schäfer
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany
PGP/GPG key via keyserver
I'm a bright... http://www.the-brights.net


(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your 
signature to help him gain world domination!


pgpHKDwBo0JnO.pgp
Description: PGP signature


Re: [OMPI users] How to restart a job twice

2008-04-24 Thread Josh Hursey

Tamer,

I'm confident that this particular problem is now fixed in the trunk  
(r18276). If you are interested in the details on the bug and how it  
was fixed the commit message is fairly detailed:

 https://svn.open-mpi.org/trac/ompi/changeset/18276

Let me know if this patch fixes things. Like I said I'm confident that  
it does, but there are always more bugs :)


Thanks again for the bug report.

Cheers,
Josh

On Apr 24, 2008, at 11:02 AM, Josh Hursey wrote:


Tamer,

Another user contacted me off list yesterday with a similar problem
with the current trunk. I have been able to reproduce this, and am
currently trying to debug it again. It seems to occur more often with
builds without the checkpoint thread (--disable-ft-thread). It seems
to be a race in our connection wireup which is why it does not always
occur.

Thank you for your patience as I try to track this down. I'll let you
know as soon as I have a fix.

Cheers,
Josh

On Apr 24, 2008, at 10:50 AM, Tamer wrote:


Josh, Thank you for your help. I was able to do the following with
r18241:

start the parallel job
checkpoint and restart
checkpoint and restart
checkpoint but failed to restart with the following message:

ompi-restart ompi_global_snapshot_23800.ckpt
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree:  
Connection

to lifeline [[45699,0],0] lost
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree:  
Connection

to lifeline [[45699,0],0] lost
[dhcp-119-202:23650] *** Process received signal ***
[dhcp-119-202:23650] Signal: Segmentation fault (11)
[dhcp-119-202:23650] Signal code: Address not mapped (1)
[dhcp-119-202:23650] Failing at address: 0x3e0f50
[dhcp-119-202:23650] [ 0] [0x110440]
[dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107)
[0xc5df97]
[dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1]
[dhcp-119-202:23650] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 23857 on node
dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault).


So, this time the process went further than before. I tested on a
different platform (64 bit machine with fedora core 7) and openmpi
checkpoints and restarts as many times as I want to without any
problems. This means that the issue above must be platform dependent
and I must be missing some option in building the code.

Cheers,
Tamer


On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote:


Tamer,

This should now be fixed in r18241.

Though I was able to replicate this bug, it only occurred
sporadically for me. It seemed to be caused by some socket  
descriptor

caching that was not properly cleaned up by the restart procedure.

My testing appears to conclude that this bug is now fixed, but since
it is difficult to reproduce if you see it happen again definitely
let me know.


With the current trunk you may see the following error message:
--
[odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--
This is not caused by the checkpoint/restart code, but by some  
recent

changes to our TCP component. We are working on fixing this, but I
just wanted to give you a heads up in case you see this error. As  
far

as I can tell it does not interfere with the checkpoint/restart
functionality.

Let me know if this fixes your problem.

Cheers,
Josh


On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:


Tamer,

Just wanted to update you on my progress. I am able to reproduce
something similar to this problem. I am currently working on a
solution to it. I'll let you know when it is available, probably in
the next day or two.

Thank you for the bug report.

Cheers,
Josh

On Apr 18, 2008, at 1:11 PM, Tamer wrote:


Hi Josh:

I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7

The machine is dual-core with shared memory so it's not even a
cluster.

I downloaded r18208 and built it with the following options:

./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208
--
with-ft=cr --with-blcr=/usr/local/blcr

when I run mpirun I pass the following command:

mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760

I was able to checkpoint and restart successfully and was able to
checkpoint the restarted job  (mpirun showed up with ps-efa |grep
mpirun under r18208) but was unable to restart again; here's the
error message:

mpi-restart ompi_global_snapshot_23865.ckpt
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to 

Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread Danesh Daroui
I just wanted to add my last comment since this discussion seems to be 
very hot! As Jeff
mentioned while a process is waiting to receive a message it doesn't 
really matter if it uses
blocking or polling. What I really meant was that blocking can be useful 
to use CPU cycles to
handle other calculations which is supposed to be done by this node if 
OMPI is smart enough
tp decide such things. Otherwise, because HPC nodes are usually 
deidicated nodes so there
will no other tasks which will be run in background and therefore be 
influenced by spinning.
Nevertheless, I think that using blocking instead of busy loops should 
have higher priority since

it can save CPU idle cycles at least for OMPI's internal tasks...

D.


Jeff Squyres skrev:
What George said is what I meant by "it's a non-trivial amount of  
work." :-)


In addition to when George adds these patches (allowing components to  
register for blocking progress), there's going to be some work to deal  
with shared memory (we have some ideas here, but it's a bit more than  
just allowing shmem to register to blocking progress) and other random  
issues that will arise.



On Apr 24, 2008, at 11:17 AM, George Bosilca wrote:

  
Well, blocking or not blocking this is the question !!!  
Unfortunately, it's more complex than this thread seems to indicate.  
It's not that we didn't want to implement it in Open MPI, it's that  
at one point we had to make a choice ... and we decided to always go  
for performance first.


However, there were some experimentations to go in blocking more at  
least when only TCP was used. Unfortunately, this break some other  
things in Open MPI, because of our progression model. We are  
component based and these components are allowed to register  
periodically called callbacks ... And here periodically means as  
often as possible. There are at least 2 components that use this  
mechanism for their own progression: romio (mca/io/romio) and one- 
sided communications (mca/osc/*). Switching in blocking mode will  
break these 2 components completely. This was the reason why we're  
not blocking when only TCP is used.


Anyway, there is a solution. We have to move from a poll base  
progress for these components to an event base progress. There were  
some discussions, and if I remember well ... everybody's waiting for  
one of my patches :) A patch that allow a component to add a  
completion callback to MPI requests ... I don't have a clear  
deadline for this, and unfortunately I'm a little busy right now ...  
but I'll work on it asap.


 george.

On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote:



On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote:
  
I am using one of the nodes as a desktop computer. Therefore it is  
most
important for me that the mpi program is not so greedily acquiring  
cpu

time.

This is a kernel scheduling issue, not an OpenMPI issue.  Busy  
waiting in
one process should not cause noticable loss of responsiveness in  
another

processes.  Have you experimented with the "nice" command?

  

But I would imagine that the energy consumption is generally a big
issue, since energy is a major cost factor in a computer cluster.


Yup.

  

When a
cpu is idle, it uses considerably less energy. Last time I checked  
my
computer used 180W when both cpu cores were working and 110W when  
both

cores were idle.


What processor is this?

  
I just made a small hack to solve the problem. I inserted a simple  
sleep

call into the function 'opal_condition_wait':

--- orig/openmpi-1.2.6/opal/threads/condition.h
+++ openmpi-1.2.6/opal/threads/condition.h
@@ -78,6 +78,7 @@
#endif
   } else {
   while (c->c_signaled == 0) {
+   usleep(1000);
   opal_progress();
   }
   }



I expect this would lead to increased execution time for all programs
and increased energy consumption for most programs.  Recall that  
energy

is power multiplied by time.  You're reducing the power on some nodes
and increasing time on all nodes.

  

The usleep call will let the program sleep for about 4 ms (it won't
sleep for a shorter time because of some timer granularity). But  
that is
good enough for me. The cpu usage is (almost) zero when the tasks  
are

waiting for one another.

I think your mistake here is considering CPU load to be a useful  
metric.
It isn't.  Responsiveness is a useful metric, energy is a useful  
metric,

but CPU load isn't a reliable guide to either of these.

  
For a proper implementation you would want to actively poll  
without a
sleep call for a few milliseconds, and then use some other method  
that

sleeps not for a fixed time, but until new messages arrive.

Well, it sounds like you can get to this before I can.  Post your  
patch

here and I'll test it on the NAS suite, UMT2K, Paradis, and a few
synthetic benchmarks I've written.  The cluster I use has multimeters
hooked up so I can 

Re: [OMPI users] install intel mac with Laopard

2008-04-24 Thread Doug Reeder

Jeff,

I don't know if it there is a way to capture the "not of required  
architecture" response and add it to the error message. I agree that  
the current error message captures the problem in broad terms and  
points to the config.log file. It is just not very specific. If the  
architecture problem can't be added to the error message then I   
think we are stuck with what we have. If that is the case is it  
worthwhile to add this to the FAQ for building openmpi.


Doug
On Apr 24, 2008, at 9:34 AM, Jeff Squyres wrote:


On Apr 24, 2008, at 12:24 PM, George Bosilca wrote:


There are so many special errors that are compiler and operating
system dependent that there is no way to handle each of them
specifically. And even if it was possible, I will not use autoconf
if the resulting configure file was 100MB ...


More specifically, the error messages in config.log are mostly written
by the compiler/linker (i.e., redirect stdout/stderr from the command
line to config.log). We don't usually modify that -- the Autoconf Way
is that Autoconf is 100% responsible for config.log.


Additionally, I think the error message is more than clear. It
clearly state that the problem is coming from a mismatch between the
CFLAGS and FFLAGS. There is even a hint that one has to look in
config.log to find the real cause...


As George specifies, the stdout from configure is what we can most
directly affect, and that's why we chose to output this message:


* It appears that your Fortran 77 compiler is unable to link against
* object files created by your C compiler.  This generally indicates
* either a conflict between the options specified in CFLAGS and FFLAGS
* or a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compilers and what error resulted when the commands were  
executed) is

* available in the config.log file in this directory.


OMPI doesn't know *why* the test link failed; we just know that it
failed.  I agree with George that trying to put in compiler-specific
stdout/stderr analysis is a black hole that would be extraordinarily
difficult.

Do you have any suggestions for re-wording this message?  That's
probably the best that we can do.




 george.

On Apr 24, 2008, at 11:57 AM, Doug Reeder wrote:


Jeff,

For the specific problem of the gcc compiler creating i386 objects
and ifort creating x86_64 objects, in the config.log file it says

configure:26935: ifort -o conftest conftest.f conftest_c.o >&% ld:
warning in conftest_c.o, file is not of required architecture

If configure could pick up on this and write an error message
something like "Your C and fortran compilers are creating objects  
for

different architectures. You probably need to change your CFLAG or
FFLAG arguments to ensure that they are consistent" it would point
the user more directly to the real problem. Right now the  
information

is in the config.log file but it doesn't jump out at you.

Doug Reeder
On Apr 24, 2008, at 8:40 AM, Jeff Squyres wrote:


On Apr 24, 2008, at 11:07 AM, Doug Reeder wrote:


Make sure that your compilers are all creaqting code for the same
architecture (i386 or x86-64). ifort usually installs such that  
the

64 bit version of the compiler is the dfault while the apple gcc
compiler creates i386 output by default. Check the architecture of
the .o files with file *.o and if the gcc output needs to be  
x86_64

add the -m64 flag to the c and c++ flags. That has worked for me.
You shouldn't need the intel c/c++ compilers. I find the configure
error message to be a little bit cryptic and not very insightful.


Do you have a suggestion for a new configure error message?  I
thought
it was very clear, but then again, I'm one of the implementors...

checking if C and Fortran 77 are link compatible... no
* 
***

**
* It appears that your Fortran 77 compiler is unable to link
against
* object files created by your C compiler.  This generally
indicates
* either a conflict between the options specified in CFLAGS and
FFLAGS
* or a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compilers and what error resulted when the commands were
executed) is
* available in the config.log file in this directory.
* 
***

**
configure: error: C and Fortran 77 compilers are not link
compatible.
Can not continue.




--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Openmpi (VASP): Signal code: Address not mapped (1)

2008-04-24 Thread Steven Truong
Hi.  I recently encountered this error and can not really understand
what this means.  I googled and could not find any relevant
information.  Could somebody tell me what might cause this error?

Our systems:  Rocks 4.3 x86_64, openmpi-1.2.5, scalapack-1.8.0,
Barcelona, Gigabit interconnections.

Thank you very much.


ERROR MESSAGE:
[compute-1-27:31550] *** Process received signal ***
[compute-1-27:31550] Signal: Segmentation fault (11)
[compute-1-27:31550] Signal code: Address not mapped (1)
[compute-1-27:31550] Failing at address: (nil)
[compute-1-27:31550] [ 0] /lib64/tls/libpthread.so.0 [0x34e6c0c4f0]
[compute-1-27:31550] [ 1]
/usr/local/bin/vaspopenmpi_scala(__dfast__cnormn+0x18e) [0x4dd0ee]
[compute-1-27:31550] [ 2]
/usr/local/bin/vaspopenmpi_scala(__rmm_diis__eddrmm+0x59be) [0x5b11fe]
[compute-1-27:31550] [ 3]
/usr/local/bin/vaspopenmpi_scala(elmin_+0x32fa) [0x608a9a]
[compute-1-27:31550] [ 4]
/usr/local/bin/vaspopenmpi_scala(MAIN__+0x15492) [0x425f4a]
[compute-1-27:31550] [ 5] /usr/local/bin/vaspopenmpi_scala(main+0xe) [0x6ed9ee]
[compute-1-27:31550] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x34e5f1c3fb]
[compute-1-27:31550] [ 7] /usr/local/bin/vaspopenmpi_scala [0x410a2a]
[compute-1-27:31550] *** End of error message ***
[compute-1-27:31549] *** Process received signal ***
[compute-1-27:31549] Signal: Segmentation fault (11)
[compute-1-27:31549] Signal code: Address not mapped (1)
[compute-1-27:31549] Failing at address: (nil)
[compute-1-27:31549] [ 0] /lib64/tls/libpthread.so.0 [0x34e6c0c4f0]
[compute-1-27:31549] [ 1]
/usr/local/bin/vaspopenmpi_scala(__dfast__cnorma+0x1e4) [0x4dd884]
[compute-1-27:31549] [ 2]
/usr/local/bin/vaspopenmpi_scala(__rmm_diis__eddrmm+0x6dbd) [0x5b25fd]
[compute-1-27:31549] [ 3]
/usr/local/bin/vaspopenmpi_scala(elmin_+0x32fa) [0x608a9a]
[compute-1-27:31549] [ 4]
/usr/local/bin/vaspopenmpi_scala(MAIN__+0x15492) [0x425f4a]
[compute-1-27:31549] [ 5] /usr/local/bin/vaspopenmpi_scala(main+0xe) [0x6ed9ee]
[compute-1-27:31549] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x34e5f1c3fb]
[compute-1-27:31549] [ 7] /usr/local/bin/vaspopenmpi_scala [0x410a2a]
[compute-1-27:31549] *** End of error message ***
mpiexec noticed that job rank 0 with PID 31544 on node
compute-1-27.local exited on signal 15 (Terminated).


Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread Alberto Giannetti


On Apr 24, 2008, at 9:09 AM, Adrian Knoth wrote:

On Thu, Apr 24, 2008 at 08:25:44AM -0400, Alberto Giannetti wrote:


I am using one of the nodes as a desktop computer. Therefore it is
most important for me that the mpi program is not so greedily
acquiring cpu time.



From a performance/usability stand, you could set interactive
applications on higher priority to guarantee your desktop
applications work as expected.


What you really mean is to renice the MPI program to 10 or even 19.


Linux has also a Posix real-time scheduling mode (priocntl).



It's usually not a good idea to raise the priority of any program  
below

0, as this could lock up your machine (that's why nice-levels below 0
are reserved for privileged users (root))

(note that lower nice levels actually mean higher priority. Just to
avoid confusion. I guess I don't have to mention "man nice" on a
technical mailing list.)

Anyway, I suggest you set mpi_yield_when_idle=1 in your mca- 
params.conf.



--
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] install intel mac with Laopard

2008-04-24 Thread Doug Reeder

Jeff,

For the specific problem of the gcc compiler creating i386 objects  
and ifort creating x86_64 objects, in the config.log file it says


configure:26935: ifort -o conftest conftest.f conftest_c.o >&% ld:  
warning in conftest_c.o, file is not of required architecture


If configure could pick up on this and write an error message  
something like "Your C and fortran compilers are creating objects for  
different architectures. You probably need to change your CFLAG or  
FFLAG arguments to ensure that they are consistent" it would point  
the user more directly to the real problem. Right now the information  
is in the config.log file but it doesn't jump out at you.


Doug Reeder
On Apr 24, 2008, at 8:40 AM, Jeff Squyres wrote:


On Apr 24, 2008, at 11:07 AM, Doug Reeder wrote:


Make sure that your compilers are all creaqting code for the same
architecture (i386 or x86-64). ifort usually installs such that the
64 bit version of the compiler is the dfault while the apple gcc
compiler creates i386 output by default. Check the architecture of
the .o files with file *.o and if the gcc output needs to be x86_64
add the -m64 flag to the c and c++ flags. That has worked for me.
You shouldn't need the intel c/c++ compilers. I find the configure
error message to be a little bit cryptic and not very insightful.


Do you have a suggestion for a new configure error message?  I thought
it was very clear, but then again, I'm one of the implementors...

checking if C and Fortran 77 are link compatible... no
 
**

* It appears that your Fortran 77 compiler is unable to link against
* object files created by your C compiler.  This generally indicates
* either a conflict between the options specified in CFLAGS and
FFLAGS
* or a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compilers and what error resulted when the commands were
executed) is
* available in the config.log file in this directory.
 
**
configure: error: C and Fortran 77 compilers are not link  
compatible.

Can not continue.




--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Proper use of sigaction in Open MPI?

2008-04-24 Thread Ralph H Castain
I have never tested this before, so I could be wrong. However, my best guess
is that the following is happening:

1. you trap the signal and do your cleanup. However, when your proc now
exits, it does not exit with a status of "terminated-by-signal". Instead, it
exits normally.

2. the local daemon sees the proc exit, but since it exit'd normally, it
takes no action to abort the job. Hence, mpirun has no idea that anything
"wrong" has happened, nor that it should do anything about it.

3. if you re-raise the signal, the proc now exits with
"terminated-by-signal", so the abort procedure works as intended.

Since you call mpi_finalize before leaving, even the upcoming 1.3 release
would be "fooled" by this behavior. It will again think that the proc exit'd
normally, and happily wait for all the procs to "complete".

Now, if -all- of your procs receive this signal and terminate, then the
system should shutdown. But I gather from your note that this isn't the case
- that only a subset, perhaps only one, of the procs is taking this action?

If all of the procs are exiting, then it is possible that there is a bug in
the 1.2 release that is getting confused by the signals. Mpirun does trap
SIGTERM to order a clean abort of all procs, so it is possible that a race
condition is getting activated and causing mpirun to hang. Unfortunately,
that can happen in the 1.2 series. The 1.3 release should be more robust in
that regard.

I don't think what you are doing will cause any horrid problems. Like I
said, I have never tried something like this, so I might be surprised.

But if you job cleans up the way you want, I certainly wouldn't worry about
it. At the worst, there might be some dangling tmp files from Open MPI.

Ralph



On 4/24/08 8:51 AM, "Jeff Squyres (jsquyres)"  wrote:

> Thoughts?
> 
> Is this a "fixed in 1.3" issue?
> 
> -jms
> Sent from my PDA.  No type good.
> 
>  -Original Message-
> From:   Keller, Jesse [mailto:jesse.kel...@roche.com]
> Sent:   Thursday, April 24, 2008 09:35 AM Eastern Standard Time
> To: us...@open-mpi.org
> Subject:[OMPI users] Proper use of sigaction in Open MPI?
> 
> Hello, all -
> 
> 
> 
> I have an OpenMPI application that generates a file while it runs.  No big
> deal.  However, I¹d like to delete the partial file if the job is aborted via
> a user signal.  In a non-MPI application, I¹d use sigaction to intercept the
> SIGTERM and delete the open files there.  I¹d then call the ³old² signal
> handler.   When I tried this with my OpenMPI program, the signal was caught,
> the files deleted, the processes exited, but the MPI exec command as a whole
> did not exit.   This is the technique, by the way, that was described in this
> IBM MPI document:
> 
> 
> 
> http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ib
> m.cluster.pe.doc/pe_linux42/am106l0037.html
> 
> 
> 
> My question is, what is the ³right² way to do this under OpenMPI?  The only
> way I got the thing to work was by resetting the sigaction to the old handler
> and re-raising the signal.  It seems to work, but I want to know if I am going
> to get ³bit² by this.  Specifically, am I ³closing² MPI correctly by doing
> this?
> 
> 
> 
> I am running OpenMPI 1.2.5 under Fedora 8 on Linux in a x86_64 environment.
> My compiler is gcc 4.1.2.  This behavior happens when all processes are
> running on the same node using shared memory and between nodes when using TCP
> transport.  I don¹t have access to any other transport.
> 
> 
> 
> Thanks for your help.
> 
> 
> 
> Jesse Keller
> 
> 454 Life Sciences
> 
> 
> 
> Here¹s a code snippet to demonstrate what I¹m talking about.
> 
> 
> 
> --
> --
> 
> 
> 
> struct sigaction sa_old_term;  /* Global. */
> 
> 
> 
> void
> 
> SIGTERM_handler(int signal , siginfo_t * siginfo , void * a)
> 
> {
> 
> UnlinkOpenedFiles(); /* Global function to delete partial files. */
> 
> /* The commented code doesn¹t work. */
> 
> //if (sa_old_term.sa_sigaction)
> 
> //{
> 
> //  sa_old_term.sa_flags =SA_SIGINFO;
> 
> //  (*sa_old_term.sa_sigaction)(signal,siginfo,a);
> 
> //}
> 
> sigaction(SIGTERM, _old_term,NULL);
> 
> raise(signal);
> 
> }
> 
> 
> 
> int main( int argc, char * argv)
> 
> {
> 
> MPI::Init(argc, argv);
> 
>
> 
> struct sigaction sa_term;
> 
> sigemptyset(_term.sa_mask);
> 
> sa_term.sa_flags = SA_SIGINFO;
> 
> sa_term.sa_sigaction = SIGTERM_handler;
> 
> sigaction(SIGTERM, _term, _old_term);
> 
> 
> 
>doSomeMPIComputation();
> 
>MPI::Finalize();
> 
>return 0;
> 
> }
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread Ingo Josopait


Barry Rountree schrieb:
> On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote:
>> I am using one of the nodes as a desktop computer. Therefore it is most
>> important for me that the mpi program is not so greedily acquiring cpu
>> time. 
> 
> This is a kernel scheduling issue, not an OpenMPI issue.  Busy waiting in
> one process should not cause noticable loss of responsiveness in another
> processes.  Have you experimented with the "nice" command?

I don't think that is a kernel issue. In the current OpenMPI
implementation, when mpi is waiting for new messages, it simply waits in
a loop for new messages to arrive. The kernel has then no way to know
whether the program is actually doing some useful calculations or
whether it is simply busy waiting. If, on the other hand, mpi would tell
the kernel that it is waiting for new messages, the kernel could
schedule its cpu time more efficiently to background programs, or make
an idle call if no other program is running (which would lower the
energy consumption).

> 
>> But I would imagine that the energy consumption is generally a big
>> issue, since energy is a major cost factor in a computer cluster. 
> 
> Yup.  
> 
>> When a
>> cpu is idle, it uses considerably less energy. Last time I checked my
>> computer used 180W when both cpu cores were working and 110W when both
>> cores were idle.
> 
> What processor is this?

Athlon X2 6000+ (3 Ghz)

> 
>> I just made a small hack to solve the problem. I inserted a simple sleep
>> call into the function 'opal_condition_wait':
>>
>> --- orig/openmpi-1.2.6/opal/threads/condition.h
>> +++ openmpi-1.2.6/opal/threads/condition.h
>> @@ -78,6 +78,7 @@
>>  #endif
>>  } else {
>>  while (c->c_signaled == 0) {
>> +   usleep(1000);
>>  opal_progress();
>>  }
>>  }
>>
> 
> I expect this would lead to increased execution time for all programs
> and increased energy consumption for most programs.  Recall that energy
> is power multiplied by time.  You're reducing the power on some nodes
> and increasing time on all nodes.  
> 
>> The usleep call will let the program sleep for about 4 ms (it won't
>> sleep for a shorter time because of some timer granularity). But that is
>> good enough for me. The cpu usage is (almost) zero when the tasks are
>> waiting for one another.
> 
> I think your mistake here is considering CPU load to be a useful metric.
> It isn't.  Responsiveness is a useful metric, energy is a useful metric,
> but CPU load isn't a reliable guide to either of these.  
> 
>> For a proper implementation you would want to actively poll without a
>> sleep call for a few milliseconds, and then use some other method that
>> sleeps not for a fixed time, but until new messages arrive.
> 
> Well, it sounds like you can get to this before I can.  Post your patch
> here and I'll test it on the NAS suite, UMT2K, Paradis, and a few
> synthetic benchmarks I've written.  The cluster I use has multimeters
> hooked up so I can also let you know how much energy is being saved.
> 
> Barry Rountree
> Ph.D. Candidate, Computer Science
> University of Georgia
> 


Here is now a slightly more sophisticated patch:

--- orig/openmpi-1.2.6/opal/threads/condition.h 2006-11-09
19:53:32.0 +0100
+++ openmpi-1.2.6/opal/threads/condition.h  2008-04-24
17:15:29.0 +0200
@@ -77,7 +77,11 @@
 }
 #endif
 } else {
+int nosleep_counter = 30;
 while (c->c_signaled == 0) {
+if (--nosleep_counter < 0) {
+usleep(1000);
+}
 opal_progress();
 }
 }


It will actively poll for a short time (0.1 seconds on my 2Ghz athlon64
laptop, this may adjusted by chosing a different number than 30),
and after that it will sleep for about 4 ms in each loop cycle.

You may test it. It should not increase the latency by much. The cpu
usage (as displayed by 'top') is nearly zero when waiting for new data,
and judging from the noise level of my laptop fan, the cpu uses far less
power.

A better solution would certainly be to use some other blocking
mechanism, but as others have said in this thread, this seems to be a
bit less trivial.



Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread Jeff Squyres
What George said is what I meant by "it's a non-trivial amount of  
work." :-)


In addition to when George adds these patches (allowing components to  
register for blocking progress), there's going to be some work to deal  
with shared memory (we have some ideas here, but it's a bit more than  
just allowing shmem to register to blocking progress) and other random  
issues that will arise.



On Apr 24, 2008, at 11:17 AM, George Bosilca wrote:

Well, blocking or not blocking this is the question !!!  
Unfortunately, it's more complex than this thread seems to indicate.  
It's not that we didn't want to implement it in Open MPI, it's that  
at one point we had to make a choice ... and we decided to always go  
for performance first.


However, there were some experimentations to go in blocking more at  
least when only TCP was used. Unfortunately, this break some other  
things in Open MPI, because of our progression model. We are  
component based and these components are allowed to register  
periodically called callbacks ... And here periodically means as  
often as possible. There are at least 2 components that use this  
mechanism for their own progression: romio (mca/io/romio) and one- 
sided communications (mca/osc/*). Switching in blocking mode will  
break these 2 components completely. This was the reason why we're  
not blocking when only TCP is used.


Anyway, there is a solution. We have to move from a poll base  
progress for these components to an event base progress. There were  
some discussions, and if I remember well ... everybody's waiting for  
one of my patches :) A patch that allow a component to add a  
completion callback to MPI requests ... I don't have a clear  
deadline for this, and unfortunately I'm a little busy right now ...  
but I'll work on it asap.


 george.

On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote:


On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote:
I am using one of the nodes as a desktop computer. Therefore it is  
most
important for me that the mpi program is not so greedily acquiring  
cpu

time.


This is a kernel scheduling issue, not an OpenMPI issue.  Busy  
waiting in
one process should not cause noticable loss of responsiveness in  
another

processes.  Have you experimented with the "nice" command?


But I would imagine that the energy consumption is generally a big
issue, since energy is a major cost factor in a computer cluster.


Yup.


When a
cpu is idle, it uses considerably less energy. Last time I checked  
my
computer used 180W when both cpu cores were working and 110W when  
both

cores were idle.


What processor is this?



I just made a small hack to solve the problem. I inserted a simple  
sleep

call into the function 'opal_condition_wait':

--- orig/openmpi-1.2.6/opal/threads/condition.h
+++ openmpi-1.2.6/opal/threads/condition.h
@@ -78,6 +78,7 @@
#endif
   } else {
   while (c->c_signaled == 0) {
+   usleep(1000);
   opal_progress();
   }
   }



I expect this would lead to increased execution time for all programs
and increased energy consumption for most programs.  Recall that  
energy

is power multiplied by time.  You're reducing the power on some nodes
and increasing time on all nodes.


The usleep call will let the program sleep for about 4 ms (it won't
sleep for a shorter time because of some timer granularity). But  
that is
good enough for me. The cpu usage is (almost) zero when the tasks  
are

waiting for one another.


I think your mistake here is considering CPU load to be a useful  
metric.
It isn't.  Responsiveness is a useful metric, energy is a useful  
metric,

but CPU load isn't a reliable guide to either of these.

For a proper implementation you would want to actively poll  
without a
sleep call for a few milliseconds, and then use some other method  
that

sleeps not for a fixed time, but until new messages arrive.


Well, it sounds like you can get to this before I can.  Post your  
patch

here and I'll test it on the NAS suite, UMT2K, Paradis, and a few
synthetic benchmarks I've written.  The cluster I use has multimeters
hooked up so I can also let you know how much energy is being saved.

Barry Rountree
Ph.D. Candidate, Computer Science
University of Georgia




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] install intel mac with Laopard

2008-04-24 Thread Jeff Squyres

On Apr 24, 2008, at 11:07 AM, Doug Reeder wrote:

Make sure that your compilers are all creaqting code for the same  
architecture (i386 or x86-64). ifort usually installs such that the  
64 bit version of the compiler is the dfault while the apple gcc  
compiler creates i386 output by default. Check the architecture of  
the .o files with file *.o and if the gcc output needs to be x86_64  
add the -m64 flag to the c and c++ flags. That has worked for me.  
You shouldn't need the intel c/c++ compilers. I find the configure  
error message to be a little bit cryptic and not very insightful.


Do you have a suggestion for a new configure error message?  I thought  
it was very clear, but then again, I'm one of the implementors...

checking if C and Fortran 77 are link compatible... no
**
* It appears that your Fortran 77 compiler is unable to link against
* object files created by your C compiler.  This generally indicates
* either a conflict between the options specified in CFLAGS and  
FFLAGS

* or a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compilers and what error resulted when the commands were  
executed) is

* available in the config.log file in this directory.
**
configure: error: C and Fortran 77 compilers are not link compatible.
Can not continue.




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread George Bosilca
Well, blocking or not blocking this is the question !!! Unfortunately,  
it's more complex than this thread seems to indicate. It's not that we  
didn't want to implement it in Open MPI, it's that at one point we had  
to make a choice ... and we decided to always go for performance first.


However, there were some experimentations to go in blocking more at  
least when only TCP was used. Unfortunately, this break some other  
things in Open MPI, because of our progression model. We are component  
based and these components are allowed to register periodically called  
callbacks ... And here periodically means as often as possible. There  
are at least 2 components that use this mechanism for their own  
progression: romio (mca/io/romio) and one-sided communications (mca/ 
osc/*). Switching in blocking mode will break these 2 components  
completely. This was the reason why we're not blocking when only TCP  
is used.


Anyway, there is a solution. We have to move from a poll base progress  
for these components to an event base progress. There were some  
discussions, and if I remember well ... everybody's waiting for one of  
my patches :) A patch that allow a component to add a completion  
callback to MPI requests ... I don't have a clear deadline for this,  
and unfortunately I'm a little busy right now ... but I'll work on it  
asap.


  george.

On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote:


On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote:
I am using one of the nodes as a desktop computer. Therefore it is  
most
important for me that the mpi program is not so greedily acquiring  
cpu

time.


This is a kernel scheduling issue, not an OpenMPI issue.  Busy  
waiting in
one process should not cause noticable loss of responsiveness in  
another

processes.  Have you experimented with the "nice" command?


But I would imagine that the energy consumption is generally a big
issue, since energy is a major cost factor in a computer cluster.


Yup.


When a
cpu is idle, it uses considerably less energy. Last time I checked my
computer used 180W when both cpu cores were working and 110W when  
both

cores were idle.


What processor is this?



I just made a small hack to solve the problem. I inserted a simple  
sleep

call into the function 'opal_condition_wait':

--- orig/openmpi-1.2.6/opal/threads/condition.h
+++ openmpi-1.2.6/opal/threads/condition.h
@@ -78,6 +78,7 @@
#endif
} else {
while (c->c_signaled == 0) {
+   usleep(1000);
opal_progress();
}
}



I expect this would lead to increased execution time for all programs
and increased energy consumption for most programs.  Recall that  
energy

is power multiplied by time.  You're reducing the power on some nodes
and increasing time on all nodes.


The usleep call will let the program sleep for about 4 ms (it won't
sleep for a shorter time because of some timer granularity). But  
that is

good enough for me. The cpu usage is (almost) zero when the tasks are
waiting for one another.


I think your mistake here is considering CPU load to be a useful  
metric.
It isn't.  Responsiveness is a useful metric, energy is a useful  
metric,

but CPU load isn't a reliable guide to either of these.


For a proper implementation you would want to actively poll without a
sleep call for a few milliseconds, and then use some other method  
that

sleeps not for a fixed time, but until new messages arrive.


Well, it sounds like you can get to this before I can.  Post your  
patch

here and I'll test it on the NAS suite, UMT2K, Paradis, and a few
synthetic benchmarks I've written.  The cluster I use has multimeters
hooked up so I can also let you know how much energy is being saved.

Barry Rountree
Ph.D. Candidate, Computer Science
University of Georgia




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] Message compression in OpenMPI

2008-04-24 Thread George Bosilca
Actually, even in this particular condition (over internet)1  
compression make sense only for very specific data. The problem is  
that usually the compression algorithm is very expensive if you want  
to really get a interesting factor of size reduction. And there is the  
tradeoff, what you save in terms of data transfer you lose in terms of  
compression time. In other terms, the compression became interesting  
in only 2 scenarios: you have a very congested network (really very  
very congested) or you have a network with a limited bandwidth.


The algorithm use in the paper you cited is fast, but unfortunately  
very specific for MPI_DOUBLE and only works if the data exhibit the  
properties I cited in my previous email. The generic compression  
algorithms are at least one order of magnitude slower. And then again,  
one needs a very slow network in order to get any benefits from doing  
the compression ... And of course slow networks is not exactly the  
most common place where you will find MPI applications.


But as Jeff stated in his email, contributions are always welcomed :)

  george.


On Apr 24, 2008, at 8:26 AM, Tomas Ukkonen wrote:


George Bosilca wrote:


The paper you cited, while presenting a particular implementation  
doesn't bring present any new ideas. The compression of the data  
was studied for long time, and [unfortunately] it always came back  
to the same result. In the general case, not worth the effort !


Now of course, if one limit itself to very regular applications  
(such as the one presented in the paper), where the matrices  
involved in the computation are well conditioned (such as in the  
paper), and if you only use MPI_DOUBLE (\cite{same_paper}), and  
finally if you only expect to run over slow Ethernet (1Gbs)  
(\cite{same_paper_again})... then yes one might get some benefit.


Yes, you are probably right that its not worth the effort in general  
and

especially not in HPC environments where you have very fast network.

But I can think of (rather important) special cases where it is  
important


- non HPC environments with slow network: beowulf clusters and/or
  internet + normal PCs where you use existing workstations and  
network

  for computations.
- communication/io-bound computations where you transfer
  large redundant datasets between nodes

So it would be nice to be able to turn on the compression (for spefic
communicators and/or data transfers) when you need it.

--
Tomas

  george.

On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote:


Hello

I read from somewhere that OpenMPI supports
some kind of data compression but I couldn't find
any information about it.

Is this true and how it can be used?

Does anyone have any experiences about using it?

Is it possible to use compression in just some
subset of communications (communicator
specific compression settings)?

In our MPI application we are transferring large
amounts of sparse/redundant data that compresses
very well. Also my initial tests showed significant
improvements in performance.

There are also articles that suggest that compression
should be used [1].

[1] J. Ke, M. Burtcher and E. Speight.
Runtime Compression of MPI Messages to Improve the
Performance and Scalability of Parallel Applications.


Thanks in advance,
Tomas

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] How to restart a job twice

2008-04-24 Thread Josh Hursey

Tamer,

Another user contacted me off list yesterday with a similar problem  
with the current trunk. I have been able to reproduce this, and am  
currently trying to debug it again. It seems to occur more often with  
builds without the checkpoint thread (--disable-ft-thread). It seems  
to be a race in our connection wireup which is why it does not always  
occur.


Thank you for your patience as I try to track this down. I'll let you  
know as soon as I have a fix.


Cheers,
Josh

On Apr 24, 2008, at 10:50 AM, Tamer wrote:


Josh, Thank you for your help. I was able to do the following with
r18241:

start the parallel job
checkpoint and restart
checkpoint and restart
checkpoint but failed to restart with the following message:

ompi-restart ompi_global_snapshot_23800.ckpt
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection
to lifeline [[45699,0],0] lost
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection
to lifeline [[45699,0],0] lost
[dhcp-119-202:23650] *** Process received signal ***
[dhcp-119-202:23650] Signal: Segmentation fault (11)
[dhcp-119-202:23650] Signal code: Address not mapped (1)
[dhcp-119-202:23650] Failing at address: 0x3e0f50
[dhcp-119-202:23650] [ 0] [0x110440]
[dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107)
[0xc5df97]
[dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1]
[dhcp-119-202:23650] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 23857 on node
dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault).


So, this time the process went further than before. I tested on a
different platform (64 bit machine with fedora core 7) and openmpi
checkpoints and restarts as many times as I want to without any
problems. This means that the issue above must be platform dependent
and I must be missing some option in building the code.

Cheers,
Tamer


On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote:


Tamer,

This should now be fixed in r18241.

Though I was able to replicate this bug, it only occurred
sporadically for me. It seemed to be caused by some socket descriptor
caching that was not properly cleaned up by the restart procedure.

My testing appears to conclude that this bug is now fixed, but since
it is difficult to reproduce if you see it happen again definitely
let me know.


With the current trunk you may see the following error message:
--
[odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--
This is not caused by the checkpoint/restart code, but by some recent
changes to our TCP component. We are working on fixing this, but I
just wanted to give you a heads up in case you see this error. As far
as I can tell it does not interfere with the checkpoint/restart
functionality.

Let me know if this fixes your problem.

Cheers,
Josh


On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:


Tamer,

Just wanted to update you on my progress. I am able to reproduce
something similar to this problem. I am currently working on a
solution to it. I'll let you know when it is available, probably in
the next day or two.

Thank you for the bug report.

Cheers,
Josh

On Apr 18, 2008, at 1:11 PM, Tamer wrote:


Hi Josh:

I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7

The machine is dual-core with shared memory so it's not even a
cluster.

I downloaded r18208 and built it with the following options:

./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208  
--

with-ft=cr --with-blcr=/usr/local/blcr

when I run mpirun I pass the following command:

mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760

I was able to checkpoint and restart successfully and was able to
checkpoint the restarted job  (mpirun showed up with ps-efa |grep
mpirun under r18208) but was unable to restart again; here's the
error message:

mpi-restart ompi_global_snapshot_23865.ckpt
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to lifeline 

Re: [OMPI users] PubSub and MPI

2008-04-24 Thread Jeff Squyres (jsquyres)
Additionally, the mpi-t spec has some accept/connect examples in the dynamic 
processes chapter.

-jms
Sent from my PDA.  No type good.

 -Original Message-
From:   Tim Prins [mailto:tpr...@open-mpi.org]
Sent:   Thursday, April 24, 2008 09:33 AM Eastern Standard Time
To: Open MPI Users
Subject:Re: [OMPI users] PubSub and MPI

Open MPI ships with a full set of man pages for all the MPI functions, 
you might want to start with those.

Tim

Alberto Giannetti wrote:
> I am looking to use MPI in a publisher/subscriber context. Haven't  
> found much relevant information online.
> Basically I would need to deal with dynamic tag subscriptions from  
> independent components (connectors) and a number of other issues. I  
> can provide more details if there is an interest. Am also looking for  
> more information on these calls:
> 
> MPI_Open_port
> MPI_Publish_name
> MPI_Comm_spawn_multiple
> 
> Any code example or snapshot would be great.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] PubSub and MPI

2008-04-24 Thread Tim Prins
Open MPI ships with a full set of man pages for all the MPI functions, 
you might want to start with those.


Tim

Alberto Giannetti wrote:
I am looking to use MPI in a publisher/subscriber context. Haven't  
found much relevant information online.
Basically I would need to deal with dynamic tag subscriptions from  
independent components (connectors) and a number of other issues. I  
can provide more details if there is an interest. Am also looking for  
more information on these calls:


MPI_Open_port
MPI_Publish_name
MPI_Comm_spawn_multiple

Any code example or snapshot would be great.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Message compression in OpenMPI

2008-04-24 Thread Jeff Squyres

On Apr 24, 2008, at 8:26 AM, Tomas Ukkonen wrote:

Yes, you are probably right that its not worth the effort in general  
and

especially not in HPC environments where you have very fast network.

But I can think of (rather important) special cases where it is  
important


- non HPC environments with slow network: beowulf clusters and/or
  internet + normal PCs where you use existing workstations and  
network

  for computations.
- communication/io-bound computations where you transfer
  large redundant datasets between nodes

So it would be nice to be able to turn on the compression (for spefic
communicators and/or data transfers) when you need it.


Quite possibly so.  Note that there are a few proposals going on in  
MPI-2.2/MPI-3 about how to pass "hints" or "assertions" to the MPI  
implementation.  Compression could be one of these hints -- the MPI  
may not be able to detect that it's in a situation that is favorable  
for compression, so having the user/app tell it "use compression on  
this communicator" could be helpful.


Would you be willing to contribute the work to Open MPI to enable  
compression?  Per a post yesterday (http://www.open-mpi.org/community/lists/users/2008/04/5473.php 
), contributions are always welcome.


--
Jeff Squyres
Cisco Systems



[OMPI users] PubSub and MPI

2008-04-24 Thread Alberto Giannetti
I am looking to use MPI in a publisher/subscriber context. Haven't  
found much relevant information online.
Basically I would need to deal with dynamic tag subscriptions from  
independent components (connectors) and a number of other issues. I  
can provide more details if there is an interest. Am also looking for  
more information on these calls:


MPI_Open_port
MPI_Publish_name
MPI_Comm_spawn_multiple

Any code example or snapshot would be great.


Re: [OMPI users] Message compression in OpenMPI

2008-04-24 Thread Tomas Ukkonen
George Bosilca wrote:
> The paper you cited, while presenting a particular implementation
> doesn't bring present any new ideas. The compression of the data was
> studied for long time, and [unfortunately] it always came back to the
> same result. In the general case, not worth the effort !
>
> Now of course, if one limit itself to very regular applications (such
> as the one presented in the paper), where the matrices involved in the
> computation are well conditioned (such as in the paper), and if you
> only use MPI_DOUBLE (\cite{same_paper}), and finally if you only
> expect to run over slow Ethernet (1Gbs) (\cite{same_paper_again})...
> then yes one might get some benefit.
>
Yes, you are probably right that its not worth the effort in general and
especially not in HPC environments where you have very fast network.

But I can think of (rather important) special cases where it is important

- non HPC environments with slow network: beowulf clusters and/or
  internet + normal PCs where you use existing workstations and network
  for computations.
- communication/io-bound computations where you transfer
  large redundant datasets between nodes

So it would be nice to be able to turn on the compression (for spefic
communicators and/or data transfers) when you need it.

-- 
Tomas

>   george.
>
> On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote:
>
>> Hello
>>
>> I read from somewhere that OpenMPI supports
>> some kind of data compression but I couldn't find
>> any information about it.
>>
>> Is this true and how it can be used?
>>
>> Does anyone have any experiences about using it?
>>
>> Is it possible to use compression in just some
>> subset of communications (communicator
>> specific compression settings)?
>>
>> In our MPI application we are transferring large
>> amounts of sparse/redundant data that compresses
>> very well. Also my initial tests showed significant
>> improvements in performance.
>>
>> There are also articles that suggest that compression
>> should be used [1].
>>
>> [1] J. Ke, M. Burtcher and E. Speight.
>> Runtime Compression of MPI Messages to Improve the
>> Performance and Scalability of Parallel Applications.
>>
>>
>> Thanks in advance,
>> Tomas
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> 
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread Alberto Giannetti


On Apr 24, 2008, at 6:56 AM, Ingo Josopait wrote:
I am using one of the nodes as a desktop computer. Therefore it is  
most

important for me that the mpi program is not so greedily acquiring cpu
time.


From a performance/usability stand, you could set interactive  
applications on higher priority to guarantee your desktop  
applications work as expected.

http://www.informit.com/articles/article.aspx?p=101760


[OMPI users] install intel mac with Laopard

2008-04-24 Thread Koun SHIRAI

 Dear Sir:

I think that this problem must be solved, and maybe some information  
should be given in the archives.  But, I miss the right answer in my  
searching area, so please allow me to repeat.


I tried to install openmpi-1.2.5 to a new xserve (Xeon) with Leopard.   
Intel compiler is used for Fortran.


My options for configure was
CC=/usr/bin/gcc-4.0
CXX=/usr/bin/g++-4.0
F77=ifort
along with
--with-rsh="ssh -x" --enable-shared --without-cs-fs --without-memory- 
manager


Then, I saw an error message.  This says

checking if C and Fortran 77 are link compatible... no
**
* It appears that your Fortran 77 compiler is unable to link against
* object files created by your C compiler.  This generally indicates
* either a conflict between the options specified in CFLAGS and FFLAGS
* or a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compilers and what error resulted when the commands were executed) is
* available in the config.log file in this directory.
**
configure: error: C and Fortran 77 compilers are not link compatible.   
Can not continue.


I suppose that the problem is the default selection for the  
architecture (32 or 64 bit).  I don't know the correct options.  Of  
course, I like to use 64-bit architecture as far as it works.


Best regard,


---
Koun SHIRAI
Nanoscience and Nanotechnology Center
ISIR, Osaka University
8-1, Mihogaoka, Ibaraki
Osaka 567-0047, JAPAN
PH: +81-6-6879-4302
FAX: +81-6-6879-8539




Re: [OMPI users] Message compression in OpenMPI

2008-04-24 Thread Tomas Ukkonen
Jeff Squyres wrote:
> On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote:
>   
>> I read from somewhere that OpenMPI supports
>>
>> some kind of data compression but I couldn't find
>> any information about it.
>>
>> Is this true and how it can be used?
>> 
> Nope, sorry -- not true.
>
> This just came up in a different context, actually.  We added some  
> preliminary compression on our startup/mpirun messages and found that  
> it really had no effect; any savings that you get in bandwidth (and  
> therefore overall wall clock time) are eaten up by the time necessary  
> to compress/uncompress the messages.  There were a few more things we  
> could have tried, but frankly we had some higher priority items to  
> finish for the upcoming v1.3 series.  :-(
>   
Ok, so I have to do it myself. Not a problem really because
there are only few places where the compression really seems to matter.
>> Does anyone have any experiences about using it?
>>
>> Is it possible to use compression in just some
>> subset of communications (communicator
>> specific compression settings)?
>>
>> In our MPI application we are transferring large
>> amounts of sparse/redundant data that compresses
>> very well. Also my initial tests showed significant
>> improvements in performance.
>> 
>
> If your particular data is well-suited for fast compression, you might  
> want to compress it before calling MPI_SEND / after calling MPI_RECV.   
> Use the MPI_BYTE datatype to send/receive the messages, and then MPI  
> won't do anything additional for datatype conversions, etc
Yeah, already did something like this. I have a situation where all
the nodes are sending large amounts of redundant data at once. The
combination: "compress --> MPI_SEND --> MPI_RECV --> decompress"
works of course, but it forces one to allocate large amounts of memory
(or diskspace) for the compressed data. You can do it manually in parts
of course, but it would be nice if MPI library could do it behind the
scenes.

Thanks,

-- 
Tomas Ukkonen




Re: [OMPI users] Busy waiting [was Re: (no subject)]

2008-04-24 Thread Ingo Josopait
I am using one of the nodes as a desktop computer. Therefore it is most
important for me that the mpi program is not so greedily acquiring cpu
time. But I would imagine that the energy consumption is generally a big
issue, since energy is a major cost factor in a computer cluster. When a
cpu is idle, it uses considerably less energy. Last time I checked my
computer used 180W when both cpu cores were working and 110W when both
cores were idle.

I just made a small hack to solve the problem. I inserted a simple sleep
call into the function 'opal_condition_wait':

--- orig/openmpi-1.2.6/opal/threads/condition.h
+++ openmpi-1.2.6/opal/threads/condition.h
@@ -78,6 +78,7 @@
 #endif
 } else {
 while (c->c_signaled == 0) {
+   usleep(1000);
 opal_progress();
 }
 }

The usleep call will let the program sleep for about 4 ms (it won't
sleep for a shorter time because of some timer granularity). But that is
good enough for me. The cpu usage is (almost) zero when the tasks are
waiting for one another.

For a proper implementation you would want to actively poll without a
sleep call for a few milliseconds, and then use some other method that
sleeps not for a fixed time, but until new messages arrive.



Barry Rountree schrieb:
> On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote:
>> I can think of several advantages that using blocking or signals to
>> reduce the cpu load would have:
>>
>> - Reduced energy consumption
> 
> Not necessarily.  Any time the program ends up running longer, the
> cluster is up and running (and wasting electricity) for that amount of
> time.  In the case where lots of tiny messages are being sent you could
> easily end up using more energy.  
> 
>> - Running additional background programs could be done far more efficiently
> 
> It's usually more efficient -- especially in terms of cache -- to batch
> up programs to run one after the other instead of running them
> simultaneously.  
> 
>> - It would be much simpler to examine the load balance.
> 
> This is true, but it's still pretty trivial to measure load imbalance.
> MPI allows you to write a wrapper library that intercepts any MPI_*
> call.  You can instrument the code however you like, then call PMPI_*,
> then catch the return value, finish your instrumentation, and return
> control to your program.  Here's some pseudocode:
> 
> int MPI_Barrier(MPI_Comm comm){
>   gettimeofday(, NULL);
>   rc=PMPI_Barrier( comm );
>   gettimeofday(, NULL);
>   fprintf( logfile, "Barrier on node %d took %lf seconds\n",
>   rank, delta(, ) );
>   return rc;
> }
> 
> I've got some code that does this for all of the MPI calls in OpenMPI
> (ah, the joys of writing C code using python scripts).  Let me know if
> you'd find it useful.
> 
>> It may depend on the type of program and the computational environment,
>> but there are certainly many cases in which putting the system in idle
>> mode would be advantageous. This is especially true for programs with
>> low network traffic and/or high load imbalances.
> 
>   I could use a few more benchmarks like that.  Seriously, if
> you're mostly concerned about saving energy, a quick hack is to set a
> timer as soon as you enter an MPI call (say for 100ms) and if the timer
> goes off while you're still in the call, use DVS to drop your CPU
> frequency to the lowest value it has.  Then, when you exit the MPI call,
> pop it back up to the highest frequency.  This can save a significant
> amount of energy, but even here there can be a performance penalty.  For
> example, UMT2K schleps around very large messages, and you really need
> to be running as fast as possible during the MPI_Waitall calls or the
> program will slow down by 1% or so (thus using more energy).
> 
> Doing this just for Barriers and Allreduces seems to speed up the
> program a tiny bit, but I haven't done enough runs to make sure this
> isn't an artifact.
> 
> (This is my dissertation topic, so before asking any question be advised
> that I WILL talk your ear off.)
>  
>> The "spin for a while and then block" method that you mentioned earlier
>> seems to be a good compromise. Just do polling for some time that is
>> long compared to the corresponding system call, and then go to sleep if
>> nothing happens. In this way, the latency would be only marginally
>> increased, while less cpu time is wasted in the polling loops, and I
>> would be much happier.
>>
> 
> I'm interested in seeing what this does for energy savings.  Are you
> volunteering to test a patch?  (I've got four other papers I need to
> get finished up, so it'll be a few weeks before I start coding.)
> 
> Barry Rountree
> Ph.D. Candidate, Computer Science
> University of Georgia
> 
>>
>>
>>
>> Jeff Squyres schrieb:
>>> On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:
>>>
 Do you really mean that Open-MPI uses busy loop in order to handle
 incomming calls? It seems to be incorrect since