Re: [OMPI users] Message compression in OpenMPI
From a pretty old experiment I made, compression was giving good results on 10Mbps network but was actually decreasing RTT on 100Mbs and more. I played with all the zlib settings from 1 to 9, and actually even the low compression setting was unable to reach decent performance. I don't believe that the computing/bandwidth ratio has changed to favor compression. Aurelien. Le 24 avr. 08 à 11:06, George Bosilca a écrit : Actually, even in this particular condition (over internet)1 compression make sense only for very specific data. The problem is that usually the compression algorithm is very expensive if you want to really get a interesting factor of size reduction. And there is the tradeoff, what you save in terms of data transfer you lose in terms of compression time. In other terms, the compression became interesting in only 2 scenarios: you have a very congested network (really very very congested) or you have a network with a limited bandwidth. The algorithm use in the paper you cited is fast, but unfortunately very specific for MPI_DOUBLE and only works if the data exhibit the properties I cited in my previous email. The generic compression algorithms are at least one order of magnitude slower. And then again, one needs a very slow network in order to get any benefits from doing the compression ... And of course slow networks is not exactly the most common place where you will find MPI applications. But as Jeff stated in his email, contributions are always welcomed :) george. On Apr 24, 2008, at 8:26 AM, Tomas Ukkonen wrote: George Bosilca wrote: The paper you cited, while presenting a particular implementation doesn't bring present any new ideas. The compression of the data was studied for long time, and [unfortunately] it always came back to the same result. In the general case, not worth the effort ! Now of course, if one limit itself to very regular applications (such as the one presented in the paper), where the matrices involved in the computation are well conditioned (such as in the paper), and if you only use MPI_DOUBLE (\cite{same_paper}), and finally if you only expect to run over slow Ethernet (1Gbs) (\cite{same_paper_again})... then yes one might get some benefit. Yes, you are probably right that its not worth the effort in general and especially not in HPC environments where you have very fast network. But I can think of (rather important) special cases where it is important - non HPC environments with slow network: beowulf clusters and/or internet + normal PCs where you use existing workstations and network for computations. - communication/io-bound computations where you transfer large redundant datasets between nodes So it would be nice to be able to turn on the compression (for spefic communicators and/or data transfers) when you need it. -- Tomas george. On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote: Hello I read from somewhere that OpenMPI supports some kind of data compression but I couldn't find any information about it. Is this true and how it can be used? Does anyone have any experiences about using it? Is it possible to use compression in just some subset of communications (communicator specific compression settings)? In our MPI application we are transferring large amounts of sparse/redundant data that compresses very well. Also my initial tests showed significant improvements in performance. There are also articles that suggest that compression should be used [1]. [1] J. Ke, M. Burtcher and E. Speight. Runtime Compression of MPI Messages to Improve the Performance and Scalability of Parallel Applications. Thanks in advance, Tomas ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Openmpi (VASP): Signal code: Address not mapped (1)
Hi, On 10:03 Thu 24 Apr , Steven Truong wrote: > Could somebody tell me what might cause this error? I'll try. > [compute-1-27:31550] *** Process received signal *** > [compute-1-27:31550] Signal: Segmentation fault (11) > [compute-1-27:31550] Signal code: Address not mapped (1) "Address not mapped" means that the program tried to access a memory location that is not part of the process' address space (e.g. null pointer). > [compute-1-27:31550] Failing at address: (nil) > [compute-1-27:31550] [ 0] /lib64/tls/libpthread.so.0 [0x34e6c0c4f0] > [compute-1-27:31550] [ 1] > /usr/local/bin/vaspopenmpi_scala(__dfast__cnormn+0x18e) [0x4dd0ee] > [compute-1-27:31550] [ 2] > /usr/local/bin/vaspopenmpi_scala(__rmm_diis__eddrmm+0x59be) [0x5b11fe] > [compute-1-27:31550] [ 3] > /usr/local/bin/vaspopenmpi_scala(elmin_+0x32fa) [0x608a9a] > [compute-1-27:31550] [ 4] > /usr/local/bin/vaspopenmpi_scala(MAIN__+0x15492) [0x425f4a] > [compute-1-27:31550] [ 5] /usr/local/bin/vaspopenmpi_scala(main+0xe) > [0x6ed9ee] > [compute-1-27:31550] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) > [0x34e5f1c3fb] > [compute-1-27:31550] [ 7] /usr/local/bin/vaspopenmpi_scala [0x410a2a] > [compute-1-27:31550] *** End of error message *** > [compute-1-27:31549] *** Process received signal *** What follows is a backtrace of the functions currently being executed (in reverse order, as found on the stack). I'd hazard a guess that it's not OMPI's fault but VASP's, since the segfault happens in one of its functions. Maybe you should have a look there. HTH -Andi -- Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! pgpHKDwBo0JnO.pgp Description: PGP signature
Re: [OMPI users] How to restart a job twice
Tamer, I'm confident that this particular problem is now fixed in the trunk (r18276). If you are interested in the details on the bug and how it was fixed the commit message is fairly detailed: https://svn.open-mpi.org/trac/ompi/changeset/18276 Let me know if this patch fixes things. Like I said I'm confident that it does, but there are always more bugs :) Thanks again for the bug report. Cheers, Josh On Apr 24, 2008, at 11:02 AM, Josh Hursey wrote: Tamer, Another user contacted me off list yesterday with a similar problem with the current trunk. I have been able to reproduce this, and am currently trying to debug it again. It seems to occur more often with builds without the checkpoint thread (--disable-ft-thread). It seems to be a race in our connection wireup which is why it does not always occur. Thank you for your patience as I try to track this down. I'll let you know as soon as I have a fix. Cheers, Josh On Apr 24, 2008, at 10:50 AM, Tamer wrote: Josh, Thank you for your help. I was able to do the following with r18241: start the parallel job checkpoint and restart checkpoint and restart checkpoint but failed to restart with the following message: ompi-restart ompi_global_snapshot_23800.ckpt [dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection to lifeline [[45699,0],0] lost [dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection to lifeline [[45699,0],0] lost [dhcp-119-202:23650] *** Process received signal *** [dhcp-119-202:23650] Signal: Segmentation fault (11) [dhcp-119-202:23650] Signal code: Address not mapped (1) [dhcp-119-202:23650] Failing at address: 0x3e0f50 [dhcp-119-202:23650] [ 0] [0x110440] [dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107) [0xc5df97] [dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1] [dhcp-119-202:23650] *** End of error message *** -- mpirun noticed that process rank 1 with PID 23857 on node dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault). So, this time the process went further than before. I tested on a different platform (64 bit machine with fedora core 7) and openmpi checkpoints and restarts as many times as I want to without any problems. This means that the issue above must be platform dependent and I must be missing some option in building the code. Cheers, Tamer On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote: Tamer, This should now be fixed in r18241. Though I was able to replicate this bug, it only occurred sporadically for me. It seemed to be caused by some socket descriptor caching that was not properly cleaned up by the restart procedure. My testing appears to conclude that this bug is now fixed, but since it is difficult to reproduce if you see it happen again definitely let me know. With the current trunk you may see the following error message: -- [odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -- This is not caused by the checkpoint/restart code, but by some recent changes to our TCP component. We are working on fixing this, but I just wanted to give you a heads up in case you see this error. As far as I can tell it does not interfere with the checkpoint/restart functionality. Let me know if this fixes your problem. Cheers, Josh On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote: Tamer, Just wanted to update you on my progress. I am able to reproduce something similar to this problem. I am currently working on a solution to it. I'll let you know when it is available, probably in the next day or two. Thank you for the bug report. Cheers, Josh On Apr 18, 2008, at 1:11 PM, Tamer wrote: Hi Josh: I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7 The machine is dual-core with shared memory so it's not even a cluster. I downloaded r18208 and built it with the following options: ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 -- with-ft=cr --with-blcr=/usr/local/blcr when I run mpirun I pass the following command: mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760 I was able to checkpoint and restart successfully and was able to checkpoint the restarted job (mpirun showed up with ps-efa |grep mpirun under r18208) but was unable to restart again; here's the error message: mpi-restart ompi_global_snapshot_23865.ckpt [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity: Connection to
Re: [OMPI users] Busy waiting [was Re: (no subject)]
I just wanted to add my last comment since this discussion seems to be very hot! As Jeff mentioned while a process is waiting to receive a message it doesn't really matter if it uses blocking or polling. What I really meant was that blocking can be useful to use CPU cycles to handle other calculations which is supposed to be done by this node if OMPI is smart enough tp decide such things. Otherwise, because HPC nodes are usually deidicated nodes so there will no other tasks which will be run in background and therefore be influenced by spinning. Nevertheless, I think that using blocking instead of busy loops should have higher priority since it can save CPU idle cycles at least for OMPI's internal tasks... D. Jeff Squyres skrev: What George said is what I meant by "it's a non-trivial amount of work." :-) In addition to when George adds these patches (allowing components to register for blocking progress), there's going to be some work to deal with shared memory (we have some ideas here, but it's a bit more than just allowing shmem to register to blocking progress) and other random issues that will arise. On Apr 24, 2008, at 11:17 AM, George Bosilca wrote: Well, blocking or not blocking this is the question !!! Unfortunately, it's more complex than this thread seems to indicate. It's not that we didn't want to implement it in Open MPI, it's that at one point we had to make a choice ... and we decided to always go for performance first. However, there were some experimentations to go in blocking more at least when only TCP was used. Unfortunately, this break some other things in Open MPI, because of our progression model. We are component based and these components are allowed to register periodically called callbacks ... And here periodically means as often as possible. There are at least 2 components that use this mechanism for their own progression: romio (mca/io/romio) and one- sided communications (mca/osc/*). Switching in blocking mode will break these 2 components completely. This was the reason why we're not blocking when only TCP is used. Anyway, there is a solution. We have to move from a poll base progress for these components to an event base progress. There were some discussions, and if I remember well ... everybody's waiting for one of my patches :) A patch that allow a component to add a completion callback to MPI requests ... I don't have a clear deadline for this, and unfortunately I'm a little busy right now ... but I'll work on it asap. george. On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote: On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote: I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. This is a kernel scheduling issue, not an OpenMPI issue. Busy waiting in one process should not cause noticable loss of responsiveness in another processes. Have you experimented with the "nice" command? But I would imagine that the energy consumption is generally a big issue, since energy is a major cost factor in a computer cluster. Yup. When a cpu is idle, it uses considerably less energy. Last time I checked my computer used 180W when both cpu cores were working and 110W when both cores were idle. What processor is this? I just made a small hack to solve the problem. I inserted a simple sleep call into the function 'opal_condition_wait': --- orig/openmpi-1.2.6/opal/threads/condition.h +++ openmpi-1.2.6/opal/threads/condition.h @@ -78,6 +78,7 @@ #endif } else { while (c->c_signaled == 0) { + usleep(1000); opal_progress(); } } I expect this would lead to increased execution time for all programs and increased energy consumption for most programs. Recall that energy is power multiplied by time. You're reducing the power on some nodes and increasing time on all nodes. The usleep call will let the program sleep for about 4 ms (it won't sleep for a shorter time because of some timer granularity). But that is good enough for me. The cpu usage is (almost) zero when the tasks are waiting for one another. I think your mistake here is considering CPU load to be a useful metric. It isn't. Responsiveness is a useful metric, energy is a useful metric, but CPU load isn't a reliable guide to either of these. For a proper implementation you would want to actively poll without a sleep call for a few milliseconds, and then use some other method that sleeps not for a fixed time, but until new messages arrive. Well, it sounds like you can get to this before I can. Post your patch here and I'll test it on the NAS suite, UMT2K, Paradis, and a few synthetic benchmarks I've written. The cluster I use has multimeters hooked up so I can
Re: [OMPI users] install intel mac with Laopard
Jeff, I don't know if it there is a way to capture the "not of required architecture" response and add it to the error message. I agree that the current error message captures the problem in broad terms and points to the config.log file. It is just not very specific. If the architecture problem can't be added to the error message then I think we are stuck with what we have. If that is the case is it worthwhile to add this to the FAQ for building openmpi. Doug On Apr 24, 2008, at 9:34 AM, Jeff Squyres wrote: On Apr 24, 2008, at 12:24 PM, George Bosilca wrote: There are so many special errors that are compiler and operating system dependent that there is no way to handle each of them specifically. And even if it was possible, I will not use autoconf if the resulting configure file was 100MB ... More specifically, the error messages in config.log are mostly written by the compiler/linker (i.e., redirect stdout/stderr from the command line to config.log). We don't usually modify that -- the Autoconf Way is that Autoconf is 100% responsible for config.log. Additionally, I think the error message is more than clear. It clearly state that the problem is coming from a mismatch between the CFLAGS and FFLAGS. There is even a hint that one has to look in config.log to find the real cause... As George specifies, the stdout from configure is what we can most directly affect, and that's why we chose to output this message: * It appears that your Fortran 77 compiler is unable to link against * object files created by your C compiler. This generally indicates * either a conflict between the options specified in CFLAGS and FFLAGS * or a problem with the local compiler installation. More * information (including exactly what command was given to the * compilers and what error resulted when the commands were executed) is * available in the config.log file in this directory. OMPI doesn't know *why* the test link failed; we just know that it failed. I agree with George that trying to put in compiler-specific stdout/stderr analysis is a black hole that would be extraordinarily difficult. Do you have any suggestions for re-wording this message? That's probably the best that we can do. george. On Apr 24, 2008, at 11:57 AM, Doug Reeder wrote: Jeff, For the specific problem of the gcc compiler creating i386 objects and ifort creating x86_64 objects, in the config.log file it says configure:26935: ifort -o conftest conftest.f conftest_c.o >&% ld: warning in conftest_c.o, file is not of required architecture If configure could pick up on this and write an error message something like "Your C and fortran compilers are creating objects for different architectures. You probably need to change your CFLAG or FFLAG arguments to ensure that they are consistent" it would point the user more directly to the real problem. Right now the information is in the config.log file but it doesn't jump out at you. Doug Reeder On Apr 24, 2008, at 8:40 AM, Jeff Squyres wrote: On Apr 24, 2008, at 11:07 AM, Doug Reeder wrote: Make sure that your compilers are all creaqting code for the same architecture (i386 or x86-64). ifort usually installs such that the 64 bit version of the compiler is the dfault while the apple gcc compiler creates i386 output by default. Check the architecture of the .o files with file *.o and if the gcc output needs to be x86_64 add the -m64 flag to the c and c++ flags. That has worked for me. You shouldn't need the intel c/c++ compilers. I find the configure error message to be a little bit cryptic and not very insightful. Do you have a suggestion for a new configure error message? I thought it was very clear, but then again, I'm one of the implementors... checking if C and Fortran 77 are link compatible... no * *** ** * It appears that your Fortran 77 compiler is unable to link against * object files created by your C compiler. This generally indicates * either a conflict between the options specified in CFLAGS and FFLAGS * or a problem with the local compiler installation. More * information (including exactly what command was given to the * compilers and what error resulted when the commands were executed) is * available in the config.log file in this directory. * *** ** configure: error: C and Fortran 77 compilers are not link compatible. Can not continue. -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Openmpi (VASP): Signal code: Address not mapped (1)
Hi. I recently encountered this error and can not really understand what this means. I googled and could not find any relevant information. Could somebody tell me what might cause this error? Our systems: Rocks 4.3 x86_64, openmpi-1.2.5, scalapack-1.8.0, Barcelona, Gigabit interconnections. Thank you very much. ERROR MESSAGE: [compute-1-27:31550] *** Process received signal *** [compute-1-27:31550] Signal: Segmentation fault (11) [compute-1-27:31550] Signal code: Address not mapped (1) [compute-1-27:31550] Failing at address: (nil) [compute-1-27:31550] [ 0] /lib64/tls/libpthread.so.0 [0x34e6c0c4f0] [compute-1-27:31550] [ 1] /usr/local/bin/vaspopenmpi_scala(__dfast__cnormn+0x18e) [0x4dd0ee] [compute-1-27:31550] [ 2] /usr/local/bin/vaspopenmpi_scala(__rmm_diis__eddrmm+0x59be) [0x5b11fe] [compute-1-27:31550] [ 3] /usr/local/bin/vaspopenmpi_scala(elmin_+0x32fa) [0x608a9a] [compute-1-27:31550] [ 4] /usr/local/bin/vaspopenmpi_scala(MAIN__+0x15492) [0x425f4a] [compute-1-27:31550] [ 5] /usr/local/bin/vaspopenmpi_scala(main+0xe) [0x6ed9ee] [compute-1-27:31550] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x34e5f1c3fb] [compute-1-27:31550] [ 7] /usr/local/bin/vaspopenmpi_scala [0x410a2a] [compute-1-27:31550] *** End of error message *** [compute-1-27:31549] *** Process received signal *** [compute-1-27:31549] Signal: Segmentation fault (11) [compute-1-27:31549] Signal code: Address not mapped (1) [compute-1-27:31549] Failing at address: (nil) [compute-1-27:31549] [ 0] /lib64/tls/libpthread.so.0 [0x34e6c0c4f0] [compute-1-27:31549] [ 1] /usr/local/bin/vaspopenmpi_scala(__dfast__cnorma+0x1e4) [0x4dd884] [compute-1-27:31549] [ 2] /usr/local/bin/vaspopenmpi_scala(__rmm_diis__eddrmm+0x6dbd) [0x5b25fd] [compute-1-27:31549] [ 3] /usr/local/bin/vaspopenmpi_scala(elmin_+0x32fa) [0x608a9a] [compute-1-27:31549] [ 4] /usr/local/bin/vaspopenmpi_scala(MAIN__+0x15492) [0x425f4a] [compute-1-27:31549] [ 5] /usr/local/bin/vaspopenmpi_scala(main+0xe) [0x6ed9ee] [compute-1-27:31549] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x34e5f1c3fb] [compute-1-27:31549] [ 7] /usr/local/bin/vaspopenmpi_scala [0x410a2a] [compute-1-27:31549] *** End of error message *** mpiexec noticed that job rank 0 with PID 31544 on node compute-1-27.local exited on signal 15 (Terminated).
Re: [OMPI users] Busy waiting [was Re: (no subject)]
On Apr 24, 2008, at 9:09 AM, Adrian Knoth wrote: On Thu, Apr 24, 2008 at 08:25:44AM -0400, Alberto Giannetti wrote: I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. From a performance/usability stand, you could set interactive applications on higher priority to guarantee your desktop applications work as expected. What you really mean is to renice the MPI program to 10 or even 19. Linux has also a Posix real-time scheduling mode (priocntl). It's usually not a good idea to raise the priority of any program below 0, as this could lock up your machine (that's why nice-levels below 0 are reserved for privileged users (root)) (note that lower nice levels actually mean higher priority. Just to avoid confusion. I guess I don't have to mention "man nice" on a technical mailing list.) Anyway, I suggest you set mpi_yield_when_idle=1 in your mca- params.conf. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] install intel mac with Laopard
Jeff, For the specific problem of the gcc compiler creating i386 objects and ifort creating x86_64 objects, in the config.log file it says configure:26935: ifort -o conftest conftest.f conftest_c.o >&% ld: warning in conftest_c.o, file is not of required architecture If configure could pick up on this and write an error message something like "Your C and fortran compilers are creating objects for different architectures. You probably need to change your CFLAG or FFLAG arguments to ensure that they are consistent" it would point the user more directly to the real problem. Right now the information is in the config.log file but it doesn't jump out at you. Doug Reeder On Apr 24, 2008, at 8:40 AM, Jeff Squyres wrote: On Apr 24, 2008, at 11:07 AM, Doug Reeder wrote: Make sure that your compilers are all creaqting code for the same architecture (i386 or x86-64). ifort usually installs such that the 64 bit version of the compiler is the dfault while the apple gcc compiler creates i386 output by default. Check the architecture of the .o files with file *.o and if the gcc output needs to be x86_64 add the -m64 flag to the c and c++ flags. That has worked for me. You shouldn't need the intel c/c++ compilers. I find the configure error message to be a little bit cryptic and not very insightful. Do you have a suggestion for a new configure error message? I thought it was very clear, but then again, I'm one of the implementors... checking if C and Fortran 77 are link compatible... no ** * It appears that your Fortran 77 compiler is unable to link against * object files created by your C compiler. This generally indicates * either a conflict between the options specified in CFLAGS and FFLAGS * or a problem with the local compiler installation. More * information (including exactly what command was given to the * compilers and what error resulted when the commands were executed) is * available in the config.log file in this directory. ** configure: error: C and Fortran 77 compilers are not link compatible. Can not continue. -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Proper use of sigaction in Open MPI?
I have never tested this before, so I could be wrong. However, my best guess is that the following is happening: 1. you trap the signal and do your cleanup. However, when your proc now exits, it does not exit with a status of "terminated-by-signal". Instead, it exits normally. 2. the local daemon sees the proc exit, but since it exit'd normally, it takes no action to abort the job. Hence, mpirun has no idea that anything "wrong" has happened, nor that it should do anything about it. 3. if you re-raise the signal, the proc now exits with "terminated-by-signal", so the abort procedure works as intended. Since you call mpi_finalize before leaving, even the upcoming 1.3 release would be "fooled" by this behavior. It will again think that the proc exit'd normally, and happily wait for all the procs to "complete". Now, if -all- of your procs receive this signal and terminate, then the system should shutdown. But I gather from your note that this isn't the case - that only a subset, perhaps only one, of the procs is taking this action? If all of the procs are exiting, then it is possible that there is a bug in the 1.2 release that is getting confused by the signals. Mpirun does trap SIGTERM to order a clean abort of all procs, so it is possible that a race condition is getting activated and causing mpirun to hang. Unfortunately, that can happen in the 1.2 series. The 1.3 release should be more robust in that regard. I don't think what you are doing will cause any horrid problems. Like I said, I have never tried something like this, so I might be surprised. But if you job cleans up the way you want, I certainly wouldn't worry about it. At the worst, there might be some dangling tmp files from Open MPI. Ralph On 4/24/08 8:51 AM, "Jeff Squyres (jsquyres)"wrote: > Thoughts? > > Is this a "fixed in 1.3" issue? > > -jms > Sent from my PDA. No type good. > > -Original Message- > From: Keller, Jesse [mailto:jesse.kel...@roche.com] > Sent: Thursday, April 24, 2008 09:35 AM Eastern Standard Time > To: us...@open-mpi.org > Subject:[OMPI users] Proper use of sigaction in Open MPI? > > Hello, all - > > > > I have an OpenMPI application that generates a file while it runs. No big > deal. However, I¹d like to delete the partial file if the job is aborted via > a user signal. In a non-MPI application, I¹d use sigaction to intercept the > SIGTERM and delete the open files there. I¹d then call the ³old² signal > handler. When I tried this with my OpenMPI program, the signal was caught, > the files deleted, the processes exited, but the MPI exec command as a whole > did not exit. This is the technique, by the way, that was described in this > IBM MPI document: > > > > http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ib > m.cluster.pe.doc/pe_linux42/am106l0037.html > > > > My question is, what is the ³right² way to do this under OpenMPI? The only > way I got the thing to work was by resetting the sigaction to the old handler > and re-raising the signal. It seems to work, but I want to know if I am going > to get ³bit² by this. Specifically, am I ³closing² MPI correctly by doing > this? > > > > I am running OpenMPI 1.2.5 under Fedora 8 on Linux in a x86_64 environment. > My compiler is gcc 4.1.2. This behavior happens when all processes are > running on the same node using shared memory and between nodes when using TCP > transport. I don¹t have access to any other transport. > > > > Thanks for your help. > > > > Jesse Keller > > 454 Life Sciences > > > > Here¹s a code snippet to demonstrate what I¹m talking about. > > > > -- > -- > > > > struct sigaction sa_old_term; /* Global. */ > > > > void > > SIGTERM_handler(int signal , siginfo_t * siginfo , void * a) > > { > > UnlinkOpenedFiles(); /* Global function to delete partial files. */ > > /* The commented code doesn¹t work. */ > > //if (sa_old_term.sa_sigaction) > > //{ > > // sa_old_term.sa_flags =SA_SIGINFO; > > // (*sa_old_term.sa_sigaction)(signal,siginfo,a); > > //} > > sigaction(SIGTERM, _old_term,NULL); > > raise(signal); > > } > > > > int main( int argc, char * argv) > > { > > MPI::Init(argc, argv); > > > > struct sigaction sa_term; > > sigemptyset(_term.sa_mask); > > sa_term.sa_flags = SA_SIGINFO; > > sa_term.sa_sigaction = SIGTERM_handler; > > sigaction(SIGTERM, _term, _old_term); > > > >doSomeMPIComputation(); > >MPI::Finalize(); > >return 0; > > } > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Busy waiting [was Re: (no subject)]
Barry Rountree schrieb: > On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote: >> I am using one of the nodes as a desktop computer. Therefore it is most >> important for me that the mpi program is not so greedily acquiring cpu >> time. > > This is a kernel scheduling issue, not an OpenMPI issue. Busy waiting in > one process should not cause noticable loss of responsiveness in another > processes. Have you experimented with the "nice" command? I don't think that is a kernel issue. In the current OpenMPI implementation, when mpi is waiting for new messages, it simply waits in a loop for new messages to arrive. The kernel has then no way to know whether the program is actually doing some useful calculations or whether it is simply busy waiting. If, on the other hand, mpi would tell the kernel that it is waiting for new messages, the kernel could schedule its cpu time more efficiently to background programs, or make an idle call if no other program is running (which would lower the energy consumption). > >> But I would imagine that the energy consumption is generally a big >> issue, since energy is a major cost factor in a computer cluster. > > Yup. > >> When a >> cpu is idle, it uses considerably less energy. Last time I checked my >> computer used 180W when both cpu cores were working and 110W when both >> cores were idle. > > What processor is this? Athlon X2 6000+ (3 Ghz) > >> I just made a small hack to solve the problem. I inserted a simple sleep >> call into the function 'opal_condition_wait': >> >> --- orig/openmpi-1.2.6/opal/threads/condition.h >> +++ openmpi-1.2.6/opal/threads/condition.h >> @@ -78,6 +78,7 @@ >> #endif >> } else { >> while (c->c_signaled == 0) { >> + usleep(1000); >> opal_progress(); >> } >> } >> > > I expect this would lead to increased execution time for all programs > and increased energy consumption for most programs. Recall that energy > is power multiplied by time. You're reducing the power on some nodes > and increasing time on all nodes. > >> The usleep call will let the program sleep for about 4 ms (it won't >> sleep for a shorter time because of some timer granularity). But that is >> good enough for me. The cpu usage is (almost) zero when the tasks are >> waiting for one another. > > I think your mistake here is considering CPU load to be a useful metric. > It isn't. Responsiveness is a useful metric, energy is a useful metric, > but CPU load isn't a reliable guide to either of these. > >> For a proper implementation you would want to actively poll without a >> sleep call for a few milliseconds, and then use some other method that >> sleeps not for a fixed time, but until new messages arrive. > > Well, it sounds like you can get to this before I can. Post your patch > here and I'll test it on the NAS suite, UMT2K, Paradis, and a few > synthetic benchmarks I've written. The cluster I use has multimeters > hooked up so I can also let you know how much energy is being saved. > > Barry Rountree > Ph.D. Candidate, Computer Science > University of Georgia > Here is now a slightly more sophisticated patch: --- orig/openmpi-1.2.6/opal/threads/condition.h 2006-11-09 19:53:32.0 +0100 +++ openmpi-1.2.6/opal/threads/condition.h 2008-04-24 17:15:29.0 +0200 @@ -77,7 +77,11 @@ } #endif } else { +int nosleep_counter = 30; while (c->c_signaled == 0) { +if (--nosleep_counter < 0) { +usleep(1000); +} opal_progress(); } } It will actively poll for a short time (0.1 seconds on my 2Ghz athlon64 laptop, this may adjusted by chosing a different number than 30), and after that it will sleep for about 4 ms in each loop cycle. You may test it. It should not increase the latency by much. The cpu usage (as displayed by 'top') is nearly zero when waiting for new data, and judging from the noise level of my laptop fan, the cpu uses far less power. A better solution would certainly be to use some other blocking mechanism, but as others have said in this thread, this seems to be a bit less trivial.
Re: [OMPI users] Busy waiting [was Re: (no subject)]
What George said is what I meant by "it's a non-trivial amount of work." :-) In addition to when George adds these patches (allowing components to register for blocking progress), there's going to be some work to deal with shared memory (we have some ideas here, but it's a bit more than just allowing shmem to register to blocking progress) and other random issues that will arise. On Apr 24, 2008, at 11:17 AM, George Bosilca wrote: Well, blocking or not blocking this is the question !!! Unfortunately, it's more complex than this thread seems to indicate. It's not that we didn't want to implement it in Open MPI, it's that at one point we had to make a choice ... and we decided to always go for performance first. However, there were some experimentations to go in blocking more at least when only TCP was used. Unfortunately, this break some other things in Open MPI, because of our progression model. We are component based and these components are allowed to register periodically called callbacks ... And here periodically means as often as possible. There are at least 2 components that use this mechanism for their own progression: romio (mca/io/romio) and one- sided communications (mca/osc/*). Switching in blocking mode will break these 2 components completely. This was the reason why we're not blocking when only TCP is used. Anyway, there is a solution. We have to move from a poll base progress for these components to an event base progress. There were some discussions, and if I remember well ... everybody's waiting for one of my patches :) A patch that allow a component to add a completion callback to MPI requests ... I don't have a clear deadline for this, and unfortunately I'm a little busy right now ... but I'll work on it asap. george. On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote: On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote: I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. This is a kernel scheduling issue, not an OpenMPI issue. Busy waiting in one process should not cause noticable loss of responsiveness in another processes. Have you experimented with the "nice" command? But I would imagine that the energy consumption is generally a big issue, since energy is a major cost factor in a computer cluster. Yup. When a cpu is idle, it uses considerably less energy. Last time I checked my computer used 180W when both cpu cores were working and 110W when both cores were idle. What processor is this? I just made a small hack to solve the problem. I inserted a simple sleep call into the function 'opal_condition_wait': --- orig/openmpi-1.2.6/opal/threads/condition.h +++ openmpi-1.2.6/opal/threads/condition.h @@ -78,6 +78,7 @@ #endif } else { while (c->c_signaled == 0) { + usleep(1000); opal_progress(); } } I expect this would lead to increased execution time for all programs and increased energy consumption for most programs. Recall that energy is power multiplied by time. You're reducing the power on some nodes and increasing time on all nodes. The usleep call will let the program sleep for about 4 ms (it won't sleep for a shorter time because of some timer granularity). But that is good enough for me. The cpu usage is (almost) zero when the tasks are waiting for one another. I think your mistake here is considering CPU load to be a useful metric. It isn't. Responsiveness is a useful metric, energy is a useful metric, but CPU load isn't a reliable guide to either of these. For a proper implementation you would want to actively poll without a sleep call for a few milliseconds, and then use some other method that sleeps not for a fixed time, but until new messages arrive. Well, it sounds like you can get to this before I can. Post your patch here and I'll test it on the NAS suite, UMT2K, Paradis, and a few synthetic benchmarks I've written. The cluster I use has multimeters hooked up so I can also let you know how much energy is being saved. Barry Rountree Ph.D. Candidate, Computer Science University of Georgia ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] install intel mac with Laopard
On Apr 24, 2008, at 11:07 AM, Doug Reeder wrote: Make sure that your compilers are all creaqting code for the same architecture (i386 or x86-64). ifort usually installs such that the 64 bit version of the compiler is the dfault while the apple gcc compiler creates i386 output by default. Check the architecture of the .o files with file *.o and if the gcc output needs to be x86_64 add the -m64 flag to the c and c++ flags. That has worked for me. You shouldn't need the intel c/c++ compilers. I find the configure error message to be a little bit cryptic and not very insightful. Do you have a suggestion for a new configure error message? I thought it was very clear, but then again, I'm one of the implementors... checking if C and Fortran 77 are link compatible... no ** * It appears that your Fortran 77 compiler is unable to link against * object files created by your C compiler. This generally indicates * either a conflict between the options specified in CFLAGS and FFLAGS * or a problem with the local compiler installation. More * information (including exactly what command was given to the * compilers and what error resulted when the commands were executed) is * available in the config.log file in this directory. ** configure: error: C and Fortran 77 compilers are not link compatible. Can not continue. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Busy waiting [was Re: (no subject)]
Well, blocking or not blocking this is the question !!! Unfortunately, it's more complex than this thread seems to indicate. It's not that we didn't want to implement it in Open MPI, it's that at one point we had to make a choice ... and we decided to always go for performance first. However, there were some experimentations to go in blocking more at least when only TCP was used. Unfortunately, this break some other things in Open MPI, because of our progression model. We are component based and these components are allowed to register periodically called callbacks ... And here periodically means as often as possible. There are at least 2 components that use this mechanism for their own progression: romio (mca/io/romio) and one-sided communications (mca/ osc/*). Switching in blocking mode will break these 2 components completely. This was the reason why we're not blocking when only TCP is used. Anyway, there is a solution. We have to move from a poll base progress for these components to an event base progress. There were some discussions, and if I remember well ... everybody's waiting for one of my patches :) A patch that allow a component to add a completion callback to MPI requests ... I don't have a clear deadline for this, and unfortunately I'm a little busy right now ... but I'll work on it asap. george. On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote: On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote: I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. This is a kernel scheduling issue, not an OpenMPI issue. Busy waiting in one process should not cause noticable loss of responsiveness in another processes. Have you experimented with the "nice" command? But I would imagine that the energy consumption is generally a big issue, since energy is a major cost factor in a computer cluster. Yup. When a cpu is idle, it uses considerably less energy. Last time I checked my computer used 180W when both cpu cores were working and 110W when both cores were idle. What processor is this? I just made a small hack to solve the problem. I inserted a simple sleep call into the function 'opal_condition_wait': --- orig/openmpi-1.2.6/opal/threads/condition.h +++ openmpi-1.2.6/opal/threads/condition.h @@ -78,6 +78,7 @@ #endif } else { while (c->c_signaled == 0) { + usleep(1000); opal_progress(); } } I expect this would lead to increased execution time for all programs and increased energy consumption for most programs. Recall that energy is power multiplied by time. You're reducing the power on some nodes and increasing time on all nodes. The usleep call will let the program sleep for about 4 ms (it won't sleep for a shorter time because of some timer granularity). But that is good enough for me. The cpu usage is (almost) zero when the tasks are waiting for one another. I think your mistake here is considering CPU load to be a useful metric. It isn't. Responsiveness is a useful metric, energy is a useful metric, but CPU load isn't a reliable guide to either of these. For a proper implementation you would want to actively poll without a sleep call for a few milliseconds, and then use some other method that sleeps not for a fixed time, but until new messages arrive. Well, it sounds like you can get to this before I can. Post your patch here and I'll test it on the NAS suite, UMT2K, Paradis, and a few synthetic benchmarks I've written. The cluster I use has multimeters hooked up so I can also let you know how much energy is being saved. Barry Rountree Ph.D. Candidate, Computer Science University of Georgia ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] Message compression in OpenMPI
Actually, even in this particular condition (over internet)1 compression make sense only for very specific data. The problem is that usually the compression algorithm is very expensive if you want to really get a interesting factor of size reduction. And there is the tradeoff, what you save in terms of data transfer you lose in terms of compression time. In other terms, the compression became interesting in only 2 scenarios: you have a very congested network (really very very congested) or you have a network with a limited bandwidth. The algorithm use in the paper you cited is fast, but unfortunately very specific for MPI_DOUBLE and only works if the data exhibit the properties I cited in my previous email. The generic compression algorithms are at least one order of magnitude slower. And then again, one needs a very slow network in order to get any benefits from doing the compression ... And of course slow networks is not exactly the most common place where you will find MPI applications. But as Jeff stated in his email, contributions are always welcomed :) george. On Apr 24, 2008, at 8:26 AM, Tomas Ukkonen wrote: George Bosilca wrote: The paper you cited, while presenting a particular implementation doesn't bring present any new ideas. The compression of the data was studied for long time, and [unfortunately] it always came back to the same result. In the general case, not worth the effort ! Now of course, if one limit itself to very regular applications (such as the one presented in the paper), where the matrices involved in the computation are well conditioned (such as in the paper), and if you only use MPI_DOUBLE (\cite{same_paper}), and finally if you only expect to run over slow Ethernet (1Gbs) (\cite{same_paper_again})... then yes one might get some benefit. Yes, you are probably right that its not worth the effort in general and especially not in HPC environments where you have very fast network. But I can think of (rather important) special cases where it is important - non HPC environments with slow network: beowulf clusters and/or internet + normal PCs where you use existing workstations and network for computations. - communication/io-bound computations where you transfer large redundant datasets between nodes So it would be nice to be able to turn on the compression (for spefic communicators and/or data transfers) when you need it. -- Tomas george. On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote: Hello I read from somewhere that OpenMPI supports some kind of data compression but I couldn't find any information about it. Is this true and how it can be used? Does anyone have any experiences about using it? Is it possible to use compression in just some subset of communications (communicator specific compression settings)? In our MPI application we are transferring large amounts of sparse/redundant data that compresses very well. Also my initial tests showed significant improvements in performance. There are also articles that suggest that compression should be used [1]. [1] J. Ke, M. Burtcher and E. Speight. Runtime Compression of MPI Messages to Improve the Performance and Scalability of Parallel Applications. Thanks in advance, Tomas ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] How to restart a job twice
Tamer, Another user contacted me off list yesterday with a similar problem with the current trunk. I have been able to reproduce this, and am currently trying to debug it again. It seems to occur more often with builds without the checkpoint thread (--disable-ft-thread). It seems to be a race in our connection wireup which is why it does not always occur. Thank you for your patience as I try to track this down. I'll let you know as soon as I have a fix. Cheers, Josh On Apr 24, 2008, at 10:50 AM, Tamer wrote: Josh, Thank you for your help. I was able to do the following with r18241: start the parallel job checkpoint and restart checkpoint and restart checkpoint but failed to restart with the following message: ompi-restart ompi_global_snapshot_23800.ckpt [dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection to lifeline [[45699,0],0] lost [dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection to lifeline [[45699,0],0] lost [dhcp-119-202:23650] *** Process received signal *** [dhcp-119-202:23650] Signal: Segmentation fault (11) [dhcp-119-202:23650] Signal code: Address not mapped (1) [dhcp-119-202:23650] Failing at address: 0x3e0f50 [dhcp-119-202:23650] [ 0] [0x110440] [dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107) [0xc5df97] [dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1] [dhcp-119-202:23650] *** End of error message *** -- mpirun noticed that process rank 1 with PID 23857 on node dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault). So, this time the process went further than before. I tested on a different platform (64 bit machine with fedora core 7) and openmpi checkpoints and restarts as many times as I want to without any problems. This means that the issue above must be platform dependent and I must be missing some option in building the code. Cheers, Tamer On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote: Tamer, This should now be fixed in r18241. Though I was able to replicate this bug, it only occurred sporadically for me. It seemed to be caused by some socket descriptor caching that was not properly cleaned up by the restart procedure. My testing appears to conclude that this bug is now fixed, but since it is difficult to reproduce if you see it happen again definitely let me know. With the current trunk you may see the following error message: -- [odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -- This is not caused by the checkpoint/restart code, but by some recent changes to our TCP component. We are working on fixing this, but I just wanted to give you a heads up in case you see this error. As far as I can tell it does not interfere with the checkpoint/restart functionality. Let me know if this fixes your problem. Cheers, Josh On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote: Tamer, Just wanted to update you on my progress. I am able to reproduce something similar to this problem. I am currently working on a solution to it. I'll let you know when it is available, probably in the next day or two. Thank you for the bug report. Cheers, Josh On Apr 18, 2008, at 1:11 PM, Tamer wrote: Hi Josh: I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7 The machine is dual-core with shared memory so it's not even a cluster. I downloaded r18208 and built it with the following options: ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 -- with-ft=cr --with-blcr=/usr/local/blcr when I run mpirun I pass the following command: mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760 I was able to checkpoint and restart successfully and was able to checkpoint the restarted job (mpirun showed up with ps-efa |grep mpirun under r18208) but was unable to restart again; here's the error message: mpi-restart ompi_global_snapshot_23865.ckpt [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity: Connection to lifeline [[45670,0],0] lost [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity: Connection to lifeline [[45670,0],0] lost [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0] mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32) [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity: Connection to lifeline
Re: [OMPI users] PubSub and MPI
Additionally, the mpi-t spec has some accept/connect examples in the dynamic processes chapter. -jms Sent from my PDA. No type good. -Original Message- From: Tim Prins [mailto:tpr...@open-mpi.org] Sent: Thursday, April 24, 2008 09:33 AM Eastern Standard Time To: Open MPI Users Subject:Re: [OMPI users] PubSub and MPI Open MPI ships with a full set of man pages for all the MPI functions, you might want to start with those. Tim Alberto Giannetti wrote: > I am looking to use MPI in a publisher/subscriber context. Haven't > found much relevant information online. > Basically I would need to deal with dynamic tag subscriptions from > independent components (connectors) and a number of other issues. I > can provide more details if there is an interest. Am also looking for > more information on these calls: > > MPI_Open_port > MPI_Publish_name > MPI_Comm_spawn_multiple > > Any code example or snapshot would be great. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] PubSub and MPI
Open MPI ships with a full set of man pages for all the MPI functions, you might want to start with those. Tim Alberto Giannetti wrote: I am looking to use MPI in a publisher/subscriber context. Haven't found much relevant information online. Basically I would need to deal with dynamic tag subscriptions from independent components (connectors) and a number of other issues. I can provide more details if there is an interest. Am also looking for more information on these calls: MPI_Open_port MPI_Publish_name MPI_Comm_spawn_multiple Any code example or snapshot would be great. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Message compression in OpenMPI
On Apr 24, 2008, at 8:26 AM, Tomas Ukkonen wrote: Yes, you are probably right that its not worth the effort in general and especially not in HPC environments where you have very fast network. But I can think of (rather important) special cases where it is important - non HPC environments with slow network: beowulf clusters and/or internet + normal PCs where you use existing workstations and network for computations. - communication/io-bound computations where you transfer large redundant datasets between nodes So it would be nice to be able to turn on the compression (for spefic communicators and/or data transfers) when you need it. Quite possibly so. Note that there are a few proposals going on in MPI-2.2/MPI-3 about how to pass "hints" or "assertions" to the MPI implementation. Compression could be one of these hints -- the MPI may not be able to detect that it's in a situation that is favorable for compression, so having the user/app tell it "use compression on this communicator" could be helpful. Would you be willing to contribute the work to Open MPI to enable compression? Per a post yesterday (http://www.open-mpi.org/community/lists/users/2008/04/5473.php ), contributions are always welcome. -- Jeff Squyres Cisco Systems
[OMPI users] PubSub and MPI
I am looking to use MPI in a publisher/subscriber context. Haven't found much relevant information online. Basically I would need to deal with dynamic tag subscriptions from independent components (connectors) and a number of other issues. I can provide more details if there is an interest. Am also looking for more information on these calls: MPI_Open_port MPI_Publish_name MPI_Comm_spawn_multiple Any code example or snapshot would be great.
Re: [OMPI users] Message compression in OpenMPI
George Bosilca wrote: > The paper you cited, while presenting a particular implementation > doesn't bring present any new ideas. The compression of the data was > studied for long time, and [unfortunately] it always came back to the > same result. In the general case, not worth the effort ! > > Now of course, if one limit itself to very regular applications (such > as the one presented in the paper), where the matrices involved in the > computation are well conditioned (such as in the paper), and if you > only use MPI_DOUBLE (\cite{same_paper}), and finally if you only > expect to run over slow Ethernet (1Gbs) (\cite{same_paper_again})... > then yes one might get some benefit. > Yes, you are probably right that its not worth the effort in general and especially not in HPC environments where you have very fast network. But I can think of (rather important) special cases where it is important - non HPC environments with slow network: beowulf clusters and/or internet + normal PCs where you use existing workstations and network for computations. - communication/io-bound computations where you transfer large redundant datasets between nodes So it would be nice to be able to turn on the compression (for spefic communicators and/or data transfers) when you need it. -- Tomas > george. > > On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote: > >> Hello >> >> I read from somewhere that OpenMPI supports >> some kind of data compression but I couldn't find >> any information about it. >> >> Is this true and how it can be used? >> >> Does anyone have any experiences about using it? >> >> Is it possible to use compression in just some >> subset of communications (communicator >> specific compression settings)? >> >> In our MPI application we are transferring large >> amounts of sparse/redundant data that compresses >> very well. Also my initial tests showed significant >> improvements in performance. >> >> There are also articles that suggest that compression >> should be used [1]. >> >> [1] J. Ke, M. Burtcher and E. Speight. >> Runtime Compression of MPI Messages to Improve the >> Performance and Scalability of Parallel Applications. >> >> >> Thanks in advance, >> Tomas >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Busy waiting [was Re: (no subject)]
On Apr 24, 2008, at 6:56 AM, Ingo Josopait wrote: I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. From a performance/usability stand, you could set interactive applications on higher priority to guarantee your desktop applications work as expected. http://www.informit.com/articles/article.aspx?p=101760
[OMPI users] install intel mac with Laopard
Dear Sir: I think that this problem must be solved, and maybe some information should be given in the archives. But, I miss the right answer in my searching area, so please allow me to repeat. I tried to install openmpi-1.2.5 to a new xserve (Xeon) with Leopard. Intel compiler is used for Fortran. My options for configure was CC=/usr/bin/gcc-4.0 CXX=/usr/bin/g++-4.0 F77=ifort along with --with-rsh="ssh -x" --enable-shared --without-cs-fs --without-memory- manager Then, I saw an error message. This says checking if C and Fortran 77 are link compatible... no ** * It appears that your Fortran 77 compiler is unable to link against * object files created by your C compiler. This generally indicates * either a conflict between the options specified in CFLAGS and FFLAGS * or a problem with the local compiler installation. More * information (including exactly what command was given to the * compilers and what error resulted when the commands were executed) is * available in the config.log file in this directory. ** configure: error: C and Fortran 77 compilers are not link compatible. Can not continue. I suppose that the problem is the default selection for the architecture (32 or 64 bit). I don't know the correct options. Of course, I like to use 64-bit architecture as far as it works. Best regard, --- Koun SHIRAI Nanoscience and Nanotechnology Center ISIR, Osaka University 8-1, Mihogaoka, Ibaraki Osaka 567-0047, JAPAN PH: +81-6-6879-4302 FAX: +81-6-6879-8539
Re: [OMPI users] Message compression in OpenMPI
Jeff Squyres wrote: > On Apr 22, 2008, at 9:03 AM, Tomas Ukkonen wrote: > >> I read from somewhere that OpenMPI supports >> >> some kind of data compression but I couldn't find >> any information about it. >> >> Is this true and how it can be used? >> > Nope, sorry -- not true. > > This just came up in a different context, actually. We added some > preliminary compression on our startup/mpirun messages and found that > it really had no effect; any savings that you get in bandwidth (and > therefore overall wall clock time) are eaten up by the time necessary > to compress/uncompress the messages. There were a few more things we > could have tried, but frankly we had some higher priority items to > finish for the upcoming v1.3 series. :-( > Ok, so I have to do it myself. Not a problem really because there are only few places where the compression really seems to matter. >> Does anyone have any experiences about using it? >> >> Is it possible to use compression in just some >> subset of communications (communicator >> specific compression settings)? >> >> In our MPI application we are transferring large >> amounts of sparse/redundant data that compresses >> very well. Also my initial tests showed significant >> improvements in performance. >> > > If your particular data is well-suited for fast compression, you might > want to compress it before calling MPI_SEND / after calling MPI_RECV. > Use the MPI_BYTE datatype to send/receive the messages, and then MPI > won't do anything additional for datatype conversions, etc Yeah, already did something like this. I have a situation where all the nodes are sending large amounts of redundant data at once. The combination: "compress --> MPI_SEND --> MPI_RECV --> decompress" works of course, but it forces one to allocate large amounts of memory (or diskspace) for the compressed data. You can do it manually in parts of course, but it would be nice if MPI library could do it behind the scenes. Thanks, -- Tomas Ukkonen
Re: [OMPI users] Busy waiting [was Re: (no subject)]
I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. But I would imagine that the energy consumption is generally a big issue, since energy is a major cost factor in a computer cluster. When a cpu is idle, it uses considerably less energy. Last time I checked my computer used 180W when both cpu cores were working and 110W when both cores were idle. I just made a small hack to solve the problem. I inserted a simple sleep call into the function 'opal_condition_wait': --- orig/openmpi-1.2.6/opal/threads/condition.h +++ openmpi-1.2.6/opal/threads/condition.h @@ -78,6 +78,7 @@ #endif } else { while (c->c_signaled == 0) { + usleep(1000); opal_progress(); } } The usleep call will let the program sleep for about 4 ms (it won't sleep for a shorter time because of some timer granularity). But that is good enough for me. The cpu usage is (almost) zero when the tasks are waiting for one another. For a proper implementation you would want to actively poll without a sleep call for a few milliseconds, and then use some other method that sleeps not for a fixed time, but until new messages arrive. Barry Rountree schrieb: > On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote: >> I can think of several advantages that using blocking or signals to >> reduce the cpu load would have: >> >> - Reduced energy consumption > > Not necessarily. Any time the program ends up running longer, the > cluster is up and running (and wasting electricity) for that amount of > time. In the case where lots of tiny messages are being sent you could > easily end up using more energy. > >> - Running additional background programs could be done far more efficiently > > It's usually more efficient -- especially in terms of cache -- to batch > up programs to run one after the other instead of running them > simultaneously. > >> - It would be much simpler to examine the load balance. > > This is true, but it's still pretty trivial to measure load imbalance. > MPI allows you to write a wrapper library that intercepts any MPI_* > call. You can instrument the code however you like, then call PMPI_*, > then catch the return value, finish your instrumentation, and return > control to your program. Here's some pseudocode: > > int MPI_Barrier(MPI_Comm comm){ > gettimeofday(, NULL); > rc=PMPI_Barrier( comm ); > gettimeofday(, NULL); > fprintf( logfile, "Barrier on node %d took %lf seconds\n", > rank, delta(, ) ); > return rc; > } > > I've got some code that does this for all of the MPI calls in OpenMPI > (ah, the joys of writing C code using python scripts). Let me know if > you'd find it useful. > >> It may depend on the type of program and the computational environment, >> but there are certainly many cases in which putting the system in idle >> mode would be advantageous. This is especially true for programs with >> low network traffic and/or high load imbalances. > > I could use a few more benchmarks like that. Seriously, if > you're mostly concerned about saving energy, a quick hack is to set a > timer as soon as you enter an MPI call (say for 100ms) and if the timer > goes off while you're still in the call, use DVS to drop your CPU > frequency to the lowest value it has. Then, when you exit the MPI call, > pop it back up to the highest frequency. This can save a significant > amount of energy, but even here there can be a performance penalty. For > example, UMT2K schleps around very large messages, and you really need > to be running as fast as possible during the MPI_Waitall calls or the > program will slow down by 1% or so (thus using more energy). > > Doing this just for Barriers and Allreduces seems to speed up the > program a tiny bit, but I haven't done enough runs to make sure this > isn't an artifact. > > (This is my dissertation topic, so before asking any question be advised > that I WILL talk your ear off.) > >> The "spin for a while and then block" method that you mentioned earlier >> seems to be a good compromise. Just do polling for some time that is >> long compared to the corresponding system call, and then go to sleep if >> nothing happens. In this way, the latency would be only marginally >> increased, while less cpu time is wasted in the polling loops, and I >> would be much happier. >> > > I'm interested in seeing what this does for energy savings. Are you > volunteering to test a patch? (I've got four other papers I need to > get finished up, so it'll be a few weeks before I start coding.) > > Barry Rountree > Ph.D. Candidate, Computer Science > University of Georgia > >> >> >> >> Jeff Squyres schrieb: >>> On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: >>> Do you really mean that Open-MPI uses busy loop in order to handle incomming calls? It seems to be incorrect since