Re: [OMPI users] Distribute app over open mpi
I'm afraid we don't position binaries for you, though we have talked about providing that capability. I have some code in the system that - will- do it for certain special circumstances, but not in the general case, and much of the required infrastructure exists in the code today. I'm not sure how well that will do what you want. There are a number of reasons why we don't do it (locating required libraries, etc) - I think they have been reported on the user list several times over the years. On Nov 6, 2009, at 9:10 AM, Arnaud Westenberg wrote: Hi all, Sorry for the newbie question, but I'm having a hard time finding the answer, as I'm not even familiar with the terminology... I've setup a small cluster on Ubuntu (hardy) and everything is working great, including slurm etc. If I run the well known 'Pi' program I get the proper results returned from all the nodes. However, I'm looking for a way such that I wouldn't need to install the application on each node, nor on the shared nfs. Currently I get the obvious error that the app is not found on the nodes on which it isn't installed. The idea is that the master node would thus distribute the required (parts of the) program to the slave nodes so they can perform the assigned work. Reason is that I want to run an FEA package on a much larger (redhat) cluster we currently use for CDF calculations. I really don't want to mess up the cluster as we bought it already configured and compiling new versions of the FEA package on it turns out to be a missing library nightmare. Thanks for your help. Regards, Arnaud ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] an environment variable with same meaning than the -x option of mpiexec
Not at the moment - though I imagine we could create one. It is a tad tricky in that we allow multiple -x options on the cmd line, but we obviously can't do that with an envar. The most likely solution would be to specify multiple "-x" equivalents by separating them with a comma in the envar. It would take some parsing to make it all work, but not impossible. I can add it to the "to-do" list for a rainy day :-) On Nov 6, 2009, at 7:59 AM, Paul Kapinos wrote: Dear OpenMPI developer, with the -x option of mpiexec there is a way to distribute environmnet variables: -x Export the specified environment variables to the remote nodes before executing the program. Is there an environment variable ( OMPI_) with the same meaning? The writing of environmnet variables on the command line is ugly and tedious... I've searched for this info on OpenMPI web pages for about an hour and didn't find the ansver :-/ Thanking you in anticipation, Paul -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Programming Help needed
Thanks Dear, Jonathan seems almost perfect; I percieve the same. On Fri, Nov 6, 2009 at 6:17 PM, Tom Rosmondwrote: > AMJAD > > On your first question, the answer is probably, if everything else is > done correctly. The first test is to not try to do the overlapping > communication and computation, but do them sequentially and make sure > the answers are correct. Have you done this test? Debugging your > original approach will be challenging, and having a control solution > will be a big help. > I followed the path of sequentional---then--parallel blockingand then parallel non-blocking. My serial solution is the control solution. > > On your second question, if I understand it correctly, is that it is > always better to minimize the number of messages. In problems like this > communication costs are dominated by latency, so bundling the data into > the fewest possible messages will ALWAYS be better. > Thats good. But what pointed out by Jonathan: If you really do hide most of the communications cost with your non-blocking communications, then it may not matter too much. is the point I want to be sure about. > T. Rosmond > > > > On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote: > > Hi all, > > > > I need/request some help from those who have some experience in > > debugging/profiling/tuning parallel scientific codes, specially for > > PDEs/CFD. > > > > I have parallelized a Fortran CFD code to run on > > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is > > that: > > > > Suppose that the grid/mesh is decomposed for n number of processors, > > such that each processors has a number of elements that share their > > side/face with different processors. What I do is that I start non > > blocking MPI communication at the partition boundary faces (faces > > shared between any two processors) , and then start computing values > > on the internal/non-shared faces. When I complete this computation, I > > put WAITALL to ensure MPI communication completion. Then I do > > computation on the partition boundary faces (shared-ones). This way I > > try to hide the communication behind computation. Is it correct? > > > > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less > > elements) with an another processor B then it sends/recvs 50 different > > messages. So in general if a processors has X number of faces sharing > > with any number of other processors it sends/recvs that much messages. > > Is this way has "very much reduced" performance in comparison to the > > possibility that processor A will send/recv a single-bundle message > > (containg all 50-faces-data) to process B. Means that in general a > > processor will only send/recv that much messages as the number of > > processors neighbour to it. It will send a single bundle/pack of > > messages to each neighbouring processor. > > Is their "quite a much difference" between these two approaches? > > > > THANK YOU VERY MUCH. > > AMJAD. > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Programming Help needed
AMJAD On your first question, the answer is probably, if everything else is done correctly. The first test is to not try to do the overlapping communication and computation, but do them sequentially and make sure the answers are correct. Have you done this test? Debugging your original approach will be challenging, and having a control solution will be a big help. On your second question, if I understand it correctly, is that it is always better to minimize the number of messages. In problems like this communication costs are dominated by latency, so bundling the data into the fewest possible messages will ALWAYS be better. T. Rosmond On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote: > Hi all, > > I need/request some help from those who have some experience in > debugging/profiling/tuning parallel scientific codes, specially for > PDEs/CFD. > > I have parallelized a Fortran CFD code to run on > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is > that: > > Suppose that the grid/mesh is decomposed for n number of processors, > such that each processors has a number of elements that share their > side/face with different processors. What I do is that I start non > blocking MPI communication at the partition boundary faces (faces > shared between any two processors) , and then start computing values > on the internal/non-shared faces. When I complete this computation, I > put WAITALL to ensure MPI communication completion. Then I do > computation on the partition boundary faces (shared-ones). This way I > try to hide the communication behind computation. Is it correct? > > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less > elements) with an another processor B then it sends/recvs 50 different > messages. So in general if a processors has X number of faces sharing > with any number of other processors it sends/recvs that much messages. > Is this way has "very much reduced" performance in comparison to the > possibility that processor A will send/recv a single-bundle message > (containg all 50-faces-data) to process B. Means that in general a > processor will only send/recv that much messages as the number of > processors neighbour to it. It will send a single bundle/pack of > messages to each neighbouring processor. > Is their "quite a much difference" between these two approaches? > > THANK YOU VERY MUCH. > AMJAD. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Programming Help needed
Hi, Amjad: [...] What I do is that I start non blocking MPI communication at the partition boundary faces (faces shared between any two processors) , and then start computing values on the internal/non-shared faces. When I complete this computation, I put WAITALL to ensure MPI communication completion. Then I do computation on the partition boundary faces (shared-ones). This way I try to hide the communication behind computation. Is it correct? As long as your numerical method allows you to do this (that is, you definitely don't need those boundary values to compute the internal values), then yes, this approach can hide some of the communication costs very effectively. The way I'd program this if I were doing it from scratch would be to do the usual blocking approach (no one computes anything until all the faces are exchanged) first and get that working, then break up the computation step into internal and boundary computations and make sure it still works, and then change the messaging to isends/irecvs/waitalls, and make sure it still works, and only then interleave the two. IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements) with an another processor B then it sends/recvs 50 different messages. So in general if a processors has X number of faces sharing with any number of other processors it sends/recvs that much messages. Is this way has "very much reduced" performance in comparison to the possibility that processor A will send/recv a single-bundle message (containg all 50-faces-data) to process B. Means that in general a processor will only send/recv that much messages as the number of processors neighbour to it. It will send a single bundle/pack of messages to each neighbouring processor. Is their "quite a much difference" between these two approaches? Your individual element faces that are being communicated are likely quite small. It is quite generally the case that bundling many small messages into large messages can significantly improve performance, as you avoid incurring the repeated latency costs of sending many messages. As always, though, the answer is `it depends', and the only way to know is to try it both ways. If you really do hide most of the communications cost with your non-blocking communications, then it may not matter too much. In addition, if you don't know beforehand how much data you need to send/receive, then you'll need a handshaking step which introduces more synchronization and may actually hurt performance, or you'll have to use MPI2 one-sided communications. On the other hand, if this shared boundary doesn't change through the simulation, you could just figure out at start-up time how big the messages will be between neighbours and use that as the basis for the usual two-sided messages. My experience is that there's an excellent chance you'll improve the performance by packing the little messages into fewer larger messages. Jonathan -- Jonathan Dursi
[OMPI users] Programming Help needed
Hi all, I need/request some help from those who have some experience in debugging/profiling/tuning parallel scientific codes, specially for PDEs/CFD. I have parallelized a Fortran CFD code to run on Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that: Suppose that the grid/mesh is decomposed for n number of processors, such that each processors has a number of elements that share their side/face with different processors. What I do is that I start non blocking MPI communication at the partition boundary faces (faces shared between any two processors) , and then start computing values on the internal/non-shared faces. When I complete this computation, I put WAITALL to ensure MPI communication completion. Then I do computation on the partition boundary faces (shared-ones). This way I try to hide the communication behind computation. Is it correct? IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements) with an another processor B then it sends/recvs 50 different messages. So in general if a processors has X number of faces sharing with any number of other processors it sends/recvs that much messages. Is this way has "very much reduced" performance in comparison to the possibility that processor A will send/recv a single-bundle message (containg all 50-faces-data) to process B. Means that in general a processor will only send/recv that much messages as the number of processors neighbour to it. It will send a single bundle/pack of messages to each neighbouring processor. Is their "quite a much difference" between these two approaches? THANK YOU VERY MUCH. AMJAD.
[OMPI users] Distribute app over open mpi
Hi all, Sorry for the newbie question, but I'm having a hard time finding the answer, as I'm not even familiar with the terminology... I've setup a small cluster on Ubuntu (hardy) and everything is working great, including slurm etc. If I run the well known 'Pi' program I get the proper results returned from all the nodes. However, I'm looking for a way such that I wouldn't need to install the application on each node, nor on the shared nfs. Currently I get the obvious error that the app is not found on the nodes on which it isn't installed. The idea is that the master node would thus distribute the required (parts of the) program to the slave nodes so they can perform the assigned work. Reason is that I want to run an FEA package on a much larger (redhat) cluster we currently use for CDF calculations. I really don't want to mess up the cluster as we bought it already configured and compiling new versions of the FEA package on it turns out to be a missing library nightmare. Thanks for your help. Regards, Arnaud
[OMPI users] mpirun noticed that process rank 1 ... exited on signal 13 (Broken pipe).
Hi Everyone, I have install openmpi 1.3 and blcr 0.81 on my laptop (single processor). I am trying to checkpoint a small test application: ### #include #include #include #include #include int main(int argc, char **argv) { int rank,size; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 10"); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 10"); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 10"); printf("mpisleep bye \n"); MPI_Finalize(); return 0; } ### I compile it as follows: mpicc mpisleep.c -o mpisleep and i run it as follows: mpirun -am ft-enable-cr -np 2 mpisleep. When i try checkpointing ( ompi-checkpoint -v 8118) it, it checkpoints fine but when i restart it, i get the following: I am processor no 0 of a total of 2 procs I am processor no 1 of a total of 2 procs mpisleep bye -- mpirun noticed that process rank 1 with PID 8118 on node raj-laptop exited on signal 13 (Broken pipe). -- Any suggestions is very much appreciated Raj
[OMPI users] an environment variable with same meaning than the -x option of mpiexec
Dear OpenMPI developer, with the -x option of mpiexec there is a way to distribute environmnet variables: -x Export the specified environment variables to the remote nodes before executing the program. Is there an environment variable ( OMPI_) with the same meaning? The writing of environmnet variables on the command line is ugly and tedious... I've searched for this info on OpenMPI web pages for about an hour and didn't find the ansver :-/ Thanking you in anticipation, Paul -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] Segmentation fault whilst running RaXML-MPI
Hi, Thank you for the information, I'm going to try the new Intel Compilers which I'm downloading now, but as they're taking so long to download I don't think I'm going to be able to look into this again until after the weekend. BTW using their java-based downloader is a bit less painful than their normal download. In the meantime, if anyone else has some suggestions then please let me know. Thanks Nick 2009/11/5 Jeff Squyres: > FWIW, I think Intel released 11.1.059 earlier today (I've been trying to > download it all morning). I doubt it's an issue in this case, but I thought > I'd mention it as a public service announcement. ;-) > > Seg faults are *usually* an application issue (never say "never", but they > *usually* are). You might want to first contact the RaXML team to see if > there are any known issues with their software and Open MPI 1.3.3...? > (Sorry, I'm totally unfamiliar with RaXML) > > On Nov 5, 2009, at 12:30 PM, Nick Holway wrote: > >> Dear all, >> >> I'm trying to run RaXML 7.0.4 on my 64bit Rocks 5.1 cluster (ie Centos >> 5.2). I compiled Open MPI 1.3.3 using the Intel compilers v 11.1.056 >> using ./configure CC=icc CXX=icpc F77=ifort FC=ifort --with-sge >> --prefix=/usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man >> --with-memory-manager=none. >> >> When I run run RaXML in a qlogin session using >> /usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man/bin/mpirun -np 8 >> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI >> -f a -x 12345 -p12345 -# 10 -m GTRGAMMA -s >> /users/holwani1/jay/ornodko-1582 -n mpitest39 >> >> I get the following output: >> >> This is the RAxML MPI Worker Process Number: 1 >> This is the RAxML MPI Worker Process Number: 3 >> >> This is the RAxML MPI Master process >> >> This is the RAxML MPI Worker Process Number: 7 >> >> This is the RAxML MPI Worker Process Number: 4 >> >> This is the RAxML MPI Worker Process Number: 5 >> >> This is the RAxML MPI Worker Process Number: 2 >> >> This is the RAxML MPI Worker Process Number: 6 >> IMPORTANT WARNING: Alignment column 1695 contains only undetermined >> values which will be treated as missing data >> >> >> IMPORTANT WARNING: Sequences A4_H10 and A3ii_E11 are exactly identical >> >> >> IMPORTANT WARNING: Sequences A2_A08 and A9_C10 are exactly identical >> >> >> IMPORTANT WARNING: Sequences A3ii_B03 and A3ii_C06 are exactly identical >> >> >> IMPORTANT WARNING: Sequences A9_D08 and A9_F10 are exactly identical >> >> >> IMPORTANT WARNING: Sequences A3ii_F07 and A9_C08 are exactly identical >> >> >> IMPORTANT WARNING: Sequences A6_F05 and A6_F11 are exactly identical >> >> IMPORTANT WARNING >> Found 6 sequences that are exactly identical to other sequences in the >> alignment. >> Normally they should be excluded from the analysis. >> >> >> IMPORTANT WARNING >> Found 1 column that contains only undetermined values which will be >> treated as missing data. >> Normally these columns should be excluded from the analysis. >> >> An alignment file with undetermined columns and sequence duplicates >> removed has already >> been printed to file /users/holwani1/jay/ornodko-1582.reduced >> >> >> You are using RAxML version 7.0.4 released by Alexandros Stamatakis in >> April 2008 >> >> Alignment has 1280 distinct alignment patterns >> >> Proportion of gaps and completely undetermined characters in this >> alignment: 0.124198 >> >> RAxML rapid bootstrapping and subsequent ML search >> >> >> Executing 10 rapid bootstrap inferences and thereafter a thorough ML >> search >> >> All free model parameters will be estimated by RAxML >> GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter >> GAMMA Model parameters will be estimated up to an accuracy of >> 0.10 Log Likelihood units >> >> Partition: 0 >> Name: No Name Provided >> DataType: DNA >> Substitution Matrix: GTR >> Empirical Base Frequencies: >> pi(A): 0.261129 pi(C): 0.228570 pi(G): 0.315946 pi(T): 0.194354 >> >> >> Switching from GAMMA to CAT for rapid Bootstrap, final ML search will >> be conducted under the GAMMA model you specified >> Bootstrap[10]: Time 44.442728 bootstrap likelihood -inf, best >> rearrangement setting 5 >> Bootstrap[0]: Time 44.814948 bootstrap likelihood -inf, best >> rearrangement setting 5 >> Bootstrap[6]: Time 46.470371 bootstrap likelihood -inf, best >> rearrangement setting 6 >> [compute-0-11:08698] *** Process received signal *** >> [compute-0-11:08698] Signal: Segmentation fault (11) >> [compute-0-11:08698] Signal code: Address not mapped (1) >> [compute-0-11:08698] Failing at address: 0x408 >> [compute-0-11:08698] [ 0] /lib64/libpthread.so.0 [0x3fb580de80] >> [compute-0-11:08698] [ 1] >> >> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(hookup+0) >> [0x413ca0] >> [compute-0-11:08698] [ 2] >> >> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(restoreTL+0xd9) >> [0x442c09] >> [compute-0-11:08698] [ 3] >>
Re: [OMPI users] Question about checkpoint/restart protocol
On Nov 5, 2009, at 4:46 AM, Mohamed Adel wrote: Dear Sergio, Thank you for your reply. I've inserted the modules into the kernel and it all worked fine. But there is still a weired issue. I use the command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart- test" to start the an mpi job. I then use "ompi-checkpoint PID" to checkpoint a job, but the ompi-checkpoint didn't respond and the mpirun produces the following. -- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: comp001.local (PID 23514) MPI_COMM_WORLD rank: 0 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -- [login01.local:21425] 1 more process has sent help message help-mpi- runtime.txt / mpi_init:warn-fork [login01.local:21425] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Notice: if the -n option has a value more than 1, then this error occurs, but if the -n option has the value 1 then the ompi- checkpoint succeeds, mpirun produces the same message and ompi- restart fails with the message [login01:21417] *** Process received signal *** [login01:21417] Signal: Segmentation fault (11) [login01:21417] Signal code: Address not mapped (1) [login01:21417] Failing at address: (nil) [login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70] [login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/ mca_crs_blcr.so [0x2b093509dfee] [login01:21417] [ 2] /home/mab/openmpi-1.3.3/lib/openmpi/ mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) [0x2b093509d251] [login01:21417] [ 3] opal-restart [0x401c3e] [login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) [0x32dea1d8b4] [login01:21417] [ 5] opal-restart [0x401399] [login01:21417] *** End of error message *** -- mpirun noticed that process rank 0 with PID 21417 on node login01.local exited on signal 11 (Segmentation fault). -- Any help with that will be appreciated? I have not seen this behavior before. The first error is Open MPI warning you that one of your MPI processes is trying to use fork(), so you may want to make sure that your application is not using any system () or fork() function calls. Open MPI internally should not be using any of these functions from within the MPI library linked to the application. When you reloaded the BLCR module, did you rebuild Open MPI and install it in a clean directory (not over the top of the old directory)? Have you tried to checkpoint/restart an non-MPI process with BLCR on your system? This will help to rule out installation problems with BLCR. I suspect that Open MPI is not building correctly, or something in your build environment is confusing/corrupting the build. Can you send me your config.log, it may help me pinpoint the problem if it is build related. -- Josh Thanks in advance, Mohamed Adel From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Sergio Díaz [sd...@cesga.es] Sent: Thursday, November 05, 2009 11:38 AM To: Open MPI Users Subject: Re: [OMPI users] Question about checkpoint/restart protocol Hi, Did you load the BLCR modules before compiling OpenMPI? Regards, Sergio Mohamed Adel escribió: Dear OMPI users, I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge -- enable-ft-thread --with-ft=cr --enable-mpi-threads --enable-static --disable-shared --with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it successfully. Now I'm trying to use the checkpoint/restart protocol. I run a program with the options "mpirun -n 2 -am ft-enable-cr -H localhost prime/checkpoint-restart-test" but I receive the following error: *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [madel:28896] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel
Re: [OMPI users] checkpoint opempi-1.3.3+sge62
On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote: Hello, I have achieved the checkpoint of an easy program without SGE. Now, I'm trying to do the integration openmpi+sge but I have some problems... When I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID doesn't exit. The example below. I do not have any experience with the SGE environment, so I suspect that there may something 'special' about the environment that is tripping up the ompi-checkpoint tool. First of all, what version of Open MPI are you using? Somethings to check: - Does 'ompi-ps' work when your application is running? - Is there an /tmp/openmpi-sessions-* directory on the node where mpirun is currently running? This directory contains information on how to connect to the mpirun process from an external tool, if it's missing then this could be the cause of the problem. Any ideas? Somebody have a script to do it automatic with SGE?. For example I have one to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched by SGE if you have configured the queue and the ckpt environment. I do not know of any integration of the Open MPI checkpointing work with SGE at the moment. As far as time triggered checkpointing, I have a feature ticket open about this: https://svn.open-mpi.org/trac/ompi/ticket/1961 It is not available yet, but in the works. Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the option to do it. Not at this time. Though I could see it as a useful feature, and shouldn't be too hard to implement. I filed a ticket if you want to follow the progress: https://svn.open-mpi.org/trac/ompi/ticket/2098 -- Josh Regards, Sergio [sdiaz@compute-3-17 ~]$ ps auxf root 20044 0.0 0.0 4468 1224 ?S13:28 0:00 \_ sge_shepherd-2645150 -bg sdiaz20072 0.0 0.0 53172 1212 ?Ss 13:28 0:00 \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/ 2645150 sdiaz20112 0.2 0.0 41028 2480 ?S13:28 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3 sdiaz20113 0.0 0.0 36484 1824 ?Sl 13:28 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit - nostdin -V compute-3-18.. sdiaz20116 1.2 0.0 99464 4616 ?Sl 13:28 0:00 \_ pi3 [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112 [compute-3-17.local:20124] HNP with PID 20112 Not found! [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112 [compute-3-17.local:20135] HNP with PID 20112 Not found! [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112 [compute-3-17.local:20136] HNP with PID 20112 Not found! [sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112 -- ompi-checkpoint PID_OF_MPIRUN Open MPI Checkpoint Tool -am Aggregate MCA parameter set file list -gmca|--gmca Pass global MCA parameters that are applicable to all contexts (arg0 is the parameter name; arg1 is the parameter value) -h|--helpThis help message --hnp-jobid This should be the jobid of the HNP whose applications you wish to checkpoint. --hnp-pid This should be the pid of the mpirun whose applications you wish to checkpoint. -mca|--mca Pass context-specific MCA parameters; they are considered global if --gmca is not used and only one context is specified (arg0 is the parameter name; arg1 is the parameter value) -s|--status Display status messages describing the progression of the checkpoint --termTerminate the application after checkpoint -v|--verbose Be Verbose -w|--nowait Do not wait for the application to finish checkpointing before returning -- [sdiaz@compute-3-17 ~]$ exit logout Connection to c3-17 closed. [sdiaz@svgd mpi_test]$ ssh c3-18 Last login: Wed Oct 28 13:24:12 2009 from svgd.local -bash-3.00$ ps auxf |grep sdiaz sdiaz14412 0.0 0.0 1888 560 ?Ss 13:28 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/ default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 sdiaz14419 0.0 0.0 35728 2260 ?S13:28 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328 - mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path
Re: [OMPI users] problems with checkpointing an mpi job
On Oct 30, 2009, at 1:35 PM, Hui Jin wrote: Hi All, I got a problem when trying to checkpoint a mpi job. I will really appreciate if you can help me fix the problem. the blcr package was installed successfully on the cluster. I configure the ompenmpi with flags, ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads -- with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/ The installation looks correct. The open MPI version is 1.3.3 I got the following output when issueing ompi_info: root@hec:/export/home/hjin/test# ompi_info | grep ft MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3) root@hec:/export/home/hjin/test# ompi_info | grep crs MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3) It seems the MCA crs is lost but I have no idea about how to get it. This is an artifact of the way ompi_info searches for components. This came up before on the users list: http://www.open-mpi.org/community/lists/users/2009/09/10667.php I filed a bug about this, if you want to track its progress: https://svn.open-mpi.org/trac/ompi/ticket/2097 To run a checkpointable application, I run: mpirun -np 2 --host hec -am ft-enable-cr test_mpi however, when trying to checkpoint at another terminal of the same host, I have the following, root@hec:~# ompi-checkpoint -v 29234 [hec:29243] orte_checkpoint: Checkpointing... [hec:29243] PID 29234 [hec:29243] Connected to Mpirun [[46621,0],0] [hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process PID 29234 [hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [hec:29243] orte_checkpoint: hnp_receiver: Receive a command message. [hec:29243] orte_checkpoint: hnp_receiver: Status Update. [hec:29243] Requested - Global Snapshot Reference: (null) [hec:29243] orte_checkpoint: hnp_receiver: Receive a command message. [hec:29243] orte_checkpoint: hnp_receiver: Status Update. [hec:29243] Pending - Global Snapshot Reference: (null) [hec:29243] orte_checkpoint: hnp_receiver: Receive a command message. [hec:29243] orte_checkpoint: hnp_receiver: Status Update. [hec:29243] Running - Global Snapshot Reference: (null) There is some error msg at the terminal of the running applicaiton, as, -- Error: The process with PID 29236 is not checkpointable. This could be due to one of the following: - An application with this PID doesn't currently exist - The application with this PID isn't checkpointable - The application with this PID isn't an OPAL application. We were looking for the named files: /tmp/opal_cr_prog_write.29236 /tmp/opal_cr_prog_read.29236 -- [hec:29234] local) Error: Unable to initiate the handshake with peer [[46621,1],1]. -1 [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 567 [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 This means that either the MPI application did not respond to the checkpoint request in time, or that the application was not checkpointable for some other reason. Some options to try: - Set the 'snapc_full_max_wait_time' MCA parameter to say 60, the default is 20 seconds before giving up. You can also set it to 0, which indicates to the runtime to wait indefinitely. shell$ mpirun -mca snapc_full_max_wait_time 60 - Try cleaning out the /tmp directory on all of the nodes, maybe this has something to do with disks being full (though usually we would see other symptoms). If that doesn't help, can you send me the config.log from your build of Open MPI. If those do not work, I would suspect that something in the configure of Open MPI might have gone wrong. -- Josh does anyone have some hint to fix this problem? Thanks, Hui Jin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem using openmpi with DMTCP
(Sorry for the excessive delay in replying) I do not have any experience with the DMTCP project, so I can only speculate on what might be going on here. If you are using DMTCP to transparently checkpoint Open MPI you will need to make sure that you are not using any other interconnect other than TCP. If you are building an OPAL CRS component for DMTCP (actually you probably want their MTCP project which is just the local checkpoint/ restart service), then what you might be seeing are the TCP sockets that are left open across a checkpoint operation. As an optimization for checkpoint->continue we leave sockets open when we checkpoint. Since most checkpoint/restart services will skip over the socket fd (since they are not supported) and take the checkpoint we leave them open, and close them only on restart. I suspect that DMTCP is erroring out since it is trying to do something else with those fds. You may want to try just using the MTCP project, or ask for a way to shut off the socket negotiation and just ignore the socket fds. Let me know how it goes. -- Josh On Sep 28, 2009, at 9:55 AM, Kritiraj Sajadah wrote: Dear All, I am trying to integrate DMTCP with openmpi. IF I run a c application, it works fine. But when I execute the program using mpirun, It checkpoints application but gives error when restarting the application. # [31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING ((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed' id() = 2ab3f248-30933-4ac0d75a(99007) _sockDomain = 10 _sockType = 1 _sockProtocol = 0 Message: socket type not yet [fully] supported [31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING ((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed' id() = 2ab3f248-30943-4ac0d75c(99007) _sockDomain = 10 _sockType = 1 _sockProtocol = 0 Message: socket type not yet [fully] supported [31013] WARNING at connection.cpp:87 in restartDup2; REASON='JWARNING (_real_dup2 ( oldFd, fd ) == fd) failed' oldFd = 537 fd = 1 (strerror((*__errno_location ( = Bad file descriptor [31013] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor [31015] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor [31017] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor [31007] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor MTCP: mtcp_restart_nolibc: mapping current version of /usr/lib/gconv/ gconv-modules.cache into memory; _not_ file as it existed at time of checkpoint. Change mtcp_restart_nolibc.c:634 and re-compile, if you want different behavior. [31015] ERROR at connection.cpp:372 in restoreOptions; REASON='JASSERT(ret == 0) failed' (strerror((*__errno_location ( = Invalid argument fds[0] = 6 opt->first = 26 opt->second.size() = 4 Message: restoring setsockopt failed Terminating... # Any suggestions is very welcomed. regards, Raj ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Changing location where checkpoints are saved
(Sorry for the excessive delay in replying) On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote: Thanks for the reply! Concerning the mca options for checkpointing: - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ? By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php (for restarting) Following indications given in [1], I tried to make each process checkpoint itself in it local /tmp and centralize the resulting checkpoints in /tmp or $HOME: Excerpt from mca-params.conf: - snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/tmp or $HOME crs_base_snapshot_dir=/tmp COMMANDS used: -- mpirun -n 2 -machinefile machines -am ft-enable-cr a.out ompi-checkpoint mpirun_pid OUTPUT of ompi-checkpoint -v 16753 -- [ic85:17044] orte_checkpoint: Checkpointing... [ic85:17044] PID 17036 [ic85:17044] Connected to Mpirun [[42098,0],0] [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt OUTPUT of MPIRUN [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 -- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0 Host: ic85 Will continue attempting to launch the process. -- [ic85:17036] filem:rsh: wait_all(): Wait failed (-1) [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054 This is a warning about creating the global snapshot directory (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It seems to indicate that the directory existed when the file gather started. A couple things to check: - Did you clean out the /tmp on all of the nodes with any files starting with "opal" or "ompi"? - Does the error go away when you set (snapc_base_global_snapshot_dir=$HOME)? - Could you try running against a v1.3 release? (I wonder if this feature has been broken on the trunk) Let me know what you find. In the next couple days, I'll try to test the trunk again with this feature to make sure that it is still working on my test machines. -- Josh Does anyone has an idea about what is wrong? Best regards, -- Constantinos Josh Hursey wrote: This is described in the C/R User's Guide attached to the webpage below: https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples. -- Josh On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote: Dear all, I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them to be saved on a local disk (e.g.: /tmp). Does anyone know how I can change the location where Open MPI saves checkpoints? Best regards, -- Constantinos ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpirun example program fail on multiple nodes- unable to launch specified application on client node
As an alternative technique for distributing the binary, you could ask Open MPI's runtime to do it for you (made available in the v1.3 series). You still need to make sure that the same version of Open is installed on all nodes, but if you pass the --preload-binary option to mpirun the runtime environment will distribute the binary across the machine (staging it to a temporary directory) before launching it. You can do the same with any arbitrary set of files or directories (comma separated) using the --preload-files option as well. If you type 'mpirun --help' the options that you are looking for are: --preload-files --preload-files-dest-dir with --preload-files. By default the absolute and relative paths provided by --preload-files are -s|--preload-binary Preload the binary on the remote machine before -- Josh On Nov 5, 2009, at 6:56 PM, Terry Frankcombe wrote: For small ad hoc COWs I'd vote for sshfs too. It may well be as slow as a dog, but it actually has some security, unlike NFS, and is a doddle to make work with no superuser access on the server, unlike NFS. On Thu, 2009-11-05 at 17:53 -0500, Jeff Squyres wrote: On Nov 5, 2009, at 5:34 PM, Douglas Guptill wrote: I am currently using sshfs to mount both OpenMPI and my application on the "other" computers/nodes. The advantage to this is that I have only one copy of OpenMPI and my application. There may be a performance penalty, but I haven't seen it yet. For a small number of nodes (where small <=32 or sometimes even <=64), I find that simple NFS works just fine. If your apps aren't IO intensive, that can greatly simplify installation and deployment of both Open MPI and your MPI applications IMNSHO. But -- every app is different. :-) YMMV. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Help: Firewall problems
On Nov 6, 2009, at 5:49 AM, Lee Amy wrote: Thanks. And actually I don't know if I need to disable iptables to run MPI programs properly. Obviously from your words Open MPI will use random ports so how do I set up in iptables then let trusted machines open their random ports? I'm afraid I'm not enough of an iptables expert to know (I don't know if anyone else on this list will be, either) -- you'll need to check the iptables docs to see. Sorry! -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Help: Firewall problems
On Fri, Nov 6, 2009 at 2:39 AM, Jeff Squyreswrote: > On Nov 5, 2009, at 11:28 AM, Lee Amy wrote: > >> I remembered MPI does not count on TCP/IP but why default iptables >> will prevent the MPI programs from running? After I stop iptables then >> programs run well. I use Ethernet as connection. >> > > > Note that Open MPI *can* use TCP as an interface for MPI messaging. It > definitely uses TCP for administrative control of MPI jobs, even if TCP is > not used for MPI messaging. Open MPI therefore basically requires the > ability to open sockets between all nodes in the job on random TCP ports. > > Your could probably configure iptables to "trust" all the machines in your > cluster (i.e., allow TCP sockets to/from random ports) but disallow most > (all?) TCP connections from outside your cluster, if you wanted to...? > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Thanks. And actually I don't know if I need to disable iptables to run MPI programs properly. Obviously from your words Open MPI will use random ports so how do I set up in iptables then let trusted machines open their random ports? Regards, Amy
Re: [OMPI users] using specific algorithm for collective communication and knowing the root cpu?
On Nov 4, 2009, at 12:46 , George Markomanolis wrote: I have some questions, because I am using some programs for profiling, when you say that the cost of allreduce raise you mean about the time only or also and the flops of this command? Is there some additional work added at the allreduce or it's only about time? During profiling I want to count the flops so if there is a small difference on timing because of debug mode and declaration of the allreduce algorithm is not so big deal, but if it changes also the flops then it is bad for me. Using a linear algorithm for reduce will clearly increase the number of fp on the root (of course if we suppose the reduction is working on fp), and will decrease the fp on the other nodes. Imagine that instead of having the computations nicely spread all over the nodes, you put them all on the root. This is what happens with the linear reduction. When I executed a program with debug mode I saw that openmpi uses some algorithms and I looked at your code and I saw that rank 0 is not always the root cpu (if I understood right). Finally do you have any opinion about which is the best way to know the algorithm is used in collective communication and the root cpu of the communication? For the linear implementation of allreduce Open MPI always use the rank 0 in the communicator as the root. The code is in the $ (OMPI_SRCDIR)/ompi/mca/coll/tuned/coll_tuned_allreduce.c file at line 895. george. Best regards, George Today's Topics: 1. Re: using specific algorithm for collectivecommunication, and knowing the root cpu? (George Bosilca) -- Message: 1 Date: Tue, 3 Nov 2009 12:09:18 -0500 From: George BosilcaSubject: Re: [OMPI users] using specific algorithm for collective communication, and knowing the root cpu? To: Open MPI Users Message-ID: Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes You can add the following MCA parameters either on the command line or in the $(HOME)/.openmpi/mca-params.conf file. On Nov 2, 2009, at 08:52 , George Markomanolis wrote: Dear all, I would like to ask about collective communication. With debug mode enabled, I can see many info during the execution which algorithm is used etc. But my question is that I would like to use a specific algorithm (the simplest I suppose). I am profiling some applications and I want to simulate them with another program so I must be able to know for example what the mpi_allreduce is doing. I saw many algorithms that depend on the message size and the number of processors, so I would like to ask: 1) what is the way to say at open mpi to use a simple algorithm for allreduce (is there any way to say to use the simplest algorithm for all the collective communication?). Basically I would like to know the root cpu for every collective communication. What are the disadvantages for demanding the simplest algorithm? coll_tuned_use_dynamic_rules=1 to allow you to manually set the algorithms to be used. coll_tuned_allreduce_algorithm=*something between 0 and 5* to describe the algorithm to be user. For the simplest algorithm I guess you will want to use 1 (star based fan-in fan-out). The main disadvantage is that the cost of the allreduce will raise which will negatively impact the overall performance of the application. 2) Is there any overhead because I installed open mpi with debug mode even if I just run a program without any flag with --mca? There are many overhead because you compile in debug mode. We do a lot of extra tracking of internally allocate memory, checks on most/all internal objects and so on. Based on previous results I would say your latency increase by about 2-3 micro-secs, but the impact on the bandwidth is minimal. 3) How you could describe allreduce by words? Can we say that the root cpu does reduce and then broadcast? I mean is that right for your implementation? I saw that it depends on the algorithm which cpu is the root, so is it possible to use an algorithm that I will know every time that cpu with rank 0 is the root? Exactly, allreduce = reduce + bcast (and btw this is what the algorithm basic will do). However, there is no root in an allreduce as all processors execute symmetric work. Of course if one see the allreduce as a reduce followed by a broadcast then one has to select a root (I guess we pick the rank 0 in our implementation). george. Thanks a lot, George ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org