Re: [OMPI users] Related to project ideas in OpenMPI
, >>>>> >>>>> There's also Kernel-Level Checkpointing vs. User-Level Checkpointing - >>>>> if you can checkpoint an MPI task and restart it on a new node, then >>>>> this is also "process migration". >>>>> >>>>> Of course, doing a checkpoint & restart can be slower than pure >>>>> in-kernel process migration, but the advantage is that you don't need >>>>> any kernel support, and can in fact do all of it in user-space. >>>>> >>>>> Rayson >>>>> >>>>> >>>>> On Thu, Aug 25, 2011 at 10:26 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> It also depends on what part of migration interests you - are you >>>>>> wanting to look at the MPI part of the problem (reconnecting MPI >>>>>> transports, ensuring messages are not lost, etc.) or the RTE part of the >>>>>> problem (where to restart processes, detecting failures, etc.)? >>>>>> >>>>>> >>>>>> On Aug 24, 2011, at 7:04 AM, Jeff Squyres wrote: >>>>>> >>>>>>> Be aware that process migration is a pretty complex issue. >>>>>>> >>>>>>> Josh is probably the best one to answer your question directly, but >>>>>>> he's out today. >>>>>>> >>>>>>> >>>>>>> On Aug 24, 2011, at 5:45 AM, srinivas kundaram wrote: >>>>>>> >>>>>>>> I am final year grad student looking for my final year project in >>>>>>>> OpenMPI.We are group of 4 students. >>>>>>>> I wanted to know about the "Process Migration" process of MPI >>>>>>>> processes in OpenMPI. >>>>>>>> Can anyone suggest me any ideas for project related to process >>>>>>>> migration in OenMPI or other topics in Systems. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> regards, >>>>>>>> Srinivas Kundaram >>>>>>>> srinu1...@gmail.com >>>>>>>> +91-8149399160 >>>>>>>> ___ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> >>>>>>> ___ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> ___ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Rayson >>>>> >>>>> == >>>>> Open Grid Scheduler - The Official Open Source Grid Engine >>>>> http://gridscheduler.sourceforge.net/ >>>>> >>>>> ___ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> >>> -- >>> Rayson >>> >>> == >>> Open Grid Scheduler - The Official Open Source Grid Engine >>> http://gridscheduler.sourceforge.net/ >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> >> >> >> -- >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] BLCR support not building on 1.5.3
I'm glad that worked. I understand the confusion. The configure output could be better. It shouldn't be too difficult to cleanup. I filed a ticket so we don't forget about this issue. The ticket is linked below if you are interested: https://svn.open-mpi.org/trac/ompi/ticket/2807 Next time I cycle back to the C/R functionality I'll try to address it, but if someone else beats me to it then that should be reflected in the ticket. -- Josh On May 27, 2011, at 3:54 PM, Bill Johnstone wrote: > Hello, > > > Thank you very much for this. I've replied further below: > > > - Original Message - >> From: Joshua Hursey <jjhur...@open-mpi.org> > [...] >> What other configure options are you passing to Open MPI? Specifically the >> configure test will always fail if '--with-ft=cr' is not specified - by >> default Open MPI will only build the BLCR component if C/R FT is requested >> by >> the user. > > This was it! Now the BLCR supports builds in just fine. > > If I may offer some feedback: > > When I think "Checkpoint/Restart", I don't immediately think "Fault > Tolerance"; rather, I'm interested in it for a better alternative to > suspend/resume. So I had *no* idea turning on the "ft" configure option this > was a prerequisite for BLCR support to compile from just reading the > configure help, configure output, docs, etc. > > I'd like to request that this be made easier to spot. At a minimum, the > configure -help output could mention this when it gets to talking about BLCR, > or C/R in general. > > Additionally, in general when configuring components, it would be nice in the > config logs if there was a way to get more details about the tests (and why > they failed) than just "can compile...no". This may require more invasive > changes - not being super-knowledgeable about configure, I don't know how > much work this would be. > > Lastly, the standard Open MPI documentation (particularly the FAQ) could be > updated in the C/R or BLCR sections to reflect the need for the > "--with-ft=cr" argument. > > Again, I really appreciate the assistance. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] BLCR support not building on 1.5.3
What version of BLCR are you using? What other configure options are you passing to Open MPI? Specifically the configure test will always fail if '--with-ft=cr' is not specified - by default Open MPI will only build the BLCR component if C/R FT is requested by the user. Can you send a zip'ed up config.log to the list, that might show something that the configure is missing? Thanks, Josh On May 26, 2011, at 2:26 PM, Bill Johnstone wrote: > Hello all. > > I'm building 1.5.3 from source on a Debian Squeeze AMD64 system, and trying > to get BLCR support built-in. I've installed all the packages that I think > should be relevant to BLCR support, including: > > +blcr-dkms > +libcr0 > +libcr-dev > +blcr-util > > I've also installed blcr-testuite . I only run Open MPI's configure after > loading the blcr modules, and the tests in blcr-testsuite pass. The relevant > headers seem to be in /usr/include and the relevant libraries in /usr/lib . > > I've tried three different invocations of configure: > > 1. No BLCR-related arguments. > > Output snippet from configure: > checking --with-blcr value... simple ok (unspecified) > checking --with-blcr-libdir value... simple ok (unspecified) > checking if MCA component crs:blcr can compile... no > > 2. With --with-blcr=/usr only > > Output snippet from configure: > checking --with-blcr value... sanity check ok (/usr) > checking --with-blcr-libdir value... simple ok (unspecified) > configure: WARNING: BLCR support requested but not found. Perhaps you need > to specify the location of the BLCR libraries. > configure: error: Aborting. > > 3. With --with-blcr-libdir=/usr/lib only > > Output snippet from configure: > checking --with-blcr value... simple ok (unspecified) > checking --with-blcr-libdir value... sanity check ok (/usr/lib) > checking if MCA component crs:blcr can compile... no > > > config.log only seems to contain the output of whatever tests were run to > determine whether or not blcr support could be compiled, but I don't see any > way to get details on what code and compile invocation actually failed, in > order to get to the root of the problem. I'm not a configure or m4 expert, > so I'm not sure how to go further in troubleshooting this. > > Help would be much appreciated. > > Thanks! > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"
Thanks for the program. I created a ticket for this performance bug and attached the tarball to the ticket: https://svn.open-mpi.org/trac/ompi/ticket/2743 I do not know exactly when I will be able to get back to this, but hopefully soon. I added you to the CC so you should receive any progress updates regarding the ticket as we move forward. Thanks again, Josh On Mar 3, 2011, at 2:12 AM, Nguyen Toan wrote: > Dear Josh, > > Attached with this email is a small program that illustrates the performance > problem. You can find simple instructions in the README file. > There are also 2 sample result files (cpu.256^3.8N.*) which show the > execution time difference between 2 cases. > Hope you can take some time to find the problem. > Thanks for your kindness. > > Best Regards, > Nguyen Toan > > On Wed, Mar 2, 2011 at 3:00 AM, Joshua Hursey <jjhur...@open-mpi.org> wrote: > I have not had the time to look into the performance problem yet, and > probably won't for a little while. Can you send me a small program that > illustrates the performance problem, and I'll file a bug so we don't lose > track of it. > > Thanks, > Josh > > On Feb 25, 2011, at 1:31 PM, Nguyen Toan wrote: > > > Dear Josh, > > > > Did you find out the problem? I still cannot progress anything. > > Hope to hear some good news from you. > > > > Regards, > > Nguyen Toan > > > > On Sun, Feb 13, 2011 at 3:04 PM, Nguyen Toan <nguyentoan1...@gmail.com> > > wrote: > > Hi Josh, > > > > I tried the MCA parameter you mentioned but it did not help, the unknown > > overhead still exists. > > Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1. > > Hope you can find out the problem. > > Thank you. > > > > Regards, > > Nguyen Toan > > > > On Wed, Feb 9, 2011 at 11:08 PM, Joshua Hursey <jjhur...@open-mpi.org> > > wrote: > > It looks like the logic in the configure script is turning on the FT thread > > for you when you specify both '--with-ft=cr' and '--enable-mpi-threads'. > > > > Can you send me the output of 'ompi_info'? Can you also try the MCA > > parameter that I mentioned earlier to see if that changes the performance? > > > > I there are many non-blocking sends and receives, there might be > > performance bug with the way the point-to-point wrapper is tracking request > > objects. If the above MCA parameter does not help the situation, let me > > know and I might be able to take a look at this next week. > > > > Thanks, > > Josh > > > > On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote: > > > > > Hi Josh, > > > Thanks for the reply. I did not use the '--enable-ft-thread' option. Here > > > is my build options: > > > > > > CFLAGS=-g \ > > > ./configure \ > > > --with-ft=cr \ > > > --enable-mpi-threads \ > > > --with-blcr=/home/nguyen/opt/blcr \ > > > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \ > > > --prefix=/home/nguyen/opt/openmpi \ > > > --with-openib \ > > > --enable-mpirun-prefix-by-default > > > > > > My application requires lots of communication in every loop, focusing on > > > MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one > > > checkpoint per application execution for my purpose, but the unknown > > > overhead exists even when no checkpoint was taken. > > > > > > Do you have any other idea? > > > > > > Regards, > > > Nguyen Toan > > > > > > > > > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi.org> > > > wrote: > > > There are a few reasons why this might be occurring. Did you build with > > > the '--enable-ft-thread' option? > > > > > > If so, it looks like I didn't move over the thread_sleep_wait adjustment > > > from the trunk - the thread was being a bit too aggressive. Try adding > > > the following to your command line options, and see if it changes the > > > performance. > > > "-mca opal_cr_thread_sleep_wait 1000" > > > > > > There are other places to look as well depending on how frequently your > > > application communicates, how often you checkpoint, process layout, ... > > > But usually the aggressive nature of the thread is the main problem. > > > > > > Let me know if that helps. > > > > > > -- Josh > > > > > > On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote: > > > > > > > Hi
Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"
I have not had the time to look into the performance problem yet, and probably won't for a little while. Can you send me a small program that illustrates the performance problem, and I'll file a bug so we don't lose track of it. Thanks, Josh On Feb 25, 2011, at 1:31 PM, Nguyen Toan wrote: > Dear Josh, > > Did you find out the problem? I still cannot progress anything. > Hope to hear some good news from you. > > Regards, > Nguyen Toan > > On Sun, Feb 13, 2011 at 3:04 PM, Nguyen Toan <nguyentoan1...@gmail.com> wrote: > Hi Josh, > > I tried the MCA parameter you mentioned but it did not help, the unknown > overhead still exists. > Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1. > Hope you can find out the problem. > Thank you. > > Regards, > Nguyen Toan > > On Wed, Feb 9, 2011 at 11:08 PM, Joshua Hursey <jjhur...@open-mpi.org> wrote: > It looks like the logic in the configure script is turning on the FT thread > for you when you specify both '--with-ft=cr' and '--enable-mpi-threads'. > > Can you send me the output of 'ompi_info'? Can you also try the MCA parameter > that I mentioned earlier to see if that changes the performance? > > I there are many non-blocking sends and receives, there might be performance > bug with the way the point-to-point wrapper is tracking request objects. If > the above MCA parameter does not help the situation, let me know and I might > be able to take a look at this next week. > > Thanks, > Josh > > On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote: > > > Hi Josh, > > Thanks for the reply. I did not use the '--enable-ft-thread' option. Here > > is my build options: > > > > CFLAGS=-g \ > > ./configure \ > > --with-ft=cr \ > > --enable-mpi-threads \ > > --with-blcr=/home/nguyen/opt/blcr \ > > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \ > > --prefix=/home/nguyen/opt/openmpi \ > > --with-openib \ > > --enable-mpirun-prefix-by-default > > > > My application requires lots of communication in every loop, focusing on > > MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint > > per application execution for my purpose, but the unknown overhead exists > > even when no checkpoint was taken. > > > > Do you have any other idea? > > > > Regards, > > Nguyen Toan > > > > > > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi.org> > > wrote: > > There are a few reasons why this might be occurring. Did you build with the > > '--enable-ft-thread' option? > > > > If so, it looks like I didn't move over the thread_sleep_wait adjustment > > from the trunk - the thread was being a bit too aggressive. Try adding the > > following to your command line options, and see if it changes the > > performance. > > "-mca opal_cr_thread_sleep_wait 1000" > > > > There are other places to look as well depending on how frequently your > > application communicates, how often you checkpoint, process layout, ... But > > usually the aggressive nature of the thread is the main problem. > > > > Let me know if that helps. > > > > -- Josh > > > > On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote: > > > > > Hi all, > > > > > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2). > > > I found that when running an application,which uses MPI_Isend, MPI_Irecv > > > and MPI_Wait, > > > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is > > > much longer than the normal execution with mpirun (no checkpoint was > > > taken). > > > This overhead becomes larger when the normal execution runtime is longer. > > > Does anybody have any idea about this overhead, and how to eliminate it? > > > Thanks. > > > > > > Regards, > > > Nguyen > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > Joshua Hursey > > Postdoctoral Research Associate > > Oak Ridge National Laboratory > > http://users.nccs.gov/~jjhursey > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] --without-tm [SEC=UNCLASSIFIED]
There is no restriction to use the C/R functionality in Open MPI in a TM environment (that I am aware of), if you use the ompi-checkpoint/ompi-restart commands directly. If you want TM to checkpoint/restart Open MPI processes for you as part of the resource management role, then there is a bit of a work around that you have to go through. The 'cr_mpirun' wrapper (mentioned in the email) that BLCR is/will be providing does the necessary things to make the two work together. The BLCR folks would be the best to contact if there are issues of compatibility when using that script since they maintain it. -- Josh On Feb 21, 2011, at 9:57 AM, Jeff Squyres wrote: > On Feb 21, 2011, at 12:50 AM, DOHERTY, Greg wrote: > >> blcr needs cr_mpirun to start the job without torque support to be able >> to checkpoint the mpi job correctly. > > Josh -- > > Do we have a restriction on BLCR support when used with TM? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"
It looks like the logic in the configure script is turning on the FT thread for you when you specify both '--with-ft=cr' and '--enable-mpi-threads'. Can you send me the output of 'ompi_info'? Can you also try the MCA parameter that I mentioned earlier to see if that changes the performance? I there are many non-blocking sends and receives, there might be performance bug with the way the point-to-point wrapper is tracking request objects. If the above MCA parameter does not help the situation, let me know and I might be able to take a look at this next week. Thanks, Josh On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote: > Hi Josh, > Thanks for the reply. I did not use the '--enable-ft-thread' option. Here is > my build options: > > CFLAGS=-g \ > ./configure \ > --with-ft=cr \ > --enable-mpi-threads \ > --with-blcr=/home/nguyen/opt/blcr \ > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \ > --prefix=/home/nguyen/opt/openmpi \ > --with-openib \ > --enable-mpirun-prefix-by-default > > My application requires lots of communication in every loop, focusing on > MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint > per application execution for my purpose, but the unknown overhead exists > even when no checkpoint was taken. > > Do you have any other idea? > > Regards, > Nguyen Toan > > > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi.org> wrote: > There are a few reasons why this might be occurring. Did you build with the > '--enable-ft-thread' option? > > If so, it looks like I didn't move over the thread_sleep_wait adjustment from > the trunk - the thread was being a bit too aggressive. Try adding the > following to your command line options, and see if it changes the performance. > "-mca opal_cr_thread_sleep_wait 1000" > > There are other places to look as well depending on how frequently your > application communicates, how often you checkpoint, process layout, ... But > usually the aggressive nature of the thread is the main problem. > > Let me know if that helps. > > -- Josh > > On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote: > > > Hi all, > > > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2). > > I found that when running an application,which uses MPI_Isend, MPI_Irecv > > and MPI_Wait, > > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much > > longer than the normal execution with mpirun (no checkpoint was taken). > > This overhead becomes larger when the normal execution runtime is longer. > > Does anybody have any idea about this overhead, and how to eliminate it? > > Thanks. > > > > Regards, > > Nguyen > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"
There are a few reasons why this might be occurring. Did you build with the '--enable-ft-thread' option? If so, it looks like I didn't move over the thread_sleep_wait adjustment from the trunk - the thread was being a bit too aggressive. Try adding the following to your command line options, and see if it changes the performance. "-mca opal_cr_thread_sleep_wait 1000" There are other places to look as well depending on how frequently your application communicates, how often you checkpoint, process layout, ... But usually the aggressive nature of the thread is the main problem. Let me know if that helps. -- Josh On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote: > Hi all, > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2). > I found that when running an application,which uses MPI_Isend, MPI_Irecv and > MPI_Wait, > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much > longer than the normal execution with mpirun (no checkpoint was taken). > This overhead becomes larger when the normal execution runtime is longer. > Does anybody have any idea about this overhead, and how to eliminate it? > Thanks. > > Regards, > Nguyen > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] allow job to survive process death
On Jan 27, 2011, at 9:47 AM, Reuti wrote: > Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > >> The current version of Open MPI does not support continued operation of an >> MPI application after process failure within a job. If a process dies, so >> will the MPI job. Note that this is true of many MPI implementations out >> there at the moment. >> >> At Oak Ridge National Laboratory, we are working on a version of Open MPI >> that will be able to run-through process failure, if the application wishes >> to do so. The semantics and interfaces needed to support this functionality >> are being actively developed by the MPI Forums Fault Tolerance Working >> Group, and can be found at the wiki page below: >> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization > > I had a look at this document, but what is really covered - the application > has to react on the notification of a failed rank and act appropriate on its > own? Yes. This is to support application based fault tolerance (ABFT). Libraries could be developed on top of these semantics to hide some of the fault handing. The purpose is to enable fault tolerant MPI applications and libraries to be built on top of MPI. This document only covers run-through stabilization, not process recovery, at the moment. So the application will have well defined semantics to allow it to continue processing without the failed process. Recovering the failed process is not specified in this document. That is the subject of a supplemental document in preparation - the two proposals are meant to be complementary and build upon one another. > > Having a true ability to survive a dying process (i.e. rank) which might be > computing already for hours would mean to have some kind of "rank RAID" or > "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are > failing, your job will be ready in time. Yes, that is one possible technique. So once a process failure occurs, the application is notified via the existing error handling mechanisms. The application is then responsible for determining how best to recover from that process failure. This could include using MPI_Comm_spawn to create new processes (useful in manager/worker applications), recovering the state from an in-memory checksum, using spare processes in the communicator, rolling back some/all ranks to an application level checkpoint, ignoring the failure and allowing the residual error to increase, aborting the job or a single sub-communicator, ... the list goes on. But the purpose of the proposal is to allow an application or library to start building such techniques based on portable semantics and well defined interfaces. Does that help clarify? If you would like to discuss the developing proposals further or have input on how to make it better, I would suggest moving the discussion to the MPI3-ft mailing list so other groups can participate that do not normally follow the Open MPI lists. The mailing list information is below: http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft -- Josh > > -- Reuti > > >> This work is on-going, but once we have a stable prototype we will assess >> how to bring it back to the mainline Open MPI trunk. For the moment, there >> is no public release of this branch, but once there is we will be sure to >> announce it on the appropriate Open MPI mailing list for folks to start >> playing around with it. >> >> -- Josh >> >> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: >> >>> Hi, >>> >>> I was wondering what support Open MPI has for allowing a job to >>> continue running when one or more processes in the job die >>> unexpectedly? Is there a special mpirun flag for this? Any other ways? >>> >>> It seems obvious that collectives will fail once a process dies, but >>> would it be possible to create a new group (if you knew which ranks >>> are dead) that excludes the dead processes - then turn this group into >>> a working communicator? >>> >>> Thanks, >>> Kirk >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] allow job to survive process death
The current version of Open MPI does not support continued operation of an MPI application after process failure within a job. If a process dies, so will the MPI job. Note that this is true of many MPI implementations out there at the moment. At Oak Ridge National Laboratory, we are working on a version of Open MPI that will be able to run-through process failure, if the application wishes to do so. The semantics and interfaces needed to support this functionality are being actively developed by the MPI Forums Fault Tolerance Working Group, and can be found at the wiki page below: https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization This work is on-going, but once we have a stable prototype we will assess how to bring it back to the mainline Open MPI trunk. For the moment, there is no public release of this branch, but once there is we will be sure to announce it on the appropriate Open MPI mailing list for folks to start playing around with it. -- Josh On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: > Hi, > > I was wondering what support Open MPI has for allowing a job to > continue running when one or more processes in the job die > unexpectedly? Is there a special mpirun flag for this? Any other ways? > > It seems obvious that collectives will fail once a process dies, but > would it be possible to create a new group (if you knew which ranks > are dead) that excludes the dead processes - then turn this group into > a working communicator? > > Thanks, > Kirk > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
[OMPI users] Fwd: BLCR at SC10
For those interested in the developing fault tolerance capabilities of Open MPI (particularly BLCR and CIFTS FTB support), you may find the events hosted by Lawrence Berkeley National Laboratory of interest. Also the Indiana University booth has a demonstration of "Application-level Fault Tolerance in Open MPI through Preemptive Process Migration and Resiliency" that may be of interest. http://sc10.supercomputing.iu.edu/demos -- Josh Begin forwarded message: > From: "Paul H. Hargrove" > Date: November 14, 2010 11:23:01 AM CST > Subject: BLCR at SC10 > > Hello BLCR users, > > I am writing to let you all know about some BLCR-related events at SC10 > in New Orleans this week. > > Tues Nov 16: > 12:15 to 1:15 CIFTS BoF (CIFTS = Coordinated Infrastructure for Fault > Tolerant Systems) > 3:00 to 4:00 CIFTS round-table discussion in LBNL booth #2448 > > Wed Nov 17: > 3:00 to 4:00 BLCR round-table discussion in LBNL booth #2448 > 5:30 to 6:30 brief BLCR talk at TORQUE BoF > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] Running on crashing nodes
As one of the Open MPI developers actively working on the MPI layer stabilization/recover feature set, I don't think we can give you a specific timeframe for availability, especially availability in a stable release. Once the initial functionality is finished, we will open it up for user testing by making a public branch available. After addressing the concerns highlighted by public testing, we will attempt to work this feature into the mainline trunk and eventual release. Unfortunately it is difficult to assess the time needed to go through these development stages. What I can tell you is that the work to this point on the MPI layer is looking promising, and that as soon as we feel that the code is ready we will make it available to the public for further testing. -- Josh On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote: > Ralph, could you tell us when this functionality will be available in the > stable version? A rough estimate will be fine. > > > On Fri, Sep 24, 2010 at 01:24, Ralph Castain <r...@open-mpi.org> wrote: > In a word, no. If a node crashes, OMPI will abort the currently-running job > if it had processes on that node. There is no current ability to "ride-thru" > such an event. > > That said, there is work being done to support "ride-thru". Most of that is > in the current developer's code trunk, and more is coming, but I wouldn't > consider it production-quality just yet. > > Specifically, the code that does what you specify below is done and works. It > is recovery of the MPI job itself (collectives, lost messages, etc.) that > remains to be completed. > > > On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <andrei.fo...@neutron.kth.se> > wrote: > Dear users, > > Our cluster has a number of nodes which have high probability to crash, so it > happens quite often that calculations stop due to one node getting down. May > be you know if it is possible to block the crashed nodes during run-time when > running with OpenMPI? I am asking about principal possibility to program such > behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious > about is the following: > > 1. A code starts its tasks via mpirun on several nodes > 2. At some moment one node gets down > 3. The code realizes that the node is down (the results are lost) and > excludes it from the list of nodes to run its tasks on > 4. At later moment the user restarts the crashed node > 5. The code notices that the node is up again, and puts it back to the list > of active nodes > > > Regards, > Andrei > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey
Re: [OMPI users] Question on staging in checkpoint
Adjust the 'filem_rsh_max_incomming' parameter: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-filem_rsh_max_incomming I defaulted this MCA parameter to 10 since, depending on how big each individual checkpoint is, you will find that often sending them all at once is often worse than sending only a window of them at a time. I would recommend trying a few different values for this parameter and seeing the impact it has both on checkpoint overhead (additional application overhead) and checkpoint latency (the time it takes for the checkpoint to completely finish). -- Josh On Sep 13, 2010, at 7:42 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Hi > > I was trying out the staging option in checkpoint where I save the checkpoint > image in local file system and have the image transferred to global > filesystem in the background. As part of the background process I see that > the “scp” command is launched to transfer the images from local file system > to global file system. I am using openmpi-1.5rc6 with BLCR 0.8.2. > > In my experiment, I had about 128 cores saved their respective checkpoint > images on local file system. During the background process, I see that only > 10 “scp” requests are sent at a time. Is this a configurable parameter? Since > these commands will run on respective nodes, how can I launch all 128 scp > requests (to take care of all 128 images in my experiment) simultaneously? > > Thanks > Ananda > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey
Re: [OMPI users] High Checkpoint Overhead Ratio
Have you tried testing without using the NFS? So setting the mca-params.conf to something like: crs_base_snapshot_dir=/tmp/ snapc_base_global_snapshot_dir=/tmp/global snapc_basee_store_in_place=0 This would remove the NFS time from the checkpoint time. However if you are using staging this may or may not reduce the application overhead significantly. If you want to save to NFS, and it is globally mounted you could try setting the 'snapc_base_global_shared' parameter (deprecated in the trunk) which tells the system to use standard UNIX copy commands (i.e., cp) instead of the rsh varieties. You might try changing the '--mca filem_rsh_max_incomming' parameter (default 10) to increase or decrease the number of concurrent rcp/scp operations. Something else to try is to look at the SnapC timing to pinpoint where the system is taking the most time: snapc_full_enable_timing=1 Dince you are using the C/R thread, it takes up some CPU cycles that may interfere with application performance. You can adjust the agressiveness of this thread by adjusting the 'opal_cr_thread_sleep_wait' parameter. In 1.5.0 it defaults to 0 microseconds, but on the trunk this has been adjusted to 1000 microseconds. Try setting the parameter: opal_cr_thread_sleep_wait=1000 Depending on how much memory is required by CG.C and available on each node, you may be hitting a memory barrier that BLCR is struggling to overcome. What happens if you reduce the number of processes per node? Those are some things to play around with to see what works best for your system and application. For a full list of parameters available in the C/R infrastructure see the link below: http://osl.iu.edu/research/ft/ompi-cr/api.php -- Josh On Aug 30, 2010, at 11:08 PM, 陈文浩 wrote: > Dear OMPI Users, > > I’m now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a > very long time to checkpoint. > > BLCR configuration: > ./onfigure --prefix=/opt/blcr --enable-static > OpenMPi configuration: > ./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr > --enable-static --enable-ft-thread --enable-mpi-threads > > Our blades use NFS. $HOME and /opt are shared. > > In $HOME/.opnempi/mca-params.conf: > crs_base_snapshot_dir=/tmp/ > snapc_base_global_snapshot_dir=/home/chenwh > snapc_basee_store_in_place=0 > > > Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04). > With no checkpoint, 'Time in seconds' is about 100s. It's normal. > But when I take a single checkpoint, 'Time in seconds' is up to 300s. The > overhead ratio is over 200%! WHY? How can I improve it? > > blade02:~> ompi-checkpoint --status 27115 > [blade02:27130] [ 0.00 / 0.25] Requested - ... > [blade02:27130] [ 0.00 / 0.25] Pending - ... > [blade02:27130] [ 0.21 / 0.46] Running - ... > [blade02:27130] [221.25 / 221.71] Finished - > ompi_global_snapshot_27115.ckpt > Snapshot Ref.: 0 ompi_global_snapshot_27115.ckpt > > As you see, it takes 200+ secconds to checkpoint. btw, what the former and > latter number represent in [ , ]? > > Regards > > Whchen > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey
Re: [OMPI users] OpenMPI with BLCR runtime problem
On Aug 24, 2010, at 10:27 AM, 陈文浩 wrote: > Dear OMPI users, > > I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C blade10, > nfs) > BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static > After the installation, I can see the ‘blcr’ module loaded correctly (lsmod | > grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’, ‘cr_restart’ to C/R > the examples correctly under /blcr/examples/. > Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr > �Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �Cenable-static > The installation is okay too. > > Then here comes the problem. > On one node: > mpirun -np 2 ./hello_c.c > mpirun -np 2 �Cam ft-enable-cr ./hello_c.c > are both okay. > On two nodes(blade01, blade02): > mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c OK. > mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed > below: > > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [blade02:28896] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > opal_cr_init() failed failed > --> Returned value -1 instead of OPAL_SUCCESS > -- > [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 77 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > ompi_mpi_init: orte_init failed > --> Returned "Error" (-1) instead of "Success" (0) > -- > > I have no idea about the error. Our blades use nfs, does it matter? Can > anyone help me solve the problem? I really appreciate it. Thank you. > > btw, similar error like: > “Oops, cr_init() failed (the initialization call to the BLCR checkpointing > system). Abort in despair. > The crmpi SSI subsystem failed to initialized modules successfully during > MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI + > BLCR. This seems to indicate that BLCR is not working correctly on one of the compute nodes. Did you try some of the BLCR example programs on both of the compute nodes? If BLCRs cr_init() fails, then there is not much the MPI library can do for you. I would check the installation of BLCR on all of the compute nodes (blade01 and blade02). Make sure the modules are loaded and that the BLCR single process examples work on all nodes. I suspect that one of the nodes is having trouble initializing the BLCR library. You may also want to check to make sure prelinking is turned off on all nodes as well: https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink If that doesn't work then I would suggest trying the current Open MPI trunk. There should not be any problem with using NFS, since this is occurring in MPI_Init, this is well before we ever try to use the file system. I also test with NFS, and local staging on a fairly regular basis, so it shouldn't be a problem even when checkpointing/restarting. -- Josh > > Regards > > whchen > > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey
[OMPI users] Checkpoint/Restart Process Migration and Automatic Recovery Support
I am pleased to announce that Open MPI now supports checkpoint/restart process migration and automatic recovery. This is in addition to our current support for more traditional checkpoint/restart fault tolerance. These new features were introduced in the Open MPI development trunk in commit r23587. These new features are currently being scheduled for release in the v1.5.1 release of Open MPI. In addition to the two features mentioned above, this commit also includes support for C/R-enabled parallel debugging (Documentation for this feature to become available in September). This commit also introduces an API for the C/R functionality allowing applications to request a checkpoint/restart/migration from within their application. We also abstracted the stable storage technique and added support for checkpoint caching and compression. So lots of good stuff to play with. At the bottom of this email is a list of the new features, major/minor changes and bug fixes that are included in this release. The current implementation deprecates some MCA parameters to make way for a more extensible C/R infrastructure. So please check the online documentation for information on how to use this new functionality. Documentation is available at the link below: http://osl.iu.edu/research/ft/ If you have any questions or problems using these new features please send them to the users list. Enjoy :) Josh -- Major Changes: -- * Add two new ErrMgr recovery policies to the 'hnp' ErrMgr component * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr recovery policy * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes #2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes #2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes #2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -- * Fixes #1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes #2097 : {{{ompi-info}}} should now report all available CRS components * Fixes #2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes #2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with
Re: [OMPI users] Checkpointing mpi4py program
I just fixed the --stop bug that you highlighted in r23627. As far as the mpi4py program, I don't really know what to suggest. I don't have a setup to test this locally and am completely unfamiliar with mpi4py. Can you reproduce this with just a C program? -- Josh On Aug 16, 2010, at 12:25 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Josh > > I have one more update on my observation while analyzing this issue. > > Just to refresh, I am using openmpi-trunk release 23596 with mpi4py-1.2.1 and > BLCR 0.8.2. When I checkpoint the python script written using mpi4py, the > program doesn’t progress after the checkpoint is taken successfully. I tried > it with openmpi 1.4.2 and then tried it with the latest trunk version as > suggested. I see the similar behavior in both the releases. > > I have one more interesting observation which I thought may be useful. I > tried the “-stop” option of ompi-checkpoint (trunk version) and the mpirun > prints the following error messages when I run the command “ompi-checkpoint > –stop –v ”: > > Error messages in the window where mpirun command was running START > == > [hpdcnln001:15148] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > [hpdcnln001:15149] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > [hpdcnln001:15146] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > [hpdcnln001:15147] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > Error messages in the window where mpirun command was running END > == > > Please note that the checkpoint image was created at the end of it. However > when I run the command “kill –CONT ”, it fails to move forward > which is same as the original problem I have reported. > > Let me know if you need any additional information. > > Thanks for your time in advance > > - Ananda > > Ananda B Mudar, PMP > Senior Technical Architect > Wipro Technologies > Ph: 972 765 8093 > ananda.mu...@wipro.com > > From: Ananda Babu Mudar (WT01 - Energy and Utilities) > Sent: Sunday, August 15, 2010 11:25 PM > To: us...@open-mpi.org > Subject: Re: [OMPI users] Checkpointing mpi4py program > Importance: High > > Josh > > I tried running the mpi4py program with the latest trunk version of openmpi. > I have compiled openmpi-1.7a1r23596 from trunk and recompiled mpi4py to use > this library. Unfortunately I see the same behavior as I have seen with > openmpi 1.4.2 ie; checkpoint will be successful but the program doesn’t > proceed after that. > > I have attached the stack traces of all the MPI processes that are part of > the mpirun. I really appreciate if you can take a look at the stack trace and > let m e know the potential problem. I am kind of stuck at this point and need > your assistance to move forward. Please let me know if you need any > additional information. > > Thanks for your time in advance > > Thanks > > Ananda > > -Original Message- > Subject: Re: [OMPI users] Checkpointing mpi4py program > From: Joshua Hursey (jjhursey_at_[hidden]) > Date: 2010-08-13 12:28:31 > > Nope. I probably won't get to it for a while. I'll let you know if I do. > > On Aug 13, 2010, at 12:17 PM, <ananda.mudar_at_[hidden]> > <ananda.mudar_at_[hidden]> wrote: > > > OK, I will do that. > > > > But did you try this program on a system where the latest trunk is > > installed? Were you successful in checkpointing? > > > > - Ananda > > -Original Message- > > Message: 9 > > Date: Fri, 13 Aug 2010 10:21:29 -0400 > > From: Joshua Hursey <jjhursey_at_[hidden]> > > Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2 > > To: Open MPI Users <users_at_[hidden]> > > Message-ID: <7A43615B-A462-4C
Re: [OMPI users] Checkpointing mpi4py program
Nope. I probably won't get to it for a while. I'll let you know if I do. On Aug 13, 2010, at 12:17 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > OK, I will do that. > > But did you try this program on a system where the latest trunk is > installed? Were you successful in checkpointing? > > - Ananda > -Original Message- > Message: 9 > Date: Fri, 13 Aug 2010 10:21:29 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <7a43615b-a462-4c72-8112-496653d8f...@open-mpi.org> > Content-Type: text/plain; charset=us-ascii > > I probably won't have an opportunity to work on reproducing this on the > 1.4.2. The trunk has a bunch of bug fixes that probably will not be > backported to the 1.4 series (things have changed too much since that > branch). So I would suggest trying the 1.5 series. > > -- Josh > > On Aug 13, 2010, at 10:12 AM, <ananda.mu...@wipro.com> > <ananda.mu...@wipro.com> wrote: > >> Josh >> >> I am having problems compiling the sources from the latest trunk. It >> complains of libgomp.spec missing even though that file exists on my >> system. I will see if I have to change any other environment variables >> to have a successful compilation. I will keep you posted. >> >> BTW, were you successful in reproducing the problem on a system with >> OpenMPI 1.4.2? >> >> Thanks >> Ananda >> -Original Message- >> Date: Thu, 12 Aug 2010 09:12:26 -0400 >> From: Joshua Hursey <jjhur...@open-mpi.org> >> Subject: Re: [OMPI users] Checkpointing mpi4py program >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org> >> Content-Type: text/plain; charset=us-ascii >> >> Can you try this with the current trunk (r23587 or later)? >> >> I just added a number of new features and bug fixes, and I would be >> interested to see if it fixes the problem. In particular I suspect > that >> this might be related to the Init/Finalize bounding of the checkpoint >> region. >> >> -- Josh >> >> On Aug 10, 2010, at 2:18 PM, <ananda.mu...@wipro.com> >> <ananda.mu...@wipro.com> wrote: >> >>> Josh >>> >>> Please find attached is the python program that reproduces the hang >> that >>> I described. Initial part of this file describes the prerequisite >>> modules and the steps to reproduce the problem. Please let me know if >>> you have any questions in reproducing the hang. >>> >>> Please note that, if I add the following lines at the end of the >> program >>> (in case sleep_time is True), the problem disappears ie; program >> resumes >>> successfully after successful completion of checkpoint. >>> # Add following lines at the end for sleep_time is True >>> else: >>> time.sleep(0.1) >>> # End of added lines >>> >>> >>> Thanks a lot for your time in looking into this issue. >>> >>> Regards >>> Ananda >>> >>> Ananda B Mudar, PMP >>> Senior Technical Architect >>> Wipro Technologies >>> Ph: 972 765 8093 >>> ananda.mu...@wipro.com >>> >>> >>> -Original Message- >>> Date: Mon, 9 Aug 2010 16:37:58 -0400 >>> From: Joshua Hursey <jjhur...@open-mpi.org> >>> Subject: Re: [OMPI users] Checkpointing mpi4py program >>> To: Open MPI Users <us...@open-mpi.org> >>> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org> >>> Content-Type: text/plain; charset=windows-1252 >>> >>> I have not tried to checkpoint an mpi4py application, so I cannot say >>> for sure if it works or not. You might be hitting something with the >>> Python runtime interacting in an odd way with either Open MPI or > BLCR. >>> >>> Can you attach a debugger and get a backtrace on a stuck checkpoint? >>> That might show us where things are held up. >>> >>> -- Josh >>> >>> >>> On Aug 9, 2010, at 4:04 PM, <ananda.mu...@wipro.com> >>> <ananda.mu...@wipro.com> wrote: >>> >>>> Hi >>>> >>>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR >>> 0.8.2. When I run ompi-checkpoint on the program written using > mpi4py, >> I >>> see that program doesn?t resu
Re: [OMPI users] users Digest, Vol 1658, Issue 2
I probably won't have an opportunity to work on reproducing this on the 1.4.2. The trunk has a bunch of bug fixes that probably will not be backported to the 1.4 series (things have changed too much since that branch). So I would suggest trying the 1.5 series. -- Josh On Aug 13, 2010, at 10:12 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Josh > > I am having problems compiling the sources from the latest trunk. It > complains of libgomp.spec missing even though that file exists on my > system. I will see if I have to change any other environment variables > to have a successful compilation. I will keep you posted. > > BTW, were you successful in reproducing the problem on a system with > OpenMPI 1.4.2? > > Thanks > Ananda > -Original Message----- > Date: Thu, 12 Aug 2010 09:12:26 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] Checkpointing mpi4py program > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org> > Content-Type: text/plain; charset=us-ascii > > Can you try this with the current trunk (r23587 or later)? > > I just added a number of new features and bug fixes, and I would be > interested to see if it fixes the problem. In particular I suspect that > this might be related to the Init/Finalize bounding of the checkpoint > region. > > -- Josh > > On Aug 10, 2010, at 2:18 PM, <ananda.mu...@wipro.com> > <ananda.mu...@wipro.com> wrote: > >> Josh >> >> Please find attached is the python program that reproduces the hang > that >> I described. Initial part of this file describes the prerequisite >> modules and the steps to reproduce the problem. Please let me know if >> you have any questions in reproducing the hang. >> >> Please note that, if I add the following lines at the end of the > program >> (in case sleep_time is True), the problem disappears ie; program > resumes >> successfully after successful completion of checkpoint. >> # Add following lines at the end for sleep_time is True >> else: >> time.sleep(0.1) >> # End of added lines >> >> >> Thanks a lot for your time in looking into this issue. >> >> Regards >> Ananda >> >> Ananda B Mudar, PMP >> Senior Technical Architect >> Wipro Technologies >> Ph: 972 765 8093 >> ananda.mu...@wipro.com >> >> >> -Original Message- >> Date: Mon, 9 Aug 2010 16:37:58 -0400 >> From: Joshua Hursey <jjhur...@open-mpi.org> >> Subject: Re: [OMPI users] Checkpointing mpi4py program >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org> >> Content-Type: text/plain; charset=windows-1252 >> >> I have not tried to checkpoint an mpi4py application, so I cannot say >> for sure if it works or not. You might be hitting something with the >> Python runtime interacting in an odd way with either Open MPI or BLCR. >> >> Can you attach a debugger and get a backtrace on a stuck checkpoint? >> That might show us where things are held up. >> >> -- Josh >> >> >> On Aug 9, 2010, at 4:04 PM, <ananda.mu...@wipro.com> >> <ananda.mu...@wipro.com> wrote: >> >>> Hi >>> >>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR >> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, > I >> see that program doesn?t resume sometimes after successful checkpoint >> creation. This doesn?t occur always meaning the program resumes after >> successful checkpoint creation most of the time and completes >> successfully. Has anyone tested the checkpoint/restart functionality >> with mpi4py programs? Are there any best practices that I should keep > in >> mind while checkpointing mpi4py programs? >>> >>> Thanks for your time >>> - Ananda >>> Please do not print this email unless it is absolutely necessary. >>> >>> The information contained in this electronic message and any >> attachments to this message are intended for the exclusive use of the >> addressee(s) and may contain proprietary, confidential or privileged >> information. If you are not the intended recipient, you should not >> disseminate, distribute or copy this e-mail. Please notify the sender >> immediately and destroy all copies of this message and any > attachments. >>> >>> WARNING: Computer viruses can be transmitted via email. The recipient >> sh
Re: [OMPI users] Checkpointing mpi4py program
Can you try this with the current trunk (r23587 or later)? I just added a number of new features and bug fixes, and I would be interested to see if it fixes the problem. In particular I suspect that this might be related to the Init/Finalize bounding of the checkpoint region. -- Josh On Aug 10, 2010, at 2:18 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Josh > > Please find attached is the python program that reproduces the hang that > I described. Initial part of this file describes the prerequisite > modules and the steps to reproduce the problem. Please let me know if > you have any questions in reproducing the hang. > > Please note that, if I add the following lines at the end of the program > (in case sleep_time is True), the problem disappears ie; program resumes > successfully after successful completion of checkpoint. > # Add following lines at the end for sleep_time is True > else: > time.sleep(0.1) > # End of added lines > > > Thanks a lot for your time in looking into this issue. > > Regards > Ananda > > Ananda B Mudar, PMP > Senior Technical Architect > Wipro Technologies > Ph: 972 765 8093 > ananda.mu...@wipro.com > > > -Original Message- > Date: Mon, 9 Aug 2010 16:37:58 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] Checkpointing mpi4py program > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org> > Content-Type: text/plain; charset=windows-1252 > > I have not tried to checkpoint an mpi4py application, so I cannot say > for sure if it works or not. You might be hitting something with the > Python runtime interacting in an odd way with either Open MPI or BLCR. > > Can you attach a debugger and get a backtrace on a stuck checkpoint? > That might show us where things are held up. > > -- Josh > > > On Aug 9, 2010, at 4:04 PM, <ananda.mu...@wipro.com> > <ananda.mu...@wipro.com> wrote: > >> Hi >> >> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR > 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I > see that program doesn?t resume sometimes after successful checkpoint > creation. This doesn?t occur always meaning the program resumes after > successful checkpoint creation most of the time and completes > successfully. Has anyone tested the checkpoint/restart functionality > with mpi4py programs? Are there any best practices that I should keep in > mind while checkpointing mpi4py programs? >> >> Thanks for your time >> - Ananda >> Please do not print this email unless it is absolutely necessary. >> >> The information contained in this electronic message and any > attachments to this message are intended for the exclusive use of the > addressee(s) and may contain proprietary, confidential or privileged > information. If you are not the intended recipient, you should not > disseminate, distribute or copy this e-mail. Please notify the sender > immediately and destroy all copies of this message and any attachments. >> >> WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. > The company accepts no liability for any damage caused by any virus > transmitted by this email. >> >> www.wipro.com >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > > Message: 8 > Date: Mon, 9 Aug 2010 13:50:03 -0700 > From: John Hsu <john...@willowgarage.com> > Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: >
Re: [OMPI users] Checkpointing mpi4py program
I have not tried to checkpoint an mpi4py application, so I cannot say for sure if it works or not. You might be hitting something with the Python runtime interacting in an odd way with either Open MPI or BLCR. Can you attach a debugger and get a backtrace on a stuck checkpoint? That might show us where things are held up. -- Josh On Aug 9, 2010, at 4:04 PM,wrote: > Hi > > I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR 0.8.2. > When I run ompi-checkpoint on the program written using mpi4py, I see that > program doesn’t resume sometimes after successful checkpoint creation. This > doesn’t occur always meaning the program resumes after successful checkpoint > creation most of the time and completes successfully. Has anyone tested the > checkpoint/restart functionality with mpi4py programs? Are there any best > practices that I should keep in mind while checkpointing mpi4py programs? > > Thanks for your time > - Ananda > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault (11)
That is interesting. I cannot think of any reason why this might be causing a problem just in Open MPI. popen() is similar to fork()/system() so you have to be careful with interconnects that do not play nice with fork(), like openib. But since it looks like you are excluding openib, this should not be the problem. I wonder if this has something to so with the way we use BLCR (maybe we need to pass additional parameters to cr_checkpoint()). When the process fails, are there any messages in the system logs from BLCR indicating an issue that it encountered? It is common for BLCR to post a 'socket open' warning, but that is expected/normal since we leave TCP sockets open in most cases as an optimization. I am wondering if there is a warning about the popen'ed process. Personally, I will not have an opportunity to look into this in more detail until probably mid-April. :/ Let me know what you find, and maybe we can sort out what is happening on the list. -- Josh On Mar 29, 2010, at 2:28 PM, Jean Potsam wrote: > Hi Josh/All, >I just tested a simple c application with blcr and it worked > fine. > > ## > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > char * getprocessid() > { > FILE * read_fp; > char buffer[BUFSIZ + 1]; > int chars_read; > char * buffer_data="12345"; > memset(buffer, '\0', sizeof(buffer)); > read_fp = popen("uname -a", "r"); > /* > ... > */ > return buffer_data; > } > > int main(int argc, char ** argv) > { > > int rank; >int size; > char * thedata; > int n=0; > thedata=getprocessid(); > printf(" the data is %s", thedata); > > while( n <10) > { > printf("value is %d\n", n); > n++; > sleep(1); >} > printf("bye\n"); > > } > > > jean@sun32:/tmp$ cr_run ./pipetest3 & > [1] 31807 > jean@sun32:~$ the data is 12345value is 0 > value is 1 > value is 2 > ... > value is 9 > bye > > jean@sun32:/tmp$ cr_checkpoint 31807 > > jean@sun32:/tmp$ cr_restart context.31807 > value is 7 > value is 8 > value is 9 > bye > > ## > > > It looks like its more to do with Openmpi. Any ideas from you side? > > Thank you. > > Kind regards, > > Jean. > > > > > > --- On Mon, 29/3/10, Josh Hurseywrote: > > From: Josh Hursey > Subject: Re: [OMPI users] Segmentation fault (11) > To: "Open MPI Users" > Date: Monday, 29 March, 2010, 16:08 > > I wonder if this is a bug with BLCR (since the segv stack is in the BLCR > thread). Can you try an non-MPI version of this application that uses > popen(), and see if BLCR properly checkpoints/restarts it? > > If so, we can start to see what Open MPI might be doing to confuse things, > but I suspect that this might be a bug with BLCR. Either way let us know what > you find out. > > Cheers, > Josh > > On Mar 27, 2010, at 6:17 AM, jody wrote: > > > I'm not sure if this is the cause of your problems: > > You define the constant BUFFER_SIZE, but in the code you use a constant > > called BUFSIZ... > > Jody > > > > > > On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam > > wrote: > > Dear All, > > I am having a problem with openmpi . I have installed openmpi > > 1.4 and blcr 0.8.1 > > > > I have written a small mpi application as follows below: > > > > ### > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > #define BUFFER_SIZE PIPE_BUF > > > > char * getprocessid() > > { > > FILE * read_fp; > > char buffer[BUFSIZ + 1]; > > int chars_read; > > char * buffer_data="12345"; > > memset(buffer, '\0', sizeof(buffer)); > > read_fp = popen("uname -a", "r"); > > /* > > ... > > */ > > return buffer_data; > > } > > > > int main(int argc, char ** argv) > > { > > MPI_Status status; > > int rank; > >int size; > > char * thedata; > > MPI_Init(, ); > > MPI_Comm_size(MPI_COMM_WORLD,); > > MPI_Comm_rank(MPI_COMM_WORLD,); > > thedata=getprocessid(); > > printf(" the data is %s", thedata); > > MPI_Finalize(); > > } > > > > > > I get the following result: > > > > ### > > jean@sunn32:~$ mpicc pipetest2.c -o pipetest2 > > jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib pipetest2 > > [sun32:19211] *** Process received signal *** > > [sun32:19211] Signal: Segmentation fault (11) > > [sun32:19211] Signal code: Address not mapped (1) > > [sun32:19211] Failing at address: 0x4 > > [sun32:19211] [ 0] [0xb7f3c40c] > > [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b] > > [sun32:19211] [ 2]
Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint
On Mar 5, 2010, at 3:15 AM, 马少杰 wrote: > Dear Sir: > - What version of Open MPI are you using? > my version is 1.3.4 > - What configure options are you using? > ./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread > --with-blcr=$dir --with-blcr-libdir=/$dir/lib > --prefix=/public/mpi/openmpi134-gnu-cr --enable-mpirun-prefix-by-default > make > make install > - What MCA parameters are you using? > mpirun -np 8 --am ft-enable-cr -machinefile ma xhpl > vim $HOME/.openmpi/mca-params.conf > # Local snapshot directory (not used in this scenario) > crs_base_snapshot_dir=/home/me/tmp > # Remote snapshot directory (globally mounted file system)) > snapc_base_global_snapshot_dir=/home/me/checkpoints > > > - Are you building from a release tarball or a SVN checkout? > building from openmpi-1.3.4.tar.gz > > > Now, I solve the problem successfully. > I found that the mpirun command as > > mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0 -machinefile ma > ./xhpl > > the time cost is almost equal to the time cost by the command: mpirun -np 8 > -machinefile ma ./xhpl > > I think it should be a bug. Since you have configured Open MPI to use the C/R thread (--enable-ft-thread) then Open MPI will start the concurrent C/R thread when you ask for C/R to be enabled. By default the thread polls very aggressively (waiting only 0 microseconds, or the same as calling sched_yeild() on most systems). By turning it off you eliminate the contention the thread is causing on the system. There are two MCA parameters that control this behavior, links below: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_wait I agree that the default behavior is probably too aggressive for most applications. However by increasing these values the user is also increasing the amount of time before a checkpoint can begin. In my setup I usually set: opal_cr_thread_sleep_wait=1000 Which will throttle down the thread when the application is in the MPI library. You might want to play around with these MCA parameters to tune the aggressiveness of the C/R thread to your performance needs. In the mean time I will look into finding better default parameters for these options. Cheers, Josh > > > 2010-03-05 > 马少杰 > 发件人: Joshua Hursey > 发送时间: 2010-03-05 00:07:19 > 收件人: Open MPI Users > 抄送: > 主题: Re: [OMPI users] low efficiency when we use --am ft-enable-cr tocheckpoint > There is some overhead involved when activating the current C/R functionality > in Open MPI due to the wrapping of the internal point-to-point stack. The > wrapper (CRCP framework) tracks the signature of each message (not the > buffer, so constant time for any size MPI message) so that when we need to > quiesce the network we know of all the outstanding messages that need to be > drained. > > So there is an overhead, but it should not be as significant as you have > mentioned. I looked at some of the performance aspects in the paper at the > link below: > http://www.open-mpi.org/papers/hpdc-2009/ > Though I did not look at HPL explicitly in this paper (just NPB, GROMACS, and > NetPipe), I have in testing and the time difference was definitely not 2x > (cannot recall the exact differences at the moment). > > Can you tell me a bit about your setup: > - What version of Open MPI are you using? > - What configure options are you using? > - What MCA parameters are you using? > - Are you building from a release tarball or a SVN checkout? > > -- Josh > > > On Mar 3, 2010, at 10:07 PM, 马少杰 wrote: > > > > > > > 2010-03-04 > > 马少杰 > > Dear Sir: > >I want to use blcr and openmpi to checkpoint, now I can save check > > point and restart my work successfully. How erver I find the option "--am > > ft-enable-cr" will case large cost . For example , when I run my HPL job > > without and with the option "--am ft-enable-cr" on 4 hosts (32 process, IB > > network) respectively , the time costed are 8m21.180sand 16m37.732s > > respctively. it is should be noted that I did not save the checkpoint when > > I run the job, the additional cost is caused by "--am ft-enable-cr" > > independently. Why can the optin "--am ft-enable-cr" case so much system > > cost? Is it normal? How can I solve the problem. > > I also test other mpi applications, the problem still exists. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint
There is some overhead involved when activating the current C/R functionality in Open MPI due to the wrapping of the internal point-to-point stack. The wrapper (CRCP framework) tracks the signature of each message (not the buffer, so constant time for any size MPI message) so that when we need to quiesce the network we know of all the outstanding messages that need to be drained. So there is an overhead, but it should not be as significant as you have mentioned. I looked at some of the performance aspects in the paper at the link below: http://www.open-mpi.org/papers/hpdc-2009/ Though I did not look at HPL explicitly in this paper (just NPB, GROMACS, and NetPipe), I have in testing and the time difference was definitely not 2x (cannot recall the exact differences at the moment). Can you tell me a bit about your setup: - What version of Open MPI are you using? - What configure options are you using? - What MCA parameters are you using? - Are you building from a release tarball or a SVN checkout? -- Josh On Mar 3, 2010, at 10:07 PM, 马少杰 wrote: > > > 2010-03-04 > 马少杰 > Dear Sir: >I want to use blcr and openmpi to checkpoint, now I can save check > point and restart my work successfully. How erver I find the option "--am > ft-enable-cr" will case large cost . For example , when I run my HPL job > without and with the option "--am ft-enable-cr" on 4 hosts (32 process, IB > network) respectively , the time costed are 8m21.180sand 16m37.732s > respctively. it is should be noted that I did not save the checkpoint when I > run the job, the additional cost is caused by "--am ft-enable-cr" > independently. Why can the optin "--am ft-enable-cr" case so much system > cost? Is it normal? How can I solve the problem. > I also test other mpi applications, the problem still exists. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] checkpointing multi node and multi process applications
On Mar 4, 2010, at 8:17 AM, Fernando Lemos wrote: > On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemoswrote: > >> Is there anything I can do to provide more information about this bug? >> E.g. try to compile the code in the SVN trunk? I also have kept the >> snapshots intact, I can tar them up and upload them somewhere in case >> you guys need it. I can also provide the source code to the ring >> program, but it's really the canonical ring MPI example. >> > > I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags). > This time taking the checkpoint didn't generate any error message: > > root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 > -np 2 --host debian1,debian2 ring > Process 1 sending 2761 to 0 Process 1 received 2760 Process 1 sending 2760 to 0 > root@debian1:~# > > But restoring it did: > > root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt > [debian1:23129] Error: Unable to access the path > [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]! > -- > Error: The filename (opal_snapshot_1.ckpt) is invalid because either > you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -- > -- > mpirun has exited due to process rank 1 with PID 23129 on > node debian1 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -- > root@debian1:~# > > Indeed, opal_snapshot_1.ckpt does not exist exist: > > root@debian1:~# find ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data > ompi_global_snapshot_23071.ckpt/restart-appfile > ompi_global_snapshot_23071.ckpt/0 > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073 > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data > root@debian1:~# > > It can be found in debian2: > > root@debian2:~# find ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/0 > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501 > root@debian2:~# By default, Open MPI requires a shared file system to save checkpoint files. So by default the local snapshot is moved, since the system assumes that it is writing to the same directory on a shared file system. If you want to use the local disk staging functionality (which is known to be broken in the 1.4 series), check out the example on the webpage below: http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local > > Then I tried supplying a hostfile for ompi-run and it worked just > fine! I thought the checkpoint included the hosts information? We intentionally do not save the hostfile as part of the checkpoint. Typically folks will want to restart on different nodes than those they checkpointed on (such as in a batch scheduling environment). If we saved the hostfile then it could lead to unexpected user behavior on restart if the machines that they wish to restart on change. If you need to pass a hostfile, the you can pass one to ompi-restart just as you would mpirun. > > So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN? The file staging functionality is known to be broken in the 1.4 series at this time, per the ticket below: https://svn.open-mpi.org/trac/ompi/ticket/2139 Unfortunately the fix is likely to be both custom for the branch (since we redesigned the functionality for the trunk and v1.5) and fairly involved. I don't have the time at the moment to work on fix, but hopefully in the coming months I will be able to look into this issue. In the mean time, patches are always welcome :) Hope that helps, Josh > > > Thanks a bunch, > ___ > users mailing list > us...@open-mpi.org >
Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)
On Mar 3, 2010, at 3:42 PM, Fernando Lemos wrote: > On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hursey <jjhur...@open-mpi.org> wrote: > >> >> Yes, ompi-restart should be printing a helpful message and exiting normally. >> Thanks for the bug report. I believe that I have seen and fixed this on a >> development branch making its way to the trunk. I'll make sure to move the >> fix to the 1.4 series once it has been applied to the trunk. >> >> I filed a ticket on this if you wanted to track the issue. >> https://svn.open-mpi.org/trac/ompi/ticket/2329 > > Ah, that's great. Just wondering, do you have any idea why blcr-util > is required? That package only contains the cr_* binaries (cr_restart, > cr_checkpoint, cr_run) and some docs (manpages, changelog, etc.). I've > filled a Debian bug (#572229) about making openmpi-checkpoint depend > on blcr-util, but the package maintainer told me he found it unusual > that ompi-restart would depend on the cr_* binaries since libcr > supposedly provides all the functionality ompi-restart needs. > > I'm about to compile OpenMPI in debug mode and take a look at the > backtrace to see if I can understand what's going on. > > Btw, this is the list of files in the blcr-util package: > http://packages.debian.org/sid/amd64/blcr-util/filelist . As you can > see, only cr_* binaries and docs. Open MPI currently calls 'cr_restart' for each process it restarts, exec'ed from the 'opal-restart' binary (LAM/MPI also used cr_restart directly, in case anyone is interested). We use the internal library interface for checkpoint, but not restarting at this time. If I recall correctly, it wasn't until relatively recently that BLCR added the ability to restart a process from a library call. We have not put in the code to use this functionality (though all of the framework interfaces are in place to do so). On my development branch I will add the ability to use the BLCR library interface if available. That functionality will not likely make it to the v1.4 release series since it is not really a bug fix, but I will plan on including it in the v1.5 and later releases. And just so I don't lose track of it, I created an enhancement ticket for this: https://svn.open-mpi.org/trac/ompi/ticket/2330 Cheers, Josh > >> >> Thanks again, >> Josh > > Thank you! > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)
On Mar 2, 2010, at 9:17 AM, Fernando Lemos wrote: > On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos> wrote: >> Hello, >> >> >> I'm trying to come up with a fault tolerant OpenMPI setup for research >> purposes. I'm doing some tests now, but I'm stuck with a segfault when >> I try to restart my test program from a checkpoint. >> >> My test program is the "ring" program, where messages are sent to the >> next node in the ring N times. It's pretty simple, I can supply the >> source code if needed. I'm running it like this: >> >> # mpirun -np 4 -am ft-enable-cr ring >> ... > Process 1 sending 703 to 2 > Process 3 received 704 > Process 3 sending 704 to 0 > Process 3 received 703 > Process 3 sending 703 to 0 >> -- >> mpirun noticed that process rank 0 with PID 18358 on node debian1 >> exited on signal 0 (Unknown signal 0). >> -- >> 4 total processes killed (some possibly by mpirun during cleanup) >> >> That's the output when I ompi-checkpoint the mpirun PID from another >> terminal. >> >> The checkpoint is taken just fine in maybe 1.5 seconds. I can see the >> checkpoint directory has been created in $HOME. >> >> This is what I get when I try to run ompi-restart >> >> ps axroot@debian1:~# ps ax | grep mpirun >> 18357 pts/0R+ 0:01 mpirun -np 4 -am ft-enable-cr ring >> 18378 pts/5S+ 0:00 grep mpirun >> root@debian1:~# ompi-checkpoint 18357 >> Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt >> root@debian1:~# ompi-checkpoint --term 18357 >> Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt >> root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt >> -- >> Error: Unable to obtain the proper restart command to restart from the >> checkpoint file (opal_snapshot_2.ckpt). Returned -1. >> >> -- >> [debian1:18384] *** Process received signal *** >> [debian1:18384] Signal: Segmentation fault (11) >> [debian1:18384] Signal code: Address not mapped (1) >> [debian1:18384] Failing at address: 0x725f725f >> [debian1:18384] [ 0] [0xb775f40c] >> [debian1:18384] [ 1] >> /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] >> [debian1:18384] [ 2] >> /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] >> [debian1:18384] [ 3] >> /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] >> [debian1:18384] [ 4] opal-restart [0x804908e] >> [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) >> [0xb7568b55] >> [debian1:18384] [ 6] opal-restart [0x8048fc1] >> [debian1:18384] *** End of error message *** >> -- >> mpirun noticed that process rank 2 with PID 18384 on node debian1 >> exited on signal 11 (Segmentat >> -- >> >> I used a clean install of Debian Squeeze (testing) to make sure my >> environment was ok. Those are the steps I took: >> >> - Installed Debian Squeeze, only base packages >> - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build >> tools, BLCR dev and run-time environment) >> - Compiled openmpi-1.4.1 >> >> Note that I did compile openmpi-1.4.1 because the Debian package >> (openmpi-checkpoint) doesn't seem to be usable at the moment. There >> are no leftovers from any previous install of Debian packages >> supplying OpenMPI because this is a fresh install, no openmpi package >> had been installed before. >> >> I used the following configure options: >> >> # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads >> >> I also tried to add the option --with-memory-manager=none because I >> saw an e-mail on the mailing list that described this as a possible >> solution to an (apparently) not related problem, but the problem >> remains the same. >> >> I don't have config.log (I rm'ed the build dir), but if you think it's >> necessary I can recompile OpenMPI and provide it. >> >> Some information about the system (VirtualBox virtual machine, single >> processor, btw): >> >> Kernel version 2.6.32-trunk-686 >> >> root@debian1:~# lsmod | grep blcr >> blcr 79084 0 >> blcr_imports2077 1 blcr >> >> libcr (BLCR) is version 0.8.2-9. >> >> gcc is version 4.4.3. >> >> >> Please let me know of any other information you might need. >> >> >> Thanks in advance, >> > > Hello, > > I figured it out. The problem is that the Debian package brcl-utils, > which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.) > wasn't installed. I believe OpenMPI could perhaps show a more > descriptive message instead of segfaulting, though? Also, you might > want to add that information to the FAQ. > > Anyways,
Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes
You can use the 'checkpoint to local disk' example to checkpoint and restart without access to a globally shared storage devices. There is an example on the website that does not use a globally mounted file system: http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local What version of Open MPI are you using? This functionality is known to be broken on the v1.3/1.4 branches, per the ticket below: https://svn.open-mpi.org/trac/ompi/ticket/2139 Try the nightly snapshot of the 1.5 branch or the development trunk, and see if this issues still occurs. -- Josh On Feb 8, 2010, at 8:35 AM, Andreea Costea wrote: > I asked this question because checkpointing with to NFS is successful, but > checkpointing without a mount filesystem or a shared storage throws this > warning: > > WARNING: Could not preload specified file: File already exists. > Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0 > Host: X > > Will continue attempting to launch the process. > > > filem:rsh: wait_all(): Wait failed (-1) > [[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 > > even if I set the mca-parameters like this: > snapc_base_store_in_place=0 > > crs_base_snapshot_dir > =/home/andreea/checkpoints/local > > snapc_base_global_snapshot_dir > =/home/andreea/checkpoints/global > and the nodes can connect through ssh without a password. > > Thanks, > Andreea > > On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea> wrote: > Hi, > > Let's say I have an MPI application running on several hosts. Is there any > way to checkpoint this application without having a shared storage between > the nodes? > I already took a look at the examples here > http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in > both cases there is a globally mounted file system. > > Thanks, > Andreea > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Checkpoint/Restart error
On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: > Hi, > > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have > downloaded today. When I want to checkpoint I am having the following error > message: > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 > HNP with PID 2337 Not found! This looks like an error coming from the 1.3.3 install. In 1.4.1 there is no error at line 399, in 1.3.3 there is. Check your installation of Open MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected problems. Try a clean installation of 1.4.1 and double check that 1.3.3 is not in your path/lib_path any longer. -- Josh > > I tried the same thing with version 1.3.3 and it works perfectly. > > Any idea why? > > thanks, > Andreea > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI checkpoint/restart
On Jan 14, 2010, at 2:50 AM, Andreea Costea wrote: > Hei there > > I have some questions regarding checkpoint/restart: > > 1. Until recently I thought that ompi-restart and ompi-restart are used to > checkpoint a process inside an MPI application. Now I reread this and I > realized that actually what it does is to checkpoint the mpirun process. Does > this mean that if I run my application with multiple processes and on > multiple nodes in my network the checkpoint file will contain the states of > all the processes of my MPI application? I think you slightly misread the entry. ompi-checkpoint checkpoints the entire MPI application, across node boundaries. It requires that the user pass the PID of mpirun to server as a reference point for the command. This way a user can run multiple mpiruns from the same machine and only checkpoint a subset of those. > 2. Can I restart the application on a different node? Yes. If you have trouble doing this, then I would suggest following the directions in the BLCR FAQ entry below (it usually addressed 99% of the problems people have doing this): https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink -- Josh > > Thanks a lot, > Andreea > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Elementary question on openMPI application location when using PBS submission
The --preload-* options to 'mpirun' currently use the ssh/scp commands (or rsh/rcp via an MCA parameter) to move files from the machine local to the 'mpirun' command to the compute nodes during launch. This assumes that you have Open MPI already installed on all of the machines. It was an option targeted to users that do not wish to have an NFS or similar mount on all machines. Torque/PBS may be faster at this depending on how they organize the staging, but I assume that we are essentially doing the same thing. There was a post on the users list a little while back discussing these options a bit more fully. -- Josh On Dec 1, 2009, at 3:21 PM, Belaid MOA wrote: > I saw those options before but somehow I did not pay attention to them :(. > I was thinking that the copying is done automatically, so I felt the options > were useless but I was wrong. > Thanks a lot Gus; that's exactly what I was looking for. I will try them then. > > Best Regards. > ~Belaid. > > > Date: Tue, 1 Dec 2009 15:14:01 -0500 > > From: g...@ldeo.columbia.edu > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Elementary question on openMPI application > > location when using PBS submission > > > > Hi Belaid Moa > > > > I spoke too fast, and burnt my tongue. > > I should have double checked before speaking out. > > I just looked up "man mpiexec" and found the options below. > > I never used or knew about them, but you may want to try. > > They seem to be similar to the Torque/PBS stage_in feature. > > I would guess they use scp to copy the executable and other > > files to the nodes, but I don't really know which copying > > mechanism is used. > > > > Gus Correa > > - > > Gustavo Correa > > Lamont-Doherty Earth Observatory - Columbia University > > Palisades, NY, 10964-8000 - USA > > - > > > > # > > Excerpt from (OpenMPI 1.3.2) "man mpiexec": > > # > > > > --preload-binary > > Copy the specified executable(s) to remote machines > > prior to > > starting remote processes. The executables will be > > copied to > > the Open MPI session directory and will be deleted > > upon com- > > pletion of the job. > > > > --preload-files > > Preload the comma separated list of files to the > > current > > working directory of the remote machines where > > processes will > > be launched prior to starting those processes. > > > > --preload-files-dest-dir > > The destination directory to be used for > > preload-files, if > > other than the current working directory. By > > default, the > > absolute and relative paths provided by > > --preload-files are > > used. > > > > > > > > > > Gus Correa wrote: > > > Hi Belaid Moa > > > > > > Belaid MOA wrote: > > >> Thank you very very much Gus. Does this mean that OpenMPI does not > > >> copy the executable from the master node to the worker nodes? > > > > > > Not that I know. > > > Making the executable available on the nodes, and any > > > input files the program may need, is the user's responsibility, > > > not of mpiexec. > > > > > > On the other hand, > > > Torque/PBS has a "stage_in/stage_out" feature that is supposed to > > > copy files over to the nodes, if you want to give it a shot. > > > See "man qsub" and look into the (numerous) "-W" option under > > > the "stage[in,out]=file_list" sub-options. > > > This is a relic from the old days where everything had to be on > > > local disks on the nodes, and NFS ran over Ethernet 10/100, > > > but it is still used by people that > > > run MPI programs with heavy I/O, to avoid pounding on NFS or > > > even on parallel file systems. > > > I tried the stage_in/out feature a lng time ago, > > > (old PBS before Torque), but it had issues. > > > It probably works now with the newer/better > > > versions of Torque. > > > > > > However, the easy way to get this right is just to use an NFS mounted > > > directory. > > > > > >> If that's case, I will go ahead and NFS mount my working directory. > > >> > > > > > > This would make your life much easier. > > > > > > My $0.02. > > > Gus Correa > > > - > > > Gustavo Correa > > > Lamont-Doherty Earth Observatory - Columbia University > > > Palisades, NY, 10964-8000 - USA > > > - > > > > > > > > > > > > > > >> ~Belaid. > > >> > > >> > > >> > Date: Tue, 1 Dec 2009 13:50:57 -0500 > > >> > From: g...@ldeo.columbia.edu > > >> > To: us...@open-mpi.org > > >> > Subject: Re: [OMPI users] Elementary question on openMPI > > >> application location when using PBS submission > > >> > > > >> > Hi Belaid MOA > > >> > > > >> > See this FAQ: > > >> > > > >>
Re: [OMPI users] How to build OMPI with Checkpoint/restart.
On Sep 16, 2009, at 8:30 AM, Marcin Stolarek wrote: Hi, It seems I solved my problem. Root of the error was, that I haven't loaded blcr module. So I couldn't checkpoint even one therad application. I am glad to hear that you have things working now. However I stil can't find MCA:blcr in ompi_all -info, It's working. This may have been a red-herring, sorry. I think ompi_info will only show the 'none' component due to the way it searches for components in the system. This is a bug how in the CRS selection logic plays with ompi_info. I will take a note/file a bug to look into fixing it. Unfortunately I do not have a work around other than looking in the install directory for the mca_crs_blcr.so file. -- Josh marcin 2009/9/15 Marcin StolarekHi, I've done everythink from the beginig.: rm -r $ompi_install make clean make make install In $ompi_install, I've got files you mentioned: mstol@halo2:/home/guests/mstol/openmpi/lib/openmp# ls mca_crs_bl* mca_crs_blcr.la mca_crs_blcr.so but, when I try: # ompi_info -all | grep "crs:" mstol@halo2:/home/guests/mstol/openmpi/openmpi-1.3.3# ompi_info -- all | grep "crs:" MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3) MCA crs: parameter "crs_base_verbose" (current value: "0", data source: default value) MCA crs: parameter "crs" (current value: "none", data source: default value) MCA crs: parameter "crs_none_select_warning" (current value: "0", data source: default value) MCA crs: parameter "crs_none_priority" (current value: "0", data source: default value) I don't have crs: blcr component. marcin 2009/9/14 Josh Hursey The config.log looked fine, so I think you have fixed the configure problem that you previously posted about. Though the config.log indicates that the BLCR component is scheduled for compile, ompi_info does not indicate that it is available. I suspect that the error below is because the CRS could not find any CRS components to select (though there should have been an error displayed indicating as such). I would check your Open MPI installation to make sure that it is the one that you configured with. Specifically I would check to make sure that in the installation location there are the following files: $install_dir/lib/openmpi/mca_crs_blcr.so $install_dir/lib/openmpi/mca_crs_blcr.la If that checks out, then I would remove the old installation directory and try reinstalling fresh. Let me know how it goes. -- Josh On Sep 13, 2009, at 5:49 AM, Marcin Stolarek wrote: I've tryed another time. Here is what I get when trying to run using-1.4a1r21964 : (terminus:~) mstol% mpirun --am ft-enable-cr ./a.out -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_cr_init() failed failed --> Returned value -1 instead of OPAL_SUCCESS -- [terminus:06120] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_ init.c at line 79 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [terminus:6120] Abort before MPI_INIT completed successfully; not able to guaran tee that all other processes were killed! -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- I've included config.log and ompi_info --all output in attacment LD_LIBRARY_PATH is set correctly. Any idea? marcin 2009/9/12 Marcin Stolarek Hi, I'm trying to compile OpenMPI with checkpoint restart via BLCR. I'm not sure which path shoul I set as a value of --with-blcr option. I'm using 1.3.3 release, which version of BLCR should I use? I've compiled the
Re: [MTT users] Perl Wrap Error
That seemed to have done the trick. Thanks, Josh On Jul 6, 2007, at 3:04 PM, Ethan Mallove wrote: On Fri, Jul/06/2007 01:22:06PM, Joshua Hursey wrote: Anyone seen the following error from MTT before? It looks like it is in the reporter stage. <-> shell$ /spin/home/jjhursey/testing/mtt//client/mtt --mpi-install --scratch /spin/home/jjhursey/testing/scratch/20070706 --file /spin/home/jjhursey/testing/etc/jaguar/simple-svn.ini --print-time --verbose --debug 2>&1 1>> /spin/home/jjhursey/testing/scratch/20070706/output.txt This shouldn't happen at /usr/lib/perl5/5.8.3/Text/Wrap.pm line 64. shell$ <-> "This shouldn't happen at ..." is the die message? Try this INI [Reporter: TextFile] section: {{{ [Reporter: text file backup] module = TextFile textfile_filename = $phase-$section-$mpi_name-$mpi_version.txt # User-defined report headers/footers textfile_summary_header = < The return code is: 6400 I attached the output log incase that helps, and the INI file. -- Josh ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users Josh Hursey jjhur...@open-mpi.org http://www.open-mpi.org/
[MTT users] Perl Wrap Error
Anyone seen the following error from MTT before? It looks like it is in the reporter stage. <-> shell$ /spin/home/jjhursey/testing/mtt//client/mtt --mpi-install -- scratch /spin/home/jjhursey/testing/scratch/20070706 --file /spin/ home/jjhursey/testing/etc/jaguar/simple-svn.ini --print-time -- verbose --debug 2>&1 1>> /spin/home/jjhursey/testing/scratch/20070706/ output.txt This shouldn't happen at /usr/lib/perl5/5.8.3/Text/Wrap.pm line 64. shell$ <-> The return code is: 6400 I attached the output log incase that helps, and the INI file. -- Josh jjhursey-mtt.tar.bz2 Description: Binary data Josh Hursey jjhur...@open-mpi.org http://www.open-mpi.org/