Re: [OMPI users] Related to project ideas in OpenMPI

2011-08-27 Thread Joshua Hursey
91-8149399160 >>>>>>>> ___ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>&g

Re: [OMPI users] BLCR support not building on 1.5.3

2011-05-27 Thread Joshua Hursey
urther below: > > > - Original Message - >> From: Joshua Hursey <jjhur...@open-mpi.org> > [...] >> What other configure options are you passing to Open MPI? Specifically the >> configure test will always fail if '--with-ft=cr' is not specified - by >> default

Re: [OMPI users] BLCR support not building on 1.5.3

2011-05-27 Thread Joshua Hursey
What version of BLCR are you using? What other configure options are you passing to Open MPI? Specifically the configure test will always fail if '--with-ft=cr' is not specified - by default Open MPI will only build the BLCR component if C/R FT is requested by the user. Can you send a zip'ed

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-03-03 Thread Joshua Hursey
here are also 2 sample result files (cpu.256^3.8N.*) which show the > execution time difference between 2 cases. > Hope you can take some time to find the problem. > Thanks for your kindness. > > Best Regards, > Nguyen Toan > > On Wed, Mar 2, 2011 at 3:00 AM, Joshua Hu

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-03-01 Thread Joshua Hursey
the MCA parameter you mentioned but it did not help, the unknown > overhead still exists. > Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1. > Hope you can find out the problem. > Thank you. > > Regards, > Nguyen Toan > > On Wed, Feb 9, 2011 at 1

Re: [OMPI users] --without-tm [SEC=UNCLASSIFIED]

2011-02-21 Thread Joshua Hursey
st > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-09 Thread Joshua Hursey
e checkpoint > per application execution for my purpose, but the unknown overhead exists > even when no checkpoint was taken. > > Do you have any other idea? > > Regards, > Nguyen Toan > > > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-08 Thread Joshua Hursey
eliminate it? > Thanks. > > Regards, > Nguyen > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey

Re: [OMPI users] allow job to survive process death

2011-01-27 Thread Joshua Hursey
On Jan 27, 2011, at 9:47 AM, Reuti wrote: > Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > >> The current version of Open MPI does not support continued operation of an >> MPI application after process failure within a job. If a process dies, so >> will the MPI job

Re: [OMPI users] allow job to survive process death

2011-01-27 Thread Joshua Hursey
this group into > a working communicator? > > Thanks, > Kirk > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral

[OMPI users] Fwd: BLCR at SC10

2010-11-14 Thread Joshua Hursey
roup > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey

Re: [OMPI users] Running on crashing nodes

2010-09-24 Thread Joshua Hursey
Regards, > Andrei > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______ > users mailing list > us...@open-

Re: [OMPI users] Question on staging in checkpoint

2010-09-13 Thread Joshua Hursey
can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > > ---

Re: [OMPI users] High Checkpoint Overhead Ratio

2010-08-31 Thread Joshua Hursey
21.71] Finished - > ompi_global_snapshot_27115.ckpt > Snapshot Ref.: 0 ompi_global_snapshot_27115.ckpt > > As you see, it takes 200+ secconds to checkpoint. btw, what the former and > latter number represent in [ , ]? > > Regards > > Whchen >

Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-24 Thread Joshua Hursey
AQ.html#prelink If that doesn't work then I would suggest trying the current Open MPI trunk. There should not be any problem with using NFS, since this is occurring in MPI_Init, this is well before we ever try to use the file system. I also test with NFS, and local staging on a fairly regular b

[OMPI users] Checkpoint/Restart Process Migration and Automatic Recovery Support

2010-08-19 Thread Joshua Hursey
I am pleased to announce that Open MPI now supports checkpoint/restart process migration and automatic recovery. This is in addition to our current support for more traditional checkpoint/restart fault tolerance. These new features were introduced in the Open MPI development trunk in commit

Re: [OMPI users] Checkpointing mpi4py program

2010-08-18 Thread Joshua Hursey
’t > proceed after that. > > I have attached the stack traces of all the MPI processes that are part of > the mpirun. I really appreciate if you can take a look at the stack trace and > let m e know the potential problem. I am kind of stuck at this point and need > your

Re: [OMPI users] Checkpointing mpi4py program

2010-08-13 Thread Joshua Hursey
led? Were you successful in checkpointing? > > - Ananda > -Original Message- > Message: 9 > Date: Fri, 13 Aug 2010 10:21:29 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2 > To: Open MPI Users <u

Re: [OMPI users] users Digest, Vol 1658, Issue 2

2010-08-13 Thread Joshua Hursey
ronment variables > to have a successful compilation. I will keep you posted. > > BTW, were you successful in reproducing the problem on a system with > OpenMPI 1.4.2? > > Thanks > Ananda > -Original Message----- > Date: Thu, 12 Aug 2010 09:12:26 -0400 > From: J

Re: [OMPI users] Checkpointing mpi4py program

2010-08-12 Thread Joshua Hursey
hnical Architect > Wipro Technologies > Ph: 972 765 8093 > ananda.mu...@wipro.com > > > -Original Message- > Date: Mon, 9 Aug 2010 16:37:58 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] Checkpointing mpi4py program > To: Open MPI Us

Re: [OMPI users] Checkpointing mpi4py program

2010-08-09 Thread Joshua Hursey
I have not tried to checkpoint an mpi4py application, so I cannot say for sure if it works or not. You might be hitting something with the Python runtime interacting in an odd way with either Open MPI or BLCR. Can you attach a debugger and get a backtrace on a stuck checkpoint? That might show

Re: [OMPI users] Segmentation fault (11)

2010-03-31 Thread Joshua Hursey
That is interesting. I cannot think of any reason why this might be causing a problem just in Open MPI. popen() is similar to fork()/system() so you have to be careful with interconnects that do not play nice with fork(), like openib. But since it looks like you are excluding openib, this

Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint

2010-03-05 Thread Joshua Hursey
cr_thread_sleep_wait=1000 Which will throttle down the thread when the application is in the MPI library. You might want to play around with these MCA parameters to tune the aggressiveness of the C/R thread to your performance needs. In the mean time I will look into finding better default para

Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint

2010-03-04 Thread Joshua Hursey
There is some overhead involved when activating the current C/R functionality in Open MPI due to the wrapping of the internal point-to-point stack. The wrapper (CRCP framework) tracks the signature of each message (not the buffer, so constant time for any size MPI message) so that when we need

Re: [OMPI users] checkpointing multi node and multi process applications

2010-03-04 Thread Joshua Hursey
On Mar 4, 2010, at 8:17 AM, Fernando Lemos wrote: > On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos wrote: > >> Is there anything I can do to provide more information about this bug? >> E.g. try to compile the code in the SVN trunk? I also have kept the >> snapshots

Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-03-03 Thread Joshua Hursey
On Mar 3, 2010, at 3:42 PM, Fernando Lemos wrote: > On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hursey <jjhur...@open-mpi.org> wrote: > >> >> Yes, ompi-restart should be printing a helpful message and exiting normally. >> Thanks for the bug report. I beli

Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-03-03 Thread Joshua Hursey
On Mar 2, 2010, at 9:17 AM, Fernando Lemos wrote: > On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos > wrote: >> Hello, >> >> >> I'm trying to come up with a fault tolerant OpenMPI setup for research >> purposes. I'm doing some tests now, but I'm stuck with a segfault

Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-08 Thread Joshua Hursey
You can use the 'checkpoint to local disk' example to checkpoint and restart without access to a globally shared storage devices. There is an example on the website that does not use a globally mounted file system: http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local What

Re: [OMPI users] Checkpoint/Restart error

2010-01-14 Thread Joshua Hursey
On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: > Hi, > > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have > downloaded today. When I want to checkpoint I am having the following error > message: > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line

Re: [OMPI users] OpenMPI checkpoint/restart

2010-01-14 Thread Joshua Hursey
On Jan 14, 2010, at 2:50 AM, Andreea Costea wrote: > Hei there > > I have some questions regarding checkpoint/restart: > > 1. Until recently I thought that ompi-restart and ompi-restart are used to > checkpoint a process inside an MPI application. Now I reread this and I > realized that

Re: [OMPI users] Elementary question on openMPI application location when using PBS submission

2009-12-02 Thread Joshua Hursey
The --preload-* options to 'mpirun' currently use the ssh/scp commands (or rsh/rcp via an MCA parameter) to move files from the machine local to the 'mpirun' command to the compute nodes during launch. This assumes that you have Open MPI already installed on all of the machines. It was an

Re: [OMPI users] How to build OMPI with Checkpoint/restart.

2009-09-17 Thread Joshua Hursey
On Sep 16, 2009, at 8:30 AM, Marcin Stolarek wrote: Hi, It seems I solved my problem. Root of the error was, that I haven't loaded blcr module. So I couldn't checkpoint even one therad application. I am glad to hear that you have things working now. However I stil can't find MCA:blcr

Re: [MTT users] Perl Wrap Error

2007-07-06 Thread Joshua Hursey
That seemed to have done the trick. Thanks, Josh On Jul 6, 2007, at 3:04 PM, Ethan Mallove wrote: On Fri, Jul/06/2007 01:22:06PM, Joshua Hursey wrote: Anyone seen the following error from MTT before? It looks like it is in the reporter stage. <-> shell$ /spi

[MTT users] Perl Wrap Error

2007-07-06 Thread Joshua Hursey
Anyone seen the following error from MTT before? It looks like it is in the reporter stage. <-> shell$ /spin/home/jjhursey/testing/mtt//client/mtt --mpi-install -- scratch /spin/home/jjhursey/testing/scratch/20070706 --file /spin/