Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2010-01-11 Thread Josh Hursey
On Dec 14, 2009, at 12:25 PM, Sergio Díaz wrote: Hi Reuti, Yes, I sent a job with SGE and I checkpointed the mpirun process, by hand, entering into the mpi master node. Then I killed the job with qdel and after that I did the ompi-restart. I will try to integrate with SGE creating a ckpt

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-15 Thread Sergio Díaz
Hi, Thanks Reuti. These links were very useful when I did the integration of BLCR with SGE. I will review them to check if there is more useful information. Regards, Sergio Reuti escribió: Hi, no, I never tried Open MPI's checkpointing. But there are two Howto's from which you may get

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-14 Thread Sergio Díaz
Hi Reuti, Yes, I sent a job with SGE and I checkpointed the mpirun process, by hand, entering into the mpi master node. Then I killed the job with qdel and after that I did the ompi-restart. I will try to integrate with SGE creating a ckpt environment but I think that it could be a bit

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-14 Thread Reuti
Hi, Am 14.12.2009 um 17:05 schrieb Sergio Díaz: I got a successful checkpoint with a fresh installation and without use the trunk. I can't understand why it is working now and before I could do a successful restart... Maybe there was something wrong in the openmpi installation and then

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-14 Thread Sergio Díaz
Hi Josh, I got a successful checkpoint with a fresh installation and without use the trunk. I can't understand why it is working now and before I could do a successful restart... Maybe there was something wrong in the openmpi installation and then the metadata was created in a wrong way. I

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-11 Thread Sergio Díaz
Hi Josh Here you go the file. I will try to apply the trunk but I think that I broke-up my openmpi installation doing "something" and I don't know what :-( . I was modifying the mca parameters... When I send a job, the orted daemon expanded in the SLAVE host is launched in a bucle till they

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-09 Thread Josh Hursey
On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote: Hi Josh, You were right. The main problem was the /tmp. SGE uses a scratch directory in which the jobs have temporary files. Setting TMPDIR to / tmp, checkpoint works! However, when I try to restart it... I got the following error (see

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-12 Thread Sergio Díaz
Hi Josh, You were right. The main problem was the /tmp. SGE uses a scratch directory in which the jobs have temporary files. Setting TMPDIR to /tmp, checkpoint works! However, when I try to restart it... I got the following error (see ERROR1). Option -v agrees these lines (see ERRO2). I was

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-11 Thread Josh Hursey
On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote: > Hi Josh, > > The OpenMPI version is 1.3.3. > > The command ompi-ps doesn't work. > > [root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241 > [root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241 > [compute-3-18.local:16254] orte_ps: Acquiring list

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-09 Thread Sergio Díaz
Hi Josh, The OpenMPI version is 1.3.3. The command ompi-ps doesn't work. [root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241 [root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241 [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and setting contact info into RML... [root@compute-3-18

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-06 Thread Josh Hursey
On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote: Hello, I have achieved the checkpoint of an easy program without SGE. Now, I'm trying to do the integration openmpi+sge but I have some problems... When I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-02 Thread Andreea m. (Costea)
I am having the same problem when I want to checkpoint manually: "HNP with PID Not found!", though I am sure I put the right PID --- On Mon, 11/2/09, Sergio Díaz <sd...@cesga.es> wrote: From: Sergio Díaz <sd...@cesga.es> Subject: Re: [OMPI users] checkpoint opempi

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-02 Thread Sergio Díaz
Hi again, I found a C program to test ompi-checkpoint/restart an it works fine. The program was written by Alan Woodland and shared in the following distribution list: debian-bugs-d...@lists.debian.org This program starts a countdown from 10 to 0 and when the countdown is 6, do a checkpoint,