Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2010-01-11 Thread Josh Hursey


On Dec 14, 2009, at 12:25 PM, Sergio Díaz wrote:


Hi Reuti,

Yes, I sent a job with SGE and I checkpointed the mpirun process, by  
hand, entering into the mpi master node. Then I killed the job with  
qdel and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I  
think that it could be a bit difficult because:
 1 -  when I do checkpoint I can't specify a directory with  
a name like checkpoint_jobid
 2 -  I can't specify the scratch directory and I have to  
use the /tmp instead of SGE's scratch directory.
 3 -  I tried to restart the snapshot and it only works if I  
use the same machinefile. That is, If the job ran in the c3-13 and  
c3-14, I have to restart the job using a machinefile with these two  
nodes.


This is usually caused by prelink'ing interfering with BLCR. See the  
BLCR FAQ for how to disable this option:

  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

Let me know if that fixes this problem.

Josh



[sdiaz@svgd ~]$ ompi-restart -v -machinefile  
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
[svgd.cesga.es:28836] Checking for the existence  
of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
[svgd.cesga.es:28836] Restarting from file  
(ompi_global_snapshot_12554.ckpt)

[svgd.cesga.es:28836]Exec in self
 tiempo  110
 Process1 :
 compute-3-14.local
of2
 tiempo  110
 Process0 :
 compute-3-13.local
   of2
 
--
mpirun noticed that process rank 1 with PID  
8477 on node compute-3-15 exited on signal 11 (Segmentation fault).
 
--


To solve problem 1, there is a feature opened by Josh. (https://svn.open-mpi.org/trac/ompi/ticket/2098 
)
To solve problem 2, there is a thread in which is talked ([OMPI  
users] Changing location where checkpoints are saved) and also a bug  
opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I  
think that it could work... we will see.
To solve problem 3, I didn't have time to search it. But if Josh or  
anyone have an idea... please tell to us :-)


Reuti, Did you test it successfully? How do you solve these problems?

Regards,
Sergio


Reuti escribió:


Hi,

Am 14.12.2009 um 17:05 schrieb Sergio Díaz:

I got a successful checkpoint with a fresh installation and  
without use the trunk. I can't understand why it is working now  
and before I could do a successful restart... Maybe there was  
something wrong in the openmpi installation and then the metadata  
was created in a wrong way.

I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz@compute-3-13 ~]$ ompi-restart -machinefile mpi_test/ 
lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt

 tiempo  110
 Process1 :
 compute-3-14.local
 of2
 tiempo  110
 Process0 :
 compute-3-13.local
 of2
 tiempo  120
 Process1 :
 compute-3-14.local
 of2
 tiempo  120
 Process0 :
 compute-3-13.local
...
...

[sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz26273  0.0  0.0 34676 1668 ?Ss   15:58   0:00  
orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca


in a Tight Integration into SGE the daemon should get the argument  
--no-daemonize. Are you restarting a job on the command line, which  
ran before under SGE's supervision?


-- Reuti 



orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri  
1739128832.0;tcp://192.168.4.148:45551 -mca  
mca_base_param_file_prefix ft-enable-cr -mca  
mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/ 
openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz -mca  
mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
sdiaz26274  0.1  0.0 15984  504 ?Sl   15:58   0:00  \_  
cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/ 
opal_snapshot_1.ckpt/ompi_blcr_context.26047
sdiaz26047  1.5  0.0 99460 3624 ?Sl   15:58
0:00  \_ ./pi3


[sdiaz@compute-3-13 ~]$ ps auxf |grep sdiaz
root 12878  0.0  0.0 90260 3000 pts/0S15:55   0:00   
|   \_ su - sdiaz
sdiaz12880  0.0  0.0 53432 1512 pts/0S15:55   0:00   
|   \_ -bash
sdiaz13070  0.3  0.0 39988 2500 pts/0S+   15:58   0:00   
|   \_ mpirun -am ft-enable-cr --default-hostfile  
mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/ 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-15 Thread Sergio Díaz

Hi,

Thanks Reuti. These links were very useful when I did the integration of 
BLCR with SGE. I will review them to check if there is more useful 
information.


Regards,
Sergio

Reuti escribió:

Hi,

no, I never tried Open MPI's checkpointing. But there are two Howto's 
from which you may get some ideas to integrate it with SGE:


http://gridengine.sunsource.net/howto/checkpointing.html
http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf (but Open 
MPI's checkpointing seems more to be like Condor's, as you don't have 
to deal with any process list on your own AFAIK)


Included is also an example to integrate SGE with the Condor 
checkpointing library in standalone mode.


Purpose of the checkpointing interface can be to copy the files from a 
local (checkpointing) directory on a node to a shared space like 
/home/checkpoint (the $SGE_CKPT_DIR [I even greated a subdirectory 
with the $JOB_ID therein in the examples]). Later on the files can be 
copied to the (maybe different) nodes again (either in a queue prolog 
or the job script) when the job restarts.



-- Reuti


Am 14.12.2009 um 18:25 schrieb Sergio Díaz:


Hi Reuti,

Yes, I sent a job with SGE and I checkpointed the mpirun process, by 
hand, entering into the mpi master node. Then I killed the job with 
qdel and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I 
think that it could be a bit difficult because:
 1 -  when I do checkpoint I can't specify a directory with a 
name like checkpoint_jobid
 2 -  I can't specify the scratch directory and I have to use 
the /tmp instead of SGE's scratch directory.
 3 -  I tried to restart the snapshot and it only works if I 
use the same machinefile. That is, If the job ran in the c3-13 and 
c3-14, I have to restart the job using a machinefile with these two 
nodes.


[sdiaz@svgd ~]$ ompi-restart -v -machinefile 
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
[svgd.cesga.es:28836] Checking for the existence 
of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
[svgd.cesga.es:28836] Restarting from file 
(ompi_global_snapshot_12554.ckpt)

[svgd.cesga.es:28836]Exec in self
 tiempo  110
 Process1 :
 compute-3-14.local
of2
 tiempo  110
 Process0 :
 compute-3-13.local
   of2

-- 

mpirun noticed that process rank 1 with PID 
8477 on node compute-3-15 exited on signal 11 (Segmentation fault).

-- 



To solve problem 1, there is a feature opened by Josh. 
(https://svn.open-mpi.org/trac/ompi/ticket/2098)
To solve problem 2, there is a thread in which is talked ([OMPI 
users] Changing location where checkpoints are saved) and also a bug 
opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I 
think that it could work... we will see.
To solve problem 3, I didn't have time to search it. But if Josh or 
anyone have an idea... please tell to us :-)


Reuti, Did you test it successfully? How do you solve these problems?

Regards,
Sergio


Reuti escribió:


Hi,

Am 14.12.2009 um 17:05 schrieb Sergio Díaz:

I got a successful checkpoint with a fresh installation and without 
use the trunk. I can't understand why it is working now and before 
I could do a successful restart... Maybe there was something wrong 
in the openmpi installation and then the metadata was created in a 
wrong way.

I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz@compute-3-13 ~]$ ompi-restart -machinefile 
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt

 tiempo  110
 Process1 :
 compute-3-14.local
 of2
 tiempo  110
 Process0 :
 compute-3-13.local
 of2
 tiempo  120
 Process1 :
 compute-3-14.local
 of2
 tiempo  120
 Process0 :
 compute-3-13.local
...
...

[sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz26273  0.0  0.0 34676 1668 ?Ss   15:58   0:00 
orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca


in a Tight Integration into SGE the daemon should get the argument 
--no-daemonize. Are you restarting a job on the command line, which 
ran before under SGE's supervision?


-- Reuti

orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
1739128832.0;tcp://192.168.4.148:45551 -mca 
mca_base_param_file_prefix ft-enable-cr -mca 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-14 Thread Sergio Díaz

Hi Reuti,

Yes, I sent a job with SGE and I checkpointed the mpirun process, by 
hand, entering into the mpi master node. Then I killed the job with qdel 
and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I think 
that it could be a bit difficult because:
1 -  when I do checkpoint I can't specify a directory with a 
name like checkpoint_jobid
2 -  I can't specify the scratch directory and I have to use 
the /tmp instead of SGE's scratch directory.
3 -  I tried to restart the snapshot and it only works if I use 
the same machinefile. That is, If the job ran in the c3-13 and c3-14, I 
have to restart the job using a machinefile with these two nodes.


   [sdiaz@svgd ~]$ ompi-restart -v -machinefile 
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
   [svgd.cesga.es:28836] Checking for the existence of 
(/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
   [svgd.cesga.es:28836] Restarting from file 
(ompi_global_snapshot_12554.ckpt)

   [svgd.cesga.es:28836]Exec in self
tiempo  110
Process1 :
compute-3-14.local
   of2
tiempo  110
Process0 :
compute-3-13.local
  of2

--
   mpirun noticed that process rank 1 with PID 8477 
on node compute-3-15 exited on signal 11 (Segmentation fault).


--

To solve problem 1, there is a feature opened by Josh. 
(https://svn.open-mpi.org/trac/ompi/ticket/2098)
To solve problem 2, there is a thread in which is talked ([OMPI users] 
Changing location where checkpoints are saved) and also a bug opened by 
Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I think that it 
could work... we will see.
To solve problem 3, I didn't have time to search it. But if Josh or 
anyone have an idea... please tell to us :-)


Reuti, Did you test it successfully? How do you solve these problems?

Regards,
Sergio


Reuti escribió:

Hi,

Am 14.12.2009 um 17:05 schrieb Sergio Díaz:

I got a successful checkpoint with a fresh installation and without 
use the trunk. I can't understand why it is working now and before I 
could do a successful restart... Maybe there was something wrong in 
the openmpi installation and then the metadata was created in a wrong 
way.

I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz@compute-3-13 ~]$ ompi-restart -machinefile 
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt

 tiempo  110
 Process1 :
 compute-3-14.local
 of2
 tiempo  110
 Process0 :
 compute-3-13.local
 of2
 tiempo  120
 Process1 :
 compute-3-14.local
 of2
 tiempo  120
 Process0 :
 compute-3-13.local
...
...

[sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz26273  0.0  0.0 34676 1668 ?Ss   15:58   0:00 orted 
--daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca


in a Tight Integration into SGE the daemon should get the argument 
--no-daemonize. Are you restarting a job on the command line, which 
ran before under SGE's supervision?


-- Reuti


orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
1739128832.0;tcp://192.168.4.148:45551 -mca 
mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path 
/opt/cesga/openmpi-1.3.3_bis/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz 
-mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
sdiaz26274  0.1  0.0 15984  504 ?Sl   15:58   0:00  \_ 
cr_restart 
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.26047 

sdiaz26047  1.5  0.0 99460 3624 ?Sl   15:58   0:00  
\_ ./pi3


[sdiaz@compute-3-13 ~]$ ps auxf |grep sdiaz
root 12878  0.0  0.0 90260 3000 pts/0S15:55   0:00  
|   \_ su - sdiaz
sdiaz12880  0.0  0.0 53432 1512 pts/0S15:55   0:00  
|   \_ -bash
sdiaz13070  0.3  0.0 39988 2500 pts/0S+   15:58   0:00  
|   \_ mpirun -am ft-enable-cr --default-hostfile 
mpi_test/lanzar_pi3.sh.po3117822 --app 
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/restart-appfile
sdiaz13073  0.0  0.0 15988  508 pts/0Sl+  15:58   0:00  
|   \_ cr_restart 
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.12558 

sdiaz12558  0.2  0.0 99464 3616 pts/0Sl+  15:58   0:00  
|   \_ ./pi3



Sergio Díaz escribió:


Hi Josh

Here you go the 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-14 Thread Reuti

Hi,

Am 14.12.2009 um 17:05 schrieb Sergio Díaz:

I got a successful checkpoint with a fresh installation and without  
use the trunk. I can't understand why it is working now and before  
I could do a successful restart... Maybe there was something wrong  
in the openmpi installation and then the metadata was created in a  
wrong way.

I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz@compute-3-13 ~]$ ompi-restart -machinefile mpi_test/ 
lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt

 tiempo  110
 Process1 :
 compute-3-14.local
 of2
 tiempo  110
 Process0 :
 compute-3-13.local
 of2
 tiempo  120
 Process1 :
 compute-3-14.local
 of2
 tiempo  120
 Process0 :
 compute-3-13.local
...
...

[sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz26273  0.0  0.0 34676 1668 ?Ss   15:58   0:00  
orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca


in a Tight Integration into SGE the daemon should get the argument -- 
no-daemonize. Are you restarting a job on the command line, which ran  
before under SGE's supervision?


-- Reuti


orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri  
1739128832.0;tcp://192.168.4.148:45551 -mca  
mca_base_param_file_prefix ft-enable-cr -mca  
mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/openmpi/ 
amca-param-sets:/home_no_usc/cesga/sdiaz -mca  
mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
sdiaz26274  0.1  0.0 15984  504 ?Sl   15:58   0:00  \_  
cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/ 
opal_snapshot_1.ckpt/ompi_blcr_context.26047
sdiaz26047  1.5  0.0 99460 3624 ?Sl   15:58   0:00   
\_ ./pi3


[sdiaz@compute-3-13 ~]$ ps auxf |grep sdiaz
root 12878  0.0  0.0 90260 3000 pts/0S15:55   0:00   
|   \_ su - sdiaz
sdiaz12880  0.0  0.0 53432 1512 pts/0S15:55   0:00   
|   \_ -bash
sdiaz13070  0.3  0.0 39988 2500 pts/0S+   15:58   0:00   
|   \_ mpirun -am ft-enable-cr --default-hostfile  
mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/ 
ompi_global_snapshot_12554.ckpt/restart-appfile
sdiaz13073  0.0  0.0 15988  508 pts/0Sl+  15:58   0:00   
|   \_ cr_restart /home/cesga/sdiaz/ 
ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/ 
ompi_blcr_context.12558
sdiaz12558  0.2  0.0 99464 3616 pts/0Sl+  15:58   0:00   
|   \_ ./pi3



Sergio Díaz escribió:


Hi Josh

Here you go the file.

I will try to apply the trunk but I think that I broke-up my  
openmpi installation doing "something" and I don't know what :-( .  
I was modifying the mca parameters...
When I send a job, the orted daemon expanded in the SLAVE host is  
launched in a bucle till they spend all the reserved memory.
It is very strange so I will compile it again, I will reproduce  
the bug and then I will test the trunk.


Thanks a lot for the support and tickets opened.
Sergio


sdiaz30279  0.0  0.0  1888  560 ?Ds   12:54
0:00  \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/ 
cesga/sge62/default/spool/compute
sdiaz30286  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca  
ess env -mca orte_ess_jobid 219
sdiaz30322  0.0  0.0 52772 1188 ?S12:54
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30358  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30394  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/ 
bin/orted
sdiaz30430  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/ 
openmpi-1.3.3/bin/orted
sdiaz30466  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/ 
openmpi-1.3.3/bin/orted
sdiaz30502  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/ 
openmpi-1.3.3/bin/orted
sdiaz30538  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/cesga/ 
openmpi-1.3.3/bin/orted
sdiaz30574  0.0  0.0 52772 1188 ?D12:54
0:00  \_ /bin/bash /opt/ 
cesga/openmpi-1.3.3/bin/orted





Josh Hursey escribió:



On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:


Hi Josh,

You were right. The main problem was the /tmp. SGE uses a  
scratch directory in which the jobs have temporary files.  
Setting TMPDIR to /tmp, checkpoint works!
However, when I try to restart it... I got the following error  
(see ERROR1). Option -v agrees these lines (see ERRO2).


It is concerning that 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-14 Thread Sergio Díaz

Hi Josh,

I got a successful checkpoint with a fresh installation and without use 
the trunk. I can't understand why it is working now and before I could 
do a successful restart... Maybe there was something wrong in the 
openmpi installation and then the metadata was created in a wrong way.

I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz@compute-3-13 ~]$ ompi-restart -machinefile 
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt

tiempo  110
Process1 :
compute-3-14.local
of2
tiempo  110
Process0 :
compute-3-13.local
of2
tiempo  120
Process1 :
compute-3-14.local
of2
tiempo  120
Process0 :
compute-3-13.local
...
...


[sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz26273  0.0  0.0 34676 1668 ?Ss   15:58   0:00 orted 
--daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca 
orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
1739128832.0;tcp://192.168.4.148:45551 -mca mca_base_param_file_prefix 
ft-enable-cr -mca mca_base_param_file_path 
/opt/cesga/openmpi-1.3.3_bis/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz 
-mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
sdiaz26274  0.1  0.0 15984  504 ?Sl   15:58   0:00  \_ 
cr_restart 
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.26047

sdiaz26047  1.5  0.0 99460 3624 ?Sl   15:58   0:00  \_ ./pi3

[sdiaz@compute-3-13 ~]$ ps auxf |grep sdiaz
root 12878  0.0  0.0 90260 3000 pts/0S15:55   0:00  |   
\_ su - sdiaz
sdiaz12880  0.0  0.0 53432 1512 pts/0S15:55   0:00  
|   \_ -bash
sdiaz13070  0.3  0.0 39988 2500 pts/0S+   15:58   0:00  
|   \_ mpirun -am ft-enable-cr --default-hostfile 
mpi_test/lanzar_pi3.sh.po3117822 --app 
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/restart-appfile
sdiaz13073  0.0  0.0 15988  508 pts/0Sl+  15:58   0:00  
|   \_ cr_restart 
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.12558
sdiaz12558  0.2  0.0 99464 3616 pts/0Sl+  15:58   0:00  
|   \_ ./pi3



Sergio Díaz escribió:

Hi Josh

Here you go the file.

I will try to apply the trunk but I think that I broke-up my openmpi 
installation doing "something" and I don't know what :-( . I was 
modifying the mca parameters...
When I send a job, the orted daemon expanded in the SLAVE host is 
launched in a bucle till they spend all the reserved memory.
It is very strange so I will compile it again, I will reproduce the 
bug and then I will test the trunk.


Thanks a lot for the support and tickets opened.
Sergio


sdiaz30279  0.0  0.0  1888  560 ?Ds   12:54   0:00  \_ 
/opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
/opt/cesga/sge62/default/spool/compute
sdiaz30286  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca ess 
env -mca orte_ess_jobid 219
sdiaz30322  0.0  0.0 52772 1188 ?S12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30358  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30394  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30430  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30466  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30502  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30538  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30574  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted





Josh Hursey escribió:


On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:


Hi Josh,

You were right. The main problem was the /tmp. SGE uses a scratch 
directory in which the jobs have temporary files. Setting TMPDIR to 
/tmp, checkpoint works!
However, when I try to restart it... I got the following error (see 
ERROR1). Option -v agrees these lines (see ERRO2).


It is concerning that ompi-restart is segfault'ing when it errors 
out. The error message is being generated between the launch of the 
opal-restart starter command and when we try to exec(cr_restart). 
Usually the failure is related to a corruption of the metadata stored 
in 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-11 Thread Sergio Díaz

Hi Josh

Here you go the file.

I will try to apply the trunk but I think that I broke-up my openmpi 
installation doing "something" and I don't know what :-( . I was 
modifying the mca parameters...
When I send a job, the orted daemon expanded in the SLAVE host is 
launched in a bucle till they spend all the reserved memory.
It is very strange so I will compile it again, I will reproduce the bug 
and then I will test the trunk.


Thanks a lot for the support and tickets opened.
Sergio


sdiaz30279  0.0  0.0  1888  560 ?Ds   12:54   0:00  \_ 
/opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
/opt/cesga/sge62/default/spool/compute
sdiaz30286  0.0  0.0 52772 1188 ?D12:54   0:00  
\_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca ess env -mca 
orte_ess_jobid 219
sdiaz30322  0.0  0.0 52772 1188 ?S12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30358  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30394  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30430  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30466  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30502  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30538  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30574  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted





Josh Hursey escribió:


On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:


Hi Josh,

You were right. The main problem was the /tmp. SGE uses a scratch 
directory in which the jobs have temporary files. Setting TMPDIR to 
/tmp, checkpoint works!
However, when I try to restart it... I got the following error (see 
ERROR1). Option -v agrees these lines (see ERRO2).


It is concerning that ompi-restart is segfault'ing when it errors out. 
The error message is being generated between the launch of the 
opal-restart starter command and when we try to exec(cr_restart). 
Usually the failure is related to a corruption of the metadata stored 
in the checkpoint.


Can you send me the file below:
 ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data 



I was able to reproduce the segv (at least I think it is the same 
one). We failed to check the validity of a string when we parse the 
metadata. I committed a fix to the trunk in r22290, and requested that 
the fix be moved to the v1.4 and v1.5 branches. If you are interested 
in seeing when they get applied you can follow the following tickets:

  https://svn.open-mpi.org/trac/ompi/ticket/2140
  https://svn.open-mpi.org/trac/ompi/ticket/2141

Can you try the trunk to see if the problem goes away? The development 
trunk and v1.5 series have a bunch of improvements to the C/R 
functionality that were never brought over the v1.3/v1.4 series.




I was trying to use ssh instead of rsh but I was impossible. By 
default it should use ssh and if it finds a problem, it will use rsh. 
It seems that ssh doesn't work because always use rsh.

If I change this MCA parameter, It still uses rsh.
If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use 
ssh and doesn't works. I got --> "bash: orted: command not found" and 
the mpi process dies.
The command which try to execute is the following and I haven't found 
yet the reason why this command doesn't found orted because I set the 
/etc/bashrc in order to get always the right path and I have the 
right path into my application. (see ERROR4).


This seems like an SGE specific issue, so a bit out of my domain. 
Maybe others have suggestions here.


-- Josh




Many thanks!,
Sergio

P.S. Sorry about these long emails. I just try to show you useful 
information to identify my problems.



ERROR 1
> 


> [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt
> 
-- 


> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> 
-- 

> 
-- 


> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-09 Thread Josh Hursey


On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:


Hi Josh,

You were right. The main problem was the /tmp. SGE uses a scratch  
directory in which the jobs have temporary files. Setting TMPDIR to / 
tmp, checkpoint works!
However, when I try to restart it... I got the following error (see  
ERROR1). Option -v agrees these lines (see ERRO2).


It is concerning that ompi-restart is segfault'ing when it errors out.  
The error message is being generated between the launch of the opal- 
restart starter command and when we try to exec(cr_restart). Usually  
the failure is related to a corruption of the metadata stored in the  
checkpoint.


Can you send me the file below:
 ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/ 
snapshot_meta.data


I was able to reproduce the segv (at least I think it is the same  
one). We failed to check the validity of a string when we parse the  
metadata. I committed a fix to the trunk in r22290, and requested that  
the fix be moved to the v1.4 and v1.5 branches. If you are interested  
in seeing when they get applied you can follow the following tickets:

  https://svn.open-mpi.org/trac/ompi/ticket/2140
  https://svn.open-mpi.org/trac/ompi/ticket/2141

Can you try the trunk to see if the problem goes away? The development  
trunk and v1.5 series have a bunch of improvements to the C/R  
functionality that were never brought over the v1.3/v1.4 series.




I was trying to use ssh instead of rsh but I was impossible. By  
default it should use ssh and if it finds a problem, it will use  
rsh. It seems that ssh doesn't work because always use rsh.

If I change this MCA parameter, It still uses rsh.
If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use  
ssh and doesn't works. I got --> "bash: orted: command not found"  
and the mpi process dies.
The command which try to execute is the following and I haven't  
found yet the reason why this command doesn't found orted because I  
set the /etc/bashrc in order to get always the right path and I have  
the right path into my application. (see ERROR4).


This seems like an SGE specific issue, so a bit out of my domain.  
Maybe others have suggestions here.


-- Josh




Many thanks!,
Sergio

P.S. Sorry about these long emails. I just try to show you useful  
information to identify my problems.



ERROR 1
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>>

> [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt
>  
--
> Error: Unable to obtain the proper restart command to restart from  
the

>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
>  
--
>  
--
> Error: Unable to obtain the proper restart command to restart from  
the

>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
>  
--

> [compute-3-18:28792] *** Process received signal ***
> [compute-3-18:28792] Signal: Segmentation fault (11)
> [compute-3-18:28792] Signal code:  (128)
> [compute-3-18:28792] Failing at address: (nil)
> [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
> [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)  
[0x33bb669135]
> [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
> [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
> [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]

> [compute-3-18:28792] [ 5] opal-restart [0x40312a]
> [compute-3-18:28792] [ 6] /lib64/tls/libc.so.6(__libc_start_main 
+0xdb) [0x33bb61c3fb]

> [compute-3-18:28792] [ 7] opal-restart [0x40272a]
> [compute-3-18:28792] *** End of error message ***
> [compute-3-18:28793] *** Process received signal ***
> [compute-3-18:28793] Signal: Segmentation fault (11)
> [compute-3-18:28793] Signal code:  (128)
> [compute-3-18:28793] Failing at address: (nil)
> [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
> [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)  
[0x33bb669135]
> [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
> [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
> [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]

> [compute-3-18:28793] [ 5] opal-restart [0x40312a]
> [compute-3-18:28793] [ 6] /lib64/tls/libc.so.6(__libc_start_main 
+0xdb) [0x33bb61c3fb]

> [compute-3-18:28793] [ 7] opal-restart [0x40272a]
> [compute-3-18:28793] *** End of error message ***
>  

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-12 Thread Sergio Díaz

Hi Josh,

You were right. The main problem was the /tmp. SGE uses a scratch 
directory in which the jobs have temporary files. Setting TMPDIR to 
/tmp, checkpoint works!
However, when I try to restart it... I got the following error (see 
ERROR1). Option -v agrees these lines (see ERRO2).


I was trying to use ssh instead of rsh but I was impossible. By default 
it should use ssh and if it finds a problem, it will use rsh. It seems 
that ssh doesn't work because always use rsh.

If I change this MCA parameter, It still uses rsh.
If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use ssh 
and doesn't works. I got --> "bash: orted: command not found" and the 
mpi process dies.
The command which try to execute is the following and I haven't found 
yet the reason why this command doesn't found orted because I set the 
/etc/bashrc in order to get always the right path and I have the right 
path into my application. (see ERROR4).


Many thanks!,
Sergio

P.S. Sorry about these long emails. I just try to show you useful 
information to identify my problems.



ERROR 1
>
> [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt
> 
--

> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> 
--
> 
--

> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> 
--

> [compute-3-18:28792] *** Process received signal ***
> [compute-3-18:28792] Signal: Segmentation fault (11)
> [compute-3-18:28792] Signal code:  (128)
> [compute-3-18:28792] Failing at address: (nil)
> [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
> [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) 
[0x33bb669135]
> [compute-3-18:28792] [ 2] 
/opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e) 
[0x2a95586658]
> [compute-3-18:28792] [ 3] 
/opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e) 
[0x2a9557906e]
> [compute-3-18:28792] [ 4] 
/opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36) 
[0x2a9556bcfa]

> [compute-3-18:28792] [ 5] opal-restart [0x40312a]
> [compute-3-18:28792] [ 6] 
/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]

> [compute-3-18:28792] [ 7] opal-restart [0x40272a]
> [compute-3-18:28792] *** End of error message ***
> [compute-3-18:28793] *** Process received signal ***
> [compute-3-18:28793] Signal: Segmentation fault (11)
> [compute-3-18:28793] Signal code:  (128)
> [compute-3-18:28793] Failing at address: (nil)
> [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
> [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) 
[0x33bb669135]
> [compute-3-18:28793] [ 2] 
/opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e) 
[0x2a95586658]
> [compute-3-18:28793] [ 3] 
/opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e) 
[0x2a9557906e]
> [compute-3-18:28793] [ 4] 
/opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36) 
[0x2a9556bcfa]

> [compute-3-18:28793] [ 5] opal-restart [0x40312a]
> [compute-3-18:28793] [ 6] 
/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]

> [compute-3-18:28793] [ 7] opal-restart [0x40272a]
> [compute-3-18:28793] *** End of error message ***
> 
--
> mpirun noticed that process rank 0 with PID 28792 on node 
compute-3-18.local exited on signal 11 (Segmentation fault).
> 
--

>


ERROR 2

> [sdiaz@compute-3-18 ~]$ ompi-restart -v ompi_global_snapshot_28454.ckpt
>[compute-3-18.local:28941] Checking for the existence of 
(/home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)  
> [compute-3-18.local:28941] Restarting from file 
(ompi_global_snapshot_28454.ckpt)   

> [compute-3-18.local:28941]   Exec in self 
> ...   





ERROR3

>[sdiaz@compute-3-18 ~]$ ompi_info  --all|grep "plm_rsh_agent"
> How many plm_rsh_agent instances to invoke concurrently (must 
be > 0)
> MCA plm: parameter "plm_rsh_agent" (current value: 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-11 Thread Josh Hursey

On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:

> Hi Josh,
> 
> The OpenMPI version is 1.3.3.
> 
> The command ompi-ps doesn't work.
> 
> [root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
> [root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and setting 
> contact info into RML...
> [root@compute-3-18 ~]# ompi-ps -v -j 2726959
> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and setting 
> contact info into RML...
> 
> [root@compute-3-18 ~]# ps uaxf | grep sdiaz
> root 16260  0.0  0.0 51084  680 pts/0S+   13:38   0:00  \_ 
> grep sdiaz
> sdiaz16203  0.0  0.0 53164 1220 ?Ss   13:37   0:00  \_ -bash 
> /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
> sdiaz16241  0.0  0.0 41028 2480 ?S13:37   0:00  \_ 
> mpirun -np 2 -am ft-enable-cr ./pi3
> sdiaz16242  0.0  0.0 36484 1840 ?Sl   13:37   0:00  
> \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V compute-3-17.local 
>  orted -mca ess env -mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 2 --hnp-uri "2769879040.0;tcp://192.168.4.143:57010" -mca 
> mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path 
> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>  -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
> sdiaz16245  0.1  0.0 99464 4616 ?Sl   13:37   0:00  
> \_ ./pi3
> 
> [root@compute-3-18 ~]# ompi-ps -n c3-18
> [root@compute-3-18 ~]# ompi-ps -n compute-3-18
> [root@compute-3-18 ~]# ompi-ps -n
> 
> There is not directory on the /tmp of the node. However, if the application 
> is run without SGE, the directory is created

This may be the core of the problem. ompi-ps and other command line tools 
(e.g., ompi-checkpoint) look for the Open MPI session directory in /tmp in 
order to find the connection information to connect to the mpirun process 
(internally called the HNP or Head Node Process).

Can you change the location of the temporary directory in SGE? The temporary 
directory is usually set via an environment variable (e.g., TMPDIR, or TMP). So 
removing the environment variable or setting it to /tmp might help.


> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I interrupt it. Does 
> it take long time?

It should not take a long time. It is just querying the mpirun process for 
state information.

> what means the option -j of ompi-ps command? isn't it related to a batch 
> system(like sge, condor...), is it?

The '-j' option allows the user to specify the Open MPI jobid. This is 
completely different than the jobid provided by the batch system. In general, 
users should not need to specify the -j option. It is useful when you have 
multiple Open MPI jobs, and want a summary of just one of them.

> 
> Thanks for the ticket. I will follow it.
> 
> Talking with Alan, I realized that there are few transport protocols that are 
> supported. And maybe it is the problem. Currently, SGE is using qrsh to 
> expand mpi process. I can change this protocol and use ssh. So, I'm going to 
> test it this afternoon and I will comment to you the results.

Try 'ssh' and see if that helps. I suspect the problem is with the session 
directory location though.

> 
> Regards,
> Sergio
> 
> 
> Josh Hursey escribió:
>> 
>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote: 
>> 
>>> Hello, 
>>> 
>>> I have achieved the checkpoint of an easy program without SGE. Now, I'm 
>>> trying to do the integration openmpi+sge but I have some problems... When I 
>>> try to do checkpoint of the mpirun PID, I got an error similar to the error 
>>> gotten when the PID doesn't exit. The example below. 
>> 
>> I do not have any experience with the SGE environment, so I suspect that 
>> there may something 'special' about the environment that is tripping up the 
>> ompi-checkpoint tool. 
>> 
>> First of all, what version of Open MPI are you using? 
>> 
>> Somethings to check: 
>>  - Does 'ompi-ps' work when your application is running? 
>>  - Is there an /tmp/openmpi-sessions-* directory on the node where mpirun is 
>> currently running? This directory contains information on how to connect to 
>> the mpirun process from an external tool, if it's missing then this could be 
>> the cause of the problem. 
>> 
>>> 
>>> Any ideas? 
>>> Somebody have a script to do it automatic with SGE?. For example I have one 
>>> to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched 
>>> by SGE if you have configured the queue and the ckpt environment. 
>> 
>> I do not know of any integration of the Open MPI checkpointing work with SGE 
>> at the moment. 
>> 
>> As far as time triggered checkpointing, I have a feature ticket open about 
>> this: 
>>   https://svn.open-mpi.org/trac/ompi/ticket/1961 
>> 
>> It is not available yet, but in the works. 
>> 
>> 
>>> 
>>> Is it 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-09 Thread Sergio Díaz

Hi Josh,

The OpenMPI version is 1.3.3.

The command ompi-ps doesn't work.

[root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
[root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
[compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and setting 
contact info into RML...

[root@compute-3-18 ~]# ompi-ps -v -j 2726959
[compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and setting 
contact info into RML...


[root@compute-3-18 ~]# ps uaxf | grep sdiaz
root 16260  0.0  0.0 51084  680 pts/0S+   13:38   0:00  
\_ grep sdiaz
sdiaz16203  0.0  0.0 53164 1220 ?Ss   13:37   0:00  \_ 
-bash /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
sdiaz16241  0.0  0.0 41028 2480 ?S13:37   0:00  
\_ mpirun -np 2 -am ft-enable-cr ./pi3
sdiaz16242  0.0  0.0 36484 1840 ?Sl   13:37   
0:00  \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit 
-nostdin -V compute-3-17.local  orted -mca ess env -mca orte_ess_jobid 
2769879040 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
"2769879040.0;tcp://192.168.4.143:57010" -mca mca_base_param_file_prefix 
ft-enable-cr -mca mca_base_param_file_path 
/opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test 
-mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
sdiaz16245  0.1  0.0 99464 4616 ?Sl   13:37   
0:00  \_ ./pi3


[root@compute-3-18 ~]# ompi-ps -n c3-18
[root@compute-3-18 ~]# ompi-ps -n compute-3-18
[root@compute-3-18 ~]# ompi-ps -n

There is not directory on the /tmp of the node. However, if the 
application is run without SGE, the directory is created but if I do 
ompi-ps -j MPIRUN_PID, it seems hanged and I interrupt it. Does it take 
long time?
what means the option -j of ompi-ps command? isn't it related to a batch 
system(like sge, condor...), is it?


Thanks for the ticket. I will follow it.

Talking with Alan, I realized that there are few transport protocols 
that are supported. And maybe it is the problem. Currently, SGE is using 
qrsh to expand mpi process. I can change this protocol and use ssh. So, 
I'm going to test it this afternoon and I will comment to you the results.


Regards,
Sergio


Josh Hursey escribió:


On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:


Hello,

I have achieved the checkpoint of an easy program without SGE. Now, 
I'm trying to do the integration openmpi+sge but I have some 
problems... When I try to do checkpoint of the mpirun PID, I got an 
error similar to the error gotten when the PID doesn't exit. The 
example below.


I do not have any experience with the SGE environment, so I suspect 
that there may something 'special' about the environment that is 
tripping up the ompi-checkpoint tool.


First of all, what version of Open MPI are you using?

Somethings to check:
 - Does 'ompi-ps' work when your application is running?
 - Is there an /tmp/openmpi-sessions-* directory on the node where 
mpirun is currently running? This directory contains information on 
how to connect to the mpirun process from an external tool, if it's 
missing then this could be the cause of the problem.




Any ideas?
Somebody have a script to do it automatic with SGE?. For example I 
have one to do checkpoint each X seconds with BLCR and non-mpi jobs. 
It is launched by SGE if you have configured the queue and the ckpt 
environment.


I do not know of any integration of the Open MPI checkpointing work 
with SGE at the moment.


As far as time triggered checkpointing, I have a feature ticket open 
about this:

  https://svn.open-mpi.org/trac/ompi/ticket/1961

It is not available yet, but in the works.




Is it possible choose the name of the ckpt folder when you do the 
ompi-checkpoint? I can't find the option to do it.


Not at this time. Though I could see it as a useful feature, and 
shouldn't be too hard to implement. I filed a ticket if you want to 
follow the progress:

  https://svn.open-mpi.org/trac/ompi/ticket/2098

-- Josh




Regards,
Sergio




[sdiaz@compute-3-17 ~]$ ps auxf

root 20044  0.0  0.0  4468 1224 ?S13:28   0:00  \_ 
sge_shepherd-2645150 -bg
sdiaz20072  0.0  0.0 53172 1212 ?Ss   13:28   0:00  
\_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
sdiaz20112  0.2  0.0 41028 2480 ?S13:28   
0:00  \_ mpirun -np 2 -am ft-enable-cr pi3
sdiaz20113  0.0  0.0 36484 1824 ?Sl   13:28   
0:00  \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit 
-nostdin -V compute-3-18..
sdiaz20116  1.2  0.0 99464 4616 ?Sl   13:28   
0:00  \_ pi3



[sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
[compute-3-17.local:20124] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
[compute-3-17.local:20135] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-06 Thread Josh Hursey


On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:


Hello,

I have achieved the checkpoint of an easy program without SGE. Now,  
I'm trying to do the integration openmpi+sge but I have some  
problems... When I try to do checkpoint of the mpirun PID, I got an  
error similar to the error gotten when the PID doesn't exit. The  
example below.


I do not have any experience with the SGE environment, so I suspect  
that there may something 'special' about the environment that is  
tripping up the ompi-checkpoint tool.


First of all, what version of Open MPI are you using?

Somethings to check:
 - Does 'ompi-ps' work when your application is running?
 - Is there an /tmp/openmpi-sessions-* directory on the node where  
mpirun is currently running? This directory contains information on  
how to connect to the mpirun process from an external tool, if it's  
missing then this could be the cause of the problem.




Any ideas?
Somebody have a script to do it automatic with SGE?. For example I  
have one to do checkpoint each X seconds with BLCR and non-mpi jobs.  
It is launched by SGE if you have configured the queue and the ckpt  
environment.


I do not know of any integration of the Open MPI checkpointing work  
with SGE at the moment.


As far as time triggered checkpointing, I have a feature ticket open  
about this:

  https://svn.open-mpi.org/trac/ompi/ticket/1961

It is not available yet, but in the works.




Is it possible choose the name of the ckpt folder when you do the  
ompi-checkpoint? I can't find the option to do it.


Not at this time. Though I could see it as a useful feature, and  
shouldn't be too hard to implement. I filed a ticket if you want to  
follow the progress:

  https://svn.open-mpi.org/trac/ompi/ticket/2098

-- Josh




Regards,
Sergio




[sdiaz@compute-3-17 ~]$ ps auxf

root 20044  0.0  0.0  4468 1224 ?S13:28   0:00  \_  
sge_shepherd-2645150 -bg
sdiaz20072  0.0  0.0 53172 1212 ?Ss   13:28   0:00   
\_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/ 
2645150
sdiaz20112  0.2  0.0 41028 2480 ?S13:28
0:00  \_ mpirun -np 2 -am ft-enable-cr pi3
sdiaz20113  0.0  0.0 36484 1824 ?Sl   13:28
0:00  \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit - 
nostdin -V compute-3-18..
sdiaz20116  1.2  0.0 99464 4616 ?Sl   13:28
0:00  \_ pi3



[sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
[compute-3-17.local:20124] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
[compute-3-17.local:20135] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
[compute-3-17.local:20136] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
--
ompi-checkpoint PID_OF_MPIRUN
  Open MPI Checkpoint Tool

   -am Aggregate MCA parameter set file list
   -gmca|--gmca  
 Pass global MCA parameters that are  
applicable to
 all contexts (arg0 is the parameter name;  
arg1 is

 the parameter value)
-h|--helpThis help message
   --hnp-jobid This should be the jobid of the HNP whose
 applications you wish to checkpoint.
   --hnp-pid   This should be the pid of the mpirun whose
 applications you wish to checkpoint.
   -mca|--mca  
 Pass context-specific MCA parameters; they  
are
 considered global if --gmca is not used and  
only
 one context is specified (arg0 is the  
parameter

 name; arg1 is the parameter value)
-s|--status  Display status messages describing the  
progression

 of the checkpoint
   --termTerminate the application after checkpoint
-v|--verbose Be Verbose
-w|--nowait  Do not wait for the application to finish
 checkpointing before returning

--
[sdiaz@compute-3-17 ~]$ exit
logout
Connection to c3-17 closed.
[sdiaz@svgd mpi_test]$ ssh c3-18
Last login: Wed Oct 28 13:24:12 2009 from svgd.local
-bash-3.00$ ps auxf |grep sdiaz

sdiaz14412  0.0  0.0  1888  560 ?Ss   13:28   0:00   
\_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/ 
default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
sdiaz14419  0.0  0.0 35728 2260 ?S13:28
0:00  \_ orted -mca ess env -mca orte_ess_jobid 2295267328 - 
mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri  
2295267328.0;tcp://192.168.4.144:36596 -mca  
mca_base_param_file_prefix ft-enable-cr -mca  
mca_base_param_file_path 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-02 Thread Andreea m. (Costea)
I am having the same problem when I want to checkpoint manually: "HNP with PID 
 Not found!", though I am sure I put the right PID 

--- On Mon, 11/2/09, Sergio Díaz <sd...@cesga.es> wrote:

From: Sergio Díaz <sd...@cesga.es>
Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Monday, November 2, 2009, 6:43 PM

Hi again,

I found a C program to test ompi-checkpoint/restart an it works fine. The 
program was written by Alan Woodland and shared in the following distribution 
list: debian-bugs-d...@lists.debian.org
This program starts a countdown from 10 to 0 and when the countdown is 6, do a 
checkpoint, kill the process and restart the process.

However, I still have the problem when I try to do (by hand) checkpointing 
directly into a node

Any ideas? :-(

Best regards
Sergio



Sergio Díaz escribió:
> Hello,
> 
> I have achieved the checkpoint of an easy program without SGE. Now, I'm 
> trying to do the integration openmpi+sge but I have some problems... When I 
> try to do checkpoint of the mpirun PID, I got an error similar to the error 
> gotten when the PID doesn't exit. The example below.
> 
> Any ideas?
> Somebody have a script to do it automatic with SGE?. For example I have one 
> to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched by 
> SGE if you have configured the queue and the ckpt environment.
> 
> Is it possible choose the name of the ckpt folder when you do the 
> ompi-checkpoint? I can't find the option to do it.
> 
> 
> Regards,
> Sergio
> 
> 
> 
> 
> [sdiaz@compute-3-17 ~]$ ps auxf
> 
> root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_ 
> sge_shepherd-2645150 -bg
> sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00      \_ -bash 
> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
> sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28   0:00          \_ 
> mpirun -np 2 -am ft-enable-cr pi3
> sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28   0:00              
> \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V 
> compute-3-18..
> sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28   0:00              
> \_ pi3
> 
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
> [compute-3-17.local:20124] HNP with PID 20112 Not found!
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
> [compute-3-17.local:20135] HNP with PID 20112 Not found!
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
> [compute-3-17.local:20136] HNP with PID 20112 Not found!
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
> --
> ompi-checkpoint PID_OF_MPIRUN
>   Open MPI Checkpoint Tool
> 
>    -am             Aggregate MCA parameter set file list
>    -gmca|--gmca  
>                          Pass global MCA parameters that are applicable to
>                          all contexts (arg0 is the parameter name; arg1 is
>                          the parameter value)
> -h|--help                This help message
>    --hnp-jobid     This should be the jobid of the HNP whose
>                          applications you wish to checkpoint.
>    --hnp-pid       This should be the pid of the mpirun whose
>                          applications you wish to checkpoint.
>    -mca|--mca  
>                          Pass context-specific MCA parameters; they are
>                          considered global if --gmca is not used and only
>                          one context is specified (arg0 is the parameter
>                          name; arg1 is the parameter value)
> -s|--status              Display status messages describing the progression
>                          of the checkpoint
>    --term                Terminate the application after checkpoint
> -v|--verbose             Be Verbose
> -w|--nowait              Do not wait for the application to finish
>                          checkpointing before returning
> 
> --
> [sdiaz@compute-3-17 ~]$ exit
> logout
> Connection to c3-17 closed.
> [sdiaz@svgd mpi_test]$ ssh c3-18
> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
> -bash-3.00$ ps auxf |grep sdiaz
> 
> sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00      \_ 
> /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
> sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28   0:00          \_ 
> orted -mca ess env -mca orte_e

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-02 Thread Sergio Díaz

Hi again,

I found a C program to test ompi-checkpoint/restart an it works fine. 
The program was written by Alan Woodland and shared in the following 
distribution list: debian-bugs-d...@lists.debian.org
This program starts a countdown from 10 to 0 and when the countdown is 
6, do a checkpoint, kill the process and restart the process.


However, I still have the problem when I try to do (by hand) 
checkpointing directly into a node


Any ideas? :-(

Best regards
Sergio



Sergio Díaz escribió:

Hello,

I have achieved the checkpoint of an easy program without SGE. Now, 
I'm trying to do the integration openmpi+sge but I have some 
problems... When I try to do checkpoint of the mpirun PID, I got an 
error similar to the error gotten when the PID doesn't exit. The 
example below.


Any ideas?
Somebody have a script to do it automatic with SGE?. For example I 
have one to do checkpoint each X seconds with BLCR and non-mpi jobs. 
It is launched by SGE if you have configured the queue and the ckpt 
environment.


Is it possible choose the name of the ckpt folder when you do the 
ompi-checkpoint? I can't find the option to do it.



Regards,
Sergio




[sdiaz@compute-3-17 ~]$ ps auxf

root 20044  0.0  0.0  4468 1224 ?S13:28   0:00  \_ 
sge_shepherd-2645150 -bg
sdiaz20072  0.0  0.0 53172 1212 ?Ss   13:28   0:00  \_ 
-bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
sdiaz20112  0.2  0.0 41028 2480 ?S13:28   
0:00  \_ mpirun -np 2 -am ft-enable-cr pi3
sdiaz20113  0.0  0.0 36484 1824 ?Sl   13:28   
0:00  \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit 
-nostdin -V compute-3-18..
sdiaz20116  1.2  0.0 99464 4616 ?Sl   13:28   
0:00  \_ pi3



[sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
[compute-3-17.local:20124] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
[compute-3-17.local:20135] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
[compute-3-17.local:20136] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
--
ompi-checkpoint PID_OF_MPIRUN
  Open MPI Checkpoint Tool

   -am Aggregate MCA parameter set file list
   -gmca|--gmca  
 Pass global MCA parameters that are applicable to
 all contexts (arg0 is the parameter name; arg1 is
 the parameter value)
-h|--helpThis help message
   --hnp-jobid This should be the jobid of the HNP whose
 applications you wish to checkpoint.
   --hnp-pid   This should be the pid of the mpirun whose
 applications you wish to checkpoint.
   -mca|--mca  
 Pass context-specific MCA parameters; they are
 considered global if --gmca is not used and only
 one context is specified (arg0 is the parameter
 name; arg1 is the parameter value)
-s|--status  Display status messages describing the 
progression

 of the checkpoint
   --termTerminate the application after checkpoint
-v|--verbose Be Verbose
-w|--nowait  Do not wait for the application to finish
 checkpointing before returning

--
[sdiaz@compute-3-17 ~]$ exit
logout
Connection to c3-17 closed.
[sdiaz@svgd mpi_test]$ ssh c3-18
Last login: Wed Oct 28 13:24:12 2009 from svgd.local
-bash-3.00$ ps auxf |grep sdiaz

sdiaz14412  0.0  0.0  1888  560 ?Ss   13:28   0:00  \_ 
/opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
/opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
sdiaz14419  0.0  0.0 35728 2260 ?S13:28   
0:00  \_ orted -mca ess env -mca orte_ess_jobid 2295267328 
-mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
2295267328.0;tcp://192.168.4.144:36596 -mca mca_base_param_file_prefix 
ft-enable-cr -mca mca_base_param_file_path 
/opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test 
-mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
sdiaz14420  0.0  0.0 99452 4596 ?Sl   13:28   
0:00  \_ pi3






--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/




___
users mailing list
us...@open-mpi.org