Lane,

Am 16.03.2011 um 21:34 schrieb Lane Schwartz:

> On Wed, Mar 16, 2011 at 4:24 PM, Lane Schwartz <[email protected]> wrote:
> On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]> wrote:
> Am 16.03.2011 um 19:35 schrieb Lane Schwartz:
> 
> > <snip>
> > The job gets queued up and assigned to run, and the stderr and stdout files 
> > are created. When a checkpointable job starts, condor and DMTCP each print 
> > a small log message. That log message shows up in the logs. But no output 
> > from my program appears. SGE lists my job's status as "r" but when I ssh in 
> > to the machine where the job is running and run ps aux, ps lists my job's 
> > status as suspended.
> >
> > When I launch my checkpointable jobs locally (not using qsub) they run and 
> > produce immediate output. When I run those same jobs using qsub, they go 
> > into "r" status, but never produce output and appear to not be actually 
> > running.
> >
> > On a related topic, using 6.2u5p1 I've had mixed results following the 
> > checkpointing interface tutorial at 
> > http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial 
> > examples describe setting up a transparent interface and running it with 
> > some simply shell scripts; I've been able to get these to work as 
> > described. I've also followed the examples for setting up application-level 
> > interface with shell scripts; that works, but only the migr_command and 
> > clean_command appear to run. When I run example 6, which uses condor in 
> > conjunction with transparent checkpointing, no condor checkpoint files are 
> > created.
> 
> You set usr2 as the to be used signal and waited at least min_cpu_interval? 
> Still no checkpoint file is created in /home/checkpoint or alike? Can you try 
> sending usr by hand to the complete process group on the node?
> I can confirm. I ran the following, and no checkpoint file was created:
>  
> $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh

I wouldn't use "-b y" here - it's a script.


>  
> $ qstat
> ... lists the above job in state "r", with job-ID 114 ...
>  
> $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane
> ... lists the processes associated with the job.
> ... The parent process has PID 10240, and is "-csh c 
> ./condor_transparent6.sh" with ps state "Ss"
> ... The second process has PID 10322, and is "/bin/bash 
> ./condor_transparent6.sh" with ps state "S"
> ... The third process has PID, and is running the actual condor-linked binary 
> with ps state "S"
> ... These three jobs have group PGID 10240.

$ ps -e f

(f w/o -) will show an overview as a nice tree.


>  
> $ kill -s USR2 -- -10240

Do you use my demo-script and trap usr2 as outlined?


>  
> $ qstat
> ... My job is no longer listed ...
>  
> $ ls /tmp/114
> ... No files are listed. The directory exists, though, which makes sense 
> since "Checkpoint Directory" is set to /tmp in the checkpointing 
> configuration.
>  
>  
> My checkpoint interface definition is below:
>  
> Name: transparent
> 
> Interface: TRANSPARENT
> Checkpoint command: NONE
> Migrate command: NONE
> Clean command: NONE
> Checkpoint directory: /tmp

It should be a location shared between all nodes, otherwise you can't restart 
on another node.


> Checkpoint When: xsr

Without "m" here, never ever a checkpoint will be created, as explained in the 
Howto and the state diagram I mentioned before.


> Checkpoint Signal: NONE

In the Howto it's defined as usr2.

Do you find my Howto misleading?

-- Reuti


> This is all on a sandbox grid setup using version 6.2u5p1. The script a 
> slightly modified version of the condor_transparent6.sh script in the howto 
> (I added some echo statement to print variable values). The binary is a toy 
> C++ program that increments integer values then prints them out in a big loop.
>  
>  
> Reuti,
>  
> Just to make sure it wasn't my toy binary, I just re-ran with your ever.c 
> program. Using that, a checkpoint file was created. My toy binary used the 
> sleep command. OK, this is good. :)
>  
> Cheers,
> Lane
>  


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to