Lane, Am 16.03.2011 um 21:34 schrieb Lane Schwartz:
> On Wed, Mar 16, 2011 at 4:24 PM, Lane Schwartz <[email protected]> wrote: > On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]> wrote: > Am 16.03.2011 um 19:35 schrieb Lane Schwartz: > > > <snip> > > The job gets queued up and assigned to run, and the stderr and stdout files > > are created. When a checkpointable job starts, condor and DMTCP each print > > a small log message. That log message shows up in the logs. But no output > > from my program appears. SGE lists my job's status as "r" but when I ssh in > > to the machine where the job is running and run ps aux, ps lists my job's > > status as suspended. > > > > When I launch my checkpointable jobs locally (not using qsub) they run and > > produce immediate output. When I run those same jobs using qsub, they go > > into "r" status, but never produce output and appear to not be actually > > running. > > > > On a related topic, using 6.2u5p1 I've had mixed results following the > > checkpointing interface tutorial at > > http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial > > examples describe setting up a transparent interface and running it with > > some simply shell scripts; I've been able to get these to work as > > described. I've also followed the examples for setting up application-level > > interface with shell scripts; that works, but only the migr_command and > > clean_command appear to run. When I run example 6, which uses condor in > > conjunction with transparent checkpointing, no condor checkpoint files are > > created. > > You set usr2 as the to be used signal and waited at least min_cpu_interval? > Still no checkpoint file is created in /home/checkpoint or alike? Can you try > sending usr by hand to the complete process group on the node? > I can confirm. I ran the following, and no checkpoint file was created: > > $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh I wouldn't use "-b y" here - it's a script. > > $ qstat > ... lists the above job in state "r", with job-ID 114 ... > > $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane > ... lists the processes associated with the job. > ... The parent process has PID 10240, and is "-csh c > ./condor_transparent6.sh" with ps state "Ss" > ... The second process has PID 10322, and is "/bin/bash > ./condor_transparent6.sh" with ps state "S" > ... The third process has PID, and is running the actual condor-linked binary > with ps state "S" > ... These three jobs have group PGID 10240. $ ps -e f (f w/o -) will show an overview as a nice tree. > > $ kill -s USR2 -- -10240 Do you use my demo-script and trap usr2 as outlined? > > $ qstat > ... My job is no longer listed ... > > $ ls /tmp/114 > ... No files are listed. The directory exists, though, which makes sense > since "Checkpoint Directory" is set to /tmp in the checkpointing > configuration. > > > My checkpoint interface definition is below: > > Name: transparent > > Interface: TRANSPARENT > Checkpoint command: NONE > Migrate command: NONE > Clean command: NONE > Checkpoint directory: /tmp It should be a location shared between all nodes, otherwise you can't restart on another node. > Checkpoint When: xsr Without "m" here, never ever a checkpoint will be created, as explained in the Howto and the state diagram I mentioned before. > Checkpoint Signal: NONE In the Howto it's defined as usr2. Do you find my Howto misleading? -- Reuti > This is all on a sandbox grid setup using version 6.2u5p1. The script a > slightly modified version of the condor_transparent6.sh script in the howto > (I added some echo statement to print variable values). The binary is a toy > C++ program that increments integer values then prints them out in a big loop. > > > Reuti, > > Just to make sure it wasn't my toy binary, I just re-ran with your ever.c > program. Using that, a checkpoint file was created. My toy binary used the > sleep command. OK, this is good. :) > > Cheers, > Lane > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
