On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]> wrote:
> Am 16.03.2011 um 19:35 schrieb Lane Schwartz: > > > <snip> > > The job gets queued up and assigned to run, and the stderr and stdout > files are created. When a checkpointable job starts, condor and DMTCP each > print a small log message. That log message shows up in the logs. But no > output from my program appears. SGE lists my job's status as "r" but when I > ssh in to the machine where the job is running and run ps aux, ps lists my > job's status as suspended. > > > > When I launch my checkpointable jobs locally (not using qsub) they run > and produce immediate output. When I run those same jobs using qsub, they go > into "r" status, but never produce output and appear to not be actually > running. > > > > On a related topic, using 6.2u5p1 I've had mixed results following the > checkpointing interface tutorial at > http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial > examples describe setting up a transparent interface and running it with > some simply shell scripts; I've been able to get these to work as described. > I've also followed the examples for setting up application-level interface > with shell scripts; that works, but only the migr_command and clean_command > appear to run. When I run example 6, which uses condor in conjunction with > transparent checkpointing, no condor checkpoint files are created. > > You set usr2 as the to be used signal and waited at least min_cpu_interval? > Still no checkpoint file is created in /home/checkpoint or alike? Can you > try sending usr by hand to the complete process group on the node? > I can confirm. I ran the following, and no checkpoint file was created: $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh $ qstat ... lists the above job in state "r", with job-ID 114 ... $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane ... lists the processes associated with the job. ... The parent process has PID 10240, and is "-csh c ./condor_transparent6.sh" with ps state "Ss" ... The second process has PID 10322, and is "/bin/bash ./condor_transparent6.sh" with ps state "S" ... The third process has PID, and is running the actual condor-linked binary with ps state "S" ... These three jobs have group PGID 10240. $ kill -s USR2 -- -10240 $ qstat ... My job is no longer listed ... $ ls /tmp/114 ... No files are listed. The directory exists, though, which makes sense since "Checkpoint Directory" is set to /tmp in the checkpointing configuration. My checkpoint interface definition is below: Name: transparent Interface: TRANSPARENT Checkpoint command: NONE Migrate command: NONE Clean command: NONE Checkpoint directory: /tmp Checkpoint When: xsr Checkpoint Signal: NONE This is all on a sandbox grid setup using version 6.2u5p1. The script a slightly modified version of the condor_transparent6.sh script in the howto (I added some echo statement to print variable values). The binary is a toy C++ program that increments integer values then prints them out in a big loop. Thanks, Lane
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
