On Wed, Mar 16, 2011 at 4:24 PM, Lane Schwartz <[email protected]> wrote:
> On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]>wrote: > >> Am 16.03.2011 um 19:35 schrieb Lane Schwartz: >> >> > <snip> >> > The job gets queued up and assigned to run, and the stderr and stdout >> files are created. When a checkpointable job starts, condor and DMTCP each >> print a small log message. That log message shows up in the logs. But no >> output from my program appears. SGE lists my job's status as "r" but when I >> ssh in to the machine where the job is running and run ps aux, ps lists my >> job's status as suspended. >> > >> > When I launch my checkpointable jobs locally (not using qsub) they run >> and produce immediate output. When I run those same jobs using qsub, they go >> into "r" status, but never produce output and appear to not be actually >> running. >> > >> > On a related topic, using 6.2u5p1 I've had mixed results following the >> checkpointing interface tutorial at >> http://gridscheduler.sourceforge.net/howto/checkpointing.html. The >> initial examples describe setting up a transparent interface and running it >> with some simply shell scripts; I've been able to get these to work as >> described. I've also followed the examples for setting up application-level >> interface with shell scripts; that works, but only the migr_command and >> clean_command appear to run. When I run example 6, which uses condor in >> conjunction with transparent checkpointing, no condor checkpoint files are >> created. >> >> You set usr2 as the to be used signal and waited at least >> min_cpu_interval? Still no checkpoint file is created in /home/checkpoint or >> alike? Can you try sending usr by hand to the complete process group on the >> node? >> > I can confirm. I ran the following, and no checkpoint file was created: > > $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh > > $ qstat > ... lists the above job in state "r", with job-ID 114 ... > > $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane > ... lists the processes associated with the job. > ... The parent process has PID 10240, and is "-csh c > ./condor_transparent6.sh" with ps state "Ss" > ... The second process has PID 10322, and is "/bin/bash > ./condor_transparent6.sh" with ps state "S" > ... The third process has PID, and is running the actual condor-linked > binary with ps state "S" > ... These three jobs have group PGID 10240. > > $ kill -s USR2 -- -10240 > > $ qstat > ... My job is no longer listed ... > > $ ls /tmp/114 > ... No files are listed. The directory exists, though, which makes sense > since "Checkpoint Directory" is set to /tmp in the checkpointing > configuration. > > > My checkpoint interface definition is below: > > Name: transparent > > Interface: TRANSPARENT > Checkpoint command: NONE > Migrate command: NONE > Clean command: NONE > Checkpoint directory: /tmp > Checkpoint When: xsr > Checkpoint Signal: NONE > This is all on a sandbox grid setup using version 6.2u5p1. The script a > slightly modified version of the condor_transparent6.sh script in the howto > (I added some echo statement to print variable values). The binary is a toy > C++ program that increments integer values then prints them out in a big > loop. > > Reuti, Just to make sure it wasn't my toy binary, I just re-ran with your ever.c program. Using that, a checkpoint file was created. My toy binary used the sleep command. OK, this is good. :) Cheers, Lane
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
