On Wed, Mar 16, 2011 at 4:46 PM, Reuti <[email protected]> wrote:
> Lane, > > Am 16.03.2011 um 21:34 schrieb Lane Schwartz: > > > On Wed, Mar 16, 2011 at 4:24 PM, Lane Schwartz <[email protected]> > wrote: > > On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]> > wrote: > > Am 16.03.2011 um 19:35 schrieb Lane Schwartz: > > > > > <snip> > > > The job gets queued up and assigned to run, and the stderr and stdout > files are created. When a checkpointable job starts, condor and DMTCP each > print a small log message. That log message shows up in the logs. But no > output from my program appears. SGE lists my job's status as "r" but when I > ssh in to the machine where the job is running and run ps aux, ps lists my > job's status as suspended. > > > > > > When I launch my checkpointable jobs locally (not using qsub) they run > and produce immediate output. When I run those same jobs using qsub, they go > into "r" status, but never produce output and appear to not be actually > running. > > > > > > On a related topic, using 6.2u5p1 I've had mixed results following the > checkpointing interface tutorial at > http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial > examples describe setting up a transparent interface and running it with > some simply shell scripts; I've been able to get these to work as described. > I've also followed the examples for setting up application-level interface > with shell scripts; that works, but only the migr_command and clean_command > appear to run. When I run example 6, which uses condor in conjunction with > transparent checkpointing, no condor checkpoint files are created. > > > > You set usr2 as the to be used signal and waited at least > min_cpu_interval? Still no checkpoint file is created in /home/checkpoint or > alike? Can you try sending usr by hand to the complete process group on the > node? > > I can confirm. I ran the following, and no checkpoint file was created: > > > > $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh > > I wouldn't use "-b y" here - it's a script. > > > > > > $ qstat > > ... lists the above job in state "r", with job-ID 114 ... > > > > $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane > > ... lists the processes associated with the job. > > ... The parent process has PID 10240, and is "-csh c > ./condor_transparent6.sh" with ps state "Ss" > > ... The second process has PID 10322, and is "/bin/bash > ./condor_transparent6.sh" with ps state "S" > > ... The third process has PID, and is running the actual condor-linked > binary with ps state "S" > > ... These three jobs have group PGID 10240. > > $ ps -e f > > (f w/o -) will show an overview as a nice tree. > > > > > > $ kill -s USR2 -- -10240 > > Do you use my demo-script and trap usr2 as outlined? > > > > > > $ qstat > > ... My job is no longer listed ... > > > > $ ls /tmp/114 > > ... No files are listed. The directory exists, though, which makes sense > since "Checkpoint Directory" is set to /tmp in the checkpointing > configuration. > > > > > > My checkpoint interface definition is below: > > > > Name: transparent > > > > Interface: TRANSPARENT > > Checkpoint command: NONE > > Migrate command: NONE > > Clean command: NONE > > Checkpoint directory: /tmp > > It should be a location shared between all nodes, otherwise you can't > restart on another node. > > > > Checkpoint When: xsr > > Without "m" here, never ever a checkpoint will be created, as explained in > the Howto and the state diagram I mentioned before. > > > > Checkpoint Signal: NONE > > In the Howto it's defined as usr2. > > Do you find my Howto misleading? > Reuti, Your Howto is excellent! :-) That's the only reason I've been able to make as much progress as I have. > Checkpoint Signal: NONE This was a copy-paste error. I do actually have the signal set to usr2, and I have the script set to trap it via 'trap '' usr2' in condor_transparent6.sh. Now that I've got the example working via a manually-sent USR2 signal, my next step will be to try and get the signal to be sent via the migrate script under Application-level checkpointing. I'll keep you apprised of my success. :) Cheers, Lane
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
