On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]> wrote:

> Am 16.03.2011 um 19:35 schrieb Lane Schwartz:
>
> > <snip>
> > The job gets queued up and assigned to run, and the stderr and stdout
> files are created. When a checkpointable job starts, condor and DMTCP each
> print a small log message. That log message shows up in the logs. But no
> output from my program appears. SGE lists my job's status as "r" but when I
> ssh in to the machine where the job is running and run ps aux, ps lists my
> job's status as suspended.
> >
> > When I launch my checkpointable jobs locally (not using qsub) they run
> and produce immediate output. When I run those same jobs using qsub, they go
> into "r" status, but never produce output and appear to not be actually
> running.
> >
> > On a related topic, using 6.2u5p1 I've had mixed results following the
> checkpointing interface tutorial at
> http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial
> examples describe setting up a transparent interface and running it with
> some simply shell scripts; I've been able to get these to work as described.
> I've also followed the examples for setting up application-level interface
> with shell scripts; that works, but only the migr_command and clean_command
> appear to run. When I run example 6, which uses condor in conjunction with
> transparent checkpointing, no condor checkpoint files are created.
>
> You set usr2 as the to be used signal and waited at least min_cpu_interval?
> Still no checkpoint file is created in /home/checkpoint or alike? Can you
> try sending usr by hand to the complete process group on the node?
>
I can confirm. I ran the following, and no checkpoint file was created:

$ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh

$ qstat
... lists the above job in state "r", with job-ID 114 ...

$ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane
... lists the processes associated with the job.
... The parent process has PID 10240, and is "-csh c
./condor_transparent6.sh" with ps state "Ss"
... The second process has PID 10322, and is "/bin/bash
./condor_transparent6.sh" with ps state "S"
... The third process has PID, and is running the actual condor-linked
binary with ps state "S"
... These three jobs have group PGID 10240.

$ kill -s USR2 -- -10240

$ qstat
... My job is no longer listed ...

$ ls /tmp/114
... No files are listed. The directory exists, though, which makes sense
since "Checkpoint Directory" is set to /tmp in the checkpointing
configuration.


My checkpoint interface definition is below:

Name: transparent
Interface: TRANSPARENT
Checkpoint command: NONE
Migrate command: NONE
Clean command: NONE
Checkpoint directory: /tmp
Checkpoint When: xsr
Checkpoint Signal: NONE
This is all on a sandbox grid setup using version 6.2u5p1. The script a
slightly modified version of the condor_transparent6.sh script in the howto
(I added some echo statement to print variable values). The binary is a toy
C++ program that increments integer values then prints them out in a big
loop.

Thanks,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to