Re: [gridengine users] How to use condor checkpointing with SGE

Lane Schwartz Thu, 17 Mar 2011 05:53:52 -0700

On Wed, Mar 16, 2011 at 4:46 PM, Reuti <[email protected]> wrote:


> Lane,
>
> Am 16.03.2011 um 21:34 schrieb Lane Schwartz:
>
> > On Wed, Mar 16, 2011 at 4:24 PM, Lane Schwartz <[email protected]>
> wrote:
> > On Wed, Mar 16, 2011 at 2:47 PM, Reuti <[email protected]>
> wrote:
> > Am 16.03.2011 um 19:35 schrieb Lane Schwartz:
> >
> > > <snip>
> > > The job gets queued up and assigned to run, and the stderr and stdout
> files are created. When a checkpointable job starts, condor and DMTCP each
> print a small log message. That log message shows up in the logs. But no
> output from my program appears. SGE lists my job's status as "r" but when I
> ssh in to the machine where the job is running and run ps aux, ps lists my
> job's status as suspended.
> > >
> > > When I launch my checkpointable jobs locally (not using qsub) they run
> and produce immediate output. When I run those same jobs using qsub, they go
> into "r" status, but never produce output and appear to not be actually
> running.
> > >
> > > On a related topic, using 6.2u5p1 I've had mixed results following the
> checkpointing interface tutorial at
> http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial
> examples describe setting up a transparent interface and running it with
> some simply shell scripts; I've been able to get these to work as described.
> I've also followed the examples for setting up application-level interface
> with shell scripts; that works, but only the migr_command and clean_command
> appear to run. When I run example 6, which uses condor in conjunction with
> transparent checkpointing, no condor checkpoint files are created.
> >
> > You set usr2 as the to be used signal and waited at least
> min_cpu_interval? Still no checkpoint file is created in /home/checkpoint or
> alike? Can you try sending usr by hand to the complete process group on the
> node?
> > I can confirm. I ran the following, and no checkpoint file was created:
> >
> > $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh
>
> I wouldn't use "-b y" here - it's a script.
>
>
> >
> > $ qstat
> > ... lists the above job in state "r", with job-ID 114 ...
> >
> > $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane
> > ... lists the processes associated with the job.
> > ... The parent process has PID 10240, and is "-csh c
> ./condor_transparent6.sh" with ps state "Ss"
> > ... The second process has PID 10322, and is "/bin/bash
> ./condor_transparent6.sh" with ps state "S"
> > ... The third process has PID, and is running the actual condor-linked
> binary with ps state "S"
> > ... These three jobs have group PGID 10240.
>
> $ ps -e f
>
> (f w/o -) will show an overview as a nice tree.
>
>
> >
> > $ kill -s USR2 -- -10240
>
> Do you use my demo-script and trap usr2 as outlined?
>
>
> >
> > $ qstat
> > ... My job is no longer listed ...
> >
> > $ ls /tmp/114
> > ... No files are listed. The directory exists, though, which makes sense
> since "Checkpoint Directory" is set to /tmp in the checkpointing
> configuration.
> >
> >
> > My checkpoint interface definition is below:
> >
> > Name: transparent
> >
> > Interface: TRANSPARENT
> > Checkpoint command: NONE
> > Migrate command: NONE
> > Clean command: NONE
> > Checkpoint directory: /tmp
>
> It should be a location shared between all nodes, otherwise you can't
> restart on another node.
>
>
> > Checkpoint When: xsr
>
> Without "m" here, never ever a checkpoint will be created, as explained in
> the Howto and the state diagram I mentioned before.
>
>
> > Checkpoint Signal: NONE
>
> In the Howto it's defined as usr2.
>
> Do you find my Howto misleading?
>
Reuti,

Your Howto is excellent! :-) That's the only reason I've been able to make
as much progress as I have.

> Checkpoint Signal: NONE

This was a copy-paste error. I do actually have the signal set to usr2, and
I have the script set to trap it via 'trap '' usr2' in
condor_transparent6.sh. Now that I've got the example working via a
manually-sent USR2 signal, my next step will be to try and get the signal to
be sent via the migrate script under Application-level checkpointing. I'll
keep you apprised of my success. :)

Cheers,
Lane

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to use condor checkpointing with SGE

Reply via email to