Am 10.10.2012 um 20:32 schrieb Orion Poplawski:

> So I've been playing around with trying to get dmtcp integrated into 
> gridengine.  I'm not terribly close (I've be (re)learning a lot), but I 
> figured I'd post what I got in case anyone else out there has some better 
> ideas as to how to do this or to proceed.
> 
> Current approach - hopefully transparent to the user:

You can check the examples in:

http://arc.liv.ac.uk/SGE/howto/checkpointing.html

for application-level checkpointing and how to give the job-/task-id to a job 
script to restart.

-- Reuti


> ckpt_name          dmtcp
> interface          APPLICATION-LEVEL
> ckpt_command       /usr/share/gridengine/util/dmtcp_checkpoint
> migr_command       /usr/share/gridengine/util/dmtcp_migrate
> restart_command    NONE
> clean_command      NONE
> ckpt_dir           /tmp
> signal             NONE
> when               xsr
> 
> qname                 dmtcp
> ckpt_list             dmtcp
> starter_method        /usr/share/gridengine/util/starter_dmtcp
> 
> ---
> 
> #!/bin/bash
> # dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing
> 
> # Setup dmtcp_coordinator - this will get killed by the shepherd when the job 
> completes
> export DMTCP_PORT=`dmtcp_coordinator --port 0 --exit-on-last --interval 0 
> --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`
> 
> # Record the port for later use by checkpointing scripts
> echo $DMTCP_PORT > $TMPDIR/dmtcp_port
> 
> # Start the job (TODO - be able to set the argv[0] for login shell)
> exec dmtcp_checkpoint $SGE_STARTER_SHELL_PATH "$@"
> 
> ---
> 
> #!/bin/bash
> # dmtcp_checkpoint - checkpoint a dmtcp job
> 
> # Retrieve the dmtcp port
> export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port)
> 
> # Checkpoint the job, waiting until done
> /usr/bin/dmtcp_command --quiet bc
> 
> ---
> 
> #!/bin/bash
> # dmtcp_migrate - migrate a dmtcp job
> 
> # Retrieve the dmtcp port
> export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port)
> 
> # Checkpoint the jobs, blocking until done, and then quit
> /usr/bin/dmtcp_command --quiet bc && /usr/bin/dmtcp_command --quiet q
> 
> ---
> 
> I tried using the default dmtcp checkpoint signal (USR2), but that doesn't 
> appear to work in this case.
> 
> These scripts to appear to produce the dmtcp restart stuff in the job's 
> working directory:
> 
> -rw-------. 1 orion nwra  2108167 Oct 10 12:29 
> ckpt_foo_422ca3e65019b42-1421-5075be27.dmtcp
> -rw-------. 1 orion nwra 25996438 Oct 10 12:29 
> ckpt_bash_422ca3e65019b42-1407-5075be27.dmtcp
> lrwxrwxrwx. 1 orion nwra       55 Oct 10 12:29 dmtcp_restart_script.sh -> 
> ./dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh
> -rwxr--r--. 1 orion nwra     4007 Oct 10 12:29 
> dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh
> 
> Now to figure out how to restart.  Probably need to move the restart files to 
> a network directory.  I'd also like to handle multiple jobs/tasks running in 
> the same directory.
> 
> -- 
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder Office                  FAX: 303-415-9702
> 3380 Mitchell Lane                       [email protected]
> Boulder, CO 80301                   http://www.nwra.com
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to