So I've been playing around with trying to get dmtcp integrated into
gridengine. I'm not terribly close (I've be (re)learning a lot), but I
figured I'd post what I got in case anyone else out there has some better
ideas as to how to do this or to proceed.
Current approach - hopefully transparent to the user:
ckpt_name dmtcp
interface APPLICATION-LEVEL
ckpt_command /usr/share/gridengine/util/dmtcp_checkpoint
migr_command /usr/share/gridengine/util/dmtcp_migrate
restart_command NONE
clean_command NONE
ckpt_dir /tmp
signal NONE
when xsr
qname dmtcp
ckpt_list dmtcp
starter_method /usr/share/gridengine/util/starter_dmtcp
---
#!/bin/bash
# dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing
# Setup dmtcp_coordinator - this will get killed by the shepherd when the job
completes
export DMTCP_PORT=`dmtcp_coordinator --port 0 --exit-on-last --interval 0
--background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`
# Record the port for later use by checkpointing scripts
echo $DMTCP_PORT > $TMPDIR/dmtcp_port
# Start the job (TODO - be able to set the argv[0] for login shell)
exec dmtcp_checkpoint $SGE_STARTER_SHELL_PATH "$@"
---
#!/bin/bash
# dmtcp_checkpoint - checkpoint a dmtcp job
# Retrieve the dmtcp port
export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port)
# Checkpoint the job, waiting until done
/usr/bin/dmtcp_command --quiet bc
---
#!/bin/bash
# dmtcp_migrate - migrate a dmtcp job
# Retrieve the dmtcp port
export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port)
# Checkpoint the jobs, blocking until done, and then quit
/usr/bin/dmtcp_command --quiet bc && /usr/bin/dmtcp_command --quiet q
---
I tried using the default dmtcp checkpoint signal (USR2), but that doesn't
appear to work in this case.
These scripts to appear to produce the dmtcp restart stuff in the job's
working directory:
-rw-------. 1 orion nwra 2108167 Oct 10 12:29
ckpt_foo_422ca3e65019b42-1421-5075be27.dmtcp
-rw-------. 1 orion nwra 25996438 Oct 10 12:29
ckpt_bash_422ca3e65019b42-1407-5075be27.dmtcp
lrwxrwxrwx. 1 orion nwra 55 Oct 10 12:29 dmtcp_restart_script.sh ->
./dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh
-rwxr--r--. 1 orion nwra 4007 Oct 10 12:29
dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh
Now to figure out how to restart. Probably need to move the restart files to
a network directory. I'd also like to handle multiple jobs/tasks running in
the same directory.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users