So I've been playing around with trying to get dmtcp integrated into gridengine. I'm not terribly close (I've be (re)learning a lot), but I figured I'd post what I got in case anyone else out there has some better ideas as to how to do this or to proceed.

Current approach - hopefully transparent to the user:

ckpt_name          dmtcp
interface          APPLICATION-LEVEL
ckpt_command       /usr/share/gridengine/util/dmtcp_checkpoint
migr_command       /usr/share/gridengine/util/dmtcp_migrate
restart_command    NONE
clean_command      NONE
ckpt_dir           /tmp
signal             NONE
when               xsr

qname                 dmtcp
ckpt_list             dmtcp
starter_method        /usr/share/gridengine/util/starter_dmtcp

---

#!/bin/bash
# dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing

# Setup dmtcp_coordinator - this will get killed by the shepherd when the job completes export DMTCP_PORT=`dmtcp_coordinator --port 0 --exit-on-last --interval 0 --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`

# Record the port for later use by checkpointing scripts
echo $DMTCP_PORT > $TMPDIR/dmtcp_port

# Start the job (TODO - be able to set the argv[0] for login shell)
exec dmtcp_checkpoint $SGE_STARTER_SHELL_PATH "$@"

---

#!/bin/bash
# dmtcp_checkpoint - checkpoint a dmtcp job

# Retrieve the dmtcp port
export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port)

# Checkpoint the job, waiting until done
/usr/bin/dmtcp_command --quiet bc

---

#!/bin/bash
# dmtcp_migrate - migrate a dmtcp job

# Retrieve the dmtcp port
export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port)

# Checkpoint the jobs, blocking until done, and then quit
/usr/bin/dmtcp_command --quiet bc && /usr/bin/dmtcp_command --quiet q

---

I tried using the default dmtcp checkpoint signal (USR2), but that doesn't appear to work in this case.

These scripts to appear to produce the dmtcp restart stuff in the job's working directory:

-rw-------. 1 orion nwra 2108167 Oct 10 12:29 ckpt_foo_422ca3e65019b42-1421-5075be27.dmtcp -rw-------. 1 orion nwra 25996438 Oct 10 12:29 ckpt_bash_422ca3e65019b42-1407-5075be27.dmtcp lrwxrwxrwx. 1 orion nwra 55 Oct 10 12:29 dmtcp_restart_script.sh -> ./dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh -rwxr--r--. 1 orion nwra 4007 Oct 10 12:29 dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh

Now to figure out how to restart. Probably need to move the restart files to a network directory. I'd also like to handle multiple jobs/tasks running in the same directory.

--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [email protected]
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to