On 10/10/2012 12:32 PM, Orion Poplawski wrote:
Now to figure out how to restart.  Probably need to move the restart files to
a network directory.  I'd also like to handle multiple jobs/tasks running in
the same directory.


Okay, I moved the checkpoint directory to a shared filesystem to support migrating between machines (set with ckpt_dir) and now handle restarting:

ckpt_name          dmtcp
interface          APPLICATION-LEVEL
ckpt_command       /usr/share/gridengine/util/dmtcp_checkpoint
migr_command       /usr/share/gridengine/util/dmtcp_migrate
restart_command    NONE
clean_command      /usr/share/gridengine/util/dmtcp_cleanup
ckpt_dir           /data/cora/dmtcp
signal             NONE
when               xsr

---

#!/bin/bash
# dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing

# Get the base from the config file
eval `grep ^ckpt_dir= $SGE_JOB_SPOOL_DIR/config`

# Make the per task checkpoint directory if it doesn't already exist
CKPTDIR=${ckpt_dir}/${JOB_ID}.${SGE_TASK_ID/undefined/1}
mkdir -p $CKPTDIR

# Setup dmtcp_coordinator - this will get killed by the shepherd
export DMTCP_PORT=`dmtcp_coordinator --port 0 --ckptdir $CKPTDIR --exit-on-last --interval 0 --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`

# Record the port for later use by checkpointing scripts
echo $DMTCP_PORT > $TMPDIR/dmtcp_port

if [ "$RESTARTED" -eq 2 ]
then
  # Override the setting in dmtcp_restart_script.sh
  export DMTCP_HOST=$HOSTNAME
  # Restart the job
  exec $CKPTDIR/dmtcp_restart_script.sh
else
  # We need to move the job script to remove the hostname from the path
  cp $1 $CKPTDIR/jobscript
  shift
  # Start the job (TODO - be able to set the argv[0] for login shell)
  exec dmtcp_checkpoint --quiet $SGE_STARTER_SHELL_PATH $CKPTDIR/jobscript "$@"
fi

---

Also added a cleanup script:


#!/bin/bash
# dmtcp_cleanup - run at job exit

# Get the base from the config file
eval `grep ^ckpt_dir= $SGE_JOB_SPOOL_DIR/config`
CKPTDIR=${ckpt_dir}/${JOB_ID}.${SGE_TASK_ID/undefined/1}

# Cleanup the proper *.dmtcp checkpoint files in the current directory
eval `grep ^given_ckpt_files= ${CKPTDIR}/dmtcp_restart_script.sh`
rm $given_ckpt_files

# Remove dmtcp checkpoint directory
rm -r ${CKPTDIR}

# Always exit true
exit 0


This appears to work fine for migrating within the same queue, and I think it should work between queues as well.

The next thing I'd like to tackle is where to set using the dmtcp_starter script. Ideally this would be set in the dmtcp checkpointing configuration file, but there is no current starter_method/command setting there. Perhaps worth an RFE? Otherwise I may be stuck doing it on a per queue basis or globally which presents cleanup issues - although it looks like I can use SGE_CKPT_ENV=dmtcp to see if dmtcp checkpointing has been called for. Looks like the thing to do for now.



--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [email protected]
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to