On 10/10/2012 12:32 PM, Orion Poplawski wrote:
Now to figure out how to restart. Probably need to move the restart files to
a network directory. I'd also like to handle multiple jobs/tasks running in
the same directory.
Okay, I moved the checkpoint directory to a shared filesystem to support
migrating between machines (set with ckpt_dir) and now handle restarting:
ckpt_name dmtcp
interface APPLICATION-LEVEL
ckpt_command /usr/share/gridengine/util/dmtcp_checkpoint
migr_command /usr/share/gridengine/util/dmtcp_migrate
restart_command NONE
clean_command /usr/share/gridengine/util/dmtcp_cleanup
ckpt_dir /data/cora/dmtcp
signal NONE
when xsr
---
#!/bin/bash
# dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing
# Get the base from the config file
eval `grep ^ckpt_dir= $SGE_JOB_SPOOL_DIR/config`
# Make the per task checkpoint directory if it doesn't already exist
CKPTDIR=${ckpt_dir}/${JOB_ID}.${SGE_TASK_ID/undefined/1}
mkdir -p $CKPTDIR
# Setup dmtcp_coordinator - this will get killed by the shepherd
export DMTCP_PORT=`dmtcp_coordinator --port 0 --ckptdir $CKPTDIR
--exit-on-last --interval 0 --background 2>&1 | grep "Port:" | /bin/sed -e
's/Port://g' -e 's/[ \t]//g'`
# Record the port for later use by checkpointing scripts
echo $DMTCP_PORT > $TMPDIR/dmtcp_port
if [ "$RESTARTED" -eq 2 ]
then
# Override the setting in dmtcp_restart_script.sh
export DMTCP_HOST=$HOSTNAME
# Restart the job
exec $CKPTDIR/dmtcp_restart_script.sh
else
# We need to move the job script to remove the hostname from the path
cp $1 $CKPTDIR/jobscript
shift
# Start the job (TODO - be able to set the argv[0] for login shell)
exec dmtcp_checkpoint --quiet $SGE_STARTER_SHELL_PATH $CKPTDIR/jobscript "$@"
fi
---
Also added a cleanup script:
#!/bin/bash
# dmtcp_cleanup - run at job exit
# Get the base from the config file
eval `grep ^ckpt_dir= $SGE_JOB_SPOOL_DIR/config`
CKPTDIR=${ckpt_dir}/${JOB_ID}.${SGE_TASK_ID/undefined/1}
# Cleanup the proper *.dmtcp checkpoint files in the current directory
eval `grep ^given_ckpt_files= ${CKPTDIR}/dmtcp_restart_script.sh`
rm $given_ckpt_files
# Remove dmtcp checkpoint directory
rm -r ${CKPTDIR}
# Always exit true
exit 0
This appears to work fine for migrating within the same queue, and I think it
should work between queues as well.
The next thing I'd like to tackle is where to set using the dmtcp_starter
script. Ideally this would be set in the dmtcp checkpointing configuration
file, but there is no current starter_method/command setting there. Perhaps
worth an RFE? Otherwise I may be stuck doing it on a per queue basis or
globally which presents cleanup issues - although it looks like I can use
SGE_CKPT_ENV=dmtcp to see if dmtcp checkpointing has been called for. Looks
like the thing to do for now.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users