On 10/10/2012 03:49 PM, Orion Poplawski wrote:
The next thing I'd like to tackle is where to set using the dmtcp_starter
script. Ideally this would be set in the dmtcp checkpointing configuration
file, but there is no current starter_method/command setting there. Perhaps
worth an RFE? Otherwise I may be stuck doing it on a per queue basis or
globally which presents cleanup issues - although it looks like I can use
SGE_CKPT_ENV=dmtcp to see if dmtcp checkpointing has been called for. Looks
like the thing to do for now.
Okay, this uses SGE_CKPT_ENV:
#!/bin/bash
# dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing
# starter_methods need to be installed per queue, but we only
# want to setup dmtcp checkpointing for jobs that use it
if [ "$SGE_CKPT_ENV" = dmtcp ]
then
# Get the base from the config file
eval `grep ^ckpt_dir= $SGE_JOB_SPOOL_DIR/config`
# Make the per task checkpoint directory if it doesn't already exist
CKPTDIR=${ckpt_dir}/${JOB_ID}.${SGE_TASK_ID/undefined/1}
mkdir -p $CKPTDIR
# Setup dmtcp_coordinator - this will get killed by the shepherd
export DMTCP_PORT=`dmtcp_coordinator --port 0 --ckptdir $CKPTDIR
--exit-on-last --interval 0 --background 2>&1 | grep "Port:" | /bin/sed -e
's/Port://g' -e 's/[ \t]//g'`
# Record the port for later use by checkpointing scripts
echo $DMTCP_PORT > $TMPDIR/dmtcp_port
if [ "$RESTARTED" -eq 2 ]
then
# Override the setting in dmtcp_restart_script.sh
export DMTCP_HOST=$HOSTNAME
# Restart the job
exec $CKPTDIR/dmtcp_restart_script.sh
else
# We need to move the job script to remove the hostname from the path
cp $1 $CKPTDIR/jobscript
shift
# Start the job (TODO - be able to set the argv[0] for login shell)
exec dmtcp_checkpoint --quiet $SGE_STARTER_SHELL_PATH $CKPTDIR/jobscript
"$@"
fi
else
# Start the job normally with proper login shell handling
if [ "$SGE_STARTER_USE_LOGIN_SHELL" == true ]
then
shellname=$(basename $SGE_STARTER_SHELL_PATH)
exec -a -${shellname} $SGE_STARTER_SHELL_PATH "$@"
else
exec $SGE_STARTER_SHELL_PATH "$@"
fi
fi
I've set up a repository here:
https://github.com/opoplawski/gridengine_dmtcp
This is all very raw, hot off the press, but I hope to do more testing in the
coming days.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users