Am 10.10.2012 um 20:32 schrieb Orion Poplawski: > So I've been playing around with trying to get dmtcp integrated into > gridengine. I'm not terribly close (I've be (re)learning a lot), but I > figured I'd post what I got in case anyone else out there has some better > ideas as to how to do this or to proceed. > > Current approach - hopefully transparent to the user:
You can check the examples in: http://arc.liv.ac.uk/SGE/howto/checkpointing.html for application-level checkpointing and how to give the job-/task-id to a job script to restart. -- Reuti > ckpt_name dmtcp > interface APPLICATION-LEVEL > ckpt_command /usr/share/gridengine/util/dmtcp_checkpoint > migr_command /usr/share/gridengine/util/dmtcp_migrate > restart_command NONE > clean_command NONE > ckpt_dir /tmp > signal NONE > when xsr > > qname dmtcp > ckpt_list dmtcp > starter_method /usr/share/gridengine/util/starter_dmtcp > > --- > > #!/bin/bash > # dmtcp_starter - dmtcp job starter - runs jobs under dmtcp checkpointing > > # Setup dmtcp_coordinator - this will get killed by the shepherd when the job > completes > export DMTCP_PORT=`dmtcp_coordinator --port 0 --exit-on-last --interval 0 > --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'` > > # Record the port for later use by checkpointing scripts > echo $DMTCP_PORT > $TMPDIR/dmtcp_port > > # Start the job (TODO - be able to set the argv[0] for login shell) > exec dmtcp_checkpoint $SGE_STARTER_SHELL_PATH "$@" > > --- > > #!/bin/bash > # dmtcp_checkpoint - checkpoint a dmtcp job > > # Retrieve the dmtcp port > export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port) > > # Checkpoint the job, waiting until done > /usr/bin/dmtcp_command --quiet bc > > --- > > #!/bin/bash > # dmtcp_migrate - migrate a dmtcp job > > # Retrieve the dmtcp port > export DMTCP_PORT=$(cat $TMPDIR/dmtcp_port) > > # Checkpoint the jobs, blocking until done, and then quit > /usr/bin/dmtcp_command --quiet bc && /usr/bin/dmtcp_command --quiet q > > --- > > I tried using the default dmtcp checkpoint signal (USR2), but that doesn't > appear to work in this case. > > These scripts to appear to produce the dmtcp restart stuff in the job's > working directory: > > -rw-------. 1 orion nwra 2108167 Oct 10 12:29 > ckpt_foo_422ca3e65019b42-1421-5075be27.dmtcp > -rw-------. 1 orion nwra 25996438 Oct 10 12:29 > ckpt_bash_422ca3e65019b42-1407-5075be27.dmtcp > lrwxrwxrwx. 1 orion nwra 55 Oct 10 12:29 dmtcp_restart_script.sh -> > ./dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh > -rwxr--r--. 1 orion nwra 4007 Oct 10 12:29 > dmtcp_restart_script_422ca3e65019b42-1407-5075be27.sh > > Now to figure out how to restart. Probably need to move the restart files to > a network directory. I'd also like to handle multiple jobs/tasks running in > the same directory. > > -- > Orion Poplawski > Technical Manager 303-415-9701 x222 > NWRA, Boulder Office FAX: 303-415-9702 > 3380 Mitchell Lane [email protected] > Boulder, CO 80301 http://www.nwra.com > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
