Thanks all for your tips and suggestions. Here's what I did
and it seems to be working.
1. SGE submission script is attached with this email
2. I followed James Rudd's tip on 'qconf -mconf' and make
execd_params NOTIFY_KILL=INT
3. The attachment may also be found in
http://sgowtham.net/misc/gaussian_2009p_l82.txt
The Gaussian 09 calculation runs fine, the log file reports
that
...
%NProcShared=4
Will use up to 4 processors via shared memory.
%LindaWorkers=compute-1-21,compute-1-19,compute-1-18,compute-1-20
...
Is there some way (a script or a tool) with which I can
confidently make sure that all the said compute nodes are
actually being used by this Gaussian 09 calculation?
Thanks again for your time and help,
g
--
Gowtham
Advanced IT Research Support
Michigan Technological University
(906) 487/3593
On Thu, 4 Aug 2011, James Rudd wrote:
| I agree about it wasting resources. We found that with big jobs the
| l502.exel process could keep running for days on slave nodes after it had
| been terminated on master.
|
| I contacted Gaussian Support but they pretty much said they don't run SGE
| so rely on info from users. Their first suggestion was to add -catch_rsh to
| the PE config file.
| Then they suggested looking at Sig signals sent:
| >>>>>>>>>>>>>>>>
| I have not heard from other users who I
| have discussed this with but I did get an insight from some users on
| PBS systems. The qdel command with PBS actually sends out two signals,
| first is SIGTERM and then followed by a variable delay SIGKILL. If
| we set the delay between the two signals to about 120 seconds this gave
| sufficient time for the master to reliably advise the workers to
| exit.
| >>>>>>>>>>>>>>>>>>>
|
| After looking around I found SGE just sends a SIGKILL, this kills the master
| with no time to send out the shutdown signal. My previous post is what I
| sent back to Gaussian to let them know how I had resolved the problem.
|
|
| Regards,
| James
|
| On Thu, Aug 4, 2011 at 1:10 PM, Sudarshan Wadkar <[email protected]> wrote:
|
| > interesting approach James,
| > i was faced with same problem. I did not think of using torque's
| > similar feature (I am not sure if its there in torque)
| > I had to hack the way gaussian is run. I kept a trail of running
| > gaussian jobs and used post job scripts to clean the node of stale
| > gaussian jobs.
| > It (the problem) was really annoying and hogging up the system resources a
| > lot.
| > I reported the problem to Gaussian Support, but they didn't respond
| > (except a mail or two asking for debug outputs)
| >
| > -Sudarshan Wadkar
| >
| > On Thu, Aug 4, 2011 at 3:56 AM, James Rudd <[email protected]> wrote:
| > > We had problems with SGE not properly killing Linda jobs if they
| > > were canceled. Master would stop but slaves would keep on running.
| >
| > --
| > -Sudarshan Wadkar
| >
| > "Success is getting what you want. Happiness is wanting what you get."
| > - Dale Carnegie
| > "It's always our decision who we are"
| > - Robert Solomon in Waking Life
| > "The Truth is The Truth, so all you can do is live with it."
| > - $udhi :)
| >
| -------------- next part --------------
| An HTML attachment was scrubbed...
| URL:
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110804/a4ca3a14/attachment.html
| #! /bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe mpich 4
#
# Set required variables [PATH, LD_LIBRARY_PATH, G09 stuff, etc.]
. /share/apps/bin/batch_env.sh
# Folder where the files are located
# Folder where the calculation will be done
export INIT_DIR="/home/sgowtham/test_runs/G09"
# Name of the Gaussian 2009 input file
export INAME="Test_G09_Linda82"
# Prepare for running Gaussian 2009
export GAUSS_LFLAGS=' -vv -opt "Tsnet.Node.lindarsharg: ssh"'
LINDAWORKERS=$(cat $PE_HOSTFILE | grep -v "catch_rsh" | awk -F '.' '{ print
$1}' | tr '\n' ',' | sed 's/,$//')
# Prepend input deck with necessary information to run
# Gaussian 2009 with Linda 8.2
# Run Gaussian 2009
( echo %NProcShared=${NSLOTS}; echo %LindaWorkers=${LINDAWORKERS}; cat
${INAME}.com ) | \
/share/apps/g09/g09 > ${INAME}_${NSLOTS}.log
# Delete the core dumps, if any
/bin/rm -f ${INIT_DIR}/core*
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users