Hi,

Am 04.08.2011 um 17:10 schrieb Gowtham:

> Thanks all for your tips and suggestions. Here's what I did 
> and it seems to be working. 
> 
> 1. SGE submission script is attached with this email
> 
> 2. I followed James Rudd's tip on 'qconf -mconf' and make
> 
>    execd_params  NOTIFY_KILL=INT
> 
> 3. The attachment may also be found in
> 
>    http://sgowtham.net/misc/gaussian_2009p_l82.txt
> 
> 
> The Gaussian 09 calculation runs fine, the log file reports 
> that 
> 
> ...
> %NProcShared=4
> Will use up to    4 processors via shared memory.
> %LindaWorkers=compute-1-21,compute-1-19,compute-1-18,compute-1-20
> ...
> 
> 
> Is there some way (a script or a tool) with which I can 
> confidently make sure that all the said compute nodes are 
> actually being used by this Gaussian 09 calculation?

you will have to go to each node and check whether the processes are there. As 
Gaussian will run some some of its links as serial and others as parallel, the 
processes may come and go on slave nodes.

But let me add some statements here:

1. to achieve a tight integration into SGE it's necessary to change the file 
linda8.2/opteron-linux/bin/linda_rsh near the end:

          *)  exec /usr/bin/rsh $host $user -n "$@"

to

          *)  exec rsh $host $user -n "$@"

This way the rsh-wrapper of SGE can catch the call and use a PE from MPICH with 
the setup -catch_rsh

2. The rsh wrapper I patched to a) don't echo the commands to start slave tasks 
in the users output file and b) also forward all variables to all ndoes

if [ x$just_wrap = x ]; then 
   if [ $minus_n -eq 1 ]; then
#      echo $SGE_ROOT/bin/$ARC/qrsh -inherit -V -nostdin $rhost $cmd
      exec $SGE_ROOT/bin/$ARC/qrsh -inherit -V -nostdin $rhost $cmd
   else
#      echo $SGE_ROOT/bin/$ARC/qrsh -inherit -V $rhost $cmd
      exec $SGE_ROOT/bin/$ARC/qrsh -inherit -V $rhost $cmd
   fi
else

I create a dedicated PE for each parallel library.


3. The necessary PE would be:

$ qconf -sp linda
pe_name            linda
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /usr/sge/cluster/linda/startlinda.sh -catch_rsh $pe_hostfile
stop_proc_args     /usr/sge/cluster/linda/stoplinda.sh
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE


4. While startlinda is just a renamed startmpi.sh, the stoplinda.sh includes 
commands to remove empty output and error files. The test for the counter you 
will have to adjust, incase the listed machinefile has more than one line (my 
startlinda.sh assembles already the %lindaworkers line and so I got only 2 
lines therein: the echo of the start options and the %lindaworkers line).

#!/bin/sh
rm $TMPDIR/machines

rshcmd=rsh
case "$ARC" in
   hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
   *) ;;
esac
rm $TMPDIR/$rshcmd

if [ -r "$SGE_STDOUT_PATH" -a -f "$SGE_STDOUT_PATH" ] ; then
  counter=`wc -l $SGE_STDOUT_PATH`
  [ "${counter%%$SGE_STDOUT_PATH}" -eq 2 ] && rm -f $SGE_STDOUT_PATH
fi
[ -r "$SGE_STDERR_PATH" -a -f "$SGE_STDERR_PATH" ] && [ ! -s "$SGE_STDERR_PATH" 
] && rm -f $SGE_STDERR_PATH

exit 0


5. Changing the input file to get the list of lindaworkers

As said, I prefer to change a copy of the input file, but this might be a 
matter of taste whether you prefer on job per directory or just arbitrary ones.

-- Reuti



> Thanks again for your time and help,
> g
> 
> --
> Gowtham
> Advanced IT Research Support
> Michigan Technological University
> 
> (906) 487/3593
> 
> 
> On Thu, 4 Aug 2011, James Rudd wrote:
> 
> | I agree about it wasting resources. We found that with big jobs the
> | l502.exel process could keep running for days on slave nodes after it had
> | been terminated on master.
> | 
> | I contacted Gaussian Support but they pretty much said they don't run SGE
> | so rely on info from users. Their first suggestion was to add -catch_rsh to
> | the PE config file.
> | Then they suggested looking at Sig signals sent:
> | >>>>>>>>>>>>>>>>
> |      I have not heard from other users who I
> | have discussed this with but I did get an insight from some users on
> | PBS systems.  The qdel command with PBS actually sends out two signals,
> | first is SIGTERM and then followed by a variable delay SIGKILL.  If
> | we set the delay between the two signals to about 120 seconds this gave
> | sufficient time for the master to reliably advise the workers to
> | exit.
> | >>>>>>>>>>>>>>>>>>>
> | 
> | After looking around I found SGE just sends a SIGKILL, this kills the master
> | with no time to send out the shutdown signal. My previous post is what I
> | sent back to Gaussian to let them know how I had resolved the problem.
> | 
> | 
> | Regards,
> |  James
> | 
> | On Thu, Aug 4, 2011 at 1:10 PM, Sudarshan Wadkar <[email protected]> wrote:
> | 
> | > interesting approach James,
> | > i was faced with same problem. I did not think of using torque's
> | > similar feature (I am not sure if its there in torque)
> | > I had to hack the way gaussian is run. I kept a trail of running
> | > gaussian jobs and used post job scripts to clean the node of stale
> | > gaussian jobs.
> | > It (the problem) was really annoying and hogging up the system resources a
> | > lot.
> | > I reported the problem to Gaussian Support, but they didn't respond
> | > (except a mail or two asking for debug outputs)
> | >
> | > -Sudarshan Wadkar
> | >
> | > On Thu, Aug 4, 2011 at 3:56 AM, James Rudd <[email protected]> wrote:
> | > > We had problems with SGE not properly killing Linda jobs if they
> | > > were canceled. Master would stop but slaves would keep on running.
> | >
> | > --
> | > -Sudarshan Wadkar
> | >
> | > "Success is getting what you want. Happiness is wanting what you get."
> | > - Dale Carnegie
> | > "It's always our decision who we are"
> | > - Robert Solomon in Waking Life
> | > "The Truth is The Truth, so all you can do is live with it."
> | > - $udhi :)
> | >
> | -------------- next part --------------
> | An HTML attachment was scrubbed...
> | URL: 
> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110804/a4ca3a14/attachment.html
>  
> | <gaussian_2009p_l82.txt>_______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to