Re: [Toolserver-l] Grid Engine config change

Merlissimo Sat, 03 Mar 2012 16:18:31 -0800

Dr. Trigon wrote:


On 03.03.2012 22:46, Merlissimo wrote:

Hello toolserver users,

[...]


First thanks a lot for your effort!! First I thought why making it even
more complicate (this increases the possibility of mal-config) but after
setting my cron(ie)tab up I have to say it makes sense! Will be a good
thing even thought not simple.

--
Example:

This morning Dr. Trigon complained that his job "mainbot" did not run immediatly and was 
queued for a long time. I would guess he submitted his job from cron using "cronsub mainbot -l 
/home/drtrigon/pywikipedia/mainbot.py"
This indicates that the job runs forevery (longrun) with unkown memory usage. 
So grid engine was only able to start this job on willow.
It is not possible to run infinite job on the webservers (only shorter jobs are 
allowed so that most jobs have finished before high webserver usage is expected 
during the evening). Nor it was possible to run it on the server running mail 
transfer agent which only have less than 500MB memory free, but much cpu power 
(expected memory usage is unkown). Other servers like nightshade and yarrow 
aren't currently available.


Thanks for taking me as an example - that help a lot... ;))

The exact command was:
cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron
(very close... ;)

According to the last run of this job it takes about 2 hours and 30 minutes runtime and 
had a peek usage of 370 MB memory. I got these values by requesting grid engine about 
usage statistics of the last ten days: "qacct -j mainbot -d 10".
To be safe that the job gets always enough resouces i would suggest to raise 
the values to 4 hours and 500MB memory. It is not a problem if you request more 
resouces than really needed, but job needing more resources than requested may 
be killed. So the new submit command would be:

"qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB 
/home/drtrigon/pywikipedia/mainbot.py"


I use now:
qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot
$HOME/pywikipedia/bot_control.py -default -cron
(a little bit more of time and '-m a -b y')

And here my key question arises; you mentioned 'qacct' to get more info
(thanks for this hint) and this is one of the biggest problem I had with
the whole SGE stuff; I was not able to get a complete docu whether on
the toolserver nor else. At the moment, on the toolserver commands like
'qstat' or 'qdel' are not covered anymore.
I (we) would like to know more about this great system.

E.g. what is the analogue to the old commands:

'cronsub [jobname] [command]'
has become
'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname]
[command]'

'cronsub -l [jobname] [command]'
has become
'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname]
[command]'

In both cases the old behavior was without -m a -b y, so

'cronsub [jobname] [command]'
has become
'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -N [jobname] [command]

'cronsub -l [jobname] [command]'
has become
'qcronsub -l h_rt=INFINITY -l virtual_free=100M -N [jobname] [command]

The -b y option is mostly useful for binaries, e.g. if you don't submit the 
python script itself, but call the binary interpreter (python) with an argument.

It is just an option if the submitted script file should be copied to a local filesystem on execution server (which increases performance, makes nfs error impossible and was always the defaultsetting) or executed directly from your home (if you use -b y). In most cases this option isn't needed and copying is the best for most shell scripts.


I added this to the interwiki bots exmaple because
1: if you submit a job interwiki.py the file is copied the sge spool directory 
and the job is queued
2: then you update your local svn copy
3: afterwards the job is started

Now the copied interwiki.py can be older than the rest of your pwd files. So in this case its better to use the same version of all pwd files directly from your homedir. It's very unlikey that thisproblem really happends, but i wanted to write a perfect example.

as far as I can see... (do not remember what the '-s' was for...)
Is this correct?

The -s at cronsub is for merging standard error stream into standard output 
stream. This is now -j y. I don't know why river used -s for this (perhaps for 
stream?).

Some weeks ago i installed a script that removes empty log files for standard error/output stream after job execution. Many people used this option to prevent that their homedir contain so many emptyerror logs.

Thanks for your work Merlissimo
and greetings
DrTrigon


Merlissimo

_______________________________________________
Toolserver-l mailing list ([email protected])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Grid Engine config change

Reply via email to