-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03.03.2012 22:46, Merlissimo wrote:
> Hello toolserver users,
> 
> as you may know, there were some bigger problems related to sun grid
> engine starting in november 2011. I asked DaB. to become a sge manager
> for helping them to solve these problems.
> During the last months i silently started reconfiguring sge in small
> steps so that it was always possible to use it as before and no downtime
> was needed. This took some time because i am only a volunteer and i had
> to changes nearly everything. Additional Nosy and DaB. changed some
> solaris configurations that i proposed.
> 
> All scripts that used grid engine before can continue to run without
> changes. But maybe you can increase your script performance by adding
> additional informations.
> 
> In the past you were requested to choose a suitable queue (all.q or
> longrun) for your job. Many people choosed a queue that did not fit best
> for their task. So i changed this procedure.
> 
> Now you have to add all resources that your job needs during runtime on
> job submition. Then sge will choose queue and host that fits best for
> your requirements. So you don't have to care about different queues
> anymore (you may have seen that there are much more queues than before).
> 
> All jobs must at least contain informations about maximum runtime (h_rt)
> and peak memory usage (virtual_free). This information may get
> obligatory in future. Currently only a warning message is shown.
> You also have to request other resources like sql connections, free temp
> space, etc. if these are needed by your job. Please read documentation
> on toolserverwiki i have updated today:
> <https://wiki.toolserver.org/view/Job_scheduling>
> This currently contains the main informations you need to know, but
> maybe i add some more examples later.
> 
> I also have added a new script called "qcronsub". This is the
> replacement for "cronsub" most of you used before. Differently to
> cronsub it accepts the same arguments as the original "qsub" command by
> grid engine. So now it is possible the add all resource values at
> command line.
> 
> Please note that you should always use cronie at submit.toolserver.org
> for submitting jobs to sge by cron. These cron tasks will always be
> executed even if one host (e.g. clematis or willow) is down. This is the
> suggested usage since about 17 months. Many people have migrated their
> cron jobs from nightshade to willow during the last weeks. But they will
> have the same problem again if willow must be shut down for a longer
> time (which hopefully never happens).

First thanks a lot for your effort!! First I thought why making it even
more complicate (this increases the possibility of mal-config) but after
setting my cron(ie)tab up I have to say it makes sense! Will be a good
thing even thought not simple.

> -- 
> Example:
> 
> This morning Dr. Trigon complained that his job "mainbot" did not run 
> immediatly and was queued for a long time. I would guess he submitted his job 
> from cron using "cronsub mainbot -l /home/drtrigon/pywikipedia/mainbot.py"
> This indicates that the job runs forevery (longrun) with unkown memory usage. 
> So grid engine was only able to start this job on willow.
> It is not possible to run infinite job on the webservers (only shorter jobs 
> are allowed so that most jobs have finished before high webserver usage is 
> expected during the evening). Nor it was possible to run it on the server 
> running mail transfer agent which only have less than 500MB memory free, but 
> much cpu power (expected memory usage is unkown). Other servers like 
> nightshade and yarrow aren't currently available.

Thanks for taking me as an example - that help a lot... ;))

The exact command was:
cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron
(very close... ;)

> According to the last run of this job it takes about 2 hours and 30 minutes 
> runtime and had a peek usage of 370 MB memory. I got these values by 
> requesting grid engine about usage statistics of the last ten days: "qacct -j 
> mainbot -d 10".
> To be safe that the job gets always enough resouces i would suggest to raise 
> the values to 4 hours and 500MB memory. It is not a problem if you request 
> more resouces than really needed, but job needing more resources than 
> requested may be killed. So the new submit command would be:
> 
> "qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB 
> /home/drtrigon/pywikipedia/mainbot.py"

I use now:
qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot
$HOME/pywikipedia/bot_control.py -default -cron
(a little bit more of time and '-m a -b y')

And here my key question arises; you mentioned 'qacct' to get more info
(thanks for this hint) and this is one of the biggest problem I had with
the whole SGE stuff; I was not able to get a complete docu whether on
the toolserver nor else. At the moment, on the toolserver commands like
'qstat' or 'qdel' are not covered anymore.
I (we) would like to know more about this great system.

E.g. what is the analogue to the old commands:

'cronsub [jobname] [command]'
has become
'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname]
[command]'

'cronsub -l [jobname] [command]'
has become
'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname]
[command]'

as far as I can see... (do not remember what the '-s' was for...)
Is this correct?

Thanks for your work Merlissimo
and greetings
DrTrigon
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9SpnkACgkQAXWvBxzBrDBfoACgjkU9Cq/BT7eRp5RokONOxb5K
GekAniUSTuTaTKufOyjD9+lGiqmRRVDw
=g9x4
-----END PGP SIGNATURE-----

_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to