-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03.03.2012 22:46, Merlissimo wrote: > Hello toolserver users, > > as you may know, there were some bigger problems related to sun grid > engine starting in november 2011. I asked DaB. to become a sge manager > for helping them to solve these problems. > During the last months i silently started reconfiguring sge in small > steps so that it was always possible to use it as before and no downtime > was needed. This took some time because i am only a volunteer and i had > to changes nearly everything. Additional Nosy and DaB. changed some > solaris configurations that i proposed. > > All scripts that used grid engine before can continue to run without > changes. But maybe you can increase your script performance by adding > additional informations. > > In the past you were requested to choose a suitable queue (all.q or > longrun) for your job. Many people choosed a queue that did not fit best > for their task. So i changed this procedure. > > Now you have to add all resources that your job needs during runtime on > job submition. Then sge will choose queue and host that fits best for > your requirements. So you don't have to care about different queues > anymore (you may have seen that there are much more queues than before). > > All jobs must at least contain informations about maximum runtime (h_rt) > and peak memory usage (virtual_free). This information may get > obligatory in future. Currently only a warning message is shown. > You also have to request other resources like sql connections, free temp > space, etc. if these are needed by your job. Please read documentation > on toolserverwiki i have updated today: > <https://wiki.toolserver.org/view/Job_scheduling> > This currently contains the main informations you need to know, but > maybe i add some more examples later. > > I also have added a new script called "qcronsub". This is the > replacement for "cronsub" most of you used before. Differently to > cronsub it accepts the same arguments as the original "qsub" command by > grid engine. So now it is possible the add all resource values at > command line. > > Please note that you should always use cronie at submit.toolserver.org > for submitting jobs to sge by cron. These cron tasks will always be > executed even if one host (e.g. clematis or willow) is down. This is the > suggested usage since about 17 months. Many people have migrated their > cron jobs from nightshade to willow during the last weeks. But they will > have the same problem again if willow must be shut down for a longer > time (which hopefully never happens).
First thanks a lot for your effort!! First I thought why making it even more complicate (this increases the possibility of mal-config) but after setting my cron(ie)tab up I have to say it makes sense! Will be a good thing even thought not simple. > -- > Example: > > This morning Dr. Trigon complained that his job "mainbot" did not run > immediatly and was queued for a long time. I would guess he submitted his job > from cron using "cronsub mainbot -l /home/drtrigon/pywikipedia/mainbot.py" > This indicates that the job runs forevery (longrun) with unkown memory usage. > So grid engine was only able to start this job on willow. > It is not possible to run infinite job on the webservers (only shorter jobs > are allowed so that most jobs have finished before high webserver usage is > expected during the evening). Nor it was possible to run it on the server > running mail transfer agent which only have less than 500MB memory free, but > much cpu power (expected memory usage is unkown). Other servers like > nightshade and yarrow aren't currently available. Thanks for taking me as an example - that help a lot... ;)) The exact command was: cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron (very close... ;) > According to the last run of this job it takes about 2 hours and 30 minutes > runtime and had a peek usage of 370 MB memory. I got these values by > requesting grid engine about usage statistics of the last ten days: "qacct -j > mainbot -d 10". > To be safe that the job gets always enough resouces i would suggest to raise > the values to 4 hours and 500MB memory. It is not a problem if you request > more resouces than really needed, but job needing more resources than > requested may be killed. So the new submit command would be: > > "qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB > /home/drtrigon/pywikipedia/mainbot.py" I use now: qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot $HOME/pywikipedia/bot_control.py -default -cron (a little bit more of time and '-m a -b y') And here my key question arises; you mentioned 'qacct' to get more info (thanks for this hint) and this is one of the biggest problem I had with the whole SGE stuff; I was not able to get a complete docu whether on the toolserver nor else. At the moment, on the toolserver commands like 'qstat' or 'qdel' are not covered anymore. I (we) would like to know more about this great system. E.g. what is the analogue to the old commands: 'cronsub [jobname] [command]' has become 'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname] [command]' 'cronsub -l [jobname] [command]' has become 'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname] [command]' as far as I can see... (do not remember what the '-s' was for...) Is this correct? Thanks for your work Merlissimo and greetings DrTrigon -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9SpnkACgkQAXWvBxzBrDBfoACgjkU9Cq/BT7eRp5RokONOxb5K GekAniUSTuTaTKufOyjD9+lGiqmRRVDw =g9x4 -----END PGP SIGNATURE----- _______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette