Re: [Labs-l] Cron job concurrency: consider adding `-once` to your cron tasks

2017-08-02 Thread Bryan Davis
On Wed, Aug 2, 2017 at 11:05 AM, Maximilian Doerr
 wrote:
> Which tools are the offending tools?

I'm not sure I would classify any of the tools with lots of parallel
jobs running as "offending". That word has some aggressive
connotations at least in English. I'm also not sure that naming and
shaming anyone is useful. If you really want to know where I've been
intervening, you can grep for !log messages by me in the
#wikimedia-cloud freenode IRC channel logs for the last 24 hours or
so.

We do have a tool at http://tools.wmflabs.org/grid-jobs/ that shows
data updated once an hour that allows sorting and drilling down into
per-tool information. This tool and a graphite view of running jobs
over time 
()
are what led to deeper investigation.

Bryan
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Cloud Services  Boise, ID USA
irc: bd808v:415.839.6885 x6855

___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l


Re: [Labs-l] Cron job concurrency: consider adding `-once` to your cron tasks

2017-08-02 Thread Maximilian Doerr
Which tools are the offending tools?

Cyberpower678
English Wikipedia Account Creation Team
English Wikipedia Administrator
Global User Renamer

> On Aug 2, 2017, at 13:04, Bryan Davis  wrote:
> 
> We saw a big spike of active Grid Engine jobs starting around
> 2017-08-01T00:00. I've been looking at the list of active jobs and
> noticed that several tools had a lot of copies of the same job
> running. There are tools that are designed to have several copies of
> the same job running working from a shared queue of some sort, but
> often this is a sign that something is wrong with the script.
> 
> Here's fancy shell pipeline that will give you a list of all of your
> tool's running jobs grouped by job name and sorted by start time:
> 
>  qstat -xml |
>  tr '\n' ' ' |
>  sed 's#]*>#\n#g' |
>  sed 's#<[^>]*>##g' |
>  grep " " |
>  column -t |
>  awk 'BEGIN { OFS="\t" } {print $1, $3, $6, $5}' |
>  sort -n -k 3|sort -s -k 2
> 
> You can use this to see if you have parallel jobs running and if so
> when the "stuck" jobs started. It seems that there may have been some
> database related events happening between 2017-07-31T23:00 and
> 2017-08-01T06:00 that left a bunch of jobs stuck in a bad state
> internally.
> 
> To keep your cron scheduled jobs from running in parallel, you can add
> the `-once` flag to your crontab. Either `jsub -once ...` or `qcronsub
> ...` will do this for you. When the once flag is active, jsub and
> qcronsub will look for jobs that your tool is already running and if
> there is an active job with the same name then the new job will *not*
> be started and an error message will be logged. The name is either
> provided explicitly with `-N ` or automatically added based on the
> command if -N is not used.
> 
> (This should probably end up on wikitech in the help somewhere...)
> 
> Bryan
> -- 
> Bryan Davis  Wikimedia Foundation
> [[m:User:BDavis_(WMF)]] Manager, Cloud Services  Boise, ID USA
> irc: bd808v:415.839.6885 x6855
> 
> ___
> Labs-l mailing list
> Labs-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l

___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l