Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-07-10 Thread Peter Cock
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker dannonba...@me.com wrote:
 I'll take care of it.  Thanks for reminding me about the TODO!


This seems to have reached galaxy-central now:
https://bitbucket.org/galaxy/galaxy-central/changeset/dc20a7b5b6ce

i.e. When Galaxy creates sub-jobs from tools using the parallelism
tag to split tasks over the cluster, if the user kills the parent job the
child jobs should get kill too.

That will be appreciated next time our cluster is heavily loaded :)

Thanks,

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-07-10 Thread Scott McManus

A suggested change will be coming down the pipe shortly, but it's good to
hear that it will be useful!

-Scott

- Original Message -
 On Tue, May 1, 2012 at 3:46 PM, Dannon Baker dannonba...@me.com
 wrote:
  I'll take care of it.  Thanks for reminding me about the TODO!
 
 
 This seems to have reached galaxy-central now:
 https://bitbucket.org/galaxy/galaxy-central/changeset/dc20a7b5b6ce
 
 i.e. When Galaxy creates sub-jobs from tools using the parallelism
 tag to split tasks over the cluster, if the user kills the parent job
 the
 child jobs should get kill too.
 
 That will be appreciated next time our cluster is heavily loaded :)
 
 Thanks,
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
   http://lists.bx.psu.edu/
 
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-05-31 Thread Peter Cock
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker dannonba...@me.com wrote:
On Tue, May 1, 2012 at 3:10 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 On May 1, 2012, at 10:03 AM, Dannon Baker dannonba...@me.com wrote:
  On May 1, 2012, at 9:51 AM, Peter Cock wrote:
 
  I'm a little confused about tasks.py vs drmaa.py but that TODO
  comment looks pertinent. Is that the problem here?
 
  The runner in tasks.py is what executes the primary job, splitting
  and creating the tasks.  The tasks themselves are actually injected
  back into the regular job queue and run as normal jobs with the
  usual runners (in your case drmaa).
 
  And, yes, it should be fairly straightforward to add, but this just hasn't
  been implemented yet.
 
  -Dannon

 So the stop_job method for the runner in task.py needs to call the
 stop_job method of each of the child tasks it created for that job
 (which in this case are drmaa jobs - but could be pbs etc jobs).
 I'm not really clear how all that works.

 Should I open an issue on this?

 Peter

 I'll take care of it.  Thanks for reminding me about the TODO!


Hi Dannon,

Is this any nearer the top of your TODO list? I was reminded by having
to manually log onto our cluster today and issue a bunch of SGE qdel
commands to manually kill a job which was hogging the queue, but
had bee deleted in Galaxy.

Thanks,

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-05-03 Thread Peter Cock
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker dannonba...@me.com wrote:
 I'll take care of it.  Thanks for reminding me about the TODO!


On a related point, I've noticed sometimes one child job from a split task
can fail, yet the rest of the child jobs continue to run on the cluster wasting
CPU time. As soon as one child job dies (assuming there are no plans for
attempting a retry), I would like the parent task to kill all the
other children,
and fail itself. I suppose you could merge the output of any children which
did finish... but it would be simpler not to bother.

Regards,

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-05-03 Thread Dannon Baker
 On a related point, I've noticed sometimes one child job from a split task
 can fail, yet the rest of the child jobs continue to run on the cluster 
 wasting
 CPU time. As soon as one child job dies (assuming there are no plans for
 attempting a retry), I would like the parent task to kill all the
 other children,
 and fail itself. I suppose you could merge the output of any children which
 did finish... but it would be simpler not to bother.

Right now, yes, this would make sense- I'll see about adding it.  Ultimately we 
want to build in a mechanism for retrying child tasks that fail due to cluster 
errors, etc, so it isn't necessary to rerun the entire job.

-Dannon
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-05-03 Thread Peter Cock
On Thu, May 3, 2012 at 3:54 PM, Dannon Baker dannonba...@me.com wrote:
 On a related point, I've noticed sometimes one child job from a split task
 can fail, yet the rest of the child jobs continue to run on the cluster 
 wasting
 CPU time. As soon as one child job dies (assuming there are no plans for
 attempting a retry), I would like the parent task to kill all the other 
 children,
 and fail itself. I suppose you could merge the output of any children which
 did finish... but it would be simpler not to bother.

 Right now, yes, this would make sense- I'll see about adding it.

Great.

 Ultimately we want to build in a mechanism for retrying child tasks that
 fail due to cluster errors, etc, so it isn't necessary to rerun the entire 
 job.

That could be helpful - but also rather fiddly for detecting when it is
appropriate to retry a job or now. For the split-tasks, right now I'm finding
some child-jobs fail when the OS kills them due to running out of RAM -
in which case a neat idea would be to further sub-divide the jobs and
resubmit. This is probably over-engineering though... KISS principle.

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] Galaxy not killing split cluster jobs

2012-05-01 Thread Peter Cock
Hi all,

We're running our Galaxy with an SGE cluster, using the DRMAA
support in Galaxy, and job splitting. I've noticed if the user cancels
a job (that was running or queued on the cluster) while the job is
shows as deleted in Galaxy, looking at the queue on the cluster
with qstat shows it persists.

I've not seen anything similar reported except for this PBS issue:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003633.html

When I don't use job splitting, cancelling jobs seems to work:

galaxy.jobs.handler DEBUG 2012-05-01 14:46:47,755 stopping job 57 in
drmaa runner
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:47,756 (57/26504)
Being killed...
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:47,757 (57/26504)
Removed from DRM queue at user's request
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:48,441 (57/26504)
state change: job finished, but failed
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:48,441 Job output not
returned from cluster

When I am using job splitting, cancelling jobs fails:

galaxy.jobs.handler DEBUG 2012-05-01 14:28:30,364 stopping job 56 in
tasks runner
galaxy.jobs.runners.tasks WARNING 2012-05-01 14:28:30,386 stop_job():
56: no PID in database for job, unable to stop

That warning comes from lib/galaxy/jobs/runners/tasks.py which starts:

def stop_job( self, job ):
# DBTODO Call stop on all of the tasks.
#if our local job has JobExternalOutputMetadata associated,
then our primary job has to have already finished
if job.external_output_metadata:
pid =
job.external_output_metadata[0].job_runner_external_pid #every
JobExternalOutputMetadata has a pid set, we just need to take from one
of them
else:
pid = job.job_runner_external_id
if pid in [ None, '' ]:
log.warning( stop_job(): %s: no PID in database for job,
unable to stop % job.id )
return
pid = int( pid )
...

I'm a little confused about tasks.py vs drmaa.py but that TODO
comment looks pertinent. Is that the problem here?

Regards,

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy not killing split cluster jobs

2012-05-01 Thread Dannon Baker
I'll take care of it.  Thanks for reminding me about the TODO!



On May 1, 2012, at 10:03 AM, Dannon Baker dannonba...@me.com wrote:

 On May 1, 2012, at 9:51 AM, Peter Cock wrote:
 
 I'm a little confused about tasks.py vs drmaa.py but that TODO
 comment looks pertinent. Is that the problem here?
 
 The runner in tasks.py is what executes the primary job, splitting and 
 creating the tasks.  The tasks themselves are actually injected back into the 
 regular job queue and run as normal jobs with the usual runners (in your case 
 drmaa).
 
 And, yes, it should be fairly straightforward to add, but this just hasn't 
 been implemented yet.
 
 -Dannon
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
  http://lists.bx.psu.edu/
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/