Re: [galaxy-dev] Some workflows not scheduled until handler restart

2016-01-08 Thread Charles Girardot
Hi all, 

I work with Jelle and want to add on the issue. 

Help would be *greatly* appreciated as this is a *major* stopper on our 
production server right now.

In the database ‘workflow_invocation' table, one can see a ’state’ column with 
values like ’scheduled’ or ‘failed’. 

Before december 18, I only see the values ’scheduled’ or ‘failed’. 
After this date, a new state appeared : ’new’ . And this is always associated 
to handler 1 (would have 2 job handlers i.e. ‘0’ and ‘1’). 
As time goes on, we can see a mix of ’new’ and ‘scheduled’ state with more and 
more ‘new’ and from Jan 4 it is only ’new’ (only for handler ‘1')

This sounds like all workflows being assigned to handler1 never get into the 
‘scheduled’ mode and then jobs are never created.
I have 269 entries in the  ‘workflow_invocation’ table with ’new’ state and 
restarting the job handlers has no impact anymore (used to work a few days ago)

How can I fix this ?

Thank for your help

Charles

> On 5 Jan 2016, at 11:29, Jelle Scholtalbers  wrote:
> 
> Hi all,
> 
> On our installation (v15.07) we suddenly see that one of two job handlers get 
> stuck with a high cpu load (last message generally, `cleaning up external 
> metadata files`) without new messages appearing. In addition, when running 
> workflows in batch (>6x), only a few of them (~3) get their workflow 
> steps/jobs scheduled (LSF-DRMAA).  For the remaining 3, their new histories 
> are created but remain empty (according to the GUI). Only upon restart of the 
> two job handlers the remaining workflow steps are scheduled and shown in the 
> history.
> 
> First question, how do we resolve this issue?
> Second, how does this actually work? How are the workflow steps stored in the 
> database i.e. why are they not shown in the web interface until they are 
> processed by a handler?
> 
> Possible relevant config settings:
> [server:handler0]
> use_threadpool = true
> threadpool_workers = 5
> 
> [server:handler1]
> use_threadpool = true
> threadpool_workers = 5
> 
> [app:main]
> force_beta_workflow_scheduled_min_steps=1
> force_beta_workflow_scheduled_for_collections=True
> track_jobs_in_database = True
> enable_job_recovery = True
> retry_metadata_internally = False
> cache_user_job_count = True # only a limit set for the very few local tools 
> like upload
> 
> Cheers,
> 
> Jelle
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  https://lists.galaxyproject.org/
> 
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Some workflows not scheduled until handler restart

2016-01-08 Thread John Chilton
I've been swamped with release related things but my intention is to
dig deeply into this. This is a very serious bug. A work around is
just to disable beta workflow scheduling for now:

switch

force_beta_workflow_scheduled_min_steps=1
force_beta_workflow_scheduled_for_collections=True

to

force_beta_workflow_scheduled_min_steps=250
force_beta_workflow_scheduled_for_collections=False

Your old workflows wouldn't run, but new ones wouldn't have any
problems. This would decrease the size of the workflow you could
easily run though.

The forthcoming 16.01 has a great number of enhancements to the beta
workflow scheduling - improved state tracking, improved logging of
problems, many optimizations to the scheduling process - any of these
things could help the problem. usegalaxy.org will start running the
release_16.01 branch (which already exists) on Monday - it might be
worth upgrading to that shortly after.

My best guess about what is happening, is some workflow that got
scheduled is causing an exception that causes workflow scheduling to
stop. I think logging of this might be absent prior to 16.01 due to a
huge oversight on my part. Do you want to restart whatever handler is
running workflows and send me the first 5 minutes worth of Galaxy logs
for that process, it might help me figure out what is happening
specifically?


On Fri, Jan 8, 2016 at 12:56 PM, Charles Girardot
 wrote:
> Hi all,
>
> I work with Jelle and want to add on the issue.
>
> Help would be *greatly* appreciated as this is a *major* stopper on our 
> production server right now.
>
> In the database ‘workflow_invocation' table, one can see a ’state’ column 
> with values like ’scheduled’ or ‘failed’.
>
> Before december 18, I only see the values ’scheduled’ or ‘failed’.
> After this date, a new state appeared : ’new’ . And this is always associated 
> to handler 1 (would have 2 job handlers i.e. ‘0’ and ‘1’).
> As time goes on, we can see a mix of ’new’ and ‘scheduled’ state with more 
> and more ‘new’ and from Jan 4 it is only ’new’ (only for handler ‘1')
>
> This sounds like all workflows being assigned to handler1 never get into the 
> ‘scheduled’ mode and then jobs are never created.
> I have 269 entries in the  ‘workflow_invocation’ table with ’new’ state and 
> restarting the job handlers has no impact anymore (used to work a few days 
> ago)
>
> How can I fix this ?
>
> Thank for your help
>
> Charles
>
>> On 5 Jan 2016, at 11:29, Jelle Scholtalbers  wrote:
>>
>> Hi all,
>>
>> On our installation (v15.07) we suddenly see that one of two job handlers 
>> get stuck with a high cpu load (last message generally, `cleaning up 
>> external metadata files`) without new messages appearing. In addition, when 
>> running workflows in batch (>6x), only a few of them (~3) get their workflow 
>> steps/jobs scheduled (LSF-DRMAA).  For the remaining 3, their new histories 
>> are created but remain empty (according to the GUI). Only upon restart of 
>> the two job handlers the remaining workflow steps are scheduled and shown in 
>> the history.
>>
>> First question, how do we resolve this issue?
>> Second, how does this actually work? How are the workflow steps stored in 
>> the database i.e. why are they not shown in the web interface until they are 
>> processed by a handler?
>>
>> Possible relevant config settings:
>> [server:handler0]
>> use_threadpool = true
>> threadpool_workers = 5
>>
>> [server:handler1]
>> use_threadpool = true
>> threadpool_workers = 5
>>
>> [app:main]
>> force_beta_workflow_scheduled_min_steps=1
>> force_beta_workflow_scheduled_for_collections=True
>> track_jobs_in_database = True
>> enable_job_recovery = True
>> retry_metadata_internally = False
>> cache_user_job_count = True # only a limit set for the very few local tools 
>> like upload
>>
>> Cheers,
>>
>> Jelle
>> ___
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>  https://lists.galaxyproject.org/
>>
>> To search Galaxy mailing lists use the unified search at:
>>  http://galaxyproject.org/search/mailinglists/
>
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: