Re: [galaxy-dev] Job handler keeps crashing

2013-01-22 Thread Derrick Lin
Hey Ido,

I actually copied & pasted those committed changes in dramaa.py manually
and it serves well as a temporary solution.

D


On Tue, Jan 22, 2013 at 10:51 PM, Ido Tamir  wrote:

> Is it possible to backport this onto the latest distribution?
> Yes, I'm lazy, but there are also others that are still updating
> within the next weeks and will have problems without them being
> aware of this fix.
>
> best,
> ido
>
>
> On Jan 21, 2013, at 9:50 PM, Nate Coraor wrote:
>
> > Hi all,
> >
> > The commit[1] that fixes this is not in the January 11 distribution.
>  It'll be part of the next distribution.
> >
> > --nate
> >
> > [1]
> https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5552c00e3e38a2da0
> >
> > On Jan 21, 2013, at 3:10 PM, Anthonius deBoer wrote:
> >
> >> I have seen this same issue exactly. Python just dies without any
> errors in the log. Using the latest galaxy-dist
> >>
> >> Sent from my iPhone
> >>
> >> On Jan 20, 2013, at 8:35 PM, Derrick Lin  wrote:
> >>
> >>> Update to the 11 Jan 2013 dist does not help with this issue. :(
> >>>
> >>> I checked the database and have the look at the job entries that
> handler0 tried to stop then shutdown:
> >>>
> >>> | 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |531 |
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
>  | 0.1.2| deleted_new | Job output deleted by user before job
> completed. | NULL | NULL   | NULL| NULL   | NULL
> | NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL
>   |  NULL |  76 |0 | NULL| NULL
>   | handler0 |  NULL |
> >>> | 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |531 |
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
>  | 0.1.2| deleted_new | Job output deleted by user before job
> completed. | NULL | NULL   | NULL| NULL   | NULL
> | NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL
>   |  NULL |  76 |0 | NULL| NULL
>   | handler0 |  NULL |
> >>> | 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |531 |
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0
>  | 1.0.0| deleted_new | Job output deleted by user before job
> completed. | NULL | NULL   | NULL| NULL   | NULL
> | NULL  |   1749 | drmaa://-V -j n -R y -q intel.q/ | NULL
>   |  NULL |  76 |0 | NULL| NULL
>   | handler0 |  NULL |
> >>>
> >>> So basically the job table has several of these entries what assigned
> to handler0 and marked as "deleted_new". When the handler0 is up, it starts
> stopping these jobs, after the first job has been "stopped", handler0 went
> crash and died. But that job was then marked as "deleted".
> >>>
> >>> I think if I manually change the job state from "deleted_new" to
> "deleted" in the db, the handler0 will become fine. I am just concerned
> about how these jobs were created (like assigned to a handler but marked as
> "deleted_new").
> >>>
> >>> Cheers,
> >>> D
> >>>
> >>>
> >>> On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin 
> wrote:
> >>> I had a close look at the code in
> >>>
> >>> galaxy-dist / lib / galaxy / jobs / handler.py
> >>> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
> >>>
> >>> and found that stopping "deleted" and "deleted_new" seems to be normal
> routine for the job handler. Could not find any exception that caused the
> shutdown.
> >>>
> >>> I do notice in the galaxy-dist on bitbucket, there is one commit with
> comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating
> threads (these are still...", it seems to be relevant?
> >>>
> >>> I will do the update to 11 Jan release and see if it fixes the issue.
> >>>
> >>> D
> >>>
> >>>
> >>> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin 
> wrote:
> >>> Hi guys,
> >>>
> >>> We have updated our galaxy to 20 Dec 2012 release. Recently we found
> that some submitted jobs could not start (stay gray forever).
> >>>
> >>> We found that it was caused by the job manager sent jobs to a handler
> (handler0) whose python process crashed and died.
> >>>
> >>> From the handler log we found the last messages right before the crash:
> >>>
> >>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
> >>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in
> drmaa runner
> >>>
> >>> We restarted the galaxy, handler0 is up for few seconds then died
> again with the same error messages except the job number moved to the next
> one.
> >>>
> >>> We observed that the jobs it was trying to stop are all previous jobs
> whose status is either "deleted" or "deleted_new".
> >>>
> >>> We have never seen this in the past, so wondering if there is bugs in
> the new release?
> >>>
> >>> Cheers,
> >>> Derrick
> >>>
> >>>
> >>> ___

Re: [galaxy-dev] Job handler keeps crashing

2013-01-22 Thread Ido Tamir
Is it possible to backport this onto the latest distribution?
Yes, I'm lazy, but there are also others that are still updating
within the next weeks and will have problems without them being
aware of this fix.

best,
ido


On Jan 21, 2013, at 9:50 PM, Nate Coraor wrote:

> Hi all,
> 
> The commit[1] that fixes this is not in the January 11 distribution.  It'll 
> be part of the next distribution.
> 
> --nate
> 
> [1] 
> https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5552c00e3e38a2da0
> 
> On Jan 21, 2013, at 3:10 PM, Anthonius deBoer wrote:
> 
>> I have seen this same issue exactly. Python just dies without any errors in 
>> the log. Using the latest galaxy-dist
>> 
>> Sent from my iPhone
>> 
>> On Jan 20, 2013, at 8:35 PM, Derrick Lin  wrote:
>> 
>>> Update to the 11 Jan 2013 dist does not help with this issue. :(
>>> 
>>> I checked the database and have the look at the job entries that handler0 
>>> tried to stop then shutdown:
>>> 
>>> | 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |531 | 
>>> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
>>> | 0.1.2| deleted_new | Job output deleted by user before job 
>>> completed. | NULL | NULL   | NULL| NULL   | NULL   
>>> | NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL  
>>>  |  NULL |  76 |0 | NULL| NULL  
>>>  | handler0 |  NULL |
>>> | 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |531 | 
>>> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
>>> | 0.1.2| deleted_new | Job output deleted by user before job 
>>> completed. | NULL | NULL   | NULL| NULL   | NULL   
>>> | NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL  
>>>  |  NULL |  76 |0 | NULL| NULL  
>>>  | handler0 |  NULL |
>>> | 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |531 | 
>>> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0  
>>> | 1.0.0| deleted_new | Job output deleted by user before job 
>>> completed. | NULL | NULL   | NULL| NULL   | NULL   
>>> | NULL  |   1749 | drmaa://-V -j n -R y -q intel.q/ | NULL  
>>>  |  NULL |  76 |0 | NULL| NULL  
>>>  | handler0 |  NULL |
>>> 
>>> So basically the job table has several of these entries what assigned to 
>>> handler0 and marked as "deleted_new". When the handler0 is up, it starts 
>>> stopping these jobs, after the first job has been "stopped", handler0 went 
>>> crash and died. But that job was then marked as "deleted".
>>> 
>>> I think if I manually change the job state from "deleted_new" to "deleted" 
>>> in the db, the handler0 will become fine. I am just concerned about how 
>>> these jobs were created (like assigned to a handler but marked as 
>>> "deleted_new"). 
>>> 
>>> Cheers,
>>> D
>>> 
>>> 
>>> On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin  wrote:
>>> I had a close look at the code in 
>>> 
>>> galaxy-dist / lib / galaxy / jobs / handler.py
>>> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
>>> 
>>> and found that stopping "deleted" and "deleted_new" seems to be normal 
>>> routine for the job handler. Could not find any exception that caused the 
>>> shutdown.
>>> 
>>> I do notice in the galaxy-dist on bitbucket, there is one commit with 
>>> comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating 
>>> threads (these are still...", it seems to be relevant?
>>> 
>>> I will do the update to 11 Jan release and see if it fixes the issue.
>>> 
>>> D
>>> 
>>> 
>>> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin  wrote:
>>> Hi guys,
>>> 
>>> We have updated our galaxy to 20 Dec 2012 release. Recently we found that 
>>> some submitted jobs could not start (stay gray forever).
>>> 
>>> We found that it was caused by the job manager sent jobs to a handler 
>>> (handler0) whose python process crashed and died.
>>> 
>>> From the handler log we found the last messages right before the crash:
>>> 
>>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
>>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in 
>>> drmaa runner
>>> 
>>> We restarted the galaxy, handler0 is up for few seconds then died again 
>>> with the same error messages except the job number moved to the next one.
>>> 
>>> We observed that the jobs it was trying to stop are all previous jobs whose 
>>> status is either "deleted" or "deleted_new".
>>> 
>>> We have never seen this in the past, so wondering if there is bugs in the 
>>> new release?
>>> 
>>> Cheers,
>>> Derrick
>>> 
>>> 
>>> ___
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and oth

Re: [galaxy-dev] Job handler keeps crashing

2013-01-21 Thread Derrick Lin
Thanks Nate,

I tried that commit and seems it fixes the issue.

Thanks

Derrick


On Tue, Jan 22, 2013 at 7:50 AM, Nate Coraor  wrote:

> Hi all,
>
> The commit[1] that fixes this is not in the January 11 distribution.
>  It'll be part of the next distribution.
>
> --nate
>
> [1]
> https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5552c00e3e38a2da0
>
> On Jan 21, 2013, at 3:10 PM, Anthonius deBoer wrote:
>
> > I have seen this same issue exactly. Python just dies without any errors
> in the log. Using the latest galaxy-dist
> >
> > Sent from my iPhone
> >
> > On Jan 20, 2013, at 8:35 PM, Derrick Lin  wrote:
> >
> >> Update to the 11 Jan 2013 dist does not help with this issue. :(
> >>
> >> I checked the database and have the look at the job entries that
> handler0 tried to stop then shutdown:
> >>
> >> | 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |531 |
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
>  | 0.1.2| deleted_new | Job output deleted by user before job
> completed. | NULL | NULL   | NULL| NULL   | NULL
> | NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL
>   |  NULL |  76 |0 | NULL| NULL
>   | handler0 |  NULL |
> >> | 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |531 |
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
>  | 0.1.2| deleted_new | Job output deleted by user before job
> completed. | NULL | NULL   | NULL| NULL   | NULL
> | NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL
>   |  NULL |  76 |0 | NULL| NULL
>   | handler0 |  NULL |
> >> | 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |531 |
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0
>  | 1.0.0| deleted_new | Job output deleted by user before job
> completed. | NULL | NULL   | NULL| NULL   | NULL
> | NULL  |   1749 | drmaa://-V -j n -R y -q intel.q/ | NULL
>   |  NULL |  76 |0 | NULL| NULL
>   | handler0 |  NULL |
> >>
> >> So basically the job table has several of these entries what assigned
> to handler0 and marked as "deleted_new". When the handler0 is up, it starts
> stopping these jobs, after the first job has been "stopped", handler0 went
> crash and died. But that job was then marked as "deleted".
> >>
> >> I think if I manually change the job state from "deleted_new" to
> "deleted" in the db, the handler0 will become fine. I am just concerned
> about how these jobs were created (like assigned to a handler but marked as
> "deleted_new").
> >>
> >> Cheers,
> >> D
> >>
> >>
> >> On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin  wrote:
> >> I had a close look at the code in
> >>
> >> galaxy-dist / lib / galaxy / jobs / handler.py
> >> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
> >>
> >> and found that stopping "deleted" and "deleted_new" seems to be normal
> routine for the job handler. Could not find any exception that caused the
> shutdown.
> >>
> >> I do notice in the galaxy-dist on bitbucket, there is one commit with
> comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating
> threads (these are still...", it seems to be relevant?
> >>
> >> I will do the update to 11 Jan release and see if it fixes the issue.
> >>
> >> D
> >>
> >>
> >> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin  wrote:
> >> Hi guys,
> >>
> >> We have updated our galaxy to 20 Dec 2012 release. Recently we found
> that some submitted jobs could not start (stay gray forever).
> >>
> >> We found that it was caused by the job manager sent jobs to a handler
> (handler0) whose python process crashed and died.
> >>
> >> From the handler log we found the last messages right before the crash:
> >>
> >> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
> >> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in
> drmaa runner
> >>
> >> We restarted the galaxy, handler0 is up for few seconds then died again
> with the same error messages except the job number moved to the next one.
> >>
> >> We observed that the jobs it was trying to stop are all previous jobs
> whose status is either "deleted" or "deleted_new".
> >>
> >> We have never seen this in the past, so wondering if there is bugs in
> the new release?
> >>
> >> Cheers,
> >> Derrick
> >>
> >>
> >> ___
> >> Please keep all replies on the list by using "reply all"
> >> in your mail client.  To manage your subscriptions to this
> >> and other Galaxy lists, please use the interface at:
> >>
> >>  http://lists.bx.psu.edu/
> > ___
> > Please keep all replies on the list by using "reply all"
> > in your mail client.  To manage your subscriptions to this
> > and

Re: [galaxy-dev] Job handler keeps crashing

2013-01-21 Thread Nate Coraor
Hi all,

The commit[1] that fixes this is not in the January 11 distribution.  It'll be 
part of the next distribution.

--nate

[1] 
https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5552c00e3e38a2da0

On Jan 21, 2013, at 3:10 PM, Anthonius deBoer wrote:

> I have seen this same issue exactly. Python just dies without any errors in 
> the log. Using the latest galaxy-dist
> 
> Sent from my iPhone
> 
> On Jan 20, 2013, at 8:35 PM, Derrick Lin  wrote:
> 
>> Update to the 11 Jan 2013 dist does not help with this issue. :(
>> 
>> I checked the database and have the look at the job entries that handler0 
>> tried to stop then shutdown:
>> 
>> | 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |531 | 
>> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2| 
>> 0.1.2| deleted_new | Job output deleted by user before job 
>> completed. | NULL | NULL   | NULL| NULL   | NULL   | 
>> NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL 
>>   |  NULL |  76 |0 | NULL| NULL   | 
>> handler0 |  NULL |
>> | 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |531 | 
>> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2| 
>> 0.1.2| deleted_new | Job output deleted by user before job 
>> completed. | NULL | NULL   | NULL| NULL   | NULL   | 
>> NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL 
>>   |  NULL |  76 |0 | NULL| NULL   | 
>> handler0 |  NULL |
>> | 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |531 | 
>> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0  | 
>> 1.0.0| deleted_new | Job output deleted by user before job 
>> completed. | NULL | NULL   | NULL| NULL   | NULL   | 
>> NULL  |   1749 | drmaa://-V -j n -R y -q intel.q/ | NULL 
>>   |  NULL |  76 |0 | NULL| NULL   | 
>> handler0 |  NULL |
>> 
>> So basically the job table has several of these entries what assigned to 
>> handler0 and marked as "deleted_new". When the handler0 is up, it starts 
>> stopping these jobs, after the first job has been "stopped", handler0 went 
>> crash and died. But that job was then marked as "deleted".
>> 
>> I think if I manually change the job state from "deleted_new" to "deleted" 
>> in the db, the handler0 will become fine. I am just concerned about how 
>> these jobs were created (like assigned to a handler but marked as 
>> "deleted_new"). 
>> 
>> Cheers,
>> D
>> 
>> 
>> On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin  wrote:
>> I had a close look at the code in 
>> 
>> galaxy-dist / lib / galaxy / jobs / handler.py
>> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
>> 
>> and found that stopping "deleted" and "deleted_new" seems to be normal 
>> routine for the job handler. Could not find any exception that caused the 
>> shutdown.
>> 
>> I do notice in the galaxy-dist on bitbucket, there is one commit with 
>> comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating 
>> threads (these are still...", it seems to be relevant?
>> 
>> I will do the update to 11 Jan release and see if it fixes the issue.
>> 
>> D
>> 
>> 
>> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin  wrote:
>> Hi guys,
>> 
>> We have updated our galaxy to 20 Dec 2012 release. Recently we found that 
>> some submitted jobs could not start (stay gray forever).
>> 
>> We found that it was caused by the job manager sent jobs to a handler 
>> (handler0) whose python process crashed and died.
>> 
>> From the handler log we found the last messages right before the crash:
>> 
>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in drmaa 
>> runner
>> 
>> We restarted the galaxy, handler0 is up for few seconds then died again with 
>> the same error messages except the job number moved to the next one.
>> 
>> We observed that the jobs it was trying to stop are all previous jobs whose 
>> status is either "deleted" or "deleted_new".
>> 
>> We have never seen this in the past, so wondering if there is bugs in the 
>> new release?
>> 
>> Cheers,
>> Derrick
>> 
>> 
>> ___
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>> 
>>  http://lists.bx.psu.edu/
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/


__

Re: [galaxy-dev] Job handler keeps crashing

2013-01-21 Thread Anthonius deBoer
I have seen this same issue exactly. Python just dies without any errors in the 
log. Using the latest galaxy-dist

Sent from my iPhone

On Jan 20, 2013, at 8:35 PM, Derrick Lin  wrote:

> Update to the 11 Jan 2013 dist does not help with this issue. :(
> 
> I checked the database and have the look at the job entries that handler0 
> tried to stop then shutdown:
> 
> | 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |531 | 
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2| 
> 0.1.2| deleted_new | Job output deleted by user before job completed. 
> | NULL | NULL   | NULL| NULL   | NULL   | NULL  | 
>   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL   |  
> NULL |  76 |0 | NULL| NULL   | handler0 | 
>  NULL |
> | 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |531 | 
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2| 
> 0.1.2| deleted_new | Job output deleted by user before job completed. 
> | NULL | NULL   | NULL| NULL   | NULL   | NULL  | 
>   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL   |  
> NULL |  76 |0 | NULL| NULL   | handler0 | 
>  NULL |
> | 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |531 | 
> toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0  | 
> 1.0.0| deleted_new | Job output deleted by user before job completed. 
> | NULL | NULL   | NULL| NULL   | NULL   | NULL  | 
>   1749 | drmaa://-V -j n -R y -q intel.q/ | NULL   |  
> NULL |  76 |0 | NULL| NULL   | handler0 | 
>  NULL |
> 
> So basically the job table has several of these entries what assigned to 
> handler0 and marked as "deleted_new". When the handler0 is up, it starts 
> stopping these jobs, after the first job has been "stopped", handler0 went 
> crash and died. But that job was then marked as "deleted".
> 
> I think if I manually change the job state from "deleted_new" to "deleted" in 
> the db, the handler0 will become fine. I am just concerned about how these 
> jobs were created (like assigned to a handler but marked as "deleted_new"). 
> 
> Cheers,
> D
> 
> 
> On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin  wrote:
>> I had a close look at the code in 
>> 
>> galaxy-dist / lib / galaxy / jobs / handler.py
>> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
>> 
>> and found that stopping "deleted" and "deleted_new" seems to be normal 
>> routine for the job handler. Could not find any exception that caused the 
>> shutdown.
>> 
>> I do notice in the galaxy-dist on bitbucket, there is one commit with 
>> comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating 
>> threads (these are still...", it seems to be relevant?
>> 
>> I will do the update to 11 Jan release and see if it fixes the issue.
>> 
>> D
>> 
>> 
>> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin  wrote:
>>> Hi guys,
>>> 
>>> We have updated our galaxy to 20 Dec 2012 release. Recently we found that 
>>> some submitted jobs could not start (stay gray forever).
>>> 
>>> We found that it was caused by the job manager sent jobs to a handler 
>>> (handler0) whose python process crashed and died.
>>> 
>>> From the handler log we found the last messages right before the crash:
>>> 
>>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
>>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in 
>>> drmaa runner
>>> 
>>> We restarted the galaxy, handler0 is up for few seconds then died again 
>>> with the same error messages except the job number moved to the next one.
>>> 
>>> We observed that the jobs it was trying to stop are all previous jobs whose 
>>> status is either "deleted" or "deleted_new".
>>> 
>>> We have never seen this in the past, so wondering if there is bugs in the 
>>> new release?
>>> 
>>> Cheers,
>>> Derrick
> 
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Job handler keeps crashing

2013-01-20 Thread Derrick Lin
Update to the 11 Jan 2013 dist does not help with this issue. :(

I checked the database and have the look at the job entries that handler0
tried to stop then shutdown:

| 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |531 |
toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
 | 0.1.2| deleted_new | Job output deleted by user before job
completed. | NULL | NULL   | NULL| NULL   | NULL
| NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL
  |  NULL |  76 |0 | NULL| NULL
  | handler0 |  NULL |
| 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |531 |
toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2
 | 0.1.2| deleted_new | Job output deleted by user before job
completed. | NULL | NULL   | NULL| NULL   | NULL
| NULL  |   1659 | drmaa://-V -j n -R y -q intel.q/ | NULL
  |  NULL |  76 |0 | NULL| NULL
  | handler0 |  NULL |
| 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |531 |
toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0
 | 1.0.0| deleted_new | Job output deleted by user before job
completed. | NULL | NULL   | NULL| NULL   | NULL
| NULL  |   1749 | drmaa://-V -j n -R y -q intel.q/ | NULL
  |  NULL |  76 |0 | NULL| NULL
  | handler0 |  NULL |

So basically the job table has several of these entries what assigned to
handler0 and marked as "deleted_new". When the handler0 is up, it starts
stopping these jobs, after the first job has been "stopped", handler0 went
crash and died. But that job was then marked as "deleted".

I think if I manually change the job state from "deleted_new" to "deleted"
in the db, the handler0 will become fine. I am just concerned about how
these jobs were created (like assigned to a handler but marked as
"deleted_new").

Cheers,
D


On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin  wrote:

> I had a close look at the code in
>
> galaxy-dist / lib / galaxy / jobs / handler.py
> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
>
> and found that stopping "deleted" and "deleted_new" seems to be normal
> routine for the job handler. Could not find any exception that caused the
> shutdown.
>
> I do notice in the galaxy-dist on bitbucket, there is one commit with
> comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating
> threads (these are still...", it seems to be relevant?
>
> I will do the update to 11 Jan release and see if it fixes the issue.
>
> D
>
>
> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin  wrote:
>
>> Hi guys,
>>
>> We have updated our galaxy to 20 Dec 2012 release. Recently we found that
>> some submitted jobs could not start (stay gray forever).
>>
>> We found that it was caused by the job manager sent jobs to a handler
>> (handler0) whose python process crashed and died.
>>
>> From the handler log we found the last messages right before the crash:
>>
>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in
>> drmaa runner
>>
>> We restarted the galaxy, handler0 is up for few seconds then died again
>> with the same error messages except the job number moved to the next one.
>>
>> We observed that the jobs it was trying to stop are all previous jobs
>> whose status is either "deleted" or "deleted_new".
>>
>> We have never seen this in the past, so wondering if there is bugs in the
>> new release?
>>
>> Cheers,
>> Derrick
>>
>
>
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Job handler keeps crashing

2013-01-20 Thread Derrick Lin
I had a close look at the code in

galaxy-dist / lib / galaxy / jobs / handler.py
galaxy-dist / lib / galaxy / jobs / runners / drmaa.py

and found that stopping "deleted" and "deleted_new" seems to be normal
routine for the job handler. Could not find any exception that caused the
shutdown.

I do notice in the galaxy-dist on bitbucket, there is one commit with
comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating
threads (these are still...", it seems to be relevant?

I will do the update to 11 Jan release and see if it fixes the issue.

D


On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin  wrote:

> Hi guys,
>
> We have updated our galaxy to 20 Dec 2012 release. Recently we found that
> some submitted jobs could not start (stay gray forever).
>
> We found that it was caused by the job manager sent jobs to a handler
> (handler0) whose python process crashed and died.
>
> From the handler log we found the last messages right before the crash:
>
> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in
> drmaa runner
>
> We restarted the galaxy, handler0 is up for few seconds then died again
> with the same error messages except the job number moved to the next one.
>
> We observed that the jobs it was trying to stop are all previous jobs
> whose status is either "deleted" or "deleted_new".
>
> We have never seen this in the past, so wondering if there is bugs in the
> new release?
>
> Cheers,
> Derrick
>
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/