Re: [galaxy-dev] Jobs crash only in workflow context

2012-10-18 Thread Todd Oakley

Hi James,
We have made some progress in understanding the workflow-specific 
job crashes.


It seems that 'parallel' workflows are sending jobs simultaneously, and 
this is problematic for torque.


We get this error:
10/18/2012 10:06:18;0080;PBS_Server;Req;req_reject;Reject reply 
code=15058(Bad DIS based Request Protocol MSG=cannot decode message), 
aux=0, type=Connect, from @



There is a thread here:
http://osdir.com/ml/galaxy-source-control/2011-08/msg00136.html

which is very similar to what we are experiencing.

In the post linked above, the author indicates he found a fix (pasted 
below). Would you recommend we make the same change?


Thanks!
Todd

"To deal with this I modified the lib/galaxy/jobs/runners/pbs.py script 
to make multiple attempts at submitting in the following way:
@@ -286,6 +286,12 @@ class PBSJobRunner( BaseJobRunner ): 
 log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) )
 log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) 
)  job_id = pbs.pbs_submit(c, job_attrs, job_file, 
pbs_queue_name, None)+   ##Modified to give ten tries for qsubbing a 
job+   num_try=0+   while(not job_id and num_try<10): + 
  job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, 
None)+   num_try+=1+pbs.pbs_disconnect(c)  # 
check to see if it submitted "


On 10/17/2012 9:40 AM, James Taylor wrote:

Todd, this is definitely unusual. Can you post (or send directly)
relevant sections from the Galaxy log?

-- jt


On Tue, Oct 16, 2012 at 8:15 PM, Todd Oakley
 wrote:

Hello,
 We just did a few tweaks to improve Galaxy performance, and a new issue
popped up that I would like advice on troubleshooting.

 When we run workflows, we see that tools later in the workflow run and
crash before the results they depend on have completed running.

 We can re-run the crashed jobs later and they work fine, suggesting that
they are only failing in the context of running workflows.

 I'd appreciate any advice on how to start troubleshooting this problem.

Thanks much!
Todd


--

***
Todd Oakley, Professor
Ecology Evolution and Marine Biology
University of California, Santa Barbara
Santa Barbara, CA 93106 USA
***

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


--

***
Todd Oakley, Professor
Ecology Evolution and Marine Biology
University of California, Santa Barbara
Santa Barbara, CA 93106 USA
***

Lab Website 
Twitter: @UCSB_OakleyLab

*Recent Papers: *

 * Pancrustacean Phylotranscriptomics
   MBE Paper
   


 * Convergent Evolution in Cephalopoda
   BMC Ev Biol 
 * Cnidocyte discharge regulated by opsin and light
   BMC Biology Paper  Scientific American
   Write-up
   


 * Sponge Larvae Could be Guided by Cryptochrome
   J Exp Biol. Paper  |
   Nature News
   

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Jobs crash only in workflow context

2012-10-17 Thread James Taylor
Todd, this is definitely unusual. Can you post (or send directly)
relevant sections from the Galaxy log?

-- jt


On Tue, Oct 16, 2012 at 8:15 PM, Todd Oakley
 wrote:
> Hello,
> We just did a few tweaks to improve Galaxy performance, and a new issue
> popped up that I would like advice on troubleshooting.
>
> When we run workflows, we see that tools later in the workflow run and
> crash before the results they depend on have completed running.
>
> We can re-run the crashed jobs later and they work fine, suggesting that
> they are only failing in the context of running workflows.
>
> I'd appreciate any advice on how to start troubleshooting this problem.
>
> Thanks much!
> Todd
>
>
> --
>
> ***
> Todd Oakley, Professor
> Ecology Evolution and Marine Biology
> University of California, Santa Barbara
> Santa Barbara, CA 93106 USA
> ***
>
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/