Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Peter Cock
On Tue, Mar 29, 2011 at 1:31 AM, Assaf Gordon gor...@cshl.edu wrote:
 Hello all,

 We're developing alternative bowtie tools that more closely suit our
 needs, are we're happy to share (and get comments).

 The main differences are:
 1. separate tools for paired-end and single-end

Sounds sensible to me.

 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina
 format (no more need for grooming). Illumina is the default for
 newly uploaded FASTQ files.

I think that's a bad idea - use Sanger FASTQ as the default to be
consistent with the rest of Galaxy, and also with CASAVA 1.8
Illumina machines will produce that too, see:
http://seqanswers.com/forums/showthread.php?t=8895

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Assaf Gordon
Hi Peter,

Peter Cock wrote, On 03/29/2011 05:39 AM:
 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina
 format (no more need for grooming). Illumina is the default for
 newly uploaded FASTQ files.
 
 I think that's a bad idea - use Sanger FASTQ as the default to be
 consistent with the rest of Galaxy, and also with CASAVA 1.8
 Illumina machines will produce that too, see:
 http://seqanswers.com/forums/showthread.php?t=8895
 

Thanks for the link - very interesting read, I wasn't aware of it.

However, for our local Galaxy server - I'm sticking with Illumina scale until I 
see real samples with phred-33 in the wild.

The defaults can be easily changed (in the XML file, simply assume a different 
scale when the extension is fastq),
or don't accept fastq at all and force the user to change the format to 
either fastqillumina or fastqsanger.

I'll explain my reasoning:
We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina scale.
I'm trying to make life as easy as possible for our users.
When they upload a FASTQ file, it is by default an Illumina FASTQ file, I want 
them to be able to use a workflow on it immediately.
All of our internal tools assume Illumina scale.

The one time I've tried to make the built-in Bowtie tool available, I got 
complaints about why isn't my FASTQ file appear in the input list - 
because it was fastq and not fastqsanger after grooming - this is a silly 
technical step that should not be a concern to users - so I'm taking it out of 
the equation here (not to mention that grooming two 14GB FASTQ files for every 
lane is a huge waste of space and time).

When CASAVA 1.8 is ready (that is - when it is actually running in our 
sequencing center), then we'll have to deal with it.
Ideally - galaxy will have some metadata code that will scan the first 
1,000,000 lines and heuristically detect which scale it is.
I'm not leaving this choice for the users, because they will make the wrong 
choice and then come crying back.

Just my two cents,
 -gordon
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Daniel Blankenberg
Hi Assaf,

Just a quick note that the standard bowtie tool in Galaxy was enhanced in 
changeset 5157:7a9476924daf to work on 'fastqillumina' and 'fastqsolexa' 
variants in addition to the already possible 'fastqsanger'. In general, it is 
not a good idea to have a tool accept dataset.ext=='fastq' unless it doesn't 
care about quality scores or it determines the correct offset/scale itself or 
the variant type is declared by the user in the tool interface. 

When files are added to Galaxy, the datatype can be directly set to any of the 
fastq variants (e.g. fastqillumina), which removes the requirement of grooming 
(but should only be done when users know what they are doing).


 The one time I've tried to make the built-in Bowtie tool available, I got 
 complaints about why isn't my FASTQ file appear in the input list - 
 because it was fastq and not fastqsanger after grooming - this is a silly 
 technical step that should not be a concern to users - so I'm taking it out 
 of the equation here (not to mention that grooming two 14GB FASTQ files for 
 every lane is a huge waste of space and time).


It should not be possible to have a data.ext=='fastq' after Grooming (unless 
manually changed by a user), please report the steps that lead to this. 


Thanks,

Dan


On Mar 29, 2011, at 10:25 AM, Assaf Gordon wrote:

 Hi Peter,
 
 Peter Cock wrote, On 03/29/2011 05:39 AM:
 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina
 format (no more need for grooming). Illumina is the default for
 newly uploaded FASTQ files.
 
 I think that's a bad idea - use Sanger FASTQ as the default to be
 consistent with the rest of Galaxy, and also with CASAVA 1.8
 Illumina machines will produce that too, see:
 http://seqanswers.com/forums/showthread.php?t=8895
 
 
 Thanks for the link - very interesting read, I wasn't aware of it.
 
 However, for our local Galaxy server - I'm sticking with Illumina scale until 
 I see real samples with phred-33 in the wild.
 
 The defaults can be easily changed (in the XML file, simply assume a 
 different scale when the extension is fastq),
 or don't accept fastq at all and force the user to change the format to 
 either fastqillumina or fastqsanger.
 
 I'll explain my reasoning:
 We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina 
 scale.
 I'm trying to make life as easy as possible for our users.
 When they upload a FASTQ file, it is by default an Illumina FASTQ file, I 
 want them to be able to use a workflow on it immediately.
 All of our internal tools assume Illumina scale.
 
 The one time I've tried to make the built-in Bowtie tool available, I got 
 complaints about why isn't my FASTQ file appear in the input list - 
 because it was fastq and not fastqsanger after grooming - this is a silly 
 technical step that should not be a concern to users - so I'm taking it out 
 of the equation here (not to mention that grooming two 14GB FASTQ files for 
 every lane is a huge waste of space and time).
 
 When CASAVA 1.8 is ready (that is - when it is actually running in our 
 sequencing center), then we'll have to deal with it.
 Ideally - galaxy will have some metadata code that will scan the first 
 1,000,000 lines and heuristically detect which scale it is.
 I'm not leaving this choice for the users, because they will make the wrong 
 choice and then come crying back.
 
 Just my two cents,
 -gordon
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
  http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Assaf Gordon
Hi Dan,

Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:
 When files are added to Galaxy, the datatype can be directly set to
 any of the fastq variants (e.g. fastqillumina), which removes the
 requirement of grooming (but should only be done when users know what
 they are doing).

I'm not using the get data tool, we have our own import tools (uploading huge 
files with HTTP is not stable enough for me).
You're right that I should change the format of this tool from 'fastq' to 
'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, 
so I never bothered to update it...).

 
 The one time I've tried to make the built-in Bowtie tool available,
 I got complaints about why isn't my FASTQ file appear in the input
 list - because it was fastq and not fastqsanger after grooming
 - this is a silly technical step that should not be a concern to
 users - so I'm taking it out of the equation here (not to mention
 that grooming two 14GB FASTQ files for every lane is a huge waste
 of space and time).

 It should not be possible to have a data.ext=='fastq' after Grooming
 (unless manually changed by a user), please report the steps that
 lead to this.

Sorry, I didn't explain myself correctly:
I forbid users from grooming anything (just joking, but I really really 
discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'.
There is no bug - the groomer is simply not used.
As stated above, I should change the output format from 'fastq' to 
'fastqillumina' (but up until this recent changeset it wouldn't have made any 
different, because the bowtie tool would not have accepted fastqillumina).


-gordon
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Peter Cock
On Tue, Mar 29, 2011 at 4:46 PM, Assaf Gordon gor...@cshl.edu wrote:
 Hi Dan,

 Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:
 When files are added to Galaxy, the datatype can be directly set to
 any of the fastq variants (e.g. fastqillumina), which removes the
 requirement of grooming (but should only be done when users know what
 they are doing).

 I'm not using the get data tool, we have our own import tools (uploading
 huge files with HTTP is not stable enough for me). You're right that I should
 change the format of this tool from 'fastq' to 'fastqillumina' (but the tool
 pre-dated all those built-in formats in galaxy, so I never bothered to update
 it...).

Why not do the Illumina to Sanger conversion as part of your pipeline
that gets the data into Galaxy (and mark the files as fastqsanger)?
As Glen said, with a C tool that isn't really so slow. That future proofs
you for the pending Illumina CASAVA 1.8 release, and means you don't
need to maintain divergent Bowtie wrappers for Galaxy.

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Ryan Golhar


Note about multithreaded bowtie:
currently the tools use 10 threads (hard-coded in the XML files) - easily 
changeable.



If possible, have the user indicate as a parameter how many threads they 
wish to use.


--
CONFIDENTIALITY NOTICE: This email communication may contain private, 
confidential, or legally privileged information intended for the sole 
use of the designated and/or duly authorized recipient(s). If you are 
not the intended recipient or have received this email in error, please 
notify the sender immediately by email and permanently delete all copies 
of this email including all attachments without reading them. If you are 
the intended recipient, secure the contents in a manner that conforms to 
all applicable state and/or federal requirements related to privacy and 
confidentiality of such information.
attachment: golharam.vcf___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Daniel Blankenberg

The Grooming step is currently very time consuming and can be quite wasteful in 
disk space if the source and target fastq files are the same, but I have seen 
many occasions where Grooming has 'saved the day' by e.g. detecting truncated 
files that may have gone undetected by downstream tools or by indicating to the 
user that the variant they had selected as the source was incorrect.

However, I have been thinking about adding a 'check only' option to the Groomer 
that would use a naive parser (assume exactly 4 lines to a read, ascii scores, 
require input variant==output variant, etc.) and reuse the underlying original 
dataset file as the output (without writing over the file). This would be 
significantly faster and not waste disk space, but it would require 
enhancements to the framework. 


Thanks,

Dan

On Mar 29, 2011, at 11:46 AM, Assaf Gordon wrote:

 Hi Dan,
 
 Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:
 When files are added to Galaxy, the datatype can be directly set to
 any of the fastq variants (e.g. fastqillumina), which removes the
 requirement of grooming (but should only be done when users know what
 they are doing).
 
 I'm not using the get data tool, we have our own import tools (uploading 
 huge files with HTTP is not stable enough for me).
 You're right that I should change the format of this tool from 'fastq' to 
 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, 
 so I never bothered to update it...).
 
 
 The one time I've tried to make the built-in Bowtie tool available,
 I got complaints about why isn't my FASTQ file appear in the input
 list - because it was fastq and not fastqsanger after grooming
 - this is a silly technical step that should not be a concern to
 users - so I'm taking it out of the equation here (not to mention
 that grooming two 14GB FASTQ files for every lane is a huge waste
 of space and time).
 
 It should not be possible to have a data.ext=='fastq' after Grooming
 (unless manually changed by a user), please report the steps that
 lead to this.
 
 Sorry, I didn't explain myself correctly:
 I forbid users from grooming anything (just joking, but I really really 
 discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'.
 There is no bug - the groomer is simply not used.
 As stated above, I should change the output format from 'fastq' to 
 'fastqillumina' (but up until this recent changeset it wouldn't have made any 
 different, because the bowtie tool would not have accepted fastqillumina).
 
 
 -gordon


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread Assaf Gordon
Dan and Peter,

Peter Cock wrote, On 03/29/2011 12:08 PM:
 Why not do the Illumina to Sanger conversion as part of your
 pipeline that gets the data into Galaxy (and mark the files as
 fastqsanger)? As Glen said, with a C tool that isn't really so slow.
 That future proofs you for the pending Illumina CASAVA 1.8 release,
 and means you don't need to maintain divergent Bowtie wrappers for
 Galaxy.

I refuse to groom on a general principle.
The idea itself is unreasonable - all the tools support Illumina scale natively.
I'm not going to waste my disk space and users' time (and SGE time) by grooming.

When I'll see CASAVA 1.8 running then I'll switch (as we are software people, 
we know that there's a gap between the planning document and the real 
software). Note that even in that CASAVA 1.8 document they mention that the 
export files will still be in Illumina format, so it won't be completely gone.

Daniel Blankenberg wrote, On 03/29/2011 12:41 PM:
 
 The Grooming step is currently very time consuming and can be quite 
 wasteful in disk space if the source and target fastq files are the 
 same.
It is wasteful in any case, not just if they are the same...

 but I have seen many occasions where Grooming has 'saved the 
 day' by e.g. detecting truncated files that may have gone undetected 
 by downstream tools or by indicating to the user that the variant 
 they had selected as the source was incorrect.

I would humbly guess that most of those truncated files are due to problematic 
HTTP uploads - so it saves the day from another problem, which should be 
avoided all together.

 However, I have been thinking about adding a 'check only' option to 
 the Groomer that would use a naive parser (assume exactly 4 lines to 
 a read, ascii scores, require input variant==output variant, etc.) 
 and reuse the underlying original dataset file as the output
 (without writing over the file). This would be significantly faster
 and not waste disk space, but it would require enhancements to the
 framework.

I know you (the galaxy team) try very hard to have everything in native python 
(for easy deployment) but I still hold the opinion that these tools should not 
be done in python. No matter how much you minimize the processing, it will not 
be as efficient as good a compile program. Python (or perl, I don't 
discriminate) can probably do this entire check only mode in just a few lines 
of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not 
practical.

Bottom line - I wouldn't use a python checker anyhow.

-gordon
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Alternative bowtie tools

2011-03-29 Thread James Taylor
I would humbly guess that most of those truncated files are due to  
problematic HTTP uploads - so it saves the day from another problem,  
which should be avoided all together.


Maybe most, but definitely not all. We see all kinds of strange  
corruption.



However, I have been thinking about adding a 'check only' option to
the Groomer that would use a naive parser (assume exactly 4 lines to
a read, ascii scores, require input variant==output variant, etc.)
and reuse the underlying original dataset file as the output
(without writing over the file). This would be significantly faster
and not waste disk space, but it would require enhancements to the
framework.


I know you (the galaxy team) try very hard to have everything in  
native python (for easy deployment) but I still hold the opinion  
that these tools should not be done in python. No matter how much  
you minimize the processing, it will not be as efficient as good a  
compile program. Python (or perl, I don't discriminate) can probably  
do this entire check only mode in just a few lines of regexes -  
but try it on twenty 14GB FASTQ files and you'll realize it's not  
practical.


Bottom line - I wouldn't use a python checker anyhow.


We care more about easy deployment then language. If you have a nice C  
function that can do this, wrapping it in cython and packaging it is  
trivial and adds minimal overhead.



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/