Re: [galaxy-dev] Alternative bowtie tools
On Tue, Mar 29, 2011 at 1:31 AM, Assaf Gordon gor...@cshl.edu wrote: Hello all, We're developing alternative bowtie tools that more closely suit our needs, are we're happy to share (and get comments). The main differences are: 1. separate tools for paired-end and single-end Sounds sensible to me. 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files. I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895 Peter ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
Hi Peter, Peter Cock wrote, On 03/29/2011 05:39 AM: 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files. I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895 Thanks for the link - very interesting read, I wasn't aware of it. However, for our local Galaxy server - I'm sticking with Illumina scale until I see real samples with phred-33 in the wild. The defaults can be easily changed (in the XML file, simply assume a different scale when the extension is fastq), or don't accept fastq at all and force the user to change the format to either fastqillumina or fastqsanger. I'll explain my reasoning: We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina scale. I'm trying to make life as easy as possible for our users. When they upload a FASTQ file, it is by default an Illumina FASTQ file, I want them to be able to use a workflow on it immediately. All of our internal tools assume Illumina scale. The one time I've tried to make the built-in Bowtie tool available, I got complaints about why isn't my FASTQ file appear in the input list - because it was fastq and not fastqsanger after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time). When CASAVA 1.8 is ready (that is - when it is actually running in our sequencing center), then we'll have to deal with it. Ideally - galaxy will have some metadata code that will scan the first 1,000,000 lines and heuristically detect which scale it is. I'm not leaving this choice for the users, because they will make the wrong choice and then come crying back. Just my two cents, -gordon ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
Hi Assaf, Just a quick note that the standard bowtie tool in Galaxy was enhanced in changeset 5157:7a9476924daf to work on 'fastqillumina' and 'fastqsolexa' variants in addition to the already possible 'fastqsanger'. In general, it is not a good idea to have a tool accept dataset.ext=='fastq' unless it doesn't care about quality scores or it determines the correct offset/scale itself or the variant type is declared by the user in the tool interface. When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing). The one time I've tried to make the built-in Bowtie tool available, I got complaints about why isn't my FASTQ file appear in the input list - because it was fastq and not fastqsanger after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time). It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this. Thanks, Dan On Mar 29, 2011, at 10:25 AM, Assaf Gordon wrote: Hi Peter, Peter Cock wrote, On 03/29/2011 05:39 AM: 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files. I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895 Thanks for the link - very interesting read, I wasn't aware of it. However, for our local Galaxy server - I'm sticking with Illumina scale until I see real samples with phred-33 in the wild. The defaults can be easily changed (in the XML file, simply assume a different scale when the extension is fastq), or don't accept fastq at all and force the user to change the format to either fastqillumina or fastqsanger. I'll explain my reasoning: We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina scale. I'm trying to make life as easy as possible for our users. When they upload a FASTQ file, it is by default an Illumina FASTQ file, I want them to be able to use a workflow on it immediately. All of our internal tools assume Illumina scale. The one time I've tried to make the built-in Bowtie tool available, I got complaints about why isn't my FASTQ file appear in the input list - because it was fastq and not fastqsanger after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time). When CASAVA 1.8 is ready (that is - when it is actually running in our sequencing center), then we'll have to deal with it. Ideally - galaxy will have some metadata code that will scan the first 1,000,000 lines and heuristically detect which scale it is. I'm not leaving this choice for the users, because they will make the wrong choice and then come crying back. Just my two cents, -gordon ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
Hi Dan, Daniel Blankenberg wrote, On 03/29/2011 10:55 AM: When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing). I'm not using the get data tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...). The one time I've tried to make the built-in Bowtie tool available, I got complaints about why isn't my FASTQ file appear in the input list - because it was fastq and not fastqsanger after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time). It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this. Sorry, I didn't explain myself correctly: I forbid users from grooming anything (just joking, but I really really discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'. There is no bug - the groomer is simply not used. As stated above, I should change the output format from 'fastq' to 'fastqillumina' (but up until this recent changeset it wouldn't have made any different, because the bowtie tool would not have accepted fastqillumina). -gordon ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
On Tue, Mar 29, 2011 at 4:46 PM, Assaf Gordon gor...@cshl.edu wrote: Hi Dan, Daniel Blankenberg wrote, On 03/29/2011 10:55 AM: When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing). I'm not using the get data tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...). Why not do the Illumina to Sanger conversion as part of your pipeline that gets the data into Galaxy (and mark the files as fastqsanger)? As Glen said, with a C tool that isn't really so slow. That future proofs you for the pending Illumina CASAVA 1.8 release, and means you don't need to maintain divergent Bowtie wrappers for Galaxy. Peter ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
Note about multithreaded bowtie: currently the tools use 10 threads (hard-coded in the XML files) - easily changeable. If possible, have the user indicate as a parameter how many threads they wish to use. -- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. attachment: golharam.vcf___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
The Grooming step is currently very time consuming and can be quite wasteful in disk space if the source and target fastq files are the same, but I have seen many occasions where Grooming has 'saved the day' by e.g. detecting truncated files that may have gone undetected by downstream tools or by indicating to the user that the variant they had selected as the source was incorrect. However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework. Thanks, Dan On Mar 29, 2011, at 11:46 AM, Assaf Gordon wrote: Hi Dan, Daniel Blankenberg wrote, On 03/29/2011 10:55 AM: When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing). I'm not using the get data tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...). The one time I've tried to make the built-in Bowtie tool available, I got complaints about why isn't my FASTQ file appear in the input list - because it was fastq and not fastqsanger after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time). It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this. Sorry, I didn't explain myself correctly: I forbid users from grooming anything (just joking, but I really really discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'. There is no bug - the groomer is simply not used. As stated above, I should change the output format from 'fastq' to 'fastqillumina' (but up until this recent changeset it wouldn't have made any different, because the bowtie tool would not have accepted fastqillumina). -gordon ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
Dan and Peter, Peter Cock wrote, On 03/29/2011 12:08 PM: Why not do the Illumina to Sanger conversion as part of your pipeline that gets the data into Galaxy (and mark the files as fastqsanger)? As Glen said, with a C tool that isn't really so slow. That future proofs you for the pending Illumina CASAVA 1.8 release, and means you don't need to maintain divergent Bowtie wrappers for Galaxy. I refuse to groom on a general principle. The idea itself is unreasonable - all the tools support Illumina scale natively. I'm not going to waste my disk space and users' time (and SGE time) by grooming. When I'll see CASAVA 1.8 running then I'll switch (as we are software people, we know that there's a gap between the planning document and the real software). Note that even in that CASAVA 1.8 document they mention that the export files will still be in Illumina format, so it won't be completely gone. Daniel Blankenberg wrote, On 03/29/2011 12:41 PM: The Grooming step is currently very time consuming and can be quite wasteful in disk space if the source and target fastq files are the same. It is wasteful in any case, not just if they are the same... but I have seen many occasions where Grooming has 'saved the day' by e.g. detecting truncated files that may have gone undetected by downstream tools or by indicating to the user that the variant they had selected as the source was incorrect. I would humbly guess that most of those truncated files are due to problematic HTTP uploads - so it saves the day from another problem, which should be avoided all together. However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework. I know you (the galaxy team) try very hard to have everything in native python (for easy deployment) but I still hold the opinion that these tools should not be done in python. No matter how much you minimize the processing, it will not be as efficient as good a compile program. Python (or perl, I don't discriminate) can probably do this entire check only mode in just a few lines of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not practical. Bottom line - I wouldn't use a python checker anyhow. -gordon ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Alternative bowtie tools
I would humbly guess that most of those truncated files are due to problematic HTTP uploads - so it saves the day from another problem, which should be avoided all together. Maybe most, but definitely not all. We see all kinds of strange corruption. However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework. I know you (the galaxy team) try very hard to have everything in native python (for easy deployment) but I still hold the opinion that these tools should not be done in python. No matter how much you minimize the processing, it will not be as efficient as good a compile program. Python (or perl, I don't discriminate) can probably do this entire check only mode in just a few lines of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not practical. Bottom line - I wouldn't use a python checker anyhow. We care more about easy deployment then language. If you have a nice C function that can do this, wrapping it in cython and packaging it is trivial and adds minimal overhead. ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/