Re: [galaxy-dev] Tool shed and datatypes
Hello Jim, Thanks for sending your converter. I've committed change set 6484:4fdceec512f5 to our central repository. I now have things working for properly handling proprietary datatype converters and indexers. I've also added the following paragraph to the tool shed wiki. It doesn't apply to your mothur data types since you use only 1 converter, but you should be aware of this requirement for future tool development. If you make the changes we discussed yesterday to your mothur tool suite (and add your missing converter ), all data types and the converter should properly load when your repository is installed to a local Galaxy instance. Thanks very much for your help on this, and please let me know if you bump into any issues. If you include datatype converters or indexers in your repository, all converter files (the disk file referred to by the value of the "file" attribute) must be located in the same directory in your repository hierarchy. The same requirement applies to indexers. If you include both converters and indexers in your repository, the relevant files may all be located within the same directory or you could decide to keep all converters in one directory and all indexers in a different directory within your repository hierarchy. This is critical because the Galaxy components that load these proprietary items assume they are all located in the same directory. On Jan 5, 2012, at 4:15 PM, Jim Johnson wrote: > I'll also upload those to the toolshed soon. > > Big Question? > When I started creating all those datatype classes for mothur, I just labeled > the the file_ext as mothur generated them for its output files. > Will we have namespace issues? > Should the file_ext fields include the toolshed name, e.g. should "otu" > be named "mothur.otu" to avoid conflicts with other downloaded tools from the > toolshed? > Seems like this would be the time to establish rules/practices for such > concerns. > > JJ > > > > On 1/5/12 3:06 PM, Greg Von Kuster wrote: >> >> I will make sure that the converters are functional when installed, but I'm >> fairly sure it is currently not working. If you could pass your 2 files >> along to me, I'll make sure to fix whatever bugs may exist. >> >> On Jan 5, 2012, at 3:51 PM, Jim Johnson wrote: >> >>> >>> This was a converter that I used on my local installation, but forgot to >>> include for the ToolShed: >>> >>> >> type="galaxy.datatypes.metagenomics:RefTaxonomy" display_in_upload="true"> >>> >> target_datatype="seq.taxonomy"/> >>> >>> >>> $ find lib -name ref_to_seq_taxonomy_converter.xml >>> lib/galaxy/datatypes/converters/ref_to_seq_taxonomy_converter.xml >>> $ find lib -name ref_to_seq_taxonomy_converter.py >>> lib/galaxy/datatypes/converters/ref_to_seq_taxonomy_converter.py >>> >>> I'll add those 2 files to my repository along with the other changes you >>> specified. >>> Can converters as such be auto-installed as well? >>> >>> Thanks, >>> >>> JJ >>> >>> >>> >>> On 1/5/12 2:14 PM, Greg Von Kuster wrote: Hi Jim, Here are the changes you'll need to make to your mothur tool suite. CHANGE 1 Add the following datatypes.conf.xml file to your repository. >>> display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:OtuList" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Sabund" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Rabund" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:SharedRabund" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:RelAbund" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Names" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Design" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Summary" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Group" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:Oligos" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:SequenceAlignment" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:AccNos" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:SecondaryStructureMap" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:AlignCheck" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:AlignReport" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:LaneMask" display_in_upload="true"/> >>> type="galaxy.datatypes.metagenomics:DistanceMatrix" display_in_upload="true"/>
Re: [galaxy-dev] Tool shed and datatypes
Of course, your approach of prepending the repository name would probably eliminate any future issue in this regard. Whatever you feel is best... ;) On Jan 5, 2012, at 4:49 PM, Greg Von Kuster wrote: > Yes, this is certainly important, but I think the hope is that proprietary > data types will not become so prevalent that name-spacing the extensions is > necessary. > > On Jan 5, 2012, at 4:15 PM, Jim Johnson wrote: > >> >> Big Question? >> When I started creating all those datatype classes for mothur, I just >> labeled the the file_ext as mothur generated them for its output files. >> Will we have namespace issues? >> Should the file_ext fields include the toolshed name, e.g. should "otu" >> be named "mothur.otu" to avoid conflicts with other downloaded tools from >> the toolshed? >> Seems like this would be the time to establish rules/practices for such >> concerns. >> >> JJ >> >> >> > > > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
Yes, this is certainly important, but I think the hope is that proprietary data types will not become so prevalent that name-spacing the extensions is necessary. On Jan 5, 2012, at 4:15 PM, Jim Johnson wrote: > > Big Question? > When I started creating all those datatype classes for mothur, I just labeled > the the file_ext as mothur generated them for its output files. > Will we have namespace issues? > Should the file_ext fields include the toolshed name, e.g. should "otu" > be named "mothur.otu" to avoid conflicts with other downloaded tools from the > toolshed? > Seems like this would be the time to establish rules/practices for such > concerns. > > JJ > > > ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
Of course, this assume that there is not more than one datatypes class module in your repository with the same name. This would definitely pose problems, so care should be taken that it is not done. On Jan 5, 2012, at 3:29 PM, Greg Von Kuster wrote: > However, your datatype class module files will be found no matter where they > are located within your repository hierarchy. > > > On Jan 5, 2012, at 3:25 PM, Jim Johnson wrote: > >> Greg, >> >> I have been putting datatype def files in relative path: >> lib/galaxy/datatypes/ >> This was just to make it more clear for someone manually modifying their own >> galaxy installation. >> Is there any preferred best practice for where a datatypes implementation >> file should be? >> >> Thanks, >> >> JJ >> ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
Your approach is great since it models the Galaxy distribution, and as you say, make sit clear to those downloading your repository. However, your datatype class module files will be found no matter where they are located within your repository hierarchy. On Jan 5, 2012, at 3:25 PM, Jim Johnson wrote: > Greg, > > I have been putting datatype def files in relative path: > lib/galaxy/datatypes/ > This was just to make it more clear for someone manually modifying their own > galaxy installation. > Is there any preferred best practice for where a datatypes implementation > file should be? > > Thanks, > > JJ > > On 1/5/12 1:38 PM, Greg Von Kuster wrote: >> >> Hello Jim, >> >> I've implemented support for proprietary datatypes that use class modules >> included in tool shed repositories. To see how this works, you'll need at >> least change set revision 6479:4d131422777f, which is currently available >> only from our central repo at https://bitbucket.org/galaxy/galaxy-central. >> >> I've documented the way this works in the following 2 sections of the tool >> shed wiki. In the second section, I've taken the liberty of using your gmap >> tool repository as an example. i hope you don't mind. I've written the >> document section assuming that your gmap repository includes the 2 changes >> I've described below. >> >> http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_subclass_from_Galaxy_data_types_in_the_distribution >> http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_use_class_modules_included_in_your_repository >> >> There are 2 categories of datatypes that are currently supported: >> >> 1. data types that subclass from the datatype classes included in the Galaxy >> distribution - these require no code files that define proprietary datatype >> classes to be included in the tool shed repository, and are documented in >> the first wiki section listed above. >> >> 2. datatypes that use proprietary classes defined in code files included in >> the tool shed repository - documented in the second wiki section listed >> above. Your gmap tool suite falls into this category. >> >> If you make the following changes to your gmap tool suite, your proprietary >> data types will automatically load into a local Galaxy instance when the >> Galaxy admin installs your tool suite to that instance. The data types will >> be loaded at the time of installation as well as whenever the Galaxy server >> is stopped / restarted. I'll send you a separate message detailing the >> changes you'll need to make to your mothur tool suite. >> >> >> CHANGE 1 >> >> Add a file named datatypes_conf.xml to your repository. This is the >> approach I'm using to support proprietary datatypes included in tool shed >> repositories instead f your proposed addition of datatypes in the tool >> config's tag set. The datatypes_conf.xml file can be located >> anywhere in the repository, but the the obvious location for your gmap >> repository is your ~/tool-data directory. >> >> This file should contain the following datatype definitions. >> >> >> >> >> >> >> >> > display_in_upload="False"/> >> > type="galaxy.datatypes.gmap:GmapSnpIndex" display_in_upload="False"/> >> > type="galaxy.datatypes.gmap:IntervalIndexTree" display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:SpliceSitesIntervalIndexTree" >> display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:IntronsIntervalIndexTree" >> display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:SNPsIntervalIndexTree" display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:IntervalAnnotation" display_in_upload="False"/> >> > type="galaxy.datatypes.gmap:SpliceSiteAnnotation" display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:IntronAnnotation" display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:SNPAnnotation" display_in_upload="True"/> >> >> >> >> >> >> >> >> >> >> I noticed that your README in your current gmap repository on the main >> Galaxy tool shed includes the following datatype definitions, but they refer >> to classes that are not included in your repository so I've eliminated them >> from the above datatypes_conf.xml file. You may need to add the classes to >> your current gmap.py datatypes class file and add them to the above >> datatypes_conf.xml file if your tools actually require them. >> >> > type="galaxy.datatypes.gmap:TallyIntervalIndexTree" >> display_in_upload="True"/> >> > type="galaxy.datatypes.gmap:TallyAnnotation" display_in_upload="True"/> >> > display_in_upload="True"/> >> >> >> CHANGE 2 >> >> Modules that include proprietary datatype class definitions cannot use >> relative import references for imported modules. Imports must be defined as >> a
Re: [galaxy-dev] Tool shed and datatypes
Hi Jim, Here are the changes you'll need to make to your mothur tool suite. CHANGE 1 Add the following datatypes.conf.xml file to your repository. I'm probably not correctly handling the converter for your ref.taxonomy data type - I've not been able to find the ref_to_seq_taxonomy_converter.xml file. Can you pass it along to me so I can see if I have some debugging to do? Also, I've eliminated the following entry from your README in the above file because the Newick class is not included in your metagenomics.py class module. It seems you may have include the Newick class in your local copy of ~/lib/galaxy/datatypes/data.py. If your tools use this class, it should be added to either your metagenomics.py class file or another class file in your repository and the value of the "type" attribute in the following should be changed accordingly. CHANGE 2 --- The following relative imports in your metagenomics.py class module: import data from sniff import * need to look like this: from galaxy.datatypes import data from galaxy.datatypes.sniff import * CHANGE 3 --- You can optionally choose to remove your suite_config.xml file from your repository as it is no longer used in any way. Thanks! Greg Von Kuster On Oct 18, 2011, at 11:03 AM, Jim Johnson wrote: > Greg, > > The mothur_toolsuite in the ToolShed contains a file with added datatypes > for metagenomics (used by mothur and some by qiime): > mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py > The README has info on how I incorporated mothur into our local galaxy server. > > I'm also working on GMAP/GSNAP ( http://research-pub.gene.com/gmap/ ) > So far I've created a GmapDB class, analogous to the ngsindex.BowtieIndex > class, but with more metadata. > I'm also adding a IntervalIndexTree class for indexing maps of splice > junctions, introns, and SNPs. > I'll send you this as soon as I've got it working. > > Thanks, > > JJ > Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
Hello Jim, I've implemented support for proprietary datatypes that use class modules included in tool shed repositories. To see how this works, you'll need at least change set revision 6479:4d131422777f, which is currently available only from our central repo at https://bitbucket.org/galaxy/galaxy-central. I've documented the way this works in the following 2 sections of the tool shed wiki. In the second section, I've taken the liberty of using your gmap tool repository as an example. i hope you don't mind. I've written the document section assuming that your gmap repository includes the 2 changes I've described below. http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_subclass_from_Galaxy_data_types_in_the_distribution http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_use_class_modules_included_in_your_repository There are 2 categories of datatypes that are currently supported: 1. data types that subclass from the datatype classes included in the Galaxy distribution - these require no code files that define proprietary datatype classes to be included in the tool shed repository, and are documented in the first wiki section listed above. 2. datatypes that use proprietary classes defined in code files included in the tool shed repository - documented in the second wiki section listed above. Your gmap tool suite falls into this category. If you make the following changes to your gmap tool suite, your proprietary data types will automatically load into a local Galaxy instance when the Galaxy admin installs your tool suite to that instance. The data types will be loaded at the time of installation as well as whenever the Galaxy server is stopped / restarted. I'll send you a separate message detailing the changes you'll need to make to your mothur tool suite. CHANGE 1 Add a file named datatypes_conf.xml to your repository. This is the approach I'm using to support proprietary datatypes included in tool shed repositories instead f your proposed addition of datatypes in the tool config's tag set. The datatypes_conf.xml file can be located anywhere in the repository, but the the obvious location for your gmap repository is your ~/tool-data directory. This file should contain the following datatype definitions. I noticed that your README in your current gmap repository on the main Galaxy tool shed includes the following datatype definitions, but they refer to classes that are not included in your repository so I've eliminated them from the above datatypes_conf.xml file. You may need to add the classes to your current gmap.py datatypes class file and add them to the above datatypes_conf.xml file if your tools actually require them. CHANGE 2 Modules that include proprietary datatype class definitions cannot use relative import references for imported modules. Imports must be defined as absolute from the galaxy subdirectory inside the Galaxy root's lib subdirectory. So for your ~/lib/galaxy/datatypes/gmap.py datatypes module in your gmap repository, the following changes are necessary. Your current imports look like this: import logging import os,os.path,re import data from data import Text from galaxy import util from metadata import MetadataElement But they need to be changed to this - note the elimination of relative imports: import logging import os,os.path,re import galaxy.datatypes.data from galaxy.datatypes.data import Text from galaxy import util from galaxy.datatypes.metadata import MetadataElement Thanks very much for helping out with this, and please let me know if you bump into any problems. Greg Von Kuster On Oct 21, 2011, at 1:13 PM, Jim Johnson wrote: > Greg, > > I put the gmap tool suite in the galaxy Tool Shed, let me know if there is > more I should do. > > It has 5 galaxy tools: > GMAP - Genomic Mapping and Alignment Program for mRNA and EST > sequences > GSNAP- Genomic Short-read Nucleotide Alignment Program > GMAP Build- a database genome index for GMAP and GSNAP ( calls: > gmap_build, iit_store, snpindex, cmetindex, atoiindex ) > GMAP SNP Index- build index files for known SNPs > (calls: iit_store, snpindex) > GMAP IIT- Create a map store for known genes or SNPs > (calls: iit_store) > > It uses these added datatypes: > % grep -E '(^class | file_ext)' lib/galaxy/datatypes/gmap.py > class GmapDB( Text ): > file_ext = 'gmapdb' > class GmapSnpIndex( Text ): > file_ext = 'gmapsnpindex' > class IntervalIndexTree( Text ): > file_ext = 'iit' > class SpliceSitesIntervalIndexTree( IntervalIndexTree ): > file_ext = 'splicesites.iit' > class IntronsIntervalIndexTree( Int
Re: [galaxy-dev] Tool shed and datatypes
Ahh - sorry. I finally found the format specification for BGZF in the SAM format specification, and it seems that it is 100% GZIP-compatible. There is still the issue of needing an external file index, since all BGZF seems to give you is the size of the compressed block, not anything format-specific, like the number of sequences in the block. In any case, whether it's GZIP or BGZF, it seems the solutions are very similar, and porting my work should be pretty simple - I just used larger blocks and put all the data in the index file and none in the headers. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Tuesday, November 08, 2011 4:04 PM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Tue, Nov 8, 2011 at 11:45 PM, Duddy, John wrote: > It's not public yet, and it involves a little conundrum - we want > it so we can support large amounts of data efficiently on a variety > of aligners, including our ELAND from CASAVA. However, ELAND > does not support unaligned BAM inputs yet, and apparently it > would be a lot of work to make it so (and another team's area > of responsibility as well). OK, so using (unaligned) BAM isn't about to happen. > So in the near term, BGZF would not meet our needs. > I don't follow you there, BAM != BGZF. We can use BGZF to compress FASTQ, FASTA, GenBank, basically anything. You get compression approaching that of plain GZIP (depending on the characteristics of the data) plus efficient random access. > However, work is quite far along on a GZIP-based one > that works with ELAND and BWA, since they both read > GZIP FASTQ files, and works/will work with a converter > to fastq_sanger for other tools. > > I can put you in touch with the engineer doing the work if > you are interested. That might be a good idea, or ask them to post here? Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
On Tue, Nov 8, 2011 at 11:45 PM, Duddy, John wrote: > It's not public yet, and it involves a little conundrum - we want > it so we can support large amounts of data efficiently on a variety > of aligners, including our ELAND from CASAVA. However, ELAND > does not support unaligned BAM inputs yet, and apparently it > would be a lot of work to make it so (and another team's area > of responsibility as well). OK, so using (unaligned) BAM isn't about to happen. > So in the near term, BGZF would not meet our needs. > I don't follow you there, BAM != BGZF. We can use BGZF to compress FASTQ, FASTA, GenBank, basically anything. You get compression approaching that of plain GZIP (depending on the characteristics of the data) plus efficient random access. > However, work is quite far along on a GZIP-based one > that works with ELAND and BWA, since they both read > GZIP FASTQ files, and works/will work with a converter > to fastq_sanger for other tools. > > I can put you in touch with the engineer doing the work if > you are interested. That might be a good idea, or ask them to post here? Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
BTW - the pull request for the GZIP-based splitting is actually integrated - I was referring to the GZIP-based datatype. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Tuesday, November 08, 2011 3:29 PM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John wrote: > GZIP files are definitely our plan. I just finished testing the code > that distributes the processing of a FASTQ (or pair for PE) to an > arbitrary number of tasks, where each subtask extracts just the > data it needs without reading any of the file it does not need. It > extracts the blocks of GZIPped data into a standalone GZIP file > just by copying whole blocks and appending them (if the window > is not aligned perfectly, there is additional processing). Since > the entire file does not need to be read, it distributes quite nicely. > > I'll be preparing a pull request for it soon. > > > John Duddy Hi John, Is your pull request public yet? I'd like to know more about your GZIP based plan (and how it differs from BGZF). It would seem silly to reinvent something slightly different if an existing and well tested mechanism like BGZF (used in BAM files) would work. BGZF is based on GZIP with blocks each up to 64kb, where the block size is recorded in the GZIP block header. This may be more fine grained than the block sizes you are using, but should serve equally well for distribution of data chunks between machines/cores. I appreciate the SAM/BAM specification where BGZF is defined is quite dry reading, and the broad potential of this GZIP variant beyond BAM is not articulated clearly. So I've written a blog post about how BGZF can be used for efficient random access to sequential files (in the sense of one self contained record after another, e.g. many sequence file formats including FASTA & FASTQ): http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html I've also added a reference to BGZF on the open Galaxy feature request for general support of gzipped data types: https://bitbucket.org/galaxy/galaxy-central/issue/666/ Regards, Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
It's not public yet, and it involves a little conundrum - we want it so we can support large amounts of data efficiently on a variety of aligners, including our ELAND from CASAVA. However, ELAND does not support unaligned BAM inputs yet, and apparently it would be a lot of work to make it so (and another team's area of responsibility as well). So in the near term, BGZF would not meet our needs. However, work is quite far along on a GZIP-based one that works with ELAND and BWA, since they both read GZIP FASTQ files, and works/will work with a converter to fastq_sanger for other tools. I can put you in touch with the engineer doing the work if you are interested. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Tuesday, November 08, 2011 3:29 PM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John wrote: > GZIP files are definitely our plan. I just finished testing the code > that distributes the processing of a FASTQ (or pair for PE) to an > arbitrary number of tasks, where each subtask extracts just the > data it needs without reading any of the file it does not need. It > extracts the blocks of GZIPped data into a standalone GZIP file > just by copying whole blocks and appending them (if the window > is not aligned perfectly, there is additional processing). Since > the entire file does not need to be read, it distributes quite nicely. > > I'll be preparing a pull request for it soon. > > > John Duddy Hi John, Is your pull request public yet? I'd like to know more about your GZIP based plan (and how it differs from BGZF). It would seem silly to reinvent something slightly different if an existing and well tested mechanism like BGZF (used in BAM files) would work. BGZF is based on GZIP with blocks each up to 64kb, where the block size is recorded in the GZIP block header. This may be more fine grained than the block sizes you are using, but should serve equally well for distribution of data chunks between machines/cores. I appreciate the SAM/BAM specification where BGZF is defined is quite dry reading, and the broad potential of this GZIP variant beyond BAM is not articulated clearly. So I've written a blog post about how BGZF can be used for efficient random access to sequential files (in the sense of one self contained record after another, e.g. many sequence file formats including FASTA & FASTQ): http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html I've also added a reference to BGZF on the open Galaxy feature request for general support of gzipped data types: https://bitbucket.org/galaxy/galaxy-central/issue/666/ Regards, Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John wrote: > GZIP files are definitely our plan. I just finished testing the code > that distributes the processing of a FASTQ (or pair for PE) to an > arbitrary number of tasks, where each subtask extracts just the > data it needs without reading any of the file it does not need. It > extracts the blocks of GZIPped data into a standalone GZIP file > just by copying whole blocks and appending them (if the window > is not aligned perfectly, there is additional processing). Since > the entire file does not need to be read, it distributes quite nicely. > > I'll be preparing a pull request for it soon. > > > John Duddy Hi John, Is your pull request public yet? I'd like to know more about your GZIP based plan (and how it differs from BGZF). It would seem silly to reinvent something slightly different if an existing and well tested mechanism like BGZF (used in BAM files) would work. BGZF is based on GZIP with blocks each up to 64kb, where the block size is recorded in the GZIP block header. This may be more fine grained than the block sizes you are using, but should serve equally well for distribution of data chunks between machines/cores. I appreciate the SAM/BAM specification where BGZF is defined is quite dry reading, and the broad potential of this GZIP variant beyond BAM is not articulated clearly. So I've written a blog post about how BGZF can be used for efficient random access to sequential files (in the sense of one self contained record after another, e.g. many sequence file formats including FASTA & FASTQ): http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html I've also added a reference to BGZF on the open Galaxy feature request for general support of gzipped data types: https://bitbucket.org/galaxy/galaxy-central/issue/666/ Regards, Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
On 10/21/11 12:29 PM, James Taylor wrote: Excerpts from Jim Johnson's message of 2011-10-21 17:13:02 +: I put the gmap tool suite in the galaxy Tool Shed, let me know if there is more I should do. Awesome! I added a requirement tag for the datatypes to the tool-configs: % grep 'requirement.*datatype' *.xml gmap_build.xml:gmapdb Requirement tags for datatypes are an interesting idea, but I'm wondering if this is something we should require? It seems like all this information is implicit -- a tool requires a datatype if it has an input or output parameter that references that type. Is there other information that should go in the requirement tag? That is certainly correct that the tag would be redundant, the tool config parser could identify the list of datatype formats. I was just trying to think of some way to indicate that additional datatypes were required above those in the central distribution. My goal would be to have the installation of tools from the Tool Shed also be able to install the extra datatypes that those tools require. Having datatypes specified separately in the Tool Shed from tools would hopefully promote less redundancy of datatypes and better interoperability among developers tools.For example the metagenomics applications mothur and qiime have many specific formats that are internal to their tools, but also a few that might be used to migrate data between those applications. We'd need a way to avoid name clashes, perhaps adopting a namespace pattern for the file_ext attribute. ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
hnson wrote: >>>> >>>>> Greg, >>>>> >>>>> It would be great if there were a way to expand upon the core datatypes >>>>> using the ToolShed. >>>>> >>>>> Would it be possible to have a separate datatype repository within the >>>>> ToolShed? >>>>> >>>>> Datatype >>>>> name="" >>>>> description="" >>>>> datatype_dependencies=[] >>>>> definition= >>>>> >>>>> The tool config could be expanded to have requirement for datatypes. >>>>> ssmap >>>>> >>>>> >>>>> >>>>> >>>>> Table datatype >>>>> Column|Type | Modifiers >>>>> -+-+--- >>>>> id | integer | not null default >>>>> nextval('datatype_id_seq'::regclass) >>>>> name| character varying(255) | >>>>> version | character varying(40) | >>>>> description | text| >>>>> definition | text| >>>>> UNIQUE (name) >>>>> >>>>> Table datatype_datatype_association >>>>> Column|Type | Modifiers >>>>> -+-+--- >>>>> id | integer | not null default >>>>> nextval('datatype_id_seq'::regclass) >>>>> datatype_id | integer | >>>>> requires_id | integer | >>>>> FOREIGN KEY (datatype_id) REFERENCES datatype(id) >>>>> FOREIGN KEY (requires_id) REFERENCES datatype(id) >>>>> >>>>> >>>>> Then for my mothur metagenomics tools I could define: >>>>> >>>>> name="ssmap" description="Secondary Structure Map" version="1.0" >>>>> datatype_dependencies=[tabular] >>>>> definition= >>>>> from galaxy.datatypes.tabular import Tabular >>>>> class SecondaryStructureMap(Tabular): >>>>>file_ext = 'ssmap' >>>>>def __init__(self, **kwd): >>>>>"""Initialize secondary structure map datatype""" >>>>>Tabular.__init__( self, **kwd ) >>>>>self.column_names = ['Map'] >>>>> >>>>>def sniff( self, filename ): >>>>>""" >>>>>Determines whether the file is a secondary structure map format >>>>>A single column with an integer value which indicates the row that >>>>> this row maps to. >>>>>check you make sure is structMap[10] = 380 then structMap[380] = >>>>> 10. >>>>>""" >>>>> ... >>>>> >>>>> >>>>> >>>>> >>>>> Then the align.check.xml tool_config could require the 'ssmap' datatype: >>>>> >>>>> >>>>> Calculate the number of potentially misaligned >>>>> bases >>>>> >>>>> mothur >>>>> ssmap >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> John, >>>>>> >>>>>> I've been following this message thread, and it seems it's gone in a >>>>>> direction that differs from your initial question about the possibility >>>>>> for Galaxy to handle automatic editing of the datatypes_conf.xml file >>>>>> when certain Galaxy tool shed tools are automatically installed. There >>>>>> are some complexities to consider in attempting this. One of the issues >>>>>> to consider is that the work for adding support for a new datatype to >>>>>> Galaxy lies outside of the intended function of the tool shed. If new >>>>>> support is added to the Galaxy cod
Re: [galaxy-dev] Tool shed and datatypes
Excerpts from Jim Johnson's message of 2011-10-21 17:13:02 +: > I put the gmap tool suite in the galaxy Tool Shed, let me know if there is > more I should do. Awesome! > I added a requirement tag for the datatypes to the tool-configs: > > % grep 'requirement.*datatype' *.xml > gmap_build.xml: gmapdb Requirement tags for datatypes are an interesting idea, but I'm wondering if this is something we should require? It seems like all this information is implicit -- a tool requires a datatype if it has an input or output parameter that references that type. Is there other information that should go in the requirement tag? -- James Taylor, Assistant Professor, Biology / Computer Science, Emory University ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
extval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id) Then for my mothur metagenomics tools I could define: name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map'] def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ... Then the align.check.xml tool_config could require the 'ssmap' datatype: Calculate the number of potentially misaligned bases mothur ssmap John, I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful Thanks! Greg On Oct 5, 2011, at 11:48 PM, Duddy, John wrote: One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools. Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes Hello John, The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this? Thanks! On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file? John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team greg at bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using &quo
Re: [galaxy-dev] Tool shed and datatypes
entially misaligned >>> bases >>> >>> mothur >>> ssmap >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> John, >>>> >>>> I've been following this message thread, and it seems it's gone in a >>>> direction that differs from your initial question about the possibility >>>> for Galaxy to handle automatic editing of the datatypes_conf.xml file when >>>> certain Galaxy tool shed tools are automatically installed. There are >>>> some complexities to consider in attempting this. One of the issues to >>>> consider is that the work for adding support for a new datatype to Galaxy >>>> lies outside of the intended function of the tool shed. If new support is >>>> added to the Galaxy code base, an entry for that new datatype should be >>>> manually added to the table at the same time. There may be benefits to >>>> enabling automatic changes to datatype entries that already exist in the >>>> file (e.g., adding a new converter for an existing datatype entry), but >>>> perhaps adding a completely new datatype to the file may not be >>>> appropriate. I'll continue to think about this - send additional thought >>>> and feedback, as doing so is always helpful >>>> >>>> Thanks! >>>> >>>> Greg >>>> >>>> >>>> On Oct 5, 2011, at 11:48 PM, Duddy, John wrote: >>>> >>>>> One of the things we’re facing is the sheer size of a whole human genome >>>>> at 30x coverage. An effective way to deal with that is by compressing the >>>>> FASTQ files. That works for BWA and our ELAND, which can directly read a >>>>> compressed FASTQ, but other tools crash when reading compressed FASTQ >>>>> filesfiles. One way to address that would be to introduce a new type, for >>>>> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could >>>>> take both types as input. This would allow the best of both worlds – >>>>> efficient storage and use by all existing tools. >>>>> >>>>> Another example would be adding the CASAVA tools to Galaxy. Some of the >>>>> statistics generation tools use custom file formats. To be able to make >>>>> the use of those tools optional and configurable, they should be separate >>>>> from the aligner, but that would require that Galaxy be made aware of the >>>>> custom file formats – we’d have to add a datatype. >>>>> >>>>> John Duddy >>>>> Sr. Staff Software Engineer >>>>> Illumina, Inc. >>>>> 9885 Towne Centre Drive >>>>> San Diego, CA 92121 >>>>> Tel: 858-736-3584 >>>>> E-mail: jduddy at illumina.com >>>>> >>>>> From: Greg Von Kuster [mailto:greg at bx.psu.edu] >>>>> Sent: Wednesday, October 05, 2011 6:25 PM >>>>> To: Duddy, John >>>>> Cc: galaxy-dev at lists.bx.psu.edu >>>>> Subject: Re: [galaxy-dev] Tool shed and datatypes >>>>> >>>>> Hello John, >>>>> >>>>> The Galaxy tool shed currently is not enabled to automatically edit the >>>>> datatypes_conf.xml file, although I could add this feature if the need >>>>> exists. Can you elaborate on what you are looking to do regarding this? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: >>>>> >>>>> >>>>> Can we introduce new file types via tools in the tool shed? It seems >>>>> Galaxy can load them if they are in the datatypes configuration file. >>>>> Does tool installation automate the editing of that file? >>>>> >>>>> >>>>> John Duddy >>>>> Sr. Staff Software Engineer >>>>> Illumina, Inc. >>>>> 9885 Towne Centre Drive >>>>> San Diego, CA 92121 >>>>> Tel: 858-736-3584 >>>>> E-mail: jduddy at illumina.com >>>>> >>>>> ___ >>>>> Please keep all replies on the list by using "reply all" >>>>> in your mail client. To manage your subscriptions to this >>>>> and other Galaxy lists, please use the interface at: >>>>> >>>>> http://lists.bx.psu.edu/ >>>>> >>>>> Greg Von Kuster >>>>> Galaxy Development Team >>>>> greg at bx.psu.edu >>>>> >>> ___ >>> Please keep all replies on the list by using "reply all" >>> in your mail client. To manage your subscriptions to this >>> and other Galaxy lists, please use the interface at: >>> >>> http://lists.bx.psu.edu/ >>> >> Greg Von Kuster >> Galaxy Development Team >> g...@bx.psu.edu >> >> >> > > > > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ > Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
h that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools. Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes Hello John, The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this? Thanks! On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file? John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team greg at bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
iles. That works for BWA and our ELAND, which can directly read a >>> compressed FASTQ, but other tools crash when reading compressed FASTQ >>> filesfiles. One way to address that would be to introduce a new type, for >>> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could >>> take both types as input. This would allow the best of both worlds – >>> efficient storage and use by all existing tools. >>> >>> Another example would be adding the CASAVA tools to Galaxy. Some of the >>> statistics generation tools use custom file formats. To be able to make the >>> use of those tools optional and configurable, they should be separate from >>> the aligner, but that would require that Galaxy be made aware of the custom >>> file formats – we’d have to add a datatype. >>> >>> John Duddy >>> Sr. Staff Software Engineer >>> Illumina, Inc. >>> 9885 Towne Centre Drive >>> San Diego, CA 92121 >>> Tel: 858-736-3584 >>> E-mail: jduddy at illumina.com >>> >>> From: Greg Von Kuster [mailto:greg at bx.psu.edu] >>> Sent: Wednesday, October 05, 2011 6:25 PM >>> To: Duddy, John >>> Cc: galaxy-dev at lists.bx.psu.edu >>> Subject: Re: [galaxy-dev] Tool shed and datatypes >>> >>> Hello John, >>> >>> The Galaxy tool shed currently is not enabled to automatically edit the >>> datatypes_conf.xml file, although I could add this feature if the need >>> exists. Can you elaborate on what you are looking to do regarding this? >>> >>> Thanks! >>> >>> >>> On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: >>> >>> >>> Can we introduce new file types via tools in the tool shed? It seems Galaxy >>> can load them if they are in the datatypes configuration file. Does tool >>> installation automate the editing of that file? >>> >>> >>> John Duddy >>> Sr. Staff Software Engineer >>> Illumina, Inc. >>> 9885 Towne Centre Drive >>> San Diego, CA 92121 >>> Tel: 858-736-3584 >>> E-mail: jduddy at illumina.com >>> >>> ___ >>> Please keep all replies on the list by using "reply all" >>> in your mail client. To manage your subscriptions to this >>> and other Galaxy lists, please use the interface at: >>> >>> http://lists.bx.psu.edu/ >>> >>> Greg Von Kuster >>> Galaxy Development Team >>> greg at bx.psu.edu >>> > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ > Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
Hello Jim, On Oct 10, 2011, at 1:01 PM, Jim Johnson wrote: > There are a number of well defined formats that are exchanged between > applications, e.g. BAM, gtf, etc, I wouldn't advocate proliferating those. > > I see the need for Toolshed datatypes more for the intermediate file formats > used within a suite of commands. These can be helpful in guiding a user to > select appropriate inputs for successive steps in an analysis. > > For example, when developing the 90 some tool wrappers for the mothur > metagenomic package, there are many file formats that get passed among the > mothur commands. It greatly simplifies the user's experience if the outputs > are typed so as to correctly filter the acceptable inputs to another command. > I fear the amount of time I would spend providing user support if the > outputs and inputs were generically typed. An approach for simplifying this is to include one or more exported Galaxy workflows in the tool shed repository along with the tools. The workflows cannot currently be automatically imported into Galaxy, but they can be manually imported, providing the user an idea of the steps in the analyses for which the tools are intended. Additional features related to Galaxy workflows included in Galaxy tool shed repositories will be available in future Galaxy releases. > > I'm also seeing a similar need as I am creating creating tool wrappers for > the GMAP/GSNAP mapping commands. While input to GSNAP and GMAP can be fastq > and output in SAM format, some of the more interesting use cases involve > creating additional map stores, where specific datatypes would guide the user > in setting the tool parameters correctly. > > JJ > > James E Johnson > Minnesota Supercomputing Institute, University of Minnesota > > > On 10/10/11 11:09 AM, Duddy, John wrote: >> I agree with the risks you cited. >> >> There is a risk in the other direction that I think is even scarier - >> without the ability to add data types, tool authors may be forced to use a >> "typeless" system, declaring all inputs/outputs as "data" or "text". While >> this works, it has the same drawbacks as typeless programming languages - >> deferring error detection to runtime, impairing the ability to perform >> static analysis, inability to perform transparent type conversions - in >> other words, the tools have to take over responsibilities from the framework. >> >> Like all interesting problems, I don't think there is an "obviously right" >> answer ;-} >> >> John Duddy >> Sr. Staff Software Engineer >> Illumina, Inc. >> 9885 Towne Centre Drive >> San Diego, CA 92121 >> Tel: 858-736-3584 >> E-mail: jdu...@illumina.com >> >> >> -Original Message----- >> From: galaxy-dev-boun...@lists.bx.psu.edu >> [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric >> Sent: Friday, October 07, 2011 5:53 PM >> To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu >> Cc: Greg Von Kuster >> Subject: Re: [galaxy-dev] Tool shed and datatypes >> >> Hi all, >> >> Just my 2 cents. >> >> This is a really great idea to have dynamically (down-)loadable datatypes, >> and a tool config tag to express a datatype dependency is right on the >> money. I agree with Greg in having hesitations about adding that feature >> though. The purpose (at least as far I see it) of the tool shed is to allow >> the community to share its productivity. New tools written by one group can >> be used by another group that may not have adequate skill, resources, or >> time to create the same tool on their own. One issue this model can suffer >> from, however, is over-proliferation of contributions. In this case, new >> tools with the same, overlapping, or very similar functions might be >> developed independently by multiple groups who then want to contribute to >> the tool shed. I don't know how often this situation arises or what >> official contingencies are in place to manage them, but it is important to >> manage that situation carefully. If it occurs with any appreciable >> frequency, then eventually there a! r! > e many clusters of tools available that do almost the same thing but not > quite. This is bad for the user, bad for the maintainer, complicates > communication between researchers, etc. This model can work nicely if the > frequency of very simliar tool submissions is small, and even better if there > is some management for cleaning out broken or redundant tools. >> >> When you allow custom datatypes
Re: [galaxy-dev] Tool shed and datatypes
Peter has the right idea here - we will add support for appropriate data types to the Galaxy distribution. Of course, the key word here is "appropriate", but any industry-standard data format should fall under this category. On Oct 10, 2011, at 12:46 PM, Peter Cock wrote: > On Mon, Oct 10, 2011 at 5:09 PM, Duddy, John wrote: >> I agree with the risks you cited. >> >> There is a risk in the other direction that I think is even scarier - >> without the ability to add data types, tool authors may be forced >> to use a "typeless" system, declaring all inputs/outputs as "data" >> or "text". While this works, it has the same drawbacks as typeless >> programming languages - deferring error detection to runtime, >> impairing the ability to perform static analysis, inability to perform >> transparent type conversions - in other words, the tools have to >> take over responsibilities from the framework. >> >> Like all interesting problems, I don't think there is an "obviously >> right" answer ;-} >> >> John Duddy > > Indeed. I'm going with lobbying the Galaxy to include new > datatypes when I need them (InterProScan XML in on my > todo list, perhaps v4 and v5 as two types), but I've been > able to get a long with with "tabular" as a tool output. > > Peter > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
There are a number of well defined formats that are exchanged between applications, e.g. BAM, gtf, etc, I wouldn't advocate proliferating those. I see the need for Toolshed datatypes more for the intermediate file formats used within a suite of commands. These can be helpful in guiding a user to select appropriate inputs for successive steps in an analysis. For example, when developing the 90 some tool wrappers for the mothur metagenomic package, there are many file formats that get passed among the mothur commands. It greatly simplifies the user's experience if the outputs are typed so as to correctly filter the acceptable inputs to another command. I fear the amount of time I would spend providing user support if the outputs and inputs were generically typed. I'm also seeing a similar need as I am creating creating tool wrappers for the GMAP/GSNAP mapping commands. While input to GSNAP and GMAP can be fastq and output in SAM format, some of the more interesting use cases involve creating additional map stores, where specific datatypes would guide the user in setting the tool parameters correctly. JJ James E Johnson Minnesota Supercomputing Institute, University of Minnesota On 10/10/11 11:09 AM, Duddy, John wrote: I agree with the risks you cited. There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework. Like all interesting problems, I don't think there is an "obviously right" answer ;-} John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric Sent: Friday, October 07, 2011 5:53 PM To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes Hi all, Just my 2 cents. This is a really great idea to have dynamically (down-)loadable datatypes, and a tool config tag to express a datatype dependency is right on the money. I agree with Greg in having hesitations about adding that feature though. The purpose (at least as far I see it) of the tool shed is to allow the community to share its productivity. New tools written by one group can be used by another group that may not have adequate skill, resources, or time to create the same tool on their own. One issue this model can suffer from, however, is over-proliferation of contributions. In this case, new tools with the same, overlapping, or very similar functions might be developed independently by multiple groups who then want to contribute to the tool shed. I don't know how often this situation arises or what official contingencies are in place to manage them, but it is important to manage that situation carefully. If it occurs with any appreciable frequency, then eventually there ar! e many clusters of tools available that do almost the same thing but not quite. This is bad for the user, bad for the maintainer, complicates communication between researchers, etc. This model can work nicely if the frequency of very simliar tool submissions is small, and even better if there is some management for cleaning out broken or redundant tools. When you allow custom datatypes to enter the picture, however, the story can become hairy much more quickly. Having a limited set of officially supplied / supported datatypes forces the contributors of new tools to use datatypes drawn from a standard set. Without that constraint, the number of datatype variants could explode. Now the concern is not only that multiple contributors may submit very similar tool variants, or that each of them might choose to create their own datatypes to optimize their methods, but also that contributors of tools which are functionally dissimilar but manipulate the same general types of data will write their tools using new datatypes that are variants of each other. Tools are essentially typed by the datatypes they accept and produce, so you won't be able to chain these tools together very easliy at all. Most pairs of tools will have the "wrong" datatype, on input or output, for what a user wants to do. The general trend is then prolifer! ation of clusters of redundant tools, clusters of redundant datatypes, and growing sparsity in the "tool graph" (think of datatypes as vertices and tool
Re: [galaxy-dev] Tool shed and datatypes
On Mon, Oct 10, 2011 at 5:09 PM, Duddy, John wrote: > I agree with the risks you cited. > > There is a risk in the other direction that I think is even scarier - > without the ability to add data types, tool authors may be forced > to use a "typeless" system, declaring all inputs/outputs as "data" > or "text". While this works, it has the same drawbacks as typeless > programming languages - deferring error detection to runtime, > impairing the ability to perform static analysis, inability to perform > transparent type conversions - in other words, the tools have to > take over responsibilities from the framework. > > Like all interesting problems, I don't think there is an "obviously > right" answer ;-} > > John Duddy Indeed. I'm going with lobbying the Galaxy to include new datatypes when I need them (InterProScan XML in on my todo list, perhaps v4 and v5 as two types), but I've been able to get a long with with "tabular" as a tool output. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
I agree with the risks you cited. There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework. Like all interesting problems, I don't think there is an "obviously right" answer ;-} John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric Sent: Friday, October 07, 2011 5:53 PM To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes Hi all, Just my 2 cents. This is a really great idea to have dynamically (down-)loadable datatypes, and a tool config tag to express a datatype dependency is right on the money. I agree with Greg in having hesitations about adding that feature though. The purpose (at least as far I see it) of the tool shed is to allow the community to share its productivity. New tools written by one group can be used by another group that may not have adequate skill, resources, or time to create the same tool on their own. One issue this model can suffer from, however, is over-proliferation of contributions. In this case, new tools with the same, overlapping, or very similar functions might be developed independently by multiple groups who then want to contribute to the tool shed. I don't know how often this situation arises or what official contingencies are in place to manage them, but it is important to manage that situation carefully. If it occurs with any appreciable frequency, then eventually there are ! many clusters of tools available that do almost the same thing but not quite. This is bad for the user, bad for the maintainer, complicates communication between researchers, etc. This model can work nicely if the frequency of very simliar tool submissions is small, and even better if there is some management for cleaning out broken or redundant tools. When you allow custom datatypes to enter the picture, however, the story can become hairy much more quickly. Having a limited set of officially supplied / supported datatypes forces the contributors of new tools to use datatypes drawn from a standard set. Without that constraint, the number of datatype variants could explode. Now the concern is not only that multiple contributors may submit very similar tool variants, or that each of them might choose to create their own datatypes to optimize their methods, but also that contributors of tools which are functionally dissimilar but manipulate the same general types of data will write their tools using new datatypes that are variants of each other. Tools are essentially typed by the datatypes they accept and produce, so you won't be able to chain these tools together very easliy at all. Most pairs of tools will have the "wrong" datatype, on input or output, for what a user wants to do. The general trend is then proliferat! ion of clusters of redundant tools, clusters of redundant datatypes, and growing sparsity in the "tool graph" (think of datatypes as vertices and tools as directed [hyper]edges). So, a move in the direction of supporting something like a "TypeShed" would require careful consideration consist of at least either a well defined policy for managing *Shed rot and capability to execute it or a very slick tool / datatype versioning system with flexible control for users and some also very slick method for maintaining implicit conversions between the datatypes in a datatype cluster (ideally automatically generated). I think at least the implicit conversion part can be done, even if not in a fully automated manner, then by a combination of policy and engineering. For policy, you can define, identify, or construct a canonical datatype in each cluster and require that a contributor of a variant datatype submit methods for implicit conversion to/from the canonical datatype in that cluster. One idea that could help reduce complexity is to potentially place some additional structure on datatypes and take the canonical datatype for a cluster to be a form of the ! union (mathematical, not the "union" from C) of the variants in the cluster, which would simplify implicit conversations somewhat. Or, if there's some reason for this, there can also be a set of "canonical"
Re: [galaxy-dev] Tool shed and datatypes
alf of Jim Johnson [j...@umn.edu] Sent: Friday, October 07, 2011 2:06 PM To: galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes Greg, It would be great if there were a way to expand upon the core datatypes using the ToolShed. Would it be possible to have a separate datatype repository within the ToolShed? Datatype name="" description="" datatype_dependencies=[] definition= The tool config could be expanded to have requirement for datatypes. ssmap Table datatype Column|Type | Modifiers -+-+--- id | integer | not null default nextval('datatype_id_seq'::regclass) name| character varying(255) | version | character varying(40) | description | text| definition | text| UNIQUE (name) Table datatype_datatype_association Column|Type | Modifiers -+-+--- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id) Then for my mothur metagenomics tools I could define: name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map'] def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ... Then the align.check.xml tool_config could require the 'ssmap' datatype: Calculate the number of potentially misaligned bases mothur ssmap > John, > > I've been following this message thread, and it seems it's gone in a > direction that differs from your initial question about the possibility for > Galaxy to handle automatic editing of the datatypes_conf.xml file when > certain Galaxy tool shed tools are automatically installed. There are some > complexities to consider in attempting this. One of the issues to consider > is that the work for adding support for a new datatype to Galaxy lies outside > of the intended function of the tool shed. If new support is added to the > Galaxy code base, an entry for that new datatype should be manually added to > the table at the same time. There may be benefits to enabling automatic > changes to datatype entries that already exist in the file (e.g., adding a > new converter for an existing datatype entry), but perhaps adding a > completely new datatype to the file may not be appropriate. I'll continue to > think about this - send additional thought and feedback, as doing so is > always helpful > > Thanks! > > Greg > > > On Oct 5, 2011, at 11:48 PM, Duddy, John wrote: > >> One of the things we’re facing is the sheer size of a whole human genome at >> 30x coverage. An effective way to deal with that is by compressing the FASTQ >> files. That works for BWA and our ELAND, which can directly read a >> compressed FASTQ, but other tools crash when reading compressed FASTQ >> filesfiles. One way to address that would be to introduce a new type, for >> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could >> take both types as input. This would allow the best of both worlds – >> efficient storage and use by all existing tools. >> >> Another example would be adding the CASAVA tools to Galaxy. Some of the >> statistics generation tools use custom file formats. To be able to make the >> use of those tools optional and configurable, they should be separate from >> the aligner, but that would require that Galaxy be made aware of the custom >> file formats – we’d have to add a datatype. >> >> John Duddy >> Sr. Staff Software Engineer >> Illumina, Inc. >> 9885 Towne Centre Drive >> San Diego, CA 92121 >> Tel: 8
Re: [galaxy-dev] Tool shed and datatypes
Greg, It would be great if there were a way to expand upon the core datatypes using the ToolShed. Would it be possible to have a separate datatype repository within the ToolShed? Datatype name="" description="" datatype_dependencies=[] definition= The tool config could be expanded to have requirement for datatypes. ssmap Table datatype Column|Type | Modifiers -+-+--- id | integer | not null default nextval('datatype_id_seq'::regclass) name| character varying(255) | version | character varying(40) | description | text| definition | text| UNIQUE (name) Table datatype_datatype_association Column|Type | Modifiers -+-+--- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id) Then for my mothur metagenomics tools I could define: name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map'] def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ... Then the align.check.xml tool_config could require the 'ssmap' datatype: Calculate the number of potentially misaligned bases mothur ssmap John, I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful Thanks! Greg On Oct 5, 2011, at 11:48 PM, Duddy, John wrote: One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools. Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes Hello John, The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this? Thanks! On Oct 5, 2011, at 1:52
Re: [galaxy-dev] Tool shed and datatypes
John, I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful Thanks! Greg On Oct 5, 2011, at 11:48 PM, Duddy, John wrote: > One of the things we’re facing is the sheer size of a whole human genome at > 30x coverage. An effective way to deal with that is by compressing the FASTQ > files. That works for BWA and our ELAND, which can directly read a compressed > FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One > way to address that would be to introduce a new type, for example > “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both > types as input. This would allow the best of both worlds – efficient storage > and use by all existing tools. > > Another example would be adding the CASAVA tools to Galaxy. Some of the > statistics generation tools use custom file formats. To be able to make the > use of those tools optional and configurable, they should be separate from > the aligner, but that would require that Galaxy be made aware of the custom > file formats – we’d have to add a datatype. > > John Duddy > Sr. Staff Software Engineer > Illumina, Inc. > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 858-736-3584 > E-mail: jdu...@illumina.com > > From: Greg Von Kuster [mailto:g...@bx.psu.edu] > Sent: Wednesday, October 05, 2011 6:25 PM > To: Duddy, John > Cc: galaxy-dev@lists.bx.psu.edu > Subject: Re: [galaxy-dev] Tool shed and datatypes > > Hello John, > > The Galaxy tool shed currently is not enabled to automatically edit the > datatypes_conf.xml file, although I could add this feature if the need > exists. Can you elaborate on what you are looking to do regarding this? > > Thanks! > > > On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: > > > Can we introduce new file types via tools in the tool shed? It seems Galaxy > can load them if they are in the datatypes configuration file. Does tool > installation automate the editing of that file? > > > John Duddy > Sr. Staff Software Engineer > Illumina, Inc. > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 858-736-3584 > E-mail: jdu...@illumina.com > > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ > > Greg Von Kuster > Galaxy Development Team > g...@bx.psu.edu > > > > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
GZIP files are definitely our plan. I just finished testing the code that distributes the processing of a FASTQ (or pair for PE) to an arbitrary number of tasks, where each subtask extracts just the data it needs without reading any of the file it does not need. It extracts the blocks of GZIPped data into a standalone GZIP file just by copying whole blocks and appending them (if the window is not aligned perfectly, there is additional processing). Since the entire file does not need to be read, it distributes quite nicely. I'll be preparing a pull request for it soon. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Thursday, October 06, 2011 9:19 AM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Thu, Oct 6, 2011 at 5:02 PM, Duddy, John wrote: > As I understand it, Isilion is built up from "bricks" that have storage > and compute power. They replicate files amongst themselves, so > that for every IO request there are multiple systems that could > respond. They are interconnected by an ultra fast fibre backbone. So why not use gzipped files on top of that? Smaller chunks of data to access so should be faster even with the decompression once it gets to the CPU. > So, depending on your topology, it's possible to get a lot more > throughput by working on different sections of the same file from > different physical computers. Nice. > I haven't delved into BGZF, so I can't comment. My approach to > block GZIP was just to concatenate multiple GZIP files and keep > a record of the offsets and sequences contained in each. The > advantage is compatibility, in that it decompresses just like it > was one big chunk, yet you can compose subsets of the data > without decompressing/recompressing and (as long as we > actually have to write out the file subsets) can reap the reduced > IO benefits of smaller writes. That sounds VERY similar to BGZF - have a read over the SAM specification which covers this. Basically they stick the block size into the gzip headers, and the BAM index files (BAI) use a 64 bit offset which is split into the BGZF block offset and the offset within that decompressed block. See: http://samtools.sourceforge.net/SAM1.pdf Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
On Thu, Oct 6, 2011 at 5:02 PM, Duddy, John wrote: > As I understand it, Isilion is built up from "bricks" that have storage > and compute power. They replicate files amongst themselves, so > that for every IO request there are multiple systems that could > respond. They are interconnected by an ultra fast fibre backbone. So why not use gzipped files on top of that? Smaller chunks of data to access so should be faster even with the decompression once it gets to the CPU. > So, depending on your topology, it's possible to get a lot more > throughput by working on different sections of the same file from > different physical computers. Nice. > I haven't delved into BGZF, so I can't comment. My approach to > block GZIP was just to concatenate multiple GZIP files and keep > a record of the offsets and sequences contained in each. The > advantage is compatibility, in that it decompresses just like it > was one big chunk, yet you can compose subsets of the data > without decompressing/recompressing and (as long as we > actually have to write out the file subsets) can reap the reduced > IO benefits of smaller writes. That sounds VERY similar to BGZF - have a read over the SAM specification which covers this. Basically they stick the block size into the gzip headers, and the BAM index files (BAI) use a 64 bit offset which is split into the BGZF block offset and the offset within that decompressed block. See: http://samtools.sourceforge.net/SAM1.pdf Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
As I understand it, Isilion is built up from "bricks" that have storage and compute power. They replicate files amongst themselves, so that for every IO request there are multiple systems that could respond. They are interconnected by an ultra fast fibre backbone. So, depending on your topology, it's possible to get a lot more throughput by working on different sections of the same file from different physical computers. I haven't delved into BGZF, so I can't comment. My approach to block GZIP was just to concatenate multiple GZIP files and keep a record of the offsets and sequences contained in each. The advantage is compatibility, in that it decompresses just like it was one big chunk, yet you can compose subsets of the data without decompressing/recompressing and (as long as we actually have to write out the file subsets) can reap the reduced IO benefits of smaller writes. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Thursday, October 06, 2011 8:16 AM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John wrote: > I'd be up for that something like that, although I have other tasking > in the short term after I finish my parallelism work. I'd rather not have > Galaxy do the compression/decompression, because that will not > effectively utilize the distributed nature of many filesystems, such > as Isilon, that our customers use. Is that like a compressed filesystem, where there is probably less benefit to storing the data gzipped? > My parallelism work (second > phase almost done) handles that by using a block-gzipped > format and index files that allow the compute nodes to seek to > the blocks they need and extract from there. How similar is your block-gzipped approach to BGZF used in BAM? > Another thing that should probably go along with this is an > enhancement to metadata such that it can be fed in from the > outside. We upload files by linking to file paths, and at that > point, we know everything about the files (index information). > So need to decompress a 500GB file and read the whole > thing just to count the lines - all you have to do is ask ;-} I can see how that might be useful. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John wrote: > I'd be up for that something like that, although I have other tasking > in the short term after I finish my parallelism work. I'd rather not have > Galaxy do the compression/decompression, because that will not > effectively utilize the distributed nature of many filesystems, such > as Isilon, that our customers use. Is that like a compressed filesystem, where there is probably less benefit to storing the data gzipped? > My parallelism work (second > phase almost done) handles that by using a block-gzipped > format and index files that allow the compute nodes to seek to > the blocks they need and extract from there. How similar is your block-gzipped approach to BGZF used in BAM? > Another thing that should probably go along with this is an > enhancement to metadata such that it can be fed in from the > outside. We upload files by linking to file paths, and at that > point, we know everything about the files (index information). > So need to decompress a 500GB file and read the whole > thing just to count the lines - all you have to do is ask ;-} I can see how that might be useful. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
I'd be up for that something like that, although I have other tasking in the short term after I finish my parallelism work. I'd rather not have Galaxy do the compression/decompression, because that will not effectively utilize the distributed nature of many filesystems, such as Isilon, that our customers use. My parallelism work (second phase almost done) handles that by using a block-gzipped format and index files that allow the compute nodes to seek to the blocks they need and extract from there. Another thing that should probably go along with this is an enhancement to metadata such that it can be fed in from the outside. We upload files by linking to file paths, and at that point, we know everything about the files (index information). So need to decompress a 500GB file and read the whole thing just to count the lines - all you have to do is ask ;-} John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Thursday, October 06, 2011 1:28 AM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Thu, Oct 6, 2011 at 4:48 AM, Duddy, John wrote: > One of the things we're facing is the sheer size of a whole human genome at > 30x coverage. An effective way to deal with that is by compressing the FASTQ > files. That works for BWA and our ELAND, which can directly read a > compressed FASTQ, but other tools crash when reading compressed FASTQ > filesfiles. One way to address that would be to introduce a new type, for > example "CompressedFastQ", with a conversion to FASTQ defined. BWA could > take both types as input. This would allow the best of both worlds - > efficient storage and use by all existing tools. We'd discussed this and a more general approach where any file could be gzipped, but the code to do that doesn't exist yet: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-September/006745.html Issue filed: https://bitbucket.org/galaxy/galaxy-central/issue/666/ That seems a better long term solution than the pragmatic short term solution of fastqsanger-gzip (or whatever it gets called). Note that it sounded like Edward Kirton might already be using this - you should be consistent. The other strong idea from that thread was moving from FASTQ to unaligned BAM, which is gzipped compressed, and has explicit support for paired end reads, read groups, etc. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
On Thu, Oct 6, 2011 at 4:48 AM, Duddy, John wrote: > One of the things we’re facing is the sheer size of a whole human genome at > 30x coverage. An effective way to deal with that is by compressing the FASTQ > files. That works for BWA and our ELAND, which can directly read a > compressed FASTQ, but other tools crash when reading compressed FASTQ > filesfiles. One way to address that would be to introduce a new type, for > example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could > take both types as input. This would allow the best of both worlds – > efficient storage and use by all existing tools. We'd discussed this and a more general approach where any file could be gzipped, but the code to do that doesn't exist yet: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-September/006745.html Issue filed: https://bitbucket.org/galaxy/galaxy-central/issue/666/ That seems a better long term solution than the pragmatic short term solution of fastqsanger-gzip (or whatever it gets called). Note that it sounded like Edward Kirton might already be using this - you should be consistent. The other strong idea from that thread was moving from FASTQ to unaligned BAM, which is gzipped compressed, and has explicit support for paired end reads, read groups, etc. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
One of the things we're facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example "CompressedFastQ", with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds - efficient storage and use by all existing tools. Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats - we'd have to add a datatype. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com<mailto:jdu...@illumina.com> From: Greg Von Kuster [mailto:g...@bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes Hello John, The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this? Thanks! On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file? John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jdu...@illumina.com<mailto:jdu...@illumina.com> ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu<mailto:g...@bx.psu.edu> ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Tool shed and datatypes
Hello John, The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this? Thanks! On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: > Can we introduce new file types via tools in the tool shed? It seems Galaxy > can load them if they are in the datatypes configuration file. Does tool > installation automate the editing of that file? > > > John Duddy > Sr. Staff Software Engineer > Illumina, Inc. > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 858-736-3584 > E-mail: jdu...@illumina.com > > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ Greg Von Kuster Galaxy Development Team g...@bx.psu.edu ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/