Re: [galaxy-dev] Tool shed and datatypes

2012-01-06 Thread Greg Von Kuster
Hello Jim,

Thanks for sending your converter.  I've committed change set 6484:4fdceec512f5 
to our central repository.  I now have things working for properly handling 
proprietary datatype converters and indexers.  I've also added the following 
paragraph to the tool shed wiki.  It doesn't apply to your mothur data types 
since you use only 1 converter, but you should be aware of this requirement for 
future tool development.  If you make the changes we discussed yesterday to 
your mothur tool suite (and add your missing converter ), all data types and 
the converter should properly load when your repository is installed to a local 
Galaxy instance. 

Thanks very much for your help on this, and please let me know if you bump into 
any issues.

If you include datatype converters or indexers in your repository, all 
converter files (the disk file referred to by the value of the "file" 
attribute) must be located in the same directory in your repository hierarchy.  
The same requirement applies to indexers.  If you include both converters and 
indexers in your repository, the relevant files may all be located within the 
same directory or you could decide to keep all converters in one directory and 
all indexers in a different directory within your repository hierarchy.  This 
is critical because the Galaxy components that load these proprietary items 
assume they are all located in the same directory.




On Jan 5, 2012, at 4:15 PM, Jim Johnson wrote:

> I'll also upload those to the toolshed soon. 
> 
> Big Question?
> When I started creating all those datatype classes for mothur, I just labeled 
> the the file_ext as mothur generated them for its output files.  
> Will we have namespace issues?  
> Should the file_ext fields include the toolshed name, e.g.   should "otu"   
> be named "mothur.otu" to avoid conflicts with other downloaded tools from the 
> toolshed?  
> Seems like this would be the time to establish rules/practices for such 
> concerns.
> 
> JJ
> 
> 
> 
> On 1/5/12 3:06 PM, Greg Von Kuster wrote:
>> 
>> I will make sure that the converters are functional when installed, but I'm 
>> fairly sure it is currently not working.  If you could pass your 2 files 
>> along to me, I'll make sure to fix whatever bugs may exist.
>> 
>> On Jan 5, 2012, at 3:51 PM, Jim Johnson wrote:
>> 
>>> 
>>> This was a converter that I used on my local installation, but forgot to 
>>> include for the ToolShed:
>>> 
>>> >> type="galaxy.datatypes.metagenomics:RefTaxonomy" display_in_upload="true">
>>> >> target_datatype="seq.taxonomy"/>
>>> 
>>> 
>>> $ find lib  -name ref_to_seq_taxonomy_converter.xml
>>> lib/galaxy/datatypes/converters/ref_to_seq_taxonomy_converter.xml
>>> $ find lib  -name ref_to_seq_taxonomy_converter.py 
>>> lib/galaxy/datatypes/converters/ref_to_seq_taxonomy_converter.py
>>> 
>>> I'll add those 2 files to my repository along with the other changes you 
>>> specified.   
>>> Can converters as such be auto-installed as well?   
>>> 
>>> Thanks,
>>> 
>>> JJ
>>> 
>>> 
>>> 
>>> On 1/5/12 2:14 PM, Greg Von Kuster wrote:
 
 Hi Jim,
 
 Here are the changes you'll need to make to your mothur tool suite.
 
 CHANGE 1
 
 Add the following datatypes.conf.xml file to your repository.
 
 
 
 
 
 
 
 >>> display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:OtuList" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Sabund" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Rabund" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:SharedRabund" 
 display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:RelAbund" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Names" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Design" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Summary" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Group" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:Oligos" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:SequenceAlignment" 
 display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:AccNos" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:SecondaryStructureMap" 
 display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:AlignCheck" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:AlignReport" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:LaneMask" display_in_upload="true"/>
 >>> type="galaxy.datatypes.metagenomics:DistanceMatrix" 
 display_in_upload="true"/>

Re: [galaxy-dev] Tool shed and datatypes

2012-01-05 Thread Greg Von Kuster
Of course, your approach of prepending the repository name would probably 
eliminate any future issue in this regard.  Whatever you feel is best...   ;)

On Jan 5, 2012, at 4:49 PM, Greg Von Kuster wrote:

> Yes, this is certainly important, but I think the hope is that proprietary 
> data types will not become so prevalent that name-spacing the extensions is 
> necessary. 
> 
> On Jan 5, 2012, at 4:15 PM, Jim Johnson wrote:
> 
>> 
>> Big Question?
>> When I started creating all those datatype classes for mothur, I just 
>> labeled the the file_ext as mothur generated them for its output files.  
>> Will we have namespace issues?  
>> Should the file_ext fields include the toolshed name, e.g.   should "otu"   
>> be named "mothur.otu" to avoid conflicts with other downloaded tools from 
>> the toolshed?  
>> Seems like this would be the time to establish rules/practices for such 
>> concerns.
>> 
>> JJ
>> 
>> 
>> 
> 
> 
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu




___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2012-01-05 Thread Greg Von Kuster
Yes, this is certainly important, but I think the hope is that proprietary data 
types will not become so prevalent that name-spacing the extensions is 
necessary. 

On Jan 5, 2012, at 4:15 PM, Jim Johnson wrote:

> 
> Big Question?
> When I started creating all those datatype classes for mothur, I just labeled 
> the the file_ext as mothur generated them for its output files.  
> Will we have namespace issues?  
> Should the file_ext fields include the toolshed name, e.g.   should "otu"   
> be named "mothur.otu" to avoid conflicts with other downloaded tools from the 
> toolshed?  
> Seems like this would be the time to establish rules/practices for such 
> concerns.
> 
> JJ
> 
> 
> 


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2012-01-05 Thread Greg Von Kuster
Of course, this assume that there is not more than one datatypes class module 
in your repository with the same name.  This would definitely pose problems, so 
care should be taken that it is not done.

On Jan 5, 2012, at 3:29 PM, Greg Von Kuster wrote:

>  However, your datatype class module files will be found no matter where they 
> are located within your repository hierarchy.
> 
> 
> On Jan 5, 2012, at 3:25 PM, Jim Johnson wrote:
> 
>> Greg,
>> 
>> I have been putting datatype def files in relative path:   
>> lib/galaxy/datatypes/
>> This was just to make it more clear for someone manually modifying their own 
>> galaxy installation.  
>> Is there any preferred best practice for where a datatypes implementation 
>> file should be?
>> 
>> Thanks,
>> 
>> JJ
>> 





___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2012-01-05 Thread Greg Von Kuster
Your approach is great since it models the Galaxy distribution, and as you say, 
make sit clear to those downloading your repository.  However, your datatype 
class module files will be found no matter where they are located within your 
repository hierarchy.


On Jan 5, 2012, at 3:25 PM, Jim Johnson wrote:

> Greg,
> 
> I have been putting datatype def files in relative path:   
> lib/galaxy/datatypes/
> This was just to make it more clear for someone manually modifying their own 
> galaxy installation.  
> Is there any preferred best practice for where a datatypes implementation 
> file should be?
> 
> Thanks,
> 
> JJ
> 
> On 1/5/12 1:38 PM, Greg Von Kuster wrote:
>> 
>> Hello Jim,
>> 
>> I've implemented support for proprietary datatypes that use class modules 
>> included in tool shed repositories.  To see how this works, you'll need at 
>> least change set revision 6479:4d131422777f, which is currently available 
>> only from our central repo at https://bitbucket.org/galaxy/galaxy-central.
>> 
>> I've documented the way this works in the following 2 sections of the tool 
>> shed wiki.  In the second section, I've taken the liberty of using your gmap 
>> tool repository as an example.  i hope you don't mind.  I've written the 
>> document section assuming that your gmap repository includes the 2 changes 
>> I've described below.
>> 
>> http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_subclass_from_Galaxy_data_types_in_the_distribution
>> http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_use_class_modules_included_in_your_repository
>> 
>> There are 2 categories of datatypes that are currently supported:
>> 
>> 1. data types that subclass from the datatype classes included in the Galaxy 
>> distribution - these require no code files that define proprietary datatype 
>> classes to be included in the tool shed repository, and are documented in 
>> the first wiki section listed above.
>> 
>> 2. datatypes that use proprietary classes defined in code files included in 
>> the tool shed repository - documented in the second wiki section listed 
>> above.  Your gmap tool suite falls into this category.
>> 
>> If you make the following changes to your gmap tool suite, your proprietary 
>> data types will automatically load into a local Galaxy instance when the 
>> Galaxy admin installs your tool suite to that instance.  The data types will 
>> be loaded at the time of installation as well as whenever the Galaxy server 
>> is stopped / restarted.  I'll send you a separate message detailing the 
>> changes you'll need to make to your mothur tool suite.
>> 
>> 
>> CHANGE 1
>> 
>> Add a file named datatypes_conf.xml to your repository.  This is the 
>> approach I'm using to support proprietary datatypes included in tool shed 
>> repositories instead f your proposed addition of datatypes in the tool 
>> config's  tag set.  The datatypes_conf.xml file can be located 
>> anywhere in the repository, but the the obvious location for your gmap 
>> repository is your ~/tool-data directory.
>> 
>> This file should contain the following datatype definitions.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> > display_in_upload="False"/>
>> > type="galaxy.datatypes.gmap:GmapSnpIndex" display_in_upload="False"/>
>> > type="galaxy.datatypes.gmap:IntervalIndexTree" display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:SpliceSitesIntervalIndexTree" 
>> display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:IntronsIntervalIndexTree" 
>> display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:SNPsIntervalIndexTree" display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:IntervalAnnotation" display_in_upload="False"/>
>> > type="galaxy.datatypes.gmap:SpliceSiteAnnotation" display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:IntronAnnotation" display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:SNPAnnotation" display_in_upload="True"/>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> I noticed that your README in your current gmap repository on the main 
>> Galaxy tool shed includes the following datatype definitions, but they refer 
>> to classes that are not included in your repository so I've eliminated them 
>> from the above datatypes_conf.xml file.  You may need to add the classes to 
>> your current gmap.py datatypes class file and add them to the above 
>> datatypes_conf.xml file if your tools actually require them.
>> 
>> > type="galaxy.datatypes.gmap:TallyIntervalIndexTree"  
>> display_in_upload="True"/>
>> > type="galaxy.datatypes.gmap:TallyAnnotation"  display_in_upload="True"/>
>> > display_in_upload="True"/>
>> 
>> 
>> CHANGE 2
>> 
>> Modules that include proprietary datatype class definitions cannot use 
>> relative import references for imported modules.  Imports must be defined as 
>> a

Re: [galaxy-dev] Tool shed and datatypes

2012-01-05 Thread Greg Von Kuster
Hi Jim,

Here are the changes you'll need to make to your mothur tool suite.

CHANGE 1

Add the following datatypes.conf.xml file to your repository.













































I'm probably not correctly handling the converter for your ref.taxonomy data 
type - I've not been able to find the ref_to_seq_taxonomy_converter.xml file.  
Can you pass it along to me so I can see if I have some debugging to do?

Also, I've eliminated the following entry from your README in the above file 
because the Newick class is not included in your metagenomics.py class module.  
It seems you may have include the Newick class in your local copy of 
~/lib/galaxy/datatypes/data.py.  If your tools use this class, it should be 
added to either your metagenomics.py class file or another class file in your 
repository and the value of the "type" attribute in the following should be 
changed accordingly.



CHANGE 2
---

The following relative imports in your metagenomics.py class module:

import data
from sniff import *

need to look like this:

from galaxy.datatypes import data
from galaxy.datatypes.sniff import *

CHANGE 3
---
You can optionally choose to remove your suite_config.xml file from your 
repository as it is no longer used in any way.

Thanks!

Greg Von Kuster


On Oct 18, 2011, at 11:03 AM, Jim Johnson wrote:

> Greg,
> 
> The mothur_toolsuite in the ToolShed  contains a file with added datatypes 
> for metagenomics (used by mothur and some by qiime):
> mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py
> The README has info on how I incorporated mothur into our local galaxy server.
> 
> I'm also working on GMAP/GSNAP  (  http://research-pub.gene.com/gmap/ )
> So far I've created a GmapDB class,  analogous to the ngsindex.BowtieIndex 
> class, but with more metadata.
> I'm also adding a IntervalIndexTree class for indexing maps of splice 
> junctions, introns, and SNPs.
> I'll send you this as soon as I've got it working.
> 
> Thanks,
> 
> JJ
> 

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu



___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Tool shed and datatypes

2012-01-05 Thread Greg Von Kuster
Hello Jim,

I've implemented support for proprietary datatypes that use class modules 
included in tool shed repositories.  To see how this works, you'll need at 
least change set revision 6479:4d131422777f, which is currently available only 
from our central repo at https://bitbucket.org/galaxy/galaxy-central.

I've documented the way this works in the following 2 sections of the tool shed 
wiki.  In the second section, I've taken the liberty of using your gmap tool 
repository as an example.  i hope you don't mind.  I've written the document 
section assuming that your gmap repository includes the 2 changes I've 
described below.

http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_subclass_from_Galaxy_data_types_in_the_distribution
http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_use_class_modules_included_in_your_repository

There are 2 categories of datatypes that are currently supported:

1. data types that subclass from the datatype classes included in the Galaxy 
distribution - these require no code files that define proprietary datatype 
classes to be included in the tool shed repository, and are documented in the 
first wiki section listed above.

2. datatypes that use proprietary classes defined in code files included in the 
tool shed repository - documented in the second wiki section listed above.  
Your gmap tool suite falls into this category.

If you make the following changes to your gmap tool suite, your proprietary 
data types will automatically load into a local Galaxy instance when the Galaxy 
admin installs your tool suite to that instance.  The data types will be loaded 
at the time of installation as well as whenever the Galaxy server is stopped / 
restarted.  I'll send you a separate message detailing the changes you'll need 
to make to your mothur tool suite.


CHANGE 1

Add a file named datatypes_conf.xml to your repository.  This is the approach 
I'm using to support proprietary datatypes included in tool shed repositories 
instead f your proposed addition of datatypes in the tool config's 
 tag set.  The datatypes_conf.xml file can be located anywhere in 
the repository, but the the obvious location for your gmap repository is your 
~/tool-data directory.

This file should contain the following datatype definitions.


























I noticed that your README in your current gmap repository on the main Galaxy 
tool shed includes the following datatype definitions, but they refer to 
classes that are not included in your repository so I've eliminated them from 
the above datatypes_conf.xml file.  You may need to add the classes to your 
current gmap.py datatypes class file and add them to the above 
datatypes_conf.xml file if your tools actually require them.






CHANGE 2

Modules that include proprietary datatype class definitions cannot use relative 
import references for imported modules.  Imports must be defined as absolute 
from the galaxy subdirectory inside the Galaxy root's lib subdirectory.  So for 
your ~/lib/galaxy/datatypes/gmap.py datatypes module in your gmap repository, 
the following changes are necessary.

Your current imports look like this:

import logging
import os,os.path,re
import data
from data import Text
from galaxy import util
from metadata import MetadataElement

But they need to be changed to this - note the elimination of relative imports:

import logging
import os,os.path,re
import galaxy.datatypes.data
from galaxy.datatypes.data import Text
from galaxy import util
from galaxy.datatypes.metadata import MetadataElement

Thanks very much for helping out with this, and please let me know if you bump 
into any problems.

Greg Von Kuster


On Oct 21, 2011, at 1:13 PM, Jim Johnson wrote:

> Greg,
> 
> I put the gmap tool suite in the galaxy Tool Shed,  let me know if there is 
> more I should do.  
>   
> It has 5 galaxy tools:
> GMAP   -  Genomic Mapping and Alignment Program for mRNA and EST 
> sequences 
> GSNAP- Genomic Short-read Nucleotide Alignment Program   
> GMAP Build-  a database genome index for GMAP and GSNAP ( calls:  
> gmap_build, iit_store, snpindex, cmetindex, atoiindex ) 
> GMAP SNP Index- build index files for known SNPs 
> (calls:  iit_store, snpindex) 
> GMAP IIT- Create a map store for known genes or SNPs  
> (calls:  iit_store) 
> 
> It uses these added datatypes:
> % grep -E '(^class | file_ext)' lib/galaxy/datatypes/gmap.py 
> class GmapDB( Text ):
> file_ext = 'gmapdb'
> class GmapSnpIndex( Text ):
> file_ext = 'gmapsnpindex'
> class IntervalIndexTree( Text ):
> file_ext = 'iit'
> class SpliceSitesIntervalIndexTree( IntervalIndexTree ):
> file_ext = 'splicesites.iit'
> class IntronsIntervalIndexTree( Int

Re: [galaxy-dev] Tool shed and datatypes

2011-11-08 Thread Duddy, John
Ahh - sorry. I finally found the format specification for BGZF in the SAM 
format specification, and it seems that it is 100% GZIP-compatible. There is 
still the issue of needing an external file index, since all BGZF seems to give 
you is the size of the compressed block, not anything format-specific, like the 
number of sequences in the block.

In any case, whether it's GZIP or BGZF, it seems the solutions are very 
similar, and porting my work should be pretty simple - I just used larger 
blocks and put all the data in the index file and none in the headers.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Tuesday, November 08, 2011 4:04 PM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Tue, Nov 8, 2011 at 11:45 PM, Duddy, John  wrote:
> It's not public yet, and it involves a little conundrum - we want
> it so we can support large amounts of data efficiently on a variety
> of aligners, including our ELAND from CASAVA. However, ELAND
> does not support unaligned BAM inputs yet, and apparently it
> would be a lot of work to make it so (and another team's area
> of responsibility as well).

OK, so using (unaligned) BAM isn't about to happen.

> So in the near term, BGZF would not meet our needs.
>

I don't follow you there, BAM != BGZF.

We can use BGZF to compress FASTQ, FASTA, GenBank,
basically anything. You get compression approaching that
of plain GZIP (depending on the characteristics of the data)
plus efficient random access.

> However, work is quite far along on a GZIP-based one
> that works with ELAND and BWA, since they both read
> GZIP FASTQ files, and works/will work with a converter
> to fastq_sanger for other tools.
>
> I can put you in touch with the engineer doing the work if
> you are interested.

That might be a good idea, or ask them to post here?

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-11-08 Thread Peter Cock
On Tue, Nov 8, 2011 at 11:45 PM, Duddy, John  wrote:
> It's not public yet, and it involves a little conundrum - we want
> it so we can support large amounts of data efficiently on a variety
> of aligners, including our ELAND from CASAVA. However, ELAND
> does not support unaligned BAM inputs yet, and apparently it
> would be a lot of work to make it so (and another team's area
> of responsibility as well).

OK, so using (unaligned) BAM isn't about to happen.

> So in the near term, BGZF would not meet our needs.
>

I don't follow you there, BAM != BGZF.

We can use BGZF to compress FASTQ, FASTA, GenBank,
basically anything. You get compression approaching that
of plain GZIP (depending on the characteristics of the data)
plus efficient random access.

> However, work is quite far along on a GZIP-based one
> that works with ELAND and BWA, since they both read
> GZIP FASTQ files, and works/will work with a converter
> to fastq_sanger for other tools.
>
> I can put you in touch with the engineer doing the work if
> you are interested.

That might be a good idea, or ask them to post here?

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-11-08 Thread Duddy, John
BTW - the pull request for the GZIP-based splitting is actually integrated - I 
was referring to the GZIP-based datatype.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Tuesday, November 08, 2011 3:29 PM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John  wrote:
> GZIP files are definitely our plan. I just finished testing the code
> that distributes the processing of a FASTQ (or pair for PE) to an
> arbitrary number of tasks, where each subtask extracts just the
> data it needs without reading any of the file it does not need. It
> extracts the blocks of GZIPped data into a standalone GZIP file
> just by copying whole blocks and appending them (if the window
> is not aligned perfectly, there is additional processing). Since
> the entire file does not need to be read, it distributes quite nicely.
>
> I'll be preparing a pull request for it soon.
>
>
> John Duddy

Hi John,

Is your pull request public yet? I'd like to know more about
your GZIP based plan (and how it differs from BGZF). It
would seem silly to reinvent something slightly different
if an existing and well tested mechanism like BGZF (used
in BAM files) would work.

BGZF is based on GZIP with blocks each up to 64kb,
where the block size is recorded in the GZIP block
header. This may be more fine grained than the block
sizes you are using, but should serve equally well for
distribution of data chunks between machines/cores.

I appreciate the SAM/BAM specification where BGZF is
defined is quite dry reading, and the broad potential of
this GZIP variant beyond BAM is not articulated clearly.
So I've written a blog post about how BGZF can be used
for efficient random access to sequential files (in the
sense of one self contained record after another, e.g.
many sequence file formats including FASTA & FASTQ):

http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

I've also added a reference to BGZF on the open
Galaxy feature request for general support of gzipped
data types:

https://bitbucket.org/galaxy/galaxy-central/issue/666/

Regards,

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-11-08 Thread Duddy, John
It's not public yet, and it involves a little conundrum - we want it so we can 
support large amounts of data efficiently on a variety of aligners, including 
our ELAND from CASAVA. However, ELAND does not support unaligned BAM inputs 
yet, and apparently it would be a lot of work to make it so (and another team's 
area of responsibility as well). So in the near term, BGZF would not meet our 
needs.

However, work is quite far along on a GZIP-based one that works with ELAND and 
BWA, since they both read GZIP FASTQ files, and works/will work with a 
converter to fastq_sanger for other tools.

I can put you in touch with the engineer doing the work if you are interested.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Tuesday, November 08, 2011 3:29 PM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John  wrote:
> GZIP files are definitely our plan. I just finished testing the code
> that distributes the processing of a FASTQ (or pair for PE) to an
> arbitrary number of tasks, where each subtask extracts just the
> data it needs without reading any of the file it does not need. It
> extracts the blocks of GZIPped data into a standalone GZIP file
> just by copying whole blocks and appending them (if the window
> is not aligned perfectly, there is additional processing). Since
> the entire file does not need to be read, it distributes quite nicely.
>
> I'll be preparing a pull request for it soon.
>
>
> John Duddy

Hi John,

Is your pull request public yet? I'd like to know more about
your GZIP based plan (and how it differs from BGZF). It
would seem silly to reinvent something slightly different
if an existing and well tested mechanism like BGZF (used
in BAM files) would work.

BGZF is based on GZIP with blocks each up to 64kb,
where the block size is recorded in the GZIP block
header. This may be more fine grained than the block
sizes you are using, but should serve equally well for
distribution of data chunks between machines/cores.

I appreciate the SAM/BAM specification where BGZF is
defined is quite dry reading, and the broad potential of
this GZIP variant beyond BAM is not articulated clearly.
So I've written a blog post about how BGZF can be used
for efficient random access to sequential files (in the
sense of one self contained record after another, e.g.
many sequence file formats including FASTA & FASTQ):

http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

I've also added a reference to BGZF on the open
Galaxy feature request for general support of gzipped
data types:

https://bitbucket.org/galaxy/galaxy-central/issue/666/

Regards,

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-11-08 Thread Peter Cock
On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John  wrote:
> GZIP files are definitely our plan. I just finished testing the code
> that distributes the processing of a FASTQ (or pair for PE) to an
> arbitrary number of tasks, where each subtask extracts just the
> data it needs without reading any of the file it does not need. It
> extracts the blocks of GZIPped data into a standalone GZIP file
> just by copying whole blocks and appending them (if the window
> is not aligned perfectly, there is additional processing). Since
> the entire file does not need to be read, it distributes quite nicely.
>
> I'll be preparing a pull request for it soon.
>
>
> John Duddy

Hi John,

Is your pull request public yet? I'd like to know more about
your GZIP based plan (and how it differs from BGZF). It
would seem silly to reinvent something slightly different
if an existing and well tested mechanism like BGZF (used
in BAM files) would work.

BGZF is based on GZIP with blocks each up to 64kb,
where the block size is recorded in the GZIP block
header. This may be more fine grained than the block
sizes you are using, but should serve equally well for
distribution of data chunks between machines/cores.

I appreciate the SAM/BAM specification where BGZF is
defined is quite dry reading, and the broad potential of
this GZIP variant beyond BAM is not articulated clearly.
So I've written a blog post about how BGZF can be used
for efficient random access to sequential files (in the
sense of one self contained record after another, e.g.
many sequence file formats including FASTA & FASTQ):

http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

I've also added a reference to BGZF on the open
Galaxy feature request for general support of gzipped
data types:

https://bitbucket.org/galaxy/galaxy-central/issue/666/

Regards,

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-21 Thread Jim Johnson

On 10/21/11 12:29 PM, James Taylor wrote:

Excerpts from Jim Johnson's message of 2011-10-21 17:13:02 +:

I put the gmap tool suite in the galaxy Tool Shed,  let me know if there is 
more I should do.

Awesome!


I added a requirement tag for the datatypes to the tool-configs:

 % grep 'requirement.*datatype' *.xml
 gmap_build.xml:gmapdb

Requirement tags for datatypes are an interesting idea, but I'm
wondering if this is something we should require? It seems like all this
information is implicit -- a tool requires a datatype if it has an input
or output parameter that references that type. Is there other
information that should go in the requirement tag?


That is certainly correct that the tag would be redundant, the tool config 
parser could identify the list of datatype formats.

I was just trying to think of some way to indicate that additional datatypes 
were required above those in the central distribution.
My goal would be to have the installation of tools from the Tool Shed also be 
able to install the extra datatypes that those tools require.

Having datatypes specified separately in the Tool Shed from tools would 
hopefully promote less redundancy of datatypes and better interoperability 
among developers tools.For example the metagenomics applications mothur and 
qiime have many specific formats that are internal to their tools, but also a 
few that might be used to migrate data between those applications.   We'd need 
a way to avoid name clashes, perhaps adopting a namespace pattern for the 
file_ext attribute.


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-21 Thread Greg Von Kuster
hnson wrote:
>>>> 
>>>>> Greg,
>>>>> 
>>>>> It would be great if there were a way to expand upon the core datatypes 
>>>>> using the ToolShed.
>>>>> 
>>>>> Would it be possible to have a separate datatype repository within the 
>>>>> ToolShed?
>>>>> 
>>>>> Datatype
>>>>>  name=""
>>>>>  description=""
>>>>>  datatype_dependencies=[]
>>>>>  definition=
>>>>> 
>>>>> The tool config could be expanded to have requirement for datatypes.
>>>>>   ssmap
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Table datatype
>>>>>   Column|Type | Modifiers
>>>>> -+-+---
>>>>> id  | integer | not null default 
>>>>> nextval('datatype_id_seq'::regclass)
>>>>> name| character varying(255)  |
>>>>> version | character varying(40)   |
>>>>> description | text|
>>>>> definition  | text|
>>>>> UNIQUE (name)
>>>>> 
>>>>> Table datatype_datatype_association
>>>>>   Column|Type | Modifiers
>>>>> -+-+---
>>>>> id  | integer | not null default 
>>>>> nextval('datatype_id_seq'::regclass)
>>>>> datatype_id | integer |
>>>>> requires_id | integer |
>>>>> FOREIGN KEY (datatype_id) REFERENCES datatype(id)
>>>>> FOREIGN KEY (requires_id) REFERENCES datatype(id)
>>>>> 
>>>>> 
>>>>> Then for my mothur metagenomics tools I could define:
>>>>> 
>>>>> name="ssmap"   description="Secondary Structure Map"  version="1.0"  
>>>>> datatype_dependencies=[tabular]
>>>>> definition=
>>>>> from galaxy.datatypes.tabular import Tabular
>>>>> class SecondaryStructureMap(Tabular):
>>>>>file_ext = 'ssmap'
>>>>>def __init__(self, **kwd):
>>>>>"""Initialize secondary structure map datatype"""
>>>>>Tabular.__init__( self, **kwd )
>>>>>self.column_names = ['Map']
>>>>> 
>>>>>def sniff( self, filename ):
>>>>>"""
>>>>>Determines whether the file is a secondary structure map format
>>>>>A single column with an integer value which indicates the row that 
>>>>> this row maps to.
>>>>>check you make sure is structMap[10] = 380 then structMap[380] = 
>>>>> 10.
>>>>>"""
>>>>> ...
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Then the align.check.xml tool_config could require the 'ssmap' datatype:
>>>>> 
>>>>> 
>>>>> Calculate the number of potentially misaligned 
>>>>> bases
>>>>> 
>>>>>   mothur
>>>>>   ssmap
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> John,
>>>>>> 
>>>>>> I've been following this message thread, and it seems it's gone in a 
>>>>>> direction that differs from your initial question about the possibility 
>>>>>> for Galaxy to handle automatic editing of the datatypes_conf.xml file 
>>>>>> when certain Galaxy tool shed tools are automatically installed.  There 
>>>>>> are some complexities to consider in attempting this.  One of the issues 
>>>>>> to consider is that the work for adding support for a new datatype to 
>>>>>> Galaxy lies outside of the intended function of the tool shed.  If new 
>>>>>> support is added to the Galaxy cod

Re: [galaxy-dev] Tool shed and datatypes

2011-10-21 Thread James Taylor
Excerpts from Jim Johnson's message of 2011-10-21 17:13:02 +:
> I put the gmap tool suite in the galaxy Tool Shed,  let me know if there is 
> more I should do.

Awesome!

> I added a requirement tag for the datatypes to the tool-configs:
> 
> % grep 'requirement.*datatype' *.xml
> gmap_build.xml: gmapdb

Requirement tags for datatypes are an interesting idea, but I'm
wondering if this is something we should require? It seems like all this
information is implicit -- a tool requires a datatype if it has an input
or output parameter that references that type. Is there other
information that should go in the requirement tag? 

-- 
James Taylor, Assistant Professor, Biology / Computer Science, Emory University
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-21 Thread Jim Johnson
extval('datatype_id_seq'::regclass)
datatype_id | integer |
requires_id | integer |
FOREIGN KEY (datatype_id) REFERENCES datatype(id)
FOREIGN KEY (requires_id) REFERENCES datatype(id)


Then for my mothur metagenomics tools I could define:

name="ssmap"   description="Secondary Structure Map"  version="1.0"  
datatype_dependencies=[tabular]
definition=
from galaxy.datatypes.tabular import Tabular
class SecondaryStructureMap(Tabular):
file_ext = 'ssmap'
def __init__(self, **kwd):
"""Initialize secondary structure map datatype"""
Tabular.__init__( self, **kwd )
self.column_names = ['Map']

def sniff( self, filename ):
"""
Determines whether the file is a secondary structure map format
A single column with an integer value which indicates the row that this 
row maps to.
check you make sure is structMap[10] = 380 then structMap[380] = 10.
"""
...




Then the align.check.xml tool_config could require the 'ssmap' datatype:


Calculate the number of potentially misaligned bases

   mothur
   ssmap
  










John,

I've been following this message thread, and it seems it's gone in a direction 
that differs from your initial question about the possibility for Galaxy to 
handle automatic editing of the datatypes_conf.xml file when certain Galaxy 
tool shed tools are automatically installed.  There are some complexities to 
consider in attempting this.  One of the issues to consider is that the work 
for adding support for a new datatype to Galaxy lies outside of the intended 
function of the tool shed.  If new support is added to the Galaxy code base, an 
entry for that new datatype should be manually added to the table at the same 
time.  There may be benefits to enabling automatic changes to datatype entries 
that already exist in the file (e.g., adding a new converter for an existing 
datatype entry), but perhaps adding a completely new datatype to the file may 
not be appropriate.  I'll continue to think about this - send additional 
thought and feedback, as doing so is always helpful

Thanks!

Greg


On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:


One of the things we’re facing is the sheer size of a whole human genome at 30x 
coverage. An effective way to deal with that is by compressing the FASTQ files. 
That works for BWA and our ELAND, which can directly read a compressed FASTQ, 
but other tools crash when reading compressed FASTQ filesfiles. One way to 
address that would be to introduce a new type, for example “CompressedFastQ”, 
with a conversion to FASTQ defined. BWA could take both types as input. This 
would allow the best of both worlds – efficient storage and use by all existing 
tools.

Another example would be adding the CASAVA tools to Galaxy. Some of the 
statistics generation tools use custom file formats. To be able to make the use 
of those tools optional and configurable, they should be separate from the 
aligner, but that would require that Galaxy be made aware of the custom file 
formats – we’d have to add a datatype.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

From: Greg Von Kuster [mailto:greg at bx.psu.edu]
Sent: Wednesday, October 05, 2011 6:25 PM
To: Duddy, John
Cc: galaxy-dev at lists.bx.psu.edu
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hello John,

The Galaxy tool shed currently is not enabled to automatically edit the 
datatypes_conf.xml file, although I could add this feature if the need exists.  
Can you elaborate on what you are looking to do regarding this?

Thanks!


On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:


Can we introduce new file types via tools in the tool shed? It seems Galaxy can 
load them if they are in the datatypes configuration file. Does tool 
installation automate the editing of that file?


John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
greg at bx.psu.edu


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu






___
Please keep all replies on the list by using &quo

Re: [galaxy-dev] Tool shed and datatypes

2011-10-18 Thread Greg Von Kuster
entially misaligned 
>>> bases
>>> 
>>>   mothur
>>>   ssmap
>>>  
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> John,
>>>> 
>>>> I've been following this message thread, and it seems it's gone in a 
>>>> direction that differs from your initial question about the possibility 
>>>> for Galaxy to handle automatic editing of the datatypes_conf.xml file when 
>>>> certain Galaxy tool shed tools are automatically installed.  There are 
>>>> some complexities to consider in attempting this.  One of the issues to 
>>>> consider is that the work for adding support for a new datatype to Galaxy 
>>>> lies outside of the intended function of the tool shed.  If new support is 
>>>> added to the Galaxy code base, an entry for that new datatype should be 
>>>> manually added to the table at the same time.  There may be benefits to 
>>>> enabling automatic changes to datatype entries that already exist in the 
>>>> file (e.g., adding a new converter for an existing datatype entry), but 
>>>> perhaps adding a completely new datatype to the file may not be 
>>>> appropriate.  I'll continue to think about this - send additional thought 
>>>> and feedback, as doing so is always helpful
>>>> 
>>>> Thanks!
>>>> 
>>>> Greg
>>>> 
>>>> 
>>>> On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
>>>> 
>>>>> One of the things we’re facing is the sheer size of a whole human genome 
>>>>> at 30x coverage. An effective way to deal with that is by compressing the 
>>>>> FASTQ files. That works for BWA and our ELAND, which can directly read a 
>>>>> compressed FASTQ, but other tools crash when reading compressed FASTQ 
>>>>> filesfiles. One way to address that would be to introduce a new type, for 
>>>>> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could 
>>>>> take both types as input. This would allow the best of both worlds – 
>>>>> efficient storage and use by all existing tools.
>>>>> 
>>>>> Another example would be adding the CASAVA tools to Galaxy. Some of the 
>>>>> statistics generation tools use custom file formats. To be able to make 
>>>>> the use of those tools optional and configurable, they should be separate 
>>>>> from the aligner, but that would require that Galaxy be made aware of the 
>>>>> custom file formats – we’d have to add a datatype.
>>>>> 
>>>>> John Duddy
>>>>> Sr. Staff Software Engineer
>>>>> Illumina, Inc.
>>>>> 9885 Towne Centre Drive
>>>>> San Diego, CA 92121
>>>>> Tel: 858-736-3584
>>>>> E-mail: jduddy at illumina.com
>>>>> 
>>>>> From: Greg Von Kuster [mailto:greg at bx.psu.edu]
>>>>> Sent: Wednesday, October 05, 2011 6:25 PM
>>>>> To: Duddy, John
>>>>> Cc: galaxy-dev at lists.bx.psu.edu
>>>>> Subject: Re: [galaxy-dev] Tool shed and datatypes
>>>>> 
>>>>> Hello John,
>>>>> 
>>>>> The Galaxy tool shed currently is not enabled to automatically edit the 
>>>>> datatypes_conf.xml file, although I could add this feature if the need 
>>>>> exists.  Can you elaborate on what you are looking to do regarding this?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
>>>>> 
>>>>> 
>>>>> Can we introduce new file types via tools in the tool shed? It seems 
>>>>> Galaxy can load them if they are in the datatypes configuration file. 
>>>>> Does tool installation automate the editing of that file?
>>>>> 
>>>>> 
>>>>> John Duddy
>>>>> Sr. Staff Software Engineer
>>>>> Illumina, Inc.
>>>>> 9885 Towne Centre Drive
>>>>> San Diego, CA 92121
>>>>> Tel: 858-736-3584
>>>>> E-mail: jduddy at illumina.com
>>>>> 
>>>>> ___
>>>>> Please keep all replies on the list by using "reply all"
>>>>> in your mail client.  To manage your subscriptions to this
>>>>> and other Galaxy lists, please use the interface at:
>>>>> 
>>>>> http://lists.bx.psu.edu/
>>>>> 
>>>>> Greg Von Kuster
>>>>> Galaxy Development Team
>>>>> greg at bx.psu.edu
>>>>> 
>>> ___
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>> 
>>> http://lists.bx.psu.edu/
>>> 
>> Greg Von Kuster
>> Galaxy Development Team
>> g...@bx.psu.edu
>> 
>> 
>> 
> 
> 
> 
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/
> 

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu




___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-18 Thread Jim Johnson
h that is by compressing the FASTQ files. 
That works for BWA and our ELAND, which can directly read a compressed FASTQ, 
but other tools crash when reading compressed FASTQ filesfiles. One way to 
address that would be to introduce a new type, for example “CompressedFastQ”, 
with a conversion to FASTQ defined. BWA could take both types as input. This 
would allow the best of both worlds – efficient storage and use by all existing 
tools.

Another example would be adding the CASAVA tools to Galaxy. Some of the 
statistics generation tools use custom file formats. To be able to make the use 
of those tools optional and configurable, they should be separate from the 
aligner, but that would require that Galaxy be made aware of the custom file 
formats – we’d have to add a datatype.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

From: Greg Von Kuster [mailto:greg at bx.psu.edu]
Sent: Wednesday, October 05, 2011 6:25 PM
To: Duddy, John
Cc: galaxy-dev at lists.bx.psu.edu
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hello John,

The Galaxy tool shed currently is not enabled to automatically edit the 
datatypes_conf.xml file, although I could add this feature if the need exists.  
Can you elaborate on what you are looking to do regarding this?

Thanks!


On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:


Can we introduce new file types via tools in the tool shed? It seems Galaxy can 
load them if they are in the datatypes configuration file. Does tool 
installation automate the editing of that file?


John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
greg at bx.psu.edu


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu







___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-17 Thread Greg Von Kuster
iles. That works for BWA and our ELAND, which can directly read a 
>>> compressed FASTQ, but other tools crash when reading compressed FASTQ 
>>> filesfiles. One way to address that would be to introduce a new type, for 
>>> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could 
>>> take both types as input. This would allow the best of both worlds – 
>>> efficient storage and use by all existing tools.
>>> 
>>> Another example would be adding the CASAVA tools to Galaxy. Some of the 
>>> statistics generation tools use custom file formats. To be able to make the 
>>> use of those tools optional and configurable, they should be separate from 
>>> the aligner, but that would require that Galaxy be made aware of the custom 
>>> file formats – we’d have to add a datatype.
>>> 
>>> John Duddy
>>> Sr. Staff Software Engineer
>>> Illumina, Inc.
>>> 9885 Towne Centre Drive
>>> San Diego, CA 92121
>>> Tel: 858-736-3584
>>> E-mail: jduddy at illumina.com
>>> 
>>> From: Greg Von Kuster [mailto:greg at bx.psu.edu]
>>> Sent: Wednesday, October 05, 2011 6:25 PM
>>> To: Duddy, John
>>> Cc: galaxy-dev at lists.bx.psu.edu
>>> Subject: Re: [galaxy-dev] Tool shed and datatypes
>>> 
>>> Hello John,
>>> 
>>> The Galaxy tool shed currently is not enabled to automatically edit the 
>>> datatypes_conf.xml file, although I could add this feature if the need 
>>> exists.  Can you elaborate on what you are looking to do regarding this?
>>> 
>>> Thanks!
>>> 
>>> 
>>> On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
>>> 
>>> 
>>> Can we introduce new file types via tools in the tool shed? It seems Galaxy 
>>> can load them if they are in the datatypes configuration file. Does tool 
>>> installation automate the editing of that file?
>>> 
>>> 
>>> John Duddy
>>> Sr. Staff Software Engineer
>>> Illumina, Inc.
>>> 9885 Towne Centre Drive
>>> San Diego, CA 92121
>>> Tel: 858-736-3584
>>> E-mail: jduddy at illumina.com
>>> 
>>> ___
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>> 
>>> http://lists.bx.psu.edu/
>>> 
>>> Greg Von Kuster
>>> Galaxy Development Team
>>> greg at bx.psu.edu
>>> 
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/
> 

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu




___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-10 Thread Greg Von Kuster
Hello Jim,

On Oct 10, 2011, at 1:01 PM, Jim Johnson wrote:

> There are a number of well defined formats that are exchanged between 
> applications, e.g. BAM, gtf, etc,   I wouldn't advocate proliferating those.
> 
> I see the need for Toolshed datatypes more for the intermediate file formats 
> used within a suite of commands.   These can be helpful in guiding a user to 
> select appropriate inputs for successive steps in an analysis.
> 
> For example, when developing the 90 some tool wrappers for the mothur 
> metagenomic package,  there are many file formats that get passed among the 
> mothur commands.   It greatly simplifies the user's experience if the outputs 
> are typed so as to correctly filter the acceptable inputs to another command. 
>   I fear the amount of time I would spend providing user support if the 
> outputs and inputs were generically typed.

An approach for simplifying this is to include one or more exported Galaxy 
workflows in the tool shed repository along with the tools.  The workflows 
cannot currently be automatically imported into Galaxy, but they can be 
manually imported, providing the user an idea of the steps in the analyses for 
which the tools are intended.  Additional features related to Galaxy workflows 
included in Galaxy tool shed repositories will be available in future Galaxy 
releases.

> 
> I'm also seeing a similar need as I am creating creating tool wrappers for 
> the GMAP/GSNAP mapping commands.   While input to GSNAP and GMAP can be fastq 
> and output in SAM format, some of the more interesting use cases involve 
> creating additional map stores, where specific datatypes would guide the user 
> in setting the tool parameters correctly.
> 
> JJ
> 
> James E Johnson
> Minnesota Supercomputing Institute, University of Minnesota
> 
> 
> On 10/10/11 11:09 AM, Duddy, John wrote:
>> I agree with the risks you cited.
>> 
>> There is a risk in the other direction that I think is even scarier - 
>> without the ability to add data types, tool authors may be forced to use a 
>> "typeless" system, declaring all inputs/outputs as "data" or "text". While 
>> this works, it has the same drawbacks as typeless programming languages - 
>> deferring error detection to runtime, impairing the ability to perform 
>> static analysis, inability to perform transparent type conversions - in 
>> other words, the tools have to take over responsibilities from the framework.
>> 
>> Like all interesting problems, I don't think there is an "obviously right" 
>> answer ;-}
>> 
>> John Duddy
>> Sr. Staff Software Engineer
>> Illumina, Inc.
>> 9885 Towne Centre Drive
>> San Diego, CA 92121
>> Tel: 858-736-3584
>> E-mail: jdu...@illumina.com
>> 
>> 
>> -Original Message-----
>> From: galaxy-dev-boun...@lists.bx.psu.edu 
>> [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric
>> Sent: Friday, October 07, 2011 5:53 PM
>> To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu
>> Cc: Greg Von Kuster
>> Subject: Re: [galaxy-dev] Tool shed and datatypes
>> 
>> Hi all,
>> 
>> Just my 2 cents.
>> 
>> This is a really great idea to have dynamically (down-)loadable datatypes, 
>> and a tool config tag to express a datatype dependency is right on the 
>> money.  I agree with Greg in having hesitations about adding that feature 
>> though.  The purpose (at least as far I see it) of the tool shed is to allow 
>> the community to share its productivity.  New tools written by one group can 
>> be used by another group that may not have adequate skill, resources, or 
>> time to create the same tool on their own.  One issue this model can suffer 
>> from, however, is over-proliferation of contributions.  In this case, new 
>> tools with the same, overlapping, or very similar functions might be 
>> developed independently by multiple groups who then want to contribute to 
>> the tool shed.  I don't know how often this situation arises or what 
>> official contingencies are in place to manage them, but it is important to 
>> manage that situation carefully.  If it occurs with any appreciable 
>> frequency, then eventually there a!
 r!
> e many clusters of tools available that do almost the same thing but not 
> quite.  This is bad for the user, bad for the maintainer, complicates 
> communication between researchers, etc.  This model can work nicely if the 
> frequency of very simliar tool submissions is small, and even better if there 
> is some management for cleaning out broken or redundant tools.
>> 
>> When you allow custom datatypes

Re: [galaxy-dev] Tool shed and datatypes

2011-10-10 Thread Greg Von Kuster
Peter has the right idea here - we will add support for appropriate data types 
to the Galaxy distribution.  Of course, the key word here is "appropriate", but 
any industry-standard data format should fall under this category.

On Oct 10, 2011, at 12:46 PM, Peter Cock wrote:

> On Mon, Oct 10, 2011 at 5:09 PM, Duddy, John  wrote:
>> I agree with the risks you cited.
>> 
>> There is a risk in the other direction that I think is even scarier -
>> without the ability to add data types, tool authors may be forced
>> to use a "typeless" system, declaring all inputs/outputs as "data"
>> or "text". While this works, it has the same drawbacks as typeless
>> programming languages - deferring error detection to runtime,
>> impairing the ability to perform static analysis, inability to perform
>> transparent type conversions - in other words, the tools have to
>> take over responsibilities from the framework.
>> 
>> Like all interesting problems, I don't think there is an "obviously
>> right" answer ;-}
>> 
>> John Duddy
> 
> Indeed. I'm going with lobbying the Galaxy to include new
> datatypes when I need them (InterProScan XML in on my
> todo list, perhaps v4 and v5 as two types), but I've been
> able to get a long with with "tabular" as a tool output.
> 
> Peter
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu




___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-10 Thread Jim Johnson

There are a number of well defined formats that are exchanged between 
applications, e.g. BAM, gtf, etc,   I wouldn't advocate proliferating those.

I see the need for Toolshed datatypes more for the intermediate file formats 
used within a suite of commands.   These can be helpful in guiding a user to 
select appropriate inputs for successive steps in an analysis.

For example, when developing the 90 some tool wrappers for the mothur 
metagenomic package,  there are many file formats that get passed among the 
mothur commands.   It greatly simplifies the user's experience if the outputs 
are typed so as to correctly filter the acceptable inputs to another command.   
I fear the amount of time I would spend providing user support if the outputs 
and inputs were generically typed.

I'm also seeing a similar need as I am creating creating tool wrappers for the 
GMAP/GSNAP mapping commands.   While input to GSNAP and GMAP can be fastq and 
output in SAM format, some of the more interesting use cases involve creating 
additional map stores, where specific datatypes would guide the user in setting 
the tool parameters correctly.

JJ

James E Johnson
Minnesota Supercomputing Institute, University of Minnesota


On 10/10/11 11:09 AM, Duddy, John wrote:

I agree with the risks you cited.

There is a risk in the other direction that I think is even scarier - without the ability to add data types, 
tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as 
"data" or "text". While this works, it has the same drawbacks as typeless programming 
languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability 
to perform transparent type conversions - in other words, the tools have to take over responsibilities from 
the framework.

Like all interesting problems, I don't think there is an "obviously right" 
answer ;-}

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: galaxy-dev-boun...@lists.bx.psu.edu 
[mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric
Sent: Friday, October 07, 2011 5:53 PM
To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu
Cc: Greg Von Kuster
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hi all,

Just my 2 cents.

This is a really great idea to have dynamically (down-)loadable datatypes, and 
a tool config tag to express a datatype dependency is right on the money.  I 
agree with Greg in having hesitations about adding that feature though.  The 
purpose (at least as far I see it) of the tool shed is to allow the community 
to share its productivity.  New tools written by one group can be used by 
another group that may not have adequate skill, resources, or time to create 
the same tool on their own.  One issue this model can suffer from, however, is 
over-proliferation of contributions.  In this case, new tools with the same, 
overlapping, or very similar functions might be developed independently by 
multiple groups who then want to contribute to the tool shed.  I don't know how 
often this situation arises or what official contingencies are in place to 
manage them, but it is important to manage that situation carefully.  If it 
occurs with any appreciable frequency, then eventually there ar!

e many clusters of tools available that do almost the same thing but not quite. 
 This is bad for the user, bad for the maintainer, complicates communication 
between researchers, etc.  This model can work nicely if the frequency of very 
simliar tool submissions is small, and even better if there is some management 
for cleaning out broken or redundant tools.


When you allow custom datatypes to enter the picture, however, the story can become hairy 
much more quickly.  Having a limited set of officially supplied / supported datatypes 
forces the contributors of new tools to use datatypes drawn from a standard set.  Without 
that constraint, the number of datatype variants could explode.  Now the concern is not 
only that multiple contributors may submit very similar tool variants, or that each of 
them might choose to create their own datatypes to optimize their methods, but also that 
contributors of tools which are functionally dissimilar but manipulate the same general 
types of data will write their tools using new datatypes that are variants of each other. 
 Tools are essentially typed by the datatypes they accept and produce, so you won't be 
able to chain these tools together very easliy at all.  Most pairs of tools will have the 
"wrong" datatype, on input or output, for what a user wants to do.  The general 
trend is then prolifer!

ation of clusters of redundant tools, clusters of redundant datatypes, and growing 
sparsity in the "tool graph" (think of datatypes as vertices and tool

Re: [galaxy-dev] Tool shed and datatypes

2011-10-10 Thread Peter Cock
On Mon, Oct 10, 2011 at 5:09 PM, Duddy, John  wrote:
> I agree with the risks you cited.
>
> There is a risk in the other direction that I think is even scarier -
> without the ability to add data types, tool authors may be forced
> to use a "typeless" system, declaring all inputs/outputs as "data"
> or "text". While this works, it has the same drawbacks as typeless
> programming languages - deferring error detection to runtime,
> impairing the ability to perform static analysis, inability to perform
> transparent type conversions - in other words, the tools have to
> take over responsibilities from the framework.
>
> Like all interesting problems, I don't think there is an "obviously
> right" answer ;-}
>
> John Duddy

Indeed. I'm going with lobbying the Galaxy to include new
datatypes when I need them (InterProScan XML in on my
todo list, perhaps v4 and v5 as two types), but I've been
able to get a long with with "tabular" as a tool output.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-10 Thread Duddy, John
I agree with the risks you cited.

There is a risk in the other direction that I think is even scarier - without 
the ability to add data types, tool authors may be forced to use a "typeless" 
system, declaring all inputs/outputs as "data" or "text". While this works, it 
has the same drawbacks as typeless programming languages - deferring error 
detection to runtime, impairing the ability to perform static analysis, 
inability to perform transparent type conversions - in other words, the tools 
have to take over responsibilities from the framework.

Like all interesting problems, I don't think there is an "obviously right" 
answer ;-}

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: galaxy-dev-boun...@lists.bx.psu.edu 
[mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric
Sent: Friday, October 07, 2011 5:53 PM
To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu
Cc: Greg Von Kuster
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hi all,

Just my 2 cents.

This is a really great idea to have dynamically (down-)loadable datatypes, and 
a tool config tag to express a datatype dependency is right on the money.  I 
agree with Greg in having hesitations about adding that feature though.  The 
purpose (at least as far I see it) of the tool shed is to allow the community 
to share its productivity.  New tools written by one group can be used by 
another group that may not have adequate skill, resources, or time to create 
the same tool on their own.  One issue this model can suffer from, however, is 
over-proliferation of contributions.  In this case, new tools with the same, 
overlapping, or very similar functions might be developed independently by 
multiple groups who then want to contribute to the tool shed.  I don't know how 
often this situation arises or what official contingencies are in place to 
manage them, but it is important to manage that situation carefully.  If it 
occurs with any appreciable frequency, then eventually there are !
 many clusters of tools available that do almost the same thing but not quite.  
This is bad for the user, bad for the maintainer, complicates communication 
between researchers, etc.  This model can work nicely if the frequency of very 
simliar tool submissions is small, and even better if there is some management 
for cleaning out broken or redundant tools.

When you allow custom datatypes to enter the picture, however, the story can 
become hairy much more quickly.  Having a limited set of officially supplied / 
supported datatypes forces the contributors of new tools to use datatypes drawn 
from a standard set.  Without that constraint, the number of datatype variants 
could explode.  Now the concern is not only that multiple contributors may 
submit very similar tool variants, or that each of them might choose to create 
their own datatypes to optimize their methods, but also that contributors of 
tools which are functionally dissimilar but manipulate the same general types 
of data will write their tools using new datatypes that are variants of each 
other.  Tools are essentially typed by the datatypes they accept and produce, 
so you won't be able to chain these tools together very easliy at all.  Most 
pairs of tools will have the "wrong" datatype, on input or output, for what a 
user wants to do.  The general trend is then proliferat!
 ion of clusters of redundant tools, clusters of redundant datatypes, and 
growing sparsity in the "tool graph" (think of datatypes as vertices and tools 
as directed [hyper]edges).

So, a move in the direction of supporting something like a "TypeShed" would 
require careful consideration consist of at least either a well defined policy 
for managing *Shed rot and capability to execute it or a very slick tool / 
datatype versioning system with flexible control for users and some also very 
slick method for maintaining implicit conversions between the datatypes in a 
datatype cluster (ideally automatically generated).  I think at least the 
implicit conversion part can be done, even if not in a fully automated manner, 
then by a combination of policy and engineering.  For policy, you can define, 
identify, or construct a canonical datatype in each cluster and require that a 
contributor of a variant datatype submit methods for implicit conversion 
to/from the canonical datatype in that cluster.  One idea that could help 
reduce complexity is to potentially place some additional structure on 
datatypes and take the canonical datatype for a cluster to be a form of the !
 union (mathematical, not the "union" from C) of the variants in the cluster, 
which would simplify implicit conversations somewhat.  Or, if there's some 
reason for this, there can also be a set of "canonical" 

Re: [galaxy-dev] Tool shed and datatypes

2011-10-07 Thread Paniagua, Eric
alf of Jim Johnson [j...@umn.edu]
Sent: Friday, October 07, 2011 2:06 PM
To: galaxy-dev@lists.bx.psu.edu
Cc: Greg Von Kuster
Subject: Re: [galaxy-dev] Tool shed and datatypes

Greg,

It would be great if there were a way to expand upon the core datatypes using 
the ToolShed.

Would it be possible to have a separate datatype repository within the ToolShed?

Datatype
   name=""
   description=""
   datatype_dependencies=[]
   definition=


The tool config could be expanded to have requirement for datatypes.
ssmap




Table datatype
Column|Type | Modifiers
-+-+---
  id  | integer | not null default 
nextval('datatype_id_seq'::regclass)
  name| character varying(255)  |
  version | character varying(40)   |
  description | text|
  definition  | text|
UNIQUE (name)

Table datatype_datatype_association
Column|Type | Modifiers
-+-+---
  id  | integer | not null default 
nextval('datatype_id_seq'::regclass)
  datatype_id | integer |
  requires_id | integer |
FOREIGN KEY (datatype_id) REFERENCES datatype(id)
FOREIGN KEY (requires_id) REFERENCES datatype(id)


Then for my mothur metagenomics tools I could define:

name="ssmap"   description="Secondary Structure Map"  version="1.0"  
datatype_dependencies=[tabular]
definition=
from galaxy.datatypes.tabular import Tabular
class SecondaryStructureMap(Tabular):
 file_ext = 'ssmap'
 def __init__(self, **kwd):
 """Initialize secondary structure map datatype"""
 Tabular.__init__( self, **kwd )
 self.column_names = ['Map']

 def sniff( self, filename ):
 """
 Determines whether the file is a secondary structure map format
 A single column with an integer value which indicates the row that 
this row maps to.
 check you make sure is structMap[10] = 380 then structMap[380] = 10.
 """
...




Then the align.check.xml tool_config could require the 'ssmap' datatype:


  Calculate the number of potentially misaligned 
bases
  
mothur
ssmap
   









> John,
>
> I've been following this message thread, and it seems it's gone in a 
> direction that differs from your initial question about the possibility for 
> Galaxy to handle automatic editing of the datatypes_conf.xml file when 
> certain Galaxy tool shed tools are automatically installed.  There are some 
> complexities to consider in attempting this.  One of the issues to consider 
> is that the work for adding support for a new datatype to Galaxy lies outside 
> of the intended function of the tool shed.  If new support is added to the 
> Galaxy code base, an entry for that new datatype should be manually added to 
> the table at the same time.  There may be benefits to enabling automatic 
> changes to datatype entries that already exist in the file (e.g., adding a 
> new converter for an existing datatype entry), but perhaps adding a 
> completely new datatype to the file may not be appropriate.  I'll continue to 
> think about this - send additional thought and feedback, as doing so is 
> always helpful
>
> Thanks!
>
> Greg
>
>
> On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
>
>> One of the things we’re facing is the sheer size of a whole human genome at 
>> 30x coverage. An effective way to deal with that is by compressing the FASTQ 
>> files. That works for BWA and our ELAND, which can directly read a 
>> compressed FASTQ, but other tools crash when reading compressed FASTQ 
>> filesfiles. One way to address that would be to introduce a new type, for 
>> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could 
>> take both types as input. This would allow the best of both worlds – 
>> efficient storage and use by all existing tools.
>>
>> Another example would be adding the CASAVA tools to Galaxy. Some of the 
>> statistics generation tools use custom file formats. To be able to make the 
>> use of those tools optional and configurable, they should be separate from 
>> the aligner, but that would require that Galaxy be made aware of the custom 
>> file formats – we’d have to add a datatype.
>>
>> John Duddy
>> Sr. Staff Software Engineer
>> Illumina, Inc.
>> 9885 Towne Centre Drive
>> San Diego, CA 92121
>> Tel: 8

Re: [galaxy-dev] Tool shed and datatypes

2011-10-07 Thread Jim Johnson

Greg,

It would be great if there were a way to expand upon the core datatypes using 
the ToolShed.

Would it be possible to have a separate datatype repository within the ToolShed?

Datatype
  name=""
  description=""
  datatype_dependencies=[]
  definition=
  


The tool config could be expanded to have requirement for datatypes.
   ssmap




Table datatype
   Column|Type | Modifiers
-+-+---
 id  | integer | not null default 
nextval('datatype_id_seq'::regclass)
 name| character varying(255)  |
 version | character varying(40)   |
 description | text|
 definition  | text|
UNIQUE (name)

Table datatype_datatype_association
   Column|Type | Modifiers
-+-+---
 id  | integer | not null default 
nextval('datatype_id_seq'::regclass)
 datatype_id | integer |
 requires_id | integer |
FOREIGN KEY (datatype_id) REFERENCES datatype(id)
FOREIGN KEY (requires_id) REFERENCES datatype(id)


Then for my mothur metagenomics tools I could define:

name="ssmap"   description="Secondary Structure Map"  version="1.0"  
datatype_dependencies=[tabular]
definition=
from galaxy.datatypes.tabular import Tabular
class SecondaryStructureMap(Tabular):
file_ext = 'ssmap'
def __init__(self, **kwd):
"""Initialize secondary structure map datatype"""
Tabular.__init__( self, **kwd )
self.column_names = ['Map']

def sniff( self, filename ):
"""
Determines whether the file is a secondary structure map format
A single column with an integer value which indicates the row that this 
row maps to.
check you make sure is structMap[10] = 380 then structMap[380] = 10.
"""
...




Then the align.check.xml tool_config could require the 'ssmap' datatype:


 Calculate the number of potentially misaligned bases
 
   mothur
   ssmap
  










John,

I've been following this message thread, and it seems it's gone in a direction 
that differs from your initial question about the possibility for Galaxy to 
handle automatic editing of the datatypes_conf.xml file when certain Galaxy 
tool shed tools are automatically installed.  There are some complexities to 
consider in attempting this.  One of the issues to consider is that the work 
for adding support for a new datatype to Galaxy lies outside of the intended 
function of the tool shed.  If new support is added to the Galaxy code base, an 
entry for that new datatype should be manually added to the table at the same 
time.  There may be benefits to enabling automatic changes to datatype entries 
that already exist in the file (e.g., adding a new converter for an existing 
datatype entry), but perhaps adding a completely new datatype to the file may 
not be appropriate.  I'll continue to think about this - send additional 
thought and feedback, as doing so is always helpful

Thanks!

Greg


On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:


One of the things we’re facing is the sheer size of a whole human genome at 30x 
coverage. An effective way to deal with that is by compressing the FASTQ files. 
That works for BWA and our ELAND, which can directly read a compressed FASTQ, 
but other tools crash when reading compressed FASTQ filesfiles. One way to 
address that would be to introduce a new type, for example “CompressedFastQ”, 
with a conversion to FASTQ defined. BWA could take both types as input. This 
would allow the best of both worlds – efficient storage and use by all existing 
tools.

Another example would be adding the CASAVA tools to Galaxy. Some of the 
statistics generation tools use custom file formats. To be able to make the use 
of those tools optional and configurable, they should be separate from the 
aligner, but that would require that Galaxy be made aware of the custom file 
formats – we’d have to add a datatype.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

From: Greg Von Kuster [mailto:greg at bx.psu.edu]
Sent: Wednesday, October 05, 2011 6:25 PM
To: Duddy, John
Cc: galaxy-dev at lists.bx.psu.edu
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hello John,

The Galaxy tool shed currently is not enabled to automatically edit the 
datatypes_conf.xml file, although I could add this feature if the need exists.  
Can you elaborate on what you are looking to do regarding this?

Thanks!


On Oct 5, 2011, at 1:52

Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Greg Von Kuster
John,

I've been following this message thread, and it seems it's gone in a direction 
that differs from your initial question about the possibility for Galaxy to 
handle automatic editing of the datatypes_conf.xml file when certain Galaxy 
tool shed tools are automatically installed.  There are some complexities to 
consider in attempting this.  One of the issues to consider is that the work 
for adding support for a new datatype to Galaxy lies outside of the intended 
function of the tool shed.  If new support is added to the Galaxy code base, an 
entry for that new datatype should be manually added to the table at the same 
time.  There may be benefits to enabling automatic changes to datatype entries 
that already exist in the file (e.g., adding a new converter for an existing 
datatype entry), but perhaps adding a completely new datatype to the file may 
not be appropriate.  I'll continue to think about this - send additional 
thought and feedback, as doing so is always helpful

Thanks!

Greg


On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:

> One of the things we’re facing is the sheer size of a whole human genome at 
> 30x coverage. An effective way to deal with that is by compressing the FASTQ 
> files. That works for BWA and our ELAND, which can directly read a compressed 
> FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One 
> way to address that would be to introduce a new type, for example 
> “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both 
> types as input. This would allow the best of both worlds – efficient storage 
> and use by all existing tools.
>  
> Another example would be adding the CASAVA tools to Galaxy. Some of the 
> statistics generation tools use custom file formats. To be able to make the 
> use of those tools optional and configurable, they should be separate from 
> the aligner, but that would require that Galaxy be made aware of the custom 
> file formats – we’d have to add a datatype.
>  
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jdu...@illumina.com
>  
> From: Greg Von Kuster [mailto:g...@bx.psu.edu] 
> Sent: Wednesday, October 05, 2011 6:25 PM
> To: Duddy, John
> Cc: galaxy-dev@lists.bx.psu.edu
> Subject: Re: [galaxy-dev] Tool shed and datatypes
>  
> Hello John,
>  
> The Galaxy tool shed currently is not enabled to automatically edit the 
> datatypes_conf.xml file, although I could add this feature if the need 
> exists.  Can you elaborate on what you are looking to do regarding this?
>  
> Thanks!
>  
>  
> On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
> 
> 
> Can we introduce new file types via tools in the tool shed? It seems Galaxy 
> can load them if they are in the datatypes configuration file. Does tool 
> installation automate the editing of that file?
>  
>  
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jdu...@illumina.com
>  
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/
>  
> Greg Von Kuster
> Galaxy Development Team
> g...@bx.psu.edu
>  
>  
>  
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu



___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Duddy, John
GZIP files are definitely our plan. I just finished testing the code that 
distributes the processing of a FASTQ (or pair for PE) to an arbitrary number 
of tasks, where each subtask extracts just the data it needs without reading 
any of the file it does not need. It extracts the blocks of GZIPped data into a 
standalone GZIP file just by copying whole blocks and appending them (if the 
window is not aligned perfectly, there is additional processing). Since the 
entire file does not need to be read, it distributes quite nicely.

I'll be preparing a pull request for it soon.


John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Thursday, October 06, 2011 9:19 AM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 5:02 PM, Duddy, John  wrote:
> As I understand it, Isilion is built up from "bricks" that have storage
> and compute power. They replicate files amongst themselves, so
> that for every IO request there are multiple systems that could
> respond. They are interconnected by an ultra fast fibre backbone.

So why not use gzipped files on top of that? Smaller chunks of
data to access so should be faster even with the decompression
once it gets to the CPU.

> So, depending on your topology, it's possible to get a lot more
> throughput by working on different sections of the same file from
> different physical computers.

Nice.

> I haven't delved into BGZF, so I can't comment. My approach to
> block GZIP was just to concatenate multiple GZIP files and keep
> a record of the offsets and sequences contained in each. The
> advantage is compatibility, in that it decompresses just like it
> was one big chunk, yet you can compose subsets of the data
> without decompressing/recompressing and (as long as we
> actually have to write out the file subsets) can reap the reduced
> IO benefits of smaller writes.

That sounds VERY similar to BGZF - have a read over the
SAM specification which covers this. Basically they stick
the block size into the gzip headers, and the BAM index files
(BAI) use a 64 bit offset which is split into the BGZF block
offset and the offset within that decompressed block. See:
http://samtools.sourceforge.net/SAM1.pdf

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Peter Cock
On Thu, Oct 6, 2011 at 5:02 PM, Duddy, John  wrote:
> As I understand it, Isilion is built up from "bricks" that have storage
> and compute power. They replicate files amongst themselves, so
> that for every IO request there are multiple systems that could
> respond. They are interconnected by an ultra fast fibre backbone.

So why not use gzipped files on top of that? Smaller chunks of
data to access so should be faster even with the decompression
once it gets to the CPU.

> So, depending on your topology, it's possible to get a lot more
> throughput by working on different sections of the same file from
> different physical computers.

Nice.

> I haven't delved into BGZF, so I can't comment. My approach to
> block GZIP was just to concatenate multiple GZIP files and keep
> a record of the offsets and sequences contained in each. The
> advantage is compatibility, in that it decompresses just like it
> was one big chunk, yet you can compose subsets of the data
> without decompressing/recompressing and (as long as we
> actually have to write out the file subsets) can reap the reduced
> IO benefits of smaller writes.

That sounds VERY similar to BGZF - have a read over the
SAM specification which covers this. Basically they stick
the block size into the gzip headers, and the BAM index files
(BAI) use a 64 bit offset which is split into the BGZF block
offset and the offset within that decompressed block. See:
http://samtools.sourceforge.net/SAM1.pdf

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Duddy, John
As I understand it, Isilion is built up from "bricks" that have storage and 
compute power. They replicate files amongst themselves, so that for every IO 
request there are multiple systems that could respond. They are interconnected 
by an ultra fast fibre backbone.

So, depending on your topology, it's possible to get a lot more throughput by 
working on different sections of the same file from different physical 
computers.

I haven't delved into BGZF, so I can't comment. My approach to block GZIP was 
just to concatenate multiple GZIP files and keep a record of the offsets and 
sequences contained in each. The advantage is compatibility, in that it 
decompresses just like it was one big chunk, yet you can compose subsets of the 
data without decompressing/recompressing and (as long as we actually have to 
write out the file subsets) can reap the reduced IO benefits of smaller writes.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Thursday, October 06, 2011 8:16 AM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John  wrote:
> I'd be up for that something like that, although I have other tasking
> in the short term after I finish my parallelism work. I'd rather not have
> Galaxy do the compression/decompression, because that will not
> effectively utilize the distributed nature of many filesystems, such
> as Isilon, that our customers use.

Is that like a compressed filesystem, where there is probably less
benefit to storing the data gzipped?

> My parallelism work (second
> phase almost done) handles that by using a block-gzipped
> format and index files that allow the compute nodes to seek to
> the blocks they need and extract from there.

How similar is your block-gzipped approach to BGZF used in BAM?

> Another thing that should probably go along with this is an
> enhancement to metadata such that it can be fed in from the
> outside. We upload files by linking to file paths, and at that
> point, we know everything about the files (index information).
> So need to decompress a 500GB file and read the whole
> thing just to count the lines - all you have to do is ask ;-}

I can see how that might be useful.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Peter Cock
On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John  wrote:
> I'd be up for that something like that, although I have other tasking
> in the short term after I finish my parallelism work. I'd rather not have
> Galaxy do the compression/decompression, because that will not
> effectively utilize the distributed nature of many filesystems, such
> as Isilon, that our customers use.

Is that like a compressed filesystem, where there is probably less
benefit to storing the data gzipped?

> My parallelism work (second
> phase almost done) handles that by using a block-gzipped
> format and index files that allow the compute nodes to seek to
> the blocks they need and extract from there.

How similar is your block-gzipped approach to BGZF used in BAM?

> Another thing that should probably go along with this is an
> enhancement to metadata such that it can be fed in from the
> outside. We upload files by linking to file paths, and at that
> point, we know everything about the files (index information).
> So need to decompress a 500GB file and read the whole
> thing just to count the lines - all you have to do is ask ;-}

I can see how that might be useful.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Duddy, John
I'd be up for that something like that, although I have other tasking in the 
short term after I finish my parallelism work. I'd rather not have Galaxy do 
the compression/decompression, because that will not effectively utilize the 
distributed nature of many filesystems, such as Isilon, that our customers use. 
My parallelism work (second phase almost done) handles that by using a 
block-gzipped format and index files that allow the compute nodes to seek to 
the blocks they need and extract from there.

Another thing that should probably go along with this is an enhancement to 
metadata such that it can be fed in from the outside. We upload files by 
linking to file paths, and at that point, we know everything about the files 
(index information). So need to decompress a 500GB file and read the whole 
thing just to count the lines - all you have to do is ask ;-}

 
John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Thursday, October 06, 2011 1:28 AM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 4:48 AM, Duddy, John  wrote:
> One of the things we're facing is the sheer size of a whole human genome at
> 30x coverage. An effective way to deal with that is by compressing the FASTQ
> files. That works for BWA and our ELAND, which can directly read a
> compressed FASTQ, but other tools crash when reading compressed FASTQ
> filesfiles. One way to address that would be to introduce a new type, for
> example "CompressedFastQ", with a conversion to FASTQ defined. BWA could
> take both types as input. This would allow the best of both worlds -
> efficient storage and use by all existing tools.

We'd discussed this and a more general approach where any file
could be gzipped, but the code to do that doesn't exist yet:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-September/006745.html

Issue filed:
https://bitbucket.org/galaxy/galaxy-central/issue/666/

That seems a better long term solution than the pragmatic short term
solution of fastqsanger-gzip (or whatever it gets called). Note that it
sounded like Edward Kirton might already be using this - you should
be consistent.

The other strong idea from that thread was moving from FASTQ to
unaligned BAM, which is gzipped compressed, and has explicit
support for paired end reads, read groups, etc.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-06 Thread Peter Cock
On Thu, Oct 6, 2011 at 4:48 AM, Duddy, John  wrote:
> One of the things we’re facing is the sheer size of a whole human genome at
> 30x coverage. An effective way to deal with that is by compressing the FASTQ
> files. That works for BWA and our ELAND, which can directly read a
> compressed FASTQ, but other tools crash when reading compressed FASTQ
> filesfiles. One way to address that would be to introduce a new type, for
> example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could
> take both types as input. This would allow the best of both worlds –
> efficient storage and use by all existing tools.

We'd discussed this and a more general approach where any file
could be gzipped, but the code to do that doesn't exist yet:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-September/006745.html

Issue filed:
https://bitbucket.org/galaxy/galaxy-central/issue/666/

That seems a better long term solution than the pragmatic short term
solution of fastqsanger-gzip (or whatever it gets called). Note that it
sounded like Edward Kirton might already be using this - you should
be consistent.

The other strong idea from that thread was moving from FASTQ to
unaligned BAM, which is gzipped compressed, and has explicit
support for paired end reads, read groups, etc.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Tool shed and datatypes

2011-10-05 Thread Duddy, John
One of the things we're facing is the sheer size of a whole human genome at 30x 
coverage. An effective way to deal with that is by compressing the FASTQ files. 
That works for BWA and our ELAND, which can directly read a compressed FASTQ, 
but other tools crash when reading compressed FASTQ filesfiles. One way to 
address that would be to introduce a new type, for example "CompressedFastQ", 
with a conversion to FASTQ defined. BWA could take both types as input. This 
would allow the best of both worlds - efficient storage and use by all existing 
tools.

Another example would be adding the CASAVA tools to Galaxy. Some of the 
statistics generation tools use custom file formats. To be able to make the use 
of those tools optional and configurable, they should be separate from the 
aligner, but that would require that Galaxy be made aware of the custom file 
formats - we'd have to add a datatype.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com<mailto:jdu...@illumina.com>

From: Greg Von Kuster [mailto:g...@bx.psu.edu]
Sent: Wednesday, October 05, 2011 6:25 PM
To: Duddy, John
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hello John,

The Galaxy tool shed currently is not enabled to automatically edit the 
datatypes_conf.xml file, although I could add this feature if the need exists.  
Can you elaborate on what you are looking to do regarding this?

Thanks!


On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:


Can we introduce new file types via tools in the tool shed? It seems Galaxy can 
load them if they are in the datatypes configuration file. Does tool 
installation automate the editing of that file?


John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com<mailto:jdu...@illumina.com>

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu<mailto:g...@bx.psu.edu>



___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Tool shed and datatypes

2011-10-05 Thread Greg Von Kuster
Hello John,

The Galaxy tool shed currently is not enabled to automatically edit the 
datatypes_conf.xml file, although I could add this feature if the need exists.  
Can you elaborate on what you are looking to do regarding this?

Thanks!


On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:

> Can we introduce new file types via tools in the tool shed? It seems Galaxy 
> can load them if they are in the datatypes configuration file. Does tool 
> installation automate the editing of that file?
>  
>  
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jdu...@illumina.com
>  
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
g...@bx.psu.edu



___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/