Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread James Taylor
open_compressed in bx-python does this already (for bz2 as well).

On Jul 8, 2013, at 5:58 PM, Peter Cock p.j.a.c...@googlemail.com wrote:

 On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
 robert.baert...@gmail.com wrote:
 Peter and Dan,
 I like the idea of replacing all open() with galaxy_open() in all tools. You
 can tell the format by looking at the first 4 byes (see C code below from
 the UCSC browser team). Is there some pythonic way of overriding open?
 
 There is monkey patching (replace the current 'open' function with
 your modified version), but that is not a good idea in general.
 
 In any case, this would only affect the small number of Python
 tools which happen to use the Galaxy parsing libraries - which
 is a very small fraction of the tools in Galaxy. Most of the tools
 in Galaxy are compiled programs and are entirely separate.
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/
 
 To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
Peter and Dan,
I like the idea of replacing all open() with galaxy_open() in all tools. You 
can tell the format by looking at the first 4 byes (see C code below from the 
UCSC browser team). Is there some pythonic way of overriding open?

You need to read the first four bytes of the file to see if it is compressed 
and call gzip.open inside of the function and pass pack the handle. 

For now, it would require a global sweep through the tools to change open() to 
galaxy_open(), but it is probably a good idea to have tool developers avoid 
calling open directly.

You would have to have special handling if there are multiple files in the 
compressed archive but that support could be added later.

-Robert


def galaxy_open(filename, mode=r):
   compressor = getCompressor(filename, mode)
   if compessor != NULL:
 return openCompressed(filename, mode, compressor)
   else:
 return open(filename, mode)


def openCompressed(filename, mode):
  4bytes = read4bytes(filename)
  ext = getExtensionFromHdrSig(4bytes)
  if ext == gz :
 return gzip.open(filename, mode)
  else if ext == bz2:
 return bz2.BZ2File(filename, mode)
  else if ext == zip:
 return zipfile.ZipFile(filename, mode)

  

char *getExtensionFromHdrSig(char *first4bytes)
/* Check if header has signature of supported compression stream,
   and return a phoney filename with extension for it, or NULL if no sig found. 
*/
{
char buf[20];
char *ext=NULL;
if (startsWith(\x1f\x8b,first4bytes)) ext = gz;
else if (startsWith(\x1f\x9d\x90,first4bytes)) ext = Z;
else if (startsWith(BZ,first4bytes)) ext = bz2;
else if (startsWith(PK\x03\x04,first4bytes)) ext = zip;
if (ext==NULL)
return NULL;
}
On Jul 8, 2013, at 4:05 AM, Peter Cock wrote:

 On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch
 robert.baert...@gmail.com wrote:
 Dan,
 Do these readers support gzip files?
 
   reader = fastqVerboseErrorReader
reader = fastqReader
 
 Presumably you are writing a Python script using this library?
 The answer is a qualified yes. Instead of passing them a normal
 file handle using open(example.fastq) you instead use
 gzip.open(example.fastq) via import gzip.
 
 Do I have to define a special type in galaxy for gzipped files or will the 
 fastq type be ok?
 
 
 This needs a special file format - but you are not the first person to
 look at this, some groups have defined custom gzipped variants of
 the FASTQ formats within their own Galaxy instances. I've not
 done this but there should be some useful emails in the archive.
 
 Note you'd also need to modify any tool definitions to that they
 can accept a gzipped FASTQ file.
 
 Ideally, I would like to keep my files zipped and not have galaxy unzip 
 them, since they triple in size when unzipped.
 
 I'm happy to do a push request if you don't support this but I want to make 
 sure I'm in line with your roadmap.
 
 Personally I would like a more general system in Galaxy for
 potentially any file type to be held compressed in a range of
 formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions
 for things like BAM which are already compressed. This way
 naive tools would get the gzipped file file uncompressed to a
 temporary folder before use (i.e. no change for the tool wrapper),
 but if a tool accepts a gzipped file it will get that (less disk IO
 and CPU usage, but requires updating tool wrappers).
 
 That idea is quite ambitious through ;)
 
 I have written a simple tool to convert Illumina fastq to mapsplice fastq. 
 Does that already exist already somewhere?
 
 
 I don't know.
 
 Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Peter Cock
On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch rbaer...@ucsc.edu wrote:
 On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
 The tools available in Galaxy are written in a range
 of languages including C, Perl, R, etc. Yes, some are in Python,
 but of those most are independent of Galaxy and can be used
 separately from Galaxy.

 the helper function would have to ported to R. We are talking
 about how galaxy compressed data. Once we decide that, we
 can determine how to best implement it.

Individual tools called from Galaxy read and create the files -
and we can't usually control them at this level (modifying them all
to call a Galaxy managed file open mechanism is not an option).

 Proposal: Do not treat compressed data as a separate data type.
 Treat it as an independent attribute that can be applied to any data.
 Otherwise you will have to create a gzipped , zip and bz2 type for
 every type that you want to compress.

That's what I've been saying - the fact that some people are
already using a new gzipped FASTQ format within their Galaxy
instances is practical, but I view it as a short term solution only.

 Encoding the gzip status in the datatype will create an explosion of
 datatypes. Compression is not actually a datatype, it tells you nothing
 about the content data that is stored in the file.

 What we'd previously discussed was a dual system, holding
 the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
 compression (e.g., None, normal GZIP, BGZF which is a
 GZIP variant, BZIP2, etc).

 What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?

Note ZIP is a bit different, as it is often a multiple file bundle -
it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that
regard.

But otherwise, yes. As a specific example, the tabix tool used BGZF
compressed tabular data to combine compression and efficient
random access. This would be useful for many annotation files
(e.g. GTF, GFF3).

 This will quickly get out of hand and create a mess for tool
 developers that need to support all thees types.

Why? Individual tool developers don't need to know if Galaxy
is keeping the original data file on disk compressed - unless
the tool XML says otherwise, Galaxy would hide this detail
and call the tool with an uncompressed input file.

(Unix named pipe which decompresses the file on the file would
be a potential alternative - but only if the tool XML was marked
up to say that an input could be streamed. The default must be
to assume potential random access to the input files)

 The tool code and tool xml should be written to handle uncompressed
 data and galaxy should handle the details of decompression. This
 is not hard to do.

It isn't trivial either ;)

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
great. Let's put the bx-python calls in a galaxy_open helper function.

On Jul 8, 2013, at 3:20 PM, James Taylor wrote:

 open_compressed in bx-python does this already (for bz2 as well).
 
 On Jul 8, 2013, at 5:58 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 
 On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
 robert.baert...@gmail.com wrote:
 Peter and Dan,
 I like the idea of replacing all open() with galaxy_open() in all tools. You
 can tell the format by looking at the first 4 byes (see C code below from
 the UCSC browser team). Is there some pythonic way of overriding open?
 
 There is monkey patching (replace the current 'open' function with
 your modified version), but that is not a good idea in general.
 
 In any case, this would only affect the small number of Python
 tools which happen to use the Galaxy parsing libraries - which
 is a very small fraction of the tools in Galaxy. Most of the tools
 in Galaxy are compiled programs and are entirely separate.
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/
 
 To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
great. Let's put the bx-python calls in a galaxy_open helper function.

On Jul 8, 2013, at 3:20 PM, James Taylor wrote:

 open_compressed in bx-python does this already (for bz2 as well).
 
 On Jul 8, 2013, at 5:58 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 
 On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
 robert.baert...@gmail.com wrote:
 Peter and Dan,
 I like the idea of replacing all open() with galaxy_open() in all tools. You
 can tell the format by looking at the first 4 byes (see C code below from
 the UCSC browser team). Is there some pythonic way of overriding open?
 
 There is monkey patching (replace the current 'open' function with
 your modified version), but that is not a good idea in general.
 
 In any case, this would only affect the small number of Python
 tools which happen to use the Galaxy parsing libraries - which
 is a very small fraction of the tools in Galaxy. Most of the tools
 in Galaxy are compiled programs and are entirely separate.
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/
 
 To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
I will implement this if the galaxy team likes the approach. 

We did this in ucsc genome browser code years ago: a single open_helper call 
handles, gzip, http, ftp and pipes. No need to care about how the data is 
compressed or where it data resides. 

wouldn't it be great to be able to pipe data between workflow steps rather than 
writing to disk?  I admit that this will require some work but the first step 
is to abstract the open.

On Jul 9, 2013, at 10:38 AM, Peter Cock wrote:

 On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch rbaer...@ucsc.edu wrote:
 On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
 The tools available in Galaxy are written in a range
 of languages including C, Perl, R, etc. Yes, some are in Python,
 but of those most are independent of Galaxy and can be used
 separately from Galaxy.
 
 the helper function would have to ported to R. We are talking
 about how galaxy compressed data. Once we decide that, we
 can determine how to best implement it.
 
 Individual tools called from Galaxy read and create the files -
 and we can't usually control them at this level (modifying them all
 to call a Galaxy managed file open mechanism is not an option).
 
 Proposal: Do not treat compressed data as a separate data type.
 Treat it as an independent attribute that can be applied to any data.
 Otherwise you will have to create a gzipped , zip and bz2 type for
 every type that you want to compress.
 
 That's what I've been saying - the fact that some people are
 already using a new gzipped FASTQ format within their Galaxy
 instances is practical, but I view it as a short term solution only.
 
 Encoding the gzip status in the datatype will create an explosion of
 datatypes. Compression is not actually a datatype, it tells you nothing
 about the content data that is stored in the file.
 
 What we'd previously discussed was a dual system, holding
 the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
 compression (e.g., None, normal GZIP, BGZF which is a
 GZIP variant, BZIP2, etc).
 
 What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?
 
 Note ZIP is a bit different, as it is often a multiple file bundle -
 it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that
 regard.
 
 But otherwise, yes. As a specific example, the tabix tool used BGZF
 compressed tabular data to combine compression and efficient
 random access. This would be useful for many annotation files
 (e.g. GTF, GFF3).
 
 This will quickly get out of hand and create a mess for tool
 developers that need to support all thees types.
 
 Why? Individual tool developers don't need to know if Galaxy
 is keeping the original data file on disk compressed - unless
 the tool XML says otherwise, Galaxy would hide this detail
 and call the tool with an uncompressed input file.
 
 (Unix named pipe which decompresses the file on the file would
 be a potential alternative - but only if the tool XML was marked
 up to say that an input could be streamed. The default must be
 to assume potential random access to the input files)
 
 The tool code and tool xml should be written to handle uncompressed
 data and galaxy should handle the details of decompression. This
 is not hard to do.
 
 It isn't trivial either ;)
 
 Peter


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-08 Thread Peter Cock
On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch
robert.baert...@gmail.com wrote:
 Dan,
 Do these readers support gzip files?

reader = fastqVerboseErrorReader
 reader = fastqReader

Presumably you are writing a Python script using this library?
The answer is a qualified yes. Instead of passing them a normal
file handle using open(example.fastq) you instead use
gzip.open(example.fastq) via import gzip.

 Do I have to define a special type in galaxy for gzipped files or will the 
 fastq type be ok?


This needs a special file format - but you are not the first person to
look at this, some groups have defined custom gzipped variants of
the FASTQ formats within their own Galaxy instances. I've not
done this but there should be some useful emails in the archive.

Note you'd also need to modify any tool definitions to that they
can accept a gzipped FASTQ file.

 Ideally, I would like to keep my files zipped and not have galaxy unzip them, 
 since they triple in size when unzipped.

 I'm happy to do a push request if you don't support this but I want to make 
 sure I'm in line with your roadmap.

Personally I would like a more general system in Galaxy for
potentially any file type to be held compressed in a range of
formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions
for things like BAM which are already compressed. This way
naive tools would get the gzipped file file uncompressed to a
temporary folder before use (i.e. no change for the tool wrapper),
but if a tool accepts a gzipped file it will get that (less disk IO
and CPU usage, but requires updating tool wrappers).

That idea is quite ambitious through ;)

 I have written a simple tool to convert Illumina fastq to mapsplice fastq. 
 Does that already exist already somewhere?


I don't know.

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-08 Thread Peter Cock
On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
robert.baert...@gmail.com wrote:
 Peter and Dan,
 I like the idea of replacing all open() with galaxy_open() in all tools. You
 can tell the format by looking at the first 4 byes (see C code below from
 the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with
your modified version), but that is not a good idea in general.

In any case, this would only affect the small number of Python
tools which happen to use the Galaxy parsing libraries - which
is a very small fraction of the tools in Galaxy. Most of the tools
in Galaxy are compiled programs and are entirely separate.

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-08 Thread Peter Cock
On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch rbaer...@ucsc.edu wrote:
 I respectfully disagree,  If you want an extensible system, you should
 always wrap primitive system level calls.

 Any tools that opens a file that could be compressed would be affected.
 That is a huge number of tools. Do you really want a cottage industry of
 tools that have different methods of dealing with compression?

But defining a Python helper function within the Galaxy Python
libraries doesn't achieve that.

Are you talking about patching the OS level POSIX open functions
or something? The tools available in Galaxy are written in a range
of languages including C, Perl, R, etc. Yes, some are in Python,
but of those most are independent of Galaxy and can be used
separately from Galaxy.

 Encoding the gzip status in the datatype will create an explosion of
 datatypes. Compression is not actually a datatype, it tells you nothing
 about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding
the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
compression (e.g., None, normal GZIP, BGZF which is a
GZIP variant, BZIP2, etc).

Galaxy tool wrappers currently define input files with a list
of file types - they'd also have to give a list of supported
compression types (defaulting to none). Likewise for any
output files - if they are already compressed the XML for
the tool wrapper would have to tell Galaxy this.

 It is up to the galaxy team to provide a standard way to interact
 with compressed files.

That is my preference too - although this could be driven by
the Galaxy community rather than the core team? I see
defining new datatypes like 'gzippedfastq' as a stop gap
special case (but a very practical route for now).

 My proposed solution, is a very small change that could
 be phased in over time. Any tools that uses open would not support
 compressed files, but they would not break on uncompressed files.

 Do others have an opinion?

Either I don't understand your plan, or it would only help in
a tiny minority of cases.

Regards,

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/