Re: [galaxy-dev] disk space and file formats

2011-10-06 Thread Peter Cock
On Tue, Sep 6, 2011 at 5:12 PM, Nate Coraor wrote: > Peter Cock wrote: >> On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor wrote: >> > Peter Cock wrote: >> >> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor wrote: >> >> > Ideally, there'd just be a column on the dataset table indicating >> >> > whether t

Re: [galaxy-dev] disk space and file formats

2011-09-08 Thread Fields, Christopher J
The use of (unaligned) BAM for readgroups seems like a good idea. At the very least it prevents inconsistently hacking this information into the FASTQ descriptor (a common problem with any simple format). chris On Sep 8, 2011, at 1:35 PM, Edward Kirton wrote: > copied from another thread: >

Re: [galaxy-dev] disk space and file formats

2011-09-08 Thread Edward Kirton
copied from another thread: On Thu, Sep 8, 2011 at 7:30 AM, Anton Nekrutenko wrote: > What we are thinking of lately is switching to unaligned BAM for > everyting. One of the benefits here is the ability to add readgroups from > day 1 simplifying multisample analyses down the road. > this seems

Re: [galaxy-dev] disk space and file formats

2011-09-06 Thread Nate Coraor
Peter Cock wrote: > On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor wrote: > > Peter Cock wrote: > >> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor wrote: > >> > Ideally, there'd just be a column on the dataset table indicating > >> > whether the dataset is compressed or not, and then tools get a new >

Re: [galaxy-dev] disk space and file formats

2011-09-06 Thread Peter Cock
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor wrote: > Peter Cock wrote: >> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor wrote: >> > Ideally, there'd just be a column on the dataset table indicating >> > whether the dataset is compressed or not, and then tools get a new >> > way to indicate whether

Re: [galaxy-dev] disk space and file formats

2011-09-06 Thread Nate Coraor
Peter Cock wrote: > On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor wrote: > > Edward Kirton wrote: > >> Peter wrote: > >> > I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) > >> > datatype? > >> > However this seems generally useful (not just for FASTQ) so perhaps a > >> > more >

Re: [galaxy-dev] disk space and file formats

2011-09-06 Thread Peter Cock
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor wrote: > Edward Kirton wrote: >> Peter wrote: >> > I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) >> > datatype? >> > However this seems generally useful (not just for FASTQ) so perhaps a more >> > general mechanism would be better w

Re: [galaxy-dev] disk space and file formats

2011-09-06 Thread Nate Coraor
Edward Kirton wrote: > > In your position I agree that is a pragmatic choice. > > Thanks for helping me muddle through my options. > > > You might be able to > > modify the file upload code to gzip any FASTQ files... that would prevent > > uncompressed FASTQ getting into new histories. > > Right

Re: [galaxy-dev] disk space and file formats

2011-09-03 Thread Paul Gordon
Probably not, as it is somewhat a competitor of SAM/BAM (a bit broader in scope, beyond just alignments). As Peter indicated, I know the BioHDF folks (they are here in town); however, my actual question was whether anyone is actually using HDF5 or SRA in production? I haven't seen adoption

Re: [galaxy-dev] disk space and file formats

2011-09-03 Thread Scott Smith
On Sep 2, 2011, at 8:02 PM, Peter Cock wrote: > On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J > wrote: >> On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote: >> What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Edward Kirton
> In your position I agree that is a pragmatic choice. Thanks for helping me muddle through my options. > You might be able to > modify the file upload code to gzip any FASTQ files... that would prevent > uncompressed FASTQ getting into new histories. Right! > I wonder if Galaxy would benefit f

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Fields, Christopher J
On Sep 2, 2011, at 8:02 PM, Peter Cock wrote: > On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J > wrote: >> On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote: >> What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some t

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Peter Cock
On Saturday, September 3, 2011, Edward Kirton wrote: > of course there is a computational cost to compressing/uncompressing > files but that's probably better than storing unnecessarily huge > files. it's a trade-off. It may still be faster due to less IO, probably depends on your hardware. > s

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Edward Kirton
>>> i actually think illumina's pipeline produces files in this format >>>(unaligned-bam) now. > Oh do they? - that's interesting. Do you have a reference/link? i caught wind of this at the recent illumina user's conference but i asked someone in our sequencing team to confirm and he hadn't hear

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Peter Cock
On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J wrote: > On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote: > >>> What, like a BAM file of unaligned reads? Uses gzip compression, and >>> tracks the pairing information explicitly :) Some tools will already take >>> this as an input format, but

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Fields, Christopher J
On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote: >> What, like a BAM file of unaligned reads? Uses gzip compression, and >> tracks the pairing information explicitly :) Some tools will already take >> this as an input format, but not all. > > ah, yes, precisely. i actually think illumina's pipel

Re: [galaxy-dev] disk space and file formats

2011-09-02 Thread Edward Kirton
> What, like a BAM file of unaligned reads? Uses gzip compression, and > tracks the pairing information explicitly :) Some tools will already take > this as an input format, but not all. ah, yes, precisely. i actually think illumina's pipeline produces files in this format now. wrappers which cre

Re: [galaxy-dev] disk space and file formats

2011-09-01 Thread Peter Cock
On Thu, Sep 1, 2011 at 11:02 PM, Edward Kirton wrote: > Read QC intermediate files account for most of the storage used on our > galaxy site. And it's a real problem that I must solve soon. > My first attempt at taming the beast was to try to create a single read QC > tool that did such things as

Re: [galaxy-dev] disk space and file formats

2011-09-01 Thread Edward Kirton
Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon. My first attempt at taming the beast was to try to create a single read QC tool that did such things as convert qual encoding, qual-end trimming, etc. (very basic fun

Re: [galaxy-dev] disk space and file formats

2011-08-19 Thread Jelle Scholtalbers
Hi Patrick, the issue you are having is partly related to the idea of Galaxy to ensure reproducible science and saving each intermediate step and output files. For example in your current workflow in Galaxy you can easily do something else with each intermediate file - feed it to a different tool

[galaxy-dev] disk space and file formats

2011-08-19 Thread Patrick Page-McCaw
I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a ne