Re: [galaxy-user] Patch for better FASTQ description handling

2011-11-29 Thread Florent Angly

On 19/10/11 23:31, Daniel Blankenberg wrote:


Sorry for the delay.  I did try the patch out shortly after you 
contributed it, but it caused the functional to fail.  I was able to 
fix the issue and allow the existing tests to start passing, but I've 
been bogged down lately and haven't been able to perform a more 
thorough review of the code. If you could provide tests with files 
(e.g. for the tools affected) that test the new functionality, that 
would be a great help.




Hi Dan,
I finally addressed your comments. See the updated pull request: 
https://bitbucket.org/galaxy/galaxy-central/pull-request/8/paired-end-code-mishandles-description-of

Best,
Florent

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Patch for better FASTQ description handling

2011-10-20 Thread Eric Cabot

Florent Angly wrote:

Peter and Daniel, thanks for the comments.

On 19/10/11 23:49, Peter Cock wrote:
On Wed, Oct 19, 2011 at 2:31 PM, Daniel Blankenbergd...@bx.psu.edu  
wrote:

Hi Florent,
Sorry for the delay.  I did try the patch out shortly after you 
contributed
it, but it caused the functional to fail.  I was able to fix the 
issue and
allow the existing tests to start passing, but I've been bogged down 
lately
and haven't been able to perform a more thorough review of the code. 
If you
could provide tests with files (e.g. for the tools affected) that 
test the

new functionality, that would be a great help.

I'll have a look at that.
The use of partition removes python compatibility for2.5, although 
this is

a lesser/non-concern.

I guess you could use split, but special case on there being no space.

Also, I'm not entirely sold on having the Identifier line being 
parsed as
  identifier +space  + description instead a single identifier 
line.

That is the normal convention, just like with FASTA.
http://dx.doi.org/10.1093/nar/gkp1137
The Bioperl and Biopython projects use this convention for FASTA and 
FASTQ files.



This would mean that identifiers could not themselves contain spaces,
but There is no standardization for identifiers (so they could 
technically
have spaces?). Could two different reads be identified as Read A 
and Read
B, but then would no longer be uniquely identifiable as each would 
then be

identified as Read.  If this added functionalilty were introduced as
optional behavior (e.g. a user needs to click a checkbox on the tools to
apply the id line splitting), these concerns can be mitigated.
That is expected, @Read A and @Read B have the same identifier, 
Read.


Peter, Florent, anyone else: I'd be very interested to hear your 
thoughts on
the above, particularly in respect to know real-world data. For now, 
lets

discount SRA data from this discussion.

See also the new Illumina 1.8 naming convention where they dropped
the /1 and /2 and hit it in the description. It should be tested, but 
I think
Florent's patch will work here (while the current Galaxy behaviour 
won't).


Peter
I was not aware of this new naming. It seems like a terrible decision 
from Illumina because now both reads in a pair technically have the same 
ID (but a different description).


This is not quite the case. Here are two fastq header lines for a pair of 
reads produced by Illumina's CASAVA 1.8:


@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 1:N:0:CTTGTA
@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 2:N:0:CTTGTA

The two key things to note, relevant to this discussion are:

1. A space character is used to split the fields into two groups.
This is actually a good thing, because that particular character can NEVER 
appear in either a sequence or a quality line. This make it easy to detect 
name lines as those beginning with @ (a valid quality character) and 
also having a space. If you are writing a parser for the new Illumina 
fastq format, please don't break the names on spaces!


2. Appart from the read number, encoded as the digit immediately following 
the space, the two lines are identical--as they were with earlier CASAVA 
versions.  Why is this worse than two lines differing by /1 vs. /2?


An additional improvement with the new naming convention is that flowcell 
and run ID's, as well as a flag for not passing filters (where N means 
does PF), are now included.




Eric L. Cabot
Biotechnology Center
University of Wisconsin-Madison




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Patch for better FASTQ description handling

2011-10-20 Thread Peter Cock
On Thu, Oct 20, 2011 at 2:15 PM, Eric Cabot ca...@biotech.wisc.edu wrote:
 I was not aware of this new naming. It seems like a terrible decision from
 Illumina because now both reads in a pair technically have the same ID (but
 a different description).

 This is not quite the case. Here are two fastq header lines for a pair of
 reads produced by Illumina's CASAVA 1.8:

 @XYZZY:123:D0ABCDEFG:7:1101:1445:2057 1:N:0:CTTGTA
 @XYZZY:123:D0ABCDEFG:7:1101:1445:2057 2:N:0:CTTGTA

Yes, Illumina gives both read 1 and read 2 the same template ID
of XYZZY:123:D0ABCDEFG:7:1101:1445:2057 (much like the
two reads would have the same ID in a SAM/BAM file).

 The two key things to note, relevant to this discussion are:

 1. A space character is used to split the fields into two groups.
 This is actually a good thing, because that particular character can NEVER
 appear in either a sequence or a quality line. This make it easy to detect
 name lines as those beginning with @ (a valid quality character) and also
 having a space. If you are writing a parser for the new Illumina fastq
 format, please don't break the names on spaces!

Yes, you could use the space as a sanity test for *this* style Illumina
FASTQ, and have a bespoke parser which treats this all specially.
But for a generic FASTQ parser you *should* split at the space.

The point is Illumina have changed the meaning of their FASTQ
identifier, it used to be the template ID plus a /1 or /2 suffix, but
now it is just the common template ID used for both parts.

 2. Appart from the read number, encoded as the digit immediately following
 the space, the two lines are identical--as they were with earlier CASAVA
 versions.  Why is this worse than two lines differing by /1 vs. /2?

Because it is a change from the existing well established convention,
which will require changed to hundreds of scripts and and tools
(guessed number including user's bespoke scripts).

 An additional improvement with the new naming convention is that flowcell
 and run ID's, as well as a flag for not passing filters (where N means does
 PF), are now included.

Yes, that is good.

Peter

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Patch for better FASTQ description handling

2011-10-19 Thread Peter Cock
On Wed, Oct 19, 2011 at 4:53 AM, Florent Angly florent.an...@gmail.com wrote:

 I have had the chance to try the patch on several datasets and it looks good
 :)
 I reiterate my suggestion to pull the patch in galaxy-central.
 Best,
 Florent

It looks sensible, although I would add a comment to the join method
to say the description from read1 is taken (if the reads differ in their
descriptions). Mind you, the whole module seems to lack docstrings ;)

Are there any unit tests (not that Galaxy seems to insist on them)?

Peter
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Patch for better FASTQ description handling

2011-10-19 Thread Daniel Blankenberg
Hi Florent,

Sorry for the delay.  I did try the patch out shortly after you contributed it, 
but it caused the functional to fail.  I was able to fix the issue and allow 
the existing tests to start passing, but I've been bogged down lately and 
haven't been able to perform a more thorough review of the code. If you could 
provide tests with files (e.g. for the tools affected) that test the new 
functionality, that would be a great help. 

The use of partition removes python compatibility for 2.5, although this is a 
lesser/non-concern.

Also, I'm not entirely sold on having the Identifier line being parsed as  
identifier + space + description instead a single identifier line. This 
would mean that identifiers could not themselves contain spaces, but There is 
no standardization for identifiers (so they could technically have spaces?). 
Could two different reads be identified as Read A and Read B, but then 
would no longer be uniquely identifiable as each would then be identified as 
Read.  If this added functionalilty were introduced as optional behavior 
(e.g. a user needs to click a checkbox on the tools to apply the id line 
splitting), these concerns can be mitigated. 


Peter, Florent, anyone else: I'd be very interested to hear your thoughts on 
the above, particularly in respect to know real-world data. For now, lets 
discount SRA data from this discussion.


Thanks,

Dan



On Oct 19, 2011, at 5:00 AM, Peter Cock wrote:

 On Wed, Oct 19, 2011 at 4:53 AM, Florent Angly florent.an...@gmail.com 
 wrote:
 
 I have had the chance to try the patch on several datasets and it looks good
 :)
 I reiterate my suggestion to pull the patch in galaxy-central.
 Best,
 Florent
 
 It looks sensible, although I would add a comment to the join method
 to say the description from read1 is taken (if the reads differ in their
 descriptions). Mind you, the whole module seems to lack docstrings ;)
 
 Are there any unit tests (not that Galaxy seems to insist on them)?
 
 Peter
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:
 
  http://lists.bx.psu.edu/listinfo/galaxy-dev
 
 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:
 
  http://lists.bx.psu.edu/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Patch for better FASTQ description handling

2011-10-19 Thread Florent Angly

Peter and Daniel, thanks for the comments.

On 19/10/11 23:49, Peter Cock wrote:

On Wed, Oct 19, 2011 at 2:31 PM, Daniel Blankenbergd...@bx.psu.edu  wrote:

Hi Florent,
Sorry for the delay.  I did try the patch out shortly after you contributed
it, but it caused the functional to fail.  I was able to fix the issue and
allow the existing tests to start passing, but I've been bogged down lately
and haven't been able to perform a more thorough review of the code. If you
could provide tests with files (e.g. for the tools affected) that test the
new functionality, that would be a great help.

I'll have a look at that.

The use of partition removes python compatibility for2.5, although this is
a lesser/non-concern.

I guess you could use split, but special case on there being no space.


Also, I'm not entirely sold on having the Identifier line being parsed as
  identifier +space  + description instead a single identifier line.

That is the normal convention, just like with FASTA.
http://dx.doi.org/10.1093/nar/gkp1137
The Bioperl and Biopython projects use this convention for FASTA and 
FASTQ files.



This would mean that identifiers could not themselves contain spaces,
but There is no standardization for identifiers (so they could technically
have spaces?). Could two different reads be identified as Read A and Read
B, but then would no longer be uniquely identifiable as each would then be
identified as Read.  If this added functionalilty were introduced as
optional behavior (e.g. a user needs to click a checkbox on the tools to
apply the id line splitting), these concerns can be mitigated.

That is expected, @Read A and @Read B have the same identifier, Read.


Peter, Florent, anyone else: I'd be very interested to hear your thoughts on
the above, particularly in respect to know real-world data. For now, lets
discount SRA data from this discussion.

See also the new Illumina 1.8 naming convention where they dropped
the /1 and /2 and hit it in the description. It should be tested, but I think
Florent's patch will work here (while the current Galaxy behaviour won't).

Peter
I was not aware of this new naming. It seems like a terrible decision 
from Illumina because now both reads in a pair technically have the same 
ID (but a different description).


Florent


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Patch for better FASTQ description handling

2011-10-18 Thread Florent Angly


I have had the chance to try the patch on several datasets and it looks 
good :)

I reiterate my suggestion to pull the patch in galaxy-central.
Best,
Florent


On 05/10/11 18:28, Florent Angly wrote:

Hi,

I have found some issue with the way FASTQ read description is handled 
by Galaxy utilities:
https://bitbucket.org/galaxy/galaxy-central/issue/665/paired-end-code-mishandles-description-of 


Please consider pulling my patch, thanks,

Florent


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/