Re: [galaxy-user] Counting RNA-seq reads per class.

2012-09-13 Thread Jennifer Jackson

Hello Mo,

This may be a coordinate problems with 0-based vs 1-based start files. 
Using tools from "Operate on Genomic Intervals" might be an alternative 
since it works with the coordinates appropriately. File formats can be 
converted as needed BAM <-> SAM -> Interval.


Alternatively, and may sound simple, but would the tool "Join, Subtract 
and Group -> Group" do the summary with enough specificity? These files 
(eg transcript/gene expression) have both the 'class_code' and a 
'coverage' column. Coverage isn't exactly the same number but it does 
quantify the read data Cufflinks actually used to create the assembled 
transcripts assigned to the various class_codes, if that is what you are 
looking for.


Please let us know if your question has been misunderstood. Others are 
also welcome to add in more comments!


Best,

Jen
Galaxy team

On 9/10/12 8:52 AM, Mohammad Heydarian wrote:

Hi All,
I have been trying to count the number of RNA-seq reads that fall into
the various Cufflinks class codes ('=', 'j', 'u', 'x', etc...) and I am
curious how others are determining how to count reads per class..

I tried first using the BedTools tool where you "count" the number of
reads overlapping another set of intervals and later realized that each
interval is extended1 kb up and downstream prior to the analysis (by
default and not adjustable on Galaxy), so the number of reads that were
"counted" for all of the classes was always much more than the amount of
reads that I had for my Bam file. I then tried to isolate reads from
each class into separate BAM files, using the BedTools "intersect" tool
and there I consistently end up with significantly less reads than I
have in my sample.

I am very curious to find out how others are tackling this problem on
Galaxy.

Thanks for any input!

Cheers,
Mo Heydarian





___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/



--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Counting RNA-seq reads per class.

2012-09-13 Thread Jennifer Jackson

Hi Mo,

I wanted to send a follow-up after reading your question again.

The tools in the group 'BEDTools' also should be interpreting 
coordinates from various file formats correctly - but if you have an 
exact example where you believe it is not, please share that history 
with me privately and I will take a look.


I also was thinking about the method of counting up the number of 
alignments (from the BAM file) based on overlapping coordinates. You 
mentioned that this was giving you a total that was greater than the 
original number of alignments. A single base extra base of overlap (if 
this turns out to be a problem) seems unlikely to be responsible for so 
much over-counting. Distinct gene bounds wouldn't be that close (in any 
great number). Something else might be the problem.


Was the counting for genes or transcripts? If transcripts, then 
alignments counting more than once would be expected (since transcripts 
in the same gene bound will have overlap). Maybe this was the case?


But even when using genes, it is possible that there is still some 
overlap - so some double counting might be expected at a low level. This 
could be tested using the tool 'Operate on Genomic Intervals -> Merge". 
If any genes merge, then there is overlap and this is where a few 
over-counted alignments might come from.


I haven't addressed strand, but I am sure that you are taking that into 
consideration with the analysis.


I do not know of a method to track back and find out exactly which 
alignments donated to which assembled transcripts. Using coverage seems 
to be the alternative to consider, but the question is still definitely 
open for others to comment (and correct!).


Take care,

Jen
Galaxy team

On 9/13/12 10:27 AM, Jennifer Jackson wrote:

Hello Mo,

This may be a coordinate problems with 0-based vs 1-based start files.
Using tools from "Operate on Genomic Intervals" might be an alternative
since it works with the coordinates appropriately. File formats can be
converted as needed BAM <-> SAM -> Interval.

Alternatively, and may sound simple, but would the tool "Join, Subtract
and Group -> Group" do the summary with enough specificity? These files
(eg transcript/gene expression) have both the 'class_code' and a
'coverage' column. Coverage isn't exactly the same number but it does
quantify the read data Cufflinks actually used to create the assembled
transcripts assigned to the various class_codes, if that is what you are
looking for.

Please let us know if your question has been misunderstood. Others are
also welcome to add in more comments!

Best,

Jen
Galaxy team

On 9/10/12 8:52 AM, Mohammad Heydarian wrote:

Hi All,
I have been trying to count the number of RNA-seq reads that fall into
the various Cufflinks class codes ('=', 'j', 'u', 'x', etc...) and I am
curious how others are determining how to count reads per class..

I tried first using the BedTools tool where you "count" the number of
reads overlapping another set of intervals and later realized that each
interval is extended1 kb up and downstream prior to the analysis (by
default and not adjustable on Galaxy), so the number of reads that were
"counted" for all of the classes was always much more than the amount of
reads that I had for my Bam file. I then tried to isolate reads from
each class into separate BAM files, using the BedTools "intersect" tool
and there I consistently end up with significantly less reads than I
have in my sample.

I am very curious to find out how others are tackling this problem on
Galaxy.

Thanks for any input!

Cheers,
Mo Heydarian





___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/





--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/