Re: [galaxy-user] Tophat mapping and Cufflinks output issues

2013-01-16 Thread Jeremy Goecks
Tophat should be used when mapping reads to the genome, not the transcriptome. 
Because you're mapping your reads to the transcriptome assembled via Trinity, 
Bowtie or BWA are good choices.

This also changes your downstream analyses, because Cufflinks does not work 
well on reads mapped to the transcriptome. Tools for quantitating 
transcriptome-mapped reads include RSEM and eXpress.

Good luck,
J.

> I've been using the Main Galaxy server to work on an RNA-Seq project for a 
> non-model plant, and I've noticed that my output from Tophat and Cufflinks 
> might not be as good as I'd like.  I have a reference transcriptome assembled 
> in Trinity, and it is based on the same Illumina-generated 100 bp reads I'm 
> trying to map to it.  When I use Tophat to map the reads to the reference 
> transcriptome (I have trimmed the reads and filtered the lower quality ones), 
> only about 10% of the reads actually map, so I go from 30,000,000 reads 
> before mapping to 3,000,000 that are actually mapped.  Therefore, I feel like 
> I'm losing a lot of data.  When I've changed the parameters to allow for more 
> mismatches, not many more reads seem to map, and in many cases, the Tophat 
> run fails and I receive the error message: "Settings: Output files: 
> "/tmp/3030460.cyberstar.psu.edu/tmpWbxTnm/dataset_5530451.*.ebwt" Line rate: 
> 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 5 (one 
> in 32) FTable chars: 10 Strings: unpacked Max bucket size: def".  I've had 
> similar numbers of reads map with Bowtie by itself and BWA as well.  I've 
> also tried mapping the reads to the assembled isoforms (contigs) of the 
> transcriptome, and this results in many more reads (close to 90%) being 
> mapped.  Therefore, I figure the reads should map to the reference 
> transcriptome, and I'm not sure why this isn't happening.
> 
> The other issue I've run into is that in Cuffdiff only about 4,800 genes 
> appear in the output files as being tested for differential expression.  
> There are approximately 100,000 genes in the reference transcriptome, so I 
> was thinking that there should be more than ca. 4,800 that are tested for 
> differential expression.  Should each gene be tested?  Does Cuffdiff just not 
> report some of the genes that are not differentially expressed, or is the 
> program not testing all of the genes?  
> 
> If anyone can provide some help, guidance, or a suggestion, I'd greatly 
> appreciate it.  Thanks, and take care.
> 
> Jim
> 
> 
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>  http://lists.bx.psu.edu/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Tophat mapping and Cufflinks output issues

2013-01-15 Thread Jim Cohen
Hello Galaxy Users-

I've been using the Main Galaxy server to work on an RNA-Seq project for a
non-model plant, and I've noticed that my output from Tophat and Cufflinks
might not be as good as I'd like.  I have a reference transcriptome
assembled in Trinity, and it is based on the same Illumina-generated 100 bp
reads I'm trying to map to it.  When I use Tophat to map the reads to the
reference transcriptome (I have trimmed the reads and filtered the lower
quality ones), only about 10% of the reads actually map, so I go from
30,000,000 reads before mapping to 3,000,000 that are actually mapped.
 Therefore, I feel like I'm losing a lot of data.  When I've changed the
parameters to allow for more mismatches, not many more reads seem to map,
and in many cases, the Tophat run fails and I receive the error
message: "*Settings:
Output files: "/tmp/
3030460.cyberstar.psu.edu/tmpWbxTnm/dataset_5530451.*.ebwt" Line rate: 6
(line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 5 (one
in 32) FTable chars: 10 Strings: unpacked Max bucket size: def"*.  I've had
similar numbers of reads map with Bowtie by itself and BWA as well.  I've
also tried mapping the reads to the assembled isoforms (contigs) of the
transcriptome, and this results in many more reads (close to 90%) being
mapped.  Therefore, I figure the reads should map to the reference
transcriptome, and I'm not sure why this isn't happening.

The other issue I've run into is that in Cuffdiff only about 4,800 genes
appear in the output files as being tested for differential expression.
 There are approximately 100,000 genes in the reference transcriptome, so I
was thinking that there should be more than ca. 4,800 that are tested for
differential expression.  Should each gene be tested?  Does Cuffdiff just
not report some of the genes that are not differentially expressed, or is
the program not testing all of the genes?

If anyone can provide some help, guidance, or a suggestion, I'd greatly
appreciate it.  Thanks, and take care.

Jim
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Tophat mapping error

2012-04-27 Thread Jennifer Jackson
Hi Jiwen,

Please submit this error using the green bug icon associated with the dataset 
and we can check to see if this is related to the other issues discussed 
earlier today.

Thank you,

Jen
Galaxy Team

On Apr 27, 2012, at 3:18 PM, 杨继文  wrote:

> Hi all,
> I got the following error infomation during Tophat mapping
> An error occurred running this job: Job output not returned by PBS: the 
> output datasets were deleted while the job was running, the job was manually 
> dequeued or there was a cluster error.
>  
>  
> Please let me know what's wrong.
> Help will be appreciated.
> Thanks
> Jiwen
>  
> 
> 
> 网易Lofter,专注兴趣,分享创作!
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>  http://lists.bx.psu.edu/
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Tophat mapping error

2012-04-27 Thread 杨继文
Hi all,
I got the following error infomation during Tophat mapping
An error occurred running this job: Job output not returned by PBS: the output 
datasets were deleted while the job was running, the job was manually dequeued 
or there was a cluster error.
 
 
Please let me know what's wrong.
Help will be appreciated.
Thanks
Jiwen
 ___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Tophat mapping

2012-04-18 Thread Jeremy Goecks
> Jeremy, do you have a workflow to estimate what percent of the reads
> are mapping to unknown expressed regions?


Here's a simple approach assuming mapped reads are in BAM format:

BAM --> SAM

SAM --> Interval

Intersect reads as interval with known annotation not allowing for any overlap.

Best,
J.
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Tophat mapping

2012-04-18 Thread Carlos Borroto
On Wed, Apr 18, 2012 at 8:37 AM, Jeremy Goecks  wrote:
> I am wondering if these "non-coding reads" will be included when cufflinks
> calculates transcript/gene expression.
>
>
> Reads will only be included if they map to assembled/known transcripts.

Well it depends what transcript annotation file you pass to cuffdiff.
If you run cufflinks without using --GTF:

"Tells Cufflinks to use the supplied reference annotation (a GFF file)
to estimate isoform expression. It will not assemble novel
transcripts, and the program will ignore alignments not structurally
compatible with any reference transcript."[1]

In Galaxy language, option "Use Reference Annotation:" with "Use
reference annotation" selected. Then the two other options, "No" or
"Use reference annotation as guide", will allow cufflinks to estimate
unknown transcripts. If later you use cuffmerge to produce the
transcripts annotation from your cufflinks runs and use it for
cuffdiff, the "non-coding reads" will almost for sure pollute your
transcript expression estimates.

[1]http://cufflinks.cbcb.umd.edu/manual.html

Jeremy, do you have a workflow to estimate what percent of the reads
are mapping to unknown expressed regions? I would like to be able to
produce this estimate before I make a decision on which transcripts
annotation I should pass to cuffdiff. I would expect a small percent
of reads to map outside of known expressed regions, but is this number
is to big, then I would like to check for potential problems with my
library.

Regards,
Carlos
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Tophat mapping

2012-04-18 Thread Jeremy Goecks
> I am wondering if these "non-coding reads" will be included when cufflinks 
> calculates transcript/gene expression. 

Reads will only be included if they map to assembled/known transcripts.

> And another question is:  how to know the number of reads mapped to a certain 
> exon? 

This isn't possible because a single read may map to multiple exons and/or 
transcripts. Cufflinks assigns reads probabilistically when their mapping 
cannot be uniquely determined.

See

http://cufflinks.cbcb.umd.edu/faq.html#count
http://cufflinks.cbcb.umd.edu/howitworks.html

for details.

Best,
J.___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Tophat mapping

2012-04-15 Thread 杨继文
 Hi,

After mapping RNA-Seq paired end reads with Tophat,  I can see that most of 
reads fall into the right regions. However, I still can see lots of reads 
mapped to non-coding region (the locations where the reads are mapped to don't 
contain exons). 

I am wondering if these "non-coding reads" will be included when cufflinks 
calculates transcript/gene expression.
Dying to know your opinion.

And another question is:  how to know the number of reads mapped to a certain 
exon?

Thanks
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/