Re: [galaxy-user] identify different number of differential expressed genes using ensemble or reseq GTF

2013-01-14 Thread Wei Liao
Hi, Jennifer,

Thanks for your reply!
My raw RNA-seq data was mapped to the hg19 without reference GTF in our
local instance. In order to troubleshoot, I tried the following:
(1) use Tophat to map data again with hg19, and iGenome ensembl.GTF, then
use Cuffdiff to find differential expressed genes. There are still 250
significant genes.
(2) use Tophat to map data again with hg19 without reference GTF, use
cufflink with Homo_sapiens.GRCh37.69.gtf downloaded from ensembl.org. Same
results with 250 significant genes.
(3) use Tophat to map data again with hg19 without reference GTF, use
cufflink with refseq refFlat.GTF, The results are ~1000 significant genes.
(4) use Tophat to map data again with hg19 without reference GTF, use
cufflink with refseq iGenome refseq.GTF, The results are ~1000 significant
genes.
However, I need to confirm what release or version is the hg19 reference
genome I am using. Do you think the  different results are caused by
mapping to different hg19 genome? if so, how can you find a match of hg19
with reference to a correct GTF?  I thought the use of ensembl or refseq
would not affect the results in cuffdiff step. These reference GTF file
(refFlat.GTF, iGenome refseq.GTF, or iGenome ensembl.GTF) should represents
complete transcripts.
Wei
On Mon, Jan 7, 2013 at 5:27 PM, Jennifer Hillman-Jackson wrote:

> Hello Wei,
>
> The contents of the reference GTF files (original, before analysis) will
> probably provide some explanation. My guess is that GTF files have
> different contents and are not directly comparable - RefSeq with full
> transcripts and Ensembl with full transcripts + potentially partial
> predictions and/or predicted splice sites. Alternative versions of each may
> be available. When possible, you most likely will want to be using a
> reference GTF file that represents complete transcripts.
>
> I don't know what genome you are using, but you can check the source notes
> at Ensembl (& NCBI) to find out what each annotation build contains. A raw
> count on the number of entries in the GTF files can also be a clue - if
> greatly different, then you very likely have different populations in the
> two files.
>
> Good luck with your project!
>
> Jen
> Galaxy team
>
>
> On 1/7/13 1:47 PM, Wei Liao wrote:
>
>> Hi all,
>>
>> I am analyzing significant differential expressed genes for a pair of
>> normal V.S tumor, using Cuffdiff 2.0.2.
>> I noticed that by using ensemble GTF and refseq GTF, the results showed
>> a big difference on the number of genes being significant expressed.
>>
>> For ensemble GTF, there are only 250 genes differential expressed.
>> But for refseq GTF, there are about 1000 genes.
>>
>> I am running these data on Galaxy server and with the same workflow.
>>
>> Can anyone explain what is going on here? so which result should I trust?
>>
>> Thanks.
>>
>>
>> --
>> Wei Liao
>> Research Scientist,
>> Brentwood Biomedical Research Institute
>> 16111 Plummer St.
>> Bldg 7, Rm D-122
>> North Hills, CA 91343
>> 818-891-7711 ext 7645
>>
>>
>> __**_
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>
>> http://lists.bx.psu.edu/**listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>http://lists.bx.psu.edu/
>>
>>
> --
> Jennifer Hillman-Jackson
> Galaxy Support and Training
> http://galaxyproject.org
>



-- 
Wei Liao
Research Scientist,
Brentwood Biomedical Research Institute
16111 Plummer St.
Bldg 7, Rm D-122
North Hills, CA 91343
818-891-7711 ext 7645
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] identify different number of differential expressed genes using ensemble or reseq GTF

2013-01-08 Thread Loraine, Ann
Hi,

Another approach you can try is to use DESeq or EdgeR from Bioconductor to
assess differential expression.

I personally like these two methods LOTS better than Cuff* mainly because
they are a lot closer to tried and true statistical methods developed for
microarrays. I esp. like how both methods let you test different factors.
For example, if you are testing a treatment (drug or no drug) and a
genotype (mutant vs. wildtype) you can find out which genes' expression
depends on having a wild-type copy of the gene by testing an "interaction
term."

Both methods start with simple counts - numbers of reads overlapping
annotated genes.

Probably there is a Galaxy workflow that can calculate counts of reads per
gene, but I don't know if Galaxy currently incorporates R/Bioconductor
tools. If you can get Galaxy to calculate reads per gene, then you can
then download the file and run it through edgeR or DESeq.

R is free but it does take some time to master it. But it is incredibly
powerful and well worth the effort!

To get started with R, I recommend doing the free-of-charge O'Reilly Press
"try R" tutorial which is on-line here: http://tryr.codeschool.com/

I hope this will be helpful!

Best wishes,

Ann Loraine

---
Ann Loraine, Ph.D.
Associate Professor
Department of Bioinformatics and Genomics
University of North Carolina at Charlotte
North Carolina Research Campus
600 Laureate Way
Kannapolis, NC 28081
704-250-5750
alora...@uncc.edu
http://www.transvar.org
http://www.bioviz.org
http://www.uncc.edu





On 1/7/13 8:27 PM, "Jennifer Hillman-Jackson"  wrote:

>Hello Wei,
>
>The contents of the reference GTF files (original, before analysis) will
>probably provide some explanation. My guess is that GTF files have
>different contents and are not directly comparable - RefSeq with full
>transcripts and Ensembl with full transcripts + potentially partial
>predictions and/or predicted splice sites. Alternative versions of each
>may be available. When possible, you most likely will want to be using a
>reference GTF file that represents complete transcripts.
>
>I don't know what genome you are using, but you can check the source
>notes at Ensembl (& NCBI) to find out what each annotation build
>contains. A raw count on the number of entries in the GTF files can also
>be a clue - if greatly different, then you very likely have different
>populations in the two files.
>
>Good luck with your project!
>
>Jen
>Galaxy team
>
>On 1/7/13 1:47 PM, Wei Liao wrote:
>> Hi all,
>>
>> I am analyzing significant differential expressed genes for a pair of
>> normal V.S tumor, using Cuffdiff 2.0.2.
>> I noticed that by using ensemble GTF and refseq GTF, the results showed
>> a big difference on the number of genes being significant expressed.
>>
>> For ensemble GTF, there are only 250 genes differential expressed.
>> But for refseq GTF, there are about 1000 genes.
>>
>> I am running these data on Galaxy server and with the same workflow.
>>
>> Can anyone explain what is going on here? so which result should I
>>trust?
>>
>> Thanks.
>>
>>
>> --
>> Wei Liao
>> Research Scientist,
>> Brentwood Biomedical Research Institute
>> 16111 Plummer St.
>> Bldg 7, Rm D-122
>> North Hills, CA 91343
>> 818-891-7711 ext 7645
>>
>>
>> ___
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>http://lists.bx.psu.edu/
>>
>
>-- 
>Jennifer Hillman-Jackson
>Galaxy Support and Training
>http://galaxyproject.org
>___
>The Galaxy User list should be used for the discussion of
>Galaxy analysis and other features on the public server
>at usegalaxy.org.  Please keep all replies on the list by
>using "reply all" in your mail client.  For discussion of
>local Galaxy instances and the Galaxy source code, please
>use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
>To manage your subscriptions to this and other Galaxy lists,
>please use the interface at:
>
>  http://lists.bx.psu.edu/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy

Re: [galaxy-user] identify different number of differential expressed genes using ensemble or reseq GTF

2013-01-07 Thread Jennifer Hillman-Jackson

Hello Wei,

The contents of the reference GTF files (original, before analysis) will 
probably provide some explanation. My guess is that GTF files have 
different contents and are not directly comparable - RefSeq with full 
transcripts and Ensembl with full transcripts + potentially partial 
predictions and/or predicted splice sites. Alternative versions of each 
may be available. When possible, you most likely will want to be using a 
reference GTF file that represents complete transcripts.


I don't know what genome you are using, but you can check the source 
notes at Ensembl (& NCBI) to find out what each annotation build 
contains. A raw count on the number of entries in the GTF files can also 
be a clue - if greatly different, then you very likely have different 
populations in the two files.


Good luck with your project!

Jen
Galaxy team

On 1/7/13 1:47 PM, Wei Liao wrote:

Hi all,

I am analyzing significant differential expressed genes for a pair of
normal V.S tumor, using Cuffdiff 2.0.2.
I noticed that by using ensemble GTF and refseq GTF, the results showed
a big difference on the number of genes being significant expressed.

For ensemble GTF, there are only 250 genes differential expressed.
But for refseq GTF, there are about 1000 genes.

I am running these data on Galaxy server and with the same workflow.

Can anyone explain what is going on here? so which result should I trust?

Thanks.


--
Wei Liao
Research Scientist,
Brentwood Biomedical Research Institute
16111 Plummer St.
Bldg 7, Rm D-122
North Hills, CA 91343
818-891-7711 ext 7645


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/



--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/