Re: [galaxy-user] Exceptionally high RPKM values of miRNA and other short genes in Cuffdiff's output

2013-07-18 Thread Mohammad Heydarian
Hi Thanh,
This is due to Cuffdiff correcting for the size of smaller transcripts, the
authors call it the effective length correction. It is supposed to
correct the loss of shorter transcripts upon size selection in creating
your RNA-seq library. The default setting on Galaxy is to use the
effective length correction.

Cole Trapnell, the creator of the Cuff-suite tools, discusses this length
correction here:
http://seqanswers.com/forums/showpost.php?p=76430postcount=32

Some library preparation protocols don't include a size selection. The one
we favor, and Illumina recommends, ScriptSeq v2 from Epicentre (owned by
Illumina), does not include a size selection step. It would be great if
there was an option in the Cuffdiff wrapper in Galaxy to turn off the
effective length correction.



Cheers,
Mo Heydarian

PhD candidate
The Johns Hopkins School of Medicine
Department of Biological Chemistry
725 Wolfe Street
402 Biophysics
Baltimore, MD 21205


On Thu, Jul 18, 2013 at 12:55 PM, Hoang, Thanh hoan...@miamioh.edu wrote:

 Hi all,
 I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is
 single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to
 differential gene expression
 In the Cuffdiff's output, I got very high RPKM value for some of miRNA and
 some other short genes ( less than 100bp). These genes are in the top genes
 with the highest RPKM. I think the RPKM values of these genes are probably
  too high to be true.
   *test_id* *gene_id* *gene* *locus* *sample_1* *sample_2* *status* *
 value_1* *value_2* *log2(fold_change)* *test_stat* *p_value* *q_value* *
 significant*  *ENSMUSG0093077* *ENSMUSG0093077* *Mir5105* *
 5:146231229-146302874* *Epithelium* *Fiber* *OK* *1.53E+06* *  445558* *
 -1.78097* *-355.367* *0.00715* *0.016986* *yes*  *ENSMUSG0093098* *
 ENSMUSG0093098* *Gm22641* *7:130162450-133124354* *Epithelium* *Fiber*
 *OK* *87894.1* * 36474.7* *-1.26887* *-0.59863* *0.4913* *0.587174* *no*
 *ENSMUSG0089855* *ENSMUSG0089855* *Gm15662* *
 10:105187662-105583874* *Epithelium* *Fiber* *OK* *42868.9* * 21566.5* *
 -0.99114* *-20.7066* *0.0186* *0.039568* *yes*  *ENSMUSG0092984* *
 ENSMUSG0092984* *Mir5115* *2:73012853-73012927* *Epithelium* *Fiber* *
 OK* *21104.8* * 8317.49* *-1.34335* *-447.314* *0.0001* *0.000354* *yes*
 *ENSMUSG0086324* *ENSMUSG0086324* *Gm15564* *16:35926510-36037131*
 *Epithelium* *Fiber* *OK* *6443.35* * 3664.15* *-0.81433* *-1.52095* *
 0.2129* *0.301429* *no*  *ENSMUSG0092981* *ENSMUSG0092981* *
 Mir5125* *17:23803186-23824739* *Epithelium* *Fiber* *OK* *5974.14* *
 2390.75* *-1.32127* *-0.34111* *0.5746* *0.661937* *no*

  I checked some forums and they said that this is the drawback of
 TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not
 so clear about this. Anyone got the same problem? What can I do with this
 situation?
 Anyone suggests any other good tools to test for (1) differential gene
 expression OR (2) both differential gene expression and gene discovery?

 Thank you
 Thanh

 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:

   http://galaxyproject.org/search/mailinglists/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Exceptionally high RPKM values of miRNA and other short genes in Cuffdiff's output

2013-07-18 Thread Ross
Hi, Thanh,
If your primary goal is inference about differential 'gene' expression
taking biological variability into account with biological replicates for
each of two conditions, you might want (eg see Dillies et al.,
http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long and
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAbstracts.2FPosters.P4:_Comparing_R-based_methods_and_Cuffdiff2_for_analysis_of_RNA-seq_data_in_Galaxy)
to try (and compare!) edgeR (and optionally DESeq and VOOM/limma). A set of
*very much beta* tools is available for admin installation and user testing
from the test toolshed in the statistics section owned by fubar.

The edgeR tool can optionally run 2 way GLM. It requires raw count matrices
as inputs which can be generated from a GTF/'gene' model of your choice and
any number of mapped SAM/BAM inputs using the htseq based companion tool in
the same tool shed section. Please don't install to a production machine
yet but we're getting good results from it - feedback and code improvements
are welcomed from willing beta testers.

The R 3.0.x tool shed dependency package in particular is still under
development and is likely to change substantially in the next week or two
as we sort out a sane and generalised Atlas dependency installation.


On Fri, Jul 19, 2013 at 2:55 AM, Hoang, Thanh hoan...@miamioh.edu wrote:

 Hi all,
 I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is
 single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to
 differential gene expression
 In the Cuffdiff's output, I got very high RPKM value for some of miRNA and
 some other short genes ( less than 100bp). These genes are in the top genes
 with the highest RPKM. I think the RPKM values of these genes are probably
  too high to be true.
   *test_id* *gene_id* *gene* *locus* *sample_1* *sample_2* *status* *
 value_1* *value_2* *log2(fold_change)* *test_stat* *p_value* *q_value* *
 significant*  *ENSMUSG0093077* *ENSMUSG0093077* *Mir5105* *
 5:146231229-146302874* *Epithelium* *Fiber* *OK* *1.53E+06* *  445558* *
 -1.78097* *-355.367* *0.00715* *0.016986* *yes*  *ENSMUSG0093098* *
 ENSMUSG0093098* *Gm22641* *7:130162450-133124354* *Epithelium* *Fiber*
 *OK* *87894.1* * 36474.7* *-1.26887* *-0.59863* *0.4913* *0.587174* *no*
 *ENSMUSG0089855* *ENSMUSG0089855* *Gm15662* *
 10:105187662-105583874* *Epithelium* *Fiber* *OK* *42868.9* * 21566.5* *
 -0.99114* *-20.7066* *0.0186* *0.039568* *yes*  *ENSMUSG0092984* *
 ENSMUSG0092984* *Mir5115* *2:73012853-73012927* *Epithelium* *Fiber* *
 OK* *21104.8* * 8317.49* *-1.34335* *-447.314* *0.0001* *0.000354* *yes*
 *ENSMUSG0086324* *ENSMUSG0086324* *Gm15564* *16:35926510-36037131*
 *Epithelium* *Fiber* *OK* *6443.35* * 3664.15* *-0.81433* *-1.52095* *
 0.2129* *0.301429* *no*  *ENSMUSG0092981* *ENSMUSG0092981* *
 Mir5125* *17:23803186-23824739* *Epithelium* *Fiber* *OK* *5974.14* *
 2390.75* *-1.32127* *-0.34111* *0.5746* *0.661937* *no*

  I checked some forums and they said that this is the drawback of
 TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not
 so clear about this. Anyone got the same problem? What can I do with this
 situation?
 Anyone suggests any other good tools to test for (1) differential gene
 expression OR (2) both differential gene expression and gene discovery?

 Thank you
 Thanh

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/