Re: [galaxy-user] RNA-seq analysis withtout reference genome

2012-10-08 Thread Jennifer Jackson

Hello Alicia,

There are two tools in the main Galaxy Tool Shed 
(http://toolshed.g2.bx.psu.edu/) that would likely be helpful. Read the 
input requirements for each to decide which is a better fit. Search for 
'deseq' to find them:


  deseq_and_sam2counts
  deseq_hts

To use these tools, a local or cloud instance will be needed. Help to 
get started is in these wikis:

http://usegalaxy.org/cloud
http://getgalaxy.org
http://wiki.g2.bx.psu.edu/Tool%20Shed

The best list to use for local/cloud support is the galaxy-...@bx.psu.edu.
http://wiki.g2.bx.psu.edu/Support#Mailing_Lists

Also sure to subscribe to the galaxy-announce mailing list and follow 
the Distribution News Briefs to stay updated:

http://wiki.g2.bx.psu.edu/News

Best,

Jen
Galaxy team


On 10/5/12 8:40 AM, Alicia R. Pérez-Porro wrote:

Hi all,

I'm about to start doing gene expression analysis but i don't have a
reference genome.
I got my sequences from Illumina and i did my assemblies with CLC bio.
I was planning to use DESeq for the analysis but for that i need my gene
expression count values.
There is any tool in Galaxy that i can use for that?

Thanks,
Alicia.


––

Alicia R. Pérez-Porro
PhD student

Giribet lab
Department of Organismic and Evolutionary Biology
MCZ labs
Harvard University
26 Oxford St, Cambridge MA 02138
phone: +1 617-496-5308
fax: +1 617-495-5667
www.oeb.harvard.edu/faculty/giribet/




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-07 Thread puvan001


Hi

Thank you! yes, your guess is correct. Now it works.



Sumathy






On May 7 2011, Jeremy Goecks wrote:


Sumathy,

It sounds like you're on the right track. To visualize data for a custom 
build in Trackster, you need to create a custom build and use that in 
Trackster:


(1) using the top tabs in Galaxy, go to User --> Custom Builds;
(2) add a new build with the length info as follows:
 

Important note: you'll need to make sure that your contig name matches the 
one used in your fasta file. This is my best guess about what's causing 
problems for you.


(3) Create a Trackster visualization using the custom build and add your 
dataset.


Let us know if you have more questions/problems.

Thanks,
J.

On May 6, 2011, at 10:43 PM,  wrote:




Hi

  I may be doing in a wrong way. I clicked trackster and I added the 
custom build genome. Since it is a very small genome (~2kb), I considered 
this as a single contig. Then I cliked "add tracks" and added my data file. 
But I got a message "no data for this contig. Whenever I used built in 
genomes I did not have any problem. I guess I am doing something wrong 
here.



Sumathy










On May 6 2011, Jeremy Goecks wrote:


Sumathy,

What kind of problems are you having with Trackster?

J.

On May 6, 2011, at 8:30 PM,  wrote:


Hello
   I was able to run RNA seq data against a custom build genome. How can I 
visualize the results. I tried via trackster and unfortunately I couldn't. 
Can you help me?

Thanks
Sumathy
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/





--
Sumathy Puvanendiran
Graduate student







--
Sumathy Puvanendiran
Graduate student


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-07 Thread Jeremy Goecks
Sumathy,

It sounds like you're on the right track. To visualize data for a custom build 
in Trackster, you need to create a custom build and use that in Trackster:

(1) using the top tabs in Galaxy, go to User --> Custom Builds;
(2) add a new build with the length info as follows:
 

Important note: you'll need to make sure that your contig name matches the one 
used in your fasta file. This is my best guess about what's causing problems 
for you.

(3) Create a Trackster visualization using the custom build and add your 
dataset.

Let us know if you have more questions/problems.

Thanks,
J.

On May 6, 2011, at 10:43 PM,  wrote:

> 
> 
> Hi
> 
> I may be doing in a wrong way. I clicked trackster and I added the custom 
> build genome. Since it is a very small genome (~2kb), I considered this as a 
> single contig. Then I cliked "add tracks" and added my data file. But I got a 
> message "no data for this contig. Whenever I used built in genomes I did not 
> have any problem. I guess I am doing something wrong here.
> 
> 
> Sumathy
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On May 6 2011, Jeremy Goecks wrote:
> 
>> Sumathy,
>> 
>> What kind of problems are you having with Trackster?
>> 
>> J.
>> 
>> On May 6, 2011, at 8:30 PM,  wrote:
>> 
>>> Hello
>  I was able to run RNA seq data against a custom build genome. How can I 
> visualize the results. I tried via trackster and unfortunately I couldn't. 
> Can you help me?
>>> Thanks
>>> Sumathy
>>> ___
>>> The Galaxy User list should be used for the discussion of
>>> Galaxy analysis and other features on the public server
>>> at usegalaxy.org.  Please keep all replies on the list by
>>> using "reply all" in your mail client.  For discussion of
>>> local Galaxy instances and the Galaxy source code, please
>>> use the Galaxy Development list:
>>> http://lists.bx.psu.edu/listinfo/galaxy-dev
>>> To manage your subscriptions to this and other Galaxy lists,
>>> please use the interface at:
>>> http://lists.bx.psu.edu/
>> 
>> 
> 
> -- 
> Sumathy Puvanendiran
> Graduate student
> 
> 


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread vasu punj
Thanks Jim,
 
Vasu
 
--- On Fri, 5/6/11, Jim Robinson  wrote:


From: Jim Robinson 
Subject: Re: [galaxy-user] RNA seq analysis
To: "vasu punj" 
Cc: "Austin Paul" , "Sean Davis" , 
"galaxy-user@lists.bx.psu.edu" , 
"puvan...@umn.edu" 
Date: Friday, May 6, 2011, 9:01 PM


Hi Vasu,

I'm going to add the function to index BAM files soon, using Picard.   In the 
beginning  there was no java BAM reader, only SAM, and I added the index 
then.  Indexed BAMs came along later, but that's probably more than you want to 
know...    I think most people will still use Galaxy to index as it can take a 
long time, but I agree with you on the convenience factor.

Jim


On May 6, 2011, at 9:36 PM, vasu punj wrote:

> One of the problem is IGV dont have option of creating index file so one has 
> to create index file in Galaxy first to  view in IGV. Jim I have been using 
> IGV 2 beta version it is great work but How hard is to include index 
> functionality with in IGV. I know we can use sam tools also but just for 
> convinence if it is not that much of work.
> Vasu
> 
> --- On Fri, 5/6/11, Sean Davis  wrote:
> 
> From: Sean Davis 
> Subject: Re: [galaxy-user] RNA seq analysis
> To: "Austin Paul" 
> Cc: "galaxy-user@lists.bx.psu.edu" , 
> "puvan...@umn.edu" 
> Date: Friday, May 6, 2011, 8:02 PM
> 
> IGV reads BAM files just fine; no need to convert to SAM.
> Sean
> 
> On Fri, May 6, 2011 at 8:45 PM, Austin Paul  wrote:
> There are many ways.  I typically use IGV.  It needs a sam file, so I first 
> convert the bam to sam in galaxy, then download the sam file.  In IGV, I 
> upload the reference and the sam file, then use IGVtools to index the sam 
> file, then I can visualize the data.
> 
> Austin
> On Fri, May 6, 2011 at 5:30 PM,  wrote:
> Hello
> 
> I was able to run RNA seq data against a custom build genome. How can I 
> visualize the results. I tried via trackster and unfortunately I couldn't. 
> Can you help me?
> 
> 
> Thanks
> 
> Sumathy
> 
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>  http://lists.bx.psu.edu/
> 
> 
> 
> -Inline Attachment Follows-
> 
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>   http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>   http://lists.bx.psu.edu/
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>  http://lists.bx.psu.edu/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread puvan001


Hi

Thanks! I am little bit familiar with IGV. I'll try then. 




Sumathy



On May 6 2011, vasu punj wrote:

One of the problem is IGV dont have option of creating index file so one 
has to create index file in Galaxy first to  view in IGV. Jim I have been 
using IGV 2 beta version it is great work but How hard is to include index 
functionality with in IGV. I know we can use sam tools also but just for 
convinence if it is not that much of work.

Vasu

--- On Fri, 5/6/11, Sean Davis  wrote:


From: Sean Davis 
Subject: Re: [galaxy-user] RNA seq analysis
To: "Austin Paul" 
Cc: "galaxy-user@lists.bx.psu.edu" , 
"puvan...@umn.edu" 

Date: Friday, May 6, 2011, 8:02 PM


IGV reads BAM files just fine; no need to convert to SAM.

Sean


On Fri, May 6, 2011 at 8:45 PM, Austin Paul  wrote:


There are many ways.  I typically use IGV.  It needs a sam file, so I 
first convert the bam to sam in galaxy, then download the sam file.  In 
IGV, I upload the reference and the sam file, then use IGVtools to index 
the sam file, then I can visualize the data.

 
Austin

On Fri, May 6, 2011 at 5:30 PM,  wrote:

Hello

I was able to run RNA seq data against a custom build genome. How can I 
visualize the results. I tried via trackster and unfortunately I couldn't. 
Can you help me?



Thanks

Sumathy 




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/



-Inline Attachment Follows-


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


--
Sumathy Puvanendiran
Graduate student



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread puvan001



Hi

I may be doing in a wrong way. I clicked trackster and I added the custom 
build genome. Since it is a very small genome (~2kb), I considered this as 
a single contig. Then I cliked "add tracks" and added my data file. But I 
got a message "no data for this contig. Whenever I used built in genomes I 
did not have any problem. I guess I am doing something wrong here.



Sumathy










On May 6 2011, Jeremy Goecks wrote:


Sumathy,

What kind of problems are you having with Trackster?

J.

On May 6, 2011, at 8:30 PM,  wrote:


Hello

  I was able to run RNA seq data against a custom build genome. How can I 
visualize the results. I tried via trackster and unfortunately I couldn't. 
Can you help me?



Thanks

Sumathy
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/





--
Sumathy Puvanendiran
Graduate student


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread Jim Robinson

Hi Vasu,

I'm going to add the function to index BAM files soon, using Picard.
In the beginning  there was no java BAM reader, only SAM, and I  
added the index then.  Indexed BAMs came along later, but that's  
probably more than you want to know...I think most people will  
still use Galaxy to index as it can take a long time, but I agree with  
you on the convenience factor.


Jim


On May 6, 2011, at 9:36 PM, vasu punj wrote:

One of the problem is IGV dont have option of creating index file so  
one has to create index file in Galaxy first to  view in IGV. Jim I  
have been using IGV 2 beta version it is great work but How hard is  
to include index functionality with in IGV. I know we can use sam  
tools also but just for convinence if it is not that much of work.

Vasu

--- On Fri, 5/6/11, Sean Davis  wrote:

From: Sean Davis 
Subject: Re: [galaxy-user] RNA seq analysis
To: "Austin Paul" 
Cc: "galaxy-user@lists.bx.psu.edu" , "puvan...@umn.edu 
" 

Date: Friday, May 6, 2011, 8:02 PM

IGV reads BAM files just fine; no need to convert to SAM.
Sean

On Fri, May 6, 2011 at 8:45 PM, Austin Paul  wrote:
There are many ways.  I typically use IGV.  It needs a sam file, so  
I first convert the bam to sam in galaxy, then download the sam  
file.  In IGV, I upload the reference and the sam file, then use  
IGVtools to index the sam file, then I can visualize the data.


Austin
On Fri, May 6, 2011 at 5:30 PM,  wrote:
Hello

I was able to run RNA seq data against a custom build genome. How  
can I visualize the results. I tried via trackster and unfortunately  
I couldn't. Can you help me?



Thanks

Sumathy

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/



-Inline Attachment Follows-

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread Jeremy Goecks
Sumathy,

What kind of problems are you having with Trackster?

J.

On May 6, 2011, at 8:30 PM,  wrote:

> Hello
> 
> I was able to run RNA seq data against a custom build genome. How can I 
> visualize the results. I tried via trackster and unfortunately I couldn't. 
> Can you help me?
> 
> 
> Thanks
> 
> Sumathy
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
> http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
> http://lists.bx.psu.edu/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread vasu punj
One of the problem is IGV dont have option of creating index file so one has to 
create index file in Galaxy first to  view in IGV. Jim I have been using IGV 2 
beta version it is great work but How hard is to include index functionality 
with in IGV. I know we can use sam tools also but just for convinence if it is 
not that much of work.
Vasu

--- On Fri, 5/6/11, Sean Davis  wrote:


From: Sean Davis 
Subject: Re: [galaxy-user] RNA seq analysis
To: "Austin Paul" 
Cc: "galaxy-user@lists.bx.psu.edu" , 
"puvan...@umn.edu" 
Date: Friday, May 6, 2011, 8:02 PM


IGV reads BAM files just fine; no need to convert to SAM.

Sean


On Fri, May 6, 2011 at 8:45 PM, Austin Paul  wrote:


There are many ways.  I typically use IGV.  It needs a sam file, so I first 
convert the bam to sam in galaxy, then download the sam file.  In IGV, I upload 
the reference and the sam file, then use IGVtools to index the sam file, then I 
can visualize the data.
 
Austin

On Fri, May 6, 2011 at 5:30 PM,  wrote:

Hello

I was able to run RNA seq data against a custom build genome. How can I 
visualize the results. I tried via trackster and unfortunately I couldn't. Can 
you help me?


Thanks

Sumathy 



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/



-Inline Attachment Follows-


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread vasu punj
I generally take the GTF file to UCSC genome browser. 
If you are visualizing Bam file after alignment. I found IGV convinenet, though 
you may be able to visualize in Galaxy.
 
Vasu

--- On Fri, 5/6/11, puvan...@umn.edu  wrote:


From: puvan...@umn.edu 
Subject: Re: [galaxy-user] RNA seq analysis
To: "David Matthews" 
Cc: galaxy-user@lists.bx.psu.edu
Date: Friday, May 6, 2011, 7:30 PM


Hello

I was able to run RNA seq data against a custom build genome. How can I 
visualize the results. I tried via trackster and unfortunately I couldn't. Can 
you help me?


Thanks

Sumathy
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread Austin Paul
Oops.  Good to know.  Thanks.

Austin

On Fri, May 6, 2011 at 6:02 PM, Sean Davis  wrote:

> IGV reads BAM files just fine; no need to convert to SAM.
>
> Sean
>
> On Fri, May 6, 2011 at 8:45 PM, Austin Paul  wrote:
>
>> There are many ways.  I typically use IGV.  It needs a sam file, so I
>> first convert the bam to sam in galaxy, then download the sam file.  In IGV,
>> I upload the reference and the sam file, then use IGVtools to index the sam
>> file, then I can visualize the data.
>>
>> Austin
>> On Fri, May 6, 2011 at 5:30 PM,  wrote:
>>
>>> Hello
>>>
>>> I was able to run RNA seq data against a custom build genome. How can I
>>> visualize the results. I tried via trackster and unfortunately I couldn't.
>>> Can you help me?
>>>
>>>
>>> Thanks
>>>
>>> Sumathy
>>>
>>> ___
>>> The Galaxy User list should be used for the discussion of
>>> Galaxy analysis and other features on the public server
>>> at usegalaxy.org.  Please keep all replies on the list by
>>> using "reply all" in your mail client.  For discussion of
>>> local Galaxy instances and the Galaxy source code, please
>>> use the Galaxy Development list:
>>>
>>>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>>
>>> To manage your subscriptions to this and other Galaxy lists,
>>> please use the interface at:
>>>
>>>  http://lists.bx.psu.edu/
>>>
>>
>>
>
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread Sean Davis
IGV reads BAM files just fine; no need to convert to SAM.

Sean

On Fri, May 6, 2011 at 8:45 PM, Austin Paul  wrote:

> There are many ways.  I typically use IGV.  It needs a sam file, so I first
> convert the bam to sam in galaxy, then download the sam file.  In IGV, I
> upload the reference and the sam file, then use IGVtools to index the sam
> file, then I can visualize the data.
>
> Austin
> On Fri, May 6, 2011 at 5:30 PM,  wrote:
>
>> Hello
>>
>> I was able to run RNA seq data against a custom build genome. How can I
>> visualize the results. I tried via trackster and unfortunately I couldn't.
>> Can you help me?
>>
>>
>> Thanks
>>
>> Sumathy
>>
>> ___
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>  http://lists.bx.psu.edu/
>>
>
>
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread Austin Paul
There are many ways.  I typically use IGV.  It needs a sam file, so I first
convert the bam to sam in galaxy, then download the sam file.  In IGV, I
upload the reference and the sam file, then use IGVtools to index the sam
file, then I can visualize the data.

Austin
On Fri, May 6, 2011 at 5:30 PM,  wrote:

> Hello
>
> I was able to run RNA seq data against a custom build genome. How can I
> visualize the results. I tried via trackster and unfortunately I couldn't.
> Can you help me?
>
>
> Thanks
>
> Sumathy
>
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/
>
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread puvan001

Hello

I was able to run RNA seq data against a custom build genome. How can I 
visualize the results. I tried via trackster and unfortunately I couldn't. 
Can you help me?



Thanks

Sumathy
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread puvan001


Hi Austin

I did all these (grooming and trimming)on rna-seq data and I don't have a 
problem with built in genome . I'll try again!



Thanks

Sumathy




On May 6 2011, Austin Paul wrote:


Hi,

You need to run fastq groomer on your rna-seq data.  Your reference is fine
as a fasta.

Austin

On Fri, May 6, 2011 at 10:26 AM,  wrote:



Hi David,

Thanks!When I tried to run Tophat, it doesn't recognise my FASTA file and
  it says "History does not include a dataset of the required format / 
build".

Do you have any thoughts about this?

Now it makes more sense about "multihits". Thanks for sharing your
workflow.

With regards

Sumathy


On May 6 2011, David Matthews wrote:

Hi,


I have done exactly the same kind of thing for adenovirus so I can help

with it. In answer to question 1 you do not need to index it will be done
  for you when tophat is called. Secondly you should leave the 40 
multihits as

it is and post analysis filter out the multihits - this will allow you to
  determine if you do have a multihit problem or not and if so whether it 
is a
  big problem and where it is on the genome. I have a workflow on Galaxy 
which
  you can use called "Bristol workflow to get sorted unique proper pair 
mapped
  reads". If you plug in your sam file it should give you files listing 
only

unique hits and those which map more than once. This workflow assumes you
  have paired end data but it can be modified to work with single end 
reads as

well.



Hope this helps.


Best Wishes,
David.

__
Dr David A. Matthews

Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.

Tel. +44 117 3312058
Fax. +44 117 3312091

d.a.matth...@bristol.ac.uk






On 6 May 2011, at 17:09, puvan...@umn.edu wrote:

Hi


I have a couple of questions regarding RNA seq analysis. My questions 
are


1.I need to use a viral genome (very small, ~2kb ) as a reference 
genome

and it is not available in Galaxy. I guess I can use this data from my
history. I have a fasta file but I am not sure whether I have to do some
kind of indexing or not.




 2. In Tophat, default for "maximum number of alignments to be allowed"

is 40. What my understanding is a single read can be aligned maximum 40
different places. I am wondering why this is 40. Is there any specific
reason? If I need unique mapping, I have to use 1 instead of 40. Am I
correct?





Thanks

SP



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/






--
Sumathy Puvanendiran
Graduate student




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/





--
Sumathy Puvanendiran
Graduate student


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread Austin Paul
Hi,

You need to run fastq groomer on your rna-seq data.  Your reference is fine
as a fasta.

Austin

On Fri, May 6, 2011 at 10:26 AM,  wrote:

>
> Hi David,
>
> Thanks!When I tried to run Tophat, it doesn't recognise my FASTA file and
> it says "History does not include a dataset of the required format / build".
> Do you have any thoughts about this?
>
> Now it makes more sense about "multihits". Thanks for sharing your
> workflow.
>
> With regards
>
> Sumathy
>
>
> On May 6 2011, David Matthews wrote:
>
> Hi,
>>
>> I have done exactly the same kind of thing for adenovirus so I can help
> with it. In answer to question 1 you do not need to index it will be done
> for you when tophat is called. Secondly you should leave the 40 multihits as
> it is and post analysis filter out the multihits - this will allow you to
> determine if you do have a multihit problem or not and if so whether it is a
> big problem and where it is on the genome. I have a workflow on Galaxy which
> you can use called "Bristol workflow to get sorted unique proper pair mapped
> reads". If you plug in your sam file it should give you files listing only
> unique hits and those which map more than once. This workflow assumes you
> have paired end data but it can be modified to work with single end reads as
> well.
>
>>
>> Hope this helps.
>>
>>
>> Best Wishes,
>> David.
>>
>> __
>> Dr David A. Matthews
>>
>> Senior Lecturer in Virology
>> Room E49
>> Department of Cellular and Molecular Medicine,
>> School of Medical Sciences
>> University Walk,
>> University of Bristol
>> Bristol.
>> BS8 1TD
>> U.K.
>>
>> Tel. +44 117 3312058
>> Fax. +44 117 3312091
>>
>> d.a.matth...@bristol.ac.uk
>>
>>
>>
>>
>>
>>
>> On 6 May 2011, at 17:09, puvan...@umn.edu wrote:
>>
>> Hi
>>>
>>> I have a couple of questions regarding RNA seq analysis. My questions are
>>>
>>  1.I need to use a viral genome (very small, ~2kb ) as a reference genome
> and it is not available in Galaxy. I guess I can use this data from my
> history. I have a fasta file but I am not sure whether I have to do some
> kind of indexing or not.
>
>>
>>>  2. In Tophat, default for "maximum number of alignments to be allowed"
> is 40. What my understanding is a single read can be aligned maximum 40
> different places. I am wondering why this is 40. Is there any specific
> reason? If I need unique mapping, I have to use 1 instead of 40. Am I
> correct?
>
>>
>>>
>>> Thanks
>>>
>>> SP
>>>
>>>
>>>
>>> ___
>>> The Galaxy User list should be used for the discussion of
>>> Galaxy analysis and other features on the public server
>>> at usegalaxy.org.  Please keep all replies on the list by
>>> using "reply all" in your mail client.  For discussion of
>>> local Galaxy instances and the Galaxy source code, please
>>> use the Galaxy Development list:
>>>
>>> http://lists.bx.psu.edu/listinfo/galaxy-dev
>>>
>>> To manage your subscriptions to this and other Galaxy lists,
>>> please use the interface at:
>>>
>>> http://lists.bx.psu.edu/
>>>
>>
>>
>>
> --
> Sumathy Puvanendiran
> Graduate student
>
>
>
>
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/
>
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread puvan001


Hi David,

Thanks!When I tried to run Tophat, it doesn't recognise my FASTA file and 
it says "History does not include a dataset of the required format / 
build". Do you have any thoughts about this?


Now it makes more sense about "multihits". Thanks for sharing your 
workflow.


With regards

Sumathy

On May 6 2011, David Matthews wrote:


Hi,

I have done exactly the same kind of thing for adenovirus so I can help 
with it. In answer to question 1 you do not need to index it will be done 
for you when tophat is called. Secondly you should leave the 40 multihits 
as it is and post analysis filter out the multihits - this will allow you 
to determine if you do have a multihit problem or not and if so whether it 
is a big problem and where it is on the genome. I have a workflow on Galaxy 
which you can use called "Bristol workflow to get sorted unique proper pair 
mapped reads". If you plug in your sam file it should give you files 
listing only unique hits and those which map more than once. This workflow 
assumes you have paired end data but it can be modified to work with single 
end reads as well.


Hope this helps.


Best Wishes,
David.

__
Dr David A. Matthews

Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.

Tel. +44 117 3312058
Fax. +44 117 3312091

d.a.matth...@bristol.ac.uk






On 6 May 2011, at 17:09, puvan...@umn.edu wrote:


Hi

I have a couple of questions regarding RNA seq analysis. My questions are
  1.I need to use a viral genome (very small, ~2kb ) as a reference genome 
and it is not available in Galaxy. I guess I can use this data from my 
history. I have a fasta file but I am not sure whether I have to do some 
kind of indexing or not.


  2. In Tophat, default for "maximum number of alignments to be allowed" 
is 40. What my understanding is a single read can be aligned maximum 40 
different places. I am wondering why this is 40. Is there any specific 
reason? If I need unique mapping, I have to use 1 instead of 40. Am I 
correct?



Thanks

SP



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/





--
Sumathy Puvanendiran
Graduate student



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-05-06 Thread David Matthews
Hi,

I have done exactly the same kind of thing for adenovirus so I can help with 
it. In answer to question 1 you do not need to index it will be done for you 
when tophat is called. Secondly you should leave the 40 multihits as it is and 
post analysis filter out the multihits - this will allow you to determine if 
you do have a multihit problem or not and if so whether it is a big problem and 
where it is on the genome. I have a workflow on Galaxy which you can use called 
"Bristol workflow to get sorted unique proper pair mapped reads". If you plug 
in your sam file it should give you files listing only unique hits and those 
which map more than once. This workflow assumes you have paired end data but it 
can be modified to work with single end reads as well.

Hope this helps.


Best Wishes,
David.

__
Dr David A. Matthews

Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.

Tel. +44 117 3312058
Fax. +44 117 3312091

d.a.matth...@bristol.ac.uk






On 6 May 2011, at 17:09, puvan...@umn.edu wrote:

> Hi
> 
> I have a couple of questions regarding RNA seq analysis. My questions are
> 1.I need to use a viral genome (very small, ~2kb ) as a reference genome and 
> it is not available in Galaxy. I guess I can use this data from my history. I 
> have a fasta file but I am not sure whether I have to do some kind of 
> indexing or not.
> 
> 2. In Tophat, default for "maximum number of alignments to be allowed" is 40. 
> What my understanding is a single read can be aligned maximum 40 different 
> places. I am wondering why this is 40. Is there any specific reason? If I 
> need unique mapping, I have to use 1 instead of 40. Am I correct?
> 
> 
> Thanks
> 
> SP
> 
> 
> 
> ___
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
> http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
> http://lists.bx.psu.edu/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis

2011-05-03 Thread Jennifer Jackson

Hello,

On 4/28/11 9:21 PM, puvan...@umn.edu wrote:

I am new to Galaxy and I am not sure whether these topics were discussed
earlier. I followed the steps up to cufflinks and I did not have any
problems. Thanks for the RNA seq tutorial. My questions are
1. How do I know the number of reads mapped against the reference genome
used after Top Hat mapping


Please try: "NGS: SAM Tools -> flagstat provides simple stats on BAM files"


2. I am aware that Cuffdiff is used to find the differences in
expression. How do I combine replicates (3) of different treatments ?


Set the "NGS: RNA Analysis -> Cuffdiff" form so that the second choice, 
"Perform replicate analysis:" is "Yes". The ability to add/define groups 
from your history will pop up when the form modifies in response to this 
selection.


Best wishes for your research,

Jen
Galaxy team





SP
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/


--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis and GTF files

2011-04-08 Thread David K Crossman
Jeremy,

Thank you very much for this information.  One quick question.  
I added the gene_id values to the 10th column of my patched GTF file.  After 
uploading it to Galaxy, the column doesn't have a name (i.e. column 1 = 
Seqname; column 2 = Source; etc...).  Do I need to assign it a name (i.e. 
gene_name or gene_id) for it to be recognized and if so, how do you assign 
column names to GTF files?

Thanks,
David


From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu]
Sent: Thursday, April 07, 2011 9:40 PM
To: David K Crossman
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files

David,

Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you 
get nearest_ref_id, so that's promising. What I think is needed is the addition 
of an attribute called gene_name to your reference file; you can use whatever 
value you want for gene name, and using the same value as gene_id probably 
makes sense.

Rerun your analysis with the further-patched GTF file, and let us know if this 
doesn't solve the problem. Also note that even using this attribute, some gene 
name/ids and some nearest_ref_id columns will not be populated in some cuffdiff 
files. See the post from Howie in this thread for an explanation from a 
Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288

Best,
J.

On Apr 7, 2011, at 5:00 PM, David K Crossman wrote:


Jeremy,

I've shared it with you using your email address.

Thanks,
David


From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu]
Sent: Thursday, April 07, 2011 3:42 PM
To: David K Crossman
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files

David, can you please share your history with me and I'll take a look (History 
Options --> Share/Publish --> Share with User --> my email?

Thanks,
J.

On Apr 7, 2011, at 3:23 PM, David K Crossman wrote:



Hello!

I would like to ask a question related to this thread below.  I 
ran into the same issues as below and was unaware of having to swap some 
columns around in the GTF file.  So, after 'swapping the gene name from the 
complete table (name2 value, column 12) into the GFT file's gene_id value 
(which by default is the same as transcript_id)," I uploaded this "patched" 
file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this 
"patched" GTF file as the reference annotation.  For both Cufflinks and 
CuffCompare, the gene_id was present in their respective columns.  The problem 
I have encountered now is that in all of the output files in CuffDiff, the 
gene_id column is blank (contains a "-"; highlighted in yellow below).  This 
example is from the CuffDiff gene expression output file:

test_id

gene

locus

sample_1

sample_2

status

value_1

value_2

ln(fold_change)

test_stat

p_value

significant

XLOC_01

-

chr1:4797973-4836816

q1

q2

OK

73.1908

82.1567

0.115559

-0.71896

0.472168

no

XLOC_02

-

chr1:4847774-4887990

q1

q2

OK

81.7264

53.1165

-0.43089

2.44474

0.014496

no

XLOC_03

-

chr1:5073253-5152630

q1

q2

OK

408.289

333.749

-0.20159

2.73173

0.0063

no

XLOC_04

-

chr1:5578573-5596214

q1

q2

NOTEST

2.34764

4.79772

0.71473

-0.89735

0.369532

no


What am I doing wrong?  I am interested in the differentially 
expressed genes in this RNA-Seq dataset (as well as calling variants, which is 
my next step, but want to get this answered first before moving on).  Any info, 
suggestions or help would be greatly appreciated.

Thanks,
David


-Original Message-
From: 
galaxy-user-boun...@lists.bx.psu.edu<mailto:galaxy-user-boun...@lists.bx.psu.edu>
 [mailto:galaxy-user-boun...@lists.bx.psu.edu] On Behalf Of Jeremy Goecks
Sent: Friday, April 01, 2011 8:47 AM
To: mailto:ssa...@ccib.mgh.harvard.edu>>
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files



On Mar 31, 2011, at 12:30 PM, 
mailto:ssa...@ccib.mgh.harvard.edu>> 
mailto:ssa...@ccib.mgh.harvard.edu>> wrote:

> Hi Jeremy,
> I used your exercise to perform an RNA-seq analysis. First I encountered a 
> problem where the gene IDs were missing from the results. Jen from the Galaxy 
> team suggested this:
>
> "Yes, the team has taken a look and there are a few things going on.
>
> The first is that when running the Cuffcompare program, a reference 
> annotation file in GTF format should be used in order to obtain the same 
> results as in Jeremy's exercise. This seemed to be missing from your runs, 
> which resulted in badly formatted output that later resulted in a poor result 
> when Cuffdiff was used.
>
> The second has to do with the reference GTF file itself. For the best 
> results, the GTF file must have the "gene_id" attribute defined in the 9th 
> column of the file and the chromosome 

Re: [galaxy-user] RNA seq analysis and GTF files

2011-04-07 Thread Jeremy Goecks
David, can you please share your history with me and I'll take a look  
(History Options --> Share/Publish --> Share with User --> my email?


Thanks,
J.

On Apr 7, 2011, at 3:23 PM, David K Crossman wrote:


Hello!

I would like to ask a question related to this  
thread below.  I ran into the same issues as below and was unaware  
of having to swap some columns around in the GTF file.  So, after  
'swapping the gene name from the complete table (name2 value, column  
12) into the GFT file's gene_id value (which by default is the same  
as transcript_id)," I uploaded this "patched" file (mm9) into Galaxy  
and ran Cufflinks, CuffCompare and CuffDiff using this "patched" GTF  
file as the reference annotation.  For both Cufflinks and  
CuffCompare, the gene_id was present in their respective columns.   
The problem I have encountered now is that in all of the output  
files in CuffDiff, the gene_id column is blank (contains a "-";  
highlighted in yellow below).  This example is from the CuffDiff  
gene expression output file:


test_id
gene
locus
sample_1
sample_2
status
value_1
value_2
ln(fold_change)
test_stat
p_value
significant
XLOC_01
-
chr1:4797973-4836816
q1
q2
OK
73.1908
82.1567
0.115559
-0.71896
0.472168
no
XLOC_02
-
chr1:4847774-4887990
q1
q2
OK
81.7264
53.1165
-0.43089
2.44474
0.014496
no
XLOC_03
-
chr1:5073253-5152630
q1
q2
OK
408.289
333.749
-0.20159
2.73173
0.0063
no
XLOC_04
-
chr1:5578573-5596214
q1
q2
NOTEST
2.34764
4.79772
0.71473
-0.89735
0.369532
no

What am I doing wrong?  I am interested in the  
differentially expressed genes in this RNA-Seq dataset (as well as  
calling variants, which is my next step, but want to get this  
answered first before moving on).  Any info, suggestions or help  
would be greatly appreciated.


Thanks,
David


-Original Message-
From: galaxy-user-boun...@lists.bx.psu.edu [mailto:galaxy-user-boun...@lists.bx.psu.edu 
] On Behalf Of Jeremy Goecks

Sent: Friday, April 01, 2011 8:47 AM
To: 
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files



On Mar 31, 2011, at 12:30 PM,  > wrote:


> Hi Jeremy,
> I used your exercise to perform an RNA-seq analysis. First I  
encountered a problem where the gene IDs were missing from the  
results. Jen from the Galaxy team suggested this:

>
> "Yes, the team has taken a look and there are a few things going on.
>
> The first is that when running the Cuffcompare program, a  
reference annotation file in GTF format should be used in order to  
obtain the same results as in Jeremy's exercise. This seemed to be  
missing from your runs, which resulted in badly formatted output  
that later resulted in a poor result when Cuffdiff was used.

>
> The second has to do with the reference GTF file itself. For the  
best results, the GTF file must have the "gene_id" attribute defined  
in the 9th column of the file and the chromosome names must be in  
the same format as the genome native to Galaxy. Depending on the  
source of the reference GTF, one of these may need to be adjusted.  
Chromosome names can be adjusted using Galaxy's "Text Manipulation"  
tools. The gene_id attribute would need to be adjusted prior to  
loading into Galaxy.

>
> For mm9, using the "Get Data -> UCSC Main table browser" tool can  
help you to obtain all of the raw data necessary to create a  
complete GTF file with a gene_id identifier. Extract data from the  
track "RefSeq Genes" and output the primary data table "refGene"  
twice - first in GTF format, then again as the complete table in  
tabular format (not BED). Then, using your own tools, swap in the  
gene name from the complete table (name2 value, column 12) into the  
GTF file's gene_id value (which by default is the same as  
transcript_id). Upload and the tools will function as intended.

>
> The team is aware of the issues associated with GTF source files  
and is discussing solutions. Any changes to native data content will  
be reported to the mailing list in a News Brief or other  
communications.

>
> Our apologies for the inconvenience! Thanks for using Galaxy and
> please let us know if we can help again,
>
> Best,
>
> Jen
> Galaxy team"
>
>
> I followed the directions (or at least I think I did) and things  
seemed to work better but there is one more issue for example in file:
> Galaxy287- 
[Cuffdiff_on_data_197,_data_197,_and_data_274__isoform_FPKM_

> tracking].tabular.txt The column gene_short_name does not have any
> names in it. nearest_ref_id does have the gene ID info so I can  
still interpret the data, but I was wondering if there remains  
another problem that I'm not aware of with the GTF file.


Slim,

Please send questions to the galaxy-user mailing list (cc'd) rather

Re: [galaxy-user] RNA seq analysis and GTF files

2011-04-07 Thread David K Crossman
Hello!



I would like to ask a question related to this thread below.  I 
ran into the same issues as below and was unaware of having to swap some 
columns around in the GTF file.  So, after 'swapping the gene name from the 
complete table (name2 value, column 12) into the GFT file's gene_id value 
(which by default is the same as transcript_id)," I uploaded this "patched" 
file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this 
"patched" GTF file as the reference annotation.  For both Cufflinks and 
CuffCompare, the gene_id was present in their respective columns.  The problem 
I have encountered now is that in all of the output files in CuffDiff, the 
gene_id column is blank (contains a "-"; highlighted in yellow below).  This 
example is from the CuffDiff gene expression output file:


test_id

gene

locus

sample_1

sample_2

status

value_1

value_2

ln(fold_change)

test_stat

p_value

significant

XLOC_01

-

chr1:4797973-4836816

q1

q2

OK

73.1908

82.1567

0.115559

-0.71896

0.472168

no

XLOC_02

-

chr1:4847774-4887990

q1

q2

OK

81.7264

53.1165

-0.43089

2.44474

0.014496

no

XLOC_03

-

chr1:5073253-5152630

q1

q2

OK

408.289

333.749

-0.20159

2.73173

0.0063

no

XLOC_04

-

chr1:5578573-5596214

q1

q2

NOTEST

2.34764

4.79772

0.71473

-0.89735

0.369532

no




What am I doing wrong?  I am interested in the differentially 
expressed genes in this RNA-Seq dataset (as well as calling variants, which is 
my next step, but want to get this answered first before moving on).  Any info, 
suggestions or help would be greatly appreciated.



Thanks,

David





-Original Message-
From: galaxy-user-boun...@lists.bx.psu.edu 
[mailto:galaxy-user-boun...@lists.bx.psu.edu] On Behalf Of Jeremy Goecks
Sent: Friday, April 01, 2011 8:47 AM
To: 
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files







On Mar 31, 2011, at 12:30 PM, 
mailto:ssa...@ccib.mgh.harvard.edu>> 
mailto:ssa...@ccib.mgh.harvard.edu>> wrote:



> Hi Jeremy,

> I used your exercise to perform an RNA-seq analysis. First I encountered a 
> problem where the gene IDs were missing from the results. Jen from the Galaxy 
> team suggested this:

>

> "Yes, the team has taken a look and there are a few things going on.

>

> The first is that when running the Cuffcompare program, a reference 
> annotation file in GTF format should be used in order to obtain the same 
> results as in Jeremy's exercise. This seemed to be missing from your runs, 
> which resulted in badly formatted output that later resulted in a poor result 
> when Cuffdiff was used.

>

> The second has to do with the reference GTF file itself. For the best 
> results, the GTF file must have the "gene_id" attribute defined in the 9th 
> column of the file and the chromosome names must be in the same format as the 
> genome native to Galaxy. Depending on the source of the reference GTF, one of 
> these may need to be adjusted. Chromosome names can be adjusted using 
> Galaxy's "Text Manipulation" tools. The gene_id attribute would need to be 
> adjusted prior to loading into Galaxy.

>

> For mm9, using the "Get Data -> UCSC Main table browser" tool can help you to 
> obtain all of the raw data necessary to create a complete GTF file with a 
> gene_id identifier. Extract data from the track "RefSeq Genes" and output the 
> primary data table "refGene" twice - first in GTF format, then again as the 
> complete table in tabular format (not BED). Then, using your own tools, swap 
> in the gene name from the complete table (name2 value, column 12) into the 
> GTF file's gene_id value (which by default is the same as transcript_id). 
> Upload and the tools will function as intended.

>

> The team is aware of the issues associated with GTF source files and is 
> discussing solutions. Any changes to native data content will be reported to 
> the mailing list in a News Brief or other communications.

>

> Our apologies for the inconvenience! Thanks for using Galaxy and

> please let us know if we can help again,

>

> Best,

>

> Jen

> Galaxy team"

>

>

> I followed the directions (or at least I think I did) and things seemed to 
> work better but there is one more issue for example in file:

> Galaxy287-[Cuffdiff_on_data_197,_data_197,_and_data_274__isoform_FPKM_

> tracking].tabular.txt The column gene_short_name does not have any

> names in it. nearest_ref_id does have the gene ID info so I can still 
> interpret the data, but I was wondering if there remains another problem that 
> I'm not aware of with the GTF file.



Slim,



Please send questions to the galaxy-user mailing 

Re: [galaxy-user] RNA seq analysis and GTF files

2011-04-01 Thread Jeremy Goecks


On Mar 31, 2011, at 12:30 PM,  
 wrote:

> Hi Jeremy, 
> I used your exercise to perform an RNA-seq analysis. First I encountered a 
> problem where the gene IDs were missing from the results. Jen from the Galaxy 
> team suggested this:  
> 
> "Yes, the team has taken a look and there are a few things going on.
> 
> The first is that when running the Cuffcompare program, a reference 
> annotation file in GTF format should be used in order to obtain the same 
> results as in Jeremy's exercise. This seemed to be missing from your runs, 
> which resulted in badly formatted output that later resulted in a poor result 
> when Cuffdiff was used.
> 
> The second has to do with the reference GTF file itself. For the best 
> results, the GTF file must have the "gene_id" attribute defined in the 9th 
> column of the file and the chromosome names must be in the same format as the 
> genome native to Galaxy. Depending on the source of the reference GTF, one of 
> these may need to be adjusted. Chromosome names can be adjusted using 
> Galaxy's "Text Manipulation" tools. The gene_id attribute would need to be 
> adjusted prior to loading into Galaxy.
> 
> For mm9, using the "Get Data -> UCSC Main table browser" tool can help you to 
> obtain all of the raw data necessary to create a complete GTF file with a 
> gene_id identifier. Extract data from the track "RefSeq Genes" and output the 
> primary data table "refGene" twice - first in GTF format, then again as the 
> complete table in tabular format (not BED). Then, using your own tools, swap 
> in the gene name from the complete table (name2 value, column 12) into the 
> GTF file's gene_id value (which by default is the same as transcript_id). 
> Upload and the tools will function as intended.
> 
> The team is aware of the issues associated with GTF source files and is 
> discussing solutions. Any changes to native data content will be reported to 
> the mailing list in a News Brief or other communications.
> 
> Our apologies for the inconvenience! Thanks for using Galaxy and please let 
> us know if we can help again,
> 
> Best,
> 
> Jen
> Galaxy team"
> 
> 
> I followed the directions (or at least I think I did) and things seemed to 
> work better but there is one more issue for example in file:
> Galaxy287-[Cuffdiff_on_data_197,_data_197,_and_data_274__isoform_FPKM_tracking].tabular.txt
> The column gene_short_name does not have any names in it. nearest_ref_id does 
> have the gene ID info so I can still interpret the data, but I was wondering 
> if there remains another problem that I'm not aware of with the GTF file.

Slim,

Please send questions to the galaxy-user mailing list (cc'd) rather than 
individual Galaxy team members; there are many people on the list that may be 
able to address your question, and discussions are archived for future use as 
well. Without seeing your analysis, I'd suggest trying two things:

(1) Provide gene annotation reference file to Cufflinks as well as Cuffcompare 
and Cuffdiff; in other words, you'll want to do guided assembly.
(2) Try using an Ensembl GTF, which has the gene name in the attributes.

I think (2) is more likely to generate the results you want, but there are the 
many known problems in using Ensembl GTFs with Cufflinks/compare/diff.

Good luck,
J.
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] RNA seq analysis

2011-02-24 Thread David Matthews
HI Jeremy,

Thanks for the feedback. I know what you mean about tophat not having the same 
functionality of bowtie. However, I think whatever tophat does do (now or in 
the future) I think it is useful to collect the multihits separately since 
either you leave them in and over estimate gene expression or remove them and 
underestimate gene expression. As you suggested I put this up on Seqanswers to 
see if anyone else likes/doesn't like it we'll see how it goes. I certainly 
find it handy - not least to reassure myself that when I get the gene 
expresison data I can tell if there are any "funny" reads making up the numbers!

Cheers
David

P.S. I modified the workflow to include collecting the multihits in a separate 
sorted sam file. 


On 24 Feb 2011, at 04:05, Jeremy Goecks wrote:

> Hi David,
> 
> This is a really interesting workflow. My comments:
> 
> (1) I encourage you to start a discussion about this idea on seqanswers.com; 
> you'll reach more people and may have a better discussion there. Ideally, 
> you'll get a Tophat developer to chime in on what I perceive to be the main 
> issue, which is:
> 
>> This may seem similar to setting tophat to ignore non-unique reads. However, 
>> it is not. This approach gives you 10-15% more reads. I think it is because 
>> if tophat finds (for example) that the forward read maps to one site but the 
>> reverse read maps to two sites it throws away the whole read.
> 
> Remember that Tophat uses Bowtie to map reads, so it would make sense to look 
> carefully at the Bowtie documentation to see how it handles paired-end reads. 
> I can't find anything that directly addresses your issue. The other thing to 
> consider is how Tophat maps reads -- it breaks them up in order to find 
> splice junctions -- and so I'm not sure that Tophat/Bowtie is really mapping 
> paired reads; it may be doing some hybrid single/paired-end mapping. Also, at 
> one time, you could specify Bowtie parameters when running Tophat, but I 
> don't see that option anymore.
> 
> (2) It would be interesting to know whether you get qualitatively different 
> results via Cufflinks (or another transcriptome analysis software package) 
> using your method vs. just using Tophat w/ and w/o ignoring non-unique reads. 
> A skeptical view of your workflow would note that (a) multi-mapping reads may 
> be legitimate and should not be filtered out and (b) Cufflinks/compare/diff 
> assembly and quantitation may smooth out stray reads enough so that your 
> method isn't necessary.
> 
> Thanks for the interesting post,
> J.
> 
> On Feb 23, 2011, at 9:41 AM, David Matthews wrote:
> 
>> Hi Jeremy,
>> 
>> I thought I'd write to get a discussion of a workflow for people doing RNA 
>> seq that I have found very useful and addresses some issues in mapping mRNA 
>> derived RNA-seq paired end data to the genome using tophat. Here is the 
>> approach I use (I have a human mRNA sample deep sequenced with a 56bp paired 
>> end read on an illumina generating 29 million reads):
>> 
>> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for 
>> each sequence read
>> 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is 
>> mapped in a proper pair"
>> 3. Use "group" to group the filtered sam file on c1 (which is the 
>> "bio-sequencer" read number) and set an operation to count on c1 as well. 
>> This provides a list of the reads and how many times they map to the human 
>> genome, because you have filtered the set for reads that have a mate pair 
>> there will be an even number for each read. For most of the reads the number 
>> will be 2 (indicating the forward read maps once and the reverse read maps 
>> once and in a proper pair) but for reads that map ambiguously the number 
>> will be multiples of 2. If you count these up I find that 18 million reads 
>> map once, 1.3 million map twice, 400,000 reads map 3 times and so on until 
>> you get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
>> 4. Filter the reads to remove any reads that map more than 2 times.
>> 5. Use "compare two datasets" to compare your new list of reads that map 
>> only twice to pull out all the reads in your sam file that only map twice 
>> (i.e. the mate pairs).
>> 6. You'll need to sort the sam file before you can use it with other 
>> applications like IGV.
>> 
>> What you end up with is a sam file where all the reads map to one site only 
>> and all the reads map as a proper pair. This may seem similar to setting 
>> tophat to ignore non-unique reads. However, it is not. This approach gives 
>> you 10-15% more reads. I think it is because if tophat finds (for example) 
>> that the forward read maps to one site but the reverse read maps to two 
>> sites it throws away the whole read. By filtering the sam file to restrict 
>> it to only those mappings that make sense you increase the number of unique 
>> reads by getting rid of irrational mappings.
>> 
>> Has anyone else found this

Re: [galaxy-user] RNA seq analysis

2011-02-24 Thread Ann Loraine
Hello,

I like your approach of running the alignment tools with liberal settings
and then filtering the results into different categories.

This discussion reminds me of how in expression microarray analysis, we face
uncertainty as to what molecules (exactly) are hybridizing to the probes on
a chip. 

Maybe the ambiguity of mapping short sequence reads introduces similar
uncertainty?  

I also like your idea of capturing the reads that map multiple times.

It¹s interesting to visualize the alignments for reads that map onto
multiple locations in a genome.

An example (from data expressed in ³wiggle² format) is described here:

https://wiki.transvar.org/confluence/x/w4BJAQ

My apologies for posting another IGB citation, but I think it can be
interesting and informative to see the data in this way, and IGB makes it
easy to zoom in and out through the data and find patterns quickly.

One of the first things I noticed when I started looking at coverage graphs
made from multi-mapping reads is that (1) there are a lot of them and (2)
they expose tandemly duplicated genes.

I attach an image that shows a particularly striking example from a
single-read, 75 bp RNA-Seq data set from Arabidopsis thaliana Col-0.

The pattern of read alignment is nearly identical between the two genes.

You can¹t see it from the image, of course, but if I right-click one of the
genes, IGB links out to a Web page describing the gene at
www.arabidopsis.org, the main on-line database for Arabidopsis. (Human genes
link to NCBI.) 

-Ann





On 2/23/11 11:05 PM, "Jeremy Goecks"  wrote:

> Hi David,
> 
> This is a really interesting workflow. My comments:
> 
> (1) I encourage you to start a discussion about this idea on seqanswers.com
>  ; you'll reach more people and may have a better
> discussion there. Ideally, you'll get a Tophat developer to chime in on what I
> perceive to be the main issue, which is:
> 
>> This may seem similar to setting tophat to ignore non-unique reads. However,
>> it is not. This approach gives you 10-15% more reads. I think it is because
>> if tophat finds (for example) that the forward read maps to one site but the
>> reverse read maps to two sites it throws away the whole read.
> 
> Remember that Tophat uses Bowtie to map reads, so it would make sense to look
> carefully at the Bowtie documentation to see how it handles paired-end reads.
> I can't find anything that directly addresses your issue. The other thing to
> consider is how Tophat maps reads -- it breaks them up in order to find splice
> junctions -- and so I'm not sure that Tophat/Bowtie is really mapping paired
> reads; it may be doing some hybrid single/paired-end mapping. Also, at one
> time, you could specify Bowtie parameters when running Tophat, but I don't see
> that option anymore.
> 
> (2) It would be interesting to know whether you get qualitatively different
> results via Cufflinks (or another transcriptome analysis software package)
> using your method vs. just using Tophat w/ and w/o ignoring non-unique reads.
> A skeptical view of your workflow would note that (a) multi-mapping reads may
> be legitimate and should not be filtered out and (b) Cufflinks/compare/diff
> assembly and quantitation may smooth out stray reads enough so that your
> method isn't necessary.
> 
> Thanks for the interesting post,
> J.
> 
> On Feb 23, 2011, at 9:41 AM, David Matthews wrote:
> 
>> Hi Jeremy,
>> 
>> I thought I'd write to get a discussion of a workflow for people doing RNA
>> seq that I have found very useful and addresses some issues in mapping mRNA
>> derived RNA-seq paired end data to the genome using tophat. Here is the
>> approach I use (I have a human mRNA sample deep sequenced with a 56bp paired
>> end read on an illumina generating 29 million reads):
>> 
>> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for
>> each sequence read
>> 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is
>> mapped in a proper pair"
>> 3. Use "group" to group the filtered sam file on c1 (which is the
>> "bio-sequencer" read number) and set an operation to count on c1 as well.
>> This provides a list of the reads and how many times they map to the human
>> genome, because you have filtered the set for reads that have a mate pair
>> there will be an even number for each read. For most of the reads the number
>> will be 2 (indicating the forward read maps once and the reverse read maps
>> once and in a proper pair) but for reads that map ambiguously the number will
>> be multiples of 2. If you count these up I find that 18 million reads map
>> once, 1.3 million map twice, 400,000 reads map 3 times and so on until you
>> get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
>> 4. Filter the reads to remove any reads that map more than 2 times.
>> 5. Use "compare two datasets" to compare your new list of reads that map only
>> twice to pull out all the reads in your sam file that on

Re: [galaxy-user] RNA seq analysis

2011-02-24 Thread David Matthews
Thanks Ann for your comments and for the stuff you showed at IGB - looks very 
interesting. I agree that multihits may the equivalent of the problem you 
describe from microarrays. I think, for me anyway, knowing the scale if the 
issue is the key thing at this stage. As you imply from your email the next 
-and potentially very interesting step -   is to figure out how/where these 
multihits are and how they came to be. I guess it all comes dow to where do 
genes come from? Well, many of them come from other genes via duplications, 
transpositions etc etc!

I have made a slight alteration to this "bristol" workflow which now 
automatically creates a sorted sam file of the multihits (forgot to put it in 
1st time round!)

Cheers
David


On 24 Feb 2011, at 12:08, Ann Loraine wrote:

> 
> Hello,
> 
> I like your approach of running the alignment tools with liberal settings and 
> then filtering the results into different categories.
> 
> This discussion reminds me of how in expression microarray analysis, we face 
> uncertainty as to what molecules (exactly) are hybridizing to the probes on a 
> chip. 
> 
> Maybe the ambiguity of mapping short sequence reads introduces similar 
> uncertainty?  
> 
> I also like your idea of capturing the reads that map multiple times. 
> 
> It’s interesting to visualize the alignments for reads that map onto multiple 
> locations in a genome.
> 
> An example (from data expressed in “wiggle” format) is described here:
> 
> https://wiki.transvar.org/confluence/x/w4BJAQ
> 
> My apologies for posting another IGB citation, but I think it can be 
> interesting and informative to see the data in this way, and IGB makes it 
> easy to zoom in and out through the data and find patterns quickly.
> 
> One of the first things I noticed when I started looking at coverage graphs 
> made from multi-mapping reads is that (1) there are a lot of them and (2) 
> they expose tandemly duplicated genes.  
> 
> Here’s a link to an image that showing a particularly striking example from a 
> single-read, 75 bp RNA-Seq data set from Arabidopsis thaliana Col-0. The 
> pattern of read alignment is nearly identical between the two genes. 
> https://wiki.transvar.org/confluence/download/attachments/21594307/tandem-duplication.png
> 
> You can’t see it from the image, of course, but if I right-click one of the 
> genes, IGB links out to a Web page describing the gene at 
> www.arabidopsis.org, the main on-line database for Arabidopsis. (Human genes 
> link to NCBI.) 
> 
> -Ann
> 
> 
> On 2/23/11 11:05 PM, "Jeremy Goecks"  wrote:
> 
>> Hi David,
>> 
>> This is a really interesting workflow. My comments:
>> 
>> (1) I encourage you to start a discussion about this idea on seqanswers.com 
>>  ; you'll reach more people and may have a better 
>> discussion there. Ideally, you'll get a Tophat developer to chime in on what 
>> I perceive to be the main issue, which is:
>> 
>>> This may seem similar to setting tophat to ignore non-unique reads. 
>>> However, it is not. This approach gives you 10-15% more reads. I think it 
>>> is because if tophat finds (for example) that the forward read maps to one 
>>> site but the reverse read maps to two sites it throws away the whole read.
>> 
>> Remember that Tophat uses Bowtie to map reads, so it would make sense to 
>> look carefully at the Bowtie documentation to see how it handles paired-end 
>> reads. I can't find anything that directly addresses your issue. The other 
>> thing to consider is how Tophat maps reads -- it breaks them up in order to 
>> find splice junctions -- and so I'm not sure that Tophat/Bowtie is really 
>> mapping paired reads; it may be doing some hybrid single/paired-end mapping. 
>> Also, at one time, you could specify Bowtie parameters when running Tophat, 
>> but I don't see that option anymore.
>> 
>> (2) It would be interesting to know whether you get qualitatively different 
>> results via Cufflinks (or another transcriptome analysis software package) 
>> using your method vs. just using Tophat w/ and w/o ignoring non-unique 
>> reads. A skeptical view of your workflow would note that (a) multi-mapping 
>> reads may be legitimate and should not be filtered out and (b) 
>> Cufflinks/compare/diff assembly and quantitation may smooth out stray reads 
>> enough so that your method isn't necessary.
>> 
>> Thanks for the interesting post,
>> J.
>> 
>> On Feb 23, 2011, at 9:41 AM, David Matthews wrote:
>> 
>>> Hi Jeremy,
>>> 
>>> I thought I'd write to get a discussion of a workflow for people doing RNA 
>>> seq that I have found very useful and addresses some issues in mapping mRNA 
>>> derived RNA-seq paired end data to the genome using tophat. Here is the 
>>> approach I use (I have a human mRNA sample deep sequenced with a 56bp 
>>> paired end read on an illumina generating 29 million reads):
>>> 
>>> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for 
>>> each sequence read
>>> 2. In samtools filt

Re: [galaxy-user] RNA seq analysis

2011-02-24 Thread Ann Loraine

Hello,

I like your approach of running the alignment tools with liberal settings
and then filtering the results into different categories.

This discussion reminds me of how in expression microarray analysis, we face
uncertainty as to what molecules (exactly) are hybridizing to the probes on
a chip. 

Maybe the ambiguity of mapping short sequence reads introduces similar
uncertainty?  

I also like your idea of capturing the reads that map multiple times.

It¹s interesting to visualize the alignments for reads that map onto
multiple locations in a genome.

An example (from data expressed in ³wiggle² format) is described here:

https://wiki.transvar.org/confluence/x/w4BJAQ

My apologies for posting another IGB citation, but I think it can be
interesting and informative to see the data in this way, and IGB makes it
easy to zoom in and out through the data and find patterns quickly.

One of the first things I noticed when I started looking at coverage graphs
made from multi-mapping reads is that (1) there are a lot of them and (2)
they expose tandemly duplicated genes.

Here¹s a link to an image that showing a particularly striking example from
a single-read, 75 bp RNA-Seq data set from Arabidopsis thaliana Col-0. The
pattern of read alignment is nearly identical between the two genes.
https://wiki.transvar.org/confluence/download/attachments/21594307/tandem-du
plication.png

You can¹t see it from the image, of course, but if I right-click one of the
genes, IGB links out to a Web page describing the gene at
www.arabidopsis.org, the main on-line database for Arabidopsis. (Human genes
link to NCBI.) 

-Ann


On 2/23/11 11:05 PM, "Jeremy Goecks"  wrote:

> Hi David,
> 
> This is a really interesting workflow. My comments:
> 
> (1) I encourage you to start a discussion about this idea on seqanswers.com
>  ; you'll reach more people and may have a better
> discussion there. Ideally, you'll get a Tophat developer to chime in on what I
> perceive to be the main issue, which is:
> 
>> This may seem similar to setting tophat to ignore non-unique reads. However,
>> it is not. This approach gives you 10-15% more reads. I think it is because
>> if tophat finds (for example) that the forward read maps to one site but the
>> reverse read maps to two sites it throws away the whole read.
> 
> Remember that Tophat uses Bowtie to map reads, so it would make sense to look
> carefully at the Bowtie documentation to see how it handles paired-end reads.
> I can't find anything that directly addresses your issue. The other thing to
> consider is how Tophat maps reads -- it breaks them up in order to find splice
> junctions -- and so I'm not sure that Tophat/Bowtie is really mapping paired
> reads; it may be doing some hybrid single/paired-end mapping. Also, at one
> time, you could specify Bowtie parameters when running Tophat, but I don't see
> that option anymore.
> 
> (2) It would be interesting to know whether you get qualitatively different
> results via Cufflinks (or another transcriptome analysis software package)
> using your method vs. just using Tophat w/ and w/o ignoring non-unique reads.
> A skeptical view of your workflow would note that (a) multi-mapping reads may
> be legitimate and should not be filtered out and (b) Cufflinks/compare/diff
> assembly and quantitation may smooth out stray reads enough so that your
> method isn't necessary.
> 
> Thanks for the interesting post,
> J.
> 
> On Feb 23, 2011, at 9:41 AM, David Matthews wrote:
> 
>> Hi Jeremy,
>> 
>> I thought I'd write to get a discussion of a workflow for people doing RNA
>> seq that I have found very useful and addresses some issues in mapping mRNA
>> derived RNA-seq paired end data to the genome using tophat. Here is the
>> approach I use (I have a human mRNA sample deep sequenced with a 56bp paired
>> end read on an illumina generating 29 million reads):
>> 
>> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for
>> each sequence read
>> 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is
>> mapped in a proper pair"
>> 3. Use "group" to group the filtered sam file on c1 (which is the
>> "bio-sequencer" read number) and set an operation to count on c1 as well.
>> This provides a list of the reads and how many times they map to the human
>> genome, because you have filtered the set for reads that have a mate pair
>> there will be an even number for each read. For most of the reads the number
>> will be 2 (indicating the forward read maps once and the reverse read maps
>> once and in a proper pair) but for reads that map ambiguously the number will
>> be multiples of 2. If you count these up I find that 18 million reads map
>> once, 1.3 million map twice, 400,000 reads map 3 times and so on until you
>> get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
>> 4. Filter the reads to remove any reads that map more than 2 times.
>> 5. Use "compare two datasets" to compar

Re: [galaxy-user] RNA seq analysis

2011-02-23 Thread vasu punj
David,
 
Just curious are you using  latest version of TopHat ( 1.2.0 released on 
01/18/11). I was in contact with TopHat team about similar issue of paired end 
reads. They suggested that this concern has been addressed in new version. 
I have to confirm the claim with my run.
 
Vasu Punj 

--- On Wed, 2/23/11, Jeremy Goecks  wrote:


From: Jeremy Goecks 
Subject: Re: [galaxy-user] RNA seq analysis
To: "David Matthews" 
Cc: galaxy-u...@bx.psu.edu
Date: Wednesday, February 23, 2011, 10:05 PM


Hi David, 


This is a really interesting workflow. My comments:


(1) I encourage you to start a discussion about this idea on seqanswers.com; 
you'll reach more people and may have a better discussion there. Ideally, 
you'll get a Tophat developer to chime in on what I perceive to be the main 
issue, which is:





This may seem similar to setting tophat to ignore non-unique reads. However, it 
is not. This approach gives you 10-15% more reads. I think it is because if 
tophat finds (for example) that the forward read maps to one site but the 
reverse read maps to two sites it throws away the whole read.

Remember that Tophat uses Bowtie to map reads, so it would make sense to look 
carefully at the Bowtie documentation to see how it handles paired-end reads. I 
can't find anything that directly addresses your issue. The other thing to 
consider is how Tophat maps reads -- it breaks them up in order to find splice 
junctions -- and so I'm not sure that Tophat/Bowtie is really mapping paired 
reads; it may be doing some hybrid single/paired-end mapping. Also, at one 
time, you could specify Bowtie parameters when running Tophat, but I don't see 
that option anymore.


(2) It would be interesting to know whether you get qualitatively different 
results via Cufflinks (or another transcriptome analysis software package) 
using your method vs. just using Tophat w/ and w/o ignoring non-unique reads. A 
skeptical view of your workflow would note that (a) multi-mapping reads may be 
legitimate and should not be filtered out and (b) Cufflinks/compare/diff 
assembly and quantitation may smooth out stray reads enough so that your method 
isn't necessary.


Thanks for the interesting post,
J.



On Feb 23, 2011, at 9:41 AM, David Matthews wrote:


Hi Jeremy, 


I thought I'd write to get a discussion of a workflow for people doing RNA seq 
that I have found very useful and addresses some issues in mapping mRNA derived 
RNA-seq paired end data to the genome using tophat. Here is the approach I use 
(I have a human mRNA sample deep sequenced with a 56bp paired end read on an 
illumina generating 29 million reads):


1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for each 
sequence read
2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is 
mapped in a proper pair"
3. Use "group" to group the filtered sam file on c1 (which is the 
"bio-sequencer" read number) and set an operation to count on c1 as well. This 
provides a list of the reads and how many times they map to the human genome, 
because you have filtered the set for reads that have a mate pair there will be 
an even number for each read. For most of the reads the number will be 2 
(indicating the forward read maps once and the reverse read maps once and in a 
proper pair) but for reads that map ambiguously the number will be multiples of 
2. If you count these up I find that 18 million reads map once, 1.3 million map 
twice, 400,000 reads map 3 times and so on until you get down to 1 read mapping 
30 times, 1 read mapping 31 times and so on...
4. Filter the reads to remove any reads that map more than 2 times.
5. Use "compare two datasets" to compare your new list of reads that map only 
twice to pull out all the reads in your sam file that only map twice (i.e. the 
mate pairs).
6. You'll need to sort the sam file before you can use it with other 
applications like IGV.


What you end up with is a sam file where all the reads map to one site only and 
all the reads map as a proper pair. This may seem similar to setting tophat to 
ignore non-unique reads. However, it is not. This approach gives you 10-15% 
more reads. I think it is because if tophat finds (for example) that the 
forward read maps to one site but the reverse read maps to two sites it throws 
away the whole read. By filtering the sam file to restrict it to only those 
mappings that make sense you increase the number of unique reads by getting rid 
of irrational mappings.


Has anyone else found this? Does this make sense to anyone else? Am I making a 
huge mistake somewhere?


A nice aspect of this (or at least I think so!) is that by filtering in this 
manner you can also create a sam file of non-unique mappings which you can 
monitor. This can be useful if one or more genes has a problem of generating a 
lot of non-unique maps which may give problems 

Re: [galaxy-user] RNA seq analysis

2011-02-23 Thread Jeremy Goecks
Hi David,

This is a really interesting workflow. My comments:

(1) I encourage you to start a discussion about this idea on seqanswers.com; 
you'll reach more people and may have a better discussion there. Ideally, 
you'll get a Tophat developer to chime in on what I perceive to be the main 
issue, which is:

> This may seem similar to setting tophat to ignore non-unique reads. However, 
> it is not. This approach gives you 10-15% more reads. I think it is because 
> if tophat finds (for example) that the forward read maps to one site but the 
> reverse read maps to two sites it throws away the whole read.

Remember that Tophat uses Bowtie to map reads, so it would make sense to look 
carefully at the Bowtie documentation to see how it handles paired-end reads. I 
can't find anything that directly addresses your issue. The other thing to 
consider is how Tophat maps reads -- it breaks them up in order to find splice 
junctions -- and so I'm not sure that Tophat/Bowtie is really mapping paired 
reads; it may be doing some hybrid single/paired-end mapping. Also, at one 
time, you could specify Bowtie parameters when running Tophat, but I don't see 
that option anymore.

(2) It would be interesting to know whether you get qualitatively different 
results via Cufflinks (or another transcriptome analysis software package) 
using your method vs. just using Tophat w/ and w/o ignoring non-unique reads. A 
skeptical view of your workflow would note that (a) multi-mapping reads may be 
legitimate and should not be filtered out and (b) Cufflinks/compare/diff 
assembly and quantitation may smooth out stray reads enough so that your method 
isn't necessary.

Thanks for the interesting post,
J.

On Feb 23, 2011, at 9:41 AM, David Matthews wrote:

> Hi Jeremy,
> 
> I thought I'd write to get a discussion of a workflow for people doing RNA 
> seq that I have found very useful and addresses some issues in mapping mRNA 
> derived RNA-seq paired end data to the genome using tophat. Here is the 
> approach I use (I have a human mRNA sample deep sequenced with a 56bp paired 
> end read on an illumina generating 29 million reads):
> 
> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for 
> each sequence read
> 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is 
> mapped in a proper pair"
> 3. Use "group" to group the filtered sam file on c1 (which is the 
> "bio-sequencer" read number) and set an operation to count on c1 as well. 
> This provides a list of the reads and how many times they map to the human 
> genome, because you have filtered the set for reads that have a mate pair 
> there will be an even number for each read. For most of the reads the number 
> will be 2 (indicating the forward read maps once and the reverse read maps 
> once and in a proper pair) but for reads that map ambiguously the number will 
> be multiples of 2. If you count these up I find that 18 million reads map 
> once, 1.3 million map twice, 400,000 reads map 3 times and so on until you 
> get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
> 4. Filter the reads to remove any reads that map more than 2 times.
> 5. Use "compare two datasets" to compare your new list of reads that map only 
> twice to pull out all the reads in your sam file that only map twice (i.e. 
> the mate pairs).
> 6. You'll need to sort the sam file before you can use it with other 
> applications like IGV.
> 
> What you end up with is a sam file where all the reads map to one site only 
> and all the reads map as a proper pair. This may seem similar to setting 
> tophat to ignore non-unique reads. However, it is not. This approach gives 
> you 10-15% more reads. I think it is because if tophat finds (for example) 
> that the forward read maps to one site but the reverse read maps to two sites 
> it throws away the whole read. By filtering the sam file to restrict it to 
> only those mappings that make sense you increase the number of unique reads 
> by getting rid of irrational mappings.
> 
> Has anyone else found this? Does this make sense to anyone else? Am I making 
> a huge mistake somewhere?
> 
> A nice aspect of this (or at least I think so!) is that by filtering in this 
> manner you can also create a sam file of non-unique mappings which you can 
> monitor. This can be useful if one or more genes has a problem of generating 
> a lot of non-unique maps which may give problems accurately estimating its 
> expression. Also, you also get a list of how many multi hits you have in your 
> data so you know the scale of the problem.
> 
> Best Wishes,
> David.
> 
> __
> Dr David A. Matthews
> 
> Senior Lecturer in Virology
> Room E49
> Department of Cellular and Molecular Medicine,
> School of Medical Sciences
> University Walk,
> University of Bristol
> Bristol.
> BS8 1TD
> U.K.
> 
> Tel. +44 117 3312058
> Fax. +44 117 3312091
> 
> d.a.matth...@bristol.ac.uk
> 
> 
> 
> 

__

Re: [galaxy-user] RNA seq analysis

2011-02-23 Thread David Matthews
Hi all,

Further to my last email, I've published a workflow (Bristol workflow ) 
which does what I described below - hope this helps in understanding what I'm 
on about (!).

Best Wishes,
David.



On 23 Feb 2011, at 14:41, David Matthews wrote:

> Hi Jeremy,
> 
> I thought I'd write to get a discussion of a workflow for people doing RNA 
> seq that I have found very useful and addresses some issues in mapping mRNA 
> derived RNA-seq paired end data to the genome using tophat. Here is the 
> approach I use (I have a human mRNA sample deep sequenced with a 56bp paired 
> end read on an illumina generating 29 million reads):
> 
> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for 
> each sequence read
> 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is 
> mapped in a proper pair"
> 3. Use "group" to group the filtered sam file on c1 (which is the 
> "bio-sequencer" read number) and set an operation to count on c1 as well. 
> This provides a list of the reads and how many times they map to the human 
> genome, because you have filtered the set for reads that have a mate pair 
> there will be an even number for each read. For most of the reads the number 
> will be 2 (indicating the forward read maps once and the reverse read maps 
> once and in a proper pair) but for reads that map ambiguously the number will 
> be multiples of 2. If you count these up I find that 18 million reads map 
> once, 1.3 million map twice, 400,000 reads map 3 times and so on until you 
> get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
> 4. Filter the reads to remove any reads that map more than 2 times.
> 5. Use "compare two datasets" to compare your new list of reads that map only 
> twice to pull out all the reads in your sam file that only map twice (i.e. 
> the mate pairs).
> 6. You'll need to sort the sam file before you can use it with other 
> applications like IGV.
> 
> What you end up with is a sam file where all the reads map to one site only 
> and all the reads map as a proper pair. This may seem similar to setting 
> tophat to ignore non-unique reads. However, it is not. This approach gives 
> you 10-15% more reads. I think it is because if tophat finds (for example) 
> that the forward read maps to one site but the reverse read maps to two sites 
> it throws away the whole read. By filtering the sam file to restrict it to 
> only those mappings that make sense you increase the number of unique reads 
> by getting rid of irrational mappings.
> 
> Has anyone else found this? Does this make sense to anyone else? Am I making 
> a huge mistake somewhere?
> 
> A nice aspect of this (or at least I think so!) is that by filtering in this 
> manner you can also create a sam file of non-unique mappings which you can 
> monitor. This can be useful if one or more genes has a problem of generating 
> a lot of non-unique maps which may give problems accurately estimating its 
> expression. Also, you also get a list of how many multi hits you have in your 
> data so you know the scale of the problem.
> 
> Best Wishes,
> David.
> 
> __
> Dr David A. Matthews
> 
> Senior Lecturer in Virology
> Room E49
> Department of Cellular and Molecular Medicine,
> School of Medical Sciences
> University Walk,
> University of Bristol
> Bristol.
> BS8 1TD
> U.K.
> 
> Tel. +44 117 3312058
> Fax. +44 117 3312091
> 
> d.a.matth...@bristol.ac.uk
> 
> 
> 
> 
> ___
> The Galaxy User list should be used for the discussion
> of Galaxy analysis and other features on the public
> server at usegalaxy.org. For discussion of local Galaxy
> instances and the Galaxy source code, please use the
> Galaxy Development list:
> 
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

___
The Galaxy User list should be used for the discussion
of Galaxy analysis and other features on the public
server at usegalaxy.org. For discussion of local Galaxy
instances and the Galaxy source code, please use the
Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other
Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/