[galaxy-user] NGS: Indel Analysis tool order

2011-08-22 Thread David K Crossman
Hello!

I have a question about the NGS: Indel Analysis toolset in 
Galaxy.  I have aligned my samples from Illumina's HiSeq2000 to the reference 
genome using BWA.  I've called SNPs using SAMTools and now need to call indels. 
 Under the NGS: Indel Analysis toolset, I see two options: Filter Indels for 
SAM and Extract indels from SAM.  Since the descriptions are a little vague, 
I would presume that I start with the Filter Indels for SAM first on the BWA 
SAM file and then run Extract indels from SAM on the Filter output SAM file 
(i.e. BWA SAM - Filter Indels for SAM - Extract indels from SAM).  Is this 
the correct order?  Or would I skip the Filter Indels for SAM step (since the 
BWA SAM file technically already contains indels so there would be no need to 
filter) and just go straight to Extract indels from SAM (i.e. BWA SAM - 
Extract indels from SAM)?
I've tried both ways and get different results.  For example:

1.   BWA SAM - Filter Indels for SAM - Extract indels from SAM - gives 
me 26,417 regions of interest

2.   BWA SAM - Extract indels from SAM - gives me 93,974 regions of 
interest

Which way is correct?  Any help/info would be greatly 
appreciated!

Thanks,
David

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Cuffdiff question about using an unspecified (?) database/build

2011-08-19 Thread David K Crossman
Hello!

I have an RNA-Seq project which consists of 5 samples from the 
species tree shrew.  When uploading these fastq files into Galaxy, I chose 
unspecified (?) for the database/build since the latest tree shrew version is 
not in the drop down list.  When using TopHat, Cufflinks/Compare I have 
selected a reference genome from my history instead of using a built-in index, 
as well as a gtf annotation file for Cufflinks/Compare and everything has been 
working fine.  Now, I am at the Cuffdiff step and I am running into an error 
when setting it up to perform replicate analysis.  When I select my TopHat 
accepted hits bam file I see a red X and the error: Unspecified genome build, 
click the pencil icon in the history item to set the genome build.  Here's a 
screenshot of what I'm seeing:

[cid:image001.png@01CC5E4E.76F37AF0]

Since the latest reference genome for tree shrew wasn't listed, 
that's why I chose unspecified (?).  Should I go back and edit these accepted 
hits bam files to say the Database/Build from the drop down list is Tree shrew 
Dec. 2006 (Broad/tupBel1) (tupBel1)?  I know that this is simple to change, 
but will this affect my results in any way?  Any help/info would be greatly 
appreciated.

Thanks,
David
inline: image001.png___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Cuffdiff question about using an unspecified (?) database/build

2011-08-19 Thread David K Crossman
Jen,

Thank you very much for the reply.  I'm glad to know it is a known bug 
and not something on my side of things.  So, would my analysis be affected if I 
did change the bam file Database/Build to the older tree shrew version found 
in the drop down list?  What significance does this Database/Build box have 
in downstream analysis if you have your own fasta reference genome file and gtf 
annotation file that is being referenced instead of a locally cached one?  I'm 
just trying to obtain a better understanding of the Database/Build box for 
analyses where I provide the fasta and gtf file.

Thanks,
David


-Original Message-
From: Jennifer Jackson [mailto:j...@bx.psu.edu] 
Sent: Friday, August 19, 2011 9:20 AM
To: David K Crossman
Cc: galaxy-user (galaxy-user@lists.bx.psu.edu)
Subject: Re: [galaxy-user] Cuffdiff question about using an unspecified (?) 
database/build

Hello David,

This is a known bug. The correction is planned to be moved out onto the public 
Galaxy instance at the next update (within a week).

Sorry for the current inconvenience,

Best,

Jen
Galaxy team

On 8/19/11 7:00 AM, David K Crossman wrote:
 Hello!

 I have an RNA-Seq project which consists of 5 samples from the species 
 tree shrew. When uploading these fastq files into Galaxy, I chose 
 unspecified (?) for the database/build since the latest tree shrew 
 version is not in the drop down list. When using TopHat, 
 Cufflinks/Compare I have selected a reference genome from my history 
 instead of using a built-in index, as well as a gtf annotation file 
 for Cufflinks/Compare and everything has been working fine. Now, I am 
 at the Cuffdiff step and I am running into an error when setting it up 
 to perform replicate analysis. When I select my TopHat accepted hits 
 bam file I see a red X and the error: Unspecified genome build, click 
 the pencil icon in the history item to set the genome build. Here's a 
 screenshot of what I'm seeing:

 Since the latest reference genome for tree shrew wasn't listed, that's 
 why I chose unspecified (?). Should I go back and edit these 
 accepted hits bam files to say the Database/Build from the drop down 
 list is Tree shrew Dec. 2006 (Broad/tupBel1) (tupBel1)? I know that 
 this is simple to change, but will this affect my results in any way? 
 Any help/info would be greatly appreciated.

 Thanks,

 David



 ___
 The Galaxy User list should be used for the discussion of Galaxy 
 analysis and other features on the public server at usegalaxy.org.  
 Please keep all replies on the list by using reply all in your mail 
 client.  For discussion of local Galaxy instances and the Galaxy 
 source code, please use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists, please 
 use the interface at:

http://lists.bx.psu.edu/

--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org/Support

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] using files produced by Barcode Splitter

2011-07-18 Thread David K Crossman
Jeremy,

The files need to be groomed using the FastQ Groomer so that 
they will end up in the fastqsanger state.  Then your files will show up in the 
pull-down menus.

David


From: galaxy-user-boun...@lists.bx.psu.edu 
[mailto:galaxy-user-boun...@lists.bx.psu.edu] On Behalf Of Jeremy Coate
Sent: Monday, July 18, 2011 1:44 PM
To: galaxy-user@lists.bx.psu.edu
Subject: [galaxy-user] using files produced by Barcode Splitter

I used the Barcode Splitter tool to split multiplexed RNA-Seq libraries into 
separate files. I would now like to map the reads from each of these fastq 
files to a reference genome. However, the fastq files generated by Barcode 
Splitter don't appear in the Fastq File pull-down menus within the the BWA or 
Bowtie launch pages. I'm probably missing something obvious, but what is the 
trick for making these files available for the mapping tools? Do I need to 
import them into my history somehow?

Thanks!
Jeremy
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Mycoplasma pneumoniae M129 and FH reference genome

2011-05-23 Thread David K Crossman
Hello!

I noticed that Mycoplasma pneumonia M129 and FH are not found 
in the reference genome in Galaxy.  Would it be possible to have both of those 
in there?

Thanks,
David
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] RNA seq analysis and GTF files

2011-04-08 Thread David K Crossman
Jeremy,

Thank you very much for this information.  One quick question.  
I added the gene_id values to the 10th column of my patched GTF file.  After 
uploading it to Galaxy, the column doesn't have a name (i.e. column 1 = 
Seqname; column 2 = Source; etc...).  Do I need to assign it a name (i.e. 
gene_name or gene_id) for it to be recognized and if so, how do you assign 
column names to GTF files?

Thanks,
David


From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu]
Sent: Thursday, April 07, 2011 9:40 PM
To: David K Crossman
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files

David,

Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you 
get nearest_ref_id, so that's promising. What I think is needed is the addition 
of an attribute called gene_name to your reference file; you can use whatever 
value you want for gene name, and using the same value as gene_id probably 
makes sense.

Rerun your analysis with the further-patched GTF file, and let us know if this 
doesn't solve the problem. Also note that even using this attribute, some gene 
name/ids and some nearest_ref_id columns will not be populated in some cuffdiff 
files. See the post from Howie in this thread for an explanation from a 
Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288

Best,
J.

On Apr 7, 2011, at 5:00 PM, David K Crossman wrote:


Jeremy,

I've shared it with you using your email address.

Thanks,
David


From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu]
Sent: Thursday, April 07, 2011 3:42 PM
To: David K Crossman
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files

David, can you please share your history with me and I'll take a look (History 
Options -- Share/Publish -- Share with User -- my email?

Thanks,
J.

On Apr 7, 2011, at 3:23 PM, David K Crossman wrote:



Hello!

I would like to ask a question related to this thread below.  I 
ran into the same issues as below and was unaware of having to swap some 
columns around in the GTF file.  So, after 'swapping the gene name from the 
complete table (name2 value, column 12) into the GFT file's gene_id value 
(which by default is the same as transcript_id), I uploaded this patched 
file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this 
patched GTF file as the reference annotation.  For both Cufflinks and 
CuffCompare, the gene_id was present in their respective columns.  The problem 
I have encountered now is that in all of the output files in CuffDiff, the 
gene_id column is blank (contains a -; highlighted in yellow below).  This 
example is from the CuffDiff gene expression output file:

test_id

gene

locus

sample_1

sample_2

status

value_1

value_2

ln(fold_change)

test_stat

p_value

significant

XLOC_01

-

chr1:4797973-4836816

q1

q2

OK

73.1908

82.1567

0.115559

-0.71896

0.472168

no

XLOC_02

-

chr1:4847774-4887990

q1

q2

OK

81.7264

53.1165

-0.43089

2.44474

0.014496

no

XLOC_03

-

chr1:5073253-5152630

q1

q2

OK

408.289

333.749

-0.20159

2.73173

0.0063

no

XLOC_04

-

chr1:5578573-5596214

q1

q2

NOTEST

2.34764

4.79772

0.71473

-0.89735

0.369532

no


What am I doing wrong?  I am interested in the differentially 
expressed genes in this RNA-Seq dataset (as well as calling variants, which is 
my next step, but want to get this answered first before moving on).  Any info, 
suggestions or help would be greatly appreciated.

Thanks,
David


-Original Message-
From: 
galaxy-user-boun...@lists.bx.psu.edumailto:galaxy-user-boun...@lists.bx.psu.edu
 [mailto:galaxy-user-boun...@lists.bx.psu.edu] On Behalf Of Jeremy Goecks
Sent: Friday, April 01, 2011 8:47 AM
To: ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files



On Mar 31, 2011, at 12:30 PM, 
ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu 
ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu wrote:

 Hi Jeremy,
 I used your exercise to perform an RNA-seq analysis. First I encountered a 
 problem where the gene IDs were missing from the results. Jen from the Galaxy 
 team suggested this:

 Yes, the team has taken a look and there are a few things going on.

 The first is that when running the Cuffcompare program, a reference 
 annotation file in GTF format should be used in order to obtain the same 
 results as in Jeremy's exercise. This seemed to be missing from your runs, 
 which resulted in badly formatted output that later resulted in a poor result 
 when Cuffdiff was used.

 The second has to do with the reference GTF file itself. For the best 
 results, the GTF file must have the gene_id attribute defined in the 9th 
 column of the file and the chromosome names must be in the same format as the 
 genome native to Galaxy. Depending on the source of the reference GTF, one