[galaxy-user] FASTQ splitter produced empty dataset, please help

2012-08-10 Thread Du, Jianguang
I have problem to split a paired-end FASTQ dataset into two separate datasets. 
In order to explain the problem clearly, I list the detail of what I did with 
my dataset:



Step 1) My aim is to compare datasets for the differential alternative 
splicing. I downloaded paired-end datasets at FASTQ format from SRA of NCBI as 
original data.



Below is part of my paired-end FASTQ dataset that I downloaed from SRA of NCBI, 
Does this dataset look OK?

@SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
GCTGAGTGAGGGTGTGTTTGGAGTTTG
+SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
I28II;II*2/5:++,(..*943F@I.('+.35'
@SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
AAAGATGTTAGTGATACGGAAAGGATATCTC
+SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
9+*9+7@?F1206,IGI+D122/0++-.+6/@?

Step 2) Then I performed FASTQ groomer at setting as follows:



a) Input FASTQ quality scores type: Illumina 1.3-1.7

b)Advanced Options: Hide Advanced Options.



Did I choose the right setting for FASTQ groomer? Should I use Advanced 
Options? If yes, what is the setting for Advances Options?



Below is part of groomed dataset:

@SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
GCTGAGTGAGGGTGTGTTTGGAGTTTG
+SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
*!!**!**'!*
@SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
AAAGATGTTAGTGATACGGAAAGGATATCTC
+SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
'!*(*!%

Does the groomed data look right? Is number represnting the member of a pair 
correct? Here they are .1 and .2, should they be /1 and /2?



Step 3) Then I ran FASTQ splitter with the groomed files. There is not setting 
for the splitter. I chose the right groomed file and then click Excute. Below 
is the description of the splitted dataset:



empty
format: fastqsanger, database:hg19
Info: Split 0 of 15277248 reads (0.00%).



Please help me dela with this problem.

Thanks.

Jianguang Du








___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] (no subject)

2012-08-10 Thread Du, Jianguang
I am new to the NGS analysis. I need help to solve this problem.



As shown in my previous emial/question shown below, I have some paired-end 
datasets at FASTQ format, and I have problem to split each of these datasets 
into two datasets (one forward and one reverse).



Jennifer instructed me to assign the datatype to be fastqsanger first and then 
run 'Manipulate FASTQ'.



I have two questions:

1) Now that the datasets were already split into forward and reverse reads when 
extracted in FASTQ format from the SRA, can I use them just as single end data?

2) If I do need to split each dataset into two datasets, how should I choose 
the settings when I run Manipulte FASTQ?



Thanks.



Jianguang

/

On 8/10/12 7:21 AM, Du, Jianguang wrote:
 I have problem to split a paired-end FASTQ dataset into two separate
 datasets. In order to explain the problem clearly, I list the detail of
 what I did with my dataset:

 Step 1) My aim is to compare datasets for the differential alternative
 splicing. I downloaded paired-end datasets at FASTQ format from SRA of
 NCBI as original data.

 Below is part of my paired-end FASTQ dataset that I downloaed from SRA
 of NCBI, Does this dataset look OK?

 @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 GCTGAGTGAGGGTGTGTTTGGAGTTTG
 +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 I28II;II*2/5:++,(..*943F@I.('+.35'
 @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 AAAGATGTTAGTGATACGGAAAGGATATCTC
 +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 9+*9+7@?F1206,IGI+D122/0++-.+6/@?

 Step 2) Then I performed FASTQ groomer at setting as follows:

 a) Input FASTQ quality scores type: Illumina 1.3-1.7

 b)Advanced Options: Hide Advanced Options.

 Did I choose the right setting for FASTQ groomer? Should I use Advanced
 Options? If yes, what is the setting for Advances Options?

 Below is part of groomed dataset:

 @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 GCTGAGTGAGGGTGTGTTTGGAGTTTG
 +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 *!!**!**'!*
 @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 AAAGATGTTAGTGATACGGAAAGGATATCTC
 +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 '!*(*!%

 Does the groomed data look right? Is number represnting the member of a
 pair correct? Here they are .1 and .2, should they be /1 and /2?

 Step 3) Then I ran FASTQ splitter with the groomed files. There is not
 setting for the splitter. I chose the right groomed file and then click
 Excute. Below is the description of the splitted dataset:

 empty
 format:fastqsanger, database:hg19
 Info: Split 0 of 15277248 reads (0.00%).

 Please help me dela with this problem.

 Thanks.

 Jianguang Du

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] need help to split paired-end dataset

2012-08-10 Thread Du, Jianguang
I am new to the NGS analysis. I need help to solve this problem.



As shown in my previous emial/question shown below, I have some paired-end 
datasets at FASTQ format, and I have problem to split each of these datasets 
into two datasets (one forward and one reverse).



Jennifer instructed me to assign the datatype to be fastqsanger first and then 
run 'Manipulate FASTQ'.



I have two questions:

1) Now that the datasets were already split into forward and reverse reads when 
extracted in FASTQ format from the SRA, can I use them just as single end data?

2) If I do need to split each dataset into two datasets, how should I choose 
the settings when I run Manipulte FASTQ?



Thanks.



Jianguang

/

On 8/10/12 7:21 AM, Du, Jianguang wrote:
 I have problem to split a paired-end FASTQ dataset into two separate
 datasets. In order to explain the problem clearly, I list the detail of
 what I did with my dataset:

 Step 1) My aim is to compare datasets for the differential alternative
 splicing. I downloaded paired-end datasets at FASTQ format from SRA of
 NCBI as original data.

 Below is part of my paired-end FASTQ dataset that I downloaed from SRA
 of NCBI, Does this dataset look OK?

 @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 GCTGAGTGAGGGTGTGTTTGGAGTTTG
 +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 I28II;II*2/5:++,(..*943F@I.('+.35'
 @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 AAAGATGTTAGTGATACGGAAAGGATATCTC
 +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 9+*9+7@?F1206,IGI+D122/0++-.+6/@?

 Step 2) Then I performed FASTQ groomer at setting as follows:

 a) Input FASTQ quality scores type: Illumina 1.3-1.7

 b)Advanced Options: Hide Advanced Options.

 Did I choose the right setting for FASTQ groomer? Should I use Advanced
 Options? If yes, what is the setting for Advances Options?

 Below is part of groomed dataset:

 @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 GCTGAGTGAGGGTGTGTTTGGAGTTTG
 +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35
 *!!**!**'!*
 @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 AAAGATGTTAGTGATACGGAAAGGATATCTC
 +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35
 '!*(*!%

 Does the groomed data look right? Is number represnting the member of a
 pair correct? Here they are .1 and .2, should they be /1 and /2?

 Step 3) Then I ran FASTQ splitter with the groomed files. There is not
 setting for the splitter. I chose the right groomed file and then click
 Excute. Below is the description of the splitted dataset:

 empty
 format:fastqsanger, database:hg19
 Info: Split 0 of 15277248 reads (0.00%).

 Please help me dela with this problem.

 Thanks.

 Jianguang Du

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] whixh setting should be used to upload mouse reference genome?

2012-08-14 Thread Du, Jianguang
Dear All,

I am going to search the alternative splicing events bentween datasets. I am 
not sure about the settings of mouse reference genome (mm9) when I upload it 
from UCSC Main.

Would you please tell me the settings for

1) group:

2) Track:

3) Table:

4) Output format:



Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] How to decide Mean Inner Distance between Mate Pairs?

2012-08-15 Thread Du, Jianguang
Dear All,

I am analyzing the downloaded RNA-seq datasets. However I am not sure how much 
is Mean Inner Distance between Mate Pairs for these paired-end datasets.

Take a paired-end RNA-seq dataset as an example, there is a description for 
this dataset in SRA database of NCBI: Layout: PAIRED, Orientation: 
5'-3'-3'-5', Nominal length: 400, Nominal Std Dev: 20.

At first I thought the Mean Inner Distance between Mate Pairs should be 325bps 
because the length of reads on both ends is 36bps. However when I aligned the 
sequence of the paired reads on to transcripts and genome using BLASTn, the 
distance between the paired reads is about 200bps. How should I decide the Mean 
Inner Distance between Mate Pairs in my case?

Thanks.

Jianguang Du
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Do I need to allow indel search?

2012-08-15 Thread Du, Jianguang
Dear All,

I want to compare the pre-mRNA alternaive splicing events between RNA-seq 
datasets. Do I need to allow indel search when I run Tophat? What is the indel 
search for? I could not find detail information about indel search through 
the documentation of Tophat.

Thanks.

Jianguang Du
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Use Own Junctions or not

2012-08-15 Thread Du, Jianguang
Dear All,

I want to compare the pre-mRNA alternaive splicing events between RNA-seq 
datasets. Should I use own junctions when I run Tophat? What does Own 
Junctions mean?
Thanks.
Jianguang DU
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Minimum length of read segments

2012-08-16 Thread Du, Jianguang
Dear All,

I am going to run Tophat with RNA-seq dataset to observe alternative splicing 
events. There is a parameter for Tophat: Minimum length of read segment. 
According to implemented Tophat options, the description for Minimum length 
of read segment is Each read is cut up into segments, each at least this 
long. These segments are mapped independently. The default is 25. The length 
of my reads is 36bps, should I change this parameter based on the length of my 
reads? How long should I input?

Thanks.

Jianguang Du
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] run Bowtie to estimate Mean Inner Distance between Mate Pairs

2012-08-16 Thread Du, Jianguang
Dear All,

In order to figure out the Mean Inner Distance between Mate Pairs of my 
paired-end RNA-seq datasets, I ran Bowtie (Map with Bowtie for Illumina) with 
both forward and reverse datasets and mouse mm9 as reference genome. Below I 
list the Bowtie output for only one pair of reads (I put the fields on the left 
side):


For the forward read
QNAME:   SRR322837.8.1
FLAG:99
RNAME:   chr1
POS: 163761156
MAPQ:255
CIAGR:   36M
MRNM:=
MPOS:163761301
ISIZE:   181
SEQ: NTGGATACTAGCCATAAATGAATT
QUAL:%(,,')(())@@@22358852@@@##
OPT: XA:i:1 MD:Z:0A35  NM:i:1

For the reverse read
QNAME:   SRR322837.8.2
FLAG:147
RNAME:   chr1
POS: 163761301
MAPQ:255
CIAGR:   36M
MRNM:=
MPOS:163761156
ISIZE:   -181
SEQ: TATTATGTCAATCTATGAAGAAGGACGGCGAGGTGA
QUAL:GDBE@BEEGDB=BD-=GEDDGGBGD8GB?
OPT: MD:Z:29A6  NM:i:1



Is the ISIZE the insert size? The difference between POS and MPOS is 145bp, 
which is 36bp shorter than ISIZE (181). My question is: if ISIZE does mean 
insert size, how should I convert INSIZE into Mean Inner Distance between Mate 
Pairs?



Thanks,



Jianguang Du
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] How to find the alternatively spliced segment of genes in Cuffdiff output

2012-08-21 Thread Du, Jianguang
Dear All,

I have run programs from Tophat to Cuffdiff of Galaxy to look for the 
difference in alternative splicing events between cell types. However I do not 
know how to find the detail information  (such as the sequence and the genomic 
coordinates) of the alternatively spliced part of a given gene. I looked at the 
data of Cuffdiff ouput  splicing differential expression testing, there is no 
column showing the position of alternatively spliced region. Please help to 
solve this problem.

Thanks in advance.

Jianguang Du
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] How much can I trimm my reads

2012-08-23 Thread Du, Jianguang
Dear All,

I am analysing RNA-seq datasets for the differential splicing events between 
cell types. My reads are 36bp long. In order to increase the quality of reads, 
I need to trim some nucleotides from ends. How many nucleotides can I trim? I 
am afraid that if I trim too much, the reliability of the alingment will be 
affected.

Thanks in advance.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] What is the minimum Quality should I set for Filter FASTQ?

2012-08-23 Thread Du, Jianguang
Dear All,

I am analysing RNA-seq datasets for differential splicing events between cell 
types.

Some of my reads contain bed nucleotides, should I run Filter FASTQ to remove 
these not so good reads? If I do need to, what is the Minimum Quality 
should I set for the Filter?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat?

2012-08-23 Thread Du, Jianguang
Dear All,

I am analysing RNA-seq datasets for differential splicing events between cell 
types. These are mouse cells. Jen suggested me to use the iGenomes version of 
reference GTF to take full advantage of the options in CuffDiff. My question 
is: should I use this iGenome version reference GTF when I run Tophat?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat?

2012-08-23 Thread Du, Jianguang
Hi Jen,
Thanks for your help.
Do you mean that if I want to find novel isoform/splicing, I need to select 
No under Use Reference Annotation when I run Cufflink, and then use iGenome 
version of reference GTF when I run Cuffmerge?

Based on your information and some protocols found online, my understanding is 
that: 
1) if use iGenome version of reference GTF, I only need to run Cuffmerge with 
the Cufflink ouputs, because iGenome version reference GTF already contains 
attributes such as p_id and tss_id. Then the Cuffmerge output can be used for 
Cuffdiff.
2) however, if I use the reference GTF from Ensembl/UCSC (rather than from 
iGenome), I need to run Cuffcompare to create p_id and tss_id, which is 
required for Cuffdiff.
Am I right?

Another question is: should I use iGenome version of reference GTF when I run 
Tophat if I want to see novel isoforms/splicing?

Thanks.
Jianguang


From: Jennifer Jackson [j...@bx.psu.edu]
Sent: Thursday, August 23, 2012 11:46 AM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for 
Tophat?

Hello Jianguang,

When in the analysis process to start using the reference GTF file can
depend on whether or not you intend to do any discovery along with
differential expression testing. At the TopHat and Cufflinks steps,
using reference GTF file can influence how datasets will map and
assemble. In general, if your intention is to do discovery (e.g. work
with novel isoforms in your data, but not in the reference), then do not
add in the reference GTF until the CuffMerge step (to produce the input
annotation GTF file for Cuffdiff). But if you want to guide the analysis
toward known isoforms, then use the reference GTF.

This is the process our RNA-seq example protocol follows:
http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise

For reference, there are other variations of this on the Cufflinks web
site, some that never lead to Cuffdiff, but still may be useful to
review. Please see the Cufflinks paper (linked from right side bar as
Protocol for many more options/discussion.
http://cufflinks.cbcb.umd.edu/tutorial.html
-- Common uses of the Cufflinks package

The end decision will be up to you, and a few runs with different
options may be a useful way to make the final call, but hopefully this
provides some resources to help you understand the option,

Jen
Galaxy team

On 8/23/12 8:03 AM, Du, Jianguang wrote:
 Dear All,

 I am analysing RNA-seq datasets for differential splicing events between
 cell types. These are mouse cells. Jen suggested me to use the iGenomes
 version of reference GTF to take full advantage of the options in
 CuffDiff. My question is: should I use this iGenome version reference
 GTF when I run Tophat?

 Thanks.

 Jianguang



 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat?

2012-08-23 Thread Du, Jianguang
Hi Jen,
I had a problem when I tried to run Tophat with the iGenome reference GTF.
What I did is:
1) uploaded iGenome version of mm9 genes.gtf by: Shared Data - Data Libraries 
- iGenomes - click genes.gtf under mm9 - click Go for Import to 
current history. The genes.gtf appeared in history and turned green.
2) click Tophat for Illumina Find splice junctions using RNA-seq data to open 
the window of Tophat for Illumina (version 1.5.0)
3) selected the dataset to be analysed under RNA-Seq FASTQ file:.
4) chose Use one from the history under Will you select a reference genome 
from your history or use a built-in index?:
Then the screen refreshed and the box (pulldown menu) under Select the 
reference genome: became smaller. Nothing showed up in the pulldown menu 
(actually the menu can not be pulled down). So that I could not input iGenome 
reference GTF. Looks like the Tophat can only Use a built-in index.
How can I solve this problem?
Thanks in advance.
Jianguang 



From: galaxy-user-boun...@lists.bx.psu.edu 
[galaxy-user-boun...@lists.bx.psu.edu] on behalf of Du, Jianguang 
[jia...@iupui.edu]
Sent: Thursday, August 23, 2012 4:01 PM
To: Jennifer Jackson
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for 
Tophat?

Hi Jen,
Thank you very much for your help.
Jianguang


From: Jennifer Jackson [j...@bx.psu.edu]
Sent: Thursday, August 23, 2012 3:53 PM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for 
Tophat?

Hello Jianguang,

On 8/23/12 11:28 AM, Du, Jianguang wrote:
 Hi Jen,
 Thanks for your help.
 Do you mean that if I want to find novel isoform/splicing, I need to select 
 No under Use Reference Annotation when I run Cufflink, and then use 
 iGenome version of reference GTF when I run Cuffmerge?

Yes, according to the tool documentation, this is the method.

 Based on your information and some protocols found online, my understanding 
 is that:
 1) if use iGenome version of reference GTF, I only need to run Cuffmerge with 
 the Cufflink ouputs, because iGenome version reference GTF already contains 
 attributes such as p_id and tss_id. Then the Cuffmerge output can be used for 
 Cuffdiff.
Yes, this is the example protocol I shared.

 2) however, if I use the reference GTF from Ensembl/UCSC (rather than from 
 iGenome), I need to run Cuffcompare to create p_id and tss_id, which is 
 required for Cuffdiff.
This can be tricky, it depends on what order you run the tools with and
without the GTF annotation. The protocol in #1 is recommended.

 Am I right?

 Another question is: should I use iGenome version of reference GTF when I run 
 Tophat if I want to see novel isoforms/splicing?
Yes, this is what I intended to answer in my original reply, I apologize
if that was not clear. The reference GTF can influence both mapping and
assembly. So, both Tophat and Cufflinks. The information on the TopHat
web site for the parameter provides more information (see link on TopHat
tool form). The tool authors can also be contacted if there are some
details that you are curious about that are not covered in the primary
documentation: tophat.cuffli...@gmail.com

Others are welcome to add to the thread with their experiences if they
have used a reference annotation GTF with Tophat (or chosen not to for a
particular reason that they would like to share),

Best,

Jen
Galaxy team

 Thanks.
 Jianguang

 
 From: Jennifer Jackson [j...@bx.psu.edu]
 Sent: Thursday, August 23, 2012 11:46 AM
 To: Du, Jianguang
 Cc: galaxy-user@lists.bx.psu.edu
 Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF 
 for Tophat?

 Hello Jianguang,

 When in the analysis process to start using the reference GTF file can
 depend on whether or not you intend to do any discovery along with
 differential expression testing. At the TopHat and Cufflinks steps,
 using reference GTF file can influence how datasets will map and
 assemble. In general, if your intention is to do discovery (e.g. work
 with novel isoforms in your data, but not in the reference), then do not
 add in the reference GTF until the CuffMerge step (to produce the input
 annotation GTF file for Cuffdiff). But if you want to guide the analysis
 toward known isoforms, then use the reference GTF.

 This is the process our RNA-seq example protocol follows:
 http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise

 For reference, there are other variations of this on the Cufflinks web
 site, some that never lead to Cuffdiff, but still may be useful to
 review. Please see the Cufflinks paper (linked from right side bar as
 Protocol for many more options/discussion.
 http://cufflinks.cbcb.umd.edu/tutorial.html
 -- Common uses of the Cufflinks package

 The end decision will be up to you, and a few runs

[galaxy-user] Please help me check the quality of the Tophat mapping to reference genome

2012-08-27 Thread Du, Jianguang
Dear All,

I ran Flagstat under NGS: SAM Tools to check the quality of the Tophat 
output (the file of accepted hits).  I got the diagnosis results as follow:

9471730 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
9471730 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ=5)

I ran Tophat with settings as shown below:

Will you select a reference genome from your history or use a built-in index?
Use a built-in index
Select a reference genome
/galaxy/data/mm9/bowtie_index/mm9
Is this library mate-paired?
Single-end
TopHat settings to use
Full parameter list
Library Type
FR Unstranded
Anchor length (at least 3)
8
Maximum number of mismatches that can appear in the anchor region of spliced 
alignment
0
The minimum intron length
70
The maximum intron length
50
Allow indel search
Yes
Max insertion length.
3
Max deletion length.
3
Maximum number of alignments to be allowed
20
Minimum intron length that may be found during split-segment (default) search
50
Maximum intron length that may be found during split-segment (default) search
50
Number of mismatches allowed in the initial read mapping
1
Number of mismatches allowed in each segment alignment for reads mapped 
independently
1
Minimum length of read segments
25
Use Own Junctions
Yes
Use Gene Annotation Model
Yes
Gene Model Annotations
iGenome version of mm9 genes. GTF
Use Raw Junctions
No
Only look for supplied junctions
No
Use Closure Search
No
Use Coverage Search
Yes
Minimum intron length that may be found during coverage search
50
Maximum intron length that may be found during coverage search
2
Use Microexon Search
No

Please help me find out what is wrong with the Tophat.

Thanks,

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Please help with the settings for Cufflink, Cuffmerge and Cuffdiff

2012-08-27 Thread Du, Jianguang
Dear All,
I am looking for the differential splicing events between cell types.
Although I got a lot of helps from Jen and from protocols found online, I am 
still not sure about some settings for Cufflink, Cuffmerge and Cuffdiff.

1) For Cufflink:
There is a setting for Bias Correction. I made the setting as below:

Perform Bias Correction:  Yes
Reference sequence data:  Locally cached

Did I make the right settings?

2) For Cuffmerge:
As for whether use sequence data, I made the setting as below:

Use Sequence Data:  Yes
Choose the source for the reference list:   Locally cached

Did I make the right settings?

3) For Cuffdiff:
There is another choice whether perform Bias Correction, I made the setting as 
below:

Perform Bias Correction:   Yes
Reference sequence data:   Locally cached

Did I make the right settings?

Thanks.
Jianguang


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] How to decide if the deference is significant

2012-08-27 Thread Du, Jianguang
Dear All,

I am looking for the deferential splicing events between cell types. I have run 
the Cuffdiff and I am going through the output file splicing differential 
expression testing. I have read the documentation and protocols about how 
Cuffdiff test for differential expression and regulation. However although I 
know the changes in relative abundance are quantified by the square root of the 
Jensen-Shannon divergence, I still could not understand the concept of it 
(unfortunately I am not good at math and statistics). Is there any way to 
convert the square root of the Jensen-Shannon divergence into fold of 
diference? How much of the square root of the Jensen-Shannon divergence 
equals to 2 fold of difference?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Should I use raw junction and Only look for supplied junctions

2012-08-28 Thread Du, Jianguang
Dear All,

I have two more questions about settings for Tophat.

My aim is to look for the defferential splicing events between cell types.

After I checked Use Own Junctions, three more options came out:



1) Use Gene Annotation Model

2) Use raw Junctions

3) Only look for supplied junctions



As instructed by Jen, I checked Use Gene Annotation Model, and input iGenome 
mm9 genes.gtf as Gene Model Annotations.

However, I am not sure if I should choose to Use raw junctions and only look 
for supplied junctions. Please help me set up these two options.

Thanks.

Jianguang


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Please help to understand the square root of Jensen-Shannon divergence

2012-09-04 Thread Du, Jianguang
Dear All,

I am looking for the differential splicing events between cell types. However 
the Cuffdiff gives output using the square root of Jensen-shannon divergence 
to measure the difference.

Although I tried my best to understand the definition of the square root of 
Jensen-shannon divergence, I still could not understand the meaning of a 
specific value of the square root of Jensen-shannon divergence. I would 
appreciate it very much if anyone let me know how to covert the square root of 
Jensen-shannon divergence into fold. For example, how much the square root 
of Jensen-shannon divergence is 2 fold difference equal to.

Thanks in advance.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Tophat settings

2012-09-06 Thread Du, Jianguang
Dear All,

I am not so sure about two Tophat settings. Please help.



1) Number of mismatches allowed in the initial read mapping

Based on the documantation, my understanding is: the reads are re-aligned to 
transcriptome/genome if the mismatches in the initial alignment is more than 
the set number (for example, the default setting is 2). In other words, the 
re-aligning will continue until the mismatches is equal to or below the set 
number. Is my understanding correct? If I am right, I have one worry: will 
Tophat stop re-aligning if the mismatch is below 2 (if I use the default 
setting). If it is true, the read will not be aligned to where it belongs to 
(with 0 mismatch).



2) Number of mismatches allowed in each segment alignment for reads mapped 
independently

Does this mean that the reads will be cut into segments if the mismatches of 
alignment is more than the set number?



Thanks in advance.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Please help to understand the square root of Jensen-Shannon divergence

2012-09-06 Thread Du, Jianguang
Hi Jen,
Thank you for your answer.
However, the output file transcript differential expression testing gives the 
ratio (log2 of the fold change) of FPKM of a specific transcript between two 
conditions, which means this fold change in FPKM does not take the overall gene 
expression into consideration (the expression of one gene many be much higher 
in condition A than condidtion B) and therefore can not be used as difference 
of alternative splicing.  
What I am doing is looking for the difference of splicing between two cell 
types by examining the output file splicing differential expression testing. 
In this file, column 10 gives the value of sqrt(JS) (the explaination of it 
is The splice overloading of the primary transcript, as measured by the square 
root of the Jensen-Shannon divergence computed on the relative abundances of 
the splice variants, and the value is never larger than 1). My understanding 
is that, this value has already taken the overall gene expression into 
consideration. But I do not know how much sqrt(JS) equals to 2 fold of change 
because I want to focus the alternative splicing with 2 fold difference between 
two cell types. Do you know how to convert the value of sqrt(JS) into fold?
In addition, how to understand the sentence The splice overloading of the 
primary transcript? If one gene have 3 transcript: A, B and C and the 
expression of these transcripts is: A=60%, B=25%, and C=15%, do you mean the 
primary transcript is A? Does the Cuffdiff take the overall expression 
(A+B+C=100%) or just the primary transcript (A=60%) into consideration when 
calculates the ratio of transcript B? Actually, it would be much easier for us 
if Cuffdiff calculates the ratio of the expression of the alternatively spliced 
exon to overall gene expression, and then compare between conditions.
Thanks in advance,
Jianguang


From: Jennifer Jackson [j...@bx.psu.edu]
Sent: Thursday, September 06, 2012 12:38 PM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu; closetic...@galaxyproject.org
Subject: Re: [galaxy-user] Please help to understand the square root of 
Jensen-Shannon divergence

Hello Jianguang,

Fold is included in the Cuffdiff output. Section Differential
expression tests, first file, column #9.
http://cufflinks.cbcb.umd.edu/manual.html

Hopefully this helps,

Jen
Galaxy team

On 9/4/12 1:16 PM, Du, Jianguang wrote:
 Dear All,

 I am looking for the differential splicing events between cell types.
 However the Cuffdiff gives output using the square root of
 Jensen-shannon divergence to measure the difference.

 Although I tried my best to understand the definition of the square
 root of Jensen-shannon divergence, I still could not understand the
 meaning of a specific value of the square root of Jensen-shannon
 divergence. I would appreciate it very much if anyone let me know how
 to covert the square root of Jensen-shannon divergence into fold.
 For example, how much the square root of Jensen-shannon divergence is
 2 fold difference equal to.

 Thanks in advance.

 Jianguang



 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-user] Number of mismatches allowed in the initial read mapping

2012-09-06 Thread Du, Jianguang
Dear All,



I tested how to set the Number of mismatches allowed in the initial read 
mapping as follows.



At first, I ran FASTQ Groomer on a dataset to get the number of total reads. 
The total number of the reads is 17510227.



Then I ran Tophat after set Number of mismatches allowed in the initial read 
mapping as 1, and then ran flagstat under NGS: SAM Tools. Here is the 
statistic information of Thophat output:
18162942 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
18162942 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ=5)



Next I ran Tophat after set Number of mismatches allowed in the initial read 
mapping as 0, and then ran flagstat under NGS: SAM Tools. Here is the 
statistic information of Thophat output:
16100027 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
16100027 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ=5)



Does it mean about 0.6 million reads are aligned for 2 times or more after I 
set Number of mismatches allowed in the initial read mapping as 1, however 
about 1.4 million reads can not be aligned because of more stringent setting? 
Which number should we choose?



Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Does Tophat output *.accepted hits file contain headers?

2012-09-13 Thread Du, Jianguang
Dear All,

I want to use the Tophat output files with .accepted hits to do analysis 
outside Galaxy. However, the program I am using requires the Tophat output to 
be indexed, sorted BAM files that contain headers. Do the Tophat ouputs with 
.accepted hits produced at Galaxy contain headers? Will the headers of BAM 
files generated by Tophat universally the same?

Thanks,

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] How much FPKM can be take into consideration when compare gene expression

2012-09-19 Thread Du, Jianguang
Dear All,

I am comparing the gene expression between two cell types by examining the 
Cufflink output file -- gene differential expression 
testingjavascript:void(0);. The file lists the FPKM of genes in two cell 
types and log2 of fold. I want to look for genes that have more than 2-flod of 
expression in cell type A than in cell type B. What is the minimum FPKM in cell 
type A so that only the genes that have FPKM highier than this number can be 
taken into consideration for further analysis?

For example,

The FPKM of gene X in cell type A is 80, and in cell type B is 20, the fold of 
difference is 4.

The FPKM of gene Y in cell type A is 4, and in cell type B is 1, the fold of 
difference is also 4.

Is there a minimum FPKM in cell type A for genes to be selected for further 
analysis?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] please restore my account

2012-10-08 Thread Du, Jianguang
Dear Sir or Madam,

I had onpened multiple accounts at Galaxy Main, I did not know that it is 
against policy. I noticed this policy when I found that all the accounts are 
blocked. Would you please restore the account with email address 
jia...@iupui.edumailto:jia...@iupui.edu?

If you are not responsible for account management, would you please forward 
this email to Galaxy adminstrator, or give me the right email address of Galaxy 
adminstrator (I could not find the email address of Galaxy adminstator)?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] Do I need to specify the file format when I upload datasets using FTP method?

2013-03-21 Thread Du, Jianguang
Hi Everyone,

When I upload my datasets onto my history via FTP method (using FileZilla), do 
I need to specify the file format under File Format of Upload File from your 
computer?

I noticed that the screencast of how to upload datasets via FTP just leaves the 
File Format as Auto-detect. However, I also noticed this sentence in the 
help for Auto-detect: the system will attempt to detect Axt, Fasta, 
Fastqsolexa, Gff, Gff3, Html, Lav, Maf, Tabular, Wiggle, Bed and Interval (Bed 
with headers) formats. Do I need to specify the format of my datasets if the 
format of my datasets is not listed in the sentence above?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-user] is there size limit of dataset for running Tophat?

2013-03-27 Thread Du, Jianguang
Hi All,

Is there a size limit of dataset for running Tophat at Galaxy? If there is, how 
many reads is the limit?

Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] Parameters for merging BAM files

2013-04-05 Thread Du, Jianguang
Hi All,

I want to merge the Tophat output (Accepted Hits) of Several datasets. I want 
the merged BAM file has the exact format as the individual input BAM files, 
should I check Merge all component bam file headers into the merged bam file?

Thanks.

Have a nice weekend.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?

2013-04-08 Thread Du, Jianguang
Hi All,

I have a very basic question. I have RNA-seq datasets of several cell types and 
want to compare the alternative splicing events between cell types. The reads 
are 36nt in length. Are these reads long enough to map on the splicing 
jucntions accurately when I run Tophat with stringent parameters (no mismatch)?

Thanks.

Best,

Jianguang Du


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?

2013-04-09 Thread Du, Jianguang
Hi Jeremy,

Thank you for the information.

In addition to reducing the the Minimum length of reas segments, do I also 
need to reduce Anchor length to get more mapping on splicing junctins?

Looks like the setting for Anchor length only affects the number of mapped 
splicing junctions reported in the .splicing junctions output. Is my 
understanding correct? Does the regions mean the number of mapped splicing 
junctions?

Thanks.

Best,

Jianguang




From: Jeremy Goecks [jeremy.goe...@emory.edu]
Sent: Tuesday, April 09, 2013 9:03 AM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly 
map on splicing junctions?

36bp reads will map across splice junctions but at a relatively low rate; you 
can try changing segment length to get better mapping, but you'll want to 
evaluate the results carefully to ensure that you're getting good results.

Good luck,
J.

On Apr 8, 2013, at 5:45 PM, Du, Jianguang wrote:

Hi All,
I have a very basic question. I have RNA-seq datasets of several cell types and 
want to compare the alternative splicing events between cell types. The reads 
are 36nt in length. Are these reads long enough to map on the splicing 
jucntions accurately when I run Tophat with stringent parameters (no mismatch)?
Thanks.
Best,
Jianguang Du



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.orghttp://usegalaxy.org/.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Parameters for merging BAM files

2013-04-10 Thread Du, Jianguang
Hi Jen,

Thanks for the information. I used this setting and the merged BAM files 
(.accepted hits) worked quite well for the downstream analysis.

Best,

Jianguang


From: Jennifer Jackson [j...@bx.psu.edu]
Sent: Tuesday, April 09, 2013 4:10 PM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Parameters for merging BAM files

Hello Jianguang,

This setting is recommended to be used. It will merge all headers, but if there 
are differences between the input files these will be combined together in the 
final output.  If you want to see what this content will be, convert the BAM 
files before and after the merge to SAM format (using the option to include 
headers) and review the results. You can always later delete these permanently 
to recover disk space.

Hopefully this helps,

Jen
Galaxy team

On 4/5/13 1:22 PM, Du, Jianguang wrote:

Hi All,

I want to merge the Tophat output (Accepted Hits) of Several datasets. I want 
the merged BAM file has the exact format as the individual input BAM files, 
should I check Merge all component bam file headers into the merged bam file?

Thanks.

Have a nice weekend.

Jianguang



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/


--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?

2013-04-10 Thread Du, Jianguang
Hi Jeremy,

Thank you very much for the reply. I have some more questions of the same topic.



1) My reads are 36nt long. How much should I set for the Minimum length of 
reads segments to get the most reliable output with the highest mapping of 
splicing junctions?. In my previous run of TopHat, I set it as 18. Can I reduce 
it more to get better mapping on splicing junctions?



2) I do not understand exactly how TopHat works as for the Anchor length 
although I have read the manual for TopHat.

Suppose I set the Anchor length as 8 and the Maximum number of mismatch that 
can appear in the anchor region of spliced alignment as 0 when I run Tophat. 
Does it mean, for a read maps on two adjacent exons, TopHat will report this 
alignment to the outputs .accepted hits and .splicing junctions if either 
end of the read has 8 or more nucleotides mapping on one exon?



3) Is there disadvantage/negative effect if I choose to set the Anchor length 
at the lowest, for example 3? My understanding is that, under the 0 mismatch 
condition, if 3 nuceoides of one end of a read mapped on one exon, the other 
part of the read will map on the adjacent exon (in my case, the other part 
would be 33 nucleotides). So my understanding is that setting the Anchor 
length at 3 does not increase the inaccuracy of the alignment. Am I correct?



Best,

Jianguang




From: Jeremy Goecks [jeremy.goe...@emory.edu]
Sent: Tuesday, April 09, 2013 1:57 PM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly 
map on splicing junctions?

In addition to reducing the the Minimum length of reas segments, do I also 
need to reduce Anchor length to get more mapping on splicing junctins?

Definitely worth a try.

Looks like the setting for Anchor length only affects the number of mapped 
splicing junctions reported in the .splicing junctions output. Is my 
understanding correct?

No, it will affect mapped reads as well.

Does the regions mean the number of mapped splicing junctions?

Yes.

Best,
J.
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?

2013-04-11 Thread Du, Jianguang
Hi Jeremy,

Thank you very much for your reply.

I have one more question about the Anchor length. For a RNA-seq read mapped 
on the splicing junction under the 0 mismatch condition, if 5 nucleotides of 
one end map on one exon, does it mean the rest part of the read must map on the 
adjacent exon? What I want to understand is that, although reducing Anchor 
length may reduce the reliability of mapping on one end/exon, but the 
increased number of mapped nucleotides on the adjacent exon may increase the 
reliability of mapping. Does it mean overall the reliability of mapping is not 
changed?

Best,

Jianguang




From: Jeremy Goecks [jeremy.goe...@emory.edu]
Sent: Wednesday, April 10, 2013 3:16 PM
To: Du, Jianguang
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly 
map on splicing junctions?

1) My reads are 36nt long. How much should I set for the Minimum length of 
reads segments to get the most reliable output with the highest mapping of 
splicing junctions?. In my previous run of TopHat, I set it as 18. Can I reduce 
it more to get better mapping on splicing junctions?

You'll need to define for yourself what you mean by better/best mapping and 
experiment to find the parameters that give you the best results.

2) I do not understand exactly how TopHat works as for the Anchor length 
although I have read the manual for TopHat.
Suppose I set the Anchor length as 8 and the Maximum number of mismatch that 
can appear in the anchor region of spliced alignment as 0 when I run Tophat. 
Does it mean, for a read maps on two adjacent exons, TopHat will report this 
alignment to the outputs .accepted hits and .splicing junctions if either 
end of the read has 8 or more nucleotides mapping on one exon?

I think that's correct.

3) Is there disadvantage/negative effect if I choose to set the Anchor length 
at the lowest, for example 3? My understanding is that, under the 0 mismatch 
condition, if 3 nuceoides of one end of a read mapped on one exon, the other 
part of the read will map on the adjacent exon (in my case, the other part 
would be 33 nucleotides). So my understanding is that setting the Anchor 
length at 3 does not increase the inaccuracy of the alignment. Am I correct?

Setting the anchor length especially small reduces the constraints on mapping, 
so more reads will map but there are likely to be more false positives as well.

Good luck,
J.
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] Which Library Type should I use for single-end reads

2013-04-15 Thread Du, Jianguang
Hi All,

I have a very basic question about parameters for running TopHat.

I have datasets of single-end reads. These datasets were generated with 
Illumina Genome Analyzer IIx. Which Library Type should I choose to run 
Tophat?

Thanks.

Best,

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] View details of Tophat alignment

2013-05-30 Thread Du, Jianguang
Hi All,

After I finshed Tophat alignment for RNA-seq, I took look at the details of 
parameters by clicking the icon View details, and I got the information as 
shown below:



Input Parameter Value   Note for rerun
RNA-Seq FASTQ file  73: Filtered Groomed data1_rep2
Use a built in reference genome or own from your historyindexed
Select a reference genome   /galaxy/data/mm9/bowtie_index/mm9
Is this library mate-paired?single
TopHat settings to use  full
Library TypeFR Unstranded
Anchor length (at least 3)  None
Maximum number of mismatches that can appear in the anchor region of spliced 
alignment  None
The minimum intron length   None
The maximum intron length   None
Allow indel search  No
Maximum number of alignments to be allowed  None
Minimum intron length that may be found during split-segment (default) search   
None
Maximum intron length that may be found during split-segment (default) search   
None
Number of mismatches allowed in the initial read mappingNone
Number of mismatches allowed in each segment alignment for reads mapped 
independently   None
Minimum length of read segments None
Use Own Junctions   Yes
Use Gene Annotation Model   Yes
Gene Model Annotations  1: mm9 genes.gtf
Use Raw Junctions   No
Only look for supplied junctionsNo
Use Closure Search  No
Use Coverage Search Yes
Minimum intron length that may be found during coverage search  None
Maximum intron length that may be found during coverage search  None
Use Microexon SearchNo



I am totally confused by so many Nones.

Then I checked the workflow I set and used for the TopHat alignment, the 
details are the same as above.



However, the brief description just under the title of alignment output (. 
accepted hits) is as below:



format: bam, database: mm9
Tophat for Illumina on data 1 and data 73: accepted_hits, TopHat v1.4.0 tophat 
-p 8 -a 8 -m 0 -i 70 -I 50 -g 20 -G 
/galaxy/main_pool/pool1/files/004/425/dataset_4425972.dat --library-type 
fr-unstranded --no-novel-indels --coverage-search --min-cove



Could you please tell me is there anything wrong (because so many None in the 
detail parameters)?



Thanks.

Jianguang DU
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] Which Input FASTQ quality scores type should I choose when run FASTQ Groomer?

2013-08-30 Thread Du, Jianguang
Hi All,

I downloaded some RNA-seq datasets from NCBI. The datasets were generated by 
Illumina Hiseq 2000. I am not sure which Input FASTQ quality scores type I 
should choose when run FASTQ Groomer. Below shows the scores of 2 reads of a 
dataset, I renamed them as read 1 and read 2.


1) Sequence and quality score displayed in Galaxy
@read 1 length=51
NTGAGATTCTTGACTAGTTATTTCTGCTTTCAGGGAAGAAATCAGCTGGGC
+read 1 length=51
#1=ADADEHIIGIHJGJJJHJIIJJJH@HEGBFH;FHEH@HI
@read 2 length=51
NGAAGAGTCAGTTGTTTCCCTCATAACTTGCTAGATTCCGGATTGCT
+read 2 length=51
#1=DDDEDHHFHHJIJJHIIIJJJIJJJIJIJJII



2)
Sequence and one chanel quality score shown in SRA of NCBI when I downloaded 
the dataset.
gnl|SRA|read 1
NTGAGATTCTTGACTAGTTATTTCTGCTTTCAGGGAAGAAATCAGCTGGGC
One channel quality score
 2 16 28 32 35 32 35 36 39 39 39 39 39 40 40 38 40 39 41 38 41 41 41 39 41 40 
40 41 41 41 39 31 39 36 38 33 37 39 26 37 39 36 39 29 31 39 40 41 41 41 41

gnl|SRA|read 2
NGAAGAGTCAGTTGTTTCCCTCATAACTTGCTAGATTCCGGATTGCT
One channel quality score
 2 16 28 35 35 35 36 35 39 39 37 39 39 41 41 41 41 41 40 41 41 39 40 40 40 41 
41 41 40 41 41 41 41 41 41 41 40 41 40 41 41 41 41 41 41 40 41 41 41 41 40



Looks like the dataset is generated by illumina that is later than version 1.8 
because some of the reads are at score quality of 41. Can I choose sanger as 
Input FASTQ quality scores type when I run FASTQ Groomer?



Thanks.



Jianguang Du


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/