[galaxy-user] FASTQ splitter produced empty dataset, please help
I have problem to split a paired-end FASTQ dataset into two separate datasets. In order to explain the problem clearly, I list the detail of what I did with my dataset: Step 1) My aim is to compare datasets for the differential alternative splicing. I downloaded paired-end datasets at FASTQ format from SRA of NCBI as original data. Below is part of my paired-end FASTQ dataset that I downloaed from SRA of NCBI, Does this dataset look OK? @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 GCTGAGTGAGGGTGTGTTTGGAGTTTG +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 I28II;II*2/5:++,(..*943F@I.('+.35' @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 AAAGATGTTAGTGATACGGAAAGGATATCTC +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 9+*9+7@?F1206,IGI+D122/0++-.+6/@? Step 2) Then I performed FASTQ groomer at setting as follows: a) Input FASTQ quality scores type: Illumina 1.3-1.7 b)Advanced Options: Hide Advanced Options. Did I choose the right setting for FASTQ groomer? Should I use Advanced Options? If yes, what is the setting for Advances Options? Below is part of groomed dataset: @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 GCTGAGTGAGGGTGTGTTTGGAGTTTG +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 *!!**!**'!* @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 AAAGATGTTAGTGATACGGAAAGGATATCTC +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 '!*(*!% Does the groomed data look right? Is number represnting the member of a pair correct? Here they are .1 and .2, should they be /1 and /2? Step 3) Then I ran FASTQ splitter with the groomed files. There is not setting for the splitter. I chose the right groomed file and then click Excute. Below is the description of the splitted dataset: empty format: fastqsanger, database:hg19 Info: Split 0 of 15277248 reads (0.00%). Please help me dela with this problem. Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] (no subject)
I am new to the NGS analysis. I need help to solve this problem. As shown in my previous emial/question shown below, I have some paired-end datasets at FASTQ format, and I have problem to split each of these datasets into two datasets (one forward and one reverse). Jennifer instructed me to assign the datatype to be fastqsanger first and then run 'Manipulate FASTQ'. I have two questions: 1) Now that the datasets were already split into forward and reverse reads when extracted in FASTQ format from the SRA, can I use them just as single end data? 2) If I do need to split each dataset into two datasets, how should I choose the settings when I run Manipulte FASTQ? Thanks. Jianguang / On 8/10/12 7:21 AM, Du, Jianguang wrote: I have problem to split a paired-end FASTQ dataset into two separate datasets. In order to explain the problem clearly, I list the detail of what I did with my dataset: Step 1) My aim is to compare datasets for the differential alternative splicing. I downloaded paired-end datasets at FASTQ format from SRA of NCBI as original data. Below is part of my paired-end FASTQ dataset that I downloaed from SRA of NCBI, Does this dataset look OK? @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 GCTGAGTGAGGGTGTGTTTGGAGTTTG +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 I28II;II*2/5:++,(..*943F@I.('+.35' @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 AAAGATGTTAGTGATACGGAAAGGATATCTC +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 9+*9+7@?F1206,IGI+D122/0++-.+6/@? Step 2) Then I performed FASTQ groomer at setting as follows: a) Input FASTQ quality scores type: Illumina 1.3-1.7 b)Advanced Options: Hide Advanced Options. Did I choose the right setting for FASTQ groomer? Should I use Advanced Options? If yes, what is the setting for Advances Options? Below is part of groomed dataset: @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 GCTGAGTGAGGGTGTGTTTGGAGTTTG +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 *!!**!**'!* @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 AAAGATGTTAGTGATACGGAAAGGATATCTC +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 '!*(*!% Does the groomed data look right? Is number represnting the member of a pair correct? Here they are .1 and .2, should they be /1 and /2? Step 3) Then I ran FASTQ splitter with the groomed files. There is not setting for the splitter. I chose the right groomed file and then click Excute. Below is the description of the splitted dataset: empty format:fastqsanger, database:hg19 Info: Split 0 of 15277248 reads (0.00%). Please help me dela with this problem. Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] need help to split paired-end dataset
I am new to the NGS analysis. I need help to solve this problem. As shown in my previous emial/question shown below, I have some paired-end datasets at FASTQ format, and I have problem to split each of these datasets into two datasets (one forward and one reverse). Jennifer instructed me to assign the datatype to be fastqsanger first and then run 'Manipulate FASTQ'. I have two questions: 1) Now that the datasets were already split into forward and reverse reads when extracted in FASTQ format from the SRA, can I use them just as single end data? 2) If I do need to split each dataset into two datasets, how should I choose the settings when I run Manipulte FASTQ? Thanks. Jianguang / On 8/10/12 7:21 AM, Du, Jianguang wrote: I have problem to split a paired-end FASTQ dataset into two separate datasets. In order to explain the problem clearly, I list the detail of what I did with my dataset: Step 1) My aim is to compare datasets for the differential alternative splicing. I downloaded paired-end datasets at FASTQ format from SRA of NCBI as original data. Below is part of my paired-end FASTQ dataset that I downloaed from SRA of NCBI, Does this dataset look OK? @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 GCTGAGTGAGGGTGTGTTTGGAGTTTG +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 I28II;II*2/5:++,(..*943F@I.('+.35' @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 AAAGATGTTAGTGATACGGAAAGGATATCTC +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 9+*9+7@?F1206,IGI+D122/0++-.+6/@? Step 2) Then I performed FASTQ groomer at setting as follows: a) Input FASTQ quality scores type: Illumina 1.3-1.7 b)Advanced Options: Hide Advanced Options. Did I choose the right setting for FASTQ groomer? Should I use Advanced Options? If yes, what is the setting for Advances Options? Below is part of groomed dataset: @SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 GCTGAGTGAGGGTGTGTTTGGAGTTTG +SRR192532.1.1 HWI-EAS269:1:4:655:110.1 length=35 *!!**!**'!* @SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 AAAGATGTTAGTGATACGGAAAGGATATCTC +SRR192532.1.2 HWI-EAS269:1:4:655:110.2 length=35 '!*(*!% Does the groomed data look right? Is number represnting the member of a pair correct? Here they are .1 and .2, should they be /1 and /2? Step 3) Then I ran FASTQ splitter with the groomed files. There is not setting for the splitter. I chose the right groomed file and then click Excute. Below is the description of the splitted dataset: empty format:fastqsanger, database:hg19 Info: Split 0 of 15277248 reads (0.00%). Please help me dela with this problem. Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] whixh setting should be used to upload mouse reference genome?
Dear All, I am going to search the alternative splicing events bentween datasets. I am not sure about the settings of mouse reference genome (mm9) when I upload it from UCSC Main. Would you please tell me the settings for 1) group: 2) Track: 3) Table: 4) Output format: Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] How to decide Mean Inner Distance between Mate Pairs?
Dear All, I am analyzing the downloaded RNA-seq datasets. However I am not sure how much is Mean Inner Distance between Mate Pairs for these paired-end datasets. Take a paired-end RNA-seq dataset as an example, there is a description for this dataset in SRA database of NCBI: Layout: PAIRED, Orientation: 5'-3'-3'-5', Nominal length: 400, Nominal Std Dev: 20. At first I thought the Mean Inner Distance between Mate Pairs should be 325bps because the length of reads on both ends is 36bps. However when I aligned the sequence of the paired reads on to transcripts and genome using BLASTn, the distance between the paired reads is about 200bps. How should I decide the Mean Inner Distance between Mate Pairs in my case? Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Do I need to allow indel search?
Dear All, I want to compare the pre-mRNA alternaive splicing events between RNA-seq datasets. Do I need to allow indel search when I run Tophat? What is the indel search for? I could not find detail information about indel search through the documentation of Tophat. Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Use Own Junctions or not
Dear All, I want to compare the pre-mRNA alternaive splicing events between RNA-seq datasets. Should I use own junctions when I run Tophat? What does Own Junctions mean? Thanks. Jianguang DU ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Minimum length of read segments
Dear All, I am going to run Tophat with RNA-seq dataset to observe alternative splicing events. There is a parameter for Tophat: Minimum length of read segment. According to implemented Tophat options, the description for Minimum length of read segment is Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25. The length of my reads is 36bps, should I change this parameter based on the length of my reads? How long should I input? Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] run Bowtie to estimate Mean Inner Distance between Mate Pairs
Dear All, In order to figure out the Mean Inner Distance between Mate Pairs of my paired-end RNA-seq datasets, I ran Bowtie (Map with Bowtie for Illumina) with both forward and reverse datasets and mouse mm9 as reference genome. Below I list the Bowtie output for only one pair of reads (I put the fields on the left side): For the forward read QNAME: SRR322837.8.1 FLAG:99 RNAME: chr1 POS: 163761156 MAPQ:255 CIAGR: 36M MRNM:= MPOS:163761301 ISIZE: 181 SEQ: NTGGATACTAGCCATAAATGAATT QUAL:%(,,')(())@@@22358852@@@## OPT: XA:i:1 MD:Z:0A35 NM:i:1 For the reverse read QNAME: SRR322837.8.2 FLAG:147 RNAME: chr1 POS: 163761301 MAPQ:255 CIAGR: 36M MRNM:= MPOS:163761156 ISIZE: -181 SEQ: TATTATGTCAATCTATGAAGAAGGACGGCGAGGTGA QUAL:GDBE@BEEGDB=BD-=GEDDGGBGD8GB? OPT: MD:Z:29A6 NM:i:1 Is the ISIZE the insert size? The difference between POS and MPOS is 145bp, which is 36bp shorter than ISIZE (181). My question is: if ISIZE does mean insert size, how should I convert INSIZE into Mean Inner Distance between Mate Pairs? Thanks, Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] How to find the alternatively spliced segment of genes in Cuffdiff output
Dear All, I have run programs from Tophat to Cuffdiff of Galaxy to look for the difference in alternative splicing events between cell types. However I do not know how to find the detail information (such as the sequence and the genomic coordinates) of the alternatively spliced part of a given gene. I looked at the data of Cuffdiff ouput splicing differential expression testing, there is no column showing the position of alternatively spliced region. Please help to solve this problem. Thanks in advance. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] How much can I trimm my reads
Dear All, I am analysing RNA-seq datasets for the differential splicing events between cell types. My reads are 36bp long. In order to increase the quality of reads, I need to trim some nucleotides from ends. How many nucleotides can I trim? I am afraid that if I trim too much, the reliability of the alingment will be affected. Thanks in advance. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] What is the minimum Quality should I set for Filter FASTQ?
Dear All, I am analysing RNA-seq datasets for differential splicing events between cell types. Some of my reads contain bed nucleotides, should I run Filter FASTQ to remove these not so good reads? If I do need to, what is the Minimum Quality should I set for the Filter? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat?
Dear All, I am analysing RNA-seq datasets for differential splicing events between cell types. These are mouse cells. Jen suggested me to use the iGenomes version of reference GTF to take full advantage of the options in CuffDiff. My question is: should I use this iGenome version reference GTF when I run Tophat? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat?
Hi Jen, Thanks for your help. Do you mean that if I want to find novel isoform/splicing, I need to select No under Use Reference Annotation when I run Cufflink, and then use iGenome version of reference GTF when I run Cuffmerge? Based on your information and some protocols found online, my understanding is that: 1) if use iGenome version of reference GTF, I only need to run Cuffmerge with the Cufflink ouputs, because iGenome version reference GTF already contains attributes such as p_id and tss_id. Then the Cuffmerge output can be used for Cuffdiff. 2) however, if I use the reference GTF from Ensembl/UCSC (rather than from iGenome), I need to run Cuffcompare to create p_id and tss_id, which is required for Cuffdiff. Am I right? Another question is: should I use iGenome version of reference GTF when I run Tophat if I want to see novel isoforms/splicing? Thanks. Jianguang From: Jennifer Jackson [j...@bx.psu.edu] Sent: Thursday, August 23, 2012 11:46 AM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat? Hello Jianguang, When in the analysis process to start using the reference GTF file can depend on whether or not you intend to do any discovery along with differential expression testing. At the TopHat and Cufflinks steps, using reference GTF file can influence how datasets will map and assemble. In general, if your intention is to do discovery (e.g. work with novel isoforms in your data, but not in the reference), then do not add in the reference GTF until the CuffMerge step (to produce the input annotation GTF file for Cuffdiff). But if you want to guide the analysis toward known isoforms, then use the reference GTF. This is the process our RNA-seq example protocol follows: http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise For reference, there are other variations of this on the Cufflinks web site, some that never lead to Cuffdiff, but still may be useful to review. Please see the Cufflinks paper (linked from right side bar as Protocol for many more options/discussion. http://cufflinks.cbcb.umd.edu/tutorial.html -- Common uses of the Cufflinks package The end decision will be up to you, and a few runs with different options may be a useful way to make the final call, but hopefully this provides some resources to help you understand the option, Jen Galaxy team On 8/23/12 8:03 AM, Du, Jianguang wrote: Dear All, I am analysing RNA-seq datasets for differential splicing events between cell types. These are mouse cells. Jen suggested me to use the iGenomes version of reference GTF to take full advantage of the options in CuffDiff. My question is: should I use this iGenome version reference GTF when I run Tophat? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat?
Hi Jen, I had a problem when I tried to run Tophat with the iGenome reference GTF. What I did is: 1) uploaded iGenome version of mm9 genes.gtf by: Shared Data - Data Libraries - iGenomes - click genes.gtf under mm9 - click Go for Import to current history. The genes.gtf appeared in history and turned green. 2) click Tophat for Illumina Find splice junctions using RNA-seq data to open the window of Tophat for Illumina (version 1.5.0) 3) selected the dataset to be analysed under RNA-Seq FASTQ file:. 4) chose Use one from the history under Will you select a reference genome from your history or use a built-in index?: Then the screen refreshed and the box (pulldown menu) under Select the reference genome: became smaller. Nothing showed up in the pulldown menu (actually the menu can not be pulled down). So that I could not input iGenome reference GTF. Looks like the Tophat can only Use a built-in index. How can I solve this problem? Thanks in advance. Jianguang From: galaxy-user-boun...@lists.bx.psu.edu [galaxy-user-boun...@lists.bx.psu.edu] on behalf of Du, Jianguang [jia...@iupui.edu] Sent: Thursday, August 23, 2012 4:01 PM To: Jennifer Jackson Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat? Hi Jen, Thank you very much for your help. Jianguang From: Jennifer Jackson [j...@bx.psu.edu] Sent: Thursday, August 23, 2012 3:53 PM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat? Hello Jianguang, On 8/23/12 11:28 AM, Du, Jianguang wrote: Hi Jen, Thanks for your help. Do you mean that if I want to find novel isoform/splicing, I need to select No under Use Reference Annotation when I run Cufflink, and then use iGenome version of reference GTF when I run Cuffmerge? Yes, according to the tool documentation, this is the method. Based on your information and some protocols found online, my understanding is that: 1) if use iGenome version of reference GTF, I only need to run Cuffmerge with the Cufflink ouputs, because iGenome version reference GTF already contains attributes such as p_id and tss_id. Then the Cuffmerge output can be used for Cuffdiff. Yes, this is the example protocol I shared. 2) however, if I use the reference GTF from Ensembl/UCSC (rather than from iGenome), I need to run Cuffcompare to create p_id and tss_id, which is required for Cuffdiff. This can be tricky, it depends on what order you run the tools with and without the GTF annotation. The protocol in #1 is recommended. Am I right? Another question is: should I use iGenome version of reference GTF when I run Tophat if I want to see novel isoforms/splicing? Yes, this is what I intended to answer in my original reply, I apologize if that was not clear. The reference GTF can influence both mapping and assembly. So, both Tophat and Cufflinks. The information on the TopHat web site for the parameter provides more information (see link on TopHat tool form). The tool authors can also be contacted if there are some details that you are curious about that are not covered in the primary documentation: tophat.cuffli...@gmail.com Others are welcome to add to the thread with their experiences if they have used a reference annotation GTF with Tophat (or chosen not to for a particular reason that they would like to share), Best, Jen Galaxy team Thanks. Jianguang From: Jennifer Jackson [j...@bx.psu.edu] Sent: Thursday, August 23, 2012 11:46 AM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Should I use iGenomes verson of a reference GTF for Tophat? Hello Jianguang, When in the analysis process to start using the reference GTF file can depend on whether or not you intend to do any discovery along with differential expression testing. At the TopHat and Cufflinks steps, using reference GTF file can influence how datasets will map and assemble. In general, if your intention is to do discovery (e.g. work with novel isoforms in your data, but not in the reference), then do not add in the reference GTF until the CuffMerge step (to produce the input annotation GTF file for Cuffdiff). But if you want to guide the analysis toward known isoforms, then use the reference GTF. This is the process our RNA-seq example protocol follows: http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise For reference, there are other variations of this on the Cufflinks web site, some that never lead to Cuffdiff, but still may be useful to review. Please see the Cufflinks paper (linked from right side bar as Protocol for many more options/discussion. http://cufflinks.cbcb.umd.edu/tutorial.html -- Common uses of the Cufflinks package The end decision will be up to you, and a few runs
[galaxy-user] Please help me check the quality of the Tophat mapping to reference genome
Dear All, I ran Flagstat under NGS: SAM Tools to check the quality of the Tophat output (the file of accepted hits). I got the diagnosis results as follow: 9471730 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 9471730 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ=5) I ran Tophat with settings as shown below: Will you select a reference genome from your history or use a built-in index? Use a built-in index Select a reference genome /galaxy/data/mm9/bowtie_index/mm9 Is this library mate-paired? Single-end TopHat settings to use Full parameter list Library Type FR Unstranded Anchor length (at least 3) 8 Maximum number of mismatches that can appear in the anchor region of spliced alignment 0 The minimum intron length 70 The maximum intron length 50 Allow indel search Yes Max insertion length. 3 Max deletion length. 3 Maximum number of alignments to be allowed 20 Minimum intron length that may be found during split-segment (default) search 50 Maximum intron length that may be found during split-segment (default) search 50 Number of mismatches allowed in the initial read mapping 1 Number of mismatches allowed in each segment alignment for reads mapped independently 1 Minimum length of read segments 25 Use Own Junctions Yes Use Gene Annotation Model Yes Gene Model Annotations iGenome version of mm9 genes. GTF Use Raw Junctions No Only look for supplied junctions No Use Closure Search No Use Coverage Search Yes Minimum intron length that may be found during coverage search 50 Maximum intron length that may be found during coverage search 2 Use Microexon Search No Please help me find out what is wrong with the Tophat. Thanks, Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Please help with the settings for Cufflink, Cuffmerge and Cuffdiff
Dear All, I am looking for the differential splicing events between cell types. Although I got a lot of helps from Jen and from protocols found online, I am still not sure about some settings for Cufflink, Cuffmerge and Cuffdiff. 1) For Cufflink: There is a setting for Bias Correction. I made the setting as below: Perform Bias Correction: Yes Reference sequence data: Locally cached Did I make the right settings? 2) For Cuffmerge: As for whether use sequence data, I made the setting as below: Use Sequence Data: Yes Choose the source for the reference list: Locally cached Did I make the right settings? 3) For Cuffdiff: There is another choice whether perform Bias Correction, I made the setting as below: Perform Bias Correction: Yes Reference sequence data: Locally cached Did I make the right settings? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] How to decide if the deference is significant
Dear All, I am looking for the deferential splicing events between cell types. I have run the Cuffdiff and I am going through the output file splicing differential expression testing. I have read the documentation and protocols about how Cuffdiff test for differential expression and regulation. However although I know the changes in relative abundance are quantified by the square root of the Jensen-Shannon divergence, I still could not understand the concept of it (unfortunately I am not good at math and statistics). Is there any way to convert the square root of the Jensen-Shannon divergence into fold of diference? How much of the square root of the Jensen-Shannon divergence equals to 2 fold of difference? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Should I use raw junction and Only look for supplied junctions
Dear All, I have two more questions about settings for Tophat. My aim is to look for the defferential splicing events between cell types. After I checked Use Own Junctions, three more options came out: 1) Use Gene Annotation Model 2) Use raw Junctions 3) Only look for supplied junctions As instructed by Jen, I checked Use Gene Annotation Model, and input iGenome mm9 genes.gtf as Gene Model Annotations. However, I am not sure if I should choose to Use raw junctions and only look for supplied junctions. Please help me set up these two options. Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Please help to understand the square root of Jensen-Shannon divergence
Dear All, I am looking for the differential splicing events between cell types. However the Cuffdiff gives output using the square root of Jensen-shannon divergence to measure the difference. Although I tried my best to understand the definition of the square root of Jensen-shannon divergence, I still could not understand the meaning of a specific value of the square root of Jensen-shannon divergence. I would appreciate it very much if anyone let me know how to covert the square root of Jensen-shannon divergence into fold. For example, how much the square root of Jensen-shannon divergence is 2 fold difference equal to. Thanks in advance. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Tophat settings
Dear All, I am not so sure about two Tophat settings. Please help. 1) Number of mismatches allowed in the initial read mapping Based on the documantation, my understanding is: the reads are re-aligned to transcriptome/genome if the mismatches in the initial alignment is more than the set number (for example, the default setting is 2). In other words, the re-aligning will continue until the mismatches is equal to or below the set number. Is my understanding correct? If I am right, I have one worry: will Tophat stop re-aligning if the mismatch is below 2 (if I use the default setting). If it is true, the read will not be aligned to where it belongs to (with 0 mismatch). 2) Number of mismatches allowed in each segment alignment for reads mapped independently Does this mean that the reads will be cut into segments if the mismatches of alignment is more than the set number? Thanks in advance. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Please help to understand the square root of Jensen-Shannon divergence
Hi Jen, Thank you for your answer. However, the output file transcript differential expression testing gives the ratio (log2 of the fold change) of FPKM of a specific transcript between two conditions, which means this fold change in FPKM does not take the overall gene expression into consideration (the expression of one gene many be much higher in condition A than condidtion B) and therefore can not be used as difference of alternative splicing. What I am doing is looking for the difference of splicing between two cell types by examining the output file splicing differential expression testing. In this file, column 10 gives the value of sqrt(JS) (the explaination of it is The splice overloading of the primary transcript, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the splice variants, and the value is never larger than 1). My understanding is that, this value has already taken the overall gene expression into consideration. But I do not know how much sqrt(JS) equals to 2 fold of change because I want to focus the alternative splicing with 2 fold difference between two cell types. Do you know how to convert the value of sqrt(JS) into fold? In addition, how to understand the sentence The splice overloading of the primary transcript? If one gene have 3 transcript: A, B and C and the expression of these transcripts is: A=60%, B=25%, and C=15%, do you mean the primary transcript is A? Does the Cuffdiff take the overall expression (A+B+C=100%) or just the primary transcript (A=60%) into consideration when calculates the ratio of transcript B? Actually, it would be much easier for us if Cuffdiff calculates the ratio of the expression of the alternatively spliced exon to overall gene expression, and then compare between conditions. Thanks in advance, Jianguang From: Jennifer Jackson [j...@bx.psu.edu] Sent: Thursday, September 06, 2012 12:38 PM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu; closetic...@galaxyproject.org Subject: Re: [galaxy-user] Please help to understand the square root of Jensen-Shannon divergence Hello Jianguang, Fold is included in the Cuffdiff output. Section Differential expression tests, first file, column #9. http://cufflinks.cbcb.umd.edu/manual.html Hopefully this helps, Jen Galaxy team On 9/4/12 1:16 PM, Du, Jianguang wrote: Dear All, I am looking for the differential splicing events between cell types. However the Cuffdiff gives output using the square root of Jensen-shannon divergence to measure the difference. Although I tried my best to understand the definition of the square root of Jensen-shannon divergence, I still could not understand the meaning of a specific value of the square root of Jensen-shannon divergence. I would appreciate it very much if anyone let me know how to covert the square root of Jensen-shannon divergence into fold. For example, how much the square root of Jensen-shannon divergence is 2 fold difference equal to. Thanks in advance. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Number of mismatches allowed in the initial read mapping
Dear All, I tested how to set the Number of mismatches allowed in the initial read mapping as follows. At first, I ran FASTQ Groomer on a dataset to get the number of total reads. The total number of the reads is 17510227. Then I ran Tophat after set Number of mismatches allowed in the initial read mapping as 1, and then ran flagstat under NGS: SAM Tools. Here is the statistic information of Thophat output: 18162942 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 18162942 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ=5) Next I ran Tophat after set Number of mismatches allowed in the initial read mapping as 0, and then ran flagstat under NGS: SAM Tools. Here is the statistic information of Thophat output: 16100027 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 16100027 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ=5) Does it mean about 0.6 million reads are aligned for 2 times or more after I set Number of mismatches allowed in the initial read mapping as 1, however about 1.4 million reads can not be aligned because of more stringent setting? Which number should we choose? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Does Tophat output *.accepted hits file contain headers?
Dear All, I want to use the Tophat output files with .accepted hits to do analysis outside Galaxy. However, the program I am using requires the Tophat output to be indexed, sorted BAM files that contain headers. Do the Tophat ouputs with .accepted hits produced at Galaxy contain headers? Will the headers of BAM files generated by Tophat universally the same? Thanks, Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] How much FPKM can be take into consideration when compare gene expression
Dear All, I am comparing the gene expression between two cell types by examining the Cufflink output file -- gene differential expression testingjavascript:void(0);. The file lists the FPKM of genes in two cell types and log2 of fold. I want to look for genes that have more than 2-flod of expression in cell type A than in cell type B. What is the minimum FPKM in cell type A so that only the genes that have FPKM highier than this number can be taken into consideration for further analysis? For example, The FPKM of gene X in cell type A is 80, and in cell type B is 20, the fold of difference is 4. The FPKM of gene Y in cell type A is 4, and in cell type B is 1, the fold of difference is also 4. Is there a minimum FPKM in cell type A for genes to be selected for further analysis? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] please restore my account
Dear Sir or Madam, I had onpened multiple accounts at Galaxy Main, I did not know that it is against policy. I noticed this policy when I found that all the accounts are blocked. Would you please restore the account with email address jia...@iupui.edumailto:jia...@iupui.edu? If you are not responsible for account management, would you please forward this email to Galaxy adminstrator, or give me the right email address of Galaxy adminstrator (I could not find the email address of Galaxy adminstator)? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Do I need to specify the file format when I upload datasets using FTP method?
Hi Everyone, When I upload my datasets onto my history via FTP method (using FileZilla), do I need to specify the file format under File Format of Upload File from your computer? I noticed that the screencast of how to upload datasets via FTP just leaves the File Format as Auto-detect. However, I also noticed this sentence in the help for Auto-detect: the system will attempt to detect Axt, Fasta, Fastqsolexa, Gff, Gff3, Html, Lav, Maf, Tabular, Wiggle, Bed and Interval (Bed with headers) formats. Do I need to specify the format of my datasets if the format of my datasets is not listed in the sentence above? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] is there size limit of dataset for running Tophat?
Hi All, Is there a size limit of dataset for running Tophat at Galaxy? If there is, how many reads is the limit? Thanks. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-user] Parameters for merging BAM files
Hi All, I want to merge the Tophat output (Accepted Hits) of Several datasets. I want the merged BAM file has the exact format as the individual input BAM files, should I check Merge all component bam file headers into the merged bam file? Thanks. Have a nice weekend. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?
Hi All, I have a very basic question. I have RNA-seq datasets of several cell types and want to compare the alternative splicing events between cell types. The reads are 36nt in length. Are these reads long enough to map on the splicing jucntions accurately when I run Tophat with stringent parameters (no mismatch)? Thanks. Best, Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?
Hi Jeremy, Thank you for the information. In addition to reducing the the Minimum length of reas segments, do I also need to reduce Anchor length to get more mapping on splicing junctins? Looks like the setting for Anchor length only affects the number of mapped splicing junctions reported in the .splicing junctions output. Is my understanding correct? Does the regions mean the number of mapped splicing junctions? Thanks. Best, Jianguang From: Jeremy Goecks [jeremy.goe...@emory.edu] Sent: Tuesday, April 09, 2013 9:03 AM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions? 36bp reads will map across splice junctions but at a relatively low rate; you can try changing segment length to get better mapping, but you'll want to evaluate the results carefully to ensure that you're getting good results. Good luck, J. On Apr 8, 2013, at 5:45 PM, Du, Jianguang wrote: Hi All, I have a very basic question. I have RNA-seq datasets of several cell types and want to compare the alternative splicing events between cell types. The reads are 36nt in length. Are these reads long enough to map on the splicing jucntions accurately when I run Tophat with stringent parameters (no mismatch)? Thanks. Best, Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.orghttp://usegalaxy.org/. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Parameters for merging BAM files
Hi Jen, Thanks for the information. I used this setting and the merged BAM files (.accepted hits) worked quite well for the downstream analysis. Best, Jianguang From: Jennifer Jackson [j...@bx.psu.edu] Sent: Tuesday, April 09, 2013 4:10 PM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Parameters for merging BAM files Hello Jianguang, This setting is recommended to be used. It will merge all headers, but if there are differences between the input files these will be combined together in the final output. If you want to see what this content will be, convert the BAM files before and after the merge to SAM format (using the option to include headers) and review the results. You can always later delete these permanently to recover disk space. Hopefully this helps, Jen Galaxy team On 4/5/13 1:22 PM, Du, Jianguang wrote: Hi All, I want to merge the Tophat output (Accepted Hits) of Several datasets. I want the merged BAM file has the exact format as the individual input BAM files, should I check Merge all component bam file headers into the merged bam file? Thanks. Have a nice weekend. Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?
Hi Jeremy, Thank you very much for the reply. I have some more questions of the same topic. 1) My reads are 36nt long. How much should I set for the Minimum length of reads segments to get the most reliable output with the highest mapping of splicing junctions?. In my previous run of TopHat, I set it as 18. Can I reduce it more to get better mapping on splicing junctions? 2) I do not understand exactly how TopHat works as for the Anchor length although I have read the manual for TopHat. Suppose I set the Anchor length as 8 and the Maximum number of mismatch that can appear in the anchor region of spliced alignment as 0 when I run Tophat. Does it mean, for a read maps on two adjacent exons, TopHat will report this alignment to the outputs .accepted hits and .splicing junctions if either end of the read has 8 or more nucleotides mapping on one exon? 3) Is there disadvantage/negative effect if I choose to set the Anchor length at the lowest, for example 3? My understanding is that, under the 0 mismatch condition, if 3 nuceoides of one end of a read mapped on one exon, the other part of the read will map on the adjacent exon (in my case, the other part would be 33 nucleotides). So my understanding is that setting the Anchor length at 3 does not increase the inaccuracy of the alignment. Am I correct? Best, Jianguang From: Jeremy Goecks [jeremy.goe...@emory.edu] Sent: Tuesday, April 09, 2013 1:57 PM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions? In addition to reducing the the Minimum length of reas segments, do I also need to reduce Anchor length to get more mapping on splicing junctins? Definitely worth a try. Looks like the setting for Anchor length only affects the number of mapped splicing junctions reported in the .splicing junctions output. Is my understanding correct? No, it will affect mapped reads as well. Does the regions mean the number of mapped splicing junctions? Yes. Best, J. ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions?
Hi Jeremy, Thank you very much for your reply. I have one more question about the Anchor length. For a RNA-seq read mapped on the splicing junction under the 0 mismatch condition, if 5 nucleotides of one end map on one exon, does it mean the rest part of the read must map on the adjacent exon? What I want to understand is that, although reducing Anchor length may reduce the reliability of mapping on one end/exon, but the increased number of mapped nucleotides on the adjacent exon may increase the reliability of mapping. Does it mean overall the reliability of mapping is not changed? Best, Jianguang From: Jeremy Goecks [jeremy.goe...@emory.edu] Sent: Wednesday, April 10, 2013 3:16 PM To: Du, Jianguang Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Are reads of 36nt in length long enough to accutatly map on splicing junctions? 1) My reads are 36nt long. How much should I set for the Minimum length of reads segments to get the most reliable output with the highest mapping of splicing junctions?. In my previous run of TopHat, I set it as 18. Can I reduce it more to get better mapping on splicing junctions? You'll need to define for yourself what you mean by better/best mapping and experiment to find the parameters that give you the best results. 2) I do not understand exactly how TopHat works as for the Anchor length although I have read the manual for TopHat. Suppose I set the Anchor length as 8 and the Maximum number of mismatch that can appear in the anchor region of spliced alignment as 0 when I run Tophat. Does it mean, for a read maps on two adjacent exons, TopHat will report this alignment to the outputs .accepted hits and .splicing junctions if either end of the read has 8 or more nucleotides mapping on one exon? I think that's correct. 3) Is there disadvantage/negative effect if I choose to set the Anchor length at the lowest, for example 3? My understanding is that, under the 0 mismatch condition, if 3 nuceoides of one end of a read mapped on one exon, the other part of the read will map on the adjacent exon (in my case, the other part would be 33 nucleotides). So my understanding is that setting the Anchor length at 3 does not increase the inaccuracy of the alignment. Am I correct? Setting the anchor length especially small reduces the constraints on mapping, so more reads will map but there are likely to be more false positives as well. Good luck, J. ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-user] Which Library Type should I use for single-end reads
Hi All, I have a very basic question about parameters for running TopHat. I have datasets of single-end reads. These datasets were generated with Illumina Genome Analyzer IIx. Which Library Type should I choose to run Tophat? Thanks. Best, Jianguang ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-user] View details of Tophat alignment
Hi All, After I finshed Tophat alignment for RNA-seq, I took look at the details of parameters by clicking the icon View details, and I got the information as shown below: Input Parameter Value Note for rerun RNA-Seq FASTQ file 73: Filtered Groomed data1_rep2 Use a built in reference genome or own from your historyindexed Select a reference genome /galaxy/data/mm9/bowtie_index/mm9 Is this library mate-paired?single TopHat settings to use full Library TypeFR Unstranded Anchor length (at least 3) None Maximum number of mismatches that can appear in the anchor region of spliced alignment None The minimum intron length None The maximum intron length None Allow indel search No Maximum number of alignments to be allowed None Minimum intron length that may be found during split-segment (default) search None Maximum intron length that may be found during split-segment (default) search None Number of mismatches allowed in the initial read mappingNone Number of mismatches allowed in each segment alignment for reads mapped independently None Minimum length of read segments None Use Own Junctions Yes Use Gene Annotation Model Yes Gene Model Annotations 1: mm9 genes.gtf Use Raw Junctions No Only look for supplied junctionsNo Use Closure Search No Use Coverage Search Yes Minimum intron length that may be found during coverage search None Maximum intron length that may be found during coverage search None Use Microexon SearchNo I am totally confused by so many Nones. Then I checked the workflow I set and used for the TopHat alignment, the details are the same as above. However, the brief description just under the title of alignment output (. accepted hits) is as below: format: bam, database: mm9 Tophat for Illumina on data 1 and data 73: accepted_hits, TopHat v1.4.0 tophat -p 8 -a 8 -m 0 -i 70 -I 50 -g 20 -G /galaxy/main_pool/pool1/files/004/425/dataset_4425972.dat --library-type fr-unstranded --no-novel-indels --coverage-search --min-cove Could you please tell me is there anything wrong (because so many None in the detail parameters)? Thanks. Jianguang DU ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-user] Which Input FASTQ quality scores type should I choose when run FASTQ Groomer?
Hi All, I downloaded some RNA-seq datasets from NCBI. The datasets were generated by Illumina Hiseq 2000. I am not sure which Input FASTQ quality scores type I should choose when run FASTQ Groomer. Below shows the scores of 2 reads of a dataset, I renamed them as read 1 and read 2. 1) Sequence and quality score displayed in Galaxy @read 1 length=51 NTGAGATTCTTGACTAGTTATTTCTGCTTTCAGGGAAGAAATCAGCTGGGC +read 1 length=51 #1=ADADEHIIGIHJGJJJHJIIJJJH@HEGBFH;FHEH@HI @read 2 length=51 NGAAGAGTCAGTTGTTTCCCTCATAACTTGCTAGATTCCGGATTGCT +read 2 length=51 #1=DDDEDHHFHHJIJJHIIIJJJIJJJIJIJJII 2) Sequence and one chanel quality score shown in SRA of NCBI when I downloaded the dataset. gnl|SRA|read 1 NTGAGATTCTTGACTAGTTATTTCTGCTTTCAGGGAAGAAATCAGCTGGGC One channel quality score 2 16 28 32 35 32 35 36 39 39 39 39 39 40 40 38 40 39 41 38 41 41 41 39 41 40 40 41 41 41 39 31 39 36 38 33 37 39 26 37 39 36 39 29 31 39 40 41 41 41 41 gnl|SRA|read 2 NGAAGAGTCAGTTGTTTCCCTCATAACTTGCTAGATTCCGGATTGCT One channel quality score 2 16 28 35 35 35 36 35 39 39 37 39 39 41 41 41 41 41 40 41 41 39 40 40 40 41 41 41 40 41 41 41 41 41 41 41 40 41 40 41 41 41 41 41 41 40 41 41 41 41 40 Looks like the dataset is generated by illumina that is later than version 1.8 because some of the reads are at score quality of 41. Can I choose sanger as Input FASTQ quality scores type when I run FASTQ Groomer? Thanks. Jianguang Du ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/