Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

2014-01-14 Thread Yang Bi
Hi Jen:

I still have a little problem with the chromosome names. It appears that the 
mitochondria genes and chloroplast genes are named ChrC and ChrM in the 
gff3 file which I need to change to chrC and chrM. How do I change cases 
specifically for the initial letters and not the entire words?

Thanks
Yang 

- 原始邮件 -
发件人: Jennifer Jackson j...@bx.psu.edu
收件人: Yang Bi bey...@stanford.edu
抄送: galaxy-user@lists.bx.psu.edu
发送时间: 星期一, 2014年 1 月 13日 下午 6:54:53
主题: Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

Hello Yang,

Glad the problem was isolated - the mismatched chromosomes is definitely 
something to be fixed.

The tools in 'Text Manipulation can help. The tool Change Case of 
selected columns can change the case for you. Click on the pencil icon 
after running the tool to reassign the datatype correctly as needed.

Take care,

Jen
Galaxy team

On 1/13/14 6:31 PM, Yang Bi wrote:
 Hi Jen:

 Thank you for the prompt reply. RPKMs produced by cufflink look normal (from 
 an assembled transcript file):

 Seqname   Source  Feature Start   End Score   Strand  Frame   
 Attributes
 chr1  Cufflinks   transcript  11960   13178   1000.   .   
 gene_id CUFF.180; transcript_id CUFF.180.1; FPKM 6.5441928094; frac 
 1.00; conf_lo 3.594986; conf_hi 8.987465; cov 2.413218; 
 full_read_support yes;
 chr1  Cufflinks   exon11960   13178   1000.   .   gene_id 
 CUFF.180; transcript_id CUFF.180.1; exon_number 1; FPKM 6.5441928094; 
 frac 1.00; conf_lo 3.594986; conf_hi 8.987465; cov 2.413218;
 chr1  Cufflinks   transcript  453653141000+   .   
 gene_id CUFF.178; transcript_id CUFF.178.1; FPKM 11.0556332840; frac 
 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844; 
 full_read_support no;
 chr1  Cufflinks   exon453646051000+   .   gene_id 
 CUFF.178; transcript_id CUFF.178.1; exon_number 1; FPKM 
 11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; 
 cov 4.076844;
 chr1  Cufflinks   exon470650951000+   .   gene_id 
 CUFF.178; transcript_id CUFF.178.1; exon_number 2; FPKM 
 11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; 
 cov 4.076844;
 chr1  Cufflinks   exon517453141000+   .   gene_id 
 CUFF.178; transcript_id CUFF.178.1; exon_number 3; FPKM 
 11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; 
 cov 4.076844;

 I checked the chromosome names and I realized that the BAM outputs use lower 
 cases for RNAME, eg. chr1 while my gff3 file uses initial capital letters 
 for seqId, eg Chr1. Could this be the problem? What is the fastest way to 
 convert the capital C in my gff3 file to lower case?

 Thank you very much
 Yang

-- 
Jennifer Hillman-Jackson
http://galaxyproject.org


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

2014-01-14 Thread Jennifer Jackson

Hi Yang,

I am going to give you a method to do this - in short you'll be 
splitting the dataset into three parts, altering two of them, then 
merging the three final results datasets together. A workflow could be 
extracted from the history once you have completed this method, saved 
for future use.


1 - Use 'Filter and Sort - Select'

  The default string would match all of the lines in your dataset. 
Alter it to create three files:


   Use Matching for all

  All chroms, minus ChrM and ChrC
  ^chr([0-9])+

  ChrM
  ^ChrM

  ChrC
  ^ChrC

2. For the datasets ChrM and ChrC, use 'Text Manipulation - Add column' 
on each file individually. This column should be in the final desired 
form, e.g. chrM or chrC


3. For both results, use 'Text Manipulation - Cut to replace column 
1 with the new column.


4. Use the tool Concatenate datasets to combine the three files again, 
using the new results.


5. Reassign the metadata as needed using the pencil icon as needed.

These tool all work on datatype tabular and generally on other text 
data, but assign a dataset to tabular format using the pencil icon if 
it is not recognized by a tool. This is fine until the last step where 
you can set it back to GFF.




On 1/14/14 11:17 AM, Yang Bi wrote:

Hi Jen:

I still have a little problem with the chromosome names. It appears that the mitochondria genes and chloroplast genes 
are named ChrC and ChrM in the gff3 file which I need to change to chrC and 
chrM. How do I change cases specifically for the initial letters and not the entire words?

Thanks
Yang



--
Jennifer Hillman-Jackson
http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/


Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

2014-01-13 Thread Jennifer Jackson

Hello,

It looks like the data is mapping as novel - not linked with the 
reference annotation. There can be a few factors that can cause this to 
occur for part of a dataset (often desirable) but when it occurs for an 
entire dataset, there is often a data mismatch or parameter issue.


The first item I always check is that the reference genomes are a match 
between inputs. Do this by confirming that the identifiers in the 
reference GFF file are the same as those in the Tophat BAM output 
(convert to SAM, with headers, to see the chromosome names). For the GFF 
file, the tool  Join, Subtract and Group - Group on the first column, 
chromosome name, with the action count distinct will isolate these.


But the real problem could be in the parameters, see below:

On 1/11/14 10:43 PM, Yang Bi wrote:

Dear all:

I am new to Galaxy and I followed online tutorials/tips to analyze my RNA seq data for 
alternative splicing. I used tophat for illumina to align my sequencing data 
after QC/filtering. Other than setting min intron to 20, I used the default settings. 
Then I feed the accepted hit files to cufflink. I set Min isoform fraction to 0, use 
annotation (tair10 gff3) as guide and choose yes for perform bias correction (locally 
cached tair10).
My guess is that this Cufflinks run had the same issue - have you 
checked it? The 'Min isoform fraction' set to 0 may be problematic (I 
have never run Cufflinks this way). It may seem that this is a setting 
that is permissive - to capture even very small expression levels - but 
it may have had the reverse effect of not assigning any reads.


(The Tophat run with min intron at 20 is pretty low/sensitive - but with 
a smaller genome this probably will not cause memory issues with the 
mapping. Was this set based on the genome having transcripts with known, 
characterized introns this short? I didn't check, but you can in the 
reference GFF file.).


Maybe double check the above Cufflinks run, confirm the results were as 
expected, then try the default in Cufflinks to see how that works out 
(0.1)? As a first pass test? If you want to make this more sensitive 
in subsequent run, you could try 0.01 - although how significant those 
results are, given this genome and your specific input data, would need 
to be evaluated.


After that, if you are still having trouble, please feel free to share a 
history link and we can try to help (copy and email a share link from 
the public server, direct to me, to keep your data private). Here is how:

https://wiki.galaxyproject.org/Support#Shared_and_Published_data

Hopefully the parameter change works, or a reference genome issue is 
found and corrected, but if not, I'll watch for your email,


Jen
Galaxy team


I merged the assembled transcripts with cuffmerge and use cuffcompare to compare the resultant 
merged assembled transcript to the reference annotation file tair10 gff3. I choose yes for 
use sequence data and locally cached tair10 as the reference list. I get 
this for the transcript accuracy analysis:

# Cuffcompare v2.1.1 | Command line was:
#cuffcompare -o cc_output -r 
/galaxy-repl/main/files/007/386/dataset_7386886.dat -s 
/galaxy/data/Arabidopsis_thaliana_TAIR10/sam_index/Arabidopsis_thaliana_TAIR10.fa
 ./input1
#

#= Summary for dataset: ./input1 :
# Query mRNAs :   72778 in   51779 loci  (57559 multi-exon transcripts)
#(12679 multi-transcript loci, ~1.4 transcripts per locus)
# Reference mRNAs :   42163 in   33350 loci  (30127 multi-exon)
# Corresponding super-loci:  33140
#|   Sn   |  Sp   |  fSn |  fSp
 Base level:100.062.7 -   -
 Exon level:104.659.5   100.060.5
   Intron level:100.055.5   100.056.5
Intron chain level:  98.351.5   100.060.3
   Transcript level: 98.757.294.854.9
Locus level: 99.464.0   100.064.1

  Matching intron chains:   29618
   Matching loci:   33147

   Missed exons:   1/169820 (  0.0%)
Novel exons:  128021/298149 ( 42.9%)
 Missed introns:   0/127896 (  0.0%)
  Novel introns:  102614/230568 ( 44.5%)
Missed loci:   1/33350  (  0.0%)
 Novel loci:2962/51779  (  5.7%)

  Total union super-loci across all input datasets: 51779

For the tmap file, all my FPKMs are 0:

ref_gene_id ref_id  class_code  cuff_gene_idcuff_id FMI FPKM
FPKM_conf_loFPKM_conf_hicov len major_iso_idref_match_len
AT1G01010   AT1G01010.1 =   AT1G01010   TCONS_0001  0   
0.000.000.000.001688
TCONS_0001  1688
AT1G01040   AT1G01040.1 =   AT1G01040   TCONS_0002  0   
0.000.000.000.006251
TCONS_0002  6251
AT1G01040   AT1G01040.2 =   AT1G01040   TCONS_0003  0   
0.00 

Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

2014-01-13 Thread Yang Bi
Hi Jen:

Thank you for the prompt reply. RPKMs produced by cufflink look normal (from an 
assembled transcript file):

Seqname Source  Feature Start   End Score   Strand  Frame   Attributes
chr1Cufflinks   transcript  11960   13178   1000.   .   
gene_id CUFF.180; transcript_id CUFF.180.1; FPKM 6.5441928094; frac 
1.00; conf_lo 3.594986; conf_hi 8.987465; cov 2.413218; 
full_read_support yes;
chr1Cufflinks   exon11960   13178   1000.   .   gene_id 
CUFF.180; transcript_id CUFF.180.1; exon_number 1; FPKM 6.5441928094; 
frac 1.00; conf_lo 3.594986; conf_hi 8.987465; cov 2.413218;
chr1Cufflinks   transcript  453653141000+   .   
gene_id CUFF.178; transcript_id CUFF.178.1; FPKM 11.0556332840; frac 
1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844; 
full_read_support no;
chr1Cufflinks   exon453646051000+   .   gene_id 
CUFF.178; transcript_id CUFF.178.1; exon_number 1; FPKM 11.0556332840; 
frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844;
chr1Cufflinks   exon470650951000+   .   gene_id 
CUFF.178; transcript_id CUFF.178.1; exon_number 2; FPKM 11.0556332840; 
frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844;
chr1Cufflinks   exon517453141000+   .   gene_id 
CUFF.178; transcript_id CUFF.178.1; exon_number 3; FPKM 11.0556332840; 
frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844;

I checked the chromosome names and I realized that the BAM outputs use lower 
cases for RNAME, eg. chr1 while my gff3 file uses initial capital letters 
for seqId, eg Chr1. Could this be the problem? What is the fastest way to 
convert the capital C in my gff3 file to lower case?

Thank you very much
Yang

- 原始邮件 -
发件人: Jennifer Jackson j...@bx.psu.edu
收件人: Yang Bi bey...@stanford.edu, galaxy-user@lists.bx.psu.edu
发送时间: 星期一, 2014年 1 月 13日 上午 10:56:39
主题: Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

Hello,

It looks like the data is mapping as novel - not linked with the 
reference annotation. There can be a few factors that can cause this to 
occur for part of a dataset (often desirable) but when it occurs for an 
entire dataset, there is often a data mismatch or parameter issue.

The first item I always check is that the reference genomes are a match 
between inputs. Do this by confirming that the identifiers in the 
reference GFF file are the same as those in the Tophat BAM output 
(convert to SAM, with headers, to see the chromosome names). For the GFF 
file, the tool  Join, Subtract and Group - Group on the first column, 
chromosome name, with the action count distinct will isolate these.

But the real problem could be in the parameters, see below:

On 1/11/14 10:43 PM, Yang Bi wrote:
 Dear all:

 I am new to Galaxy and I followed online tutorials/tips to analyze my RNA seq 
 data for alternative splicing. I used tophat for illumina to align my 
 sequencing data after QC/filtering. Other than setting min intron to 20, I 
 used the default settings. Then I feed the accepted hit files to cufflink. I 
 set Min isoform fraction to 0, use annotation (tair10 gff3) as guide and 
 choose yes for perform bias correction (locally cached tair10).
My guess is that this Cufflinks run had the same issue - have you 
checked it? The 'Min isoform fraction' set to 0 may be problematic (I 
have never run Cufflinks this way). It may seem that this is a setting 
that is permissive - to capture even very small expression levels - but 
it may have had the reverse effect of not assigning any reads.

(The Tophat run with min intron at 20 is pretty low/sensitive - but with 
a smaller genome this probably will not cause memory issues with the 
mapping. Was this set based on the genome having transcripts with known, 
characterized introns this short? I didn't check, but you can in the 
reference GFF file.).

Maybe double check the above Cufflinks run, confirm the results were as 
expected, then try the default in Cufflinks to see how that works out 
(0.1)? As a first pass test? If you want to make this more sensitive 
in subsequent run, you could try 0.01 - although how significant those 
results are, given this genome and your specific input data, would need 
to be evaluated.

After that, if you are still having trouble, please feel free to share a 
history link and we can try to help (copy and email a share link from 
the public server, direct to me, to keep your data private). Here is how:
https://wiki.galaxyproject.org/Support#Shared_and_Published_data

Hopefully the parameter change works, or a reference genome issue is 
found and corrected, but if not, I'll watch for your email,

Jen
Galaxy team

 I merged the assembled transcripts with cuffmerge and use cuffcompare to 
 compare the resultant merged assembled transcript to the reference annotation 
 file tair10 gff3. I choose 

Re: [galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

2014-01-13 Thread Jennifer Jackson

Hello Yang,

Glad the problem was isolated - the mismatched chromosomes is definitely 
something to be fixed.


The tools in 'Text Manipulation can help. The tool Change Case of 
selected columns can change the case for you. Click on the pencil icon 
after running the tool to reassign the datatype correctly as needed.


Take care,

Jen
Galaxy team

On 1/13/14 6:31 PM, Yang Bi wrote:

Hi Jen:

Thank you for the prompt reply. RPKMs produced by cufflink look normal (from an 
assembled transcript file):

Seqname Source  Feature Start   End Score   Strand  Frame   Attributes
chr1Cufflinks   transcript  11960   13178   1000.   .   gene_id CUFF.180; transcript_id CUFF.180.1; FPKM 
6.5441928094; frac 1.00; conf_lo 3.594986; conf_hi 8.987465; cov 2.413218; full_read_support 
yes;
chr1Cufflinks   exon11960   13178   1000.   .   gene_id CUFF.180; transcript_id CUFF.180.1; exon_number 
1; FPKM 6.5441928094; frac 1.00; conf_lo 3.594986; conf_hi 8.987465; cov 2.413218;
chr1Cufflinks   transcript  453653141000+   .   gene_id CUFF.178; transcript_id CUFF.178.1; FPKM 
11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844; full_read_support 
no;
chr1Cufflinks   exon453646051000+   .   gene_id CUFF.178; transcript_id CUFF.178.1; exon_number 
1; FPKM 11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844;
chr1Cufflinks   exon470650951000+   .   gene_id CUFF.178; transcript_id CUFF.178.1; exon_number 
2; FPKM 11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844;
chr1Cufflinks   exon517453141000+   .   gene_id CUFF.178; transcript_id CUFF.178.1; exon_number 
3; FPKM 11.0556332840; frac 1.00; conf_lo 3.645830; conf_hi 13.216134; cov 4.076844;

I checked the chromosome names and I realized that the BAM outputs use lower cases for RNAME, eg. 
chr1 while my gff3 file uses initial capital letters for seqId, eg Chr1. Could this 
be the problem? What is the fastest way to convert the capital C in my gff3 file to lower case?

Thank you very much
Yang


--
Jennifer Hillman-Jackson
http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/


[galaxy-user] all FPKMs are 0 in the tmap files produced by cuffcompare

2014-01-11 Thread Yang Bi
Dear all:

I am new to Galaxy and I followed online tutorials/tips to analyze my RNA seq 
data for alternative splicing. I used tophat for illumina to align my 
sequencing data after QC/filtering. Other than setting min intron to 20, I used 
the default settings. Then I feed the accepted hit files to cufflink. I set Min 
isoform fraction to 0, use annotation (tair10 gff3) as guide and choose yes for 
perform bias correction (locally cached tair10). I merged the assembled 
transcripts with cuffmerge and use cuffcompare to compare the resultant merged 
assembled transcript to the reference annotation file tair10 gff3. I choose yes 
for use sequence data and locally cached tair10 as the reference list. I 
get this for the transcript accuracy analysis:

# Cuffcompare v2.1.1 | Command line was:
#cuffcompare -o cc_output -r 
/galaxy-repl/main/files/007/386/dataset_7386886.dat -s 
/galaxy/data/Arabidopsis_thaliana_TAIR10/sam_index/Arabidopsis_thaliana_TAIR10.fa
 ./input1
#

#= Summary for dataset: ./input1 :
# Query mRNAs :   72778 in   51779 loci  (57559 multi-exon transcripts)
#(12679 multi-transcript loci, ~1.4 transcripts per locus)
# Reference mRNAs :   42163 in   33350 loci  (30127 multi-exon)
# Corresponding super-loci:  33140
#|   Sn   |  Sp   |  fSn |  fSp  
Base level: 100.062.7 -   - 
Exon level: 104.659.5   100.060.5
  Intron level: 100.055.5   100.056.5
Intron chain level:  98.351.5   100.060.3
  Transcript level:  98.757.294.854.9
   Locus level:  99.464.0   100.064.1

 Matching intron chains:   29618
  Matching loci:   33147

  Missed exons:   1/169820  (  0.0%)
   Novel exons:  128021/298149  ( 42.9%)
Missed introns:   0/127896  (  0.0%)
 Novel introns:  102614/230568  ( 44.5%)
   Missed loci:   1/33350   (  0.0%)
Novel loci:2962/51779   (  5.7%)

 Total union super-loci across all input datasets: 51779

For the tmap file, all my FPKMs are 0:

ref_gene_id ref_id  class_code  cuff_gene_idcuff_id FMI FPKM
FPKM_conf_loFPKM_conf_hicov len major_iso_idref_match_len
AT1G01010   AT1G01010.1 =   AT1G01010   TCONS_0001  0   
0.000.000.000.001688
TCONS_0001  1688
AT1G01040   AT1G01040.1 =   AT1G01040   TCONS_0002  0   
0.000.000.000.006251
TCONS_0002  6251
AT1G01040   AT1G01040.2 =   AT1G01040   TCONS_0003  0   
0.000.000.000.005877
TCONS_0002  5877
AT1G01046   AT1G01046.1 =   AT1G01046   TCONS_0004  0   
0.000.000.000.00207 
TCONS_0004  207
AT1G01073   AT1G01073.1 =   AT1G01073   TCONS_0005  0   
0.000.000.000.00111 
TCONS_0005  111
AT1G01110   AT1G01110.2 =   AT1G01110   TCONS_0006  0   
0.000.000.000.001782
TCONS_0006  1782
AT1G01110   AT1G01110.1 =   AT1G01110   TCONS_0007  0   
0.000.000.000.001439
TCONS_0006  1439
AT1G01115   AT1G01115.1 =   AT1G01115   TCONS_0008  0   
0.000.000.000.00117 
TCONS_0008  117
AT1G01160   AT1G01160.1 =   AT1G01160   TCONS_0009  0   
0.000.000.000.001045
TCONS_0010  1045
AT1G01160   AT1G01160.2 =   AT1G01160   TCONS_0010  0   
0.000.000.000.001129
TCONS_0010  1129
AT1G01180   AT1G01180.1 =   AT1G01180   TCONS_0011  0   
0.000.000.000.001176
TCONS_0011  1176
AT1G01210   AT1G01210.1 =   AT1G01210   TCONS_0012  0   
0.000.000.000.00616 
TCONS_0012  616
AT1G01220   AT1G01220.1 =   AT1G01220   TCONS_0013  0   
0.000.000.000.003532
TCONS_0013  3532

The FPKMs were normal in the assembled trancripts produced by cufflink.

Please enlighten me on the possible mistakes that i have made. I really 
appreciate your help.

Best
Yang 
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.