Hello Jianpeng,

The output files from TopHat are described on the TopHat tool form:

--- quote ---
Outputs

Tophat produces two output files:

junctions -- A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction.
    accepted_hits -- A list of read alignments in BAM format.

Two other possible outputs, depending on the options you choose, are insertions and deletions, both of which are in BED format.
-------------

BED format is described in the Galaxy wiki, which includes links to the UCSC BED format description (they authored the format).
http://wiki.g2.bx.psu.edu/Learn/Datatypes#Bed

Two important rules to remember about BED format:

rule #1: coordinate data is already reported with respect to the (+) strand

rule #2: "start" is defined as the smallest coordinate, "end" is defined as the largest coordinate, due to the rule #1.

BED files have a 0-based, fully-closed, "start" position in data files, but in browsers the data will display as 1-based. This means you'll need to add "1" to any "start" coordinate in a .bed file to locate it in a display application. The two will not and should not match. The "end" coordinate is also 0-based, but half-open. This will make it appear to be 1-based for casual users, so it will match between data files and display applications.

-------------
Using the first data row as an example and this information, we can tell that:

chr20 199821 204701 JUNC00000001 17 - 199821 204701 255,0,0 2 87,93 0,4787

* column 5 is 'score', or 'number of alignments spanning the junction'. In this case, "17" alignments.

* column 11 is the blockSizes, or 'read maximal overhang' of the junctions (max alignment length). The first is 87 bases, the second 93 bases.

* column 12 is the blockStarts, or 'overhang start' of the junctions (alignment start). The first is 0, the second 4787 bases. I am fairly certain that the first is always 0 and the second could be interpreted as the 'intron' length, but someone please correct me if this is wrong!

Some calculations can be done with these numbers with respect to the overall position of the junction already defined in columns 1,2,3 (chrom, start, end): chr20:199821-204701 (-) that define the location of the predicted splices, the flanking aligned regions, and the (presumed) 'intron'. This example is a bit tricky because the alignment is on the (-) strand, but for most uses it is enough to simply calculate backwards from the end coordinate to the start. (Consider the end the start, and the start the end). If this sounds confusing, that's because it is! When you visualize the data the concept will make more sense and it is definitely worth learning about.

Brief explanation: The first start is 0, which literally means that it starts at the very beginning of the alignment (0-based), which would be at position chr20, base 204,701, on the (-) strand. This alignment would continue for 87 bases, then stop. Then the splice would be present. The second start is at position (204701 - 4787) = 199914 = chr20, base 199,914, on the (-) strand. This is where the second splice would be present. This alignment would continue for 93 bases. The places the end at (199914 - 93) = 199821 = chr20, base 199,821, on the (-) strand. Which is the same as the reported global junction start position, which we are considering our "end", because this is a (-) stranded alignment. And, it all adds up.

Trackster would be a good place to start for "Visualization (use the top menu bar link). The dataset can also be saved as a regular .bed file and loaded as a custom track into the UCSC Genome Browser (If the direct link is not fully configured yet).

Hopefully this helps,

Jen
Galaxy team

On 4/17/12 7:20 AM, Xu, Jianpeng wrote:
Hi,

I have installed local galaxy. I used the Tophat to do the RNA-seq
alignment and got a output file: splice junction in bed format.

I can not understand it clearly. What does the number 17, 14 ... in the
column 5 mean ? What does the 87,93 mean ? What does the 0, 4787 mean ?
Can you explain a little bit to me ? Which tool can be used to view this
file ?

Thanks,

track name=junctions description="TopHat junctions"
chr20      199821        204701  JUNC00000001        17        -        199821  
  204701      255,0,0     2        87,93  0,4787
chr20        204631     205520     JUNC00000002     14        -        204631   
   205520        255,0,0  2        96,87     0,802
chr20      205428      205775     JUNC00000003        9            -            
205428  205775      255,0,0      2            92,91     0,256
chr20  205699     205958    JUNC00000004     15  -  205699        205958  
255,0,0     2  87,92    0,167
chr20        205929     207067     JUNC00000005     31        -        205929   
    207067      255,0,0  2        95,97     0,1041
chr20      206977        207909  JUNC00000006        19        -        206977  
      207909     255,0,0     2        93,97  0,835
chr20      207884     212679        JUNC00000007     15      -      207884      
  212679  255,0,0     2      87,76        0,4719
chr20      207910      218238     JUNC00000008        1            -            
207910      218238    255,0,0      2            61,39     0,10289
chr20        212628    218293        JUNC00000009     28      -      212628     
   218293  255,0,0     2      94,94        0,5571


------------------------------------------------------------------------

This e-mail message (including any attachments) is for the sole use of
the intended recipient(s) and may contain confidential and privileged
information. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message (including any attachments) is strictly
prohibited.

If you have received this message in error, please contact
the sender by reply e-mail message and destroy all copies of the
original message (including attachments).


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/

--
Jennifer Jackson
http://galaxyproject.org
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Reply via email to