Hello Jianpeng,
The output files from TopHat are described on the TopHat tool form:
--- quote ---
Outputs
Tophat produces two output files:
junctions -- A UCSC BED track of junctions reported by TopHat. Each
junction consists of two connected BED blocks, where each block is as
long as the maximal overhang of any read spanning the junction. The
score is the number of alignments spanning the junction.
accepted_hits -- A list of read alignments in BAM format.
Two other possible outputs, depending on the options you choose, are
insertions and deletions, both of which are in BED format.
-------------
BED format is described in the Galaxy wiki, which includes links to the
UCSC BED format description (they authored the format).
http://wiki.g2.bx.psu.edu/Learn/Datatypes#Bed
Two important rules to remember about BED format:
rule #1: coordinate data is already reported with respect to the (+) strand
rule #2: "start" is defined as the smallest coordinate, "end" is defined
as the largest coordinate, due to the rule #1.
BED files have a 0-based, fully-closed, "start" position in data files,
but in browsers the data will display as 1-based. This means you'll need
to add "1" to any "start" coordinate in a .bed file to locate it in a
display application. The two will not and should not match. The "end"
coordinate is also 0-based, but half-open. This will make it appear to
be 1-based for casual users, so it will match between data files and
display applications.
-------------
Using the first data row as an example and this information, we can tell
that:
chr20 199821 204701 JUNC00000001 17 - 199821 204701 255,0,0 2
87,93 0,4787
* column 5 is 'score', or 'number of alignments spanning the
junction'. In this case, "17" alignments.
* column 11 is the blockSizes, or 'read maximal overhang' of the
junctions (max alignment length). The first is 87 bases, the second 93
bases.
* column 12 is the blockStarts, or 'overhang start' of the junctions
(alignment start). The first is 0, the second 4787 bases. I am fairly
certain that the first is always 0 and the second could be interpreted
as the 'intron' length, but someone please correct me if this is wrong!
Some calculations can be done with these numbers with respect to the
overall position of the junction already defined in columns 1,2,3
(chrom, start, end): chr20:199821-204701 (-) that define the location of
the predicted splices, the flanking aligned regions, and the (presumed)
'intron'. This example is a bit tricky because the alignment is on the
(-) strand, but for most uses it is enough to simply calculate backwards
from the end coordinate to the start. (Consider the end the start, and
the start the end). If this sounds confusing, that's because it is! When
you visualize the data the concept will make more sense and it is
definitely worth learning about.
Brief explanation: The first start is 0, which literally means that it
starts at the very beginning of the alignment (0-based), which would be
at position chr20, base 204,701, on the (-) strand. This alignment would
continue for 87 bases, then stop. Then the splice would be present. The
second start is at position (204701 - 4787) = 199914 = chr20, base
199,914, on the (-) strand. This is where the second splice would be
present. This alignment would continue for 93 bases. The places the end
at (199914 - 93) = 199821 = chr20, base 199,821, on the (-) strand.
Which is the same as the reported global junction start position, which
we are considering our "end", because this is a (-) stranded alignment.
And, it all adds up.
Trackster would be a good place to start for "Visualization (use the top
menu bar link). The dataset can also be saved as a regular .bed file and
loaded as a custom track into the UCSC Genome Browser (If the direct
link is not fully configured yet).
Hopefully this helps,
Jen
Galaxy team
On 4/17/12 7:20 AM, Xu, Jianpeng wrote:
Hi,
I have installed local galaxy. I used the Tophat to do the RNA-seq
alignment and got a output file: splice junction in bed format.
I can not understand it clearly. What does the number 17, 14 ... in the
column 5 mean ? What does the 87,93 mean ? What does the 0, 4787 mean ?
Can you explain a little bit to me ? Which tool can be used to view this
file ?
Thanks,
track name=junctions description="TopHat junctions"
chr20 199821 204701 JUNC00000001 17 - 199821
204701 255,0,0 2 87,93 0,4787
chr20 204631 205520 JUNC00000002 14 - 204631
205520 255,0,0 2 96,87 0,802
chr20 205428 205775 JUNC00000003 9 -
205428 205775 255,0,0 2 92,91 0,256
chr20 205699 205958 JUNC00000004 15 - 205699 205958
255,0,0 2 87,92 0,167
chr20 205929 207067 JUNC00000005 31 - 205929
207067 255,0,0 2 95,97 0,1041
chr20 206977 207909 JUNC00000006 19 - 206977
207909 255,0,0 2 93,97 0,835
chr20 207884 212679 JUNC00000007 15 - 207884
212679 255,0,0 2 87,76 0,4719
chr20 207910 218238 JUNC00000008 1 -
207910 218238 255,0,0 2 61,39 0,10289
chr20 212628 218293 JUNC00000009 28 - 212628
218293 255,0,0 2 94,94 0,5571
------------------------------------------------------------------------
This e-mail message (including any attachments) is for the sole use of
the intended recipient(s) and may contain confidential and privileged
information. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message (including any attachments) is strictly
prohibited.
If you have received this message in error, please contact
the sender by reply e-mail message and destroy all copies of the
original message (including attachments).
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
--
Jennifer Jackson
http://galaxyproject.org
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/