Hi guys,
here are some late feedback on this discussion:
* When talking about copy numbers, it is important to always be very
clear and distinguish between whether we talk about normal/germline
CNs or tumor CNs. The former take integer CN levels (0, 1, 2, 3,
...), whereas for tumors we very rarely observe pure homogeneous tumor
cells, which is why we only measure and observe non-integer CN levels.
Hopefully, we observe at least discrete CN levels in tumors, but one
should never expect integer levels.
* aCGH: a historical term often used as a synonym for total copy
numbers. For example, some say aCGH analysis when they really mean
total copy-number analysis. aCGH stands for array-CGH, or in full
'array comparative genomic hybridization'. This refers to the older
generation two-color/two-channel arrays where a test and a reference
sample where labelled with two different dyes and competitively
hybridized to the same array and the same probes. I recommend to stop
using this term and instead use total copy number, total CN, or
TCN (when it's clear). By being explicit about total, you're
also explicitly contrasting it to parent-specific CNs (which you can
do if you have SNP data).
* CNA: Copy-Number Aberration. This term can be applied to both tumor
and germline samples. In tumors you expect non-integer CN levels. In
germline/normals you expect integer CN levels (0, 1, 2, 3, ...).
* CNP: Copy-Number Polymorphism. This term applies to copy-number
differences in relationship to a population. This also implies we're
talking about germline genomes. In other words, CNPs are also integer
CN levels (0, 1, 2, 3, ...). CNPs are used to specify, say, 2% of
the Europeans have a 1 copy deletion of length 1.0-1.5 Mb on Chr 3 at
124.5Mb. CNPs is for segment deletions and gains what SNPs are for
nucleotide polymorphisms. The term CNP is rare. It is much more
common to hear/see CNV.
* CNV: Copy-Number Variation. Ideally the word variation refers to
polymorphism and therefore the term CNV should be used only to refer
to CNPs. I don't know if there is a formal definitions, but I find it
unfortunate to see CNV being used when CNA should be used. By my
books, CNV only takes integer CN levels (0, 1, 2, 3, ...). The term
CNV should never be used to refer to CN levels in tumors.
* Calling total CN levels is very hard in tumors, and as the first
above point alludes to, it may not even be a well defined problem.
For instance, imagine you have a tumor sample with 5% tumor cells and
95% normal cells, and that the those tumors cells all have a deletion
on Chr 2. Then, at what point to you consider that sample itself to
have a deletion on Chr 2? Are you after he sample/tissue itself, or
are you after those 5% tumors cells? What if you have a heterogeneous
mix of tumor cells? The more precise you can specify your question
the more easy it is for you to decided what approach forward (may)
work and what doesn't work. Here work can also be read as make
sense.
* The first and most important task for almost all segmentation
methods is to *segment* the genome, that is, identify at what genomic
locations the observed DNA (tumor, normal or a mix) changes in CN
level. Together, these location, aka change points, defines how the
genome can be partitioned into segments with equal CN levels, such
that when we look at a particular segment, we can assume that all
genomic locations within that segment has the same underlying genomic
composition (e.g. gain, loss, loss in 5% of the cells, etc.). CBS,
GLAD, and many other methods, segment the genome this way as a first
step.
* A common task after having decided on the segments (partitioning of
the genome), is to decide on what is going on within each segment.
Not all methods does this. For instance, CBS only provides you with
the change points. GLAD on the other hand does both the segmentation
and then also provides a method for calling. Theoretically, there is
nothing preventing you from using the GLAD *calling* algorithm using
the segmentation found by CBS. Unfortunately, I don't think it is
straightforward to do that in practice; at least you have to coerce
one data format into one that GLAD understands.
* GLAD does not scale well with the number of loci, because it's
computational complexity is ~O(n^2), unless things have changed since.
In 2007, I tried to predict GLAD's processing time when we were using
the Affymetrix 500K chips and the GenomeWideSNP_5 and GenomeWideSNP_6
were starting to come out. A GWS6 chip would basically take days to
segment. See attached PNG for a table.
* CBS is much faster as an algorithm. Also, the implementation in the
DNAcopy package has been made even faster over time. There was a
major speedup back in 2009, cf.
http://aroma-project.org/benchmarks/DNAcopy_v1.19.2-speedup/
Over and for now
Henrik
On Thu, Jan 22, 2015 at 12:42 AM, Chengyu Liu chengyu.liu...@gmail.com wrote:
Hi,
I have tried this and works good but at the