[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-27 Thread Benjamin Coverston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588347#comment-13588347
 ] 

Benjamin Coverston commented on CASSANDRA-2699:
---

You're right, which also means that in the face of idempotent writes and replay 
the incremental scenario is also broken with the in-memory tree.

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-27 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588366#comment-13588366
 ] 

Sylvain Lebresne commented on CASSANDRA-2699:
-

bq. in the face of idempotent writes and replay the incremental scenario is 
also broken with the in-memory tree

Yeah, and streaming and read-repair breaks things too I too (for the in-memory 
tree idea), and I'm not sure how you even compute the initial in-memory tree at 
startup in the first-place (there's a talk of saving the in-memory tree on disk 
to reload it on restart, but I see many problem with that so it could be I 
misunderstood the idea). But overall it sounded the conclusion of 
CASSANDRA-4482 was that the in-memory trees themselves do not sound very 
useful, which from my current comprehension of the idea (that could be very 
partial/wrong) sounds about right.

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-26 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586942#comment-13586942
 ] 

Sylvain Lebresne commented on CASSANDRA-2699:
-

bq. I'm pretty sure this means that we should be able to XOR the buckets 
together from pre-computed merkle tree SSTable components

No, I don't think that works. Because nodes are certainly not guaranteed to be 
at the same state of compaction, even if they have the same data. Meaning that 
for a given column whose value A has been overwritten to B at some point, one 
of the nodes may have 1 sstable with just B while another node may have 2 
sstables, one with A and one with B, because it hasn't compacted things yet. 
But hash(B) (on the first node) will not be equal to hash(A) xor hash(B) (on 
the second node), even though both node really have the same data (since B 
overwrites A upon column merge on the 2nd node).


 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-21 Thread Benjamin Coverston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583453#comment-13583453
 ] 

Benjamin Coverston commented on CASSANDRA-2699:
---

From CASSANDRA-4482

{quote}
Instead, we maintain a Merkle Tree (MT) in memory and update it with every 
single column insert in ColumnFamilyStore.apply(). We use 
column.updateDigest(digest) on all the changes in order to create a hash per 
column update and then XOR this hash with the existing one in the Merkle Tree 
bucket for the corresponding row.
This Merkle Tree is created with the column family (one per range), initialized 
with zeros, and persisted to disk with regular snapshots.

The commutative properties of XOR make it possible to update the MT 
incrementally without having to read on write.
{quote}

I'm pretty sure this means that we should be able to XOR the buckets together 
from pre-computed merkle tree SSTable components.

We could just create these on flush, merge them on compaction, then validation 
compaction is just a read, MT components and merge.



 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-11-30 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507615#comment-13507615
 ] 

Jonathan Ellis commented on CASSANDRA-2699:
---

A big problem with the only-repair-new-data idea is that it doesn't deal with 
hardware-level data loss -- i.e., we want to re-repair data that *was* 
complete, until we lost a disk or a machine.

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-02-15 Thread Jonathan Ellis (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208467#comment-13208467
 ] 

Jonathan Ellis commented on CASSANDRA-2699:
---

I assume you meant to link a different issue?

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-02-15 Thread Peter Schuller (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208593#comment-13208593
 ] 

Peter Schuller commented on CASSANDRA-2699:
---

Sorry yes - CASSANDRA-3912.


 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-02-14 Thread Peter Schuller (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208257#comment-13208257
 ] 

Peter Schuller commented on CASSANDRA-2699:
---

CASSANDRA-2699 has a baby step towards this which addresses incremental repair 
and the merkle tree resolution problem (but does not remove, in fact increases, 
the need for external scripting).

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-08-17 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086772#comment-13086772
 ] 

Jonathan Ellis commented on CASSANDRA-2699:
---

bq. one way to incorporate terje's idea would be to continue to build merkle 
trees, but to only build them using sstables that have been created since the 
last time we ran repair

I'm not sure this works, because as compaction does its work you will have a 
lot of sstable turnover.  Even if you preserve repaired state where possible 
(two repaired sstables, merge to form an sstable also marked repaired) new 
sstables will fragment out as they're compacted under leveldb, 
contaminating what they are merged with.

Note that the problem gets worse if a peer is down, because then un-repaired 
sstables start to pile up.

You could create separate sets of levels for repaired and unrepaired 
sstables, I suppose.  That feels ugly.

You could also keep a repaired bloom filter at the row level for 
partially-repaired sstables.  That feels more reasonable to me.  (But that 
brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.)

bq. one problem I see with CL.ALL reads is that you won't get new keys repaired 
to that node until another node is repaired that has the key

I don't think this is a blocker -- it just means you still have to run repair 
against each node, which has always been the case.

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-25 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070528#comment-13070528
 ] 

Jeremiah Jordan commented on CASSANDRA-2699:


I like the idea of not repairing stuff that has already been repaired, one 
problem I see with CL.ALL reads is that you won't get new keys repaired to that 
node until another node is repaired that has the key.

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-25 Thread Chris Burroughs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070651#comment-13070651
 ] 

Chris Burroughs commented on CASSANDRA-2699:


If we don't repair stuff that has already been marked as repaired, doesn't that 
prevent repair from recovering from bitrot or bugs in the (previous) repair 
process?

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-25 Thread Peter Schuller (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070657#comment-13070657
 ] 

Peter Schuller commented on CASSANDRA-2699:
---

I would argue that repair's ability to do that is limited as it is, and the 
primary concern is making anti-entropy a less invasive and more reliable 
process. Isn't 'scrub' closer to what you'd want for bitrot btw?

However, longer-term with checksumming and the ability to truly handle 
arbitrary corruption, I think it would be a great feature to integrate the 
detection of checksum mismatches with automatic repair from neighboring nodes 
(this time the word 'repair' truly being repair, not anti-entropy - well, 
unless you count bitrot as entropy, which I suppose one should ;)).

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-23 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070079#comment-13070079
 ] 

Jonathan Ellis commented on CASSANDRA-2699:
---

This seems like a good place to summarize a discussion from IRC:

Terje had an interesting idea: instead of repair operating on ranges (with 
merkle trees) maybe the right unit of repair is an sstable: i.e. when you 
repair an sstable, it does CL.ALL reads of each row, then adds a repaired 
metadata. Compared to merkle-tree repair, you are doing random i/o, in exchange 
for being able to only worry about new-since-last-repair data (and whatever we 
have to merge, that BF doesn't skip).

This feels like a better fit for Cassandra to me, because it takes advantage of 
SSTable immutability.  Constantly re-repairing data that doesn't need it (in 
large increments or small) is wasteful.

Stu added: one way to incorporate terje's idea would be to continue to build 
merkle trees, but to only build them using sstables that have been created 
since the last time we ran repair.

Peter commented: The only significant downside I can think of off the top of my 
head is that when nodes are unavailable or otherwise have some problem for a 
long time, the sstables created since last repair might be a non-trivial amount 
of data.  So it doesn't [in the worst case] fulfil the goal of guaranteeing 
that the bulkyness of an individual repair does not exceed some reasonable 
value. But it sounds a *lot* better still than the current situation.

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in 

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-13 Thread Peter Schuller (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064905#comment-13064905
 ] 

Peter Schuller commented on CASSANDRA-2699:
---

I should also add that this type of approach completely eliminates the need for 
operators to initiate and monitor repair, which is something which seems to 
cause lots of confusion. What is still needed is monitoring that Cassandra 
isn't saying something is wrong, I seem to be running behind with repairs. 
But actually exposing such a binary condition should be trivial if the rest is 
implemented.

The alerting would be even less critical if the information kept as part of an 
implementation of this were to be used when doing compaction; one could prevent 
automatically tombstones from ever being removed pre-maturely based on whether 
or not AES has happened recently enough.

(All assumes we can *accurately* and *reliably* determine that repair actually 
succeeded, so bugs causing silent failure of repair need fixing and it needs to 
not easily break again.)

 continuous incremental anti-entropy
 ---

 Key: CASSANDRA-2699
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller

 Currently, repair works by periodically running bulk jobs that (1)
 performs a validating compaction building up an in-memory merkle tree,
 and (2) streaming ring segments as needed according to differences
 indicated by the merkle tree.
 There are some disadvantages to this approach:
 * There is a trade-off between memory usage and the precision of the
   merkle tree. Less precision means more data streamed relative to
   what is strictly required.
 * Repair is a periodic bulk process that runs for a significant
   period and, although possibly rate limited as compaction (if 0.8 or
   backported throttling patch applied), is a divergence in terms of
   performance characteristics from normal operation of the cluster.
 * The impact of imprecision can be huge on a workload dominated by I/O
   and with cache locality being critical, since you will suddenly
   transfers lots of data to the target node.
 I propose a more incremental process whereby anti-entropy is
 incremental and continuous over time. In order to avoid being
 seek-bound one still wants to do work in some form of bursty fashion,
 but the amount of data processed at a time could be sufficiently small
 that the impact on the cluster feels a lot more continuous, and that
 the page cache allows us to avoid re-reading differing data twice.
 Consider a process whereby a node is constantly performing a per-CF
 repair operation for each CF. The current state of the repair process
 is defined by:
 * A starting timestamp of the current iteration through the token
   range the node is responsible for.
 * A finger indicating the current position along the token ring to
   which iteration has completed.
 This information, other than being in-memory, could periodically (every
 few minutes or something) be stored persistently on disk.
 The finger advances by the node selecting the next small bit of the
 ring and doing whatever merkling/hashing/checksumming is necessary on
 that small part, and then asking neighbors to do the same, and
 arranging for neighbors to send the node data for mismatching
 ranges. The data would be sent either by way of mutations like with
 read repair, or by streaming sstables. But it would be small amounts
 of data that will act roughly the same as regular writes for the
 perspective of compaction.
 Some nice properties of this approach:
 * It's always on; no periodic sudden effects on cluster performance.
 * Restarting nodes never cancels or breaks anti-entropy.
 * Huge compactions of entire CF:s never clog up the compaction queue
   (not necessarily a non-issue even with concurrent compactions in
   0.8).
 * Because we're always operating on small chunks, there is never the
   same kind of trade-off for memory use. A merkel tree or similar
   could be calculated at a very detailed level potentially. Although
   the precision from the perspective of reading from disk would likely
   not matter much if we are in page cache anyway, very high precision
   could be *very* useful when doing anti-entropy across data centers
   on slow links.
 There are devils in details, like how to select an appropriate ring
 segment given that you don't have knowledge of the data density on
 other nodes. But I feel that the overall idea/process seems very
 promising.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira