[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588347#comment-13588347 ] Benjamin Coverston commented on CASSANDRA-2699: --- You're right, which also means that in the face of idempotent writes and replay the incremental scenario is also broken with the in-memory tree. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588366#comment-13588366 ] Sylvain Lebresne commented on CASSANDRA-2699: - bq. in the face of idempotent writes and replay the incremental scenario is also broken with the in-memory tree Yeah, and streaming and read-repair breaks things too I too (for the in-memory tree idea), and I'm not sure how you even compute the initial in-memory tree at startup in the first-place (there's a talk of saving the in-memory tree on disk to reload it on restart, but I see many problem with that so it could be I misunderstood the idea). But overall it sounded the conclusion of CASSANDRA-4482 was that the in-memory trees themselves do not sound very useful, which from my current comprehension of the idea (that could be very partial/wrong) sounds about right. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586942#comment-13586942 ] Sylvain Lebresne commented on CASSANDRA-2699: - bq. I'm pretty sure this means that we should be able to XOR the buckets together from pre-computed merkle tree SSTable components No, I don't think that works. Because nodes are certainly not guaranteed to be at the same state of compaction, even if they have the same data. Meaning that for a given column whose value A has been overwritten to B at some point, one of the nodes may have 1 sstable with just B while another node may have 2 sstables, one with A and one with B, because it hasn't compacted things yet. But hash(B) (on the first node) will not be equal to hash(A) xor hash(B) (on the second node), even though both node really have the same data (since B overwrites A upon column merge on the 2nd node). continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583453#comment-13583453 ] Benjamin Coverston commented on CASSANDRA-2699: --- From CASSANDRA-4482 {quote} Instead, we maintain a Merkle Tree (MT) in memory and update it with every single column insert in ColumnFamilyStore.apply(). We use column.updateDigest(digest) on all the changes in order to create a hash per column update and then XOR this hash with the existing one in the Merkle Tree bucket for the corresponding row. This Merkle Tree is created with the column family (one per range), initialized with zeros, and persisted to disk with regular snapshots. The commutative properties of XOR make it possible to update the MT incrementally without having to read on write. {quote} I'm pretty sure this means that we should be able to XOR the buckets together from pre-computed merkle tree SSTable components. We could just create these on flush, merge them on compaction, then validation compaction is just a read, MT components and merge. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507615#comment-13507615 ] Jonathan Ellis commented on CASSANDRA-2699: --- A big problem with the only-repair-new-data idea is that it doesn't deal with hardware-level data loss -- i.e., we want to re-repair data that *was* complete, until we lost a disk or a machine. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208467#comment-13208467 ] Jonathan Ellis commented on CASSANDRA-2699: --- I assume you meant to link a different issue? continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208593#comment-13208593 ] Peter Schuller commented on CASSANDRA-2699: --- Sorry yes - CASSANDRA-3912. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208257#comment-13208257 ] Peter Schuller commented on CASSANDRA-2699: --- CASSANDRA-2699 has a baby step towards this which addresses incremental repair and the merkle tree resolution problem (but does not remove, in fact increases, the need for external scripting). continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086772#comment-13086772 ] Jonathan Ellis commented on CASSANDRA-2699: --- bq. one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair I'm not sure this works, because as compaction does its work you will have a lot of sstable turnover. Even if you preserve repaired state where possible (two repaired sstables, merge to form an sstable also marked repaired) new sstables will fragment out as they're compacted under leveldb, contaminating what they are merged with. Note that the problem gets worse if a peer is down, because then un-repaired sstables start to pile up. You could create separate sets of levels for repaired and unrepaired sstables, I suppose. That feels ugly. You could also keep a repaired bloom filter at the row level for partially-repaired sstables. That feels more reasonable to me. (But that brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.) bq. one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key I don't think this is a blocker -- it just means you still have to run repair against each node, which has always been the case. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070528#comment-13070528 ] Jeremiah Jordan commented on CASSANDRA-2699: I like the idea of not repairing stuff that has already been repaired, one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070651#comment-13070651 ] Chris Burroughs commented on CASSANDRA-2699: If we don't repair stuff that has already been marked as repaired, doesn't that prevent repair from recovering from bitrot or bugs in the (previous) repair process? continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070657#comment-13070657 ] Peter Schuller commented on CASSANDRA-2699: --- I would argue that repair's ability to do that is limited as it is, and the primary concern is making anti-entropy a less invasive and more reliable process. Isn't 'scrub' closer to what you'd want for bitrot btw? However, longer-term with checksumming and the ability to truly handle arbitrary corruption, I think it would be a great feature to integrate the detection of checksum mismatches with automatic repair from neighboring nodes (this time the word 'repair' truly being repair, not anti-entropy - well, unless you count bitrot as entropy, which I suppose one should ;)). continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070079#comment-13070079 ] Jonathan Ellis commented on CASSANDRA-2699: --- This seems like a good place to summarize a discussion from IRC: Terje had an interesting idea: instead of repair operating on ranges (with merkle trees) maybe the right unit of repair is an sstable: i.e. when you repair an sstable, it does CL.ALL reads of each row, then adds a repaired metadata. Compared to merkle-tree repair, you are doing random i/o, in exchange for being able to only worry about new-since-last-repair data (and whatever we have to merge, that BF doesn't skip). This feels like a better fit for Cassandra to me, because it takes advantage of SSTable immutability. Constantly re-repairing data that doesn't need it (in large increments or small) is wasteful. Stu added: one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair. Peter commented: The only significant downside I can think of off the top of my head is that when nodes are unavailable or otherwise have some problem for a long time, the sstables created since last repair might be a non-trivial amount of data. So it doesn't [in the worst case] fulfil the goal of guaranteeing that the bulkyness of an individual repair does not exceed some reasonable value. But it sounds a *lot* better still than the current situation. continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in
[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy
[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064905#comment-13064905 ] Peter Schuller commented on CASSANDRA-2699: --- I should also add that this type of approach completely eliminates the need for operators to initiate and monitor repair, which is something which seems to cause lots of confusion. What is still needed is monitoring that Cassandra isn't saying something is wrong, I seem to be running behind with repairs. But actually exposing such a binary condition should be trivial if the rest is implemented. The alerting would be even less critical if the information kept as part of an implementation of this were to be used when doing compaction; one could prevent automatically tombstones from ever being removed pre-maturely based on whether or not AES has happened recently enough. (All assumes we can *accurately* and *reliably* determine that repair actually succeeded, so bugs causing silent failure of repair need fixing and it needs to not easily break again.) continuous incremental anti-entropy --- Key: CASSANDRA-2699 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Currently, repair works by periodically running bulk jobs that (1) performs a validating compaction building up an in-memory merkle tree, and (2) streaming ring segments as needed according to differences indicated by the merkle tree. There are some disadvantages to this approach: * There is a trade-off between memory usage and the precision of the merkle tree. Less precision means more data streamed relative to what is strictly required. * Repair is a periodic bulk process that runs for a significant period and, although possibly rate limited as compaction (if 0.8 or backported throttling patch applied), is a divergence in terms of performance characteristics from normal operation of the cluster. * The impact of imprecision can be huge on a workload dominated by I/O and with cache locality being critical, since you will suddenly transfers lots of data to the target node. I propose a more incremental process whereby anti-entropy is incremental and continuous over time. In order to avoid being seek-bound one still wants to do work in some form of bursty fashion, but the amount of data processed at a time could be sufficiently small that the impact on the cluster feels a lot more continuous, and that the page cache allows us to avoid re-reading differing data twice. Consider a process whereby a node is constantly performing a per-CF repair operation for each CF. The current state of the repair process is defined by: * A starting timestamp of the current iteration through the token range the node is responsible for. * A finger indicating the current position along the token ring to which iteration has completed. This information, other than being in-memory, could periodically (every few minutes or something) be stored persistently on disk. The finger advances by the node selecting the next small bit of the ring and doing whatever merkling/hashing/checksumming is necessary on that small part, and then asking neighbors to do the same, and arranging for neighbors to send the node data for mismatching ranges. The data would be sent either by way of mutations like with read repair, or by streaming sstables. But it would be small amounts of data that will act roughly the same as regular writes for the perspective of compaction. Some nice properties of this approach: * It's always on; no periodic sudden effects on cluster performance. * Restarting nodes never cancels or breaks anti-entropy. * Huge compactions of entire CF:s never clog up the compaction queue (not necessarily a non-issue even with concurrent compactions in 0.8). * Because we're always operating on small chunks, there is never the same kind of trade-off for memory use. A merkel tree or similar could be calculated at a very detailed level potentially. Although the precision from the perspective of reading from disk would likely not matter much if we are in page cache anyway, very high precision could be *very* useful when doing anti-entropy across data centers on slow links. There are devils in details, like how to select an appropriate ring segment given that you don't have knowledge of the data density on other nodes. But I feel that the overall idea/process seems very promising. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira