subject:"\[jira\] \[Commented\] \(CASSANDRA\-2699\) continuous incremental anti\-entropy"

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-27 Thread Benjamin Coverston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588347#comment-13588347
]

Benjamin Coverston commented on CASSANDRA-2699:
---

You're right, which also means that in the face of idempotent writes and replay
the incremental scenario is also broken with the in-memory tree.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

Currently, repair works by periodically running bulk jobs that (1)
performs a validating compaction building up an in-memory merkle tree,
and (2) streaming ring segments as needed according to differences
indicated by the merkle tree.
There are some disadvantages to this approach:
* There is a trade-off between memory usage and the precision of the
merkle tree. Less precision means more data streamed relative to
what is strictly required.
* Repair is a periodic bulk process that runs for a significant
period and, although possibly rate limited as compaction (if 0.8 or
backported throttling patch applied), is a divergence in terms of
performance characteristics from normal operation of the cluster.
* The impact of imprecision can be huge on a workload dominated by I/O
and with cache locality being critical, since you will suddenly
transfers lots of data to the target node.
I propose a more incremental process whereby anti-entropy is
incremental and continuous over time. In order to avoid being
seek-bound one still wants to do work in some form of bursty fashion,
but the amount of data processed at a time could be sufficiently small
that the impact on the cluster feels a lot more continuous, and that
the page cache allows us to avoid re-reading differing data twice.
Consider a process whereby a node is constantly performing a per-CF
repair operation for each CF. The current state of the repair process
is defined by:
* A starting timestamp of the current iteration through the token
range the node is responsible for.
* A finger indicating the current position along the token ring to
which iteration has completed.
This information, other than being in-memory, could periodically (every
few minutes or something) be stored persistently on disk.
The finger advances by the node selecting the next small bit of the
ring and doing whatever merkling/hashing/checksumming is necessary on
that small part, and then asking neighbors to do the same, and
arranging for neighbors to send the node data for mismatching
ranges. The data would be sent either by way of mutations like with
read repair, or by streaming sstables. But it would be small amounts
of data that will act roughly the same as regular writes for the
perspective of compaction.
Some nice properties of this approach:
* It's always on; no periodic sudden effects on cluster performance.
* Restarting nodes never cancels or breaks anti-entropy.
* Huge compactions of entire CF:s never clog up the compaction queue
(not necessarily a non-issue even with concurrent compactions in
0.8).
* Because we're always operating on small chunks, there is never the
same kind of trade-off for memory use. A merkel tree or similar
could be calculated at a very detailed level potentially. Although
the precision from the perspective of reading from disk would likely
not matter much if we are in page cache anyway, very high precision
could be *very* useful when doing anti-entropy across data centers
on slow links.
There are devils in details, like how to select an appropriate ring
segment given that you don't have knowledge of the data density on
other nodes. But I feel that the overall idea/process seems very
promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-27 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588366#comment-13588366
]

Sylvain Lebresne commented on CASSANDRA-2699:
-

bq. in the face of idempotent writes and replay the incremental scenario is
also broken with the in-memory tree

Yeah, and streaming and read-repair breaks things too I too (for the in-memory
tree idea), and I'm not sure how you even compute the initial in-memory tree at
startup in the first-place (there's a talk of saving the in-memory tree on disk
to reload it on restart, but I see many problem with that so it could be I
misunderstood the idea). But overall it sounded the conclusion of
CASSANDRA-4482 was that the in-memory trees themselves do not sound very
useful, which from my current comprehension of the idea (that could be very
partial/wrong) sounds about right.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-26 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586942#comment-13586942
]

Sylvain Lebresne commented on CASSANDRA-2699:
-

bq. I'm pretty sure this means that we should be able to XOR the buckets
together from pre-computed merkle tree SSTable components

No, I don't think that works. Because nodes are certainly not guaranteed to be
at the same state of compaction, even if they have the same data. Meaning that
for a given column whose value A has been overwritten to B at some point, one
of the nodes may have 1 sstable with just B while another node may have 2
sstables, one with A and one with B, because it hasn't compacted things yet.
But hash(B) (on the first node) will not be equal to hash(A) xor hash(B) (on
the second node), even though both node really have the same data (since B
overwrites A upon column merge on the 2nd node).

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2013-02-21 Thread Benjamin Coverston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583453#comment-13583453
]

Benjamin Coverston commented on CASSANDRA-2699:
---

From CASSANDRA-4482

{quote}
Instead, we maintain a Merkle Tree (MT) in memory and update it with every
single column insert in ColumnFamilyStore.apply(). We use
column.updateDigest(digest) on all the changes in order to create a hash per
column update and then XOR this hash with the existing one in the Merkle Tree
bucket for the corresponding row.
This Merkle Tree is created with the column family (one per range), initialized
with zeros, and persisted to disk with regular snapshots.

The commutative properties of XOR make it possible to update the MT
incrementally without having to read on write.
{quote}

I'm pretty sure this means that we should be able to XOR the buckets together
from pre-computed merkle tree SSTable components.

We could just create these on flush, merge them on compaction, then validation
compaction is just a read, MT components and merge.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-11-30 Thread Jonathan Ellis (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507615#comment-13507615
]

Jonathan Ellis commented on CASSANDRA-2699:
---

A big problem with the only-repair-new-data idea is that it doesn't deal with
hardware-level data loss -- i.e., we want to re-repair data that *was*
complete, until we lost a disk or a machine.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-02-15 Thread Jonathan Ellis (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208467#comment-13208467
]

Jonathan Ellis commented on CASSANDRA-2699:
---

I assume you meant to link a different issue?

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-02-15 Thread Peter Schuller (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208593#comment-13208593
]

Peter Schuller commented on CASSANDRA-2699:
---

Sorry yes - CASSANDRA-3912.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2012-02-14 Thread Peter Schuller (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208257#comment-13208257
]

Peter Schuller commented on CASSANDRA-2699:
---

CASSANDRA-2699 has a baby step towards this which addresses incremental repair
and the merkle tree resolution problem (but does not remove, in fact increases,
the need for external scripting).

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-08-17 Thread Jonathan Ellis (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086772#comment-13086772
]

Jonathan Ellis commented on CASSANDRA-2699:
---

bq. one way to incorporate terje's idea would be to continue to build merkle
trees, but to only build them using sstables that have been created since the
last time we ran repair

I'm not sure this works, because as compaction does its work you will have a
lot of sstable turnover. Even if you preserve repaired state where possible
(two repaired sstables, merge to form an sstable also marked repaired) new
sstables will fragment out as they're compacted under leveldb,
contaminating what they are merged with.

Note that the problem gets worse if a peer is down, because then un-repaired
sstables start to pile up.

You could create separate sets of levels for repaired and unrepaired
sstables, I suppose. That feels ugly.

You could also keep a repaired bloom filter at the row level for
partially-repaired sstables. That feels more reasonable to me. (But that
brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.)

bq. one problem I see with CL.ALL reads is that you won't get new keys repaired
to that node until another node is repaired that has the key

I don't think this is a blocker -- it just means you still have to run repair
against each node, which has always been the case.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-25 Thread Jeremiah Jordan (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070528#comment-13070528
]

Jeremiah Jordan commented on CASSANDRA-2699:

I like the idea of not repairing stuff that has already been repaired, one
problem I see with CL.ALL reads is that you won't get new keys repaired to that
node until another node is repaired that has the key.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-25 Thread Chris Burroughs (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070651#comment-13070651
]

Chris Burroughs commented on CASSANDRA-2699:

If we don't repair stuff that has already been marked as repaired, doesn't that
prevent repair from recovering from bitrot or bugs in the (previous) repair
process?

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-25 Thread Peter Schuller (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070657#comment-13070657
]

Peter Schuller commented on CASSANDRA-2699:
---

I would argue that repair's ability to do that is limited as it is, and the
primary concern is making anti-entropy a less invasive and more reliable
process. Isn't 'scrub' closer to what you'd want for bitrot btw?

However, longer-term with checksumming and the ability to truly handle
arbitrary corruption, I think it would be a great feature to integrate the
detection of checksum mismatches with automatic repair from neighboring nodes
(this time the word 'repair' truly being repair, not anti-entropy - well,
unless you count bitrot as entropy, which I suppose one should ;)).

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-23 Thread Jonathan Ellis (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070079#comment-13070079
]

Jonathan Ellis commented on CASSANDRA-2699:
---

This seems like a good place to summarize a discussion from IRC:

Terje had an interesting idea: instead of repair operating on ranges (with
merkle trees) maybe the right unit of repair is an sstable: i.e. when you
repair an sstable, it does CL.ALL reads of each row, then adds a repaired
metadata. Compared to merkle-tree repair, you are doing random i/o, in exchange
for being able to only worry about new-since-last-repair data (and whatever we
have to merge, that BF doesn't skip).

This feels like a better fit for Cassandra to me, because it takes advantage of
SSTable immutability. Constantly re-repairing data that doesn't need it (in
large increments or small) is wasteful.

Stu added: one way to incorporate terje's idea would be to continue to build
merkle trees, but to only build them using sstables that have been created
since the last time we ran repair.

Peter commented: The only significant downside I can think of off the top of my
head is that when nodes are unavailable or otherwise have some problem for a
long time, the sstables created since last repair might be a non-trivial amount
of data. So it doesn't [in the worst case] fulfil the goal of guaranteeing
that the bulkyness of an individual repair does not exceed some reasonable
value. But it sounds a *lot* better still than the current situation.

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

2011-07-13 Thread Peter Schuller (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064905#comment-13064905
]

Peter Schuller commented on CASSANDRA-2699:
---

I should also add that this type of approach completely eliminates the need for
operators to initiate and monitor repair, which is something which seems to
cause lots of confusion. What is still needed is monitoring that Cassandra
isn't saying something is wrong, I seem to be running behind with repairs.
But actually exposing such a binary condition should be trivial if the rest is
implemented.

The alerting would be even less critical if the information kept as part of an
implementation of this were to be used when doing compaction; one could prevent
automatically tombstones from ever being removed pre-maturely based on whether
or not AES has happened recently enough.

(All assumes we can *accurately* and *reliably* determine that repair actually
succeeded, so bugs causing silent failure of repair need fixing and it needs to
not easily break again.)

continuous incremental anti-entropy
---

Key: CASSANDRA-2699
URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

[jira] [Commented] (CASSANDRA-2699) continuous incremental anti-entropy

14 matches

Site Navigation

Mail list logo

Footer information