[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313621#comment-17313621 ] maxwellguo commented on CASSANDRA-6246: --- [~bdeggleston]any update? > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement > Components: Feature/Lightweight Transactions, Legacy/Coordination >Reporter: Jonathan Ellis >Assignee: Blake Eggleston >Priority: Normal > Labels: LWT, messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128718#comment-16128718 ] Joshua McKenzie commented on CASSANDRA-6246: bq. I'm looking for a solution to implement a reference counter based on Cassandra. bq. Dear community, do you have any idea? Please take questions to the user mailing list if you haven't already. JIRA is for discussion concerning development of Cassandra internals. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159 ] Igor Zubchenok commented on CASSANDRA-6246: --- I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TLL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? Huge thanks to everyone who answer me. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127782#comment-16127782 ] sankalp kohli commented on CASSANDRA-6246: -- What are you looking for with this patch? It would help if you could rebase this patch and see if someone can review it. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127713#comment-16127713 ] Igor Zubchenok commented on CASSANDRA-6246: --- It is a pity that these lightweight transactions can not be used at full strength due to the delay in merging this improvement. I refer to CASSANDRA-9328. I would set the highest priority for the merging. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127426#comment-16127426 ] Joshua McKenzie commented on CASSANDRA-6246: [~geagle]: given that this a) needs a rebase, and b) is a [massive patch|https://github.com/apache/cassandra/compare/trunk...bdeggleston:CASSANDRA-6246-trunk] that has yet to be reviewed, I'd expect there's going to be a substantial delay for this to be ready for merge. Not to put words in Blake's mouth, but I'd assume a post 4.0 world. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126756#comment-16126756 ] Igor Zubchenok commented on CASSANDRA-6246: --- Any update when this can be released? > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858602#comment-15858602 ] Blake Eggleston commented on CASSANDRA-6246: It hasn't been forgotten, but I don't have any updates right now. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 3.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858041#comment-15858041 ] Dobrin commented on CASSANDRA-6246: --- Just wondering if there is any progress since 2015? Other plans not to put EPaxos in Cassandra at all? thanks > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 3.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980634#comment-14980634 ] Jim Meyer commented on CASSANDRA-6246: -- Does anyone know if this patch will help with CASSANDRA-9328 (i.e. outcome of LWT not reported to client when there is contention). There's a suggestion to that effect in the comments of 9328, but I don't know if anyone has tried running the test code in 9328 to see if this patch has an effect on that issue. Is this patch compatible with rc2 of Cassandra 3.0.0 or does it need to be updated? When is it planned to add epaxos to an official build? Thanks for any info. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Fix For: 3.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980722#comment-14980722 ] Blake Eggleston commented on CASSANDRA-6246: bq. Does anyone know if this patch will help with CASSANDRA-9328 it should, yes bq. Is this patch compatible with rc2 of Cassandra 3.0.0 or does it need to be updated? it needs to be rebased onto cassandra-3.0, there are a few parts where it interacts directly with the cell timestamps bq. When is it planned to add epaxos to an official build? There are no plans at the moment, the patch still needs to be reviewed. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Fix For: 3.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574539#comment-14574539 ] Blake Eggleston commented on CASSANDRA-6246: So I think this is at a point where it's ready for review. Epaxos rebased against trunk is here: https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The supporting dtests are here: https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246. Most of the problems and solutions for supporting repair, read repair, bootstrap and failure recovery are discussed above, so I won't go into it here. The upgrade from current paxos to epaxos is opt in, and is done via nodetool upgradepaxos. That needs to be run on each node after the cluster has been upgraded, and will transition a cluster from paxos to epaxos. Serialized queries are still processed during the upgrade. How that works is explained here: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83. For new clusters, or nodes being added to upgrading clusters, there's a yaml file parameter you can set which will make the cluster. Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely separate. Keeping track of the two with the same set of metadata would have introduced a ton of complexity. Obviously mixing the two is a bad idea, and you give up the serialization guarantees when you do it anyway, so epaxos doesn't even bother trying. There is some weird stuff going on with how some things are serialized. First, for cas requests, the statement objects aren't serialized when attached to instances... the query strings and their parameters are. This is because serializing the important parts of a CQLCasRequest would have required serializers for dozens of classes that I'm not familiar with, and that didn't seem to be intended for serialization. Serializing the query string and parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099 will make it a bit easier to do this correctly. Second, most of the epaxos metadata is persisted as blobs. This is mainly because the dependency graph meta data is very queue like in it usage patterns, so dumping it as a big blob when there are changes prevents a lot of headaches. For the instances, since there are 3 types, each with different attributes, it seemed less risky to maintain a single serializer vs a serializer and select/insert statements. At this point, it would probably be ok to break it out into a proper table, but I don't have strong feelings about it either way. The token meta data was saved as blobs just because the other two were, iirc. Regarding performance, I've run some tests today, this time with one and two datacenters and with more concurrent clients. For a single datacenter, the epaxos median response time is 40-50% faster than regular paxos. However, the 95th and 99th percentiles are actually worse. I'm not sure why that is, but will be looking into that in the next week or so. In multiple datacenters, epaxos is 70-75% faster than regular paxos, and the 95th 99th percentiles are 50-70% faster as well. I haven't tested contended performance yet, and will do those in the next week or so. I'd expect them to be similar to last time though. The patch is about 50% tests. I've tried to be very thorough. The core epaxos services wrap calls to singletons in protected methods to make testing the interaction of multiple nodes in unit tests straightforward. In addition to dtests, there are a bunch of junit tests that put a simulated cluster in strange and probably rare failure conditions to test recovery, like this one: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144. I've also written a test that executes thousands of queries against a simulated cluster in a single thread, randomly turning nodes on and off, and checking that each node executed instances in the same order. It's pretty ugly, and needs to be expanded, but has been useful in uncovering small bugs. It's here: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48 EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363951#comment-14363951 ] Blake Eggleston commented on CASSANDRA-6246: I still have some things I need to complete on this before it's really ready for review, but also haven't had time either. Maybe in a week or so. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363947#comment-14363947 ] sankalp kohli commented on CASSANDRA-6246: -- I am not finding time to review this. If someone else can pick up ...that will be gr8. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339463#comment-14339463 ] Blake Eggleston commented on CASSANDRA-6246: I just merged some commits into my CASSANDRA-6246 branch. This is the initial implementation of the epoch, instance gc, streaming/repair, read repair, and failure recovery logic. I also have a dtests fork that tests it here: https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246. I still have some items I need to complete before submitting review, but nothing major (relative to this stuff). Mostly cleaning up and refining things. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339572#comment-14339572 ] sankalp kohli commented on CASSANDRA-6246: -- Let me take a look EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335886#comment-14335886 ] Blake Eggleston commented on CASSANDRA-6246: I should have them merged in, and update the ticket with my progress within the next few days EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335767#comment-14335767 ] sankalp kohli commented on CASSANDRA-6246: -- When do you plan to merge the new branches back into CASSANDRA-6246? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267389#comment-14267389 ] sankalp kohli commented on CASSANDRA-6246: -- Sure. Make sense EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265909#comment-14265909 ] sankalp kohli commented on CASSANDRA-6246: -- I am a little confused as in how you will use epoch to make sure instances are executed on all replicas when incrementing? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266223#comment-14266223 ] Blake Eggleston commented on CASSANDRA-6246: By using the existing epaxos ordering constraints. Incrementing the epoch is done by an instance which takes all unacknowledged instances as dependencies for the token range it's incrementing the epoch for. The epoch can only be incremented if all previous instances have also been executed. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230331#comment-14230331 ] Blake Eggleston commented on CASSANDRA-6246: Since it looks like the performance improvements from epaxos could be worth the (substantial) added complexity, I’ve been thinking through problems are caused by the need to garbage collect instances, and repair causing inconsistencies by sending data from ‘the future’. For repair, the only thing I’ve thought of that would work 100% of the time would be to count executed instances for a partition, and to send that count along with the repair request. If the remote count is higher than the local count, we know for sure that it has data from the future, and the repair for that partition should be deferred. For garbage collection, we’ll need to support a failure recovery mode that works without all historical instances. We also need a way to quickly determine if a prepare phase should be used, or we need a epaxos repair type operation to bring a node up to speed. Breaking the continuous execution space of partition ranges into discrete epochs would give us a relatively straightforward way of solving all of these problems. Each partition range will have it’s own epoch number. At a given instance number threshold, time threshold, or event, epaxos will run an epoch increment instance. It will take every active instance in it’s partition range as a dependency. Any instance executed before the epoch instance belongs to the last epoch, any executed after belong to the new one. How this would solve the outstanding problems: Garbage Collection: Any instance from 2 or more epochs ago can be deleted. Although epoch incrementing instances doesn’t prevent dependencies on the previous epoch, it does prevent dependencies from the previous-1 epoch Repair: Counting executions allows us to determine if repair data is from the future. Epochs let us scope execution counts to an epoch. If the epoch has incremented twice without new executions for a partition, the bookkeeping data for that partition can be deleted. This gives us a race free way to delete old execution counts, preventing keeping bookkeeping data around forever. Failure recovery: Using epochs makes deciding to use prepare or failure recovery unambiguous. If a node is missing instances that are from 2 or more epochs ago, it will need to run a failure recovery. Otherwise, prepare phases will work. Additionally, using an epaxos instance as the method of incrementing epochs guarantees that a given instance has been executed once the epoch has been incremented twice. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207036#comment-14207036 ] sankalp kohli commented on CASSANDRA-6246: -- One of the features we keep hearing from people moving from RDMS background is replicated log style replication. This provides timeline consistency when you do the reads say in other DC after a DC failure. Currently in C*, say you did 3 writes A,B and C. Here say B could not be replicated to other DC. Now after failover, you will be reading A and C and not B. This breaks a lot of things for some applications. One of the advantages of epaxos is that it orders all writes on all machines. If all writes are done via epaxos, I think it provide the above timeline consistency. So apart from epaxos being fast, I think this is a very important feature we get with it. What do you think [~bdeggleston] EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207383#comment-14207383 ] sankalp kohli commented on CASSANDRA-6246: -- Regarding reviewing the patch. I have some cleanups/suggestions to the code. I am yet to see the whole code. Also I won't note down the things which still need to be taken care of or coded. 1) In the DependencyManger, we might want to keep the last executed instance otherwise we won't know if the next one depends on the previous one or we have missed any in between. 2) You might want to create java packages and move files there. For example in repair code, org.apache.cassandra.repair.messages where we keep all the Request Responses. We can do the same for verb handler, etc. 3) We should add the new verbs to DatabaseDescriptor.getTimout(). Otherwise they will use the default timeout. I fixed this for current paxos implementation in CASSANDRA-7752 4) PreacceptResponse.failure can also accept missingInstances in the constructor. You can make it final and not volatile. 5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is always true. Also loadedScc is empty as we don't insert into it. 6) In ExecuteTask.run(), Instance toExecute = state.loadInstance(toExecuteId); should be within the try as we are holding a lock. 7) EpaxosState.commitCallbacks could be a multi map. 8) In Instance.java, successors, noop and fastPathPossible are not used. We can also get rid of Instance.applyRemote() method. 9) PreacceptCallback.ballot need not be an instance variable as we set completed=true after we set it. 10) PreacceptResponse.missingInstance is not required as it can be calculated on the leader in the PreacceptCallback. 11) EpaxosState.accept(). We can filter out the skipPlaceholderPredicate when we calculated missingInstances in PreacceptCallback.getAcceptDecision() 12) PreacceptCallback.getAcceptDecision() We don't need to calculate missingIds if accept is going to be false in AcceptDecision. 13) ParticipantInfo.remoteEndpoints. Here we are not doing any isAlive check and just sending messages to all remote endpoints. 14) ParticipantInfo.endpoints will not be required once we remove the Epaxos.getSuccessors() 15) Accept is send to live local endpoints and to all remote endpoints. In AcceptCallback, I think we should count response from only local endpoints 16) When we execute the instance in ExecuteTask, what if we crash after executing the instance but before recording it. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207557#comment-14207557 ] Blake Eggleston commented on CASSANDRA-6246: Thanks Sankalp. Since my last post, I've been cleaning things up and improving the tests. Sorry for the delay pushing it up. I also found a problem in the execution phase that was slowing things down. Epaxos is now 40% faster than the existing implementation in uncontended workloads, and 20x faster in contended workloads. Here are the performance numbers: https://docs.google.com/spreadsheets/d/1inBuO5bxo_b36jnTn5Ff9UCOhnMGLcx6EyNxp2nFM_Q/edit?usp=sharing bq. 1) In the DependencyManger, we might want to keep the last executed instance otherwise we won't know if the next one depends on the previous one or we have missed any in between. Instances only become eligible for eviction when they’ve been both executed and acknowledged. An executed instance will be a dependency of at least one additional instance before being evicted from the manager. {quote} 2) You might want to create java packages and move files there. For example in repair code, org.apache.cassandra.repair.messages where we keep all the Request Responses. We can do the same for verb handler, etc. 3) We should add the new verbs to DatabaseDescriptor.getTimout(). Otherwise they will use the default timeout. I fixed this for current paxos implementation in CASSANDRA-7752 4) PreacceptResponse.failure can also accept missingInstances in the constructor. You can make it final and not volatile. {quote} I'll look into these bq. 5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is always true. Also loadedScc is empty as we don't insert into it. ids are being put into uncommitted in the addInstance method, so it won’t always equal 0, good catch on the loadedScc though. I’ll get that fixed. bq. 6) In ExecuteTask.run(), Instance toExecute = state.loadInstance(toExecuteId); should be within the try as we are holding a lock. fixed in the cleaned up code bq. 7) EpaxosState.commitCallbacks could be a multi map. agreed, I'll update {quote} 8) In Instance.java, successors, noop and fastPathPossible are not used. We can also get rid of Instance.applyRemote() method. 14) ParticipantInfo.endpoints will not be required once we remove the Epaxos.getSuccessors() {quote} successors and noop will be used in the prepare and execute phases respectively, fastPathImpossible should be removed through. bq. 9) PreacceptCallback.ballot need not be an instance variable as we set completed=true after we set it. agreed, I'll update {quote} 10) PreacceptResponse.missingInstance is not required as it can be calculated on the leader in the PreacceptCallback. 11) EpaxosState.accept(). We can filter out the skipPlaceholderPredicate when we calculated missingInstances in PreacceptCallback.getAcceptDecision() {quote} Missing instances are sent both ways. When a node responds to a preaccept message, if it believes the leader is missing an instance, it will include it in it's response. Once the leader has received all the responses, if it thinks any of the replicas are missing instances, it will send them along. {quote} 12) PreacceptCallback.getAcceptDecision() We don't need to calculate missingIds if accept is going to be false in AcceptDecision. 13) ParticipantInfo.remoteEndpoints. Here we are not doing any isAlive check and just sending messages to all remote endpoints. {quote} I'll fix bq. 15) Accept is send to live local endpoints and to all remote endpoints. In AcceptCallback, I think we should count response from only local endpoints fixed in cleaned up code bq. 16) When we execute the instance in ExecuteTask, what if we crash after executing the instance but before recording it. Saving the best for last I see :) The existing implementation has this problem as well. Cassandra doesn't have a way to mutate multiple keyspaces with a single commit log entry (that I've found). We could collect the mutations from the actual cas write, the dependency manager update, and the instances update and hold off on applying them until the very end, but that only makes the problem less likely. Speaking of which, the default of not waiting for an fsync before considering a write successful is a more serious problem for paxos/epaxos, since a paxos node forgetting it's state can cause inconsistencies. I'll give this and your timeline consistency question some more thought. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207741#comment-14207741 ] sankalp kohli commented on CASSANDRA-6246: -- 5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is always true. Also loadedScc is empty as we don't insert into it. ids are being put into uncommitted in the addInstance method, so it won’t always equal 0, good catch on the loadedScc though. I’ll get that fixed. We only call ExecutionSorter.getOrder() in the else of executionSorter.uncommitted.size() 0 in ExecuteTask.run(). So we can remove the check. Missing instances are sent both ways. When a node responds to a preaccept message, if it believes the leader is missing an instance, it will include it in it's response. Once the leader has received all the responses, if it thinks any of the replicas are missing instances, it will send them along. I think there is not need to send them. Since we are sending all the dependencies of the endpoint in the response to the leader, leader can do the diff. There is no point sending duplicate information over the wire. So I think in PreacceptVerbHandler, we don't need to calculate and send the missing instances. Speaking of which, the default of not waiting for an fsync before considering a write successful is a more serious problem for paxos/epaxos, since a paxos node forgetting it's state can cause inconsistencies. I agree we can tackle this later. But here it is more dangerous because once an endpoint is out of sync, no further updates can be applied as condition checks are local. In current paxos, if a machine is in this situation and could not apply the commit. The next commit will still be applied as condition checks are at quorum level. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199147#comment-14199147 ] sankalp kohli commented on CASSANDRA-6246: -- In PreacceptCallback boolean fpQuorum = numResponses = participantInfo.fastQuorumSize; //will be always false since we don't accept any requests after quorum number of requests have got in. This will cause accept phase to run always even if there is no contention. Am I reading it right? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199170#comment-14199170 ] Blake Eggleston commented on CASSANDRA-6246: Only with a replication factor 5. For rf = 5, the fast path quorum size is the same as the basic quorum size. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197053#comment-14197053 ] sankalp kohli commented on CASSANDRA-6246: -- Let me take a look at the patch. It is a big one :) EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194659#comment-14194659 ] Blake Eggleston commented on CASSANDRA-6246: I have an initial implementation here: https://github.com/bdeggleston/cassandra/compare/CASSANDRA-6246?expand=1 It’s still pretty rough, I just wanted to get it to a point where we could get a feel for the performance advantages and decide if the additional complexity was worth it. There’s also none of the instance gc / optimized failure recovery we’ve been talking about. I did some performance comparisons over the weekend. The tldr is that epaxos is 10% to 11.5x faster than classic paxos, depending on the workload. To test, I used a cluster of 3 m3.xlarge instances in us-east, and a 4th instance executing queries against the cluster. Each C* node was in a different az. Commit log and data directories were on different disks. There were 2 tests, each running 10k queries against the cluster. The first test measured throughput using queries that wouldn’t contend with each other. Each query inserted a row for a different partition. The second test measured performance under contention, where every query contended for the same partition. Each test was run with 1, 5, 10 concurrent client requests. With the uncontended workload, epaxos request time is 10-14% faster than the current implementation on average. See: https://docs.google.com/spreadsheets/d/1olMYCepsE_02bMyfzV0Hke5UKuqoCNNjSIjR9yNs5iI/edit?pli=1#gid=0 With the contended workload, epaxos request time is 4.5x-11.5x faster than the current implementation on average. See: https://docs.google.com/spreadsheets/d/1olMYCepsE_02bMyfzV0Hke5UKuqoCNNjSIjR9yNs5iI/edit?pli=1#gid=1327463955 There are 2 epaxos sections, regular, and cached. With higher contended request concurrency, the execution algorithm has to visit a lot of unexecuted instances to build it’s dependency graph. Reading the dependency data and instances out of their tables and deserializing them for each visit slows down epaxos to a point where it’s over twice as slow as classic paxos. By using a guava cache for the instance and dependency data objects, and keeping them around for a few minutes, epaxos is ~30x faster in higher contention/concurrency situations. Some notes on the concurrent contended tests: * The median query time for epaxos is a little slower than classic paxos for 5 concurrent contended requests. This is because epaxos is now doing an accept phase on a lot of the queries, and because classic paxos doesn’t send commit messages out if the predicate doesn’t apply to the query. * With concurrent contending queries, 1-2.5% of the classic paxos queries timeout and fail. At this level, there are no failing epaxos queries. * Variance in query times is also much lower with epaxos. With 10 concurrent contending requests, the 95th %ile request time for classic paxos is 23x the median, epaxos is 1.8x. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165151#comment-14165151 ] Blake Eggleston commented on CASSANDRA-6246: Since epaxos executes mutations at different times on each machine, each instance needs a serialized copy of the statement. The CQL3CasRequest.RowUpdate class keeps a reference to the actual ModificationStatement, and serializing that looks like it will involve implementing at least 50 (de)serializers. Since I’m not super familiar with the inner workings of the UpdateStatement and DeleteStatement, I thought I’d ask here to see if there’s a better solution I’m not seeing. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165177#comment-14165177 ] T Jake Luciani commented on CASSANDRA-6246: --- Can't you just call .getMutations on the statements and serialize the actual RowMutations? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165185#comment-14165185 ] Blake Eggleston commented on CASSANDRA-6246: That would work most of the time, but a few operations do a read before a write. I suppose I could narrow the serialization support down to just the operations and terms that are involved in those statements, but I'd like to avoid special casing specific operations if possible. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159620#comment-14159620 ] Blake Eggleston commented on CASSANDRA-6246: I’ve been thinking through how epaxos would be affected by repair, read repair, and hints. Since both the read and write parts of an epaxos instance are executed both locally and asynchronously, it’s possible that a repair could write the result of an instance to a node before that instance is executed on that node. This would cause the decision of an epaxos instance to be different on the node being repaired, which could create an inconsistency between nodes. Although it’s difficult to imagine an instance taking more time to execute than a repair, I don’t think it’s impossible, and would introduce inconsistencies during normal operation. Something that would be more likely to cause problems would be someone performing a quorum read on a key that has instances in flight, and triggering a read repair on that key. Hints would have a similar problem, but it would also mean that people are mixing serialized and unserialized writes concurrently. Having the node sending the repair message include some metadata about the most recent executed instance(s) it's aware of is the best solution I've come up with so far. If the receiving node is behind, it could work with the sending node to catch up before performing the repair. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151693#comment-14151693 ] T Jake Luciani commented on CASSANDRA-6246: --- For write timestamps take a look at CASSANDRA-7919 we need to change this to support better LWW semantics and RAMP EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152316#comment-14152316 ] sankalp kohli commented on CASSANDRA-6246: -- Currently we do a read on quorum/local_quorum and based on that value decide if the condition matches at the co-ordinator. With you approach, the decisions will now be local and could be different on different replicas. If some replica some how lags behind due to various reasons, the condition on it will never be satisfied going forward. Coming back to my suggestion from previous comment, if all replicas respond back with all committed before this instance and things are the same, we can use the read value. Correct? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152336#comment-14152336 ] Blake Eggleston commented on CASSANDRA-6246: [~tjake] that'll solve the problem of having multiple mutations at a single timestamp, but might cause other problems when the calculated execution order puts an instance after an instance with a larger timestamp. Using arbitrary timestamps in uuid1s in this case might cause come collisions, but they wouldn't be on the same cells. In any case, collisions would be less likely than they are now. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152383#comment-14152383 ] Blake Eggleston commented on CASSANDRA-6246: [~kohlisankalp] right, the incrementing ballot numbers per partition in the current implementation, and the quorum read basically synchronizes r/w for the partition being queried. But that synchronization creates a bottleneck, and the potential for live lock. Epaxos doesn't need to synchronize any of it's reads or writes. The preaccept, accept, and commit steps are basically building a directed graph of queries. The constraints those steps impose/satisfy allow other nodes to figure out the state of the graph on other machines in case of failure, provided it can talk to a quorum of nodes, and even if the machines with a newer view of the graph are down. At execution time, this graph is sorted to determine the execution order. Since the graph will always be the same, the order instances are executed will always be the same. So even though each machine will perform it's read and write in isolation, the other nodes are guaranteed to execute instances in the same order, and therefore, guaranteed to reach the same decision. Even though they aren't talking to each other at execution time. What can cause inconsistencies, since each node is executing instances in isolation, is users mixing serialized writes with unserialized writes. The quorum read/write in the current implementation mitigates this problem. However, like I mentioned in my comment yesterday, I think we can work out a way to detect and correct this. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152592#comment-14152592 ] sankalp kohli commented on CASSANDRA-6246: -- Inconsistencies can be introduced by machines being down or network partitioned for longer than we replay missed updates to it. Currently for normal writes, hint is for 1 hour. If you bring in a machine after 1 hour, you run a repair. But repair won't help here since it takes time to run the repair and new LWTs will come and will see a different view of the data and won't apply. However, like I mentioned in my comment yesterday, I think we can work out a way to detect and correct this. +1 Assuming each instance is an average of ~170 bytes (uncompressed), sustained 1000 instances per second for 3 hours would keep ~1.8GB of data around. Here instance includes the condition and update. Update could be quite big and keeping it around could be problematic. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152631#comment-14152631 ] Blake Eggleston commented on CASSANDRA-6246: bq. Inconsistencies can be introduced by machines being down or network partitioned for longer than we replay missed updates to it. Currently for normal writes, hint is for 1 hour. If you bring in a machine after 1 hour, you run a repair. But repair won't help here since it takes time to run the repair and new LWTs will come and will see a different view of the data and won't apply. For serialized queries, new instances sent to a machine that's recovering from a failure will be learn of missed instances during the preaccept phase, and will have to catch up before they can execute the instance and respond to client {quote} Assuming each instance is an average of ~170 bytes (uncompressed), sustained 1000 instances per second for 3 hours would keep ~1.8GB of data around. Here instance includes the condition and update. Update could be quite big and keeping it around could be problematic. {quote} Yeah... agree 100%. Keeping an extensive history of instances for failure recovery is not a good idea. Anyway, it doesn't even solve the problem of recovery since you'll start to get dangling pointers. So let's forget about keeping a lot of history around. For recovering from longer outages, here's my thinking: To accurately determine dependencies for the preaccept phase, we'll need to keep references to active instances around. Otherwise we can get dependency graphs that are split, or have gaps. Active instances would be instances that both been executed, and that either a quorum of instances have accepted as a dependency for another instance, or that was a dependency of a committed instance. This should be all the historical info we need to keep around. We might want to keep a little more so we can just use the prepare phase to recover from shorter outages. In cases where a node is joining, or has been down for a while, it seems that if we immediately start including them in paxos messages (for record only, not to act on), then send them the current dependency data described above for a row/cell from a quorum of nodes and the current value for that row/cell, that should be enough for the node to start participating in instances. This way we can avoid a prepare phase that depends on persisting and transmitting a ton of data. wdyt? I haven't spent a lot of time thinking through all the edge cases, but I think it has potential for making recovery practical. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152770#comment-14152770 ] sankalp kohli commented on CASSANDRA-6246: -- Yes make sense. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151250#comment-14151250 ] sankalp kohli commented on CASSANDRA-6246: -- Keeping executed instances In the current implementation, we only keep the last commit per CQL partition. We can do the same for this as well. I am also reading about epaxos recently and want to know when do you do the condition check in your implementation? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151256#comment-14151256 ] Blake Eggleston commented on CASSANDRA-6246: bq. In the current implementation, we only keep the last commit per CQL partition. We can do the same for this as well. Yeah I've been thinking about that some more. Just because we could keep a bunch of historical data doesn't mean we should. There may be situations where we need to keep more than one instance around though, specifically when the instance is part of a strongly connected component. Keeping some historical data would be useful for helping instances recover from short failures where they miss several instances, but after a point, transmitting all the activity for the last hour or two would just be nuts. The other issue with relying on historical data for failure recovery is that you can't keep all of it, so you'd have dangling pointers on the older instances. For longer partitions, and nodes joining the ring, if we transmitted our current dependency bookkeeping for the token ranges they're replicating, the corresponding instances, and the current values for those instances, that should be enough to get going. bq. I am also reading about epaxos recently and want to know when do you do the condition check in your implementation? It would have to be when the instance is executed. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151261#comment-14151261 ] sankalp kohli commented on CASSANDRA-6246: -- It would have to be when the instance is executed. Since the client(the application) needs to know whether this was a success of not, I was thinking of making it part of the pre accept. When a replica gets a request of pre accept, along with last instance, it can also send the values of the check. If the response from all replicas are the same(fast path), it could be committed locally and async to other replicas. Also the response to the client will contain whether the query succeed or not. Make sense? PS: I am quite excited to see this implementation coming along specially since you are working on it :) EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151289#comment-14151289 ] Blake Eggleston commented on CASSANDRA-6246: Thanks [~kohlisankalp] :) So the issue with making the check part of the preaccept phase is that you can't trust the value in the database at that point. If there are other interfering instances in flight, you don't know what order they'll be executed in until they're all committed. So, one of them could change the value and you'd have replied to the client with incorrect information. Assuming the client sends the query to a replica, things would go like this: # receive client request # send preaccept request to replicas and wait for a fast path quorum to respond # assuming all responses agreed, commit locally notify replicas asynchronously # assuming all dependencies are committed, sort dependency graph # execute all instances preceding the client's instance, read value* in question and perform the check, make mutation* # respond with result to client *performed locally EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150816#comment-14150816 ] Blake Eggleston commented on CASSANDRA-6246: I’ve been working through some of the concerns I’d posted about last week, and talking with Iulian Moraru and Dave Anderson, who’ve been really helpful. I also put together a quick python implementation to test arbitrary failure scenarios, which you can check out here: https://github.com/bdeggleston/cassandra_epaxos_prototype. The failure scenario I was worried about last week is not a problem, I’d just forgotten a step. Before I put together a plan about how to implement this. There are some things we’ll need to figure out about how epaxos will fit into Cassandra’s architecture. Below are the main problems / questions, and some possible solutions. Sorry for the super long comment. *Non-replica coordinators* Optimized Egalitarian Paxos depends on the command leader also being a replica of the key being queried, because it uses information about whether replicas agreed with the command leader’s preaccept attributes to make failure recovery possible. Token aware routing should make this less of an issue, but we’ll still need to handle non-replica coordinators. There are 3 options when we get a query on a non-replica node: # Optimistically forward queries to a node that is a replica. This method offers the best case message network round trip (2) for queries not sent to a replica. However, this also makes the node we’re forwarding queries to a spof for this query. We may be able to quickly reroute if we find the node is unreachable, but if the node is just slow to respond, the query could timeout, something that might not have happened if we went with a slower, but more reliable route. # Always default to the slow path. For every query with a non-replica leader, this means sending preaccept, accept, and commit messages to all replicas. We’d need to wait for a quorum to reply to the preaccept and accept messages, and a single reply to the commit message for the result. This method would be more reliable, but would always take 3 message round trips. The exception to the 3 round trip rule is if we receive identical preaccept responses from all replicas. In that case we could skip the accept phase. # A compromise between the first two. When a replica receives a preaccept message from the non-replica coordinator, the replica broadcasts it’s preaccept responses to a subset of replicas, and responds after receiving broadcasted preaccept responses from a subset of the other replicas. Receiving identical responses from enough of the replicas would allow us to commit on the fast path. This puts the best case at 2.5 round trips, but involves more network activity. It’s also different enough from the normal flow, that I think it would be better to make this a follow on task if we decide it’s the way to go. *Keeping executed instances* Ideally, we could delete our record of an instance as soon as it’s a) executed, b) not a dependency of an unexecuted node, and c) not part of a strongly connected component which hasn’t been fully executed. However, when nodes which have been down for a while are recovering from failure, they need to get copies of all the instances they missed. The absolute simplest solution would be to just keep all instances persisted, ttl’d with the hinted handoff time, where we assume that the node isn’t coming back. Assuming each instance is an average of ~170 bytes (uncompressed), sustained 1000 instances per second for 3 hours would keep ~1.8GB of data around. Based on the lz4 benchmarks I’ve found, that would be ~1GB per 1000 instances per second. Hypothetically, if GAE duplicated their 1 million wps benchmark on 330 machines using epaxos, with rf3, that’s ~9k instances per second, or 9GB. That shouldn’t be a problem for a machine handling that kind of load, and doesn’t introduce any additional complexity. *Write timestamps* When there are a lot of concurrent updates on the same key, some instances will be executed in a different order than they were received. If we use the query timestamp for a write, we could write an instance with a timestamp that’s before the timestamp of the last instance. So reads wouldn’t see the result of the most recent instance. The commit time can’t be used, because a prepare phase could have committed the same instance on different nodes at different times. Finally the execution timestamp will be different on every node, using that would give the most recent mutation, but would cause out of date data to get written on repair in some situations. Using max(last_write_timestamp + 1, query_timestamp) for the timestamp would work, although in situations with a lot of writes on a single key, this could put mutations a little into the future.
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150819#comment-14150819 ] Blake Eggleston commented on CASSANDRA-6246: /cc [~jbellis] [~slebresne] EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150847#comment-14150847 ] Albert P Tobey commented on CASSANDRA-6246: --- For backwards compatibility, if it's possible to run both protocols, make it a configuration in the yaml. Another rolling restart to disable hybrid/dual mode isn't so bad if it removes a lot of complexity from runtime. Would also make it easy for conservative users to stick with the old paxos. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150875#comment-14150875 ] Blake Eggleston commented on CASSANDRA-6246: Making switching to a special hybrid mode a required step could be error prone. Plus, the direction your transitioning is important. That's really the tricky part, running serialized queries while the cluster transparently transitions from one protocol to another. Specifically when the nodes for a given range can and do switch. Making it configurable could be useful, at least from a peace of mind / opt-in perspective. We'd have to work out how to transition back from epaxos to old paxos though. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141590#comment-14141590 ] Blake Eggleston commented on CASSANDRA-6246: I'm still poring over the discussion in CASSANDRA-5062, and the current implementation, but wanted to expand on some of the advantages, list a few disadvantages and caveats of using egalitarian paxos, and talk about a few areas where we'd probably want to deviate from the process as described in the paper by Moraru et al. Advantages: * In the ideal case we should be able to answer a client's query after the same number of inter-node messages it takes to do a quorum write. (There will be more total messages, but we don't need to wait for them to complete before responding to the client) ** This is assuming that each node performs the cas locally instead of using paxos to setup a quorum read/write * Even in the non-ideal case, you're still looking at 2 network round trips before reaching commit (it looks like current impl has 4 network round trips for cas?) * Much higher throughput on interfering queries is possible. Multiple in-flight queries on the same row is not a problem. ** livelock is not a risk during normal operation, only during failure recovery. However, this can be mitigated by specifying an order of succession for query leaders. Of course, really heavy 'normal' operation might start causing failure cases. * Granular control over which operations interfere with each other Disadvantages: * the epaxos optimizations are possible because it has a pretty complex failure recovery procedure * the concurrent programming side of things will be more complicated than the current implementation * because execution is more asynchronous than classic paxos, I think we'd have to perform the operations locally rather than using paxos to setup a normal quorum read/write. On one hand, this saves us a network round trip. On the other hand, if people are doing non-serialized writes at the same time as serialized writes that affect the same cells, it's likely that different nodes will record different results for a query. Obviously, it's not a good idea to do this, but that doesn't mean people won't do it. Caveats: * with rf3, or a non-replica coordinator, responses from more than a quorum of replicas _may_ be needed to commit on the ideal case. Or we just use the 2 message commit path in those situation. I'm still working out the details, but I'm pretty sure there are failure scenarios where not doing that could result in different values can be committed after recovery. * Epaxos is pretty new. I was talking to the authors about it a few months ago, and the only implementations we were aware of were mine and theirs... I'm pretty sure there aren't any production deployments of it. That's not _neccesarily_ a bad thing, but I just wanted to point out that we are in fairly new territory, and that should be weighed against the advantages. There is no 'Making EPaxos Live' paper out there. Places where Cassandra's architecture will likely require doing things a bit differently than outlined in the paper: * Sequence values will cause problems, but they shouldn't be neccesary. *# since each node is responsible for different ranges of data, and therefore would have seen different queries, encountering different seq values would be very likely, and would result in a lot of otherwise unnecessary accept phases. We could get around this by using different seq values for different token ranges, but... *# Since we'd wait until the query is actually executed before returning a result to the client (don't know why we wouldn't), it's a superfluous requirement. I discussed this with Iulian Moraru a few months ago and he agreed. * Using a non-replica coordinator: *# The paper assumes that an instance leader is also a replica of the data being queried. I'd imagine we'd want to avoid optimistically forwarding queries to a single replica and hoping it's up, which would mean allowing coordinators to lead queries for keys they don't know anything about. This would prevent the non-leaders from recording that they agree with the leader, preventing some optimizations in failure recovery. It would make a good case for using prepared statements and token aware routing. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly
[jira] [Commented] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017865#comment-14017865 ] Jonathan Ellis commented on CASSANDRA-6246: --- Good overview: http://blakeeggleston.com/egalitarian-paxos-explained.html EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.2#6252)