subject:"\[jira\] \[Commented\] \(CASSANDRA\-6246\) EPaxos"

[jira] [Commented] (CASSANDRA-6246) EPaxos

2021-04-02 Thread maxwellguo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313621#comment-17313621
 ] 

maxwellguo commented on CASSANDRA-6246:
---

[~bdeggleston]any update?

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions, Legacy/Coordination
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>Priority: Normal
>  Labels: LWT, messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-08-16 Thread Joshua McKenzie (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128718#comment-16128718
 ] 

Joshua McKenzie commented on CASSANDRA-6246:


bq. I'm looking for a solution to implement a reference counter based on 
Cassandra.
bq. Dear community, do you have any idea?
Please take questions to the user mailing list if you haven't already. JIRA is 
for discussion concerning development of Cassandra internals.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Igor Zubchenok (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159
 ] 

Igor Zubchenok commented on CASSANDRA-6246:
---

I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TLL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

Huge thanks to everyone who answer me.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-08-15 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127782#comment-16127782
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

What are you looking for with this patch? 
It would help if you could rebase this patch and see if someone can review it. 

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Igor Zubchenok (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127713#comment-16127713
 ] 

Igor Zubchenok commented on CASSANDRA-6246:
---

It is a pity that these lightweight transactions can not be used at full 
strength due to the delay in merging this improvement. I refer to 
CASSANDRA-9328. I would set the highest priority for the merging.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Joshua McKenzie (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127426#comment-16127426
 ] 

Joshua McKenzie commented on CASSANDRA-6246:


[~geagle]: given that this a) needs a rebase, and b) is a [massive 
patch|https://github.com/apache/cassandra/compare/trunk...bdeggleston:CASSANDRA-6246-trunk]
 that has yet to be reviewed, I'd expect there's going to be a substantial 
delay for this to be ready for merge. Not to put words in Blake's mouth, but 
I'd assume a post 4.0 world.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-08-14 Thread Igor Zubchenok (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126756#comment-16126756
 ] 

Igor Zubchenok commented on CASSANDRA-6246:
---

Any update when this can be released?

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-02-08 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858602#comment-15858602
 ] 

Blake Eggleston commented on CASSANDRA-6246:


It hasn't been forgotten, but I don't have any updates right now.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 3.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2017-02-08 Thread Dobrin (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858041#comment-15858041
 ] 

Dobrin commented on CASSANDRA-6246:
---

Just wondering if there is any progress since 2015? Other plans not to put 
EPaxos in Cassandra at all?
thanks

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 3.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-10-29 Thread Jim Meyer (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980634#comment-14980634
 ] 

Jim Meyer commented on CASSANDRA-6246:
--

Does anyone know if this patch will help with CASSANDRA-9328 (i.e. outcome of 
LWT not reported to client when there is contention).  There's a suggestion to 
that effect in the comments of 9328, but I don't know if anyone has tried 
running the test code in 9328 to see if this patch has an effect on that issue.

Is this patch compatible with rc2 of Cassandra 3.0.0 or does it need to be 
updated?  When is it planned to add epaxos to an official build?  Thanks for 
any info.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
> Fix For: 3.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-10-29 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980722#comment-14980722
 ] 

Blake Eggleston commented on CASSANDRA-6246:


bq. Does anyone know if this patch will help with CASSANDRA-9328

it should, yes

bq. Is this patch compatible with rc2 of Cassandra 3.0.0 or does it need to be 
updated?

it needs to be rebased onto cassandra-3.0, there are a few parts where it 
interacts directly with the cell timestamps

bq. When is it planned to add epaxos to an official build?

There are no plans at the moment, the patch still needs to be reviewed.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
> Fix For: 3.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-06-05 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574539#comment-14574539
]

Blake Eggleston commented on CASSANDRA-6246:

So I think this is at a point where it's ready for review. Epaxos rebased
against trunk is here:
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The
supporting dtests are here:
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.

Most of the problems and solutions for supporting repair, read repair,
bootstrap and failure recovery are discussed above, so I won't go into it here.

The upgrade from current paxos to epaxos is opt in, and is done via nodetool
upgradepaxos. That needs to be run on each node after the cluster has been
upgraded, and will transition a cluster from paxos to epaxos. Serialized
queries are still processed during the upgrade. How that works is explained
here:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83.
For new clusters, or nodes being added to upgrading clusters, there's a yaml
file parameter you can set which will make the cluster.

Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely
separate. Keeping track of the two with the same set of metadata would have
introduced a ton of complexity. Obviously mixing the two is a bad idea, and you
give up the serialization guarantees when you do it anyway, so epaxos doesn't
even bother trying.

There is some weird stuff going on with how some things are serialized. First,
for cas requests, the statement objects aren't serialized when attached to
instances... the query strings and their parameters are. This is because
serializing the important parts of a CQLCasRequest would have required
serializers for dozens of classes that I'm not familiar with, and that didn't
seem to be intended for serialization. Serializing the query string and
parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099
will make it a bit easier to do this correctly. Second, most of the epaxos
metadata is persisted as blobs. This is mainly because the dependency graph
meta data is very queue like in it usage patterns, so dumping it as a big blob
when there are changes prevents a lot of headaches. For the instances, since
there are 3 types, each with different attributes, it seemed less risky to
maintain a single serializer vs a serializer and select/insert statements. At
this point, it would probably be ok to break it out into a proper table, but I
don't have strong feelings about it either way. The token meta data was saved
as blobs just because the other two were, iirc.

Regarding performance, I've run some tests today, this time with one and two
datacenters and with more concurrent clients. For a single datacenter, the
epaxos median response time is 40-50% faster than regular paxos. However, the
95th and 99th percentiles are actually worse. I'm not sure why that is, but
will be looking into that in the next week or so. In multiple datacenters,
epaxos is 70-75% faster than regular paxos, and the 95th 99th percentiles are
50-70% faster as well. I haven't tested contended performance yet, and will do
those in the next week or so. I'd expect them to be similar to last time though.

The patch is about 50% tests. I've tried to be very thorough. The core epaxos
services wrap calls to singletons in protected methods to make testing the
interaction of multiple nodes in unit tests straightforward. In addition to
dtests, there are a bunch of junit tests that put a simulated cluster in
strange and probably rare failure conditions to test recovery, like this one:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144.
I've also written a test that executes thousands of queries against a simulated
cluster in a single thread, randomly turning nodes on and off, and checking
that each node executed instances in the same order. It's pretty ugly, and
needs to be expanded, but has been useful in uncovering small bugs. It's here:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48

EPaxos
--

Key: CASSANDRA-6246
URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-03-16 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363951#comment-14363951
 ] 

Blake Eggleston commented on CASSANDRA-6246:


I still have some things I need to complete on this before it's really ready 
for review, but also haven't had time either. Maybe in a week or so.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-03-16 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363947#comment-14363947
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

I am not finding time to review this. If someone else can pick up ...that will 
be gr8. 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-02-26 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339463#comment-14339463
 ] 

Blake Eggleston commented on CASSANDRA-6246:


I just merged some commits into my CASSANDRA-6246 branch. This is the initial 
implementation of the epoch, instance gc, streaming/repair, read repair, and 
failure recovery logic. I also have a dtests fork that tests it here: 
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.

I still have some items I need to complete before submitting review, but 
nothing major (relative to this stuff). Mostly cleaning up and refining things.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-02-26 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339572#comment-14339572
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Let me take a look 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-02-24 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335886#comment-14335886
 ] 

Blake Eggleston commented on CASSANDRA-6246:


I should have them merged in, and update the ticket with my progress within the 
next few days 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-02-24 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335767#comment-14335767
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

When do you plan to merge the new branches back into CASSANDRA-6246? 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-01-07 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267389#comment-14267389
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Sure. Make sense

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-01-06 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265909#comment-14265909
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

I am a little confused as in how you will use epoch to make sure instances are 
executed on all replicas when incrementing? 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2015-01-06 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266223#comment-14266223
 ] 

Blake Eggleston commented on CASSANDRA-6246:


By using the existing epaxos ordering constraints. Incrementing the epoch is 
done by an instance which takes all unacknowledged instances as dependencies 
for the token range it's incrementing the epoch for. The epoch can only be 
incremented if all previous instances have also been executed.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-12-01 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230331#comment-14230331
]

Blake Eggleston commented on CASSANDRA-6246:

Since it looks like the performance improvements from epaxos could be worth the
(substantial) added complexity, I’ve been thinking through problems are caused
by the need to garbage collect instances, and repair causing inconsistencies
by sending data from ‘the future’.

For repair, the only thing I’ve thought of that would work 100% of the time
would be to count executed instances for a partition, and to send that count
along with the repair request. If the remote count is higher than the local
count, we know for sure that it has data from the future, and the repair for
that partition should be deferred.

For garbage collection, we’ll need to support a failure recovery mode that
works without all historical instances. We also need a way to quickly determine
if a prepare phase should be used, or we need a epaxos repair type operation to
bring a node up to speed.

Breaking the continuous execution space of partition ranges into discrete
epochs would give us a relatively straightforward way of solving all of these
problems. Each partition range will have it’s own epoch number. At a given
instance number threshold, time threshold, or event, epaxos will run an epoch
increment instance. It will take every active instance in it’s partition range
as a dependency. Any instance executed before the epoch instance belongs to the
last epoch, any executed after belong to the new one.

How this would solve the outstanding problems:

Garbage Collection: Any instance from 2 or more epochs ago can be deleted.
Although epoch incrementing instances doesn’t prevent dependencies on the
previous epoch, it does prevent dependencies from the previous-1 epoch

Repair: Counting executions allows us to determine if repair data is from the
future. Epochs let us scope execution counts to an epoch. If the epoch has
incremented twice without new executions for a partition, the bookkeeping data
for that partition can be deleted. This gives us a race free way to delete old
execution counts, preventing keeping bookkeeping data around forever.

Failure recovery: Using epochs makes deciding to use prepare or failure
recovery unambiguous. If a node is missing instances that are from 2 or more
epochs ago, it will need to run a failure recovery. Otherwise, prepare phases
will work. Additionally, using an epaxos instance as the method of incrementing
epochs guarantees that a given instance has been executed once the epoch has
been incremented twice.

EPaxos
--

One reason we haven't optimized our Paxos implementation with Multi-paxos is
that Multi-paxos requires leader election and hence, a period of
unavailability when the leader dies.
EPaxos is a Paxos variant that requires (1) less messages than multi-paxos,
(2) is particularly useful across multiple datacenters, and (3) allows any
node to act as coordinator:
http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
However, there is substantial additional complexity involved if we choose to
implement it.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-11 Thread sankalp kohli (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207036#comment-14207036
]

sankalp kohli commented on CASSANDRA-6246:
--

One of the features we keep hearing from people moving from RDMS background is
replicated log style replication. This provides timeline consistency when you
do the reads say in other DC after a DC failure. Currently in C*, say you did 3
writes A,B and C. Here say B could not be replicated to other DC. Now after
failover, you will be reading A and C and not B.

This breaks a lot of things for some applications.

One of the advantages of epaxos is that it orders all writes on all machines.
If all writes are done via epaxos, I think it provide the above timeline
consistency.

So apart from epaxos being fast, I think this is a very important feature we
get with it.

What do you think [~bdeggleston]

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-11 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207383#comment-14207383
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Regarding reviewing the patch. I have some cleanups/suggestions to the code. I 
am yet to see the whole code. Also I won't note down the things which still 
need to be taken care of or coded.
1) In the DependencyManger, we might want to keep the last executed instance 
otherwise we won't know if the next one depends on the previous one or we have 
missed any in between. 
2) You might want to create java packages and move files there. For example in 
repair code, org.apache.cassandra.repair.messages where we keep all the Request 
Responses. We can do the same for verb handler, etc. 
3) We should add the new verbs to DatabaseDescriptor.getTimout(). Otherwise 
they will use the default timeout. I fixed this for current paxos 
implementation in CASSANDRA-7752
4) PreacceptResponse.failure can also accept missingInstances in the 
constructor. You can make it final and not volatile. 
5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is 
always true. Also loadedScc is empty as we don't insert into it. 
6) In ExecuteTask.run(), Instance toExecute = state.loadInstance(toExecuteId); 
should be within the try as we are holding a lock. 
7) EpaxosState.commitCallbacks could be a multi map. 
8) In Instance.java, successors, noop and fastPathPossible are not used. We can 
also get rid of Instance.applyRemote() method.
9) PreacceptCallback.ballot need not be an instance variable as we set 
completed=true after we set it.  
10) PreacceptResponse.missingInstance is not required as it can be calculated 
on the leader in the PreacceptCallback. 
11) EpaxosState.accept(). We can filter out the skipPlaceholderPredicate when 
we calculated missingInstances in PreacceptCallback.getAcceptDecision()
12) PreacceptCallback.getAcceptDecision() We don't need to calculate missingIds 
if accept is going to be false in AcceptDecision. 
13) ParticipantInfo.remoteEndpoints. Here we are not doing any isAlive check 
and just sending messages to all remote endpoints. 
14) ParticipantInfo.endpoints will not be required once we remove the 
Epaxos.getSuccessors()
15) Accept is send to live local endpoints and to all remote endpoints. In 
AcceptCallback, I think we should count response from only local endpoints 
16) When we execute the instance in ExecuteTask, what if we crash after 
executing the instance but before recording it. 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-11 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207557#comment-14207557
]

Blake Eggleston commented on CASSANDRA-6246:

Thanks Sankalp. Since my last post, I've been cleaning things up and improving
the tests. Sorry for the delay pushing it up.

I also found a problem in the execution phase that was slowing things down.
Epaxos is now 40% faster than the existing implementation in uncontended
workloads, and 20x faster in contended workloads.

Here are the performance numbers:
https://docs.google.com/spreadsheets/d/1inBuO5bxo_b36jnTn5Ff9UCOhnMGLcx6EyNxp2nFM_Q/edit?usp=sharing

bq. 1) In the DependencyManger, we might want to keep the last executed
instance otherwise we won't know if the next one depends on the previous one or
we have missed any in between.

Instances only become eligible for eviction when they’ve been both executed
and acknowledged. An executed instance will be a dependency of at least one
additional instance before being evicted from the manager.

{quote}
2) You might want to create java packages and move files there. For example in
repair code, org.apache.cassandra.repair.messages where we keep all the Request
Responses. We can do the same for verb handler, etc.
3) We should add the new verbs to DatabaseDescriptor.getTimout(). Otherwise
they will use the default timeout. I fixed this for current paxos
implementation in CASSANDRA-7752
4) PreacceptResponse.failure can also accept missingInstances in the
constructor. You can make it final and not volatile.
{quote}

I'll look into these

bq. 5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is
always true. Also loadedScc is empty as we don't insert into it.

ids are being put into uncommitted in the addInstance method, so it won’t
always equal 0, good catch on the loadedScc though. I’ll get that fixed.

bq. 6) In ExecuteTask.run(), Instance toExecute =
state.loadInstance(toExecuteId); should be within the try as we are holding a
lock.

fixed in the cleaned up code

bq. 7) EpaxosState.commitCallbacks could be a multi map.

agreed, I'll update

{quote}
8) In Instance.java, successors, noop and fastPathPossible are not used. We can
also get rid of Instance.applyRemote() method.
14) ParticipantInfo.endpoints will not be required once we remove the
Epaxos.getSuccessors()
{quote}

successors and noop will be used in the prepare and execute phases
respectively, fastPathImpossible should be removed through.

bq. 9) PreacceptCallback.ballot need not be an instance variable as we set
completed=true after we set it.

agreed, I'll update

{quote}
10) PreacceptResponse.missingInstance is not required as it can be calculated
on the leader in the PreacceptCallback.
11) EpaxosState.accept(). We can filter out the skipPlaceholderPredicate when
we calculated missingInstances in PreacceptCallback.getAcceptDecision()
{quote}

{quote}
12) PreacceptCallback.getAcceptDecision() We don't need to calculate missingIds
if accept is going to be false in AcceptDecision.
13) ParticipantInfo.remoteEndpoints. Here we are not doing any isAlive check
and just sending messages to all remote endpoints.
{quote}

I'll fix

bq. 15) Accept is send to live local endpoints and to all remote endpoints. In
AcceptCallback, I think we should count response from only local endpoints

fixed in cleaned up code

bq. 16) When we execute the instance in ExecuteTask, what if we crash after
executing the instance but before recording it.

Saving the best for last I see :)
The existing implementation has this problem as well. Cassandra doesn't have a
way to mutate multiple keyspaces with a single commit log entry (that I've
found). We could collect the mutations from the actual cas write, the
dependency manager update, and the instances update and hold off on applying
them until the very end, but that only makes the problem less likely.

I'll give this and your timeline consistency question some more thought.

EPaxos
--

One reason we haven't optimized our Paxos

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-11 Thread sankalp kohli (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207741#comment-14207741
]

sankalp kohli commented on CASSANDRA-6246:
--

5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is
always true. Also loadedScc is empty as we don't insert into it.
ids are being put into uncommitted in the addInstance method, so it won’t
always equal 0, good catch on the loadedScc though. I’ll get that fixed.
We only call ExecutionSorter.getOrder() in the else of
executionSorter.uncommitted.size() 0 in ExecuteTask.run(). So we can remove
the check.

Missing instances are sent both ways. When a node responds to a preaccept
message, if it believes the leader is missing an instance, it will include it
in it's response. Once the leader has received all the responses, if it thinks
any of the replicas are missing instances, it will send them along.
I think there is not need to send them. Since we are sending all the
dependencies of the endpoint in the response to the leader, leader can do the
diff. There is no point sending duplicate information over the wire. So I think
in PreacceptVerbHandler, we don't need to calculate and send the missing
instances.

Speaking of which, the default of not waiting for an fsync before considering
a write successful is a more serious problem for paxos/epaxos, since a paxos
node forgetting it's state can cause inconsistencies.
I agree we can tackle this later. But here it is more dangerous because once
an endpoint is out of sync, no further updates can be applied as condition
checks are local. In current paxos, if a machine is in this situation and could
not apply the commit. The next commit will still be applied as condition checks
are at quorum level.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-05 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199147#comment-14199147
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

In PreacceptCallback
boolean fpQuorum = numResponses = participantInfo.fastQuorumSize; //will be 
always false since we don't accept any requests after quorum number of requests 
have got in. This will cause accept phase to run always even if there is no 
contention. 

Am I reading it right? 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-05 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199170#comment-14199170
 ] 

Blake Eggleston commented on CASSANDRA-6246:


Only with a replication factor  5. For rf = 5, the fast path quorum size is 
the same as the basic quorum size.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-04 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197053#comment-14197053
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Let me take a look at the patch. It is a big one :)

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-11-03 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194659#comment-14194659
]

Blake Eggleston commented on CASSANDRA-6246:

I have an initial implementation here:
https://github.com/bdeggleston/cassandra/compare/CASSANDRA-6246?expand=1

It’s still pretty rough, I just wanted to get it to a point where we could get
a feel for the performance advantages and decide if the additional complexity
was worth it. There’s also none of the instance gc / optimized failure recovery
we’ve been talking about.

I did some performance comparisons over the weekend. The tldr is that epaxos is
10% to 11.5x faster than classic paxos, depending on the workload.

To test, I used a cluster of 3 m3.xlarge instances in us-east, and a 4th
instance executing queries against the cluster. Each C* node was in a different
az. Commit log and data directories were on different disks.

There were 2 tests, each running 10k queries against the cluster. The first
test measured throughput using queries that wouldn’t contend with each other.
Each query inserted a row for a different partition. The second test measured
performance under contention, where every query contended for the same
partition.

Each test was run with 1, 5, 10 concurrent client requests.

With the uncontended workload, epaxos request time is 10-14% faster than the
current implementation on average.
See:
https://docs.google.com/spreadsheets/d/1olMYCepsE_02bMyfzV0Hke5UKuqoCNNjSIjR9yNs5iI/edit?pli=1#gid=0

With the contended workload, epaxos request time is 4.5x-11.5x faster than the
current implementation on average.
See:
https://docs.google.com/spreadsheets/d/1olMYCepsE_02bMyfzV0Hke5UKuqoCNNjSIjR9yNs5iI/edit?pli=1#gid=1327463955

There are 2 epaxos sections, regular, and cached. With higher contended request
concurrency, the execution algorithm has to visit a lot of unexecuted instances
to build it’s dependency graph. Reading the dependency data and instances out
of their tables and deserializing them for each visit slows down epaxos to a
point where it’s over twice as slow as classic paxos. By using a guava cache
for the instance and dependency data objects, and keeping them around for a few
minutes, epaxos is ~30x faster in higher contention/concurrency situations.

Some notes on the concurrent contended tests:

* The median query time for epaxos is a little slower than classic paxos for 5
concurrent contended requests. This is because epaxos is now doing an accept
phase on a lot of the queries, and because classic paxos doesn’t send commit
messages out if the predicate doesn’t apply to the query.
* With concurrent contending queries, 1-2.5% of the classic paxos queries
timeout and fail. At this level, there are no failing epaxos queries.
* Variance in query times is also much lower with epaxos. With 10 concurrent
contending requests, the 95th %ile request time for classic paxos is 23x the
median, epaxos is 1.8x.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-10-09 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165151#comment-14165151
 ] 

Blake Eggleston commented on CASSANDRA-6246:


Since epaxos executes mutations at different times on each machine, each 
instance needs a serialized copy of the statement. The CQL3CasRequest.RowUpdate 
class keeps a reference to the actual ModificationStatement, and serializing 
that looks like it will involve implementing at least 50 (de)serializers. Since 
I’m not super familiar with the inner workings of the UpdateStatement and 
DeleteStatement, I thought I’d ask here to see if there’s a better solution I’m 
not seeing.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-10-09 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165177#comment-14165177
 ] 

T Jake Luciani commented on CASSANDRA-6246:
---

Can't you just call .getMutations on the statements and serialize the actual 
RowMutations?

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-10-09 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165185#comment-14165185
 ] 

Blake Eggleston commented on CASSANDRA-6246:


That would work most of the time, but a few operations do a read before a write.

I suppose I could narrow the serialization support down to just the operations 
and terms that are involved in those statements, but I'd like to avoid special 
casing specific operations if possible.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-10-05 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159620#comment-14159620
 ] 

Blake Eggleston commented on CASSANDRA-6246:


I’ve been thinking through how epaxos would be affected by repair, read repair, 
and hints.

Since both the read and write parts of an epaxos instance are executed both 
locally and asynchronously, it’s possible that a repair could write the result 
of an instance to a node before that instance is executed on that node. This 
would cause the decision of an epaxos instance to be different on the node 
being repaired, which could create an inconsistency between nodes. Although 
it’s difficult to imagine an instance taking more time to execute than a 
repair, I don’t think it’s impossible, and would introduce inconsistencies 
during normal operation.

Something that would be more likely to cause problems would be someone 
performing a quorum read on a key that has instances in flight, and triggering 
a read repair on that key. Hints would have a similar problem, but it would 
also mean that people are mixing serialized and unserialized writes 
concurrently.

Having the node sending the repair message include some metadata about the most 
recent executed instance(s) it's aware of is the best solution I've come up 
with so far. If the receiving node is behind, it could work with the sending 
node to catch up before performing the repair.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151693#comment-14151693
 ] 

T Jake Luciani commented on CASSANDRA-6246:
---

For write timestamps take a look at CASSANDRA-7919 we need to change this to 
support better LWW semantics and RAMP

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152316#comment-14152316
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Currently we do a read on quorum/local_quorum and based on that value decide if 
the condition matches at the co-ordinator. 
With you approach, the decisions will now be local and could be different on 
different replicas. If some replica some how lags behind due to various 
reasons, the condition on it will never be satisfied going forward. 

Coming back to my suggestion from previous comment, if all replicas respond 
back with all committed before this instance and things are the same, we can 
use the read value. Correct?  

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152336#comment-14152336
 ] 

Blake Eggleston commented on CASSANDRA-6246:


[~tjake] that'll solve the problem of having multiple mutations at a single 
timestamp, but might cause other problems when the calculated execution order 
puts an instance after an instance with a larger timestamp. Using arbitrary 
timestamps in uuid1s in this case might cause come collisions, but they 
wouldn't be on the same cells. In any case, collisions would be less likely 
than they are now.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152383#comment-14152383
]

Blake Eggleston commented on CASSANDRA-6246:

[~kohlisankalp] right, the incrementing ballot numbers per partition in the
current implementation, and the quorum read basically synchronizes r/w for the
partition being queried. But that synchronization creates a bottleneck, and the
potential for live lock.

Epaxos doesn't need to synchronize any of it's reads or writes. The preaccept,
accept, and commit steps are basically building a directed graph of queries.
The constraints those steps impose/satisfy allow other nodes to figure out the
state of the graph on other machines in case of failure, provided it can talk
to a quorum of nodes, and even if the machines with a newer view of the graph
are down. At execution time, this graph is sorted to determine the execution
order. Since the graph will always be the same, the order instances are
executed will always be the same. So even though each machine will perform it's
read and write in isolation, the other nodes are guaranteed to execute
instances in the same order, and therefore, guaranteed to reach the same
decision. Even though they aren't talking to each other at execution time.

What can cause inconsistencies, since each node is executing instances in
isolation, is users mixing serialized writes with unserialized writes. The
quorum read/write in the current implementation mitigates this problem.
However, like I mentioned in my comment yesterday, I think we can work out a
way to detect and correct this.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread sankalp kohli (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152592#comment-14152592
]

sankalp kohli commented on CASSANDRA-6246:
--

Inconsistencies can be introduced by machines being down or network partitioned
for longer than we replay missed updates to it. Currently for normal writes,
hint is for 1 hour. If you bring in a machine after 1 hour, you run a repair.
But repair won't help here since it takes time to run the repair and new LWTs
will come and will see a different view of the data and won't apply.

However, like I mentioned in my comment yesterday, I think we can work out a
way to detect and correct this.
+1

Assuming each instance is an average of ~170 bytes (uncompressed), sustained
1000 instances per second for 3 hours would keep ~1.8GB of data around.
Here instance includes the condition and update. Update could be quite big and
keeping it around could be problematic.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152631#comment-14152631
]

Blake Eggleston commented on CASSANDRA-6246:

bq. Inconsistencies can be introduced by machines being down or network
partitioned for longer than we replay missed updates to it. Currently for
normal writes, hint is for 1 hour. If you bring in a machine after 1 hour, you
run a repair. But repair won't help here since it takes time to run the repair
and new LWTs will come and will see a different view of the data and won't
apply.

For serialized queries, new instances sent to a machine that's recovering from
a failure will be learn of missed instances during the preaccept phase, and
will have to catch up before they can execute the instance and respond to client

{quote}
Assuming each instance is an average of ~170 bytes (uncompressed), sustained
1000 instances per second for 3 hours would keep ~1.8GB of data around.
Here instance includes the condition and update. Update could be quite big and
keeping it around could be problematic.
{quote}

Yeah... agree 100%. Keeping an extensive history of instances for failure
recovery is not a good idea. Anyway, it doesn't even solve the problem of
recovery since you'll start to get dangling pointers.

So let's forget about keeping a lot of history around.

For recovering from longer outages, here's my thinking:

To accurately determine dependencies for the preaccept phase, we'll need to
keep references to active instances around. Otherwise we can get dependency
graphs that are split, or have gaps. Active instances would be instances that
both been executed, and that either a quorum of instances have accepted as a
dependency for another instance, or that was a dependency of a committed
instance.

This should be all the historical info we need to keep around. We might want to
keep a little more so we can just use the prepare phase to recover from shorter
outages.

In cases where a node is joining, or has been down for a while, it seems that
if we immediately start including them in paxos messages (for record only, not
to act on), then send them the current dependency data described above for a
row/cell from a quorum of nodes and the current value for that row/cell, that
should be enough for the node to start participating in instances. This way we
can avoid a prepare phase that depends on persisting and transmitting a ton of
data.

wdyt? I haven't spent a lot of time thinking through all the edge cases, but I
think it has potential for making recovery practical.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-29 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152770#comment-14152770
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Yes make sense. 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-28 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151250#comment-14151250
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

Keeping executed instances
In the current implementation, we only keep the last commit per CQL partition. 
We can do the same for this as well. 

I am also reading about epaxos recently and want to know when do you do the 
condition check in your implementation? 


 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-28 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151256#comment-14151256
]

Blake Eggleston commented on CASSANDRA-6246:

bq. In the current implementation, we only keep the last commit per CQL
partition. We can do the same for this as well.

Yeah I've been thinking about that some more. Just because we could keep a
bunch of historical data doesn't mean we should. There may be situations where
we need to keep more than one instance around though, specifically when the
instance is part of a strongly connected component. Keeping some historical
data would be useful for helping instances recover from short failures where
they miss several instances, but after a point, transmitting all the activity
for the last hour or two would just be nuts. The other issue with relying on
historical data for failure recovery is that you can't keep all of it, so you'd
have dangling pointers on the older instances.

For longer partitions, and nodes joining the ring, if we transmitted our
current dependency bookkeeping for the token ranges they're replicating, the
corresponding instances, and the current values for those instances, that
should be enough to get going.

bq. I am also reading about epaxos recently and want to know when do you do the
condition check in your implementation?

It would have to be when the instance is executed.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-28 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151261#comment-14151261
 ] 

sankalp kohli commented on CASSANDRA-6246:
--

It would have to be when the instance is executed.

Since the client(the application) needs to know whether this was a success of 
not, I was thinking of making it part of the pre accept. 
When a replica gets a request of pre accept, along with last instance, it can 
also send the values of the check. If the response from all replicas are the 
same(fast path), it could be committed locally and async to other replicas. 
Also the response to the client will contain whether the query succeed or not. 

Make sense? 

PS: I am quite excited to see this implementation coming along specially since 
you are working on it :) 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-28 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151289#comment-14151289
 ] 

Blake Eggleston commented on CASSANDRA-6246:


Thanks [~kohlisankalp] :)

So the issue with making the check part of the preaccept phase is that you 
can't trust the value in the database at that point. If there are other 
interfering instances in flight, you don't know what order they'll be executed 
in until they're all committed. So, one of them could change the value and 
you'd have replied to the client with incorrect information. Assuming the 
client sends the query to a replica, things would go like this:

# receive client request
# send preaccept request to replicas and wait for a fast path quorum to respond
# assuming all responses agreed, commit locally  notify replicas asynchronously
# assuming all dependencies are committed, sort dependency graph
# execute all instances preceding the client's instance, read value* in 
question and perform the check, make mutation*
# respond with result to client

*performed locally

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-27 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150816#comment-14150816
]

Blake Eggleston commented on CASSANDRA-6246:

I’ve been working through some of the concerns I’d posted about last week, and
talking with Iulian Moraru and Dave Anderson, who’ve been really helpful. I
also put together a quick python implementation to test arbitrary failure
scenarios, which you can check out here:
https://github.com/bdeggleston/cassandra_epaxos_prototype. The failure scenario
I was worried about last week is not a problem, I’d just forgotten a step.

Before I put together a plan about how to implement this. There are some things
we’ll need to figure out about how epaxos will fit into Cassandra’s
architecture. Below are the main problems / questions, and some possible
solutions. Sorry for the super long comment.

*Non-replica coordinators*
Optimized Egalitarian Paxos depends on the command leader also being a replica
of the key being queried, because it uses information about whether replicas
agreed with the command leader’s preaccept attributes to make failure recovery
possible. Token aware routing should make this less of an issue, but we’ll
still need to handle non-replica coordinators.

There are 3 options when we get a query on a non-replica node:

# Optimistically forward queries to a node that is a replica. This method
offers the best case message network round trip (2) for queries not sent to a
replica. However, this also makes the node we’re forwarding queries to a spof
for this query. We may be able to quickly reroute if we find the node is
unreachable, but if the node is just slow to respond, the query could timeout,
something that might not have happened if we went with a slower, but more
reliable route.
# Always default to the slow path. For every query with a non-replica leader,
this means sending preaccept, accept, and commit messages to all replicas. We’d
need to wait for a quorum to reply to the preaccept and accept messages, and a
single reply to the commit message for the result. This method would be more
reliable, but would always take 3 message round trips. The exception to the 3
round trip rule is if we receive identical preaccept responses from all
replicas. In that case we could skip the accept phase.
# A compromise between the first two. When a replica receives a preaccept
message from the non-replica coordinator, the replica broadcasts it’s preaccept
responses to a subset of replicas, and responds after receiving broadcasted
preaccept responses from a subset of the other replicas. Receiving identical
responses from enough of the replicas would allow us to commit on the fast
path. This puts the best case at 2.5 round trips, but involves more network
activity. It’s also different enough from the normal flow, that I think it
would be better to make this a follow on task if we decide it’s the way to go.

*Keeping executed instances*
Ideally, we could delete our record of an instance as soon as it’s a) executed,
b) not a dependency of an unexecuted node, and c) not part of a strongly
connected component which hasn’t been fully executed. However, when nodes which
have been down for a while are recovering from failure, they need to get copies
of all the instances they missed.

The absolute simplest solution would be to just keep all instances persisted,
ttl’d with the hinted handoff time, where we assume that the node isn’t coming
back. Assuming each instance is an average of ~170 bytes (uncompressed),
sustained 1000 instances per second for 3 hours would keep ~1.8GB of data
around. Based on the lz4 benchmarks I’ve found, that would be ~1GB per 1000
instances per second. Hypothetically, if GAE duplicated their 1 million wps
benchmark on 330 machines using epaxos, with rf3, that’s ~9k instances per
second, or 9GB. That shouldn’t be a problem for a machine handling that kind of
load, and doesn’t introduce any additional complexity.

*Write timestamps*
When there are a lot of concurrent updates on the same key, some instances will
be executed in a different order than they were received. If we use the query
timestamp for a write, we could write an instance with a timestamp that’s
before the timestamp of the last instance. So reads wouldn’t see the result of
the most recent instance. The commit time can’t be used, because a prepare
phase could have committed the same instance on different nodes at different
times. Finally the execution timestamp will be different on every node, using
that would give the most recent mutation, but would cause out of date data to
get written on repair in some situations.

Using max(last_write_timestamp + 1, query_timestamp) for the timestamp would
work, although in situations with a lot of writes on a single key, this could
put mutations a little into the future.

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-27 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150819#comment-14150819
 ] 

Blake Eggleston commented on CASSANDRA-6246:


/cc [~jbellis] [~slebresne]

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-27 Thread Albert P Tobey (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150847#comment-14150847
 ] 

Albert P Tobey commented on CASSANDRA-6246:
---

For backwards compatibility, if it's possible to run both protocols, make it a 
configuration in the yaml. Another rolling restart to disable hybrid/dual mode 
isn't so bad if it removes a lot of complexity from runtime. Would also make it 
easy for conservative users to stick with the old paxos.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-27 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150875#comment-14150875
]

Blake Eggleston commented on CASSANDRA-6246:

Making switching to a special hybrid mode a required step could be error prone.
Plus, the direction your transitioning is important. That's really the tricky
part, running serialized queries while the cluster transparently transitions
from one protocol to another. Specifically when the nodes for a given range can
and do switch. Making it configurable could be useful, at least from a peace of
mind / opt-in perspective. We'd have to work out how to transition back from
epaxos to old paxos though.

EPaxos
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-09-19 Thread Blake Eggleston (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141590#comment-14141590
]

Blake Eggleston commented on CASSANDRA-6246:

I'm still poring over the discussion in CASSANDRA-5062, and the current
implementation, but wanted to expand on some of the advantages, list a few
disadvantages and caveats of using egalitarian paxos, and talk about a few
areas where we'd probably want to deviate from the process as described in the
paper by Moraru et al.

Advantages:
* In the ideal case we should be able to answer a client's query after the same
number of inter-node messages it takes to do a quorum write. (There will be
more total messages, but we don't need to wait for them to complete before
responding to the client)
** This is assuming that each node performs the cas locally instead of using
paxos to setup a quorum read/write
* Even in the non-ideal case, you're still looking at 2 network round trips
before reaching commit (it looks like current impl has 4 network round trips
for cas?)
* Much higher throughput on interfering queries is possible. Multiple in-flight
queries on the same row is not a problem.
** livelock is not a risk during normal operation, only during failure
recovery. However, this can be mitigated by specifying an order of succession
for query leaders. Of course, really heavy 'normal' operation might start
causing failure cases.
* Granular control over which operations interfere with each other

Disadvantages:
* the epaxos optimizations are possible because it has a pretty complex failure
recovery procedure
* the concurrent programming side of things will be more complicated than the
current implementation
* because execution is more asynchronous than classic paxos, I think we'd have
to perform the operations locally rather than using paxos to setup a normal
quorum read/write. On one hand, this saves us a network round trip. On the
other hand, if people are doing non-serialized writes at the same time as
serialized writes that affect the same cells, it's likely that different nodes
will record different results for a query. Obviously, it's not a good idea to
do this, but that doesn't mean people won't do it.

Caveats:
* with rf3, or a non-replica coordinator, responses from more than a quorum of
replicas _may_ be needed to commit on the ideal case. Or we just use the 2
message commit path in those situation. I'm still working out the details, but
I'm pretty sure there are failure scenarios where not doing that could result
in different values can be committed after recovery.
* Epaxos is pretty new. I was talking to the authors about it a few months ago,
and the only implementations we were aware of were mine and theirs... I'm
pretty sure there aren't any production deployments of it. That's not
_neccesarily_ a bad thing, but I just wanted to point out that we are in fairly
new territory, and that should be weighed against the advantages. There is no
'Making EPaxos Live' paper out there.

Places where Cassandra's architecture will likely require doing things a bit
differently than outlined in the paper:
* Sequence values will cause problems, but they shouldn't be neccesary.
*# since each node is responsible for different ranges of data, and therefore
would have seen different queries, encountering different seq values would be
very likely, and would result in a lot of otherwise unnecessary accept phases.
We could get around this by using different seq values for different token
ranges, but...
*# Since we'd wait until the query is actually executed before returning a
result to the client (don't know why we wouldn't), it's a superfluous
requirement. I discussed this with Iulian Moraru a few months ago and he agreed.
* Using a non-replica coordinator:
*# The paper assumes that an instance leader is also a replica of the data
being queried. I'd imagine we'd want to avoid optimistically forwarding queries
to a single replica and hoping it's up, which would mean allowing coordinators
to lead queries for keys they don't know anything about. This would prevent the
non-leaders from recording that they agree with the leader, preventing some
optimizations in failure recovery. It would make a good case for using prepared
statements and token aware routing.

EPaxos
--

Key: CASSANDRA-6246
URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Jonathan Ellis
Priority: Minor

[jira] [Commented] (CASSANDRA-6246) EPaxos

2014-06-04 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017865#comment-14017865
 ] 

Jonathan Ellis commented on CASSANDRA-6246:
---

Good overview: http://blakeeggleston.com/egalitarian-paxos-explained.html

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

51 matches

Mail list logo