Looks like all of this is happening because we’re using CAS operations and
the driver is going to SERIAL consistency level.
SERIAL and LOCAL_SERIAL write failure scenarios¶
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html?scroll=concept_ds_umf_5xx_zj__failure-scenariosIf
one of three nodes is down, the Paxos commit fails under the following
conditions:
- CQL query-configured consistency level of ALL
- Driver-configured serial consistency level of SERIAL
- Replication factor of 3
I don’t understand why this would fail.. it seems completely broken in this
situation.
We were having write timeout at replication factor of 2 .. and a lot of
people from the list said of course , because 2 nodes with 1 node down
means there’s no quorum and paxos needs a quorum. .. and not sure why I
missed that :-P
So we went with 3 replicas, and a quorum,
but this is new and I didn’t see this documented. We set the driver to
QUORUM but then I guess the driver sees that this is a CAS operation and
forces it back to SERIAL? Doesn’t this mean that all decommissions result
in failures of CAS?
This is Cassandra 2.0.9 btw.
On Wed, Jul 1, 2015 at 2:22 PM, Kevin Burton bur...@spinn3r.com wrote:
We get lots of write timeouts when we decommission a node. About 80% of
them are write timeout and just about 20% of them are read timeout.
We’ve tried to adjust streamthroughput (and compaction throughput) for
that matter and that doesn’t resolve the issue.
We’ve increased write_request_timeout_in_ms … and read timeout as well.
Is there anything else I should be looking at?
I can’t seem to find the documentation that explains what the heck is
happening.
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts