Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-12 Thread Ralph Boehme

On 4/12/23 15:30, Jeff Jirsa wrote:

Are you always inserting into the same partition (with contention) or
different ?


I'm actually updating the very same row. :)


Which version are you using ?


# nodetool version
ReleaseVersion: 4.1.1


The short tldr is that the failure modes of the existing paxos
implementation (under contention, under latency, under cluster
strain) can cause undefined states. I believe that a subsequent
serial read will deterministically resolve the state (look at
cassandra-12126), but that has a cost (both the extra operation and
the code complexity)


I'm definitely driving contention here in my workload. I'm basically 
implementing locks using LWTs on a row and I'm running lock/unlock in a 
tight loop *from multiple clients*. As said, this already comes to a 
grinding halt with just 2 clients.



The upcoming transactional rewrite will likely change this, but it’s
still WIP (CEP-15)


Thanks. I'm aware of Acord and can't await to get my fingers on 
Cassandra 5.0. :) In the meantime I was hoping I could use Cassandra's 
LWTs to implement locking.


Thanks!
-slow


OpenPGP_signature
Description: OpenPGP digital signature


Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-12 Thread Jeff Jirsa
Are you always inserting into the same partition (with contention) or different 
?

Which version are you using ? 

The short tldr is that the failure modes of the existing paxos implementation 
(under contention, under latency, under cluster strain) can cause undefined 
states. I believe that a subsequent serial read will deterministically resolve 
the state (look at cassandra-12126), but that has a cost (both the extra 
operation and the code complexity)

The upcoming transactional rewrite will likely change this, but it’s still WIP 
(CEP-15)




> On Apr 12, 2023, at 6:11 AM, Ralph Boehme  wrote:
> 
> On 4/11/23 21:14, Ralph Boehme wrote:
>>> On 4/11/23 19:53, Bowen Song via user wrote:
>>> That error message sounds like one of the nodes timed out in the paxos 
>>> propose stage.  You can check the system.log and gc.log and see if you can 
>>> find anything unusual in them, such as network errors, out of sync clocks 
>>> or long stop-the-world GC pauses.
>> hm, I'll check the logs, but I can reproduce this 100% on an idle test 
>> cluster just by running a simple test client that generates a smallish 
>> workload where just 2 processes on a single host hammer the Cassandra 
>> cluster with LWTs.
> 
> nothing in the logs really.
> 
>> Maybe LWTs are not meant to be used this way?
> 
> fwiw, this happens 100% within a few seconds with a worload where two clients 
> hammer with LWTs on a single row.
> 
> Thanks!
> -slow
> 


Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-12 Thread Ralph Boehme

On 4/11/23 21:14, Ralph Boehme wrote:

On 4/11/23 19:53, Bowen Song via user wrote:
That error message sounds like one of the nodes timed out in the paxos 
propose stage.  You can check the system.log and gc.log and see if you 
can find anything unusual in them, such as network errors, out of sync 
clocks or long stop-the-world GC pauses.


hm, I'll check the logs, but I can reproduce this 100% on an idle test 
cluster just by running a simple test client that generates a smallish 
workload where just 2 processes on a single host hammer the Cassandra 
cluster with LWTs.


nothing in the logs really.


Maybe LWTs are not meant to be used this way?


fwiw, this happens 100% within a few seconds with a worload where two 
clients hammer with LWTs on a single row.


Thanks!
-slow



OpenPGP_signature
Description: OpenPGP digital signature


Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-11 Thread Ralph Boehme

On 4/11/23 19:53, Bowen Song via user wrote:
That error message sounds like one of the nodes timed out in the paxos 
propose stage.  You can check the system.log and gc.log and see if you 
can find anything unusual in them, such as network errors, out of sync 
clocks or long stop-the-world GC pauses.


hm, I'll check the logs, but I can reproduce this 100% on an idle test 
cluster just by running a simple test client that generates a smallish 
workload where just 2 processes on a single host hammer the Cassandra 
cluster with LWTs.


Maybe LWTs are not meant to be used this way?

BTW, since you said you want it to be fast, I think it's worth 
mentioning that LWT comes with additional cost and is much slower than a 
straight forward INSERT/UPDATE. 


Sure, but we have to swallow that pill as we need linearizability.

You should avoid using it if possible. 
For example, if all of the Cassandra clients (samba servers) are running 
on the same machine, it may be far more efficient to use a lock than LWT.


no, the goal is designing a huge scaleout SMB cluster spanning hundreds 
of nodes, used as multitennant cloud SMB frontend much like Microsoft 
Azure SMB.


Thanks!
-slow



OpenPGP_signature
Description: OpenPGP digital signature


Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-11 Thread Bowen Song via user
That error message sounds like one of the nodes timed out in the paxos 
propose stage.  You can check the system.log and gc.log and see if you 
can find anything unusual in them, such as network errors, out of sync 
clocks or long stop-the-world GC pauses.



BTW, since you said you want it to be fast, I think it's worth 
mentioning that LWT comes with additional cost and is much slower than a 
straight forward INSERT/UPDATE. You should avoid using it if possible. 
For example, if all of the Cassandra clients (samba servers) are running 
on the same machine, it may be far more efficient to use a lock than LWT.



On 11/04/2023 18:18, Ralph Boehme wrote:

Hi folks!

Ralph here from the Samba team.

I'm currently doing research into Opensource distributed NoSQL 
key/value stores to be used by Samba as an more scalable alternative 
to Samba's own homegrown distributed key/value store called "ctdb" [1].


As an Opensource implementation of the SMB filesharing protocol from 
Microsoft, we have some specific requirements wrt to database behaviour:


- fast
- fast
- fast
- highly consistent, iow linearizable

We got away without a linearizable database as historically the SMB 
protocol and the SMB client implementations were built around the 
assumption that handle and session state at the server could be lost 
due to events like process or server crashes and client would 
implement a best effort strategy to recover client state.


Modern SMB3 offers stronger guarantees which require a strongly 
consistent ie linearizable database.


While prototyping a Python module for our pluggable database client in 
Samba I ran into the following issue with Cassandra:


  File "cassandra/cluster.py", line 2618, in 
cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4901, in 
cassandra.cluster.ResponseFuture.result
cassandra.protocol.ErrorMessageSub: [Unknown] message="CAS operation result is unknown - proposal accepted 
by 1 but not a quorum.">


This happens when executing the following LWT:

    f'''
    INSERT INTO {dbname} (key, guid, owner, refcount)
    VALUES (?, ?, ?, ?)
    IF NOT EXISTS
    ''')

This is the first time I'm running Cassandra. I've just setup a three 
node test cluster and everything looks ok:


# nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load    Tokens  Owns (effective)  Host ID 
  Rack
UN  172.18.200.21  360,09 KiB  16  100,0% 
4590f3a6-4ca5-466f-a24d-edc54afa36f0  rack1
UN  172.18.200.23  326,92 KiB  16  100,0% 
9175fd4e-4d84-4899-878a-dd5266132ff8  rack1
UN  172.18.200.22  335,32 KiB  16  100,0% 
35e05369-cc8a-4642-b98d-a5fcc326502f  rack1


Can anyone shed some light on what I might be doing wrong?

Thanks!
-slow

[1]