RE: LWT broken?

2018-02-13 Thread Jacques-Henri Berthemet
Yes, non-applied LWT will return the row of the winning result. I agree, in 
theory I’d expect your code to have a correct behavior.

You could also check release notes of later Cassandra versions for LWT related 
bugs. If your ids are timeUUID you could try to extract the time when the 
inconsistencies happened and check corresponding Cassandra logs to see what 
happened.
--
Jacques-Henri Berthemet

From: Mahdi Ben Hamida [mailto:ma...@signalfx.com]
Sent: Monday, February 12, 2018 8:45 PM
To: user@cassandra.apache.org
Subject: Re: LWT broken?

On 2/12/18 2:04 AM, Jacques-Henri Berthemet wrote:

Mahdi, you don’t need to re-read at CL ONE on line 9. When a LWT statement is 
not applied, the values that prevented the LWT are returned as part of the 
response, I’d expect them to be more consistent than your read. I’m not 100% 
sure it’s the case for 2.0.x but it’s the case for Cassandra 2.2.

Yes. That's an optimization that can be added. I need to check that it works 
properly with the version of cassandra that I'm running. Right now, we have 
line 9 done at a SERIAL consistency and the issue still happens.



And it’s the same for line 1, you should only keep your LWT statement unless 
you have a huge performance benefit of doing. In Cassandra doing a read before 
write is a bad pattern.
I'll be trying this next and seeing if the issue disappears when we change it 
to serial. Although, I still don't understand how this would cause any 
inconsistencies. In the worst case, a non serial read would return no rows for 
the specified primary key which I handle by trying to do an LWT insert. If it's 
returning a result, I assume that result will be the row that the winning 
lightweight transaction has written. I think that assumption may not be correct 
all the time and I would love to understand why that is the case.

--
Mahdi.


AFAIK a LWT statement is always executed as SERIAL, the only choice you have is 
between SERIAL and LOCAL_SERIAL.

Regards,
--
Jacques-Henri Berthemet

From: DuyHai Doan [mailto:doanduy...@gmail.com]
Sent: Sunday, February 11, 2018 6:11 PM
To: user <user@cassandra.apache.org><mailto:user@cassandra.apache.org>
Subject: Re: LWT broken?

Mahdi , the issue in your code is here:

else // we lost LWT, fetch the winning value
 9existing_id = SELECT id FROM hash_id WHERE hash=computed_hash | 
consistency = ONE

You lost LWT, it means that there is a concurrent LWT that has won the Paxos 
round and has applied the value using QUORUM/SERIAL.

In best case, it means that the won LWT value has been applied to at least 2 
replicas out of 3 (assuming RF=3)
In worst case, the won LWT value has not been applied yet or is pending to be 
applied to any replica

Now, if you immediately read with CL=ONE, you may:

1) Read the staled value on the 3rd replica which has not yet received the 
correct won LWT value
2) Or worst, read a staled value because the won LWT is being applied when the 
read operation is made

That's the main reason reading with CL=SERIAL is recommended (CL=QUORUM is not 
sufficient enough)

Reading with CL=SERIAL will:

a. like QUORUM, contact strict majority of replicas
b. unlike QUORUM, look for validated (but not yet applied) previous Paxos round 
value and force-applied it before actually reading the new value




On Sun, Feb 11, 2018 at 5:36 PM, Mahdi Ben Hamida 
<ma...@signalfx.com<mailto:ma...@signalfx.com>> wrote:

Totally understood that it's not worth (or it's rather incorrect) to mix serial 
and non serial operations for LWT tables. It would be highly satisfying to my 
engineer mind if someone can explain why that would cause issues in this 
particular situation. The only explanation I have is that a non serial read may 
cause a read repair to happen and that could interfere with a concurrent serial 
write, although I still can't explain how that would cause two different 
"insert if not exist" transactions to both succeed.

--

Mahdi.
On 2/9/18 2:40 PM, Jonathan Haddad wrote:
If you want consistent reads you have to use the CL that enforces it. There’s 
no way around it.
On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida 
<ma...@signalfx.com<mailto:ma...@signalfx.com>> wrote:

In this case, we only write using CAS (code guarantees that). We also never 
update, just insert if not exist. Once a hash exists, it never changes (it may 
get deleted later and that'll be a CAS delete as well).

--

Mahdi.
On 2/9/18 1:38 PM, Jeff Jirsa wrote:


On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida 
<ma...@signalfx.com<mailto:ma...@signalfx.com>> wrote:

 Under what circumstances would we be reading inconsistent results ? Is there a 
case where we end up reading a value that actually end up not being written ?




If you ever write the same value with CAS and without CAS (different code paths 
both updating the same value), you're using CAS wrong, and inconsistencies can 
happen.








Re: LWT broken?

2018-02-12 Thread Mahdi Ben Hamida

On 2/12/18 2:04 AM, Jacques-Henri Berthemet wrote:


Mahdi, you don’t need to re-read at CL ONE on line 9. When a LWT 
statement is not applied, the values that prevented the LWT are 
returned as part of the response, I’d expect them to be more 
consistent than your read. I’m not 100% sure it’s the case for 2.0.x 
but it’s the case for Cassandra 2.2.




Yes. That's an optimization that can be added. I need to check that it 
works properly with the version of cassandra that I'm running. Right 
now, we have line 9 done at a SERIAL consistency and the issue still 
happens.


And it’s the same for line 1, you should only keep your LWT statement 
unless you have a huge performance benefit of doing. In Cassandra 
doing a read before write is a bad pattern.


I'll be trying this next and seeing if the issue disappears when we 
change it to serial. Although, I still don't understand how this would 
cause any inconsistencies. In the worst case, a non serial read would 
return no rows for the specified primary key which I handle by trying to 
do an LWT insert. If it's returning a result, I assume that result will 
be the row that the winning lightweight transaction has written. I think 
that assumption may not be correct all the time and I would love to 
understand why that is the case.


--
Mahdi.


AFAIK a LWT statement is always executed as SERIAL, the only choice 
you have is between SERIAL and LOCAL_SERIAL.


Regards,

*--*

*Jacques-Henri Berthemet*

*From:* DuyHai Doan [mailto:doanduy...@gmail.com]
*Sent:* Sunday, February 11, 2018 6:11 PM
*To:* user <user@cassandra.apache.org>
*Subject:* Re: LWT broken?

Mahdi , the issue in your code is here:

else // we lost LWT, fetch the winning value
 9    existing_id = SELECT id FROM hash_id WHERE hash=computed_hash | 
consistency = ONE


You lost LWT, it means that there is a concurrent LWT that has won the 
Paxos round and has applied the value using QUORUM/SERIAL.


In best case, it means that the won LWT value has been applied to at 
least 2 replicas out of 3 (assuming RF=3)


In worst case, the won LWT value has not been applied yet or is 
pending to be applied to any replica


Now, if you immediately read with CL=ONE, you may:

1) Read the staled value on the 3rd replica which has not yet received 
the correct won LWT value


2) Or worst, read a staled value because the won LWT is being applied 
when the read operation is made


That's the main reason reading with CL=SERIAL is recommended 
(CL=QUORUM is not sufficient enough)


Reading with CL=SERIAL will:

a. like QUORUM, contact strict majority of replicas

b. unlike QUORUM, look for validated (but not yet applied) previous 
Paxos round value and force-applied it before actually reading the new 
value


On Sun, Feb 11, 2018 at 5:36 PM, Mahdi Ben Hamida <ma...@signalfx.com 
<mailto:ma...@signalfx.com>> wrote:


Totally understood that it's not worth (or it's rather incorrect)
to mix serial and non serial operations for LWT tables. It would
be highly satisfying to my engineer mind if someone can explain
why that would cause issues in this particular situation. The only
explanation I have is that a non serial read may cause a read
repair to happen and that could interfere with a concurrent serial
write, although I still can't explain how that would cause two
different "insert if not exist" transactions to both succeed.

-- 


Mahdi.

On 2/9/18 2:40 PM, Jonathan Haddad wrote:

If you want consistent reads you have to use the CL that
enforces it. There’s no way around it.

On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida
<ma...@signalfx.com <mailto:ma...@signalfx.com>> wrote:

In this case, we only write using CAS (code guarantees
that). We also never update, just insert if not exist.
Once a hash exists, it never changes (it may get deleted
later and that'll be a CAS delete as well).

-- 


Mahdi.

On 2/9/18 1:38 PM, Jeff Jirsa wrote:

On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida
<ma...@signalfx.com <mailto:ma...@signalfx.com>> wrote:

 Under what circumstances would we be reading
inconsistent results ? Is there a case where we
end up reading a value that actually end up not
being written ?

If you ever write the same value with CAS and without
CAS (different code paths both updating the same
value), you're using CAS wrong, and inconsistencies
can happen.





RE: LWT broken?

2018-02-12 Thread Jacques-Henri Berthemet
Mahdi, you don’t need to re-read at CL ONE on line 9. When a LWT statement is 
not applied, the values that prevented the LWT are returned as part of the 
response, I’d expect them to be more consistent than your read. I’m not 100% 
sure it’s the case for 2.0.x but it’s the case for Cassandra 2.2.

And it’s the same for line 1, you should only keep your LWT statement unless 
you have a huge performance benefit of doing. In Cassandra doing a read before 
write is a bad pattern.

AFAIK a LWT statement is always executed as SERIAL, the only choice you have is 
between SERIAL and LOCAL_SERIAL.

Regards,
--
Jacques-Henri Berthemet

From: DuyHai Doan [mailto:doanduy...@gmail.com]
Sent: Sunday, February 11, 2018 6:11 PM
To: user <user@cassandra.apache.org>
Subject: Re: LWT broken?

Mahdi , the issue in your code is here:

else // we lost LWT, fetch the winning value
 9existing_id = SELECT id FROM hash_id WHERE hash=computed_hash | 
consistency = ONE

You lost LWT, it means that there is a concurrent LWT that has won the Paxos 
round and has applied the value using QUORUM/SERIAL.

In best case, it means that the won LWT value has been applied to at least 2 
replicas out of 3 (assuming RF=3)
In worst case, the won LWT value has not been applied yet or is pending to be 
applied to any replica

Now, if you immediately read with CL=ONE, you may:

1) Read the staled value on the 3rd replica which has not yet received the 
correct won LWT value
2) Or worst, read a staled value because the won LWT is being applied when the 
read operation is made

That's the main reason reading with CL=SERIAL is recommended (CL=QUORUM is not 
sufficient enough)

Reading with CL=SERIAL will:

a. like QUORUM, contact strict majority of replicas
b. unlike QUORUM, look for validated (but not yet applied) previous Paxos round 
value and force-applied it before actually reading the new value




On Sun, Feb 11, 2018 at 5:36 PM, Mahdi Ben Hamida 
<ma...@signalfx.com<mailto:ma...@signalfx.com>> wrote:

Totally understood that it's not worth (or it's rather incorrect) to mix serial 
and non serial operations for LWT tables. It would be highly satisfying to my 
engineer mind if someone can explain why that would cause issues in this 
particular situation. The only explanation I have is that a non serial read may 
cause a read repair to happen and that could interfere with a concurrent serial 
write, although I still can't explain how that would cause two different 
"insert if not exist" transactions to both succeed.

--

Mahdi.
On 2/9/18 2:40 PM, Jonathan Haddad wrote:
If you want consistent reads you have to use the CL that enforces it. There’s 
no way around it.
On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida 
<ma...@signalfx.com<mailto:ma...@signalfx.com>> wrote:

In this case, we only write using CAS (code guarantees that). We also never 
update, just insert if not exist. Once a hash exists, it never changes (it may 
get deleted later and that'll be a CAS delete as well).

--

Mahdi.
On 2/9/18 1:38 PM, Jeff Jirsa wrote:


On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida 
<ma...@signalfx.com<mailto:ma...@signalfx.com>> wrote:

 Under what circumstances would we be reading inconsistent results ? Is there a 
case where we end up reading a value that actually end up not being written ?




If you ever write the same value with CAS and without CAS (different code paths 
both updating the same value), you're using CAS wrong, and inconsistencies can 
happen.







Re: LWT broken?

2018-02-11 Thread DuyHai Doan
Mahdi , the issue in your code is here:

else // we lost LWT, fetch the winning value
 9existing_id = SELECT id FROM hash_id WHERE hash=computed_hash |
consistency = ONE

You lost LWT, it means that there is a concurrent LWT that has won the
Paxos round and has applied the value using QUORUM/SERIAL.

In best case, it means that the won LWT value has been applied to at least
2 replicas out of 3 (assuming RF=3)
In worst case, the won LWT value has not been applied yet or is pending to
be applied to any replica

Now, if you immediately read with CL=ONE, you may:

1) Read the staled value on the 3rd replica which has not yet received the
correct won LWT value
2) Or worst, read a staled value because the won LWT is being applied when
the read operation is made

That's the main reason reading with CL=SERIAL is recommended (CL=QUORUM is
not sufficient enough)

Reading with CL=SERIAL will:

a. like QUORUM, contact strict majority of replicas
b. unlike QUORUM, look for validated (but not yet applied) previous Paxos
round value and force-applied it before actually reading the new value




On Sun, Feb 11, 2018 at 5:36 PM, Mahdi Ben Hamida 
wrote:

> Totally understood that it's not worth (or it's rather incorrect) to mix
> serial and non serial operations for LWT tables. It would be highly
> satisfying to my engineer mind if someone can explain why that would cause
> issues in this particular situation. The only explanation I have is that a
> non serial read may cause a read repair to happen and that could interfere
> with a concurrent serial write, although I still can't explain how that
> would cause two different "insert if not exist" transactions to both
> succeed.
>
> --
> Mahdi.
>
> On 2/9/18 2:40 PM, Jonathan Haddad wrote:
>
> If you want consistent reads you have to use the CL that enforces it.
> There’s no way around it.
> On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida 
> wrote:
>
>> In this case, we only write using CAS (code guarantees that). We also
>> never update, just insert if not exist. Once a hash exists, it never
>> changes (it may get deleted later and that'll be a CAS delete as well).
>>
>> --
>> Mahdi.
>>
>> On 2/9/18 1:38 PM, Jeff Jirsa wrote:
>>
>>
>>
>> On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida 
>> wrote:
>>
>>>  Under what circumstances would we be reading inconsistent results ? Is
>>> there a case where we end up reading a value that actually end up not being
>>> written ?
>>>
>>>
>>>
>>
>> If you ever write the same value with CAS and without CAS (different code
>> paths both updating the same value), you're using CAS wrong, and
>> inconsistencies can happen.
>>
>>
>>
>>
>


Re: LWT broken?

2018-02-11 Thread Mahdi Ben Hamida
Totally understood that it's not worth (or it's rather incorrect) to mix 
serial and non serial operations for LWT tables. It would be highly 
satisfying to my engineer mind if someone can explain why that would 
cause issues in this particular situation. The only explanation I have 
is that a non serial read may cause a read repair to happen and that 
could interfere with a concurrent serial write, although I still can't 
explain how that would cause two different "insert if not exist" 
transactions to both succeed.


--
Mahdi.

On 2/9/18 2:40 PM, Jonathan Haddad wrote:
If you want consistent reads you have to use the CL that enforces it. 
There’s no way around it.
On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida > wrote:


In this case, we only write using CAS (code guarantees that). We
also never update, just insert if not exist. Once a hash exists,
it never changes (it may get deleted later and that'll be a CAS
delete as well).

-- 
Mahdi.


On 2/9/18 1:38 PM, Jeff Jirsa wrote:



On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida
> wrote:

 Under what circumstances would we be reading inconsistent
results ? Is there a case where we end up reading a value
that actually end up not being written ?




If you ever write the same value with CAS and without CAS
(different code paths both updating the same value), you're using
CAS wrong, and inconsistencies can happen.








Re: LWT broken?

2018-02-09 Thread Jonathan Haddad
If you want consistent reads you have to use the CL that enforces it.
There’s no way around it.
On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida  wrote:

> In this case, we only write using CAS (code guarantees that). We also
> never update, just insert if not exist. Once a hash exists, it never
> changes (it may get deleted later and that'll be a CAS delete as well).
>
> --
> Mahdi.
>
> On 2/9/18 1:38 PM, Jeff Jirsa wrote:
>
>
>
> On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida 
> wrote:
>
>>  Under what circumstances would we be reading inconsistent results ? Is
>> there a case where we end up reading a value that actually end up not being
>> written ?
>>
>>
>>
>
> If you ever write the same value with CAS and without CAS (different code
> paths both updating the same value), you're using CAS wrong, and
> inconsistencies can happen.
>
>
>
>


Re: LWT broken?

2018-02-09 Thread Mahdi Ben Hamida
In this case, we only write using CAS (code guarantees that). We also 
never update, just insert if not exist. Once a hash exists, it never 
changes (it may get deleted later and that'll be a CAS delete as well).


--
Mahdi.

On 2/9/18 1:38 PM, Jeff Jirsa wrote:



On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida > wrote:


 Under what circumstances would we be reading inconsistent results
? Is there a case where we end up reading a value that actually
end up not being written ?




If you ever write the same value with CAS and without CAS (different 
code paths both updating the same value), you're using CAS wrong, and 
inconsistencies can happen.







Re: LWT broken?

2018-02-09 Thread Jeff Jirsa
On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida  wrote:

>  Under what circumstances would we be reading inconsistent results ? Is
> there a case where we end up reading a value that actually end up not being
> written ?
>
>
>

If you ever write the same value with CAS and without CAS (different code
paths both updating the same value), you're using CAS wrong, and
inconsistencies can happen.


Re: LWT broken?

2018-02-09 Thread Mahdi Ben Hamida

Hi Stefan,

I was hoping we could avoid the cost of a serial read (which I assume is 
a lot more expensive than a regular read due to the paxos requirements). 
I actually do a serial read at line #9 (ie, when we lose the LWT and 
have to read the winning value) and that still fails to ensure the 
uniqueness guarantees. Under what circumstances would we be reading 
inconsistent results ? Is there a case where we end up reading a value 
that actually end up not being written ?


Thanks !

--
Mahdi.

On 2/9/18 12:52 PM, Stefan Podkowinski wrote:


I'd not recommend using any consistency level but serial for reading 
tables updated by LWT operations. Otherwise you might end up reading 
inconsistent results.



On 09.02.18 08:06, Mahdi Ben Hamida wrote:


Hello,

I'm running a 2.0.17 cluster (I know, I know, need to upgrade) with 
46 nodes across 3 racks (& RF=3). I'm seeing that under high 
contention, LWT may actually not guarantee uniqueness. With a total 
of 16 million LWT transactions (with peak LWT concurrency around 
5k/sec), I found 38 conflicts that should have been impossible. I was 
wondering if there were any known issues that make LWT broken for 
this old version of cassandra.


I use LWT to guarantee that a 128 bit number (hash) maps to a unique 
64 bit number (id). There could be a large number of threads trying 
to allocate an id for a given hash.


I do the following logic (slightly more complicated than this due to 
timeout handling)


 1  existing_id = SELECT id FROM hash_id WHERE hash=computed_hash *| 
consistency = ONE*

 2  if existing_id != null:
 3    return existing_id
 4  new_id = generateUniqueId()
 5  result=INSERT INTO hash_id (id) VALUES(new_id) WHERE 
hash=computed_hash IF NOT EXIST | *consistency = QUORUM, 
serialConsistency = SERIAL*

 6  if result == [applied] // ie we won LWT
 7    return new_id
 8  else// we lost LWT, fetch the winning value
 9    existing_id = SELECT id FROM hash_id WHERE hash=computed_hash | 
consistency = ONE

10    return existing_id

Is there anything flawed about this ?
I do the read at line #1 and #9 at a consistency of ONE. Would that 
cause uncommitted changes to be seen (ie, dirty reads) ? Should it be 
a SERIAL consistency instead ? My understanding is that only one 
transaction will be able to apply the write (at quorum), so doing a 
read at consistency of one will either result in a null, or I would 
get the id that won the LWT race.


Any help is appreciated. I've been banging my head on this issue 
(thinking it was a bug in the code) for some time now.


--
Mahdi.






Re: LWT broken?

2018-02-09 Thread Stefan Podkowinski
I'd not recommend using any consistency level but serial for reading
tables updated by LWT operations. Otherwise you might end up reading
inconsistent results.


On 09.02.18 08:06, Mahdi Ben Hamida wrote:
>
> Hello,
>
> I'm running a 2.0.17 cluster (I know, I know, need to upgrade) with 46
> nodes across 3 racks (& RF=3). I'm seeing that under high contention,
> LWT may actually not guarantee uniqueness. With a total of 16 million
> LWT transactions (with peak LWT concurrency around 5k/sec), I found 38
> conflicts that should have been impossible. I was wondering if there
> were any known issues that make LWT broken for this old version of
> cassandra.
>
> I use LWT to guarantee that a 128 bit number (hash) maps to a unique
> 64 bit number (id). There could be a large number of threads trying to
> allocate an id for a given hash.
>
> I do the following logic (slightly more complicated than this due to
> timeout handling)
>
>  1  existing_id = SELECT id FROM hash_id WHERE hash=computed_hash *|
> consistency = ONE*
>  2  if existing_id != null:
>  3    return existing_id
>  4  new_id = generateUniqueId()
>  5  result=INSERT INTO hash_id (id) VALUES(new_id) WHERE
> hash=computed_hash IF NOT EXIST | *consistency = QUORUM,
> serialConsistency = SERIAL*
>  6  if result == [applied] // ie we won LWT
>  7    return new_id
>  8  else// we lost LWT, fetch the winning value
>  9    existing_id = SELECT id FROM hash_id WHERE hash=computed_hash |
> consistency = ONE
> 10    return existing_id
>
> Is there anything flawed about this ?
> I do the read at line #1 and #9 at a consistency of ONE. Would that
> cause uncommitted changes to be seen (ie, dirty reads) ? Should it be
> a SERIAL consistency instead ? My understanding is that only one
> transaction will be able to apply the write (at quorum), so doing a
> read at consistency of one will either result in a null, or I would
> get the id that won the LWT race.
>
> Any help is appreciated. I've been banging my head on this issue
> (thinking it was a bug in the code) for some time now.
>
> -- 
> Mahdi.