How are timestamps selected for LWTs?

2016-02-02 Thread Nicholas Wilson
Hi,

In the Cassandra docs I've read, it's not described how the timestamp is 
determined for LWTs. It's not possible to specify a timestamp with "USING 
TIMESTAMP ...", and my best guess is that in the "read" phase of the LWT 
(between propose and commit) the timestamp is selected based on the timestamps 
of the cells read. However, after reading through the source code (mainly 
StorageProxy::cas) I can't any hint of that.

I'm worried about the following problem:

Node A writes (using a LWT): UPDATE table SET val = 123, version = 2 WHERE key 
= 'foo' IF version = 1
Node B writes (using a LWT): UPDATE table SET val = 234, version = 3 WHERE key 
= 'foo' IF version = 2

If the first write is completed before the second, then both updates will be 
applied, but if Node B's clock is behind Node A's clock, then the second update 
would be effectively discarded if client-generated timestamps are used. It 
wouldn't take a big clock discrepancy, the HW clocks could in fact be perfectly 
in sync, but if the kernel ticks System.currentTimeMillis() at 15ms intervals 
it's quite possible for the two nodes to be 30ms out from each other.

So, after the update query has "succeeded", do you need to do a read to find 
out whether it was actually applied? That would be surprising, since I can't 
find mention of it anywhere in the docs. You'd actually have to do a QUORUM 
read after every LWT update, just to find out whether your client chose the 
timestamp sensibly.

The ideal thing would be if Cassandra chose the timestamp for the write, using 
the timestamp of the cells read during Paxos, to guarantee that writes are 
applied if the query condition holds, rather than leaving the potential for the 
query to succeed but do nothing if the cell already has a higher timestamp.

If I've misunderstood, please do correct me!

Thanks,
Nicholas

---
Nicholas Wilson
Software developer
RealVNC

Re: How are timestamps selected for LWTs?

2016-02-02 Thread Nicholas Wilson
Thanks, Sylvain.


I missed it because I wasn't looking in the right place! In StorageProxy::cas, 
Commit::newProposal() unpacks the ballot's UUID into a timestamp.


I think I understand how it works now, thank you.


Regards,

Nick


From: Sylvain Lebresne <sylv...@datastax.com>
Sent: 02 February 2016 10:24
To: user@cassandra.apache.org
Subject: Re: How are timestamps selected for LWTs?

On Tue, Feb 2, 2016 at 10:46 AM, Nicholas Wilson 
<nicholas.wil...@realvnc.com<mailto:nicholas.wil...@realvnc.com>> wrote:
Hi,

In the Cassandra docs I've read, it's not described how the timestamp is 
determined for LWTs. It's not possible to specify a timestamp with "USING 
TIMESTAMP ...", and my best guess is that in the "read" phase of the LWT 
(between propose and commit) the timestamp is selected based on the timestamps 
of the cells read. However, after reading through the source code (mainly 
StorageProxy::cas) I can't any hint of that.

It's not exactly how it works, but it yields a somewhat equivalent result. 
Internally, LWTs use a so call "ballot" which is timeuuid, and the underlying 
algorithm basically guarantees that the order of commit of operations is the 
order of their ballot. And the timestamp used for the cells of a given of 
operation is the timestamp part of that timeuuid ballot, thus guaranteeing that 
this timestamp respects the order in which operations are committed.

This is why you can't provide the timestamp client side: that timestamp is 
picked server side and the value picked depends on when the operation is 
committed.



I'm worried about the following problem:

Node A writes (using a LWT): UPDATE table SET val = 123, version = 2 WHERE key 
= 'foo' IF version = 1
Node B writes (using a LWT): UPDATE table SET val = 234, version = 3 WHERE key 
= 'foo' IF version = 2

If the first write is completed before the second, then both updates will be 
applied, but if Node B's clock is behind Node A's clock, then the second update 
would be effectively discarded if client-generated timestamps are used. It 
wouldn't take a big clock discrepancy, the HW clocks could in fact be perfectly 
in sync, but if the kernel ticks System.currentTimeMillis() at 15ms intervals 
it's quite possible for the two nodes to be 30ms out from each other.

So, after the update query has "succeeded", do you need to do a read to find 
out whether it was actually applied? That would be surprising, since I can't 
find mention of it anywhere in the docs. You'd actually have to do a QUORUM 
read after every LWT update, just to find out whether your client chose the 
timestamp sensibly.

The ideal thing would be if Cassandra chose the timestamp for the write, using 
the timestamp of the cells read during Paxos, to guarantee that writes are 
applied if the query condition holds, rather than leaving the potential for the 
query to succeed but do nothing if the cell already has a higher timestamp.

If I've misunderstood, please do correct me!

Thanks,
Nicholas

---
Nicholas Wilson
Software developer
RealVNC




Re: How are timestamps selected for LWTs?

2016-02-02 Thread Sylvain Lebresne
On Tue, Feb 2, 2016 at 10:46 AM, Nicholas Wilson <
nicholas.wil...@realvnc.com> wrote:

> Hi,
>
> In the Cassandra docs I've read, it's not described how the timestamp is
> determined for LWTs. It's not possible to specify a timestamp with "USING
> TIMESTAMP ...", and my best guess is that in the "read" phase of the LWT
> (between propose and commit) the timestamp is selected based on the
> timestamps of the cells read. However, after reading through the source
> code (mainly StorageProxy::cas) I can't any hint of that.
>

It's not exactly how it works, but it yields a somewhat equivalent result.
Internally, LWTs use a so call "ballot" which is timeuuid, and the
underlying algorithm basically guarantees that the order of commit of
operations is the order of their ballot. And the timestamp used for the
cells of a given of operation is the timestamp part of that timeuuid
ballot, thus guaranteeing that this timestamp respects the order in which
operations are committed.

This is why you can't provide the timestamp client side: that timestamp is
picked server side and the value picked depends on when the operation is
committed.



>
> I'm worried about the following problem:
>
> Node A writes (using a LWT): UPDATE table SET val = 123, version = 2 WHERE
> key = 'foo' IF version = 1
> Node B writes (using a LWT): UPDATE table SET val = 234, version = 3 WHERE
> key = 'foo' IF version = 2
>
> If the first write is completed before the second, then both updates will
> be applied, but if Node B's clock is behind Node A's clock, then the second
> update would be effectively discarded if client-generated timestamps are
> used. It wouldn't take a big clock discrepancy, the HW clocks could in fact
> be perfectly in sync, but if the kernel ticks System.currentTimeMillis() at
> 15ms intervals it's quite possible for the two nodes to be 30ms out from
> each other.
>
> So, after the update query has "succeeded", do you need to do a read to
> find out whether it was actually applied? That would be surprising, since I
> can't find mention of it anywhere in the docs. You'd actually have to do a
> QUORUM read after every LWT update, just to find out whether your client
> chose the timestamp sensibly.
>
> The ideal thing would be if Cassandra chose the timestamp for the write,
> using the timestamp of the cells read during Paxos, to guarantee that
> writes are applied if the query condition holds, rather than leaving the
> potential for the query to succeed but do nothing if the cell already has a
> higher timestamp.
>
> If I've misunderstood, please do correct me!
>
> Thanks,
> Nicholas
>
> ---
> Nicholas Wilson
> Software developer
> RealVNC