Right you are, thank you Cody.
Wondering if I may reach out again to the list and ask a similar question
in a more specific way:
Scenario: Cassandra 3.x, small cluster (<10 nodes), 1 DC
Is a batch warn threshold of 50kb, and average batch sizes in the 40kb
range a recipe for regret? Should we be considering a solution such as
Cody elucidated earlier in the thread, or am I over-worrying the issue?
On Wed, Dec 7, 2016 at 4:08 PM, Cody Yancey wrote:
> There is a disconnect between write.3 and write.4, but it can only affect
> performance, not consistency. The presence or absence of a row's txnUUID in
> the IncompleteTransactions table is the ultimate source of truth, and rows
> whose txnUUID are not null will be checked against that truth in the read
> path.
>
> And yes, it is a good point, failures with this model will accumulate and
> degrade performance if you never clear out old failed transactions. The
> tables we have that use this generally use TTLs so we don't really care as
> long as irrecoverable transaction failures are very rare.
>
> Thanks,
> Cody
>
> On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot
> wrote:
>
>> Appreciate the long writeup Cody.
>>
>> Yeah, we're good with temporary inconsistency (thankfully) as well. I'm
>> going to try to ride the batch train and hope it doesn't derail - our load
>> is fairly static (or, more precisely, increase in load is fairly slow and
>> can be projected).
>>
>> Enjoyed your two-phase commit text. Presumably one would also have some
>> cleanup implementation that culls any failed updates (write.5) which could
>> be identified in read.3 / read.4? Still a disconnect possible between
>> write.3 and write.4, but there's always something...
>>
>> We're insert-only (well, with some deletes via TTL, but anyway), so
>> that's somewhat tempting, but I'd rather not prematurely optimize. Unless,
>> of course, anyone's got experience such that "batches over XXkb are
>> definitely going to be a problem".
>>
>> Appreciate everyone's time.
>> --Voytek Jarnot
>>
>> On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey wrote:
>>
>>> Hi Voytek,
>>> I think the way you are using it is definitely the canonical way.
>>> Unfortunately, as you learned, there are some gotchas. We tried
>>> substantially increasing the batch size and it worked for a while, until we
>>> reached new scale, and we increased it again, and so forth. It works, but
>>> soon you start getting write timeouts, lots of them. And the thing about
>>> multi-partition batch statements is that they offer atomicity, but not
>>> isolation. This means your database can temporarily be in an inconsistent
>>> state while writes are propagating to the various machines.
>>>
>>> For our use case, we could deal with temporary inconsistency, as long as
>>> it was for a strictly bounded period of time, on the order of a few
>>> seconds. Unfortunately, as with all things eventually consistent, it
>>> degrades to "totally inconsistent" when your database is under heavy load
>>> and the time-bounds expand beyond what the application can handle. When a
>>> batch write times out, it often still succeeds (eventually) but your tables
>>> can be inconsistent for
>>>
>>> minutes, even while nodetool status shows all nodes up and normal.
>>>
>>> But there is another way, that requires us to take a page from our RDBMS
>>> ancestors' book: multi-phase commit.
>>>
>>> Similar to logged batch writes, multi-phase commit patterns typically
>>> entail some write amplification cost for the benefit of stronger
>>> consistency guarantees across isolatable units (in Cassandra's case,
>>> *partitions*). However, multi-phase commit offers stronger guarantees
>>> that batch writes, and ALL of the additional write load is completely
>>> distributed as per your load-balancing policy, where as batch writes all go
>>> through one coordinator node, then get written in their entirety to the
>>> batch log on two or three nodes, and then get dispersed in a distributed
>>> fashion from there.
>>>
>>> A typical two-phase commit pattern looks like this:
>>>
>>> The Write Path
>>>
>>>1. The client code chooses a random UUID.
>>>2. The client writes the UUID into the IncompleteTransactions table,
>>>which only has one column, the transactionUUID.
>>>3. The client makes all of the inserts involved in the transaction,
>>>IN PARALLEL, with the transactionUUID duplicated in every inserted row.
>>>4. The client deletes the UUID from IncompleteTransactions table.
>>>5. The client makes parallel updates to all of the rows it inserted,
>>>IN PARALLEL, setting the transactionUUID to null.
>>>
>>> The Read Path
>>>
>>>1. The client reads some rows from a partition. If this particular
>>>client request can handle extraneous rows, you are done. If not, read on
>>> to
>>>step #2.
>>>2. The client gathers the set of unique transactionUUIDs. In the
>>>main case, they've all been deleted by step #5 in the Write Path. If no