Right you are, thank you Cody.

Wondering if I may reach out again to the list and ask a similar question
in a more specific way:

Scenario: Cassandra 3.x, small cluster (<10 nodes), 1 DC

Is a batch warn threshold of 50kb, and average batch sizes in the 40kb
range a recipe for regret?  Should we be considering a solution such as
Cody elucidated earlier in the thread, or am I over-worrying the issue?


On Wed, Dec 7, 2016 at 4:08 PM, Cody Yancey <yan...@uber.com> wrote:

> There is a disconnect between write.3 and write.4, but it can only affect
> performance, not consistency. The presence or absence of a row's txnUUID in
> the IncompleteTransactions table is the ultimate source of truth, and rows
> whose txnUUID are not null will be checked against that truth in the read
> path.
>
> And yes, it is a good point, failures with this model will accumulate and
> degrade performance if you never clear out old failed transactions. The
> tables we have that use this generally use TTLs so we don't really care as
> long as irrecoverable transaction failures are very rare.
>
> Thanks,
> Cody
>
> On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot <voytek.jar...@gmail.com>
> wrote:
>
>> Appreciate the long writeup Cody.
>>
>> Yeah, we're good with temporary inconsistency (thankfully) as well.  I'm
>> going to try to ride the batch train and hope it doesn't derail - our load
>> is fairly static (or, more precisely, increase in load is fairly slow and
>> can be projected).
>>
>> Enjoyed your two-phase commit text.  Presumably one would also have some
>> cleanup implementation that culls any failed updates (write.5) which could
>> be identified in read.3 / read.4?  Still a disconnect possible between
>> write.3 and write.4, but there's always something...
>>
>> We're insert-only (well, with some deletes via TTL, but anyway), so
>> that's somewhat tempting, but I'd rather not prematurely optimize.  Unless,
>> of course, anyone's got experience such that "batches over XXkb are
>> definitely going to be a problem".
>>
>> Appreciate everyone's time.
>> --Voytek Jarnot
>>
>> On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey <yan...@uber.com> wrote:
>>
>>> Hi Voytek,
>>> I think the way you are using it is definitely the canonical way.
>>> Unfortunately, as you learned, there are some gotchas. We tried
>>> substantially increasing the batch size and it worked for a while, until we
>>> reached new scale, and we increased it again, and so forth. It works, but
>>> soon you start getting write timeouts, lots of them. And the thing about
>>> multi-partition batch statements is that they offer atomicity, but not
>>> isolation. This means your database can temporarily be in an inconsistent
>>> state while writes are propagating to the various machines.
>>>
>>> For our use case, we could deal with temporary inconsistency, as long as
>>> it was for a strictly bounded period of time, on the order of a few
>>> seconds. Unfortunately, as with all things eventually consistent, it
>>> degrades to "totally inconsistent" when your database is under heavy load
>>> and the time-bounds expand beyond what the application can handle. When a
>>> batch write times out, it often still succeeds (eventually) but your tables
>>> can be inconsistent for
>>>
>>> minutes, even while nodetool status shows all nodes up and normal.
>>>
>>> But there is another way, that requires us to take a page from our RDBMS
>>> ancestors' book: multi-phase commit.
>>>
>>> Similar to logged batch writes, multi-phase commit patterns typically
>>> entail some write amplification cost for the benefit of stronger
>>> consistency guarantees across isolatable units (in Cassandra's case,
>>> *partitions*). However, multi-phase commit offers stronger guarantees
>>> that batch writes, and ALL of the additional write load is completely
>>> distributed as per your load-balancing policy, where as batch writes all go
>>> through one coordinator node, then get written in their entirety to the
>>> batch log on two or three nodes, and then get dispersed in a distributed
>>> fashion from there.
>>>
>>> A typical two-phase commit pattern looks like this:
>>>
>>> The Write Path
>>>
>>>    1. The client code chooses a random UUID.
>>>    2. The client writes the UUID into the IncompleteTransactions table,
>>>    which only has one column, the transactionUUID.
>>>    3. The client makes all of the inserts involved in the transaction,
>>>    IN PARALLEL, with the transactionUUID duplicated in every inserted row.
>>>    4. The client deletes the UUID from IncompleteTransactions table.
>>>    5. The client makes parallel updates to all of the rows it inserted,
>>>    IN PARALLEL, setting the transactionUUID to null.
>>>
>>> The Read Path
>>>
>>>    1. The client reads some rows from a partition. If this particular
>>>    client request can handle extraneous rows, you are done. If not, read on 
>>> to
>>>    step #2.
>>>    2. The client gathers the set of unique transactionUUIDs. In the
>>>    main case, they've all been deleted by step #5 in the Write Path. If not,
>>>    go to #3.
>>>    3. For remaining transactionUUIDs (which should be a very small
>>>    number), query the IncompleteTransactions table.
>>>    4. The client code culls rows where the transactionUUID existed in
>>>    the IncompleteTransactions table.
>>>
>>> This is just an example, one that is reasonably performant for
>>> ledger-style non-updated inserts. For transactions involving updates to
>>> possibly existing data, more effort is required, generally the client needs
>>> to be smart enough to merge updates based on a timestamp, with a periodic
>>> batch job that cleans out obsolete inserts. If it feels like reinventing
>>> the wheel, that's because it is. But it just might be the quickest path to
>>> what you need.
>>>
>>> Thanks,
>>> Cody
>>>
>>> On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>> I have been circling around a thought process over batches. Now that
>>>> Cassandra has aggregating functions, it might be possible write a type of
>>>> record that has an END_OF_BATCH type marker and the data can be suppressed
>>>> from view until it was all there.
>>>>
>>>> IE you write something like a checksum record that an intelligent
>>>> client can use to tell if the rest of the batch is complete.
>>>>
>>>> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot <voytek.jar...@gmail.com
>>>> > wrote:
>>>>
>>>>> Been about a month since I have up on it, but it was very much related
>>>>> to the stuff you're dealing with ... Basically Cassandra just stepping on
>>>>> its own.... errrrr, tripping over its own feet streaming MVs.
>>>>>
>>>>> On Dec 7, 2016 10:45 AM, "Benjamin Roth" <benjamin.r...@jaumo.com>
>>>>> wrote:
>>>>>
>>>>>> I meant the mv thing
>>>>>>
>>>>>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" <voytek.jar...@gmail.com
>>>>>> >:
>>>>>>
>>>>>>> Sure, about which part?
>>>>>>>
>>>>>>> default batch size warning is 5kb
>>>>>>> I've increased it to 30kb, and will need to increase to 40kb (8x
>>>>>>> default setting) to avoid WARN log messages about batch sizes.  I do
>>>>>>> realize it's just a WARNing, but may as well avoid those if I can 
>>>>>>> configure
>>>>>>> it out.  That said, having to increase it so substantially (and we're 
>>>>>>> only
>>>>>>> dealing with 5 tables) is making me wonder if I'm not taking the correct
>>>>>>> approach in terms of using batches to guarantee atomicity.
>>>>>>>
>>>>>>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth <
>>>>>>> benjamin.r...@jaumo.com> wrote:
>>>>>>>
>>>>>>>> Could you please be more specific?
>>>>>>>>
>>>>>>>> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" <
>>>>>>>> voytek.jar...@gmail.com>:
>>>>>>>>
>>>>>>>>> Should've mentioned - running 3.9.  Also - please do not recommend
>>>>>>>>> MVs: I tried, they're broken, we punted.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot <
>>>>>>>>> voytek.jar...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The low default value for batch_size_warn_threshold_in_kb is
>>>>>>>>>> making me wonder if I'm perhaps approaching the problem of atomicity 
>>>>>>>>>> in a
>>>>>>>>>> non-ideal fashion.
>>>>>>>>>>
>>>>>>>>>> With one data set duplicated/denormalized into 5 tables to
>>>>>>>>>> support queries, we use batches to ensure inserts make it to all or 0
>>>>>>>>>> tables.  This works fine, but I've had to bump the warn threshold 
>>>>>>>>>> and fail
>>>>>>>>>> threshold substantially (8x higher for the warn threshold).  This - 
>>>>>>>>>> in turn
>>>>>>>>>> - makes me wonder, with a default setting so low, if I'm not solving 
>>>>>>>>>> this
>>>>>>>>>> problem in the canonical/standard way.
>>>>>>>>>>
>>>>>>>>>> Mostly just looking for confirmation that we're not
>>>>>>>>>> unintentionally doing something weird...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>>>
>>
>

Reply via email to