Right you are, thank you Cody. Wondering if I may reach out again to the list and ask a similar question in a more specific way:
Scenario: Cassandra 3.x, small cluster (<10 nodes), 1 DC Is a batch warn threshold of 50kb, and average batch sizes in the 40kb range a recipe for regret? Should we be considering a solution such as Cody elucidated earlier in the thread, or am I over-worrying the issue? On Wed, Dec 7, 2016 at 4:08 PM, Cody Yancey <yan...@uber.com> wrote: > There is a disconnect between write.3 and write.4, but it can only affect > performance, not consistency. The presence or absence of a row's txnUUID in > the IncompleteTransactions table is the ultimate source of truth, and rows > whose txnUUID are not null will be checked against that truth in the read > path. > > And yes, it is a good point, failures with this model will accumulate and > degrade performance if you never clear out old failed transactions. The > tables we have that use this generally use TTLs so we don't really care as > long as irrecoverable transaction failures are very rare. > > Thanks, > Cody > > On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot <voytek.jar...@gmail.com> > wrote: > >> Appreciate the long writeup Cody. >> >> Yeah, we're good with temporary inconsistency (thankfully) as well. I'm >> going to try to ride the batch train and hope it doesn't derail - our load >> is fairly static (or, more precisely, increase in load is fairly slow and >> can be projected). >> >> Enjoyed your two-phase commit text. Presumably one would also have some >> cleanup implementation that culls any failed updates (write.5) which could >> be identified in read.3 / read.4? Still a disconnect possible between >> write.3 and write.4, but there's always something... >> >> We're insert-only (well, with some deletes via TTL, but anyway), so >> that's somewhat tempting, but I'd rather not prematurely optimize. Unless, >> of course, anyone's got experience such that "batches over XXkb are >> definitely going to be a problem". >> >> Appreciate everyone's time. >> --Voytek Jarnot >> >> On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey <yan...@uber.com> wrote: >> >>> Hi Voytek, >>> I think the way you are using it is definitely the canonical way. >>> Unfortunately, as you learned, there are some gotchas. We tried >>> substantially increasing the batch size and it worked for a while, until we >>> reached new scale, and we increased it again, and so forth. It works, but >>> soon you start getting write timeouts, lots of them. And the thing about >>> multi-partition batch statements is that they offer atomicity, but not >>> isolation. This means your database can temporarily be in an inconsistent >>> state while writes are propagating to the various machines. >>> >>> For our use case, we could deal with temporary inconsistency, as long as >>> it was for a strictly bounded period of time, on the order of a few >>> seconds. Unfortunately, as with all things eventually consistent, it >>> degrades to "totally inconsistent" when your database is under heavy load >>> and the time-bounds expand beyond what the application can handle. When a >>> batch write times out, it often still succeeds (eventually) but your tables >>> can be inconsistent for >>> >>> minutes, even while nodetool status shows all nodes up and normal. >>> >>> But there is another way, that requires us to take a page from our RDBMS >>> ancestors' book: multi-phase commit. >>> >>> Similar to logged batch writes, multi-phase commit patterns typically >>> entail some write amplification cost for the benefit of stronger >>> consistency guarantees across isolatable units (in Cassandra's case, >>> *partitions*). However, multi-phase commit offers stronger guarantees >>> that batch writes, and ALL of the additional write load is completely >>> distributed as per your load-balancing policy, where as batch writes all go >>> through one coordinator node, then get written in their entirety to the >>> batch log on two or three nodes, and then get dispersed in a distributed >>> fashion from there. >>> >>> A typical two-phase commit pattern looks like this: >>> >>> The Write Path >>> >>> 1. The client code chooses a random UUID. >>> 2. The client writes the UUID into the IncompleteTransactions table, >>> which only has one column, the transactionUUID. >>> 3. The client makes all of the inserts involved in the transaction, >>> IN PARALLEL, with the transactionUUID duplicated in every inserted row. >>> 4. The client deletes the UUID from IncompleteTransactions table. >>> 5. The client makes parallel updates to all of the rows it inserted, >>> IN PARALLEL, setting the transactionUUID to null. >>> >>> The Read Path >>> >>> 1. The client reads some rows from a partition. If this particular >>> client request can handle extraneous rows, you are done. If not, read on >>> to >>> step #2. >>> 2. The client gathers the set of unique transactionUUIDs. In the >>> main case, they've all been deleted by step #5 in the Write Path. If not, >>> go to #3. >>> 3. For remaining transactionUUIDs (which should be a very small >>> number), query the IncompleteTransactions table. >>> 4. The client code culls rows where the transactionUUID existed in >>> the IncompleteTransactions table. >>> >>> This is just an example, one that is reasonably performant for >>> ledger-style non-updated inserts. For transactions involving updates to >>> possibly existing data, more effort is required, generally the client needs >>> to be smart enough to merge updates based on a timestamp, with a periodic >>> batch job that cleans out obsolete inserts. If it feels like reinventing >>> the wheel, that's because it is. But it just might be the quickest path to >>> what you need. >>> >>> Thanks, >>> Cody >>> >>> On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> >>>> I have been circling around a thought process over batches. Now that >>>> Cassandra has aggregating functions, it might be possible write a type of >>>> record that has an END_OF_BATCH type marker and the data can be suppressed >>>> from view until it was all there. >>>> >>>> IE you write something like a checksum record that an intelligent >>>> client can use to tell if the rest of the batch is complete. >>>> >>>> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot <voytek.jar...@gmail.com >>>> > wrote: >>>> >>>>> Been about a month since I have up on it, but it was very much related >>>>> to the stuff you're dealing with ... Basically Cassandra just stepping on >>>>> its own.... errrrr, tripping over its own feet streaming MVs. >>>>> >>>>> On Dec 7, 2016 10:45 AM, "Benjamin Roth" <benjamin.r...@jaumo.com> >>>>> wrote: >>>>> >>>>>> I meant the mv thing >>>>>> >>>>>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" <voytek.jar...@gmail.com >>>>>> >: >>>>>> >>>>>>> Sure, about which part? >>>>>>> >>>>>>> default batch size warning is 5kb >>>>>>> I've increased it to 30kb, and will need to increase to 40kb (8x >>>>>>> default setting) to avoid WARN log messages about batch sizes. I do >>>>>>> realize it's just a WARNing, but may as well avoid those if I can >>>>>>> configure >>>>>>> it out. That said, having to increase it so substantially (and we're >>>>>>> only >>>>>>> dealing with 5 tables) is making me wonder if I'm not taking the correct >>>>>>> approach in terms of using batches to guarantee atomicity. >>>>>>> >>>>>>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth < >>>>>>> benjamin.r...@jaumo.com> wrote: >>>>>>> >>>>>>>> Could you please be more specific? >>>>>>>> >>>>>>>> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" < >>>>>>>> voytek.jar...@gmail.com>: >>>>>>>> >>>>>>>>> Should've mentioned - running 3.9. Also - please do not recommend >>>>>>>>> MVs: I tried, they're broken, we punted. >>>>>>>>> >>>>>>>>> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot < >>>>>>>>> voytek.jar...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The low default value for batch_size_warn_threshold_in_kb is >>>>>>>>>> making me wonder if I'm perhaps approaching the problem of atomicity >>>>>>>>>> in a >>>>>>>>>> non-ideal fashion. >>>>>>>>>> >>>>>>>>>> With one data set duplicated/denormalized into 5 tables to >>>>>>>>>> support queries, we use batches to ensure inserts make it to all or 0 >>>>>>>>>> tables. This works fine, but I've had to bump the warn threshold >>>>>>>>>> and fail >>>>>>>>>> threshold substantially (8x higher for the warn threshold). This - >>>>>>>>>> in turn >>>>>>>>>> - makes me wonder, with a default setting so low, if I'm not solving >>>>>>>>>> this >>>>>>>>>> problem in the canonical/standard way. >>>>>>>>>> >>>>>>>>>> Mostly just looking for confirmation that we're not >>>>>>>>>> unintentionally doing something weird... >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>> >>> >> >