Preventing data showing up in Cassandra logs

2016-12-09 Thread Voytek Jarnot
I'm happy with INFO level logging from Cassandra in principle, but am
wondering if there's any option to prevent Cassandra from exposing data in
the logs (without necessarily changing log levels)?

SSTableIndex.open logs minTerm, maxTerm, minKey, and maxKey which expose
data, as does the "Writing large partition" WARNing (exposes the partition
key).  There are probably others.  The large partition warning would
probably be mostly useless without logging the partition key, but - in any
case - there are usage scenarios where data in logs is prohibited.

Thanks,
Voytek Jarnot


Re: Batch size warnings

2016-12-09 Thread Voytek Jarnot
Right you are, thank you Cody.

Wondering if I may reach out again to the list and ask a similar question
in a more specific way:

Scenario: Cassandra 3.x, small cluster (<10 nodes), 1 DC

Is a batch warn threshold of 50kb, and average batch sizes in the 40kb
range a recipe for regret?  Should we be considering a solution such as
Cody elucidated earlier in the thread, or am I over-worrying the issue?


On Wed, Dec 7, 2016 at 4:08 PM, Cody Yancey  wrote:

> There is a disconnect between write.3 and write.4, but it can only affect
> performance, not consistency. The presence or absence of a row's txnUUID in
> the IncompleteTransactions table is the ultimate source of truth, and rows
> whose txnUUID are not null will be checked against that truth in the read
> path.
>
> And yes, it is a good point, failures with this model will accumulate and
> degrade performance if you never clear out old failed transactions. The
> tables we have that use this generally use TTLs so we don't really care as
> long as irrecoverable transaction failures are very rare.
>
> Thanks,
> Cody
>
> On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot 
> wrote:
>
>> Appreciate the long writeup Cody.
>>
>> Yeah, we're good with temporary inconsistency (thankfully) as well.  I'm
>> going to try to ride the batch train and hope it doesn't derail - our load
>> is fairly static (or, more precisely, increase in load is fairly slow and
>> can be projected).
>>
>> Enjoyed your two-phase commit text.  Presumably one would also have some
>> cleanup implementation that culls any failed updates (write.5) which could
>> be identified in read.3 / read.4?  Still a disconnect possible between
>> write.3 and write.4, but there's always something...
>>
>> We're insert-only (well, with some deletes via TTL, but anyway), so
>> that's somewhat tempting, but I'd rather not prematurely optimize.  Unless,
>> of course, anyone's got experience such that "batches over XXkb are
>> definitely going to be a problem".
>>
>> Appreciate everyone's time.
>> --Voytek Jarnot
>>
>> On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey  wrote:
>>
>>> Hi Voytek,
>>> I think the way you are using it is definitely the canonical way.
>>> Unfortunately, as you learned, there are some gotchas. We tried
>>> substantially increasing the batch size and it worked for a while, until we
>>> reached new scale, and we increased it again, and so forth. It works, but
>>> soon you start getting write timeouts, lots of them. And the thing about
>>> multi-partition batch statements is that they offer atomicity, but not
>>> isolation. This means your database can temporarily be in an inconsistent
>>> state while writes are propagating to the various machines.
>>>
>>> For our use case, we could deal with temporary inconsistency, as long as
>>> it was for a strictly bounded period of time, on the order of a few
>>> seconds. Unfortunately, as with all things eventually consistent, it
>>> degrades to "totally inconsistent" when your database is under heavy load
>>> and the time-bounds expand beyond what the application can handle. When a
>>> batch write times out, it often still succeeds (eventually) but your tables
>>> can be inconsistent for
>>>
>>> minutes, even while nodetool status shows all nodes up and normal.
>>>
>>> But there is another way, that requires us to take a page from our RDBMS
>>> ancestors' book: multi-phase commit.
>>>
>>> Similar to logged batch writes, multi-phase commit patterns typically
>>> entail some write amplification cost for the benefit of stronger
>>> consistency guarantees across isolatable units (in Cassandra's case,
>>> *partitions*). However, multi-phase commit offers stronger guarantees
>>> that batch writes, and ALL of the additional write load is completely
>>> distributed as per your load-balancing policy, where as batch writes all go
>>> through one coordinator node, then get written in their entirety to the
>>> batch log on two or three nodes, and then get dispersed in a distributed
>>> fashion from there.
>>>
>>> A typical two-phase commit pattern looks like this:
>>>
>>> The Write Path
>>>
>>>1. The client code chooses a random UUID.
>>>2. The client writes the UUID into the IncompleteTransactions table,
>>>which only has one column, the transactionUUID.
>>>3. The client makes all of the inserts involved in the transaction,
>>>IN PARALLEL, with the transactionUUID duplicated in every inserted row.
>>>4. The client deletes the UUID from IncompleteTransactions table.
>>>5. The client makes parallel updates to all of the rows it inserted,
>>>IN PARALLEL, setting the transactionUUID to null.
>>>
>>> The Read Path
>>>
>>>1. The client reads some rows from a partition. If this particular
>>>client request can handle extraneous rows, you are done. If not, read on 
>>> to
>>>step #2.
>>>2. The client gathers the set of unique transactionUUIDs. In the
>>>main case, they've all been deleted by step #5 in the Write Path. If no

Node doesn't join to the ring

2016-12-09 Thread Aleksandr Ivanov
I'm trying to join node to the ring with nodetool join command but it fails
with error message about inconsistent replica.
My steps:
1. run cassandra with -Dcassandra.join_ring=false
2. wait couple of minutes
3. ensure that all nodes are in UN state
4. run nodetool join

Joining fails with message:
error: A node required to move the data consistently is down
(/XX.XX.XX.XX). If you wish to move the data from a potentially
inconsistent replica, restart the node with
-Dcassandra.consistent.rangemovement
=false
-- StackTrace --
java.lang.RuntimeException: A node required to move the data consistently
is down (/XX.XX.X.XX). If you wish to move the data from a potentially
inconsistent replica, restart the node with -Dcassandra.con
sistent.rangemovement=false
at
org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:275)
at
org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:158)
at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:83)
...
I'm pretty sure that node XX.XX.XX.XX is up and reachable over network.
Also no references in log that XX.XX.XX.XX node is went down.
Different tries shows different nodes are "down", but all nodes are from
remote data centers.

Log from XX.XX.XX.XX node:
DEBUG [GossipStage:1] 2016-12-09 16:37:41,557 StorageService.java:1893 -
Node /YY.YY.YY.YY state bootstrapping, token [-1853578429238772836,
-155879016060746037, 8986645362194256101, 7352444016819915166,
-7093258492559265403, ..., -5513577076972264694, 411791028346144716,
-7405822878444068495]
DEBUG [PendingRangeCalculator:1] 2016-12-09 16:37:47,225
PendingRangeCalculatorService.java:66 - finished calculation for 8
keyspaces in 5667ms

C* v3.0.9

Any clues how to troubleshoot it?