Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Prasenjit Sarkar
Jeff, I don't think you can push the topic of usability back to developers by asking them to open JIRAs. It is upon the technical leaders of the Cassandra community to take the initiative in this regard. We can argue back and forth on the dynamics of open source projects, but the usability

Re: LEAK DETECTED while minor compaction

2018-02-20 Thread Jeff Jirsa
Your bloom filter settings look broken. Did you set the FP ratio to 0? If so that’s a bad idea and we should have stopped you from doing it. -- Jeff Jirsa > On Feb 20, 2018, at 11:01 PM, Дарья Меленцова wrote: > > Hello. > > Could you help me with LEAK DETECTED error

LEAK DETECTED while minor compaction

2018-02-20 Thread Дарья Меленцова
Hello. Could you help me with LEAK DETECTED error while minor compaction process? There is a table with a lot of small record 6.6*10^9 (mapping (eventId, boxId) -> cellId)). Minor compaction starts and then fails on 99% done with an error: Stacktrace ERROR [Reference-Reaper:1] 2018-02-05

RE: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Kenneth Brotman
If you watch this video through you'll see why usability is so important. You can't ignore usability issues. Cassandra does not exist in a vacuum. The competitors are world class. The video is on the New Cassandra API for Azure Cosmos DB: https://www.youtube.com/watch?v=1Sf4McGN1AQ

Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Jeff Jirsa
You add the nodes with rf=0 so there’s no streaming, then bump it to rf=1 and run repair, then rf=2 and run repair, then rf=3 and run repair, then you either change the app to use local quorum in the new dc, or reverse the process by decreasing the rf in the original dc by 1 at a time -- Jeff

Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Kyrylo Lebediev
I'd say, "add new DC, then remove old DC" approach is more risky especially if they use QUORUM CL (in this case they will need to change CL to LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs). Also, if there is a chance to get rid of streaming, it worth doing as usually

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-20 Thread kurt greaves
Probably a lot of work but it would be incredibly useful for vnodes if flushing was range aware (to be used with RangeAwareCompactionStrategy). The writers are already range aware for JBOD, but that's not terribly valuable ATM. On 20 February 2018 at 21:57, Jeff Jirsa wrote: >

Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread kurt greaves
> > Outside of rack awareness, would the next primary ranges take the replica > ranges? Yes. ​

Re: Performance Of IN Queries On Wide Rows

2018-02-20 Thread Eric Stevens
Someone can correct me if I'm wrong, but I believe if you do a large IN() on a single partition's cluster keys, all the reads are going to be served from a single replica. Compared to many concurrent individual equal statements you can get the performance gain of leaning on several replicas for

Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Nitan Kainth
You can also create a new DC and then terminate old one. Sent from my iPhone > On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev wrote: > > Hi, > Consider using this approach, replacing nodes one by one: >

Installing the common service to start cassandrea

2018-02-20 Thread Jeff Hechter
Hi, I have cassandra running on my machine(Windows). I have downloaded commons-daemon-1.1.0-bin-windows.zip and extracted it to cassandra\bin\daemon. I successfully created the service using cassandra.bat -install. When I go to start the service I get error below. When I start from the command

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-20 Thread Jeff Jirsa
There are some arguments to be made that the flush should consider compaction strategy - would allow a bug flush to respect LCS filesizes or break into smaller pieces to try to minimize range overlaps going from l0 into l1, for example. I have no idea how much work would be involved, but may

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-20 Thread Jon Haddad
The file format is independent from compaction. A compaction strategy only selects sstables to be compacted, that’s it’s only job. It could have side effects, like generating other files, but any decent compaction strategy will account for the fact that those other files don’t exist. I

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Jürgen Albersdorfer
We do archiving data in Order to make assumptions on it in future. So, yes we expect to grow continously. In the mean time I learned to go for predictable grow per partition rather than unpredictable large partitioning. So today we are growing 250.000.000 Records per Day going into a single

Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Kyrylo Lebediev
Hi, Consider using this approach, replacing nodes one by one: https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/ Regards, Kyrill From: Leena Ghatpande Sent: Tuesday, February 20, 2018 10:24:24 PM

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Jeff Jirsa
At a past job, we set the limit at around 60 hosts per cluster - anything bigger than that got single token. Anything smaller, and we'd just tolerate the inconveniences of vnodes. But that was before the new vnode token allocation went into 3.0, and really assumed things that may not be true for

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Jürgen Albersdorfer
Thanks Jeff, your answer is really not what I expected to learn - which is again more manual doing as soon as we start really using C*. But I‘m happy to be able to learn it now and have still time to learn the neccessary Skills and ask the right questions on how to correctly drive big data with

Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Leena Ghatpande
Best approach to replace existing 8 smaller 8 nodes in production cluster with New 8 nodes that are bigger in capacity without a downtime We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with new 8 nodes that are bigger in capacity in terms of RAM,CPU and Diskspace without a

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Jeff Jirsa
The scenario you describe is the typical point where people move away from vnodes and towards single-token-per-node (or a much smaller number of vnodes). The default setting puts you in a situation where virtually all hosts are adjacent/neighbors to all others (at least until you're way into the

Performance Of IN Queries On Wide Rows

2018-02-20 Thread Gareth Collins
Hello, When querying large wide rows for multiple specific values is it better to do separate queries for each value...or do it with one query and an "IN"? I am using Cassandra 2.1.14 I am asking because I had changed my app to use 'IN' queries and it **appears** to be slower rather than faster.

Re: Cassandra counter readtimeout error

2018-02-20 Thread Carl Mueller
How "hot" are your partition keys in these counters? I would think, theoretically, if specific partition keys are getting thousands of counter increments/mutations updates, then compaction won't "compact" those together into the final value, and you'll start experiencing the problems people get

Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Carl Mueller
Ahhh, the topology strategy does that. But if one were to maintain the same rack topology and was adding nodes just within the racks... Hm, might not be possible in new nodes. ALthough AWS "racks" are at the availability zone IIRC, so that would be doable. Outside of rack awareness, would the

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Russell Bateman
I ask Cassandra to be a database that is high-performance, highly scalable with no single point of failure. Anything "cool" that's added beyond must be added only as a separate, optional ring around Cassandra and must not get in the way of my usage. Yes, I would like some help with some of

Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Jon Haddad
That’s why you use a NTS + a snitch, it picks replaces based on rack awareness. > On Feb 20, 2018, at 9:33 AM, Carl Mueller > wrote: > > So in theory, one could double a cluster by: > > 1) moving snapshots of each node to a new node. > 2) for each snapshot moved,

Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Carl Mueller
So in theory, one could double a cluster by: 1) moving snapshots of each node to a new node. 2) for each snapshot moved, figure out the primary range of the new node by taking the old node's primary range token and calculating the midpoint value between that and the next primary range start token

vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Carl Mueller
As I understand it: Replicas of data are replicated to the next primary range owner. As tokens are randomly generated (at least in 2.1.x that I am on), can't we have this situation: Say we have RF3, but the tokens happen to line up where: NodeA handles 0-10 NodeB handles 11-20 NodeA handlea

Re: Rapid scaleup of cassandra nodes with snapshots and initial_token in the yaml

2018-02-20 Thread Carl Mueller
Ok, so vnodes are random assignments under normal circumstances (I'm in 2.1.x, I'm assuming a derivative approach was in the works that would avoid some hot node aspects of random primary range assingment for new nodes once you had one or two or three in a cluster). So... couldn't I just

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Carl Mueller
I think what is really necessary is providing table-level recipes for storing data. We need a lot of real world examples and the resulting schema, compaction strategies, and tunings that were performed for them. Right now I don't see such crucial cookbook data in the project. AI is a bit

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Nicolas Guyomar
Yes you are right, it limit how much data a node will send while streaming data (repair, boostrap etc) total to other node, so that is does not affec this node performance. Boostraping is initiated by the boostraping node itself, which determine, based on his token, which nodes to ask data from,

Save the date: ApacheCon North America, September 24-27 in Montréal

2018-02-20 Thread Rich Bowen
Dear Apache Enthusiast, (You’re receiving this message because you’re subscribed to a user@ or dev@ list of one or more Apache Software Foundation projects.) We’re pleased to announce the upcoming ApacheCon [1] in Montréal, September 24-27. This event is all about you — the Apache project

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Jürgen Albersdorfer
Hi Nicolas, I have seen that ' stream_throughput_outbound_megabits_per_sec', but afaik this limits what each node will provide at a maximum. What I'm more concerned of is the vast amount of connections to handle and the concurrent threads of which at least two get started for every single

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Nicolas Guyomar
Hi Jurgen, stream_throughput_outbound_megabits_per_sec is the "given total throughput in Mbps", so it does limit the "concurrent throughput" IMHO, is it not what you are looking for? The only limits I can think of are : - number of connection between every node and the one boostrapping - number

Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-20 Thread Jürgen Albersdorfer
Hi, I'm wondering if it is possible resp. would it make sense to limit concurrent streaming when joining a new node to cluster. I'm currently operating a 15-Node C* Cluster (V 3.11.1) and joining another Node every day. The 'nodetool netstats' shows it always streams data from all other nodes.

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Kyrylo Lebediev
Agree with you, Daniel, regarding gaps in documentation. --- At the same time I disagree with the folks who are complaining in this thread about some functionality like 'advanced backup' etc is missing out of the box. We all live in the time where there are literally tons of open-source tools

Re: Right sizing Cassandra data nodes

2018-02-20 Thread Rahul Singh
Node density is active data managed in the cluster divided by the number of active nodes. Eg. If you you have 500TB or active data under management then you would need 250-500 nodes to get beast like optimum performance. It also depends on how much memory is on the boxes and if you are using

Re: newbie , to use cassandra when query is arbitrary?

2018-02-20 Thread Rahul Singh
Technically no. Cassandra is a NoSQL database. It is a columnar store — and so it’s not a set of relations that can be arbitrarily queried. The sstable structure is building for heavy writes and specific partook specific queries. If you need the ability for arbitrary queries you are using the

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Daniel Hölbling-Inzko
Hi, I have to add my own two cents here as the main thing that keeps me from really running Cassandra is the amount of pain running it incurs. Not so much because it's actually painful but because the tools are so different and the documentation and best practices are scattered across a dozen