Re: Cassandra vnodes Streaming Reliability Calculator
Hi Ken. 1) Thanks for the great link. Ironically it was written by Netflix, who continued to use single tokenfor years after vnodes were released so that they could continue touse Priam and their other tools dependent on single token. (I was in theearly Cassandra group there.) 2) My tool agrees overall with their findings: a) it does reflect that increasing numbers of vnodes and nodes reduce reliabilitydramatically, so the results are conceptually the same and the deltas atdifferent vnode counts matches what I see in my calculator. b) but it uses a more complicated model. I'm happy with my calculator thatlooks at simple "probability of a streaming connection failed for any reason"and is immediately usable by any DBA or SRE. 3) As an Operations DBA, their reference to "centuries" made me laugh though.Note that my calculations are about failures within one week, which alignsmore with my experience. So either they're overly optimistic, or I'm pessimistic. You can verify which by doing a grep of your logs on a production cluster fora month and counting how many connection failures there were. My blogpost has some links to actual error message to grep for. 4) Note that Datastax recommends 8 vnodes now. See my blog for the reference. Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in Bay area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top From: Kenneth Brotman To: user@cassandra.apache.org Sent: Saturday, February 16, 2019 5:00 AM Subject: RE: Cassandra vnodes Streaming Reliability Calculator #yiv4674113709 #yiv4674113709 -- _filtered #yiv4674113709 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv4674113709 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv4674113709 #yiv4674113709 p.yiv4674113709MsoNormal, #yiv4674113709 li.yiv4674113709MsoNormal, #yiv4674113709 div.yiv4674113709MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;font-family:New;}#yiv4674113709 a:link, #yiv4674113709 span.yiv4674113709MsoHyperlink {color:blue;text-decoration:underline;}#yiv4674113709 a:visited, #yiv4674113709 span.yiv4674113709MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv4674113709 span.yiv4674113709EmailStyle17 {color:#1F497D;}#yiv4674113709 .yiv4674113709MsoChpDefault {font-size:10.0pt;} _filtered #yiv4674113709 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv4674113709 div.yiv4674113709WordSection1 {}#yiv4674113709 Hi James, Thanks for doing that. Very interesting. I haven’t had a chance to check the math. Did you look at this white paper by Lynch and Snyder called Cassandra Availability with Virtual Nodes: https://github.com/jolynch/python_performance_toolkit/blob/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf Are the calculations consistent with your online calculator? Thanks again, Kenneth Brotman From: James Briggs [mailto:james.bri...@yahoo.com.INVALID] Sent: Friday, February 15, 2019 7:42 PM To: user@cassandra.apache.org Subject: Cassandra vnodes Streaming Reliability Calculator Hi folks. Please check out my online vnodes reliability calculator and reply with any feedback:http://www.jebriggs.com/blog/2019/02/cassandra-vnodes-reliability-calculator/ Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in Bay Area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top
Cassandra vnodes Streaming Reliability Calculator
Hi folks. Please check out my online vnodes reliability calculator and reply with any feedback:http://www.jebriggs.com/blog/2019/02/cassandra-vnodes-reliability-calculator/ Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in Bay Area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top
Re: Long GC Pauses
General best practices with Java 8: If you have enough RAM for 24 GB heap, use G1 GC.If you have less RAM, then use CMS with a medium-sized heap setting so theGC time is not as long but more frequent. Graph memory use with Grafana or something and let people know what's happening. https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsTuneJVM.html http://thelastpickle.com/blog/2018/04/11/gc-tuning.html Your cluster: Which version of Java? How much RAM do your systems have?Is it the same on all nodes?What are your current heap settings?Anything else? Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top From: Rajasekhar Kommineni To: user@cassandra.apache.org Sent: Monday, November 19, 2018 2:33 PM Subject: Long GC Pauses Hi All, My C cluster configuration. 1) 2 DC with 4 nodes each and Replication Factor of 3 per each DC 2) Writes(Bulk data load) are done to 2nd DC and Application (reads) are done from 1st DC. 3) CMS GC Issue : Observing long GC pauses during data load and timeouts from application (reads) during the same time. Question : 1)Why am I seeing GC pauses on 1st DC , even though I am using stream_throughput of 16 Mb/s. 2) Is there any way to reduce the GC pause times other than changing it. Thanks, - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Jepsen testing
For those relatively new to Cassandra, Riptano is the previous company name for Datastax, back in 2011. :) http://www.h-online.com/open/news/item/Cassandra-service-company-Riptano-changes-name-to-DataStax.html Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top From: Oleksandr Shulgin To: User Cc: d...@cassandra.apache.org Sent: Friday, November 9, 2018 12:33 AM Subject: Re: Jepsen testing On Thu, Nov 8, 2018 at 10:42 PM Yuji Ito wrote: We are working on Jepsen testing for Cassandra.https://github.com/scalar-labs/jepsen/tree/cassandra/cassandra As you may know, Jepsen is a framework for distributed systems verification.It can inject network failure and so on and check data consistency.https://github.com/jepsen-io/jepsen Our tests are based on riptano's great work.https://github.com/riptano/jepsen/tree/cassandra/cassandra I refined it for the latest Jepsen and removed some tests.Next, I'll fix clock-drift tests. I would like to get your feedback. Cool stuff! Do you have jepsen tests as part of regular testing in scalardb? How long does it take to run all of them on average? I wonder if Apache Cassandra would be willing to include this as part of regular testing drill as well. Cheers,--Alex
Re: JBOD disk failure - just say no
Cassandra JBOD has a bunch of issues, so I don't recommend it for production: 1) disks fill up with load (data) unevenly, meaning you can run out on a disk while some are half-full2) one bad disk can take out the whole node3) instead of a small failure probability on an LVM/RAID volume, with JBOD you end up near 100% chance of failure after 3 years or so.4) generally you will not have enough warning of a looming failure with JBOD compared to LVM/RAID. (Somecompanies take a week or two to replace a failed disk.) JBOD is easy to setup, but hard to manage. Thanks, James. From: kurt greaves To: User Sent: Friday, August 17, 2018 5:42 AM Subject: Re: JBOD disk failure As far as I'm aware, yes. I recall hearing someone mention tying system tables to a particular disk but at the moment that doesn't exist. On Fri., 17 Aug. 2018, 01:04 Eric Evans, wrote: On Wed, Aug 15, 2018 at 3:23 AM kurt greaves wrote: > Yep. It might require a full node replace depending on what data is lost from > the system tables. In some cases you might be able to recover from partially > lost system info, but it's not a sure thing. Ugh, does it really just boil down to what part of `system` happens to be on the disk in question? In my mind, that makes the only sane operational procedure for a failed disk to be: "replace the entire node". IOW, I don't think we can realistically claim you can survive a failed a JBOD device if it relies on happenstance. > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, > wrote: >> >> Thank you for the answers. We are using the current version 3.11.3 So this >> one includes CASSANDRA-6696. >> >> So if I get this right, losing system tables will need a full node rebuild. >> Otherwise repair will get the node consistent again. > > [ ... ] -- Eric Evans john.eric.ev...@gmail.com -- -- - To unsubscribe, e-mail: user-unsubscribe@cassandra. apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Cassandra Needs to Grow Up by Version Five!
Kenneth: What you said is not wrong. Vertica and Riak are examples of distributed databases that don't require hand-holding. Cassandra is for Java-programmer DIYers, or more often Datastax clients, at this point. Thanks, James. From: Kenneth BrotmanTo: user@cassandra.apache.org Cc: d...@cassandra.apache.org Sent: Monday, February 19, 2018 4:56 PM Subject: RE: Cassandra Needs to Grow Up by Version Five! #yiv0297673896 #yiv0297673896 -- _filtered #yiv0297673896 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv0297673896 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv0297673896 #yiv0297673896 p.yiv0297673896MsoNormal, #yiv0297673896 li.yiv0297673896MsoNormal, #yiv0297673896 div.yiv0297673896MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0297673896 a:link, #yiv0297673896 span.yiv0297673896MsoHyperlink {color:blue;text-decoration:underline;}#yiv0297673896 a:visited, #yiv0297673896 span.yiv0297673896MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv0297673896 p.yiv0297673896MsoAcetate, #yiv0297673896 li.yiv0297673896MsoAcetate, #yiv0297673896 div.yiv0297673896MsoAcetate {margin:0in;margin-bottom:.0001pt;font-size:8.0pt;}#yiv0297673896 span.yiv0297673896EmailStyle17 {color:#1F497D;}#yiv0297673896 span.yiv0297673896BalloonTextChar {}#yiv0297673896 span.yiv0297673896EmailStyle20 {color:#1F497D;}#yiv0297673896 .yiv0297673896MsoChpDefault {font-size:10.0pt;} _filtered #yiv0297673896 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv0297673896 div.yiv0297673896WordSection1 {}#yiv0297673896 Jeff, you helped me figure out what I was missing. It just took me a day to digest what you wrote. I’m coming over from another type of engineering. I didn’t know and it’s not really documented. Cassandra runs in a data center. Now days that means the nodes are going to be in managed containers, Docker containers, managed by Kerbernetes, Meso or something, and for that reason anyone operating Cassandra in a real world setting would not encounter the issues I raised in the way I described. Shouldn’t the architectural diagrams people reference indicate that in some way? That would have help me. Kenneth Brotman From: Kenneth Brotman [mailto:kenbrot...@yahoo.com] Sent: Monday, February 19, 2018 10:43 AM To: 'user@cassandra.apache.org' Cc: 'd...@cassandra.apache.org' Subject: RE: Cassandra Needs to Grow Up by Version Five! Well said. Very fair. I wouldn’t mind hearing from others still. You’re a good guy! Kenneth Brotman From: Jeff Jirsa [mailto:jji...@gmail.com] Sent: Monday, February 19, 2018 9:10 AM To: cassandra Cc: Cassandra DEV Subject: Re: Cassandra Needs to Grow Up by Version Five! There's a lot of things below I disagree with, but it's ok. I convinced myself not to nit-pick every point. https://issues.apache.org/jira/browse/CASSANDRA-13971 has some of Stefan's work with cert management Beyond that, I encourage you to do what Michael suggested: open JIRAs for things you care strongly about, work on them if you have time. Sometime this year we'll schedule a NGCC (Next Generation Cassandra Conference) where we talk about future project work and direction, I encourage you to attend if you're able (I encourage anyone who cares about the direction of Cassandra to attend, it's probably be either free or very low cost, just to cover a venue and some food). If nothing else, you'll meet some of the teams who are working on the project, and learn why they've selected the projects on which they're working. You'll have an opportunity to pitch your vision, and maybe you can talk some folks into helping out. - Jeff On Mon, Feb 19, 2018 at 1:01 AM, Kenneth Brotman wrote:Comments inline >-Original Message- >From: Jeff Jirsa [mailto:jji...@gmail.com] >Sent: Sunday, February 18, 2018 10:58 PM >To: user@cassandra.apache.org >Cc: d...@cassandra.apache.org >Subject: Re: Cassandra Needs to Grow Up by Version Five! > >Comments inline > > >> On Feb 18, 2018, at 9:39 PM, Kenneth Brotman >> wrote: >> > >Cassandra feels like an unfinished program to me. The problem is not that > >it’s open source or cutting edge. It’s an open source cutting edge program > >that lacks some of its basic functionality. We are all stuck addressing > >fundamental mechanical tasks for Cassandra because the basic code that would > >do that part has not been contributed yet. >> >There’s probably 2-3 reasons why here: > >1) Historically the pmc has tried to keep the scope of the project very >narrow. It’s a database. We don’t ship drivers. We don’t ship developer tools. >We don’t ship fancy UIs. We ship a database. I think for the most part the >narrow vision has been for the best, but maybe it’s time to reconsider some of >the scope. > >Postgres will autovacuum to prevent wraparound (hopefully), but everyone I >know running
Re: Reg :- Multiple Node Cluster set up in Virtual Box
Nandan: The original Datastax training classes (when it was still called Riptano) used 3 virtualbox Debian instances to setup a Cassandra cluster. Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top From: kurt greaves <k...@instaclustr.com> To: User <user@cassandra.apache.org> Sent: Monday, November 6, 2017 3:08 PM Subject: Re: Reg :- Multiple Node Cluster set up in Virtual Box Worth keeping in mind that in 3.6 onwards nodes will not start unless they can contact a seed. Not quite SPOF but still problematic. CASSANDRA-13851
Re: cassandra.yaml configuration for large machines (scale up vs. scale out)
> I know that Cassandra is built for scale out on commodity hardware The term "commodity hardware" is not very useful, though the averageserver-class machine bought in 2017 can work. Netflix found that SSD helped greatly with compactions in production.Generally servers use 10 GB networking in 2017. 128 GB is commonly used, but I would use 256+ GB in new servers. I don't recommend the Cassandra JBOD configuration since losingone drive means rebuilding the node immediately, which manyorganizations aren't responsive enough to do. Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top From: "Steinmaurer, Thomas"To: "user@cassandra.apache.org" Sent: Friday, November 3, 2017 6:34 AM Subject: cassandra.yaml configuration for large machines (scale up vs. scale out) Hello, I know that Cassandra is built for scale out on commodity hardware, but I wonder if anyone can share some experience when running Cassandra on rather capable machines. Let’s say we have a 3 node cluster with 128G RAM, 32 physical cores (16 per CPU socket), Large Raid with Spinning Disks (so somewhere beyond 2000 IOPS). What are some recommended cassandra.yaml configuration / JVM settings, e.g. we have been using with something like that as a first baseline: ·31G heap, G1, -XX:MaxGCPauseMillis=2000 ·concurrent_compactors: 8 · compaction_throughput_mb_per_sec: 128 ·key_cache_size_in_mb: 2048 · concurrent_reads: 256 ·concurrent_writes: 256 · native_transport_max_threads: 256 Anything else we should add to our first baseline of settings? E.g. although we have a key cache of 2G, nodetool info gives me only 0.451 as hit rate: Key Cache : entries 2919619, size 1.99 GB, capacity 2 GB, 71493172 hits, 158411217 requests, 0.451 recent hit rate, 14400 save period in seconds Thanks, Thomas The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313
Re: Indexes Fragmentation
MySQL Cluster (don't use FKs yet) or Redis (in-memory databases) sound more appropriate for data that churns a lot. Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. cass_top: https://github.com/jamesbriggs/cassandra-top From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Monday, September 29, 2014 5:01 PM Subject: Re: Indexes Fragmentation On Sun, Sep 28, 2014 at 9:49 AM, Arthur Zubarev arthur.zuba...@aol.com wrote: There are 200+ times more updates and 50x inserts than analytical loads. In Cassandra to just be able to query (in CQL) on a column I have to have an index, the question is what tall the fragmentation coming from the frequent updates and inserts has on a CF? Do I also need to manually defrug? You have appeared to have just asked if maintaing indexes which have a high rate of change in a log structured database with immutable data files is likely to be more performant than maintaining them in a database with modify-in-place semantics. No. =Rob
Re: Help with approach to remove RDBMS schema from code to move to C*?
Most of the C* success stories are for greenfield applications. Migrating from one database to another database is a lot of work. C* offers no magical path. If you only have a few tables and minor RDBMS feature dependencies, it can be done. Make sure your users and QA people are cooperative first though. Most companies don't have a budget to re-QA applications a second time. Maybe introduce C* to your organization on a new, small project first? Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Les Hartzman lhartz...@gmail.com To: user@cassandra.apache.org Sent: Friday, September 19, 2014 2:46 PM Subject: Help with approach to remove RDBMS schema from code to move to C*? My company is using an RDBMS for storing time-series data. This application was developed before Cassandra and NoSQL. I'd like to move to C*, but ... The application supports data coming from multiple models of devices. Because there is enough variability in the data, the main table to hold the device data only has some core columns defined. The other columns are non-specific; a set of columns for numeric and a set for character. So for these non-specific columns, their use is defined in the code. The use of column 'numeric_1' might hold a millisecond time for one device and a fault code for another device. This appears to have been done to keep from modifying the schema whenever a new device was introduced. And they rolled their own db interface to support this mess. Now, we could just use C* like an RDBMS - defining CFs to mimic the tables. But this just pushes a bad design from one platform to another. Clearly there needs to be a code re-write. But what suggestions does anyone have on how to make this shift to C*? Would you just layout all of the columns represented by the different devices, naming them as they are used, and having jagged rows? Or is there some other way to approach this? Of course, the data miners already have scripts/methods for accessing the data from the RDBMS now in the user-unfriendly form it's in now. This would have to be addressed as well, but until I know how to store it, mining it gets ahead of things. Thanks. Les
Re: Blocking while a node finishes joining the cluster after restart.
Kevin: The serial approach would take a LONG time for large clusters. If you have sixty nodes, it could take an hour to do a rolling restart. 1) In Cassandra land, an hour is nothing. There's people doing repairs that practically never finish - as soon as one finishes after a week, they have to start the next one. 2) I met some people at the conference who were embarrassed to operate only 12 nodes. I'm not sure why, since managing 12 is a lot easier and cheaper than 60. In fact, I would be proud to operate a large site on 8 or 12 nodes. :) 3) After I finish my cass_top project this week, I'll take a look at scripting what you mentioned in this thread. Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Kevin Burton bur...@spinn3r.com To: user@cassandra.apache.org user@cassandra.apache.org; James Briggs james.bri...@yahoo.com Sent: Friday, September 19, 2014 11:30 AM Subject: Re: Blocking while a node finishes joining the cluster after restart. This is great feedback… I think it could actually be even easier than this… You could have an ansible (or whatever cluster management system you’re using) role for just seeds. Then you would serially restart all seeds one at a time. You would need to run ‘nodetool status’ and make sure the node is ‘U’ (up) I think.. but you might want to make sure the majority of other nodes have agreed that this node is up and available. I think you can ONLY do this serially.. .for a LARGE number of hosts, this might take a while unless you can compute nodes which have mutually exclusive key ranges. The serial approach would take a LONG time for large clusters. If you have sixty nodes, it could take an hour to do a rolling restart. Kevin On Tue, Sep 16, 2014 at 12:21 PM, James Briggs james.bri...@yahoo.com wrote: FYI: OpsCenter has a default of sleep 60 seconds after each node restart, and an option of drain before stopping. I haven't noticed if they do anything special with seeds. (At least one seed needs to be running before you restart other nodes.) I wondered the same thing as Kevin and came to these conclusions. Fixing the startup script is non-trivial as far as startup scripts go. For start, it would have to: - parse cassandra.yaml for seeds - if itself is not a seed, wait for a seed to start first. (could take minutes or never.) - continue start. For a no-downtime cluster restart script, it would have to: - verify cluster health (ie. quorum/CL is met or you lose writes) - parse cassandra.yaml for seeds and see if a seed is up - stop gossip and thrift - maybe do compaction before drain - drain node - stop/start or restart cassandra process. http://comments.gmane.org/gmane.comp.db.cassandra.user/20144 Both of those scripts would be nice to have. :) OpsCenter is flaky at doing rolling restart in my test cluster, so an alternative is needed. Also, the free OpsCenter doesn't have rolling repair option enabled. ccm has the options to do drain, stop and start, but a bash script would be needed to make it rolling. https://github.com/pcmanus/ccm Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Duncan Sands duncan.sa...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, September 16, 2014 11:09 AM Subject: Re: Blocking while a node finishes joining the cluster after restart. Hi Kevin, if you are using the latest version of opscenter, then even the community (= free) edition can do a rolling restart of your cluster. It's pretty convenient. Ciao, Duncan. On 16/09/14 19:44, Kevin Burton wrote: Say I want to do a rolling restart of Cassandra… I can’t just restart all of them because they need some time to gossip and for that gossip to get to all nodes. What is the best strategy for this. It would be something like: /etc/init.d/cassandra restart wait-for-cassandra.sh … or something along those lines. -- Founder/CEO Spinn3r.com http://Spinn3r.com Location: *San Francisco, CA* blog:**http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile
Re: what's cool about cassandra 2.1.0?
I'll be blunt. The reason to use the latest 2.0 or soon 2.1 is because Apple has committed 20 patches that make Cassandra operationally useful. Apple is the QA lab for Cassandra. Their conference talk was very exciting. I hope a video of that gets posted in October. Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: DuyHai Doan doanduy...@gmail.com To: user@cassandra.apache.org Sent: Friday, September 19, 2014 7:07 AM Subject: Re: what's cool about cassandra 2.1.0? Hello Tim From this blog (http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1) you should find the pointers to other big topics of 2.1 On Fri, Sep 19, 2014 at 3:33 PM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I tried googling around to get an idea about what was new (and potentially cool) in the newest release of cassandra - 2.1.0. But all that I've been able to find so far is this kind of general statement about the new features. https://www.mail-archive.com/user@cassandra.apache.org/msg38448.html It doesn't seem to have a lot of detail! Particularly I'm curious about how CQL has been enhanced beyond just an incomplete list of new data types. I'd like to know what the performance improvements are, How the row cache has been improved. Etc. You get the idea! So where can I find a more complete description of how this update is of benefit? Thanks! Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
Re: Blocking while a node finishes joining the cluster after restart.
FYI: OpsCenter has a default of sleep 60 seconds after each node restart, and an option of drain before stopping. I haven't noticed if they do anything special with seeds. (At least one seed needs to be running before you restart other nodes.) I wondered the same thing as Kevin and came to these conclusions. Fixing the startup script is non-trivial as far as startup scripts go. For start, it would have to: - parse cassandra.yaml for seeds - if itself is not a seed, wait for a seed to start first. (could take minutes or never.) - continue start. For a no-downtime cluster restart script, it would have to: - verify cluster health (ie. quorum/CL is met or you lose writes) - parse cassandra.yaml for seeds and see if a seed is up - stop gossip and thrift - maybe do compaction before drain - drain node - stop/start or restart cassandra process. http://comments.gmane.org/gmane.comp.db.cassandra.user/20144 Both of those scripts would be nice to have. :) OpsCenter is flaky at doing rolling restart in my test cluster, so an alternative is needed. Also, the free OpsCenter doesn't have rolling repair option enabled. ccm has the options to do drain, stop and start, but a bash script would be needed to make it rolling. https://github.com/pcmanus/ccm Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Duncan Sands duncan.sa...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, September 16, 2014 11:09 AM Subject: Re: Blocking while a node finishes joining the cluster after restart. Hi Kevin, if you are using the latest version of opscenter, then even the community (= free) edition can do a rolling restart of your cluster. It's pretty convenient. Ciao, Duncan. On 16/09/14 19:44, Kevin Burton wrote: Say I want to do a rolling restart of Cassandra… I can’t just restart all of them because they need some time to gossip and for that gossip to get to all nodes. What is the best strategy for this. It would be something like: /etc/init.d/cassandra restart wait-for-cassandra.sh … or something along those lines. -- Founder/CEO Spinn3r.com http://Spinn3r.com Location: *San Francisco, CA* blog:**http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Blocking while a node finishes joining the cluster after restart.
Hi Robert. I just did a test (shutdown all nodes, start one non-seed node.) You're correct that an old non-seed node can start by itself. So startup scripts don't have to be intelligent, but apps need to wait until there's enough nodes up to serve the whole keyspace: cqlsh:my_keyspace consistency Current consistency level is ONE. cqlsh:my_keyspace select * from numbers where v=1; v --- 1 (1 rows) cqlsh:my_keyspace select * from numbers where v=2; Unable to complete request: one or more nodes were unavailable. Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote.
Re: backport of CASSANDRA-6916
Paulo: Out of curiosity, why not just upgrade to 2.1 if you want the new features? You know you want to! :) Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, September 16, 2014 4:13 PM Subject: Re: backport of CASSANDRA-6916 On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob
Announce: top for Cassandra - cass_top
I wrote cass_top, a poor man's version of OpsCenter, in bash (no dependencies.) http://www.jebriggs.com/blog/2014/09/top-utility-for-cassandra-clusters-cass_top/ Actually, if it had node or cluster restart, it would do most of what the OpsCenter free version does. :) The features of cass_top are: - colorizes nodetool status output: UN nodes green, DN nodes red, other statuses blue - no extra firewall holes needed (agent-less and server-less), unlike OpsCenter - fast initial startup time (under 2 seconds), unlike OpsCenter - uses bash, so no programming environment needed - run it anywhere nodetool works - uses minimal screen real estate, so several rings can fit on one monitor - free (Apache 2). Please send me your comments and suggestions. The top-like infinite loop is actually a read loop, so adding a few more features like cfstats or flush would be easy. Enjoy, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote.
Re: no change observed in read latency after switching from EBS to SSD storage
To expand on what Robert said, Cassandra is a log-structured database: - writes are append operations, so both correctly configured disk volumes and SSD are fast at that - reads could be helped by SSD if they're not in cache (ie. on disk) - but compaction is definitely helped by SSD with large data loads (compaction is the trade-off for fast writes) Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. Mailbox dimensions: 10x12x14 From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, September 16, 2014 5:42 PM Subject: Re: no change observed in read latency after switching from EBS to SSD storage On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller moham...@glassbeam.com wrote: Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? What does it say when you enable tracing on this CQL query? 10 seconds is a really long time to access anything in Cassandra. There is, generally speaking, a reason why the default timeouts are lower than this. My conjecture is that the data in question was previously being served from the page cache and is now being served from SSD. You have, in switching from EBS-plus-page-cache to SSD successfully proved that SSD and RAM are both very fast. There is also a strong suggestion that whatever access pattern you are using is not bounded by disk performance. =Rob
Re: C 2.1
Hi Ram. 1) As an Operations DBA, I consider all versions of Cassandra to be alpha. So whether you pick 2.0.10 or 2.1.0 doesn't really matter since you will have to do your own acceptance testing. 2) Data modelling is everything when it comes to a distributed database like Cassandra. You can read my blog post which is a quick way to get up to speed with CQL: Notes on “Getting Started with Time Series Data Modeling” in Cassandra http://jbriggs.com/blog/2014/09/notes-on-getting-started-with-time-series-data-modeling-in-cassandra/ Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Ram N yrami...@gmail.com To: user@cassandra.apache.org Sent: Saturday, September 13, 2014 3:49 PM Subject: C 2.1 Team, I am pretty new to cassandra (with just 2 weeks of playing around with it on and off) and planning a fresh deployment with 2.1 release. The data-model is pretty simple for my use-case. Questions I have in mind are Is 2.1 a production ready release? Driver selection? I played around with Hector, Astyanax and Java driver? I don't see much activity happening on Hector, For Astyanax - Love the Fluent style of writing code and abstractions, recipes, pooling etc Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?) On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks. -R
Re: C 2.1
Ram, The reason secondary indexes are not recommended is that since they can't use the partition key, the values have to be fetched from all nodes. So you have higher latency, and likely timeouts. The C* solutions are: a) use a denormalized (materialized) table b) use a clustered index if all the data related to the row key is in the same partition (read my blog link from this thread for more) That's the price of using distributed systems. Oh, and then there's the need to rewrite the data access layer of your entire existing app. :) AOL and Blizzard talked about porting a couple apps to Cassandra at the conference last week, but they sounded like trivial user-db (UDB) apps, and even then Patrick was usually credited with the data modelling. I haven't heard of anybody porting a 100+ table Oracle or MySQL app to C* yet. I'm sure it's been done, but most of the apps written for C* are greenfield or v2.0 rewrites. Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Ram N yrami...@gmail.com To: user user@cassandra.apache.org Sent: Monday, September 15, 2014 1:34 PM Subject: Re: C 2.1 Jack, Using Solr or an external search/indexing service is an option but increases the complexity of managing different systems. I am curious to understand the impact of having wide-rows on a separate CF for inverted index purpose which if I understand correctly is what Rob's response, having a separate CF for index is better than using the default Secondary index option. Would be great to understand the design decision to go with present implementation on Secondary Index when the alternative is better? Looking at JIRAs is still confusing to come up with the why :) --R On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky j...@basetechnology.com wrote: If you’re indexing and querying on that many columns (dozens, or more than a handful), consider DSE/Solr, especially if you need to query on multiple columns in the same query. -- Jack Krupansky From: Robert Coli Sent: Monday, September 15, 2014 11:07 AM To: user@cassandra.apache.org Subject: Re: C 2.1 On Sat, Sep 13, 2014 at 3:49 PM, Ram N yrami...@gmail.com wrote: Is 2.1 a production ready release? https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?) New projects should use CQL. Access to underlying storage via Thrift is likely to eventually be removed from Cassandra. On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks. Use of the Secondary Indexes feature is generally an anti-pattern in Cassandra. 30-50 indexed columns in a row sounds insane to me. However 30-50 column families into which one manually denormalized does not sound too insane to me... =Rob http://twitter.com/rcolidba
Re: Cassandra JBOD disk configuration
I've used JBOD before and here's the operational problems I noticed: 1) each volume/disk fills at a different rate, so the min might be 100 GB data, and the max might be 200 GB.That means you cannot use anywhere near your real hard disk capacity. (Then on top of that compaction requires space.) 2) when a disk dies you lose that node immediately, whereas with RAID you get some warning. Those issues made JBOD unusable for us, but if you're just using Cassandra as a cache, or your operations team doesn't mind rebuilding nodes all the time with no advance notice, or your data size is small compared to the disk size, then it might be ok for you. Thanks, James Briggs. From: Chris Lohfink clohf...@blackbirdit.com To: user@cassandra.apache.org Sent: Tuesday, September 9, 2014 12:14 PM Subject: Re: Cassandra JBOD disk configuration It can get really unbalanced with STCS. Whats more is even if there was a disk that could fit the 600gb sstable it doesn't pay attention to space (first) so may pick the 75% full one over the 10% one. Its a better idea to use LCS with it unless data model really needs it in which case monitor it carefully. If you want to more completely utilize your disks you will probably just want to use RAID. I imagine you would get far better performance out of JBOD though... It Depends Chris On Sep 4, 2014, at 4:48 AM, Hannu Kröger hkro...@gmail.com wrote: Hi, Let's imagine that I have one keyspace with one big table configured with size tiered compaction strategy and nothing else. The disk configuration would to have 10x 500GB disks, each mounted to separate directory. Each directory would then be configured as a separate entry in cassandra.yaml. Over time data accumulates and I have at some point 4x 300GB sstables that the cassandra would like to compact to one 1,2 TB sstable. Since each directory has max 500GB disk space, that would not work. Right? Is JBOD with more than 2 disks really usable with STCS? Probably LCS would the only way to go in this case? Cheers, Hannu
Re: cassandra on own distributed network
What you're describing depends on the load (data size) and latency. Doing a bootstrap or backup would require a fair amount of bandwidth if you want it done quickly with a lot of data. Also, latency would be very high going over some kind of office VPN. But there's no reason you can't do what you're describing. You could setup a test cluster and see what the actual latency is. Most people use 4 nodes per POP with NetworkTopologyStrategy (NTS)for a multi-DC setup with RF=3. Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: David M da3bob...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, September 9, 2014 5:49 PM Subject: cassandra on own distributed network Hi everyone I am at a loss for locating use cases/examples/documentation/books/etc for deploying Cassandra where multi-dc nodes of a single cluster are on your own network at points around the world. In my example a Cassandra dc equates to a building. Of interest to me is how installations are inter-connecting their dcs (circuit bandwidth, latency requirements) for optimal replication/gossip/etc and any lessons learned they can share. I know there isn't going to be a single config that applies to every deployment/usage pattern/etc but surely there are at least loose rules of thumb that will get me going (or maybe alternative deployments). The interesting posts/blogs/books/etc seem to reference Cassandra in the cloud (eg specifying AWS instance types) leaving out descriptions/usage/requirements at the network layer. If anyone knows of any information on this topic that I've missed I'd appreciate your sharing. Thanks, David
Re: hardware sizing for cassandra
Regarding what Netflix does, the last time I checked: 1) sure, they use AWS VMs, but they take the whole machine. So is that really using a VM? :) 2) they use SSD mainly to reduce compaction time. We don't even notice it with SSD any more. When sizing nodes and clusters, the main factors I've seen are: a) What read latency are you trying to achieve? With 400 GB data per node, 10 ms is easy, but 1 ms is hard. Your whole design will revolve around this if you want low latency. b) How much data load per node is there? Bootstrapping and backup/restore gets time-consuming and hard with more than 400 GB per node. c) Are you planning to delete data? If so, that's harder to manage. Other than that, the previous comments on RAM are pretty accurate. I would want more cores with vnodes to do more parallel operations. Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, September 9, 2014 2:44 PM Subject: Re: hardware sizing for cassandra On Tue, Sep 9, 2014 at 2:16 PM, Russell Bradberry rbradbe...@gmail.com wrote: Because RAM is expensive and the JVM heap is limited to 8gb. While you do get benefit out of using extra RAM as page cache, it's often not cost efficient to do so Again, this is so use-case dependent. I have met several people that run small nodes with fat ram to get it all in memory to serve things in as few milliseconds as possible. This is a very common pattern in ad-tech where every millisecond counts. The tunable consistency and cross-datacenter replication make Cassandra very appealing as it is difficult to set this up with other DBs. Sure, it's also very common to run RDBMS in such a mode that hundreds of gigabytes of RAM are available as either page cache or buffer pool. But things are fast when you don't access slow disks is not really a commentary on Cassandra specifically, 8gb is the largest practical heap size with CMS GC is.. :D The recommended setup is 3 nodes and an RF of 3 to be able to make quorum reads/writes and survive an outage. But again, this is completely use-case dependent. IMO, minimum number of nodes you actually want to use in production with RF=3 is =4, probably closer to 6. But as you say, use case dependent. =Rob