Re: Data synchronization between 2 running clusters on different availability zone
Here's a snitch we use for this situation - it uses a property file if it exists, but falls back to EC2 autodiscovery if it is missing. https://github.com/barchart/cassandra-plugins/blob/master/src/main/java/com/barchart/cassandra/plugins/snitch/GossipingPropertyFileWithEC2FallbackSnitch.java On Mon, Dec 1, 2014 at 12:33 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Nov 27, 2014 at 1:24 AM, Spico Florin spicoflo...@gmail.com wrote: I have another question. What about the following scenario: two Cassandra instances installed on different cloud providers (EC2, Flexiant)? How do you synchronize them? Can you use some internal tools or do I have to implement my own mechanism? That's what I meant by if maybe hybrid in the future, use GPFS : http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchGossipPF_c.html hybrid in this case means AWS-and-not-AWS. =Rob
EC2 SSD cluster costs
The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down?
Best practices for frequently updated columns
We are building a historical timeseries database for stocks and futures, with trade prices aggregated into daily bars (open, high, low, close values for the day). The latest bar for each instrument needs to be updated as new trades arrive on the realtime data feeds. Depending on the trading volume for an instrument, some columns will be updated multiple times per second. I've read comments about frequent column updates causing compaction issues with Cassandra. What is the recommended Cassandra configuration / best practices for usage scenarios like this?
Re: vnode and NetworkTopologyStrategy: not playing well together ?
If your nodes are not actually evenly distributed across physical racks for redundancy, don't use multiple racks. On Tue, Aug 5, 2014 at 10:57 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: First, thanks for your answer. This is incorrect. Network Topology w/ Vnodes will be fine, assuming you've got RF= # of racks. IMHO, it's not a good enough condition. Let's use an example with RF=2 N1/rack_1 N2/rack_1 N3/rack_1 N4/rack_2 Here, you have RF= # of racks And due to NetworkTopologyStrategy, N4 will store *all* the cluster data, leading to a completely imbalanced cluster. IMHO, it happens when using nodes *or* vnodes. As well-balanced clusters with NetworkTopologyStrategy rely on carefully chosen token distribution/path along the ring *and* as tokens are randomly-generated with vnodes, my guess is that with vnodes and NetworkTopologyStrategy, it's better to define a single (logical) rack // due to carefully chosen tokens vs randomly-generated token clash. I don't see other options left. Do you see other ones ? Regards, Dominique -Message d'origine- De : jonathan.had...@gmail.com [mailto:jonathan.had...@gmail.com] De la part de Jonathan Haddad Envoyé : mardi 5 août 2014 17:43 À : user@cassandra.apache.org Objet : Re: vnode and NetworkTopologyStrategy: not playing well together ? This is incorrect. Network Topology w/ Vnodes will be fine, assuming you've got RF= # of racks. For each token, replicas are chosen based on the strategy. Essentially, you could have a wild imbalance in token ownership, but it wouldn't matter because the replicas would be distributed across the rest of the machines. http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html On Tue, Aug 5, 2014 at 8:19 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, My understanding is that NetworkTopologyStrategy does NOT play well with vnodes, due to: · Vnode = tokens are (usually) randomly generated (AFAIK) · NetworkTopologyStrategy = required carefully choosen tokens for all nodes in order to not to get a VERY unbalanced ring like in https://issues.apache.org/jira/browse/CASSANDRA-3810 When playing with vnodes, is the recommendation to define one rack for the entire cluster ? Thanks. Regards, Dominique -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Authentication exception
Yes, and all nodes have had at least two more scheduled repairs since then. On Jul 30, 2014 1:47 AM, Or Sher or.sh...@gmail.com wrote: Did you ran a repair after changing replication factor for system_auth ? On Tue, Jul 29, 2014 at 5:48 PM, Jeremy Jongsma jer...@barchart.com wrote: This is still happening to me; is there anything else I can check? All nodes have NTP installed, all are in sync, all have open communication to each other. But usually first thing in the morning, I get this auth exception. A little while later, it starts working. I'm very puzzled. On Tue, Jul 22, 2014 at 8:53 AM, Jeremy Jongsma jer...@barchart.com wrote: Verified all clocks are in sync. On Mon, Jul 21, 2014 at 10:03 PM, Rahul Menon ra...@apigee.com wrote: I could you perhaps check your ntp? On Tue, Jul 22, 2014 at 3:35 AM, Jeremy Jongsma jer...@barchart.com wrote: I routinely get this exception from cqlsh on one of my clusters: cql.cassandra.ttypes.AuthenticationException: AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.') The system_auth keyspace is set to replicate X times given X nodes in each datacenter, and at the time of the exception all nodes are reporting as online and healthy. After a short period (i.e. 30 minutes), it will let me in again. What could be the cause of this? -- Or Sher
Re: Measuring WAN replication latency
The brute force way would be: 1) Make client connections to a node in each datacenter from your monitoring tool. 2) Periodically write a row to one datacenter (at whatever consistency level your application typically uses.) 3) Immediately query the other datacenter nodes for the same row key with LOCAL_QUORUM consistency. If not found, execute query again immediately in a loop. 4) Once the row is available, record time since initial write for that datacenter. DataStax folks: this actually seems like a useful metric for something OpsCenter to track, since it is already doing active statistics collection. On Wed, Jul 30, 2014 at 8:59 AM, Rahul Neelakantan ra...@rahul.be wrote: Rob, Any ideas you can provide on how to do this will be appreciated, we would like to build a latency monitoring tool/dashboard that shows how long it takes for data to get sent across various DCs. Rahul Neelakantan On Jul 29, 2014, at 8:53 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jul 29, 2014 at 3:15 PM, Rahul Neelakantan ra...@rahul.be wrote: Does anyone know of a way to measure/monitor WAN replication latency for Cassandra? No. [1] =Rob [1] There are ways to do something like this task, but you probably don't actually want to do them. Trying to do them suggests that you are relying on WAN replication timing for your application, which is something you almost certainly do not want to do. Why do you believe you have this requirement?
Re: Measuring WAN replication latency
Yes, the results should definitely not be relied on as a future performance indicator for key app functionality. but knowing roughly what your current replication latency is (and whether it's outside of the normal average) can inform client failover policies, debug data consistency issues, warn of datacenter link congestion, etc. On Wed, Jul 30, 2014 at 12:02 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jul 30, 2014 at 6:59 AM, Rahul Neelakantan ra...@rahul.be wrote: Any ideas you can provide on how to do this will be appreciated, we would like to build a latency monitoring tool/dashboard that shows how long it takes for data to get sent across various DCs. The brute force method described downthread by Jeremy Jongsma gives you something like the monitoring you're looking for, but I continue to believe it's probably a bad idea to try to design a system in this way. =Rob
Re: Authentication exception
This is still happening to me; is there anything else I can check? All nodes have NTP installed, all are in sync, all have open communication to each other. But usually first thing in the morning, I get this auth exception. A little while later, it starts working. I'm very puzzled. On Tue, Jul 22, 2014 at 8:53 AM, Jeremy Jongsma jer...@barchart.com wrote: Verified all clocks are in sync. On Mon, Jul 21, 2014 at 10:03 PM, Rahul Menon ra...@apigee.com wrote: I could you perhaps check your ntp? On Tue, Jul 22, 2014 at 3:35 AM, Jeremy Jongsma jer...@barchart.com wrote: I routinely get this exception from cqlsh on one of my clusters: cql.cassandra.ttypes.AuthenticationException: AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.') The system_auth keyspace is set to replicate X times given X nodes in each datacenter, and at the time of the exception all nodes are reporting as online and healthy. After a short period (i.e. 30 minutes), it will let me in again. What could be the cause of this?
Re: Cassandra on AWS suggestions for data safety
We also run a nightly nodetool snapshot on all nodes, and use duplicity to sync the snapshot to S3, keeping 7 days' worth of backups. Since duplicity tracks incremental changes this gives you the benefit of point-in-time snapshots without duplicating sstables that are common across multiple backups. It also makes it easy to revert all nodes' state to X days ago in case of accidental or malicious data corruption. On Thu, Jul 24, 2014 at 12:17 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jul 23, 2014 at 4:12 PM, Hao Cheng br...@critica.io wrote: 3. Using a backup system, either manually via rsync or through something like Priam, to directly push backups of the data on ephemeral storage to S3. https://github.com/JeremyGrosser/tablesnap =Rob
Re: Why is the cassandra documentation such poor quality?
My experience is similar to Nicholas'. Basic usage was easy to get a handle on, but the advanced tuning/tweaking info is scattered EVERYWHERE around the web, mostly on personal blogs. It feels like it took way too long to become confident enough in my understanding of Cassandra that I trust our deployment configuration in production. Without this mailing list I would still be on the fence. On Wed, Jul 23, 2014 at 8:20 AM, Peter Lin wool...@gmail.com wrote: @benedict - you're right that I've haven't requested permission to edit. You're also right that I've given up on getting edit permission to cassandra wiki. I've been struggling and struggled with how to manage open source projects, so I totally get it. Managing projects is a thankless job most of the time. Pleasing everyone is totally impossible. Apache isn't alone in this. I've submitted stuff to google's open source projects in the past and had it go into a black hole. We all struggle with managing open source projects. I am committed to contributing Cassandra community, but just not through the wiki. There's lots of different ways to contribute. The jira tickets I've submitted have gotten good responses generally. It does take several days depending on how busy the committers are, but that's normal for all projects. On Wed, Jul 23, 2014 at 9:00 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: Requesting a change is very different to requesting permission to edit (which, I note, still hasn't been made); we do our best to promote community engagement, so granting a privilege request has a different mental category to a random edit request, which is much more likely to be forgotten by any particular committer in the process of attending to their more pressing work. The relationship between committers and the community is debated at length in all projects, often by vocal individuals such as yourselves who are unhappy in some way with how the project is being run. However it is very hard to please everyone - most of the time we can't even please all the committers, and that is a much smaller and more homogenous group. On Wed, Jul 23, 2014 at 2:30 PM, Peter Lin wool...@gmail.com wrote: I sent a request to add a link my .Net driver for cassandra to the wiki over 5 weeks back and no response at all. I sent another request way back in 2013 and got zero response. Again, I totally understand people are busy and I'm just as guilty as everyone else of letting requests slip by. It's the reality of contributing to open source as a hobby. If I wasn't serious about contributing to cassandra community, I wouldn't have spent 2.5 months porting Hector to C# manually. Perhaps the real cause is that some committers can't empathise with others in the community? On Wed, Jul 23, 2014 at 8:22 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: All requests I've seen in the past year to edit the wiki (admittedly only 2-3) have been answered promptly with editing privileges. Personally I don't have a major preference either way for policy - there are positives and negatives to each approach - but, like I said, raise it on the dev list and see if anybody else does. However I must admit I cannot empathise with your characterisation of requesting permission as 'begging', or a 'slap in the face', or that it is even particularly onerous. It is a slight psychological barrier, but in my personal experience when a psychological barrier as low as this prevents me from taking action, it's usually because I don't have as much desire to contribute as I thought I did. On Wed, Jul 23, 2014 at 1:54 PM, Peter Lin wool...@gmail.com wrote: I've submitted requests to edit the wiki in the past and nothing ever got done. Having been an apache committer and contributor over the years, I can totally understand that people are busy. I also understand that most developer find writing docs tedious. I'd rather not harass the committers about wiki edits, since I didn't like it when it happened to me in the past. That's why many apache projects keep their wiki's open. Honestly, as much as I find writing docs challenging and tedious, it's critical and important. For my other open source projects, I force myself to write docs. my point is, the wiki should be open and the barrier should be removed. Having to beg/ask to edit the wiki feels like a slap in the face to me, but maybe I'm alone in this. Then again, I've heard the same sentiment from other people about cassandra's wiki. The thing is, they just chalk it up to cassandra committers don't give a crap about docs. I do my best to defend the committers and point out some are volunteers, but it does give the public a negative impression. I know the committers care about docs, but they don't always have time to do it. I know that given a choice between coding or writing docs, 90% of the time I'll choose coding. What I've
Re: Authentication exception
Verified all clocks are in sync. On Mon, Jul 21, 2014 at 10:03 PM, Rahul Menon ra...@apigee.com wrote: I could you perhaps check your ntp? On Tue, Jul 22, 2014 at 3:35 AM, Jeremy Jongsma jer...@barchart.com wrote: I routinely get this exception from cqlsh on one of my clusters: cql.cassandra.ttypes.AuthenticationException: AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.') The system_auth keyspace is set to replicate X times given X nodes in each datacenter, and at the time of the exception all nodes are reporting as online and healthy. After a short period (i.e. 30 minutes), it will let me in again. What could be the cause of this?
Authentication exception
I routinely get this exception from cqlsh on one of my clusters: cql.cassandra.ttypes.AuthenticationException: AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.') The system_auth keyspace is set to replicate X times given X nodes in each datacenter, and at the time of the exception all nodes are reporting as online and healthy. After a short period (i.e. 30 minutes), it will let me in again. What could be the cause of this?
Re: New application - separate column family or separate cluster?
Thanks Tupshin, I am thinking #2 is the way to go in my case, and always have the option of migrating column families to a new cluster if needed. Parag, At the traffic volumes I'm talking about, #2 (and especially #3) will have a lot more total VM nodes, because the other apps are used lightly enough that there is no reason to add capacity specifically for them to an already large cluster. But app-specific clusters would need at least 3 nodes each (for redundancy) when the actual traffic load would require less than one, hence the increased node costs. On Wed, Jul 9, 2014 at 7:07 AM, Parag Patel ppa...@clearpoolgroup.com wrote: In your scenario #1, is the total number of nodes staying the same? Meaning, if you launch multiple clusters for #2, you’d have N total nodes – are we assuming #1 has N or less than N? If #1 and #2 both have N, wouldn’t the performance be the same since Cassandra’s performance increases linearly? *From:* Tupshin Harper [mailto:tups...@tupshin.com] *Sent:* Tuesday, July 08, 2014 11:13 PM *To:* user@cassandra.apache.org *Subject:* Re: New application - separate column family or separate cluster? I've seen a lot of deployments, and I think you captured the scenarios and reasoning quite well. You can apply other nuances and details to #2 (e.g. segment based on SLA or topology), but I agree with all of your reasoning. -Tupshin -Global Field Strategy -Datastax On Jul 8, 2014 10:54 AM, Jeremy Jongsma jer...@barchart.com wrote: Do you prefer purpose-specific Cassandra clusters that support a single application's data set, or a single Cassandra cluster that contains column families for many applications? I realize there is no ideal answer for every situation, but what have your experiences been in this area for cluster planning? My reason for asking is that we have one application with high data volume (multiple TB, thousands of writes/sec) that caused us to adopt Cassandra in the first place. Now we have the tools and cluster management infrastructure built up to the point where it is not a major investment to store smaller sets of data for other applications in C* also, and I am debating whether to: 1) Store everything in one large cluster (no isolation, low cost) 2) Use one cluster for the high-volume data, and one for everything else (good isolation, medium cost) 3) Give every major service its own cluster, even if they have small amounts of data (best isolation, highest cost) I suspect #2 is the way to go as far as balancing hosting costs and application performance isolation. Any pros or cons am I missing? -j
New application - separate column family or separate cluster?
Do you prefer purpose-specific Cassandra clusters that support a single application's data set, or a single Cassandra cluster that contains column families for many applications? I realize there is no ideal answer for every situation, but what have your experiences been in this area for cluster planning? My reason for asking is that we have one application with high data volume (multiple TB, thousands of writes/sec) that caused us to adopt Cassandra in the first place. Now we have the tools and cluster management infrastructure built up to the point where it is not a major investment to store smaller sets of data for other applications in C* also, and I am debating whether to: 1) Store everything in one large cluster (no isolation, low cost) 2) Use one cluster for the high-volume data, and one for everything else (good isolation, medium cost) 3) Give every major service its own cluster, even if they have small amounts of data (best isolation, highest cost) I suspect #2 is the way to go as far as balancing hosting costs and application performance isolation. Any pros or cons am I missing? -j
Re: Storing values of mixed types in a list
Use a ByteBuffer value type with your own serialization (we use protobuf for complex value structures) On Jun 24, 2014 5:30 AM, Tuukka Mustonen tuukka.musto...@gmail.com wrote: Hello, I need to store a list of mixed types in Cassandra. The list may contain numbers, strings and booleans. So I would need something like list?. Is this possible in Cassandra and if not, what workaround would you suggest for storing a list of mixed type items? I sketched a few (using a list per type, using list of user types in Cassandra 2.1, etc.), but I get a bad feeling about each. Couldn't find an exact answer to this through searches... Regards, Tuukka P.S. I first asked this at SO before realizing the traffic there is very low: http://stackoverflow.com/questions/24380158/storing-a-list-of-mixed-types-in-cassandra
Re: How to perform Range Queries in Cassandra
You'd be better off using external indexing (ElasticSearch or Solr), Cassandra isn't really designed for this sort of querying. On Jun 24, 2014 3:09 AM, Mike Carter jaloos...@gmail.com wrote: Hello! I'm a beginner in C* and I'm quite struggling with it. I’d like to measure the performance of some Cassandra-Range-Queries. The idea is to execute multidimensional range-queries on Cassandra. E.g. there is a given table of 1million rows with 10 columns and I like to execute some queries like “select count(*) from testable where d=1 and v110 and v2 20 and v3 45 and v470 … allow filtering”. This kind of queries is very slow in C* and soon the tables are bigger, I get a read-timeout probably caused by long scan operations. In further tests I like to extend the dimensions to more than 200 hundreds and the rows to 100millions, but actually I can’t handle this small table. Should reorganize the data or is it impossible to perform such high multi-dimensional queries on Cassandra? The setup: Cassandra is installed on a single node with 2 TB disk space and 180GB Ram. Connected to Test Cluster at localhost:9160. [cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Keyspace: CREATE KEYSPACE test WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' }; Table: CREATE TABLE testc21 ( key int, d int, v1 int, v10 int, v2 int, v3 int, v4 int, v5 int, v6 int, v7 int, v8 int, v9 int, PRIMARY KEY (key) ) WITH bloom_filter_fp_chance=0.01 AND caching='ROWS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; CREATE INDEX testc21_d_idx ON testc21 (d); select * from testc21 limit 10; key| d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9 +---++-+++-+++++- 302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54 531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27 693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19 4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14 927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100 313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7 368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52 671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93 62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63 488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45 Mike
Re: Best way to do a multi_get using CQL
I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use
Re: Best way to do a multi_get using CQL
That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more
Custom snitch classpath?
Where do I add my custom snitch JAR to the Cassandra classpath so I can use it?
Re: Custom snitch classpath?
Sharing in case anyone else wants to use this: https://github.com/barchart/cassandra-plugins/blob/master/src/main/java/com/barchart/cassandra/plugins/snitch/GossipingPropertyFileWithEC2FallbackSnitch.java Basically it is a proxy that attempts to use GossipingPropertyFileSnitch, and it that fails to initialize due to missing rack or datacenter values, it falls back to Ec2MultiRegionSnitch. We are using it for hybrid cloud deployments between AWS and our private datacenter. On Fri, Jun 20, 2014 at 1:04 PM, Tyler Hobbs ty...@datastax.com wrote: The lib directory (where all the other jars are). bin/cassandra.in.sh does this: for jar in $CASSANDRA_HOME/lib/*.jar; do CLASSPATH=$CLASSPATH:$jar done On Fri, Jun 20, 2014 at 12:58 PM, Jeremy Jongsma jer...@barchart.com wrote: Where do I add my custom snitch JAR to the Cassandra classpath so I can use it? -- Tyler Hobbs DataStax http://datastax.com/
Re: Best way to do a multi_get using CQL
There is nothing preventing that in Cassandra, it's just a matter of how intelligent the driver API is. Submit a feature request to Astyanax or Datastax driver projects. On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: The bad design part (just my opinion, no intention to offend) is not allow the possibility of sending batches directly to the data nodes, without using a coordinator. I would choose that option. []s 2014-06-20 16:05 GMT-03:00 DuyHai Doan doanduy...@gmail.com: Well it's kind of a trade-off. Either you send data directly to the primary replica nodes to take advantage of data-locality using token-aware strategy and the price to pay is a high number of opened connections from client side. Or you just batch data to a random node playing the coordinator role to dispatch requests to the right nodes. The price to pay is then spike load on 1 node (the coordinator) and intra-cluster bandwdith usage. The choice is yours, it has nothing to do with good or bad design. On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I am using python + CQL Driver. I wonder how they do... These things seems little important, but they are fundamental to get a good performance in Cassandra... I wish there was a simpler way to query in batches. Opening a large amount of connections and sending 1 message at a time seems bad to me, as sometimes you want to work with small rows. It's no surprise Cassandra performs better when we use average row sizes. But honestly I disagree with this part of Cassandra/Driver's design. []s 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com: That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra
Re: running out of diskspace during maintenance tasks
One option is to add new nodes, and do a node repair/cleanup on everything. That will at least reduce your per-node data size. On Wed, Jun 18, 2014 at 11:01 AM, Brian Tarbox tar...@cabotresearch.com wrote: I'm running on AWS m2.2xlarge instances using the ~800 gig ephemeral/attached disk for my data directory. My data size per node is nearing 400 gig. Sometimes during maintenance operations (repairs mostly I think) I run out of disk space as my understanding is that some of these operations require double the space of one's data. Since I can't change the size of attached storage for my instance type my question is can I somehow get these maintenance operations to use other volumes? Failing that, what are my options? Thanks. Brian Tarbox
Re: Large number of row keys in query kills cluster
Good to know, thanks Peter. I am worried about client-to-node latency if I have to do 20,000 individual queries, but that makes it clearer that at least batching in smaller sizes is a good idea. On Wed, Jun 11, 2014 at 6:34 PM, Peter Sanford psanf...@retailnext.net wrote: On Wed, Jun 11, 2014 at 10:12 AM, Jeremy Jongsma jer...@barchart.com wrote: The big problem seems to have been requesting a large number of row keys combined with a large number of named columns in a query. 20K rows with 20K columns destroyed my cluster. Splitting it into slices of 100 sequential queries fixed the performance issue. When updating 20K rows at a time, I saw a different issue - BrokenPipeException from all nodes. Splitting into slices of 1000 fixed that issue. Is there any documentation on this? Obviously these limits will vary by cluster capacity, but for new users it would be great to know that you can run into problems with large queries, and how they present themselves when you hit them. The errors I saw are pretty opaque, and took me a couple days to track down. The first thing that comes to mind is the Multiget section on the Datastax anti-patterns page: http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningAntiPatterns_c.html?scroll=concept_ds_emm_hwl_fk__multiple-gets -psanford
Re: Backup Cassandra to
That will not necessarily scale, and I wouldn't recommend it - your backup node will need as much disk space as an entire replica of the cluster data. For a cluster with a couple of nodes that may be OK, for dozens of nodes, probably not. You also lose the ability to restore individual nodes - the only way to replace a dead node is with a full repair. On Thu, Jun 12, 2014 at 1:38 PM, Jabbar Azam aja...@gmail.com wrote: There is another way. You create a cassandra node in it's own datacentre, then any changes going to the main cluster will be replicated to this node. You can backup from this node. In the event of a disaster the data from both clusters and wiped and then replayed to the individual node. The data will then be replicated to the main cluster. This will also work for the case when the main cluster increases or decreases in size. Thanks Jabbar Azam On 12 June 2014 18:27, Andrew redmu...@gmail.com wrote: There isn’t a lot of “actual documentation” on the act of backing up, but I did research for my own company into the act of backing up and unfortunately, you’re not going to have a similar setup as Oracle. There are reasons for this, however. If you have more than one replica of the data, that means each node in the cluster will likely be holding it’s own unique set of data. So you would need to back up the ENTIRE set of nodes in order to get an accurate snapshot. Likewise, you would need to restore it to the cluster of the same size in order to restore it (and then run refresh to tell Cassandra to reload the tables from disk). Copying the snapshots is easy—it’s just a bunch of files in your data directory. It’s even smaller if you use incremental snapshots. I’ll admit, I’m no expert on tape drives, but I’d imagine it’s as easy as copy/pasting the snapshots to the drive (or whatever the equivalent tape drive operation is). What you (and I, admittedly) would really like to see is a way to back up all the logical *data*, and then simply replay it. This is possible on Oracle because it’s typically restricted to either one (plus maybe one or two standbys) that don’t “share” any data. What you could do, in theory, is literally select all the data in the entire cluster and simply dump it to a file—but this could take hours, days, or even weeks to complete, depending on the size of your data, and then simply re-load it. This is probably not a great solution, but hey—maybe it will work for you. Netflix (thankfully) has posted a lot of their operational observations and what not, including their utility Priam. In their documentation, they include some overviews of what they use: https://github.com/Netflix/Priam/wiki/Backups Hope this helps! Andrew On June 12, 2014 at 6:18:57 AM, Jack Krupansky (j...@basetechnology.com) wrote: The doc for backing up – and restoring – Cassandra is here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html That doesn’t tell you how to move the “snapshot” to or from tape, but a snapshot is the starting point for backing up Cassandra. -- Jack Krupansky *From:* Camacho, Maria (NSN - FI/Espoo) maria.cama...@nsn.com *Sent:* Thursday, June 12, 2014 4:57 AM *To:* user@cassandra.apache.org *Subject:* Backup Cassandra to Hi there, I'm trying to find information/instructions about backing up and restoring a Cassandra DB to and from a tape unit. I was hopping someone in this forum could help me with this since I could not find anything useful in Google :( Thanks in advance, Maria
Re: Large number of row keys in query kills cluster
I'm using Astyanax with a query like this: clusterContext .getClient() .getKeyspace(instruments) .prepareQuery(INSTRUMENTS_CF) .setConsistencyLevel(ConsistencyLevel.CL_LOCAL_QUORUM) .getKeySlice(new String[] { ROW1, ROW2, // 20,000 keys here... ROW2 }) .execute(); At the time this query executes the first time (resulting in unresponsive cluster), there are zero rows in the column family. Schema is below, pretty basic: CREATE KEYSPACE instruments WITH replication = { 'class': 'NetworkTopologyStrategy', 'aws-us-east-1': '2' }; CREATE TABLE instruments ( key bigint PRIMARY KEY, definition blob, id bigint, name text, symbol text, updated bigint ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael michael.la...@nytimes.com wrote: Perhaps if you described both the schema and the query in more detail, we could help... e.g. did the query have an IN clause with 2 keys? Or is the key compound? More detail will help. On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma jer...@barchart.com wrote: I didn't explain clearly - I'm not requesting 2 unknown keys (resulting in a full scan), I'm requesting 2 specific rows by key. On Jun 10, 2014 6:02 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Jeremy Basically what you are doing is to ask Cassandra to do a distributed full scan on all the partitions across the cluster, it's normal that the nodes are somehow stressed. How did you make the query? Are you using Thrift or CQL3 API? Please note that there is another way to get all partition keys : SELECT DISTINCT partition_key FROM..., more details here : www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3 I ran an application today that attempted to fetch 20,000+ unique row keys in one query against a set of completely empty column families. On a 4-node cluster (EC2 m1.large instances) with the recommended memory settings (2 GB heap), every single node immediately ran out of memory and became unresponsive, to the point where I had to kill -9 the cassandra processes. Now clearly this query is not the best idea in the world, but the effects of it are a bit disturbing. What could be going on here? Are there any other query pitfalls I should be aware of that have the potential to explode the entire cluster? -j
Re: Large number of row keys in query kills cluster
The big problem seems to have been requesting a large number of row keys combined with a large number of named columns in a query. 20K rows with 20K columns destroyed my cluster. Splitting it into slices of 100 sequential queries fixed the performance issue. When updating 20K rows at a time, I saw a different issue - BrokenPipeException from all nodes. Splitting into slices of 1000 fixed that issue. Is there any documentation on this? Obviously these limits will vary by cluster capacity, but for new users it would be great to know that you can run into problems with large queries, and how they present themselves when you hit them. The errors I saw are pretty opaque, and took me a couple days to track down. In any case this seems like a bug to me - it shouldn't be possible to completely lock up a cluster with a valid query that isn't doing a table scan, should it? On Wed, Jun 11, 2014 at 9:33 AM, Jeremy Jongsma jer...@barchart.com wrote: I'm using Astyanax with a query like this: clusterContext .getClient() .getKeyspace(instruments) .prepareQuery(INSTRUMENTS_CF) .setConsistencyLevel(ConsistencyLevel.CL_LOCAL_QUORUM) .getKeySlice(new String[] { ROW1, ROW2, // 20,000 keys here... ROW2 }) .execute(); At the time this query executes the first time (resulting in unresponsive cluster), there are zero rows in the column family. Schema is below, pretty basic: CREATE KEYSPACE instruments WITH replication = { 'class': 'NetworkTopologyStrategy', 'aws-us-east-1': '2' }; CREATE TABLE instruments ( key bigint PRIMARY KEY, definition blob, id bigint, name text, symbol text, updated bigint ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael michael.la...@nytimes.com wrote: Perhaps if you described both the schema and the query in more detail, we could help... e.g. did the query have an IN clause with 2 keys? Or is the key compound? More detail will help. On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma jer...@barchart.com wrote: I didn't explain clearly - I'm not requesting 2 unknown keys (resulting in a full scan), I'm requesting 2 specific rows by key. On Jun 10, 2014 6:02 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Jeremy Basically what you are doing is to ask Cassandra to do a distributed full scan on all the partitions across the cluster, it's normal that the nodes are somehow stressed. How did you make the query? Are you using Thrift or CQL3 API? Please note that there is another way to get all partition keys : SELECT DISTINCT partition_key FROM..., more details here : www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3 I ran an application today that attempted to fetch 20,000+ unique row keys in one query against a set of completely empty column families. On a 4-node cluster (EC2 m1.large instances) with the recommended memory settings (2 GB heap), every single node immediately ran out of memory and became unresponsive, to the point where I had to kill -9 the cassandra processes. Now clearly this query is not the best idea in the world, but the effects of it are a bit disturbing. What could be going on here? Are there any other query pitfalls I should be aware of that have the potential to explode the entire cluster? -j
Frequent secondary index sstable corruption
I'm in the process of migrating data over to cassandra for several of our apps, and a few of the schemas use secondary indexes. Four times in the last couple months I've run into a corrupted sstable belonging to a secondary index, but have never seen this on any other sstables. When it happens, any query against the secondary index just hangs until the node is fixed. It's making me a bit nervous about using secondary indexes in production. This has usually happened after a bulk data import, so I am wondering if the firehose method of dumping initial data into cassandra (write consistency = any) is causing some sort of write concurrency issue when it comes to secondary indexes. Has anyone else experienced this? The cluster is running 1.2.16 on 4x EC2 m1.large instances.
Large number of row keys in query kills cluster
I ran an application today that attempted to fetch 20,000+ unique row keys in one query against a set of completely empty column families. On a 4-node cluster (EC2 m1.large instances) with the recommended memory settings (2 GB heap), every single node immediately ran out of memory and became unresponsive, to the point where I had to kill -9 the cassandra processes. Now clearly this query is not the best idea in the world, but the effects of it are a bit disturbing. What could be going on here? Are there any other query pitfalls I should be aware of that have the potential to explode the entire cluster? -j
Re: Large number of row keys in query kills cluster
I didn't explain clearly - I'm not requesting 2 unknown keys (resulting in a full scan), I'm requesting 2 specific rows by key. On Jun 10, 2014 6:02 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Jeremy Basically what you are doing is to ask Cassandra to do a distributed full scan on all the partitions across the cluster, it's normal that the nodes are somehow stressed. How did you make the query? Are you using Thrift or CQL3 API? Please note that there is another way to get all partition keys : SELECT DISTINCT partition_key FROM..., more details here : www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3 I ran an application today that attempted to fetch 20,000+ unique row keys in one query against a set of completely empty column families. On a 4-node cluster (EC2 m1.large instances) with the recommended memory settings (2 GB heap), every single node immediately ran out of memory and became unresponsive, to the point where I had to kill -9 the cassandra processes. Now clearly this query is not the best idea in the world, but the effects of it are a bit disturbing. What could be going on here? Are there any other query pitfalls I should be aware of that have the potential to explode the entire cluster? -j
Re: Question about replacing a dead node
A dead node is still allocated key ranges, and Cassandra will wait for it to come back online rather than redistributing its data. It needs to be decommissioned or replaced by a new node for it to be truly dead as far as the cluster is concerned. On Tue, Jun 3, 2014 at 11:12 AM, Prem Yadav ipremya...@gmail.com wrote: Hi, in the last week week, we saw at least two emails about dead node replacement. Though I saw the documentation about how to do this, i am not sure I understand why this is required. Assuming replication factor is 2, if a node dies, why does it matter? If we add a new node is added, shouldn't it just take the chunk of data it server as the primary node from the other existing nodes. Why do we need to worry about replacing the dead node? Thanks
Re: Cassandra snapshot
I wouldn't recommend doing this before regular backups for the simple reason that for large data sets it will take a long time to run, and will require that your node backup schedule be properly staggered (you should never be running repair on all nodes at the same time.) Backups should be treated as eventually consistent just like Cassandra itself. That said, if you are doing a one-time backup of a node and for whatever reason you want it as up-to-date as possible without unnecessary data, you should also run nodetool compact. On Mon, Jun 2, 2014 at 2:18 PM, ng pipeli...@gmail.com wrote: I need to make sure that all the data in sstable before taking the snapshot. I am thinking of nodetool cleanup nodetool repair nodetool flush nodetool snapshot Am I missing anything else? Thanks in advance for the responses/suggestions. ng
Re: Managing truststores with inter-node encryption
It appears that only adding the CA certificate to the truststore is sufficient for this. On Thu, May 22, 2014 at 10:05 AM, Jeremy Jongsma jer...@barchart.com wrote: The docs say that each node needs every other node's certificate in its local truststore: http://www.datastax.com/documentation/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html This seems like a bit of a headache for adding nodes to a cluster. How do others deal with this? 1) If I am self-signing the client certificates (with puppetmaster), is it enough that the truststore just contain the CA certificate used to sign them? This is the typical PKI mechanism for verifying trust, so I am hoping it works here. 2) If not, can I use the same certificate for every node? If so, what is the downside? I'm mainly concerned with encryption over public internet links, not node identity verification.
Managing truststores with inter-node encryption
The docs say that each node needs every other node's certificate in its local truststore: http://www.datastax.com/documentation/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html This seems like a bit of a headache for adding nodes to a cluster. How do others deal with this? 1) If I am self-signing the client certificates (with puppetmaster), is it enough that the truststore just contain the CA certificate used to sign them? This is the typical PKI mechanism for verifying trust, so I am hoping it works here. 2) If not, can I use the same certificate for every node? If so, what is the downside? I'm mainly concerned with encryption over public internet links, not node identity verification.