Re: Can you query Cassandra while it's doing major compaction
At Netflix we rotate the major compactions around the cluster, don't run them all at once. We also either take that node out of client traffic so it doesn't get used as a coordinator or use the Astyanax client that is latency and token aware to steer traffic to the other replicas. We are running on EC2 with lots of CPU and RAM but only two internal disk spindles, so if you have lots of IOPS available in your config, you may find that it doesn't affect read times much. Adrian On Thu, Feb 2, 2012 at 11:44 PM, Peter Schuller peter.schul...@infidyne.com wrote: If every node in the cluster is running major compaction, would it be able to answer any read request? And is it wise to write anything to a cluster while it's doing major compaction? Compaction is something that is supposed to be continuously running in the background. As noted, it will have a performance impact in that it (1) generates I/O, (2) leads to cache eviction, and (if you're CPU bound rather than disk bound) (3) adds CPU load. But there is no intention that clients should have to care about compaction; it's to be viewed as a background operation continuously happening. A good rule of thumb is that an individual node should be able to handle your traffic when doing compaction; you don't want to be in the position where you're just barely dealing with the traffic, and a node doing compaction not being able to handle it. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Benchmarking Cassandra scalability to over 1M writes/s on AWS
Hi folks, we just posted a detailed Netflix technical blog entry on this http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Hope you find it interesting/useful Cheers Adrian
Re: Very slow writes in Cassandra
You are using replication factor of one and the Lustre clustered filesystem over the network. Not good practice. Try RF=3 and local disks. Lustre duplicates much of the functionality of Cassandra, there is no point using both. Make your Lustre server nodes into Cassandra nodes instead. Adrian On Sun, Oct 30, 2011 at 6:33 PM, Evgeny erepe...@cmcrc.com wrote: Hello Cassandra users, I'm newbie in NoSQL and Cassandara in particular. At the moment doing some benchmarking with Cassandra and experiencing very slow write throughput. As it is said, Cassandra can perform hundreds of thousands of inserts per second, however I'm not observing this: 1) when I send 100 thousand inserts simultaneously via 8 CQL clients, then throughput is ~14470 inserts per second. 2) when I do the same via 8 Thrift clients, then throughput is ~16300 inserts per seconds. I think Cassandra performance can be improved, but I don't know what to tune. Please take a look at the test conditions below and advise something. Thank you. Tests conditions: 1. Cassandra cluster is deployed on three machines, each machine has 8 cores Intel(R) Xeon(R) CPU E5420 @ 2.50GHz, RAM is 16GB, network speed is 1000Mb/s. 2. The data sample is set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['order_id'] = '1.0'; set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['security'] = 'AA1'; set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['price'] = '47.1'; set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['volume'] = '300.0'; set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['se'] = '1'; set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['order_id'] = '2.0'; set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['security'] = 'AA1'; set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['price'] = '44.89'; set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['volume'] = '310.0'; set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['se'] = '1'; set MM[utf8('3:exc_source_algo:2010010500.00:ENTER:0')]['order_id'] = '3.0'; set MM[utf8('3:exc_source_algo:2010010500.00:ENTER:0')]['security'] = 'AA2'; set MM[utf8('3:exc_source_algo:2010010500.00:ENTER:0')]['price'] = '0.35'; 3. Commit log is written on the local hard drive, the data is written on Lustre. 4. Keyspace description Keyspace: MD: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [datacenter1:1] Column Families: ColumnFamily: MM Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds:20.0/14400 Memtable thresholds: 2.3247/1440/496 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Thanks in advance. Evgeny.
Re: selective replication
This has been proposed a few times, there are some good use cases for it, and there is no current mechanism for it, but it's been discussed as a possible enhancement. Adrian On Wed, Sep 14, 2011 at 11:06 AM, Todd Burruss bburr...@expedia.com wrote: Has anyone done any work on what I'll call selective replication between DCs? I want to use Cassandra to replicate data to another virtual DC (for analytical purposes), but only inserts, not deletes. Picture having two data centers, DC1 for OLTP of short lived data (say 90 day window) and DC2 for OLAP (years of data). DC2 would probably be a Brisk setup. In this scenario, clients would get/insert/delete from DC1 (the OLTP system) and DC1 would replicate inserts only to DC2 (the OLAP system) for analytics. I don't have any experience (yet) with multi-dc replication, but I don't think this is possible. Thoughts?
Re: Cassandra as in-memory cache
You should be using the off heap row cache option. That way you avoid GC overhead and the rows are stored in a compact serialized form that means you get more cache entries in RAM. Trade off is slightly more CPU for deserialization etc. Adrian On Sunday, September 11, 2011, aaron morton aa...@thelastpickle.com wrote: If the row cache is enabled the read path will not use the sstables. Depending on the workload I would then look at setting *low* memtable flush settings to use as much memory as possible for the row cache. If the row is in the row cache the read path will not look at SSTables. Then set the row cache save settings per CF to ensure the cache is warmed when the node starts. The write path will still use the WAL so if you may want to disable the commit log using the durable_writes setting on the keyspace. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10/09/2011, at 4:38 AM, kapil nayar wrote: Hi, Can we configure some column-families (or keyspaces) in Cassandra to perform as a pure in-memory cache? The feature should let the memtables always be in-memory (never flushed to the disk - sstables). The memtable flush threshold settings of time/ memory/ operations can be set to a max value to achieve this. However, it seems uneven distribution of the keys across the nodes in the cluster could lead to java error no-memory available. In order to prevent this error can we overflow some entries to the disk? Thanks, Kapil
Re: Direct control over where data is stored?
Sounds like Khanh thinks he can do joins... :-) User oriented data is easy, key by facebook id, let cassandra handle location. Set replication factor=3 so you don't lose data and can do consistent but slower read after write when you need to using quorum. If you are running on AWS you should distribute your replicas over availability zones. Then you can do read A, read B join them in your app code. Single digit milliseconds for each read or write. If you want to do bulk operations over many users, use Brisk with a Hadoop job. HTH Adrian On Sat, Jun 4, 2011 at 9:32 PM, Maki Watanabe watanabe.m...@gmail.com wrote: You may be able to do it with the Order Preserving Partitioner with making key to node mapping before storing data, or you may need your custom Partitioner. Please note that you are responsible to distribute load between nodes in this case. From application design perspective, it is not clear for me why you need to store user A and his friends into same box maki 2011/6/5 Khanh Nguyen nguyen.h.kh...@gmail.com: Hi everyone, Is it possible to have direct control over where objects are stored in Cassandra? For example, I have a Cassandra cluster of 4 machines and 4 objects A, B, C, D; I want to store A at machine 1, B at machine 2, C at machine 3 and D at machine 4. My guess is that I need to intervene they way Cassandra hashes an object into the keyspace? If so, how complicated the task will be? I'm new to the list and Cassandra. The reason I am asking is that my current project is related to social locality of data: if A and B are Facebook friends, I want to store their data as close as possible, preferably in the same machine in a cluster. Thank you. Regards, -k -- w3m
Re: batch dump of data from cassandra?
Hi Yang, You could also use Hadoop (i.e. Brisk), and run a MapReduce job or Hive query to extract and summarize/renormalize the data into whatever format you like. If you use sstable2json, you have to run on every file on every node, deduplicate/merge all the output across machines, which is what MR does anyway. Our data flow is to take backups of a production cluster, restore a backup to a different cluster running Hadoop, then run our point in time data extraction for ETL processing by the BI team. The backup/restore gives a frozen in time (consistent to a second or so) cluster for extraction. Running live with Brisk means you are running your extraction over a moving target. Adrian On Sun, May 22, 2011 at 11:14 PM, Yang tedd...@gmail.com wrote: Thanks Jonathan. On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis jbel...@gmail.com wrote: I'd modify SSTableExport.serializeRow (the sstable2json class) to output to whatever system you are targeting. On Sun, May 22, 2011 at 11:19 PM, Yang tedd...@gmail.com wrote: let's say periodically (daily) I need to dump out the contents of my Cassandra DB, and do a import into oracle , or some other custom data stores, is there a way to do it? I checked that you can do multi-get() but you probably can't pass the entire key domain into the API, cuz the entire db would be returned on a single thrift call, and probably overflow the API? plus multi-get underneath just sends out per-key lookups one by one, while I really do not care about which key corresponds to which result, a simple scraping of the underlying SSTable would be perfect, because I could utilize the file cache coherency as I read down the file. Thanks Yang -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: batch dump of data from cassandra?
Three ways to do this. Client app does get key for every row, lots of small network operations brisk / hive does select(*), which is sent to each node to map then the hadoop network shuffle merges the result Write your own code to merge all the SStables across the cluster. So I think that brisk is going to be easier to implement but also closer in efficiency to the way you want to do it. Adrian On Monday, May 23, 2011, Yang tedd...@gmail.com wrote: thanks Sri I am trying to make sure that Brisk underneath does a simple scraping of the rows, instead of doing foreach key ( keys ) { lookup (key) }.. after that, I can feel comfortable using Brisk for the import/export jobs yang On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati srisat...@datastax.com wrote: Adrian, +1 Using hive hadoop for the export-import of data from to Cassandra is one of the original use cases we had in mind for Brisk. That also has the ability to parallelize the workload and finish rapidly. thanks, Sri On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft adrian.cockcr...@gmail.com wrote: Hi Yang, You could also use Hadoop (i.e. Brisk), and run a MapReduce job or Hive query to extract and summarize/renormalize the data into whatever format you like. If you use sstable2json, you have to run on every file on every node, deduplicate/merge all the output across machines, which is what MR does anyway. Our data flow is to take backups of a production cluster, restore a backup to a different cluster running Hadoop, then run our point in time data extraction for ETL processing by the BI team. The backup/restore gives a frozen in time (consistent to a second or so) cluster for extraction. Running live with Brisk means you are running your extraction over a moving target. Adrian On Sun, May 22, 2011 at 11:14 PM, Yang tedd...@gmail.com wrote: Thanks Jonathan. On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis jbel...@gmail.com wrote: I'd modify SSTableExport.serializeRow (the sstable2json class) to output to whatever system you are targeting. On Sun, May 22, 2011 at 11:19 PM, Yang tedd...@gmail.com wrote: let's say periodically (daily) I need to dump out the contents of my Cassandra DB, and do a import into oracle , or some other custom data stores, is there a way to do it? I checked that you can do multi-get() but you probably can't pass the entire key domain into the API, cuz the entire db would be returned on a single thrift call, and probably overflow the API? plus multi-get underneath just sends out per-key lookups one by one, while I really do not care about which key corresponds to which result, a simple scraping of the underlying SSTable would be perfect, because I could utilize the file cache coherency as I read down the file. Thanks Yang -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- SriSatish Ambati Director of Engineering, DataStax @srisatish
Re: Ec2 Stress Results
Hi Alex, This has been a useful thread, we've been comparing your numbers with our own tests. Why did you choose four big instances rather than more smaller ones? For $8/hr you get four m2.4xl with a total of 8 disks. For $8.16/hr you could have twelve m1.xl with a total of 48 disks, 3x disk space, a bit less total RAM and much more CPU When an instance fails, you have a 25% loss of capacity with 4 or an 8% loss of capacity with 12. I don't think it makes sense (especially on EC2) to run fewer than 6 instances, we are mostly starting at 12-15. We can also spread the instances over three EC2 availability zones, with RF=3 and one copy of the data in each zone. Cheers Adrian On Wed, May 11, 2011 at 5:25 PM, Alex Araujo cassandra-us...@alex.otherinbox.com wrote: On 5/9/11 9:49 PM, Jonathan Ellis wrote: On Mon, May 9, 2011 at 5:58 PM, Alex Araujocassandra- How many replicas are you writing? Replication factor is 3. So you're actually spot on the predicted numbers: you're pushing 20k*3=60k raw rows/s across your 4 machines. You might get another 10% or so from increasing memtable thresholds, but bottom line is you're right around what we'd expect to see. Furthermore, CPU is the primary bottleneck which is what you want to see on a pure write workload. That makes a lot more sense. I upgraded the cluster to 4 m2.4xlarge instances (68GB of RAM/8 CPU cores) in preparation for application stress tests and the results were impressive @ 200 threads per client: +--+--+--+--+--+--+--+--+--+ | Server Nodes | Client Nodes | --keep-going | Columns | Client | Total | Rep Factor | Test Rate | Cluster Rate | | | | | | Threads | Threads | | (writes/s) | (writes/s) | +==+==+==+==+==+==+==+==+==+ | 4 | 3 | N | 1000 | 200 | 600 | 3 | 44644 | 133931 | +--+--+--+--+--+--+--+--+--+ The issue I'm seeing with app stress tests is that the rate will be comparable/acceptable at first (~100k w/s) and will degrade considerably (~48k w/s) until a flush and restart. CPU usage will correspondingly be high at first (500-700%) and taper down to 50-200%. My data model is pretty standard (This is pseudo-type information): UsersColumn UserId32CharHash : { emailString: a...@b.com, first_nameString: John, last_nameString: Doe } UserGroupsSuperColumn GroupIdUUID: { UserId32CharHash: { date_joinedDateTime: 2011-05-10 13:14.789, date_leftDateTime: 2011-05-11 13:14.789, activeshort: 0|1 } } UserGroupTimelineColumn GroupIdUUID: { date_joinedTimeUUID: UserId32CharHash } UserGroupStatusColumn CompositeId('GroupIdUUID:UserId32CharHash'): { activeshort: 0|1 } Every new User has a row in Users and a ColumnOrSuperColumn in the other 3 CFs (total of 4 operations). One notable difference is that the RAID0 on this instance type (surprisingly) only contains two ephemeral volumes and appear a bit more saturated in iostat, although not enough to clearly stand out as the bottleneck. Is the bottleneck in this scenario likely memtable flush and/or commitlog rotation settings? RF = 2; ConsistencyLevel = One; -Xmx = 6GB; concurrent_writes: 64; all other settings are the defaults. Thanks, Alex.
Re: best way to backup
Netflix has also gone down this path, we run a regular full backup to S3 of a compressed tar, and we have scripts that restore everything into the right place on a different cluster (it needs the same node count). We also pick up the SSTables as they are created, and drop them in S3. Whatever you do, make sure you have a regular process to restore the data and verify that it contains what you think it should... Adrian On Thu, Apr 28, 2011 at 1:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: one thing we're looking at doing is watching the cassandra data directory and backing up the sstables to s3 when they are created. Some guys at simplegeo started tablesnap that does this: https://github.com/simplegeo/tablesnap What it does is for every sstable that is pushed to s3, it also copies a json file with the current files in the directory, so you can know what to restore in that event (as far as I understand). On Apr 28, 2011, at 2:53 PM, William Oberman wrote: Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Multi-DC Deployment
Queues replicate bad data just as well as anything else. The biggest source of bad data is broken app code... You will still need to implement a reconciliation/repair checker, as queues have their own failure modes when they get backed up. We have also looked at using queues to bounce data between cassandra clusters for other reasons, and they have their place. However it is a lot more work to implement than using existing well tested Cassandra functionality to do it for us. I think your code needs to retry a failed local-quorum read with a read-one to get the behavior you are asking for. Our approach to bad data and corruption issues is backups, wind back to the last good snapshot. We have figured out incremental backups as well as full. Our code has some local dependencies, but could be the basis for a generic solution. Adrian On Wed, Apr 20, 2011 at 6:08 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Assuming that you generally put an API on top of this, delivering to two or more systems then boils down to a message queue issue or some similar mechanism which handles secure delivery of messages. Maybe not trivial, but there are many products that can help you with this, and it is a lot easier to implement than a fully distributed storage system. Yes, ideally Cassandra will not distribute corruption, but the reason you pay up to have 2 fully redundant setups in 2 different datacenters is because we do not live in an ideal world. Anyone having tested Cassandra since 0.7.0 with any real data will be able to testify how well it can mess things up. This is not specific to Cassandra, in fact, I would argue thats this is in the blood of any distributed system. You want them to distribute after all and the tighter the coupling is between nodes, the better they distribute bad stuff as well as good stuff. There is a bigger risk for a complete failure with 2 tightly coupled redundant systems than with 2 almost completely isolated ones. The logic here is so simple it is really somewhat beyond discussion. There are a few other advantages of isolating the systems. Especially in terms of operation, 2 isolated systems would be much easier as you could relatively risk fee try out a new cassandra in one datacenter or upgrade one datacenter at a time if you needed major operational changes such as schema changes or other large changes to the data. I see the 2 copies in one datacenters + 1(or maybe 2) in another as a low cost middleway between 2 full N+2 (RF=3) systems in both data centers. That is, in a traditional design where you need 1 node for normal service, you would have 1 extra replicate for redundancy and one replica more (N+2 redundancy) so you can do maintenance and still be redundant. If I have redundancy across datacenters, I would probably still want 2 replicas to avoid network traffic between DCs in case of a node recovery, but N+2 may not be needed as my risk policy may find it acceptable to run one datacenters without redundancy for a time limited period for maintenance. That is, if my original requirement is 1 node, I could do with 3x the HW which is not all that much more than the 3x I need for one DC and a lot less than the 6x I need for 2 full N+2 systems. However, all of the above is really beyond the point of my original suggestion. Regardless of datacenters, redundancy and distribution of bad or good stuff, it would be good to have a way to return whatever data is there, but with a flag or similar stating that the consistency level was not met. Again, for a lot of services, it is fully acceptable, and a lot better, to return an almost complete (or maybe even complete, but no verified by quorum) result than no result at all. As far as I remember from the code, this just boils down to returning whatever you collected from the cluster and setting the proper flag or similar on the resultset rather than returning an error. Terje On Thu, Apr 21, 2011 at 5:01 AM, Adrian Cockcroft adrian.cockcr...@gmail.com wrote: Hi Terje, If you feed data to two rings, you will get inconsistency drift as an update to one succeeds and to the other fails from time to time. You would have to build your own read repair. This all starts to look like I don't trust Cassandra code to work, so I will write my own buggy one off versions of Cassandra functionality. I lean towards using Cassandra features rather than rolling my own because there is a large community testing, fixing and extending Cassandra, and making sure that the algorithms are robust. Distributed systems are very hard to get right, I trust lots of users and eyeballs on the code more than even the best engineer working alone. Cassandra doesn't replicate sstable corruptions. It detects corrupt data and only replicates good data. Also data isn't replicated to three identical nodes in the way you imply, it's replicated around the ring. If you lose three nodes, you don't lose a whole node's
Re: Multi-DC Deployment
If you want to use local quorum for a distributed setup, it doesn't make sense to have less than RF=3 local and remote. Three copies at both ends will give you high availability. Only one copy of the data is sent over the wide area link (with recent versions). There is no need to use mirrored or RAID5 disk in each node in this case, since you are using RAIN (N for nodes) to protect your data. So the extra disk space to hold three copies at each end shouldn't be a big deal. Netflix is using striped internal disks on EC2 nodes for this. Adrian On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Hum... Seems like it could be an idea in a case like this with a mode where result is always returned (if possible), but where a flay saying if the consistency level was met, or to what level it was met (number of nodes answering for instance).? Terje On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis jbel...@gmail.com wrote: They will timeout until failure detector realizes the DC1 nodes are down (~10 seconds). After that they will immediately return UnavailableException until DC1 comes back up. On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu baskar.duraikannu...@gmail.com wrote: We are planning to deploy Cassandra on two data centers. Let us say that we went with three replicas with 2 being in one data center and last replica in 2nd Data center. What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3 replicas are unreachable)? Will they timeout? Regards, Baskar -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Pyramid Organization of Data
We have similar requirements for wide area backup/archive at Netflix. I think what you want is a replica with RF of at least 3 in NY for all the satellites, then each satellite could have a lower RF, but if you want safe local quorum I would use 3 everywhere. Then NY is the sum of all the satellites, so that makes most use of the disk space. For archival storage I suggest you use snapshots in NY and save compressed tar files of each keyspace in NY. We've been working on this to allow full and incremental backup and restore from our EC2 hosted Cassandra clusters to/from S3. Full backup/restore works fine, incremental and per-keyspace restore is being worked on. Adrian From: Patrick Julien pjul...@gmail.commailto:pjul...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thu, 14 Apr 2011 05:38:54 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pyramid Organization of Data Thanks, I'm still working the problem so anything I find out I will post here. Yes, you're right, that is the question I am asking. No, adding more storage is not a solution since new york would have several hundred times more storage. On Apr 14, 2011 6:38 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: I think your question is NY is the archive, after a certain amount of time we want to delete the row from the original DC but keep it in the archive in NY. Once you delete a row, it's deleted as far as the client is concerned. GCGaceSeconds is only concerned with when the tombstone marker can be removed. If NY has a replica of a row from Tokyo and the row is deleted in either DC, it will be deleted in the other DC as well. Some thoughts... 1) Add more storage in the satellite DC's, then tilt you chair to celebrate a job well done :) 2) Run two clusters as you say. 3) Just thinking out loud, and I know this does not work now. Would it be possible to support per CF strategy options, so an archive CF only replicates to NY ? Can think of possible problems with repair and LOCAL_QUORUM, out of interest what else would it break? Hope that helps. Aaron On 14 Apr 2011, at 10:17, Patrick Julien wrote: We have been successful in implementing, at scale, the comments you posted here. I'm wondering what we can do about deleting data however. The way I see it, we have considerably more storage capacity in NY, but not in the other sites. Using this technique here, it occurs to me that we would replicate non-NY deleted rows back to NY. Is there a way to tell NY not to tombstone rows? The ideas I have so far: - Set GCGracePeriod to be much higher in NY than in the other sites. This way we can get to tombstone'd rows well beyond their disk life in other sites. - A variant on this solution is to set the TTL on rows in non NY sites and again, set the GCGracePeriod to be considerably higher in NY - break this up to multiple clusters and do one write from the client to the its 'local' cluster and one write to the NY cluster. On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis jbel...@gmail.commailto:jbel...@gmail.com wrote: No, I'm suggesting you have a Tokyo keyspace that gets replicated as {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London: 2, NYC: 1}, for example. On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien pjul...@gmail.commailto:pjul...@gmail.com wrote: I'm familiar with this material. I hadn't thought of it from this angle but I believe what you're suggesting is that the different data centers would hold a different properties file for node discovery instead of using auto-discovery. So Tokyo, and others, would have a configuration that make it oblivious to the non New York data centers. New York would have a configuration that would give it knowledge of no other data center. Would that work? Wouldn't the NY data center wonder where these other writes are coming from? On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis jbel...@gmail.commailto:jbel...@gmail.com wrote: On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien pjul...@gmail.commailto:pjul...@gmail.com wrote: The problem is this: we would like the historical data from Tokyo to stay in Tokyo and only be replicated to New York. The one in London to be in London and only be replicated to New York and so on for all data centers. Is this currently possible with Cassandra? I believe we would need to run multiple clusters and migrate data manually from data centers to North America to achieve this. Also, any suggestions would also be welcomed. NetworkTopologyStrategy allows configuration replicas per-keyspace, per-datacenter: http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of
Re: Pyramid Organization of Data
What you are asking for breaks the eventual consistency model, so you need to create a separate cluster in NYC that collects the same updates but has a much longer setting to timeout the data for deletion, or doesn't get the deletes. One way is to have a trigger on writes on your pyramid nodes in NY that copies data over to the long term analysis cluster. The two clusters won't be eventually consistent in the presence of failures, but with RF=3 you will get up to three triggers for each write, so you get three chances to get the copy done. Adrian On Apr 14, 2011, at 10:18 AM, Patrick Julien pjul...@gmail.com wrote: Thanks for your input Adrian, we've pretty much settled on this too. What I'm trying to figure out is how we do deletes. We want to do deletes in the satellites because: a) we'll run out of disk space very quickly with the amount of data we have b) we don't need more than 3 days worth of history in the satellites, we're currently planning for 7 days of capacity However, the deletes will get replicated back to NY. In NY, we don't want that, we want to run hadoop/pig over all that data dating back to several months/years. Even if we set the replication factor of the satellites to 1 and NY to 3, we'll run out of space very quickly in the satellites. On Thu, Apr 14, 2011 at 11:23 AM, Adrian Cockcroft acockcr...@netflix.com wrote: We have similar requirements for wide area backup/archive at Netflix. I think what you want is a replica with RF of at least 3 in NY for all the satellites, then each satellite could have a lower RF, but if you want safe local quorum I would use 3 everywhere. Then NY is the sum of all the satellites, so that makes most use of the disk space. For archival storage I suggest you use snapshots in NY and save compressed tar files of each keyspace in NY. We've been working on this to allow full and incremental backup and restore from our EC2 hosted Cassandra clusters to/from S3. Full backup/restore works fine, incremental and per-keyspace restore is being worked on. Adrian From: Patrick Julien pjul...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thu, 14 Apr 2011 05:38:54 -0700 To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pyramid Organization of Data Thanks, I'm still working the problem so anything I find out I will post here. Yes, you're right, that is the question I am asking. No, adding more storage is not a solution since new york would have several hundred times more storage. On Apr 14, 2011 6:38 AM, aaron morton aa...@thelastpickle.com wrote: I think your question is NY is the archive, after a certain amount of time we want to delete the row from the original DC but keep it in the archive in NY. Once you delete a row, it's deleted as far as the client is concerned. GCGaceSeconds is only concerned with when the tombstone marker can be removed. If NY has a replica of a row from Tokyo and the row is deleted in either DC, it will be deleted in the other DC as well. Some thoughts... 1) Add more storage in the satellite DC's, then tilt you chair to celebrate a job well done :) 2) Run two clusters as you say. 3) Just thinking out loud, and I know this does not work now. Would it be possible to support per CF strategy options, so an archive CF only replicates to NY ? Can think of possible problems with repair and LOCAL_QUORUM, out of interest what else would it break? Hope that helps. Aaron On 14 Apr 2011, at 10:17, Patrick Julien wrote: We have been successful in implementing, at scale, the comments you posted here. I'm wondering what we can do about deleting data however. The way I see it, we have considerably more storage capacity in NY, but not in the other sites. Using this technique here, it occurs to me that we would replicate non-NY deleted rows back to NY. Is there a way to tell NY not to tombstone rows? The ideas I have so far: - Set GCGracePeriod to be much higher in NY than in the other sites. This way we can get to tombstone'd rows well beyond their disk life in other sites. - A variant on this solution is to set the TTL on rows in non NY sites and again, set the GCGracePeriod to be considerably higher in NY - break this up to multiple clusters and do one write from the client to the its 'local' cluster and one write to the NY cluster. On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis jbel...@gmail.com wrote: No, I'm suggesting you have a Tokyo keyspace that gets replicated as {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London: 2, NYC: 1}, for example. On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien pjul...@gmail.com wrote: I'm familiar with this material. I hadn't thought of it from this angle but I believe what you're suggesting is that the different data centers would hold a different properties file for node discovery instead of using auto-discovery. So
Re: Reclaim deleted rows space
How many nodes do you have? You should be able to run a rolling compaction around the ring, one node at a time to minimize impact. If one node is too big an impact, maybe you should have a bigger cluster? If you are on EC2, try running more but smaller instances. Adrian From: shimi shim...@gmail.commailto:shim...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 2 Jan 2011 11:25:42 -0800 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Reclaim deleted rows space Lets assume I have: * single 100GB SSTable file * min compaction threshold is set to 2 If I delete rows which are located in this file. Is the only way to clean the deleted rows is by inserting another 100GB of data or by triggering a painful major compaction? Shimi
Re: WELCOME to user@cassandra.apache.org
The book is available via safarionline.comhttp://safarionline.com - many educational institutions have group memberships and there is a free trial for individuals. It's run by the publishers and the authors get paid. There is a nosql training video there as well. Adrian On Dec 29, 2010, at 8:52 PM, asil klin asdk...@gmail.commailto:asdk...@gmail.com wrote: Can't I get it for free from anywhere? I am a student researching on Cassandra datastore and have no economically profitable reasons to buy? If anyone of you could pass me a copy to my email address: mailto:asdk...@gmail.com asdk...@gmail.commailto:asdk...@gmail.com Thanks so much ! Asil On Thu, Dec 30, 2010 at 10:10 AM, Germán Kondolf mailto:german.kond...@gmail.comgerman.kond...@gmail.commailto:german.kond...@gmail.com wrote: Hmm... what about just paying for it? It cost less than $20 on Amazon for the Kindle version... (http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412). // Germán Kondolf http://twitter.com/germanklfhttp://twitter.com/germanklf http://code.google.com/p/seide/http://code.google.com/p/seide/ // @iPad On 30/12/2010, at 01:26, asil klin mailto:asdk...@gmail.comasdk...@gmail.commailto:asdk...@gmail.com wrote: Can anyone pass me a pdf copy of Cassandra The Definitive Guide ? Thanks. Asil
Re: Severe Reliability Problems - 0.7 RC2
What filesystem are you using? You might try EXT3 or 4 vs. XFS as another area of diversity. It sounds as if the page cache or filesystem is messed up. Are there any clues in /var/log/messages? How much swap space do you have configured? The kernel level debug stuff I know is all for Solaris unfortunately… Adrian From: Dan Hendry dan.hendry.j...@gmail.commailto:dan.hendry.j...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Mon, 20 Dec 2010 12:13:56 -0800 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: Severe Reliability Problems - 0.7 RC2 Yes, I have tried that (although only twice). Same impact as a regular kill: nothing happens and I get no stacktrace output. It is however on my list of things to try again the next time a node dies. I am also not able to attach jstack to the process. I have also tried disabling JNA (did not help) and I have now changed disk_access_mode from auto to mmap_index_only on two of the nodes. Dan From: Kani [mailto:javier.canil...@gmail.com] Sent: December-20-10 14:14 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Severe Reliability Problems - 0.7 RC2 Have you tried to send a KILL -3 to the Cassandra process before you send KILL -9? This way you will see what the threads are doing (and maybe blocking). The majority of the threads may give you the right spot where to look for the problem. I'm not much of a good linux administrator, but when something goes weird on one of my own application (java running over linux box) i tried that command to see what the application was doing or trying to. Kani On Mon, Dec 20, 2010 at 3:48 PM, Dan Hendry dan.hendry.j...@gmail.commailto:dan.hendry.j...@gmail.com wrote: I have been having severe and strange reliability problems within my Cassandra cluster. This weekend, all four of my nodes were down at once. Even now I am loosing one every few hours. I have attached output from all the system monitoring commands I can think of. What seems to happen is that the java process locks up and sits and has 100% system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total capacity). JMX freezes and the node effectively dies, but there is typically nothing unusual in the Cassandra logs. About the only thing which seems to be correlated is the flushing of memtables tables. One of the strangest stats I am getting when in this state is memory paging: 3727168.00 pages scanned/second (see sar -B output). Occasionally, if I leave the process alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to terminate the Cassandra process is with a kill -9. When this happens, Cassandra memory usage (as reported by JMX before it dies) is also reasonable (ex 6 GB out of 12 GB heap and 24 GB system). This feels more like a system level problem than a Cassandra problem so I have tried diversifying my cluster, one node runs Ubuntu 10.10, the other three 10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22). Neither change seems be correlated with the problem. These are pretty much stock ubuntu installs so nothing special on that front. Now this has been a relatively sudden development and I can potentially attribute it to a few things: 1. Upgrading to RC2 2. Ever increasing amounts of data (there is less than 100 gb per node so this should not be the problem). 3. Migrating from a set of machines where data+commit log directories were on four small raid 5 hds to machines with two 500 gig drives: one for data and one for commitlog + os. I have seen more IO wait on these new machines. But they have the same memory and system settings. I am about at my wits end on this one, any help would be appreciated. No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3327 - Release Date: 12/20/10 02:34:00
Re: Cassandra Monitoring
I'm currently working to configure AppDynamics to monitor cassandra. It does byte-code instrumentation, so there is an agent added to the cassandra JVM, which gives the ability to capture latency for requests and see where the bottleneck is coming from. We have been using it on our other Java apps. They have a free version to try it out. It doesn't track thrift calls out of the box, but I'm encouraging AD to figure out a way to do that, and working on a config for capturing the entry points in the meantime. The way the page cache works is that pages stay in memory linked to a specific file. If you delete that file, the pages are all considered invalid at that point, so get zero'ed out and go to the start of the free list. So compaction creates a new file first (which is competing with existing read traffic to try and keep its pages in memory) then removes the old files that were being merged, so at that point there is a supply of blank pages, but disk reads will be needed to warm up the cache again. The use case that I'm working with is more like a persistent memcached replacement, so we are trying to have more RAM than data on m2.4xl EC2 instances (~70GB) and keep all reads in memory all the time. Adrian On 12/19/10 5:36 AM, Peter Schuller peter.schul...@infidyne.com wrote: How / what are you monitoring? Best practices someone? I recently set up monitoring using the cassandra-munin-plugins (https://github.com/jamesgolick/cassandra-munin-plugins). However, due to various little details that wasn't too fun to integrate properly with munin-node-configure and automated configuration management. A problem is also the starting of a JVM for each use of jmxquery, which can become a problem with many column families. I like your web server idea. Something persistent that can sit there and do the JMX acrobatics, and expose something more easily consumed for stuff like munin/zabbix/etc. It would be pretty nice to have that out of the box with Cassandra, though I expect that would be considered bloat. :) -- / Peter Schuller