Important Variables for Scaling
Which variables (for instance: throughput, CPU, I/O, connections) are leading in deciding to add a node to a Cassandra setup which is put under strain. We are trying to proove scalibility, but when is the time there to add a node and have the optimum scalibilty result.
Re: Multi data center configuration - A question on read correction
Yes, that's the way to do it. On Wed, Jun 15, 2011 at 9:43 PM, Selva Kumar wwgse...@yahoo.com wrote: Thanks Jonathan. Can we turn off RR by READ_REPAIR_CHANCE.= 0. Please advice. Selva From: Jonathan Ellis jbel...@gmail.com To: user@cassandra.apache.org Sent: Tue, June 14, 2011 8:59:41 PM Subject: Re: Multi data center configuration - A question on read correction That's just read repair sending MD5s of the data for comparison. So net traffic is light. You can turn off RR but the downsides can be large. Turning it down to say 10% can be reasonable tho. But again, if network traffic is your concern you should be fine. On Tue, Jun 14, 2011 at 8:44 PM, Selva Kumar wwgse...@yahoo.com wrote: I have setup a multiple data center configuration in Cassandra. My primary intention is to minimize the network traffic between DC1 and DC2. Want DC1 read requests be served with out reaching DC2 nodes. After going through documentation, i felt following setup would do. Replica Placement Strategy: NetworkTopologyStrategy Replication Factor: 3 strategy_options: DC1 : 2 DC2 : 1 endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch Read Consistency Level: LOCAL_QUORUM Write Consistency Level: LOCAL_QUORUM File: cassandra-topology.properties # Cassandra Node IP=Data Center:Rack 10.10.10.149=DC1:RAC1 10.10.10.150=DC1:RAC1 10.10.10.151=DC1:RAC1 10.20.10.153=DC2:RAC1 10.20.10.154=DC2:RAC1 # default for unknown nodes default=DC1:RAC1 Question I have: 1. Created a java program to test. It was querying with consistency level LOCAL_QUORUM on a DC1 node. Read count(Through cfstats) on the DC2 node showed read happened there too. Is it because of read correction?. Is there way to avoid doing read correction in DC2 nodes, when we query DC1 nodes. Thanks Selva -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: sstable2json2sstable bug with json data stored
On 6/15/11 17:41, Timo Nentwig wrote: (json can likely be boiled down even more...) Any JSON (well, probably anything with quotes...) breaks it: { 74657374: [[data, {foo:bar}, 1308209845388000]] } [default@foo] set transactions[test][data]='{foo:bar}'; I feared that storing data in a readable fashion would be a fateful idea. https://issues.apache.org/jira/browse/CASSANDRA-2780
Re: sstable2json2sstable bug with json data stored
The JSON you are showing below is an export from cassandra? { 74657374: [[data, {foo:bar}, 1308209845388000]] } Does this work? { 74657374: [[data, {foo:bar}, 1308209845388000]] } -sd On Thu, Jun 16, 2011 at 9:49 AM, Timo Nentwig timo.nent...@toptarif.de wrote: On 6/15/11 17:41, Timo Nentwig wrote: (json can likely be boiled down even more...) Any JSON (well, probably anything with quotes...) breaks it: { 74657374: [[data, {foo:bar}, 1308209845388000]] } [default@foo] set transactions[test][data]='{foo:bar}'; I feared that storing data in a readable fashion would be a fateful idea. https://issues.apache.org/jira/browse/CASSANDRA-2780
Re: sstable2json2sstable bug with json data stored
On 6/16/11 10:06, Sasha Dolgy wrote: The JSON you are showing below is an export from cassandra? Yes. Just posted the solution: https://issues.apache.org/jira/browse/CASSANDRA-2780?focusedCommentId=13050274page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13050274 Guess this could simply be done in the quote() method. { 74657374: [[data, {foo:bar}, 1308209845388000]] } Does this work? { 74657374: [[data, {foo:bar}, 1308209845388000]] } -sd On Thu, Jun 16, 2011 at 9:49 AM, Timo Nentwigtimo.nent...@toptarif.de wrote: On 6/15/11 17:41, Timo Nentwig wrote: (json can likely be boiled down even more...) Any JSON (well, probably anything with quotes...) breaks it: { 74657374: [[data, {foo:bar}, 1308209845388000]] } [default@foo] set transactions[test][data]='{foo:bar}'; I feared that storing data in a readable fashion would be a fateful idea. https://issues.apache.org/jira/browse/CASSANDRA-2780
Re: sstable2json2sstable bug with json data stored
On 6/16/11 10:12, Timo Nentwig wrote: On 6/16/11 10:06, Sasha Dolgy wrote: The JSON you are showing below is an export from cassandra? Yes. Just posted the solution: https://issues.apache.org/jira/browse/CASSANDRA-2780?focusedCommentId=13050274page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13050274 Guess this could simply be done in the quote() method. Hm, is this the way it's supposed to be? [default@foo] set transactions[test][data]='{foo:bar}'; Value inserted. [default@foo] get transactions[test][data]; = (column=data, value={foo:bar}, timestamp=1308214517443000) [default@foo] set transactions[test][data]='{\foo\:bar}'; Value inserted. [default@foo] get transactions[test][data]; = (column=data, value={foo:bar}, timestamp=1308214532484000) Otherwise here's a regex that cares about existing backslashes: private static String quote(final String val) { return String.format(\%s\, val.replaceAll((?!)\, \\\)); } { 74657374: [[data, {foo:bar}, 1308209845388000]] } Does this work? { 74657374: [[data, {foo:bar}, 1308209845388000]] } -sd On Thu, Jun 16, 2011 at 9:49 AM, Timo Nentwigtimo.nent...@toptarif.de wrote: On 6/15/11 17:41, Timo Nentwig wrote: (json can likely be boiled down even more...) Any JSON (well, probably anything with quotes...) breaks it: { 74657374: [[data, {foo:bar}, 1308209845388000]] } [default@foo] set transactions[test][data]='{foo:bar}'; I feared that storing data in a readable fashion would be a fateful idea. https://issues.apache.org/jira/browse/CASSANDRA-2780
Getting Started website is out of date
Hi, the Getting started website (http://wiki.apache.org/cassandra/GettingStarted) is out of date - the link to the Twissandra demo is broken - the new CQL is not mentioned :-) Beside this I love cassandra! Best Christian
Re: Migration question
Lots of folk use a single disk or raid-1 for the system and commit log and raid-0 for the data volumes http://wiki.apache.org/cassandra/CassandraHardware Your money is probably better spent on more nodes with more disks and more memory. More nodes is always better. Happy to hear reasons otherwise. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15 Jun 2011, at 15:50, Marcos Ortiz wrote: El 6/14/2011 1:43 PM, Eric Czech escribió: Thanks Aaron. I'll make sure to copy the system tables. Another thing -- do you have any suggestions on raid configurations for main data drives? We're looking at RAID5 and 10 and I can't seem to find a convincing argument one way or the other. Well, I learned from administrating other databases (like PostgreSQL and Oracle) that RAID 10 is the best solution for data. With RAID 5, the discs suffer a lot for the excesive I/O and It can arrive to data lost. You can search about the RAID 5 Write Hole to view this. Thanks again for your help. On Mon, Jun 6, 2011 at 5:45 AM, aaron morton aa...@thelastpickle.com wrote: Sounds like you are OK to turn off the existing cluster first. Assuming so, deliver any hints using JMX then do a nodetool flush to write out all the memtables and checkpoint the commit logs. You can then copy the data directories. The System data directory contains the nodes token and the schema, you will want to copy this directory. You may also want to copy the cassandra.yaml or create new ones with the correct initial tokens. The nodes will sort themselves out when they start up and get new IP's, the important thing to them is the token. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6 Jun 2011, at 23:25, Eric Czech wrote: Hi, I have a quick question about migrating a cluster. We have a cassandra cluster with 10 nodes that we'd like to move to a new DC and what I was hoping to do is just copy the SSTables for each node to a corresponding node in the new DC (the new cluster will also have 10 nodes). Is there any reason that a straight file copy like this wouldn't work? Do any system tables need to be moved as well or is there anything else that needs to be done? Thanks! -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Slowdowns during repair
Look for log messages at the ERROR level first to find out why it's crashing. Check for GC pressure during the repair, either using JConsole or log messages from the GCInspector. Check the nodetool tpstats to get an idea if the nodes are saturated, i.e. are their tasks in the pending list. Or are they just running with high latency. If a node crashes when calculating the Merkle tree's for it's neighbours the repair will hang (for 48 hours i think) on the node that initiated the repair. I dont think this is immediately obvious though tpstats . Start with why it's crashing and whats happening with the GC. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 10:20, Aurynn Shaw wrote: Hey all; So, we have Cassandra running on a 5-server ring, with a RF of 3, and we're regularly seeing major slowdowns in read write performance while running nodetool repair, as well as the occasional Cassandra crash during the repair window - slowdowns past 10 seconds to perform a single write. The repair cycle runs nightly on a different server, so each server has it run once a week. We're running 0.7.0 currently, and we'll be upgrading to 0.7.6 shortly. System load on the Cassandra servers is never more than 10% CPU and utterly minimal IO usage, so I wouldn't think we'd be seeing issues quite like this. What sort of knobs should I be looking at tuning to reduce the impact that nodetool repair has on Cassandra? What questions should I be asking as to why Cassandra slows down to the level that it does, and what I should be optimizing? Additionally, what should I be looking for in the logs when this is happening? There's a lot in the logs, but I'm not sure what to look for. Cassadra is, in this instance, backing a system that supports around a million requests a day, so not terribly heavy traffic. Thanks, Aurynn
Re: Where is my data?
I wrote a blog post about this sort of thing the other day http://thelastpickle.com/2011/06/13/Down-For-Me/ Let me know if you spot any problems. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 02:20, AJ wrote: Thanks On 6/15/2011 3:20 AM, Sylvain Lebresne wrote: You can use the thrift call describe_ring(). It will returns a map that associate to each range of the ring who is a replica. Once any range has all it's endpoint unavailable, that range of the data is unavailable. -- Sylvain
Re: What's the best approach to search in Cassandra
Mark, Solandra doesn't use secondary indexes, the functionality is too limited for the lucene api. It maintain's it's own indexes in regular column families. I suggest you look at Solr and decide if this is the functionality you need, Solandra offers the same api but on Cassandra's distributed model. -Jake On Thu, Jun 16, 2011 at 12:56 AM, Mark Kerzner markkerz...@gmail.comwrote: Jake, *You need to maintain a huge number of distinct indexes.* * * *Are we talking about secondary indexes? If yes, this sounds like exactly my problem. There is so little documentation! - but I think that if I read all there is on GitHub, I can probably start using it. * Thank you, Mark On Fri, Jun 3, 2011 at 8:07 PM, Jake Luciani jak...@gmail.com wrote: Mark, Check out Solandra. http://github.com/tjake/Solandra On Fri, Jun 3, 2011 at 7:56 PM, Mark Kerzner markkerz...@gmail.comwrote: Hi, I need to store, say, 10M-100M documents, with each document having say 100 fields, like author, creation date, access date, etc., and then I want to ask questions like give me all documents whose author is like abc**, and creation date any time in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions, matching a list of some keywords. What's best, Lucene, Katta, Cassandra CF with secondary indices, or plan scan and compare of every record? Thanks a bunch! Mark -- http://twitter.com/tjake -- http://twitter.com/tjake
Re: Force a node to form part of quorum
Short answer: No. Medium answer: No all nodes are equal. It could create a single point of failure if a QUOURM could not be formed without a specific node. Writes are sent to every replica. Reads with Read Repair enabled are also sent to every replica. For reads the closest UP node as determined by the snitch and possibly re-ordered by the Dynamic Snitch is asked to return the actual data. This replica must respond for the request to complete. If it's a question about maximising cache hits see https://github.com/apache/cassandra/blob/cassandra-0.8.0/conf/cassandra.yaml#L308 Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 05:58, A J wrote: Is there a way to favor a node to always participate (or never participate) towards fulfillment of read consistency as well as write consistency ? Thanks AJ
Re: Atomicity of batch updates
See http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 06:26, chovatia jaydeep wrote: Cassandra write operation is atomic for all the columns/super columns for a given row key in Column Family. So in your case not all previous operations (assuming each operation was on separate key) will be reverted. Thank you, Jaydeep From: Artem Orobets artem.orob...@exigenservices.com To: user@cassandra.apache.org user@cassandra.apache.org Cc: Andrey Lomakin andrey.loma...@exigenservices.com Sent: Wednesday, 15 June 2011 7:42 AM Subject: Atomicity of batch updates Hi, Wiki says that write operation is atomic within ColumnFamily (http://wiki.apache.org/cassandra/ArchitectureOverview chapter “write properties”). If I use batch update for single CF, and get an exception in last mutation operation, is it means that all previous operation will be reverted. If no, what means atomic in this context?
Re: Easy way to overload a single node on purpose?
DEBUG 14:36:55,546 ... timed out Is logged when the coordinator times out waiting for the replicas to respond, the timeout setting is rpc_timeout in the yaml file. This results in the client getting a TimedOutException. AFAIK There is no global everything is good / bad flags to check. e.g. AFAIK I node will not mark its self down if it runs out of disk space. So you need to monitor the free disk space and alert on that. Having a ping column can work if every key is replicated to every node. It would tell you the cluster is working, sort of. Once the number of nodes is greater than the RF, it tells you a subset of the nodes works. If you google around you'll find discussions about monitoring with munin, ganglia, cloud kick and Ops Centre. If you install mx4j you can access the JMX metrics via HTTP, Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 10:38, Suan Aik Yeo wrote: Here's a weird one... what's the best way to get a Cassandra node into a half-crashed state? We have a 3-node cluster running 0.7.5. A few days ago this happened organically to node1 - the partition the commitlog was on was 100% full and there was a No space left on device error, and after a while, although the cluster and node1 was still up, to the other nodes it was down, and messages like: DEBUG 14:36:55,546 ... timed out started to show up in its debug logs. We have a tool to indicate to the load balancer that a Cassandra node is down, but it didn't detect it that time. Now I'm having trouble purposefully getting the node back to that state, so that I can try other monitoring methods. I've tried to fill up the commitlog partition with other files, and although I get the No space left on device error, the node still doesn't go down and show the other symptoms it showed before. Also, if anyone could recommend a good way for a node itself to detect that its in such a state I'd be interested in that too. Currently what we're doing is making a describe_cluster_name() thrift call, but that still worked when the node was down. I'm thinking of something like reading/writing to a fixed value in a keyspace as a check... Unfortunately Java-based solutions are out of the question. Thanks, Suan
Re: Is there a way from a running Cassandra node to determine whether or not itself is up?
take a look at mx4j http://wiki.apache.org/cassandra/Operations#Monitoring_with_MX4J someone told me once you can call the JMX ops via http, i've not checked though. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 14:45, Jake Luciani wrote: No force a node down you can use nodetool disablegossip On Wed, Jun 15, 2011 at 6:42 PM, Suan Aik Yeo yeosuan...@gmail.com wrote: Thanks, Aaron, but we determined that adding Java into the equation just brings in too much complexity for something that's called out of an Nginx Perl module. Right now I'm having trouble even replicating the above scenario and posted a question here: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Easy-way-to-overload-a-single-node-on-purpose-tt6480958.html - Suan On Thu, Jun 9, 2011 at 3:58 AM, aaron morton aa...@thelastpickle.com wrote: None via thrift that I can recall, but the StorageService MBean exposes getLiveNodes() this is what nodetool uses to see which nodes are live. From the code... /** * Retrieve the list of live nodes in the cluster, where liveness is * determined by the failure detector of the node being queried. * * @return set of IP addresses, as Strings */ public ListString getLiveNodes(); Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jun 2011, at 17:56, Suan Aik Yeo wrote: Is there a way (preferably an exposed method accessible through Thrift), from a running Cassandra node to determine whether or not itself is up? (Per Cassandra standards, I'm assuming based on the gossip protocol). Another way to think of what I'm looking for is basically running nodetool ring just on myself, but I'm only interested in knowing whether I'm Up or Down? I'm currently using the describe_cluster method, but earlier today when the commitlogs for a node filled up and it appeared down to the other nodes, describe_cluster() still worked fine, thus failing the check. Thanks, Suan -- http://twitter.com/tjake
Re: Docs: Token Selection
See this thread for background http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html In a multi DC environment, if you calculate the initial tokens for the entire cluster data will not be evenly distributed. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 15:51, Vijay wrote: +1 for more documentation (I guess contributions are always welcomed) I will try to write it down sometime when we have a bit more time... 0.8 nodetool ring command adds the DC and RAC information http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers http://www.datastax.com/products/opscenter Hope this helps... Regards, /VJ On Wed, Jun 15, 2011 at 7:24 PM, AJ a...@dude.podzone.net wrote: Ok. I understand the reasoning you laid out. But, I think it should be documented more thoroughly. I was trying to get an idea as to how flexible Cass lets you be with the various combinations of strategies, snitches, token ranges, etc.. It would be instructional to see what a graphical representation of a cluster ring with multiple data centers looks like. Google turned-up nothing. I imagine it's a multilayer ring; one layer per data center with the nodes of one layer slightly offset from the ones in the other (based on the example in the wiki). I would also like to know which node is next in the ring such so as to understand replica placement in, for example, the OldNetworkTopologyStrategy when it's doc states, ...It places one replica in a different data center from the first (if there is any such data center), the third replica in a different rack in the first datacenter, and any remaining replicas on the first unused nodes on the ring. I can only assume for now that the ring referred to is the local ring of the first data center. On 6/15/2011 5:51 PM, Vijay wrote: No it wont it will assume you are doing the right thing... Regards, /VJ On Wed, Jun 15, 2011 at 2:34 PM, AJ a...@dude.podzone.net wrote: Vijay, thank you for your thoughtful reply. Will Cass complain if I don't setup my tokens like in the examples? On 6/15/2011 2:41 PM, Vijay wrote: All you heard is right... You are not overriding Cassandra's token assignment by saying here is your token... Logic is: Calculate a token for the given key... find the node in each region independently (If you use NTS and if you set the strategy options which says you want to replicate to the other region)... Search for the ranges in each region independntly Replicate the data to that node. For multi DC cassandra needs nodes to be equally partitioned within each dc (If you care that the load equally distributed) as well as there shouldn't be any collusion of tokens within a cluster The documentation tried to explain the same and the example in the documentation. Hope this clarifies... More examples if it helps DC1 Node 1 : token 0 DC1 Node 2 : token 8.. DC2 Node 1 : token 4.. DC2 Node 1 : token 12.. or DC1 Node 1 : token 0 DC1 Node 2 : token 1.. DC2 Node 1 : token 8.. DC2 Node 1 : token 7.. Regards, /VJ On Wed, Jun 15, 2011 at 12:28 PM, AJ a...@dude.podzone.net wrote: On 6/15/2011 12:14 PM, Vijay wrote: Correction The problem in the above approach is you have 2 nodes between 12 to 4 in DC1 but from 4 to 12 you just have 1 should be The problem in the above approach is you have 1 node between 0-4 (25%) and and one node covering the rest which is 4-16, 0-0 (75%) Regards, /VJ Ok, I think you are saying that the computed token range intervals are incorrect and that they would be: DC1 *node 1 = 0 Range: (4, 16], (0, 0] node 2 = 4 Range: (0, 4] DC2 *node 3 = 8 Range: (12, 16], (0, 8] node 4 = 12 Range: (8, 12] If so, then yes, this is what I am seeking to confirm since I haven't found any documentation stating this directly and that reference that I gave only implies this; that is, that the token ranges are calculated per data center rather than per cluster. I just need someone to confirm that 100% because it doesn't sound right to me based on everything else I've read. SO, the question is: Does Cass calculate the consecutive node token ranges A.) per cluster, or B.) for the whole data center? From all I understand, the answer is B. But, that documentation (reprinted below) implies A... or something that doesn't make sense to me because of the token placement in the example: With NetworkTopologyStrategy, you should calculate the tokens the nodes in each DC independantly... DC1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 DC2 node 3 = 1 node 4 = 850705917302346158658436518579 42052865 However, I do see
Querying superColumn
I have a question about querying super column For example: I have a supercolumnFamily DEPARTMENT with dynamic superColumn 'EMPLOYEE'( name, country). Now for rowKey 'DEPT1' I have inserted multiple super column like: Employee1{ Name: Vivek country: India } Employee2{ Name: Vivs country: USA } Now if I want to retrieve a super column whose rowkey is 'DEPT1' and employee name is 'Vivek'. Can I get only 'EMPLOYEE1' ? -Vivek Write to us for a Free Gold Pass to the Cloud Computing Expo, NYC to attend a live session by Head of Impetus Labs on 'Secrets of Building a Cloud Vendor Agnostic PetaByte Scale Real-time Secure Web Application on the Cloud '. Looking to leverage the Cloud for your Big Data Strategy ? Attend Impetus webinar on May 27 by registering at http://www.impetus.com/webinar?eventid=42 . NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Important Variables for Scaling
It's a difficult questions to answer in the abstract. Some thoughts... Scaling by adding one node at time is not optimal. The best case scenario is to double the number of nodes, as this means existing nodes only have to stream their data to a new node. Obviously this is not always possible. When adding less nodes existing nodes keeping a balanced ring may mean streams data to other nodes and accepting data from nodes. In general try to keep the data volume 50% full, so there is lots of free space to do moves. In general nodes with a few 100GB's of data are easiest to manage. The pending column in nodetool tpstats will let you know how many read or write requests are waiting to be serviced. If this is consistently above concurrent_reads or concurrent_writes it means there is a queue 1 for each thread. This will add to request latency, once maximum throughput is reached additional requests will queue. See the SEDA paper. Sometime in the 0.7 dev client connection pooling was added to better manage those resources. See cassandra.yaml for info. The o.a.c.db.StorageProxy JMX MBean provides a latency trackers for total request time including wait times. And o.a.c.db.ColumnFamiles... provides latency trackers for the local node operations to do the read or write. if your data set grows quickly watch the disk space etc. If you do a lot of requests but you data grows slowly watch the throughout and latency numbers. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 18:28, Schuilenga, Jan Taeke wrote: Which variables (for instance: throughput, CPU, I/O, connections) are leading in deciding to add a node to a Cassandra setup which is put under strain. We are trying to proove scalibility, but when is the time there to add a node and have the optimum scalibilty result.
Upgrading Cassandra cluster from 0.6.3 to 0.7.5
Hi All, We are upgrading cassandra from 0.6.3 to 0.7.5.We have two node in cluster.I am bit confused how to upgrade them can you have any guide. -- S.Ali Ahsan Senior System Engineer e-Business (Pvt) Ltd 49-C Jail Road, Lahore, P.O. Box 676 Lahore 54000, Pakistan Tel: +92 (0)42 3758 7140 Ext. 128 Mobile: +92 (0)345 831 8769 Fax: +92 (0)42 3758 0027 Email: ali.ah...@panasiangroup.com www.ebusiness-pg.com www.panasiangroup.com Confidentiality: This e-mail and any attachments may be confidential and/or privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person use it for any purpose or store or copy the information in any medium. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. We do not accept liability for any errors or omissions.
Re: Querying superColumn
Well, you are looking for the secondary index. But for now,AFAIK, the supercolumn can not use secondary index . On 16/06/2011 13:55, Vivek Mishra wrote: Now for rowKey 'DEPT1' I have inserted multiple super column like: *Employee1{* *Name: Vivek* *country: India* *}* ** *Employee2{* *Name: Vivs* *country: USA* *}* Now if I want to retrieve a super column whose rowkey is 'DEPT1' and employee name is 'Vivek'. Can I get only 'EMPLOYEE1' ? -- Donal Zang Computing Center, IHEP 19B YuquanLu, Shijingshan District,Beijing, 100049 zan...@ihep.ac.cn 86 010 8823 6018
snitch thrift
Hi all! Assuming a node ends up in GC land for a while, there is a good chance that even though it performs terribly and the dynamic snitching will help you to avoid it on the gossip side, it will not really help you much if thrift still accepts requests and the thrift interface has choppy performance. This makes me wonder if it is a potential idea with thrift only client mode nodes. I don't think I have seen that this exists today (or is it possible that I have missed a way to configure that?), but it does not seem like a very hard thing to make and could maybe be good in some usage patterns for the datanode as well as the thrift side. Any thoughts? Regards, Terje
Re: Docs: Token Selection
AJ, sorry I seemed to miss the original email on this thread. As Aaron said, when computing tokens for multiple data centers, you should compute them independently for each data center - as if it were its own Cassandra cluster. You can have overlapping token ranges between multiple data centers, but no two nodes can have the same token, so for subsequent data centers I just increment the tokens. For two data centers with two nodes each using RandomPartitioner calculate the tokens for the first DC normally, but int he second data center, increment the tokens by one. In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 For RowMutations this will give each data center a local set of nodes that it can write to for complete coverage of the entire token space. If you are using NetworkTopologyStrategy for replication, it will give an offset mirror replication between the two data centers so that your replicas will not get pinned to a node in the remote DC. There are other ways to select the tokens, but the increment method is the simplest to manage and continue to grow with. Hope that helps. -Eric
Re: Docs: Token Selection
LOL, I feel Eric's pain. This double-ring thing can throw you for a loop since, like I said, there is only one place it is documented and it is only *implied*, so one is not sure he is interpreting it correctly. Even the source for NTS doesn't mention this. Thanks for everyone's help on this. On 6/16/2011 5:43 AM, aaron morton wrote: See this thread for background http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html In a multi DC environment, if you calculate the initial tokens for the entire cluster data will not be evenly distributed. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com
Re: Docs: Token Selection
Thanks Eric! I've finally got it! I feel like I've just been initiated or something by discovering this secret. I kid! But, I'm thinking about using OldNetworkTopStrat. Do you, or anyone else, know if the same rules for token assignment applies to ONTS? On 6/16/2011 7:21 AM, Eric tamme wrote: AJ, sorry I seemed to miss the original email on this thread. As Aaron said, when computing tokens for multiple data centers, you should compute them independently for each data center - as if it were its own Cassandra cluster. You can have overlapping token ranges between multiple data centers, but no two nodes can have the same token, so for subsequent data centers I just increment the tokens. For two data centers with two nodes each using RandomPartitioner calculate the tokens for the first DC normally, but int he second data center, increment the tokens by one. In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 For RowMutations this will give each data center a local set of nodes that it can write to for complete coverage of the entire token space. If you are using NetworkTopologyStrategy for replication, it will give an offset mirror replication between the two data centers so that your replicas will not get pinned to a node in the remote DC. There are other ways to select the tokens, but the increment method is the simplest to manage and continue to grow with. Hope that helps. -Eric
Cassandra JVM GC settings
Hi Everyone, I'm seeing Cassandra GC a lot and I would like to tune the Young space and the Tenured space. Anyone would have recommendations on the NewRatio or NewSize/MaxNewSize to use for an environment where Cassandra has several column families and in which we are doing a mixed load of reading and writing. The JVM has 8G of heap space assigned to it and there are 9 nodes to this cluster. Thanks for the comments! Sébastien Coutu
client API
i use jdk1.6 to install and launch cassandra in a linux platform,but can i use jdk1.5 for my cassandra Client ?
Re: Querying superColumn
Have 1 row with employee info for country/office/division, each column an employee id and json info about the employee or a reference.to.another row id for that employee data No more supercolumn. On Jun 16, 2011 1:56 PM, Vivek Mishra vivek.mis...@impetus.co.in wrote: I have a question about querying super column For example: I have a supercolumnFamily DEPARTMENT with dynamic superColumn 'EMPLOYEE'( name, country). Now for rowKey 'DEPT1' I have inserted multiple super column like: Employee1{ Name: Vivek country: India } Employee2{ Name: Vivs country: USA } Now if I want to retrieve a super column whose rowkey is 'DEPT1' and employee name is 'Vivek'. Can I get only 'EMPLOYEE1' ? -Vivek Write to us for a Free Gold Pass to the Cloud Computing Expo, NYC to attend a live session by Head of Impetus Labs on 'Secrets of Building a Cloud Vendor Agnostic PetaByte Scale Real-time Secure Web Application on the Cloud '. Looking to leverage the Cloud for your Big Data Strategy ? Attend Impetus webinar on May 27 by registering at http://www.impetus.com/webinar?eventid=42 . NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Propose new ConsistencyLevel.ALL_AVAIL for reads
Good morning all. Hypothetical Setup: 1 data center RF = 3 Total nodes 3 Problem: Suppose I need maximum consistency for one critical operation; thus I specify CL = ALL for reads. However, this will fail if only 1 replica endpoint is down. I don't see why this fail is necessary all of the time since the data could have been updated since the node became unavailable and it's data is old anyways. If only one node goes down and it has the key I need, then the app is not 100% available and it could take some time making the node available again. Proposal: If all of the *available* replica nodes answer the read operation and the latest value timestamp is clearly AFTER the time the down node became unavailable, then this situation can meet the requirements for *near* 100% consistency since the value in the down node would be outdated anyway. Clearly, the value was updated some time *after* the node went down or unavailable. This way, you can have max availability when using read with CL.ALL... or something CL close in meaning to ALL. I say near 100% consistency to leave room for some situation where the unavailable node was only unavailable to the coordinating node for some reason such as a network issue and thus still received an update by some other route after it appeared unavailable to the current coordinating node. In a situation like this, there is a chance the read will still not return the latest value. So, this will not be truly 100% consistent which CL.ALL guarantees. However, I think this logic could justify a new consistency level slightly lower than ALL, such as ALL_AVAIL. What do you think? Is my logic correct? Is there a conflict with the architecture or base principles? This fits with the tunable consistency principle for sure. Thanks for listening
Re: Docs: Token Selection
So, with ec2 ... 3 regions (DC's), each one is +1 from another? On Jun 16, 2011 3:40 PM, AJ a...@dude.podzone.net wrote: Thanks Eric! I've finally got it! I feel like I've just been initiated or something by discovering this secret. I kid! But, I'm thinking about using OldNetworkTopStrat. Do you, or anyone else, know if the same rules for token assignment applies to ONTS? On 6/16/2011 7:21 AM, Eric tamme wrote: AJ, sorry I seemed to miss the original email on this thread. As Aaron said, when computing tokens for multiple data centers, you should compute them independently for each data center - as if it were its own Cassandra cluster. You can have overlapping token ranges between multiple data centers, but no two nodes can have the same token, so for subsequent data centers I just increment the tokens. For two data centers with two nodes each using RandomPartitioner calculate the tokens for the first DC normally, but int he second data center, increment the tokens by one. In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 For RowMutations this will give each data center a local set of nodes that it can write to for complete coverage of the entire token space. If you are using NetworkTopologyStrategy for replication, it will give an offset mirror replication between the two data centers so that your replicas will not get pinned to a node in the remote DC. There are other ways to select the tokens, but the increment method is the simplest to manage and continue to grow with. Hope that helps. -Eric
Re: Docs: Token Selection
On Thu, Jun 16, 2011 at 11:11 AM, Sasha Dolgy sdo...@gmail.com wrote: So, with ec2 ... 3 regions (DC's), each one is +1 from another? I dont use ec2, so I am not familiar with the specifics of deployment there. That said, if you have 3 data centers with equal nodes in each (so that you would calculate the same tokens for each DC) - the first DC you would add 0, the second DC you would add 1, the third DC you would add 2. so it could look like the following In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 In DC 3 node 1 = 2 node 2 = 85070591730234615865843651857942052866 Keep in mind, the only reason you need to offset tokens is if there is another node that would have the exact same token. So if you have different numbers of nodes in different data centers, it is possible you wont need any token offsets. Just calculate tokens normally, as if the DC was the only one, then check for any node in another DC with the same token and add +1 to offset the token. -Eric
Unable to access column family in CLI after building CF in CQL
Hi, I was following the CQL example on the DataStax website and was able to create a new column family and query it. But when I viewed the column family in the CLI, it gives me the following error. # Unable to read column family created from CQL [default@store] list users2; *users2 not found in current keyspace.* Also, when I try to query the user table from CQL, i'm unable to filter on a key. The user table was created in the CLI but accessible by CQL with a simple select * from users; cqlsh select * from users where key='tyler'; *Bad Request: cannot parse 'tyler' as hex bytes* # In the CLI, the store keyspaces displays two column families . [default@store] show keyspaces; Keyspace: store: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Options: [replication_factor:1] Column Families: *ColumnFamily: users* Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.AsciiType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: false Built indexes: [] Column Metadata: Column Name: email Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: userName Validation Class: org.apache.cassandra.db.marshal.UTF8Type *ColumnFamily: users2* Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Column Metadata: Column Name: session_token Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: state Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: password Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: birth_year Validation Class: org.apache.cassandra.db.marshal.LongType Column Name: gender Validation Class: org.apache.cassandra.db.marshal.UTF8Type Keyspace: system: Able to see the list of keys generate within the CLI [default@store] list users; Using default limit of 100 --- RowKey: foo = (column=age, value=3339, timestamp=1308182349595000) = (column=email, value=f...@email.com, timestamp=1308182349594000) = (column=userName, value=foo, timestamp=1308182349591000) --- RowKey: bar = (column=email, value=b...@email.com, timestamp=1308182355297000) = (column=gender, value=66, timestamp=1308182355299000) = (column=userName, value=bar, timestamp=1308182355295000) --- RowKey: tyler = (column=email, value=ty...@email.com, timestamp=1308182355303000) = (column=sports, value=6261736562616c6c, timestamp=1308182355309000) = (column=userName, value=tyler, timestamp=1308182355302000)
Re: Upgrading Cassandra cluster from 0.6.3 to 0.7.5
Read NEWS.txt. 0.7.6 is better than 0.7.5, btw. On Thu, Jun 16, 2011 at 5:03 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote: Hi All, We are upgrading cassandra from 0.6.3 to 0.7.5.We have two node in cluster.I am bit confused how to upgrade them can you have any guide. -- S.Ali Ahsan Senior System Engineer e-Business (Pvt) Ltd 49-C Jail Road, Lahore, P.O. Box 676 Lahore 54000, Pakistan Tel: +92 (0)42 3758 7140 Ext. 128 Mobile: +92 (0)345 831 8769 Fax: +92 (0)42 3758 0027 Email: ali.ah...@panasiangroup.com www.ebusiness-pg.com www.panasiangroup.com Confidentiality: This e-mail and any attachments may be confidential and/or privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person use it for any purpose or store or copy the information in any medium. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. We do not accept liability for any errors or omissions. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Unable to access column family in CLI after building CF in CQL
If you create CFs outside the cli, you may need to restart it to refresh its internal cache of the schema. On Thu, Jun 16, 2011 at 8:51 AM, yikes bigdata yikes.bigd...@gmail.com wrote: Hi, I was following the CQL example on the DataStax website and was able to create a new column family and query it. But when I viewed the column family in the CLI, it gives me the following error. # Unable to read column family created from CQL [default@store] list users2; users2 not found in current keyspace. Also, when I try to query the user table from CQL, i'm unable to filter on a key. The user table was created in the CLI but accessible by CQL with a simple select * from users; cqlsh select * from users where key='tyler'; Bad Request: cannot parse 'tyler' as hex bytes # In the CLI, the store keyspaces displays two column families . [default@store] show keyspaces; Keyspace: store: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Options: [replication_factor:1] Column Families: ColumnFamily: users Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.AsciiType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: false Built indexes: [] Column Metadata: Column Name: email Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: userName Validation Class: org.apache.cassandra.db.marshal.UTF8Type ColumnFamily: users2 Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Column Metadata: Column Name: session_token Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: state Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: password Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: birth_year Validation Class: org.apache.cassandra.db.marshal.LongType Column Name: gender Validation Class: org.apache.cassandra.db.marshal.UTF8Type Keyspace: system: Able to see the list of keys generate within the CLI [default@store] list users; Using default limit of 100 --- RowKey: foo = (column=age, value=3339, timestamp=1308182349595000) = (column=email, value=f...@email.com, timestamp=1308182349594000) = (column=userName, value=foo, timestamp=1308182349591000) --- RowKey: bar = (column=email, value=b...@email.com, timestamp=1308182355297000) = (column=gender, value=66, timestamp=1308182355299000) = (column=userName, value=bar, timestamp=1308182355295000) --- RowKey: tyler = (column=email, value=ty...@email.com, timestamp=1308182355303000) = (column=sports, value=6261736562616c6c, timestamp=1308182355309000) = (column=userName, value=tyler, timestamp=1308182355302000) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Unable to access column family in CLI after building CF in CQL
The second error (the CQL select) is because you have different Key Validation Class values for your two user columns. users is org.apache.cassandra.db.marshal.BytesType, while users2 is org.apache.cassandra.db.marshal.UTF8Type. The select is failing because you are comparing a String to a bunch of bytes. - Original Message - From: yikes bigdata yikes.bigd...@gmail.com To: user@cassandra.apache.org Sent: Thursday, June 16, 2011 3:51:41 PM Subject: Unable to access column family in CLI after building CF in CQL Hi, I was following the CQL example on the DataStax website and was able to create a new column family and query it. But when I viewed the column family in the CLI, it gives me the following error. # Unable to read column family created from CQL [default@store] list users2; users2 not found in current keyspace. Also, when I try to query the user table from CQL, i'm unable to filter on a key. The user table was created in the CLI but accessible by CQL with a simple select * from users; cqlsh select * from users where key='tyler'; Bad Request: cannot parse 'tyler' as hex bytes # In the CLI, the store keyspaces displays two column families . [default@store] show keyspaces; Keyspace: store: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Options: [replication_factor:1] Column Families: ColumnFamily: users Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.AsciiType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: false Built indexes: [] Column Metadata: Column Name: email Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: userName Validation Class: org.apache.cassandra.db.marshal.UTF8Type ColumnFamily: users2 Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Column Metadata: Column Name: session_token Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: state Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: password Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: birth_year Validation Class: org.apache.cassandra.db.marshal.LongType Column Name: gender Validation Class: org.apache.cassandra.db.marshal.UTF8Type Keyspace: system: Able to see the list of keys generate within the CLI [default@store] list users; Using default limit of 100 --- RowKey: foo = (column=age, value=3339, timestamp=1308182349595000) = (column=email, value= f...@email.com , timestamp=1308182349594000) = (column=userName, value=foo, timestamp=1308182349591000) --- RowKey: bar = (column=email, value= b...@email.com , timestamp=1308182355297000) = (column=gender, value=66, timestamp=1308182355299000) = (column=userName, value=bar, timestamp=1308182355295000) --- RowKey: tyler = (column=email, value= ty...@email.com , timestamp=1308182355303000) = (column=sports, value=6261736562616c6c, timestamp=1308182355309000) = (column=userName, value=tyler, timestamp=1308182355302000)
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On Thu, Jun 16, 2011 at 8:18 AM, AJ a...@dude.podzone.net wrote: Good morning all. Hypothetical Setup: 1 data center RF = 3 Total nodes 3 Problem: Suppose I need maximum consistency for one critical operation; thus I specify CL = ALL for reads. However, this will fail if only 1 replica endpoint is down. I don't see why this fail is necessary all of the time since the data could have been updated since the node became unavailable and it's data is old anyways. If only one node goes down and it has the key I need, then the app is not 100% available and it could take some time making the node available again. Proposal: If all of the *available* replica nodes answer the read operation and the latest value timestamp is clearly AFTER the time the down node became unavailable, then this situation can meet the requirements for *near* 100% consistency since the value in the down node would be outdated anyway. Clearly, the value was updated some time *after* the node went down or unavailable. This way, you can have max availability when using read with CL.ALL... or something CL close in meaning to ALL. I say near 100% consistency to leave room for some situation where the unavailable node was only unavailable to the coordinating node for some reason such as a network issue and thus still received an update by some other route after it appeared unavailable to the current coordinating node. In a situation like this, there is a chance the read will still not return the latest value. So, this will not be truly 100% consistent which CL.ALL guarantees. However, I think this logic could justify a new consistency level slightly lower than ALL, such as ALL_AVAIL. What do you think? Is my logic correct? Is there a conflict with the architecture or base principles? This fits with the tunable consistency principle for sure. I don't think this buys you anything that you can't get with quorum reads and writes. -ryan
Re: snitch thrift
On Thu, Jun 16, 2011 at 6:11 AM, Terje Marthinussen tmarthinus...@gmail.com wrote: Hi all! Assuming a node ends up in GC land for a while, there is a good chance that even though it performs terribly and the dynamic snitching will help you to avoid it on the gossip side, it will not really help you much if thrift still accepts requests and the thrift interface has choppy performance. This makes me wonder if it is a potential idea with thrift only client mode nodes. Those could GC too, albeit to a lesser degree. I don't think I have seen that this exists today (or is it possible that I have missed a way to configure that?), but it does not seem like a very hard thing to make and could maybe be good in some usage patterns for the datanode as well as the thrift side. It might be sometimes useful, but we can't really know without running some tests. -ryan
Re: Unable to access column family in CLI after building CF in CQL
Ah that works. Thanks everyone for the help. On Thu, Jun 16, 2011 at 9:04 AM, Konstantin Naryshkin konstant...@a-bb.netwrote: The second error (the CQL select) is because you have different Key Validation Class values for your two user columns. users is org.apache.cassandra.db.marshal.BytesType, while users2 is org.apache.cassandra.db.marshal.UTF8Type. The select is failing because you are comparing a String to a bunch of bytes. -- *From: *yikes bigdata yikes.bigd...@gmail.com *To: *user@cassandra.apache.org *Sent: *Thursday, June 16, 2011 3:51:41 PM *Subject: *Unable to access column family in CLI after building CF in CQL Hi, I was following the CQL example on the DataStax website and was able to create a new column family and query it. But when I viewed the column family in the CLI, it gives me the following error. # Unable to read column family created from CQL [default@store] list users2; *users2 not found in current keyspace.* Also, when I try to query the user table from CQL, i'm unable to filter on a key. The user table was created in the CLI but accessible by CQL with a simple select * from users; cqlsh select * from users where key='tyler'; *Bad Request: cannot parse 'tyler' as hex bytes* # In the CLI, the store keyspaces displays two column families . [default@store] show keyspaces; Keyspace: store: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Options: [replication_factor:1] Column Families: *ColumnFamily: users* Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.AsciiType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: false Built indexes: [] Column Metadata: Column Name: email Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: userName Validation Class: org.apache.cassandra.db.marshal.UTF8Type *ColumnFamily: users2* Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.267187497/57/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Column Metadata: Column Name: session_token Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: state Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: password Validation Class: org.apache.cassandra.db.marshal.UTF8Type Column Name: birth_year Validation Class: org.apache.cassandra.db.marshal.LongType Column Name: gender Validation Class: org.apache.cassandra.db.marshal.UTF8Type Keyspace: system: Able to see the list of keys generate within the CLI [default@store] list users; Using default limit of 100 --- RowKey: foo = (column=age, value=3339, timestamp=1308182349595000) = (column=email, value=f...@email.com, timestamp=1308182349594000) = (column=userName, value=foo, timestamp=1308182349591000) --- RowKey: bar = (column=email, value=b...@email.com, timestamp=1308182355297000) = (column=gender, value=66, timestamp=1308182355299000) = (column=userName, value=bar, timestamp=1308182355295000) --- RowKey: tyler = (column=email, value=ty...@email.com, timestamp=1308182355303000) = (column=sports, value=6261736562616c6c, timestamp=1308182355309000) = (column=userName, value=tyler, timestamp=1308182355302000)
Re: Cassandra Statistics and Metrics
There's possibility to use command line JMX client with standard Zabbix agent to request JMX counters without incorporating zapcat into Cassandra or another Java app. I'm investigating this feature right now, will post results when finish. 2011/6/15 Viktor Jevdokimov vjevdoki...@gmail.com http://www.kjkoster.org/zapcat/Zapcat_JMX_Zabbix_Bridge.html 2011/6/14 Marcos Ortiz mlor...@uci.cu Where I can find the source code? El 6/14/2011 10:13 AM, Viktor Jevdokimov escribió: We're using open source monitoring solution Zabbix from http://www.zabbix.com/ using zapcat - not only for Cassandra but for the whole system. As MX4J tools plugin is supported by Cassandra, support of zapcat in Cassandra by default is welcome - we have to use a wrapper to start zapcat agent. 2011/6/14 Marcos Ortiz mlor...@uci.cu Regards to all. My team and me here on the University are working on a generic solution for Monitoring and Capacity Planning for Open Sources Databases, and one of the NoSQL db that we choosed to give it support is Cassandra. Where I can find all the metrics and statistics of Cassandra? I'm thinking for example: - Available space - Number of CF and all kind of metrics We are using for this development: Python + Django + Twisted + Orbited + jQuery. The idea behind is to build a Comet-based web application on top of these technologies. Any advice is welcome -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On 6/16/2011 10:05 AM, Ryan King wrote: I don't think this buys you anything that you can't get with quorum reads and writes. -ryan QUORUM = ALL_AVAIL = ALL == RF
RE: Propose new ConsistencyLevel.ALL_AVAIL for reads
I think this would add a lot of complexity behind the scenes and be conceptually confusing, particularly for new users. The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. Perhaps the simpler approach which is fairly trivial and does not require any Cassandra change is to simply downgrade your read from ALL to QUORUM when you get an unavailable exception for this particular read. I think the general answerer for 'maximum consistency' is QUORUM reads/writes. Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE for writes: this itself strikes me as a bad idea if you require 'maximum consistency for one critical operation'. Dan -Original Message- From: Ryan King [mailto:r...@twitter.com] Sent: June-16-11 12:05 To: user@cassandra.apache.org Subject: Re: Propose new ConsistencyLevel.ALL_AVAIL for reads On Thu, Jun 16, 2011 at 8:18 AM, AJ a...@dude.podzone.net wrote: Good morning all. Hypothetical Setup: 1 data center RF = 3 Total nodes 3 Problem: Suppose I need maximum consistency for one critical operation; thus I specify CL = ALL for reads. However, this will fail if only 1 replica endpoint is down. I don't see why this fail is necessary all of the time since the data could have been updated since the node became unavailable and it's data is old anyways. If only one node goes down and it has the key I need, then the app is not 100% available and it could take some time making the node available again. Proposal: If all of the *available* replica nodes answer the read operation and the latest value timestamp is clearly AFTER the time the down node became unavailable, then this situation can meet the requirements for *near* 100% consistency since the value in the down node would be outdated anyway. Clearly, the value was updated some time *after* the node went down or unavailable. This way, you can have max availability when using read with CL.ALL... or something CL close in meaning to ALL. I say near 100% consistency to leave room for some situation where the unavailable node was only unavailable to the coordinating node for some reason such as a network issue and thus still received an update by some other route after it appeared unavailable to the current coordinating node. In a situation like this, there is a chance the read will still not return the latest value. So, this will not be truly 100% consistent which CL.ALL guarantees. However, I think this logic could justify a new consistency level slightly lower than ALL, such as ALL_AVAIL. What do you think? Is my logic correct? Is there a conflict with the architecture or base principles? This fits with the tunable consistency principle for sure. I don't think this buys you anything that you can't get with quorum reads and writes. -ryan No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.901 / Virus Database: 271.1.1/3707 - Release Date: 06/16/11 02:34:00
Re: Cassandra Statistics and Metrics
This is what I use: http://code.google.com/p/simple-cassandra-monitoring/ Disclaimer: I did it myself, don't expect too much :P El jue, 16-06-2011 a las 19:35 +0300, Viktor Jevdokimov escribió: There's possibility to use command line JMX client with standard Zabbix agent to request JMX counters without incorporating zapcat into Cassandra or another Java app. I'm investigating this feature right now, will post results when finish. 2011/6/15 Viktor Jevdokimov vjevdoki...@gmail.com http://www.kjkoster.org/zapcat/Zapcat_JMX_Zabbix_Bridge.html 2011/6/14 Marcos Ortiz mlor...@uci.cu Where I can find the source code? El 6/14/2011 10:13 AM, Viktor Jevdokimov escribió: We're using open source monitoring solution Zabbix from http://www.zabbix.com/ using zapcat - not only for Cassandra but for the whole system. As MX4J tools plugin is supported by Cassandra, support of zapcat in Cassandra by default is welcome - we have to use a wrapper to start zapcat agent. 2011/6/14 Marcos Ortiz mlor...@uci.cu Regards to all. My team and me here on the University are working on a generic solution for Monitoring and Capacity Planning for Open Sources Databases, and one of the NoSQL db that we choosed to give it support is Cassandra. Where I can find all the metrics and statistics of Cassandra? I'm thinking for example: - Available space - Number of CF and all kind of metrics We are using for this development: Python + Django + Twisted + Orbited + jQuery. The idea behind is to build a Comet-based web application on top of these technologies. Any advice is welcome -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: snitch thrift
Seems like a more robust solution would be to implement dynamic-snitch-like behavior in the client. Hector has done this for a few months now. https://github.com/rantav/hector/blob/master/core/src/main/java/me/prettyprint/cassandra/connection/DynamicLoadBalancingPolicy.java On Thu, Jun 16, 2011 at 9:12 AM, Ryan King r...@twitter.com wrote: On Thu, Jun 16, 2011 at 6:11 AM, Terje Marthinussen tmarthinus...@gmail.com wrote: Hi all! Assuming a node ends up in GC land for a while, there is a good chance that even though it performs terribly and the dynamic snitching will help you to avoid it on the gossip side, it will not really help you much if thrift still accepts requests and the thrift interface has choppy performance. This makes me wonder if it is a potential idea with thrift only client mode nodes. Those could GC too, albeit to a lesser degree. I don't think I have seen that this exists today (or is it possible that I have missed a way to configure that?), but it does not seem like a very hard thing to make and could maybe be good in some usage patterns for the datanode as well as the thrift side. It might be sometimes useful, but we can't really know without running some tests. -ryan -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: need some help with counters
On Jun 13, 2011, at 5:10 AM, aaron morton wrote: I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. AFAIK thats not a great application for counters. You would need range support in the secondary indexes so you could get the first X rows ordered by a column value. To be honest, depending on scale, I'd consider a sorted set in redis for that. It does. Thanks Aaron. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 11 Jun 2011, at 00:36, Ian Holsman wrote: On Jun 9, 2011, at 10:04 PM, aaron morton wrote: I may be missing something but could you use a column for each of the last 48 hours all in the same row for a url ? e.g. { /url.com/hourly : { 20110609T01:00:00 : 456, 20110609T02:00:00 : 4567, } } yes.. that would work better... I was storing all the different times in the same row. { /url.com : { H-20110609T01:00:00 : 456, H-0110609T02:00:00 : 4567, D-0110609 : 5678, } } I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. Increment the current hour only. Delete the older columns either when a read detects there are old values or as a maintenance job. Or as part of writing values for the first 5 minutes of any hour. yes.. I thought of that. The problem with doing it on read is there may be a case where a old URL never gets read.. so it will just sit there taking up space.. the maintenance job is the route I went down. The row will get spread out over a lot of sstables which may reduce read speed. If this is a problem consider a separate CF with more aggressive GC and compaction settings. Thanks! Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Jun 2011, at 09:28, Ian Holsman wrote: So would doing something like storing it in reverse (so I know what to delete) work? Or is storing a million columns in a supercolumn impossible. I could always use a logfile and run the archiver off that as a worst case I guess. Would doing so many deletes screw up the db/cause other problems? --- Ian Holsman - 703 879-3128 I saw the angel in the marble and carved until I set him free -- Michelangelo On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote: On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote: Hi Ryan. you wouldn't have your version of cassandra up on github would you?? No, and the patch isn't in our version yet either. We're still working on it. -ryan
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On 6/16/2011 10:58 AM, Dan Hendry wrote: I think this would add a lot of complexity behind the scenes and be conceptually confusing, particularly for new users. I'm not so sure about this. Cass is already somewhat sophisticated and I don't see how this could trip-up anyone who can already grasp the basics. The only thing I am adding to the CL concept is the concept of available replication nodes, versus total replication nodes. But, don't forget; a competitor to Cass is probably in the works this very minute so constant improvement is a good thing. The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. I'm not sure what you mean. A node can be down for days during which time the value can be updated. The intention is to use the nodes available even if they fall below the RF. If there is only 1 node available for accepting a replica, that should be enough given the conditions I stated and updated below. Perhaps the simpler approach which is fairly trivial and does not require any Cassandra change is to simply downgrade your read from ALL to QUORUM when you get an unavailable exception for this particular read. It's not so trivial, esp since you would have to build that into your client at many levels. I think it would be more appropriate (if this idea survives) to put it into Cass. I think the general answerer for 'maximum consistency' is QUORUM reads/writes. Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE for writes: this itself strikes me as a bad idea if you require 'maximum consistency for one critical operation'. Very true. Specifying quorum for BOTH reads/writes provides the 100% consistency because of the overlapping of the availability numbers. But, only if the # of available nodes is not RF. Upon further reflection, this idea can be used for any consistency level. The general thrust of my argument is: If a particular value can be overwritten by one process regardless of it's prior value, then that implies that the value in the down node is no longer up-to-date and can be disregarded. Just work with the nodes that are available. Actually, now that I think about it... ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the value latest unavailability time of all unavailable replica nodes for that value's row key. Unavailable is defined as a node's Cass process that is not reachable from ANY node in the cluster in the same data center. If the node in question is available to at least one node, then the read should fail as there is a possibility that the value could have been updated some other way. After looking at the code, it doesn't look like it will be difficult. Instead of skipping the request for values from the nodes when CL nodes aren't available, it would have to go ahead and request the values from the available nodes as usual and then look at the timestamps which it does anyways and compare it to the latest unavailability time of the relevant replica nodes. The code that keeps track of what nodes are down simply records the time it went down. But, I've only been looking at the code for a few days so I'm not claiming to know everything by any stretch. Dan Thanks for your reply. I still welcome critiques.
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On Thu, Jun 16, 2011 at 1:05 PM, AJ a...@dude.podzone.net wrote: On 6/16/2011 10:58 AM, Dan Hendry wrote: I think this would add a lot of complexity behind the scenes and be conceptually confusing, particularly for new users. I'm not so sure about this. Cass is already somewhat sophisticated and I don't see how this could trip-up anyone who can already grasp the basics. The only thing I am adding to the CL concept is the concept of available replication nodes, versus total replication nodes. But, don't forget; a competitor to Cass is probably in the works this very minute so constant improvement is a good thing. There are already many competitors. The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. I'm not sure what you mean. A node can be down for days during which time the value can be updated. The intention is to use the nodes available even if they fall below the RF. If there is only 1 node available for accepting a replica, that should be enough given the conditions I stated and updated below. If this is your constraint, then you should just use CL.ONE. Perhaps the simpler approach which is fairly trivial and does not require any Cassandra change is to simply downgrade your read from ALL to QUORUM when you get an unavailable exception for this particular read. It's not so trivial, esp since you would have to build that into your client at many levels. I think it would be more appropriate (if this idea survives) to put it into Cass. I think the general answerer for 'maximum consistency' is QUORUM reads/writes. Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE for writes: this itself strikes me as a bad idea if you require 'maximum consistency for one critical operation'. Very true. Specifying quorum for BOTH reads/writes provides the 100% consistency because of the overlapping of the availability numbers. But, only if the # of available nodes is not RF. No, it will work as long as the available nodes is = RF/2 + 1 Upon further reflection, this idea can be used for any consistency level. The general thrust of my argument is: If a particular value can be overwritten by one process regardless of it's prior value, then that implies that the value in the down node is no longer up-to-date and can be disregarded. Just work with the nodes that are available. Actually, now that I think about it... ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the value latest unavailability time of all unavailable replica nodes for that value's row key. Unavailable is defined as a node's Cass process that is not reachable from ANY node in the cluster in the same data center. If the node in question is available to at least one node, then the read should fail as there is a possibility that the value could have been updated some other way. Node A can't reliably and consistently know whether node B and node C can communicate. After looking at the code, it doesn't look like it will be difficult. Instead of skipping the request for values from the nodes when CL nodes aren't available, it would have to go ahead and request the values from the available nodes as usual and then look at the timestamps which it does anyways and compare it to the latest unavailability time of the relevant replica nodes. The code that keeps track of what nodes are down simply records the time it went down. But, I've only been looking at the code for a few days so I'm not claiming to know everything by any stretch. -ryan
Visiting Auckland
So long as the Volcanic Ash stays away I'll be visiting Auckland next week on the 23rd and 24th. Drop me an email if you would like to meet to talk about things Cassandra. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On 6/16/2011 2:37 PM, Ryan King wrote: On Thu, Jun 16, 2011 at 1:05 PM, AJa...@dude.podzone.net wrote: snip The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. I'm not sure what you mean. A node can be down for days during which time the value can be updated. The intention is to use the nodes available even if they fall below the RF. If there is only 1 node available for accepting a replica, that should be enough given the conditions I stated and updated below. If this is your constraint, then you should just use CL.ONE. My constraint is a CL = All Available. So, CL.ONE will not work. Perhaps the simpler approach which is fairly trivial and does not require any Cassandra change is to simply downgrade your read from ALL to QUORUM when you get an unavailable exception for this particular read. It's not so trivial, esp since you would have to build that into your client at many levels. I think it would be more appropriate (if this idea survives) to put it into Cass. I think the general answerer for 'maximum consistency' is QUORUM reads/writes. Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE for writes: this itself strikes me as a bad idea if you require 'maximum consistency for one critical operation'. Very true. Specifying quorum for BOTH reads/writes provides the 100% consistency because of the overlapping of the availability numbers. But, only if the # of available nodes is not RF. No, it will work as long as the available nodes is= RF/2 + 1 Yes, that's what I meant. Sorry for any confusion. Restated: But, only if the # of available nodes is not RF/2 + 1. Upon further reflection, this idea can be used for any consistency level. The general thrust of my argument is: If a particular value can be overwritten by one process regardless of it's prior value, then that implies that the value in the down node is no longer up-to-date and can be disregarded. Just work with the nodes that are available. Actually, now that I think about it... ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the value latest unavailability time of all unavailable replica nodes for that value's row key. Unavailable is defined as a node's Cass process that is not reachable from ANY node in the cluster in the same data center. If the node in question is available to at least one node, then the read should fail as there is a possibility that the value could have been updated some other way. Node A can't reliably and consistently know whether node B and node C can communicate. Well, theoretically, of course; that's the nature of distributed systems. But, Cass does indeed make that determination when it counts the number available replica nodes before it decides if enough replica nodes are available. But, this is obvious to you I'm sure so maybe I don't understand your statement. After looking at the code, it doesn't look like it will be difficult. Instead of skipping the request for values from the nodes when CL nodes aren't available, it would have to go ahead and request the values from the available nodes as usual and then look at the timestamps which it does anyways and compare it to the latest unavailability time of the relevant replica nodes. The code that keeps track of what nodes are down simply records the time it went down. But, I've only been looking at the code for a few days so I'm not claiming to know everything by any stretch. -ryan
Re: Force a node to form part of quorum
It would be great if Cassandra puts this on their roadmap. There is lot of durability benefits by incorporating dc awareness into the write consistency equation. MongoDB has this feature in their upcoming release: http://www.mongodb.org/display/DOCS/Data+Center+Awareness#DataCenterAwareness-Tagging%28version1.9.1%29 On Thu, Jun 16, 2011 at 6:57 AM, aaron morton aa...@thelastpickle.com wrote: Short answer: No. Medium answer: No all nodes are equal. It could create a single point of failure if a QUOURM could not be formed without a specific node. Writes are sent to every replica. Reads with Read Repair enabled are also sent to every replica. For reads the closest UP node as determined by the snitch and possibly re-ordered by the Dynamic Snitch is asked to return the actual data. This replica must respond for the request to complete. If it's a question about maximising cache hits see https://github.com/apache/cassandra/blob/cassandra-0.8.0/conf/cassandra.yaml#L308 Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 05:58, A J wrote: Is there a way to favor a node to always participate (or never participate) towards fulfillment of read consistency as well as write consistency ? Thanks AJ
Re: Force a node to form part of quorum
It would be great if Cassandra puts this on their roadmap. There is lot of durability benefits by incorporating dc awareness into the write consistency equation. You may be interested in the discussion here: https://issues.apache.org/jira/browse/CASSANDRA-2338 -- / Peter Schuller
Re: Easy way to overload a single node on purpose?
Having a ping column can work if every key is replicated to every node. It would tell you the cluster is working, sort of. Once the number of nodes is greater than the RF, it tells you a subset of the nodes works. The way our check works is that each node checks itself, so in this context we're not concerned about whether the cluster is up, but that each individual node is up. So the symptoms I saw, the node actually going down etc, were probably due to many different events happening at the time, and will be very hard to recreate? On Thu, Jun 16, 2011 at 6:16 AM, aaron morton aa...@thelastpickle.comwrote: DEBUG 14:36:55,546 ... timed out Is logged when the coordinator times out waiting for the replicas to respond, the timeout setting is rpc_timeout in the yaml file. This results in the client getting a TimedOutException. AFAIK There is no global everything is good / bad flags to check. e.g. AFAIK I node will not mark its self down if it runs out of disk space. So you need to monitor the free disk space and alert on that. Having a ping column can work if every key is replicated to every node. It would tell you the cluster is working, sort of. Once the number of nodes is greater than the RF, it tells you a subset of the nodes works. If you google around you'll find discussions about monitoring with munin, ganglia, cloud kick and Ops Centre. If you install mx4j you can access the JMX metrics via HTTP, Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 10:38, Suan Aik Yeo wrote: Here's a weird one... what's the best way to get a Cassandra node into a half-crashed state? We have a 3-node cluster running 0.7.5. A few days ago this happened organically to node1 - the partition the commitlog was on was 100% full and there was a No space left on device error, and after a while, although the cluster and node1 was still up, to the other nodes it was down, and messages like: DEBUG 14:36:55,546 ... timed out started to show up in its debug logs. We have a tool to indicate to the load balancer that a Cassandra node is down, but it didn't detect it that time. Now I'm having trouble purposefully getting the node back to that state, so that I can try other monitoring methods. I've tried to fill up the commitlog partition with other files, and although I get the No space left on device error, the node still doesn't go down and show the other symptoms it showed before. Also, if anyone could recommend a good way for a node itself to detect that its in such a state I'd be interested in that too. Currently what we're doing is making a describe_cluster_name() thrift call, but that still worked when the node was down. I'm thinking of something like reading/writing to a fixed value in a keyspace as a check... Unfortunately Java-based solutions are out of the question. Thanks, Suan
compression for regular column names?
Hi all, As a way of gaining familiarity with Cassandra I am migrating a table that is currently stored in a relational database and mapping it into a Cassandra column family. We add about 700,000 new rows a day to this table, and the average disk space used per row is ~ 300 bytes including indexes. The mapping from table to column family is straight forward - there is a one-one relationship between table columns and column family column names. The relational table has 19 columns. The length of the names of the columns is nearly 200 bytes whereas the average amount of data per row is only 130 bytes. Initially I used the identify map for this translation - i.e. my Cassandra column names were the same as the relational column names. I then found out I could save a lot of disk space by using single letter column names instead of the original relational names. I.e. use 'L' instead of 'LINK_IDENTIFIER' for a column name. The procedure I use to determine space used is: 1. rm -rf the cassandra var-lib directory 2. start cassandra, create keyspace, column families, etc. 3. insert records 4. stop cassandra 5. re-start cassandra 6. measure disk space with du -s the cassandra var-lib directory This seems to replace the commit logs with .db files. My questions are: 1. Is this a common practice (i.e. making the client responsible for shortening the column names) when dealing with a large number of fixed column names and a high volume of inserts? Is there any way that Cassandra can help out here? 2. Is there another way to transform the commit logs into .db files without stopping and starting the server? Thanks, ER
Re: compression for regular column names?
On Thu, Jun 16, 2011 at 3:41 PM, E R pc88m...@gmail.com wrote: Hi all, As a way of gaining familiarity with Cassandra I am migrating a table that is currently stored in a relational database and mapping it into a Cassandra column family. We add about 700,000 new rows a day to this table, and the average disk space used per row is ~ 300 bytes including indexes. The mapping from table to column family is straight forward - there is a one-one relationship between table columns and column family column names. The relational table has 19 columns. The length of the names of the columns is nearly 200 bytes whereas the average amount of data per row is only 130 bytes. Initially I used the identify map for this translation - i.e. my Cassandra column names were the same as the relational column names. I then found out I could save a lot of disk space by using single letter column names instead of the original relational names. I.e. use 'L' instead of 'LINK_IDENTIFIER' for a column name. The procedure I use to determine space used is: 1. rm -rf the cassandra var-lib directory 2. start cassandra, create keyspace, column families, etc. 3. insert records 4. stop cassandra 5. re-start cassandra 6. measure disk space with du -s the cassandra var-lib directory This seems to replace the commit logs with .db files. My questions are: 1. Is this a common practice (i.e. making the client responsible for shortening the column names) when dealing with a large number of fixed column names and a high volume of inserts? Is there any way that Cassandra can help out here? Yes, we're working on a new, compressed format CASSANDRA-674. 2. Is there another way to transform the commit logs into .db files without stopping and starting the server? nodetool flush. -ryan
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On Thu, Jun 16, 2011 at 2:12 PM, AJ a...@dude.podzone.net wrote: On 6/16/2011 2:37 PM, Ryan King wrote: On Thu, Jun 16, 2011 at 1:05 PM, AJa...@dude.podzone.net wrote: snip The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. I'm not sure what you mean. A node can be down for days during which time the value can be updated. The intention is to use the nodes available even if they fall below the RF. If there is only 1 node available for accepting a replica, that should be enough given the conditions I stated and updated below. If this is your constraint, then you should just use CL.ONE. My constraint is a CL = All Available. So, CL.ONE will not work. That's a solution, not a requirement. What's your requirement? Perhaps the simpler approach which is fairly trivial and does not require any Cassandra change is to simply downgrade your read from ALL to QUORUM when you get an unavailable exception for this particular read. It's not so trivial, esp since you would have to build that into your client at many levels. I think it would be more appropriate (if this idea survives) to put it into Cass. I think the general answerer for 'maximum consistency' is QUORUM reads/writes. Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE for writes: this itself strikes me as a bad idea if you require 'maximum consistency for one critical operation'. Very true. Specifying quorum for BOTH reads/writes provides the 100% consistency because of the overlapping of the availability numbers. But, only if the # of available nodes is not RF. No, it will work as long as the available nodes is= RF/2 + 1 Yes, that's what I meant. Sorry for any confusion. Restated: But, only if the # of available nodes is not RF/2 + 1. Upon further reflection, this idea can be used for any consistency level. The general thrust of my argument is: If a particular value can be overwritten by one process regardless of it's prior value, then that implies that the value in the down node is no longer up-to-date and can be disregarded. Just work with the nodes that are available. Actually, now that I think about it... ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the value latest unavailability time of all unavailable replica nodes for that value's row key. Unavailable is defined as a node's Cass process that is not reachable from ANY node in the cluster in the same data center. If the node in question is available to at least one node, then the read should fail as there is a possibility that the value could have been updated some other way. Node A can't reliably and consistently know whether node B and node C can communicate. Well, theoretically, of course; that's the nature of distributed systems. But, Cass does indeed make that determination when it counts the number available replica nodes before it decides if enough replica nodes are available. But, this is obvious to you I'm sure so maybe I don't understand your statement. Consider this scenario: given nodes, A, B and C and A thinks C is down but B thinks C is up. What do you do? Remember, A doesn't know that B thinks C is up, it only knows its own state. -ryan
Re: jsvc hangs shell
Anton Belyaev anton.belyaev at gmail.com writes: I guess it is not trivial to modify the package to make it use JSW instead of JSVC. I am still not sure the JSVC itself is a culprit. Maybe something is wrong in my setup. I am seeing similar behavior using the Brisk Debian packages for Maverick: http://www.datastax.com/docs/0.8/brisk/install_brisk_packages#installing-the-brisk-packaged-releases Not sure if it's my configuration, but I verified in on two seperate installs. -Ken
Brisk .rpm packages for CentOS/RH/Fedora
Regards to all Cassandra´ users I don´t know if Brisk has its own mailing list, so I ask here. Has Brisk .rpm packages for Red Hat and based distributions (CentOS/Fedora)? If this is true, Where I can find them? Thanks a lot for your time. -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) http://marcosluis2186.posterous.com
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
How would your solution deal with complete network partitions? A node being 'down' does not actually mean it is dead, just that it is unreachable from whatever is making the decision to mark it 'down'. Following from Ryan's example, consider nodes A, B, and C but within a fully partitioned network: all of the nodes are up but each thinks all the others are down. Your ALL_AVAILABLE consistency level would boil down to consistency level ONE for clients connecting to any of the nodes. If I connect to A, it thinks it is the last one standing and translates 'ALL_AVALIABLE' into 'ONE'. Based on your logic, two clients connecting to two different nodes could each modify a value then read it, thinking that its 100% consistent yet it is actually *completely* inconsistent with the value on other node(s). I suggest you review the principles of the infamous CAP theorem. The consistency levels as the stand now, allow for an explicit trade off between 'available and partition tolerant' (ONE read/write) OR 'consistent and available' (QUORUM read/write). Your solution achieves only availability and can guarantee neither consistency nor partition tolerance. On Thu, Jun 16, 2011 at 7:50 PM, Ryan King r...@twitter.com wrote: On Thu, Jun 16, 2011 at 2:12 PM, AJ a...@dude.podzone.net wrote: On 6/16/2011 2:37 PM, Ryan King wrote: On Thu, Jun 16, 2011 at 1:05 PM, AJa...@dude.podzone.net wrote: snip The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. I'm not sure what you mean. A node can be down for days during which time the value can be updated. The intention is to use the nodes available even if they fall below the RF. If there is only 1 node available for accepting a replica, that should be enough given the conditions I stated and updated below. If this is your constraint, then you should just use CL.ONE. My constraint is a CL = All Available. So, CL.ONE will not work. That's a solution, not a requirement. What's your requirement? Perhaps the simpler approach which is fairly trivial and does not require any Cassandra change is to simply downgrade your read from ALL to QUORUM when you get an unavailable exception for this particular read. It's not so trivial, esp since you would have to build that into your client at many levels. I think it would be more appropriate (if this idea survives) to put it into Cass. I think the general answerer for 'maximum consistency' is QUORUM reads/writes. Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE for writes: this itself strikes me as a bad idea if you require 'maximum consistency for one critical operation'. Very true. Specifying quorum for BOTH reads/writes provides the 100% consistency because of the overlapping of the availability numbers. But, only if the # of available nodes is not RF. No, it will work as long as the available nodes is= RF/2 + 1 Yes, that's what I meant. Sorry for any confusion. Restated: But, only if the # of available nodes is not RF/2 + 1. Upon further reflection, this idea can be used for any consistency level. The general thrust of my argument is: If a particular value can be overwritten by one process regardless of it's prior value, then that implies that the value in the down node is no longer up-to-date and can be disregarded. Just work with the nodes that are available. Actually, now that I think about it... ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the value latest unavailability time of all unavailable replica nodes for that value's row key. Unavailable is defined as a node's Cass process that is not reachable from ANY node in the cluster in the same data center. If the node in question is available to at least one node, then the read should fail as there is a possibility that the value could have been updated some other way. Node A can't reliably and consistently know whether node B and node C can communicate. Well, theoretically, of course; that's the nature of distributed systems. But, Cass does indeed make that determination when it counts the number available replica nodes before it decides if enough replica nodes are available. But, this is obvious to you I'm sure so maybe I don't understand your statement. Consider this scenario: given nodes, A, B and C and A thinks C is down but B thinks C is up. What do you do? Remember, A doesn't know that B thinks C is up, it only knows its own state. -ryan
cassandra crash
All: Why cassandra crash after print the following log? INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,020 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-206-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,020 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-207-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,020 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/VCCCurScheduleTable-137-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-205-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/VCCCurScheduleTable-139-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/VCCCurScheduleTable-138-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-208-Data.db INFO [GC inspection] 2011-06-16 14:22:59,562 GCInspector.java (line 110) GC for ParNew: 385 ms, 26859800 reclaimed leaving 117789112 used; max is 118784 Best Regards Donna li
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
UPDATE to my suggestion is below. On 6/16/2011 5:50 PM, Ryan King wrote: On Thu, Jun 16, 2011 at 2:12 PM, AJa...@dude.podzone.net wrote: On 6/16/2011 2:37 PM, Ryan King wrote: On Thu, Jun 16, 2011 at 1:05 PM, AJa...@dude.podzone.netwrote: snip The Cassandra consistency model is pretty elegant and this type of approach breaks that elegance in many ways. It would also only really be useful when the value has a high probability of being updated between a node going down and the value being read. I'm not sure what you mean. A node can be down for days during which time the value can be updated. The intention is to use the nodes available even if they fall below the RF. If there is only 1 node available for accepting a replica, that should be enough given the conditions I stated and updated below. If this is your constraint, then you should just use CL.ONE. My constraint is a CL = All Available. So, CL.ONE will not work. That's a solution, not a requirement. What's your requirement? Ok. And this updates my suggestion removing the need for ALL_AVAIL. This adds logic to cope with unavailable nodes and still achieve consistency for a specific situation. The general requirement is to completely eliminate read failures for reads specifying CL = ALL for values that have been subject to a specific data update pattern. The specific data update pattern consists of a value that has been updated (or added) in the face of one or more, but less than R, unavailable replica nodes (at least 1 replica node is available). If a particular data value (column value) is updated after the latest down node, this implies this new value is independent of any replica values that are currently unavailable. Therefore, in this situation, the number of available replicas is irrelevant. After querying all *available* replica nodes, the value with the latest timestamp is consistent if that timestamp is the timestamp of the last replica node that became unavailable. snip Well, theoretically, of course; that's the nature of distributed systems. But, Cass does indeed make that determination when it counts the number available replica nodes before it decides if enough replica nodes are available. But, this is obvious to you I'm sure so maybe I don't understand your statement. Consider this scenario: given nodes, A, B and C and A thinks C is down but B thinks C is up. What do you do? Remember, A doesn't know that B thinks C is up, it only knows its own state. What kind of network configuration would have this kind of scenario? This method only applies withing a data center which should be OK since other replication across data centers seems to be mostly for fault tolerance... but I will have to think about this. -ryan
Re: Brisk .rpm packages for CentOS/RH/Fedora
Yes, there is a brisk list: brisk-us...@googlegroups.com Packages are available via rpm.datastax.com On Thu, Jun 16, 2011 at 8:21 PM, Marcos Ortiz Valmaseda mlor...@uci.cu wrote: Regards to all Cassandra´ users I don´t know if Brisk has its own mailing list, so I ask here. Has Brisk .rpm packages for Red Hat and based distributions (CentOS/Fedora)? If this is true, Where I can find them? Thanks a lot for your time. -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) http://marcosluis2186.posterous.com
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On 6/16/2011 7:56 PM, Dan Hendry wrote: How would your solution deal with complete network partitions? A node being 'down' does not actually mean it is dead, just that it is unreachable from whatever is making the decision to mark it 'down'. Following from Ryan's example, consider nodes A, B, and C but within a fully partitioned network: all of the nodes are up but each thinks all the others are down. Your ALL_AVAILABLE consistency level would boil down to consistency level ONE for clients connecting to any of the nodes. If I connect to A, it thinks it is the last one standing and translates 'ALL_AVALIABLE' into 'ONE'. Based on your logic, two clients connecting to two different nodes could each modify a value then read it, thinking that its 100% consistent yet it is actually *completely* inconsistent with the value on other node(s). Help me out here. I'm trying to visualize a situation where the clients can access all the C* nodes but the nodes can't access each other. I don't see how that can happen on a regular ethernet subnet in one data center. Well, Im sure there is a case that you can point out. Ok, I will concede that this is an issue for some network configurations. I suggest you review the principles of the infamous CAP theorem. The consistency levels as the stand now, allow for an explicit trade off between 'available and partition tolerant' (ONE read/write) OR 'consistent and available' (QUORUM read/write). Your solution achieves only availability and can guarantee neither consistency nor partition tolerance. It looks like CAP may triumph again. Thanks for the exercise Dan and Ryan.
Re: Cassandra JVM GC settings
It would help if you can provide some log messages from the GCInspector so people can see how much GC is going on. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 02:46, Sebastien Coutu wrote: Hi Everyone, I'm seeing Cassandra GC a lot and I would like to tune the Young space and the Tenured space. Anyone would have recommendations on the NewRatio or NewSize/MaxNewSize to use for an environment where Cassandra has several column families and in which we are doing a mixed load of reading and writing. The JVM has 8G of heap space assigned to it and there are 9 nodes to this cluster. Thanks for the comments! Sébastien Coutu
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
Help me out here. I'm trying to visualize a situation where the clients can access all the C* nodes but the nodes can't access each other. I don't see how that can happen on a regular ethernet subnet in one data center. Well, Im sure there is a case that you can point out. Ok, I will concede that this is an issue for some network configurations. First rule of designing/developing/operating distributed systems: assume anything and everything can and will happen, regardless of network configuration or hardware. This specific situation actually HAS happened to me. Our Cassandra nodes accept client connections on one ethernet interface on one network (the production network) yet communicate with each other on a separate ethernet interface on a separate network which is Cassandra specific. This was done mainly due to the relatively large inter-node Cassandra bandwidth requirements in comparison to client bandwidth requirements. At one point, the switch for the cassandra network went down so clients could connect yet the cassandra nodes could not talk to eachother. (We write at ONE and read at ALL so everything behaved as expected). On Thu, Jun 16, 2011 at 11:00 PM, AJ a...@dude.podzone.net wrote: On 6/16/2011 7:56 PM, Dan Hendry wrote: How would your solution deal with complete network partitions? A node being 'down' does not actually mean it is dead, just that it is unreachable from whatever is making the decision to mark it 'down'. Following from Ryan's example, consider nodes A, B, and C but within a fully partitioned network: all of the nodes are up but each thinks all the others are down. Your ALL_AVAILABLE consistency level would boil down to consistency level ONE for clients connecting to any of the nodes. If I connect to A, it thinks it is the last one standing and translates 'ALL_AVALIABLE' into 'ONE'. Based on your logic, two clients connecting to two different nodes could each modify a value then read it, thinking that its 100% consistent yet it is actually *completely* inconsistent with the value on other node(s). Help me out here. I'm trying to visualize a situation where the clients can access all the C* nodes but the nodes can't access each other. I don't see how that can happen on a regular ethernet subnet in one data center. Well, Im sure there is a case that you can point out. Ok, I will concede that this is an issue for some network configurations. I suggest you review the principles of the infamous CAP theorem. The consistency levels as the stand now, allow for an explicit trade off between 'available and partition tolerant' (ONE read/write) OR 'consistent and available' (QUORUM read/write). Your solution achieves only availability and can guarantee neither consistency nor partition tolerance. It looks like CAP may triumph again. Thanks for the exercise Dan and Ryan.
Re: client API
The Thrift Java compiler creates code that is not compliant with Java 5. https://issues.apache.org/jira/browse/THRIFT-1170 So you may have trouble getting the thrift API to run. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 03:14, karim abbouh wrote: i use jdk1.6 to install and launch cassandra in a linux platform,but can i use jdk1.5 for my cassandra Client ?
Re: Docs: Token Selection
But, I'm thinking about using OldNetworkTopStrat. NetworkTopologyStrategy is where it's at. A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 01:39, AJ wrote: Thanks Eric! I've finally got it! I feel like I've just been initiated or something by discovering this secret. I kid! But, I'm thinking about using OldNetworkTopStrat. Do you, or anyone else, know if the same rules for token assignment applies to ONTS? On 6/16/2011 7:21 AM, Eric tamme wrote: AJ, sorry I seemed to miss the original email on this thread. As Aaron said, when computing tokens for multiple data centers, you should compute them independently for each data center - as if it were its own Cassandra cluster. You can have overlapping token ranges between multiple data centers, but no two nodes can have the same token, so for subsequent data centers I just increment the tokens. For two data centers with two nodes each using RandomPartitioner calculate the tokens for the first DC normally, but int he second data center, increment the tokens by one. In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 For RowMutations this will give each data center a local set of nodes that it can write to for complete coverage of the entire token space. If you are using NetworkTopologyStrategy for replication, it will give an offset mirror replication between the two data centers so that your replicas will not get pinned to a node in the remote DC. There are other ways to select the tokens, but the increment method is the simplest to manage and continue to grow with. Hope that helps. -Eric
Re: Propose new ConsistencyLevel.ALL_AVAIL for reads
On 6/16/2011 9:36 PM, Dan Hendry wrote: Help me out here. I'm trying to visualize a situation where the clients can access all the C* nodes but the nodes can't access each other. I don't see how that can happen on a regular ethernet subnet in one data center. Well, Im sure there is a case that you can point out. Ok, I will concede that this is an issue for some network configurations. First rule of designing/developing/operating distributed systems: assume anything and everything can and will happen, regardless of network configuration or hardware. This specific situation actually HAS happened to me. Our Cassandra nodes accept client connections on one ethernet interface on one network (the production network) yet communicate with each other on a separate ethernet interface on a separate network which is Cassandra specific. This was done mainly due to the relatively large inter-node Cassandra bandwidth requirements in comparison to client bandwidth requirements. At one point, the switch for the cassandra network went down so clients could connect yet the cassandra nodes could not talk to eachother. (We write at ONE and read at ALL so everything behaved as expected). Funny, but that's the exact same setup I'm running. But, I'm not a network guy and kind of assumed it wasn't so typical. Plus, lately I've had my mind on a cloud setup.
Re: Docs: Token Selection
On 6/16/2011 9:45 PM, aaron morton wrote: But, I'm thinking about using OldNetworkTopStrat. NetworkTopologyStrategy is where it's at. Oh yeah? It didn't look like it would serve my requirements. I want 2 full production geo-diverse data centers with each serving as a failover for the other. Random Partitioner. Each dc holds 2 replicas from the local clients and 1 replica goes to the other dc. It doesn't look like I can do a ying-yang setup like that with NTS. Am I wrong? A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com