Re: what's the difference between repair CF separately and repair the entire node?
On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
is 0.8 ready for production use? as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu springri...@gmail.com wrote: is 0.8 ready for production use? some related discussion here: http://www.mail-archive.com/user@cassandra.apache.org/msg17055.html but my personal answer is yes. as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? Repair problems in 0.7 don't hit everyone equally. For some people, it works relatively well even if not in the most efficient ways. Also, for some workload (if you don't do much deletes for instance), you can set a big gc_grace_seconds value (say a month) and only run repair that often, which can make repair inefficiencies more bearable. That being said, I can't speak for many companies, but I do advise evaluating an upgrade to 0.8. -- Sylvain On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
It was mentioned in another thread that Twitter uses 0.8 in productionfor me that was a fairly strong testimonial... On Sep 14, 2011 9:28 AM, Yan Chunlu springri...@gmail.com wrote: is 0.8 ready for production use? as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
thanks a lot for the help! I have read the post and think 0.8 might be good enough for me, especially 0.8.5. also change gc_grace_seconds is a acceptable solution. On Wed, Sep 14, 2011 at 4:03 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu springri...@gmail.com wrote: is 0.8 ready for production use? some related discussion here: http://www.mail-archive.com/user@cassandra.apache.org/msg17055.html but my personal answer is yes. as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? Repair problems in 0.7 don't hit everyone equally. For some people, it works relatively well even if not in the most efficient ways. Also, for some workload (if you don't do much deletes for instance), you can set a big gc_grace_seconds value (say a month) and only run repair that often, which can make repair inefficiencies more bearable. That being said, I can't speak for many companies, but I do advise evaluating an upgrade to 0.8. -- Sylvain On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
segment fault with 0.8.5
just tried cassandra 0.8.5 binary version, and got Segment fault I am using Sun JDK so this is not CASSANDRA-2441 OS is Debian 5.0 java -version java version 1.6.0_04 Java(TM) SE Runtime Environment (build 1.6.0_04-b12) Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode) uname -a Linux mao 2.6.27.59 #1 SMP Mon Jul 25 14:30:33 CST 2011 i686 GNU/Linux I also found that the format of configuration file cassandra.yaml is different, are they compatible? thanks!
Re: segment fault with 0.8.5
On Wed, Sep 14, 2011 at 3:43 PM, Yan Chunlu springri...@gmail.com wrote: I also found that the format of configuration file cassandra.yaml is different, are they compatible? Format of 0.8.5 cassandra.yaml is different from what? You didn't mention what u r comparing it to. I recently did migration of a simple cassandra DB from 0.7.0 to 0.8.5 and found quite a few differences in structure of cassandra.yaml - the biggest one that affected us was that cassandra.yaml couldn't hold the defintion of a keyspace, which we used for embedded cassandra we bring up for testing. -- Roshan Blog: http://roshandawrani.wordpress.com/ Twitter: @roshandawrani http://twitter.com/roshandawrani Skype: roshandawrani
Re: Cassandra cluster on ec2 and ebs volumes
[moving to user@] On Wed, Sep 14, 2011 at 6:22 AM, Giannis Neokleous gian...@generalsentiment.com wrote: Hello, We currently have a cluster running on ec2 and all of the data are on the instance disks. We also have some old data which are now constant that we want to serve off from a different cluster still running on ec2. We want to have the ability to turn on/off this cluster at any time without having to reinsert any of the data. Is it possible to setup cassandra on ec2 so that the data can live on ebs volumes which can be attached/detached every time we want to bring down the cluster? Reloading the sstables will not work for us because we want to be able to turn on the cluster and have it serving data within minutes. Does anyone have this kind of setup working right now and if so how reliable is this? Thanks, -Giannis -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: segment fault with 0.8.5
That's a pretty old JDK. You should upgrade. On Wed, Sep 14, 2011 at 5:13 AM, Yan Chunlu springri...@gmail.com wrote: just tried cassandra 0.8.5 binary version, and got Segment fault I am using Sun JDK so this is not CASSANDRA-2441 OS is Debian 5.0 java -version java version 1.6.0_04 Java(TM) SE Runtime Environment (build 1.6.0_04-b12) Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode) uname -a Linux mao 2.6.27.59 #1 SMP Mon Jul 25 14:30:33 CST 2011 i686 GNU/Linux I also found that the format of configuration file cassandra.yaml is different, are they compatible? thanks! -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Error in upgrading cassandra to 0.8.5
On 09/13/2011 05:21 PM, Jonathan Ellis wrote: More or less. NEWS.txt explains upgrade procedure in more detail. When moving from 0.7.x to 0.8.5 do I need to scrub all sstables post upgrade? NEWS.txt doesn't mention anything about that but your comment here seems to indicate so: https://issues.apache.org/jira/browse/CASSANDRA-2739?focusedCommentId=13071490page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13071490 / Jonas
Re: Index search in provided list of rows (list of rowKeys).
Why it's radically? It will be same get_indexes_slices search but in specified set of rows. So mostly it will be one more Search Expression over rowIDs not only column values. Usually the more restrictions you could specify in search query, the faster search it can be (not slower at least). About moving to another engine: Sphinx has it's advantages (quite fast) and disadvantages (painful integration, lot's of limitations). Currently my company using it on production, so moving to another search engine is a big step and it will be considered. What I want to discuss is common task of searching in Cassandra. Maybe I missing some already well known solution for it (silver bullet)? I see only 2 solutions: 1) Using external search engine that will index all storage fields advantage: support full text search some engines have nice search features like sorting by relevance disadvantage: for range scans it stores column values, it mean that huge part of cassandra data will be also stored at Search Engine metadata usually engines have set of limitations 2) Use Cassandra embedded Indexing search advantage: doesn't need to index all columns that are used for filtering. Filtering performed at storage, close to data. disadvantage: not full text search support require to create and maintain secondary indexes. Both solutions are exclusive, you could choose only one and there is no way to use combination of this 2 solutions (except intersection at client side which is not a solution). So API that was discussed would open some possibility to use that combination. For me it looks like third solution. Could it really change the way we are searching in Cassandra? Evgeny.
Get CL ONE / NTS
Hello, I have 2 datacenters. Cassandra is configured as follow: - RackInferringSnitch - NetworkTopologyStrategy for CF - strategy_options: DC1:3 DC2:3 Data are written using CL LOCAL_QUORUM so data written from one datacenter will eventually be replicated to the other datacenter. Data is always written exactly once. On the other side, I'd like to improve the read path. I'm using actually the CL ONE since data is only written once (ie: timestamp is more or less meaningless in my case). This is where I have some doubts: if data is written on DC1 and tentatively read from DC2 while the data is still not replicated or partially replicated (for whatever good reason since replication is async), what is the behavior of Get with CL ONE / NTS ? 1/ Will I have an error because DC2 does not have any copy of the data ? 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? Thanks a lot, - Pierre
Nodetool removetoken taking days to run.
Hi, So, here's the backstory: We were running Cassandra 0.7.4 and at one point in time had a node in the ring at 10.84.73.18. We removed this node from the ring successfully in 0.7.4. It stopped showing in the nodetool ring command. But occasionally we'd still get weird log entries about failing to write/read to IP 10.84.73.18. We upgraded to Cassandra 0.8.4. Now, nodetool ring shows this old node: 10.84.73.18 datacenter1 rack1 Down Leaving ? 6.71% 32695837177645752437561450928649262701 So I started a nodetool removetoken on 32695837177645752437561450928649262701 last Friday. It's still going strong this morning, on day 5: ./bin/nodetool -h 10.84.73.47 -p 8080 removetoken status RemovalStatus: Removing token (32695837177645752437561450928649262701). Waiting for replication confirmation from [/10.84.73.49,/10.84.73.48,/10.84.73.51]. Should I just be patient? Or is something really weird with this node? Thanks- ryan
Re: Error in upgrading cassandra to 0.8.5
Added to NEWS: - After upgrading, run nodetool scrub against each node before running repair, moving nodes, or adding new ones. 2011/9/14 Jonas Borgström jonas.borgst...@trioptima.com: On 09/13/2011 05:21 PM, Jonathan Ellis wrote: More or less. NEWS.txt explains upgrade procedure in more detail. When moving from 0.7.x to 0.8.5 do I need to scrub all sstables post upgrade? NEWS.txt doesn't mention anything about that but your comment here seems to indicate so: https://issues.apache.org/jira/browse/CASSANDRA-2739?focusedCommentId=13071490page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13071490 / Jonas -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: what's the difference between repair CF separately and repair the entire node?
On Tue, Sep 13, 2011 at 3:57 PM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. Why is it serious to do repair one CF at a time, if I cannot do that it at a CF level, then does it mean that I cannot use more than 50% disk space? Is this specific to this problem or is that a general statement? I ask because I am planning on doing this so I can limit the max disk overhead to be a CF (+ some factor) worth. I am going to be testing this in the next couple of weeks or so. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
selective replication
Has anyone done any work on what I'll call selective replication between DCs? I want to use Cassandra to replicate data to another virtual DC (for analytical purposes), but only inserts, not deletes. Picture having two data centers, DC1 for OLTP of short lived data (say 90 day window) and DC2 for OLAP (years of data). DC2 would probably be a Brisk setup. In this scenario, clients would get/insert/delete from DC1 (the OLTP system) and DC1 would replicate inserts only to DC2 (the OLAP system) for analytics. I don't have any experience (yet) with multi-dc replication, but I don't think this is possible. Thoughts?
Re: Exception in Hadoop Word Count sample
You're using a 0.8 wordcount against a 0.7 Cassandra? On Wed, Sep 14, 2011 at 2:19 PM, Tharindu Mathew mcclou...@gmail.com wrote: I see $subject. Can anyone help me to rectify this? Stacktrace: Exception in thread main org.apache.thrift.TApplicationException: Required field 'replication_factor' was not found in serialized data! Struct: KsDef(name:wordcount, strategy_class:org.apache.cassandra.locator.SimpleStrategy, strategy_options:{replication_factor=1}, replication_factor:0, cf_defs:[CfDef(keyspace:wordcount, name:input_words, column_type:Standard, comparator_type:AsciiType, default_validation_class:AsciiType), CfDef(keyspace:wordcount, name:output_words, column_type:Standard, comparator_type:AsciiType, default_validation_class:AsciiType), CfDef(keyspace:wordcount, name:input_words_count, column_type:Standard, comparator_type:UTF8Type, default_validation_class:CounterColumnType)]) at org.apache.thrift.TApplicationException.read(TApplicationException.java:108) at org.apache.cassandra.thrift.Cassandra$Client.recv_system_add_keyspace(Cassandra.java:1531) at org.apache.cassandra.thrift.Cassandra$Client.system_add_keyspace(Cassandra.java:1514) at WordCountSetup.setupKeyspace(Unknown Source) at WordCountSetup.main(Unknown Source) -- Regards, Tharindu blog: http://mackiemathew.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: selective replication
This has been proposed a few times, there are some good use cases for it, and there is no current mechanism for it, but it's been discussed as a possible enhancement. Adrian On Wed, Sep 14, 2011 at 11:06 AM, Todd Burruss bburr...@expedia.com wrote: Has anyone done any work on what I'll call selective replication between DCs? I want to use Cassandra to replicate data to another virtual DC (for analytical purposes), but only inserts, not deletes. Picture having two data centers, DC1 for OLTP of short lived data (say 90 day window) and DC2 for OLAP (years of data). DC2 would probably be a Brisk setup. In this scenario, clients would get/insert/delete from DC1 (the OLTP system) and DC1 would replicate inserts only to DC2 (the OLAP system) for analytics. I don't have any experience (yet) with multi-dc replication, but I don't think this is possible. Thoughts?
Re: Nodetool removetoken taking days to run.
On Wed, Sep 14, 2011 at 8:54 AM, Ryan Hadley r...@sgizmo.com wrote: Hi, So, here's the backstory: We were running Cassandra 0.7.4 and at one point in time had a node in the ring at 10.84.73.18. We removed this node from the ring successfully in 0.7.4. It stopped showing in the nodetool ring command. But occasionally we'd still get weird log entries about failing to write/read to IP 10.84.73.18. We upgraded to Cassandra 0.8.4. Now, nodetool ring shows this old node: 10.84.73.18 datacenter1 rack1 Down Leaving ? 6.71% 32695837177645752437561450928649262701 So I started a nodetool removetoken on 32695837177645752437561450928649262701 last Friday. It's still going strong this morning, on day 5: ./bin/nodetool -h 10.84.73.47 -p 8080 removetoken status RemovalStatus: Removing token (32695837177645752437561450928649262701). Waiting for replication confirmation from [/10.84.73.49,/10.84.73.48,/10.84.73.51]. Should I just be patient? Or is something really weird with this node? 5 days seems excessive unless there is a very large amount of data per node. I would check nodetool netstats, and if the streams don't look active issue a 'removetoken force' against 10.84.73.47 and accept that you may possibly need to run repair to restore the replica count. -Brandon
Re: what's the difference between repair CF separately and repair the entire node?
It is a serious issue if you really need to repair one CF at the time. Why is it serious to do repair one CF at a time, if I cannot do that it at a CF level, then does it mean that I cannot use more than 50% disk space? Is this specific to this problem or is that a general statement? I ask because I am planning on doing this so I can limit the max disk overhead to be a CF (+ some factor) worth. I am going to be testing this in the next couple of weeks or so. The bug in 0.7 is causes data to be streamed for all CF:s when doing a repair on one. So, if you specifically need to repair a specific CF at a time, such as because you're trying to repair a small CF quite often while leaving a huge CF with less frequent repairs, you have an issue. If you're just wanting to repair the entire keyspace, it doesn't affect you. I'm not sure how this relates to the 50% disk space bit though. -- / Peter Schuller (@scode on twitter)
Re: Index search in provided list of rows (list of rowKeys).
The way specify more restrictions to the query is to specify them in the index_clause. The index clause is applied to the set of all rows in the database, not a sub set, applying them to a sub set is implicitly supporting a sub query. Currently it's doing select then project, this would be select then select then project. Right now I would use Solandra, or do the entire search in Sphinx and get the row keys for the result documents. In the future you may be able to use this https://issues.apache.org/jira/browse/CASSANDRA-2915 Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 12:46 AM, Evgeniy Ryabitskiy wrote: Why it's radically? It will be same get_indexes_slices search but in specified set of rows. So mostly it will be one more Search Expression over rowIDs not only column values. Usually the more restrictions you could specify in search query, the faster search it can be (not slower at least). About moving to another engine: Sphinx has it's advantages (quite fast) and disadvantages (painful integration, lot's of limitations). Currently my company using it on production, so moving to another search engine is a big step and it will be considered. What I want to discuss is common task of searching in Cassandra. Maybe I missing some already well known solution for it (silver bullet)? I see only 2 solutions: 1) Using external search engine that will index all storage fields advantage: support full text search some engines have nice search features like sorting by relevance disadvantage: for range scans it stores column values, it mean that huge part of cassandra data will be also stored at Search Engine metadata usually engines have set of limitations 2) Use Cassandra embedded Indexing search advantage: doesn't need to index all columns that are used for filtering. Filtering performed at storage, close to data. disadvantage: not full text search support require to create and maintain secondary indexes. Both solutions are exclusive, you could choose only one and there is no way to use combination of this 2 solutions (except intersection at client side which is not a solution). So API that was discussed would open some possibility to use that combination. For me it looks like third solution. Could it really change the way we are searching in Cassandra? Evgeny.
Re: Nodetool removetoken taking days to run.
On Sep 14, 2011, at 2:08 PM, Brandon Williams wrote: On Wed, Sep 14, 2011 at 8:54 AM, Ryan Hadley r...@sgizmo.com wrote: Hi, So, here's the backstory: We were running Cassandra 0.7.4 and at one point in time had a node in the ring at 10.84.73.18. We removed this node from the ring successfully in 0.7.4. It stopped showing in the nodetool ring command. But occasionally we'd still get weird log entries about failing to write/read to IP 10.84.73.18. We upgraded to Cassandra 0.8.4. Now, nodetool ring shows this old node: 10.84.73.18 datacenter1 rack1 Down Leaving ? 6.71% 32695837177645752437561450928649262701 So I started a nodetool removetoken on 32695837177645752437561450928649262701 last Friday. It's still going strong this morning, on day 5: ./bin/nodetool -h 10.84.73.47 -p 8080 removetoken status RemovalStatus: Removing token (32695837177645752437561450928649262701). Waiting for replication confirmation from [/10.84.73.49,/10.84.73.48,/10.84.73.51]. Should I just be patient? Or is something really weird with this node? 5 days seems excessive unless there is a very large amount of data per node. I would check nodetool netstats, and if the streams don't look active issue a 'removetoken force' against 10.84.73.47 and accept that you may possibly need to run repair to restore the replica count. -Brandon Hi Brandon, Thanks for the reply. Quick question though: 1. We write all data to this ring with a TTL of 30 days 2. This node hasn't been in the ring for at least 90 days, more like 120 days since it's been in the ring. So, if I nodetool removetoken forced it, would I still have to be concerned about running a repair? Also, after this node is removed, I'm going to rebalance with nodetool move. Would that remove the repair requirement too? Thanks- Ryan
Configuring the keyspace correctly - NTS
Okay, in a previous post, it was stated that I could use a NetworkTopologyStrategy in a singel data centre by setting up my keyspace with: create keyspace KeyspaceDEV with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{datacenter1:3}]; Whereby my understanding is that: [{datacenter1:3}] represents: - 1 Datacentre - 3 nodes in that datacentre My infrastructure team were recommended to instead of use datacenter1 to use the second value in the IP address: x.130.x.x [{130:3}] However, when trying to access the keyspace the following error was return: May not be enough replicas present to handle consistency level When I rebuilt the keyspace using the datacenter1 semantic, it worked fine. My guess is that there is some correlation between the 130 value and either the rpc_address or listen_address. Am I correct in thinking this? I don't have access to the se configurations so I'm just going out on a whim here trying to figure out why using the 130 form the IP address would cause the error. Anthony
Re: Get CL ONE / NTS
Your current approach to Consistency opens the door to some inconsistent behavior. 1/ Will I have an error because DC2 does not have any copy of the data ? If you read from DC2 at CL ONE and the data is not replicated it will not be returned. 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? Not at CL ONE. If you used CL EACH QUORUM then the read will go to all the DC's. If DC2 is behind DC1 then you will get the data form DC1. 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? Depending on the API call and the client, working at CL ONE, you will see either errors or missing data. 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? yes Consider using LOCAL QUORUM for write and read, will make things a bit more consistent but not add inter DC overhead into the request latency. Still possible to not get data in DC2 if it is totally disconnected from the DC1 write at LOCAL QUORUM and read at EACH QUORUM . Will so you can always read, requests in DC2 will fail if DC1 is not reachable. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 1:33 AM, Pierre Chalamet wrote: Hello, I have 2 datacenters. Cassandra is configured as follow: - RackInferringSnitch - NetworkTopologyStrategy for CF - strategy_options: DC1:3 DC2:3 Data are written using CL LOCAL_QUORUM so data written from one datacenter will eventually be replicated to the other datacenter. Data is always written exactly once. On the other side, I'd like to improve the read path. I'm using actually the CL ONE since data is only written once (ie: timestamp is more or less meaningless in my case). This is where I have some doubts: if data is written on DC1 and tentatively read from DC2 while the data is still not replicated or partially replicated (for whatever good reason since replication is async), what is the behavior of Get with CL ONE / NTS ? 1/ Will I have an error because DC2 does not have any copy of the data ? 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? Thanks a lot, - Pierre
RE: Get CL ONE / NTS
After reading Cassandra source code, I will try to answer myself. It's kind of good exercise :) 1/ Will I have an error because DC2 does not have any copy of the data ? I've not been able to find how endpoints are determined for the read request, but I guess endpoints are just coming from the current datacenter. 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? Probably no since 1/ 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? It seems to depend on RR. If read_repair_chance is set to 1 (default value), RR happens all the time : the answer is no. In case read_repair_chance is below 1, it seems CL.ONE will fail if the single read request fails. 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? It seems to depend on RR as in 3/ Are the answers right ? - Pierre -Original Message- From: Pierre Chalamet [mailto:pie...@chalamet.net] Sent: Wednesday, September 14, 2011 3:33 PM To: user@cassandra.apache.org Subject: Get CL ONE / NTS Hello, I have 2 datacenters. Cassandra is configured as follow: - RackInferringSnitch - NetworkTopologyStrategy for CF - strategy_options: DC1:3 DC2:3 Data are written using CL LOCAL_QUORUM so data written from one datacenter will eventually be replicated to the other datacenter. Data is always written exactly once. On the other side, I'd like to improve the read path. I'm using actually the CL ONE since data is only written once (ie: timestamp is more or less meaningless in my case). This is where I have some doubts: if data is written on DC1 and tentatively read from DC2 while the data is still not replicated or partially replicated (for whatever good reason since replication is async), what is the behavior of Get with CL ONE / NTS ? 1/ Will I have an error because DC2 does not have any copy of the data ? 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? Thanks a lot, - Pierre
RE: Get CL ONE / NTS
Thanks Aaron, didn't seen your answer before mine. I do agree for 2/ I might have read error. Good suggestion to use EACH_QUORUM - it could be a good trade off to read at this level if ONE fails. Maybe using LOCAL_QUORUM might be a good answer and will avoid headache after all. Are you advising CL.ONE does not worth the game when considering read performance ? By the way, I do not have consistency problem at all - data is only written once (and if more it is always the same data) and read several times across DC. I only have replication problems. That's why I'm more inclined to use CL.ONE for read if possible. Thanks, - Pierre -Original Message- From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, September 14, 2011 11:48 PM To: user@cassandra.apache.org; pie...@chalamet.net Subject: Re: Get CL ONE / NTS Your current approach to Consistency opens the door to some inconsistent behavior. 1/ Will I have an error because DC2 does not have any copy of the data ? If you read from DC2 at CL ONE and the data is not replicated it will not be returned. 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? Not at CL ONE. If you used CL EACH QUORUM then the read will go to all the DC's. If DC2 is behind DC1 then you will get the data form DC1. 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? Depending on the API call and the client, working at CL ONE, you will see either errors or missing data. 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? yes Consider using LOCAL QUORUM for write and read, will make things a bit more consistent but not add inter DC overhead into the request latency. Still possible to not get data in DC2 if it is totally disconnected from the DC1 write at LOCAL QUORUM and read at EACH QUORUM . Will so you can always read, requests in DC2 will fail if DC1 is not reachable. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 1:33 AM, Pierre Chalamet wrote: Hello, I have 2 datacenters. Cassandra is configured as follow: - RackInferringSnitch - NetworkTopologyStrategy for CF - strategy_options: DC1:3 DC2:3 Data are written using CL LOCAL_QUORUM so data written from one datacenter will eventually be replicated to the other datacenter. Data is always written exactly once. On the other side, I'd like to improve the read path. I'm using actually the CL ONE since data is only written once (ie: timestamp is more or less meaningless in my case). This is where I have some doubts: if data is written on DC1 and tentatively read from DC2 while the data is still not replicated or partially replicated (for whatever good reason since replication is async), what is the behavior of Get with CL ONE / NTS ? 1/ Will I have an error because DC2 does not have any copy of the data ? 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? Thanks a lot, - Pierre
Re: Configuring the keyspace correctly - NTS
The strategy_options for NTS accept the data centre name and the rf, [{dc_name : dc_rf}] Where the DC name comes from the snitch, so… SimpleSnitch (gotta love this guy, in there day in day out putting in the hard yards) puts all the nodes in datacenter1 which is why thats in the defaults. RackInferringSnitch (or the Hollywood Snitch as I call it) puts the them in a DC named after the second octet of the IP. So 130 in your case. PropertyFileSnitch does whats in the cassandra-topology.properties file. EC2Snitch uses the EC2 Region. Brisk snitch does it's thing. If you want to use 130 you should be using the RackInferringSnitch, if you want to use human names use either the SimpleSnitch or the PropertyFileSnitch. Property File Snitch has a default catch all DC, see the cassandra-topology.properties file. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 9:43 AM, Anthony Ikeda wrote: Okay, in a previous post, it was stated that I could use a NetworkTopologyStrategy in a singel data centre by setting up my keyspace with: create keyspace KeyspaceDEV with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{datacenter1:3}]; Whereby my understanding is that: [{datacenter1:3}] represents: 1 Datacentre 3 nodes in that datacentre My infrastructure team were recommended to instead of use datacenter1 to use the second value in the IP address: x.130.x.x [{130:3}] However, when trying to access the keyspace the following error was return: May not be enough replicas present to handle consistency level When I rebuilt the keyspace using the datacenter1 semantic, it worked fine. My guess is that there is some correlation between the 130 value and either the rpc_address or listen_address. Am I correct in thinking this? I don't have access to the se configurations so I'm just going out on a whim here trying to figure out why using the 130 form the IP address would cause the error. Anthony
Re: Get CL ONE / NTS
Are you advising CL.ONE does not worth the game when considering read performance ? Consistency is not performance, it's a whole new thing to tune in your application. If you have performance issues deal with those as performance issues, better code / data model / hard ware. By the way, I do not have consistency problem at all - data is only written once Nobody expects a consistency problem. It's chief weapon is surprise. Surprise and fear. It's two weapons are fear and surprise. And so forth http://www.youtube.com/watch?v=Ixgc_FGam3s If you write at LOCAL QUORUM in DC 1 and DC 2 is down at the start of the request, a hint will be stored in DC 1. Some time later when DC 2 comes back that hint will be sent to DC 2. If in the mean time you read from DC 2 at CL ONE you will not get that change. With Read Repair enabled it will repair in the background and you may get a different response on the next read (Am guessing here, cannot remember exactly how RR works cross DC) Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 10:07 AM, Pierre Chalamet wrote: Thanks Aaron, didn't seen your answer before mine. I do agree for 2/ I might have read error. Good suggestion to use EACH_QUORUM - it could be a good trade off to read at this level if ONE fails. Maybe using LOCAL_QUORUM might be a good answer and will avoid headache after all. Are you advising CL.ONE does not worth the game when considering read performance ? By the way, I do not have consistency problem at all - data is only written once (and if more it is always the same data) and read several times across DC. I only have replication problems. That's why I'm more inclined to use CL.ONE for read if possible. Thanks, - Pierre -Original Message- From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, September 14, 2011 11:48 PM To: user@cassandra.apache.org; pie...@chalamet.net Subject: Re: Get CL ONE / NTS Your current approach to Consistency opens the door to some inconsistent behavior. 1/ Will I have an error because DC2 does not have any copy of the data ? If you read from DC2 at CL ONE and the data is not replicated it will not be returned. 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? Not at CL ONE. If you used CL EACH QUORUM then the read will go to all the DC's. If DC2 is behind DC1 then you will get the data form DC1. 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? Depending on the API call and the client, working at CL ONE, you will see either errors or missing data. 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? yes Consider using LOCAL QUORUM for write and read, will make things a bit more consistent but not add inter DC overhead into the request latency. Still possible to not get data in DC2 if it is totally disconnected from the DC1 write at LOCAL QUORUM and read at EACH QUORUM . Will so you can always read, requests in DC2 will fail if DC1 is not reachable. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 1:33 AM, Pierre Chalamet wrote: Hello, I have 2 datacenters. Cassandra is configured as follow: - RackInferringSnitch - NetworkTopologyStrategy for CF - strategy_options: DC1:3 DC2:3 Data are written using CL LOCAL_QUORUM so data written from one datacenter will eventually be replicated to the other datacenter. Data is always written exactly once. On the other side, I'd like to improve the read path. I'm using actually the CL ONE since data is only written once (ie: timestamp is more or less meaningless in my case). This is where I have some doubts: if data is written on DC1 and tentatively read from DC2 while the data is still not replicated or partially replicated (for whatever good reason since replication is async), what is the behavior of Get with CL ONE / NTS ? 1/ Will I have an error because DC2 does not have any copy of the data ? 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2 ? 3/ In case of partial replication to DC2, will I see sometimes errors about servers not holding the data in DC2 ? 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it does not have the data or does it waits until all servers tell they do not have the data ? Thanks a lot, - Pierre
Re: Configuring the keyspace correctly - NTS
Great that makes perfect sense - I apologise for not getting this right it seems I'm doing someone elses job here. Anthony On Wed, Sep 14, 2011 at 3:15 PM, aaron morton aa...@thelastpickle.comwrote: The strategy_options for NTS accept the data centre name and the rf, [{dc_name : dc_rf}] Where the DC name comes from the snitch, so… SimpleSnitch (gotta love this guy, in there day in day out putting in the hard yards) puts all the nodes in datacenter1 which is why thats in the defaults. RackInferringSnitch (or the Hollywood Snitch as I call it) puts the them in a DC named after the second octet of the IP. So 130 in your case. PropertyFileSnitch does whats in the cassandra-topology.properties file. EC2Snitch uses the EC2 Region. Brisk snitch does it's thing. If you want to use 130 you should be using the RackInferringSnitch, if you want to use human names use either the SimpleSnitch or the PropertyFileSnitch. Property File Snitch has a default catch all DC, see the cassandra-topology.properties file. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 9:43 AM, Anthony Ikeda wrote: Okay, in a previous post, it was stated that I could use a NetworkTopologyStrategy in a singel data centre by setting up my keyspace with: create keyspace KeyspaceDEV with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{datacenter1:3}]; Whereby my understanding is that: [{datacenter1:3}] represents: - 1 Datacentre - 3 nodes in that datacentre My infrastructure team were recommended to instead of use datacenter1 to use the second value in the IP address: x.130.x.x [{130:3}] However, when trying to access the keyspace the following error was return: May not be enough replicas present to handle consistency level When I rebuilt the keyspace using the datacenter1 semantic, it worked fine. My guess is that there is some correlation between the 130 value and either the rpc_address or listen_address. Am I correct in thinking this? I don't have access to the se configurations so I'm just going out on a whim here trying to figure out why using the 130 form the IP address would cause the error. Anthony
Re: Nodetool removetoken taking days to run.
On Wed, Sep 14, 2011 at 4:25 PM, Ryan Hadley r...@sgizmo.com wrote: Hi Brandon, Thanks for the reply. Quick question though: 1. We write all data to this ring with a TTL of 30 days 2. This node hasn't been in the ring for at least 90 days, more like 120 days since it's been in the ring. So, if I nodetool removetoken forced it, would I still have to be concerned about running a repair? There have probably been some writes that thought that node was part of the replica set, so you may still be missing a replica in that regard. If you're only holding the data for 30 days though, it might not be worth the trouble of repairing and instead bet that not all of the live replicas will die in the next month. Also, after this node is removed, I'm going to rebalance with nodetool move. Would that remove the repair requirement too? If you intend to replace the node, it's better to bootstrap the new node at the dead node's token minus one, and then do the removetoken force. This would actually obviate the need to repair (except for one key, you can move the node to the old token once it has been removed) assuming that your consistency level was greater than ONE for writes, or your clients always replayed any failures. This holds true for moving to the old token as well. -Brandon
Re: Configuring the keyspace correctly - NTS
Aaron, when using the RackInferringSnitch, is the octet correlated from the rpc_address or listen_address? I just noticed that when I tried to configure this locally on my laptop I had to 0 (127.0.0.1) instead of 160 (192.160.202.235) Anthony On Wed, Sep 14, 2011 at 3:15 PM, aaron morton aa...@thelastpickle.comwrote: The strategy_options for NTS accept the data centre name and the rf, [{dc_name : dc_rf}] Where the DC name comes from the snitch, so… SimpleSnitch (gotta love this guy, in there day in day out putting in the hard yards) puts all the nodes in datacenter1 which is why thats in the defaults. RackInferringSnitch (or the Hollywood Snitch as I call it) puts the them in a DC named after the second octet of the IP. So 130 in your case. PropertyFileSnitch does whats in the cassandra-topology.properties file. EC2Snitch uses the EC2 Region. Brisk snitch does it's thing. If you want to use 130 you should be using the RackInferringSnitch, if you want to use human names use either the SimpleSnitch or the PropertyFileSnitch. Property File Snitch has a default catch all DC, see the cassandra-topology.properties file. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15/09/2011, at 9:43 AM, Anthony Ikeda wrote: Okay, in a previous post, it was stated that I could use a NetworkTopologyStrategy in a singel data centre by setting up my keyspace with: create keyspace KeyspaceDEV with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{datacenter1:3}]; Whereby my understanding is that: [{datacenter1:3}] represents: - 1 Datacentre - 3 nodes in that datacentre My infrastructure team were recommended to instead of use datacenter1 to use the second value in the IP address: x.130.x.x [{130:3}] However, when trying to access the keyspace the following error was return: May not be enough replicas present to handle consistency level When I rebuilt the keyspace using the datacenter1 semantic, it worked fine. My guess is that there is some correlation between the 130 value and either the rpc_address or listen_address. Am I correct in thinking this? I don't have access to the se configurations so I'm just going out on a whim here trying to figure out why using the 130 form the IP address would cause the error. Anthony