Re: Counter question
Like everything else in Cassandra, If you need full consistency you need to make sure that you have the right combination of (write consistency level) + (read consistency level) if W = write consistency level R = read consistency level N = replication factor then W + R N Shimi On Thu, Mar 29, 2012 at 10:09 AM, Tamar Fraenkel ta...@tok-media.comwrote: Hi! Asking again, as I didn't get responses :) I have a ring with 3 nodes and replication factor of 2. I have counter cf with the following definition: CREATE COLUMN FAMILY tk_counters with comparator = 'UTF8Type' and default_validation_class = 'CounterColumnType' and key_validation_class = 'CompositeType(UTF8Type,UUIDType)' and replicate_on_write = true; In my code (Java, Hector), I increment a counter and then read it. Is it possible that the value read will be the value before increment? If yes, how can I ensure it does not happen. All my reads and writes are done with consistency level one. If this is consistency issue, can I do only the actions on tk_counters column family with a higher consistency level? What does replicate_on_write mean? I thought this should help, but maybe even if replicating after write, my read happen before replication finished and it returns value from a still not updated node. My increment code is: MutatorComposite mutator = HFactory.createMutator(keyspace, CompositeSerializer.get()); mutator.incrementCounter(key,tk_counters, columnName, inc); mutator.execute(); My read counter code is: CounterQueryComposite,String query = createCounterColumnQuery(keyspace, CompositeSerializer.get(), StringSerializer.get()); query.setColumnFamily(tk_counters); query.setKey(key); query.setName(columnName); QueryResultHCounterColumnString r = query.execute(); return r.get().getValue(); Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.png
Re: Counter question
You set the consistency with every request. Usually a client library will let you set a default one for all write/read requests. I don't know if Hector lets you set a default consistency level per CF. Take a look at the Hector docs or ask it in the Hector mailing list. Shimi On Thu, Mar 29, 2012 at 11:47 AM, Tamar Fraenkel ta...@tok-media.comwrote: Can this be set on a CF basis. Only this CF needs higher consistency level. Thanks, Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, Mar 29, 2012 at 10:44 AM, Shimi Kiviti shim...@gmail.com wrote: Like everything else in Cassandra, If you need full consistency you need to make sure that you have the right combination of (write consistency level) + (read consistency level) if W = write consistency level R = read consistency level N = replication factor then W + R N Shimi On Thu, Mar 29, 2012 at 10:09 AM, Tamar Fraenkel ta...@tok-media.comwrote: Hi! Asking again, as I didn't get responses :) I have a ring with 3 nodes and replication factor of 2. I have counter cf with the following definition: CREATE COLUMN FAMILY tk_counters with comparator = 'UTF8Type' and default_validation_class = 'CounterColumnType' and key_validation_class = 'CompositeType(UTF8Type,UUIDType)' and replicate_on_write = true; In my code (Java, Hector), I increment a counter and then read it. Is it possible that the value read will be the value before increment? If yes, how can I ensure it does not happen. All my reads and writes are done with consistency level one. If this is consistency issue, can I do only the actions on tk_counters column family with a higher consistency level? What does replicate_on_write mean? I thought this should help, but maybe even if replicating after write, my read happen before replication finished and it returns value from a still not updated node. My increment code is: MutatorComposite mutator = HFactory.createMutator(keyspace, CompositeSerializer.get()); mutator.incrementCounter(key,tk_counters, columnName, inc); mutator.execute(); My read counter code is: CounterQueryComposite,String query = createCounterColumnQuery(keyspace, CompositeSerializer.get(), StringSerializer.get()); query.setColumnFamily(tk_counters); query.setKey(key); query.setName(columnName); QueryResultHCounterColumnString r = query.execute(); return r.get().getValue(); Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.png
Re: Row iteration over indexed clause
Yes.use get_indexed_slices (http://wiki.apache.org/cassandra/API) On Tue, Mar 13, 2012 at 2:12 PM, Vivek Mishra mishra.v...@gmail.com wrote: Hi, Is it possible to iterate and fetch in chunks using thrift API by querying using secondary indexes? -Vivek
Re: Composite column docs
On Thu, Jan 5, 2012 at 9:13 PM, aaron morton aa...@thelastpickle.comwrote: What client are you using ? I am writing a client. For example pycassa has some sweet documentation http://pycassa.github.com/pycassa/assorted/composite_types.html It is a sweet documentation but it doesn't help me. I a lower level documntation Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/01/2012, at 12:48 AM, Shimi Kiviti wrote: Is there a doc for using composite columns with thrift? Is https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/marshal/CompositeType.java the only doc? does the client needs to add the length to the get \ get_slice... queries or is it taken care of on the server side? Shimi
Composite column docs
Is there a doc for using composite columns with thrift? Is https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/marshal/CompositeType.java the only doc? does the client needs to add the length to the get \ get_slice... queries or is it taken care of on the server side? Shimi
Re: CassandraDaemon deactivate doesn't shutdown Cassandra
The problem doesn't exist after the column family is truncated or if durable_writes=true Shimi On Tue, Oct 11, 2011 at 9:30 PM, Shimi Kiviti shim...@gmail.com wrote: I am running an Embedded Cassandra (0.8.7) and calling CassandraDaemon.deactivate() after I write rows (at least 1), doesn't shutdown Cassandra. If I run only reads it does shutdown even without calling CassandraDaemon.deactivate() Anyone have any idea what can cause this problem? Shimi
CassandraDaemon deactivate doesn't shutdown Cassandra
I am running an Embedded Cassandra (0.8.7) and calling CassandraDaemon.deactivate() after I write rows (at least 1), doesn't shutdown Cassandra. If I run only reads it does shutdown even without calling CassandraDaemon.deactivate() Anyone have any idea what can cause this problem? Shimi
Re: Cassandra Capistrano recipes
Modify your Capistrano script to install an init script. If you use debian or redhat you can copy these or modify them: https://github.com/Shimi/cassandra/blob/trunk/debian/init https://github.com/Shimi/cassandra/blob/trunk/redhat/cassandra and setup Capistrano to call /etc/init.d/cassandra stop/start/restart Shimi On Thu, Jul 7, 2011 at 4:27 AM, R Headley headle...@yahoo.com wrote: Hi I'm using Capistrano with Cassandra and was wondering if anyone has a recipe(s), for in particular, starting Cassandra as a daemon. Running the 'bin/cassandra' shell script (without the '-f' switch) doesn't quite work as this only runs Cassandra in the background, logging out will kill it. Thanks, Richard
Re: Read time get worse during dynamic snitch reset
I finally found some time to get back to this issue. I turned on the DEBUG log on the StorageProxy and it shows that all of these request are read from the other datacenter. Shimi On Tue, Apr 12, 2011 at 2:31 PM, aaron morton aa...@thelastpickle.comwrote: Something feels odd. From Peters nice write up of the dynamic snitch http://www.mail-archive.com/user@cassandra.apache.org/msg12092.html The RackInferringSnitch (and the PropertyFileSnitch) derive from the AbstractNetworkTopologySnitch and should... In the case of the NetworkTopologyStrategy, it inherits the implementation in AbstractNetworkTopologySnitch which sorts by AbstractNetworkTopologySnitch.compareEndPoints(), which: (1) Always prefers itself to any other node. So myself is always closest, no matter what. (2) Else, always prefers a node in the same rack, to a node in a different rack. (3) Else, always prefers a node in the same dc, to a node in a different dc. http://www.mail-archive.com/user@cassandra.apache.org/msg12092.html AFAIK the (data) request should be going to the local DC even after the DynamicSnitch has reset the scores. Because the underlying RackInferringSnitch should prefer local nodes. Just for fun check rack and dc assignments are what you thought using the operations on o.a.c.db.EndpointSnitchInfo bean in JConsole. Pass in the ip address for the nodes in each dc. If possible can you provide some info on the ip's in each dc? Aaron On 12 Apr 2011, at 18:24, shimi wrote: On Tue, Apr 12, 2011 at 12:26 AM, aaron morton aa...@thelastpickle.comwrote: The reset interval clears the latency tracked for each node so a bad node will be read from again. The scores for each node are then updated every 100ms (default) using the last 100 responses from a node. How long does the bad performance last for? Only a few seconds and but there are a lot of read requests during this time What CL are you reading at ? At Quorum with RF 4 the read request will be sent to 3 nodes, ordered by proximity and wellness according to the dynamic snitch. (for background recent discussion on dynamic snitch http://www.mail-archive.com/user@cassandra.apache.org/msg12089.html) I am reading with CL of ONE, read_repair_chance=0.33, RackInferringSnitch and keys_cached = rows_cached = 0 You can take a look at the weights and timings used by the DynamicSnitch in JConsole under o.a.c.db.DynamicSnitchEndpoint . Also at DEBUG log level you will be able to see which nodes the request is sent to. Everything looks OK. The weights are around 3 for the nodes in the same data center and around 5 for the others. I will turn on the DEBUG level to see if I can find more info. My guess is the DynamicSnitch is doing the right thing and the slow down is a node with a problem getting back into the list of nodes used for your read. It's then moved down the list as it's bad performance is noticed. Looking the DynamicSnitch MBean I don't see any problems with any of the nodes. My guess is that during the reset time there are reads that are sent to the other data center. Hope that helps Aaron Shimi On 12 Apr 2011, at 01:28, shimi wrote: I finally upgraded 0.6.x to 0.7.4. The nodes are running with the new version for several days across 2 data centers. I noticed that the read time in some of the nodes increase by x50-60 every ten minutes. There was no indication in the logs for something that happen at the same time. The only thing that I know that is running every 10 minutes is the dynamic snitch reset. So I changed dynamic_snitch_reset_interval_in_ms to 20 minutes and now I have the problem once in every 20 minutes. I am running all nodes with: replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy strategy_options: DC1 : 2 DC2 : 2 replication_factor: 4 (DC1 and DC2 are taken from the ips) Does anyone familiar with this kind of behavior? Shimi
Re: Combining all CFs into one big one
On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote: If you have N column families you need N * memtable size of RAM to support this. If that's not an option you can merge them into one as you suggest but then you will have much larger SSTables, slower compactions, etc. I don't necessarily agree with Tyler that the OS cache will be less effective... But I do agree that if the sizes of sstables are too large for you then more hardware is the solution... If you merge CFs which are hardly accessed with one which are accessed frequently, when you read the SSTable you load data that is hardly accessed to the OS cache. Another thing which you should be aware is that if you need to run any of the nodetool cf tasks, and you really need it for a specific CF running it on the specific CF is better and faster. Shimi On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs ty...@datastax.com wrote: When you have a high number of CFs, it's a good idea to consider merging CFs with highly correlated access patterns and similar structure into one. It is *not* a good idea to merge all of your CFs into one (unless they all happen to meet this criteria). Here's why: Besides big compactions and long repairs that you can't break down into smaller pieces, the main problem here is that your caching will become much less efficient. The OS buffer cache will be less effective because rows from all of the CFs will be interspersed in the SSTables. You will no longer be able to tune the key or row cache to only cache frequently accessed data. Both of these will tend to cause a serious increase in latency for your hot data. Shouldn't these kinds of problems be solved by Cassandra? They are mainly solved by Cassandra's general solution to any performance problem: the addition of more nodes. There are tickets open to improve compaction strategies, put bounds on SSTable sizes, etc; for example, https://issues.apache.org/jira/browse/CASSANDRA-1608 , but the addition of more nodes is a reliable solution to problems of this nature. On Sun, May 1, 2011 at 7:28 AM, David Boxenhorn da...@taotown.comwrote: Shouldn't these kinds of problems be solved by Cassandra? Isn't there a maximum SSTable size? On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote: Big sstables, long compactions, in major compaction you will need to have free disk space in the size of all the sstables (which you should have anyway). Shimi On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.comwrote: I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach? -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library -- http://twitter.com/tjake
Re: Tombstones and memtable_operations
You can use memtable_flush_after_mins instead of the cron Shimi 2011/4/19 Héctor Izquierdo Seliva izquie...@strands.com El mié, 20-04-2011 a las 08:16 +1200, aaron morton escribió: I think their may be an issue here, we are counting the number of columns in the operation. When deleting an entire row we do not have a column count. Can you let us know what version you are using and how you are doing the delete ? Thanks Aaron I'm using 0.7.4. I have a file with all the row keys I have to delete (around 100 million) and I just go through the file and issue deletes through pelops. Should I manually issue flushes with a cron every x time? On 20 Apr 2011, at 04:21, Héctor Izquierdo Seliva wrote: Ok, I've read about gc grace seconds, but i'm not sure I understand it fully. Untill gc grace seconds have passed, and there is a compaction, the tombstones live in memory? I have to delete 100 million rows and my insert rate is very low, so I don't have a lot of compactions. What should I do in this case? Lower the major compaction threshold and memtable_operations to some very low number? Thanks El mar, 19-04-2011 a las 17:36 +0200, Héctor Izquierdo Seliva escribió: Hi everyone. I've configured in one of my column families memtable_operations = 0.02 and started deleting keys. I have already deleted 54k, but there hasn't been any flush of the memtable. Memory keeps pilling up and eventually nodes start to do stop-the-world GCs. Is this the way this is supposed to work or have I done something wrong? Thanks!
Re: Cassandra 0.7.4 Bug?
I had the same thing. Node restart should solve it. Shimi On Sun, Apr 17, 2011 at 4:25 PM, Dikang Gu dikan...@gmail.com wrote: +1. I also met this problem several days before, and I haven't got a solution yet... On Sun, Apr 17, 2011 at 9:17 PM, csharpplusproject csharpplusproj...@gmail.com wrote: Often, I see the following behavior: (1) Cassandra works, all nodes are up etc (2) a 'move' operation is being run on one of the nodes (3) following this 'move' operation, even after a couple of hours / days where it is obvious the operation has ended, the node which had 'moved' remains with a status of *?* perhaps it's a bug? ___ shalom@host:/opt/cassandra/apache-cassandra-0.7.4$ bin/nodetool -host 192.168.0.5 ring Address Status State LoadOwns Token 127605887595351923798765477786913079296 192.168.0.253 Up Normal 88.66 MB25.00% 0 192.168.0.4 Up Normal 558.2 MB50.00% 85070591730234615865843651857942052863 192.168.0.5 Up Normal 71.03 MB16.67% 113427455640312821154458202477256070485 192.168.0.6 Up Normal 44.71 MB8.33% 127605887595351923798765477786913079296 shalom@host:/opt/cassandra/apache-cassandra-0.7.4$ bin/nodetool -host 192.168.0.4 move 92535295865117307932921825928971026432 shalom@host:/opt/cassandra/apache-cassandra-0.7.4$ bin/nodetool -host 192.168.0.5 ring Address Status State LoadOwns Token 127605887595351923798765477786913079296 192.168.0.253 Up Normal 171.17 MB 25.00% 0 192.168.0.4 *?* Normal 212.11 MB 54.39% 92535295865117307932921825928971026432 192.168.0.5 Up Normal 263.91 MB 12.28% 113427455640312821154458202477256070485 192.168.0.6 Up Normal 26.21 MB8.33% 127605887595351923798765477786913079296 -- Dikang Gu 0086 - 18611140205
Re: Read time get worse during dynamic snitch reset
On Tue, Apr 12, 2011 at 12:26 AM, aaron morton aa...@thelastpickle.comwrote: The reset interval clears the latency tracked for each node so a bad node will be read from again. The scores for each node are then updated every 100ms (default) using the last 100 responses from a node. How long does the bad performance last for? Only a few seconds and but there are a lot of read requests during this time What CL are you reading at ? At Quorum with RF 4 the read request will be sent to 3 nodes, ordered by proximity and wellness according to the dynamic snitch. (for background recent discussion on dynamic snitch http://www.mail-archive.com/user@cassandra.apache.org/msg12089.html) I am reading with CL of ONE, read_repair_chance=0.33, RackInferringSnitch and keys_cached = rows_cached = 0 You can take a look at the weights and timings used by the DynamicSnitch in JConsole under o.a.c.db.DynamicSnitchEndpoint . Also at DEBUG log level you will be able to see which nodes the request is sent to. Everything looks OK. The weights are around 3 for the nodes in the same data center and around 5 for the others. I will turn on the DEBUG level to see if I can find more info. My guess is the DynamicSnitch is doing the right thing and the slow down is a node with a problem getting back into the list of nodes used for your read. It's then moved down the list as it's bad performance is noticed. Looking the DynamicSnitch MBean I don't see any problems with any of the nodes. My guess is that during the reset time there are reads that are sent to the other data center. Hope that helps Aaron Shimi On 12 Apr 2011, at 01:28, shimi wrote: I finally upgraded 0.6.x to 0.7.4. The nodes are running with the new version for several days across 2 data centers. I noticed that the read time in some of the nodes increase by x50-60 every ten minutes. There was no indication in the logs for something that happen at the same time. The only thing that I know that is running every 10 minutes is the dynamic snitch reset. So I changed dynamic_snitch_reset_interval_in_ms to 20 minutes and now I have the problem once in every 20 minutes. I am running all nodes with: replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy strategy_options: DC1 : 2 DC2 : 2 replication_factor: 4 (DC1 and DC2 are taken from the ips) Does anyone familiar with this kind of behavior? Shimi
Read time get worse during dynamic snitch reset
I finally upgraded 0.6.x to 0.7.4. The nodes are running with the new version for several days across 2 data centers. I noticed that the read time in some of the nodes increase by x50-60 every ten minutes. There was no indication in the logs for something that happen at the same time. The only thing that I know that is running every 10 minutes is the dynamic snitch reset. So I changed dynamic_snitch_reset_interval_in_ms to 20 minutes and now I have the problem once in every 20 minutes. I am running all nodes with: replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy strategy_options: DC1 : 2 DC2 : 2 replication_factor: 4 (DC1 and DC2 are taken from the ips) Does anyone familiar with this kind of behavior? Shimi
Re: nodetool cleanup - results in more disk use?
The bigger the file the longer it will take for it to be part of a compaction again. Compacting bucket of large files takes longer then compacting bucket of small files Shimi On Mon, Apr 4, 2011 at 3:58 PM, aaron morton aa...@thelastpickle.comwrote: mmm, interesting. My theory was t0 - major compaction runs, there is now one sstable t1 - x new sstables have been created t2 - minor compaction runs and determines there are two buckets, one with the x new sstables and one with the single big file. The bucket of many files is compacted into one, the bucket of one file is ignored. I can see that it takes longer for the big file to be involved in compaction again, and when it finally was it would take more time. But that minor compactions of new SSTables would still happen at the same rate, especially if they are created at the same rate as previously. Am I missing something or am I just reading the docs wrong ? Cheers Aaron On 4 Apr 2011, at 22:20, Jonathan Colby wrote: hi Aaron - The Datastax documentation brought to light the fact that over time, major compactions will be performed on bigger and bigger SSTables. They actually recommend against performing too many major compactions. Which is why I am wary to trigger too many major compactions ... http://www.datastax.com/docs/0.7/operations/scheduled_tasks Performing Major Compaction¶http://www.datastax.com/docs/0.7/operations/scheduled_tasks#performing-major-compaction A major compaction process merges all SSTables for all column families in a keyspace – not just similar sized ones, as in minor compaction. Note that this may create extremely large SStables that result in long intervals before the next minor compaction (and a resulting increase in CPU usage for each minor compaction). Though a major compaction ultimately frees disk space used by accumulated SSTables, during runtime it can temporarily double disk space usage. It is best to run major compactions, if at all, at times of low demand on the cluster. On Apr 4, 2011, at 1:57 PM, aaron morton wrote: cleanup reads each SSTable on disk and writes a new file that contains the same data with the exception of rows that are no longer in a token range the node is a replica for. It's not compacting the files into fewer files or purging tombstones. But it is re-writing all the data for the CF. Part of the process will trigger GC if needed to free up disk space from SSTables no longer needed. AFAIK having fewer bigger files will not cause longer minor compactions. Compaction thresholds are applied per bucket of files that share a similar size, there is normally more smaller files and fewer larger files. Aaron On 2 Apr 2011, at 01:45, Jonathan Colby wrote: I discovered that a Garbage collection cleans up the unused old SSTables. But I still wonder whether cleanup really does a full compaction. This would be undesirable if so. On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote: I ran node cleanup on a node in my cluster and discovered the disk usage went from 3.3 GB to 5.4 GB. Why is this? I thought cleanup just removed hinted handoff information. I read that *during* cleanup extra disk space will be used similar to a compaction. But I was expecting the disk usage to go back down when it finished. I hope cleanup doesn't trigger a major compaction. I'd rather not run major compactions because it means future minor compactions will take longer and use more CPU and disk.
index file contains a different key or row size
It make sense to me that compaction should solved this as well since compaction creates new index files. Am I missing something here? WARN [CompactionExecutor:1] 2011-04-04 14:50:54,105 CompactionManager.java (line 602) Row scrubbed successfully but index file contains a different key or row size; consider rebuilding the index as described in http://www.mail-archive.com/user@cassandra.apache.org/msg03325.html Shimi
Re: urgent
How did you solve it? On Sun, Apr 3, 2011 at 7:32 PM, Anurag Gujral anurag.guj...@gmail.comwrote: Now it is using all the three disks . I want to understand why recommended approach is to use one single large volume /directory and not multiple ones,can you please explain in detail. I am using SSDs using three small ones is cheaper than using one large one. Please Suggest Thanks Anurag On Sun, Apr 3, 2011 at 7:31 AM, aaron morton aa...@thelastpickle.comwrote: Is this still a problem ? Are you getting errors on the server ? It should be choosing the directory with the most space. btw, the recommended approach is to use a single large volume/directory for the data. Aaron On 2 Apr 2011, at 01:56, Anurag Gujral wrote: Hi All, I have setup a cassandra cluster with three data directories but cassandra is using only one of them and that disk is out of space and .Why is cassandra not using all the three data directories. Plz Suggest. Thanks Anurag
Re: Exceptions on 0.7.0
I didn't solved it. Since it is a test cluster I deleted all the data. I copied some sstables from my production cluster and I tried again, this time I didn't have this problem. I am planing on removing everything from this test cluster. I will start all over again with 0.6.x , then I will load it with 10th of GB of data (not sstable copy) and test the upgrade again. I did a mistake that I didn't backup the data files before I upgraded. Shimi On Tue, Feb 22, 2011 at 2:24 PM, David Boxenhorn da...@lookin2.com wrote: Shimi, I am getting the same error that you report here. What did you do to solve it? David On Thu, Feb 10, 2011 at 2:54 PM, shimi shim...@gmail.com wrote: I upgraded the version on all the nodes but I still gets the Exceptions. I run cleanup on one of the nodes but I don't think there is any cleanup going on. Another weird thing that I see is: INFO [CompactionExecutor:1] 2011-02-10 12:08:21,353 CompactionIterator.java (line 135) Compacting large row 333531353730363835363237353338383836383035363036393135323132383 73630323034313a446f20322e384c20656e67696e657320686176652061646a75737461626c65206c696674657273 (725849473109 bytes) incrementally In my production version the largest row is 10259. It shouldn't be different in this case. The first Exception is been thrown on 3 nodes during compaction. The second Exception (Internal error processing get_range_slices) is been thrown all the time by a forth node. I disabled gossip and any client traffic to it and I still get the Exceptions. Is it possible to boot a node with gossip disable? Shimi On Thu, Feb 10, 2011 at 11:11 AM, aaron morton aa...@thelastpickle.comwrote: I should be able to repair, install the new version and kick off nodetool repair . If you are uncertain search for cassandra-1992 on the list, there has been some discussion. You can also wait till some peeps in the states wake up if you want to be extra sure. The number if the number of columns the iterator is going to return from the row. I'm guessing that because this happening during compaction it's using asked for the maximum possible number of columns. Aaron On 10 Feb 2011, at 21:37, shimi wrote: On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan No. I was too puzzled by the numbers On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.com wrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron I will. I hope the data was not corrupted. On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.comwrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan *From:* shimi [mailto:shim...@gmail.com] *Sent:* February-09-11 15:06 *To:* user@cassandra.apache.org *Subject:* Exceptions on 0.7.0 I have a 4 node test cluster were I test the port to 0.7.0 from 0.6.X On 3 out of the 4 nodes I get exceptions in the log. I am using RP. Changes that I did: 1. changed the replication factor from 3 to 4 2. configured the nodes to use Dynamic Snitch 3. RR of 0.33 I run repair on 2 nodes before I noticed the errors. One of them is having the first error and the other the second. I restart the nodes but I still get the exceptions. The following Exception I get from 2 nodes: WARN [CompactionExecutor:1] 2011-02-09 19:50:51,281 BloomFilter.java (line 84) Cannot provide an optimal Bloom Filter for 1986622313 elements (1/4 buckets per element). ERROR [CompactionExecutor:1] 2011-02-09 19:51:10,190 AbstractCassandraDaemon.java (line 91) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:105) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:34) at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284) at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326) at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
EOFException: attempted to skip x bytes
) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:69) ... 19 more Shimi
Re: Exceptions on 0.7.0
On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan No. I was too puzzled by the numbers On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.com wrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron I will. I hope the data was not corrupted. On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.comwrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan *From:* shimi [mailto:shim...@gmail.com] *Sent:* February-09-11 15:06 *To:* user@cassandra.apache.org *Subject:* Exceptions on 0.7.0 I have a 4 node test cluster were I test the port to 0.7.0 from 0.6.X On 3 out of the 4 nodes I get exceptions in the log. I am using RP. Changes that I did: 1. changed the replication factor from 3 to 4 2. configured the nodes to use Dynamic Snitch 3. RR of 0.33 I run repair on 2 nodes before I noticed the errors. One of them is having the first error and the other the second. I restart the nodes but I still get the exceptions. The following Exception I get from 2 nodes: WARN [CompactionExecutor:1] 2011-02-09 19:50:51,281 BloomFilter.java (line 84) Cannot provide an optimal Bloom Filter for 1986622313 elements (1/4 buckets per element). ERROR [CompactionExecutor:1] 2011-02-09 19:51:10,190 AbstractCassandraDaemon.java (line 91) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:105) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:34) at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284) at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326) at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at com.google.common.collect.Iterators$7.computeNext(Iterators.java:604) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.cassandra.db.ColumnIndexer.serializeInternal(ColumnIndexer.java:76) at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:50) at org.apache.cassandra.io.LazilyCompactedRow.init(LazilyCompactedRow.java:88) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:136) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:42) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:323) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:122) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:92) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:280) at org.apache.cassandra.db.ColumnSerializer.deserialize
Re: Exceptions on 0.7.0
I upgraded the version on all the nodes but I still gets the Exceptions. I run cleanup on one of the nodes but I don't think there is any cleanup going on. Another weird thing that I see is: INFO [CompactionExecutor:1] 2011-02-10 12:08:21,353 CompactionIterator.java (line 135) Compacting large row 333531353730363835363237353338383836383035363036393135323132383 73630323034313a446f20322e384c20656e67696e657320686176652061646a75737461626c65206c696674657273 (725849473109 bytes) incrementally In my production version the largest row is 10259. It shouldn't be different in this case. The first Exception is been thrown on 3 nodes during compaction. The second Exception (Internal error processing get_range_slices) is been thrown all the time by a forth node. I disabled gossip and any client traffic to it and I still get the Exceptions. Is it possible to boot a node with gossip disable? Shimi On Thu, Feb 10, 2011 at 11:11 AM, aaron morton aa...@thelastpickle.comwrote: I should be able to repair, install the new version and kick off nodetool repair . If you are uncertain search for cassandra-1992 on the list, there has been some discussion. You can also wait till some peeps in the states wake up if you want to be extra sure. The number if the number of columns the iterator is going to return from the row. I'm guessing that because this happening during compaction it's using asked for the maximum possible number of columns. Aaron On 10 Feb 2011, at 21:37, shimi wrote: On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan No. I was too puzzled by the numbers On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.com wrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron I will. I hope the data was not corrupted. On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.comwrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan *From:* shimi [mailto:shim...@gmail.com] *Sent:* February-09-11 15:06 *To:* user@cassandra.apache.org *Subject:* Exceptions on 0.7.0 I have a 4 node test cluster were I test the port to 0.7.0 from 0.6.X On 3 out of the 4 nodes I get exceptions in the log. I am using RP. Changes that I did: 1. changed the replication factor from 3 to 4 2. configured the nodes to use Dynamic Snitch 3. RR of 0.33 I run repair on 2 nodes before I noticed the errors. One of them is having the first error and the other the second. I restart the nodes but I still get the exceptions. The following Exception I get from 2 nodes: WARN [CompactionExecutor:1] 2011-02-09 19:50:51,281 BloomFilter.java (line 84) Cannot provide an optimal Bloom Filter for 1986622313 elements (1/4 buckets per element). ERROR [CompactionExecutor:1] 2011-02-09 19:51:10,190 AbstractCassandraDaemon.java (line 91) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:105) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:34) at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284) at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326) at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at com.google.common.collect.Iterators$7.computeNext(Iterators.java:604) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.cassandra.db.ColumnIndexer.serializeInternal(ColumnIndexer.java:76) at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:50) at org.apache.cassandra.io.LazilyCompactedRow.init(LazilyCompactedRow.java:88) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:136) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107
Exceptions on 0.7.0
(CollatingIterator.java:217) at org.apache.cassandra.db.RowIteratorFactory$3.getReduced(RowIteratorFactory.java:136) at org.apache.cassandra.db.RowIteratorFactory$3.getReduced(RowIteratorFactory.java:106) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.cassandra.db.RowIterator.hasNext(RowIterator.java:49) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1294) at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:438) at org.apache.cassandra.thrift.CassandraServer.get_range_slices(CassandraServer.java:473) at org.apache.cassandra.thrift.Cassandra$Processor$get_range_slices.process(Cassandra.java:2868) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:1 67) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:280) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:78) ... 21 more any idea what went wrong? Shimi
Re: Do you have a site in production environment with Cassandra? What client do you use?
Same here, Hector with Java. Shimi On Fri, Jan 14, 2011 at 9:13 PM, Dan Kuebrich dan.kuebr...@gmail.comwrote: We've done hundreds of gigs in and out of cassandra 0.6.8 with pycassa 0.3. Working on upgrading to 0.7 and pycassa 1.03. I don't know if we're using it wrong, but the connection object is tied to a particular keyspace constraint isn't that awesome--we have a number of keyspaces used simultaneously. Haven't looked into it yet. On Fri, Jan 14, 2011 at 1:52 PM, Mike Wynholds m...@carbonfive.comwrote: We have one in production with Ruby / fauna Cassandra gem and Cassandra 0.6.x. The project is live but is stuck in a sort of private beta, so it hasn't really been run through any load scenarios. ..mike.. -- Michael Wynholds | Carbon Five | 310.821.7125 x13 | m...@carbonfive.com On Fri, Jan 14, 2011 at 9:24 AM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: Reclaim deleted rows space
Am I missing something here? It is already possible to trigger major compaction on a specific CF. On Thu, Jan 6, 2011 at 4:50 AM, Tyler Hobbs ty...@riptano.com wrote: Although it's not exactly the ability to list specific SSTables, the ability to only compact specific CFs will be in upcoming releases: https://issues.apache.org/jira/browse/CASSANDRA-1812 - Tyler On Wed, Jan 5, 2011 at 7:46 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Jan 5, 2011 at 4:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Pretty sure there's logic in there that says don't bother compacting a single sstable. On Wed, Jan 5, 2011 at 2:26 PM, shimi shim...@gmail.com wrote: How does minor compaction is triggered? Is it triggered Only when a new SStable is added? I was wondering if triggering a compaction with minimumCompactionThreshold set to 1 would be useful. If this can happen I assume it will do compaction on files with similar size and remove deleted rows on the rest. Shimi On Tue, Jan 4, 2011 at 9:56 PM, Peter Schuller peter.schul...@infidyne.com wrote: I don't have a problem with disk space. I have a problem with the data size. [snip] Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Ok - that makes perfect sense then. Sorry for misunderstanding :) So essentially, for workloads that are teetering on the edge of cache warmness and is subject to significant overwrites or removals, it may be beneficial to perform much more aggressive background compaction even though it might waste lots of CPU, to keep the in-memory working set down. There was talk (I think in the compaction redesign ticket) about potentially improving the use of bloom filters such that obsolete data in sstables could be eliminated from the read set without necessitating actual compaction; that might help address cases like these too. I don't think there's a pre-existing silver bullet in a current release; you probably have to live with the need for greater-than-theoretically-optimal memory requirements to keep the working set in memory. -- / Peter Schuller -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com I was wording if it made sense to have a JMX operation that can compact a list of tables by file name. This opens it up for power users to have more options then compact entire keyspace.
Re: Reclaim deleted rows space
On Wed, Jan 5, 2011 at 11:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Pretty sure there's logic in there that says don't bother compacting a single sstable. No. You can do it. Based on the log I have a feeling that it triggers an infinite compaction loop. On Wed, Jan 5, 2011 at 2:26 PM, shimi shim...@gmail.com wrote: How does minor compaction is triggered? Is it triggered Only when a new SStable is added? I was wondering if triggering a compaction with minimumCompactionThreshold set to 1 would be useful. If this can happen I assume it will do compaction on files with similar size and remove deleted rows on the rest. Shimi On Tue, Jan 4, 2011 at 9:56 PM, Peter Schuller peter.schul...@infidyne.com wrote: I don't have a problem with disk space. I have a problem with the data size. [snip] Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Ok - that makes perfect sense then. Sorry for misunderstanding :) So essentially, for workloads that are teetering on the edge of cache warmness and is subject to significant overwrites or removals, it may be beneficial to perform much more aggressive background compaction even though it might waste lots of CPU, to keep the in-memory working set down. There was talk (I think in the compaction redesign ticket) about potentially improving the use of bloom filters such that obsolete data in sstables could be eliminated from the read set without necessitating actual compaction; that might help address cases like these too. I don't think there's a pre-existing silver bullet in a current release; you probably have to live with the need for greater-than-theoretically-optimal memory requirements to keep the working set in memory. -- / Peter Schuller -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Reclaim deleted rows space
According to the code it make sense. submitMinorIfNeeded() calls doCompaction() which calls submitMinorIfNeeded(). With minimumCompactionThreshold = 1 submitMinorIfNeeded() will always run compaction. Shimi On Thu, Jan 6, 2011 at 10:26 AM, shimi shim...@gmail.com wrote: On Wed, Jan 5, 2011 at 11:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Pretty sure there's logic in there that says don't bother compacting a single sstable. No. You can do it. Based on the log I have a feeling that it triggers an infinite compaction loop. On Wed, Jan 5, 2011 at 2:26 PM, shimi shim...@gmail.com wrote: How does minor compaction is triggered? Is it triggered Only when a new SStable is added? I was wondering if triggering a compaction with minimumCompactionThreshold set to 1 would be useful. If this can happen I assume it will do compaction on files with similar size and remove deleted rows on the rest. Shimi On Tue, Jan 4, 2011 at 9:56 PM, Peter Schuller peter.schul...@infidyne.com wrote: I don't have a problem with disk space. I have a problem with the data size. [snip] Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Ok - that makes perfect sense then. Sorry for misunderstanding :) So essentially, for workloads that are teetering on the edge of cache warmness and is subject to significant overwrites or removals, it may be beneficial to perform much more aggressive background compaction even though it might waste lots of CPU, to keep the in-memory working set down. There was talk (I think in the compaction redesign ticket) about potentially improving the use of bloom filters such that obsolete data in sstables could be eliminated from the read set without necessitating actual compaction; that might help address cases like these too. I don't think there's a pre-existing silver bullet in a current release; you probably have to live with the need for greater-than-theoretically-optimal memory requirements to keep the working set in memory. -- / Peter Schuller -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: maven cassandra plugin
I use Capistrano for install, upgrades, start, stop and restart. I use it for other projects as well. It is very useful for automated tasks that needs to run on multiple machines Shiy On 2011 1 6 21:38, B. Todd Burruss bburr...@real.com wrote: has anyone created a maven plugin, like cargo for tomcat, for automating starting/stopping a cassandra instance?
Re: Reclaim deleted rows space
How does minor compaction is triggered? Is it triggered Only when a new SStable is added? I was wondering if triggering a compaction with minimumCompactionThreshold set to 1 would be useful. If this can happen I assume it will do compaction on files with similar size and remove deleted rows on the rest. Shimi On Tue, Jan 4, 2011 at 9:56 PM, Peter Schuller peter.schul...@infidyne.comwrote: I don't have a problem with disk space. I have a problem with the data size. [snip] Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Ok - that makes perfect sense then. Sorry for misunderstanding :) So essentially, for workloads that are teetering on the edge of cache warmness and is subject to significant overwrites or removals, it may be beneficial to perform much more aggressive background compaction even though it might waste lots of CPU, to keep the in-memory working set down. There was talk (I think in the compaction redesign ticket) about potentially improving the use of bloom filters such that obsolete data in sstables could be eliminated from the read set without necessitating actual compaction; that might help address cases like these too. I don't think there's a pre-existing silver bullet in a current release; you probably have to live with the need for greater-than-theoretically-optimal memory requirements to keep the working set in memory. -- / Peter Schuller
Re: Bootstrapping taking long
In my experience most of the time it takes for a node to join the cluster is the anticompaction on the other nodes. The streaming part is very fast. Check the other nodes logs to see if there is any node doing anticompaction. I don't remember how much data I had in the cluster when I needed to add/remove nodes. I do remember that it took a few hours. The node will join the ring only when it will finish the bootstrap. Shimi On Tue, Jan 4, 2011 at 12:28 PM, Ran Tavory ran...@gmail.com wrote: I asked the same question on the IRC but no luck there, everyone's asleep ;)... Using 0.6.6 I'm adding a new node to the cluster. It starts out fine but then gets stuck on the bootstrapping state for too long. More than an hour and still counting. $ bin/nodetool -p 9004 -h localhost streams Mode: Bootstrapping Not sending any streams. Not receiving any streams. It seemed to have streamed data from other nodes and indeed the load is non-zero but I'm not clear what's keeping it right now from finishing. $ bin/nodetool -p 9004 -h localhost info 51042355038140769519506191114765231716 Load : 22.49 GB Generation No: 1294133781 Uptime (seconds) : 1795 Heap Memory (MB) : 315.31 / 6117.00 nodetool ring does not list this new node in the ring, although nodetool can happily talk to the new node, it's just not listing itself as a member of the ring. This is expected when the node is still bootstrapping, so the question is still how long might the bootstrap take and whether is it stuck. The data ins't huge so I find it hard to believe that streaming or anti compaction are the bottlenecks. I have ~20G on each node and the new node already has just about that so it seems that all data had already been streamed to it successfully, or at least most of the data... So what is it waiting for now? (same question, rephrased... ;) I tried: 1. Restarting the new node. No good. All logs seem normal but at the end the node is still in bootstrap mode. 2. As someone suggested I increased the rpc timeout from 10k to 30k (RpcTimeoutInMillis) but that didn't seem to help. I did this only on the new node. Should I have done that on all (old) nodes as well? Or maybe only on the ones that were supposed to stream data to that node. 3. Logging level at DEBUG now but nothing interesting going on except for occasional messages such as [1] or [2] So the question is: what's keeping the new node from finishing the bootstrap and how can I check its status? Thanks [1] DEBUG [Timer-1] 2011-01-04 05:21:24,402 LoadDisseminator.java (line 36) Disseminating load info ... [2] DEBUG [RMI TCP Connection(22)-192.168.252.88] 2011-01-04 05:12:48,033 StorageService.java (line 1189) computing ranges for 28356863910078205288614550619314017621, 56713727820156410577229101238628035242, 85070591730234615865843651857942052863, 113427455640312821154458202477256070484, 141784319550391026443072753096570088105, 170141183460469231731687303715884105727 -- /Ran
Re: Reclaim deleted rows space
I think I didn't make myself clear. I don't have a problem with disk space. I have a problem with the data size. I have a simple crud application. Most of the requests are read but there are update/delete and when the time pass the number of deleted rows is big enough in order to free some disk space (a matter of days and not hours). Since not all of the data can fit to RAM (and I have a lot of RAM) the rest is served from disk. Since disk is slow I want to reduce as much as possible the number of requests that goes to the disk. The more requests to the disk, the disk wait time gets longer and it takes more time to return a response. Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Shimi On Tue, Jan 4, 2011 at 2:33 PM, Peter Schuller peter.schul...@infidyne.comwrote: This is what I thought. I was wishing there might be another way to reclaim the space. Be sure you really need this first :) Normally you just let it happen in the bg. The problem is that the more data you have the more time it will take to Cassandra to response. Relative to what though? There are definitely important side-effects of having very large data sets, and part of that involves compactions, but in a normal steady state type of system you should never be in the position to wait for a major compaction to run. Compactions are something that is intended to run every now and then in the background. It will result in variations in disk space within certain bounds, which is expected. Certainly the situation can be improved and the current disk space utilization situation is not perfect, but the above suggests to me that you're trying to do something that is not really intended to be done. Reclaim space of deleted rows in the biggest SSTable requires Major compaction. This compaction can be triggered by adding x2 data (or x4 data in the default configuration) to the system or by executing it manually using JMX. You can indeed choose to trigger major compactions by e.g. cron jobs. But just be aware that if you're operating under conditions where you are close to disk space running out, you have other concerns too - such as periodic repair operations also needing disk space. Also; suppose you're overwriting lots of data (or replacing by deleting and adding other data). It is not necessarily true that you need 4x the space relative to what you otherwise do just because of the compaction threshold. Keep in mind that compactions already need extra space anyway. If you're *not* overwriting or adding data, a compaction of a single CF is expected to need up to twice the amount of space that it occupies. If you're doing more overwrites and deletions though, as you point out you will have more dead data at any given point in time. But on the other hand, the peak disk space usage during compactions is lower. So the actual peak disk space usage (which is what matters since you must have this much disk space) is actually helped by the deletions/overwrites too. Further, suppose you trigger major compactions more often. That means each compaction will have a higher relative spike of disk usage because less data has had time to be overwritten or removed. So in a sense, it's like the disk space demands is being moved between the category of dead data retained for longer than necessary and peak disk usage during compaction. Also keep in mind that the *low* peak of disk space usage is not subject to any fragmentation concerns. Depending on the size of your data compared to e.g. column names, that disk space usage might be significantly lower than what you would get with an in-place updating database. There are lots of trade-offs :) You say you have to wait for deletions though which sounds like you're doing something unusual. Are you doing stuff like deleting lots of data in bulk from one CF, only to then write data to *another* CF? Such that you're actually having to wait for disk space to be freed to make room for data somewhere else? In case of a system that deletes data regularly, which needs to serve customers all day and the time it takes should be in ms, this is a problem. Not in general. I am afraid there may be some misunderstanding here. Unless disk space is a problem for you (i.e., you're running out of space), there is no need to wait for compactions. And certainly whether you can serve traffic 24/7 at low-ms latencies is an important consideration, and does become complex when disk I/O is involved, but it is not about disk *space*. If you have important performance
Re: Bootstrapping taking long
You will have something new to talk about in your talk tomorrow :) You said that the anti compaction was only on a single node? I think that your new node should get data from at least two other nodes (depending on the replication factor). Maybe the problem is not in the new node. In old version (I think prior to 0.6.3) there was case of stuck bootstrap that required restart to the new node and the nodes which were suppose to stream data to it. As far as I remember this case was resolved. I haven't seen this problem since then. Shimi On Tue, Jan 4, 2011 at 3:01 PM, Ran Tavory ran...@gmail.com wrote: Running nodetool decommission didn't help. Actually the node refused to decommission itself (b/c it wasn't part of the ring). So I simply stopped the process, deleted all the data directories and started it again. It worked in the sense of the node bootstrapped again but as before, after it had finished moving the data nothing happened for a long time (I'm still waiting, but nothing seems to be happening). Any hints how to analyze a stuck bootstrapping node?? thanks On Tue, Jan 4, 2011 at 1:51 PM, Ran Tavory ran...@gmail.com wrote: Thanks Shimi, so indeed anticompaction was run on one of the other nodes from the same DC but to my understanding it has already ended. A few hour ago... I plenty of log messages such as [1] which ended a couple of hours ago, and I've seen the new node streaming and accepting the data from the node which performed the anticompaction and so far it was normal so it seemed that data is at its right place. But now the new node seems sort of stuck. None of the other nodes is anticompacting right now or had been anticompacting since then. The new node's CPU is close to zero, it's iostats are almost zero so I can't find another bottleneck that would keep it hanging. On the IRC someone suggested I'd maybe retry to join this node, e.g. decommission and rejoin it again. I'll try it now... [1] INFO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721 CompactionManager.java (line 338) AntiCompacting [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')] INFO [COMPACTION-POOL:1] 2011-01-04 04:34:18,683 CompactionManager.java (line 338) AntiCompacting [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3874-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3873-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3876-Data.db')] INFO [COMPACTION-POOL:1] 2011-01-04 04:34:19,132 CompactionManager.java (line 338) AntiCompacting [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-951-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-976-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-978-Data.db')] INFO [COMPACTION-POOL:1] 2011-01-04 04:34:26,486 CompactionManager.java (line 338) AntiCompacting [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')] On Tue, Jan 4, 2011 at 12:45 PM, shimi shim...@gmail.com wrote: In my experience most of the time it takes for a node to join the cluster is the anticompaction on the other nodes. The streaming part is very fast. Check the other nodes logs to see if there is any node doing anticompaction. I don't remember how much data I had in the cluster when I needed to add/remove nodes. I do remember that it took a few hours. The node will join the ring only when it will finish the bootstrap. Shimi On Tue, Jan 4, 2011 at 12:28 PM, Ran Tavory ran...@gmail.com wrote: I asked the same question on the IRC but no luck there, everyone's asleep ;)... Using 0.6.6 I'm adding a new node to the cluster. It starts out fine but then gets stuck on the bootstrapping state for too long. More than an hour and still counting. $ bin/nodetool -p 9004 -h localhost streams Mode: Bootstrapping Not sending any streams. Not receiving any streams. It seemed to have streamed data from other nodes and indeed the load is non-zero but I'm not clear what's keeping it right now from finishing. $ bin/nodetool -p 9004 -h localhost info 51042355038140769519506191114765231716 Load : 22.49 GB Generation No: 1294133781 Uptime (seconds) : 1795 Heap Memory (MB) : 315.31 / 6117.00 nodetool ring does not list this new node in the ring, although nodetool can happily talk to the new node, it's just not listing itself as a member of the ring. This is expected when the node is still bootstrapping, so the question is still how long might the bootstrap take and whether is it stuck. The data ins't huge so I find it hard to believe that streaming or anti compaction are the bottlenecks. I have ~20G on each node
Reclaim deleted rows space
Lets assume I have: * single 100GB SSTable file * min compaction threshold is set to 2 If I delete rows which are located in this file. Is the only way to clean the deleted rows is by inserting another 100GB of data or by triggering a painful major compaction? Shimi
iterate over all the rows with RP
Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? Shimi
Re: iterate over all the rows with RP
So if I will use a different connection (thrift via Hector), will I get the same results? It's make sense when you use OPP and I assume it is the same with RP. I just wanted to make sure this is the case and there is no state which is kept. Shimi On Sun, Dec 12, 2010 at 8:14 PM, Peter Schuller peter.schul...@infidyne.com wrote: Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? In general, the choice of RPC connection (I assume you mean the underlying thrift connection) does not affect the semantics of the RPC calls. -- / Peter Schuller
FatClient Gossip error and some other problems
) at java.util.TimerThread.run(Timer.java:462) INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node /X.X.X.X is now part of the cluster Does anyone have any idea how can I cleanup the problematic node? Does anyone have any idea how can I get rid of the Gossip error? Shimi
Re: FatClient Gossip error and some other problems
I was patient (although it is hard when you have millions of requests which are not served in time). I was waiting for a long time. There was nothing in the Logs and in JMX. Shimi On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek gdusba...@gmail.com wrote: On Mon, Sep 20, 2010 at 09:51, shimi shim...@gmail.com wrote: I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter). I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5). My old cluster was unbalanced since I was using Random Partitioner and I bootstrapped all the nodes without specifying their tokens. Since I wanted the the cluster to be balanced I first added all the new nodes one after the other (with the right tokens this time) and then I run decommission on all the old ones, one after the other. One of the decommissioned nodes began throwing too many open files errors while It was decommissioning taking other nodes with him. After the second try I decided to stop it and run removetoken on his token from one of the other nodes. After that everything went well except that in the end one of the nodes looked unbalanced. I decided to run repair on the cluster. What I got is totally unbalanced nodes with way to much data then what is suppose to be. each node had x2-x4 more data. I run cleanup and all of them except the one which was unbalanced to begin with got back to the size they were suppose to be. Now whenever I try to run cleanup on this node I get: INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java (line 339) AntiCompacting ... INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) GC for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 used; max is 6552551424 INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150) Pool NameActive Pending INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) STREAM-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) RESPONSE-STAGE0 0 INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156) ROW-READ-STAGE8 717 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) LB-OPERATIONS 0 0 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) MISCELLANEOUS-POOL0 0 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) GMFD 0 2 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) CONSISTENCY-MANAGER 0 1 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) LB-TARGET 0 0 INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156) ROW-MUTATION-STAGE0 0 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) MESSAGE-STREAMING-POOL0 0 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) LOAD-BALANCER-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156) FLUSH-SORTER-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) MEMTABLE-POST-FLUSHER 0 0 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) AE-SERVICE-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) FLUSH-WRITER-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) HINTED-HANDOFF-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161) CompactionManager n/a 0 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402 SSTableDeletingReference.java (line 104) Deleted ... INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727 SSTableDeletingReference.java (line 104) Deleted ... INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730 SSTableDeletingReference.java (line 104) Deleted ... INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735 SSTableDeletingReference.java (line 104) Deleted ... and after that I saw an increase in the node response time and the number ROW-READ-STAGE pending tasks. Since there was no indication that something is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), the only thing that I could have done is to restart the server. I don't know if this is related but every hour I see this error (I think it is the IP of the machine that I couldn't decommission properly): INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402
Re: Bootstrap question
If I have problems with never ending bootstraping I do the following. I try each one if it doesn't help I try the next. It might not be the right thing to do but it worked for me. 1. Restart the bootstraping node 2. If I see streaming 0/ I restart the node and all the streaming nodes 3. Restart all the nodes 4. If there is data in the bootstraing node I delete it before I restart. Good luck Shimi On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro antho...@alumni.caltech.edu wrote: So still waiting for any sort of answer on this one. The cluster still refuses to do anything when I bring up new nodes. I shut down all the new nodes and am waiting. I'm guessing that maybe the old nodes have some state which needs to get cleared out? Is there anything I can do at this point? Are there alternate strategies for bootstrapping I can try? (For instance can I just scp all the sstables to all the new nodes and do a repair, would that actually work?). Anyone seen this sort of issue? All this is with 0.6.3 so I assume eventually others will see this issue. -Anthony On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote: Okay, so things were pretty messed up. I shut down all the new nodes, then the old nodes started doing the half the ring is down garbage which pretty much requires a full restart of everything. So I had to shut everything down, then bring the seed back, then the rest of the nodes, so they finally all agreed on the ring again. Then I started one of the new nodes, and have been watching the logs, so far 2 hours since the Bootstrapping message appeared in the new log and nothing has happened. No anticompaction messages anywhere, there's one node compacting, but its on the other end of the ring, so no where near that new node. I'm wondering if it will ever get data at this point. Is there something else I should try? The only thing I can think of is deleting the system directory on the new node, and restarting, so I'll try that and see if it does anything. -Anthony On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote: On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro antho...@alumni.caltech.edu wrote: Is the fact that 2 new nodes are in the range messing it up? Probably. And if so how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the bringing up nodes 2,4, waiting for them to finish, then bringing up 3,5?). Yes. You might have to restart the old nodes too to clear out the confusion. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Anthony Molinaro antho...@alumni.caltech.edu -- Anthony Molinaro antho...@alumni.caltech.edu
get_range_slices return the same rows
I wrote a code that iterate on all the rows by using get_range_slices. for the first call I use KeyRange from to . for all the others I use from the last key that I got in the previous iteration to . I always get the same rows that I got in the previous iteration. I tried changing the batch size but I still gets the same results. I tried it both in single node and a cluster. I use RP with version 0.6.3 and Hector. Does anyone know how this can be done? Shimi