Cassandra 1.2.19 and Java 8
Hello, We still have an installation of Cassandra on the 1.2.19 release, running on Java 7. We do plan on upgrading to a newer version, but in the mean time there has been some questions internally about running 1.2 on Java 8 until the upgrade can be fully completed. I seem to remember speaking to someone awhile back that advised against running the 1.2 + Java 8 combination. Unfortunately, I can't remember what the exact reasoning was behind the recommendation. It could have just been that no one was really doing it, therefore it wasn't fully tested. Does anyone here have experience with Cassandra 1.2 and Java 8 in production. Any known issues or gotchas? Cheers! -Tim -- Tim Heckman Operations Engineer PagerDuty, Inc.
Re: Ghost compaction process
Does `nodetool comactionstats` show nothing running as well? Also, for posterity what are some details of the setup (C* version, etc.)? -Tim -- Tim Heckman Operations Engineer PagerDuty, Inc. On Sun, Jun 7, 2015 at 6:40 PM, Arturas Raizys artu...@noantidot.com wrote: Hello, I'm having problem there in 1 node I have continues compaction process running and consuming CPU. nodetool tpstats show 1 compaction in progress, but if I try to query system.compactions_in_progress table, I see 0 records. This never ending compaction does slow down node and it becomes laggy. I'm willing to hire a contractor to solve this problem if anyone is interested. Cheers, Arturas
Re: Does nodetool repair stop the node to answer requests ?
On Thu, Jan 22, 2015 at 10:22 AM, Jan cne...@yahoo.com wrote: Running a 'nodetool repair' will 'not' bring the node down. It's not something that happens during normal operation. If something goes sideways, and the resource usage climbs, a repair can definitely cripple a node. Your question: does a nodetool repair make the server stop serving requests, or does it just use a lot of ressources but still serves request Answer: NO, the server will not stop serving requests. It will use some resources but not enough to affect the server serving requests. I don't think this is right. I've personally seen repair operations cause real bad things to happen to an entire Cassandra cluster. The only mitigation was to shut that misbehaving node down and then normal operations continued within the cluster. hope this helps Jan Cheers! -Tim
Re: nodetool repair exception
On Sat, Dec 6, 2014 at 8:05 AM, Eric Stevens migh...@gmail.com wrote: The official recommendation is 100k: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html I wonder if there's an advantage to this over unlimited if you're running servers which are dedicated to your Cassandra cluster (which you should be for anything production). There is the potential to have monitoring systems, and other small agents, running on systems in production. I could see this simply as a stop-gap to prevent Cassandra from being able to starve the system of free file descriptors. In theory, if there's not a proper watchdog on your monitors this could prevent an issue from causing an alert. However, just a potential advantage I could think of. Cheers! -Tim On Fri Dec 05 2014 at 2:39:24 PM Robert Coli rc...@eventbrite.com wrote: On Wed, Dec 3, 2014 at 6:37 AM, Rafał Furmański rfurman...@opera.com wrote: I see “Too many open files” exception in logs, but I’m sure that my limit is now 150k. Should I increase it? What’s the reasonable limit of open files for cassandra? Why provide any limit? ulimit allows unlimited? =Rob
Re: full gc too often
On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote: Hi,all I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full gc that sometime there may be one or two nodes full gc more than one time per minute and over 10 seconds each time, then the node will be unreachable and the latency of cluster will be increased. I grep the GCInspector's log, I found when the node is running fine without gc trouble there are two kinds of gc: ParNew GC in less than 300ms which clear the Par Eden Space and enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in more than 200ms, there is only a small number of ParNew GC in log) ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and enlarge Par Eden Space little, each 1-2 hours it will be executed once. However, sometimes ConcurrentMarkSweep will be strange like it shows: INFO [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 - ConcurrentMarkSweep GC in 12648ms. CMS Old Gen: 3579838424 - 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0 INFO [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 - ConcurrentMarkSweep GC in 12227ms. CMS Old Gen: 3579838464 - 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0 INFO [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 - ConcurrentMarkSweep GC in 11538ms. CMS Old Gen: 3579836688 - 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 - ConcurrentMarkSweep GC in 12180ms. CMS Old Gen: 3579835784 - 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 - ConcurrentMarkSweep GC in 10574ms. CMS Old Gen: 3579838112 - 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0 INFO [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 - ConcurrentMarkSweep GC in 11594ms. CMS Old Gen: 3579831424 - 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 - ConcurrentMarkSweep GC in 11463ms. CMS Old Gen: 3579817392 - 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0 INFO [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 - ConcurrentMarkSweep GC in 9576ms. CMS Old Gen: 3579838424 - 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 - ConcurrentMarkSweep GC in 11556ms. CMS Old Gen: 3579816424 - 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0 INFO [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 - ConcurrentMarkSweep GC in 12082ms. CMS Old Gen: 3579786592 - 3579814464; Par Eden Space: 503316480 - 448782440; Par Survivor Space: 62914560 - 0 In each time Old Gen reduce only a little, Survivor Space will be clear but the heap is still full so there will be another full gc very soon then the node will down. If I restart the node, it will be fine without gc trouble. Can anyone help me to find out where is the problem that full gc can't reduce CMS Old Gen? Is it because there are too many objects in heap can't be recycled? I think review the table scheme designing and add new nodes into cluster is a good idea, but I still want to know if there is any other reason causing this trouble. How much total system memory do you have? How much is allocated for heap usage? How big is your working data set? The reason I ask is that I've seen problems with lots of GC with no room gained, and it was memory pressure. Not enough for the heap. We decided that just increasing the heap size was a bad idea, as we did rely on free RAM being used for filesystem caching. So some vertical and horizontal scaling allowed us to give Cass more heap space, as well as distribute the workload to try and avoid further problems. Thanks, Philo Yang Cheers! -Tim
Re: Cassandra DC2 nodes down after increasing write requests on DC1 nodes
Hello Gabriel, On Sun, Nov 16, 2014 at 7:25 AM, Gabriel Menegatti gabr...@s1mbi0se.com.br wrote: I said that load was not a big deal, because ops center shows this loads as green, not as yellow or red at all. Also, our servers have many processors/threads, so I *think* this load is not problematic. I've seen Cassandra clusters fall over with less load on the boxes. So, not sure how trusting I am of Opscenter. However, the impact is dependent on the system resources you have available to you. How many CPU cores do these systems have, how much total and free memory, are the underlying disks SSD or spinning platters of rust? My assumption is that for some reason the DC2 10 nodes are not being able to handle the volume of requests from DC1, as it was 30 nodes. Even so, on my point of view the load of the DC2 nodes should go really high before Cassandra goes down, but its not doing so. That would make sense if the nodes are under-provisioned for the work you are trying to throw at them. The load averages and OOM in the heap seems to indicate that may be a problem. However, without more details it's hard to say. Regards, Gabriel Cheers! -Tim
Repair/Compaction Completion Confirmation
Hello, I am looking to change how we trigger maintenance operations in our C* clusters. The end goal is to schedule and run the jobs using a system that is backed by Serf to handle the event propagation. I know that when issuing some operations via nodetool, the command blocks until the operation is finished. However, is there a way to reliably determine whether or not the operation has finished without monitoring that invocation of nodetool? In other words, when I run 'nodetool repair' what is the best way to reliably determine that the repair is finished without running something equivalent to a 'pgrep' against the command I invoked? I am curious about trying to do the same for major compactions too. Cheers! -Tim
Re: Repair/Compaction Completion Confirmation
On Mon, Oct 27, 2014 at 1:44 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Oct 27, 2014 at 1:33 PM, Tim Heckman t...@pagerduty.com wrote: I know that when issuing some operations via nodetool, the command blocks until the operation is finished. However, is there a way to reliably determine whether or not the operation has finished without monitoring that invocation of nodetool? In other words, when I run 'nodetool repair' what is the best way to reliably determine that the repair is finished without running something equivalent to a 'pgrep' against the command I invoked? I am curious about trying to do the same for major compactions too. This is beyond a FAQ at this point, unfortunately; non-incremental repair is awkward to deal with and probably impossible to automate. In The Future [1] the correct solution will be to use incremental repair, which mitigates but does not solve this challenge entirely. As brief meta commentary, it would have been nice if the project had spent more time optimizing the operability of the critically important thing you must do once a week [2]. https://issues.apache.org/jira/browse/CASSANDRA-5483 =Rob [1] http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1 [2] Or, more sensibly, once a month with gc_grace_seconds set to 34 days. Thank you for getting back to me so quickly. Not the answer that I was secretly hoping for, but it is nice to have confirmation. :) Cheers! -Tim
Reading SSTables Potential File Descriptor Leak 1.2.18
Hello, I ran in to a problem today where Cassandra 1.2.18 exhausted its number of permitted open file descriptors (65,535). This node has 256 tokens (vnodes) and runs in a test environment with relatively little traffic/data. As best I could tell, the majority of the file descriptors open were for a single SSTable '.db' file. Looking in the error logs I found quite a few exceptions that looked to have been identical: ERROR [ReadStage:3817] 2014-09-19 07:00:11,056 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:3817,5,main] java.lang.RuntimeException: java.lang.IllegalArgumentException: unable to seek to position 29049 in /mnt/var/lib/cassandra/data/path/to/file.db (1855 bytes) in read-only mode Upon further investigation, it turns out this file became 'read-only' after the Cassandra node was gracefully restarted last week. I'd imagine this is a discussion for another email thread. I fixed the issue by running: nodetool scrub Keyspace nodetool repair Keyspace Attached to this email is one of the log entries/stacktrace for this exception. Before opening a JIRA ticket I thought I'd reach out to the list to see if anyone has seen any similar behavior as well as do a bit of source-diving to try and verify that the descriptor is actually leaking. Cheers! -Tim ERROR [ReadStage:3817] 2014-09-19 07:00:11,056 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:3817,5,main] java.lang.RuntimeException: java.lang.IllegalArgumentException: unable to seek to position 29049 in /mnt/var/lib/cassandra/data/IzanagiQueue/WorkQueue/IzanagiQueue-WorkQueue-ic-1-Data.db (1855 bytes) in read-only mode at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1626) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.IllegalArgumentException: unable to seek to position 29049 in /mnt/var/lib/cassandra/data/IzanagiQueue/WorkQueue/IzanagiQueue-WorkQueue-ic-1-Data.db (1855 bytes) in read-only mode at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:306) at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(PoolingSegmentedFile.java:42) at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:1048) at org.apache.cassandra.db.columniterator.IndexedSliceReader.setToRowStart(IndexedSliceReader.java:130) at org.apache.cassandra.db.columniterator.IndexedSliceReader.init(IndexedSliceReader.java:91) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:68) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.init(SSTableSliceIterator.java:44) at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:104) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:68) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:272) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1398) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1130) at org.apache.cassandra.db.Table.getRow(Table.java:348) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1070) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1622) ... 3 more
Failed to enable shuffling error
Hello, I'm looking to convert our recently upgraded Cassandra cluster from a single token per node to using vnodes. We've determined that based on our data consistency and usage patterns that shuffling will be the best way to convert our live cluster. However, when following the instructions for doing the shuffle, we aren't able to enable shuffling on the other 4 nodes in the cluster. We get the error message 'Failed to enable shuffling', which looks to be a generic string printed when a JMX IOException is caught. Unfortunately, the underlying error is not printed so I'm effectively troubleshooting in the dark. I've done some mailing list diving, as well as Google skimming, and all the suggestions did not seem to work. I've confirmed that a firewall is not the cause as I am able to establish a TCP socket (using telnet) from one node to the other. I've also double-checked the JMX-specific settings that are being set for Cassandra and those look good. I'm going with the most open settings now to try and get this working: -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false I also tried playing with the 'java.rmi.server.hostname' setting, but none of the options set seemed to make a difference (hostname, fqdn, public IPv4 address, private EC2 address). Without any further information from the 'cassandra-shuffle' utility I'm pretty much out of ideas. Any suggestions would be greatly appreciated! Cheers! -Tim
Re: Failed to enable shuffling error
On Mon, Sep 8, 2014 at 11:19 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 8, 2014 at 11:08 AM, Tim Heckman t...@pagerduty.com wrote: I'm looking to convert our recently upgraded Cassandra cluster from a single token per node to using vnodes. We've determined that based on our data consistency and usage patterns that shuffling will be the best way to convert our live cluster. You apparently haven't read anything else about shuffling, or you would have learned that no one has ever successfully done it in a real production cluster. ;) I've definitely seen the horror stories that have come out of shuffle. :) We plan on giving this a trial run on production-sized data before actually doing it on our production hardware. Unfortunately, the underlying error is not printed so I'm effectively troubleshooting in the dark. This mysterious error is protecting you from a probably quite negative experience with shuffle. We're still at the exploratory stage on systems that are not production-facing but contain production-like data. Based on our placement strategy we have some concerns that the new datacenter approach may be riskier or more difficult. We're just trying to gauge both paths and see what works best for us. I've done some mailing list diving, as well as Google skimming, and all the suggestions did not seem to work. What version of Cassandra are you running? I would not be surprised if shuffle is in fact completely broken in 2.0.x release, not only hazardous to attempt. Why do you believe that you want to shuffle and/or enable vnodes? How large is the cluster and how large is it likely to become? We're still back on the 1.2 version of Cass, specifically 1.2.16 for the majority of our clusters with one cluster having seen its inception after the 1.2.18 release. The cluster I'm testing this on is a 5 node cluster with a placement strategy such that all nodes contain 100% of the data. In practice we have six clusters of similar size that are used for different services. These different clusters may need additional capacity at different times, so it's hard to answer the maximum size question. For now let's just assume that the clusters may never see an 11th member... but no guarantees. We're looking to use vnodes to help with easing the administrative work of scaling out the cluster. The improvements of streaming data during repairs amongst others. For shuffle, it looks like it may be easier than adding a new datacenter and then have to adjust the schema for a new datacenter to come to life. And we weren't sure whether the same pitfalls of shuffle would effect us while having all data on all nodes. =Rob Thanks for the quick reply, Rob. -Tim
Re: Failed to enable shuffling error
On Mon, Sep 8, 2014 at 1:45 PM, Jonathan Haddad j...@jonhaddad.com wrote: I believe shuffle has been removed recently. I do not recommend using it for any reason. We're still using the 1.2.x branch of Cassandra, and will be for some time due to the thrift deprecation. Has it only been removed from the 2.x line? If you really want to go vnodes, your only sane option is to add a new DC that uses vnodes and switch to it. We use the NetworkTopologyStrategy across three geographically separated regions. Doing it this way feels a bit more risky based on our replication strategy. Also, I'm not sure where all we have our current datacenter names defined across our different internal repositories. So there could be quite a large number of changes going this route. The downside in the 2.0.x branch to using vnodes is that repairs take N times as long, where N is the number of tokens you put on each node. I can't think of any other reasons why you wouldn't want to use vnodes (but this may be significant enough for you by itself) 2.1 should address the repair issue for most use cases. Jon Thank you for the notes on the behaviors in the 2.x branch. If we do move to the 2.x version that's something we'll be keeping in mind. Cheers! -Tim On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman t...@pagerduty.com wrote: We're still at the exploratory stage on systems that are not production-facing but contain production-like data. Based on our placement strategy we have some concerns that the new datacenter approach may be riskier or more difficult. We're just trying to gauge both paths and see what works best for us. Your case of RF=N is probably the best possible case for shuffle, but general statements about how much this code has been exercised remain. :) The cluster I'm testing this on is a 5 node cluster with a placement strategy such that all nodes contain 100% of the data. In practice we have six clusters of similar size that are used for different services. These different clusters may need additional capacity at different times, so it's hard to answer the maximum size question. For now let's just assume that the clusters may never see an 11th member... but no guarantees. With RF of 3, cluster sizes of under approximately 10 tend to net lose from vnodes. If these clusters are not very likely to ever have more than 10 nodes, consider not using Vnodes. We're looking to use vnodes to help with easing the administrative work of scaling out the cluster. The improvements of streaming data during repairs amongst others. Most of these wins don't occur until you have a lot of nodes, but the fixed costs of having many ranges are paid all the time. For shuffle, it looks like it may be easier than adding a new datacenter and then have to adjust the schema for a new datacenter to come to life. And we weren't sure whether the same pitfalls of shuffle would effect us while having all data on all nodes. Let us know! Good luck! =Rob -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade