Cassandra 1.2.19 and Java 8

2016-01-12 Thread Tim Heckman
Hello,

We still have an installation of Cassandra on the 1.2.19 release,
running on Java 7. We do plan on upgrading to a newer version, but in
the mean time there has been some questions internally about running
1.2 on Java 8 until the upgrade can be fully completed.

I seem to remember speaking to someone awhile back that advised
against running the 1.2 + Java 8 combination. Unfortunately, I can't
remember what the exact reasoning was behind the recommendation. It
could have just been that no one was really doing it, therefore it
wasn't fully tested.

Does anyone here have experience with Cassandra 1.2 and Java 8 in
production. Any known issues or gotchas?

Cheers!
-Tim

--
Tim Heckman
Operations Engineer
PagerDuty, Inc.


Re: Ghost compaction process

2015-06-08 Thread Tim Heckman
Does `nodetool comactionstats` show nothing running as well? Also, for
posterity what are some details of the setup (C* version, etc.)?

-Tim

--
Tim Heckman
Operations Engineer
PagerDuty, Inc.


On Sun, Jun 7, 2015 at 6:40 PM, Arturas Raizys artu...@noantidot.com
wrote:

 Hello,

 I'm having problem there in 1 node I have continues compaction process
 running and consuming CPU. nodetool tpstats show 1 compaction in
 progress, but if I try to query system.compactions_in_progress table, I
 see 0 records. This never ending compaction does slow down node and it
 becomes laggy.
 I'm willing to hire a contractor to solve this problem if anyone is
 interested.


 Cheers,
 Arturas



Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread Tim Heckman
On Thu, Jan 22, 2015 at 10:22 AM, Jan cne...@yahoo.com wrote:
 Running a  'nodetool repair'  will 'not'  bring the node down.

It's not something that happens during normal operation. If something
goes sideways, and the resource usage climbs, a repair can definitely
cripple a node.

 Your question:
 does a nodetool repair make the server stop serving requests, or does it
 just use a lot of ressources but still serves request

 Answer: NO, the server will not stop serving requests.
 It will use some resources but not enough to affect the server serving
 requests.

I don't think this is right. I've personally seen repair operations
cause real bad things to happen to an entire Cassandra cluster. The
only mitigation was to shut that misbehaving node down and then normal
operations continued within the cluster.

 hope this helps
 Jan

Cheers!
-Tim


Re: nodetool repair exception

2014-12-06 Thread Tim Heckman
On Sat, Dec 6, 2014 at 8:05 AM, Eric Stevens migh...@gmail.com wrote:
 The official recommendation is 100k:
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

 I wonder if there's an advantage to this over unlimited if you're running
 servers which are dedicated to your Cassandra cluster (which you should be
 for anything production).

There is the potential to have monitoring systems, and other small
agents, running on systems in production. I could see this simply as a
stop-gap to prevent Cassandra from being able to starve the system of
free file descriptors. In theory, if there's not a proper watchdog on
your monitors this could prevent an issue from causing an alert.
However, just a potential advantage I could think of.

Cheers!
-Tim

 On Fri Dec 05 2014 at 2:39:24 PM Robert Coli rc...@eventbrite.com wrote:

 On Wed, Dec 3, 2014 at 6:37 AM, Rafał Furmański rfurman...@opera.com
 wrote:

 I see “Too many open files” exception in logs, but I’m sure that my limit
 is now 150k.
 Should I increase it? What’s the reasonable limit of open files for
 cassandra?


 Why provide any limit? ulimit allows unlimited?

 =Rob



Re: full gc too often

2014-12-04 Thread Tim Heckman
On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote:

 Hi,all

 I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full
gc that sometime there may be one or two nodes full gc more than one time
per minute and over 10 seconds each time, then the node will be unreachable
and the latency of cluster will be increased.

 I grep the GCInspector's log, I found when the node is running fine
without gc trouble there are two kinds of gc:
 ParNew GC in less than 300ms which clear the Par Eden Space and
enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in
more than 200ms, there is only a small number of ParNew GC in log)
 ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and
enlarge Par Eden Space little, each 1-2 hours it will be executed once.

 However, sometimes ConcurrentMarkSweep will be strange like it shows:

 INFO  [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 -
ConcurrentMarkSweep GC in 12648ms.  CMS Old Gen: 3579838424 - 3579838464;
Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0
 INFO  [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 -
ConcurrentMarkSweep GC in 12227ms.  CMS Old Gen: 3579838464 - 3579836512;
Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0
 INFO  [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 -
ConcurrentMarkSweep GC in 11538ms.  CMS Old Gen: 3579836688 - 3579805792;
Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0
 INFO  [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 -
ConcurrentMarkSweep GC in 12180ms.  CMS Old Gen: 3579835784 - 3579829760;
Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0
 INFO  [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 -
ConcurrentMarkSweep GC in 10574ms.  CMS Old Gen: 3579838112 - 3579799752;
Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0
 INFO  [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 -
ConcurrentMarkSweep GC in 11594ms.  CMS Old Gen: 3579831424 - 3579817392;
Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0
 INFO  [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 -
ConcurrentMarkSweep GC in 11463ms.  CMS Old Gen: 3579817392 - 3579838424;
Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0
 INFO  [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 -
ConcurrentMarkSweep GC in 9576ms.  CMS Old Gen: 3579838424 - 3579816424;
Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0
 INFO  [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 -
ConcurrentMarkSweep GC in 11556ms.  CMS Old Gen: 3579816424 - 3579785496;
Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0
 INFO  [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 -
ConcurrentMarkSweep GC in 12082ms.  CMS Old Gen: 3579786592 - 3579814464;
Par Eden Space: 503316480 - 448782440; Par Survivor Space: 62914560 - 0

 In each time Old Gen reduce only a little, Survivor Space will be clear
but the heap is still full so there will be another full gc very soon then
the node will down. If I restart the node, it will be fine without gc
trouble.

 Can anyone help me to find out where is the problem that full gc can't
reduce CMS Old Gen? Is it because there are too many objects in heap can't
be recycled? I think review the table scheme designing and add new nodes
into cluster is a good idea, but I still want to know if there is any other
reason causing this trouble.

How much total system memory do you have? How much is allocated for heap
usage? How big is your working data set?

The reason I ask is that I've seen problems with lots of GC with no room
gained, and it was memory pressure. Not enough for the heap. We decided
that just increasing the heap size was a bad idea, as we did rely on free
RAM being used for filesystem caching. So some vertical and horizontal
scaling allowed us to give Cass more heap space, as well as distribute the
workload to try and avoid further problems.

 Thanks,
 Philo Yang

Cheers!
-Tim


Re: Cassandra DC2 nodes down after increasing write requests on DC1 nodes

2014-11-16 Thread Tim Heckman
Hello Gabriel,

On Sun, Nov 16, 2014 at 7:25 AM, Gabriel Menegatti
gabr...@s1mbi0se.com.br wrote:
 I said that load was not a big deal, because ops center shows this loads as
 green, not as yellow or red at all.

 Also, our servers have many processors/threads, so I *think* this load is
 not problematic.

I've seen Cassandra clusters fall over with less load on the boxes.
So, not sure how trusting I am of Opscenter.

However, the impact is dependent on the system resources you have
available to you. How many CPU cores do these systems have, how much
total and free memory, are the underlying disks SSD or spinning
platters of rust?

 My assumption is that for some reason the DC2 10 nodes are not being able to
 handle the volume of requests from DC1, as it was 30 nodes. Even so, on my
 point of view the load of the DC2 nodes should go really high before
 Cassandra goes down, but its not doing so.

That would make sense if the nodes are under-provisioned for the work
you are trying to throw at them. The load averages and OOM in the heap
seems to indicate that may be a problem. However, without more details
it's hard to say.

 Regards,
 Gabriel

Cheers!
-Tim


Repair/Compaction Completion Confirmation

2014-10-27 Thread Tim Heckman
Hello,

I am looking to change how we trigger maintenance operations in our C*
clusters. The end goal is to schedule and run the jobs using a system that
is backed by Serf to handle the event propagation.

I know that when issuing some operations via nodetool, the command blocks
until the operation is finished. However, is there a way to reliably
determine whether or not the operation has finished without monitoring that
invocation of nodetool?

In other words, when I run 'nodetool repair' what is the best way to
reliably determine that the repair is finished without running something
equivalent to a 'pgrep' against the command I invoked? I am curious about
trying to do the same for major compactions too.

Cheers!
-Tim


Re: Repair/Compaction Completion Confirmation

2014-10-27 Thread Tim Heckman
On Mon, Oct 27, 2014 at 1:44 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Oct 27, 2014 at 1:33 PM, Tim Heckman t...@pagerduty.com wrote:

 I know that when issuing some operations via nodetool, the command blocks
 until the operation is finished. However, is there a way to reliably
 determine whether or not the operation has finished without monitoring that
 invocation of nodetool?

 In other words, when I run 'nodetool repair' what is the best way to
 reliably determine that the repair is finished without running something
 equivalent to a 'pgrep' against the command I invoked? I am curious about
 trying to do the same for major compactions too.


 This is beyond a FAQ at this point, unfortunately; non-incremental repair
 is awkward to deal with and probably impossible to automate.

 In The Future [1] the correct solution will be to use incremental repair,
 which mitigates but does not solve this challenge entirely.

 As brief meta commentary, it would have been nice if the project had spent
 more time optimizing the operability of the critically important thing you
 must do once a week [2].

 https://issues.apache.org/jira/browse/CASSANDRA-5483

 =Rob
 [1] http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1
 [2] Or, more sensibly, once a month with gc_grace_seconds set to 34 days.


Thank you for getting back to me so quickly. Not the answer that I was
secretly hoping for, but it is nice to have confirmation. :)

Cheers!
-Tim


Reading SSTables Potential File Descriptor Leak 1.2.18

2014-09-23 Thread Tim Heckman
Hello,

I ran in to a problem today where Cassandra 1.2.18 exhausted its number of
permitted open file descriptors (65,535). This node has 256 tokens (vnodes)
and runs in a test environment with relatively little traffic/data.

As best I could tell, the majority of the file descriptors open were for a
single SSTable '.db' file. Looking in the error logs I found quite a few
exceptions that looked to have been identical:

ERROR [ReadStage:3817] 2014-09-19 07:00:11,056 CassandraDaemon.java (line
191) Exception in thread Thread[ReadStage:3817,5,main]
java.lang.RuntimeException: java.lang.IllegalArgumentException: unable to
seek to position 29049 in /mnt/var/lib/cassandra/data/path/to/file.db (1855
bytes) in read-only mode

Upon further investigation, it turns out this file became 'read-only' after
the Cassandra node was gracefully restarted last week. I'd imagine this is
a discussion for another email thread.

I fixed the issue by running:

nodetool scrub Keyspace
nodetool repair Keyspace

Attached to this email is one of the log entries/stacktrace for this
exception.

Before opening a JIRA ticket I thought I'd reach out to the list to see if
anyone has seen any similar behavior as well as do a bit of source-diving
to try and verify that the descriptor is actually leaking.

Cheers!
-Tim
ERROR [ReadStage:3817] 2014-09-19 07:00:11,056 CassandraDaemon.java (line 191) 
Exception in thread Thread[ReadStage:3817,5,main]
java.lang.RuntimeException: java.lang.IllegalArgumentException: unable to seek 
to position 29049 in 
/mnt/var/lib/cassandra/data/IzanagiQueue/WorkQueue/IzanagiQueue-WorkQueue-ic-1-Data.db
 (1855 bytes) in read-only mode
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1626)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: unable to seek to position 29049 
in 
/mnt/var/lib/cassandra/data/IzanagiQueue/WorkQueue/IzanagiQueue-WorkQueue-ic-1-Data.db
 (1855 bytes) in read-only mode
at 
org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:306)
at 
org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(PoolingSegmentedFile.java:42)
at 
org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:1048)
at 
org.apache.cassandra.db.columniterator.IndexedSliceReader.setToRowStart(IndexedSliceReader.java:130)
at 
org.apache.cassandra.db.columniterator.IndexedSliceReader.init(IndexedSliceReader.java:91)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:68)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.init(SSTableSliceIterator.java:44)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:104)
at 
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:68)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:272)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1398)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1130)
at org.apache.cassandra.db.Table.getRow(Table.java:348)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1070)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1622)
... 3 more


Failed to enable shuffling error

2014-09-08 Thread Tim Heckman
Hello,

I'm looking to convert our recently upgraded Cassandra cluster from a
single token per node to using vnodes. We've determined that based on
our data consistency and usage patterns that shuffling will be the
best way to convert our live cluster.

However, when following the instructions for doing the shuffle, we
aren't able to enable shuffling on the other 4 nodes in the cluster.
We get the error message 'Failed to enable shuffling', which looks to
be a generic string printed when a JMX IOException is caught.
Unfortunately, the underlying error is not printed so I'm effectively
troubleshooting in the dark.

I've done some mailing list diving, as well as Google skimming, and
all the suggestions did not seem to work.

I've confirmed that a firewall is not the cause as I am able to
establish a TCP socket (using telnet) from one node to the other. I've
also double-checked the JMX-specific settings that are being set for
Cassandra and those look good. I'm going with the most open settings
now to try and get this working:

-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false

I also tried playing with the 'java.rmi.server.hostname' setting, but
none of the options set seemed to make a difference (hostname, fqdn,
public IPv4 address, private EC2 address).

Without any further information from the 'cassandra-shuffle' utility
I'm pretty much out of ideas. Any suggestions would be greatly
appreciated!

Cheers!
-Tim


Re: Failed to enable shuffling error

2014-09-08 Thread Tim Heckman
On Mon, Sep 8, 2014 at 11:19 AM, Robert Coli rc...@eventbrite.com wrote:
 On Mon, Sep 8, 2014 at 11:08 AM, Tim Heckman t...@pagerduty.com wrote:

 I'm looking to convert our recently upgraded Cassandra cluster from a
 single token per node to using vnodes. We've determined that based on
 our data consistency and usage patterns that shuffling will be the
 best way to convert our live cluster.


 You apparently haven't read anything else about shuffling, or you would have
 learned that no one has ever successfully done it in a real production
 cluster. ;)

I've definitely seen the horror stories that have come out of shuffle.
:) We plan on giving this a trial run on production-sized data before
actually doing it on our production hardware.


 Unfortunately, the underlying error is not printed so I'm effectively
 troubleshooting in the dark.


 This mysterious error is protecting you from a probably quite negative
 experience with shuffle.

We're still at the exploratory stage on systems that are not
production-facing but contain production-like data. Based on our
placement strategy we have some concerns that the new datacenter
approach may be riskier or more difficult. We're just trying to gauge
both paths and see what works best for us.


 I've done some mailing list diving, as well as Google skimming, and
 all the suggestions did not seem to work.


 What version of Cassandra are you running? I would not be surprised if
 shuffle is in fact completely broken in 2.0.x release, not only hazardous to
 attempt.

 Why do you believe that you want to shuffle and/or enable vnodes? How large
 is the cluster and how large is it likely to become?

We're still back on the 1.2 version of Cass, specifically 1.2.16 for
the majority of our clusters with one cluster having seen its
inception after the 1.2.18 release.

The cluster I'm testing this on is a 5 node cluster with a placement
strategy such that all nodes contain 100% of the data. In practice we
have six clusters of similar size that are used for different
services. These different clusters may need additional capacity at
different times, so it's hard to answer the maximum size question. For
now let's just assume that the clusters may never see an 11th
member... but no guarantees.

We're looking to use vnodes to help with easing the administrative
work of scaling out the cluster. The improvements of streaming data
during repairs amongst others.

For shuffle, it looks like it may be easier than adding a new
datacenter and then have to adjust the schema for a new datacenter
to come to life. And we weren't sure whether the same pitfalls of
shuffle would effect us while having all data on all nodes.

 =Rob


Thanks for the quick reply, Rob.

-Tim


Re: Failed to enable shuffling error

2014-09-08 Thread Tim Heckman
On Mon, Sep 8, 2014 at 1:45 PM, Jonathan Haddad j...@jonhaddad.com wrote:
 I believe shuffle has been removed recently.  I do not recommend using
 it for any reason.

We're still using the 1.2.x branch of Cassandra, and will be for some
time due to the thrift deprecation. Has it only been removed from the
2.x line?

 If you really want to go vnodes, your only sane option is to add a new
 DC that uses vnodes and switch to it.

We use the NetworkTopologyStrategy across three geographically
separated regions. Doing it this way feels a bit more risky based on
our replication strategy. Also, I'm not sure where all we have our
current datacenter names defined across our different internal
repositories. So there could be quite a large number of changes going
this route.

 The downside in the 2.0.x branch to using vnodes is that repairs take
 N times as long, where N is the number of tokens you put on each node.
 I can't think of any other reasons why you wouldn't want to use vnodes
 (but this may be significant enough for you by itself)

 2.1 should address the repair issue for most use cases.

 Jon

Thank you for the notes on the behaviors in the 2.x branch. If we do
move to the 2.x version that's something we'll be keeping in mind.

Cheers!
-Tim

 On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli rc...@eventbrite.com wrote:
 On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman t...@pagerduty.com wrote:

 We're still at the exploratory stage on systems that are not
 production-facing but contain production-like data. Based on our
 placement strategy we have some concerns that the new datacenter
 approach may be riskier or more difficult. We're just trying to gauge
 both paths and see what works best for us.


 Your case of RF=N is probably the best possible case for shuffle, but
 general statements about how much this code has been exercised remain. :)


 The cluster I'm testing this on is a 5 node cluster with a placement
 strategy such that all nodes contain 100% of the data. In practice we
 have six clusters of similar size that are used for different
 services. These different clusters may need additional capacity at
 different times, so it's hard to answer the maximum size question. For
 now let's just assume that the clusters may never see an 11th
 member... but no guarantees.


 With RF of 3, cluster sizes of under approximately 10 tend to net lose from
 vnodes. If these clusters are not very likely to ever have more than 10
 nodes, consider not using Vnodes.


 We're looking to use vnodes to help with easing the administrative
 work of scaling out the cluster. The improvements of streaming data
 during repairs amongst others.


 Most of these wins don't occur until you have a lot of nodes, but the fixed
 costs of having many ranges are paid all the time.


 For shuffle, it looks like it may be easier than adding a new
 datacenter and then have to adjust the schema for a new datacenter
 to come to life. And we weren't sure whether the same pitfalls of
 shuffle would effect us while having all data on all nodes.


 Let us know! Good luck!

 =Rob




 --
 Jon Haddad
 http://www.rustyrazorblade.com
 twitter: rustyrazorblade