I have no errors in my system.log just these typs of warnings occasionally: WARN [pool-1-thread-1] 2011-11-08 00:03:44,726 Memtable.java (line 167) setting live ratio to minimum of 1.0 instead of 0.9511448007676252
I did find the problem with my data drive consumption being so large, as I did not know that running scrub after the upgrade would take a snapshot of the data. Once I removed all the snapshots, they data drive is back down to where I expect it to be. Although the Load numbers reported by ring are much larger then what is in the data drive. I've also upgrade to 1.0.2 and re-ran scrub, and now I can run cfstats again, so thanks for that. Although I'm still confused on why the hints CF has become so large on a few of the nodes; Column Family: HintsColumnFamily SSTable count: 11 Space used (live): 127490858389 Space used (total): 72123363085 Number of Keys (estimate): 1408 Memtable Columns Count: 43174 Memtable Data Size: 44376138 Memtable Switch Count: 103 Read Count: 494 Read Latency: NaN ms. Write Count: 30970531 Write Latency: NaN ms. Pending Tasks: 0 Key cache capacity: 14 Key cache size: 10 Key cache hit rate: NaN Row cache: disabled Compacted row minimum size: 88149 Compacted row maximum size: 53142810146 Compacted row mean size: 6065512727 -----Original Message----- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Friday, November 04, 2011 9:29 AM To: user@cassandra.apache.org Subject: Re: Problem after upgrade to 1.0.1 One possibility: If you're overloading the cluster, replicas will drop updates to avoid OOMing. (This is logged at WARN level.) Before 1.x Cassandra would just let that slide, but with w/ 1.0 it started recording hints for those. On Thu, Nov 3, 2011 at 7:17 PM, Bryce Godfrey <bryce.godf...@azaleos.com> wrote: > Thanks for the help so far. > > Is there any way to find out why my HintsColumnFamily is so large now, since > it wasn't this way before the upgrade and it seems to just climbing? > > I've tried invoking o.a.c.db.HintedHnadoffManager.countPendingHints() > thinking I have a bunch of stale hints from upgrade issues, but it just > eventually times out. Plus the node it gets invoked against gets thrashed > and stops responding, forcing me to restart cassandra. > > -----Original Message----- > From: Jonathan Ellis [mailto:jbel...@gmail.com] > Sent: Thursday, November 03, 2011 5:06 PM > To: user@cassandra.apache.org > Subject: Re: Problem after upgrade to 1.0.1 > > I found the problem and posted a patch on > https://issues.apache.org/jira/browse/CASSANDRA-3451. If you build with that > patch and rerun scrub the exception should go away. > > On Thu, Nov 3, 2011 at 2:08 PM, Bryce Godfrey <bryce.godf...@azaleos.com> > wrote: >> A restart fixed the load numbers, they are back to where I expect them to be >> now, but disk utilization is double the load #. I'm also still get the >> cfstats exception from any node. >> >> -----Original Message----- >> From: Jonathan Ellis [mailto:jbel...@gmail.com] >> Sent: Thursday, November 03, 2011 11:52 AM >> To: user@cassandra.apache.org >> Subject: Re: Problem after upgrade to 1.0.1 >> >> Does restarting the node fix this? >> >> On Thu, Nov 3, 2011 at 1:51 PM, Bryce Godfrey <bryce.godf...@azaleos.com> >> wrote: >>> Disk utilization is actually about 80% higher than what is reported >>> for nodetool ring across all my nodes on the data drive >>> >>> >>> >>> Bryce Godfrey | Sr. Software Engineer | Azaleos Corporation | T: >>> 206.926.1978 | M: 206.849.2477 >>> >>> >>> >>> From: Dan Hendry [mailto:dan.hendry.j...@gmail.com] >>> Sent: Thursday, November 03, 2011 11:47 AM >>> To: user@cassandra.apache.org >>> Subject: RE: Problem after upgrade to 1.0.1 >>> >>> >>> >>> Regarding load growth, presumably you are referring to the load as >>> reported by JMX/nodetool. Have you actually looked at the disk >>> utilization on the nodes themselves? Potential issue I have seen: >>> http://www.mail-archive.com/user@cassandra.apache.org/msg18142.html >>> >>> >>> >>> Dan >>> >>> >>> >>> From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] >>> Sent: November-03-11 14:40 >>> To: user@cassandra.apache.org >>> Subject: Problem after upgrade to 1.0.1 >>> >>> >>> >>> I recently upgraded from 0.8.6 to 1.0.1 and everything seemed to go >>> just fine with the rolling upgrade. But now I'm having extreme load >>> growth on one of my nodes (and others are growing faster than usual >>> also). I attempted to run a cfstats against the extremely large >>> node that was seeing 2x the load of others and I get this error below. >>> I'm also went into the o.a.c.db.HintedHandoffManager mbean and >>> attempted to list pending hints to see if it was growing out of >>> control for some reason, but that just times out eventually for any node. >>> I'm not sure what to do next with this issue. >>> >>> >>> >>> Column Family: HintsColumnFamily >>> >>> SSTable count: 3 >>> >>> Space used (live): 12681676437 >>> >>> Space used (total): 10233130272 >>> >>> Number of Keys (estimate): 384 >>> >>> Memtable Columns Count: 117704 >>> >>> Memtable Data Size: 115107307 >>> >>> Memtable Switch Count: 66 >>> >>> Read Count: 0 >>> >>> Read Latency: NaN ms. >>> >>> Write Count: 21203290 >>> >>> Write Latency: 0.014 ms. >>> >>> Pending Tasks: 0 >>> >>> Key cache capacity: 3 >>> >>> Key cache size: 0 >>> >>> Key cache hit rate: NaN >>> >>> Row cache: disabled >>> >>> Compacted row minimum size: 30130993 >>> >>> Compacted row maximum size: 9223372036854775807 >>> >>> Exception in thread "main" java.lang.IllegalStateException: Unable >>> to compute ceiling for max when histogram overflowed >>> >>> at >>> org.apache.cassandra.utils.EstimatedHistogram.mean(EstimatedHistogram. >>> java:170) >>> >>> at >>> org.apache.cassandra.db.DataTracker.getMeanRowSize(DataTracker.java: >>> 3 >>> 9 >>> 5) >>> >>> at >>> org.apache.cassandra.db.ColumnFamilyStore.getMeanRowSize(ColumnFamil >>> y >>> S >>> tore.java:293) >>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. >>> j >>> ava:39) >>> >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcce >>> s >>> s >>> orImpl.java:25) >>> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>> at >>> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardM >>> B >>> e >>> anIntrospector.java:93) >>> >>> at >>> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardM >>> B >>> e >>> anIntrospector.java:27) >>> >>> at >>> com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector. >>> j >>> a >>> va:208) >>> >>> at >>> com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java: >>> 6 >>> 5 >>> ) >>> >>> at >>> com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java: >>> 2 >>> 1 >>> 6) >>> >>> at >>> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(D >>> e >>> f >>> aultMBeanServerInterceptor.java:666) >>> >>> at >>> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.j >>> a >>> v >>> a:638) >>> >>> at >>> javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnect >>> i >>> o >>> nImpl.java:1404) >>> >>> at >>> javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnecti >>> o >>> n >>> Impl.java:72) >>> >>> at >>> javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.ru >>> n >>> ( >>> RMIConnectionImpl.java:1265) >>> >>> at >>> javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation( >>> R >>> M >>> IConnectionImpl.java:1360) >>> >>> at >>> javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnec >>> t >>> i >>> onImpl.java:600) >>> >>> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown >>> Source) >>> >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcce >>> s >>> s >>> orImpl.java:25) >>> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>> at >>> sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305) >>> >>> at sun.rmi.transport.Transport$1.run(Transport.java:159) >>> >>> at java.security.AccessController.doPrivileged(Native >>> Method) >>> >>> at >>> sun.rmi.transport.Transport.serviceCall(Transport.java:155) >>> >>> at >>> sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java: >>> 5 >>> 3 >>> 5) >>> >>> at >>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTranspo >>> r >>> t >>> .java:790) >>> >>> at >>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport. >>> java:649) >>> >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExe >>> c >>> u >>> tor.java:886) >>> >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. >>> java:908) >>> >>> at java.lang.Thread.run(Thread.java:662) >>> >>> >>> >>> Bryce Godfrey | Sr. Software Engineer | Azaleos Corporation | T: >>> 206.926.1978 | M: 206.849.2477 >>> >>> >>> >>> No virus found in this incoming message. >>> Checked by AVG - www.avg.com >>> Version: 9.0.920 / Virus Database: 271.1.1/3993 - Release Date: >>> 11/03/11 >>> 03:39:00 >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com