Re: weird behavior with RAID 0 on EC2
I've seen the same behaviour (SLOW ephemeral disk) a few times. You can't do anything with a single slow disk except not using it. Our solution was always: Replace the m1.xlarge instance asap and everything is good. -Rudolf. On 31.03.2013, at 18:58, Alexis Lê-Quôc wrote: Alain, Can you post your mdadm --detail /dev/md0 output here as well as your iostat -x -d when that happens. A bad ephemeral drive on EC2 is not unheard of. Alexis | @alq | http://datadog.com P.S. also, disk utilization is not a reliable metric, iostat's await and svctm are more useful imho. On Sun, Mar 31, 2013 at 6:03 AM, aaron morton aa...@thelastpickle.com wrote: Ok, if you're going to look into it, please keep me/us posted. It's not on my radar. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 28/03/2013, at 2:43 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Ok, if you're going to look into it, please keep me/us posted. It happen twice for me, the same day, within a few hours on the same node and only happened to 1 node out of 12, making this node almost unreachable. 2013/3/28 aaron morton aa...@thelastpickle.com I noticed this on an m1.xlarge (cassandra 1.1.10) instance today as well, 1 or 2 disks in a raid 0 running at 85 to 100% the others 35 to 50ish. Have not looked into it. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 26/03/2013, at 11:57 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: We use C* on m1.xLarge AWS EC2 servers, with 4 disks xvdb, xvdc, xvdd, xvde parts of a logical Raid0 (md0). I use to see their use increasing in the same way. This morning there was a normal minor compaction followed by messages dropped on one node (out of 12). Looking closely at this node I saw the following: http://img69.imageshack.us/img69/9425/opscenterweirddisk.png On this node, one of the four disks (xvdd) started working hardly while other worked less intensively. This is quite weird since I always saw this 4 disks being used the exact same way at every moment (as you can see on 5 other nodes or when the node .239 come back to normal). Any idea on what happened and on how it can be avoided ? Alain
Re: Assertions running Cleanup on a 3-node cluster with Cassandra 1.1.4 and LCS
Which version of Cassandra has your data been created initially with? A bug in Cassandra 1.1.2 and earlier could cause out-of-order sstables and inter-level overlaps in CFs with Leveled Compaction. Your sstables generated with 1.1.3 and later should not have this issue [1] [2]. In case you have old Leveled-compacted sstables (generated with 1.1.2 or earlier. including 1.0.x) you need to run offline scrub using Cassandra 1.1.4 or later via /bin/sstablescrub command so it'll fix out-of-order sstables and inter-level overlaps caused by previous versions of LCS. You need to take nodes down in order to run offline scrub. The data was orginally created on a 1.1.2 cluster with STCS (i.e. NOT leveled compaction). After the upgrade to 1.1.4 we changed from STCS to LCS w/o problems. Then we ran more tests and created more and very big keys with millions of columns. The assertion only shows up with one particular CF containing these big keys. So, from your explanation, I don't think an offline scrub will help. Thanks, -Rudolf.
Re: Assertions running Cleanup on a 3-node cluster with Cassandra 1.1.4 and LCS
Could you, as Aaron suggested, open a ticket? Done: https://issues.apache.org/jira/browse/CASSANDRA-4644
Assertions running Cleanup on a 3-node cluster with Cassandra 1.1.4 and LCS
Hi, I'm getting 5 identical assertions while running 'nodetool cleanup' on a Cassandra 1.1.4 node with Load=104G and 80m keys. From system.log : ERROR [CompactionExecutor:576] 2012-09-10 11:25:50,265 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[CompactionExecutor:576,1,main] java.lang.AssertionError at org.apache.cassandra.db.compaction.LeveledManifest.promote(LeveledManifest.java:214) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.handleNotification(LeveledCompactionStrategy.java:158) at org.apache.cassandra.db.DataTracker.notifySSTablesChanged(DataTracker.java:531) at org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:254) at org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:992) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:200) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) After 3 hours the job is done and there are 11390 compaction tasks pending. My question: Can these assertions be ignored? Or do I need to worry about it? Thanks for your help and best regards, -Rudolf.
StackOverflowError with repair after bulkloading SSTables
Hi, I'm currently testing the restore of a Cassandra 1.1.2 snapshot. The steps to reproduce the problem: - snapshot a 3-node production cluster (1.1.2) with RF=3 and LCS (leveled compaction) == 8GB data/node - create a new 3-node cluster (node1,2,3) - stop node1 / copy data (SSTables) from the snapshot (just one node) / start node1 - Cassandra is opening 1185 SSTable files (*-hd-), pending compaction tasks: 247 - before Cassandra is starting compactions RUN: nodetool repair -pr The error messages in system.log : INFO [AntiEntropySessions:1] 2012-07-20 10:53:16,743 AntiEntropyService.java (line 666) [repair #1c59b930-d259-11e1--a0b0843ee1fe] new session: will sync /10.241.65.232, /10.54.26.250, /10.251.33.166 on range (113427455640312821154458202477256070485,0] for highscores.[highscore] INFO [AntiEntropySessions:1] 2012-07-20 10:53:16,747 AntiEntropyService.java (line 871) [repair #1c59b930-d259-11e1--a0b0843ee1fe] requesting merkle trees for highscore (to [/10.54.26.250, /10.251.33.166, /10.241.65.232]) INFO [AntiEntropyStage:1] 2012-07-20 10:53:17,085 AntiEntropyService.java (line 206) [repair #1c59b930-d259-11e1--a0b0843ee1fe] Received merkle tree for highscore from /10.54.26.250 INFO [AntiEntropyStage:1] 2012-07-20 10:53:17,104 AntiEntropyService.java (line 206) [repair #1c59b930-d259-11e1--a0b0843ee1fe] Received merkle tree for highscore from /10.251.33.166 ERROR [ValidationExecutor:1] 2012-07-20 10:53:17,865 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[ValidationExecutor:1,1,main] java.lang.StackOverflowError at com.google.common.collect.Sets$1.iterator(Sets.java:578) (repeating 1024 times) The repair command does not return. The repair command increases the Active/Pending counters of AntiEntropySessions in tpstats. The counters never go back to 0. After some time compaction starts as usual w/o problems. Am I doing something wrong? The error is bound to LCS. No problem with STCS. There is plenty of space in Java HEAP (7G) and on the disk (1.7TB). RAM is 15G and SWAP is 20G. This is an Amazon m1.xlarge instance with Ubuntu/Lucid Linux. Thanks for any hints or help, Rudolf VanderLeeden Scoreloop/RIM
Re: 2 nodes throwing exceptions trying to compact after upgrade to 1.1.2 from 1.1.0
See https://issues.apache.org/jira/browse/CASSANDRA-4411 The bug is related to LCS (leveled compaction) and has been fixed. On 16.07.2012, at 20:32, Bryce Godfrey wrote: This may not be directly related to the upgrade to 1.1.2, but I was running on 1.1.0 for a while with no issues, and I did the upgrade to 1.1.2 a few days ago. 2 of my nodes started throwing lots of promote exceptions, and then a lot of the beforeAppend exceptions from then on every few minutes. This is on the high update CF that’s using leveled compaction and compression. The other 3 nodes are not experiencing this. I can send entire log files if desired. These 2 nodes now have much higher load #’s then the other 3, and I’m assuming that’s because they are failing with the compaction errors? $ INFO [CompactionExecutor:1783] 2012-07-13 07:35:23,268 CompactionTask.java (line 109) Compacting [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-392322-Data$ ERROR [CompactionExecutor:1783] 2012-07-13 07:35:29,696 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[CompactionExecutor:1783,1,main] java.lang.AssertionError at org.apache.cassandra.db.compaction.LeveledManifest.promote(LeveledManifest.java:214) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.handleNotification(LeveledCompactionStrategy.java:158) at org.apache.cassandra.db.DataTracker.notifySSTablesChanged(DataTracker.java:531) at org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:254) at org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:978) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:200) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO [CompactionExecutor:3310] 2012-07-16 11:14:02,481 CompactionTask.java (line 109) Compacting [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-369173-Data$ ERROR [CompactionExecutor:3310] 2012-07-16 11:14:04,031 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[CompactionExecutor:3310,1,main] java.lang.RuntimeException: Last written key DecoratedKey(150919285004100953907590722809541628889, 5b30363334353237652d383966382d653031312d623131632d3030313535643031373530325d5b436f6d70757465725b4d5350422d$ at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
Re: 2 nodes throwing exceptions trying to compact after upgrade to 1.1.2 from 1.1.0
Stay with 1.1.2 and create your CF with compaction_strategy_class='SizeTieredCompactionStrategy' On 16.07.2012, at 22:17, Bryce Godfrey wrote: Thanks, is there a way around this for now or should I fall back to 1.1.0? From: Rudolf van der Leeden [mailto:rudolf.vanderlee...@scoreloop.com] Sent: Monday, July 16, 2012 12:55 PM To: user@cassandra.apache.org Cc: Rudolf van der Leeden Subject: Re: 2 nodes throwing exceptions trying to compact after upgrade to 1.1.2 from 1.1.0 See https://issues.apache.org/jira/browse/CASSANDRA-4411 The bug is related to LCS (leveled compaction) and has been fixed. On 16.07.2012, at 20:32, Bryce Godfrey wrote: This may not be directly related to the upgrade to 1.1.2, but I was running on 1.1.0 for a while with no issues, and I did the upgrade to 1.1.2 a few days ago. 2 of my nodes started throwing lots of promote exceptions, and then a lot of the beforeAppend exceptions from then on every few minutes. This is on the high update CF that’s using leveled compaction and compression. The other 3 nodes are not experiencing this. I can send entire log files if desired. These 2 nodes now have much higher load #’s then the other 3, and I’m assuming that’s because they are failing with the compaction errors? $ INFO [CompactionExecutor:1783] 2012-07-13 07:35:23,268 CompactionTask.java (line 109) Compacting [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-392322-Data$ ERROR [CompactionExecutor:1783] 2012-07-13 07:35:29,696 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[CompactionExecutor:1783,1,main] java.lang.AssertionError at org.apache.cassandra.db.compaction.LeveledManifest.promote(LeveledManifest.java:214) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.handleNotification(LeveledCompactionStrategy.java:158) at org.apache.cassandra.db.DataTracker.notifySSTablesChanged(DataTracker.java:531) at org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:254) at org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:978) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:200) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO [CompactionExecutor:3310] 2012-07-16 11:14:02,481 CompactionTask.java (line 109) Compacting [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-369173-Data$ ERROR [CompactionExecutor:3310] 2012-07-16 11:14:04,031 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[CompactionExecutor:3310,1,main] java.lang.RuntimeException: Last written key DecoratedKey(150919285004100953907590722809541628889, 5b30363334353237652d383966382d653031312d623131632d3030313535643031373530325d5b436f6d70757465725b4d5350422d$ at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)