Re: weird behavior with RAID 0 on EC2

2013-03-31 Thread Rudolf van der Leeden
I've seen the same behaviour (SLOW ephemeral disk) a few times. 
You can't do anything with a single slow disk except not using it. 
Our solution was always: Replace the m1.xlarge instance asap and everything is 
good.
-Rudolf.

On 31.03.2013, at 18:58, Alexis Lê-Quôc wrote:

 Alain,
 
 Can you post your mdadm --detail /dev/md0 output here as well as your iostat 
 -x -d when that happens. A bad ephemeral drive on EC2 is not unheard of.
 
 Alexis | @alq | http://datadog.com
 
 P.S. also, disk utilization is not a reliable metric, iostat's await and 
 svctm are more useful imho.
 
 
 On Sun, Mar 31, 2013 at 6:03 AM, aaron morton aa...@thelastpickle.com wrote:
 Ok, if you're going to look into it, please keep me/us posted.
 
 It's not on my radar.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 28/03/2013, at 2:43 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 
 Ok, if you're going to look into it, please keep me/us posted.
 
 It happen twice for me, the same day, within a few hours on the same node 
 and only happened to 1 node out of 12, making this node almost unreachable.
 
 
 2013/3/28 aaron morton aa...@thelastpickle.com
 I noticed this on an m1.xlarge (cassandra 1.1.10) instance today as well, 1 
 or 2 disks in a raid 0 running at 85 to 100% the others 35 to 50ish. 
 
 Have not looked into it. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 26/03/2013, at 11:57 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 
 We use C* on m1.xLarge AWS EC2 servers, with 4 disks xvdb, xvdc, xvdd, xvde 
 parts of a logical Raid0 (md0).
 
 I use to see their use increasing in the same way. This morning there was a 
 normal minor compaction followed by messages dropped on one node (out of 
 12).
 
 Looking closely at this node I saw the following:
 
 http://img69.imageshack.us/img69/9425/opscenterweirddisk.png
 
 On this node, one of the four disks (xvdd) started working hardly while 
 other worked less intensively.
 
 This is quite weird since I always saw this 4 disks being used the exact 
 same way at every moment (as you can see on 5 other nodes or when the node 
 .239 come back to normal).
 
 Any idea on what happened and on how it can be avoided ?
 
 Alain
 
 
 
 



Re: Assertions running Cleanup on a 3-node cluster with Cassandra 1.1.4 and LCS

2012-09-11 Thread Rudolf van der Leeden

 Which version of Cassandra has your data been created initially with?
 A bug in Cassandra 1.1.2 and earlier could cause out-of-order sstables
 and inter-level overlaps in CFs with Leveled Compaction. Your sstables
 generated with 1.1.3 and later should not have this issue [1] [2].
 In case you have old Leveled-compacted sstables (generated with 1.1.2
 or earlier. including 1.0.x) you need to run offline scrub using
 Cassandra 1.1.4 or later via /bin/sstablescrub command so it'll fix
 out-of-order sstables and inter-level overlaps caused by previous
 versions of LCS. You need to take nodes down in order to run offline
 scrub.


The data was orginally created on a 1.1.2 cluster with STCS (i.e. NOT
leveled compaction).
After the upgrade to 1.1.4 we changed from STCS to LCS w/o problems.
Then we ran more tests and created more and very big keys with millions of
columns.
The assertion only shows up with one particular CF containing these big
keys.
So, from your explanation, I don't think an offline scrub will help.

Thanks,
-Rudolf.


Re: Assertions running Cleanup on a 3-node cluster with Cassandra 1.1.4 and LCS

2012-09-11 Thread Rudolf van der Leeden
 Could you, as Aaron suggested, open a ticket?


Done:  https://issues.apache.org/jira/browse/CASSANDRA-4644


Assertions running Cleanup on a 3-node cluster with Cassandra 1.1.4 and LCS

2012-09-10 Thread Rudolf van der Leeden
Hi,

I'm getting 5 identical assertions while running 'nodetool cleanup' on a
Cassandra 1.1.4 node with Load=104G and 80m keys.
From  system.log :

ERROR [CompactionExecutor:576] 2012-09-10 11:25:50,265
AbstractCassandraDaemon.java (line 134) Exception in thread
Thread[CompactionExecutor:576,1,main]
java.lang.AssertionError
at
org.apache.cassandra.db.compaction.LeveledManifest.promote(LeveledManifest.java:214)
at
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.handleNotification(LeveledCompactionStrategy.java:158)
at
org.apache.cassandra.db.DataTracker.notifySSTablesChanged(DataTracker.java:531)
at
org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:254)
at
org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:992)
at
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:200)
at
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
at
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

After 3 hours the job is done and there are 11390 compaction tasks pending.
My question: Can these assertions be ignored? Or do I need to worry about
it?

Thanks for your help and best regards,
-Rudolf.


StackOverflowError with repair after bulkloading SSTables

2012-07-20 Thread Rudolf van der Leeden
Hi,

I'm currently testing the restore of a Cassandra 1.1.2 snapshot.

The steps to reproduce the problem:

 - snapshot a 3-node production cluster (1.1.2) with RF=3 and LCS (leveled 
compaction) == 8GB data/node
 - create a new 3-node cluster (node1,2,3)
 - stop node1 / copy data (SSTables) from the snapshot (just one node) / start 
node1
 - Cassandra is opening 1185 SSTable files (*-hd-),  pending compaction 
tasks: 247
 - before Cassandra is starting compactions RUN:  nodetool repair -pr

The error messages in system.log :

 INFO [AntiEntropySessions:1] 2012-07-20 10:53:16,743 AntiEntropyService.java 
(line 666) [repair #1c59b930-d259-11e1--a0b0843ee1fe] new session: will 
sync /10.241.65.232, /10.54.26.250, /10.251.33.166 on range 
(113427455640312821154458202477256070485,0] for highscores.[highscore]
 INFO [AntiEntropySessions:1] 2012-07-20 10:53:16,747 AntiEntropyService.java 
(line 871) [repair #1c59b930-d259-11e1--a0b0843ee1fe] requesting merkle 
trees for highscore (to [/10.54.26.250, /10.251.33.166, /10.241.65.232])
 INFO [AntiEntropyStage:1] 2012-07-20 10:53:17,085 AntiEntropyService.java 
(line 206) [repair #1c59b930-d259-11e1--a0b0843ee1fe] Received merkle tree 
for highscore from /10.54.26.250
 INFO [AntiEntropyStage:1] 2012-07-20 10:53:17,104 AntiEntropyService.java 
(line 206) [repair #1c59b930-d259-11e1--a0b0843ee1fe] Received merkle tree 
for highscore from /10.251.33.166
ERROR [ValidationExecutor:1] 2012-07-20 10:53:17,865 
AbstractCassandraDaemon.java (line 134) Exception in thread 
Thread[ValidationExecutor:1,1,main]
java.lang.StackOverflowError
at com.google.common.collect.Sets$1.iterator(Sets.java:578)  
(repeating 1024 times) 

The repair command does not return. 
The repair command increases the Active/Pending counters of 
AntiEntropySessions in tpstats. 
The counters never go back to 0.

After some time compaction starts as usual w/o problems.

Am I doing something wrong? The error is bound to LCS. No problem with STCS.
There is plenty of space in Java HEAP (7G) and on the disk (1.7TB). 
RAM is 15G and SWAP is 20G. This is an Amazon m1.xlarge instance with 
Ubuntu/Lucid Linux.

Thanks for any hints or help,
Rudolf VanderLeeden
Scoreloop/RIM



Re: 2 nodes throwing exceptions trying to compact after upgrade to 1.1.2 from 1.1.0

2012-07-16 Thread Rudolf van der Leeden
See  https://issues.apache.org/jira/browse/CASSANDRA-4411
The bug is related to LCS (leveled compaction) and has been fixed.


On 16.07.2012, at 20:32, Bryce Godfrey wrote:

 This may not be directly related to the upgrade to 1.1.2, but I was running 
 on 1.1.0 for a while with no issues, and I did the upgrade to 1.1.2 a few 
 days ago.
  
 2 of my nodes started throwing lots of promote exceptions, and then a lot of 
 the beforeAppend exceptions from then on every few minutes.  This is on the 
 high update CF that’s using leveled compaction and compression.  The other 3 
 nodes are not experiencing this.  I can send entire log files if desired.
 These 2 nodes now have much higher load #’s then the other 3, and I’m 
 assuming that’s because they are failing with the compaction errors?
  
 $
 INFO [CompactionExecutor:1783] 2012-07-13 07:35:23,268 CompactionTask.java 
 (line 109) Compacting 
 [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-392322-Data$
 ERROR [CompactionExecutor:1783] 2012-07-13 07:35:29,696 
 AbstractCassandraDaemon.java (line 134) Exception in thread 
 Thread[CompactionExecutor:1783,1,main]
 java.lang.AssertionError
 at 
 org.apache.cassandra.db.compaction.LeveledManifest.promote(LeveledManifest.java:214)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionStrategy.handleNotification(LeveledCompactionStrategy.java:158)
 at 
 org.apache.cassandra.db.DataTracker.notifySSTablesChanged(DataTracker.java:531)
 at 
 org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:254)
 at 
 org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:978)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:200)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
  
 INFO [CompactionExecutor:3310] 2012-07-16 11:14:02,481 CompactionTask.java 
 (line 109) Compacting 
 [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-369173-Data$
 ERROR [CompactionExecutor:3310] 2012-07-16 11:14:04,031 
 AbstractCassandraDaemon.java (line 134) Exception in thread 
 Thread[CompactionExecutor:3310,1,main]
 java.lang.RuntimeException: Last written key 
 DecoratedKey(150919285004100953907590722809541628889, 
 5b30363334353237652d383966382d653031312d623131632d3030313535643031373530325d5b436f6d70757465725b4d5350422d$
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)



Re: 2 nodes throwing exceptions trying to compact after upgrade to 1.1.2 from 1.1.0

2012-07-16 Thread Rudolf van der Leeden
Stay with 1.1.2 and create your CF with  
compaction_strategy_class='SizeTieredCompactionStrategy' 


On 16.07.2012, at 22:17, Bryce Godfrey wrote:

 Thanks, is there a way around this for now or should I fall back to 1.1.0?
  
  
 From: Rudolf van der Leeden [mailto:rudolf.vanderlee...@scoreloop.com] 
 Sent: Monday, July 16, 2012 12:55 PM
 To: user@cassandra.apache.org
 Cc: Rudolf van der Leeden
 Subject: Re: 2 nodes throwing exceptions trying to compact after upgrade to 
 1.1.2 from 1.1.0
  
 See  https://issues.apache.org/jira/browse/CASSANDRA-4411
 The bug is related to LCS (leveled compaction) and has been fixed.
  
  
 On 16.07.2012, at 20:32, Bryce Godfrey wrote:
 
 
 This may not be directly related to the upgrade to 1.1.2, but I was running 
 on 1.1.0 for a while with no issues, and I did the upgrade to 1.1.2 a few 
 days ago.
  
 2 of my nodes started throwing lots of promote exceptions, and then a lot of 
 the beforeAppend exceptions from then on every few minutes.  This is on the 
 high update CF that’s using leveled compaction and compression.  The other 3 
 nodes are not experiencing this.  I can send entire log files if desired.
 These 2 nodes now have much higher load #’s then the other 3, and I’m 
 assuming that’s because they are failing with the compaction errors?
  
 $
 INFO [CompactionExecutor:1783] 2012-07-13 07:35:23,268 CompactionTask.java 
 (line 109) Compacting 
 [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-392322-Data$
 ERROR [CompactionExecutor:1783] 2012-07-13 07:35:29,696 
 AbstractCassandraDaemon.java (line 134) Exception in thread 
 Thread[CompactionExecutor:1783,1,main]
 java.lang.AssertionError
 at 
 org.apache.cassandra.db.compaction.LeveledManifest.promote(LeveledManifest.java:214)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionStrategy.handleNotification(LeveledCompactionStrategy.java:158)
 at 
 org.apache.cassandra.db.DataTracker.notifySSTablesChanged(DataTracker.java:531)
 at 
 org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:254)
 at 
 org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:978)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:200)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
  
 INFO [CompactionExecutor:3310] 2012-07-16 11:14:02,481 CompactionTask.java 
 (line 109) Compacting 
 [SSTableReader(path='/opt/cassandra/data/MonitoringData/Properties/MonitoringData-Properties-hd-369173-Data$
 ERROR [CompactionExecutor:3310] 2012-07-16 11:14:04,031 
 AbstractCassandraDaemon.java (line 134) Exception in thread 
 Thread[CompactionExecutor:3310,1,main]
 java.lang.RuntimeException: Last written key 
 DecoratedKey(150919285004100953907590722809541628889, 
 5b30363334353237652d383966382d653031312d623131632d3030313535643031373530325d5b436f6d70757465725b4d5350422d$
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)