[ANNOUNCE] Polidoro - A Cassandra client in Scala
Hi all, We've open sourced Polidoro. It's a Cassandra client in Scala on top of Astyanax and in the style of Cascal. Find it at https://github.com/SpotRight/Polidoro -Lanny Ripple SpotRight, Inc - http://spotright.com
Re: Decommission an entire DC
That one is documented -- http://www.datastax.com/documentation/cassandra/1.2/index.html#cassandra/operations/ops_add_dc_to_cluster_t.html On Wed, Jul 24, 2013 at 3:33 AM, Cyril Scetbon cyril.scet...@free.frwrote: And if we want to add a new DC ? I suppose we should add all nodes and alter the replication factor of the keyspace after that, but if anyone can confirm it and maybe give me some tips ? FYI ,we have 2 DCs with between 10 and 20 nodes in each and a 2To database (local replication factor included) thanks -- Cyril SCETBON On Jul 24, 2013, at 12:04 AM, Omar Shibli o...@eyeviewdigital.com wrote: All you need to do is to decrease the replication factor of DC1 to 0, and then decommission the nodes one by one, I've tried this before and it worked with no issues. Thanks, On Tue, Jul 23, 2013 at 10:32 PM, Lanny Ripple la...@spotright.comwrote: Hi, We have a multi-dc setup using DC1:2, DC2:2. We want to get rid of DC1. We're in the position where we don't need to save any of the data on DC1. We know we'll lose a (tiny. already checked) bit of data but our processing is such that we'll recover over time. How do we drop DC1 and just move forward with DC2? Using nodetool decommision or removetoken looks like we'll eventually end up with a single DC1 node containing the entire dc's data which would be slow and costly. We've speculated that setting DC1:0 or removing it from the schema would do the trick but without finding any hits during searching on that idea I hesitate to just do it. We can drop DC1s data but have to keep a working ring in DC2.
Decommission an entire DC
Hi, We have a multi-dc setup using DC1:2, DC2:2. We want to get rid of DC1. We're in the position where we don't need to save any of the data on DC1. We know we'll lose a (tiny. already checked) bit of data but our processing is such that we'll recover over time. How do we drop DC1 and just move forward with DC2? Using nodetool decommision or removetoken looks like we'll eventually end up with a single DC1 node containing the entire dc's data which would be slow and costly. We've speculated that setting DC1:0 or removing it from the schema would do the trick but without finding any hits during searching on that idea I hesitate to just do it. We can drop DC1s data but have to keep a working ring in DC2.
Re: Thrift message length exceeded
Good catch since that bug also would have shut us down. The original problem is that previous to Cass 1.1.10 it looks like cassandra.yaml values * thrift_framed_transport_size_in_mb * thrift_max_message_length_in_mb were ignored (in favor of effectively no limits). We went from 1.1.5 to 1.2.3 and these were suddenly turned on for us (and way too low for our data). Also have confirmed your supplied patch2 works for us. -ljr On Apr 22, 2013, at 6:57 AM, Oleksandr Petrov oleksandr.pet...@gmail.com wrote: I've submitted a patch that fixes the issue for 1.2.3: https://issues.apache.org/jira/browse/CASSANDRA-5504 Maybe guys know a better way to fix it, but that helped me in a meanwhile. On Mon, Apr 22, 2013 at 1:44 AM, Oleksandr Petrov oleksandr.pet...@gmail.com wrote: If you're using Cassandra 1.2.3, and new Hadoop interface, that would make a call to next(), you'll have an eternal loop reading same things all over again from your cassandra nodes (you may see it if you enable Debug output). next() is clearing key() which is required for Wide Row iteration. Setting key back fixed issue for me. On Sat, Apr 20, 2013 at 3:05 PM, Oleksandr Petrov oleksandr.pet...@gmail.com wrote: Tried to isolate the issue in testing environment, What I currently have: That's a setup for test: CREATE KEYSPACE cascading_cassandra WITH replication = {'class' : 'SimpleStrategy', 'replication_factor' : 1}; USE cascading_cassandra; CREATE TABLE libraries (emitted_at timestamp, additional_info varchar, environment varchar, application varchar, type varchar, PRIMARY KEY (application, environment, type, emitted_at)) WITH COMPACT STORAGE; Next, insert some test data: (just for example) [INSERT INTO libraries (application, environment, type, additional_info, emitted_at) VALUES (?, ?, ?, ?, ?); [app env type 0 #inst 2013-04-20T13:01:04.935-00:00]] If keys (e.q. app env type) are all same across the dataset, it works correctly. As soon as I start varying keys, e.q. app1, app2, app3 or others, I get the error with Message Length Exceeded. Does anyone have some ideas? Thanks for help! On Sat, Apr 20, 2013 at 1:56 PM, Oleksandr Petrov oleksandr.pet...@gmail.com wrote: I can confirm running same problem. Tried ConfigHelper.setThriftMaxMessageLengthInMb();, and tuning server side, reducing/increasing batch size. Here's stacktrace from Hadoop/Cassandra, maybe it could give a hint: Caused by: org.apache.thrift.protocol.TProtocolException: Message length exceeded: 8 at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363) at org.apache.cassandra.thrift.Column.read(Column.java:528) at org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:507) at org.apache.cassandra.thrift.KeySlice.read(KeySlice.java:408) at org.apache.cassandra.thrift.Cassandra$get_paged_slice_result.read(Cassandra.java:14157) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_paged_slice(Cassandra.java:769) at org.apache.cassandra.thrift.Cassandra$Client.get_paged_slice(Cassandra.java:753) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$WideRowIterator.maybeInit(ColumnFamilyRecordReader.java:438) On Thu, Apr 18, 2013 at 12:34 AM, Lanny Ripple la...@spotright.com wrote: It's slow going finding the time to do so but I'm working on that. We do have another table that has one or sometimes two columns per row. We can run jobs on it without issue. I looked through org.apache.cassandra.hadoop code and don't see anything that's really changed since 1.1.5 (which was also using thrift-0.7) so something of a puzzler about what's going on. On Apr 17, 2013, at 2:47 PM, aaron morton aa...@thelastpickle.com wrote: Can you reproduce this in a simple way ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 18/04/2013, at 5:50 AM, Lanny Ripple la...@spotright.com wrote: That was our first thought. Using maven's dependency tree info we verified that we're using the expected (cass 1.2.3) jars $ mvn dependency:tree | grep thrift [INFO] | +- org.apache.thrift:libthrift:jar:0.7.0:compile [INFO] | \- org.apache.cassandra:cassandra-thrift:jar:1.2.3:compile I've also dumped the final command run by the hadoop we use (CDH3u5) and verified it's not sneaking thrift in on us. On Tue, Apr 16, 2013 at 4:36 PM, aaron morton aa...@thelastpickle.com wrote: Can you confirm the you are using the same thrift version that ships 1.2.3 ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton
Re: Thrift message length exceeded
That was our first thought. Using maven's dependency tree info we verified that we're using the expected (cass 1.2.3) jars $ mvn dependency:tree | grep thrift [INFO] | +- org.apache.thrift:libthrift:jar:0.7.0:compile [INFO] | \- org.apache.cassandra:cassandra-thrift:jar:1.2.3:compile I've also dumped the final command run by the hadoop we use (CDH3u5) and verified it's not sneaking thrift in on us. On Tue, Apr 16, 2013 at 4:36 PM, aaron morton aa...@thelastpickle.comwrote: Can you confirm the you are using the same thrift version that ships 1.2.3 ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 16/04/2013, at 10:17 AM, Lanny Ripple la...@spotright.com wrote: A bump to say I found this http://stackoverflow.com/questions/15487540/pig-cassandra-message-length-exceeded so others are seeing similar behavior. From what I can see of org.apache.cassandra.hadoop nothing has changed since 1.1.5 when we didn't see such things but sure looks like there's a bug that's slipped in (or been uncovered) somewhere. I'll try to narrow down to a dataset and code that can reproduce. On Apr 10, 2013, at 6:29 PM, Lanny Ripple la...@spotright.com wrote: We are using Astyanax in production but I cut back to just Hadoop and Cassandra to confirm it's a Cassandra (or our use of Cassandra) problem. We do have some extremely large rows but we went from everything working with 1.1.5 to almost everything carping with 1.2.3. Something has changed. Perhaps we were doing something wrong earlier that 1.2.3 exposed but surprises are never welcome in production. On Apr 10, 2013, at 8:10 AM, moshe.kr...@barclays.com wrote: I also saw this when upgrading from C* 1.0 to 1.2.2, and from hector 0.6 to 0.8 Turns out the Thrift message really was too long. The mystery to me: Why no complaints in previous versions? Were some checks added in Thrift or Hector? -Original Message- From: Lanny Ripple [mailto:la...@spotright.com] Sent: Tuesday, April 09, 2013 6:17 PM To: user@cassandra.apache.org Subject: Thrift message length exceeded Hello, We have recently upgraded to Cass 1.2.3 from Cass 1.1.5. We ran sstableupgrades and got the ring on its feet and we are now seeing a new issue. When we run MapReduce jobs against practically any table we find the following errors: 2013-04-09 09:58:47,746 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2013-04-09 09:58:47,899 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2013-04-09 09:58:48,021 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2013-04-09 09:58:48,024 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4a48edb5 2013-04-09 09:58:50,475 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-04-09 09:58:50,477 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:390) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:313) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getProgress(ColumnFamilyRecordReader.java:103) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.getProgress(MapTask.java:444) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:460) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) Caused by: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363) at org.apache.cassandra.thrift.Column.read(Column.java:528
Re: Thrift message length exceeded
It's slow going finding the time to do so but I'm working on that. We do have another table that has one or sometimes two columns per row. We can run jobs on it without issue. I looked through org.apache.cassandra.hadoop code and don't see anything that's really changed since 1.1.5 (which was also using thrift-0.7) so something of a puzzler about what's going on. On Apr 17, 2013, at 2:47 PM, aaron morton aa...@thelastpickle.com wrote: Can you reproduce this in a simple way ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 18/04/2013, at 5:50 AM, Lanny Ripple la...@spotright.com wrote: That was our first thought. Using maven's dependency tree info we verified that we're using the expected (cass 1.2.3) jars $ mvn dependency:tree | grep thrift [INFO] | +- org.apache.thrift:libthrift:jar:0.7.0:compile [INFO] | \- org.apache.cassandra:cassandra-thrift:jar:1.2.3:compile I've also dumped the final command run by the hadoop we use (CDH3u5) and verified it's not sneaking thrift in on us. On Tue, Apr 16, 2013 at 4:36 PM, aaron morton aa...@thelastpickle.com wrote: Can you confirm the you are using the same thrift version that ships 1.2.3 ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 16/04/2013, at 10:17 AM, Lanny Ripple la...@spotright.com wrote: A bump to say I found this http://stackoverflow.com/questions/15487540/pig-cassandra-message-length-exceeded so others are seeing similar behavior. From what I can see of org.apache.cassandra.hadoop nothing has changed since 1.1.5 when we didn't see such things but sure looks like there's a bug that's slipped in (or been uncovered) somewhere. I'll try to narrow down to a dataset and code that can reproduce. On Apr 10, 2013, at 6:29 PM, Lanny Ripple la...@spotright.com wrote: We are using Astyanax in production but I cut back to just Hadoop and Cassandra to confirm it's a Cassandra (or our use of Cassandra) problem. We do have some extremely large rows but we went from everything working with 1.1.5 to almost everything carping with 1.2.3. Something has changed. Perhaps we were doing something wrong earlier that 1.2.3 exposed but surprises are never welcome in production. On Apr 10, 2013, at 8:10 AM, moshe.kr...@barclays.com wrote: I also saw this when upgrading from C* 1.0 to 1.2.2, and from hector 0.6 to 0.8 Turns out the Thrift message really was too long. The mystery to me: Why no complaints in previous versions? Were some checks added in Thrift or Hector? -Original Message- From: Lanny Ripple [mailto:la...@spotright.com] Sent: Tuesday, April 09, 2013 6:17 PM To: user@cassandra.apache.org Subject: Thrift message length exceeded Hello, We have recently upgraded to Cass 1.2.3 from Cass 1.1.5. We ran sstableupgrades and got the ring on its feet and we are now seeing a new issue. When we run MapReduce jobs against practically any table we find the following errors: 2013-04-09 09:58:47,746 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2013-04-09 09:58:47,899 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2013-04-09 09:58:48,021 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2013-04-09 09:58:48,024 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4a48edb5 2013-04-09 09:58:50,475 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-04-09 09:58:50,477 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:390) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:313) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getProgress(ColumnFamilyRecordReader.java:103) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.getProgress(MapTask.java:444) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:460) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143
Re: CorruptedBlockException
Saw this in earlier versions. Our workaround was disable; drain; snap; shutdown; delete; link from snap; restart; -ljr On Apr 11, 2013, at 9:45, moshe.kr...@barclays.com wrote: I have formulated the following theory regarding C* 1.2.2 which may be relevant: Whenever there is a disk error during compaction of an SS table (e.g., bad block, out of disk space), that SStable’s files stick around forever after, and do not subsequently get deleted by normal compaction (minor or major), long after all its records have been deleted. This causes disk usage to rise dramatically. The only way to make the SStable files disappear is to run “nodetool cleanup” (which takes hours to run). Just a theory so far…. From: Alexis Rodríguez [mailto:arodrig...@inconcertcc.com] Sent: Thursday, April 11, 2013 5:31 PM To: user@cassandra.apache.org Subject: Re: CorruptedBlockException Aaron, It seems that we are in the same situation as Nury, we are storing a lot of files of ~5MB in a CF. This happens in a test cluster, with one node using cassandra 1.1.5, we have commitlog in a different partition than the data directory. Normally our tests use nearly 13 GB in data, but when the exception on compaction appears our disk space ramp up to: # df -h FilesystemSize Used Avail Use% Mounted on /dev/sda1 440G 330G 89G 79% / tmpfs 7.9G 0 7.9G 0% /lib/init/rw udev 7.9G 160K 7.9G 1% /dev tmpfs 7.9G 0 7.9G 0% /dev/shm /dev/sdb1 459G 257G 179G 59% /cassandra # cd /cassandra/data/Repository/ # ls Files/*tmp* | wc -l 1671 # du -ch Files | tail -1 257Gtotal # du -ch Files/*tmp* | tail -1 34G total We are using cassandra 1.1.5 with one node, our schema for that keyspace is: [default@unknown] use Repository; Authenticated to keyspace: Repository [default@Repository] show schema; create keyspace Repository with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {datacenter1 : 1} and durable_writes = true; use Repository; create column family Files with column_type = 'Standard' and comparator = 'UTF8Type' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and read_repair_chance = 0.1 and dclocal_read_repair_chance = 0.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and compaction_strategy = 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy' and caching = 'KEYS_ONLY' and compaction_strategy_options = {'sstable_size_in_mb' : '120'} and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'}; In our logs: ERROR [CompactionExecutor:1831] 2013-04-11 09:12:41,725 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[CompactionExecutor:1831,1,main] java.io.IOError: org.apache.cassandra.io.compress.CorruptedBlockException: (/cassandra/data/Repository/Files/Repository-Files-he-4533-Data.db): corruption detected, chunk at 43325354 of length 65545. at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116) at org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99) at org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176) at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83) at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68) at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
Re: Thrift message length exceeded
We are using Astyanax in production but I cut back to just Hadoop and Cassandra to confirm it's a Cassandra (or our use of Cassandra) problem. We do have some extremely large rows but we went from everything working with 1.1.5 to almost everything carping with 1.2.3. Something has changed. Perhaps we were doing something wrong earlier that 1.2.3 exposed but surprises are never welcome in production. On Apr 10, 2013, at 8:10 AM, moshe.kr...@barclays.com wrote: I also saw this when upgrading from C* 1.0 to 1.2.2, and from hector 0.6 to 0.8 Turns out the Thrift message really was too long. The mystery to me: Why no complaints in previous versions? Were some checks added in Thrift or Hector? -Original Message- From: Lanny Ripple [mailto:la...@spotright.com] Sent: Tuesday, April 09, 2013 6:17 PM To: user@cassandra.apache.org Subject: Thrift message length exceeded Hello, We have recently upgraded to Cass 1.2.3 from Cass 1.1.5. We ran sstableupgrades and got the ring on its feet and we are now seeing a new issue. When we run MapReduce jobs against practically any table we find the following errors: 2013-04-09 09:58:47,746 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2013-04-09 09:58:47,899 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2013-04-09 09:58:48,021 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2013-04-09 09:58:48,024 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4a48edb5 2013-04-09 09:58:50,475 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-04-09 09:58:50,477 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:390) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:313) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getProgress(ColumnFamilyRecordReader.java:103) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.getProgress(MapTask.java:444) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:460) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) Caused by: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363) at org.apache.cassandra.thrift.Column.read(Column.java:528) at org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:507) at org.apache.cassandra.thrift.KeySlice.read(KeySlice.java:408) at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12905) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:734) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:718) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:346) ... 16 more 2013-04-09 09:58:50,481 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task The message length listed on each failed job differs (not always 106). Jobs that used to run fine now fail with code compiled against cass 1.2.3 (and work fine if compiled against 1.1.5 and run against the 1.2.3 servers in production). I'm using the following setup to configure the job: def cassConfig(job: Job) { val conf = job.getConfiguration
Thrift message length exceeded
Hello, We have recently upgraded to Cass 1.2.3 from Cass 1.1.5. We ran sstableupgrades and got the ring on its feet and we are now seeing a new issue. When we run MapReduce jobs against practically any table we find the following errors: 2013-04-09 09:58:47,746 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2013-04-09 09:58:47,899 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2013-04-09 09:58:48,021 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2013-04-09 09:58:48,024 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4a48edb5 2013-04-09 09:58:50,475 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-04-09 09:58:50,477 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:390) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:313) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getProgress(ColumnFamilyRecordReader.java:103) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.getProgress(MapTask.java:444) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:460) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) Caused by: org.apache.thrift.TException: Message length exceeded: 106 at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363) at org.apache.cassandra.thrift.Column.read(Column.java:528) at org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:507) at org.apache.cassandra.thrift.KeySlice.read(KeySlice.java:408) at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12905) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:734) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:718) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:346) ... 16 more 2013-04-09 09:58:50,481 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task The message length listed on each failed job differs (not always 106). Jobs that used to run fine now fail with code compiled against cass 1.2.3 (and work fine if compiled against 1.1.5 and run against the 1.2.3 servers in production). I'm using the following setup to configure the job: def cassConfig(job: Job) { val conf = job.getConfiguration() ConfigHelper.setInputRpcPort(conf, + 9160) ConfigHelper.setInputInitialAddress(conf, Config.hostip) ConfigHelper.setInputPartitioner(conf, org.apache.cassandra.dht.RandomPartitioner) ConfigHelper.setInputColumnFamily(conf, Config.keyspace, Config.cfname) val pred = { val range = new SliceRange() .setStart(.getBytes(UTF-8)) .setFinish(.getBytes(UTF-8)) .setReversed(false) .setCount(4096 * 1000) new SlicePredicate().setSlice_range(range) } ConfigHelper.setInputSlicePredicate(conf, pred) } The job consists only of a mapper that increments counters for each row and associated columns so all I'm really doing is exercising ColumnFamilyRecordReader. Has anyone else seen this? Is there a workaround/fix to get our jobs running? Thanks
Re: lots of extra bytes on disk
We occasionally (twice now on a 40 node cluster over the last 6-8 months) see this. My best guess is that Cassandra can fail to mark an SSTable for cleanup somehow. Forced GC's or reboots don't clear them out. We disable thrift and gossip; drain; snapshot; shutdown; clear data/Keyspace/Table/*.db and restore (hard-linking back into place to avoid data transfer) from the just created snapshot; restart. On Mar 28, 2013, at 10:12 AM, Ben Chobot be...@instructure.com wrote: Some of my cassandra nodes in my 1.1.5 cluster show a large discrepancy between what cassandra says the SSTables should sum up to, and what df and du claim exist. During repairs, this is almost always pretty bad, but post-repair compactions tend to bring those numbers to within a few percent of each other... usually. Sometimes they remain much further apart after compactions have finished - for instance, I'm looking at one node now that claims to have 205GB of SSTables, but actually has 450GB of files living in that CF's data directory. No pending compactions, and the most recent compaction for this CF finished just a few hours ago. nodetool cleanup has no effect. What could be causing these extra bytes, and how to get them to go away? I'm ok with a few extra GB of unexplained data, but an extra 245GB (more than all the data this node is supposed to have!) is a little extreme.
Re: TimeUUID Order Partitioner
A type 4 UUID can be created from two Longs. You could MD5 your strings giving you 128 hashed bits and then make UUIDs out of that. Using Scala: import java.nio.ByteBuffer import java.security.MessageDigest import java.util.UUID val key = Hello, World! val md = MessageDigest.getInstance(MD5) val dig = md.digest(key.getBytes(UTF-8)) val bb = ByteBuffer.wrap(dig) val msb = bb.getLong val lsb = bb.getLong val uuid = new UUID(msb, lsb) On Mar 26, 2013, at 3:22 PM, aaron morton aa...@thelastpickle.com wrote: Any idea? Not off the top of my head. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 26/03/2013, at 2:13 AM, Carlos Pérez Miguel cperez...@gmail.com wrote: Yes it does. Thank you Aaron. Now I realized that the system keyspace uses string as keys, like Ring or ClusterName, and I don't know how to convert these type of keys into UUID. Any idea? Carlos Pérez Miguel 2013/3/25 aaron morton aa...@thelastpickle.com The best thing to do is start with a look at ByteOrderedPartitoner and AbstractByteOrderedPartitioner. You'll want to create a new TimeUUIDToken extends TokenUUID and a new UUIDPartitioner that extends AbstractPartitioner Usual disclaimer that ordered partitioners cause problems with load balancing. Hope that helps. - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/03/2013, at 1:12 AM, Carlos Pérez Miguel cperez...@gmail.com wrote: Hi, I store in my system rows where the key is a UUID version1, TimeUUID. I would like to maintain rows ordered by time. I know that in this case, it is recomended to use an external CF where column names are UUID ordered by time. But in my use case this is not possible, so I would like to use a custom Partitioner in order to do this. If I use ByteOrderedPartitioner rows are not correctly ordered because of the way a UUID stores the timestamp. What is needed in order to implement my own Partitioner? Thank you. Carlos Pérez Miguel
Re: TimeUUID Order Partitioner
Ah. TimeUUID. Not as useful for you then but still something for the toolbox. On Mar 27, 2013, at 8:42 AM, Lanny Ripple la...@spotright.com wrote: A type 4 UUID can be created from two Longs. You could MD5 your strings giving you 128 hashed bits and then make UUIDs out of that. Using Scala: import java.nio.ByteBuffer import java.security.MessageDigest import java.util.UUID val key = Hello, World! val md = MessageDigest.getInstance(MD5) val dig = md.digest(key.getBytes(UTF-8)) val bb = ByteBuffer.wrap(dig) val msb = bb.getLong val lsb = bb.getLong val uuid = new UUID(msb, lsb) On Mar 26, 2013, at 3:22 PM, aaron morton aa...@thelastpickle.com wrote: Any idea? Not off the top of my head. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 26/03/2013, at 2:13 AM, Carlos Pérez Miguel cperez...@gmail.com wrote: Yes it does. Thank you Aaron. Now I realized that the system keyspace uses string as keys, like Ring or ClusterName, and I don't know how to convert these type of keys into UUID. Any idea? Carlos Pérez Miguel 2013/3/25 aaron morton aa...@thelastpickle.com The best thing to do is start with a look at ByteOrderedPartitoner and AbstractByteOrderedPartitioner. You'll want to create a new TimeUUIDToken extends TokenUUID and a new UUIDPartitioner that extends AbstractPartitioner Usual disclaimer that ordered partitioners cause problems with load balancing. Hope that helps. - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/03/2013, at 1:12 AM, Carlos Pérez Miguel cperez...@gmail.com wrote: Hi, I store in my system rows where the key is a UUID version1, TimeUUID. I would like to maintain rows ordered by time. I know that in this case, it is recomended to use an external CF where column names are UUID ordered by time. But in my use case this is not possible, so I would like to use a custom Partitioner in order to do this. If I use ByteOrderedPartitioner rows are not correctly ordered because of the way a UUID stores the timestamp. What is needed in order to implement my own Partitioner? Thank you. Carlos Pérez Miguel