Re: Restoring all cluster from snapshots
On Mon, Jun 8, 2015 at 6:22 AM, Anton Koshevoy nowa...@gmail.com wrote: - sudo rm -rf /db/cassandra/cr/data0*/system/* This removes the schema. You can't load SSTables for column families which don't exist. =Rob
Re: Cassandra crashes daily; nothing on the log
It could be the linux kernel killing Cassandra b/c of memory usage. When this happens, nothing is logged in Cassandra. Check the system logs: /var/log/messages Look for a message saying Out of Memory... kill process... On Mon, Jun 8, 2015 at 1:37 PM, Paulo Motta pauloricard...@gmail.com wrote: try checking your system logs (generally /var/log/syslog) to check if the cassandra process was killed by the OS oom-killer 2015-06-06 15:39 GMT-03:00 Brian Sam-Bodden bsbod...@integrallis.com: Berk, 1 GB is not enough to run C*, the minimum memory we use on Digital Ocean is 4GB. Cheers, Brian http://integrallis.com On Sat, Jun 6, 2015 at 10:50 AM, graffit...@yahoo.com wrote: Hi all, I've installed Cassandra on a test server hosted on Digital Ocean. The server has 1GB RAM, and is running a single docker container alongside C*. Somehow, every night, the Cassandra instance crashes. The annoying part is that I cannot see anything wrong with the log files, so I can't tell what's going on. The log files are here: http://pastebin.com/Zquu5wvd Do you have any idea what's going on? Can you suggest some ways I can try to troubleshoot this? Thanks! Berk -- Cheers, Brian http://www.integrallis.com
Re: sstableloader usage doubts
On Mon, Jun 8, 2015 at 6:58 AM, ZeroUno zerozerouno...@gmail.com wrote: So you mean that refresh needs to be used if the cluster is running, but if I stopped cassandra while copying the sstables then refresh is useless? So the error No new SSTables were found during my refresh attempt is due to the fact that the sstables in my data dir were not new because already loaded, and not to the files not being found? Yes. You should be able to see logs of it opening the files it finds in the data dir. So... if I stop the two nodes on the first DC, restore their sstables' files, and then restart the nodes, nothing else needs to be done on the first DC? Be careful to avoid bootstrapping, but yes. And on the second DC instead I just need to do nodetool rebuild -- FirstDC on _both_ nodes? Yes. =Rob
Re: Cassandra crashes daily; nothing on the log
try checking your system logs (generally /var/log/syslog) to check if the cassandra process was killed by the OS oom-killer 2015-06-06 15:39 GMT-03:00 Brian Sam-Bodden bsbod...@integrallis.com: Berk, 1 GB is not enough to run C*, the minimum memory we use on Digital Ocean is 4GB. Cheers, Brian http://integrallis.com On Sat, Jun 6, 2015 at 10:50 AM, graffit...@yahoo.com wrote: Hi all, I've installed Cassandra on a test server hosted on Digital Ocean. The server has 1GB RAM, and is running a single docker container alongside C*. Somehow, every night, the Cassandra instance crashes. The annoying part is that I cannot see anything wrong with the log files, so I can't tell what's going on. The log files are here: http://pastebin.com/Zquu5wvd Do you have any idea what's going on? Can you suggest some ways I can try to troubleshoot this? Thanks! Berk -- Cheers, Brian http://www.integrallis.com
Re: Restoring all cluster from snapshots
Rob, thanks for the answer. I just follow instruction from http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html If not to remove system table data, the test cluster starts interfering to a production cluster. How Can I avoid this situation? On June 8, 2015 at 9:48:30 PM, Robert Coli (rc...@eventbrite.com) wrote: On Mon, Jun 8, 2015 at 6:22 AM, Anton Koshevoy nowa...@gmail.com wrote: - sudo rm -rf /db/cassandra/cr/data0*/system/* This removes the schema. You can't load SSTables for column families which don't exist. =Rob
RE: Restoring all cluster from snapshots
Yes, you shouldn’t delete the system directory. Next steps are …reconfigure the test cluster with new IP addresses, clear the gossiping information and then boot the test cluster. If you are running Cassandra on VMware, then you may also want to look at this solutionhttp://www.triliodata.com/wp-content/uploads/2015/04/Cassandra-Trilio-Data-Sheet4.pdf from Trilio Data, where you can create a Cassandra backup and restore it to a Test Cluster. Regards, Sanjay _ Sanjay Baronia VP of Product Solutions Management TrilioData (c) 508-335-2306 sanjay.baro...@triliodata.commailto:sanjay.baro...@triliodata.com [Trilio-Business Assurance_300 Pixels]http://www.triliodata.com/ Experience Trilio in action, please click heremailto:i...@triliodata.com?subject=Demo%20Request. to request a demo today! From: Anton Koshevoy [mailto:nowa...@gmail.com] Sent: Monday, June 8, 2015 4:42 PM To: user@cassandra.apache.org Subject: Re: Restoring all cluster from snapshots Rob, thanks for the answer. I just follow instruction from http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html If not to remove system table data, the test cluster starts interfering to a production cluster. How Can I avoid this situation? On June 8, 2015 at 9:48:30 PM, Robert Coli (rc...@eventbrite.commailto:rc...@eventbrite.com) wrote: On Mon, Jun 8, 2015 at 6:22 AM, Anton Koshevoy nowa...@gmail.commailto:nowa...@gmail.com wrote: - sudo rm -rf /db/cassandra/cr/data0*/system/* This removes the schema. You can't load SSTables for column families which don't exist. =Rob
Deserialize the collection type data from the SSTable file
Hi, Cassandra users: I have a question related to how to Deserialize the new collection types data in the Cassandra 2.x. (The exactly version is C 2.0.10). I create the following example tables in the CQLSH: CREATE TABLE coupon ( account_id bigint, campaign_id uuid, , discount_info maptext, text, , PRIMARY KEY (account_id, campaign_id)) The other columns can be ignored in this case. Then I inserted into the one test data like this: insert into coupon (account_id, campaign_id, discount_info) values (111,uuid(), {'test_key':'test_value'}); After this, I got the SSTable files. I use the sstable2json file to check the output: $./resources/cassandra/bin/sstable2json /xxx/test-coupon-jb-1-Data.db[{key: 006f,columns: [[0336e50d-21aa-4b3a-9f01-989a8c540e54:,,1433792922055000], [0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info,0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:!,1433792922054999,t,1433792922], [0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:746573745f6b6579,746573745f76616c7565,1433792922055000]]}] What I want to is to get the {test_key : test_value} as key/value pair that I input into discount_info column. I followed the sstable2json code, and try to deserialize the data by myself, but to my surprise, I cannot make it work, even I tried several ways, but kept getting Exception. From what I researched, I know that Cassandra put the campaign_id + discount_info + Another ByteBuffer as composite column in this case. When I deserialize this columnName, I got the following dumped out as String: 0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:746573745f6b6579. It includes 3 parts: the first part is the uuid for the campaign_id. The 2nd part as discount_info, which is the static name I defined in the table. The 3 part is a bytes array as length of 46, which I am not sure what it is. The corresponding value part of this composite column is another byte array as length of 10, hex as 746573745f76616c7565 if I dump it out. Now, here is what I did and not sure why it doesn't work. First, I assume the value part stores the real value I put in the Map, so I did the following: ByteBuffer value = ByteBufferUtil.clone(column.value());MapTypeString, String result = MapType.getInstance(UTF8Type.instance, UTF8Type.instance); MapString, String output = result.compose(value);// it gave me the following exception: org.apache.cassandra.serializers.MarshalException: Not enough bytes to read a mapThen I am think that the real value must be stored as part of the column names (the 3rd part of 46 bytes), so I did this:MapTypeString, String result = MapType.getInstance(UTF8Type.instance, UTF8Type.instance); MapString, String output = result.compose(third_part.value);// I got the following exception:java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:267) at org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:587) at org.apache.cassandra.utils.ByteBufferUtil.readBytesWithShortLength(ByteBufferUtil.java:596) at org.apache.cassandra.serializers.MapSerializer.deserialize(MapSerializer.java:63) at org.apache.cassandra.serializers.MapSerializer.deserialize(MapSerializer.java:28) at org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:142) I can get all other non-collection types data, but I cannot get the data from the Map. My questions are:1) How does the Cassandra store the collection data in the SSTable files? From the length of bytes, it is most likely as part of the composite column. If so, why I got the exception as above? 2) The sstable2json doesn't deserialize the real data out from the collection type. So I don't have an example to follow. Do I use the wrong way trying to compose the Map type data? Thanks Yong
Re: Deserialize the collection type data from the SSTable file
I'm not sure why sstable2json doesn't work for collections, but if you're into reading raw sstables we use the following code with good success: https://github.com/coursera/aegisthus/blob/77c73f6259f2a30d3d8ca64578be5c13ecc4e6f4/aegisthus-hadoop/src/main/java/org/coursera/mapreducer/CQLMapper.java#L85 Thanks, Daniel On Mon, Jun 8, 2015 at 1:22 PM, java8964 java8...@hotmail.com wrote: Hi, Cassandra users: I have a question related to how to Deserialize the new collection types data in the Cassandra 2.x. (The exactly version is C 2.0.10). I create the following example tables in the CQLSH: CREATE TABLE coupon ( account_id bigint, campaign_id uuid, , discount_info maptext, text, , PRIMARY KEY (account_id, campaign_id) ) The other columns can be ignored in this case. Then I inserted into the one test data like this: insert into coupon (account_id, campaign_id, discount_info) values (111,uuid(), {'test_key':'test_value'}); After this, I got the SSTable files. I use the sstable2json file to check the output: $./resources/cassandra/bin/sstable2json /xxx/test-coupon-jb-1-Data.db [ {key: 006f,columns: [[0336e50d-21aa-4b3a-9f01-989a8c540e54:,,1433792922055000], [0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info,0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:!,1433792922054999,t,1433792922], [0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:746573745f6b6579,746573745f76616c7565,1433792922055000]]} ] What I want to is to get the {test_key : test_value} as key/value pair that I input into discount_info column. I followed the sstable2json code, and try to deserialize the data by myself, but to my surprise, I cannot make it work, even I tried several ways, but kept getting Exception. From what I researched, I know that Cassandra put the campaign_id + discount_info + Another ByteBuffer as composite column in this case. When I deserialize this columnName, I got the following dumped out as String: 0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:746573745f6b6579. It includes 3 parts: the first part is the uuid for the campaign_id. The 2nd part as discount_info, which is the static name I defined in the table. The 3 part is a bytes array as length of 46, which I am not sure what it is. The corresponding value part of this composite column is another byte array as length of 10, hex as 746573745f76616c7565 if I dump it out. Now, here is what I did and not sure why it doesn't work. First, I assume the value part stores the real value I put in the Map, so I did the following: ByteBuffer value = ByteBufferUtil.clone(column.value()); MapTypeString, String result = MapType.getInstance(UTF8Type.instance, UTF8Type.instance); MapString, String output = result.compose(value); // it gave me the following exception: org.apache.cassandra.serializers.MarshalException: Not enough bytes to read a map Then I am think that the real value must be stored as part of the column names (the 3rd part of 46 bytes), so I did this: MapTypeString, String result = MapType.getInstance(UTF8Type.instance, UTF8Type.instance); MapString, String output = result.compose(third_part.value); // I got the following exception: java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:267) at org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:587) at org.apache.cassandra.utils.ByteBufferUtil.readBytesWithShortLength(ByteBufferUtil.java:596) at org.apache.cassandra.serializers.MapSerializer.deserialize(MapSerializer.java:63) at org.apache.cassandra.serializers.MapSerializer.deserialize(MapSerializer.java:28) at org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:142) I can get all other non-collection types data, but I cannot get the data from the Map. My questions are: 1) How does the Cassandra store the collection data in the SSTable files? From the length of bytes, it is most likely as part of the composite column. If so, why I got the exception as above? 2) The sstable2json doesn't deserialize the real data out from the collection type. So I don't have an example to follow. Do I use the wrong way trying to compose the Map type data? Thanks Yong
Deserialize the collection type data from the SSTable file
Hi, Cassandra users: I have a question related to how to Deserialize the new collection types data in the Cassandra 2.x. (The exactly version is C 2.0.10). I create the following example tables in the CQLSH: CREATE TABLE coupon ( account_id bigint, campaign_id uuid, , discount_info maptext, text, , PRIMARY KEY (account_id, campaign_id)) The other columns can be ignored in this case. Then I inserted into the one test data like this: insert into coupon (account_id, campaign_id, discount_info) values (111,uuid(), {'test_key':'test_value'}); After this, I got the SSTable files. I use the sstable2json file to check the output: $./resources/cassandra/bin/sstable2json /xxx/test-coupon-jb-1-Data.db[{key: 006f,columns: [[0336e50d-21aa-4b3a-9f01-989a8c540e54:,,1433792922055000], [0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info,0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:!,1433792922054999,t,1433792922], [0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:746573745f6b6579,746573745f76616c7565,1433792922055000]]}] What I want to is to get the {test_key : test_value} as key/value pair that I input into discount_info column. I followed the sstable2json code, and try to deserialize the data by myself, but to my surprise, I cannot make it work, even I tried several ways, but kept getting Exception. From what I researched, I know that Cassandra put the campaign_id + discount_info + Another ByteBuffer as composite column in this case. When I deserialize this columnName, I got the following dumped out as String: 0336e50d-21aa-4b3a-9f01-989a8c540e54:discount_info:746573745f6b6579. It includes 3 parts: the first part is the uuid for the campaign_id. The 2nd part as discount_info, which is the static name I defined in the table. The 3 part is a bytes array as length of 46, which I am not sure what it is. The corresponding value part of this composite column is another byte array as length of 10, hex as 746573745f76616c7565 if I dump it out. Now, here is what I did and not sure why it doesn't work. First, I assume the value part stores the real value I put in the Map, so I did the following: ByteBuffer value = ByteBufferUtil.clone(column.value());MapTypeString, String result = MapType.getInstance(UTF8Type.instance, UTF8Type.instance); MapString, String output = result.compose(value);// it gave me the following exception: org.apache.cassandra.serializers.MarshalException: Not enough bytes to read a mapThen I am think that the real value must be stored as part of the column names (the 3rd part of 46 bytes), so I did this:MapTypeString, String result = MapType.getInstance(UTF8Type.instance, UTF8Type.instance); MapString, String output = result.compose(third_part.value);// I got the following exception:java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:267) at org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:587) at org.apache.cassandra.utils.ByteBufferUtil.readBytesWithShortLength(ByteBufferUtil.java:596) at org.apache.cassandra.serializers.MapSerializer.deserialize(MapSerializer.java:63) at org.apache.cassandra.serializers.MapSerializer.deserialize(MapSerializer.java:28) at org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:142) I can get all other non-collection types data, but I cannot get the data from the Map. My questions are:1) How does the Cassandra store the collection data in the SSTable files? From the length of bytes, it is most likely as part of the composite column. If so, why I got the exception as above? 2) The sstable2json doesn't deserialize the real data out from the collection type. So I don't have an example to follow. Do I use the wrong way trying to compose the Map type data? Thanks Yong
Re: Restoring all cluster from snapshots
I think you just have to do a DESC KEYSPACE mykeyspace; from one node of the production cluster then copy the output and import it in your dev cluster using cqlsh -f output.cql. Take care at the start of the output you might want to change DC names, RF or strategy. Also, if you don't want to restart nodes you can load data by using nodetool refresh mykeyspace mycf C*heers Alain 2015-06-08 22:42 GMT+02:00 Anton Koshevoy nowa...@gmail.com: Rob, thanks for the answer. I just follow instruction from http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html If not to remove system table data, the test cluster starts interfering to a production cluster. How Can I avoid this situation? On June 8, 2015 at 9:48:30 PM, Robert Coli (rc...@eventbrite.com) wrote: On Mon, Jun 8, 2015 at 6:22 AM, Anton Koshevoy nowa...@gmail.com wrote: - sudo rm -rf /db/cassandra/cr/data0*/system/* This removes the schema. You can't load SSTables for column families which don't exist. =Rob
Re: Restoring all cluster from snapshots
On Mon, Jun 8, 2015 at 2:52 PM, Sanjay Baronia sanjay.baro...@triliodata.com wrote: Yes, you shouldn’t delete the system directory. Next steps are …reconfigure the test cluster with new IP addresses, clear the gossiping information and then boot the test cluster. If you don't delete the system directory, you run the risk of the test cluster nodes joining the source cluster. Just start a single node on the new cluster, empty, and create the schema on it. Then do the rest of the process. =Rob
Re: DSE 4.7 security
Cassandra authorization is at the keyspace and table level. Click on the GRANT link on the doc page, to get more info: http://docs.datastax.com/en/cql/3.1/cql/cql_reference/grant_r.html Which says *Permissions to access all keyspaces, a named keyspace, or a table can be granted to a user.* There is no finer-grain authorization at the row, column, or cell level. You might want to open a Jira for this valuable feature. -- Jack Krupansky On Sun, Jun 7, 2015 at 5:19 PM, Moshe Kranc moshekr...@gmail.com wrote: The DSE 4.7 documentation says: You use the familiar relational database GRANT/REVOKE paradigm to grant or revoke permissions to access Cassandra data. Does this mean authorization is per table? What if I need finer grain authorization, e.g., per row or even per cell (e.g., a specific column in a specific row may not be seen by users in a group)? Do I need to implement this in my application, because Cassandra does not support it?
auto clear data with ttl
I have C* 2.1.5,store some data with ttl.Reduce the gc_grace_seconds to zero. But it seems has no effect. Did I miss something? -- Ranger Tsao
Re: auto clear data with ttl
So gc_grace zero will remove tombstones without any delay after compaction. So it's possible that tombstones containing SSTs still need to be compacted. So either you can wait for compaction to happen or do a manual compaction depending on your compaction strategy. Manual compaction does have some drawbacks so please read about it. Sent from my iPhone On Jun 8, 2015, at 7:26 PM, 曹志富 cao.zh...@gmail.com wrote: I have C* 2.1.5,store some data with ttl.Reduce the gc_grace_seconds to zero. But it seems has no effect. Did I miss something? -- Ranger Tsao
Re: auto clear data with ttl
Thank You. I have change unchecked_tombstone_compaction to true . Major compaction will cause a big sstable ,I think is a lot good choice -- Ranger Tsao 2015-06-09 11:16 GMT+08:00 Aiman Parvaiz ai...@flipagram.com: So gc_grace zero will remove tombstones without any delay after compaction. So it's possible that tombstones containing SSTs still need to be compacted. So either you can wait for compaction to happen or do a manual compaction depending on your compaction strategy. Manual compaction does have some drawbacks so please read about it. Sent from my iPhone On Jun 8, 2015, at 7:26 PM, 曹志富 cao.zh...@gmail.com wrote: I have C* 2.1.5,store some data with ttl.Reduce the gc_grace_seconds to zero. But it seems has no effect. Did I miss something? -- Ranger Tsao
C* 2.0.15 - java.lang.NegativeArraySizeException
Hi everyone I am running C* 2.0.9 and decided to do a rolling upgrade. Added a node of C* 2.0.15 in the existing cluster and saw this twice: Jun 9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,658 INFO CompactionExecutor:4 CompactionTask.runMayThrow - Compacting [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-37-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-40-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-42-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-38-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-39-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-44-Data.db')] Jun 9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,669 ERROR CompactionExecutor:4 CassandraDaemon.uncaughtException - Exception in thread Thread[CompactionExecutor:4,1,main] Jun 9 02:27:20 prod-cass23.localdomain *java.lang.NegativeArraySizeException* Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableReader.getAncestors(SSTableReader.java:1366) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata.createCollector(SSTableMetadata.java:134) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.db.compaction.CompactionTask.createCompactionWriter(CompactionTask.java:316) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:162) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) Jun 9 02:27:20 prod-cass23.localdomain at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) Jun 9 02:27:20 prod-cass23.localdomain at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) Jun 9 02:27:20 prod-cass23.localdomain at java.util.concurrent.FutureTask.run(FutureTask.java:262) Jun 9 02:27:20 prod-cass23.localdomain at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) Jun 9 02:27:20 prod-cass23.localdomain at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) Jun 9 02:27:20 prod-cass23.localdomain at java.lang.Thread.run(Thread.java:745) Jun 9 02:27:47 prod-cass23.localdomain cassandra: 2015-06-09 02:27:47,725 INFO main StorageService.setMode - JOINING: Starting to bootstrap... As you can see this happened first time even before Joining. Second occasion stack trace: Jun 9 02:32:15 prod-cass23.localdomain cassandra: 2015-06-09 02:32:15,097 ERROR CompactionExecutor:6 CassandraDaemon.uncaughtException - Exception in thread Thread[CompactionExecutor:6,1,main] Jun 9 02:32:15 prod-cass23.localdomain java.lang.NegativeArraySizeException Jun 9 02:32:15 prod-cass23.localdomain at org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335) Jun 9 02:32:15 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462) Jun 9 02:32:15 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448) Jun 9 02:32:15 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432) Jun 9 02:32:15 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableReader.getAncestors(SSTableReader.java:1366) Jun 9 02:32:15 prod-cass23.localdomain at org.apache.cassandra.io.sstable.SSTableMetadata.createCollector(SSTableMetadata.java:134) Jun 9 02:32:15 prod-cass23.localdomain at
Re: Hbase vs Cassandra
Hi All, Thanks for all the input. I posted the same question in HBase forum and got more response. Posting the consolidated list here. Our case is that a central team builds and maintain the platform (Cassandra as a service). We have couple of usecases which fits Cassandra like time-series data. But as a platform team, we need to know more features and usecases which fits or best handled in Cassandra. Also to understand the usecases where HBase performs better (we might need to have it as a service too). *Cassandra:* 1) From 2013 both can still be relevant: http://www.pythian.com/blog/watch-hbase-vs-cassandra/ 2) Here are some use cases from PlanetCassandra.org of companies who chose Cassandra over HBase after evaluation, or migrated to Cassandra from HBase. The eComNext interview cited on the page touches on time-series data; http://planetcassandra.org/hbase-to-cassandra-migration/ 3) From googling, the most popular advantages for Cassandra over HBase is easy to deploy, maintain monitor and no single point of failure. 4) From our six months research and POC experience in Cassandra, CQL is pretty limited. Though CQL is targeted for Real time Read and Write, there are cases where need to pull out data differently and we are OK with little more latency. But Cassandra doesn't support that. We need MapReduce or Spark for those. Then the debate starts why Cassandra and why not HBase if we need Hadoop/Spark for MapReduce. Expected a few more technical features/usecases that is best handled by Cassandra (and how it works). *HBase:* 1) As for the #4 you might be interested in reading https://aphyr.com/posts/294-call-me-maybe-cassandra Not sure if there is comparable article about HBase (anybody knows?) but it can give you another perspective about what else to keep an eye on regarding these systems. 2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe 3) http://blog.parsely.com/post/1928/cass/ *Anyone have any comments on this?* 4) 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala 5) Migrated from Cassandra to HBase. Reasons: Scan is fast with HBase. It fits better with time series data model. Please look at opentsdb. Cassandra models it with large rows. Server side filtering. You can use to filter some of your time series data on the server side. Hbase has a better integration with hadoop in general. We had to write our own bulk loader using mapreduce for cassandra. hbase has already had a tool for that. There is a nice integration with flume and kite. High availability didnet matter for us. 10 secs down is fine for our use cases.HBase started to support eventually consistent reads. 6) Coprocessor framework (custom code inside Region Server and MasterServers), which Cassandra is missing, afaik. Coprocessors have been widely used by hBase users (Phoenix SQL, for example) since inception (in 0.92). * HBase security model is more mature and align well with Hadoop/HDFS security. Cassandra provides just basic authentication/authorization/SSL encryption, no Kerberos, no end-to-end data encryption, no cell level security. 7) Another point to add is the new HBase read high-availability using timeline-consistent region replicas feature from HBase 1.0 onward, which brings HBase closer to Cassandra in term of Read Availability during node failures. You have a choice for Read Availability now. https://issues.apache.org/jira/browse/HBASE-10070 8) Hbase can do range scans, and one can attack many problems with range scans. Cassandra can't do range scans. 9) HBase is a distributed, consistent, sorted key value store. The sorted bit allows for range scans in addition to the point gets that all K/V stores support. Nothing more, nothing less. It happens to store its data in HDFS by default, and we provide convenient input and output formats for map reduce. *Neutral:* 1) http://khangaonkar.blogspot.com/2013/09/cassandra-vs-hbase-which-nosql-store-do.html 2) The fundamental differences that come to mind are: * HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. * Cassandra defaults to a random partitioner, so range scans are not possible (by default) * HBase has a range partitioner (if you don't want that the client has to prefix the rowkey with a prefix of a hash of the rowkey). The main feature that set HBase apart are range scans. * HBase is much more tightly integrated with Hadoop/MapReduce/HDFS, etc. You can map reduce directly into HFiles and map those into HBase instantly. * Cassandra has a dedicated company supporting (and promoting) it. * Getting
Re: Hbase vs Cassandra
Hi Jens, All the points listed weren't from me. I posted the HBase Vs Cassandra in both the forums and consolidated here for the discussion. On Mon, Jun 8, 2015 at 2:27 PM, Jens Rantil jens.ran...@tink.se wrote: Hi, Some minor comments: 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. Not entirely sure what you mean here, but we ran Cloudera for a while and Cloudera Manager was buggy and hard to debug. Overall, our experience wasn't very good. This was definitely also due to us not knowing how all the Cloudera packages were configured. * This is the one of the response I got it from HBase forum. Datastax OpsCenter is there but seems it doesn't support the latest Cassandra versions (we tried it couple of times and there were bugs too)* HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. Sort of true. You can decide write consistency and throw an exception if write didn't go through consistently. However, do note that Cassandra will never rollback failed writes which means writes aren't atomic (as in ACID). * If I understand correctly, you mean when we write with QUORUM and Cassandra writes to few machines and fails to write to few machines and throws exception if it doesn't satisfy QUORUM, leaving it inconsistent and doesn't rollback?. * We chose Cassandra over HBase mostly due to ease of managability. We are a small team, and my feeling is that you will want dedicated people taking care of a Hadoop cluster if you are going down the HBase path. A Cassandra cluster can be handled by a single engineer and is, in my opinion, easier to maintain. * This is the most popular reason for Cassandra over HBase. But this alone is not a sufficient driver. * Cheers, Jens On Mon, Jun 8, 2015 at 9:59 AM, Ajay ajay.ga...@gmail.com wrote: Hi All, Thanks for all the input. I posted the same question in HBase forum and got more response. Posting the consolidated list here. Our case is that a central team builds and maintain the platform (Cassandra as a service). We have couple of usecases which fits Cassandra like time-series data. But as a platform team, we need to know more features and usecases which fits or best handled in Cassandra. Also to understand the usecases where HBase performs better (we might need to have it as a service too). *Cassandra:* 1) From 2013 both can still be relevant: http://www.pythian.com/blog/watch-hbase-vs-cassandra/ 2) Here are some use cases from PlanetCassandra.org of companies who chose Cassandra over HBase after evaluation, or migrated to Cassandra from HBase. The eComNext interview cited on the page touches on time-series data; http://planetcassandra.org/hbase-to-cassandra-migration/ 3) From googling, the most popular advantages for Cassandra over HBase is easy to deploy, maintain monitor and no single point of failure. 4) From our six months research and POC experience in Cassandra, CQL is pretty limited. Though CQL is targeted for Real time Read and Write, there are cases where need to pull out data differently and we are OK with little more latency. But Cassandra doesn't support that. We need MapReduce or Spark for those. Then the debate starts why Cassandra and why not HBase if we need Hadoop/Spark for MapReduce. Expected a few more technical features/usecases that is best handled by Cassandra (and how it works). *HBase:* 1) As for the #4 you might be interested in reading https://aphyr.com/posts/294-call-me-maybe-cassandra Not sure if there is comparable article about HBase (anybody knows?) but it can give you another perspective about what else to keep an eye on regarding these systems. 2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe 3) http://blog.parsely.com/post/1928/cass/ *Anyone have any comments on this?* 4) 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala 5) Migrated from Cassandra to HBase. Reasons: Scan is fast with HBase. It fits better with time series data model. Please look at opentsdb. Cassandra models it with large rows. Server side filtering. You can use to filter some of your time series data on the server side. Hbase has a better integration with hadoop in general. We had to write our own bulk loader using mapreduce for cassandra. hbase has already had a tool for that. There is a nice integration with flume and kite. High availability didnet matter for us. 10 secs down is fine for our use cases.HBase started to support eventually
Re: Ghost compaction process
Does `nodetool comactionstats` show nothing running as well? Also, for posterity what are some details of the setup (C* version, etc.)? -Tim -- Tim Heckman Operations Engineer PagerDuty, Inc. On Sun, Jun 7, 2015 at 6:40 PM, Arturas Raizys artu...@noantidot.com wrote: Hello, I'm having problem there in 1 node I have continues compaction process running and consuming CPU. nodetool tpstats show 1 compaction in progress, but if I try to query system.compactions_in_progress table, I see 0 records. This never ending compaction does slow down node and it becomes laggy. I'm willing to hire a contractor to solve this problem if anyone is interested. Cheers, Arturas
Ghost compaction process
Hello, I'm having problem there in 1 node I have continues compaction process running and consuming CPU. nodetool tpstats show 1 compaction in progress, but if I try to query system.compactions_in_progress table, I see 0 records. This never ending compaction does slow down node and it becomes laggy. I'm willing to hire a contractor to solve this problem if anyone is interested. Cheers, Arturas
Re: Ghost compaction process
Hi, Does `nodetool comactionstats` show nothing running as well? Also, for posterity what are some details of the setup (C* version, etc.)? `nodetool comactionstats` does not return anything, it just waits. If I do enable DEBUG logging, I see this line poping up while executing `nodetool compactionstats` : DEBUG [RMI TCP Connection(1856)-127.0.0.1] 2015-06-08 09:29:46,043 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos I'm running Cassandra 2.1.14 with 7 node cluster. We're using small VM with 8GB of ram and SSD. Our data size per node with RF=2 is ~40GB. Load is ~ 1000 writes/second. Most of the data TTL is 2weeks. Cheers, Arturas
Re: Hbase vs Cassandra
Hi, Some minor comments: 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. Not entirely sure what you mean here, but we ran Cloudera for a while and Cloudera Manager was buggy and hard to debug. Overall, our experience wasn't very good. This was definitely also due to us not knowing how all the Cloudera packages were configured. HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. Sort of true. You can decide write consistency and throw an exception if write didn't go through consistently. However, do note that Cassandra will never rollback failed writes which means writes aren't atomic (as in ACID). We chose Cassandra over HBase mostly due to ease of managability. We are a small team, and my feeling is that you will want dedicated people taking care of a Hadoop cluster if you are going down the HBase path. A Cassandra cluster can be handled by a single engineer and is, in my opinion, easier to maintain. Cheers, Jens On Mon, Jun 8, 2015 at 9:59 AM, Ajay ajay.ga...@gmail.com wrote: Hi All, Thanks for all the input. I posted the same question in HBase forum and got more response. Posting the consolidated list here. Our case is that a central team builds and maintain the platform (Cassandra as a service). We have couple of usecases which fits Cassandra like time-series data. But as a platform team, we need to know more features and usecases which fits or best handled in Cassandra. Also to understand the usecases where HBase performs better (we might need to have it as a service too). *Cassandra:* 1) From 2013 both can still be relevant: http://www.pythian.com/blog/watch-hbase-vs-cassandra/ 2) Here are some use cases from PlanetCassandra.org of companies who chose Cassandra over HBase after evaluation, or migrated to Cassandra from HBase. The eComNext interview cited on the page touches on time-series data; http://planetcassandra.org/hbase-to-cassandra-migration/ 3) From googling, the most popular advantages for Cassandra over HBase is easy to deploy, maintain monitor and no single point of failure. 4) From our six months research and POC experience in Cassandra, CQL is pretty limited. Though CQL is targeted for Real time Read and Write, there are cases where need to pull out data differently and we are OK with little more latency. But Cassandra doesn't support that. We need MapReduce or Spark for those. Then the debate starts why Cassandra and why not HBase if we need Hadoop/Spark for MapReduce. Expected a few more technical features/usecases that is best handled by Cassandra (and how it works). *HBase:* 1) As for the #4 you might be interested in reading https://aphyr.com/posts/294-call-me-maybe-cassandra Not sure if there is comparable article about HBase (anybody knows?) but it can give you another perspective about what else to keep an eye on regarding these systems. 2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe 3) http://blog.parsely.com/post/1928/cass/ *Anyone have any comments on this?* 4) 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala 5) Migrated from Cassandra to HBase. Reasons: Scan is fast with HBase. It fits better with time series data model. Please look at opentsdb. Cassandra models it with large rows. Server side filtering. You can use to filter some of your time series data on the server side. Hbase has a better integration with hadoop in general. We had to write our own bulk loader using mapreduce for cassandra. hbase has already had a tool for that. There is a nice integration with flume and kite. High availability didnet matter for us. 10 secs down is fine for our use cases.HBase started to support eventually consistent reads. 6) Coprocessor framework (custom code inside Region Server and MasterServers), which Cassandra is missing, afaik. Coprocessors have been widely used by hBase users (Phoenix SQL, for example) since inception (in 0.92). * HBase security model is more mature and align well with Hadoop/HDFS security. Cassandra provides just basic authentication/authorization/SSL encryption, no Kerberos, no end-to-end data encryption, no cell level security. 7) Another point to add is the new HBase read high-availability using timeline-consistent region replicas feature from HBase 1.0 onward, which brings HBase closer to Cassandra in term of Read Availability during node failures. You have a choice for Read Availability now. https://issues.apache.org/jira/browse/HBASE-10070
Re: Ghost compaction process
HI, Is it 2.0.14 or 2.1.4? If you are on 2.1.4 I would recommend an upgrade to 2.1.5 regardless of that issue. From the data you provide it is difficult to access what is the issue. If you are running with RF=2 you can always add another node and kill that one if that is the only node that shows that problem. With 40GB load is not a big issue. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Mon, Jun 8, 2015 at 4:04 AM, Arturas Raizys artu...@noantidot.com wrote: Hi, Does `nodetool comactionstats` show nothing running as well? Also, for posterity what are some details of the setup (C* version, etc.)? `nodetool comactionstats` does not return anything, it just waits. If I do enable DEBUG logging, I see this line poping up while executing `nodetool compactionstats` : DEBUG [RMI TCP Connection(1856)-127.0.0.1] 2015-06-08 09:29:46,043 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos I'm running Cassandra 2.1.14 with 7 node cluster. We're using small VM with 8GB of ram and SSD. Our data size per node with RF=2 is ~40GB. Load is ~ 1000 writes/second. Most of the data TTL is 2weeks. Cheers, Arturas -- --
Re: Hbase vs Cassandra
On Mon, Jun 8, 2015 at 11:16 AM, Ajay ajay.ga...@gmail.com wrote: If I understand correctly, you mean when we write with QUORUM and Cassandra writes to few machines and fails to write to few machines and throws exception if it doesn't satisfy QUORUM, leaving it inconsistent and doesn't rollback?. Yes. /Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: sstableloader usage doubts
Il 05/06/15 22:40, Robert Coli ha scritto: On Fri, Jun 5, 2015 at 7:53 AM, Sebastian Estevez sebastian.este...@datastax.com mailto:sebastian.este...@datastax.com wrote: Since you only restored one dc's sstables, you should be able to rebuild them on the second DC. Refresh means pick up new SSTables that have been directly added to the data directory. Rebuild means stream data from other replicas to re create SSTables from scratch. Sebastian's response is correct; use rebuild. Sorry that I missed that specific aspect of your question! Thank you both. So you mean that refresh needs to be used if the cluster is running, but if I stopped cassandra while copying the sstables then refresh is useless? So the error No new SSTables were found during my refresh attempt is due to the fact that the sstables in my data dir were not new because already loaded, and not to the files not being found? So... if I stop the two nodes on the first DC, restore their sstables' files, and then restart the nodes, nothing else needs to be done on the first DC? And on the second DC instead I just need to do nodetool rebuild -- FirstDC on _both_ nodes? -- 01
Restoring all cluster from snapshots
Hello all. I need to transfer and start the copy of production cluster in a test environment. My steps: - nodetool snapshot -t `hostname`-#{cluster_name}-#{timestamp} -p #{jmx_port} - nodetool ring -p #{jmx_port} | grep `/sbin/ifconfig eth0 | grep 'inet addr' | awk -F: '{print $2}' | awk '{print $1}'` | awk '{ print $NF }' | tr '\\n' ',' | sudo tee /etc/cassandra/#{cluster_name}.conf/tokens.txt - rsync snapshots to the backup machine - copy files to the 2 test servers in the same folders as on production. - sudo rm -rf /db/cassandra/cr/data0*/system/* - paste list of initial_token from step 2 to the cassandra.yaml file on each server - start both test servers. And instead of gigabytes of my keyspaces I see only: Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.40.231.3 151.06 KB 256 100.0% c505db2f-d14a-4044-949f-cb952ec022f6 RACK01 UN 10.40.231.31 134.59 KB 256 100.0% 12879849-ade0-4dcb-84c0-abb3db996ba7 RACK01 And any mentions about my keyspaces here: [cqlsh 5.0.1 | Cassandra 2.1.3 | CQL spec 3.2.0 | Native protocol v3] Use HELP for help. cqlsh cqlsh describe keyspaces system_traces system cqlsh What Do I miss in this process?
Re: Avoiding Data Duplication
Some options I can think of: 1 - depending on your data size and stime query frequency, you may use spark to peform queries filtering by server time in the log table, maybe within an device time window to reduce the dataset your spark job will need to go through. more info on the spark connector: https://github.com/datastax/spark-cassandra-connector 2 - if dtime and stime are almost always in the same date bucket (day/hour/minute/second), you may create an additional table stable_log with the same structure, but the date bucket refers to the sdate field. so, when you have an entry when stime and dtime are not from the same bucket, you should insert that entry in both the log and stime_log tables. when you want to query entries by stime, you take the distinct union of the query of both tables in your client application. this way, you only duplicate delayed data. 3 - if you data field is big and you can't afford duplicating that, create an additional table stable_log, but do not store the data field, only the metadata (imei, date, dtime, stime).. so when you want to query by stime, first query the stable_log, and then query the original log table to fetch the data field. 2015-06-05 18:10 GMT-03:00 Abhishek Singh Bailoo abhishek.singh.bai...@gmail.com: Hello! I have a column family to log in data coming from my GPS devices. CREATE TABLE log( imei ascii, date ascii, dtime timestamp, data ascii, stime timestamp, PRIMARY KEY ((imei, date), dtime)) WITH CLUSTERING ORDER BY (dtime DESC) ; It is the standard schema for modeling time series data where imei is the unique ID associated with each GPS device date is the date taken from dtime dtime is the date-time coming from the device data is all the latitude, longitude etc that the device is sending us stime is the date-time stamp of the server The reason why I put dtime in the primary key as the clustering column is because most of our queries are done on device time. There can be a delay of a few minutes to a few hours (or a few days! in rare cases) between dtime and stime if the network is not available. However, now we want to query on server time as well for the purpose of debugging. These queries will be not as common as queries on device time. Say for every 100 queries on dtime there will be just 1 query on stime. What options do I have? 1. Seconday Index - not possible because stime is a timestamp and CQL does not allow me to put or in the query for secondary index 2. Data duplication - I can build another column family where I will index by stime but that means I am storing twice as much data. I know everyone says that write operations are cheap and storage is cheap but how? If I have to buy twice as many machines on AWS EC2 each with their own ephemeral storage, then my bill doubles up! Any other ideas I can try? Many Thanks, Abhishek
[RELEASE] Apache Cassandra 2.2.0-rc1 released
The Cassandra team is pleased to announce the release of Apache Cassandra version 2.2.0-rc1. Apache Cassandra is a fully distributed database. It is the right choice when you need scalability and high availability without compromising performance. http://cassandra.apache.org/ Downloads of source and binary distributions are listed in our download section: http://cassandra.apache.org/download/ This version is a release candidate[1] on the 2.2 series. As always, please pay attention to the release notes[2] and Let us know[3] if you were to encounter any problem. Enjoy! [1]: http://goo.gl/pBjybx (CHANGES.txt) [2]: http://goo.gl/E1RiHd (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA
[RELEASE] Apache Cassandra 2.1.6 released
The Cassandra team is pleased to announce the release of Apache Cassandra version 2.1.6. We are now calling 2.1 series stable and suitable for production. Apache Cassandra is a fully distributed database. It is the right choice when you need scalability and high availability without compromising performance. http://cassandra.apache.org/ Downloads of source and binary distributions are listed in our download section: http://cassandra.apache.org/download/ This version is a bug fix release[1] on the 2.1 series. As always, please pay attention to the release notes[2] and Let us know[3] if you were to encounter any problem. Enjoy! [1]: http://goo.gl/8aR9L2 (CHANGES.txt) [2]: http://goo.gl/dstU4D (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA