Re: Cassandra rack awareness
Hi Rob, Thanks for sharing the link. I have gone through it and few other documents as well. Still I am confused. It seems, if we use vnodes and NetworkTopologyStrategy, we should use a single rack configuration in Cassandra. Or, it can create hotspots in the ring. Not sure if my understanding is correct. Regards. On 28-Feb-2015, at 2:42 am, Robert Coli rc...@eventbrite.com wrote: On Fri, Feb 27, 2015 at 7:30 AM, Amlan Roy amlan@cleartrip.com wrote: I am new to Cassandra and trying to setup a Cassandra 2.0 cluster using 4 nodes, 2 each in 2 different racks. All are in same data centre. This is what I see in the documentation: To use racks correctly: Use the same number of nodes in each rack. Use one rack and place the nodes in different racks in an alternating pattern. This allows you to still get the benefits of Cassandra's rack feature, and allows for quick and fully functional expansions. Once the cluster is stable, you can swap nodes and make the appropriate moves to ensure that nodes are placed in the ring in an alternating fashion with respect to the racks. What I have understood is, in cassandra-rackdc.properties, I need to use single rack name even though I have 2 racks and then place the nodes in such an order that they are placed in an alternating fashion - RAC1-NODE1, RAC2-NODE1, RAC1-NODE2, RAC2-NODE2. Just wanted to know if this is correct. If yes, how do I enforce this order while adding nodes. https://issues.apache.org/jira/browse/CASSANDRA-3810 =Rob
What are the factors that affect the release time of each minor version?
Hi all As a user of Cassandra, sometimes there are some bugs in my cluster and I hope someone can fix them (Of course, if I can fix them myself I'll try to contribute my code :) ). For each bug, there is a JIRA ticket to tracking it and users can know if the bug is fixed. However, there is a lag between this bug being fixed and a new minor version being released. Although we can apply the patch of this ticket to our online version and build a special snapshot to solve the trouble in our clusters or we can use the latest code directly, I think many users still want to use an official release with higher reliability and indeed, more convenience. In addition, updating more frequently can also reduce the trouble causing by unknown bugs. So someone may often ask When the new version with this patch will be released? In my mind, not only the number of issues resolved in each version but also the time interval between two versions is not fixed. So may I know what the factors that affect the release time of each minor version? Furthermore, except a vote in dev@cassandra maillist that I can see, what are the duties to release a version? If it is not a heavy work, could we make each release more frequently? Or we may make a rule to decide if we need release a new version? For example: If the latest version was released two weeks ago, or after the latest version we have already resolved 20 issues, we should release a new minor version. -- Thanks, Phil Yang
Re: Cassandra rack awareness
As far as I know, the main thing about using NetworkTopologyStrategy and different racks is replica placement throughout your cluster. That strategy favours different racks when it comes to choosing where a row's replica will be placed. So, if you have different numbers of nodes in each rack, you will probably end up with an unbalanced cluster (regarding data occupation), not because of the actual rows partitioning, but because of the replicas. The effects of it also depends on you replication factor. (You can sit down and do the math yourself.) I had an issue like that sometime ago, because I was not aware of that behavior and didn't really care about where my machines were, and was using SimpleStrategy. But when I decided to go for NetworkTopologyStrategy, I realized I had a bad physical configuration (4 nodes in a same rack, 1 node in another one), so I had to fake that last node's rack, as if it was in the same as the other nodes, otherwise I would have that node alone in the rack with twice the data amount the other ones had. (As I said, that could even be worse if I had a higher replication factor.) To be honest, I'm not sure I fully understand the documentation you quoted on your first email, specially the last phrase. But, my (limited) experience with Cassandra (2.1) tells me that if you start off with a balanced rack setup, I'll be fine. Otherwise, you'll have to change you node's physical localization or faking it on config file, and run repair and clean on your entire cluster (which is a pain in the ass) to get a balanced cluster again. I had to do that. =P On Sat, Feb 28, 2015 at 6:05 AM, Amlan Roy amlan@cleartrip.com wrote: Hi Rob, Thanks for sharing the link. I have gone through it and few other documents as well. Still I am confused. It seems, if we use vnodes and NetworkTopologyStrategy, we should use a single rack configuration in Cassandra. Or, it can create hotspots in the ring. Not sure if my understanding is correct. Regards. On 28-Feb-2015, at 2:42 am, Robert Coli rc...@eventbrite.com wrote: On Fri, Feb 27, 2015 at 7:30 AM, Amlan Roy amlan@cleartrip.com wrote: I am new to Cassandra and trying to setup a Cassandra 2.0 cluster using 4 nodes, 2 each in 2 different racks. All are in same data centre. This is what I see in the documentation: To use racks correctly: Use the same number of nodes in each rack. Use one rack and place the nodes in different racks in an alternating pattern. This allows you to still get the benefits of Cassandra's rack feature, and allows for quick and fully functional expansions. Once the cluster is stable, you can swap nodes and make the appropriate moves to ensure that nodes are placed in the ring in an alternating fashion with respect to the racks. What I have understood is, in cassandra-rackdc.properties, I need to use single rack name even though I have 2 racks and then place the nodes in such an order that they are placed in an alternating fashion - RAC1-NODE1, RAC2-NODE1, RAC1-NODE2, RAC2-NODE2. Just wanted to know if this is correct. If yes, how do I enforce this order while adding nodes. https://issues.apache.org/jira/browse/CASSANDRA-3810 =Rob
Re: how to make unique constraints in cassandra
As far as I know there is no such thing. You could make that value a single PK for the table therefore guaranteeing uniqueness and check on insert with `IF NOT EXISTS` to prevent dups. Of course that works just for one value, if you have multiple values a compound PK will still let dups in for a given column. So a different data model might be the answer there. If the values in those columns can be enumerated then you could always create a bitmask for the collection of values of the must-be-unique columns and use that as park of your PK. If they ranges are too broad then maybe some sort of hash made up of the values that need to be unique would be my next attempt. You could also simply enforce the uniqueness programmatically but that will require a read-before-you-write approach. Three is also the possibility of using a Cassandra trigger (but I heard they are dangerous and might not buy you anything more than the client-side programmatic approach) Cheers, Brian http://integrallis.com On Fri, Feb 27, 2015 at 9:29 PM, ROBIN SRIVASTAVA srivastava.robi...@gmail.com wrote: I want to make unique constraint in Cassandra . As i want that all the value in my column be unique in my column family example : name-rahul ,phone-123, address-abc now i want that in this row no values equal to rahul ,123 and abc get inserted again on searching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the three values unique means if name- jacob ,phone-123 ,address-qwe this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul. -- Cheers, Brian http://www.integrallis.com
Re: Caching the PreparedStatement (Java driver)
Hi, My earlier question was whether it is safe to cache PreparedStatement (using Java driver) in the client side for which I got it confirmed by Olivier. Now the question is do we really need to cache the PreparedStatement in the client side?. Lets take a scenario as below: 1) Client fires a REST query SELECT * from Test where Pk = val1; 2) REST service prepares a statement SELECT * from Test where Pk = ? 3) Executes the PreparedStatement by setting the values. 4) Assume we don't cache the PreparedStatement 5) Client fires another REST query SELECT * from Test where Pk = val2; 6) REST service prepares a statement SELECT * from Test where Pk = ? 7) Executes the PreparedStatement by setting the values. In this case, is there any benefit of using the PreparedStatement? From the Java driver code, the Session.prepare(query) doesn't check whether a similar query was prepared earlier or not. It directly call the server passing the query. The return from the server is a PreparedId. Do the server maintains a cache of Prepared queries or it still perform the all the steps to prepare a query if the client calls to prepare the same query more than once (using the same Session and Cluster instance which I think doesn't matter)?. Thanks Ajay On Sat, Feb 28, 2015 at 9:17 AM, Ajay ajay.ga...@gmail.com wrote: Thanks Olivier. Most of the REST query calls would come from other applications to write/read to/from Cassandra which means most queries from an application would be same (same column families but different values). Thanks Ajay On 28-Feb-2015 6:05 am, Olivier Michallat olivier.michal...@datastax.com wrote: Hi Ajay, Yes, it is safe to hold a reference to PreparedStatement instances in your client code. If you always run the same pre-defined statements, you can store them as fields in your resource classes. If your statements are dynamically generated (for example, inserting different subsets of the columns depending on what was provided in the REST payload), your caching approach is valid. When you evict a PreparedStatement from your cache, the driver will also remove the corresponding id from its internal cache. If you re-prepare it later it might still be in the Cassandra-side cache, but that is not a problem. One caveat: you should be reasonably confident that your prepared statements will be reused. If your query strings are always different, preparing will bring no advantage. -- Olivier Michallat Driver tools engineer, DataStax On Fri, Feb 27, 2015 at 7:04 PM, Ajay ajay.ga...@gmail.com wrote: Hi, We are building REST APIs for Cassandra using the Cassandra Java Driver. So as per the below guidlines from the documentation, we are caching the Cluster instance (per cluster) and the Session instance (per keyspace) as they are multi thread safe. http://www.datastax.com/documentation/developer/java-driver/2.0/java-driver/fourSimpleRules.html As the Cluster and Session instance(s) are cached in the application already and also as the PreparedStatement provide better performance, we thought to build the PreparedStatement for REST query implicitly (as REST calls are stateless) and cache the PreparedStatemen. Whenever a REST query is invoked, we look for a PreparedStatement in the cache and create and put it in the cache if it doesn't exists. (The cache is a in-memory fixed size LRU based). Is a safe approach to cache PreparedStatement in the client side?. Looking at the Java driver code, the Cluster class stores the PreparedStatements as a weak reference (to rebuild when a node is down or a new node added). Thanks Ajay To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com. To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Re: how to make unique constraints in cassandra
I agree with Peter. I typically keep in Cassandra just the data that will benefit from it's distribution and replication capabilities. Most of the applications in which I use Cassandra also use a relational DB, so best tool for the job type of approach. And for the PK it's implied; some identifiers are temporal, some others are not, PK should use non-temporal ones, e.g. phones and addresses are a bad idea, SSNs might be ok but in general for C* synthetic PK are best, e.g. UUIDs or TimeUUIDs. On Sat, Feb 28, 2015 at 8:42 AM, Peter Lin wool...@gmail.com wrote: Hate to be the one to point this out, but that is not the ideal use case for Cassandra. If you really want to brute force it and make it fit cassandra, the easiest way is to create a class called Index. The index class would have name, phone and address fields. The hashcode and equals method would need to be overrided. Basically the same rules that apply to HashMap keys. Once you have that, you can use if not exists in the insert. Here's the risk though. What happens when the phone and address changes? The hashcode and equals will fail and queries will fail. This is a general anti-pattern for anything that uses Maps and is mutable. This is the reason the default object hashCode in java is system assigned and there's all sorts of warnings about overriding hashCode and equals. The only case where this works correctly is immutable databases or temporal databases. I would strongly advise against using a key that is a compound of name+phone+address. If the system needs to perform updates, the key needs to be immutable so that queries won't fail. On Sat, Feb 28, 2015 at 10:18 AM, Brian Sam-Bodden bsbod...@integrallis.com wrote: As far as I know there is no such thing. You could make that value a single PK for the table therefore guaranteeing uniqueness and check on insert with `IF NOT EXISTS` to prevent dups. Of course that works just for one value, if you have multiple values a compound PK will still let dups in for a given column. So a different data model might be the answer there. If the values in those columns can be enumerated then you could always create a bitmask for the collection of values of the must-be-unique columns and use that as park of your PK. If they ranges are too broad then maybe some sort of hash made up of the values that need to be unique would be my next attempt. You could also simply enforce the uniqueness programmatically but that will require a read-before-you-write approach. Three is also the possibility of using a Cassandra trigger (but I heard they are dangerous and might not buy you anything more than the client-side programmatic approach) Cheers, Brian http://integrallis.com On Fri, Feb 27, 2015 at 9:29 PM, ROBIN SRIVASTAVA srivastava.robi...@gmail.com wrote: I want to make unique constraint in Cassandra . As i want that all the value in my column be unique in my column family example : name-rahul ,phone-123, address-abc now i want that in this row no values equal to rahul ,123 and abc get inserted again on searching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the three values unique means if name- jacob ,phone-123 ,address-qwe this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul. -- Cheers, Brian http://www.integrallis.com -- Cheers, Brian http://www.integrallis.com
Re: how to make unique constraints in cassandra
Hate to be the one to point this out, but that is not the ideal use case for Cassandra. If you really want to brute force it and make it fit cassandra, the easiest way is to create a class called Index. The index class would have name, phone and address fields. The hashcode and equals method would need to be overrided. Basically the same rules that apply to HashMap keys. Once you have that, you can use if not exists in the insert. Here's the risk though. What happens when the phone and address changes? The hashcode and equals will fail and queries will fail. This is a general anti-pattern for anything that uses Maps and is mutable. This is the reason the default object hashCode in java is system assigned and there's all sorts of warnings about overriding hashCode and equals. The only case where this works correctly is immutable databases or temporal databases. I would strongly advise against using a key that is a compound of name+phone+address. If the system needs to perform updates, the key needs to be immutable so that queries won't fail. On Sat, Feb 28, 2015 at 10:18 AM, Brian Sam-Bodden bsbod...@integrallis.com wrote: As far as I know there is no such thing. You could make that value a single PK for the table therefore guaranteeing uniqueness and check on insert with `IF NOT EXISTS` to prevent dups. Of course that works just for one value, if you have multiple values a compound PK will still let dups in for a given column. So a different data model might be the answer there. If the values in those columns can be enumerated then you could always create a bitmask for the collection of values of the must-be-unique columns and use that as park of your PK. If they ranges are too broad then maybe some sort of hash made up of the values that need to be unique would be my next attempt. You could also simply enforce the uniqueness programmatically but that will require a read-before-you-write approach. Three is also the possibility of using a Cassandra trigger (but I heard they are dangerous and might not buy you anything more than the client-side programmatic approach) Cheers, Brian http://integrallis.com On Fri, Feb 27, 2015 at 9:29 PM, ROBIN SRIVASTAVA srivastava.robi...@gmail.com wrote: I want to make unique constraint in Cassandra . As i want that all the value in my column be unique in my column family example : name-rahul ,phone-123, address-abc now i want that in this row no values equal to rahul ,123 and abc get inserted again on searching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the three values unique means if name- jacob ,phone-123 ,address-qwe this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul. -- Cheers, Brian http://www.integrallis.com
Re: Error on nodetool cleanup
Thanks a lot for pointing this out! Yes, a workaround would be very much appreciated, or also an ETA for 2.0.13, so that I could decide whether or not going for an officially unsupported 2.0.12 to 2.0.11 downgrade, since I really need that cleanup. Thanks On Feb 27, 2015 10:53 PM, Jeff Wehrwein j...@refresh.io wrote: We had the exact same problem, and found this bug: https://issues.apache.org/jira/browse/CASSANDRA-8716. It's fixed in 2.0.13 (unreleased), but we haven't found a workaround for the interim. Please share if you find one! Thanks, Jeff On Fri, Feb 27, 2015 at 6:01 PM, Gianluca Borello gianl...@draios.com wrote: Hello, I have a cluster of four nodes running 2.0.12. I added one more node and then went on with the cleanup procedure on the other four nodes, but I get this error (the same error on each node): INFO [CompactionExecutor:10] 2015-02-28 01:55:15,097 CompactionManager.java (line 619) Cleaned up to /raid0/cassandra/data/draios/protobuf86400/draios-protobuf86400-tmp-jb-432-Data.db. 8,253,257 to 8,253,257 (~100% of original) bytes for 5 keys. Time: 304ms. INFO [CompactionExecutor:10] 2015-02-28 01:55:15,100 CompactionManager.java (line 563) Cleaning up SSTableReader(path='/raid0/cassandra/data/draios/protobuf86400/draios-protobuf86400-jb-431-Data.db') ERROR [CompactionExecutor:10] 2015-02-28 01:55:15,102 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:10,1,main] java.lang.AssertionError: Memory was freed at org.apache.cassandra.io.util.Memory.checkPosition(Memory.java:259) at org.apache.cassandra.io.util.Memory.getInt(Memory.java:211) at org.apache.cassandra.io.sstable.IndexSummary.getIndex(IndexSummary.java:79) at org.apache.cassandra.io.sstable.IndexSummary.getKey(IndexSummary.java:84) at org.apache.cassandra.io.sstable.IndexSummary.binarySearch(IndexSummary.java:58) at org.apache.cassandra.io.sstable.SSTableReader.getIndexScanPosition(SSTableReader.java:602) at org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:947) at org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:910) at org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:819) at org.apache.cassandra.db.ColumnFamilyStore.getExpectedCompactedFileSize(ColumnFamilyStore.java:1088) at org.apache.cassandra.db.compaction.CompactionManager.doCleanupCompaction(CompactionManager.java:564) at org.apache.cassandra.db.compaction.CompactionManager.access$400(CompactionManager.java:63) at org.apache.cassandra.db.compaction.CompactionManager$5.perform(CompactionManager.java:281) at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:225) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) INFO [FlushWriter:1] 2015-02-28 01:55:15,111 Memtable.java (line 398) Completed flushing /raid0/cassandra/data/draios/mounted_fs_by_agent1/draios-mounted_fs_by_agent1-jb-132895-Data.db (2513856 bytes) for commitlog position ReplayPosition(segmentId=1425088070445, position=2041) This happens with all column families, and they are not particularly big if that matters. How can I reclaim the free space for which I expanded the cluster in the first place? Thank you
Re: how to make unique constraints in cassandra
i second UUID they're your friend On Sat, Feb 28, 2015 at 10:56 AM, Brian Sam-Bodden bsbod...@integrallis.com wrote: I agree with Peter. I typically keep in Cassandra just the data that will benefit from it's distribution and replication capabilities. Most of the applications in which I use Cassandra also use a relational DB, so best tool for the job type of approach. And for the PK it's implied; some identifiers are temporal, some others are not, PK should use non-temporal ones, e.g. phones and addresses are a bad idea, SSNs might be ok but in general for C* synthetic PK are best, e.g. UUIDs or TimeUUIDs. On Sat, Feb 28, 2015 at 8:42 AM, Peter Lin wool...@gmail.com wrote: Hate to be the one to point this out, but that is not the ideal use case for Cassandra. If you really want to brute force it and make it fit cassandra, the easiest way is to create a class called Index. The index class would have name, phone and address fields. The hashcode and equals method would need to be overrided. Basically the same rules that apply to HashMap keys. Once you have that, you can use if not exists in the insert. Here's the risk though. What happens when the phone and address changes? The hashcode and equals will fail and queries will fail. This is a general anti-pattern for anything that uses Maps and is mutable. This is the reason the default object hashCode in java is system assigned and there's all sorts of warnings about overriding hashCode and equals. The only case where this works correctly is immutable databases or temporal databases. I would strongly advise against using a key that is a compound of name+phone+address. If the system needs to perform updates, the key needs to be immutable so that queries won't fail. On Sat, Feb 28, 2015 at 10:18 AM, Brian Sam-Bodden bsbod...@integrallis.com wrote: As far as I know there is no such thing. You could make that value a single PK for the table therefore guaranteeing uniqueness and check on insert with `IF NOT EXISTS` to prevent dups. Of course that works just for one value, if you have multiple values a compound PK will still let dups in for a given column. So a different data model might be the answer there. If the values in those columns can be enumerated then you could always create a bitmask for the collection of values of the must-be-unique columns and use that as park of your PK. If they ranges are too broad then maybe some sort of hash made up of the values that need to be unique would be my next attempt. You could also simply enforce the uniqueness programmatically but that will require a read-before-you-write approach. Three is also the possibility of using a Cassandra trigger (but I heard they are dangerous and might not buy you anything more than the client-side programmatic approach) Cheers, Brian http://integrallis.com On Fri, Feb 27, 2015 at 9:29 PM, ROBIN SRIVASTAVA srivastava.robi...@gmail.com wrote: I want to make unique constraint in Cassandra . As i want that all the value in my column be unique in my column family example : name-rahul ,phone-123, address-abc now i want that in this row no values equal to rahul ,123 and abc get inserted again on searching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the three values unique means if name- jacob ,phone-123 ,address-qwe this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul. -- Cheers, Brian http://www.integrallis.com -- Cheers, Brian http://www.integrallis.com
Re: sstables remain after compaction
Hi Rob, sorry for the late response, festive season here. cassandra version is 1.0.8 and thank you, I will read on the READ_STAGE threads. Jason On Wed, Feb 18, 2015 at 3:33 AM, Robert Coli rc...@eventbrite.com wrote: On Fri, Feb 13, 2015 at 7:45 PM, Jason Wee peich...@gmail.com wrote: I trigger user defined compaction to big sstables (big as in the size per sstable reach more than 50GB, some 100GB). Occasionally, after user defined compaction, I see some sstables remain, even after 12 hours elapsed. That is unexpected. What version of Cassandra? You mentioned a thread, could you tell what threads are those or perhaps highlight in the code? I'd presume READ_STAGE threads. =Rob